Auflistung nach Autor:in "Yang, Dai"
1 - 3 von 3
Treffer pro Seite
Sortieroptionen
- ZeitschriftenartikelEnabling Malleability for Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics using LAIK(PARS-Mitteilungen: Vol. 35, Nr. 1, 2020) Raoofy, Amir; Yang, Dai; Weidendorfer, Josef; Trinitis, Carsten; Schulz, MartinMalleability, i.e., the ability for an application to release or acquire resources at runtime, has many benefits for current and future HPC systems. Implementing such functionality, however, is already difficult in newly written code and an even more daunting challenge when considering a dynamic and flexible parallel programming model that separates data and execution into twoorthogonal concerns. These properties promise easier malleability as the runtime can partition resources dynamically as needed, as well as easier incremental porting of existing MPI code. In this paper, we explore the malleability of LAIK with the help of laik-lulesh, a LAIK-based port of LULESH, a proxy application from the CORAL benchmark suite. We show the steps required for porting the application to LAIK, and we present detailed scaling experiments that show promising results.
- ZeitschriftenartikelLAIK: A Library for Fault Tolerant Distribution of Global Data for Parallel Applications(PARS-Mitteilungen: Vol. 34, Nr. 1, 2017) Weidendorfer, Josef; Yang, Dai; Trinitis, CarstenHPC applications usually are not written in a way that they can cope with dynamic changes in the execution environment, such as removing or integrating new nodes or node components. However, for higher flexibility with regard to scheduling and fault tolerance strategies, adequate application-integrated reaction would be worthwhile. However, with legacy MPI codes, this is difficult to achieve. In this paper, we present Lightweight Application-Integrated data distribution for parallel worKers (LAIK), a lightweight library for distributed index spaces and associated data containers for parallel programs supporting fault tolerance features. By giving LAIK control over data and its partitioning, the library can free compute nodes before they fail and do replication for rollback schemes on demand. Applications become more adaptive to changes of available resources. We show a simple example which integrates our LAIK library and present first results on a prototype implementation.
- ZeitschriftenartikelSymptom-based Fault Detection in Modern Computer Systems(PARS-Mitteilungen: Vol. 35, Nr. 1, 2020) Becker, Thomas; Rudolf, Nico; Yang, Dai; Karl, WolfgangMiniaturization and the increasing number of components, which get steadily more complex, lead to a rising failure rate in modern computer systems. Especially soft hardware errors are a major problem because they are usually temporary and therefore hard to detect. As classical fault-tolerance methods are very costly and reduce system efficiency, light-weight methods are needed to increase system reliability. A method that copes with this requirement is symptom-based fault detection. In this work, we evaluate the ability to detect different faults with symptom-based fault detection by using hardware performance counters. As the knowledge of a fault occurrence is usually not enough, we also evaluate the possibility to make conclusions about which fault occurred. For the evaluation, we used the fault-injection library FINJ and manually manipulated loops. The results show that symptom-based fault detection enables the system to detect faulty application behavior, however fine-grained conclusions about the causing fault are hardly possible.