Crash management for distributed parallel systems

Haase, JanEschmann, FrankDadam, PeterReichert, Manfred2019-10-112019-10-1120043-88579-380-6https://dl.gi.de/handle/20.500.12116/28720With the growing complexity of parallel architectures, the probability of system failures grows, too. One approach to cope with this problem is the self-healing, one of the organic computing's self-x features. Self-healing in this context means that computer clusters should detect and handle failures automatically. This paper presents a self-healing mechanism based on checkpointing, so that a cluster remains operative even if some sites or the connections between them fail. The proposed method has been implemented and tested on the Self Distributing Virtual Machine (SDVM).enCrash management for distributed parallel systemsText/Conference Paper1617-5468