Auflistung nach Schlagwort "Fault Tolerance"
1 - 6 von 6
Treffer pro Seite
Sortieroptionen
- TextdokumentEfficient Checkpointing in Byzantine Fault-Tolerant Systems(Tagungsband des FB-SYS Herbsttreffens 2019, 2019) Eischer, Michael; Distler, TobiasDistributed Byzantine fault-tolerant systems require frequent checkpoints of the application state to perform periodic garbage collection and enable faulty replicas to recover efficiently. State-of-the-art checkpointing approaches for replicated systems either cause significant service disruption when the application state is large, or they are unable to produce checkpoints that are verifiable across replicas. To address these problems we developed and evaluated deterministic fuzzy checkpointing, a technique to create consistent and verifiable checkpoints in parallel with request execution.
- ZeitschriftenartikelFault-Tolerant and Fail-Safe Control Systems Using Remote Redundancy(FERS-Mitteilungen: Vol. 28, No. 1, 2010) Klaus, Echtle; Thorsten, KimmeskampKlaus Echtle, Thorsten Kimmeskamp, University of Duisburg-Essen, Institute for Computer Science and Business Information Systems, 45141 Essen, Germany, (echtle | kimmeskamp)@dc.uni-due.de
- TextdokumentOn-the-fly Reconfiguration of Query Plans for Stateful Stream Processing Engines(BTW 2019, 2019) Bartnik, Adrian; Del Monte, Bonaventura; Rabl, Tilmann; Markl, VolkerStream Processing Engines (SPEs) must tolerate the dynamic nature of unbounded data streams and provide means to quickly adapt to fluctuations in the data rate. Many major SPEs however provide very little functionality to adjust the execution of a potentially infinite streaming query at runtime. Each modification requires a complete query restart, which involves an expensive redistribution of the state of a query and may require external systems in order to guarantee correct processing semantics. This results in significant downtime, which increase the operational cost of those SPEs. We present a modification protocol that enables modifying specific operators as well as the data flow of a running query while ensuring exactly-once processing semantics. We provide an implementation for Apache Flink, which enables stateful operator migration across machines, the introduction of new operators into a running query, and changes to a specific operator based on external triggers. Our results on two benchmarks show that migrating operators for queries with small state is as fast as using the savepoint mechanism of Flink. Migrating operators in the presence of large state even outperforms the savepoint mechanism by a factor of more than 2.3. Introducing and replacing operators at runtime is performed in less than 10 s. Our modification protocol demonstrates the general feasibility of runtime modifications and opens the door for many other modification use cases, such as online algorithm tweaking and up-or downscaling operator instances.
- TextdokumentSystems Support For Efficient State-Machine Replication(Tagungsband des FB-SYS Herbsttreffens 2019, 2019) Habiger, Gerhard; Hauck, Franz J.State-Machine Replication (SMR) is a well known approach for the deployment of highly fault-tolerant services. Recent research has focused on efficiency improvements, performance optimisation and novel approaches to underlying concepts of SMR, such as consensus with trusted components, dynamic weights for quorums, or parallelisation of application code. To increase adoption of SMR as a basic fault-tolerance technique, we see the need to improve the current state of the art of SMR even further, and provide four specific ways in which our research contributes to this goal. In particular, we present two approaches which make the development and deployment of SMR services both easier and more efficient, and talk about two further areas of improvement concerning internal mechanisms of common SMR architectures. The goal of this paper is to provide our current understanding of important issues of current SMR systems as well as to outline possible future solutions to them.
- TextdokumentTowards Resilient Data Management for the Internet of Moving Things(BTW 2021, 2021) Paz, Elena Beatriz Ouro; Zacharatou, Eleni Tzirita; Markl, VolkerMobile devices have become ubiquitous; smartphones, tablets and wearables are essential commodities for many people. The ubiquity of mobile devices combined with their ever increasing capabilities, open new possibilities for Internet-of-Things (IoT) applications where mobile devices act as both data generators as well as processing nodes. However, deploying a stream processing system (SPS) over mobile devices is particularly challenging as mobile devices change their position within the network very frequently and are notoriously prone to transient disconnections. To deal with faults arising from disconnections and mobility, existing fault tolerance strategies in SPS are either checkpointing-based or replication-based. Checkpointing-based strategies are too heavyweight for mobile devices, as they save and broadcast state periodically, even when there are no failures. On the other hand, replication-based strategies cannot provide fault tolerance at the level of the data source, as the data source itself cannot be always replicated. Finally, existing systems exclude mobile devices from data processing upon a disconnection even when the duration of the disconnection is very short, thus failing to exploit the computing capabilities of the offline devices. This paper proposes a buffering-based reactive fault tolerance strategy to handle transient disconnections of mobile devices that both generate and process data, even in cases where the devices move through the network during the disconnection. The main components of our strategy are: (a) a circular buffer that stores the data which are generated and processed locally during a device disconnection, (b) a query-aware buffer replacement policy, and (c) a query restart process that ensures the correct forwarding of the buffered data upon re-connection, taking into account the new network topology. We integrate our fault tolerance strategy with NebulaStream, a novel stream processing system specifically designed for the IoT. We evaluate our strategy using a custom benchmark based on real data, exhibiting reduction in data loss and query runtime compared to the baseline NebulaStream.
- ZeitschriftenartikelWorkshop “Dependability and Fault Tolerance”(FERS-Mitteilungen: Vol. 28, No. 1, 2010) Großpietsch, K.-E.; Herkersdorf, A.