Auflistung nach Schlagwort "Stream Processing"
1 - 7 von 7
Treffer pro Seite
Sortieroptionen
- KonferenzbeitragBenchmarking Scalability of Stream Processing Frameworks Deployed as Microservices in the Cloud(Software Engineering 2024 (SE 2024), 2024) Henning, Sören; Hasselbring, Wilhelm
- KonferenzbeitragCommunication-Optimal Parallel Reservoir Sampling(BTW 2023, 2023) Winter, Christian; Sichert, Moritz; Birler, Altan; Neumann, Thomas; Kemper, AlfonsWhen evaluating complex analytical queries on high-velocity data streams, many systems cannot run those queries on all elements of a stream. Sampling is a widely used method to reduce the system load by replacing the input with a representative yet manageable subset. For unbounded data, reservoir sampling generates a fixed-size uniform sample independent of the input cardinality. However, the collection of reservoir samples itself can already be a bottleneck for high-velocity data.In this paper, we introduce a technique that allows fully parallelizing reservoir sampling for many-core architectures. Our approach relies on the efficient combination of thread-local samples taken over chunks of the input without necessitating communication during the sampling phase and with minimal communication when merging. We show how our efficient merge guarantees uniform random samples while allowing data to be distributed over worker threads arbitrarily. Our analysis of this approach within the Umbra database system demonstrates linear scaling along the available threads and the ability to sustain high-velocity workloads.
- KonferenzbeitragA Data Center Infrastructure Monitoring Platform Based on Storm and Trident(Datenbanksysteme für Business, Technologie und Web (BTW 2017) - Workshopband, 2017) Dreissig, Felix; Pollner, NikoSensor data of a modern data center’s cooling and power infrastructure fulfil the character- istics of data streams and are therefore suitable for stream processing. We present a stream-based monitoring platform for data center infrastructure. It is based on multiple independent collectors, which query measurements from sensors and forward them to an Apache Kafka queue. At the platform’s core is a processing cluster based on Apache Storm and its high-level Trident API. From there, results get forwarded to one or multiple data sinks. Using our system, analytical queries can be developed using a collection of universal, generic stream operators including CORRELATE, a novel operator which combines elements from multiple streams with unique semantics. Besides the platform’s general concept, the characteristics and pitfalls of our real-world implementation are also discussed in this work.
- TextdokumentLock-free Data Structures for Data Stream Processing(BTW 2019 – Workshopband, 2019) Baumstark, AlexanderThe ever-growing amounts of data in the digital world require more and more computing power to meet the requirements. Especially in the area of social media, sensor data processing or Internet of Things, the data need to be handled on the fly during its creation. A common way to handle these data, in form of endless data streams, is the data stream processing technology. The key requirements for data stream processing are high throughput and low latency. These requirements can be accomplished with the parallelization of operators and multithreading. However, in order to realize a higher degree of parallelism, the efficient synchronization of threads is a necessity. This work examines the design principles of lock-free data structures and how this synchronization method can improve the performance of algorithms in data stream processing. For this purpose, lock-free data structures are implemented for the data stream processing engine Pipefabric and compared with current implementations. The result is an improvement for the tuple exchanging between threads and a significant improvement for the symmetric hash join algorithm based on lock-free hash maps.
- ZeitschriftenartikelLock-free Data Structures for Data Stream Processing(Datenbank-Spektrum: Vol. 19, No. 3, 2019) Baumstark, Alexander; Pohl, ConstantinProcessing data in real-time instead of storing and reading from tables has led to a specialization of DBMS into the so-called data stream processing paradigm. While high throughput and low latency are key requirements to keep up with varying stream behavior and to allow fast reaction to incoming events, there are many possibilities how to achieve them. In combination with modern hardware, like server CPUs with tens of cores, the parallelization of stream queries for multithreading and vectorization is a common schema. High degrees of parallelism, however, need efficient synchronization mechanisms to allow good scaling with threads for shared memory access.In this work, we identify the most time-consuming operations for stream processing exemplarily for our own stream processing engine PipeFabric. In addition, we present different design principles of lock-free data structures which are suited to overcome those bottlenecks. We will finally demonstrate how lock-freedom greatly improves performance for join processing and tuple exchange between operators under different workloads. Nevertheless, the efficient usage of lock-free data structures comes with additional efforts and pitfalls, which we also discuss in this paper.
- TextdokumentNoSQL & Real-Time Data Management in Research & Practice(BTW 2019 – Workshopband, 2019) Wingerath, Wolfram; Gessert, Felix; Ritter, NorbertUsers have come to expect reactivity from mobile and web applications, i.e. they assume that changes made by other users become visible immediately. However, developers are challenged with building reactive applications on top of traditional pull-oriented databases, because they are ill-equipped to push new information to the client. Systems for data stream management and processing, on the other hand, are natively push-oriented and thus facilitate reactive behavior, but they do not follow the same collection-based semantics as traditional databases: Instead of database collections, stream-oriented systems are based on a notion of potentially unbounded sequences of data items. In this tutorial, we survey and categorize the system space between pull-oriented databases and push-oriented stream management systems, using their respectively facilitated means of data retrieval as a reference point. We start with an in-depth survey of the most relevant NoSQL databases to provide a comparative classification and highlight open challenges. To this end, we analyze the approach of each system to derive its scalability, availability, consistency, data modeling, and querying characteristics. We present how each system’s design is governed by a central set of trade-offs over irreconcilable system properties. We then cover recent research results in distributed data management to illustrate that some shortcomings of NoSQL systems could already be solved in practice, whereas other NoSQL data management problems pose interesting and unsolved research challenges. A particular emphasis lies on the novel system class of real-time databases which combine the push-based access paradigm of stream-oriented systems with the collection-based query semantics of traditional databases. We explore why real-time databases deserve distinction in a separate system class and dissect their different architectures to highlight issues, derive open challenges, and discuss avenues for addressing them.
- TextdokumentQuery Planning for Transactional Stream Processing on Heterogeneous Hardware: Opportunities and Limitations(BTW 2019 – Workshopband, 2019) Götze, Philipp; Pohl, Constantin; Sattler, Kai-UweIn a heterogeneous hardware landscape consisting of various processing units and memory types, it is crucial to decide which device should be used when running a query. There is already a lot of research done for placement decisions on CPUs, coprocessors, GPUs, or FPGAs. However, those decisions can be further extended for the various types of memory within the same layer of the memory hierarchy. For storage, a division between SSDs, HDDs or even NVM is possible, whereas for main memory types like DDR4 and HBM exist. In this paper, we focus on query planning for the transactional stream processing model. We give an overview of several techniques and necessary parameters when optimizing a stateful query for various memory types, outlined with chosen experimental measurements to support our claims.