Auflistung nach Schlagwort "Data Provenance"
1 - 2 von 2
Treffer pro Seite
Sortieroptionen
- ZeitschriftenartikelCollecting and visualizing data lineage of Spark jobs(Datenbank-Spektrum: Vol. 21, No. 3, 2021) Schoenenwald, Alexander; Kern, Simon; Viehhauser, Josef; Schildgen, JohannesMetadata management constitutes a key prerequisite for enterprises as they engage in data analytics and governance. Today, however, the context of data is often only manually documented by subject matter experts, and lacks completeness and reliability due to the complex nature of data pipelines. Thus, collecting data lineage—describing the origin, structure, and dependencies of data—in an automated fashion increases quality of provided metadata and reduces manual effort, making it critical for the development and operation of data pipelines. In our practice report, we propose an end-to-end solution that digests lineage via (Py‑)Spark execution plans. We build upon the open-source component Spline , allowing us to reliably consume lineage metadata and identify interdependencies. We map the digested data into an expandable data model, enabling us to extract graph structures for both coarse- and fine-grained data lineage. Lastly, our solution visualizes the extracted data lineage via a modern web app, and integrates with BMW Group’s soon-to-be open-sourced Cloud Data Hub.
- KonferenzbeitragData Spaces as the Distributed Communication Means for Industrial Automation and Control Systems(INFORMATIK 2023 - Designing Futures: Zukünfte gestalten, 2023) deMeer, JanA data space is more than just a repository for data. Even data is more than an item of data. However, in research a data space is not a new philosophy of communication. In this paper the concept of a data space shall be developed for its application in industrial automation and control systems (IACS). For this purpose, the existing reference architecture models, e.g. for I4.0 manufacturing or, electricity transportation and distribution but also for altruistic data dissemination in realms of smart infrastructures like cities, buildings, agriculture etc. Almost all examples of infrastructures shall be extended with something that is called in this paper a ‘fourth dimension’ in addition to the three regular dimensions comprising life cycle value streams, communication protocols and system component hierarchies. The fourth dimension of various reference architecture models can be represented by a combination of the two axes of the life cycle value stream of the system assets i.e., data, products, energy etc. with the axis of layered interoperability dealing with the representation of semantics in the given reference model. Thus, semantics means the state changes performed over time by the considered assets which requires the semantic interoperability between locations of a site or between device of a production chain. A state change issued by actors or processes during the life cycle value stream is an event that is represented in a data model and shall be accessible to other actors via the data space. Thus, the communicating actors or processes interconnected by the data space do not need the traditional layered communication protocols of the architectural models since they are interconnected through a distributed data space which plays the role of a distributed data repository to all actors and recipients of an application.