Autor*innen mit den meisten Dokumenten
Neueste Veröffentlichungen
- ZeitschriftenheftPARS-Mitteilungen 2024(PARS-Mitteilungen: Vol. 36, 2024)
- ZeitschriftenartikelTosKonnect: A Modular Queue-based Communication Layer for Heterogeneous High Performance Computing(PARS-Mitteilungen: Vol. 36, 2024) Fuentes Grau, Laura; Eiling, Niklas; Lankes, Stefan; Monti, AntonelloModern HPC Cluster increasingly make use of accelerators, such as GPUs, to achieve the computational throughput that today’s applications require. Distributing computations across heterogeneous computing nodes necessitates a vast amount of inter-device data transfers, not only between, but also within, nodes. Each type of device defines unique APIs to handle these transfers. They differ in their implementation but fulfill the same task: Exchanging data between memory regions. To meet the high requirements of bandwidth and latency, many device interfaces offer asynchronous APIs that enable hardware offloading of data transfers. This paper introduces TosKonnect to unify asynchronous device communication, while keeping configurability and interoperability in mind. TosKonnect is a queue-based communication layer that defines a vendor-neutral and device-independent API for inter-device data transfers while hiding the intricate details of device communication APIs. With the low overhead TosKonnect introduces into device communication, it provides developers with a performant tool to organize data transfers.
- ZeitschriftenartikelThe DEEP-SEA project: a software stack for heterogeneous and modular supercomputers(PARS-Mitteilungen: Vol. 36, 2024) Suarez, Estela; Eicker, Norbert; Hoppe, Hans-ChristianToday’s most powerful supercomputers achieve their performance through heterogeneous system architectures that integrate CPUs with accelerators, especially GPUs, and advanced multi-level memory systems. This hardware diversity challenges application developers to adapt legacy code, requiring significant efforts in code evolution and optimisation. The European DEEP-SEA project has developed an integrated software stack for heterogeneous HPC systems, including kernel modules, libraries, management systems and programming abstractions. It supports heterogeneous hardware configurations including modular supercomputers, enabling optimal resource allocation, application of malleability and programming model composability. Enhanced tools and data placement policies improved performance on DRAM and fast memory. Results were made publicly available, ensuring sustainability through integration with upstream open source projects and extension of HPC standards. This paper summarises the DEEP-SEA project’s contributions to a wide variety of software packages and developments.
- ZeitschriftenartikelAutomatic Code Transformation of NetCDF Code for I/O Optimisation(PARS-Mitteilungen: Vol. 36, 2024) Squar, Jannek; Fuchs, Anna; Kuhn, Michael; Ludwig, ThomasEven small improvements to applications can have a huge impact when running on massive parallel systems. Domain experts often lack sufficient computer science expertise or face significant challenges when trying to implement new features such as data compression or parallel I/O. We present anextension to CATO, a code transformation tool that automatically inserts new features and optimisations into scientific code to demonstrate their use and benefits. It helps to overcome initial barriers and supports guided self-learning in a user-friendly way. In this work we implement and evaluate an LLVM pass to automatically find, analyse and transform an application using the netCDF API to optimise the runtime and memory as well as the storage footprint during the I/O phase of the application by inserting parallelisation and compression. Our evaluation shows good speedup and near-optimal memory usage when the modified application is run on distributed hardware using Lustre as the parallel file system backend.
- ZeitschriftenartikelComparing GPU and TPU in an Iterative Scenario: A Study on Neural Network-based Image Generation(PARS-Mitteilungen: Vol. 36, 2024) Lehmann, Roman; Schaarschmidt, Paul; Karl, WolfgangThis paper explores the utilization of TPUs (Tensor Processing Units) and GPUs (Graphics Processing Units) in iterative applications involving neural networks. We employ a Pix2Pix approachfor computing sequential flows, evaluating the effectiveness in scenarios where NNs are only a component of the system. While TPUs demonstrate performance improvements during training with large batch sizes, we observe no significant acceleration during inference compared to GPUs. The study highlights the need to carefully consider workload and system architecture when incorporating TPUs, emphasizing that their advantages are more prominent in training tasks.
- ZeitschriftenartikelGaining Cross-Platform Parallelism for HAL’s Molecular Dynamics Package using SYCL(PARS-Mitteilungen: Vol. 36, 2024) Skoblin, Viktor; Höfling, Felix; Christgau, SteffenMolecular dynamics simulations are one of the methods in scientific computing that benefitfrom GPU acceleration. For those devices, SYCL is a promising API for writing portable codes. In this paper, we present the case study of HAL’s MD package that has been successfully migrated from CUDA to SYCL. We describe the different strategies that we followed in the process of porting the code. Following these strategies, we achieved code portability across major GPU vendors. Depending on the actual kernels, both significant performance improvements and regressions are observed. As a side effect of the migration process, we obtained impressing speedups also for execution on CPUs.
- ZeitschriftenartikelEvaluation of GPU-Compression Algorithms for CUDA-Aware MPI(PARS-Mitteilungen: Vol. 36, 2024) Vogel, Marco; Oden, LenaThis study evaluates an efficient compression algorithm suitable for use with CUDA-aware MPI, aiming to lessen the latency of extensive GPU message transfers. We examine the performance of various compression algorithms on distinct datasets. Ndzip emerges as the optimal compression algorithm for our needs. Our findings reveal that large message latency can improve depending on the dataset. However, latency may increase for non-compressible data due to overhead when using compression. With well-compressible data, the Cannon algorithm for dense matrix-matrix multiplication can improve performance by up to 30%. For data that is not highly compressible, there’s only a minor performance penalty, as the compression overhead remains relatively small.
- ZeitschriftenartikelModelling MPI Communication using Coloured Petri Nets(PARS-Mitteilungen: Vol. 36, 2024) Krabbe, Tronje; Blesel, Michael; Kuhn, MichaelThe Message Passing Interface (MPI) is a widely used standard for distributed memory, parallel computing. Coloured Petri Nets (CPNs) are a powerful, high-level modelling framework, well suited for modelling distributed systems. This paper presents a novel approach to modelling the communication in MPI programs using Coloured Petri Nets. The paper investigates how this approach can be used for correctness checking of communication schemes. A proof-of-concept software implementation is able to detect several errors and shows promising performance.