- ZeitschriftenartikelA Quantitative Analysis of Processor Memory Bandwidth of an FPGA-MPSoC(PARS-Mitteilungen: Vol. 35, Nr. 1, 2020) Drehmel, Robert; Göbel, Matthias; Juurlink, BenSystem designers have to choose between a variety of different memories available on modern FPGA-MPSoCs. Our intention is to shed light on the achievable bandwidth when accessing them under diverse circumstances and to hint at their suitability for general-purpose applications. We conducted a systematic quantitative analysis of the memory bandwidth of two processing units using a sophisticated standalone bandwidth measurement tool. The results show a maximum cacheable memory bandwidth of 7.11 GiB/s for reads and 11.78 GiB/s for writes for the general-purpose processing unit, and 2.56 GiB/s for reads and 1.83 GiB/s writes for the special-purpose (real-time) processing unit. In contrast, the achieved non-cacheable read bandwidth lies between 60 MiB/s and 207 MiB/s, with an outlier of 2.67 GiB/s. We conclude that for most applications, relying on DRAM and hardware cache coherency management is the best choice in terms of benefit-cost ratio.
- ZeitschriftenartikelInfluence of Discretization of Frequencies and Processor Allocation on Static Scheduling of Parallelizable Tasks with Deadlines(PARS-Mitteilungen: Vol. 35, Nr. 1, 2020) Litzinger, Sebastian; Keller, JörgModels for energy-efficient static scheduling of parallelizable tasks with deadlines onfrequency-scalable parallel machines comprise moldable vs. malleable tasks and continuous vs. discrete frequency levels. We investigate the tradeoff between scheduling time and energy efficiency when going from continuous to discrete processor allocation and frequency levels. To this end, we present a tool to convert a schedule computed for malleable tasks on machines with continuous frequency scaling (P. Sanders, J. Speck, Euro-Par 2012) into one for moldable tasks on a machine with discrete frequency levels. We compare the energy efficiency of the converted schedule to the energy consumed by a schedule produced by the integrated crown scheduler (N. Melot et al., ACM TACO 2015) for moldable tasks and a machine with discrete frequency levels. Our experiments indicate that the converted Sanders Speck schedules, while computed faster, consume more energy on average than crown schedules. Surprisingly, it is not the step from malleable to moldable tasks that is responsible, but the step from continuous to discrete frequency levels.
- ZeitschriftenartikelGPU-beschleunigte Time Warping-Distanzen(PARS-Mitteilungen: Vol. 35, Nr. 1, 2020) Bachmann, Jörg P.; Trogant, Kevin M.; Freytag, Johann-C.Immer mehr Algorithmen konnten durch Implementierung auf GPUs um mehrere Größenordnungen beschleunigt werden. Insbesondere existieren hochparallele Implementierungen des im Bereich der Zeitreihenanalyse weit verbreiteten Algorithmus’ Dynamic Time Warping (DTW). Dieser Algorithmus berechnet einen Ähnlichkeitswert zweier Zeitreihen (z. B. Temperaturverläufe) unter Berücksichtigung zeitlicher Variationen wie z. B. zeitliche Verschiebungen. Leider können die existierenden GPU-Implementierungen von DTW nicht beliebige zeitliche Variationen berücksichtigen. In dieser Arbeit stellen wir Implementierungen für GPUs vor, die dieser Einschränkung nicht unterliegen. In unserer Evaluierung zeigen wir, dass sie einen Geschwindigkeitsvorteil von ca. zwei Größenordnungen gegenüber einer CPU-Implementierung erreichen.
- ZeitschriftenartikelWeight Pruning for Deep Neural Networks on GPUs(PARS-Mitteilungen: Vol. 35, Nr. 1, 2020) Hartenstein, Thomas; Maier, Daniel; Cosenza, Biagio; Juurlink, BenNeural networks are getting more complex than ever before, leading to resource-demanding training processes that have been the target of optimization. With embedded real-time applications such as traffic identification in self-driving cars relying on neural networks, the inference latency is becoming more important. The size of the model has been identified as an important target of optimization, as smaller networks also require less computations for inference. A way to shrink a network in size is to remove small weights: weight pruning. This technique has been exploited in a number of ways and has shown to be able to significantly lower the number of weights, while maintaining a very close accuracy compared to the original network. However, current pruning techniques require the removal of up to 90% of the weights, requiring high amount of redundancy in the original network, to be able to speedup the inference as sparse data structures induce overhead. We propose a novel technique for the selection of the weights to be pruned. Our technique is specifically designed to take the architecture of GPUs into account. By selecting the weights to be removed in adjacent groups that are aligned to the memory architecture, we are able to fully exploit the memory bandwidth. Our results show that with the same amount of weights removed, our technique is able to speedup a neural network by a factor of 1:57 given a pruning rate of 90% while maintaining the same accuracy when compared to state-of-the-art pruning techniques.
- ZeitschriftenartikelGenerating Optimized FPGA Based MPSoCs to Parallelize Legacy Embedded Software with Customizable Throughput(PARS-Mitteilungen: Vol. 35, Nr. 1, 2020) Heid, Kris; Hochberger, ChristianExecuting legacy software on newly developed systems can lead to problems regarding the required throughput of the software. Automatic software parallelization can help to achieve a desired exection time even if a single core version would be to slow. In this contribution, we present a toolset that automatically parallelizes a given legacy software and distributes it to multiple soft-cores forming a processing pipeline. As a goal for the parallelization, the user can provide a minimum throughput that has to be achieved. Although this concept is limited to repetitive tasks, it can be well applied to most embedded system applications. The results show that the tool achieves remarkable speedups without any manual intervention or code restructuring for a sprectrum of benchmarks.
- ZeitschriftenartikelSymptom-based Fault Detection in Modern Computer Systems(PARS-Mitteilungen: Vol. 35, Nr. 1, 2020) Becker, Thomas; Rudolf, Nico; Yang, Dai; Karl, WolfgangMiniaturization and the increasing number of components, which get steadily more complex, lead to a rising failure rate in modern computer systems. Especially soft hardware errors are a major problem because they are usually temporary and therefore hard to detect. As classical fault-tolerance methods are very costly and reduce system efficiency, light-weight methods are needed to increase system reliability. A method that copes with this requirement is symptom-based fault detection. In this work, we evaluate the ability to detect different faults with symptom-based fault detection by using hardware performance counters. As the knowledge of a fault occurrence is usually not enough, we also evaluate the possibility to make conclusions about which fault occurred. For the evaluation, we used the fault-injection library FINJ and manually manipulated loops. The results show that symptom-based fault detection enables the system to detect faulty application behavior, however fine-grained conclusions about the causing fault are hardly possible.
- ZeitschriftenartikelEvaluating the Usability of Asynchronous Runge-Kutta Methods for Solving ODEs(PARS-Mitteilungen: Vol. 35, Nr. 1, 2020) Greene, Christopher; Hoffmann, MarkusCombining asynchronous methods with scientific computing is a great challenge. In this paper we make the attempt to combine such methods with a ODE solver. Although the results are not on point for giving us a fully usable asynchronous method, this paper shows the direction of the needed development to get such an asynchronous method.
- ZeitschriftenartikelReducing DRAM Accesses through Pseudo-Channel Mode(PARS-Mitteilungen: Vol. 35, Nr. 1, 2020) Salehiminapour, Farzaneh; Lucas, Jan; Goebel, Matthias; Juurlink, BenApplications once exclusive to high-performance computing are now common in systems ranging from mobile devices to clusters. They typically require large amounts of memory bandwidth. The graphic DRAM interface standards GDDR5X and GDDR6 are new DRAM technologies that promise to almost doubled data rates compared to GDDR5. However, these higher data rates require a longer burst length of 16 words. This would typically increase the memory access granularity. However, GDDR5X and GDDR6 support a feature called pseudo-channel mode. In pseudo-channel mode, the memory is split into two 16-bit pseudo channels. This split keeps the memory access granularity constant compared to GDDR5. However, the pseudo channels are not fully independent channels. Two accesses can be performed at the same time but access type, bank, and page must match, while column address can be selected separately for each pseudo channel. With this restriction, we arguethat GDDR5X can best be seen as a GDDR5 memory that allows performing an additional request to the same page without extra cost. Therefore, we propose a DRAM buffer scheduling algorithm to make effective use of the pseudo-channel mode and the additional memory bandwidth offered by GDDR5X. Compared to the GDDR5X regular mode, our proposed algorithm achieves 12.5% to 18% memory access reduction on average in pseudo-channel mode.
- ZeitschriftenartikelComparing MPI Passive Target Synchronization Schemes on a Non-Cache-Coherent Shared-Memory Processor(PARS-Mitteilungen: Vol. 35, Nr. 1, 2020) Christgau, Steffen; Schnor, BettinaMPI passive target synchronisation offers exclusive and shared locks. These are the building blocks for the implementation of applications with Readers & Writers semantic, like for example distributed hash tables. This paper discusses the implementation of MPI passive target synchronisation on a non-cache-coherent multicore, the Intel Single-Chip Cloud Computer. The considered algorithms differ in their communication style, their data structures, and their semantics. It is shown that shared memory approaches scale very well and deliver good performance, even in absence of cache coherence.
- ZeitschriftenartikelEnabling Malleability for Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics using LAIK(PARS-Mitteilungen: Vol. 35, Nr. 1, 2020) Raoofy, Amir; Yang, Dai; Weidendorfer, Josef; Trinitis, Carsten; Schulz, MartinMalleability, i.e., the ability for an application to release or acquire resources at runtime, has many benefits for current and future HPC systems. Implementing such functionality, however, is already difficult in newly written code and an even more daunting challenge when considering a dynamic and flexible parallel programming model that separates data and execution into twoorthogonal concerns. These properties promise easier malleability as the runtime can partition resources dynamically as needed, as well as easier incremental porting of existing MPI code. In this paper, we explore the malleability of LAIK with the help of laik-lulesh, a LAIK-based port of LULESH, a proxy application from the CORAL benchmark suite. We show the steps required for porting the application to LAIK, and we present detailed scaling experiments that show promising results.