PARS-Mitteilungen 2015
Nutzen Sie die Buttons unter "Auflistung nach", um die Beiträge z.B. nach Beitragsart oder Session zu sortieren oder starten Sie direkt mit der Titelübersicht.
Sie können aber auch die komplette PARS-Mitteilungen 2015 als PDF-Datei laden.
Auflistung PARS-Mitteilungen 2015 nach Titel
1 - 10 von 15
Treffer pro Seite
Sortieroptionen
- ZeitschriftenartikelEnergy-aware mixed precision iterative refinement for linear systems on GPU-accelerated multi-node HPC clusters(PARS-Mitteilungen: Vol. 32, Nr. 1, 2015) Wlotzka, Martin; Heuveline, VincentModern high-performance computing systems are often built as a cluster of interconnected compute nodes, where each node is built upon a hybrid hardware stack of multi-core processors and many-core accelerators. To efficiently use such systems, numerical methods must embrace the different levels of parallelism from the coarse-grained distributed memory cluster level to the fine-grained shared memory node level parallelism. Synchronization requirements of numerical methods may diminish parallel performance and result in increased energy consumption. We investigate block-asynchronous iteration methods in combination with mixed precision iterative refinement to address this issue. We depict our implementation for multi-node distributed systems using MPI with a hybrid node level parallelization for multi-core CPUs using OpenMP and multiple CUDAcapable accelerators. Our numerical experiments are based on a linear system arising from the finite element discretization of the Poisson equation. We present energy and runtime measurements for a quad-CPU and dual-GPU test system. We achieve runtime and energy savings of up to 70% for block-asynchronous GPU-accelerated iteration using mixed precision compared to CPU-only computation. We also encounter configurations where the CPU-only computation is advantageous over the GPU-accelerated method.
- ZeitschriftenartikelExtended Pattern-based Parallelization Approach for Hard Real-Time Systems and its Tool Support(PARS-Mitteilungen: Vol. 32, Nr. 1, 2015) Stegmeier, Alexander; Frieb, Martin; Ungerer, TheoThe transformation of sequential legacy code to parallel applications is hard, especially when timing requirements have to be met. There exists a systematic parallelization approach dealing with this topic. Based on practical experience, we extend it and present our modifications. Our extensions comprise an additional phase dealing with implementation details and another one for quality assurance. Its results may be used to further improve the parallel program. Moreover, we propose tool support which further facilitates the parallelization process.
- ZeitschriftenartikelHigh performance CCSDS image data compression using GPGPUs for space applications(PARS-Mitteilungen: Vol. 32, Nr. 1, 2015) Ramanarayanana, Sunil Chokkanathapuram; Mantheyb, Kristian; Juurlinka, BenThe usage of graphics processing units (GPUs) as computing architectures for inherently data parallel signal processing applications in this computing era is very popular. In principle, GPUs in comparison with central processing units (CPUs) could achieve significant speed-up over the latter, especially considering data parallel applications which expect high throughput. The paper investigates the usage of GPUs for running space borne image data compression algorithms, in particular the CCSDS 122.0-B-1 standard as a case study. The paper proposes an architecture to parallelize the Bit-Plane Encoder (BPE) stage of the CCSDS 122.0-B-1 in lossless mode using a GPU to achieve high throughput performance to facilitate real-time compression of satellite image data streams. Experimental results are furnished by comparing the performance in terms of compression time of the GPU implementation versus a state of the art single threaded CPU and an field-programmable gate array (FPGA) implementation. The GPU implementation on a NVIDIA® GeForce® GTX 670 achieves a peak throughput performance of 162.382 Mbyte/s (932.288 Mbit/s) and an average speed-up of at least 15 compared to the software implementation running on a 3.47 GHz single core Intel® XeonTM processor. The high throughput CUDA implementation using GPUs could potentially be suitable for air borne and space borne applications in the future, if the GPU technology evolves to become radiation-tolerant and space-qualified.
- ZeitschriftenartikelNovel Image Processing Architecture for 3D Integrated Circuits(PARS-Mitteilungen: Vol. 32, Nr. 1, 2015) Pfundt, Benjamin; Reichenbach, Marc; Söll, Christopher; Fey, DietmarUtilizing highly parallel processors for high speed embedded image processing is a well known approach. However, the question of how to provide a sufficiently fast data rate from image sensor to processing unit is still not solved. As Trough-Silicon-Vias (TSV), a new technology for chip stacking, become available, parallel image transmission from the image sensor to processing unit is enabled. Nevertheless, the usage of a new technology requires architectural changes in the processing units. With this technology at hand, we present a novel image preprocessing architecture suitable for image processing in 3D chips stacks. The architecture was developed in parallel with a customized image sensor to make a real assembly possible. It is fully functionally verified and layouted for a 150 nm process. Our performance estimation shows a processing speed of 770 up to 14.400 fps (frames per second) for 5 × 5 filters.
- ZeitschriftenartikelParallel Processing for Data Deduplication(PARS-Mitteilungen: Vol. 32, Nr. 1, 2015) Sobe, Peter; Pazak, Denny; Stiehr, MartinData deduplication is a technique for detection and elimination of duplicated data blocks in storage systems. It creates a set of unique data blocks and places references accordingly, which allows to access the original data within a reduced amount of data blocks. For deduplication, hashes of data blocks are calculated and compared in order to detect and remove duplicates. It can be seen as an alternative to data compression that allows to save storage capacity in large storage systems. A storage capacity saving is reached at the cost of additional computational effort that originates when data blocks are written and updated. This computational effort increases with the size of the storage system. On a single processor system, deduplication influences the performance in a negative way, particularly the write and update rates drop. The utilization of parallelism is a rewarding task to compensate this performance drop, particularly for hash value calculations and comparisons of hashes. In this paper we explain in which parts of a deduplication system it is worth to parallelize and how. Exemplarily, we show the performance results of two deduplication algorithms and their parallel implementations, based on multithreading and on parallel GPU computations.
- ZeitschriftenartikelParallel Symbolic Relationship-Counting(PARS-Mitteilungen: Vol. 32, Nr. 1, 2015) Doering, Andreas C.One of the most basic operations, transforming a relationship into a function that gives the number of fulfilling elements, does not seem to be widely investigated. In this article a new algorithm for this problem is proposed. This algorithm can be implemented using Binary Decision Diagrams. The algorithm transforms a relation given as symbolic expression into a symbolic function, which can be further used, e.g. for finding maxima. The performance of an implementation based on JINC is given for a scalable example problem.
- ZeitschriftenartikelParallelisierung von Embedded Realtime Systemen: Probleme und Lösungsstrategien in Migrationsprojekten(PARS-Mitteilungen: Vol. 32, Nr. 1, 2015) Abu-Khalil, MarwanDieser Artikel extrahiert Erfahrungen aus einer Reihe erfolgreicher sowie gescheiterter industrieller Parallelisierungsprojekte, bei denen Embedded Realtime Systeme von Single-Core CPUs auf Multi-Core SMP-Plattformen portiert wurden. Die Kernthese des Vortrages lautet, dass die Parallelisierung von Embedded Realtime Systemen spezifischen Herausforderungen gegenübersteht, die bei anderen System-Klassen, wie Serveroder Desktop-Software, nur eine untergeordnete Relevanz haben. Der Artikel analysiert und kategorisiert diese spezifischen Herausforderungen. Als Resultat werden allgemeingültige Herangehensweisen vorgeschlagen, die zu erfolgreicher Parallelisierung im Embedded-Bereich führen.
- ZeitschriftenartikelParallelization of the Particle-in-cell-Code PATRIC with GPU-Programming(PARS-Mitteilungen: Vol. 32, Nr. 1, 2015) Fitzek, JuttaThe Particle-in-cell (PIC) code PATRIC (Particle Tracking Code) is used at the GSI Helmholtz Center for Heavy Ion Reasearch to simulate particles in circular particle accelerators. Parallelization of PIC codes is an open research field and solutions depend very much on the specific problem. The possibilities and limits of GPU integration are being evaluated. General GPU aspects and problems arising from collective particle effects are put into focus with an emphasis on code maintainability and reuse of existing modules. The studies have been performed using NVIDIA⃝R ’s Tesla C2075 GPU. This contribution summarizes the findings.
- ZeitschriftenheftPARS-Mitteilungen 2015(PARS-Mitteilungen: Vol. 32, Nr. 1, 2015)
- ZeitschriftenartikelParticle-in-Cell algorithms on DEEP: The iPiC3D case study(PARS-Mitteilungen: Vol. 32, Nr. 1, 2015) Jakobs, Anna; Zitz, Anke; Eicker, Norbert; Lapenta, GiovanniThe DEEP (Dynamical Exascale Entry Platform) project aims to provide a first implementation of a novel architecture for heterogeneous high-performance computing. This architecture consists of a standard HPC Cluster and – tightly coupled – a cluster of many-core processors called Booster. This concept offers application developers the opportunity to run different parts of their program on the best fitting part of the machine striving for an optimal overall performance. In order to take advantage of this architecture applications require some adaption. To provide optimal support to the application developers the DEEP concept includes a high-level programming model that helps to separate a given program to the Cluster and Booster parts of the DEEP System. This paper presents the adaption work required for a Particle-in-Cell space weather application developed by KULeuven (Katholieke Universiteit Leuven) done in the course of the DEEP project. It discusses all crucial steps of the work starting with a scalability analysis of the different parts of the program, their performance projections for the Cluster and the Booster leading to the separation decisions for the application and finally the actual implementation work. In addition to that some performance results are presented.