Logo des Repositoriums
 

PARS-Mitteilungen 2015

Nutzen Sie die Buttons unter "Auflistung nach", um die Beiträge z.B. nach Beitragsart oder Session zu sortieren oder starten Sie direkt mit der Titelübersicht.

Sie können aber auch die komplette PARS-Mitteilungen 2015 als PDF-Datei laden.

Autor*innen mit den meisten Dokumenten  

Auflistung nach:

Neueste Veröffentlichungen

1 - 10 von 15
  • Zeitschriftenartikel
    High performance CCSDS image data compression using GPGPUs for space applications
    (PARS-Mitteilungen: Vol. 32, Nr. 1, 2015) Ramanarayanana, Sunil Chokkanathapuram; Mantheyb, Kristian; Juurlinka, Ben
    The usage of graphics processing units (GPUs) as computing architectures for inherently data parallel signal processing applications in this computing era is very popular. In principle, GPUs in comparison with central processing units (CPUs) could achieve significant speed-up over the latter, especially considering data parallel applications which expect high throughput. The paper investigates the usage of GPUs for running space borne image data compression algorithms, in particular the CCSDS 122.0-B-1 standard as a case study. The paper proposes an architecture to parallelize the Bit-Plane Encoder (BPE) stage of the CCSDS 122.0-B-1 in lossless mode using a GPU to achieve high throughput performance to facilitate real-time compression of satellite image data streams. Experimental results are furnished by comparing the performance in terms of compression time of the GPU implementation versus a state of the art single threaded CPU and an field-programmable gate array (FPGA) implementation. The GPU implementation on a NVIDIA® GeForce® GTX 670 achieves a peak throughput performance of 162.382 Mbyte/s (932.288 Mbit/s) and an average speed-up of at least 15 compared to the software implementation running on a 3.47 GHz single core Intel® XeonTM processor. The high throughput CUDA implementation using GPUs could potentially be suitable for air borne and space borne applications in the future, if the GPU technology evolves to become radiation-tolerant and space-qualified.
  • Zeitschriftenartikel
    Real-Time Vision System for License Plate Detection and Recognition on FPGA
    (PARS-Mitteilungen: Vol. 32, Nr. 1, 2015) Rosli, Faird; Elhossini, Ahmed; Juurlink, Ben
    Rapid development of the Field Programmable Gate Array (FPGA) offers an alternative way to provide acceleration for computationally intensive tasks such as digital signal and image processing. Its ability to perform parallel processing shows the potential in implementing a high speed vision system. Out of numerous applications of computer vision, this paper focuses on the hardware implementation of one that is commercially known as Automatic Number Plate Recognition (ANPR).Morphological operations and Optical Character Recognition (OCR) algorithms have been implemented on a Xilinx Zynq-7000 All-Programmable SoC to realize the functions of an ANPR system. Test results have shown that the designed and implemented processing pipeline that consumed 63 % of the logic resources is capable of delivering the results with relatively low error rate. Most importantly, the computation time satisfies the real-time requirement for many ANPR applications.
  • Zeitschriftenartikel
    A run-time reconfigurable NoC Monitoring System for performance analysis and debugging support
    (PARS-Mitteilungen: Vol. 32, Nr. 1, 2015) Koser, Erol; Stabernack, Benno
    Recently Network-on-Chip based architectures become more and more important due to their advantages in respect to design flexibility and systems bandwidth scalability since nowadays systems consists typically of a huge number of processing elements (e.g. heterogeneous multi processor systems). In contrast to typical shared memory based systems, predicting and monitoring the runtime behaviour of the system e.g. data throughput, link utilization and contention becomes more complex and requires special architectural features. Besides the traditional approach of using simulation based approaches at design time, runtime usable features promise to have a number of advantages. In this paper we present a flexible, reusable and run-time reconfigurable NoC monitoring system for performance analysis and debugging purposes. The evaluation of the monitoring data enables the system designer to achieve better resource utilization by adjusting the system architecture and the programming model.
  • Zeitschriftenartikel
    Parallelization of the Particle-in-cell-Code PATRIC with GPU-Programming
    (PARS-Mitteilungen: Vol. 32, Nr. 1, 2015) Fitzek, Jutta
    The Particle-in-cell (PIC) code PATRIC (Particle Tracking Code) is used at the GSI Helmholtz Center for Heavy Ion Reasearch to simulate particles in circular particle accelerators. Parallelization of PIC codes is an open research field and solutions depend very much on the specific problem. The possibilities and limits of GPU integration are being evaluated. General GPU aspects and problems arising from collective particle effects are put into focus with an emphasis on code maintainability and reuse of existing modules. The studies have been performed using NVIDIA⃝R ’s Tesla C2075 GPU. This contribution summarizes the findings.
  • Zeitschriftenartikel
    Proximity Scheme for Instruction Caches in Tiled CMP Architectures
    (PARS-Mitteilungen: Vol. 32, Nr. 1, 2015) Alawneh, Tareq; Chi, Chi Ching; Elhossini, Ahmed; Juurlink, Ben
    Recent research results show that there is a high degree of code sharing between cores in multi-core architectures. In this paper we propose a proximity scheme for the instruction caches, a scheme in which the shared code blocks among the neighbouring L2 caches in tiled multi-core architectures are exploited to reduce the average cache miss penalty and the on-chip network traffic. We evaluate the proposed proximity scheme for instruction caches using a full-system simulator running an n-core tiled CMP. The experimental results reveal a significant execution time improvement of up to 91.4% for microbenchmarks whose instruction footprint does not fit in the private L2 cache. For real applications from the PARSEC benchmarks suite, the proposed scheme results in speedups of up to 8%.
  • Zeitschriftenartikel
    Parallelisierung von Embedded Realtime Systemen: Probleme und Lösungsstrategien in Migrationsprojekten
    (PARS-Mitteilungen: Vol. 32, Nr. 1, 2015) Abu-Khalil, Marwan
    Dieser Artikel extrahiert Erfahrungen aus einer Reihe erfolgreicher sowie gescheiterter industrieller Parallelisierungsprojekte, bei denen Embedded Realtime Systeme von Single-Core CPUs auf Multi-Core SMP-Plattformen portiert wurden. Die Kernthese des Vortrages lautet, dass die Parallelisierung von Embedded Realtime Systemen spezifischen Herausforderungen gegenübersteht, die bei anderen System-Klassen, wie Serveroder Desktop-Software, nur eine untergeordnete Relevanz haben. Der Artikel analysiert und kategorisiert diese spezifischen Herausforderungen. Als Resultat werden allgemeingültige Herangehensweisen vorgeschlagen, die zu erfolgreicher Parallelisierung im Embedded-Bereich führen.
  • Zeitschriftenartikel
    Particle-in-Cell algorithms on DEEP: The iPiC3D case study
    (PARS-Mitteilungen: Vol. 32, Nr. 1, 2015) Jakobs, Anna; Zitz, Anke; Eicker, Norbert; Lapenta, Giovanni
    The DEEP (Dynamical Exascale Entry Platform) project aims to provide a first implementation of a novel architecture for heterogeneous high-performance computing. This architecture consists of a standard HPC Cluster and – tightly coupled – a cluster of many-core processors called Booster. This concept offers application developers the opportunity to run different parts of their program on the best fitting part of the machine striving for an optimal overall performance. In order to take advantage of this architecture applications require some adaption. To provide optimal support to the application developers the DEEP concept includes a high-level programming model that helps to separate a given program to the Cluster and Booster parts of the DEEP System. This paper presents the adaption work required for a Particle-in-Cell space weather application developed by KULeuven (Katholieke Universiteit Leuven) done in the course of the DEEP project. It discusses all crucial steps of the work starting with a scalability analysis of the different parts of the program, their performance projections for the Cluster and the Booster leading to the separation decisions for the application and finally the actual implementation work. In addition to that some performance results are presented.
  • Zeitschriftenartikel
    Extended Pattern-based Parallelization Approach for Hard Real-Time Systems and its Tool Support
    (PARS-Mitteilungen: Vol. 32, Nr. 1, 2015) Stegmeier, Alexander; Frieb, Martin; Ungerer, Theo
    The transformation of sequential legacy code to parallel applications is hard, especially when timing requirements have to be met. There exists a systematic parallelization approach dealing with this topic. Based on practical experience, we extend it and present our modifications. Our extensions comprise an additional phase dealing with implementation details and another one for quality assurance. Its results may be used to further improve the parallel program. Moreover, we propose tool support which further facilitates the parallelization process.
  • Zeitschriftenartikel
    Energy-aware mixed precision iterative refinement for linear systems on GPU-accelerated multi-node HPC clusters
    (PARS-Mitteilungen: Vol. 32, Nr. 1, 2015) Wlotzka, Martin; Heuveline, Vincent
    Modern high-performance computing systems are often built as a cluster of interconnected compute nodes, where each node is built upon a hybrid hardware stack of multi-core processors and many-core accelerators. To efficiently use such systems, numerical methods must embrace the different levels of parallelism from the coarse-grained distributed memory cluster level to the fine-grained shared memory node level parallelism. Synchronization requirements of numerical methods may diminish parallel performance and result in increased energy consumption. We investigate block-asynchronous iteration methods in combination with mixed precision iterative refinement to address this issue. We depict our implementation for multi-node distributed systems using MPI with a hybrid node level parallelization for multi-core CPUs using OpenMP and multiple CUDAcapable accelerators. Our numerical experiments are based on a linear system arising from the finite element discretization of the Poisson equation. We present energy and runtime measurements for a quad-CPU and dual-GPU test system. We achieve runtime and energy savings of up to 70% for block-asynchronous GPU-accelerated iteration using mixed precision compared to CPU-only computation. We also encounter configurations where the CPU-only computation is advantageous over the GPU-accelerated method.
  • Zeitschriftenartikel
    Parallel Processing for Data Deduplication
    (PARS-Mitteilungen: Vol. 32, Nr. 1, 2015) Sobe, Peter; Pazak, Denny; Stiehr, Martin
    Data deduplication is a technique for detection and elimination of duplicated data blocks in storage systems. It creates a set of unique data blocks and places references accordingly, which allows to access the original data within a reduced amount of data blocks. For deduplication, hashes of data blocks are calculated and compared in order to detect and remove duplicates. It can be seen as an alternative to data compression that allows to save storage capacity in large storage systems. A storage capacity saving is reached at the cost of additional computational effort that originates when data blocks are written and updated. This computational effort increases with the size of the storage system. On a single processor system, deduplication influences the performance in a negative way, particularly the write and update rates drop. The utilization of parallelism is a rewarding task to compensate this performance drop, particularly for hash value calculations and comparisons of hashes. In this paper we explain in which parts of a deduplication system it is worth to parallelize and how. Exemplarily, we show the performance results of two deduplication algorithms and their parallel implementations, based on multithreading and on parallel GPU computations.