- ZeitschriftenartikelAn Architecture Framework for Porting Applications to FPGAs(PARS-Mitteilungen: Vol. 31, Nr. 1, 2014) Nowak, Fabian; Bromberger, Michael; Karl, WolfgangHigh-level language converters help creating FPGAbased accelerators and allow to rapidly come up with a working prototype. But the generated state machines do often not perform as optimal as hand-designed control units, and they require much area. Also, the created deep pipelines are not very efficient for small amounts of data. Our approach is an architecture framework of hand-coded building blocks (BBs). A microprogrammable control unit allows programming the BBs to perform computations in a data-flow style. We accelerate applications further by executing independent tasks in parallel on different BBs. Our microprogram implementation for the Conjugate-Gradient method on our data-driven, microprogrammable, task-parallel architecture framework on the Convey HC-1 is competitive with a 24-thread Intel Westmere system. It is 1.2× faster using only one out of four available FPGAs, thereby proving its potential for accelerating numerical applications. Moreover, we show that hardware developers can change the BBs and thereby reduce iteration count of a numerical algorithm like the ConjugateGradient method to less than 0.5× due to more precise operations inside the BBs, speeding up execution time 2.47×.
- ZeitschriftenartikelScaFES: An Open-Source Framework for Explicit Solvers Combining High-Scalability with User-Friendliness(PARS-Mitteilungen: Vol. 31, Nr. 1, 2014) Flehmig, Martin; Feldhoff, Kim; Markwardt, UlfWe present ScaFES, an open-source HPC framework written in C++11 for solving initial boundary value problems using explicit numerical methods in time on structured grids. It is designed to be highly-scalable and very user-friendly, i.e. to exploit all levels of parallelism and provide easy-to-use interfaces. Besides, the numerical nomenclature is reflected in a nearly oneto-one mapping. We describe how the framework works internally by presenting the core components of ScaFES, which modern C++ technologies are used, which parallelization methods are employed, and how the communication can be hidden behind during the update phase of a time step. Finally, we show how a multidimensional heat equation problem discretized via the finite difference method in space and via the explicit Euler scheme in time can be implemented and solved using ScaFES in about 60 lines. In order to demonstrate the excellent performance of ScaFES, we compare ScaFES to PETSc on the basis of the implemented heat equation example in two dimensions and present scalability results w.r.t. MPI and OpenMP achieved on HPC clusters at the ZIH.
- ZeitschriftenartikelA comparison of CUDA and OpenACC: Accelerating the Tsunami Simulation EasyWave(PARS-Mitteilungen: Vol. 31, Nr. 1, 2014) Christgau, Steffen; Spazier, Johannes; Schnor, Bettina; Hammitzsch, Martin; Babeyko, Andrey; Wächter, JoachimThis paper presents an GPU accelerated version of the tsunami simulation EasyWave. Using two different GPU generations (Nvidia Tesla and Fermi) different optimization techniques were applied to the application following the principle of locality. Their performance impact was analyzed for both hardware generations. The Fermi GPU not only has more cores, but also possesses a L2 cache shared by all streaming multiprocessors. It is revealed that even the most tuned code on the Tesla does not reach the performance of the unoptimized code on the Fermi GPU. Further, a comparison between CUDA and OpenACC shows that the platform independent approach does not reach the speed of the native CUDA code. A deeper analysis shows that memory access patterns have a critical impact on the compute kernels’ performance, although this seems to be caused by the compiler in use.
- ZeitschriftenartikelA Perfomance Study of Parallel Cauchy Reed/Solomon Coding(PARS-Mitteilungen: Vol. 31, Nr. 1, 2014) Sobe, Peter; Schumann, PeterCauchy-Reed/Solomon coding is applied to tolerate failures of memories and data storage devices in computer systems. In order to obtain a high data access bandwidth, the calculations for coding must be fast and it is required to utilize parallelism. For a software-based system, the most promising approach is data parallelism which can be easily implemented with OpenMP on a multicore or multiprocessor computer. A beneficial aspect is the clear mathematical nature of coding operations that supports functional parallelism as well. We report on a storage system application that generates the encoder and decoder as C-code automatically from a parametric description of the system and inserts OpenMP directives in the code automatically. We compare the performance in terms of achieved data throughput for data parallelism and for functional parallelism that is generated using OpenMP.
- ZeitschriftenartikelEvaluation of Adaptive Memory Management Techniques on the Tilera TILE-Gx Platform(PARS-Mitteilungen: Vol. 31, Nr. 1, 2014) Fleig, Tobias; Mattes, Oliver; Karl, WolfgangManycore processor systems are likely to be the future system structure, and even within range for usage in desktop or mobile systems. Up to now, manycore processors like Intel SCC, Tilera TILE or KALRAY’s MPPA are primarily intended to use for high performance applications, utilizing several cores with direct inter-core communication to avoid access to external memory. The spreading of these manycore systems brings up new application scenarios with multiple concurrently running high-dynamic applications, changing I/O characteristics and a not predictable memory usage. Highly dynamic workloads with varying memory usage have to be utilized. In this paper the memory management of various manycore platforms is addressed. In more detail the Tilera TILE-Gx platform will be explained, presenting results of own evaluations accessing its memory system. Based on that, the concept of the autonomous self-optimizing memory architecture Self-aware Memory (SaM) exemplarily was implemented as a software layer on the Tilera platform. The results show that adaptive memory management techniques can be realized without much management overhead, in return achieving higher flexibility and and simple usage of memory in future system architectures.
- ZeitschriftenartikelPBA2CUDA - A Framework for Parallelizing Population Based Algorithms Using CUDA(PARS-Mitteilungen: Vol. 31, Nr. 1, 2014) Zgeras, Ioannis; Brehm, Jürgen; Knoppik, MichaelTo increase the performance of a program, developers have to parallelize their code due to trends in modern hardware development. Since the parallelization of source code is paired with additional programming effort, it is desirable to provide developers with tools to help them by parallelizing source code. PBA2CUDA is a framework for semi-automatically parallelization of source code specialized in the algorithm class of Population Based Algorithms.
- ZeitschriftenartikelPerformance Engineering for a Medical Imaging Application on the Intel Xeon Phi Accelerator(PARS-Mitteilungen: Vol. 31, Nr. 1, 2014) Hofmann, Johannes; Treibig, Jan; Hager, Georg; Wellein, GerhardWe examine the Xeon Phi, which is based on Intel’s Many Integrated Cores architecture, for its suitability to run the FDK algorithm—the most commonly used algorithm to perform the 3D image reconstruction in cone-beam computed tomography. We study the challenges of efficiently parallelizing the application and means to enable sensible data sharing between threads despite the lack of a shared last level cache. Apart from parallelization, SIMD vectorization is critical for good performance on the Xeon Phi; we perform various micro-benchmarks to investigate the platform’s new set of vector instructions and put a special emphasis on the newly introduced vector gather capability. We refine a previous performance model for the application and adapt it for the Xeon Phi to validate the performance of our optimized hand-written assembly implementation, as well as the performance of several different auto-vectorization approaches.
- ZeitschriftenartikelA Quantitative Comparison of PRAM based Emulated Shared Memory Architectures to Current Multicore CPUs and GPUs(PARS-Mitteilungen: Vol. 31, Nr. 1, 2014) Hansson, Erik; Alnervik, Erik; Kessler, Christoph; Forsell, MarttiThe performance of current multicore CPUs and GPUs is limited in computations making frequent use of communication/synchronization between the subtasks executed in parallel. This is because the directory-based cache systems scale weakly and/or the cost of synchronization is high. The Emulated Shared Memory (ESM) architectures relying on multithreading and efficient synchronization mechanisms have been developed to solve these problems affecting both performance and programmability of current machines. In this paper, we compare preliminarily the performance of three hardware implemented ESM architectures with state-of-the-art multicore CPUs and GPUs. The benchmarks are selected to cover different patterns of parallel computation and therefore reveal the performance potential of ESM architectures with respect to current multicores.
- ZeitschriftenartikelEin Cloud-basierter Workflow für die effektive Fehlerdiagnose von Loop-Back-Strukturen(PARS-Mitteilungen: Vol. 31, Nr. 1, 2014) Gulbins, Matthias; Schneider, André; Rülke, SteffenEine hochkomplexe und zeitaufwändige Aufgabe beim Entwurf integrierter Mixed-Signal-Schaltkreise ist die Fehlerdiagnose. Der vorliegende Beitrag stellt einen auf Cloud-Technologien basierenden Lösungsansatz vor, der Fehler in für solche Schaltkreise typischen Strukturen aus Analog-Digitalund Digital-Analog-Wandlern lokalisiert. Das Diagnoseverfahren (Ergebnis des BMBF-Projektes DIANA) beruht auf dem sogenannten Loop-Back-Test, der zwar die Generierung von Testdaten vereinfacht, aber eine Vielzahl von Variantensimulationen mit verschiedenen Simulationsprinzipien und erheblichen Datenmengen erfordert. Diese sollen nunmehr problemangepasst und damit effizient in der Cloud realisiert werden. Für die entsprechende Informationsverarbeitung in der Cloud wurde das in dem Projekte OptiNum-Grid entwickelte Framework GridWorker adaptiert. Experimente mit ersten Anwendungsbeispielen bestätigen die Leistungsfähigkeit und Praktikabilität des Ansatzes für datenund verarbeitungsintensive Schaltkreisentwurfsaufgaben.
- ZeitschriftenheftPARS-Mitteilungen 2014(PARS-Mitteilungen: Vol. 31, Nr. 1, 2014)