- KonferenzbeitragA Generic Tool Supporting Cache Design and Optimisation on Shared Memory Systems(9th workshop on parallel systems and algorithms – workshop of the GI/ITG special interest groups PARS and PARVA, 2008) Schindewolf, Martin; Tao, Jie; Karl, Wolfgang; Cintra, Marcelo; Nagel, Wolfgang E.; Hoffmann, Rolf; Koch, AndreasFor multi-core architectures, improving the cache performance is crucial for the overall system performance. In contrast to the common approach to design caches with the best trade-off between performance and costs, this work favours an application specific cache design. Therefore, an analysis tool capable of exhibiting the reason of cache misses has been developed. The results of the analysis can be used by system developers to improve cache architectures or can help programmers to improve the data locality behaviour of their programs. The SPLASH-2 benchmark suite is used to demonstrate the abilities of the analysis model.
- KonferenzbeitragHow efficient are creatures with time-shuffled behaviors?(9th workshop on parallel systems and algorithms – workshop of the GI/ITG special interest groups PARS and PARVA, 2008) Ediger, Patrick; Hoffmann, Rolf; Halbach, Mathias; Nagel, Wolfgang E.; Hoffmann, Rolf; Koch, AndreasThe task of the creatures in the “creatures’ exploration problem” is to visit all empty cells in an environment with a minimum number of steps. We have analyzed this multi agent problem with time-shuffled algorithms (behaviors) in the cellular automata model. Ten different “uniform” (non-time-shuffled) algorithms with good performance from former investigations were used alternating in time. We designed three time-shuffling types differing in the way how the algorithms are interweaved. New metrics were defined for such a multi agent system, like the absolute and relative efficiency. The efficiency relates the work of an agent system to the work of a reference system. A reference system is such a system that can solve the problem with the lowest number of creatures with uniform or time-shuffled algorithms. Some time-shuffled systems reached high efficiency rates, but the most efficient system was a uniform one with 32 creatures. Among the most efficient successful systems the uniform ones are dominant. Shuffling algorithms resulted in better success rates for one creature. But this is not always the case for more than one creature.
- KonferenzbeitragAdaptive Cache Infrastructure: Supporting dynamic Program Changes following dynamic Program Behavior(9th workshop on parallel systems and algorithms – workshop of the GI/ITG special interest groups PARS and PARVA, 2008) Nowak, Fabian; Buchty, Rainer; Karl, Wolfgang; Nagel, Wolfgang E.; Hoffmann, Rolf; Koch, AndreasRecent examinations of program behavior at run-time revealed distinct phases. Thus, it is evident that a framework for supporting hardware adaptation to phase behavior is needed. With the memory access behavior being most important and cache accesses being a very big subset of them, we herein propose an infrastructure for fitting cache accesses to a program’s requirements for a distinct phase.
- KonferenzbeitragParallel derivative computation using ADOL-C(9th workshop on parallel systems and algorithms – workshop of the GI/ITG special interest groups PARS and PARVA, 2008) Kowarz, Andreas; Walther, Andrea; Nagel, Wolfgang E.; Hoffmann, Rolf; Koch, AndreasDerivative computation using Automatic Differentiation (AD) is often considered to operate purely serial. Performing the differentiation task in parallel may require the applied AD-tool to extract parallelization information from the user function, transform it, and apply this new strategy in the differentiation process. Furthermore, when using the reverse mode of AD, it must be ensured that no data races are introduced due to the reversed data access scheme. Considering an operator overloading based AD-tool, an additional challenge is to be met: Parallelization statements are typically not recognized. In this paper, we present and discuss the parallelization approach that we have integrated into ADOL-C, an operator overloading based AD-tool for the differentiation of C/C++ programs. The advantages of the approach are clarified by means of the parallel differentiation of a function that handles the time evolution of a 1D-quantum plasma.
- KonferenzbeitragSpecifying and Processing Co-Reservations in the Grid(9th workshop on parallel systems and algorithms – workshop of the GI/ITG special interest groups PARS and PARVA, 2008) Röblitz, Thomas; Nagel, Wolfgang E.; Hoffmann, Rolf; Koch, AndreasExecuting complex applications on Grid infrastructures necessitates the guaranteed allocation of multiple resources. Such guarantees are often implemented by means of advance reservations. Reserving resources in advance requires multiple steps – beginning with their description to their actual allocation. In a Grid, a client possesses little knowledge about the future status of resources. Thus, manually specifying successful parameters of a co-reservation is a tedious task. Instead, we propose to parametrize certain reservation characteristics (e.g., the start time) and to let a client define criteria for selecting appropriate values. Then, a Grid reservation service processes such requests by determining the future status of resources and calculating a co-reservation candidate which satisfies the cri- teria. In this paper, we present the Simple Reservation Language (SRL) for describing the requests, demonstrate the transformation of an example request into an integer program using the Zuse Institute Mathematical Programming Language (ZIMPL) and experimentally evaluate the time needed to find the optimal co-reservation using CPLEX.
- KonferenzbeitragSDVMR: A Scalable Firmware for FPGA-based Multi-Core Systems-on-Chip(9th workshop on parallel systems and algorithms – workshop of the GI/ITG special interest groups PARS and PARVA, 2008) Hofmann, Andreas; Waldschmidt, Klaus; Nagel, Wolfgang E.; Hoffmann, Rolf; Koch, AndreasAs the main scope of mobile embedded systems shifts from control to data processing tasks high performance demand and limited energy budgets are often seen conflicting design goals. Heterogeneous, adaptive multicore systems are one approach to meet these challenges. Thus, the importance of multicore FPGAs as an implementation platform steadily grows. However, efficient exploitation of parallelism and dynamic runtime reconfiguration poses new challenges for application software developement. In this paper the implementation of a virtualization layer between applica- tions and the multicore FPGA is described. This virtualization allows a transparent runtime-reconfiguration of the underlying system for adaption to changing system environments. The parallel application does not see the underlying, even heterogeneous multicore system. Many of the requirements for an adaptive FPGA-realization are met by the SDVM, the scalable dataflow-driven virtual machine. This paper describes the concept of the FPGA firmware based on a reimplementation and adaptation of the SDVM.
- Editiertes Buch9th workshop on parallel systems and algorithms – workshop of the GI/ITG special interest groups PARS and PARVA(2008) Nagel, Wolfgang E.; Hoffmann, Rolf; Koch, Andreas
- KonferenzbeitragHigh Performance Multigrid on Current Large Scale Parallel Computers(9th workshop on parallel systems and algorithms – workshop of the GI/ITG special interest groups PARS and PARVA, 2008) Gradl, Tobias; Rüde, Ulrich; Nagel, Wolfgang E.; Hoffmann, Rolf; Koch, AndreasMaking multigrid algorithms run efficiently on large parallel computers is a challenge. Without clever data structures the communication overhead will lead to an unacceptable performance drop when using thousands of processors. We show that with a good implementation it is possible to solve a linear system with 1011 unknowns in about 1.5 minutes on almost 10,000 processors. The data structures also allow for efficient adaptive mesh refinement, opening a wide range of applications to our solver.
- KonferenzbeitragGrid Virtualization Engine: Providing Virtual Resources for Grid Infrastructure(9th workshop on parallel systems and algorithms – workshop of the GI/ITG special interest groups PARS and PARVA, 2008) Kwemou, Emeric; Wang, Lizhe; Tao, Jie; Kunze, Marcel; Kramer, David; Karl, Wolfgang; Nagel, Wolfgang E.; Hoffmann, Rolf; Koch, AndreasVirtual machines offer a lot of advantage such as easy configuration and management and can simplify the development and the deployment of Grid infrastructures. Various virtualization implementations despite have similar functions often provide different management and access interfaces. The heterogeneous virtualization technologies bring challenges for employing virtual machine as computing resources to build Grid infrastructures. The work proposed in this paper focus on a Web service based virtual machine provider for Grid infrastructures. The Grid Virtualization Engine (GVE) creates an abstraction layer between users and underlying virtualization technologies. The GVE implements a scalable distributed architecture, where an GVE Agent represents a computing center. The Agent talks with different virtualization product inside the computing center and provides virtual machine resources to GVE Site Service. Users could require computing resources through GVE Site Services. The system is designed and implemented in the state of the arts of distributed computing: Web service and Grid standards.
- KonferenzbeitragAn optimized ZGEMM implementation for the Cell BE(9th workshop on parallel systems and algorithms – workshop of the GI/ITG special interest groups PARS and PARVA, 2008) Schneider, Timo; Hoefler, Torsten; Wunderlich, Simon; Mehlan, Torsten; Rehm, Wolfgang; Nagel, Wolfgang E.; Hoffmann, Rolf; Koch, AndreasThe architecture of the IBM Cell BE processor represents a new approach for designing CPUs. The fast execution of legacy software has to stand back in order to achieve very high performance for new scientific software. The Cell BE consists of 9 independent cores and represents a new promising architecture for HPC systems. The programmer has to write parallel software that is distributed to the cores and executes subtasks of the program in parallel. The simplified Vector-CPU design achieves higher clock-rates and power efficiency and exhibits predictable behavior. But to exploit the capabilities of this upcoming CPU architecture it is necessary to provide optimized libraries for frequently used algorithms. The Basic Linear Algebra Subprograms (BLAS) provide functions that are crucial for many scientific applications. The routine ZGEMM, which computes a complex matrix–matrix–product, is one of these functions. This article describes strategies to implement the ZGEMM routine on the Cell BE processor. The main goal is achieve highest performance. We compare this optimized ZGEMM implementation with several math libraries on Cell and other modern architectures. Thus we are able to show that our ZGEMM algorithm performs best in comparison to the fastest publicly available ZGEMM and DGEMM implementations for Cell BE and reasonably well in the league of other BLAS implementations.