- ZeitschriftenartikelAn Image Processing Operator Language for Design and Synthesis of Smart Camera Architectures(PARS-Mitteilungen: Vol. 34, Nr. 1, 2017) Hartmann, Christian; Häublein, Konrad; Pfundt, Benjamin; Reichenbach, Marc; Fey, DietmarRecent trends showed a rise of heterogeneous hardware architectures for image processing applications. Due to the usage of these camera systems in the embedded field, the reduction of area and power consumption became essential. Standard CPUs are not suitable in the embedded field, because of their lavish commerce regarding power and area consumption. Embedded applications have strict constraints regarding these parameters. Therefore, optimized and specialized hardware is required resulting in a heterogeneous system architecture. Designing such a system is a challenging and error-prone task. In the design process, software and hardware skills are needed. Programming skills in different programming and design languages are necessary. For reducing the complexity a common language which can easily be mapped on different hardware architectures combined with a synthesis framework is needed. With the Image Processing Operator Language (IPOL) the description of heterogeneous systems with one language become possible. The synthesis framework called Image Processing Architecture Synthesis (IPAS) completes the domain-specific language (DSL) as an underlying mapping methodology.
- ZeitschriftenartikelPredicting Efficient Execution with Source Code Analysis in a Heterogeneous Environment(PARS-Mitteilungen: Vol. 34, Nr. 1, 2017) Hellwig, Markus; Becker, ThomasFinding a good schedule for the tasks of an application is a critical step for the efficient usage of heterogeneous systems. A good schedule can only be found with information about the tasks to be scheduled. In a dynamic system, this information is normally only available after each task is at least executed once, thereby creating an initial overhead until a good schedule can be created. Therefore, we introduce a method based on static code analysis and machine learning algorithms to predict the fastest processor of a given OpenCL task before runtime by classification which helps to reduce this initial overhead. We show how we used a static code analysis implementation based on Clang to generate training data on a set of 10 different heterogeneous processors including Intel, AMD and Nvidia GPUs, a Intel Xeon Phi and Intel CPUs. This training data was used to generate prediction models via several different machine learning algorithms including Random Forest and k-Nearest Neighbour and then evaluate the models by predicting the fastest processor out of two and more processors via classification.
- ZeitschriftenartikelTracing of Multi-Threaded Java Applications in Score-P Using JVMTI and User Instrumentation(PARS-Mitteilungen: Vol. 34, Nr. 1, 2017) Frenzel, Jan; Feldhoff, Kim; Jäkel, René; Müller-Pfefferkorn, RalphOver the past years, parallel Java applications received a substantial boost in the research field of High Performance Computing, especially in the field of Big Data Analytics by the development of Java-based frameworks, i. e., Apache Hadoop, Flink or Spark, amongst others, for processing large-scale datasets. Analyzing the performance of said Big Data frameworks in particular, and multi-threaded Java applications in general, is indispensable for efficient execution. Due to the high number of threads, this requires a scalable runtime performance measurement infrastructure. The established, open-source tracing framework Score-P provides such an infrastructure, but did not support (parallel) Java applications, previously. We added support for tracing multi-threaded Java applications to Score-P by implementing two instrumentation approaches. The first instrumentation approach is based on the Java Virtual Machine tool interface (JVMTI) and allows to easily trace an application without source code modifications. The second instrumentation approach allows to manually modify sources via API functions such that only those parts of an application are recorded which the user is interested in. Both instrumentation approaches were successfully applied to the LU kernel of the established Java benchmark suite SPECjvm2008 at a modern HPC machine. We show the quality of the implementations by determining the tracing overheads of the instrumented versions for different test scenarios using varying numbers of Java threads, and thus, varying numbers of recorded events.
- ZeitschriftenartikelMinimizing Energy Cost in Task-Graph Execution on Parallel Platforms(PARS-Mitteilungen: Vol. 34, Nr. 1, 2017) Gerhards, Rainer; Keller, JörgWe investigate minimization of energy cost for execution of statically scheduled task graphs on parallel machines with frequency scaling and given deadlines, assuming that the power profile of the processing elements and the energy price curve over time is known or can be predicted. We present both a mixed integer linear program and a heuristic to solve this problem, using time slots of fixed lengths and discrete frequency levels for both approaches and a fixed budget per time slot for the heuristic. We evaluate the heuristic by comparison to cost-optimal schedules. For price curves occurring in practice, and for deadlines not too close to the minimum makespan, the heuristic produces about 15% more energy cost than the optimal solution.
- ZeitschriftenartikelA Distributed Hash Table using One-sided Communication in MPI(PARS-Mitteilungen: Vol. 34, Nr. 1, 2017) Sobe, Peter; Graupner, Tom; Hennig, FlorianThe Message Passing Interface (MPI) can be applied to implement data structures that are distributed across process memory, such as distributed arrays or hash tables. In this paper a hash table implementation is described that employs one-sided communication in case of collision-free access. Collisions of data entries within the hash table are treated using dynamic overflow memory and two-sided communication. This leads to a two-level communication architecture that combines one-sided and two-sided operations in a data structure and the related access operations. This approach circumvents the problem of dynamic and unforeseen size and arrangement of data structures in shared memory that would be hard to manage using solely one-sided communication.
- ZeitschriftenartikelDesign Space Exploration Including Approximate Computing for OpenCL-based Stereo Vision Hardware(PARS-Mitteilungen: Vol. 34, Nr. 1, 2017) Bromberger, Michael; Ehrle, Steffen; Scharrer, Michael; Erlinghagen Lukas; Schick, JensCalculating distances from objects to a subject, for instance a car, is a central task in many applications. Such distances can be calculated by stereo vision exploiting stereo camera images. The high complexity of this approach, which has to be performed under high-performance and lowpower constraints, limits a wide usage. Hardware acceleration is a promising solution to meet above constraints. Two main approaches exist, local ones work on a pixel-wise scheme and global ones consider all pixels at the same time, which highly increases the memory and time complexity. Several optimization methods exist to find Pareto-optimal designs in the design space spanned by accuracy, performance, and resource consumption. Besides well-known techniques, we design, implement, and evaluate new methods, which includes the current research trend of approximate computing. Therefore, in this paper we evaluate different optimization techniques on an OpenCL level for local as well as semi-global approaches. While we target on resource reduction for local approaches, we tackle the memory issue of semi-global approaches. We implement all methods on a low-power and low-cost FPGA-based system on chip and evaluate them on available benchmarks as well as on a real-world scenario. The novel semi-global approximate computing design provides a high frame rate, supports a high number of disparities, and achieves a good accuracy on typical traffic scenes.
- ZeitschriftenartikelDesign of MPI Passive Target Synchronization for a Non-Cache-Coherent Many-Core Processor(PARS-Mitteilungen: Vol. 34, Nr. 1, 2017) Christgau, Steffen; Schnor, BettinaDistributed hash tables are a common approach for fast data access. For this kind of application, a synchronization scheme with Readers and Writers semantic is well suited. This paper presents the design of an implementation of MPI passive target synchronization with Readers and Writers semantic. The implementation is discussed for the Single-Chip Cloud Computer, a non-cachecoherent many-core CPU with shared memory.
- ZeitschriftenartikelEvaluating the Influence of Data Type Precision On Numerical Algorithms(PARS-Mitteilungen: Vol. 34, Nr. 1, 2017) Bromberger, Michael; Hoffmann, Markus; Hampp, Andreas HamppIEEE 32 or 64 bit floating-point arithmetic is often sufficient for different kind of algorithms including scientific applications. However, there is a growing body of applications which have significant computational errors during the calculation leading to incorrect results. Such applications are ranging from numerical algorithms and probabilistic timing analysis to long-time simulations. While designing numerically stable algorithms or interval arithmetic pose possible solutions for certain problems, most scientific programmers are not aware of such deep numerical analyses. In addition, not all issues can be solved using above methods. High precision arithmetic, which is provided by software libraries or coprocessor designs, is a promising solution to overcome above numerical issues. Therefore, we investigate the influence of data type precision on a numerical algorithm, i.e. Lanczos algorithm, and compare different high precision arithmetic software libraries regarding accuracy and execution time. Additionally, we examine the usage of an exact scalar product for the Lanczos algorithm. While we show that high precision arithmetic is crucial for numerical algorithms, such arithmetic is still by far slower than hardware-supported data types.
- ZeitschriftenheftPARS-Mitteilungen 2017(PARS-Mitteilungen: Vol. 34, Nr. 1, 2017)
- ZeitschriftenartikelLAIK: A Library for Fault Tolerant Distribution of Global Data for Parallel Applications(PARS-Mitteilungen: Vol. 34, Nr. 1, 2017) Weidendorfer, Josef; Yang, Dai; Trinitis, CarstenHPC applications usually are not written in a way that they can cope with dynamic changes in the execution environment, such as removing or integrating new nodes or node components. However, for higher flexibility with regard to scheduling and fault tolerance strategies, adequate application-integrated reaction would be worthwhile. However, with legacy MPI codes, this is difficult to achieve. In this paper, we present Lightweight Application-Integrated data distribution for parallel worKers (LAIK), a lightweight library for distributed index spaces and associated data containers for parallel programs supporting fault tolerance features. By giving LAIK control over data and its partitioning, the library can free compute nodes before they fail and do replication for rollback schemes on demand. Applications become more adaptive to changes of available resources. We show a simple example which integrates our LAIK library and present first results on a prototype implementation.