- ZeitschriftenartikelReduktion von False-Sharing in Software-Transactional-Memory(PARS: Parallel-Algorithmen, -Rechnerstrukturen und -Systemsoftware: Vol. 30, No. 1, 2013) Kempf, Stefan; Veldema, Ronald; Philippsen, MichaelSoftware-Transactional-Memory (STM) erleichtert das parallele Programmieren, jedoch hat STM noch einen zu hohen Laufzeitaufwand, da gegenseitiger Ausschluss beim Zugriff auf gemeinsame Daten meist mittels einer Lock-Tabelle fester Gr¨ oße realisiert wird. F¨ ur Programme mit wenigen konkurrierenden Zugriffen und ¨ berwiegend Lesezugriffen ist diese Tabelle gr¨ u oßer als notwendig, so dass beim Commit einer Transaktion mehr Locks zur Konsistenzpr¨ ufung zu inspizieren sind als n¨ otig. F¨ ur große Datenmengen ist die Tabelle zu klein. Dann begrenzt False-Sharing (unterschiedliche Adressen werden auf das gleiche Lock abgebildet) die Parallelit¨ at, da sogar unabh¨ angige Transaktionen sich gegenseitig ausschließen. Diese Arbeit beschreibt eine Technik, die die Lock-Tabelle bei False-Sharing vergr¨ oßert. Zus¨ atzlich kann ein Programmierer mit Annotationen unterschiedliche Lock-Tabellen f¨ ur voneinander unabh¨ angige Daten verlangen, was die M¨ oglichkeit von False-Sharing und den Speicherbedarf f¨ ur die Locks weiter verringert In Benchmarks erreichen wir einen maximalen Speedup von 10.3 gegen¨ uber TL2, wobei die Lock-Tabelle bis zu 1024 mal kleiner ist.
- ZeitschriftenartikelAcceleration of Optical Flow Computations on Tightly-Coupled Processor Arrays(PARS: Parallel-Algorithmen, -Rechnerstrukturen und -Systemsoftware: Vol. 30, No. 1, 2013) Sousa, Éricles; Tanase, Alexandru; Lari, Vahid; Hannig, Frank; Teich, Jürgen; Paul, Johny; Stechele, Walter; Kröhnert, Manfred; Asfour, TaminOptical flow is widely used in many applications of portable mobile devices and automotive embedded systems for the determination of motion of objects in a visual scene. Also in robotics, it is used for motion detection, object segmentation, time-to-contact information, focus of expansion calculations, robot navigation, and automatic parking for vehicles. Similar to many other image processing algorithms, optical flow processes pixel operations repeatedly over whole image frames. Thus, it provides a high degree of fine-grained parallelism which can be efficiently exploited on massively parallel processor arrays. In this context, we propose to accelerate the computation of complex motion estimation vectors on programmable tightly-coupled processor arrays, which offer a high flexibility enabled by coarse-grained reconfiguration capabilities. Novel is also that the degree of parallelism may be adapted to the number of processors that are available to the application. Finally, we present an implementation that is 18 times faster when compared to (a) an FPGA-based soft processor implementation, and (b) may be adapted regarding different QoS requirements, hence, being more flexible than a dedicated hardware implementation.
- ZeitschriftenartikelSelf-organizing Core Allocation(PARS: Parallel-Algorithmen, -Rechnerstrukturen und -Systemsoftware: Vol. 30, No. 1, 2013) Ziermann, Tobias; Wildermann, Stefan; Teich, JürgenThis paper deals with the problem of dynamic allocation of cores to parallel applications on homogeneous many-core systems such as, for example, MultiProcessor System-on-Chips (MPSoCs). For a given number of thread-parallel applications, the goal is to find a core assignment that maximizes the average speedup. However, the difficulty is that some applications may have a higher speedup variation than others when assigned additional cores. This paper first presents a centralized algorithm to calculate an optimal assignment for the above objective. However, as the number of cores and the dynamics of applications will significantly increase in the future, decentralized concepts are necessary to scale with this development. Therefore, a decentralized (self-organizing) algorithm is developed in order to minimize the amount of global information that has to be exchanged between applications. The experimental results show that this approach can reach the optimal result of the centralized version in average by 98.95%.
- ZeitschriftenartikelComparison of PGAS Languages on a Linked Cell Algorithm(PARS: Parallel-Algorithmen, -Rechnerstrukturen und -Systemsoftware: Vol. 30, No. 1, 2013) Bauer, Martin; Kuschel, Christian; Ritter, Daniel; Sembritzki, KlausThe intention of partitioned global address space (PGAS) languages is to decrease developing time of parallel programs by abstracting the view on the memory and communication. Despite the abstraction a decent speed-up is promised. In this paper the performance and implementation time of Co-Array Fortran (CAF), Unified Parallel C (UPC) and Cascade High Productivity Language (Chapel) are compared by means of a linked cell algorithm. An MPI parallel reference implementation in C is ported to CAF, Chapel and UPC, respectively, and is optimized with respect to the available features of the corresponding language. Our tests show parallel programs are developed faster with the above mentioned PGAS languages as compared to MPI. We experienced a performance penalty for the PGAS versions that can be reduced at the expense of a similar programming effort as for MPI. Programmers should be aware that the utilization of PGAS languages may lead to higher administrative effort for compiling and executing programs on different super-computers.
- ZeitschriftenartikelReal-time Simulation of Human Vision using Temporal Compositing with CUDA on the GPU(PARS: Parallel-Algorithmen, -Rechnerstrukturen und -Systemsoftware: Vol. 30, No. 1, 2013) Nießner, Matthias; Kuhnert, Nadine; Selgrad, Kai; Stamminger, Marc; Michelson, GeorgWe present a novel approach that simulates human vision including visual defects such as glaucoma by temporal composition of human vision in real-time on the GPU. Therefore, we determine eye focus points every time step and adapt the lens accommodation of our virtual eye model accordingly. The focal distance is then used to determine bluriness of observed scene regions; i.e., we compute defocus for all visible pixels. In order to simulate the visual memory we introduce a sharpness field where we integrate defocus values temporally. That allows for memorizing sharply perceived scene points. For visualization, we ray trace the virtual scene environment while incorporating depth of field based on the sharpness field data. Thus, our algorithm facilitates the simulation of human vision mimicing the visual memory. We consider this to be particularly useful for illustration purposes for patients with visual defects such as glaucoma. In order to run our algorithm in real-time we employ massively parallel graphics hardware.
- ZeitschriftenartikelDynamic Low-Latency Distributed Event Processing of Sensor Data Streams(PARS: Parallel-Algorithmen, -Rechnerstrukturen und -Systemsoftware: Vol. 30, No. 1, 2013) Mutschler, Christopher; Philippsen, MichaelEvent-based systems (EBS) are used to detect meaningful events with low latency in surveillance, sports, finances, etc. However, with rising data and event rates and with correlations among these events, processing can no longer be sequential but it needs to be distributed. However, naively distributing existing approaches not only cause failures as their order-less processing of events cannot deal with the ubiquity of out-of-order event arrival. It is also hard to achieve a minimal detection latency. This paper illustrates the combination of our building blocks towards a scalable publish/subscribe-based EBS that analyzes high data rate sensor streams with low latency: a parameter calibration to put out-of-order events in order without a-priori knowledge on event delays, a runtime migration of event detectors across system resources, and an online optimization algorithm that uses migration to improve performance. We evaluate our EBS and its building blocks on position data streams from a Realtime Locating System in a sports application.
- ZeitschriftenartikelEnergy-Efficient Static Scheduling of Streaming Task Collections with Malleable Tasks(PARS: Parallel-Algorithmen, -Rechnerstrukturen und -Systemsoftware: Vol. 30, No. 1, 2013) Kessler, Christoph; Eitschberger, Patrick; Keller, JörgWe investigate the energy-efficiency of streaming task collections with parallelizable or malleable tasks on a manycore processor with frequency scaling. Streaming task collections differ from classical task sets in that all tasks are running concurrently, so that cores typically run several tasks that are scheduled round-robin on user level. A stream of data flows through the tasks and intermediate results are forwarded to other tasks like in a pipelined task graph. We first show the equivalence of task mapping for streaming task collections and normal task collections in the case of continuous frequency scaling, under reasonable assumptions for the user-level scheduler, if a makespan, i.e. a throughput requirement of the streaming application, is given and the energy consumed is to be minimized. We then show that in the case of discrete frequency scaling, it might be necessary for processors to switch frequencies, and that idle times still can occur, in contrast to continuous frequency scaling. We formulate the mapping of (streaming) task collections on a manycore processor with discrete frequency levels as an integer linear program. Finally, we propose two heuristics to reduce energy consumption compared to the previous results by improved load balancing through the parallel execution of a parallelizable task. We evaluate the effects of the heuristics analytically and experimentally on the Intel SCC.
- ZeitschriftenartikelFast Evolutionary Algorithms: Comparing High Performance Capabilities of CPUs and GPUs(PARS: Parallel-Algorithmen, -Rechnerstrukturen und -Systemsoftware: Vol. 30, No. 1, 2013) Hofmann, Johannes; Fey, DietmarWe use Evolutionary Algorithms (EAs) to evaluate different aspects of high performance computing on CPUs and GPUs. EAs have the distinct property of being made up of parts that behave rather differently from each other, and display different requirements for the underlying hardware as well as software. We can use these motives to answer crucial questions for each platform: How do we make best use of the hardware using manual optimization? Which platform offers the better software libraries to perform standard operations such as sorting? Which platform has the higher net floating-point performance and bandwidth? We draw the conclusion that GPUs are able to outperform CPUs in all categories; thus, considering time-to-solution, EAs should be run on GPUs whenever possible.
- ZeitschriftenartikelA highly configurable and efficient simulator for job schedulers on supercomputers(PARS: Parallel-Algorithmen, -Rechnerstrukturen und -Systemsoftware: Vol. 30, No. 1, 2013) Karbach, CarstenThe supercomputers maintained by the J¨ ulich Supercomputing Centre (JSC) are run in batch mode and are shared among all active users. Jobs are submitted to the job scheduler, which determines for each job the time of execution. It manages running and waiting jobs, sorts them by a given priority and executes the jobs by fitting them into the current system state. For users it is often difficult to comprehend the job scheduler's algorithm. Thus, a prediction of the future system allocation of currently running and queued jobs is valuable. It helps users to plan job submissions and supports administrators by optimising the system load. This paper summarises the design and implementation of a configurable simulation for various job schedulers. The developed simulation program called JuFo focuses on the job schedulers Loadleveler and Moab as these are the most important batch systems for the supercomputers provided by the JSC. A major design goal is to keep the simulation independent from special job schedulers, which is achieved by the generic configuration of JuFo. Finally, a test framework is developed, which allows for evaluating the accuracy of the generated schedules.
- ZeitschriftenartikelParallel Prefiltering for Accelerating HHblits on the Convey HC-1(PARS: Parallel-Algorithmen, -Rechnerstrukturen und -Systemsoftware: Vol. 30, No. 1, 2013) Bromberger, Michael; Nowak, FabianHHblits is a bioinformatics application for finding proteins with common ancestors. To achieve more sensitivity, the protein sequences of the query are not compared directly against the database protein sequences, but rather their Hidden Markov Models are compared. Thus, HHblits is very time-consuming and therefore needs to be accelerated. A multi-FPGA system such as the Convey HC-1 is a promising candidate to achieve acceleration. We present the design and implementation of a parallel coprocessor on the Convey HC-1 to accelerate HHblits after analyzing the application toward acceleration candidates. We achieve a speedup of 117.5× against a sequential implementation for FPGA-suitable data sizes per kernel and negligible speedup for the entire uniprot20 protein database against an optimized SSE implementation.