PATUS: A Code Generation and Auto-Tuning Framework for Stencil Kernels on Modern Microarchitectures
Matthias Christen
(Universität Basel)
PATUS is a code generation and auto-tuning framework for the class of stencil computations
targeted at modern multi- and many-core processors, such as multicore CPUs and graphics processing units.
The ultimate goals of the framework are productivity, portability (of both the code and performance),
and achieving a high performance on the target platform.
The key ingredients to achieve the goals of productivity, portability, and performance are domain specific
languages (DSLs) and the auto-tuning methodology.
The PATUS stencil specification DSL allows the programmer to express a stencil computation in a concise way
independently of hardware architecture-specific details. Thus, it increases the programmer productivity by
disburdening her or him of low level programming model issues and of manually applying hardware platform-specific
code optimization techniques. The use of domain specific languages also implies code reusability: once
implemented, the same stencil specification can be reused on different hardware platforms, i.e., the
specification code is portable across hardware architectures. Constructing the language to be geared
towards a special purpose makes it amenable to more aggressive optimizations and therefore to
potentially higher performance.
Auto-tuning provides performance and performance portability by automated adaptation of implementation-specific
parameters to the characteristics of the hardware on which the code will run. By automating the process of parameter
tuning, the system can also be used more productively than if the programmer had to fine-tune the code manually.
A Skeleton Library for Heterogeneous Multi-/Many-Core Systems
Michel Steuwer,
Philipp Kegel,
Sergei Gorlatch
(Universität Münster)
Modern parallel systems become increasingly heterogeneous. Even desktop
PCs usually comprise GPUs (Graphics Processing Units) and multi-core
CPUs. Application programming for such parallel heterogeneous systems
is complex and error-prone, because no suitable high-level models exist.
Existing programming models either:
- target only a homogeneous subset of the available compute devices,
e.g., only multi-core CPUs or GPUs but not both,
- or target all available compute devices, but are intrinsically
low-level, like OpenCL.
Using multiple compute devices at the same time, e.g., multiple GPUs,
brings additional challenges like work- and data-distribution
while the state-of-the-art programming models
like OpenCL provide no special assistance for this.
In this talk, we present the SkelCL library being developed
in Muenster, which aims to replace the low-level approaches.
The use of high-level algorithmic skeletons greatly
simplifies programming for systems comprising multiple compute
devices. SkelCL is based on OpenCL and allows for ad-hoc
parallelism as its skeletons can be mix with low-level OpenCL code.
SkelCL also provides an abstract vector data type and a high-level data
(re)distribution mechanism to free the programmer from
organizing the low-level data transfers between a system's main memory
and multiple compute devices.
We will describe some implementation details and in particular
we will discuss the problem of wrapping of OpenCL into skeletons
as OpenCL uses compilation at runtime.
Finally, we will give a brief overview of our dOpenCL project.
In this project, we develop a middleware that is used together
with SkelCL in order to extend it to distributed systems.
We use a real-world application study from the area of medical imaging
to demonstrate the reduced programming effort and competitive
performance of SkelCL as compared to OpenCL and CUDA. Besides, we
illustrate how SkelCL adapts to large-scale, distributed heterogeneous
systems in order to simplify their programming.
Delite: A Framework For High Performance Embedded Domain-Specific Languages
HyoukJoong Lee,
Arvind Sujeeth,
Kevin Brown,
Kunle Olukotun (Stanford University)
Fully utilizing heterogeneous systems has been a challenging problem for application programmers, especially with an ever-increasing number of architecture-specific programming models (e.g., Pthreads, OpenCL, CUDA). Domain-specific languages (DSLs) are a potential solution to this problem, as they can provide productivity, performance, and portability within the confines of a specific domain. However, the cost of developing such DSLs needs to be lowered to make the DSL approach useful on a large scale. We implemented the Delite compiler framework, a reusable compiler infrastructure, for the rapid development of performance-oriented DSLs. Using the concept of a multi-view intermediate representation (IR), the Delite framework provides static optimizations and code generation for heterogeneous hardware, and therefore DSL developers can easily implement DSL operations by extending the framework. We also implemented the Delite runtime that automatically schedules and executes DSL operations on heterogeneous hardware. In this talk, we will present the internals of the framework and show how DSLs can extend and interoperate with one another. We will also explain how the framework and runtime efficiently targets heterogeneous hardware, and walk through the process of implementing a new DSL as a DSL developer.
The Paraphrase Project
Horacio Gonzalez-Velez (Robert Gordon University)
Chris Brown (University of St Andrews)
The ParaPhrase project aims to produce a new structured design and implementation process for heterogeneous parallel architectures, where developers exploit a variety of parallel patterns to develop component based applications that can be mapped to the available hardware resources, and which may then be dynamically re-mapped to meet application needs and hardware availability. Key features are sustainable parallel computing through enhanced programmability and lower power consumption, cost reduction in programmability and implementation of parallel systems, and better resource utilisation of parallel heterogeneous CPU/GPU architectures. This work will enable major progress to be made in programming both current and future (parallel) computer systems. Using ParaPhrase technologies, we anticipate that we will be able to achieve significant parallel speedups for realistic applications and that these results will scale with larger systems.
Skeletons and Autotuning at Edinburgh
Murray Cole,
Chris Fensch,
Alex Collins,
Fabricio Goes,
Zoe Leiper,
Thibaut Lutz,
Siddharth Mohanty
(University of Edinburgh)
We present an overview of recent and ongoing projects within the skeletons group at Edinburgh. These include work on transactional worklists, wavefronts, stencils, mapreduce, divide-and-conquer and image processing skeleton hierarchies.