No Title

Next: Contents

Compiling and Optimizing for Decoupled Architectures

Authors:
Nigel Topham: Department of Computer Science
The University of Edinburgh
Mayfield Road
Edinburgh, UK
e-mail: npt@dcs.ed.ac.uk
Alasdair Rawsthorne: Department of Computer Science
The University of Manchester
Oxford Road
Manchester
UK
e-mail: alasdair@cs.man.ac.uk
Callum McLean: Department of Computer Science
The University of Edinburgh
Mayfield Road
Edinburgh, UK
e-mail: callum@quadstone.co.uk
Muriel Mewissen: Department of Computer Science
The University of Edinburgh
Mayfield Road
Edinburgh, UK
e-mail: mu@quadstone.co.uk
Peter Bird: e-mail: plbird@aol.com

Keywords:: Decoupled architecture, Compiling, Performance, Benchmarks, Optimization, Quantitative analysis.

Abstract

Decoupled architectures provide a key to the problem of sustained supercomputer performance through their ability to hide large memory latencies. When a program executes in a decoupled mode the perceived memory latency at the processor is zero; effectively the entire physical memory has an access time equivalent to the processor's register file, and latency is completely hidden. However, the asynchronous functional units within a decoupled architecture must occasionally synchronize, incurring a high penalty. The goal of compiling and optimizing for decoupled architectures is to partition the program between the asynchronous functional units in such a way that latencies are hidden but synchronization events are executed infrequently. This paper describes a model for decoupled compilation, and explains the effectiveness of compilation for decoupled systems. A number of new compiler optimizations are introduced and evaluated quantitatively using the Perfect Club scientific benchmarks. We show that with a suitable repertiore of optimizations, it is possible to hide large latencies most of the time for most of the programs in the Perfect Club.