Next: 1.1 Overview of decoupled Up: No Title Previous: Contents

1. Introduction

The principal goal of decoupled architecture is to hide the latency of certain operations, usually memory accesses. This is achieved by the creation of a macro pipeline, with variable depth, and some degree of asynchrony between the producer and consumer. In a decoupled access/execute architecture, such as the ZS-1 [2], the producer is an Address Unit and the consumer is an Execution Unit. Such a system is capable of hiding all memory access latency when the memory pipeline has been suitably primed. To keep a decoupled access/execute pipeline full, the Address Unit computations must not depend on the Execute Unit computations. More complex forms of decoupling, with deeper pipelines and more units, have been proposed [6] as a means of increasing performance and overall efficiency. By increasing the numbers of units, and increasing the asynchrony between the units, the task of compiling for a decoupled architecture becomes more complex. The compilation techniques described in this paper were originally developed for the ACRI supercomputer. However, we attempt to place them in a general context wherever possible.

Previous performance studies of decoupled architectures have considered relatively small pieces of code, hand-translated or compiled using naive compilers. In this paper we present results from the compilation of supercomputer benchmark programs for decoupled architectures on an experimental compiler developed within a collaborative project between university and industry. The compiler performs standard scalar optimizations, decouples the code to an abstract decoupled architecture, and performs modulo scheduling [15].

We assume a target architecture, similar in many respects to the architecture formerly under development at the Advanced Computer Research Institute, and previously described in [6]. We present a simple model of performance for this class of decoupled architectures, and use this to differentiate the types of constructs in source programs that lead to performance degradation on decoupled architectures. We use the performance model to demonstrate some techniques that compilers may employ to reduce the impact of these particular program constructs.

The experimental compiler uses HTML [8] to visualize synchronization events and their dependences by creating an annotated version of the original source code. HTML output, obtained directly from the experimental compiler, is used to illustrate program constructs which lead to performance degradation. We propose idiomatic transformations to improve decoupling.

1.1 Overview of decoupled architecture

Next: 1.1 Overview of decoupled Up: No Title Previous: Contents

npt@dcs.ed.ac.uk