A computer's instruction set may be conceptually partitioned into subsets which may then be assigned to specialised hardware for execution. In the field of high performance computing many different partitionings have been tried. Early high-performance computers such as the Cray-1, as well as subsequent vector computers, implemented a functional partitioning to permit memory access and arithmetic operations to be overlapped. Recently, the trend towards more explicit separation of scalar instruction sets has led to VLIW [12] and superscalar [4] architectures. These both address the need for higher instruction issue rates. However, as they are not specifically optimized for hiding memory latency, they can be highly sensitive to the locality characteristics of applications.
Early decoupled architectures, such as PIPE [1] and the Astronautics ZS-1 [2], introduced a controlled form of out-of-order execution by exploiting a functional partitioning of the memory access instructions and the computational instructions (typically floating point). This enabled memory access latencies to be hidden. The concept of decoupled memory access is now finding a place in high performance microprocessors, such as the MIPS R8000 [5].
The R8000, ZS-1, PIPE and Wulf's WM architecture [3] are termed access decoupled. In such architectures, control transfers often require the synchronization of the processing units. This occurs, for example, when a branch is dependent upon a comparison computed by the X-processor. A further architectural optimization, termed control decoupling, was introduced in the Advanced Computer Research Institute's ACRI-1 architecture [6]. In a control decoupled architecture there are three independent units, responsible respectively for control, memory access and execution. The additional benefit of control decoupling is that the majority of control transfer decisions can be pre-computed, thus opening up many opportunities for preparing the access and execute pipelines for the operations which follow. Access and control decoupling are fully described elsewhere [6]. However, we summarize the specific architecture model assumed by the compilation techniques described in this paper.
Given the high degree of decoupling, the degree to which real-world applications can exploit decoupling is of prime importance. In common with many recent architectural innovations, the performance of the architecture is greatly influenced by the capabilities of the compiler and the structure of the source code presented to the compiler. In that sense our analysis in this paper should be seen as a combined evaluation of control decoupled architectures and the compilation techniques that have been developed for those architectures. Our analysis consists of compiler-driven measurements and profile-driven modelling of the Perfect Club suite of scientific programs [9].
This paper addresses the performance of a single processor in a control decoupled architecture; further information on the scalability implications of decoupling can be found in [10].