Next: Losses of Decoupling Up: Performance of the Decoupled Previous: Control Decoupling

Overview of the ACRI-1 implementation

The ACRI-1 architecture is a high performance implementation of a control and access decoupled architecture. Each DU has two independent floating point units designed for a target clock period of 6ns. Each processor therefore has a floating point throughput design peak of 333 MFLOPS (64-bit precision). A node may contain up to 6 processors, for a peak floating point throughput of 2 GFLOPS. This paper addresses the performance of a single processor in this multiprocessor architecture; further information on the scalability implications of decoupling can be found in [9].

The ACRI-1 memory system comprises up to eight boards, each containing up to eight segments. Each segment may contain up to 16 independently addressable banks of DRAM. The memory boards are connected to the processors and I/O subsystems via a two-stage parallel network. Both the network and the memory boards contain request and response queues, and thus the round-trip latency for any particular request will depend to a certain degree on the memory loading. Register-transfer simulations of the network and memory subsystems have shown that latencies will be in the range 100 to 200 processor cycles, and will vary dynamically during program execution. In the execution time models presented in section 4.2, we use a nominal value of 150 cycles for the mean cost of uncoupling the DU from the CU and the AU. This uncoupling time is dominated by the round-trip delay of the memory system.

The ACRI-1 processor contains a cache which is used primarily for communication between the units, and as a level-2 instruction cache. The vast majority of memory operands are obtained directly from memory. Further information on the behaviour of caches in decoupled systems can be found in [10].

ships@dcs.ed.ac.uk
Wed Mar 1 16:43:22 GMT 1995