Next: Perfect Club LODs Up: Performance of the Decoupled Previous: Overview of the

Losses of Decoupling (LODs)

In an access and control decoupled architecture, the CU ``runs ahead'' of the AU and DU, and the AU ``runs ahead'' of the DU. It is conceivable for the CU and the DU to be separated space and time by many thousands of program statements. When the system is fully decoupled, the AU will typically be ahead of the DU by an amount which is the largest memory latency experienced by any load operation since the last recoupling of the AU and DU. The CU, AU and DU therefore constitute an elastic pipeline, with the dispatch queue linking the CU to the AU and DU, and the memory system linking the AU to the DU.

In figure 1, from [5], the solid arrows show the typical direction of flow of information during decoupled execution. The broken arrows represent inter-unit dependencies which cause decoupling to temporarily break down, requiring the CU or AU to wait for operations to complete on a unit which is normally downstream in the system pipeline.

These dependencies can be carried either through registers or memory locations. For example, a value in a DU register which the compiler knows is needed by the CU will cause a register-based flow dependence from the DU to the CU. This will be satisfied by sending the value from the DU to the CU via a transfer queue. The CU must wait for this value, and this then defines a synchronization point between DU and CU. This effectively flushes the memory pipeline and decoupling is lost. We term such a synchronization point a loss of decoupling (LOD).

When the DU defines a location in memory and there is a reaching use of that location on the CU, then a read-write hazard exists between the CU and the DU through that location. The compiler detects all such hazards and inserts explicit synchronization operators at appropriate points in the code to force the CU to wait for the hazard to be resolved. This results in the DU and the CU becoming synchronized, so this is also a form of LOD. Both of the above types of LOD are termed algorithmic LODs, as their presence is a direct result of the structure of the program. In the tables which follow, such LODs are labelled A-LODs.

The calling standard of the ACRI-1 defines a common shared stack for the CU, the AU and the DU. This has implications for data synchronization across function call and return boundaries. When a function is called, and the CU and DU are decoupled, the CU will reach the stack frame manipulation code before the DU finishes using the current stack frame. To prevent stack frame corruption, the CU must wait for the DU to reach the call point before the stack frame is modified. Again, this enforces an LOD. In the tables which follow, we term this a ``call LOD'' (or C-LOD). A similar problem may occur if the units are uncoupled at a function return point. Again, the CU would like to relinquish a stack frame, but the DU may not yet have finished using some of the local variables declared in that frame. The CU must therefore wait for the DU to reach the return point, and then perform the return sequence. This is a ``return LOD'' (or R-LOD).

Calls to external routines, such as I/O and timer calls, must also be synchronized. However, by appropriate engineering it is possible to avoid LODs across the call boundaries with intrinsic functions (information on the use and modification of parameters is more clearly specified, and intrinsics do not modify global variables). In this analysis, LODs due to calls to external routines are termed ``eXternal LODs'' (or X-LODs). In this analysis intrinsic functions do not lead to X-LODs.

To summarize, when a program is executing in a decoupled mode the perceived memory latency is zero; effectively the entire physical memory has an access time equivalent to the processor's register file, and latency is completely hidden. However, when an LOD occurs, the penalty is significant. Our model assumes a nominal penalty of 150 cycles, and this means that the frequency of LODs is of paramount importance when determining the expected application performance. We now present measurements of the LOD frequencies in the Perfect Club, and in section 4.2 we use these frequencies to predict bounds on the execution time of the Perfect Club programs on the ACRI-1 architecture.

Next: Perfect Club LODs Up: Performance of the Decoupled Previous: Overview of the

ships@dcs.ed.ac.uk
Wed Mar 1 16:43:22 GMT 1995