2.4 Causes of loss of decoupling

Next: 2.5 Placement of LOD Up: 2. A Compilation Model Previous: 2.3 The experimental compiler

2.4 Causes of loss of decoupling

In this section we examine the dominant causes of Loss of Decoupling events. We illustrate each cause with a short fragment of source code, and describe the compilation strategy and flow of data which result in an LOD.

Indirect accesses

Arc (a) in figure 1 represents the AU performing a memory reference for itself before initiating a memory request for the DU. This occurs during all indirect references, causing a break in the decoupled execution pipeline. A short, synthetic example of this type of construct is shown below. In practice this type of LOD can have a significant impact on some programs; a classical example is SPICE.

          do I = 1, N
            SUM = SUM + A(IX(I))
          end do

The straightforward compilation of a loop containing an indirect reference would cause a Loss of Decoupling penalty on each iteration of the loop. Indirection is an infrequent aspect of scientific codes in general, but is heavily used in sparse matrix codes. Consequently, a compilation technique to minimize this impact is desirable.

A prequeueing transformation for indirect accesses.

Computed array indices

Arc (b) in figure 1 shows another case in which the principal pipelined computation optimized by the decoupled architecture is compromized by a data dependency. In this case, data computed in the DU is transferred to the AU to form addresses for future operands. Whilst in the vast majority of scientific codes, this type of cross-unit computation simply does not occur [6], it is seen in certain Particle in Cell codes. These programs simulate the effect of particles within a field, which is divided into a number of cells. It is assumed that the field properties are homogeneous throughout an individual cell, and may vary between adjacent cells. As the particles are affected during the lifetime of the simulation, their new absolute position is calculated from generation to generation, and the position is quantized to determine within which cell boundaries they lie, in order to select the appropriate field properties for the next generation of the simulation.

An example of this type of loop is contained in the Livermore Loops collection of kernel benchmarks. Loop 13, the 2-D PIC (Particle-In-Cell) code, shown below.

6      C*******************************************************************************
7      C***  KERNEL 13 2-D PIC Particle In Cell
8      C*******************************************************************************
9      C     
10             fw = 1.000d0
11     C     
12     1013    do 13 k = 1, n

13               i1 = p(1, k)

14               j1 = p(2, k)
15               i1 = 1+mod2n(i1, 64)
16               j1 = 1+mod2n(j1, 64)
17               p(3, k) = p(3, k)+b(i1, j1)
18               p(4, k) = p(4, k)+c(i1, j1)
19               p(1, k) = p(1, k)+p(3, k)
20               p(2, k) = p(2, k)+p(4, k)

21               i2 = p(1, k)

22               j2 = p(2, k)
23               i2 = mod2n(i2, 64)
24               j2 = mod2n(j2, 64)
25               p(1, k) = p(1, k)+y(i2+32)
26               p(2, k) = p(2, k)+z(j2+32)
27               i2 = i2+e(i2+32)
28               j2 = j2+f(j2+32)
29               h(i2, j2) = h(i2, j2)+fw
30     13        continue

A number of address computations takes place in this code: the indices for the accesses to arrays B, C, Y, Z, E, F and H are all computed within the loop.

In a naive compilation of a loop of this form, a Loss of Decoupling event would be generated for each of these accesses, on every loop iteration. One way to reduce the impact of this type of LOD is by applying loop distribution.

I-cache disruption

Arc (c) in figure 1 shows the disruption that occurs in the execution pipeline when the instruction cache cannot supply the next instruction to the AU without delay. This stalls the AU, which in turn creates a pipeline bubble which propagates through memory to the DU, eventually stalling the DU.

The most common circumstance that gives rise to this occurrence is an I-cache miss. This problem is no different in a decoupled machine from compared to a conventional architecture, either superscalar or vector-based, and may be alleviated in many ways. These are outside the scope of this paper; the interested reader is directed to the compiler writers' appendix in [7].

Control transfers

The arc (d) in figure 1 shows the effect of a control transfer, or branch, whose outcome is determined by a value computed on the DU. While this might represent the outcome of a computed jump, such as that found when compiling a computed GOTO in Fortran, the most frequent disruptions of this type occur when a condition is computed on the DU and used to determine which of two possible sequences of code is to be executed. Until the condition has been determined, it is not possible to select which direction the branch is going to proceed: this means that the sequencer needs to wait for the DU, thus draining the pipeline formed by the I-cache, the AU, and the memory system. When the condition becomes available, the sequencer may issue further instructions, the AU may execute them, initiating memory accesses, and these operands become available to the DU after a memory latency has elapsed, enabling fully decoupled execution to continue.

This cause of disruption is again common to other architectures: both superscalar and vector architectures suffer from execution penalties induced by control transfers. A large body of optimization work on predicting the outcome of branches has been carried out in the context of conventional architectures, and most of this work is applicable to decoupled architectures.

If the DU-computed value is an element of an array, then the DU will store the value and the CU will subsequently read the value. Such array dependencies require an explicit CU/DU synchronization operation to be introduced into the code during compilation to prevent a read-after-write hazard between the CU and the DU. An excellent example of such a hazard exists in the OLDA subroutine from TRFD in the Perfect Club.

If the code that is guarded by the condition can be executed entirely on the DU and AU, then the control dependence can be converted into a data dependence under a transformation known as IF-conversion.

Realistic example of an IF-convertible LOD.
The effectiveness of IF-conversion.

While loops and premature loop exits

A WHILE loop is defined as any loop with an iteration count that is not loop invariant. In the traditional numerical supercomputing domain, WHILE loops appear to be comparatively infrequent. They are often used to control the convergence condition of an indirect algorithm, but may also be found at the leaf-level in searching or table lookup algorithms. Such loops require special treatment if decoupling is to be preserved.

Any DO loop with a conditional jump out of the loop also has an iteration count which is not loop invariant. Again, some special treatment is required to preserve decoupling across iteration boundaries.

Speculative execution for WHILE loops and DO loops with premature exit.

Next: 2.5 Placement of LOD Up: 2. A Compilation Model Previous: 2.3 The experimental compiler

npt@dcs.ed.ac.uk