4.1 Pre-queueing for indirect accesses

Next: 4.2 Loop distribution for Up: 4. Idiomatic Transformations Previous: 4. Idiomatic Transformations

4.1 Pre-queueing for indirect accesses

Indirect access LODs typically occur when an array reference contain a subscripted index. We can improve performance by prefetching enough of the subscript vector to fill the memory pipeline. This can be performed by a prologue loop, with a sole purpose to prefetch the subscript array. When the memory pipeline is full, the main body of the loop can use the subscripts as they arrive. The main body performs decoupled accesses to the actual data elements and accesses to future subscript elements, in an alternating sequence. When the fetch of the last required element of the subscript vector has been initiated the epilogue loop is entered. This to fetches the remaining elements of the data array, using the subscripts that have already been prefetched.

Where should we put the elements of the subscript vector while we are prefetching it? Since we are attempting to decouple the direct accesses of the AU from its indirect accesses, it is tempting to provide a decoupling queue, a Load Data Queue, in the AU for this purpose. Indeed, the ZS-1, PIPE and the ACRI-1 architecture all provide queues suitable for this purpose.

The transformed version of the original loop is thus:

      DO 10 I = 1, N1
         LDQA[TAIL++] = IX(I)
 10   CONTINUE
      
      DO 11 I = N1+1, N2
         LDQA[TAIL++] = IX(N1)
         SUM = SUM + A(LDQA[HEAD++])
 11   CONTINUE
      
      DO 12 I = N2+1, N
         SUM = SUM + A(LDQA[HEAD++])
 12   CONTINUE

The values of N1 and N2 determine the number of IX elements to prequeue: this value depends upon the length of LDQA available, and on the latency that must be hidden using this mechanism. If N is large enough it is possible to prevent all AU and DU stalls, after the initial pipeling priming period.

npt@dcs.ed.ac.uk