Indirect access LODs typically occur when an array reference contain a subscripted index. We can improve performance by prefetching enough of the subscript vector to fill the memory pipeline. This can be performed by a prologue loop, with a sole purpose to prefetch the subscript array. When the memory pipeline is full, the main body of the loop can use the subscripts as they arrive. The main body performs decoupled accesses to the actual data elements and accesses to future subscript elements, in an alternating sequence. When the fetch of the last required element of the subscript vector has been initiated the epilogue loop is entered. This to fetches the remaining elements of the data array, using the subscripts that have already been prefetched.
Where should we put the elements of the subscript vector while we are prefetching it? Since we are attempting to decouple the direct accesses of the AU from its indirect accesses, it is tempting to provide a decoupling queue, a Load Data Queue, in the AU for this purpose. Indeed, the ZS-1, PIPE and the ACRI-1 architecture all provide queues suitable for this purpose.
The transformed version of the original loop is thus:
DO 10 I = 1, N1 LDQA[TAIL++] = IX(I) 10 CONTINUE DO 11 I = N1+1, N2 LDQA[TAIL++] = IX(N1) SUM = SUM + A(LDQA[HEAD++]) 11 CONTINUE DO 12 I = N2+1, N SUM = SUM + A(LDQA[HEAD++]) 12 CONTINUE
The values of N1
and N2
determine the number of IX
elements to
prequeue: this value depends upon the length of LDQA
available, and on the latency
that must be hidden using this mechanism.
If N
is large enough it is possible to prevent all AU and DU stalls,
after the initial pipeling priming period.