We have seen how computed indices result in LODs in the absence of special optimizations. The challenge for compilation is to find a way of reducing the frequency of LODs in the type of code shown below.
DO 10 I=1,N IX = IFIX(X(I)+XV(I)*T) IY = IFIX(Y(I)+YV(I)*T) COUNT(IX,IY) = COUNT(IX,IY) + 1 10 CONTINUE
The key to the optimization appropriate for this situation is already visible in the example shown: notice that two values may be transferred from DU to AU at an LOD point in the program. Since the time penalty occurs simply when synchronizing the two units in order to permit the safe transfer of data, the compiler can arrange to transfer an arbitrary amount of data between DU an AU whilst only suffering one LOD penalty.
The first transformation step in this process is to promote the IX
and
IY
variables to
vectors, in order to maintain distinct values from different loop iterations. Secondly,
the original loop is split into two parts, so that the first loop computes all values for
IX
and IY
, while the second updates the COUNT
array.
The transformed code is given below.
DO 10 I=1,N IX(I) = IFIX(X(I)+XV(I)*T) IY(I) = IFIX(Y(I)+YV(I)*T) 10 CONTINUE DO 20 I=1,N COUNT(IX(I),IY(I)) = COUNT(IX(I),IY(I)) + 1 20 CONTINUE
The transformed code will not require an LOD between the loops if N
is sufficiently large to permit the first IX
and IY
values to be computed and stored in the first loop before they need to be fetched for the
second loop. Only if the loop count is small is any decoupling-related time penalty
introduced between the two loops.
The reader will observe that the second loop in the transformed code contains indirect references: this requires an additional prequeueing transformation.