4.2 Loop distribution for computed indices

Next: 4.3 IF-conversion Up: 4. Idiomatic Transformations Previous: 4.1 Pre-queueing for indirect

4.2 Loop distribution for computed indices

We have seen how computed indices result in LODs in the absence of special optimizations. The challenge for compilation is to find a way of reducing the frequency of LODs in the type of code shown below.

      DO 10 I=1,N
         IX = IFIX(X(I)+XV(I)*T)
         IY = IFIX(Y(I)+YV(I)*T)
         COUNT(IX,IY) = COUNT(IX,IY) + 1
 10   CONTINUE

The key to the optimization appropriate for this situation is already visible in the example shown: notice that two values may be transferred from DU to AU at an LOD point in the program. Since the time penalty occurs simply when synchronizing the two units in order to permit the safe transfer of data, the compiler can arrange to transfer an arbitrary amount of data between DU an AU whilst only suffering one LOD penalty.

The first transformation step in this process is to promote the IX and IY variables to vectors, in order to maintain distinct values from different loop iterations. Secondly, the original loop is split into two parts, so that the first loop computes all values for IX and IY, while the second updates the COUNT array. The transformed code is given below.

      DO 10 I=1,N
         IX(I) = IFIX(X(I)+XV(I)*T)
         IY(I) = IFIX(Y(I)+YV(I)*T)
 10   CONTINUE
      
      DO 20 I=1,N
         COUNT(IX(I),IY(I)) = COUNT(IX(I),IY(I)) + 1
 20   CONTINUE

The transformed code will not require an LOD between the loops if N is sufficiently large to permit the first IX and IY values to be computed and stored in the first loop before they need to be fetched for the second loop. Only if the loop count is small is any decoupling-related time penalty introduced between the two loops.

The reader will observe that the second loop in the transformed code contains indirect references: this requires an additional prequeueing transformation.

npt@dcs.ed.ac.uk