Store Interleaving

Until parallel arithmetic hardware units were introduced in the late 1950s, the speed of random access storage devices was not a limiting factor on processor performance. In 1957, the time for a floating-point addition in the Ferranti Mercury, for example, was 180 microsecs, while the core store cycle time was 10 microsecs. By 1961, however, the time for a floating-point addition in the Atlas computer was 1.6 microsecs, while the core store cycle time was 2 microsecs. Thus, in the absence of some special technique, the time required to access both an instruction and its operand would have amounted to 4 microsecs, leaving the floating-point unit idle for much of its time. Part of the solution to this problem was to overlap arithmetic and store accessing operations for successive instructions. In addition, the core store was made up of a number of independent storage modules, known as stacks, each with its own access circuits, so that if successive accesses were to different stacks, one access would be able to proceed immediately after another without being held up for a complete store cycle. Had it been possible to ensure that instructions and operands always occupied different stacks, then given the appropriate degree of overlap, instructions could have been executed at a rate of one per core cycle. Even in the absence of this possibility, benefit could still be gained from the independent access facilities of the stacks. The addressing of words in the four stacks of the 16 Kword version of Atlas installed at the University of Manchester was arranged as shown in Figure 1.

STACK 0             STACK 1
00000 00001
00002 00003
. .
08190 08191
STACK 2 STACK 3
08192 08193
08194 08195
. .
16382 16383

Figure 1 Store Word Addressing in Atlas

Since instructions more often than not follow in sequence, and since successive instructions occupied adjacent stacks in this arrangement, instructions could be accessed in pairs in Atlas, so that two instructions were accessed (and subsequently buffered) in one core cycle. Assuming a relatively small program occupying stacks 0 and 1, then the two operands required could come from any of the stacks, giving different limitations on performance. In a sequence of instructions in which the two operands accessed by each pair of instructions were all in either stack 0 or stack 1, for example, then the minimum time for the execution of the pair would be 6 microsecs (2 microsecs for the instruction pair access plus 2 microsecs for each operand); for a sequence in which alternate operands were in stacks 0 and 1, the minimum execution time would be 4 microsecs, and for a sequence in which alternate operands were in stacks 2 and 3, the minimum execution time would be 2 microsecs (with the instruction pair and operand accesses fully overlapped). Assuming a random distribution of operands, however, all possibilities must be considered as shown in Figure 2, and an average taken to get an indication of the expected performance.

OPERAND PAIR DISTRIBUTION         EXECUTION TIME
0 0 6
0 1 4
0 2 4
0 3 4
1 0 4
1 1 6
1 2 4
1 3 4
2 0 4
2 1 4
2 2 4
2 3 2
3 0 4
3 1 4
3 2 2
3 3 4

Figure 2 Instruction Times in Atlas

The total of the execution times for these sixteen possibilities is 64 microsecs, corresponding to an average of 4 microsecs for the execution of a pair of instructions or 2 microsecs per instruction. For sequences of computational instructions this figure was in fact achieved in Atlas. The overall average instruction time was higher, but this was due to the effects of branch instructions, and not to store clashes. Thus, by having a store made up of four independent stacks with two-way interleaving of addresses, the execution time of computational instructions was improved by a factor of two.

In Atlas the single floating-point arithmetic unit was used not only for addition and subtraction, but also for lengthier operations such as multiplication and division (which took 4.9 microsecs and between 10.7 and 29.8 microsecs respectively), and the average rate at which this unit could execute arithmetic operations was well matched to the average store accessing rate. In the CDC 6600, on the other hand, the use of multiple functional units meant that instructions could be issued at the much higher maximum rate of one every 100 ns. The cycle time of main storage modules (or banks in CDC terminology) exceeded the basic processor clock period by a factor of 10 and, since the 24 computational registers offered a relatively small amount of buffering, a much higher level of store interleaving was required to sustain the instruction execution rate. This same ratio of processor clock period to store cycle occurred in the CDC 7600 (27.5 ns and 275 ns respectively) and in order to minimise the effects of store clashes on overall performance, the 60-bit word main stores of both these machines were 32-way interleaved.

Because of the factor of 10 difference between the processor clock and store cycle time, a maximum of 10 storage banks could be in operation at one time. This situation only arose during block copy operations between the Small Core Memory (SCM) and Large Core Memory (LCM) in the 7600 and between Central Storage and the Extended Core Storage in the 6600. In random addressing of the 7600 SCM for instructions, program data and input-output channel data, an average of four SCM banks in operation at one time was more normal. This meant that there was a high probability that any single request would be to a free storage bank, and would be serviced immediately, and so much of the benefit of this storage arrangement came not so much from interleaving per se, but from the fact that the store was made up of a large number of relatively small independent modules. The interleaving of these modules had a significant effect on performance when many successive accesses were to sequential addresses, such as during instruction fetch sequences, and more importantly during the processing of long vectors. Hence, interleaving was extended to even higher levels in the Cray-1 and in the CDC CYBER 200 series of vector processors. The 32-bit wide storage banks of the CYBER 205, for example, were 256-way interleaved, and evidence for the necessity for such a high level of interleaving came from the fact that reducing this interleaving to 128-way could cause a reduction in performance of nearly 40 per cent for some types of operation.


Return to Storage Hierachies