| STACK 0 | STACK 1 | |
| 00000 | 00001 | |
| 00002 | 00003 | |
| . | . | |
| 08190 | 08191 | |
| STACK 2 | STACK 3 | |
| 08192 | 08193 | |
| 08194 | 08195 | |
| . | . | |
| 16382 | 16383 |
Since instructions more often than not follow in sequence, and since successive instructions occupied adjacent stacks in this arrangement, instructions could be accessed in pairs in Atlas, so that two instructions were accessed (and subsequently buffered) in one core cycle. Assuming a relatively small program occupying stacks 0 and 1, then the two operands required could come from any of the stacks, giving different limitations on performance. In a sequence of instructions in which the two operands accessed by each pair of instructions were all in either stack 0 or stack 1, for example, then the minimum time for the execution of the pair would be 6 microsecs (2 microsecs for the instruction pair access plus 2 microsecs for each operand); for a sequence in which alternate operands were in stacks 0 and 1, the minimum execution time would be 4 microsecs, and for a sequence in which alternate operands were in stacks 2 and 3, the minimum execution time would be 2 microsecs (with the instruction pair and operand accesses fully overlapped). Assuming a random distribution of operands, however, all possibilities must be considered as shown in Figure 2, and an average taken to get an indication of the expected performance.
| OPERAND PAIR DISTRIBUTION | EXECUTION TIME | ||
| 0 | 0 | 6 | |
| 0 | 1 | 4 | |
| 0 | 2 | 4 | |
| 0 | 3 | 4 | |
| 1 | 0 | 4 | |
| 1 | 1 | 6 | |
| 1 | 2 | 4 | |
| 1 | 3 | 4 | |
| 2 | 0 | 4 | |
| 2 | 1 | 4 | |
| 2 | 2 | 4 | |
| 2 | 3 | 2 | |
| 3 | 0 | 4 | |
| 3 | 1 | 4 | |
| 3 | 2 | 2 | |
| 3 | 3 | 4 | |
The total of the execution times for these sixteen possibilities is 64 microsecs, corresponding to an average of 4 microsecs for the execution of a pair of instructions or 2 microsecs per instruction. For sequences of computational instructions this figure was in fact achieved in Atlas. The overall average instruction time was higher, but this was due to the effects of branch instructions, and not to store clashes. Thus, by having a store made up of four independent stacks with two-way interleaving of addresses, the execution time of computational instructions was improved by a factor of two.
In Atlas the single floating-point arithmetic unit was used not only for addition and subtraction, but also for lengthier operations such as multiplication and division (which took 4.9 microsecs and between 10.7 and 29.8 microsecs respectively), and the average rate at which this unit could execute arithmetic operations was well matched to the average store accessing rate. In the CDC 6600, on the other hand, the use of multiple functional units meant that instructions could be issued at the much higher maximum rate of one every 100 ns. The cycle time of main storage modules (or banks in CDC terminology) exceeded the basic processor clock period by a factor of 10 and, since the 24 computational registers offered a relatively small amount of buffering, a much higher level of store interleaving was required to sustain the instruction execution rate. This same ratio of processor clock period to store cycle occurred in the CDC 7600 (27.5 ns and 275 ns respectively) and in order to minimise the effects of store clashes on overall performance, the 60-bit word main stores of both these machines were 32-way interleaved.
Because of the factor of 10 difference between the processor clock and store cycle time, a maximum of 10 storage banks could be in operation at one time. This situation only arose during block copy operations between the Small Core Memory (SCM) and Large Core Memory (LCM) in the 7600 and between Central Storage and the Extended Core Storage in the 6600. In random addressing of the 7600 SCM for instructions, program data and input-output channel data, an average of four SCM banks in operation at one time was more normal. This meant that there was a high probability that any single request would be to a free storage bank, and would be serviced immediately, and so much of the benefit of this storage arrangement came not so much from interleaving per se, but from the fact that the store was made up of a large number of relatively small independent modules. The interleaving of these modules had a significant effect on performance when many successive accesses were to sequential addresses, such as during instruction fetch sequences, and more importantly during the processing of long vectors. Hence, interleaving was extended to even higher levels in the Cray-1 and in the CDC CYBER 200 series of vector processors. The 32-bit wide storage banks of the CYBER 205, for example, were 256-way interleaved, and evidence for the necessity for such a high level of interleaving came from the fact that reducing this interleaving to 128-way could cause a reduction in performance of nearly 40 per cent for some types of operation.