Performance

In the typical two-pipeline CYBER 205 system two 64-bit operands can be transferred into the Floating-point Pipeline on each of its two input highways, and two 64-bit results transferred out, in one 20 ns clock period. This corresponds to an execution rate for 64-bit floating-point numbers of 100 MFLOPS, or 200 MFLOPS for 32-bit numbers. For linked triadic operations, or during the execution of the scalar product operation, this rate is doubled to 200 MFLOPS for 64-bit numbers or 400 MFLOPS for 32-bit numbers. The memory bandwidth required to support this execution rate is six 64-bit words per 20 ns interval, or 300 Mwords/s. Since the data highway between each one-million words of central memory and the memory interface is 512 bits wide, and a read or write transfer can occur every 20 ns, memory bandwidth is actually 400 Mwords/s, and a one-million word memory configuration can therefore support a two-pipeline system without difficulty. For the four-pipeline system at least two million words of central memory are required, with each one million words being connected to the memory interface via its own data highway. In this case the maximum execution rate is 800 MFLOPS for 32-bit linked triadic or scalar product operations.

Where accessing patterns involving non-unary increments are required, performance is reduced because the vector arithmetic instruction must be preceded by a gather instruction or succeeded by a scatter instruction. These instructions themselves run at less than the full arithmetic rate and have been measured to proceed at an average rate of 40 million operations per second. This performance is an average of many operations running at a rate of 50 million per second with occasional periods of 12.5 million per second.

Perhaps a more serious criticism of the CYBER 205 from the point of view of performance on general problems is the long vector start-up time. Thus whereas an execution rate of 100 MFLOPS for 64-bit numbers can in principle be obtained with a two-pipeline configuration, this can only be achieved in practice when very long vectors are used. As the length of the vectors being used becomes shorter, the start-up time has a progressively more serious effect, and Hockney and Jesshope [1] have used the vector length at which performance is halved as a measure of vector efficiency (see under Performance of Vector Processors).

For the CYBER 205 the nominal start-up time is 1 microsec, and for a nominal result rate of 100 MFLOPS this length is clearly 100. For the CRAY-1 Hockney and Jesshope quote values in the range 10 to 20, though even lower values are possible. However, the CYBER 205 was intended to be used for problems involving long vectors, and part of the start-up time can be eliminated for successive vector operations since the processing of one instruction can often begin before the last few elements of the previous instruction have been fully processed. Any attempt at a generalised comparison between the CYBER 205 and the CRAY-1 is largely irrelevant since the performance of each is critically dependent both on the application being run and on the way it is mapped on to the hardware