Cray-1 Performance

Vector processing is only of any value firstly if users' numerical problems involve vectors to a sufficient extent, in terms of both number and length, and secondly if these vectors can be identified during the process of mapping the numerical problem on to the hardware. For Cray-1 users the mapping problem is largely the responsibility of the FORTRAN compiler, which vectorises DO loops and splits long vectors into 64-element segments. Cray Research Inc. have produced results of performance studies [1] which show that the vector subroutines in their FORTRAN library of mathematical functions out-perform repetitive use of the equivalent scalar subroutines for vectors having as few as only two elements. These results are reproduced in diagram (a), in which the cost, in clock periods per result, is plotted against vector length. For scalar subroutines the cost per result remains constant, since the subroutine must be called each time. For vector subroutines the cost per result drops very rapidly as the vector length increases from one and approaches a lower limit equal to about one-seventh of the scalar cost. A more detailed analysis of the performance characteristics of vector processors in general can be found under Performance of Vector Processrs.

A second set of results, shown in diagram (b), relates to matrix multiplication. This involves repeated scalar product operations, an operation lends itself particularly well to vector processing. In the Cray-1 a vector loop is used to multiply together up to 64 pairs of vector elements to form the same number of sub-products. These sub-products must then be added together in what is essentially a scalar loop to form the scalar product. However, by executing a vector add instruction in which one of the operand registers is the result register of the multiplication, and the other is identical with the result register of the addition itself, a recursive effect is achieved in which 64 sub-products can be reduced to 8 in an operation chained with the multiplication. This recursive effect occurs because the element selection pointer of the result/operand vector does not move on from the first element until receipt of the first result. The value contained in this first element (initially set to zero) is therefore sent repeatedly until the first result element (with a value equal to the first multiplication result) is received in the eighth clock period of the operation. In the ninth clock period the value of this first result is returned as an operand to the Floating-point Add Unit to be added to the ninth result, with subsequent elements following on similarly. The result of this addition is then returned as an operand in the 17th clock period to be added to the 17th result, and so on, until at the end of the add operation elements 56 to 63 of the result vector contain the eight partially summed sub-products.

In diagram (b) the execution rate in MFLOPS is plotted against increasing matrix dimension for matrix multiplication operations on square matrices. Matrix sizes which are multiples of 64 give the best results, as would be expected, and execution rates of up to 140 MFLOPS are clearly shown to be possible. This is a best case example, of course, and execution rates such as this cannot be sustained over a more typical mix of user programs. Average rates in the range 20 to 50 MFLOPS have been reported by Hockney and Jesshope [2]. One of the reasons why these rates are considerably lower than the 160 MFLOPS peak performance can be seen in the following analysis.

In order to sustain an execution rate of 2 FLOPS/CLOCK it must be possible to transfer four operands per clock from the vector registers into the floating-point units and two operands per clock from the units back into the vector registers. Since there are eight vector registers, each capable of transferring operands in or out at a rate of one operand per clock, 2 FLOPS/CLOCK can easily be supported. However, because there are only eight vector registers, transfers between these registers and the store are required at regular intervals. For these transfers the data rate is also one operand per clock, but here there is only one data path, so that the bandwidth between the store and the registers is one-sixth of that used between the registers and the floating-point units to support 2 FLOPS/CLOCK.

In the case of matrix multiplication, where a sequence of scalar product operations may be carried out using the same column vector of one matrix repeatedly, it is in fact only necessary to transfer row vectors of the other matrix from store at this rate in order to support 2 FLOPS/CLOCK. In many other situations, however, the restricted bandwidth between the store and the registers is a performance bottleneck. This bottleneck was overcome in the Cray X-MP by the use of multiple paths between the store and the vector register.