A second set of results, shown in diagram (b), relates to matrix multiplication. This involves repeated scalar product operations, an operation lends itself particularly well to vector processing. In the Cray-1 a vector loop is used to multiply together up to 64 pairs of vector elements to form the same number of sub-products. These sub-products must then be added together in what is essentially a scalar loop to form the scalar product. However, by executing a vector add instruction in which one of the operand registers is the result register of the multiplication, and the other is identical with the result register of the addition itself, a recursive effect is achieved in which 64 sub-products can be reduced to 8 in an operation chained with the multiplication. This recursive effect occurs because the element selection pointer of the result/operand vector does not move on from the first element until receipt of the first result. The value contained in this first element (initially set to zero) is therefore sent repeatedly until the first result element (with a value equal to the first multiplication result) is received in the eighth clock period of the operation. In the ninth clock period the value of this first result is returned as an operand to the Floating-point Add Unit to be added to the ninth result, with subsequent elements following on similarly. The result of this addition is then returned as an operand in the 17th clock period to be added to the 17th result, and so on, until at the end of the add operation elements 56 to 63 of the result vector contain the eight partially summed sub-products.

In diagram (b) the execution rate in MFLOPS is plotted against increasing matrix dimension for matrix multiplication operations on square matrices. Matrix sizes which are multiples of 64 give the best results, as would be expected, and execution rates of up to 140 MFLOPS are clearly shown to be possible. This is a best case example, of course, and execution rates such as this cannot be sustained over a more typical mix of user programs. Average rates in the range 20 to 50 MFLOPS have been reported by Hockney and Jesshope [2]. One of the reasons why these rates are considerably lower than the 160 MFLOPS peak performance can be seen in the following analysis.

In order to sustain an execution rate of 2 FLOPS/CLOCK it must be
possible to transfer four operands per clock from the vector registers
into the floating-point units and two operands per clock from the
units back into the vector registers. Since there are eight vector
registers, each capable of transferring operands in or out at a rate
of one operand per clock, 2 FLOPS/CLOCK can easily be supported.
However, because there are *only* eight vector registers,
transfers between these registers and the store are required at
regular intervals. For these transfers the data rate is also one
operand per clock, but here there is only one data path, so that the
bandwidth between the store and the registers is one-sixth of that
used between the registers and the floating-point units to support 2
FLOPS/CLOCK.

In the case of matrix multiplication, where a sequence of scalar product operations may be carried out using the same column vector of one matrix repeatedly, it is in fact only necessary to transfer row vectors of the other matrix from store at this rate in order to support 2 FLOPS/CLOCK. In many other situations, however, the restricted bandwidth between the store and the registers is a performance bottleneck. This bottleneck was overcome in the Cray X-MP by the use of multiple paths between the store and the vector register.