## Functional units in the CRAY-1

The thirteen functional units in the CRAY-1 can be classified into four groups according to the kind of operations they perform and the operating registers to which they are connected. These groups are address, scalar, vector and floating-point. Each unit is pipelined into single clock period segments, so that each can start an operation on a new pair of input operands in each successive clock period.

The Address Add and Address Multiply Units perform 24-bit 2's complement integer arithmetic on operands obtained from the A registers and both return their results to an A register. The Address Add Unit (which performs both addition and subtraction) is the equivalent of the Increment Unit in the CDC 7600, and is similarly pipelined into two stages. The Address Multiply Unit is pipelined into six stages and produces as its result the least significant 24 bits of the integer product of two 24-bit operands. Address multiplication is frequently used in the handling of multi-dimensional arrays, and whereas the number format used in the CDC 7600 allowed integer multiplication to be carried out quite straightforwardly in the floating-point multiply unit, the use of a sign and magnitude, fractional mantissa in the CRAY-1 makes the use of this option much less desirable. Furthermore, the floating-point multiply unit in the CRAY-1 is frequently reserved for long periods by vector operations, and a separate address multiplier is therefore essential for performance reasons.

## Scalar Units

The Scalar Add, Scalar Shift and Scalar Logical Units perform operations on 64-bit operands taken from S registers and each delivers a 64-bit result to an S register. The Scalar Add Unit performs 2's complement integer addition and subtraction and is pipelined into three stages. The Scalar Shift Unit performs single and double-length shifts on one or two S register operands, the former requiring two clock periods and the latter three. The Scalar Logical Unit is the equivalent of the Boolean Unit in the CDC 6600 and 7600, but it produces its results in one clock period rather than the two required in the 7600.

The fourth unit in this group is the Population &Leading Zero Count Unit which takes a 64-bit operand from an S register and returns a 7-bit result, equal to the number of ones in the operand or the number of zeros preceding the most significant 1 in the operand, to an A register. The first of these operations requires four clock periods for its execution, and the second three.

## Vector Units

The Vector Add, Vector Shift and Vector Logical Units take operands from two V registers and return their results to a V register. Successive operand pairs are transmitted to a vector unit in successive clock periods, and after a start-up delay equal to the pipeline length of the unit, results are also copied back to the result register in successive clock periods. The Vector Add Unit performs 64-bit 2's complement integer addition and subtraction and is pipelined into three stages. The Vector Shift Unit is a four-stage pipeline which performs single-length shifts on individual elements of a V register or double-length shifts on consecutive pairs of V register elements. The Vector Logical Unit performs operations similar to those in the Scalar Logical Unit, but acts on operands taken from V registers rather than S registers, and is implemented as a two-stage pipeline. There is also a Vector Population Unit which operates on vectors in a manner similar to that in which the Scalar Population Unit operates on scalars.

None of the address, scalar or vector arithmetic units detects overflows; CRAY-1 users are supposed to know what they are doing and to write error-free programs. Floating-point out-of-range errors are detected, however, in the Floating-point Units.

## Floating-point Units

The Floating-point Add, Floating-point Multiply and Reciprocal Approximation Units perform floating-point arithmetic for both scalar and vector operations. For scalar instructions the operands are obtained from S registers and the results returned to an S register, while for vector instructions the operands are obtained from a pair of V registers or a V register and an S register, and results returned to a V register. When executing vector instructions successive operand pairs are transmitted to a unit in successive clock periods, and results are similarly obtained.

The Floating-point Add Unit performs addition or subtraction of 64-bit operands in floating-point format (see figure) and always produces normalised results. This is a departure from the 6600 and 7600 tradition where normalisation was an `optional extra' which had to be paid for in extra instructions. The consequence is a longer pipeline in the CRAY-1 Floating-point Add Unit, involving six stages rather than the four used in the 7600.

The Floating-point Multiply Unit is pipelined into single clock period segments, like all other functional units in the CRAY-1, and in contrast to the CDC 7600 Multiply Unit, which has a two clock period segment time. The CRAY-1 Multiply Unit uses a multiply pyramid to produce the mantissa product, and requires seven clock periods to produce any one result, rather than the five required in the 7600.

The CRAY-1 Floating-point Multiply Unit also participates in division. The standard subtract and test algorithm used in most machines to implement division cannot easily be pipelined, and a Newton-Raphson iteration algorithm is therefore used in the CRAY-1, similar to that used in the IBM System/360 Model 91 and in MU5. In the CRAY-1, however, division is not implemented directly, but instead an approximation of the reciprocal of the divisor is formed in the Reciprocal Approximation Unit, and a separate instruction must be used to multiply the result by the dividend in the Floating-point Multiply Unit. Furthermore, the Reciprocal Approximation Unit produces a result accurate to only 30 bits (in a 14-stage pipeline), and in order to produce a result accurate to 47 bits (which is still one bit short of the full mantissa), an additional iteration must be performed, again using the Floating-point Multiply Unit in a separate instruction. Thus a scalar quotient is normally computed in 29 clock periods, and forming an n-element vector quotient requires approximately 3n clock periods.