The vector floating-point pipeline

The vector Floating-point Pipeline (Figure 1) provides logical and arithmetic operand processing for all vector instructions. It is made up of five pipelined operand processing units, a Data Interchange and associated control logic. The typical CYBER 205 system has a two-pipeline configuration in which two 64-bit or four 32-bit operands presented on each of the A and B inputs can be processed simultaneously. Except in the case of divide and square root operations, these pipelines can accept a new pair of inputs in every clock period. A one-pipeline configuration processes the two halves of the 128-bit words supplied to it sequentially and therefore runs at half the speed of the two-pipeline version. The four-pipeline version is essentially two two-pipeline processors, giving a total data path width of 256 bits. Floating-point numbers use a binary exponent and have the formats shown in the Figure 2, where all exponents and mantissae are represented by 2's complement integers.

In a normal two-pipeline configuration the Add Unit receives operands from the Data Interchange over two 128-bit data highways and returns results to the Data Interchange over a single 128-bit highway. In a straightforward add or subtract operation successive elements of vector B are added to or subtracted from successive elements of vector A, and the results written to successive elements of vector C. Feedback paths within the Add Unit allow other types of add operation to be implemented, however, so that a single result element C may be obtained by summing all the elements of vector A, for example, or a vector C may be produced by repeated addition of a scalar element B to the initial value of scalar element A.

The Multiply Unit is similarly connected to the Data Interchange by two 128-bit input operand highways, and one 128-bit result highway. In a typical multiply operation successive elements of vector A are multiplied by successive elements of vector B and the results written to successive elements of vector C. An internal feedback path allows a single result to be formed by multiplying together all the elements of vector A, and the Multiply Unit also contains the logic for divide and square root operations.

The Shift Unit has one 128-bit data input highway (A) and one 14-bit input highway (B) which supplies a 7-bit shift count for each half of the pipeline. Each 64-bit element of vector A is shifted left or right according to the most significant bit of the corresponding 7-bit element of vector B, with the number of bit position shifts being determined by the six least significant bits of B.

The Logical Unit carries out bit-by-bit logical operations between pairs of A and B elements supplied via 128-bit input highways and returns its results to the Data Interchange via a 128-bit result highway. It also carries out pack and unpack operations on floating-point numbers (similar to those implemented in the CDC 6600 and 7600) and a masked compare instruction for which a third input is required, containing the 128-bit mask. This instruction searches elements of vector A in sequence for a bit-by-bit match with the single element B; bit positions for which the bit in the mask is zero are assumed to match. As each word is examined, the Register File location containing the index of A is updated, so that, if a match is found, the index provides a means of locating the position of the matching element. If no match is found the index is left pointing to the end of the vector. When the instruction terminates a condition code is set to indicate the result.

For simple vector instructions the Data Interchange is configured to connect the input and output highways to the appropriate processing unit. The Select Link instruction, however, causes the Data Interchange to be configured such that the succeeding two instructions in the code sequence become chained together. In this case the output of the unit used by the first instruction of the pair is routed to the input of the unit used by the second. Only two vector streams may be used in total, but this does allow commonly occurring triadic operations such as

Vector C = Vector A * Constant - Vector B

to be implemented in this manner. Not only does this allow the Multiply and Add Units to operate in parallel, but it also avoids the need to write the intermediate result vector into central memory and then read it out again.

Chaining of this sort occurs automatically during the execution of vector macro instructions such as the scalar product instruction, for example, in which pairs of input operands are multiplied together in the Multiply Unit and their results then summed in the Add Unit. This summation also involves the use of the Delay Unit, which contains a 16-word temporary buffer store. Successive 128-bit words sent to the Delay Unit are written into successive locations selected on a cyclic basis by a write counter. These same words are then read out again, and returned to the Data Interchange, under the control of a read counter. The delay function is implemented by off-setting the read and write counters by the required number of clock cycles of delay.

In the summation of results from the Multiply Unit in the scalar product operation (and similarly in the summation of all the elements of a single vector), the addition is performed by feeding values into input A of the Add Unit, and routing the output of the Add Unit back into input B. This produces a recursive effect similar to that found in the CRAY-1 (q.v.). Because of the delay through the Add Unit pipeline, the zeroth input value, having passed through the Add Unit, is returned to input B in time to be added to the eighth value, the first in time to be added to the ninth, and so on. After a further pass through the Add Unit the sum of the zeroth and eighth values is ready to be added to the sixteenth value, and so on, until the input stream is exhausted and eight partial sums are left circulating in the Add Unit. Adding these partial sums together involves the use of the Delay Unit. The first four partial sums are delayed by four clock periods so that they become aligned with the last four at the input to the Add Unit. The four results obtained from this operation are then further added in pairs using a two-clock period delay, and a final add involving a single clock period delay produces the desired result ready to be written into the Register File.