Chaining

Chaining starts when a match occurs between one of the V register operand designators of an instruction awaiting issue in CIP, and the V register result designator of a previously issued instruction which has not yet returned its first result element. When this element becomes available for delivery to the result register, the instruction in CIP is issued (provided there are no other hold-ups) and the result element is forwarded with this instruction to the appropriate functional unit. Successive elements follow until the whole vector has been both written into its result register and forwarded to the second functional unit. The results of this second vector operation may themselves be chained into a third operation, and so on, as shown in the example in diagram (a).

Diagram (a) shows a schematic representation of the execution of the instruction sequence

V0 <- Memory
V1 <- V0 * S1
V3 <- V1 + V2

The first instruction causes 64 operands from a designated area in memory to be read out and copied in sequence into the 64 element positions in V0. Store requests are pipelined in such a way that the store appears to the processor as a pseudo functional unit. Thus after a start-up delay of seven clock periods, the first element of the vector from store becomes available for delivery to V0, and successive elements follow in successive clock periods.

In the clock period following the issue of the first instruction, the second instruction in the sequence is copied into CIP, but the reservation on V0 prevents it from being issued immediately. This reservation is lifted, however, allowing the instruction to issue, during the clock period in which the first vector element arrives from store ready for delivery to V0. This clock period is known as chain slot time. Chaining allows the vector elements being copied into V0 to flow directly from the memory read pipeline into the Floating-point Multiply Unit pipeline, where each element is multiplied by the value taken from S1 at the start of the operation, to produce the vector V1.

The third instruction in the sequence becomes ready for issue in the clock period following issue of the second instruction, and it too is held up by a reservation on one of its input operands, this time V1. When the first element of V1 appears from the Floating-point Multiply Unit, the reservation on V1 is lifted, allowing this third instruction to issue. Now the elements emanating from the Floating-point Multiply Unit can flow directly into the Floating-point Add Unit pipeline as well as into the result register V1. Thus the memory read pipeline, and the Floating-point Multiply and Floating-point Add Unit pipelines are all chained together to produce the elements of V3, and the need for all pipelines to have the same segment time now becomes very apparent; pipelines such as those in the CDC 7600, which have different segment times, could not be chained together in this way.

An alternative representation of the sequence of events illustrated by diagram (a) is shown in the timing diagrams (b) and (c). Diagram (b) shows in detail the activities involved in producing the first element of V3. The memory read instruction issues in clock period 0 and the first memory word arrives at V0 in clock period 8. During clock period 9 (chain slot time, marked *) this element is transmitted to the Floating-point Multiply Unit (together with the value in S1 for this first element). In clock period 17 the first result element from the Multiply Unit is transmitted to V1, and in clock period 18 (chain slot time) this element is transmitted to the Floating-point Add Unit, together with the first element if V2. In clock period 25 the first result element of the composite operation (V3 = V0*S1 + V2) becomes available and is transmitted from the Floating-point Add Unit to V3. Diagram (c) shows how successive elements of V3 are produced, each one staggered one clock period behind the element preceding it.