Arithmetic Pipelines: the TI ASC

As pipelining techniques developed, it became clear that pipelining the hardware used to implement complex arithmetic operations such as floating-point addition and multiplication could enhance performance. Floating-point addition, for example, consists of four distinct operations (exponent subtraction, mantissa shifting, mantissa addition, normalisation) which can be pipelined in a very straightforward manner.

In a one-address system no benefit would be gained from pipelining arithmetic opreations: since one of the input operands to an addition operation in a one-address system is always the Accumulator, it cannot be used as an input to a subsequent operation until the current operation has finished. This implies that temporal overlap of two successive additions is not possible if the result must always pass through the Accumulator. For successful operation of a pipelined arithmetic unit each instruction must reference at least two and preferably three operands.

In practice, even when this requirement is satisfied, the possibility of dependencies between successive instructions requires complex control mechanisms and may cause hold-ups in instruction execution. Early examples of these control techniques are the data forwarding arrangement used in the IBM System/360 Model 91 and the Scoreboard used in the CDC 6600 (see The IBM Common Data Bus and The CDC 6600 in the section on Parallel Function Units).

The ideal arrangement is for one instruction to cause two independent operand streams to be combined in the arithmetic unit to form a third independent result stream. Vector processing systems such as the CDC Cyber 200 Series, the Cray-1 and the Texas Instruments' Advanced Scientific Computer (TI ASC) were designed to do just this.

Texas Instruments was originally a supplier of instrumentation to the oil industry and it was this industry's need for a powerful seismic processing capability that led to the development from 1966 onwards of the Advanced Scientific Computer [1]. An ASC processor could have up to four pipelined arithmetic units (AUs). Each AU was linked to the memory through its own Memory Buffer Unit (MBU), the primary function of the latter being to supply, from memory, a continuous stream of operands to the arithmetic unit, and to return to memory a continuous stream of arithmetic results.

Each AU was made up of eight distinct sections, each of which performed a separate arithmetic or logical operation (see figure). Each section could be connected to any other section to allow the correct sequence of operations to be executed for a particular instruction, with the appropriate configuration being established at the start of a vector instruction. In any given configuration the various sections formed a pipeline into which a new pair of operands could, in principle, be entered at each 60 ns clock, and after a start-up time, corresponding to as many clock periods as there were sections in use, result operands emerged at a rate of one per clock period. At the end of a vector instruction there was a similar run-down time between the entry of the last operand pair and the emergence of the corresponding result.

Floating-point addition, for example, required the use of the Receiver Register, Exponent Subtract, Align, Add, Normalise and Output sections, connected as shown by the solid line in the figure. Pairs of operands from the MBU were first copied into the Receiver Register, the cable delays between the MBU and AU effectively forming a complete stage in the overall pipeline arrangement. The Exponent Subtract section then performed a 7-bit subtraction to determine the difference between the exponents of the two floating-point operands, or in the case of equal exponents, used logic to determine which of the fractional mantissae was larger (this logic was also used by those instructions that tested for greater than, less than or equal to, in order to avoid duplication of hardware).

The exponent difference was used in the Align section to shift right the mantissa of the operand with the smaller exponent. In one cycle any shift which was a multiple of four could be carried out, this being all that is required for floating-point numbers represented in base 16. (Fixed-point right shifts required two cycles, one shifting by the largest multiple of four in the shift value, and a second in which the result of the first was re-entered and shifted by the residue of 0, 1, 2 or 3.)

Having been correctly aligned, the fractional parts of the two floating-point numbers were added in the Add section, and the result passed on to the Normalise section. This section closely resembled the Align section in that floating-point operations only required one cycle, while the fixed-point left shifts which it also carried out required two. The major difference between these two sections was that Align received information concerning the length of shift required in floating-point operations, while the Normalise section had to compute the shift length by determining which four-bit group contained the most significant digit. It also contained an adder to update the exponent value when a normalisation shift occured. The results of all arithmetic operations passed through the Output section before being returned to the Memory Buffer Unit. The partitioning of the arithmetic unit into these various sections was primarily intended to give high throughput of floating-point addition and subtraction. Each section was capable of operating on double length operands so that vector double length instructions could proceed at the clock rate. Double length multiplication, and all divides (which were performed by an iterative technique), proceeded more slowly.

The dashed line in the figure shows the interconnection used for fixed-point multiplication. The Multiply section could perform a 32 by 32-bit multiplication in one clock period, so that the results of both fixed-point and single-length floating-point multiplication were available after one pass through the multiplier. Because a carry-save addition technique was used, the output of the Multiply section consisted of a 64-bit pseudo-sum and a 64-bit pseudo-carry. These were added in the Add unit to produce the true result. Double-length multiplication required three separate 32 by 32-bit multiplications to be performed and these could therefore proceed at a rate of only one every three clocks. After passing through the Add section the three separate results were added together in their proper bit positions in the Accumulate section.

The Accumulate section was similar to the Add section and was used in all instructions which required a running total to be maintained. An important example of this type of instruction is the Vector Dot Product, which is used repeatedly, for example, in matrix multiplication. Pairs of operands are multiplied together in this instruction and a single scalar result, equal to the sum of the products of the pairs, is produced. Because the running total was maintained in the arithmetic unit, the read after write problems which occur in scalar implementations of this operation were avoided in the ASC.