Principles of pipeline concurrency

Pipelining is an implementation technique for high performance architectures which is normally invisible to the programmer. Consider an arbitrary computing structure for evaluating a function F, on a stream of input values x₁ .... x_n, to produce a stream of output values F(x₁) .... F(x_n). The performance of this computing structure can be characterised firstly in terms of its latency, and secondly in terms of its throughput. The latency is defined as the delay between supplying input value x_i and receiving at the output of F the value F(x_i). The throughput is defined simply as the rate at which output values are produced by F. In the simplest case, where output i appears at the same time that input i+1 is presented to F, the throughput of F will be the reciprocal of the latency.

Improving the performance of such a computing structure means increasing the throughput. One way to do this is to reduce the latency, usually by implementing F using faster logic circuits. However, there are limitations on the performance improvements attainable in this way, and so other techniques have been sought. If the function F consists of a sequence of sub-functions, for example a, b and c, such that F(x_i)= a(b(c(x_i))), then it is possible to implement F using pipelined logic. The decomposition of F into three sub-functions is illustrated in figure 1.

Each stage in the pipeline contains some function evaluation logic and between the stages there are latches for storing the intermediate values. Since the latency of each sub-function is less than the latency of F the rate at which new input values can be presented to each stage is greater than the rate at which they can be presented to a non-pipelined implementation of F. If a common clock is used to latch the intermediate values at all stages in the pipeline, and the logic at stage i has a latency of τ, then assuming a latch delay of φ, the clock period, τ, of a k-stage pipeline can be defined as

τ = max{τ_i}^k_i=1 + φ = τ_m + φ

The activity within a pipeline can be represented pictorially using a space-time diagram, as shown in figure 2 where the space-time behaviour of a four-stage pipeline is depicted for a sequence of six input values x₁ ... x₆, From figure 2 it can be seen that for a k-stage pipeline a time equivalent to k clock periods elapses before the pipeline is full, after which one result appears at the output every clock period. Therefore, if a sequence of n data items is processed by a k-stage pipeline, it will take a time of kτ to produce the first result, and a time (n-1)τ to produce the remaining (n-1) results. Hence the total time to process n data items in a k-stage pipeline is

T_k = kτ + (n - 1)τ

If we now make the assumption that a non-pipelined implementation of F has a latency of kτ, then the same n data items will take a time T₁ = nkτ when processed on a non-pipelined processor, effectively a one-stage pipeline, which evaluates an equivalent function.

The speedup resulting from the use of a k-stage pipelined implementation is defined as S_k= T₁/T_k, and this is given by

S_k

T₁

T_k

k + (n - 1)

and in the limit, as the number of data items tends to infinity

lim
n→∞

T₁

T_k

This defines the asymptotic maximum speedup, in terms of throughput, and this is equal to the number of pipeline stages. In general, of course, the flow of data items in any real pipeline is subject to disruptions of one kind or another, and maximum throughput cannot be maintained continuously. For a pipeline which processes a finite stream of data items between disruptions we can define two further parameters which characterise its performance: efficiency and net throughput. The efficiency, η, of a k-stage pipeline processing n data items is defined as

speedup

number of stages

k + (n - 1)

while the net throughput under these conditions, r, is defined as the actual number of results produced per second, and therefore

efficiency

peak processing rate

When the number of data items being processed in a pipeline approaches infinity, η → 1 and r → r_∞ = τ^-1.

Given that a pipeline cannot be maintained in a continuously busy state, there are a number of conditions which the designer should attempt to satisfy in order to maintain throughput as far as possible. These are

The latency of all pipeline stages should be equal.
The computations at each stage in the pipeline should be naturally sequential, and should be independent of each other.
A continuous flow of input values should be presented to the pipeline, and the resulting output values should be processed by the succeeding logic at the rate at which they are produced.

In considering the design of real pipelines we shall see not only how the designers have attempted to satisfy these conditions, but also how a variety of problems arises to thwart their best intentions.

Return to Pipelines