Tomasulo's Algorithm
Tomasulo's Algorithm, so called because it was first described in a
paper by R.M. Tomasulo [1], was first used in the IBM System/360
Model 91 Floating-point Unit. Just like the Scoreboard mechanism
originally designed for the CDC 6600, Tomasulo's Algorithm was
designed to control the flow of data between a set of programmable
floating-point registers and a group of parallel arithmetic
units. Both are designed to ensure that constraints imposed by
instruction dependencies are satisfied. Tomasulo's Algorithm is
different in that it systematically attempts to minimise delays
between the production of a result by one operation and the start of a
subsequent operation which requires that result as an input.
The algorithm works by using additional registers, known as
Reservation Stations, at the inputs to the arithmetic units and a
system of tags which direct result operands to where they are next
needed, rather than necessarily to where they would have gone when the
instructions which produced them were issued. This is essentially a
register renaming mechanism, and Tomasulo's Algorithm was one
of the first examples of this mechanism, now used in a variety of high
performance processors. Tomasulo's Algorithm itself is still used
today in processors such as the PowerPC 620.
The IBM System/360 Model 91
Tomasulo's Algorithm was first used in conjunction with the IBM Common
Data Bus used to interconnect the component parts of the System/360
Model 91 Floating-point Unit. The Model 91 was developed in the mid
1960s with the primary aim of providing "the highest performance
capability that advanced design philosophy and System/360 circuit
technology extensions could achieve, within a balanced development
schedule". Performance was taken to mean general
computer availability and high-speed execution of general problem
programs, and a factor of between one and two orders of magnitude over
the earlier 7090 was achieved, depending on the nature of the
problem. Only a small number of Model 91s were actually produced, but
most of its architectural features were carried over into the
commercially more successful Model 195.
One of the problems facing the designers of high-speed computer
systems at the time was the difficulty of achieving the fastest
possible execution times for a particular technology in universal
execution units. Circuitry designed to carry out both multiplication
and addition, for example, could do neither as fast as two units each
limited to one kind of operation. Thus not only did the Model 91
contain separate fixed-point and floating-point execution areas, but
the floating-point area contained separate add and multiply/divide
units capable of concurrent operation. Although both units were
internally pipelined, only the add unit (a two-stage pipeline) could
start a new operation in each (60 ns) clock cycle. The multiply/divide
unit took three cycles to execute a multiply and 12 cycles to execute
a divide, but could only deal with one instruction at a time.