Tomasulo's Algorithm

Tomasulo's Algorithm, so called because it was first described in a paper by R.M. Tomasulo [1], was first used in the IBM System/360 Model 91 Floating-point Unit. Just like the Scoreboard mechanism originally designed for the CDC 6600, Tomasulo's Algorithm was designed to control the flow of data between a set of programmable floating-point registers and a group of parallel arithmetic units. Both are designed to ensure that constraints imposed by instruction dependencies are satisfied. Tomasulo's Algorithm is different in that it systematically attempts to minimise delays between the production of a result by one operation and the start of a subsequent operation which requires that result as an input.

The algorithm works by using additional registers, known as Reservation Stations, at the inputs to the arithmetic units and a system of tags which direct result operands to where they are next needed, rather than necessarily to where they would have gone when the instructions which produced them were issued. This is essentially a register renaming mechanism, and Tomasulo's Algorithm was one of the first examples of this mechanism, now used in a variety of high performance processors. Tomasulo's Algorithm itself is still used today in processors such as the PowerPC 620.

The IBM System/360 Model 91

Tomasulo's Algorithm was first used in conjunction with the IBM Common Data Bus used to interconnect the component parts of the System/360 Model 91 Floating-point Unit. The Model 91 was developed in the mid 1960s with the primary aim of providing "the highest performance capability that advanced design philosophy and System/360 circuit technology extensions could achieve, within a balanced development schedule". Performance was taken to mean general computer availability and high-speed execution of general problem programs, and a factor of between one and two orders of magnitude over the earlier 7090 was achieved, depending on the nature of the problem. Only a small number of Model 91s were actually produced, but most of its architectural features were carried over into the commercially more successful Model 195.

One of the problems facing the designers of high-speed computer systems at the time was the difficulty of achieving the fastest possible execution times for a particular technology in universal execution units. Circuitry designed to carry out both multiplication and addition, for example, could do neither as fast as two units each limited to one kind of operation. Thus not only did the Model 91 contain separate fixed-point and floating-point execution areas, but the floating-point area contained separate add and multiply/divide units capable of concurrent operation. Although both units were internally pipelined, only the add unit (a two-stage pipeline) could start a new operation in each (60 ns) clock cycle. The multiply/divide unit took three cycles to execute a multiply and 12 cycles to execute a divide, but could only deal with one instruction at a time.