Hardware Prefetching

In hardware controlled schemes, the decision to prefetch is based on the access to the current cache block. Many different schemes of this type have been proposed, and we will look at several of them. One of the advantages of hardware-based prefetching is that prefetches are handled dynamically by the hardware at runtime, without compiler intervention. However, extra hardware resources are required and unwanted data is more likely to be prefetched.

One Block Lookahead (OBL)

This policy is the simplest of these schemes. Upon referencing block i, block i+1 is prefetched. Extensions of this scheme have been proposed, but while taking advantage of limited (sequential) spatial locality, non of these schemes can deal with large strides.

Stride Directed Prefetching

This scheme was proposed by Fu and Patel [5] to improve numerical program cache performance on a vector processor. The scheme proposed uses the relationship between the vector stride distance (which is carried by the vector instructions) and the cache block size to direct prefetching.

They later extended this method to scalar multiprocessors [6], where the stride is not explicitly known, by using a stride prediction table (SPT). This method can significantly reduce the number of cache misses for numerically intensive programs that can be highly vectorized, with a low overhead. However, for other programs, especially those with a large number of stride changes, although the cache miss rate is still reduced, there is generally a more significant overhead, and Fu and Patel conclude that stride directed prefetching is not appropriate for such programs.

Lookahead Data Prefetching

This scheme was originally proposed by Baer and Chen in [1], and versions of it are examined in [2, 4]. It combines the advantages of stride information and instruction lookahead.

The required hardware consists of a support unit for a data cache whose design is based on the prediction of the execution of the instruction stream and associated operand references in load/store instructions. The latter, and their referencing patterns, are kept in a reference prediction table (RPT) ( see figure). "The RPT will be accessed ahead of the regular program counter (PC) by a look-ahead program counter (LA-PC). The LA-PC is incremented and maintained in the same fashion as the PC with the help of a dynamic branch prediction mechanism. The LA-PC/RPT combination is used to detect regular data accesses in the RPT and to generate prefetching requests." [2] The prefetched data blocks are put in the data cache. Note that the cache concerned may be blocking or non-blocking.

The idea is to keep enough distance between the PC and the LA-PC that the prefetched data arrives "just before" it is needed. The LA-PC initially points to the instruction following the current PC and moves ahead of the PC when the execution stalls on real misses. Incorrect branch predictions and other mechanisms stop the lookahead distance from growing too large. Since the supporting unit is not on the critical path, its presence should not increase the processor cycle time.

Chen and Baer claim that data pollution resulting from erroneous prefetches can be almost totally avoided by the fine tuning of the RPT finite state machines. However, this scheme relies heavily on good branch prediction to predict the lookahead stream.