Reliability and fault-tolerance

The reliability of high performance computer systems is often considered to be an issue which is secondary to the most important task of designing for maximum throughput. However, the operating efficiency of a high performance system is the product of its throughput and its availability, and the availability of a machine depends on both the mean time between failures (MTBF) and the average down time (ADT). This is nowhere more important than in multiprocessor architectures since, as we shall see, the MTBF of very highly parallel MIMD systems can be extremely poor.

It is a commonly held belief that multiprocessor architectures are inherently tolerant of faults since the replication of processing elements leads to the natural availability of spares which can be switched in when the occasional processor fails. Designing machines which are fault-tolerant (and hence reliable) involves a great deal more than simply providing spares, however. For example, each fault must be located before any hardware reconfiguration can be performed, and when the fault has been rectified the state of the computation prior to the occurrence of the fault must be reinstated if fault processing is to be transparent.

Consider a hypothetical multiprocessor system containing 1000 processing elements, each consisting of just 100 components. If it is assumed that the failure rate for each component is 10^-7 failures per hour (λ = 10^-7) then the MTBF for each component is 1/λ = 10 million hours. The MTBF for the whole system can be calculated as the MTBF for each component divided by the number of components. Therefore, since there are 1000 x 100 components, the MTBF for the multiprocessor system will be just 100 hours, or approximately 4 days. This calculation includes only failures caused by faulty components. In addition there are transient faults caused by environmental factors such as changes in ambient temperature, or evencosmic radiation, and intermittent faults caused by poor production quality, and these normally occur more often than component faults. In addition to the problems of hardware reliability there are further problems associated with software reliability, and these are compounded in multiprocessor systems by the added software complexity of process synchronisation and communication. Diagnosing software faults in a multi-process environment can be a particularly difficult task.

Improvements in technology could, in the future, reduce the failure rate per gate within multiprocessor systems through the use of higher levels of integration. However, higher levels of integration will also result in systems with larger numbers of processors, and hence the problem of reliability will remain. Since faults cannot be avoided, and prolonged unavailability of high performance systems is unacceptable, the only alternative is to design high performance multiprocessors for maximum resilience and fault-tolerance.

Designing for maximum resilience means discovering which components are least reliable, and either minimising their use or making them more reliable. Designing for fault-tolerance means two things: firstly, designing systems with the ability to detect the occurrence of an error, and secondly imbuing those systems with the ability to correct and recover from an error. There are many ways in which the detection and recovery from faults can be implemented, for example one well-known method involves replicating sensitive components (usually thrice) and accepting the behaviour exhibited by the majority (this is known as triple-modular-redundancy, or TMR). This level of redundancy can be very costly, and of course the logic which compares the behaviour of the replicated modules may also be faulty.

The techniques that are applied to uniprocessor architectures to detect faults, such as error-detecting codes (SECDED and parity checks), can be applied within each processing element of a multiprocessor system. However, in a multiprocessor system there is a further problem caused by the reliability of the network logic which connects processors to memories, or processor-memory pairs with each other. It is well known that the most unreliable elements in a computer system are the electrical connections between physically distinct component parts, and this is particularly true of intermittent faults. Hence, in a large multiprocessor system it is reasonable to expect interruptions in the interconnection network, since these normally contain large numbers of wires and connectors. Therefore the protocol for data-movement through the network should be robust, and capable of detecting and correcting transient errors. More permanent errors in the network logic will result in one or more paths becoming unusable. This may in turn reduce the connectivity of the network, and result in one or more processors being unable to communicate with the rest of the system. This can be overcome by designing networks with multiple paths between every pair of connectable components, so that if one path becomes inoperable another can be used.

Large high performance computers are sometimes designed with a particularly time-consuming application in mind. For example, the IBM GF11 project^† was designed primarily for the solution of numerical problems in quantum chromodynamics. A calculation of particular interest has been estimated to take approximately one year on the GF11 machine, and under these circumstances reliability is a very important consideration. Since it is highly unlikely that a year-long computation could ever proceed to completion without encountering a system failure, such lengthy calculations must be partitioned into a sequence of computational segments which each occupy a time-span somewhat shorter than the system MTBF. At the end of each segment the state of the computation must be saved, allowing the computation to be rolled back to a previously known correct position and re-started in the event of failure. This ensures that the amount of time wasted as a result of each system failure is limited to the time for one segment of the computation.

† The component count for the GF11 machine was approximately 4 x 10⁵, 1296 of which were located in the network [1].