The BSP was a 48-bit machine, and each arithmetic unit (AU) performed floating-point addition and multiplication in two 160 ns clock periods. The four units which constitute the array (the AUs, memories, result routing switch, and operand routing switch) formed a five-stage macro-pipeline, and by careful scheduling of micro-instructions the Array Control Unit (ACU) was able to overlap instructions in order to maximise the utilisation of the macro-pipeline. The BSP operated by partitioning multi-dimensional array operations between the AUs on an element-by-element basis. The ACU received instructions from the scalar processor and decomposed them into micro-operations which were then scheduled using templates. These were effectively pre-computed assignments of the five stages in the circular macro-pipeline to the micro-cycles within each instruction.
A typical sequence of micro-operations required to process each group of sixteen operands would be
In the BSP the unit of parallelism is the vector, and elements of these vectors are accessed at index locations which can vary by a fixed increment. This increment may be any integer value, and this allows rows, columns and diagonals of multi-dimensional arrays which are mapped on to a BSP vector to be extracted by the ACU with ease. For example, a two-dimensional array X may be defined in Pascal notation as
X : array [1..column_length, 1..row_length] of real;
and in the BSP this would be laid out in memory in a column-wise manner. Therefore, to extract a column requires an inter-element stride equal to 1, and to extract a row requires a stride equal to column length. Arbitrary diagonals can be extracted by using a stride equal to column_length + 1 or column_length - 1. High performance processing of these arrays is achieved by extracting sixteen elemental operand sets and presenting them to the sixteen arithmetic units in parallel, and naturally maximum throughput of the array can only occur if all elements are located in different memory modules. The interleaving scheme in the BSP therefore incorporated 17 memory modules, the lowest prime number greater than 16. For memory address a, the module number m containing that address is hence given by
m = | a |17
and the offset i within module m is given by
i = ⌊ a/17 ⌋
This means that if we pick 16 values for a, separated by a constant value d, each will yield a different value for i provided d is not a multiple of 17. This results in a high probability of conflict-free access to many common array structures.
The movement of operands between memory modules and arithmetic modules was performed by two routing switches, one for input and one for output operands. Each routing switch comprised a full 16 x 17 cross-bar switch, moving data in units of 48 bits, plus error control bits. These switches had a maximum throughput of 16 words every 160nS, or 100 Mwords per second. The memory modules had a cycle time of 160nS, and most arithmetic functions required two clock cycles. This gave the BSP a peak operating speed of 50 MFLOPS.
Only a single pre-production version of the BSP was ever produced; by the time it had completed its development phase it had been superseded by the CRAY-1, and in 1979 Burroughs suspended the BSP project. One of the primary reasons for the demise of the BSP was arguably the decision to go for a slow clock speed and non-pipelined logic in the arithmetic units. This resulted in a lower peak performance than would otherwise have been possible from the ALUs, but made tractable the problem of connecting a small but significant numbers of ALUs to a physically common memory. The designers of the BSP believed that the ease with which its peak performance could be approached would counteract the slow clock speed argument, and as Austin observed [1]
"Simply put, the clock frequency does not indicate how fast a machine runs, just how often it stops !"If the arithmetic units of the BSP had been pipelined internally the whole machine would have had a structure similar to a multiple-pipe vector machine, such as the CYBER 205 or the NEC SX Series machines [2].
Despite the curtailment of its commercial career the BSP is often cited as an example of a global-memory array processor, and excellent accounts of its detailed architecture and engineering can be found in Kuck & Stokes [3] and Hockney & Jesshope [4] (pages 198-211). The cost of providing full access to all memory modules in a shared-memory SIMD architecture has meant that to date no large-scale commercial systems have used this form of architecture.