Fundamentals:
Introduction

Visual data constitutes practically all of the estimated 10⁹ bits of information received by human sensory receptors each second. Computer vision is an enormously complex task, not just becauseof the high throughput of data, but also due to the complexity of the process of perception. Humans can recognise common objects like cups and saucers which are of widely different shape and under very different lighting conditions, yet cannot always recognise the words ``hello world'' when spoken in an unusual dialect. Generic vision systems are the ``holy grail'' of vision research. Currently, an effective image processing or computer vision system will be highly specialised, containing task-oriented features which are specific to the particular application area. Indeed, it is this task-specific quality which has held back the widely predicted growth of applied vision systems, since each individual application may require a disproportionate amount of development time which is not readily applicable to other tasks. This is particularly true of systems which do not ``understand'' a 2D or 3D image as such, but extract features and classify objects into pre-determined categories, for example a fruit grading system which examines the colour, texture and diameter of oranges, or a PCB inspection system which compares a circuit board to a master template on a point-for-point basis.

However, it is possible now to buy a standard image processing system based on an IBM-PC or a SUN workstation, for example, which contains all the basic facilities for acquisition and display of 2D image data. Figure 1, below, illustrates schematically the architecture of a computer vision system.

Figure 1: The components of a computer vision system

In conjunction with the basic hardware, there are a number of standard software libraries for low and intermediate level image processing that can substantially reduce the overall development time of a particular application.

The goal of current work in applied computer vision is to develop a generic class of systems which can be readily reconfigured to undertake a variety of different tasks in as short a lead-time as possible. A general vision paradigm is illustrated in Figure 2, below. A wide range of visual sources can be employed, supplying conventional intensity or colour video data, but also other parts of the visual spectrum, texture data, ranged or depth data, X-ray images and so on. At the next level, primitives are extracted from the data by context independent filtering operations; for example the detection of abrupt changes in image intensity or surface depth which might correspond to significant boundaries in particular objects. At the intermediate level, this primitive data may be analysed with regard to general knowledge of image properties to determine how these may be grouped into more meaningful entities. For example the discontinuities of the previous stage may be linked to form boundaries of known parametric form. At the highest level, knowledge of the particular task constraints may be brought to bear; the interpretation of the feature data may be based on a finite database of possible components which may be required for an assembly task for example. In general, the balance between hardware and software processing changes as a function of the degree of complexity of the succeeding tasks. Image sensing, acquisition and display is a hardware task, and many feature extraction algorithms to detect edges have been implemented in hardware in various forms. Feature analysis is primarily software oriented, but developments on multiple-instruction-single-data (MISD) and single-instruction-multiple-data (SIMD) architectures, in addition to custom VLSI realisations, have accelerated this process. However, the cognitive processing is usually a software exercise. This is performed normally on sequential machines although there are now some research teams who are encoding high level algorithms on multiple-instruction-multiple-data (MIMD) machines.

Click here to see Figure 2: Processing stages in a computer vision system

There are several possible representations of visual data, ranging from the purely iconic at the lowest level to higher level, ``knowledge based'' structures. For example, these include

Iconic ( image-like ): e.g. grey-level intensity, RGB images, edge maps, surface maps.

Figure 3:Intensity and depth images
Segmented: e.g. sets of pixels of homogeneous properties ( regions ), sets of edge elements of similar orientation ( lines ).
Geometric: Representation in terms of geometric primitives, e.g. CSG models.

Figure 4: Representations of an optical stand
Relational: Hierarchial structures e.g. trees and semantic nets. Analogical models contain a direct representation of the model characteristics. Propositional models contain assertions of rules ( true or false ). This implies the use of inference techniques, i.e. the process of deducing facts from other known facts.

Figure 5: A network model of a chair

Models of Image Formation

Comments to: Sarah Price at ICBL.
(Last update: 22th April, 1996)