H. I. Christensen
Throughout the last three decades there has been a number of efforts in terms of doing systems integration. One of the first approaches in terms of building a system is probably Roberts [Roberts63]. Since then a large number of research efforts have been undertaken to build more or less complete systems [Brooks81,Hanson-Riseman78,VAP,Navlab,Dickmanns]. It is, however, characteristic that the construction of full systems is still a major problem both in terms of research prototypes and commercial systems.
The problems in terms of building systems are apparent from a number of indications. The most obvious indicator is probably that so far few commercial systems are available for 'general' domains. Almost all commercial systems are highly tailored to the specific applications for which it is used. This implies that a significant engineering effort is required for each new application. Such a requirement implies that vision only is a competitive technology for application domains, that are characterised by high volume, high risk or high cost. the potential for computer vision is however vast and diverse, but more structured implementation techniques must be provided before wide spread use can be expected. In addition a number of basic problems must be addressed before a more advanced applications can be provided.
Some of the primary reasons why systems integration still is a major problems are:
Another problem in system design and implementation is the need for computational resources. It is still characteristic that complex applications can only be implemented on state of the art computer facilities. At the same time it is difficult to scale applications in relation to available resources. Both of these problems are related to the issue of control, i.e. how are limited resources in terms of memory and CPU-power etc. most efficiently exploited and how is real-time response guareentied.
In the remainder of this chapter each of the problems outlined above are discussed in more detail and at the end a set of research and development issues that are considered particularly important is outlined.
In computer vision systems there is a need for a certain minimum of hardware resources. The obvious components are cameras and frame grabbers, in addition digital signal processors (DSPs) or dedicated hardware is often utilised. Each of these hardware components have their own problems.
First of all almost all available cameras (for commercial use) are designed based with broadcasting in mind, which implies that the output signal is according to the PAL or NTSC standard. This standard is excellent for transmission of images and subsequent viewing on monitors but it is not obvious that they are well suited for computer vision. The first problem is that the sampling period is 50 or 60Hz. Through use of shutters etc it is possible to reduce the effective sampling period, but it is very difficult to increase the sampling frequency, which implicitly provides an upper limit to the speed of a system. The other problem is that the spatial resolution often is inadequate. I.e. one has to make a compromise between the field of view and spatial resolution. Having access to better cameras would simplify a large number of problems. Recent progress in CMOS technology indicates that it might soon be possible to get cameras with a higher resolution and with the ability to do pixel by pixel addressing which would enable more efficient use of the sensing system.
In terms of frame grabbers it is characteristic that there is a large number of different brands almost all of which have different interfaces. A few years ago the VME bus was the standard interface, but due to the limited bandwidth of this bus most commercial boards are today equipped with a PCI bus to enable integration into PC based systems. In addition progress in multi-media applications has resulted in a number of workstations that have built-in frame grabbers, but the interfaces are still quite different and in most cases only a 'high-level' software interface is provided. These high level interface makes it difficult to change specific parameters in the hardware, and significant resources must be devoted to the adaptation of such hardware devices for particular applications. This at the same time makes it difficult to choose the most appropriate hardware platform for different applications. There is thus a need for standardisation of such hardware devices and the associated set of controllable parameters.
Some frame grabbers are equipped with dedicated processing hardware that enable real-time pre-attentive processing of images to generate gradient or edge maps. Unfortunately the programming required to use such facilities is typically extensive and non-portable, which again implies that a high cost is associated with the use of such devices. There is thus also a need for standardisation of such DSP interfaces and it would in addition be most useful to have standard libraries for the most common operations. As a minimum it would be desirable to have common definitions for application programmer interfaces (APIs) to enable easy migration from one platform to another.
Recently there has been an effort in the US to provide a standard library for programming of vision applications. This effort is named the 'Image Understanding Environment', IUE. This effort is a response, by a particular funding agency, to the fact that a significant amount of resources has been devoted to re-implementation of existing algorithms. It is however characteristics for this effort that it has it origin in Image Understanding. This implies that particular emphasis has been placed on interpretation of single images, and interfacing to many different kinds of imaging devices. It is still uncertain how and to what extent the effort also will include methods for more general computer vision research. Another characteristic of the effort is that it has tried to be extensive in the sense that it has tried to capture as much as possible into a single standard. It might be more wise to try to capture the minimum possible to enable transfer of software from one platform to another.
The IUE initiative also lacks proper support for handling of uncertainty, continuous streams of data and so far no support is available beyond single image characteristics.
A scaled down environment has been developed in parallel by General Electric and Oxford university. This systems is named Target Jr. recently this system has also been adopted at INRIA, KU Leuven and other EU universities. Due to its limited size and less extensive standardisation it is better suited for computer vision research, but again limited facilities for provided for problems beyond image based measures.
It is considered crucial to both academia and industry to have some minimum standard for integration of systems as reuse of algorithms otherwise will remain expensive. The standardisation initiatives so far have tried to provide standards at a very low level of abstraction, where there is a need to decide on function parameters, return type, ... To satisfy the diversity of different companies and universities the end result is often a comprise where the result is a function with a huge number of parameters and a complex return structure. To circumvent this problem it might be useful initially to consider more abstract methods for definition of systems. I.e. the design today is almost at the level of machine code, which it might be more useful to have a method for definition of systems in terms of a behavioural description. Ideally the systems definition should be in terms of a behavioural definition using a graphical interface. Once a system has been defined and debugged it should then be possible to compile the system description into dedicated code that can be executed on a specific application platform. Each vendor could then provide dedicated 'modules' that could be linked or compiled into applications as long as they also provide a behavioural description that can be used for development and debugging.
It is unrealistic to assume that an individual company or university can provide an adequate methodology for standardisation of software components and integration tools and there is thus a need for a community wide effort to define system definition/implementation paradigms. In this effort it is considered wise to use a top-down methodology where abstract methods are defined before the details are finalised.
Almost all of the computer vision research this far has focussed on development of individual techniques like 'edge detection', 'shape from motion', 'optical flow', 'stereo reconstruction', ... These techniques are all characterised by a limited domain of operation. To provide robust systems that has a certain generality it is necessary to exploit multiple of these cues in a coherent fashion. I.e. a set of methods with partially overlapping operation ranges should be chosen so that the entire domain of application at least is covered by one technique but preferably by multiple. Almost all of the available systems are rather brittle. This implies that even small changes in the environment in which the system operates may result in break down of the system. This at the same time results in extensive engineering of the environment for particular applications, which at the same time results in difficulties in terms of changes in the process in which the system is used.
Many of the available methods have more or less well defined operation ranges, but so far little research has been devoted to the integration of techniques. A few notable exceptions are [Aloimonos-Schullman89,Ayache89, Uhlin96]. There is however a need for a more large scale study of how different cues can be exploited to provide robustness. The lack of good cue integration method also implies that an inadequate basis is available for comparison of different techniques for particular applications. This implies that companies or institutions tend to rely on techniques they are familiar with rather than the best techniques. It should here be noted that there is no general consensus on the concept of 'best' as there is no adequate characterisation of individual algorithms.
To accommodate integration of different cues it is considered important that individual cues or features have associated confidence measures that indicate both the semantic and the parametric uncertainties. Today most algorithms have not been designed to provide such measures. It is however characteristic that the theoretical information about parametric uncertainty is often known, so it is here a matter of integrating these descriptions into the actual algorithms. There is further a need for adaptation of methods from uncertainty theory and statistical signal processing to enable cue integration.
A major problem today in computer vision is that most algorithms that are published do not have an associated description of their performance characteristics. This implies that results are presented for a few selected images or scenes. It is consequently difficult for other researchers and industrial engineers to determine to what extent the a particular algorithm or method can be applied for another application. A problem is that for most algorithms there is a need for description of underlying assumptions about the scene, these assumptions range from scale, contrast, .. to specification of resolution, type of objects, noise sensitivity, ... In addition to these more qualitative descriptions there is also a need for characterisation of the parametric characteristics of the algorithm. With such measures it is difficult for other researchers and engineers to adopt these algorithms. In addition it is often difficult to perform diagnostics of the performance of a complete system without access to such information.
Consequently, it is difficult to evaluate both a single processing module and systems under relevant conditions. Benchmarking can only be performed over a sufficiently long stretch of modules to provide sufficient invariance to the particular representations and operations used in the next stage; at worst, this involves the entire system.
A consequence from the preceding is a conviction that every descriptor should contain two parts, a statement of class membership and a confidence measure of that statement. This makes it possible to build up robust sequences of modules which interact bi-directionally to combine partially reliable data with model restrictions.
In the case of continuously operating vision systems, which interact with their environment (such as an industrial robot on a mobile platform), it is highly desirable that it is possible to achieve a "closed loop" of information flow.
We would like to re-emphasise this investigation into mechanisms and control structures for restrictive processing in continuous-operation vision systems. This is due to the computational demands resulting from performing robot vision in real time. It includes mechanisms for goal directed perceptual strategies such as focus of attention, fovea, and spatio-temporal scale-space, with variable space and time resolution sampling to provide sufficiently good representations of control mechanisms. The real-time restriction is interesting and important for a number of reasons:
In order to be acceptable such system must be:
Therefore, such a system consists of a small kernel to which problem-oriented code is added as modules. However, it is crucial that the kernel is properly decomposed and layered, with well-defined interfaces to every part of the kernel, so that it will be possible to choose alternative implementations of one or several services in the kernel. One reason for choosing an alternative implementation is increased efficiency, another is integration and compatibility with other systems.
For the same reason, the kernel must not contain anything that could just as well be implemented in add-on libraries. For example, support for different data structures should be kept out of the kernel. For distributed systems, the naming of processes, objects and services could be implemented by a general, replaceable name service in user-space. Scheduling should be implemented a user-defined scheduler, and must not be hard-wired into the kernel. Support for the transportation and distribution of objects should be based on protocols, which allow for multiple programming languages, transport layers, and object persistence implementations.
Some systems invent their own specification languages, scripting languages, graphical interface, when there are existing, generally accepted alternatives. The use of mainstream components increases the probability that the system will be accepted and used.
Many existing distributed image processing system (AVS, Khoros etc) are based on computational networks, consisting of filters which operate on data streams. While this abstraction might be appropriate for some image processing problems, reactive real-time systems needs more control over the computations. There is a strong trend towards distributed object-oriented systems, in which the programmer gets the feeling that all objects exists in a singe process, but where some method invocations are actually implemented by transparent remote procedure calls, shared buffers access, etc (CORBA, DOE, Spring and others).
It is also important in order to create embedded applications where you do not need to "download" more than what is actually used. Moreover, it could enable a "progressive learning scheme" so that a newcomer does not need to learn more than is basically needed for his application to get started.
In terms of building real-time systems there are several commercially available operating systems, and recently tools for specification of such systems have also become available in the academic community. It is, however, characteristic that the real-time tools available today all assume the real-time system is homogeneous, from a complexity point of view, and typically only a few selected routines require real-time response. In fully fledged computer vision systems the need for garenteed response times varies from a few milliseconds (low-level control loops) to several seconds (symbolic interpretation and planning tasks). The requirements, in terms of response times, are consequently highly heterogeneous. In addition computer vision and image analysis is characterised by an excessive dataflow (6-20 MB/s), this in turn requires that an execution environment must be able to create and destroy large data structures without any effect on the performance of the system. On top of this facilities for control of processing, analysis and interpretation must be supported/provided to ensure limited size models. So far no tools be it academic or commercial are available that can satisfy these requirements.
To enable technology transfer, transfer of results between companies and formation of an OEM structure within vision a standardisation effort is necessary to accommodate exchange and interfacing between different components. While earlier effort have tried to cover all possible applications and provide specification at a very low level of abstraction, it is suggested that more abstract behavioural specifications are needed.
In addition there has been much to little work on benchmarking/performance characterisation of algorithms and systems to enable assessment of methods. This problem is not only of interest for industrial applications, but is a general scientific issue. In general it is considered most unfortunate that computational vision work does not have methods for assessment of methods and algorithms. It is consequently very difficult to specify the utility of new work. To assist better characterisation of methods the ECVnet has setup a benchmarking committee and there has also been recent progress as witnessed in the Machine Vision and Applications special issue on 'performance characterisation' (December 1996).
There is thus a number of system issues that should be attended to in order to enable a much more efficient approach to design and implementation of systems and methods.
[Aloimonos-Schullman89] Y. Aloimonos and D. Schullman, Integration of Visual Modules, Academic Press, NY, 1989.
[Ayache89] N. Ayache, Artificial Vision for Mobile Robots, MIT Press, Cambridge, Massachusett, 1989.
[Brooks81]: R. A. Brooks, Symbolic reasoning among 3-D models and 2-D images, Artificial Intelligence, Vol 17, 1981, pp. 285-348
[Dickmanns] E.D. Dickmanns and A. Zapp, Autonomous high speed road vehicle guidance by computer vision, in "Automatic Control World Congress 1987, Selected paper from 10th triennial Congress of IFAC", Munich, Germany, Pergamon Press, 1987, pp. 221-226.
[Hanson-Riseman76] A. R. Hansen and E. M. Riseman, Constructing semantic models in the visual analysis of scenes, Proceedings Milwaukee Symp. Auto. & Contr. 4, 1976, pp 97-102,
[NAVLAB] C. Thorpe, M. H. Hebert, T. Kanade and S.A. Shafer, Vision and Navigation for the Carnegie-Mellon Navlab, IEEE Transactions on Pattern Analysis and Machine Intelligence, 10, (3), May,1988, pp. 362-373.
[Roberts63] L. G. Roberts, Machine Perception of three dimensional solids, MIT Lincoln Laboratory, TR 315, PhD dissertation. 1963
[Uhlin96] T. Uhlin, Fixation and Seeing Systems, Royal Inst of Technology, PhD thesis, May 1996.
[VAP] J. L. Crowley and H. I. Christensen (Eds.), Vision as Process, Springer Verlag, January 1995.