Detection and interpretation of a face image can have a number of applications in machine vision. The most obvious use is to know whether a person is present in front of a computer screen. This makes a cute, but very expensive, screen saver. It is also possible to use face recognition as a substitute for a login password, presenting a person with his preferred workspace as soon as he appears in front of the computer system. Slightly more practical is the use of computer vision to watch the eyes and lips of a user. Eye tracking can be used to determine whether the user is looking at the computer screen and to which part his fixation is posed. This could conceivably be used to activate the currently active window in an interface. Observing the mouth to detect lip movements can be used to detect speech acts in order to trigger a speech recognition system.
None of the above uses would appear to be compelling enough to justify the cost of a camera and digitizer. However, there is an application for which people are ready to pay the costs of such hardware: video communications. Recognizing and tracking faces can have several important uses for the applications of video telephones and video conferencing. We are currently experimenting with combining face interpretation with a rudimentary sound processing system to determine the origin of spoken words and aassociate speach with faces. Each of the of the applications which we envisage require active computer control of the direction (pan and tilt) zoom (focal length), focus and aperature of a camera. Fortunately, such cameras are appearing on the market.
In the video-telephone application, we use an active camera to regulate zoom, pan, tilt, focus and aperature so as to keep the face of the user centered in the image at the proper size and focus, and with an appropriate light level. Such active camera control is simply for esthetics. Keeping the face at the same place, same scale and same intensity level can dramatically reduce the information to be transmitted. One possible such coding is to define (on-line) a face space using principle components analysis (defined below) of the sample images from the last few minutes. Once the face basis vectors are transmitted, subsequent images can be transmitted as a short vector of face space coefficients. Effective use of this technique is only possible with active camera control. Other image codings can also be accelerated if the face image is normalized in position, size, gray level and held in focus.
In the video-conference scenario, a passive camera with a wide angle lens provides a global view of a the persons seated around a conference table. An active camera provides a closeup of which ever person is speaking. When no one is speaking, and during transitions of the close-up camera, the wide-angle camera view can be transmitted. Face detection, operating on the wide-angle images, can be used to estimate the location of conference participants around the table. When a participant speaks, the high resolution camera can zoom onto the face of the speaker and hold the face in the center of the image.
What are the technologies required for the above applications? Both scenarios require an ability to detect and track faces, and an ability to servo control pan, tilt, zoom, focus and aperature so as to maintain a selected face in the desired position, scale and contrast level. From a hardware standpoint, such an application requires a camera for which these axes can be controlled. Such camera heads are increasingly appearing on the market. For example, we have purchased a small RS232 controllable camera from a Japanese manufacturer which produces excellent color images for little more than the price of a normal color camera.
A second hardware requirement is the ability to digitize and process images at something close to video-rates. The classic bottle-neck here is communication of the image between the frame-grabber and the processor. Fortunatly, the rush to multi-media applications has pushed a number of vendors to produce workstations in which a framegrabber is linked to the processor by such a high speed bus. Typical hardware available for a reasonable cost permits acquisition of up to 20 frames per second at full image size and full video rates for reduced resolution images. Adding simple image processing can reduce frame rates to 2 to 10 Hz (depending on image resolution). Such workstatiosn are suitable for concept demonstrations and for experiments needed to define performance specifications. An additional factor of 2 (18 months) in band-width and processing power will bring us to full video-rates.
The questions we ask in the laboratory are: What are the software algorithms that can be used for face detection, tracking and recognition, and what are the systems concepts needed to tie these processes together. Systems concepts have been the subject of our ESPRIT Basic Research Project ``Vision as Process''. I refer the interested reader to the book [CC94] or the paper [CB94] for more details. In the following I will address software algorithms. These algorithms should be seen as complementary. They are combined together in a multi-process architecture to provide robust face detection, tracking and recognition.