Computer Vision in the Broadcast and Entertainment Industries

The use of computer vision technology in the broadcast and entertainment industries

Patrick COURTNEY
ITMI APTOR
ZIRST - 61, Chemin du vieux chêne
BP 177 - 38243 MEYLAN Cedex
Telephone : +33 4 76 41 40 00
FAX: +33 4 76 41 28 05

11 October 1996

INTRODUCTION This report concerns the use of computer vision technology within the entertainment industries. It seeks to describe existing usage of this and alternative technology, to highlight current trends and possible future needs and uses with a view to defining future research directions. In this report the term computer vision technology is taken to cover both software (image signal processing, pattern recognition, visual control, object recognition, scene reconstruction and scene understanding) as well as hardware (sensors, platforms, etc). The term entertainment industry is taken to cover principally the visual media of TV and film for news, sports, drama and comedy programming as well as games and advertising. It does not deal with print media, or market research, nor will it explicitly cover education or culture.

During the course of their development the film and TV industries have enthusiastically exploited technological developments, from sound and colour through to more recent innovations in computer animation and current experiments in high definition and digital broadcasting. Their markets are global and touch the greater part of the worlds population. The equipment market has been estimated at $15billion pa. Technically, the needs are very demanding. In TV production they are characterised by high bandwidths with multiple high quality colour image streams. For live events this is coupled with the need for low latency (a few frames at most) and a requirement for high reliability and robustness to failure.

Two stages can be distinguished: the production of the entertainment product, and the delivery (distribution and transmission) of the finished product. Each calls upon different functions and technologies in which computer vision technology has or can play a role.

PRODUCTION

2.1 image acquisition

Image capture relies upon high quality cameras. Early cameras were large, fragile and required careful control of lighting. Improvements in technology has produced cameras which are smaller and most robust, opening up new uses. For example news reporting under difficult conditions and on-board cameras in motor racing.

In situations where the cameras are subjected to shocks or where a cameraman is unavailable, stabilisation of the image may be necessary. Methods of ensuring this include:

dynamically balanced platforms suitable for handheld use, such as the Steadicam (USA).

gyroscopically balanced platforms suitable for high magnification use on aircraft such as the systems of BSS (UK) and Wescam (Canada).

opto-mechanical stabilisation using elements at the lens eg Canon (Japan).

electronic stabilisation using motion estimation offered by JVC (Japan) and others.

Cameras are often mounted on mobile heads or cranes to obtain particular views. A wide variety of such devices are available in the form of simple pan-tilt heads through to pedestal systems for XYZ motions and arms with 8 or more degrees of freedom. Additional optical degrees of freedom are obtained by controlling the zoom and focus and occasionally the aperture. These systems may be manually driven (with or without coders) or via numerical control, joystick or programmed offline with resolutions described in microns and minutes of arc. Current suppliers include Ultimatte (USA), Vinten (UK), Egripment (Netherlands), A&C (UK), Radamec (UK), MRMC (UK) and Panther (Germany).

The cameras themselves increasingly employ digital signal processing as a way of dealing with the varying preprocessing reqirements (gamma and colour matrix) as well as to provide increased stability, noise reduction and defect elimination capability.

2.2 data acquisition

For certain types of programming there is a requirement to acquire information other than pure image data. An example of this is in performance animation (also known as puppeteering) whereby the motions of a real actor are used to guide an animated character. Canal+ (France) have used a bodysuit fitted with electromagnetic sensors to animate a virtual presenter in realtime. Other users have carried out offline motion capture to produce animations for video games. A range of technical approaches exist including mechanical, electromagnetic or acoustic sensors. Wire-free optical systems using active and passive detection are available from some 30 suppliers. For example a system based on infrared diodes from Adaptive Optics (USA) or one using reflective markers from Oxford Metrics (UK). These systems work under conditions of controlled illumination and can give good precision but require good calibration and tend to suffer from occlusions problems. Animation based on facial movements is possible by sticking reflective labels to the face and tracking them under infrared illumination, for example X-IST (Germany).

Orad (Israel) offer an interactive annotation system for sports commentating, which operates on a video feed to provide tracking and naming of players, automatic tracking of players and ball, distance and speed estimations, panoramic reconstruction by view combination, offside checks, etc. It has been used in football, ice hockey, tennis, basketball and athletics for replay commentaries.

Other technologies have also been employed to track objects of interest for outdoor and sporting events. Medialab (France) have used localisation information from GPS receivers to sythesise images of yachts for the Americas Cup. In ice hockey, a radio transmitter fitted inside the puck has enabled trajectory and speed information to be obtained for display on screen.

Increasing use is being made of offline 3D data acquisition systems to create models for animation in progamming and video games. The techniques used are either laser scanning, for which there are several suppliers such 3D Scanners (UK) and Cyberware (USA); or fusing of views such as in the Sphinx 3D modeler from Dimension (Germany).

2.3 sets

Actors are conventionally placed in environments of decorated sets which are expensive and constraining. Over time methods have been developed give the impression that the action is happening elsewhere, by mixing in moving backgrounds etc. This can be done using a technique called chroma-keying. The action is shot against a background of a certain colour (usually blue), and the resulting video signal is processed to substitute the blue in the image with another image source. This technique is widely used in film and TV, for example weather forecasters, and works under the constraints that the illumination be well controlled, blue can be avoided in the foreground and the cameras remain fixed.

Just recently it has become technically feasible to synthesise background images in real time and to match them to the camera viewing parameters such as position, zoom and depth of field. This allows much more complex virutal sets, including ones where there is an interaction between the actors and the synthetic objects. In order to do this it is necessary to have precise knowledge about the cameras. The precision required depends on viewpoint but can be rather high (1mm in XYZ, <0.1deg. in orientation). This information is currently obtained in one of two ways:

using optical or mechanical sensors fitted to a remotely driven robotic camera, either a pan/tilt unit or a pedestal camera capable of XYZ motions. Suppliers include Radamec (UK) and Vinten (UK). Coders of 24 bit resolution are common on the pan and tilt axes, while 12 to 16 bit coders are used for zoom and focus. Such systems are said to suffer from backlash and drift and are of limited use for handheld or wide area use.

using the recognition of a pattern in the background and using this to determine position. This can either take the form of a bar code read by a laser scanner to determine XY position of a pedestal (Radamec), or a monochrome grid pattern on the blue backdrop to determine the 7 image projection parameters for a handheld camera (Orad). Such systems have problems when the patterns becomes hard to see due to the angle of view, focus or motion blur, or occlusion.

There are a number of commercial virtual set systems on the market including those from Accom (USA/Poland), Brainstorm (USA/Spain), Electrogig (Netherlands), Discreet Logic (Canada/UK), Orad (Israel), and RTset (Israel). Most major broadcasters are currently experimenting with such systems and they are in regular use in Germany, UK Spain and elsewhere. While these systems have the potential to reduce costs (set construction and storage) they are still very expensive (up to $750k) and the present use is to provide increased functionality (unavailable, dynamic or otherwise impossible backdrops) rather than cost reduction.

2.4 post-production

Once the footage has been shot, the final product must be produced. This can be as simple as editing together the different segments, but in general it necessary to carry out some kind of processing such as:

mixing between several images sources, including animated or synthetic sequences.

transforming or warping one image onto another.

tracking of objects for motion compensation, image stabilisation and alignment of clips.

removing or adding motion blur.

grain management (sampling, modelling and matching) when working with film.

segmentation of an object moving against a bluescreen by tracking.

There exist a number of editing and production workstations to carry out these task, such as the systems from Quantel (UK), Discreet Logic (Canada), Alias/Wavefront (Canada), etc. They draw heavily on methods from surface reconstruction, sampling, filtering, correlation, segmentation and interpolation. In general the tools used are interactive with manual setting of the parameters and the operator supervising the process running over a sequence at near realtime, tuning parameters to obtain a satisfactory result.

2.5 archive management

There now exists a huge volume of film and video footage of considerable value. Its effective management is an important issue for a number of broadcasters and producers. Operations for which computer vision technology has found a use include:

indexing and searching: this is traditionally performed using keywords associated with each shot, and there are several suppliers such as Artsum (France). For more automated searching, Dubner (USA) offer a system which detects shot or clip transitions according to change detection (cuts) or fast motion in live filming, for subsequent manual annotation. Demonstrations have also been carried out on experimental image indexing-by-content, such as the Impact system from Hitachi (Japan) for finding shots starts according to content, while Illustra/Virage (USA) and IBM (USA) have systems able to carry out searches based on colour, texture or shape queries. These systems are still under development with regards processing capability and applications.

archive restoration: the transfer of film stock to digital media is becoming increasingly important as the volume of older material threatened by loss and the resolution of broadcast media continues to rise. Colouring of black and white films can be included in this catagory. Suppliers such as DigitalVision (Sweden) provide sophisticated conversion workstations with interactive processing for motion compensation, colour correction, scratch, dirt and noise filtering, etc.

DELIVERY

3.1 format conversion

Programme sources come in many different formats and this call for the capability to convert between them. This activity is called standards conversion and may include the following operations:

interpolation between images of a film shot at 24 frames per second to a stream at 60 fields per second.

rescanning a 525 line image to create a 625 line image.

motion compensation when converting from one scan rate to another.

converting between one colour system and another.

digitising a movie suffering from grain and scratch damage.

The methods used to carry out these tasks are those drawn from signal and image processing and are present in the transfer machines, as well as in specialised boxes from companies such as Snell and Wilcox (UK) and DigitalVision (Sweden).

3.2 compression

The importance of efficient use of communications links and storage media during production and delivery has meant that image compression has recieved increasing attention. After initial fears for image quality, compression techniques are have appeared on the market in the following forms:

JPEG and MPEG-based compression for codecs, data transmission and storage.

transcoding between different compression standards and variants.

preprocessing to remove noise prior to compression.

These draws on methods from signal processing (transforms, quantisation, correlation and non-linear filtering).

Since the coding standards specify a bitstream rather than a coding method, the evaluation of codecs is becoming increasingly important as test generators and quality assessment systems. Evaluation frameworks have been developed based on models of human sensitivity to contrast, orientation at various scales. Recent work had added colour, motion and memory effects.

3.3 advert insertion

When broadcasting sponsored events such as football matches and motor racing, the same advertising hoardings appear to all viewers, whereas many sponsors fine tune their advertising message to local sensibilities. Systems able to replace such panels in realtime using image processing are on offer from Matra Datavision (France), Orad (Israel) and VDI (USA). Actual use to generate additional advertising revenue seems to be low at present.

3.4 protection

The protection of programming content is necessary to prevent piracy. The conventional approach has been the addition of a logo or icon in the corner of the image. Premium services are scrambled using one of several random line shifting algorithms. During news gathering, some news agencies encrypt their signal to protect their source from competitiors using similar techniques.

3.5 interaction

In virtual set applications where real and synthetic characters interact, it is necessary to be able to align the two in space and synchronise them in time. The common solution is to place reference marks on the floor, to use an overhead camera to segment the actor or simply to play the combined image back to the actor.

Within some interactive games, the body motions of players are used to control the game character using the blue-screen chroma-keying technique. Such systems include those from Artificial Reality and Vivid (USA).

COMPUTING PLATFORMS

The kinds of processing described in the preceeding sections places severe demands on the platforms used. Whereas low-end workstations may run on high-end MAC or PC platforms, high-end realtime use requires greater processing power, memory and internal bandwidth. Several of the major computer firms such as IBM, Hewlett-Packard and Digital offer media servers. Video editing stations tend to run on either custom hardware, or the Silicon Graphics range. At the top end is the Onyx with its 2 to 24 processors and 64M-64Gbytes of 8-way interleaved memory. The graphics rendering is carried out by 1-3 specialised engines each with separate dedicated image data buses, somewhat similar to the prototype parallel vision engines used in computer vision research. The Orad pattern recognition system uses 6-9 specialised DSPs (Texas C80) to carry out realtime camera pose estimation.

HERITAGE

For a number of suppliers the technology they offer is the result of work in other domains. For example virtual sets and performance animation systems use techniques developed for flight simulators and military target tracking respectively (Orad and RTset). Some standards conversion and advanced filtering was based on work carried out in national telecoms labs (DigitalVision). Image processing workstation supplier Quantel used to have military and medical image processing divisions. The gyroscopic platforms used for stabilising news and sports images were developed for law enforcement and military applications.

CURRENT TRENDS AND FUTURE NEEDS

The entertainment industries are experiencing pressure as the increasing number of distribution channels (from TV, cable and video to DVD, CD-Rom, games and on-line services) leads to fragmentation of audiences and thus the need to cost-effectively address small target groups. The plethora of media also makes it attractive to reuse content more widely than ever before. Increasingly computer mediation make new levels of interactivity possible, and everywhere there is a pressure to reduce costs, which often means live work. These trends give rise to a number of possible future needs and uses of computer vision technology.

image acquisition:

- smart cameras able to automate some stereotyped shooting work, including framing of scenes, tracking, zoom and focus, especially for programming needing many cameras such as minority sports, interactive games, etc. As channels become cheaper and several viewpoints are sent to the viewer, he may take control of the choice of shots. Such systems would also allow acquisition of footage not otherwise available due to the dynamics of the scene, for example the faces of sportsmen, possibly integrating other data such as GPS etc. Smart cameras are under development at MIT and Microsoft.

- three dimensional image acquisition especially for non-TV applications such as interactive games. Work within EC projects DISTIMA and MIRAGE based on polarised left and right image views, has already led to the development of a high-quality 3D camera and displaies by Thomson and AEA.

data acquisition:

- existing performance animation systems are limited in the range of information that they can obtain (arms and legs of single cooperating humans) and the conditions under which they will work (indoors, close range). Systems capable of overcoming these limitations are required.

- there is scope for additional systems capable of extracting conceptual information such as speed and distance as well as 3D shape from arbitrary image sequences for sports, news and games programming. It is also of interest for the cross-media reuse of content. 3D model acquisition from uncalibrated image sequences is the subject of the EC VANGUARD and REALISE projects

sets:

- virtual sets systems are limited by their current high cost, due in part to the problem of obtaining precise camera imaging information (position, and optical modelling).

- the constraints of using a blue-screen room remains frustrating to many. The ideal is true pixel-precise depth keying at field rate and operating over a wide area, outdoors if possible.

post-production: already rich in functionality, post-production workstations would benefit from more automatic and adaptive methods. Automedia (Israel) have recently offered colour image segmentation by contour following. Other possibilities include lip synchronisation for more precise dubbing.

archive management:

- there is an increasing need to allow the searching of archives by content, by activity, and by abstract meaning, including the searching of compressed data and mixed audio and video.

- restoration and conversion of archives will benefit from sophisticated noise and degradation models and adaptive filtering methods. The AURORA project is currently pursuing these goals.

format conversions: the rising number of formats, including compressed, means that format conversion will be with us for quite some time, especially for forwards, backwards and compression transcoding.

compression:

- increased compression ratios are needed to make more efficient use of communication channels, either using model-based coding (MPEG4) or prior knowledge concerning programme content.

- analysis of subjective performance contines to be important, especially for new media such as immersive VR. The TAPESTRIES project is examining these factors.

protection: this will become an increasingly important issue as material is distributed in digital form and image manipulation tools become more widespread. There is a certain amount of work going on in the area of digital fingerprinting as research programmes (ACCOPI, TALISMAN and IMPRIMATEUR) and systems available from EMI (UK), Highwater (UK), AT&T (USA) and NEC (USA).

interaction: mixing of real and virtual objects requires knowledge of position and event synchronisation.

In parallel with the increased performance and functionality of systems in the professional market, equipment will also begin to migrate to the home and domestic markets.

CONCLUSIONS

The following table summarises the types of functions currently provided by computer vision technologies:

       function                      current application               
     Image signal         image acquisition: in-camera processing;     
    processing and          post-production workstations; format       
    transformation          conversion: estimation and filtering;      
                            archive management: image restoration      
   Barcode reading       image acquisition: robotic pedastal camera    
                                         navigation                    
 Scene reconstruction     data acquisition: performance animation;     
  and visualisation             post-production workstations           
Motion, gesture, face     data acquisition: performance animation;     
      expression           user end interaction: games interfaces      
     recognition                                                       
  Image compression     compression: delivery and programme exchange   
     Positioning,        data acquisition: performance animation and   
   registration and       sports annotation; sets: camera position     
      metrology                         registration                   
 Pattern, object and      data acquisition: performance animation;     
  event recognition      sets: camera registration in virtual sets;    
                             archives management: searching and        
                                        preprocessing                  
  Biological vision         compression: image codec performance       
                                         assessment

The use of computer vision technology in the broadcast and entertainment industries

Contents