Next: Bibliography

Discontinuous Event Tracking

Jamie Sherrah and Shaogang Gong

Tracking objects in space over time is generally approached using spatio-temporal models of movement, such as Kalman filters or Hidden Markov Models. However, in some cases a continuous stream of observations is not available. Rather, one obtains a series of checkpoints showing the location of the object at successive time instants. The tracking problem becomes one of temporal association of objects over time based on a series of information sources rather than spatio-temporal association alone. An example of discontinuous motion occurs when performing real-time tracking of a human's head and hands from a single camera view [1]. Often body motion may appear discontinuous since the hands can move quickly and seemingly erratically, or undergo occlusion by other body parts. Therefore methods such as Kalman filtering that are strongly reliant upon well-defined dynamics and temporal continuity are generally inadequate. However, a wide range of domain knowledge beyond the visual input is typically utilised by a human observer to reduce reliance on spatio-temporal consistency. To illustrate the nature of the discontinuous body motions under these conditions, Figures 1 and 2 show the head and hands positions and accelerations (as vectors) for two video sequences, along with sample frames. The video frames were sampled at 18 frames per second. Even so, there are many significant temporal changes in both the magnitude and orientation of the acceleration of the hands. It is unrealistic to attempt to model the dynamics of the body under these circumstances.

**Figure 1:** First example of behaviour sequences and their tracked head and hand positions and accelerations. At each time frame, the 2D acceleration is shown as an arrow with arrowhead size proportional to the acceleration magnitude. From left to right, the plots correspond to the head, left hand and right hand.
$\begin{figure} \begin{center} \epsfig{width=\sml,file=figs/jamie6_00321.eps}\h... ...\epsfig{width=\pltw,file=figs/bodypos_6.dat_r.eps} \end{center} \end{figure}$

**Figure 2:** Second example of behaviour sequences and their tracked head and hand positions and accelerations. At each time frame, the 2D acceleration is shown as an arrow with arrowhead size proportional to the acceleration magnitude. From left to right, the plots correspond to the head, left hand and right hand.
$\begin{figure} \begin{center} \epsfig{width=\sml,file=figs/jamie7_00433.eps}\h... ...\epsfig{width=\pltw,file=figs/bodypos_7.dat_r.eps} \end{center} \end{figure}$

Let us consider the tracking problem to be equivalent to watching a mime artist wearing a white face mask and white gloves in black clothing against a black background. In Figure 3, for example, given that the subject is facing the camera, it is easy for a human to see that he is touching his face with the right hand. The reader can accomplish this without the benefit of depth information or spatio-temporal continuity, but rather by exploiting considerable prior knowledge. We suggest that there is sufficient information in the two-dimensional image to perform consistent and robust tracking of the head and hands. A method is required that can bring our domain knowledge to bear on the problem.

**Figure 3:** Example binary skin image frames from a video sequence of a person facing the camera and scratching his nose. This representation can be contrasted with a mime artist wearing white gloves and mask against a black background.
$\begin{figure} \begin{center} \epsfig{width=\imagew,file=figs/mime-bpin-jamie6... ...e-bpin-jamie6_0123-skin.eps} $\rightarrow \;\; t$ \end{center} \end{figure}$

Under the assumption that the head and hands form the largest moving connected skin coloured regions in the image, tracking the head and hands reduces to matching the previously tracked body parts to the skin clusters in the current frame. However, skin clusters can be indistinguishable, and only discontinuous information is available as though a strobe light were operating, creating a ``jerky'' effect. Under these conditions, explicit modelling of body dynamics inevitably makes too strong an assumption about image data. Rather, the tracking can be performed better and more robustly through a process of deduction. This requires full exploitation of both visual cues and high-level contextual knowledge. Robust, real-time human tracking systems must be designed to work with a source of discontinuous visual information. Any vision system operates under constraints that attenuate the bandwidth of visual input. In some cases the data may simply be unavailable, in other cases computation time is limited due to finite resources. In [1], the benefits of using contextual knowledge to track discontinuous motion by inference rather than temporal continuity are found to be significant. In that work, a Bayesian Belief Network (BBN) is used to encode high-level domain knowledge about the tracking task. The BBN is used to probabilistically infer the body part associations over time. The commonly-encountered problem of motion discontinuities means that consistent temporal dynamics cannot be relied upon, rather the network fuses spatio-temporal information with a number of other cues, giving them equal importance. Observations from the whole spatial domain are considered during inference so that ``unexpected'' observations do not cause the system to lose track.

Next: Bibliography

Shaogang Gong
2001-05-29