Jamie Sherrah and Shaogang Gong
Tracking human body parts and motion is a challenging but essential task for modelling, recognition and interpretation of human behaviour. In particular, tracking of at least the head and hands is required for gesture recognition in human-computer interface applications such as sign-language recognition. Existing methods for markerless tracking can be categorised according to the measurements and models used [5]. In terms of measurements, tracking usually relies on intensity information such as edges [6,2,13,4], skin colour and/or motion segmentation [19,10,7,12], or a combination of these with other cues including depth [9,19,14,1]. The choice of model depends on the application of the tracker. If the tracker output is to be used for some recognition process then a 2D model of the body will suffice [12,7]. On the other hand, a 3D model of the body may be required for generative purposes, to drive an avatar for example, in which case skeletal constraints can be exploited [19,14,4], or deformable 3D models can be matched to 2D images [6,13]. When tracking multiple overlapping objects such as human head and hands under real-world conditions, three main problems are encountered: ambiguity, occlusion and motion discontinuity. Ambiguities arise due to distracting noise, mis-matching of the tracked objects, and the possibility of occlusion. If the objects are part of the same articulated entity, such as the human body, domain knowledge can be used to resolve some of the ambiguities. A common and robust approach for real-time tracking is to combine multiple visual cues [17,18]. However, in the domain of cues such as skin colour and motion, occlusion still presents a problem because body parts such as hands can become virtually indistinguishable. Therefore joint tracking of the body parts must be performed with an exclusion principle on observations [15,11]. Often depth information and temporal dynamic models are exploited to overcome the occlusion problem, for example [19,10]. However depth information requires multiple cameras, and introduces its own problems of calibration and inaccuracy. Further, spatio-temporal continuity cannot always be assumed as the basis for tracking. Often body motion may appear discontinuous since the hands can move quickly and seemingly erratically, or undergo occlusion by other body parts. Therefore methods such as Kalman filtering that are strongly reliant upon well-defined dynamics and temporal continuity are generally inadequate. On the other hand, a wide range of domain knowledge beyond the visual input is typically utilised by a human observer to reduce reliance on spatio-temporal consistency. The problem of tracking a person's head and hands in 2D from a single camera view is addressed in [16]. To deal with noise and ambiguity problems, a view-based data fusion approach is adopted. Inexpensive visual cues, namely motion, skin colour and coarse intensity-based orientation measures are extracted from a near-frontal view of a subject. Examples of these cues are shown in Figure 1. Skin colour and motion are natural cues for focusing attention and computational resources on salient regions in the image. The hand orientation information is used to disambiguate the hands when they cross over in the image. Note that although distracting noise and background clusters appear in the skin image, these can be eliminated at a low level by ``AND''ing directly with motion information. However, fusion of these cues at this low level of processing is premature and causes loss of information. For example, the motion information generally occurs only at the edges of the moving object, making the fused information too sparse. In this approach the cues are fused at a higher level using a Bayesian Belief Network.