Modelling and Tracking Articulated Bodies : an overview

Modelling and Tracking Articulated Bodies using Multiple Cameras
An Overview

Maurice Ringer and Joan Lasenby
Signal Processing Group, Engineering Department
Cambridge University, UK

Optical motion capture is the problem of reconstructing three-dimensional position and orientation of moving articulated bodies, such as people, when the bodies are observed using a number of video cameras. Such information can be used to better analyse a person's movements for medical reasons or sports performance, or could be used by virtual characters for film or computer games.

[Stages of Optical Motion Capture]

More specifically, the desired information, x_k, is a vector of angles which specify the orientation of each limb of the body relative to the limb it is joined to (the body being captured is modelled by a number of limbs joined in a known hierarchical structure). Subscript k is an index into frame number/time. For example, the complete human body can be represented by 13 articulated limbs connected with 39 degrees of freedom [1]. In this case, x_k would contain 39 elements.

Features of the body are extracted from the images detected at each camera at each time frame. These may be the edges of arms and legs or blobs from the hands or head or (as in the case of marker-based motion capture) bright points generated from reflected markers placed on the body. The information which parameterises these features constitutes the measurement, z_k. For example, in the case of marker-based motion capture, z_k is a stacked vector of the 2D marker positions on the cameras' image planes.

The problem of optical motion capture is also a function of a third variable, q_k, which details which features were detected, which feature (if any) generated each detecion, which features were not detected and which detections were erroneous (not due to features).

Estimating the 3D pose at some time, k, is an example of sequential tracking of a hidden dynamic state in a non-linear system. Such problems have been analysed over the past few decades in many facets of engineering control, particularly in the engineering of radar systems [2]. It is typically a three step process:

Using esimates of the pose from previous times and some knowledge of how the body might move (the kinematic model), predict the pose at time k.

Estimate the association information, q_k, using the measurement, z_k, and the predicted value of the pose.

Estimate the pose at time k using z_k, q_k, and the predicted pose.

Although with some Monte Carlo (MC) techniques, the boundaries between these stages are not so well defined.

Predicting the pose

A prediction of the pose, x'_k, at time k, is calculated using previous estimates of the pose, x_k-D, .., x_k-1, where D is some amount of time delay. Typical techniques for performing this calculation include:

Let the predicted pose be the estimated pose from the last time step, x_k-1. This is the simplest form of prediction.
Modelling the next pose as an auto-regressive (AR) model. This method involves calculating the AR coefficients, which is done before-hand using known pose trajectories (training sequences) of motions similar to those being observed. This is done, for example, in [3]. Disadvantages of this technique is that training is required and the motions that can be tracked are limited to those the system has been trained for.
Simultaneously tracking the rate of change (velocity) of the elements of x_k and predicting the new pose assuming this velocity is constant. This technique is often used in radar tracking systems, and an example of its use in motion capture occurs in [4]. One disadvantage of it is that the dimension of the state, x_k, doubles.

Estimating the Feature Association

The predicted pose is projected (rendered) onto the image planes of the cameras and the resulting 2D features are compared to the detections. Typically, measurement noise is considered Gaussian, so the process of assigning detections to features is to minimise the total distance (on the camera planes) between the detections and their assigned feature. For marker-based systems, this is a point matching problem, for which there are fast and efficient solutions [5].

Note that some features will be occluded by other parts of the body when the predicted pose is rendered to the camera image planes. The system should be less inclined to assign detections to these features, as shown in [4,6].

It is important that the association is correct. To ensure this, multiple hypthosis tracking (MHT) techniques are beginning to gain popularilty (for example, [7,8]). In these sytems, more than the most likely association is retained and used to estimate the pose (discussed below), resulting in more than one pose estimate. At each time step, only the most likely N poses are retained to stop the tree of possible association hypotheses and pose estimates growing exponentially.

Updating the Pose Estimate

With knowledge of the association, q_k, the final stage in the tracking system is to actually estimate the pose at time k. Formulating the problem in a Bayesian framework, it is desired to maximise the posterior,

Typically, the distributions from the second and third terms in this equation are considered very flat (high variance) and thus the best pose is usually estimated by maximising only the first term, the likelihood function. If the detection error is considered Gaussian with covariance matrix R, the maximum likelihood estimate is found by minimising

where h is the rendering function (returns the location of the features when projected onto the camera planes given the pose x_k). As h is complicated and non-linear, one technique is to expand it about the predicted pose, x'_k, using a first-order Taylor series. The best pose is then given by the weighted least squares solution, [least squares solution]

where J_h is the Jacobian of h evaluated at x'_k. Examples of such systems include [9,10].

The extended Kalman filter (EKF) [2] is an extension of this least squares solution in which the likelihood and the distribution of the predicted pose (the third term in the original posterior equation) are simultaneously maximised. A further extension of this technique, which incorporated the effect of the second term of the posterior equation into the pose estimate, was proposed in [11].

Another popular method for estimating the pose is the sequential Monte Carlo (MC) filter (also called the particle fitler or Condensation) [12,13]. In this technique, a large number of estimates of the pose (the 'particles') are retained, the distribution of which represents the distribution of the unknown pose. At each new time frame, each particle is propogated forward in time using the kinematic model (the prediction step) and then rendered to the camera plane and assigned a weight by the likelihood function. The new distribution of particles is formed by resampling these weighted ones. Particle filters have the advantage that a full distribution of the pose is maintained (the EKF and LS estimators effectively assume the pose is Gaussian distributed) and no linearisation assumptions are made. Their problem is that they are computationally very expensive, especially when the pose has many degrees of freedom (as many particles are required).

The resulting sequence of pose estimates, x_k, for all time frame, k = 1 ... K, provide the total position and orientation information of the moving body over that time.

References

[1] A. Menache, Understanding Motion Capture for Computer Animation and Video Games, Morgan Kaufmann Publishers, 2000.

[2] S. Blackman and R. Popoli, Design and Analysis of Modern Tracking Systems, Artech House, 1999.

[3] J. Rittscher and A. Blake, Classification of Human Body Motion, Proc. Int. Conf. Computer Vision (ICCV), pp. 634-639, 1999.

[4] M. Ringer and J. Lasenby, Modelling and Tracking of Articulated Motion from Multiple Camera Views, Proc. British Machine Vision Conf (BMVC), pp 172-181, 2000.

[5] G. Carpaneto and P. Toth, Algorithm 548: Solution of the Assignment Problem [H], ACM Transactions on Mathematical Software, 6(1):104-111, March 1980.

[6] L. Herda, P. Fua, R. Plankers, R. Boulic and D. Thalmann, Skeleton-Based Motion Capture for Robust Reconstruction of Human Motion, Computer Animation, IEEE press, May 2000.

[7] T. Cham and J. Rehg, A Multiple Hypothesis Approach to Figure Tracking, Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), vol 2, pp 239-245, Fort Collins USA, 1999.

[8] M. Ringer and J. Lasenby, Multiple Hypothesis Tracking for Austomatic Visual Motion Capture, (to be published).

[9] J. Rehg and T. Kanade Model-based tracking of self-occluding articulated objects, Proc. Int. Conf on Computer Vision (ICCV) pp 612-617, Cambridge USA, 1995.

[10] T. Drummond and R. Cipolla, Real-time Tracking of Multuple Articulated structures in Multiple Views, Proc. 6th European Conf on Computer Vision (ECCV), Dublin Ireland, 2000.

[11] M. Ringer, T. Drummond and J. Lasenby, Using occlusions to aid position estimation for visual motion capture (to be published).

[12] M. Isard and A. Blake, Contour tracking by stochastic propagation of conditional density, Proc. European Conf on Computer Vision (ECCV), vol 1, pp. 343--356, Cambridge UK, 1996.

[13] S. J. Godsill, A Doucet, and M West, Maximum a posteriori sequence estimation using Monte Carlo particle filters Ann. Inst. Statist. Math., 52(1), March 2001.

This page last modified: 30 May 2001

[1]	A. Menache, Understanding Motion Capture for Computer Animation and Video Games, Morgan Kaufmann Publishers, 2000.
[2]	S. Blackman and R. Popoli, Design and Analysis of Modern Tracking Systems, Artech House, 1999.
[3]	J. Rittscher and A. Blake, Classification of Human Body Motion, Proc. Int. Conf. Computer Vision (ICCV), pp. 634-639, 1999.
[4]	M. Ringer and J. Lasenby, Modelling and Tracking of Articulated Motion from Multiple Camera Views, Proc. British Machine Vision Conf (BMVC), pp 172-181, 2000.
[5]	G. Carpaneto and P. Toth, Algorithm 548: Solution of the Assignment Problem [H], ACM Transactions on Mathematical Software, 6(1):104-111, March 1980.
[6]	L. Herda, P. Fua, R. Plankers, R. Boulic and D. Thalmann, Skeleton-Based Motion Capture for Robust Reconstruction of Human Motion, Computer Animation, IEEE press, May 2000.
[7]	T. Cham and J. Rehg, A Multiple Hypothesis Approach to Figure Tracking, Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), vol 2, pp 239-245, Fort Collins USA, 1999.
[8]	M. Ringer and J. Lasenby, Multiple Hypothesis Tracking for Austomatic Visual Motion Capture, (to be published).
[9]	J. Rehg and T. Kanade Model-based tracking of self-occluding articulated objects, Proc. Int. Conf on Computer Vision (ICCV) pp 612-617, Cambridge USA, 1995.
[10]	T. Drummond and R. Cipolla, Real-time Tracking of Multuple Articulated structures in Multiple Views, Proc. 6th European Conf on Computer Vision (ECCV), Dublin Ireland, 2000.
[11]	M. Ringer, T. Drummond and J. Lasenby, Using occlusions to aid position estimation for visual motion capture (to be published).
[12]	M. Isard and A. Blake, Contour tracking by stochastic propagation of conditional density, Proc. European Conf on Computer Vision (ECCV), vol 1, pp. 343--356, Cambridge UK, 1996.
[13]	S. J. Godsill, A Doucet, and M West, Maximum a posteriori sequence estimation using Monte Carlo particle filters Ann. Inst. Statist. Math., 52(1), March 2001.