No Title

Motion-based segmentation

This page is by no means an up-to-date and thorough review and just aims at giving beginners some first insight into problems and proposed approaches. I'll try to update it from time to time.

What is the problem and why is it important ?

Motion-based segmentation of images refers, here, to partitioning an image into regions of homogenous 2D (apparent) motion. "Homogenous" generally implies a continuity of the motion field, or the possibility of having the motion field described by a parametric motion model.

Motion-based segmentation is multi-purpose task in computer vision :

Image understanding (content analysis), which applications include surveillance, video indexing...
Video object-based coding, directed mainly at low-rate video transmission.

If the goal is image understanding, color/grey-level/texture can help locate interesting zones/objects. Though, a partition based on such criteria will often contain too many regions to be exploitable, interesting objets hence being split into several regions. Often, scenes consist in animated regions (people, vehicles...) of interest on some background scene, and in such cases, motion is a far more appropriate criterion.

A Simple example :

This segmentation was obtained the using method described in [7]. Probably you can't see it, but the camera is panning left while tracking the little boat, the larger boat is going right.

Image at time t	Image at time t+1
	Image at time t, having superimposed the boundaries of the motion-based regions obtained, as well as the motion field corresponding the affine motion model estimated per region. !! Although boundaries are shown, the method used has nothing to do with contours - it works using regions !!

What do motion-based regions correspond to?

The apparent (projected) motion which is to be segmented not only depends on the 3D motion of the scene, on camera motion, but also on the position of every point (the 3D structure of the scene), which are all generally unknown. Because of 1) possible orientation and discontinuities or important variations 2) depth discontinuities or important variations relatively to the camera and sideways camera translation, the rigid motion of a single physical entity may lead to only piece-wise homogenous motion (parallax / model insufficiency) , hence to oversegmentation, relatively to a image content understanding goal - what is apparently a drawback can however be profitable to segment according to scene depth. In nice cases, as in the example above, motion-based regions can correspond to objects of interest. Notice that a grey-level or texture-based segmentation wouldn't have been able to locate the boats...

Why is the problem difficult and how can it be tackled ?

Segmenting an image (actually, a pair of frames) according to a motion-based criterion can be expressed as the need for determination of three entities :

The motion within each region (possibly described by a motion model)
The spatial support of each region
The number of regions

While segmentation is in itself a difficult issue, segmentation based on motion suffers from the fact that motion observations are partially hidden variables. Basic understanding of computation of optical flow and the aperture problem are prerequisites not dealt with here.

One starts by characterizing motion, on somewhat arbitrary supports, since the relevant supports are not known. These supports can range from local (a dense field) to global (the whole image). They can be semi-local. There can be a partition of the image in spatially regular (square blocks) or data-dependent regions (intensity, color, texture...). Here arises the following problem :

The underlying apparent motion field to be estimated is a piece-wise homogeneous field.
Motion estimation accuracy is favoured by an extended estimation spatial support
As the region boundaries as unknown, an extended spatial support may well include two or more different motions.

Motion estimation on supports larger than a single pixel generally refers to parametric motion models (such as affine, quadratic, projective etc...). In such cases, the estimated motion model aims at describing the global motion of the whole region on which is is estimated, hence possibly falling in the mixture problem mentioned above. In the case of a dense motion field, in practise, the same issue arises, because such approaches require generally include a spatial continuity constraint of the motion field to regularize the ill-posed inverse problem of motion estimation. Besides, one would like to end up with regions that as spatially compact rather than noisy, so as to correspond to the physical reality.

Reading the literature, one may come across a classification into direct and indirect methods. The first ones operate directly the observed image pair, whereas the second category takes a pre-computed motion field as data input. We only focus here on the first category.

A few classes of approches

Top-down techniques

Let us consider first techniques that start by handling the whole image as estimation support. Is it likely that several motions may be present within this support. A global motion model may be estimated, and areas that do not conform to the estimated global motion are detected, so as to have a two-class partition. We now discuss how the global motion is estimated, and how the decision upon the conformity of the pixel to that motion may be taken. A motion model can be estimated on the non-conforming pixels and another class be found, and so forth. This principle assumes the presence of a dominant motion. This assumption is strong if motion is estimated using a least-square techniques [6], or less strong if a robust estimation technique is used [7], so as to disregard non-conforming pixels from the estimation. The matter is then, given the computed global motion model, to decide, for each pixel, whether the intensity data is well explained by this model. The simplest criterion, i.e. the DFD, has severe limitations, and [6, 7] discuss other more relevant conformity measurements. A problem is that a simple rejection of non-conforming pixels does not necessarily produce regions that are spatially compact, as one would like. Segmentation in a Markovian framework [7] enables addition of a spatial consistency constraint. The estimation/detection process may be iterated, for instance, until no region of significant size is detected; the number of regions in the partition is thereby straighforwardly provided.

Joint estimation / segmentation

A family of approches tries to avoid the assumption of a dominant motion by considering the data as produced by a mixture of motion models [1, 10]. The problem is then to associate each pixel to the right motion model, while simultaneously estimating these motions. The association weights provide the estimation supports. This is carried out in [1, 10] as the two steps of the EM algorithm. These approches estimate motion models are supports simultaneously rather than sequentially as above, and hence have the advantage of considering all regions in the same manner, and avoiding the need for a dominant motion. Introducing spatial smoothness of regions is not straightforward, but [10] proposed a formulation in this direction.

The problem of estimating the number of models remains. In mixture models or other techniques, the MDL criterion was proposed in [1, 12]. Though, image understanding and image coding purposes do not necessarily coincide.

In [8], a Markovian approach to joint estimation and segmentation is proposed, through the definition of several appropriate constraint and energy minimization.

Grouping of elementary regions

These approaches start by building a partition of the image into elementary regions. Often, the motion-based regions sought for are assumed to be clusters of these elementary regions. The elementary regions can for instance be square blocks, intensity-based or texture-based regions.

Taking both intensity and motion information into account in segmentation procedures is, among other reasons, motivated by the ability of intensity cues to locate boundaries accurately and to cope with image areas with poor intensity gradient information. These are often shortcomings for segmentation exploiting only motion information. On the other side, motion-based segmentation generally leads to a semantic description of the image, involving fewer and often more significant regions than a spatial segmentation. In several approaches, intensity is involved at pixel level through a spatial segmentation, providing a set of regions that are handled by a region-level motion-based scheme. In [2, 3], a spatial segmentation stage is followed by a motion-based region-merging phase. In [2], regions are grouped by iterating estimation of the dominant motion and grouping of regions that conform to that motion, while in [3], a k-medoid clustering algorithm if used. Similar methods, involve, in contrast, motion-based intermediate regions. A variety of methods have been proposed in this direction, generally carrying out grouping also on a motion-based criterion. A k-means clustering algorithm in motion parameter space was used in [9]. With clustering methods in particular, determination of the number of clusters is a key issue. This problem was addressed in [12] with an MDL-based approach. An explicit region-level merging procedure has been embedded in a Markovian framework in [5] and [11].

Thanks for sending any suggestions/corrections to Marc.Gelgon@irisa.fr .

References

Next: References

Marc.Gelgon
Mon Feb 2 11:59:29 MET 1998