Multiple view reconstruction

Next: Reconstructing representations for real Up: Modelling from sparse data Previous: Single view reconstruction

Multiple view reconstruction

Reconstruction of 3D models from a single image can produce only limited simplified representations of the scene as it can be seen from the specific viewpoint. To resolve occlusion ambiguities multiple images that sufficiently cover the whole scene should be acquired. PhotoBuilder [46] is an interactive working application for modelling architectural buildings from video that can be considered as an extension of [56]. Unlike [56] though motion and structure estimates are computed using a batch bundle adjustment method and planes are determined by a best fit algorithm over their manually grouped features. The final model is a textured triangulated representation constrained to the computed structure primitives.

The problem of automatically associating individual features to higher lever surfaces in 3D is very complicated because no knowledge of the spatial extent of these surface patches can be accurately inferred. The simple solution would be to use manual intervention to gain this knowledge. However, automatic plane reconstruction methods have been presented [3, 14]. Bailard et al. [3] have proposed a batch approach where 3D line positions are estimated by tracking their projections over successive frames and a plane sweep strategy is then applied by hypothesising a planar facet attached to each line for every angle around it. The half plane with the highest similarity correlation score over the whole set of images is assigned to the line and a set of heuristic rules are subsequently used to support line grouping and outline the boundaries for each estimated plane. The approach is very computationally expensive while the heuristics used may fail under complex planar configurations. A more efficient algorithm optimised for reconstruction of architectural scenes has been presented in [14]. Estimates of 3D points are computed and an initial plane grouping is achieved by recursively applying a RANSAC plane estimation algorithm. All structure is then projected to the ground plane. As all other planes corresponds to walls that can be considered perpendicular to ground they fit line segments to these projected 2D points and hypothesise the plane boundaries from the highest and lower height points in these clusters. This process unfortunately is only applicable to a restricted number of scenes as it can only recover the ground and the side planes of buildings. Nonetheless the resulting model is not a coarse planar approximation as a region growing algorithm is proposed to search for areas within each plane whose likelihood is maximised at similar offset values. The position and size of these offsets is manually fed to the system and their shape is automatically decided from a set of predefined models.

The idea of incorporating geometric models common to architectural buildings has previously been presented by Debevec et al. [12]. They developed Facade, an interactive tool for reconstruction of realistic models from a sparse set of images. Initially the user builds a hierarchical geometric model that closely resembles the scene. The model consists of predefined parametric polyhedral primitives and the process used is similar to solid modelling. A reference image is chosen and an estimate of the corresponding camera rotation and translation relative to this models coordinate system is computed. Model parameters are optimised by minimisation over the disparity between interactively matched projected edges of the model and edges marked in the image. Although the resulting model accurately approximate the geometry and topology of the scene additional detail may be captured using a dense model stereo method [13]. Although the approach requires significant manual intervention it combines geometric and photogrammetry techniques resulting in highly realistic reconstructions.

A method aiming at scene topology estimation assuming only a sparse set of 3D point estimates computed from a SFM process has been proposed by Morris et al. [40]. A triangulation of this point set is initially performed and the resulted model is then projected onto each of the images in the sequence. Using successive edge swapping and measuring the variance of the intensity for pixels belonging to corresponding triangles, they formulate the problem of finding a consistent triangulation as a minimum variance estimation problem over the space of all possible triangulations. Although the method seems to produce good surface approximations for complex real scenes its principal disadvantage is that no occlussions can be handled.

A system that attempted to address both problems of estimating scenes geometry and topology automatically has been developed by Faugeras et al. [19]. They use a trinocular video head mounted on a mobile robot that moved around the scene trying to build a 3D model representation of the real world environment. For each position a local 3D geometry description is computed called the ``Visual Map'' based on the stereo reconstruction. By relying on a batch Extended Kalman Filtering approach they achieve registration and fusion of these local maps into a common global coordinate system. Scene topology is estimated off line after the whole sequence has been processed by applying a Constrained 3D Delaunay triangulation over the reconstructed 3D structured primitives. This process yields a volumetric triangular face representation difficult to texture or render. For each viewpoint, the visibility of scene features from the corresponding ``Visual Map'' is tested in order to remove parts of the model that occlude the features. This is a process that requires accumulation of all the ``Visual Maps'' during the reconstruction stage. Furthermore, no use of uncertainty over the feature positions has been utilised in the visibility tests and therefore the method is highly susceptible to noise as only small perturbations in feature localisation can result in rejection of valid scene regions. Finally the process is strongly dependent on the locality heuristic as the generation of the tetrahedral hypothesis is performed in the single global coordinate frame. Removal of estimated tetrahedra according to visibility constraint from previous viewpoints can thus yield holes in the model without any chance of filling the resulted gaps.

More recently an incremental geometric theory for estimating the scene topology has been presented by Manessis et al. [37] and it has been proved that it converges to a triangular surface approximation of the real scene as the number of views increases. An algorithm has also been introduced that gives a computationally efficient approximation of the general methodology where a consistent model is progressively built without the use of history information stored from previous frames. Robustness of the reconstruction process to outliers and noisy input 3D measurements has further been presented in [36]. This is a complete planar scene topology estimation system operating on sparse input data that addresses all the problems presented in [19].

Next: Reconstructing representations for real Up: Modelling from sparse data Previous: Single view reconstruction

Bob Fisher
Wed Jan 23 15:38:40 GMT 2002