Reconstruction from dense passive sensors range data

Next: Shape from Silhouettes Up: Geometry Modelling Previous: Reconstruction using active sensors

Reconstruction from dense passive sensors range data

Stereo vision is the process of acquiring 3D range information about a scene from two or more images taken from different viewpoints. This is similar to the human visual system where the different perspectives of our two eyes result in a slight displacement of the scene in each of the two monocular views that permits us to estimate depth. Computer stereo vision is a passive sensing method which is based on triangulation between the pixels that corresponds to the same scene structure projection on each of the images.

Two views are sufficient in order to compute 3D depth information. However, this does not always correspond to the number of physical cameras. Triangulation can be achieved using a single camera that moves around the scene by considering its previous captured image as its stereo pair. A system that uses a single uncalibrated hand held camera for full scene reconstruction has been proposed by Pollefeys et. al. [44]. Initially they compute the epipolar geometry that relates the first couple of images in the sequence and define a reference frame based on it. The camera pose for each of the subsequent frames is then estimated in this reference projective frame. Reconstruction using uncalibrated stereo can be only determined up to a projective transformation and thus the next step involves restriction of this reconstruction ambiguity to a metric one. Having calculated the relative position and orientation for all cameras, image rectification is applied in order to facilitate a dense depth estimation process. The raw 3D data are subsequently smoothed using a thin plate spline method. In the final step a triangulation in the reference view is applied in order to build a piecewise surface model. The principal weakness of this system is the lack of integration of data from different viewpoints which limits the reconstruction to selected regions of the scene.

The use of a single camera imposes the constraint of a short baseline between the successive views in order to achieve accurate automatic matchings. On the other hand, this results in extraction of relatively inaccurate depth data located within a narrow field of view. A system that overcomes this problem has been presented by Kang et al. [27] and make use of panoramic images taken from a single weakly calibrated camera. At each camera position a panorama is created by stitching images captured while rotating the camera about its centre. Extraction of features and matching among the panoramic images is used as input to an eight point algorithm from which the essential matrix between a reference panorama and all other images is computed. Using a constrained search based on the epipolar geometry dense matching is achieved that results in a dense depth map. By applying a 3D triangulation on the cloud of estimated points a model of the real scene is computed. Unfortunately, this technique is not applicable for indoor environment reconstructions because of the inherent limitation in modelling the floor and ceiling of the scene.

Contrary to [44] and [27] where monocular video has been used Narayanan et al. [41] employed an acquisition method based on 51 synchronised calibrated cameras mounted on a 5 meter diameter dome. Their ``virtualised reality'' system gave the user the ability to control the viewing angle of a dynamic event taking place in the dome. At each time instant a multi-base line stereo algorithm is utilised to compute a dense range map. Nearest neighbour triangulation is used to build a polygonised model for each viewpoint. Subsequently the model is texture mapped from the corresponding source images providing the ability to directly render the view from each of the original camera's directions. To achieve view synthesis from an arbitrary viewpoint the model from the nearest camera was chosen as reference and rendered. Holes due to regions that were originally occluded in the reference frame and become visible from the new viewing direction are covered by rendering neighbour models. Alternatively the system could provide a complete surface model by fusion of the multiple depth maps using a volumetric integration method [11]. Although both dynamic and static scenes can be modelled the hardware configuration of the system is both costly and static. Only events inside the dome can be modelled making the approach infeasible for applications where cameras should be deployed inside the scene for exploration and mapping.

The methodologies for reconstruction from dense depth data acquired from LRS and stereo images are very similar as they both rely on registration and integration steps. Calibration, which can be considered equivalent to registration, in stereo techniques is a much simpler process which usually can be performed off line. Nevertheless stereo vision involves a step consisting the correspondence and triangulation processes in order to estimate depth. Achieving reliable, robust and accurate automatic correspondence between multiple views of an arbitrary scene though, is still an open problem in computer vision. As this process of estimating depth is performed by hardware using a LRS, superior range resolution and accuracy can be achieved. Despite the cost, size and time this is the main reason that active scanners are still utilised for applications of scene reconstruction where accuracy is the principal requirement.

Next: Shape from Silhouettes Up: Geometry Modelling Previous: Reconstruction using active sensors

Bob Fisher
Wed Jan 23 15:38:40 GMT 2002