Ullman and Basri (1991) discuss the problems coming along with three-dimensional views and propose a viewer centred model based on two-dimensional views with several benefits over the previously presented approach: It is no longer restricted to rigid transformations, does not involve the explicit reconstruction and representation of the three-dimensional structure for storing the objects, and furthermore allows a simplification of the computational requirements. Another noteworthy aspect is the proof that under certain assumptions all the views of a three-dimensional object, which may arise by rotation, translation and scaling with subsequent orthographic projection, can be derived from the linear combination of a few 2D views. On the other hand this method presupposes a correspondence between features of the input image and the model views as well as among all the models. Moreover, it demands the visibility of all object points from every perspective. Even though these assumptions must be regarded as hardly realizable for real scenes or automated model acquisition, the scheme gives impressive hints of the potential information content of two-dimensional views.
An early implementation of a view based recognition system by means of an artificial neural network is presented by Poggio and Edelman (1990). They postulate that for every object an appropriate function can be found which is capable of transforming all possible views into a single standard view. The approximations of these functions are expected to be evolved by RBF networks (Radial Basis Functions) after separately training them on different views of their corresponding object. Recognition involves the application of the transformation functions to the input view and the comparison of the resulting outputs with the stored standard views. Because of the necessity of a constant number of feature points together with an exact correspondence relation between image and model, the previously mentioned drawbacks also hold for this approach.
In contrast, the CLF network (Conjunctions of Localized Features) suggested by Edelman and Weinshall (1991) does not need the computation of an explicit correspondence but uses topological feature maps. While CLF networks have the capability to simulate error rates and recognition measures found in psychological tests, they seem to be inadequate for general image processing purposes because of their generalization method, which is mainly based on Gaussian blurring of the stored model representations. Therefore, preliminary experiments with real images instead of artificially designed objects showed the tendency towards improper matches with models containing a higher number of feature points. This systematic fault is caused by single feature points in the input image overlapping with several widened model points simultaneously and can only be circumvented by a constant number of feature points over all views and all objects as it was the case in the study of Edelman and Weinshall.