CV will constitute a major, but clearly not the only type of sensoric channel required to offer a driver comprehensive support for his activities. In order to indicate more precisely the role which CV might play in this context, we subdivide the overall capabilities of a DSS into four sectors:
The DSS is considered to be an agent, i. e. an encapsulated digital process endowed not only with its own internal state space and control, but in addition with an explicitly statable goal, externally visible actions which enable the agent to sense the state of its environment as well as to communicate with and to influence its environment, and a planning capability which concatenates actions into a plan in order to achieve the goal of the agent. Such an `agent concept' provides a computational model which facilitates to talk about the DSS in a precise manner: it is assumed that a DSS is permanently alive once the vehicle electronics has been switched on, the DSS may at any time initiate activities -- for example communicate with the driver -- and it may spawn subagents to whom it may delegate to pursue clearly circumscribed subtasks as their (sub)goals.
The multi-faceted potential contributions of CV to a DSS can be best discussed by first collecting basic driving manouvers which a driver has to perform in order to guide a vehicle safely through road traffic. A set of generic driving manouvers will be defined for this purpose. Generic in this context means that parameters of the following nine manouvers have to be determined on the one hand by the current subgoal which should be achieved, and on the other hand by the currently prevailing boundary conditions, in particular by the current traffic situation around the DSS vehicle:
The execution of a driving manouver implies the exertion of both lateral as well as longitudinal control. Both types of control require sensorical input about the actually prevailing relations between the vehicle and its environment. To provide such input constitutes the principal contribution of CV to a DSS. More specifically, the following pieces of information have to be obtained by CV:
It is by now established practice to place between 3 and 10 small windows on the expected image plane locations of bright lines marking the left and right delimitation of a lane. Edge elements within each window are associated with the projection of the model lane border into the image plane (sometimes called a `model segment'). (Weighted) deviations between edge elements and associated model segments are usually fed into an Extended Kalman Filter (EKF) in order to update the parameters characterizing the current camera position and orientation -- and thus, for a camera fixed to the vehicle, the vehicle coordinate system -- with respect to the lane.
Automatic lateral control of road vehicles on highways based on such a CV approach is considered State-of-the-Art, both for continuous as well as for discontinuous lane markings. Essentially similar approaches have been used in order to detect lines delimiting adjacent highway lanes on either side of the one used by the camera-carrying vehicle itself. Up until recently, special purpose computers, configurations of Digital Signal Processors, or a network of standard processors have been required in order to achieve this in real-time. A standard, 1996 vintage VLSI CPU can roughly cope with the computations required for real-time tracking of a well-marked, reasonably illuminated highway lane.
Innercity roads frequently turn much more sharply than highways or rural roads: clothoid or extended parabolic arc models used for the latter can not be easily adapted to the more complicated conditions of innercity roads and intersections, in particular since significant parts of the road border may be occluded by, e. g., parking vehicles. As soon as more computing power can be made available within vehicles, model-based approaches are likely to be investigated for innercity lane detection and tracking. Similarly, the lane width is not yet routinely estimated from recorded video images as a variable parameter of a generic lane model. Roads are not restricted to planes, although for highways and larger roads outside cities and mountainous areas, these vertical curvatures are so small that they usually have been neglected.
Whether additional computing power will be used in order to refine road models as discussed above or rather to increase robustness of lane detection and tracking under more adverse operating conditions such as driving by night, during fog, heavy rain, or snow, will have to be seen.
Relatively few experiments have been reported so far which have been specifically addressed towards the detection and handling of intersections by CV. An intersection or junction is usually specified as either a gap in the lateral lane or road marking or as a stop line across the approach lane. Since most driving manouvers so far have been performed on highway-like roads, the most frequent manouver upon encountering an intersection or junction area consisted in driving straight across. In these cases, a gap in the lateral lane markings is simply considered as a failure to pick them up correctly: it is handled by straight extrapolation of previously estimated lane parameters.
If, however, a sharp turning manouver had to be performed at an intersection or junction, a road model of the specific intersection/junction has been used. Complications arise if the visible section of lane boundary markings become difficult to detect or if lane boundaries temporarily disappear from the field of view of a camera oriented straight ahead, i. e. along the tangent of the current vehicle trajectory. In such cases, the camera has to be mounted on a panning head or more than one single camera has to be used. Very few systematic experiments have been reported so far regarding such conditions.
Additional complications may arise if the intersection or junction area becomes complicated enough to no longer facilitate capturing the section essential for lateral control entirely by a monocular video stream. Apart from relating a virtual vehicle trajectory to two or more images at the same time, the initialization and tracking of lane boundary markings in the different image planes may become difficult due to combinatorial problems during the association step between edge elements and model segments.
In addition to continuous or interrupted bright lines indicating lateral lane boundaries, other bright marks are painted occasionally onto a lane, for example arrows indicating obligatory driving directions (straight ahead, compulsory right turn, etc.), signs indicating that stopping or parking is prohibited, special lanes for buses or bicyclist, etc. Although such symbols are normed, it has been observed that the actually painted road mark frequently does not (fully) conform to the norm. Recognition of such symbols within a DSS thus constitutes a problem which has rarely been addressed by CV in vehicles.
A multitude of shapes will have to be handled although it should be possible to adapt methods from workpiece recognition on conveyor belts for these recognition tasks. The problem is aggravated in heavy traffic if parts of a road marking are occluded by other vehicles such that a virtual image of a marking has to be built by concatenating successively visible parts.
Practically nothing has been published about the detection and tracking of lane marks constituted by rows of reflectors nor have difficulties been treated which arise when -- possibly only temporarily -- irrelevant lane markings have not been completely removed or are superseded by new ones in a different color (for example yellow instead of white).
Traffic signs and traffic lights mounted on posts close to a lane will appear in a roughly known region of an image recorded by a vehicle-mounted, forward-looking camera. Since traffic lights and signs appear in shapes and colors known a priori, color cues are exploited within such search regions to quickly collect subregions which should be tested for compatibility with the hypothesis of representing a traffic sign or traffic light. Approaches which rely only on shape cues in order to detect and classify traffic signs seem to be less reliable.
It appears as if the `quick-and-dirty' approaches towards traffic sign detection and recognition have been exhausted. The challenge to gradually decrease the failure and false alarm rates as well as to increase the correct recognition rate even under adverse conditions is likely to be taken up by specialized research groups within companies. Relevant results thus are less likely to be described in detail in the scientific literature.
The norms regarding shape and color coding within traffic signs support a systematic approach towards their detection and classification. In order to obtain reliable detection and classification results, however, significant efforts are required even for isolated traffic signs positioned along highways and rural roads. Raw color data have to be carefully postprocessed in order to delimit detrimental influences of illumination or recording conditions. A multi-step hierarchical approach, intermixing color as well as shape evaluation in the image plane, appears to be necessary just in order to classify an image region into one of the many categories of traffic signs. Given a 1996 vintage VLSI CPU, this is possible in real time provided there are not too many distracting similar signs or shapes. When it comes to a reliable deciphering of symbols within a traffic sign, less is known.
Traffic sign recognition on highways is usually attempted with a tele-camera (fixed to the vehicle) in order to facilitate early detection. Not much has been published about repeated evaluation of the same signs in consecutive image frames, nor about combined evaluation of traffic sign images in windows extracted initially from sequences recorded by a tele-camera, with a subsequent switch to image regions from frames recorded by a wide-angle lens camera as soon as the vehicle approaches the traffic sign. Only cursory reports are available regarding advantages to actually track the hypothetical image of a traffic sign by a video-camera mounted on a pan/tilt-head within a vehicle in order to increase the signal-to-noise ratio for a more precise evaluation.
So far, driving tasks and the associated CV approaches have been discussed on the basis of essentially planar scene models. Even traffic sign recognition could be handled by a picture domain approach, given `normal' highway or rural road conditions. Problems begin to aggravate, however, if traffic signs have to be looked for not only at posts planted near the road border, but also on arms or bridges extending across multiple-lane roads. Likewise, the 3-D structure of the environment has to be taken into account if the lane marking is provided by rows of beacons or posts or if the vehicle has to manouver through narrow underpasses or gates. The latter conditions rarely occur on highways, but rather tight situations may have to be coped with along road construction sites and in the vicinity of traffic accidents.
In innercity road traffic, 3-D analysis can not be avoided, not the least since vehicles and other objects may occlude part of the road limits or traffic lights. Not much has been published so far about such problems.
The common approach towards the detection of obstacles searches for cues which are incompatible with the assumption that a certain area of the image corresponds to a road surface: such cues could be significant gray value transitions, texture boundaries, or -- alternatively -- regions with unexpected gray values or colors. Although these approaches are relatively cheap regarding the necessary computing power, usually their detection rate is low or their false alarm rate is high.
More reliable are approaches which essentially exploit the phenomenon that anything extending from the -- assumed planar -- road will exhibit a disparity which differs from that of points on the road plane itself if image frames recorded from different vantage points are compared. So far, all such approaches assume that the external camera parameters are known. Three types of approaches can be distinguished here, depending on whether the image frames based on which the disparity is estimated are recorded
In case (2) above, the relative calibration of a stereo camera pair is replaced by knowledge about motion of a monocular camera with respect to the road between two recorded image frames. Both approaches (explicit calculation of the expected shear mapping or the determination of a disparity value between corresponding image locations) sketched for the comparison of stereo image frames can then be applied to this case, too.
The approach mentioned above under (3) differs from the disparity variant of approach (2) only insofar as the optical flow is estimated instead of the displacement between corresponding image locations from two frames taken some time apart. Optical flow estimation is still time consuming. If sufficient computing power is available, it has the advantage of being performed by a non-search calculation restricted to a local spatio-temporal (x,y,t)-volume from the recorded monocular gray value stream. Even non-prominent texture or gradual gray value transitions such as those due to illumination gradients can be exploited in this manner, resulting in a more densely populated optical flow field. Feature-based approaches towards the estimation of optical flow usually result in less densely populated optical flow fields which increase the difficulty to reliably segment the image area corresponding to a potential obstacle.
The advantage of all these approaches -- versus those based on the detection of unexpected gray value configurations in the image area associated with the road surface in front of the vehicle -- consists in the fact that knowledge about the geometry of a projective transformation from one view to another can be exploited. Even high contrast marks or shadows on the road surface can thus be easily distinguished from objects extending vertically from the road plane. Special purpose processor arrangements already allow to compute optical flow fields or warping transformations in real time although the resolution and reliability still leaves ample room for improvement.
A vehicle on the same lane, but in front of the DSS constitutes an obstacle, albeit of a special kind if it moves at a limited velocity differential with respect to the DSS. On highways and rural roads, preceding vehicles have to be detected and tracked in order to decide whether they should be followed or overtaken.
Two approaches have been devised which both take advantage of the special type of `obstacle' expected in this case. The first approach assumes that the rear side of the preceding vehicle exhibits a marked symmetry around its central vertical axis in the image and that the silhouette contrasts well against the background along the horizontal direction. A Hough Transform is used with tentatively paired edge elements in order to search for edge element pairings with a lateral distance compatible with the rear view of a vehicle and a center position corresponding approximately to the middle of the lane. Depending on the sophistication of the approach, a simple rectangle might be fitted to image gradients, initialized by the size and position estimates obtained by the Hough search for symmetric pairs of vehicle side edge segments. A further step in sophistication consists in fitting the projection of a simple 3-D vehicle model in the form of a parallelepiped to the image edges, to initialize a Kalman Filter and to track such a model from frame to frame.
A different detection approach exploits the observation that the rear side of a road vehicle causes a shadow on the road which very frequently can be clearly detected as a fairly sharp gray value transition to very dark values (corresponding to the visible road area underneath the preceding vehicle). As soon as an approximately horizontal edge segment of appropriate length and location can be detected in the image area corresponding to the road in front of the DSS, it is hypothesized to constitute the rear shadow edge of a preceding vehicle. One may then search for vertical edge segments corresponding to the left and right sides of the preceding vehicle and use these three cues in order to initialize either a 2-D rectangle or a 3-D box model for tracking.
Both of the approaches mentioned can be implemented to work in real-time on a 1996 vintage VLSI CPU. It depends on what information about the preceding vehicle will be required for the driving manouver to be supported by the DSS whether more complex model-based tracking and recognition procedures are initialized. In any case, if the hypothesis that a preceding vehicle has been detected in the lane of the DSS has been firmly established, the distance to this preceding vehicle can be easily inferred from its rear end projected onto the road plane, exploiting the camera calibration of the DSS. Given a reliable estimate of this distance to the preceding vehicle, longitudinal control of the DSS vehicle can be based on this distance estimate (Follow_Preceding_Vehicle). Alternatively, a warning to the driver can be generated if this distance drops below a safety threshold.
The principal road signs on highways and major roads are normed. Their detection and initial treatment essentially resembles that of traffic signs. The problem arises once the written content has to be evaluated. Considerable variations in the size of characters -- even within the same road sign -- as well as the mixture of symbols and text let it appear advisable to develop subagents responsible for the detection, tracking, and interpretation of road signs.
Experience with address label reading machines suggests that a mere capability to read single characters will not suffice in general. A dictionary of names which may appear on road signs as well as a data base of relations - capturing likely clusters of related names as well as rankings indicating which names are likely to appear together in a certain area on a particular road - may turn out to be required. Given the fact that a lot of related information has already been stored on CD-ROMs in vehicle navigation systems, it appears reasonable to exploit this information in order to increase the reliability of road sign interpretation.
One might even think about timely advice to the driver to slow down in order to secure the reliable scanning of road signs prior to critical junctions. Even if a navigation system may be advertised to provide independence from road signs, continuously cross checking the data base of the navigation system against road signs might decrease the risk to miss recent changes in the road network.