WeE1.26

Improving Stereo Camera Depth Measurements and Benefiting from Intermediate Results C. Høilund, T. B. Moeslund, C. B. Madsen Laboratory of Computer Vision and Media Technology Aalborg University, Denmark

M. M. Trivedi Computer Vision and Robotics Research Laboratory University of California, San Diego, USA

{ch, tbm, cbm}@imi.aau.dk

[email protected]

Abstract— This paper presents a method for improving the disparity values obtained with a stereo camera by applying an iconic Kalman filter and known ego-motion. The improvements are demonstrated in an application of determining the free space in a scene as viewed by a vehicle-mounted camera. Using disparity maps from a stereo camera and known camera motion, the disparity maps are first filtered by an iconic Kalman filter, operating on each pixel individually, thereby reducing variance and increasing the density of the filtered disparity map. Then, a stochastic occupancy grid is calculated from the filtered disparity map, providing a top-down view of the scene where the uncertainty of disparity measurements are taken into account. These occupancy grids are segmented to indicate a maximum depth free of obstacles, enabling the marking of free space in the accompanying intensity image. Even without motion of the camera, the quality of the disparity map is increased significantly. Applications of the intermediate results are discussed, enabling features such as motion detection and quantifying the certainty of the measurements. The evaluation shows significant improvement in disparity variance and disparity map density, and consequently an improvement in the application of marking free space.

I. INTRODUCTION Up until recently, sensors in vehicles were typically used to monitor basic features such as engine RPM, fuel level and whether the turn-signal was activated. Today, intelligent vehicles use sensors to measure an evergrowing number of features inside and outside of the vehicle with the purpose of heightening safety and comfort. These features range from the head orientation of the driver to the distance to whatever is in front of the car, and are applied for determining driver attention [1], driver intention [2], vehicle detection [3] and so on. Purposes such as lane detection [4], detecting traffic signs [5] and obstacles [6] can use image data and accompanying depth. Image data can be obtained with regular cameras, while depth can be obtained with a number of different technologies such as monocular camera [7] [8], stereo camera [9] [10], omni-directional cameras [11] [12], and multimodal stereo vision [13] [14]. In order to reduce the number of components on board, it would be beneficial to use stereo cameras to simultaneously provide both image data and depth. Issues surrounding stereo cameras are sensor noise, leading to inaccurate and sporadic measurements, and low resolution, making it impossible to measure the correct depth, especially at increasing distance. However, with the addition

978-1-4244-7868-2/10/$26.00 ©2010 IEEE

of ego-motion it becomes possible to combine several frames into one higher-resolution or higher-quality image. In order to improve the quality of the disparity maps, a filter can be applied. The general idea is to employ temporal information which is readily available. There exists several filters, such as a median filter, Kalman filters [15], and particle filters [16]. However, the filter must operate on a per-pixel level while taking advantage of all available information. The particle filter, while applicable, is unnecessarily complex when compared with the Kalman filter. A median filter is fast, but is not the most optimal option since it ignores additional knowledge of e.g. noise and provides neither variance estimation nor motion detection. The Kalman filter can operate on a per-pixel level while applying additional knowledge, and is therefore chosen. The system presented in this work is an implementation of an iconic Kalman filter operating on disparity from stereo images. To investigate the effect of Kalman filtering an application is developed end evaluated. As an example application, determining the free space around a vehicle is chosen, since it is an important and usable aspect of intelligent vehicles. Fig. 1 presents an overview of the system. The stereo images and initial depth estimation are provided by a TYZX stereo camera with a 22 cm baseline [17]. We will not assume forward pointing cameras, as seen in e.g. [10], but rather allow for arbitrary camera poses. The system allows output of intermediate results useful in determining additional properties of a scene, evaluated in section IV.

Fig. 1: Simplified block diagram. In section II we define disparity and relate it to depth, and explain the iconic Kalman filter used to improve the disparity values. Section III shows how to use disparity to determine free space via stochastic occupancy grids. Lastly, section IV presents an evaluation, followed by the conclusion in section V.

935

II. DEPTH ESTIMATION Disparity is the measure of difference in position of a feature in one image compared to another image. This can be used to infer the depth of the feature. Several methods of estimating disparity exist and the crucial point in this work is to have an accurate measurement to reliably determine free space. Knowledge of the noise in the estimation process can significantly improve the system. In our case the distribution of noise has been found to be Gaussian in image space and hence the Kalman filter [15] can be applied. A. Iconic Kalman Filter The Kalman filter is a recursive estimation method that efficiently estimates and predicts the state of a discretized, dynamic system from a set of potentially noisy measurements such that the mean square error is minimized, giving rise to a statistically optimal estimation. Only the previous measurement and the current measurement are needed. The general Kalman filter equations are not directly applicable to a stream of images without some additional considerations. If an N × M disparity map was treated as a single input to one Kalman filter the measurement covariance matrix R would consist of (N × M)2 values, which quickly becomes a prohibitively large number. To make an implementation feasible, each pixel is assumed independent and consequently assigned a separate Kalman filter. A pixel-wise Kalman filter is said to be iconic [18] [10]. The filter is applied to disparities, meaning it is the disparity value that is estimated. The pixel coordinates are still used when applying the motion model, but not Kalman filtered. The state vector is therefore just a scalar, i.e. xt = xt = dt . State, Measurement and Motion Models: The measurement is the disparity value supplied by the stereo camera. When the camera moves, the previous disparity map will in general not represent the new camera viewpoint. Since depth is available for each pixel they can be moved individually according to how the camera was moved, effectively predicting the scene based on the previous disparity map and the ego-motion information of the camera yielding two disparity maps depicting the same scene from the same view point; a prediction and a measurement. The ego-motion in this context is linear in world space but since image formation is non-linear, the motion becomes non-linear in image space. The standard Kalman filter assumes a linear system which means it cannot handle the application of ego-motion when working in image space. Therefore, the state prediction is handled outside of the Kalman filter by triangulating the input into world space, applying ego-motion as Xt = MR Xt−1 + MT (where MT is translation and MR is rotation), and then back-projecting the result into image space. The Kalman filter therefore only handles the state variance. Support for arbitrary camera poses is accomplished by including a transformation of the ego-motion information to the coordinate system of the camera, given as ωT and ωR for translation and rotation, respectively.

The complete motion model will be referred to as the function Φ(.): xt|t−1 = Φ(xt−1|t−1 , MT , MR , ωT , ωR )

(1)

Noise models: The process and measurement variance are both scalars and assumed to be constant, i.e. Q = qd and R = rd . The process variance qd reflects the confidence in the predictions and is set to a small number to filter out moving objects (e.g. 0.001). The measurement variance rd represents the noise in the disparity values and is determined by measuring the precision of the stereo camera. The Kalman gain K is effectively a weighting between the prediction and the measurement based on the relative values of qd and rd . The state variance P is also a scalar, hence P = P. It is initially set to rd , and from there on maintained by the Kalman filter. Predict Phase and Update Phase: The finished iconic Kalman filter equations become: Predict = Φ(xt−1|t−1 , MT , MR , ωT , ωR ) = Pt−1|t−1 + qd

xt|t−1 Pt|t−1

(2) (3)

Update Kt

= Pt|t−1 Pt|t−1 + rd

−1

xt|t

= xt|t−1 + Kt yt − xt|t−1

Pt|t

= (1 − Kt ) Pt|t−1

(4)

(5) (6)

where x is the estimated state, Φ(.) is the motion model, MT /MR is the ego-motion as translation/rotation matrices, ωT /ωR is the translation/rotation of ego-motion to camera coordinate system, P is the state variance, qd is the process variance, y is the measurement variable, K is the Kalman gain, and rd is the measurement variance. Update Phase Merging: After the prediction phase has completed and a new disparity map has been calculated, the update phase begins. For each pixel, one of three scenarios occur: 1) No measurement exists for the position of the predicted pixel. The disparity present is simply kept as-is. A disparity is only carried over a certain number of consecutive frames before being discarded to prevent stale pixels from degrading the result. 2) No predictions exists for the position of the measured pixel, and the disparity is kept as-is. 3) Both exist. To filter out noise, a measurement is checked against the prediction it is set to be merged with by applying the Mahalanobis 3-sigma test [19]. If a measurement is within three standard deviations (three times the square root of rd ) of the mean (the prediction) in a given Gaussian distribution, it is merged with the prediction by the Kalman filter. If it fails, it replaces the prediction only if the prediction is young (to prevent spurious pixels from keeping correct measurements out) or deemed stale (to filter out pixels that have drifted).

936

1600

III. FREE SPACE ESTIMATION

1590

A. Stochastic Occupancy Grid

m

D(i, j) =

∑ Li j (pk )

(7)

k=1

where m is the number of data points available, p is a pixel with associated disparity, and Li j is the likelihood function mapping the data point to the cell with coordinates (i, j) in the occupancy grid. Since the measurements are noisy, there will be some difference between the measurement p and the real but unknown ground truth p′ . It is assumed this difference, ξk = p − p′ , can be modelled by a zero mean multivariate Gaussian probability density function: 1

1 ϕ (ξk ) = exp − ξTk Σ−1 ξk 3/2 1/2 2 (2π) |Σ|

(8)

where Σ is the measurement covariance matrix and |.| is the determinant. (8) yields the probability of obtaining an error ξk , or, in other words, given a measurement p, it yields the probability of the actual state being p′ . There exists different kinds of occupancy grid representations. In this paper, the polar occupancy grid is chosen. In the polar occupancy grid, the columns correspond to the columns of the disparity map and the rows represent depth, resulting in an even distribution of depth along the rows. The likelihood function is: Li j (pk ) = ϕ

xi j − x, 0, di′ j − d

T

(9)

where di′ j is the back-projected depth of the occupancy grid cell under consideration. The polar occupancy grid is directly applicable for marking free space in an intensity image, and its linear resolution of depth allows for better separation of distant objects. B. Free Space Free space can be determined with a combination of the occupancy grid and the underlying disparity map [10].

Distance [cm]

1580

With a disparity map (and consequently a reconstruction of the world points), the free space where a vehicle can navigate without collision can be computed by way of an occupancy grid [20]. An occupancy grid can be thought of as a top-down view of the scene, with lateral position along the horizontal axis and depth along the vertical axis. In a stochastic occupancy grid [10] the intensity of each cell denotes the likelihood that a world point is at the lateral position and depth represented by the cell. A world point can therefore occupy more than one cell. How large an area the world point affects depends on the variance, or noise, associated with the point. For each cell D(i, j) in the occupancy grid, the intensity is calculated as:

1570 1560 1550 1540

Original Kalman filtered

1530 0

50

100

150

Sample index

Fig. 2: Kalman filtered vs. non-filtered depth values for a single pixel in a static scene, qd = 0.001.

1) Occupancy Grid Segmentation: Each cell in an occupancy grid contains a value representing the probability of that cell being occupied. The grid is segmented by traversing it from near to far, stopping when the value of a cell exceeds a threshold. The depth of that cell is the maximum depth likely to be free of obstacles. 2) Free Space Marking: The occupancy grid is a direct top-down view of the intensity image. If no occupied cell is found in a particular column, i.e. the direction of a ray from the cameras focal point, all points in the real scene above or below that ray are part of the free space up to the vertical resolution of the occupancy grid. IV. EVALUATION To evaluate the application and the effect of Kalman filtering, a number of data sets have been collected. One as a static, indoor scene, and three on an electric vehicle with the camera at three different angles relative to the direction of travel, which is forward. The angles are 0◦ (i.e. facing forward), 65◦ , and 90◦ (perpendicular). Ego-motion is estimated. The filter parameters are qd = 0.001 and rd = 0.2. A. State and Variance To determine the effect on the state (the depth) of a pixel in a stationary scene with a non-moving camera, the first 150 frames of a single pixel is compared to its Kalman filtered counterpart. Fig. 2 shows this for a depth of approximately 16 m. On average, the variance was reduced from 52.14 to 4.10. B. Density The prediction property of the Kalman filter leads to denser disparity maps due to the accumulation of measurements over time. Pixels which are predicted for a given number of frames without being updated with a new measurement are discarded to ensure stale pixels do not degrade the occupancy grid. Fig. 3 shows four examples. Fig. 3a is the static, indoor scene. Fig. 3b is an outdoor scene with ego-motion and moving objects. The camera is facing forward. Low lighting conditions yield sparse disparity maps which is compensated for to a certain extent by the Kalman filter.

937

(b) Camera at 0◦ .

(a) Static scene.

(c) Camera at 65◦ .

(d) Camera at 90◦ .

Fig. 3: Examples of the effect on density by Kalman filtering. The top row contains the original disparity maps while the bottom row contains the Kalman filtered disparity maps. Black areas denote missing measurements. Fig. a is a static scene while figures b-d are from a sequences with ego-motion. In Fig. 3c, the camera is at a 65◦ angle. Note that areas occluded in the original disparity map have an estimated depth in the Kalman filtered version, leading to a denser disparity map than would have been possible without egomotion. In Fig. 3d, the camera is at a 90◦ angle. As in Fig. 3b, the low lighting conditions yield sparse disparity maps. However, the Kalman filtered disparity map is relatively more dense. This is due to the lateral motion which leads to a greater variation in the scene than forward motion and hence the areas where estimation of disparity is possible varies more. The predictive property of the Kalman filter then accumulates these measurements over time. A quantitative analysis of the difference in density caused by Kalman filtering is presented in Table I. Ego-motion, moving objects, and range of depth are all properties affecting the obtainable improvement. The interesting number is therefore the relative increase. Unfortunately, some of the increase in density stems from keeping spurious measurements, visible in the top center of Figure 3b and top left of Figure 3d.

TABLE I: Absolute and relative increase in density before and after Kalman filtering. A scene consists of 156,000 px. Valid pixels Scene Lab, 8 m Vehicle, 0◦ Vehicle, 65◦ Vehicle, 90◦

Increase

Before [px]

After [px]

Abs. [px]

Rel. [%]

82,185 73,588 117,378 60,404

121,589 100,800 135,850 99,333

39,404 27,212 18,472 38,929

47.95 36.98 15.74 64.44

C. Per-pixel Variance & Activity The per-pixel variance map is an image where the intensity of each pixel denotes the variance of that pixel. The variance is the state variance maintained by the Kalman filter. Fig. 4 shows the variance image for a static scene (Fig. 4a) and for a scene with ego-motion involved (Fig. 4b). The variances in the right image never decrease as much as in the left image because the pixels do not live long enough for this to happen. This variance of each pixel, along with their age and no measurement count, is an indication of the reliability of the estimate. This comes into effect when calculating the occupancy grids. Fig. 4 also contains two examples of activity maps. The activity map of a scene is an overview of what operation each pixel has undergone in the latest Kalman filter iteration. The possible activities are: • • • •

Red: “No measurement, prediction only” Yellow: “No prediction, measurement only” Orange: “Prediction and measurement merged” Blue: “Prediction replaced by new measurement”

Fig. 4c is the static scene. Most of the activity is merging, indicated by the orange color. Fig. 4d is an example of an activity map from a scene with motion. Notice the blue fringe around the person, due to the predictions being replaced by the new measurements. This is an indication of motion, because the shift in disparity from one image to the next is too large to be within the Gaussian distribution of noise associated with the measurements. Hence, the pixels must represent a different object than before, which means there is motion in that area. D. Occupancy Grid

Worthy of note is that the increase in density is from prediction of actual measurements, and not e.g. interpolation.

Examples of occupancy grids can be seen in Fig. 5a-c with corresponding segmented occupancy grid in Fig. 5d-f,

938

(a) Variance for Fig. 3a.

(b) Variance for Fig. 3c.

(c) Activity for Fig. 3a.

(d) Activity for Fig. 3c.

Fig. 4: Variance: Orange denotes high variance, blue denotes medium variance, and green denotes low variance. Activity: Red: no measurement, prediction only. Yellow: no prediction, measurement only. Orange: Prediction and measurement merged. Blue: Prediction replaced by new measurement.

57 m

57 m

0m

57 m

0m 1 px

500 px

(a) Full occupancy grid.

(d) Segmented occupancy grid.

0m 1 px

500 px

(b) Full occupancy grid.

(e) Segmented occupancy grid.

1 px

500 px

(c) Default variance with unfiltered disparity map.

(f) Segmented occupancy grid.

Fig. 5: The top row contains the full occupancy grids while the bottom row contains the resulting segmentation. Red: Low likelihood. Blue: High likelihood. Black: No measurement. Notice the higher likelihood and greater coverage of the cells in (b) as compared to (c). This is due to a lower variance for a large number of pixels, as well as the increased density.

respectively. The range of depth represented by the occupancy grids is from 0 m to 57 m. Fig. 5a is the static scene as in Fig. 3a, using the bottom disparity map. The depth is limited; only the lower part of the occupancy grid is in use. The upper cells, representing objects far away, are colored by spurious measurements. In Fig. 5b (bottom Fig. 3c), the full depth of the occupancy grid is in use. Fig. 5c is like Fig. 5b but with the raw, unfiltered disparity map of upper Fig. 3c used for calculating the occupancy grid. The difference is noticeable with a more sparse occupancy grid and lower likelihood of the cells. Likewise, the segmented occupancy (Fig. 5f) grid shows significant change compared to Fig. 5e. E. Free Space Fig. 6 shows four different scenes. First, the static scene before (Fig. 6a) and after (Fig. 6d) free space has been marked, using the disparity map in bottom Fig. 3a and the

segmented occupancy grid in Fig. 5d. Fig. 6b, c, and e are captured with a moving camera. Fig. 6b is with a frontal camera using the bottom disparity map in Fig. 3b (occupancy grid not shown). Fig. 6c is with the camera at a 65◦ angle using the bottom disparity map in Fig. 3c and the segmented occupancy grid in Fig. 5e. Fig. 6e is with the camera at a 90◦ angle, using the bottom disparity map in Fig. 3d (occupancy grid not shown). With dense disparity maps it is possible to determine the accessibility of a greater number of pixels and hence a greater area of the scene. Fig. 6c and f show the free space calculations of the same scene, based on the segmented occupancy grids of Fig. 5e-f, respectively. Fig. 6f is with an unfiltered disparity map (upper Fig. 3c), yielding a more sparse and less accurate estimation of free space, demonstrating the end effect of Kalman filtering the disparity maps.

939

(a) Original intensity image.

(b) Free space.

(c) Free space.

(d) Free space.

(e) Free space.

(f) Unfiltered.

Fig. 6: The green color denotes the free space. As an example, Fig. (d) is created by taking each pixel in (a) and first looking up the depth in Fig. 3a. If it is below the line in Fig. 5d it is regarded as free space and marked green in (d).

V. CONCLUSIONS AND FUTURE WORKS A. Conclusions Kalman filtering the disparity maps increases density and reduces per-pixel variance, which leads to more accurate occupancy grids and ultimately a more confident estimate of the occupancy of the scene. We have shown how the activity map indicates where motion is taking place on a disparity level, enabling detection of motion important for applications utilizing scene depth information. The variance map estimates the confidence of the depth measurement of each individual pixel, making it possible to decide if a measurement should be taken into account when determining the output of a system. B. Future Works The disparity map and the activity map can be used for motion detection and object detection, while the variance map can be used to determine the confidence ascribed to each pixel, enabling more fine-grained decision making. VI. ACKNOWLEDGEMENTS This work is funded by the Danish National Research Councils-FTP via the project: ”Big Brother is watching you”. R EFERENCES [1] K. Guo, G. Yu, and Z. Li, “An new algorithm for analyzing driver’s attention state,” Intelligent Vehicles Symposium, 2009 IEEE, pp. 21 –23, june 2009. [2] A. Doshi and M. M. Trivedi, “On the roles of eye gaze and head dynamics in predicting driver’s intent to change lanes,” IEEE Trans. Intell. Transport. Sys., vol. 10, pp. 453–462, September 2009. [3] S. Sivaraman and M. M. Trivedi, “Active learning based robust vehicle detection for on road active safety systems,” IEEE Intelligent Vehicles Symposium, June 2009. [4] K. H. Lim, K. P. Seng, A. C. L. Ngo, and L.-M. Ang, “Realtime implementation of vision-based lane detection and tracking,” Intelligent Human-Machine Systems and Cybernetics, 2009. IHMSC ’09. International Conference on, vol. 2, pp. 364 –367, aug. 2009.

[5] Y. Xie, L.-F. Liu, C.-H. Li, and Y.-Y. Qu, “Unifying visual saliency with hog feature learning for traffic sign detection,” Intelligent Vehicles Symposium, 2009 IEEE, pp. 24 –29, June 2009. [6] Y.-C. Lim, C.-H. Lee, S. Kwon, and J. hun Lee, “Position estimation and multiple obstacles tracking method based on stereo vision system,” Intelligent Vehicles Symposium, 2009 IEEE, pp. 72 –77, June 2009. [7] M.-H. Li, B.-R. Hong, Z.-S. Cai, S.-H. Piao, and Q.-C. Huang, “Novel indoor mobile robot navigation using monocular vision,” Engineering Applications of Artificial Intelligence, vol. 21, no. 3, 2008. [8] W. Shi and J. Samarabandu, “Corridor line detection for vision based indoor robot navigation,” in CCECE. IEEE, 2006, pp. 1988–1991. [9] M. Agrawal, K. Konolige, and R. C. Bolles, “Localization and mapping for autonomous navigation in outdoor terrains: A stereo vision approach,” WACV ’07: Proceedings of the Eighth IEEE Workshop on Applications of Computer Vision, p. 7, 2007. [10] H. Badino, U. Franke, and R. Mester, “Free space computation using stochastic occupancy grids and dynamic programming,” Workshop on Dynamical Vision, ICCV, Rio de Janeiro, Brazil, October 2007. [11] N. Winters, J. Gaspar, G. Lacey, and J. Santos-Victor, “Omnidirectional vision for robot navigation,” Omnidirectional Vision, 2000. Proceedings. IEEE Workshop on, pp. 21 –28, 2000. [12] M. M. Trivedi, T. Gandhi, and J. McCall, “Looking-in and lookingout of a vehicle: Computer-vision-based enhanced vehicle safety,” Intelligent Transportation Systems, IEEE Transactions on, vol. 8, no. 1, pp. 108–120, March 2007. [13] S. Jia, J. Sheng, E. Shang, and K. Takase, “Robot localization in indoor environments using radio frequency identification technology and stereo vision,” Advanced Robotics, vol. 22, March 2008. [14] S. Ahn, J. Choi, N. L. Doh, and W. K. Chung, “A practical approach for ekf-slam in an indoor environment: fusing ultrasonic sensors and stereo camera,” Auton. Robots, vol. 24, no. 3, pp. 315–335, 2008. [15] R. E. Kalman, “A new approach to linear filtering and prediction problems,” Transactions of the ASME–Journal of Basic Engineering, vol. 82, no. Series D, pp. 35–45, 1960. [16] N. Gordon, D. Salmond, and A. Smith, “Novel approach to nonlinear/non-gaussian bayesian state estimation,” Radar and Signal Processing, IEE Proceedings F, vol. 140, no. 2, apr 1993. [17] TYZX, “TYZX DeepSea Stereo Camera,” 2009, [Online; accessed 20November-09]. http://www.tyzx.com/. [18] L. Matthies, T. Kanade, and R. Szeliski, “Kalman filter-based algorithms for estimating depth from image sequences,” International Journal of Computer Vision, vol. 3, no. 3, pp. 209–238, 1989. [19] P. C. Mahalanobis, “On the generalised distance in statistics,” in Proc. National Institute of Science, India, vol. 2, no. 1, April 1936. [20] A. Elfes, “Using occupancy grids for mobile robot perception and navigation,” Computer, vol. 22, no. 6, pp. 46–57, Jun 1989.

940