Light Field Assisted Stereo Matching using Depth from Focus and Image-Guided Cost-Volume Filtering

Light Field Assisted Stereo Matching using Depth from Focus and Image-Guided Cost-Volume Filtering J˛edrzej Kowalczuk, Eric T. Psota, and Lance C. Pér...
1 downloads 0 Views 18MB Size
Light Field Assisted Stereo Matching using Depth from Focus and Image-Guided Cost-Volume Filtering J˛edrzej Kowalczuk, Eric T. Psota, and Lance C. Pérez Department of Electrical Engineering, University of Nebraska, Lincoln, NE, U.S.A.

Abstract— Light field photography advances upon current digital imaging technology by making it possible to adjust focus after capturing a photograph. This capability is enabled by an array of microlenses mounted above the image sensor, allowing the camera to simultaneously capture both light intensity and approximate angle of incidence. The ability to adjust focus after capture makes light field photography wellsuited for computer vision techniques that aim to determine the depth of objects in a scene, such as depth from focus/defocus. Another commonly known method for extracting depth from images is stereo matching, which seeks to obtain a disparity map linking corresponding objects that appear in both images. This disparity map, along with the known geometry of the cameras can be used to compute the depth of objects in the scene. Whereas depth from focus/defocus using traditional cameras requires multiple sequential image captures, light field cameras capture these images simultaneously, making them more suitable for use in conjunction with stereo matching. A method is presented that combines both light field photography and stereo matching to achieve an enhanced disparity map. This joint approach takes advantage of the unique strengths of both methods to produce a result that is more accurate than either method produces in isolation. Keywords: Stereo matching, stereo correspondence, light field photography, depth from focus.

1. Introduction Over the last three decades, computer vision researchers have developed a variety of methods for computing the threedimensional structure of a scene. These methods can generally be divided into active and passive methods. Active methods typically require light to be projected onto the scene, and include laser scanning [1], structured light [2], [3], and infrared range sensing [4]. While these methods produce accurate depth maps of the scene, they require additional hardware and are often restricted to small-scale indoor environments. In contrast, passive methods compute scene depths without requiring the projection of light, thus making them suitable for a broad range of applications. Commonly known passive methods include stereo matching [5], structure from motion [6], and depth from focus/defocus [7], [8]. The goal of stereo matching is to find correspondences between objects that appear in two images of the scene. By finding positional offsets, i.e., disparities, relating objects that appear in the two images, it is possible to use the known geometry of the stereo camera configuration to compute the

depth and scale of these objects. Unfortunately, the accuracy of stereo matching suffers from several inherent difficulties. These difficulties include low-texture areas, depth discontinuities, occlusions, slanted surfaces, repeating patterns, reflections, and specularities. Global stereo matching methods, such as graph-cuts and image segmentation [9], have significantly reduced the detrimental effects of low-texture surfaces, often by assuming they are part of one large, flat surface. The introduction of adaptive support weights for stereo matching cost aggregation [10] addresses the challenge posed by depth discontinuities, and has been adopted by many recently proposed stereo matching algorithms [11], [12]. Occlusions, on the other hand, require algorithms to effectively “guess” the disparity values, since occluded surfaces are only visible in one of the two images. Techniques that extract depth using focal plane characteristics can be divided into two sub-categories: depth from focus and depth from defocus [8]. A depth from focus (DFF) system adjusts the camera’s focus settings, or the distance of the focal plane, until the point of interest comes into focus. Given that there is a one-to-one relationship between the depth of an object point and the depth of the focal plane that puts that point in focus, it is possible to use depth from focus to obtain the depth of every point in the scene. In contrast, depth from defocus (DFD) operates by comparing a reference image obtained with a small aperture to images obtained with a large aperture. Depth is then approximated by determining the amount of blur (defocus) that must be applied to the reference image to reproduce portions of images with large aperture. The amount of blur, or the size of the blur kernel, corresponds to a particular depth value and larger apertures generally produce more precise depth measurements. An advantage of depth from focus/defocus is that these methods do not suffer from many of the difficulties associated with stereo matching. For example, a repeating pattern will not confuse DFF or DFD, as they operate locally and do not require a scan-line search. Until recently, DFF and DFD were largely limited to post-processing of images of static scenes, where a set of focus-adjusted images is obtained in sequence. With the availability of light field cameras produced by Lytro (Lytro Inc., Mountain View, CA), it has become possible to obtain a large set of images with different focus settings simultaneously [13]. Light field cameras effectively capture both light intensity and angle of incidence on the same image sensor using an array of microlenses. In this paper, a method is proposed that uses a stereo configuration of light field cameras. The method begins by

capturing a pair of light field photographs, performing depth estimation using DFF, and constructing a pair of images, such that every point in the images is in focus. These images are then processed using stereo matching with image-guided filtering, and the reliability of stereo matching is assessed using a back-and-forth consistency check. Finally, DFF and stereo matching costs are combined using a weighted summation, and the Winner-Takes-All decision criteria is used to find minimum-cost matches and assign disparities. It is shown that this approach improves upon the accuracy of the disparity map produced using either stereo matching or DFF in isolation.

2. Background Stereo matching is formulated as the process of computing a disparity map that defines correspondences between pixels in a pair of stereo images. Stereo matching methods are commonly built on top of a Winner-Takes-All (WTA) framework, which operates on a set of precomputed per-pixel matching costs. For a particular pixel of interest in the reference image the WTA approach selects the candidate match characterized by the minimum matching cost. Typically, these costs are aggregated to form a stereo matching cost volume CSM (sometimes referred to as the disparity space image [5]), such that the element at CSM (x, y, d) holds the per-pixel dissimilarity metric, e.g., the sum of absolute color differences or the sum of squared color differences, evaluated between pixel p at location (x, y) in the reference image and pixel p¯ at location (x − d, y) in the target image. The disparities associated with minimum-cost matches are recovered by finding disparity arguments that minimize CSM (x, y, d). For a pixel p in the reference image, the disparity is found using the decision criteria given by dp = argmin CSM (x, y, d) .

(1)

d

To enforce local consistency of matches and reduce the noise present in the resulting disparity maps, window-based cost aggregation techniques have been introduced into the WTA framework. Among these methods, adaptive supportweight aggregation presented by Yoon and Kweon [14] has been shown to produce highly accurate disparity maps and has since been employed by many top-performing local and non-local stereo matching algorithms [15], [16], [17], [18]. The adaptive support-weight cost aggregation considers a square support window Ωp centered at pixel of interest p, and assigns a support weight to each pixel q ∈ Ωp . The support weight w(p, q) is given by   ∆c (p, q) ∆g (p, q) − w(p, q) = exp − , (2) γc γg where ∆g (p, q) is the geometric distance, ∆c (p, q) is the color difference between pixels p and q, and γc and γg are chosen empirically. Given pixel p in the reference frame, pixel p¯ in the target frame, and their support windows Ωp and Ωp¯, respectively, the cost of matching p at location (x, y) and p¯ at

(x − d, y) is computed as X CSM (x, y, d) =

w(p, q)w(¯ p, q¯)δ(q, q¯)

q∈Ωp ,¯ q ∈Ωp¯

X

, w(p, q)w(¯ p, q¯)

(3)

q∈Ωp ,¯ q ∈Ωp¯

where δ(q, q¯) is the dissimilarity measure between pixels q and q¯.

2.1 Stereo Matching using Cost-Volume Filtering In fact, the adaptive support-weight cost aggregation is equivalent to smoothing the initial cost volume (layer by layer) using a bilateral filter. The edge-preserving property of the bilateral filter captures object boundaries in the disparity maps, however, the complexity of filtering grows exponentially with increasing window size, preventing the application of larger filter kernels. To address this, Rhemann et al. [19], proposed cost-volume filtering using a guided image filter [20] that closely mimics the behavior of the bilateral filter, yet can be computed in constant time, regardless of the window size. Apart from fast operation, the ability to aggregate matching cost over a larger window results in exceptionally accurate disparity maps, making their approach the best-performing local stereo matching algorithm among those listed on the Middlebury stereo benchmark [5], [2], approaching the accuracy of algorithms that perform global energy minimization. The weights of the guided filter in a support window Ωp are given by  ! I(p) − µΩp I(q) − µΩp 1 X , w(p, q) = 1+ 2 + |Ωp | σΩ p q∈Ωp (4) where |Ωp | is the number of pixels in the support window Ωp , I is the guidance image (here, a grayscale version of 2 the reference image), µΩp and σΩ are the mean and the p variance of the guidance image in the support window, and  is a regularization parameter used to control the strength of smoothing. Once the weights are computed, the cost volume is updated as X CSM (x, y, d)0 = w(p, q)CSM (x, y, d) , (5) q∈Ωp

and matches are selected according to (1).

2.2 Light Field Photography In February of 2012, Lytro released the first consumer-level light field camera. The primary advantage for photography enthusiasts is the ability to adjust focus after capturing the photo. Figure 1 illustrates a simplified model of the Lytro camera. Much like ordinary cameras, which capture the intensity of light on an image sensor, the Lytro camera incorporates a lens assembly that allows for zooming and focusing. However, unlike ordinary cameras, the Lytro camera uses a fixed aperture of f /2.0 and light passes through an array of microlenses that allow the camera to separate light captured from a variety of

different angles before making contact with the image sensor. In effect, the Lytro camera captures both the light’s intensity and its angle of incidence. By post processing this data it is possible to mimic movement of the image sensor in relation to the lens assembly, thus reproducing the effects of a standard focus adjustment.

1

Virtual focal plane (far) Focal plane Virtual focal plane (near)

Main lens

Microlenses Sensor array

Fig. 1: Image capture using Lytro’s light field camera. This technology lends itself well to the depth from focus and depth from defocus research areas which, prior to the introduction of light field cameras, required the acquisition of multiple images of a static environment captured using different focus settings. After capturing a photograph with Lytro’s light field camera, the user can extract multiple images of the scene with different focus settings. Hence, to extract depth information at a particular point in the scene, it is possible to scan through the image set in search of the one that provides the highest level of sharpness, and then interpolate the depth from that image.

3. Light Field Assisted Stereo Matching Both stereo matching and depth from focus provide the capability to produce depth maps of the observed scene. While image blur is a key to the successful operation of DFF, it is undesirable for stereo matching. Thus, the first stage of the proposed method involves creating all-in-focus images from the photographs captured by the light field cameras. Figure 2 shows three of the twelve images of a test scene obtained with both the left and right light field cameras. In the top row of Figure 2, the background is in focus, and the second and third rows correspond to focal planes moving toward the camera. To create an all-in-focus image ISharp , the sharpness of each pixel is evaluated for each of the twelve images I1 , . . . , I12 obtained from the light field photograph. Sharpness in an

Left images

Right images

Fig. 2: A stereo image pair captured using the Lytro light field camera from which three focus plane settings were used to produce the images. From top to bottom, the focal plane is moving from far to near. image can be quantified by integrating the amount of high frequency content surrounding a pixel. To create a high frequency image, a low-pass filtered image is first generated using ILPF,k = Ik ∗ H1 , where H1 is a 5 × 5 Gaussian filter with standard deviation σ = 2. Then, the low-pass filtered image is subtracted from the original image to create a high-pass filtered image IHPF,k = |Ik − ILPF,k |. The amount of high frequency content surrounding a pixel is aggregated using a second stage of low-pass filtering, and the resulting cost for evaluating the high frequency content in image k is given by CHPF,k = IHPF,k ∗ H2 ,

(6)

where H2 is an 11 × 11 Gaussian filter with standard deviation σ = 5. Once the cost are computed for each of the 12 images, the sharpest image for each pixel is determined using ISharp (x, y) = Ik (x, y) | argmax CHPF,j (x, y) = k, j=1,...,12

for all pixel locations (x, y) within the image.

(7)

(a) Left patch

(b) Right search space 0.6

−10

DFF: Measurements

−20

DFF: Interpolated Guided Filter Cost

−30 −40 −50

0.4

−60 −70 −80

Matching Cost Over 33x33 Window

Negative PSD Integrated Over High Frequency

0

−90 −100 20

30

40

50

60 Disparity

70

80

0.2 100

90

(c) Stereo and DFF matching costs

Right view

Fig. 3: A grayscale map of indices of the sharpest images calculated for each pixel (top row) and all-in-focus images obtained by combining sharp pixels from the image sets together (bottom row).

8

7 Combined Cost

Left view

9

6

5

Figure 3 provides a grayscale visualization of the sharpest images chosen for each pixel in the images, and the resulting image ISharp created using both the left and right light field captures. This pair of all-in-focus images is now suitable for processing using stereo matching.

3.1 Depth-from-Focus Cost Volume The cost volume CHPF computed in Equation (6) can be used to approximate the disparity between the left and right images after assigning a unique disparity to each of the images extracted from the light field photograph. However, the range of stereo matching disparity values is much larger than 12 between the 1024 × 1024 images given in Figure 3. In fact, the range of disparities between these images is approximately 20-100. Thus, it is necessary to interpolate the cost volume CHPF to achieve the same level of precision as stereo matching for this image set, resulting in a high-precision cost volume that will be denoted CDFF . Figure 4(c) shows a set of highfrequency costs evaluated for the 12 images extracted from the light field capture, labeled as “DFF:Measurements”. An illustration of the interpolated cost is also given in Figure 4(c), producing a minimum cost disparity of approximately d = 41.

3.2 Cost Volume Merging and Joint Disparity Selection Using the image-guided cost-volume filtering approach given in [19], the cost of matching all-in-focus images can be evaluated over the entire range of disparity hypotheses.

0.05xDFF + 20xSM 4

3 20

30

40

50

60 Disparity

70

80

90

100

(d) Combined cost

Fig. 4: The matching cost ambiguity caused by repeating patterns results in a matching error using image-guided stereo matching. A match for the left image patch (a) is scanned for in the right search space (b) during stereo matching. The matching costs show that the patch at disparity d = 81 has lower cost than the correct disparity d = 44. By combining the cost of stereo matching with the cost of depth from focus, both shown in (c), a joint cost (d) is produced to allow the algorithm to resolve this ambiguity.

Figure 4(c) illustrates the matching cost between the reference image patch in Figure 4(a) and its corresponding search space given in Figure 4(b). Note that, while the correct disparity is d = 44 for this particular pixel, the cost of choosing d = 81 is lower, and would be chosen using a pure image-guided stereo matching algorithm. Also given in Figure 4(c) is the sampled cost volume CHPF and the interpolated cost volume CDFF . While pure stereo matching fails to select the correct disparity due to the repeating pattern in the target search space, the interpolated cost CDFF clearly favors the disparity d = 44 over d = 81. It is worth noting that the precision of stereo

matching is much higher than that achieved using depth from focus, thus creating a tradeoff between the two methods. A joint cost volume CDFF+SM is created by merging together the cost volume CDFF and the stereo matching cost volume CSM using a weighted summation. Figure 4(d) shows the joint cost equal to CDFF+SM = 0.05 × CDFF + 20 × CSM . In this example, the stereo matching ambiguity can be correctly resolved using WTA after incorporating information from depth from focus. Stereo matching errors can often be detected using a backand-forth consistency check operation. If the match chosen for pixel p in the reference image is p¯, yet the match chosen for pixel p¯ in the target image is not p, the disparity assigned to pixel p is deemed inconsistent. The most common cause of inconsistencies in a well-calibrated stereo image pair is the presence of occlusions, where an area in the scene is only visible in only one of the images. Low-contrast areas and specularities can also cause inconsistent matching. In order to take advantage of consistency checking for error detection, the cost volume summation can be modified to favor the DFF matching costs for pixel locations where stereo matching is inconsistent.

(a) Stereo disparity map

(b) Stereo inconsistencies

(c) DFF disparity map

(d) Joint disparity map

(e) Region 1: stereo

(f) Region 1: joint method

(g) Region 2: stereo

(h) Region 2: joint method

4. Results The stereo matching results given in Figures 5(a) and 5(b) were generated using the image-guided cost-volume filtering approach described in Section 2.1, with a support window Ωp of size 33 × 33. Figure 5(a) shows the disparity map obtained for the left image using the Winner-Takes-All match selection criteria, and Figure 5(b) highlights inconsistent disparities where the mapping between pixels is not bijective. Figure 5(c) shows the corresponding disparity map extracted from the depth-from-focus cost volume CDFF . Figure 5(d) shows a disparity map produced using the proposed joint method incorporating depth from focus and image-guided stereo matching. The joint matching cost volume at every pixel is computed using ( 0.05 × CDFF + 20 × CSM if p is consistent CDFF+SM = , 0.075 × CDFF + 20 × CSM otherwise (8) where DFF matching costs are favored in cases where stereo matching produces inconsistent disparities. Conventional stereo matching fails in the two regions highlighted within Figure 5(a), among which the region marked with a green border illustrates errors caused by repeating patterns and the region marked with a blue border illustrates errors caused by occlusions. Within these two regions, the benefits of applying the joint method are apparent, in that the joint method successfully resolves the disparities associated with the repeating metallic mesh and correctly handles the occlusions around the pens and markers.

Fig. 5: Disparity maps produced using image-guided stereo matching (a) (also with labeled inconsistencies (b)), depth from focus (c), and the results of the joint method (d). The highlighted areas, shown in subfigures (e) to (h), illustrate the joint method’s successful handling of areas with repeated textures and occlusions.

5. Conclusion A method is given for using stereo images obtained via light field photography to enhance the accuracy of the disparity map obtained using conventional stereo matching. The precision of

stereo matching and the accuracy of depth from focus are combined by computing a weighted sum of their respective

cost volumes over the entire disparity space. It is shown that many of the common difficulties associated with stereo matching, including repeated patterns and occlusions, can be resolved by using interpolated depth-from-focus information obtained from images extracted from the light field photograph. While light field photography is still in its infancy, the results given in this paper demonstrate the potential gains that can be achieved using this technology. Applying depth from focus to light field photographs produces reliable depth for objects close to the camera, however, for objects that are far away from the camera the reliability quickly diminishes. This limitation, imposed by the size of the aperture is similar to the limitations of stereo matching caused by the distance between the two cameras. Therefore, future research and applications involving the integration of light field photography and stereo matching should consider this tradeoff in order to design passive systems that capture accurate depth information over a wide range.

References [1] E. P. Baltsavias, “A comparison between photogrammetry and laser scanning,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 54, no. 2–3, pp. 83 – 94, 1999. [2] D. Scharstein and R. Szeliski, “High-accuracy stereo depth maps using structured light,” in In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 195–202, June 2003. [3] R. Yang, S. Cheng, W. Yang, and Y. Chen, “Robust and accurate surface measurement using structured light,” Instrumentation and Measurement, IEEE Transactions on, vol. 57, pp. 1275 –1280, june 2008. [4] S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, and A. Fitzgibbon, “Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera,” in Proceedings of the 24th annual ACM symposium on User interface software and technology, UIST ’11, (New York, NY, USA), pp. 559–568, ACM, 2011. [5] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,” International Journal of Computer Vision, pp. 7–42, 2002. [6] P. Sturm and B. Triggs, “A factorization based algorithm for multiimage projective structure and motion,” in Computer Vision — ECCV ’96 (B. Buxton and R. Cipolla, eds.), vol. 1065 of Lecture Notes in Computer Science, pp. 709–720, Springer Berlin / Heidelberg, 1996. [7] S. Nayar, “Shape from focus system,” in Computer Vision and Pattern Recognition, 1992. Proceedings CVPR ’92., 1992 IEEE Computer Society Conference on, pp. 302 –308, jun 1992. [8] Y. Y. Schechner and N. Kiryati, “Depth from defocus vs. stereo: How different really are they?,” International Journal of Computer Vision, vol. 39, pp. 141–162, 2000. 10.1023/A:1008175127327. [9] Q. Yang, L. Wang, R. Yang, H. Stewenius, and D. Nister, “Stereo matching with color-weighted correlation, hierarchical belief propagation, and occlusion handling,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 31, pp. 492 –504, March 2009. [10] K.-J. Yoon and I. S. Kweon, “Adaptive support-weight approach for correspondence search,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 28, pp. 650 –656, April 2006. [11] W. Yu, T. Chen, F. Franchetti, and J. C. Hoe, “High performance stereo vision designed for massively data parallel platforms,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 20, pp. 1509 –1519, November 2010. [12] E. T. Psota, J. Kowalczuk, J. Carlson, and L. C. Pérez, “A local iterative refinement method for adaptive support-weight stereo matching,” in International Conference on Image Processing, Computer Vision, and Pattern Recognition (IPCV), pp. 271–277, July 2011. [13] R. Ng, M. Levoy, M. Bredif, G. Duval, M. Horowitz, and P. Hanrahan, “Light field photography with a hand-held plenoptic camera,” in Technical Report CTSR 2005-02, Stanford University, 2005.

[14] K.-J. Yoon and I.-S. Kweon, “Locally adaptive support-weight approach for visual correspondence search,” in CVPR ’05: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) - Volume 2, (Washington, DC, USA), pp. 924– 931, IEEE Computer Society, 2005. [15] Z. Gu, X. Su, Y. Liu, and Q. Zhang, “Local stereo matching with adaptive support-weight, rank transform and disparity calibration,” Pattern Recognition Letters, vol. 29, no. 9, pp. 1230 – 1235, 2008. [16] A. Hosni, M. Bleyer, M. Gelautz, and C. Rhemann, “Local stereo matching using geodesic support weights,” in 16th IEEE International Conference on Image Processing, pp. 2093 –2096, November 2009. [17] S. Mattoccia, M. Viti, and F. Ries, “Near real-time fast bilateral stereo on the GPU,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conference on, pp. 136 –143, June 2011. [18] J. Kowalczuk, E. T. Psota, and L. C. Pérez, “Real-time stereo matching on CUDA using an iterative refinement method for adaptive supportweight correspondences,” Accepted for publication in IEEE Transactions on Circuits and Systems for Video Technology, 2012. [19] C. Rhemann, A. Hosni, M. Bleyer, C. Rother, and M. Gelautz, “Fast costvolume filtering for visual correspondence and beyond,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 3017 –3024, june 2011. [20] K. He, J. Sun, and X. Tang, “Guided image filtering,” in Computer Vision – ECCV 2010, vol. 6311 of Lecture Notes in Computer Science, pp. 1–14, Springer Berlin / Heidelberg, 2010.

Suggest Documents