High Dynamic Range Video

Online Submission ID: 125 High Dynamic Range Video Category: Research Figure 1: High dynamic range video of a driving scene. Top row: Input video wi...
4 downloads 0 Views 5MB Size
Online Submission ID: 125

High Dynamic Range Video Category: Research

Figure 1: High dynamic range video of a driving scene. Top row: Input video with alternating short and long exposures. Bottom row: High dynamic range video (tonemapped).

Abstract

fully lit regions. Typical CCD or CMOS sensors only capture about 256-1024 levels. (The non-linear allocation of levels in a gamma curve can improve this slightly.) The limited dynamic range problem has inspired many solutions in recent years. One method of capturing the full radiance of a static scene is to take multiple exposures of the scene and to combine these to create a High Dynamic Range (HDR) map [DM97; MN99; TRK01]. The static scene requirement can be eliminated using multiple image detectors, novel sensors (e.g., National LM9628 sensor, IMS Chips HDRC sensors, Silicon Vision Products sensors) or spatially varying pixel exposures [MN00] (which also has a good review of other relevant prior work.) Mann et al. [MMF02] register differently exposed frames using homographies, which allows them to estimate the camera response function and hence produce an HDR image from a panning video. Once an HDR image is computed, it must then be rendered to a display. Since typical displays are only able to yield about two orders of magnitude, a contrast reduction must be performed on the HDR image. This tone mapping problem has recently been explored by a number of researchers [DD02; FLW02; RSSF02]. Our work addresses the problem of capturing and rendering high dynamic range video. The result of applying our approach to a driving video can be seen in Figure 1. Our approach consists of automatically determining the temporal exposure bracketing during capture, transferring information between neighboring images to produce radiance maps, and tonemapping for viewing (Figure 2). Our capture solution differs from previous efforts in that it employs a simple reprogramming of the auto-gain mechanism in a video camera. This allows us to use the inexpensive and high resolution sensors available today, unlike novel sensor designs which currently are not commercially available and may suffer from a lack of resolution. In Section 2, we present an auto-gain algorithm that intelligently varies the exposure from frame to frame in order to capture different parts of a scene’s radiance map. The acquisition process is followed by an offline process that motion-compensates the captured video and estimates a the full ra-

Typical video footage captured using an off-the-shelf camcorder suffers from limited dynamic range. This paper describes our approach to generate a high dynamic range (HDR) video from an image sequence of a dynamic scene captured while rapidly varying the exposure. Our approach consists of three parts: automatic exposure control during capture, HDR stitching across neighboring frames, and tone mapping for viewing. HDR stitching requires accurately registering neighboring frames and choosing appropriate pixels for computing the radiance map. We show examples for a variety of dynamic scenes. CR Categories: I.3.3 [Computer Graphics]: Picture/Image Generation—display algorithms; I.4.1 [Image Processing and Computer Vision]: Enhancement—Digitization and image capture. Keywords: Image processing, video processing, high dynamic range, tone mapping.

1

Introduction

The real world has a lot more brightness variation than can be captured by the sensors available in most cameras today. The radiance of a single scene may contain four orders of magnitude from shadows to

1

Online Submission ID: 125

Capture with automatic varying exposures

Register

Extract radiance map

Tonemap

HDR stitch Figure 2: The processing stages involved in producing an HDR video. diance map at each frame time (Section 3). This operation, which we call HDR stitching, establishes dense correspondences between images in order to combine pixels at different exposures to produce their HDR version. When compared to the spatially varying pixel exposures approach [MN00], our approach can be viewed as subsampling along a different dimension. In their work, the trade-off is spatial resolution for greater dynamic range; here the trade-off is temporal resolution for greater dynamic range. Before we can view the HDR video, it must be tonemapped. Applying one of the existing algorithms on a frame-by-frame basis is not sufficient as this can lead to visible temporal inconsistencies in the mapping. In order to compensate for this, we extend one of these techniques to operate on HDR video using statistics from temporally neighboring frames in order to produce tonemapped images that smoothly vary in time (Section 4).

2

Figure 3: Two input exposures from the driving video. The radiance histogram is shown on top. The red graph goes with the long exposure frame (bottom left), while the green graph goes with the short exposure frame (bottom right). Notice that the combination of these graphs spans a radiance range greater than any one exposure can capture.

sufficient to capture the radiance range of this scene. Our system is designed so that the exposure ratio between long and short exposures is minimized while simultaneously allowing the full range of scene radiances to be accommodated. This increases the number of pixels that are useful for matching in both frames.

Real-time exposure control

R == 1 && oec < oet && uec < uet

no

The auto gain control (AGC) of a typical video camera measures the brightness of the scene and computes an appropriate exposure. Most scenes have a greater dynamic range than can be captured by the camera’s 8-bit per pixel sensor. Because of this, regardless of the exposure settings, some pixels will be saturated and some will be under-exposed. In order to capture a greater dynamic range, we have developed a system that varies exposure settings on a per-frame basis. The basic idea is to sequence the settings between different values that appropriately expose dark and bright regions of the scene in turn. A post-processing step (Section 3) then combines these differently exposed frames. Our system is similar to auto-bracketing found in many still picture cameras today. When auto-bracketing a scene, the camera determines the correct exposure via the current metering mode, and then additionally captures the scene at shorter and longer exposures. Our system takes a similar approach. However, instead of bracketing with a fixed multiple of the middle exposure, we automatically determine an exposure ratio more appropriate for the scene. We used a 1394 camera from Pt. Grey Research that has a programmable control unit. The firmware was updated with a bank of four shutter (CCD integration time) and gain (ADC gain) registers. During normal operation, the camera does a round-robin through the bank using a different register set at every frame time. Additionally, the camera tags every frame with the current settings so that they can be used during the radiance map computation. A real-time AGC algorithm (running on a PC tethered to the camera) determines the next group of four settings. In our current implementation, the exposure settings alternate between two different values. The appropriate exposures are automatically determined from scene statistics, which are computed on a sub-sampled frame. All portions of the frame are weighted equally because in generating HDR imagery the goal is normally to provide tonal detail everywhere. The two exposures are continuously updated to reflect scene changes. Figure 3 shows successive frames captured by the camera along with their corresponding histograms in radiance space. The figure shows how a single exposure is not

oec oet

yes

yes choose exposures with R == max to balance oec and uec

decrease short exposure

return Legend: R = (long exposure) / (short exposure) oec = over exposed count uec = under exposed count oet = over exposed target uet = under exposed target

no

uec > uet

no

ye s increase long exposure

limit new R to max return

Figure 4: Auto gain flowchart. The first step in calculating exposures is to compute an intensity histogram for each of the current exposures. The system uses these histograms along with several programmable constraints to update itself. The constraints are maximum exposure ratio, over-exposed (saturated) target pixel count, and under-exposed (black) target pixel count. Figure 4 shows a flowchart of our algorithm, which works as follows. If the two exposures are equal (ratio is 1) and both target counts are in range, the next exposure is chosen such that the histogram is centered within the range of pixel values. If the over-exposed count is much less than its target, then the short exposure is increased until the target count is hit or the exposure ratio becomes 1. The long exposure time is decreased in a likewise fashion. If the ratio of exposures is still greater than the maximum allowed, then the underexposed count and over-exposed count are permitted to exceed their targets. In this case, the exposures are chosen to balance these two counts while not going above the maximum ratio. Finally, if either of the counts is greater than their target, the exposures are updated

2

Online Submission ID: 125 t = k-1

t=k

Lk

Sk-1

t = k+1

t = k-1

Sk+1

Sk-1

t = k+1

t=k

Lk

Sk+1

fk,B

fk,F

f’k,F = fk** fk,F

Sk* boost intensity fk*

F

Sk*

B

Sk*

f’k,B = fk** fk,B

Figure 6: Bidirectional warping for HDR computation where the past and future exposures are different. Again, this is shown for the case of the SLS sequence, but the algorithm for LSL is similar. (S=short exposure image, L=long exposure image.)

fk* Lk*

the respective images and combine the two warped images to produce an intermediate image Sk∗ . This intermediate image should be close in appearance to Lk . S’k*

2. Boost the pixel intensities of Sk∗ to match those of Lk . This operation produces the image Lk∗ .

Figure 5: Bidirectional warping for HDR computation. This diagram shows the case where the sequence is SLS, but the algorithm for LSL is similar. (S=short exposure image, L=long exposure image.) See the text for a full description.

3. Use a hierarchical global registration technique (Section 3.2) to compute the flow fk∗ that best maps Lk∗ to Lk . The flow fk∗  F0 B0 is then used to warp Sk∗ to Sk∗ . The images Lk , Sk∗ , Sk∗ ,  are together used to compute an HDR image at time and Sk∗ k.

accordingly while limiting the ratio of exposures to the maximum.

3

Note that the procedure is similar for the reverse condition, i.e., Lk−1 , Sk , Lk+1 . For the case where the past and future exposures are different, the modifications are (see Figure 6):

HDR stitching

• The intensity of the past or future frame with the lower exposure is boosted to match the other side image before Sk∗ (Figure 5) is computed.

Since frames are captured with temporally varying exposures, generating an HDR frame at any given time requires the transfer of pixel color information from neighboring frames. This, in turn, requires that the pixel correspondences across different frames be highly accurate. We call the process of transferring color information from neighboring frames and extracting the HDR image as HDR stitching. The source video contains alternating long and short exposure frames. The first step in HDR stitching is to generate both a long and a short exposure frame at every instant so that a radiance map can be computed from the pair. This requires that we synthesize the missing exposures using a warping process. Our HDR stitching process generates three intermediate warped frames: a bidirectionally warped (interpolated) frame from the left and right neighbors, a unidirectionally warped left frame, and a unidirectionally warped right frame. This redundancy is later exploited to increase tolerance to registration errors. Let us assume that the current frame is captured at a long exposure Lk with adjacent frames captured at short exposures (Sk−1 and Sk+1 ). First, we register the past and future frames with the current frame. This is done only after boosting the intensity of the past and future frames to match the long exposure range. The warped past F0 B0 and Sk∗ , respectively. and future frames are called Sk∗ After boosting, we compute the bidirectionally-warped frame using all the three frames, as illustrated in Figure 5. (The algorithm for the reverse case is similar.) For the case where adjacent exposures (associated with Sk−1 and Sk+1 in this case) are identical, the steps involved are as follows:

• Once fF , fB , and fk∗ have been computed, we warp Sk−1 to F  B using fk,F = fk∗ ∗ fk,F and warp Sk+1 to Sk∗ using Sk∗  fk,B = fk∗ ∗ fk,B . F0 B0 F B , Sk∗ , Sk∗ , and Sk∗ are then used to compute an HDR • Lk , Sk∗ image at time k.

3.1 Motion estimation Frame interpolation involves synthesizing the missing exposures at intermediate times using information from a pair of adjacent frames. To do this, we compute a dense motion match between equal exposures Sk−1 and Sk+1 and use this to warp pixel information forwards and backwards along the motion trajectories to produce Sk∗ . This procedure is also used to generate missing Linterp frames from L neighbors. Our motion estimation algorithm consists of two stages: First, we globally register the two frames by estimating an affine transform that maps one onto the other. We then use gradient-based optical flow to compute a dense motion field that forms a local correction to the global transform. Rather than computing forward or backward flow fields at times k −1 or k + 1, we compute the bidirectional field at the intermediate time k. This allows us to avoid the hole-filling problems of forwardwarping when generating each interpolated frame. At each pixel in the output frame k, we obtain composite vectors that point into the future frame k + 1 and the past frame, k − 1. These vectors are each

1. Compute the bidirectional flow fields (forward warp f k,F for Sk−1 and backward warp fk,B for Sk+1 ) using a gradientbased technique (Section 3.1). Use these flow fields to warp

3

Online Submission ID: 125 fW

Level 1

fM

1.0

1.0

H1,1 0.5

Level 0

0.5

0.0 0.0

H0 Image 1 (reference)

0.5

1.0

pixel value

0.0

δmax

δ

Image 2

Figure 7: Hierarchical homography computation. Note that only the first two levels and one of the quadrants at Level 1 are shown.

Current

the sum of affine and a local components. The affine component is derived from the global warping parameters, re-scaled to warp either from k − 1 to k or from k + 1 to k, and the local component is generated by our symmetrical optical flow algorithm. For local motion estimation, we use a variant of the Lucas and Kanade [LK81] technique in a Laplacian pyramid framework. We add to this a number of techniques to handle degenerate flow cases. Rather than simply warping one source image progressively towards the other at each iteration, we warp both source images towards the output time k and estimate the residual flow vectors between these two warped images. As the residuals are accumulated down the pyramid, they give rise to a symmetric flow field centered at time k. We augment this technique by including the global affine flow during the warping so the accumulated residuals are always represented in terms of a symmetrical local correction to this asymmetric global flow. To obtain the final intermediate image, we use bicubic warping to transfer pixels along the appropriate vectors from times k − 1 and k + 1 to each location in the output frame. We average the forward and backward warped pixels if they are available. If both source pixels are outside the frame, we average together the two pixels obtained using a zero motion vector.

Bidirectionally warped

Left warped

Right warped

Figure 8: Radiance map computation. Top left: Global weight vs. intensity, Top right: Modulation function based on radiance consistency of matched pixels, Bottom: Contributing regions from the current and warped frames.

3.3 Radiance map recovery In this section, we describe the process of combining the input images with their warped neighbors to produce a radiance map. Several techniques have been proposed to do this [DM97; MN99; TRK01]. In each of these techniques, the input images are converted to radiance images using the known exposure value and a computed response function. The final radiance value at a pixel is then computed as the weighted sum of the corresponding pixels in these radiance images. In our system, we compute the response function fR of our camera using the technique of [MN99]. The weighting function fW , derived from this calibration step is shown in Figure 8. Existing approaches assume perfectly registered input images. Due to the possibility of mis-registrations in the first step of HDR stitching, we relax this requirement. Our system is more tolerant to errors in pixel registration. The following steps are taken for the case where the input image is a long exposure and the adjacent frames are short exposures. 1. Convert L, S∗F 0 , S∗B0 and S∗ to radiance images using the response function and their respective exposure values (we drop the subscript k here). These radiance images are denoted ˆ Sˆ∗F 0 , Sˆ∗B0 and Sˆ ∗ respectively. by L,

3.2 Hierarchical homography We use a novel image registration method to refine registration between the interpolated frame and the actual frame. Constraining the flow is desirable at this point as it reduces the possibility of erroneous mapping at unreliable regions of saturated and low-contrast pixels. The idea of hierarchical homography is shown in Figure 7, which is simplified to illustrate two levels and one quadrant only. At the highest resolution, full frame registration is performed to find the best 2D perspective transform (i.e., homography) between the two input images, producing homography H0 . The reference image (Image 1) is then broken up into overlapping quadrants shown partially in dotted lines. The global motion for each quadrant is inherited from its parent. If there is insufficient intensity variation within the quadrant (we set this threshold 10 gray levels), it is left alone. Otherwise its global motion is refined by performing full image registration between that quadrant from the reference image and the appropriately sampled counterpart from the second image. The boundary of the sub-image from the second image is computed based on H0 in this illustration. In the example shown in Figure 7, this refined transform between the sub-image pair is H1,1 . This operation is repeated for all the levels (two in our case) and all the quadrants. The resulting full image flow is then computed using the local homographies. At and near the boundaries of each quadrant, their flows are feathered to minimize flow discontinuities.

2. Identify pixels in the input image L that are above a maximum value as being saturated. These pixels are assumed to produce poor registration with adjacent frames. As a result, the final radiance map is filled in with values from the bidirectionally warped frame Sˆ ∗ . 3. In other regions, compute the radiance map using (1). R=

fW M (pF , pc )pF + fW M (pB , pc )pB + fW (pc ) (1) fW M (pF , pc ) + fW M (pB , pc ) + fW (pc )

The subscripts c, F , and B refer to pixels in the current, left warped, and right warped radiance images respectively. ˆ the left warped image is In this case, the current image is L, F0 ˆ S∗ , and the right warped image is Sˆ∗B0 . fW M (pw , pc ) = fM (|pw − pc |)fW (pw ) is the weight function fW modulated by fM (Figure 8). fM () is defined by

 fM (δ) =

4

2 0



δ δmax

3

−3



δ δmax

2

+1

if δ < δmax otherwise (2)

Online Submission ID: 125 which is a modulation function (Figure 8) that downplays warped radiance values that are too different than the corresponding input radiance value. δmax is a user-specified parameter, which we set to a radiance value equivalent to 16 intensity levels in the current image. Currently, FM is estimated empirically. A more principled approach would be to use the noise statistics of the camera.

Fish market scene. A snapshot from the fish market scene can be seen on the left of Figure 9. While the single exposure version looks reasonable, there is some saturation (especially in the middle and upper right corner) as well as low contrast areas. In the frame generated using our HDR approach, good details can be seen almost everywhere (except for the base of the counter, where even the long exposure frame shows little detail).

Computing the radiance values for the case where the current image is a short exposure follows the same reasoning except for step 2. In this step, dark pixels are discarded instead of saturated ones. Figure 8 illustrates our radiance map recovery algorithm. The current frame is taken with a short exposure. We show only the pixels that contribute to the radiance map.

Harbor scene. The video was captured inside an office overlooking a harbor. In the video, the ferry can be seen moving outside the window while some human activity can be observed inside the office. As shown in Figure 9, the single exposure has both significantly large saturated and low contrast regions. On the other hand, in the frame generated using our approach, the ferry and water can be clearly seen. More details can also be seen inside the office. Driving scene. The results for the driving scene can be seen in Figure 1. In this example, the driver drives through a busy street at about 25 mph. This was a particularly difficult scene because occasionally there is large frame to frame displacement due to the fast motion of the driver’s hand. Our optical flow algorithm sometimes fails for such large motions, but this problem could be alleviated using a higher frame-rate camera. As mentioned earlier, the capture rate of our camera is currently 15 fps and similar exposures are therefore sampled at 7.5fps.

4 Temporal tone mapping Tone mapping is used to convert floating-point radiance maps into an 8-bit representation suitable for rendering. This process must reduce the dynamic range of each frame while also maintaining a good contrast level for both brightly and darkly illuminated regions. In addition, there must be consistency of the transform among captured views so that there are no temporal artifacts such as flickering. We make use of the tonemapper presented by [RSSF02], which is based on the photographic technique of dodging and burning. In our first stage, the radiance image is converted to CIE space and the chromaticity coordinates are recovered. The luminance image is then processed to compress the dynamic range. Finally, the chrominance is re-inserted to give the final byte-range RGB image. Our temporal tonemapper consists of global and local stages: For the global mapping, we compute the average and maximum luminances, which control the transfer function that provides a good initial luminance mapping. The log-average luminance is given by

 Fw = exp

1  log( + Fi (x, y)) N

Sunrise scene. The ideas used in creating high dynamic range video can be applied to image stills, especially when there is camera or relative scene motion. The sunrise scene (Figure 10(a)) is one such example. Here there is both camera motion as well as cloud motion relative to the ground. If we were to just perform global registration (2D perspective or homography), we would obtain the result shown in Figure 10(b,c), with (c) being the magnified version of the middle right part of (b). If we apply global registration followed by local registration, we obtain a significantly better result as shown in Figure 10(d,e). Notice the crisper appearance of the tree branches. We used the standard metadata information (called EXIF tags) stored in each of the stills to automate our radiance map computation.

 ,

(3)

x,y,i

6 Discussion

where  is a small value (10−6 ), N is the total number of pixels, and Fi is the causal temporal neighborhood consisting of frames at times k − 1 and k. Using a set of frames to control the global mapping helps to prevent flicker in the tonemapped sequence. The tone mapper also contains a local normalization, which is computed using a scale-space-based edge-preserving filter. This is described in detail in [RSSF02].

5

Looking at the results, you can see that our technique produces videos with increased dynamic range while handling reasonably large amounts of visual motion. For very fast motions, however, given our current low sampling rate of 15 frames per seconds, our technique sometimes produces artifacts. Using a camera with a faster capture rate would certainly help, as would improvements in the image registration algorithms. In particular, the ability to deal with occlusions (perhaps by extracting object boundaries and constructing a layered motion model) would be useful. For scenes with very large brightness ranges, such as a dark indoor scene looking out on a bright day, using just two exposures may not be adequate. Increasing the exposure gap between successive frames will capture the dynamic range better, but will make the image registration step more brittle and will lead to increased image noise in the mid-tones. Using more than two exposures is another option we considered, but similar exposures (where registration has the best chance of success) are then temporally farther apart, again leading to potential registration and interpolation problems. Currently, our weighting and modulation functions are based on the work of [MN99; TRK01]. We plan to use an integrating sphere (whose interior surface is totally Lambertian) to more accurately characterize both the response curve and the noise characteristics of the camera.

Results

In this section, we show results for three different dynamic scenes: a fish market, a harbor, and a drive along a busy street. The full impact of our work is better observed from our video submission (which also shows a crowd scene not in this paper). We also describe an example involving static images of a sunrise scene taken with a handheld moving camera. Figure 9 shows representative stills from the fish market and harbor scenes. For each scene, the top left quadrant is a short exposure frame, and the top right quadrant is a long exposure frame. The bottom left quadrant shows what the frame would look like for an exposure equal to the geometric mean of the short and long exposures. This is reasonable, given that radiance is normally handled in logarithmic space. The image in the bottom right quadrant is generated using our method.

5

Online Submission ID: 125

Figure 9: Representative stills from two HDR video examples: Left: Fish market scene, Right: Harbor scene. For each scene, the top left quadrant is a short exposure frame, and the top right quadrant is a long exposure frame. The bottom left quadrant shows what the frame would look like for an exposure equal to the geometric mean of the short and long exposures.

(a)

(b)

(c)

(d)

(e)

Figure 10: Sunrise example (still photographs). (a) The five input images, (b,c) Result of using only global (2D perspective) registration, (d,e) Result of using both global and local registration.

7

Conclusions

P.E. Debevec and J. Malik. Recovering high dynamic range radiance maps from photographs. Proc. of SIGGRAPH 97, pages 369–378, August 1997.

In this paper, we have presented a technique for creating high dynamic range video from a sequence of alternating light and dark exposures. The first part of our system is a novel gain control algorithm that selects the best pair of exposures as a function of the pixel brightness distribution. The central component of our approach is the HDR stitching process, which includes global and local registration steps to compensate for pixel motion, as well as an algorithm to select the most trustworthy pixels for radiance map computation. The third part is a tone-mapping algorithm adapted to produce temporally coherent results. The resulting system can be used to produce both high dynamic range videos and still images taken with handheld cameras and scene motion.

R. Fattal, D. Lischinski, and M. Werman. Gradient domain high dynamic range compression. ACM Trans. on Graphics, 21(3):249–256, 2002. B. D. Lucas and T. Kanade. An iterative image registration technique with an application in stereo vision. In Int’l Joint Conf. on Artificial Intelligence, pages 674–679, 1981. S. Mann, C. Manders, and J. Fung. Painting with Looks: Photographic images from video using quantimetric processing. In ACM Multimedia, Dec. 2002. T. Mitsunaga and S. K. Nayar. Radiometric self calibration. In IEEE Conf. on Computer Vision and Pattern Recognition, volume 1, pages 374–380, June 1999. T. Mitsunaga and S. K. Nayar. High dynamic range imaging: Spatially varying pixel exposures. In IEEE Conf. on Computer Vision and Pattern Recognition, volume 1, pages 472–479, June 2000. E. Reinhard, M. Stark, P. Shirley, and J. Ferwerda. Photographic tone reproduction for digital images. ACM Trans. on Graphics, 21(3):267–276, 2002.

References

Y. Tsin, V. Ramesh, and T. Kanade. Statistical calibration of CCD imaging process. In Int’l Conf. on Computer Vision, volume I, pages 480–487, July 2001.

F. Durand and J. Dorsey. Fast bilateral filtering for the display of high-dyn amic-range images. ACM Trans. on Graphics (TOG), 21(3):257–266, 2002.

6

Suggest Documents