A 3D-TV System Based On Video Plus Depth Information Christoph Fehn Image Processing Department Fraunhofer-Institut fur Nachrichteiitechnik, Heinrich-Hertz Institut Einsteinufer 37, 10587 Berlin, Germany tel.: +49 - (0)30 31002-611, fax: +49 - (0)30 3927200 email: [email protected]
Abstract This paper presents details of a system that allows for an evolutionary introduction of depth perception into the existing 2D digital TVframework. The work is part of the European Information Society Technologies (IST) project ‘qdvanced Three-Dimensional Television System Technologies” (ATTEST), an activity, where industries, research centers and universities have joined forces to design a backwards-compatible, fledble and modular broadcast 3D-TV system. At the very heart of this new idea is a novel data representation format, which consists of monoscopic color video and associated per-pixel depth information. From these data, one or more “virtual” views of the 30 scene can be synthesized in real-time at the receiver side by means of so-called depth-image-based rendering (DIBR) techniques. After describing the basics of this new approach on 3D-TV, this paper will focus on (a) the eficient generation of high-quality “virtual” stereoscopic views and (4) the backwards-compatible compression and transmission of 30 imagery using state-of-the-art MPEG tools.
A New Approach on 3D-TV
Figure 1: The A T T E S T signal processing and data transmission chain. It consists of four different functional building blocks: 1) 3D content generation; 2) 3D video coding; 3) lkansmission; 4) ”Virtual:’ view generation and 3D display.
The ambitious aim of the European IST project ATTEST is to design a novel, backwardscompatible and flexible broadcast 3D-TV system [l].In contrast to former proposals, which often relied on the basic concept of “stereoscopic” video, i. e. the capturing, transmission and display of two separate video streams - one for the left eye and one for the right eye -, this new idea is based on a more flexible joint transmission of monoscopic color video and associated per-pixel depth information. From this data representation, one or more “virtual” views of the 3D scene can then be generated in real-time a t the receiver side by means of sc-called depth-image-based rendering (DIBR) techniques. The modular and open architecture of the prcposed ATTEST system provides important features,
0-7803-8 104-1/03/$17.0002003 IEEE
such as backwards-compatibility to today’s 2D digital TV, scalability in terms of receiver complexity and easy adaptability to a wide range of different 2D and 3D displays [Z]. To allow for an easier understanding of the main ideas, the envisioned signal processing and data transmission chain of the ATTEST 3D-TV concept is illustrated in Fig. 1. It consists of four different functional building blocks: 1) 3D content creation: 2) 3D video coding; 3) Transmission; 1) “Virtual“ view generation and 3D display.
For the generation of future 3D content two complementary approaches are anticipated. In the first case: novel three-dimensional mat.erial is created by simultaneously capturing video and associated perpixel depth information wit.h a so-called ZcamT”‘: an active range camera developed by 3DV Systems 131. This system integrates a high-speed pulsed infrared light source into a conventional broadcast TV camera and relates the time of flight of the emitted and reflected light walls to direct measurements of the depth
of the scene. However, cspecially during t,lie introductioiiary pliasc of thc new 3D-TV technology it will be nccessary to satisfy tlie need for sufficient, highquality t,hrcc-dimensional programs by converting already existing 2D video material into 3D using socallcd “structnre from mot,ion“ algorittnns [4, 51. In principle, such (offline) niethods process one or more moiimcopic video sequences to cst,ablisli a densc set of imagc point correspondences from which informat,iiin about the 3D structure of the scene can be dcrived. W’hatevcr 3D content generation approach is wed in the end, tile outcoinc in all cases consists of regular 2D color vidco in Europcaii digital TV format and an accompanying depth-image sequcnce with the same spat.io-temporal rcsolution. Each of t,liesc depthimages stores dcptli information as 8-hit grayvalues with tlie graylcvel 0 specifying tlie furthest value and the graylevel 255 defining thc closest value ( S C C Fig. 2). To translate this d a h format to real, metric delibli valtics - wliicli are requircd for the “virtual:‘ vicw germ.ation (see also Section 2) and to bc flexiblc with respctt to 3D sceiics with differcnt depth clmraclcrist,ics, t,lie grayvalues are normalized to two main dcptli clip ping planes. Tho near clippi,pilzg plene Z,,,,, (gsaylevel 255) defincs the smallest metric dept,h value Z tlirtt ran be reprcsent,cd in tlie particular depth-imagc. Accordingly, tlie far clipping plane ZJ,,(graylevcl 0) defines the largcst representable metric depth value.
still or moving images and associated pcr-pixcl depth information [GI. Conceptually, this novel vicw gcneration cau he uiiderstood as the following two-stcp process: At first, tlie original image points ase reprojected into the 3D world, utilizing the respect,ive depth data. Thereafter, these 3D space points arc projected into t,lie iinagc plane of a “virtaal” camera, which is located at the rcquired viewing position. The concatenation of rcprojcction (2D-to-3D) and subsequcnt irrojection (3D-to-2D) is usually called 30 image warping in the Compiter Graphics (CG) literat,ure.
Stereoscopic Image Creation
On n stcreoscopic- or autostcreoscopic 3D-TV display, two slightly different pcrspective vicws of a 3D scenc are rcproduced (quasi-)simirltancoiisly on a joint image plane (sec Fig. 3). The horizontal differciiccs betwccn thesc left- and riglit-eyc views, the so-called screen purallaz values, are intcrpreted by the human brain and the two imagcs are fused int,o a single, tlireedime~isionalpercept.
Figure 3: Binocular depth reproduction on a stereoscopic 3D-TV display. Two different perspective vicws, i. e. oiic for the lcft eye and one for the right cye, are reproduced (qiiasi-)simultaneously on a joint image plane.
Figure 2: T h e ATTEST data representation format. It, coiisist,~of: (a) Regular 2D color video in European digital TV format; (h) Associated 8-bit depthimages that are normalized to a near clipping plaiic Z,,,,,. and a far clipping plane .,Z ,,
In the ATTEST approach on 3D-TV, such stereoscopic images are not captured directly with a “real” stereo camera, rathcr they are synthesized from monoscopic color video and associatcd per-pixel depth information. How this is done in a simple and effective way will bc described briefly in tlrc following. A much more detailed presentation of this topic iucluding implementation details can be found in .
Thc remainder of this paper is organized a~ follows: Section 2 describes very briefly the generation of “virt,ual” stcrcosco~ricviews using depth-image-based rendering (DIBR.) techniques. Thercafter, t.he coding and transmission of the 3D imagery are explained in Section 3. Finally, Section 4 provides some very realistically looking 3D synthesis results that were created from strongly compressed depth information.
2.2 S h i f t - S e n s o r A l g o r i t h m In “rcal”, high-quality stereo cameras, usually one of two different methods is utilized t o establish the so-called zero-purallar setting (ZPS), i. e. t o clioose
Depth-Image Based Rendering
Depth-image-based rendering (DIBR) is the process of synthesizing “virtual” views of a scene from
the conveqence distance Z, in the 3D scene [;I. In the “toed-in” approach, the ZPS is chosen by a joint inward-rotation of the left- and right-eye camera. In the so-called shaft-sensor approach, a plane of convergence is established by a small shift h of the parallel positioned camera’s CCD sensors (see Fig. 4).
a ‘‘red’ stereo camera when these system parameters are manually adjusted. Par.
I +/- 11
I -c /I Increase I
‘‘ II ~
1 Decrease I
+ II -
I c II -
Increase Decrease Decrease Increase
depth Increase Decrease Increase Decrease Shift (fore) Shift i a f t )
I Obj. size
Increase Decrease Constant Constant
I Constant I I
Table 1: ERects of “virtual” stereo camera setup parameters. Qualitative changes in screen parallax values, perceived depth and object size when varying the interaxial distance t,, the focal length f or the convergence distance Z, of the “virtual:’ stereo camera setup (after [SI).
Coding of 3D Imagery
To provide the future 3D-TV 1-iewers v i t h the three-dimensional content, the monoscopic color video and the associated per-pixel depth information are first compressed and then transmitted oyer the COIIventional 2D digital TV broadcast infrastructure. To ensure the required backwards-compatibility with existing 2D-TV set-top boxes: the basic 2D color video has to be encoded using the standard MPEG-2 tools currently required by the DVB (Digital Video Broadcast) project in Europe, while the supplementary depth-images can - on principle - be compressed using any of the newer; more efficient addit,ions to the MPEG family of standards such as hlPEG-4 Visual 191 or Advanced Video Coding (AVC) [lo]. The suitability of the different MPEG technologies for the efficient compression of depth-images was evaluated in a comparative coding experiment. The test group consisted of the following four codecs: a) the MPEG-2 reference model codec (TM-5); b) the [email protected]
MPEG4 Visual reference model codec (MS-Ref.); c) a ratedistortion (R/D) optimized h4PEG-4 Visual codec d e veloped a t FhG/HHI’ (R/D opt.); d) the R/D optimized AVC reference model codec (v6.la). The compression results for the two ATTEST test sequences ’Interview’ and ‘Orbi’ are shown in Fig. 5 for typical broadcast encoder settings, i. e. a GOP (Group of Pictures) length equal to 12 with a GOP structure of JBBPBBP.. ., by means of ratedistortion curves over a wide range of different bitrates.
Figure 8: Shift-sensor s t e r e o camera setup. In a shift-sensor stereo camera setup, the convergence distance Z, is est.ablished by a shift h of the camera’s CCD sensors. U’hile, t.echnically, the “toed-in” approach is easier to realize in ”real” stereo cameras, the shift-sensor approach is usually prefered because it doesn‘t introduce unwanted vertical differences - which are known to be a potential source of eye-st.rain - between the leftand the right-eye view. Fortunately, this method is actually easier to implement with depth-image-based rendering (DIBR) as the required signal processing is only one-dimensional. All that is needed is the definition of two “virtual“ cameras one for the left eye and one for the right eye. With respect to the original view, these cameras are symmetrically displaced by half the intemzial distance t , and their CCD sensors are shifted relative to the position of the lenses. Mathematically, this sensor shift can be formulated as a displacement of a camera’s principal point c. The intrinsic parameters of the two “virtual” cameras are therefore chosen to exactly correspond to the intrinsic camera parameters of the original view except for the horizontal shift h of the respective principal point. Table l shows, how the 3D reproduction that results from this setup is influenced by the choice of the three main system variables, i. e. by the choice of the interaxial distance t,, the focal length f of the original camera and the convergence distance 2,. The respective changes in screen parallax values, perceived depth and object size are qualitatively equal to what happens in ~
’Rate-distortion optimization refers to the process of jointly optimizing both the resulting ‘image quality’ and the required bitrate by systematically varying and testing different video encoder parameters 1111.
Figure 5: Test sequences ‘Interview’ and ‘Orbi’ with rate-distortion curves for four different MPEG codecs. (a,d) Monoscopic color video: @,e) Accompanying per-pixel depth information: (c,f) Coding results shown as ratc-distortion ciirvcs.
4 Experimental Results
The two graphs (c,f) show, first of all, that AVC as well as hlPEG-4 Visual are very well suitcd for the coding of per-pixel depth inforrnation (with AVC being even more efficient). The smoothness of the graylevel dcpth data (b,e) as wcll as the relatively slow camera-, resp. in-scene motion exhibitcd by these particular sequences lead to extremely high compression ratios. l f a typical broadcast bitrate of 3 Mbit/s is assumed for the MPEG-2 encodcd n~onoscopiccolor information (a,d),it can be followed from the R/D curves that the depth-images can be compressed to target rates significantly below 20% of this value. For example, AVC cornpression of the ‘Interview’ sequence a t 105 kbit/s still leads to a very high PSNR of 46.29 dB. For the inore complex ‘Orbi’ sccne, this value can be reached a t a bitrate of approximately 184 kbit/s. While it is clear that the above-described findings have to be confirmed with other, inore challenging test data, the results nonctheless indicate that it is possible to introduce t,he described new approach 011 3D-TV with only a very ininor transmission overhead compared to today’s conventional 2D digital TV.
Fig. 6 displays some further experimental results. Each of the two images on the left side of the graphic (a,.) shows AVC compressed depth information from one of the two test sequences ‘Interview’ and ‘Orbi’. For the first scene, the bitrate is equal to 105 kbit/s with a PSNR of 48.29 dB, for the second video the bitrate equals 115 kbit/s with a PSNR of 44.16 dB. Thc figure shows that even at these very low data rates the visual quality of the displayed frames is only slightly degraded in comparison to the originals. The two imagcs on the right side of the graphic (b,d) show overlayed “virtual” left- and right-eye vicws that were synthcsized from thc impaired depth-images using the before-described shift-scnsor algorithm (see again Scction 2). The “virtual” stereo cainera system paraineters were chosen such that the screen parallax values didn’t exceed a maxiniuin of about 3% of the image width’. Human-factors experiments conducted on 2This limit is generally considered to be well suited t o provide a visually pleasing 3D impression on stereoscopic- or autostereoscopic 3D displays
a single-user, autostereoscopic 3D-TV display (lenticular lens raster) developed by FhG/HHI within the ATTEST project showed that the synthesis results were visually indistinuishable from corresponding 3D sequences that were created with the original, uncom-~ pressed depth information.
References 111 C. Fehn, P. Kauff, M. Op de Beeck, M. Ernst, 1%'. IJsselsteijn, M. Pollefeys, L. Van Gool, E. Ofek, and I. Sexton, An Evolutionary and O p timized Approach on 3D-TV, Proceedings of International Broadcast Conference '02, Amsterdam, The Netherlands, 2002, 357-365.
 C. Fehn, A 3D-TV Approach Using DepthImage-Based Rendering (DIBR), Proceedings of Visualization, Imaging, and Image Processing '03, Benalm&dena, Spain, 2003, 482-487.
 G. J. Iddan and G. Yahav, 3D Imaging in the Studio and Elsewhere .... Proceedings of SPUE Videometrics and Optical Methods for 3 0 Shape A.leasurements, San Jose: CA, USA, 2001,48-55. R. I. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision: Cambridge University Press: Cambridge, UK, 2000.
Xf. Pollefeys, XI. Vergauwen, K., Cornelis, J . Tops, F. Verhiest, and L. Van Gool: Structure and Motion From Image Sequences, Proceedings of Conference on Optical 3 - 0 Measurement Techniques '01, Vienna, 2001, 251-258.
L. McMillan, A n Image-Based Approach on Three-Dimensional Computer Graphics, PhD thesis, University of North Carolina at Chapel Hill: 1997.
Figure 6: Coding and synthesis results the 'Interview' and 'Orbi' test sequences. (a.c) AVC compressed per-pixel depth information: (b,d) Overlayed "virtual" left- and right-eye views.
A. Woods, T. Docherty, and R. Koch. I m -
age Distortions in Stereoscopic Video Systems, Proceediugs of SPIE Stereoscopic Displays an.d Applications.'g3, San Jose, CA, USA, 1993.
This paper provided details of a new approach on 3D-TV using depth-image-based rendering (DIBR). After the short description of the basics of the concept and the brief explaination of the efficient generation of high-quality "virtual" stereoscopic views, it mainly dealt with the backwards-compatible compression and transmission of 3D imagery using state-ofthe-art MPEG tools. The given results indicate that it would be possible t o introduce the described 3D-TV scenario with only a very minor transmission overhead compared to today's conventional 2D digital TV.
P. Milgram and X.I. Kriiger, Adaptation Effects in Stereo Due t o On-line Changes in Cainera Configuration, Proceedings of SPIE Stereoscopic Displays and Applications '92, San Jose, CA, USA, 1992, 122-134. ISO/IEC JTC 1/SC 29/WG 11, Coding of Audio-visual Objects - Part 2: Visual, ISO/IEC 14496-2:2001, Geneva, Switzerland, 2001.
ISO/IEC J T C 1/SC 29/1VG 11, Joint Video Specification (ITU-T Rec. H.264 - ISO/IEC 14496-10 AVC), J V T Document E146d34, Geneva, Switzerland, 2002.
This work has been sponsored by the European Commission (EC) through their Information Society Technologies (IST) program under proposal No. IST2001-34396. The author would like t o thank the project officers as well as all project partners for their support and for their input to this publication.
I. E. G. Richardson, Video CODEC Design, John Wiley and Sons, 2002.