Multiview Imaging and 3D TV. A Survey

Research Assignment Multiview Imaging and 3D TV. A Survey. Anastasia Manta January 2008 Supervisors: Dr. Emile A. Hendriks Dr. ir. Andre Redert Mu...
Author: Beverley Rich
0 downloads 0 Views 906KB Size
Research Assignment

Multiview Imaging and 3D TV. A Survey. Anastasia Manta January 2008

Supervisors: Dr. Emile A. Hendriks Dr. ir. Andre Redert

Multiview Imaging and 3DTV. A Survey.

-2-

Multiview Imaging and 3DTV. A Survey.

Contents

1

Introduction .............................................................................................................................. - 4 -

2

Capturing system ...................................................................................................................... - 6 2.1

Static Scenes ....................................................................................................................... - 6 -

2.2

Dynamic Scenes ................................................................................................................. - 6 -

2.2.1

Camera Setup ............................................................................................................ - 7 -

2.2.2

Synchronization of cameras...................................................................................... - 8 -

2.2.3

Camera calibration.................................................................................................... - 8 -

2.3 3

4

Multiview correspondence ................................................................................................ - 9 -

3-D scene representation ........................................................................................................ - 13 3.1

Geometry-based modeling .............................................................................................. - 13 -

3.2

Image-based modeling .................................................................................................... - 15 -

3.3

Hybrid Image-Based modeling techniques .................................................................. - 18 -

Rendering ................................................................................................................................ - 22 4.1

Model-based rendering (MBR) - rendering with explicit geometry ........................... - 23 -

4.2

Image-based rendering (IBR) - rendering with no geometry ...................................... - 26 -

4.3

Rendering with implicit geometry ................................................................................. - 27 -

5

Coding ...................................................................................................................................... - 30 -

6

Transporting 3D Video .......................................................................................................... - 32 -

7

3-D Display .............................................................................................................................. - 34 -

8

Discussion and conclusions .................................................................................................... - 36 -

9

Appendix A.............................................................................................................................. - 38 -

10

References................................................................................................................................ - 43 -

-3-

Multiview Imaging and 3DTV. A Survey.

1

Introduction

Multiview imaging has attracted increasing attention, thanks to the rapidly dropping cost of digital cameras. This opens a wide variety of interesting research topics and applications, such as virtual view synthesis, high-performance imaging, image/video segmentation, object tracking/recognition, environmental surveillance, remote education, industrial inspection and 3DTV. While some of these tasks can be handled with conventional single view images/video, the availability of multiple views of the scene significantly broadens the field of applications, while enhancing performance and user experience.

3DTV is one of the most important applications of multiview imaging and a new type of media that expands the user experience beyond what is offered by traditional media. It has been developed by the convergence of new technologies from computer graphics, computer vision, multimedia, and related fields. 3DTV, also referred to as stereo TV, offers a three-dimensional (3-D) depth impression of the observed scene.

To enable the use of 3DTV in real-world applications, the entire processing chain, including multiview image capture, 3-D scene representation, coding, transmission, rendering and display need to be considered [1]. There are numerous challenges in this chain. A system that can capture and store large numbers of videos in real time has many building difficulties. An accurate calibration of camera position and color property is required. From acquired multiview data, one should consider how to represent a 3-D scene that is more suitable for the latter processes. Depth reconstruction is one central task in 3-D representation but still a very difficult problem for rendering novel images precisely. The amount of multiview image data is usually huge, hence the data compressing and streaming with less degradation and delay over limited bandwidth are also challenging tasks.

In addition, there are also strong interrelations between all of the processes involved. The camera configuration (array or dome) and density (number of cameras) impose practical limitations on navigation and quality of rendered views at a certain virtual position. Therefore, there is a classical trade-off to consider between costs (for equipment, cameras, processors) and quality (navigation range, quality of virtual views). In general, the denser capturing of multiview images with a larger number of cameras provides a more precise 3-D representation, resulting in higher quality views through the rendering and display processes but requires a higher compression rate in the coding process, and vice versa. An interactive display that requires random access to 3-D data affects the performance of a coding scheme that is based on data prediction. Various types of quite diverse 3-D scene representations can be employed, which implies a number of different data types.

-4-

Multiview Imaging and 3DTV. A Survey.

This report is aimed to discover some of the challenges in multiview imaging technology which can help to fulfill the ultimate research goal. This report provides an overview of multiview imaging by consulting the available literature about the subject. Focus is placed on the rendering part of the process. This report does not propose a new algorithm but it considers the up-to-date algorithms from literature, and recommends some improvement.

It has been structured as follows. In the second chapter introduces a capturing system, possible camera configurations and also discusses the important issues of camera calibration and stereo correspondence. The third chapter reviews the different data representations available in current projects. Chapter four deals with rendering, interrelates rendering with data representations and assesses rendering algorithms for corresponding applications. Chapters five and six deal with the coding and the transporting of a 3-D video. Chapter seven outlines the available 3-D displays. Finally chapter eight addresses some open issues with respect to the literature reviewed.

-5-

Multiview Imaging and 3DTV. A Survey.

2

Capturing system

For the generation of future 3D content two complementary approaches are anticipated. In the first case, novel three–dimensional material is created by simultaneously capturing video and associated per-pixel depth information. The techniques involved in this procedure are explained in this chapter. The second approach satisfies the need for sufficient three-dimensional content by converting already existing 2D video material into 3D, but is out of the scope of this report.

2.1

Static Scenes

Capturing multiple views of a static scene is relatively simple because only a single camera is needed. The camera can be moved along a predetermined path to take multiple images of the scene. Novel views can then be synthesized from the camera position/geometry. In this case the camera position/geometry is assumed to be known.

The camera geometry can be established in two ways. First, with the use of a robotic arm or a similar mechanism to control the movement of the camera. For instance, a camera gantry is used in [5] to capture light field, which assumes that the camera locations form a uniform grid on a 2-D plane. In concentric mosaics [6], a camera is mounted on the tip of a rotating arm, which captures a series of images whose centers of projection are along a circle.

The second approach to obtain camera geometry is through calibration. In the work Lumigraph [7], the authors used a handheld camera to capture the scene. The scene contains three planar patterns, which are used for camera calibration. In [8], a camera attached to a spherical gantry arm is used to capture images roughly evenly over the sphere. Calibration is still performed to register the camera locations to the scene geometry obtained through range scan. When the scene itself contains a lot of points of interest, [9], it is possible to extract and match feature points directly for camera calibration.

2.2

Dynamic Scenes

For the acquisition of dynamic scenes, an array of cameras is in most cases needed. Most existing camera arrays contain a set of static cameras. One exception is the self-reconfigurable camera array developed in [10], which has 48 cameras mounted on robotic servos. In this case, cameras move during capturing to acquire better images for rendering (they have to be calibrated on-the-fly using a calibration pattern in the scene). -6-

Multiview Imaging and 3DTV. A Survey.

Capturing dynamic scenes with multiple cameras has a number of challenges. For instance, the cameras need to be synchronized if correspondence between images will be explored (in the rendering stage). The amount of data captured by a camera array is often huge, and it is necessary to write these data into storage devices as fast as possible. Color calibration is another issue that needs to be addressed in order to render seamless synthetic views.

2.2.1

Camera Setup

The camera setups range from dense configuration (Stanford Light Field Camera, [11]) to intermediate camera spacing, [12], to wide camera distribution (Virtualized RealityTM , [13]). The wider spacing between the cameras in this system provides more of a challenge in producing locally consistent geometries and hence photorealistic views. This is mainly because of occurring occlusions.

A significantly denser camera configuration such as that of the Stanford Light Field Camera allows effects such as synthetic aperture and focusing - synthetic aperture imagery allows objects that are occluded with respect to any given camera, to be seen. In general, dense sampling permits photorealistic rendering with just either a simple planar geometric representation or a rough geometric approximation. However, the disadvantage is the large number of images required for rendering. Therefore there is an apparent trade-off of the image-geometry.

Approaches in the middle are trying to reduce the number of required cameras and compensate for this by providing high-quality stereo data for example. Zitnick et al, [12] proposed a layered depth image representation using an 8 cameras configuration. This approach however still needs a quite dense camera setup for a limited viewing range (horizontal field of view of about 300)

For configurations which cover an entire hemisphere with a small number of cameras, either modelbased approaches need to be employed (e.g. Carranza et al. [6] with 8 cameras) or degradation in visual quality has to be accepted (e.g. Wurmlin et al. [33] with 16 cameras). The latter two systems are also limited by the employed reconstruction algorithms to the capture of foreground objects or even humans only.

Scalability in terms of camera configurations is another important issue that Waschbusch et al., [11], try to solve. In their work they introduce sparsely placed, scalable 3D video bricks which act as lowcost z-cameras. The importance of z-cameras in the content acquisition process will become clearer in the stereo correspondence chapter. One single brick consists of a projector, two grayscale and one color camera. To fully cover 3600 in all dimensions about 8 to 10 3D video bricks are needed. -7-

Multiview Imaging and 3DTV. A Survey.

Resolution plays an important role in achieving photorealism as well, but having a higher resolution will not help if rendering artifacts are not properly handled. These artifacts include boundary or cutout effects, incorrect or blurred texturing, missing data, and flickering. Humans are highly sensitive to high-frequency spatial and temporal artifacts. Although using a reduced resolution would conveniently help to mask or ameliorate such artifacts, it should not be viewed as a solution. Suggestively Zitnick in [12] uses high resolution (1024*768) color cameras capturing ad 15 fps, whereas Matusik et al., [19] used an array of cameras with 1300*1030 resolution and frame rate 12 frames per second.

2.2.2

Synchronization of cameras

When the number of cameras in the array is small, synchronization between cameras is often simple. A series of 1394 FireWire cameras can be daisy chained to capture multiple videos, and the synchronization of exposure start of all the cameras are guaranteed on the same 1394 bus. Alternatively, the cameras‟ exposure can be synchronized using a common external trigger. This is a very widely used configuration and can scale up to large camera arrays [12], [15]–[18]. In the worst case, where the cameras in the system cannot be genlocked camera synchronization can still be roughly achieved by pulling images from the cameras at a common pace from the computer. Slightly unsynchronized images may cause artifacts in scene geometry reconstruction for rapid-moving objects, but the rendering results may still be acceptable since human eyes are not very sensitive about details in moving objects.

When multiple videos are recorded simultaneously, the amount of data that needs to be stored/ processed is huge. Most existing systems employ multiple computers to record and process the data from the cameras. The Stanford multicamera array [29] used a modular embedded design based on the IEEE1394 high speed serial bus, with an image sensor and MPEG2 compression at each node. Since video compression is performed on the fly, the system is capable of recording a synchronized video data set from over 100 cameras to a hard disk.

2.2.3

Camera calibration

Camera calibration is the process of determining the internal camera geometric and optical characteristics (intrinsic parameters) and/or the 3-D position and orientation of the camera frame relative to a certain world coordinate system (extrinsic parameters). The purpose of the calibration is -8-

Multiview Imaging and 3DTV. A Survey.

to establish the relationship between 3-D world coordinates and their corresponding 2-D image coordinates. Once this relationship is established, 3-D information can be inferred from 2-D information and vice versa. In an application involving multiple cameras this step is necessary to guarantee geometric consistency across the different terminals.

Those techniques can be roughly classified into two categories as discussed in [21]: photogrammetric calibration and self-calibration.

In photogrammetric calibration methods camera calibration is performed by observing a calibration object whose geometry in 3-D space is known with very good precision. Calibration can be done very efficiently [22]. The calibration object usually consists of two or three planes orthogonal to each other. Sometimes, a plane undergoing a precisely known translation is also used [23]. These approaches require an expensive calibration apparatus, and an elaborate setup.

Self-calibration techniques do not use any calibration object. Just by moving a camera in a static scene, the rigidity of the scene provides in general two constraints on the cameras‟ internal parameters from one camera displacement by using image information alone. Therefore, if images are taken by the same camera with fixed internal parameters, correspondences between three images are sufficient to recover both the internal and external parameters which allow us to reconstruct 3-D structure up to a similarity. While this approach is very flexible, it is not yet mature. Recent research on camera calibration has focused on the problem of self-calibration. A critical review of self-calibration techniques can be found in [24].

2.3

Multiview correspondence

Multiview correspondence, or multiple views matching, is the fundamental problem of determining which parts of two or more images (views) are projections of the same scene element. The output is a disparity map for each pair of cameras, giving the relative displacement, or disparity, of corresponding image elements, Figure 1, 2. Disparity maps allow us to estimate the 3-D structure of the scene and the geometry of the cameras in space.

Passive stereo, using projectors and glasses with polarizing filters, [25], remains one of the fundamental technologies for estimating 3-D geometry. It is desirable in many applications because it requires no modifications to the scene and because dense information (that is, at each image pixel) can nowadays be achieved at video rate on standard processors for medium-resolution images (e.g., CIF, CCIR) [26]–[28]. For instance, systems in the late 1990s already reported a frame rate of 22 Hz for -9-

Multiview Imaging and 3DTV. A Survey.

images of size 320 x 240 on a Pentium III at 500 MHz [29].

The availability of real-time disparity maps also enables segmentation by depth, which can be useful for layered scene representation [30], [31]–[33]. Large-baseline stereo, generating significantly different images, can be of great importance for some virtual environment applications, as it is not always possible to position cameras close enough to achieve small baselines or because doing so would imply using too many cameras given speed or bandwidth constraints. The VIRTUE system [34] is an example: four cameras can only be positioned around a large plasma screen, and using more than four cameras would increase delay and latency beyond acceptable levels for usability (but see recent systems using high numbers of cameras [13], [35], [36]).

Figure 1: Original stereo pair acquired from two cameras a and b

Figure 2: Visualization of the associated disparity maps from camera a to b (left) and from b to a (right)

There are two broad classes of correspondence algorithms seeking to achieve, respectively, a sparse set of corresponding points (yielding a sparse disparity map) or a dense set (yielding a dense disparity map).

1) Sparse Disparities and Rectification: Determining a sparse set of correspondences among the images is a key problem for multiview analysis. It is usually performed as the first step in order to calibrate (fully or weakly) the system, when nothing about the geometry of the imaging system is known yet and no geometric constraint can be used in order to help the search. We can classify the algorithms presented in literature so far in two categories: feature matching and - 10 -

Multiview Imaging and 3DTV. A Survey.

template matching. Algorithms in the first category select feature points independently in the two images, then match them using tree searching, relaxation, maximal clique detection, or string matching [37]–[40]. A different algorithm is given in [41], which presents an interesting and easy-to-implement algebraic approach based on point positions and correlation measures. Algorithms in the second category select templates in one image (usually patches with some texture information) and then look for corresponding points in the other image using a similarity measure [22], [42], [43]. The algorithms in this class tend to be slower than the ones in the first class as the search is less constrained, but it is possible to speed up the search for some particular cases [44].

The search for matches between two images is simplified and sped up if the two images are warped in such a way that corresponding points lie on the same scanline in both images. This process is called rectification [25]. The rectified images can often be regarded as acquired by cameras rotated with respect to the original ones or images of these cameras projected onto the same plane. Most of the stereo algorithms in the literature assume rectified images.

2) Dense Disparities: Dense stereo matching is a well-studied topic in image analysis [22], [25]. Here we focus on large-baseline matching given its importance for advanced visual communication systems.

The output of a dense matching algorithm is a disparity map. The matching image points must satisfy geometric constraints imposed by the algebraic structures such as the fundamental matrix for two views, plus other constraints (physical and photometric). These include: order (if two points in two images match, then matches of nearby points should maintain the same order); smoothness (the disparities should change smoothly around each pixel); and uniqueness (each pixel cannot match more than one pixel in any of the other images).

Points are usually matched using correlation-like correspondence methods [45]: given a window in a frame, standard methods in this class explore all possible candidate windows within a given search region in the next frame and pick the one optimizing an image similarity (or dissimilarity) metric. Typical metrics include sum of squared differences (SSDs), sum of absolute differences (SADs), or correlation. Typically the windows are centered around the pixel of which we are computing the disparity. This choice can give poor performance in some cases (e.g., around edges). Results can be improved adopting multiple windows matching, where different windows centered in different pixels are used [46], [47], at the cost of a higher computational time.

Computation of disparity maps can be expensive, but some tricks can be used to speed up the computation of the similarity measure by using box filtering techniques [45] and partial distances [48]. Guidelines for an efficient implementation of stereo-matching algorithms on state-of-the-art hardware - 11 -

Multiview Imaging and 3DTV. A Survey.

are given in [28] and [49].

3) Large-Baseline Matching: This is the difficult problem of determining correspondences between significantly different images, typically because the cameras‟ relative displacement or rotation is large. As a consequence of the significant difference between the images, direct correlation-based matching fails at many more locations than in small-baseline stereo. From an algorithmic point of view, the images of a large-baseline stereo pair lead to significant disparities and may present considerable amounts of relative distortions and occlusions.

Large camera translations and rotations induce large disparities in pixels, thus forcing search algorithms to cover large areas and increasing the computational effort. Large displacements between cameras may introduce also geometric and photometric distortions, which complicate image matching. As to occlusions, the farther away the viewpoints, the more likely occluded areas (visible to one camera but not to the other). The problem of occlusions can be partially solved, at the cost of extra computation, in multi-camera systems as long as every scene point is imaged by at least two cameras [50]–[52]. However, in practice, increasing the number of cameras may increase the risk of unacceptably high delay and latency.

Solutions to the problem of large-baseline matching include intrinsic curves, coarse-to-fine approaches, maximal regions [53], and other invariant regions [53], [54]. Intrinsic curves [55] are an image representation that transforms the stereo-matching problem into a nearest-neighbor problem in a different space. The interest of intrinsic curves here is that they are ideally invariant to disparity, so that they support matching, theoretically, irrespective of disparity values. In coarse-to-fine approaches [56]–[58], matching is performed at increasing image resolutions. The advantages are that an exhaustive search is performed only on the coarsest-resolution image, where the computational effort is minimal, and only a localized search takes place on high-resolution images. Approaches based on invariant features rely on properties that remain unchanged under (potentially strong) geometric and photometric changes between images. Very good results have been reported, but computational costs are usually high and direct applications to real-time, telepresence systems are unfeasible.

All of the techniques above are still too time-consuming if the target is a full-resolution disparity map for full-size video at frame rate. The methods in [26] and [59] are two approaches that address this point by exploiting the redundancy of information in video sequences. The former [26] unifies the advantages of block-recursive disparity estimation and a pixel-recursive optical flow estimation in one common scheme, leading to a fast matching algorithm. The latter [59] uses motion detection to reduce the quantity of pixels at which disparity is computed.

- 12 -

Multiview Imaging and 3DTV. A Survey.

3

3-D scene representation

The choice of a 3-D scene representation format is of central importance for the design of any 3DTV system. On one hand, the scene representation sets the requirements for multiview image processing. On the other hand 3-D scene representation determines the rendering algorithms, interactivity, as well as compression and transmission if necessary [1].

In computer graphics literature, methods for 3-D scene representation are often classified as a continuum between two extremes. The one extreme is represented by classical 3-D computer graphics. This approach can also be called geometry-based modeling. In most cases scene geometry is described on the basis of 3-D meshes. Real world objects are reproduced using geometric 3-D surfaces with an associated texture mapped onto them. More sophisticated attributes can be assigned as well. For instance, appearance properties (opacity, reflectance, specular lights, etc.) can significantly enhance the realism of the models.

The other extreme in 3-D scene representation is called image-based modeling and does not use any 3D geometry at all. For applications in 3D-TV, purely image-based representations are well suited but need many densely spaced cameras.

In between the two extremes there exists a number of methods that make more or less use of both approaches and combine their advantages in some way.

3.1

Geometry-based modeling

Geometry based modeling is used in applications such as games, Internet, TV and movies. The achievable performance with these models might be excellent, typically if the scenes are purely computer generated. The available technology for both production and rendering has been highly optimized over the last few years, especially in the case of common 3-D mesh representations. In addition, state-of-the-art PC graphic cards are able to render highly complex scenes with an impressive quality in terms of refresh rate, levels of detail, spatial resolution, reproduction of motion, and accuracy of textures [1].

With the use of geometry-based representations quite sparse camera configurations are feasible, but most existing systems are restricted to foreground objects only [15, 60, 61] . In other words, despite the advances in 3-D reconstruction algorithms, reliable computation of 3-D scene models remains difficult. In many situations, only a certain object in the scene– such as the human body- is of interest. - 13 -

Multiview Imaging and 3DTV. A Survey.

Here, prior knowledge of the object model can be used to improve the reconstruction quality.

Voxel-based representations, [77], can easily integrate information from multiple cameras but are limited in resolution. The work of Vedula et al., [77], based on the explicit recovery of 3D scene properties, first uses the voxel coloring algorithm to recover a 3D voxel model of the scene at each time instant. The 3D scene flow algorithm is then used to recover the 3D non-rigid motion of the scene between consecutive time instants, Figure 3. In the next stage, the voxel models and scene flow become inputs to a spatio-temporal view interpolation algorithm.

Figure 3: A set of calibrated images at 2 consecutive time instants. From these images, 3D voxel models are computed at each time instant using the voxel coloring algorithm. After computing the 3D voxel models, the dense non-rigid 3D motion or “scene flow” between these models is computed .

A prominent class of geometry reconstruction algorithms is shape-from-silhouette approaches. Shapefrom-silhouette reconstructs models of a scene from multiple silhouette images or video streams. Starting from the silhouettes extracted from the camera pictures, a conservative shell enveloping the true geometry of the object is computed by reprojectng the silhouette cones into the 3-D scene and intersecting them. This generated shell is called the visual hull. While the visual hull algorithms are efficient and many systems allow for real-time reconstruction the geometry models they reconstruct are often not accurate enough for high-quality reconstruction of human actors.

Since scenes involving human actors are among the most difficult to reconstruct, research studies focused on free viewpoint videos of human actors can be found in literature. A typical example is the work of Carranza et al, [15], recently updated by Theobalt et al. [60], where a triangle mesh representation is employed because it offers a closed and detailed surface representation. Since the model must be able to perform the same complex motion as its real-world counterpart, it is composed of multiple rigid-body parts that are linked by a kinematic chain. The joints between segments are - 14 -

Multiview Imaging and 3DTV. A Survey.

parameterized to reflect the object‟s kinematic degrees of freedom. Besides object pose, the dimensions of the separate body parts also must be kept adaptable as to be able to match the model to the object‟s individual stature.

A virtual reality modeling language (VRML) geometry model of a human body is used [Figure 4(a)]. Indicatively, the model consists of 16 rigid body segments, one each for the upper and lower torso, neck and head; and pairs of the upper arms, lower arms, hands, upper legs, lower legs, and feet. In total, more than 21,000 triangles make up the human body model.

(b) Figure 4: (a) Surface model and the underlying skeletal structure. Spheres indicate joints. (b) Typical camera and light arrangement during recording.

A drawback of the geometry-based modeling approaches is that typically high cost and human assistance are required for content creation. Aiming at photorealism, 3-D scene and object modeling is often complex and time consuming, and it becomes even more complex when dynamically changing scenes are considered. Furthermore, an automatic 3-D object and scene reconstruction implies an estimation of camera geometry, depth structures, and 3-D shapes. With some likelihood, all these estimation processes generate errors in the geometric model. These errors then have an impact on the rendered images. Therefore, high-quality production of geometry model, e.g. for movies, is typically done user assisted.

3.2

Image-based modeling

The other extreme in 3-D scene representation is called image-based modelling and does not use any 3-D geometry at all. The main advantage is a potentially high quality of virtual view synthesis avoiding any 3-D scene reconstruction. However, this benefit has to be paid by dense sampling of the real world with a sufficiently large number of natural camera view images. In general, the synthesis - 15 -

Multiview Imaging and 3DTV. A Survey.

quality increases with the number of available views. Hence, typically a large number of cameras have to be set up to achieve high-performance rendering, and a tremendous amount of image data needs to be processed therefore. Contrarily, if the number of used cameras is too low, interpolation and occlusion artifacts will appear in the synthesized images, possibly affecting the quality.

Examples of image-based representations are ray space or light field [64, 5] and panoramic configurations including concentric and cylindrical mosaics [6]. The underlying idea in most of these methods is capturing the complete flow of light in a region of the environment. Such a flow is described by a plenoptic function. The plenoptic function was introduced by Adelson and Bergen, [62], in order to describe the visual information available from any point space. It is characterized by seven dimensions, namely the viewing position (Vx, Vy, Vz), the viewing direction (θ, φ) or (x,y) in Cartesian coordinates , the time t and the wavelength λ. (l(7)(Vx, Vy, Vz, θ, φ, λ, t)).

Figure 5: The 7D plenoptic function

The image-based representation stage is in fact a sampling stage, samples are taken from the plenoptic function for representation and storage. The problem with the 7D plenoptic function is that is so general that due to the tremendous amount of data required, sampling the full function into one representation is not feasible. Research on image-based modeling is mostly about how to make reasonable assumptions to reduce the sample data size while keeping the rendering quality (imagebased rendering techniques are connected with image-based representations and will be discussed in the next chapter). One major strategy to reduce the data size is restraining the viewing space of the viewers. There is a common set of assumptions that people made for restraining the viewing space. Some of them are preferable, as they do not impact much on the viewers‟ experiences. Some others are more restrictive and used only when the storage size is a critical concern.

By ignoring wavelength and time dimensions, McMillan and Bishop [65] introduced plenoptic modeling, which is a 5D function: l(5)(Vx, Vy, Vz, θ, φ) They record a static scene by positioning cameras in the 3D viewing space, each on a tripod capable of continuous panning. At each position, a cylindrical projected image was composed from the captured images during the panning. This forms a 5D image-based representation: 3D for the camera position, - 16 -

Multiview Imaging and 3DTV. A Survey.

2D for the cylindrical image. To render a novel view from the 5D representation, the closeby cylindrical projected images are warped to the viewing position based on their epipolar relationship and some visibility tests.

The most well-known image-based representations are the light field [5], and the Lumigraph [7] (4D). They both ignored the wavelength and time dimensions and assumed that radiance does not change along a line in free space. However parameterizing the space of oriented lines is still a tricky problem. The solutions they came out happened to be the same: light rays are recorded by their intersections with two planes. One of the planes is indexed with coordinate (u,v) and the other with coordinate (s,t), i.e.: l(4)(s,t,u,v) In Figure 6, an example where the two planes, namely the camera plane and the focal plane, are parallel is shown. This is the most widely used setup. An example light ray is shown and indexed as (u0,v0,s0,t0). The two planes are then discretized so that a finite number of light rays are recorded. If all the discretized points from the focal plane are connected to one discretized point on the camera plane, we get an image (2D array of light rays). Therefore, the 4D representation is also a 2D image array, as is shown in Fig. 7.

Figure 6: One parameterization of the light field.

Figure 7: A sample light field image array: fruit plate.

The difference between light field and Lumigraph is that light field assumes no knowledge about the - 17 -

Multiview Imaging and 3DTV. A Survey.

scene geometry. As a result the number of sample images required in light field for capturing a normal scene is huge. On the other hand, Lumigraph reconstructs a rough geometry for the scene with an octree algorithm to facilitate the rendering (discussed in the next chapter) with a small amount of images. For this reason Lumigraph is sometimes classified as an hybrid and not a pure image-based modeling technique. In this report it is classified in the pure image based modeling techniques for readability reasons.

Other than the assumptions made in light field, concentric mosaics, [6], further restricts that both the cameras and the viewers are on a plane, which reduces the dimension of the plenoptic function to three. In concentric mosaics, the scene is captured by mounting a camera at the end of a level beam, and shooting images at regular intervals as the beam rotates, as is shown in Figure 7. The light rays are then indexed by the camera position or the beam rotation angle a; and the pixel locations (u, v): l(3)(α,u,v). This parameterization is equivalent to having many slit cameras (a long and slender crevice called a slit is placed in front of the lens of the camera) rotating around a common center and taking images along the tangent direction. Each slit camera captures a manifold mosaic, inside which the pixels can be indexed by (α, u); thus the name concentric mosaics. During the rendering, the viewer may move freely inside a rendering circle (Figure 8) with radius Rsin(FOV/2), where R is the camera path radius and FOV is the field of view of the cameras.

Figure 8: Concentric mosaic capturing

All these methods do not make any use of geometry, but they either have to cope with an enormous complexity in terms of data acquisition or they execute simplifications restricting the level of interactivity.

3.3

Hybrid Image-Based modeling techniques

In between the two above extremes, there is a number of methods that make use of both approaches. - 18 -

Multiview Imaging and 3DTV. A Survey.

Lumigraph [7], though already mentioned in the previous chapter uses in fact a similar representation as light field, but adds a rough 3-D model. This provides information on the depth structure of the scene and therefore allows for reducing the number of necessary natural camera views.

Other representations do not use explicit 3-D models but depth or disparity maps. Such maps assign a depth value to each sample of an image. Together with the original two-dimensional (2-D) image the depth map builds a 3-D-like representation, sometimes called 2.5-D, [67]. This can be extended to layered depth images [31], where multiple color and depth values are stored in consecutively ordered depth layers. A different extension is to use multiview video plus depth, where multiple depth maps are assigned to the multiple color images [68], [12], whereas the ATTEST project proposal, [67], is based on the distribution of video-plus-depth data corresponding to a single, central viewing position.

Zitnick et al., [12], propose a quite sophisticated representation. First, guided by the fact that using a 3D impostor or proxy for the scene geometry can greatly improve the quality of the interpolated views, they generate and add per-pixel depth maps (multiple depth maps for multiple views). However, even multiple depth maps still exhibit artifacts (at the rendering stage) when generating novel views. This is mainly because of the erroneous assumption at the stereo computation stage, that each pixel has a unique disparity. This is not the case for pixels along the boundary of objects that receive contributions from both the foreground and background colors. Zitnick addresses this problem using a novel two-layer representation inspired by Layered Depth Images [31]. Matting information is computed within a neighbourhood of four pixels from all depth discontinuities. A depth discontinuity is defined as any disparity jump greater than λ (=4) pixels. Within these neighborhoods, foreground and background colors along with opacities (alpha values) are computed using Bayesian matting [66]. The foreground information is combined to form the boundary layer as shown in Figure 9. The main layer consists of the background information along with the rest of the image information located away from the depth discontinuities. Chuang et al.„s algorithm do not estimate depths, only colors and opacities. Depths are estimated by using alpha-weighted averages of nearby depths in the boundary and main layers. To prevent cracks from appearing during rendering, the boundary matte is dilated by one pixel toward the inside of the boundary region.

Figure 9: Two-layer representation: (a) discontinuities in the depth are found and a boundary strip is created around these; (b) a matting algorithm is used to pull the noundary and main layers B i and Mi (The boundary layer is drawn with variable transparency to suggest partial opacity values)

- 19 -

Multiview Imaging and 3DTV. A Survey.

Figure 10 shows the results of applying the stereo reconstruction and two-layer matting process to a complete image frame. It is noticeable that only a small amount of information needs to be transmitted to account for the soft object boundaries, and how the boundary opacities and boundary/main layer colors are cleanly recovered.

Figure 10: Sample results from matting stage: (a) main color estimates; (b) main depth estimates; (c) boundary color estimates; (d) boundary depth estimates; (e) boundary alpha (opacity) estimates. For ease of printing the boundary images are negated, so that transparent empty pixels show up as white.

Closer to the geometry-based end of the spectrum, methods are reported that use view-dependent geometry and/or view dependent texture [69]. Instead of explicit 3-D mesh models also point-based representations or 3-D video fragments can be used [70], (ETH).

Waschbusch et al, [70, 71], (Appendix A), for example, proposes a view-independent point-based representation of the depth information. As already has been mentioned in the second chapter, the ETH Institute in its research uses several so-called 3D video bricks that are capturing high-quality depth maps from their respective viewpoints using pairs of stereo cameras. The matching algorithm used for depth extraction is assisted by projectors illuminating the scene with binary structured light patterns. Texture and depth are acquired simultaneously. The depth maps are post-processed to optimize discontinuities, and the results from different viewpoints are unified into a view-independent, point-based data representation consisting of Gaussian ellipsoids. All reconstructed views, from the different video bricks, are merged into a common world reference frame. The authors acclaim that they achieve a convenient, scalable representation since additional views can be added easily by backprojecting their image pixels. The model is in principle capable of providing a full 3600 view if the scene has been acquired from enough viewpoints. Unlike image-based structures is possible to keep the amount of data low by removing redundant points from the geometry. Compared to mesh-based methods, points provide advantages in terms of scene complexity because they reduce the representation to the absolutely necessary data and do not carry any topological information, which is often difficult to acquire and maintain. As each point in the model has its own assigned color, they also do not have to deal with - 20 -

Multiview Imaging and 3DTV. A Survey.

texturing issues. Moreover, a view-independent representation is very suitable for 3D video editing applications since tasks like object selection or re-lighting can be achieved easily with standard point processing methods.

The point-based model consists of an irregular set of samples, where each sample corresponds to a point on a surface and describes its properties such as location and color. The samples can be considered as a generalization of conventional 2D image pixels towards 3D video. If required, the samples can be extended with additional attributes like surface normals for re-lighting. To avoid artifacts in re-rendering, full surface coverage of the samples has to be ensured. Thus, samples cannot be represented by infinitesimal points, but need to be considered as small surface or volume elements. They have chosen an approach, similar to Hofsetz et al. [72]. Every point is modeled by a three dimensional Gaussian ellipsoid spanned by the vectors t1, t2 and t3 around its center p. This corresponds to a probabilistic model describing the positional uncertainty of each point by a trivariate normal distribution (see Appendix A).

Some outliers that the point model still contains are removed in a photo consistency enforcement stage involving all acquired images from the texture camera. Editing operations like compositing and spatiotemporal effects can be applied to the view-independent representation. This representation can directly benefit from a large variety of available point processing algorithms.

- 21 -

Multiview Imaging and 3DTV. A Survey.

4

Rendering

In the rendering stage, new views of scenes are reconstructed. Depending on the functionality required, there is a spectrum of renderings as shown in Figure 11. The technologies differ from each other in the amount of geometry information of the scenes/objects being used.

At one end of the spectrum, there are very accurate geometric models of the scenes and objects, for instance, generated by animation techniques, but only a few images are required to generate the textures. Given the 3-D models, as described in chapter 3, novel views can be rendered using conventional graphic techniques. Moreover, interactive rendering with movable objects and light sources can be supported using advanced graphics hardware.

At the other extreme, lightfield [5] or Lumigraph [7] rendering relies on dense sampling (by capturing more image/videos) with no or very little geometry information for rendering without recovering the exact 3-D models. An important advantage of the latter is its superior image quality, compared with 3D model building for complicated real-world scenes. Another important advantage is that it requires considerably less computational resources for rendering regardless of the scene complexity, because most of the quantities involved are precomputed or recorded.

Image-based representations can therefore be classified according to the geometry information used into three main categories: representations with explicit geometry, representations with implicit geometry, and representations with no geometry.

Conventional 3-D computer graphics models and more sophisticated representations [25]–[27] belong to the first category. Layered-based or object-based representations using depth maps [10], [11], [23] fall into the second. 3-D concentric mosaics [6], five-dimensional (5-D) McMillan and Bishop‟s plenoptic modeling [65], four-dimensional (4-D) rayspace representation [64], and lightfields [5], lumigraph [7] belong to the last category.

Figure 11: Spectrum of rendering representations - 22 -

Multiview Imaging and 3DTV. A Survey.

4.1

Model-based rendering (MBR) - rendering with explicit geometry

Model based rendering, MBR, is a framework for generating new views, via recovery of a 3-D model. MBR techniques have the advantage of handling the occlusion problem as they make use of the 3-D models. However registration errors of texture mapping onto the reconstructed 3-D model may cause blur of synthesized virtual images [4].

In the third chapter model based representation approaches were discussed. In many 3DTV scenarios, the object that is being recorded is known in advance, therefore such a priori knowledge can be used to bias the scene reconstruction outcome. Of course, a suitable model of the recorded object(s) must be available. A model also enables enforcing low-level as well as high-level constraints about the object‟s motion, from temporally coherent movement to anatomi-consistent motion. Another advantage of model-based approaches is that a priori model geometry can be highly detailed, which facilitates highquality rendering results and circumvents rendering inaccuracies due to poorly resolved geometry. One step forward, 3DTV imposes the demand that the resulting model must be able to produce convincing rendering results. The challenge therefore is how to automatically, robustly and visually consistently match a parameterized 3-D geometry model to recorded image content.

One model-based method that is suitable for synchronized multiview video footage consists of matching human model, as discussed in 3.1, to object silhouettes based on an analysis-by-synthesis approach [15], as shown in Figure 12. The object‟s silhouettes, as seen from the different camera viewpoints, are used to match the model to the recorded video images: The foreground in all video images is segmented and binarized. At the same time, the 3-D object model is rendered from all camera viewpoints using conventional graphic hardware, after which the rendered images are thresholded to yield binary masks of the model‟s silhouettes. Then, the rendered model silhouettes are compared to the corresponding image silhouettes: as comparison measure, or matching score, the number of silhouette pixels is used that do not overlap when putting the rendered silhouette on top of the recorded silhouette. Conveniently, the logical exclusive-or (XOR) operation between the rendered image and the recorded image yields those silhouette pixels that are not overlapping. By summing over the non-overlapping pixels for all images, the matching score is obtained. This matching score can be evaluated very efficiently on contemporary graphics hardware. To adapt model parameter values such that the matching score becomes minimal, a standard numerical nonlinear optimization algorithm runs on the CPU. For each new set of model parameter values, the optimization routine evokes the matching score evaluation routine on the graphics card which can be evaluated many hundred times per second. After convergence, object texture can be additionally exploited for pose refinement.

- 23 -

Multiview Imaging and 3DTV. A Survey.

One advantage of model-based analysis is the low-dimensional parameter space when compared to general reconstruction methods (Figure 13): The parameterized 3-D model may provide only a few dozen degrees of freedom that need to be determined, which greatly reduces the number of potential local minima. Many high-level constraints are already implicitly incorporated into the model, such as kinematic capabilities. Additional constraints can be easily enforced by making sure that all parameter values stay within their anatomically plausible range during optimization. Finally, temporal coherence is straightforwardly maintained by allowing only some maximal rate of change in parameter value from one time step to the next.

Figure 12: Analysis-by-synthesis: To match the geometry model to the multiview video footage, the foreground object is segmented and binarized, and the 3-D model is rendered from all camera viewpoints. The boolean XOR operation is executed between the reference images and the corresponding model renderings. The number of non-overlapping pixels serves as matching score. Via numerical optimization, model parameter values are varied until the matching score is minimal.

Figure 13: From eight video cameras spaced all around the scene, model-based method can capture the complex motion of a dancer.

After model-based motion capture, a high-quality 3-D geometry model is available that closely, but - 24 -

Multiview Imaging and 3DTV. A Survey.

not exactly, matches the dynamic object in the scene. For photorealistic rendering results, the original video footage must be applied as texture to the model. By making efficient use of multivideo footage, time-varying cloth folds and creases, shadows, and facial expressions can be faithfully reproduced to lend a very natural, dynamic appearance to the rendered object. Projective texture mapping is a wellknown technique to apply images as texture to triangle-mesh models. To achieve optimal rendering quality, however, it is necessary to process the video textures offline prior to real-time rendering [15]: local visibility must be considered correctly to avoid any rendering artifacts due to the inevitable small differences between model geometry and the true 3-D object surface. Also, the video images, which are taken from different viewpoints, must be blended appropriately to achieve the impression of one consistent object surface texture.

Because model geometry is not exact, the reference image silhouettes do not correspond exactly to rendered model silhouettes. When projecting the reference images onto the model, texture belonging to some frontal body segment potentially leaks onto other segments farther back [Figure 14(a)]. To avoid such artifacts, each reference view‟s penumbral region must be excluded during texturing. To determine the penumbral region of a camera, vertices of zero visibility are determined not only from the camera‟s actual position but also from a few slightly displaced virtual camera positions [Figure 14(b)]. For each reference view, each vertex is checked whether it is visible from all camera positions, actual as well as virtual. A triangle is projectively textured using a reference image only if all of its three vertices are completely visible from that camera.

(a)

(b)

Figure 14: Penumbral region determination: (a) Small differences between object silhouette and model outline can cause texture of frontal model segments to leak onto segments farther back. (b) By projecting each reference image onto the model also from slightly displaced camera positions, regions of dubious visibility are determined. These are excluded from texturing by the respective reference image.

Most surface areas of the model are seen from more than one camera. If the model geometry corresponded exactly to that of the recorded object, all camera views could be weighted according to their proximity to the desired viewing direction and blended without loss of detail. However, model geometry has been adapted to the recorded person by optimizing only a comparatively small number - 25 -

Multiview Imaging and 3DTV. A Survey.

of free parameters. The model is also composed of rigid body elements which is clearly an approximation whose validity varies, e.g., with the person‟s apparel. In summary, the available model surface can be expected to locally deviate from true object geometry. Accordingly, projectively texturing the model by simply blending multiple reference images causes blurred rendering results, and model texture varies discontinuously when the viewpoint is moving. Instead, by taking into account triangle orientation with respect to camera direction, high-quality rendering results can still be obtained for predominantly diffuse surfaces [15]. After uploading the 3-D model mesh and video cameras‟ projection matrices to the graphics card, the animated model is ready to be interactively rendered. During rendering, the multiview imagery, predetermined model pose parameter values, visibility information, and blending coefficients must be continuously uploaded, while the view-dependent texture weights are computed on the fly on the GPU. Easily achieving real-time rendering frame rates, views of the object from arbitrary perspective are possible, as well as freeze-and-rotate shots, fly-around sequences, close-ups, slow motion, fast forward, or reverse play (Figure 15).

Figure 15: The user can freely move around the dynamic object at real-time rendering frame rates.

4.2

Image-based rendering (IBR) - rendering with no geometry

Image-based rendering, IBR, has been a very active research topic recently. By capturing a set of images or light rays in the space (see chapter 3, image-based modeling), the goal of IBR is to reproduce the scene correctly at an arbitrary viewpoint, with unknown geometry. There is no need for complex 3-D geometric models to achieve realism, as the realism is in the images themselves. Representative techniques for rendering with unknown scene geometry rely on the characterization of the plenoptic function.

Image-based rendering, therefore, becomes one of constructing a continuous representation of the plenoptic function from observed discrete samples (complete or incomplete). How to sample the plenoptic function and how to reconstruct a continuous function from discrete samples are important - 26 -

Multiview Imaging and 3DTV. A Survey.

research topics. For example, the samples used in [65] are cylindrical panoramas. Disparity of each pixel in stereo pairs of cylindrical panoramas is computed and used for generating new plenoptic function samples.

For the light field and Lumigraph systems, mentioned already in the previous chapter, methods for rendering without geometry have been proposed [5, 7]. To create a new view of the object, the view is splitted into its light rays, which are then calculated by quadlinearly interpolating existing nearby light rays in the image array. For example, the light ray (u0,v0,s0,t0) in Fig. 6 is interpolated from the 16 light rays connecting the solid discrete points on the two planes. The new view is then generated by reassembling the split rays together. Such rendering can be done in real time and is independent of the scene complexity. Lumigraph, has also the advantage that can be constructed from a set of images taken from arbitrarily placed viewpoints. A re-binning process is therefore required. Geometric information is used to guide the choices of the basis functions. Because of the use of geometric information, sampling density can be reduced.

Obviously the more constraints on the camera location (Vx, Vy, Vz), the simpler the plenoptic function becomes. In Concentric Mosaics [6], camera motion is constrained along concentric circles on a plane. By constraining camera motion to planar concentric circles, concentric mosaics can be created by compositing slit images taken at different locations of each circle. Concentric mosaics index all input image rays naturally in 3 parameters: radius, rotation angle and vertical elevation. The rendering of concentric mosaics is slit-based. The novel view is split into vertical slits. For each slit, the neighboring slits in the captured images are located and used for interpolation. The rendered view is then reassembled using these interpolated slits. Although vertical distortions exist in the rendered images, they can be alleviated by depth correction. Concentric mosaics do not require the difficult modeling process of recovering geometric and photometric scene models. Yet they provide a much richer user experience by allowing the user to move freely in a circular region and observe significant parallax and lighting changes.

4.3

Rendering with implicit geometry

When 3-D models of the objects and scenes are unavailable, user interaction is limited to the change of viewpoints. In contrast, more user interaction has been found to be feasible using IBR concepts and the associated 3-D models. If approximate geometry of objects in a scene can be recovered, then interactive editing of real scenes is in principle feasible. Therefore rendering with implicit geometry is an active research area. A classical approach for generating synthetic views is image interpolation, introduced in [73] and also - 27 -

Multiview Imaging and 3DTV. A Survey.

adopted in the PANORAMA system, [74]. The disadvantage of this method is that in principal it can only produce images that are intermediate views between two original images (i.e., the virtual camera lies on the baseline between the two real cameras). From two input images, Seitz and Dyer‟s view morphing technique [75] reconstructs any viewpoint on the line linking two optical centers of the original cameras. Intermediate views are exactly linear combinations of two views only if the camera motion associated with the intermediate views is perpendicular to the camera viewing direction. If the two input images are not parallel, a pre-warp stage can be employed to rectify two input images so that corresponding scan lines are parallel. Accordingly, a post-warp stage can be used to unrectify the intermediate images.

The point where implicit geometry becomes explicit is not very clear. In this report systems that use geometry information in the form of disparity maps are regarded as systems with implicit geometry, therefore rendering techniques using depth information are included in this chapter.

When the depth information is available for every point in one or more images, 3D warping techniques can be used to render nearby viewpoints. An image can be rendered from any nearby point of view by projecting the pixels of the original image to their proper 3D locations and re-projecting them onto the new picture. The most significant problem in 3D warping is how to deal with holes generated in the warped image. Holes are due to the difference of sampling resolution between the input and output images, and the disocclusion where part of the scene is seen by the output image but not by the input images. To fill in holes, the most commonly used method is to splat a pixel in the input image to several pixels in the output image.

To deal with the disocclusion artifacts in 3D warping, Shade et al. proposed Layered Depth Image, or LDI [31], to store not only what is visible in the input image, but also what is behind the visible surface. In LDI, each pixel in the input image contains a list of depth and color values where the ray from the pixel intersects with the environment. The LDI approach was adopted by Zitnick et al., [12]. In their case the cameras are arranged along a one-dimensional (1-D) arc. During rendering, the two reference views nearest to the novel view are chosen, warped, and combined for view synthesis.

In chapter three the work of Waschbusch et al., using a view-independent point-based data representation, was discussed, (see also Appendix A). In the rendering stage they adopt an interesting approach described before by Broadhurst et al., [76]. Broadhurst uses probabilistic volume ray casting to generate smooth images. Each ray is intersected with the Gaussians of the scene model. At a specific intersection point x with the sample i, the evaluation N(x;p i;Vi) of the Gaussian describes the probability that a ray hits the corresponding surface point. To compute the final pixel color, two - 28 -

Multiview Imaging and 3DTV. A Survey.

different approaches are described. The maximum likelihood method associates a color with the ray using only the sample which has the most probable intersection. The second approach employs the Bayes rule: It integrates all colors along each ray weighted by the probabilities without considering occlusions. Thus, the color of a ray R is computed as:

 

cR



i

c iN ( x ; p i , V i )

x R

 

i

N ( x ; p i, V i )

x R

The maximum likelihood method generates crisp images, but it also sharply renders noise in the geometry. The Bayesian approach produces very smooth images with less noise, but is incapable of handling occlusions and rendering solid surfaces in an opaque way. The rendering method that the authors propose combines both approaches in order to benefit from their respective advantages. The idea is to accumulate the colors along each ray like in the Bayesian setting, but to stop as soon as a maximum accumulated probability has been reached. Reasonably, a Gaussian sample should be completely opaque if the ray passes its center. The line integral through the center of a threedimensional Gaussian has a value of 1/2π and for any ray R it holds that:



x R

N ( x; p ,V ) 

1 2

Thus, they accumulate the solution of the integrals of the above equation by traversing along the ray from the camera into the scene and stop as soon as the denominator of the equation reaches 1/2π. Assuming that solid surfaces are densely sampled, the probabilities within the surface boundaries will be high enough so that the rays will stop within the front surface.

The results of comparison of this approach with the maximum likelihood and Bayesian rendering on noisy data are demonstrated in Figure 16. The large distortions in the maximum likelihood image, get smoothed out by the other two methods. However, the Bayesian renderer blends all the points including those from occluded surfaces, while current method renders opaque surfaces and maintains the blending. Thus, current renderer provides the advantages of both previous methods.

Figure 16: Comparison of maximum likelihood (left) and Bayesian rendering (center) with Waschbusch et al.’s approach (right). - 29 -

Multiview Imaging and 3DTV. A Survey.

5

Coding

Having defined the data, efficient compression and coding is the next block in the 3D video processing chain, and that is the scope of this section. There are many different data compression techniques corresponding to different data representations, [3]. For example, there are different techniques for 3D meshes, depth data, multiple view video, etc. However, the level of maturity varies largely. There is a strong relation to the age, level of maturity and the (commercial) usage of the corresponding data representation.

One class of data is related to compression of any kind of pixel data, such as video, stereo video, multi-view video, but also associated per-pixel depth data, etc. This wide field is partially well established but partially also very innovative.

Compression of classical 2D video for instance has been studied very intensely over decades by a very large number of researchers and institutions. As result the latest generation video codecs such as standard H.264/AVC provide excellent performance. Scalability features will be added to H.264/AVC in the current SVC activity in MPEG. Nevertheless, there is still room for improvement of basic 2D video coding. These include a better pre-analysis and exploitation of semantics, as well as wavelet approaches.

Similar conclusions can be drawn for stereo video, which can be regarded as first order extension. Commercial usage is not that large as for 2D video but the technology is quite mature. However, segmentation and object based representation play a more important role for stereo video, and these fields still represent major algorithmic challenges.

The N-dimensional extension called multi-view coding (MVC) is relatively young; however, it currently receives very large attention. MPEG issued a related “Call for Proposals” that was evaluated in January 2005. This will lead to a new dedicated standard for MVC. MVC is a basic component for certain 3DTV and free viewpoint video systems.

The nature of depth and disparity data is similar to 2D video (i.e. temporal succession of matrices of integers). Compression of such data has also been studied to some extend. Available standards as MPEG-4 already allow for compression and transmission of such data. However, also in this area there is still room for improvement, using specific algorithms that better exploit the (e.g. statistical) nature of depth data. The concept of layered depth images (LDI) can be regarded as a natural extension to N views with depth of the same scene. This type of data representation is relatively young

- 30 -

Multiview Imaging and 3DTV. A Survey.

but highly interesting for certain 3DTV applications. There is also a strong relation to MVC. Principles from depth compression can be extended to LDI, but further improvement can be expected exploiting e.g. inter-view dependencies as done in the case of MVC.

A light-field representation is also a relatively young data type which stores the images of a scene from different angles. So far mainly static lightfields have been investigated. Dedicated compression algorithms have been presented in some pioneering work. In principle there is a strong relation to MVC. For instance dynamic light-field compression is handled in MPEG as a specific case of MVC. The practical relevance of very dense dynamic light-fields is still questionable. Nevertheless, significant improvements of compression performance using dedicated algorithms can be expected.

3D meshes are widely used in computer graphics. Compression of such data has therefore also been widely studied. However, further improvements are possible especially for progressive and dynamic (i.e. time varying) meshes. For the latter, there is a related activity in the SNHC group of MPEG. Dynamic meshes have not received much interest in the past. Significant improvements can be expected by incorporating basic principles from video coding.

A point cloud representation is an alternative to classical 3D meshes. Such a representation might be very interesting for certain 3D video applications. Pioneering work on compression and streaming has been presented, but there seems to be a lot of room for improvement.

Multiple description coding and channel adaptation also currently receives significant attention. Here it is shown that improvements are possible for specific application fields if some of the basic coding paradigms of available standard video coding are abandoned. This research direction should be further pursued with specific focus on 3D video data.

As for any type of media, security and rights management is also an important issue for 3D video. Some research has been done for classical 3D models. However, there still needs to be done a lot and for other data this is still an open field.

In general conclusion we may state that the very diverse research area of 3D video compression is highly active and relevant at the moment. Market relevance and interest of manufacturers, content providers and users in 3D video systems are growing rapidly. However, there are still important challenges that need to be resolved. One of the goals of the European Community funded 3DTV project is to integrate the European research efforts in 3D video compression to ensure a strong European participation in this highly relevant future market. - 31 -

Multiview Imaging and 3DTV. A Survey.

6

Transporting 3D Video

Determination of the best techniques for transporting 3DTV data over communication networks in real-time requires a thorough investigation of several classical communication techniques together with their adaptation to the unique requirements of this new application. Experiences gained in the early implementations of 3DTV systems, as discussed in the previous sections, are extremely important in reaching a clear understanding of 3DTV transport issues, and therefore must be carefully studied, [3].

It is logical to expect that the transport infrastructure for any new communication application will be based on packet network technology and employ the Internet Protocol (IP) suites. The IP architecture is proving to be flexible and successful in accommodating a wide array of communication applications as can be seen from the ongoing replacement of classical telephone services by voice over IP applications. Transport of the TV signals over IP packet networks seems to be a natural extension of such applications. Video-on demand services, both for news and for entertainment applications, are already being offered over the Internet. Also, 2.5G and 3G mobile network operators started to use IP successfully to offer wireless video services. Therefore, visualization of a 3DTV transport system is based on packet network technology and IP. Systems for streaming 3D video over the Internet can be built based on the vast experiences obtained in 2D applications. However, 3D video can have a much larger bandwidth demand and very specific dependency structures in the transmitted data. The 3DTV modalities used for 3DTV have significant effects on streaming system implementations. These modalities, particularly viewed from the transmission aspect, may be summarized using a linear spectrum. At the leftmost side of this spectrum are the techniques for completely synthetic video generation, that is, techniques based on computer graphics. To the right hand side of the spectrum, there are the techniques that mix graphics with real images, such as those that use depth information together with image data for 3D scene generation. Purely image based rendering techniques, lightfields, are located at the right side of this spectrum. Moving along this spectrum of modalities, the transmission issues vary a great deal. For example, graphic techniques, do not require a very large transmission bandwidth, but their loss tolerance may be extremely low. Purely image based techniques are much more loss tolerant, but their bandwidth demand is much larger.

The large bandwidth demand of the image based techniques makes the use of efficient compression a vital necessity. As discussed in the previous section, several effective compression techniques for multi-view video have been developed and this continues to be an active research area. From the transmission viewpoint, two important aspects of the use of compression are the reduced loss resiliency and data dependency. As the redundancy in the data is removed, so does the inherent loss

- 32 -

Multiview Imaging and 3DTV. A Survey.

resilience. And, significant compression gain in multi-view video compression is obtained through interview prediction, but this creates a dependency between the views. Nevertheless, the techniques for handling 2D compressed video transport over lossy networks are well developed and similar approaches are applicable to 3DTV transport. These include use of application layer framing and layered coding with unequal error protection. Techniques for the concealment of packet loss effects become very important as well. Loss concealment in 3D cannot be accomplished by a straightforward extensions of the techniques used for 2D video. New approaches have been one of the active research areas.

Another aspect of 3DTV video which does not exist in its 2D counterpart is the dependency of the displayed video to the viewpoint of the viewer. The video must be adjusted when the viewer moves around, changing his or her viewpoint of the display. Otherwise, the displayed scene will be quite unrealistic. Particularly for image based techniques; however, this requires transmission of a multitude of views to the end points, multiplying the bandwidth requirements by many factors. Efficient networking techniques for multi-view video delivery over multicast networks are therefore an active research area, [78].

Finally, cross layer approaches, where several layers of the communication architecture, from application to physical, are considered together, and jointly optimized, have recently shown to be very successful in 2D applications. Their extension to 3D looks very promising. This approach is particularly important in wireless applications, which may be one of the leading applications of 3DTV, because of the tendency of the wireless operators to feature new applications much earlier than their wired counterparts.

- 33 -

Multiview Imaging and 3DTV. A Survey.

7

3-D Display

The display is the last, but definitely not least, significant aspect in the development of 3D vision. As has already been outlined, there is a long chain of activity from image acquisition, compression, transmission and reconstruction of 3D images before we get to the display itself. However, the display is the most visible aspect of the 3DTV and is probably the one by which the general public will judge its success. The concept of a three-dimensional display has a long and varied history stretching back to the 3D stereo-photographs made in the late 19th century through 3D movies in the 1950‟s, holography in the 1960‟s and 70‟s and 3D computer graphics and virtual reality of today.

The need for 3D displays and vision grows in importance by the day, as does the number of applications such as scientific visualization and measurement, medical imaging, telepresence, gaming, as well as movies and television itself. Many different methods of 3D displays have been presented over the last few decades, but none has been able to capture the mass market. Much of development in 3D imaging and displays of the latter end of the 20th century was spurred on by the invention of holography, and this was certainly the catalyst which led to some of the significant advances in autostereoscopic and volumetric methods, whereas, advances in techniques of virtual reality have helped to drive the computer and optics industries to produce better head-mounted displays and other 3D displays.

The main requirement of a 3D displays is to create the illusion of depth or distance by using a series of depth cues such as disparity, motion parallax, and ocular accommodation, [79]. Additional cues are also needed for image recognition. Conflicting cues are one of the leading causes for discomfort and fatigue when viewing 3D displays. The form that such displays would take is one aspect which needs considerable thought and is a major concern in consumer acceptance. Important aspects to be considered include image resolution, field of view, brightness, whether they are single or multi-user, viewing distance and cost. The technologies being pursued for 3D display can be broadly divided into the following categories, as shown in Figure 17 (although there are various other methods of classification used and the terminology is not always clear): • Holographic displays • Volumetric displays • Autostereoscopic displays • Head mounted displays (HMD) • Stereoscopic displays

- 34 -

Multiview Imaging and 3DTV. A Survey.

The term “autostereoscopic”, strictly speaking, describes all those displays which create a stereoscopic image without any special glasses or other user-mounted devices and in this respect might be considered to include holographic, volumetric and multiple image. However, we restrict the use of the term to cover displays such as binocular, multi-view and holoform systems where only multiple twodimensional images across the field of view are considered. Autostereo-systems are limited by the number of viewers and eye or head tracking is usually needed. In holographic displays the image is formed by wave-front reconstruction, and includes both real and virtual image reconstruction. Holography is at present handicapped by the vast amount of information which has to be recorded, stored, transmitted and displayed, putting severe constraints on the display technology employed. Furthermore, holography can be deployed in reduced parallax (e.g. stereoholography or lenticular) systems, which relax some of the constraints. Volumetric displays form the image by projection within a volume of space without the use of light interference, but have limited resolution. Head mounted displays such as those using, for example, liquid-crystal-on-silicon (LCOS) devices or retinal scanning devices (RSD) are unlikely to fine mass-market acceptance because of the user discomfort similar to motion sickness and the public reluctance to wear devices, but may find some well-defined niche markets. The more conventional stereo technologies, all require the use of viewing aids such as red/green or polarizing glasses.

Clearly no display method is without its problems or limitations. The development paths which have to be followed before a full 3D display can be realised are very complex. Given the current state-ofthe-art, non-holographic displays, such as volumetric or autostereo, are in a more advanced state of development and it is felt that they are more likely to reach the market place in a shorter time frame. A full, large area, interactive, colour holographic display, which is thought by many to be the ideal goal, requires the parallel development of many essential areas of technology before it can be brought to fruition.

Figure 17: Classification of 3D display techniques.

- 35 -

Multiview Imaging and 3DTV. A Survey.

8

Discussion and conclusions

This report studies and categorizes the literature about multiview imaging. It investigates all the processes involved in multiview imaging in relation with 3D TV. The final goal for 3DTV applications is the ability to synthesize several viewpoints as needed by multi-viewpoint displays, with an image quality equal to current 2D broadcasts.

The processes of calibration and stereo correspondence have been already studied for years and the results are quite promising. A multi-camera setup for example relying on automatic calibration is nowadays feasible. The issues that are still open, and where our future work will focus, are the issues of data representation and rendering.

Current multi-camera 3D acquisition systems are limited either in quality or functionality by the stateof-the art in 3D image analysis and synthesis processing. For example the model-based systems use in principal a simplified scene with one or two objects (mostly humans) on a specifically textured floor or blue-screen background. On the other hand systems that can capture full natural dynamic scenes in 3D do not rely on complex 3D image processing, but use massive amounts of cameras, effectively capturing all viewpoints (image-based systems).

Hybrid systems, combining characteristics from both model-based and image-based systems seem more promising, but there are still many open issues. As far as the data representation is concerned, image-based approaches are suitable for 3DTV applications since they can ensure high quality. A rough 3D model of the scene is more than necessary in order to reduce the number of natural camera views and handle occlusions.

3D video as one of the emerging technologies is expected to provide the user with the possibility to change viewpoint at will during playback. Free navigation regarding time and space in streams of visual data directly enhances the viewing experience and interactivity. Unfortunately, in most existing systems virtual viewpoint effects have to be planned precisely and changes are no more feasible after the scene has been shot.

Towards the direction of editing capabilities the work of ETH is quite promising. Although the point based data representation model is quite suitable for 3D TV applications with regard to its easy editing and the fact that can represent the whole scene, is still quite complex. Many redundancies are included in the current version of the model. In a future 3D TV application the insertion, update and delete operators could be appropriate manipulated in order to decrease the redundancies. For example in case

- 36 -

Multiview Imaging and 3DTV. A Survey.

of a specific known viewpoint, selection of the closest to the viewpoint cameras and insertion of their samples in the model while ignoring or not updating the others could minimize computations. Deletion of points that vanish from the view of the input camera is also important. Coherence in time should also be taken into account. The information from previous frames can be used for building the current frame. The background/ static parts of the images even more can take advantage of this information. Therefore the introduction of an adaptive update rate that is high for foreground/moving parts and low for static could enhance the model.

The representation can directly benefit from the large variety of available point based-processing algorithms for further redundancy elimination. For example, clustering methods, [81], that split the point cloud into a number of subsets, each of which is replaced by one representative sample.

Apart from the complexity of the point based model there still remain some issues with mixed pixels that may be solved using boundary matting techniques. The two layer representation that Zitnick used in [12], could eliminate such kind of artifacts at the border of silhouettes. A representation that handles differently the main layer (in principal the background) and the boundary layer (a layer around pixels with depth discontinuities, thus the area around silhouettes) is more valuable for 3D sport scenarios for example. In a sport scenario or a similar application, large areas are covered by uniform color texture. Segmenting the scene into layers, would give a main layer containing most of the data and a boundary layer containing fairly sparse data. Since the amount of information for the object boundaries is small but very valuable either a finer representation of it could be demanded or a more precise compression.

The overall conclusion after the evaluation of state-of-the art methods is that the future of 3DTV has not one but multiple views. Many factors like the advances in video acquisition technology, novel image analysis algorithms, and the pace of progress in computer graphics hardware together drive the development of the new type of visual entertainment medium.

- 37 -

Multiview Imaging and 3DTV. A Survey.

9

Appendix A

The work of Waschbusch et al., conducted in the ETH Institute of Technology, was investigated in detail and therefore is specially presented in this appendix. The method, [71], is of special interest due to its compatibility with 3DTV applications. The whole scene is represented in the 3D video which gives this method a great advantage in comparison with those representing only silhouettes.

The basic building blocks of the 3D video setup are movable bricks containing three cameras, one color and two grayscale, and a projector illuminating the scene with alternating patterns. Each brick concurrently acquires texture information with the color camera and depth information using the stereo pair of grayscale cameras. Stereo vision generally fails in reconstructing simple geometry of uniformly colored objects. Artificial texture is added to the scene by projecting structured light patterns, therefore correlations between different views can be found easier. Alternating projections of structured light patterns and the corresponding inverses allows for simultaneous acquisition of the scene textures using an appropriately synchronized texture camera.

Figure 18: The 3D video brick with cameras and projector (left), simultaneously acquiring textures (middle) and structured light patterns (right)

Each brick acquires the scene geometry using a depth from stereo algorithm. Depth maps are computed for the images of the left and right grayscale cameras by searching for corresponding pixels. Stereo matching is formulated as a maximization problem over an energy which defines a matching criterion of two pixel correlation windows. An adaptive correlation window is used covering multiple time steps only in the static parts of the images. For the moving parts discontinuities in the image space are going to be extended into the temporal domain, making correlation window more difficult. Therefore the correlation window is extended in the temporal dimension, only for static parts, to cover three or more images.

A two phase post processing is applied to the disparity images. First the regions of wrong disparity are identified and then new disparities are extrapolated into these regions from their neighborhoods. - 38 -

Multiview Imaging and 3DTV. A Survey.

To model the resulting three-dimensional scene, a view-independent, point-based data representation is used. All reconstructed views are merged into a common world reference frame. Three video bricks were used in the experiments of paper [71], thus three different views. The model is in principle capable of providing a full 3600 view if the scene has been acquired from enough viewpoints. It consists of a set of samples, where each sample corresponds to a point on a surface and describes its properties such as location and color.

Every point is modeled by a three-dimensional Gaussian ellipsoid spanned by the vectors t1, t2, t3, around its center p. This corresponds to a probabilistic model describing the positional uncertainty of each point by a trivariate normal distribution: px( x )  N ( x; p ,V ) 



1 (2  ) V

e

1

T

(x p) V

1

(x p)

2

3

with expectation value p and covariance matrix 

V     ( t 1t 2 t 3 ) ( t 1t 2 t 3 ) T

composed of 3 x 1 column vectors t1. Assuming a Gaussian model for each image pixel uncertainty, first the back-projection of the pixel into three-space is computed which is a 2D Gaussian parallel to the image plane spanned by two vectors tu and tv. Extrusion into the third domain by adding a vector t z guarantees a full surface coverage under all possible views. This is illustrated in Figure 19.

Figure 19: Construction of a 3D Gaussian ellipsoid.

Each pixel (u,v) is spanned by orthogonal vectors σu(1,0)T and σu(1,0)T in the image plane. Assuming a positional deviation σc, the pixel width and height under uncertainty are σc = σc = 1+σc , σc is estimated to be the average re-projection error of the calibration routine.

- 39 -

Multiview Imaging and 3DTV. A Survey.

The depth of each pixel is inversely proportional to its disparity d as defined by the equation z

fL c L  c R d  pL  pR

Where fL is the focal length of the rectified camera, cL and cR are the centers of projection, and pL and pR , the u-coordinates of the principal points. The depth uncertainty σz is obtained by differentiating the above equation and augmenting the gradient Δd of the disparity with its uncertainty σc: z

fL c L  c R (d  pL  pR)

2

(d   c)

After back-projection the point model still contains outliers and falsely projected samples. Some points originating from a specific view may look wrong from extrapolated views due to reconstruction errors, especially at depth discontinuities. In the 3D model, they may cover correct points reconstructed from other views, disturbing the overall appearance of the 3D video. Thus, those points are removed by checking the whole model for photo consistency with all texture cameras.

After selecting a specific texture camera, each ellipsoid is successively projected into the camera image in an increasing depth order, starting with the points closest to the camera. All pixels of the original image which are covered by the projection and not yet occluded by previously tested, valid ellipsoids are determined. The average color of those pixels is compared with the color of the ellipsoid. If both colors differ too much, the point sample is removed. Otherwise, the ellipsoid is rasterized into a z-buffer which is used for occlusion tests for all subsequent points.

As a result, enforcing photo consistency improves the fit of multiple acquired depth maps in the model. The reduction of artifacts can be clearly seen in Figure 20.

Figure 20: Enforcing photo consistency during view merging: without (left) and with (right) enforcement

At the rendering stage, smooth images are generated using the uncertainties of the Gaussian ellipsoids. The probabilistic rendering method used is a combination of two probabilistic image generation techniques, the Bayes rule, and the maximum likelihood method.

As already discussed in chapter 4, probabilistic ray casting is used. Each ray is intersected with the - 40 -

Multiview Imaging and 3DTV. A Survey.

Gaussian of the scene model. At a specific intersection point x with the sample i, the evaluation N(x;pi,Vi) of the Gaussian describes the probability that a ray hits the corresponding surface point. To compute the final pixel color a combination of the two methods is used. The idea is to accumulate the colors along each ray but stop ad soon as a maximum accumulated probability has been reached. A Gaussian sample should be completely opaque if the ray passes its center. The line integral through the center of a three-dimensional Gaussian has a value of 1/2π and for any ray R its holds that:



x R

N ( x; p ,V ) 

1 2

Thus they accumulate the solution of the integrals of the above equation by traversing along the ray from the camera into the scene and stop as soon as the denominator of the equation reaches 1/2π.

A representative result of the method is shown in Figure 21, where novel views of the acquired scene are rendered with the reconstructed 3D model. A sample video can be found in [80].

Figure 21: re-renderings of the 3D video from novel viewpoints.

The point based data representation model provides possibilities for novel effects and 3D video editing like actor cloning, or motion trails, Figure 22. The actor is cloned by copying the corresponding point cloud to other places in the scene.

Figure 22: Actor cloning effect

The point based data representation model is in general suitable for 3DTV applications but still

- 41 -

Multiview Imaging and 3DTV. A Survey.

contains many redundancies. As it is discussed in the main body of the report there are possible ways to eliminate redundancies and improve the model.

- 42 -

Multiview Imaging and 3DTV. A Survey.

10 References

[1] A.Kubota, A. Smolic, M. Magnor, M. Tanimoto, Ts. Chen, Ch. Zhang, Multiview Imaging and 3DTV, Signal Processing Magazine IEEE, v.24, issue(6), 2007

[2] C.L.Zitnick, S.B.Kang, Stereo for Image-Based Rendering using Image Over-Segmentation, IJSV, v.75, n.1, 2007

[3] L.Onural, Th.Sikora, J.Ostermann, A.Smolic, M.R.Civanlar, J.Watson, An Assessment of 3DTV Technoloigies, 2006 NAB BEC Proceedings

[4] S.C. Chan, H.Y. Shum, K.T. Ng, Image-Based Rendering and Synthesis, Signal Processing Magazine IEEE, v.24, issue(6), 2007

[5] M. Levoy, P. Hanrahan, Light field rendering, in Proc. ACM SIGGRAPH, Aug. 1996, pp. 31–42.

[6] H.Y. Shum, L.W. He, Rendering with concentric mosaics, in Proc. ACM SIGGRAPH, Aug. 1999, pp. 299–306.

[7] S.J. Gortler, R. Grzesczuk, R. Szeliski, and M.F. Cohen, The Lumigraph,in Proc. ACM SIGGRAPH‟96, Aug. 1996, pp. 43–54

[8] D. Wood, D. Azuma, W. Aldinger, B. Curless, T. Duchamp, D. Salesin, and W. Stuetzle, Surface Light Fields for 3D Photography, in Proc. SIGGRAPH 2000.

[9] M. Pollefeys, L. Van Gool, M. Vergauwen, F. Verbiest, K. Cornelis, J. Tops, R. Koch, Visual modeling with a hand-held camera, Int. J. Comput. Vis., vol. 59, no. 3, pp. 207–232, 2004.

[10] C. Zhang, T. Chen, A self-reconfigurable camera array, Eurograph. Symp. Rendering 2004, Norrkoping, Sweden, Jun. 2004.

[11] Wilburn, B., N. Joshi, V. Vaish, E.-V. Talvala, E. Antunez, A. Barth, A. Adams, M. Horowitz, and M. Levoy: 2005, High performance imaging using large camera arrays, Proceedings of SIGGRAPH (ACM Transactions on Graphics) 24(3), 765–776.

[12] Zitnick, C. L., S. B. Kang, M. Uyttendaele, S.Winder, R. Szeliski, 2004, High-quality video view

- 43 -

Multiview Imaging and 3DTV. A Survey.

interpolation using a layered representation, Proceedings of SIGGRAPH (ACM Transactions on Graphics) pp. 600–608.

[13] Kanade, T., P. W. Rander, P. J. Narayanan: 1997, Virtualized Reality: Constructing virtual worlds from real scenes, IEEE MultiMedia Magazine 1(1), 34–47.

[14] Isaksen, A., L. McMillan, S. Gortler: 2000, Dynamically reparameterized light fields, Computer Graphics (SIGGRAPH) pp. 297–306.

[15] J. Carranza, C. Theobalt, M. Magnor, H.-P. Seidel, Free-viewpoint video of human actors, in Proc. ACM Conf. Comput. Graph. (SIGGRAPH‟03), 2003, pp. 569–577.

[16] B. Wilburn, M. Smulski, H.-H. K. Lee, M. Horowitz, The light field video camera, in Proc. Media Processors 2002, SPIE Electronic Imaging, 2002.

[17] T. Kanade, H. Saito, S. Vedula, The 3D room: Digitizing time-varying 3D events by synchronized multiple video streams, Tech. Rep. CMU-RITR-98-34, 1998.

[18] T. Fujii, K. Mori, K. Takeda, K. Mase, M. Tanimoto, Y. Suenaga, Multipoint measuring system for video and sound: 100-camera and microphone system, IEEE 2006 Int. Conf. Multimedia & Expo, July 2006, pp. 437–440

[19] W.Matusik, H. Pfister, 3D TV: a scalable system for real-time acquisition, transmission, and autostereoscopic display of dynamic scenes, International Conference on Computer Graphics and Interactive Techniques archive ACM SIGGRAPH 2004

[20] F.Isgro, Em. Trucco, P. Kauff, Ol. Schreer, 3-D Image Processing in the Future of Immersive Media, IEEE Transactions on Circuits and Systems for Video Technology, 2004, v.14(3)

[21] Z.Zhang, A Flexible New Technique for Camera Calibration, IEEE Transactions on Pattern Analysis and Machine Intelligence, v(22), issue(11), 2000

[22] O. Faugeras, Three-Dimensional Computer Vision: a Geometric Viewpoint, MIT Press, 1993.

[23] R. Y. Tsai, A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf tv cameras and lenses, IEEE Journal of Robotics and Automation, 3(4):323–344, Aug. 1987 - 44 -

Multiview Imaging and 3DTV. A Survey.

[24] A.Fusiello, Uncalibrated Euclidean reconstruction: A review, Image and Vision Computing, 18(6-7):555-563,2000

[25] E. Trucco, A. Verri, Introductory Techniques for 3-D Computer Vision, Upper Saddle River, NJ: Prentice-Hall, 1998.

[26] O. Schreer, N. Brandenburg, S. Askar, P. Kauff, Hybrid recursive matching and segmentationbased postprocessing in real-time immersive video conferencing, in Proc. Conf. Vision, Modeling and Visualization, Stuttgart, Germany, Nov. 21–23, 2001.

[27] R. Zabih, J. Woodfill, Non-parametric local transforms for computing visual correspondence, in Proc. Eur. Conf. Computer Visions, vol. 2, Stockholm, Sweden, May 2–6, 1994, pp. 151–158.

[28] K. Muhlmann, D. Maier, J. Hesser, R. Manner, Calculating dense disparity maps from color stereo images, an efficient implementation, Int. J. Comput. Vis., vol. 47, no. 1/2/3, pp. pf 79–88, 2002.

[29]

K.

Konolige,

The

SRI

Small

Vision

System,

[Online].

Available:

http://www.ai.sri.com~konolige/svs [30] C. Fehn, P. Kauff, Interactive virtual view video (IVVV) – The bridge between 3D-TV and immersive TV, in Proc. 3D-TV, Video & Display, SPIE Int. Symp., Boston, MA, Aug. 2002.

[31] J. W. Shade, Layered depth images, in Proc. SIGGRAPH, Orlando, FL, 1998, pp. 231–242.

[32] J. Snyder, J. Lengyel, Visibility sorting and compositing without splitting dor image layer decomposition, in Proc. SIGGRAPH, Orlando, FL, 1998, pp. 219–230. [33] E. Trucco, F. Isgrò, F. Bracchi, Plane detection in disparity space, in Proc. IEE Int. Conf. Visual Information Engineering, 2003, pp. 73–76. [34] VIRTUE Home European Union‟s Information Societies Technology Programme, Project IST 1999–10 044. British Telecom. [Online].Available:http://www.virtue.eu.com

[35] [online]. Available: http://www.ri.cmu.edu/labs/lab_62.html

[36] H. Fuchs, G. Bishop, K. Arthur, L. McMillan, R. Bajcsy, S. W. Lee, H. Farid, T. Kanade, Virtual - 45 -

Multiview Imaging and 3DTV. A Survey.

space teleconferencing using a sea of cameras, in Proc. 1st Int. Symp. Medical Robotics and Computer Assisted Surgery, Pittsburgh, PA, 1994, pp. 161–167. [37] J. K. Cheng and T. S. Huang, “Image registration by matching relational structures,” Pattern Recognit., vol. 17, no. 1, pp. 149–159, 1984.

[38] R. Horaud, T. Skordas, Stereo correspondence through feature grouping and maximal clique, IEEE Trans. Pattern Anal. Machine Intell., vol. 11, pp. 1168–1180, Nov. 1989.

[39] S. Ullman, The Interpretation of Visual Motion, Cambridge, MA: MIT Press, 1989.

[40] D. Tell, S. Carlsson, Combining appearance and topology for wide baseline matching, in Proc. Eur. Conf. Computer Vision, vol. I, 2002, pp. 68–81.

[41] M. Pilu, A direct method for stereo correspondence based on singular value decomposition, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 1997, pp. 261–266.

[42] A. Goshtasby, S. H. Gage, J. F. Bartholic, A two stage cross correlation approach to template matching, IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-6, pp. 374–378, Mar. 1984.

[43] C. H. Chou, Y. C. Chen, Moment-preserving pattern matching, Pattern Recognit., vol. 23, no. 5, pp. 461–474, 1990. [44] M. Pilu, F. Isgrò, A fast and reliable planar registration method with applications to document stitching, in Proc. British Machine Vision Conf., Cardiff, U.K., Sept. 2–5, 2002, pp. 688–697.

[45] D. Scharstein, R. Szeliski, A taxonomy and evaluation of dense two-frame stereo correspondence algorithms, Int. J. Comput. Vis., vol. 47, no. 1-3, pp. 7–42, Apr. 2002.

[46] A. Fusiello, E. Trucco, A. Verri, Efficient stereo with multiple windowing, in Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, 1997, pp. 858–863.

[47] T. Kanade, M. Okutomi, A stereo matching algorithm with an adaptive window: Theory and experiments, IEEE Trans. Pattern Anal. Machine Intell., vol. 16, pp. 920–932, Sept. 1994. [48] K. Lengwehasarit, A. Ortega, Probabilistic partial-distance fast matching algorithms for motion estimation, IEEE Trans. Circuits Syst. Video Technol., vol. 11, pp. 139–152, Feb. 2001.

- 46 -

Multiview Imaging and 3DTV. A Survey.

[49] M. Perez, F. Cabestaing, A comparison of hardware resources required by real-time stereo dense algorithms, in Proc. IEEE Int. Workshop Computer Architecture for Machine Perception, New Orleans, LA, May 12–14, 2003.

[50] J. Mulligan, V. Isler, K. Daniilidis, Trinocular stereo: A real-time algorithm and its evaluation, Int. J. Computer Vision, vol. 47, no. 1/2/3, pp. pf 51–61, 2002.

[51] M. Okutomi, T. Kanade, A multiple-baseline stereo, IEEE Trans. Pattern Anal. Machine Intell., vol. 15, pp. 353–363, Apr. 1993.

[52] S. B. Kang, R. Szeliski, J. Chai, Handling occlusions in dense multi-view stereo, in Proc. Int. Conf. Computer Vision and Pattern Recognition, vol. 1, Kuaui, HI, Dec. 8–14, 2001, pp. 103–110

[53]C. Schmid, R. Mohr, Local grayvalue invariants for image retrieval, IEEE Trans. Pattern Anal. Machine Intell., vol. 19, pp. 530–535, May 1997.

[54] A. Baumberg, Reliable feature matching across widely separated views, in Proc. IEEE Int. Conf. Comp. Vision and Pattern Recognition, vol. I, 2000, pp. 774–781.

[55] C. Tomasi, R. Manduchi, Stereo matching as nearest-neighbor problem, IEEE Trans. Pattern Anal. Machine Intell., vol. 20, pp. 333–340, Mar. 1998. [56] S. Crossley, N. A. Thacker, N. L. Seed, Benchmarking of bootstrap temporal stereo using statistical and physical scene modeling, in Proc. British Machine Vision Conf., 1998, pp. 346–355.

[57] L. Matthies, M. Okutomi, Bootstrap algorithms for dynamic stereo vision, in Proc. 6th Mutidimensional Signal Processing Workshop, 1989, pp. 12–22. [58] M. O‟Neil, M. Demos, Automated system for coarse to fine pyramidal area correlation stereo matching, Image Vis. Comput., vol. 14, pp. 225–136, 1996. [59] F. Isgrò, E. Trucco, L. Q. Xu, Toward teleconferencing by viewsynthesis and large-baseline stereo, in Proc. IAPR Int. Conf. Image Analysis and Processing, Sept. 2001, pp. 198–203.

[60] C. Theobalt, N. Ahmed, G. Ziegler, H. Seidel, High-Quality Reconstruction from Multiview Video Streams, Dynamic representation of 3-D human actors, Signal Processing Magazine IEEE, v.24, issue(6), 2007

- 47 -

Multiview Imaging and 3DTV. A Survey.

[61] T. Matsuyama, X. Wu, T. Takai, T. Wada, Real-Time Dynamin 3-D Object Shape Reconstruction and High-Fidelity Texture Mapping for 3-D Video, IEEE Transactions on Circuits and Systems for Video Technology, vol.14, no.3, March 204.

[62] E.H. Adelson, J.R. Bergen, The plenoptic function and the elements of early vision,

In

Computational Models of Visual Processing, Landy and Movshon, Eds. MIT Press, Cambridge, Massachusetts, 1991, ch. 1.

[63] J. Berent, P.L. Dragotti, Plenoptic Manifolds, Exploiting structure and coherence in multiview images, IEEE Transactions on Circuits and Systems for Video Technology, vol.14, no.3, March 204.

[64] T. Fujii, T. Kimoto, M. Tanimoto, Ray space coding for 3D visual communication, Picture Coding Symp. 1996, Mar. 1996, pp. 447–451. [65] L. McMillan and G. Bishop, “Plenoptic modeling: an image-based rendering system,” in Proc. Comput. Graphics (SIGGRAPH ‟95), 1995, pp. 39–46.

[66] Y. Chuang, A Bayesian Approach to Digital Matting, In Conference on Computer Vision and Pettern Recognition (CVPR), vol.11, 264-271, 2001.

[67] C. Fehn, P. Kauff, M. Op de Beeck, F. Ernst, W. Ijsselsteijn, M. Pollefeys, L. Vangool, E. Ofek, I. Sexton, An evolutionary and optimised approach on 3DTV, IBC 2002, Int. Broadcast Convention, Amsterdam, Netherlands, Sept. 2002. [68] P. Kauff, N. Atzpadin, C. Fehn, M. Müller, O. Schreer, A. Smolic, R. Tanger, Depth map creation and image based rendering for advanced 3DTV services providing interoperability and scalability, Signal Process. Image Commun., Special Issue on 3DTV, Feb. 2007

[69] P. Debevec, C. Taylor, J. Malik, Modeling and rendering architecture from photographs: A hybrid geometry- and image-based approach, in Proc. SIGGRAPH 1996, 1996, pp. 11–20.

[70] S. Wurmlin, E. Lamboray, M. Gross, 3D video fragments: Dynamic point samples for real-time free-viewpoint video, Comput. Graph., (Special issue on coding, compression and streaming techniques for 3D and multimedia data), vol.28, no. 1, pp. 3–14, 2004.

[71] M. Waschbusch, St.Wurmlin, D. Cotting, F. Sadlo, M. Gross, Scalable 3D Video of Dynamic Scenes, the Visual Computer 21(8-10):629-638, 2005 - 48 -

Multiview Imaging and 3DTV. A Survey.

[72] Hofsetz, C., Ng, K., Max, N., Chen, G., Liu, Y., McGuinness, Image-based rendering of range data with estimated depth uncertainty, IEEE CG&A 24(4), 34.42 (2005)

[73] S. E. Chen, L. Williams, View interpolation for image synthesis, In Proceedings of SIGGRAPH, pg.353-363, 1998

[74] J. Ohm, K. Grueneberg, E. Hendriks, E. Izquierdo, M. Karl, A. Redert, Andre, D. Kalivas, D. Papadimatos, Realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation, Signal Processing: Image Communication. Vol. 14, no. 1-2, pp. 147-171. Nov. 1998

[75] S. M. Seitz, C. M. Dyer, View morphing, In Computer Graphics Proceedings, Annual Conference Series, pages 21–30, Proc. SIGGRAPH‟96 (New Orleans), August 1996. ACM SIGGRAPH.

[76] A. Broadhurst, T. Drummond, R. Cipolla, A probabilistic framework for the space carving algorithm, In: ICCV '01, pp. 388.393 (2001)

[77] S. Vedula, S. Baker, T. Kanade, Spatio-temporal view interpolation, In: EGRW '02, pp. 65.76, 2002

[78] E. Kurutepe, M. R. Civanlar, A. Murat Tekalp, A Receiver-Driven Multicasting Framework For 3DTV Transmission, Proceedings, EUSIPCO 2005, Antalya, Sept. 2005.

[79] W. A. IJsselstein, P. J. H. Seuntiens, L. M. J. Meesters, State-of-the-art in Human Factors and Quality Issues of Stereoscopic Broadcast Television, ATTEST Proj. Deliverable 1, 2002.

[80] http://graphics.ethz.ch/research/3dvideo/

[81] M. Pauly, M. Gross, L. P. Kobbelt, Efficient Simplification of Point-Sampled Surfaces, Proceedings of the conference on Visualization '02, session P5, p.163-170.

- 49 -