Low-complexity techniques for 2D-to-3D conversion

Low-complexity techniques for 2D-to-3D conversion Tessa Bosschem Promotor: prof. dr. ir. Rik Van de Walle Begeleiders: ir. Sebastiaan Van Leuven, ir....

Author: Mark Miles

0 downloads 2 Views 10MB Size

Report

Download PDF

Recommend Documents

Learning Techniques for Deaf and Dumb People using Hand Recognition and Text to Voice Conversion

Conversion Factors. Conversion Factors

MapReduce Jobs For Video Conversion

Statistical Tests for Gene Conversion

Curriculum Conversion Chart for Apologia

Techniques for Estimating

Techniques for Using Humor

Routing Techniques for

Techniques for Empirical Validation

Breathing Techniques for Labor

Programming Techniques for Supercomputers

TECHNIQUES FOR CREATING (PREWRITING)

Techniques for Scar Revision

An RGB to Spectrum Conversion for Reflectances

Metric Conversion Word Problems For 5th Grade

A simple Frequency Reader for conversion VFOs

VIDBOX Video Conversion for PC User Guide

Correlation-based Frequency Warping for Voice Conversion

4. Basic processes for wood energy conversion

POTENTIAL OF FOREST BIOMASS FOR ENERGY CONVERSION

Robust Constituent-to-Dependency Conversion for English

Conversion Chart for Item Numbers from

Energy Conversion Technologies for Clean Biomass Cookstoves

Text to Speech Conversion for Odia Language

Low-complexity techniques for 2D-to-3D conversion Tessa Bosschem

Promotor: prof. dr. ir. Rik Van de Walle Begeleiders: ir. Sebastiaan Van Leuven, ir. Glenn Van Wallendael Masterproef ingediend tot het behalen van de academische graad van Master in de ingenieurswetenschappen: computerwetenschappen

Vakgroep Elektronica en Informatiesystemen Voorzitter: prof. dr. ir. Jan Van Campenhout Faculteit Ingenieurswetenschappen en Architectuur Academiejaar 2010-2011

Low-complexity techniques for 2D-to-3D conversion Tessa Bosschem

Promotor: prof. dr. ir. Rik Van de Walle Begeleiders: ir. Sebastiaan Van Leuven, ir. Glenn Van Wallendael Masterproef ingediend tot het behalen van de academische graad van Master in de ingenieurswetenschappen: computerwetenschappen

Vakgroep Elektronica en Informatiesystemen Voorzitter: prof. dr. ir. Jan Van Campenhout Faculteit Ingenieurswetenschappen en Architectuur Academiejaar 2010-2011

Acknowledgments During the realization of this thesis I have been accompanied and helped by many people. It is now a great pleasure to have the opportunity to thank them. First of all, I would like to show my gratitude to my promoter Rik Van de Walle and my supervisors Sebastiaan Van Leuven, Glenn Van Wallendael and Jan De Cock. Their encouragement, guidance and enthusiasm enabled me to develop an understanding of the subject. Without their help and good advice this dissertation would not have been possible. I also owe my gratitude to the people of the Vlaamse Radio- en Televisieomroep (VRT), for providing me the necessary material in order for my thesis to succeed. I wish to thank my friends, who supported me during the dicult times and provided emotional support whenever necessary. Last but not least, it is an honor for me to thank my family, my parents and my sister, for helping me in every possible way and for supporting me from the beginning until the end. I want to thank my parents for giving me the opportunity to undertake my studies and believing in me.

i

Toelating tot bruikleen "De auteur geeft de toelating deze masterproef voor consultatie beschikbaar te stellen en delen van de masterproef te kopiëren voor persoonlijk gebruik. Elk ander gebruik valt onder de beperkingen van het auteursrecht, in het bijzonder met betrekking tot de verplichting de bron uitdrukkelijk te vermelden bij het aanhalen van resultaten uit deze masterproef."

"The author gives permission to make this master dissertation available for consultation and to copy parts of this master dissertation for personal use.

In the case of any other

use, the limitations of the copyright have to be respected, in particular with regard to the obligation to state expressly the source when quoting results from this master dissertation."

Tessa Bosschem, juni 2011

ii

Low-complexity techniques for 2D-to-3D conversion Tessa Bosschem

A thesis submitted in partial fulllment of the requirements for the degree of Master in Engineering Science: Computer Science Year 2010 - 2011

Promoter: prof. dr. ir. Rik Van de Walle Supervisors: ir. Sebastiaan Van Leuven, ir. Glenn Van Wallendael

Faculty of Engineering and Architecture Ghent University

Summary The upswing of 3D television has resulted in the need for the ecient creation of 3D video material. In addition to methods to retrieve 3D information directly from a scene, there is an increasing need for well-working 2D-to-3D conversion techniques. This dissertation aims at developing a new 2D-to-3D conversion technique, specically designed to meet the needs of a chroma keying application. Therefore, in this work a new algorithm was proposed, exploiting these characteristics and providing a straightforward technique to create a 3D experience from 2D video streams. A comparison was made between distinct 2D-to-3D techniques and visual comfort user assessments were executed to obtain subjective results. Based on these results, conclusions were drawn about the perceived depth, the preferred parallax eect and the technique that produces the most enjoyable 3D eect. The most signicant result was the overall preference of the observers for 3D videos generated with the proposed depth extraction method, evaluated for the use case of a weather forecast.

Keywords: 2D-to-3D conversion, visual comfort, stereoscopy, chroma key

iii

Low-complexity techniques for 2D-to-3D conversion Tessa Bosschem Supervisor(s): ir. Sebastiaan Van Leuven, ir. Glenn Van Wallendael, prof. dr. ir. Rik Van de Walle Abstract—This article describes the development of a new 2D-to-3D conversion algorithm for the conversion of existing 2D video material. A lowcomplexity technique is developed and applied to a chroma keying application to illustrate its usefulness. Additional functionalities, such as the extraction of objects with a more pronounced 3D effect and the possibility to choose between 3D “behind” and “in front of” the display screen, are available. To be able to evaluate the new system, subjective experiments regarding visual discomfort were executed. The performance of distinct 2D-to-3D conversion methods was investigated, with an additional exploration of the effect of increasing depth levels, parallax effect and amount of 3D animations. The experiments show that, compared to state of the art implementations, the newly designed algorithm is found to generate the most pleasing results in case of a chroma keying situation. This demonstrates the applicability of our technique. Keywords— 2D-to-3D conversion, visual comfort, stereoscopy, chroma key, background subtraction

I. I NTRODUCTION HREE -dimensional television is nowadays often seen as a major milestone in the ultimate visual experience of multimedia. The concept of stereoscopy1 has been around for ages but only recently a real breakthrough from conventional 2D television to 3D television is expected to come soon. The mass production of 3D video material is essential for 3D television to become a succes and penetrate the living rooms. One possible approach for generating 3D content from existing 2D content are 2D-to-3D conversion techniques. These techniques pose several difficulties such as disocclusion problems and the need for flawless extraction of depth information. In this work, the goal is to avoid these issues by exploiting the characteristics of chroma keying applications for the conversion to three-dimensional material.

T

II. BACKGROUND SUBTRACTION METHOD In this research, the main focus lies on the conversion of twodimensional video used in chroma keying applications. Such an application allows to employ two video streams as input for the conversion algorithm: the video with the foreground object recorded in front of the chroma key screen and an additional background video stream. In a first phase, a background subtraction algorithm is applied to separate foreground objects from the chroma key screen. Afterwards, pixels classified as background can be substituted by the corresponding pixels in the background video stream. This will generate a merged media file, the video as seen on television (in applications as e.g. weather forecasts or news broadcasting). For this article we evaluate the use case of a weather forecast situation. 1 Stereoscopy refers to the technique used to obtain a 3D effect by presenting two images separately to the left and right eye of the viewer.

III. 2D- TO -3D CONVERSION TECHNIQUES In this section, different 2D-to-3D conversion techniques will be discussed. The first subsection describes state of the art techniques while the following subsections propose the two designed algorithms. Each algorithm was designed to retrieve depth information from a 2D video stream and generate left and right view sequences. These left eye and right eye views can be offered to a stereoscopic player which redirects them to the correct eye. The brain does the remaining work by creating a perception of depth. These techniques can be applied to the weather forecast situation and the results are used for testing purposes. A. State of the art techniques Investigating state of the art 2D-to-3D conversion methods enabled the making of a well-founded comparison between the current techniques and the newly designed algorithms. Two distinct state of the art methods are considered during the evaluation of the algorithms. The first method, the open source MakeMe3D software[4], uses self-developed object recognition and motion analysis to extract depth information. The second method employs the input video stream as first view and generates the second view by introducing a frame difference. A frame is shown to the left eye, while the previous frame is shown to the right eye. Both algorithms accept one merged video stream as input for their algorithm. B. Video plus depth After the classification made by the background subtraction method (described in paragraph II), it is possible to immediately generate depth maps corresponding to each frame of the merged video stream. The depth maps are represented by YUV sequences, only containing non-zero values for the luminance component. Left and right view are then generated using a Depth Image Based Rendering (DIBR) toolbox[3]. This approach will result in disoccluded regions in the generated views, which are gaps of missing pixel information. The depth maps are interpreted according to the MPEG N8038 informative recommendations and disocclusions are filled by background pixel extrapolation. C. Proposed generation of left and right view The classification, as made by the background subtraction method, remains the same but left and right view are generated earlier in the process, during the classification of foreground and background pixels. The possibility to assign depth before the video streams are merged together is one of the major advantages of the proposed method.

Fig. 1. Left and right frame generated using the new algorithm, with resulting stereo image.

A cut-and-paste technique is applied to obtain the desired depth effect and to assign similar depth levels to the same pixels as in previous method. However, in this case, there is no need to apply a hole filling technique, considering the fact that all pixel values of the background stream are defined. Although this approach basically assigns the same depth level to each pixel of the presenter, the cardboard effect2 is not dominantly present. When viewing the stereoscopic video one can for example clearly see the presenter’s hands appearing more to the front. Fig. 1 depicts an example of the resulting views after applying the proposed conversion scheme to the weather forecast video streams. The left image represents the left eye view while the right image represents the right eye view of an extracted frame from the video sequences. By exploiting the characteristics of a chroma keying application, artifacts around the edges of the foreground object can be avoided. These artifacts, hazy regions surrounding the foreground area, are present in the previous method (described in paragraph III-B) which makes the current approach preferable.

Fig. 2. Comparison of different 2D-to-3D conversion techniques with a large depth effect.

B. Comparison of additional criteria Additional experiments compared video fragments with small and large depth effects, for each algorithm. In the majority of the applied algorithms, a larger depth effect was chosen above a smaller one. Furthermore, observers tend to show a clear preference for viewing 3D “behind” the display screen (the positive parallax effect). Finally, it could be inferred that the majority of the observers enjoys additional three-dimensional effects. V. C ONCLUSIONS

A first conclusion was drawn that the newly designed algorithms (Algorithm 1 and 3 in Fig. 2) outperform the other state of the art techniques. Algorithm 1 corresponds to the technique explained in paragraph III-C while algorithm 3 corresponds to paragraph III-B. Reducing or enlarging depth level does not affect the opinion of the observers regarding the most enjoyable conversion technique.

The proposed algorithm is capable of converting 2D video content to 3D video material with an enjoyable depth impression. The usefulness of the designed technique depends strongly on the type of application. Extensive subjective testing, with an unbiased test panel, shows that this technique of assigning depth outperforms existing techniques for in total 73,92% of the viewers. Although meanwhile the complexity slightly increases, a lot of possibilities exist for speeding up the application, which should be investigated in further research. The application creates entertaining 3D animations, for which assigned depth can be adjusted to the user’s preferences. Moreover, the user can decide whether more or less 3D animations should be added, or whether a 3D effect “behind” or “out of” the screen should be obtained. These are options that are lacking in the described state of the art methods, making our method preferable for chroma keying applications. The application is not restricted to weather forecast situations but could also be used in other TV applications such as e.g. news broadcasting. Moreover, with this research the necessity for the development of well-functioning and high quality 2D-to-3D conversion techniques in the future was demonstrated.

2 The “cardboard effect” is a distortion peculiar to stereoscopic images and describes the phenomenon whereby objects appear to be flattened in depth. 3 A negative parallax results in objects appearing in front of the display screen, a positive parallax in objects appearing behind the display screen.

[1] Kauff, P. and Atzpadin, N. and Fehn, C. and M¨uller, M. and Schreer, O. and Smolic, A. and Tanger, R., Depth map creation and image-based rendering

IV. E XPERIMENTAL RESULTS Subjective experiments were conducted in which 25 observers participated. Test sets consisted of several evaluations, in which two video fragments were compared two by two. These fragments differentiated in depth level, applied 2D-to-3D conversion algorithm, negative or positive parallax3 effect and/or number of 3D animations. The test was performed on a stereoscopic display screen with active shutter glasses. A. Comparison of algorithms

R EFERENCES

for advanced 3DTV services providing interoperability and scalability, [2] W. J. Tam and Liang Zhang, 3D-TV Content Generation: 2D-to-3D Conversion, [3] V. De Silva, Depth Image Based Stereoscopic View Rendering Toolbox , http://www.mathworks.com/matlabcentral/fileexchange/27538-depthimage-based-stereoscopic-view-rendering, 2010 [4] MakeMe3D software, http://www.makeme3d.net/convert 2d to 3d.php

2D-naar-3D conversietechnieken met lage complexiteit Tessa Bosschem Begeleider(s): ir. Sebastiaan Van Leuven, ir. Glenn Van Wallendael, prof. dr. ir. Rik Van de Walle Abstract— Dit artikel introduceert een laag-complexe 2D-naar-3D conversiemethode, die bestaande tweedimensionale videosequenties kan omzetten in 3D. Een nieuw algoritme werd ontwikkeld en toegepast op een applicatie die steunt op de chroma keying techniek. Extra functionaliteiten zijn toegevoegd, zoals o.a. de extractie van objecten in de te substitueren achtergrond om een groter 3D effect te cre¨eren. Verder bestaat de mogelijkheid om te kiezen tussen een 3D effect die de objecten ”voor” of ”achter” het scherm laat verschijnen. Subjectieve testen zijn uitgevoerd met een niet-deskundige gebruikersgroep. Hierbiij werd stereoscopisch videomateriaal vergeleken. Het ge¨evalueerde videomateriaal bestaat uit fragmenten gegenereerd met verschillende algoritmes. Hierbij kunnen ook extra parameters zoals de grootte van de toegekende diepte, positief of negatief parallax-effect en het aantal extra 3D animaties vari¨eren. Resultaten wijzen erop dat het ontwikkelde algoritme de voorkeur krijgt t.o.v. de overige ge¨evalueerde algoritmes, wanneer het werd toegepast op videostromen van een weerbericht. Keywords— 2D-naar-3D conversie, visueel gebruikerscomfort, stereoscopie, chroma key, subtractie van achtergrond

I. I NLEDING RIE dimensionale televisie (3DTV) wordt vandaag de dag gezien als de volgende grote mijlpaal in het bereiken van de ultieme kijkervaring. Het concept van stereoscopie1 bestaat al sedert vele jaren, maar een echte doorbraak van conventionele 2D naar 3D televisie lijkt nu pas op gang te komen. Opdat 3DTV een succes wordt en binnendringt in de woonkamers, is de massaproductie van 3D videomateriaal onontbeerlijk. Een mogelijke aanpak zijn de 2D-naar-3D conversietechnieken, waarbij bestaand 2D videomateriaal omgezet wordt naar video’s die in 3D formaat kunnen bekeken worden. Deze technieken gaan gepaard met een aantal moeilijkheden zoals disocclusieproblemen en de foutloze extractie van diepte-informatie. In de voorgestelde techniek worden deze problemen omzeild door de eigenschappen van chroma key2 toepassingen uit te buiten bij de omzetting van 2D naar 3D.

D

II. A LGORITME VOOR VOORGRONDEXTRACTIE De focus ligt hier op de conversie van tweedimensionale videosequenties zoals deze opgenomen worden in chroma keying applicaties. Een dergelijke applicatie laat toe twee videostromen te gebruiken als argumenten van het conversiealgoritme: de video waarbij e´ e´ n persoon (of meerdere) zich voor een chroma key scherm bevindt, en een video die de achtergrond voorstelt. In een eerste fase passen we een achtergrondsubtractie model toe om de voorgrondobjecten te onderscheiden van het chroma key scherm. Vervolgens kunnen de pixels die geclassificeerd werden als achtergrond, gesubstitueerd worden 1 Stereoscopie verwijst naar het verkrijgen van een dieptedimensie door twee afbeeldingen voor elk oog te genereren. 2 Chroma keying verwijst naar het gebruik van een effen gekleurde achtergrond tijdens het opnemen van het videomateriaal.

door de corresponderende pixels in de achtergrondvideo. Het resultaat van deze opeenvolgende stappen is een samengevoegde video zoals die te zien is in bv. een weerbericht of een nieuwsbericht. Om de voorgestelde implementatie te kunnen testen worden videostromen van een weerbericht gebruikt. III. 2D- NAAR -3D TECHNIEKEN In deze sectie worden verscheidene 2D-naar-3D conversietechnieken besproken. De eerste subsectie beschrijft reeds bestaande technieken terwijl de volgende subsecties zich toeleggen op de voorgestelde methodes. Iedere techniek heeft als doel diepte-informatie te extraheren en een linker- en rechterbeeld te genereren. Deze beelden zullen aan een stereoscopische speler aangeboden worden, die ervoor zorgt dat elk beeld het correcte oog bereikt. De hersenen zijn dan verantwoordelijk voor het cre¨eren van een dieptegevoel. Deze technieken werden toegepast op de specifieke situatie van een weerbericht en de resulterende videostromen werden dan aangewend voor experimenteel onderzoek. A. Huidige technieken Onderzoek naar reeds bestaande 2D-naar-3D conversietechnieken liet ons toe om een grondige vergelijking te kunnen maken tussen de huidige en de voorgestelde algoritmes. Twee verschillende reeds bestaande technieken werden in beschouwing genomen. De eerste methode, de MakeMe3D software[4], gebruikt objectherkenning en bewegingsanalyse voor de extractie van diepteinformatie. Het tweede algoritme, e´ e´ n van de meest voor de hand liggende technieken, generereert de tweede videosequentie door een frameverschil te introduceren. Een bepaald frame is dan bestemd voor het linkeroog terwijl het voorgaande frame bestemd is voor het rechteroog. Beide methodes verwachten e´ e´ n samengevoegde videosequentie als argument voor hun algoritme. B. Video plus dieptemap Na de classificatie van pixels als voorgrond of achtergrond (door het algoritme beschreven in paragraaf II) kunnen direct dieptemappen gegenereerd worden voor elke frame van de samengevoegde videosequentie. De dieptemappen worden voorgesteld door YUV sequenties, waarbij enkel de luminantiecomponent van nul verschillende waarden bevat. Linker- en rechterbeeld worden vervolgens gegenereerd via een bestaande Matlab toolbox[3] die een Depth Image Based Rendering (DIBR) uitvoert op een 2D video en een dieptemap. Deze manier van werken resulteert in regio’s waarbij sommige pixelwaarden niet gedefinieerd zijn ten gevolge van disocclusies. De dieptemappen worden ge¨ınterpreteerd volgens de MPEG

Fig. 1. Linker- en rechterbeeld gegenereerd door het nieuwe algoritme, met resulterende stereo afbeelding.

N8038 informative recommendations en ontbrekende pixelwaarden worden ingevuld door toepassing van achtergrondpixel extrapolatie.

Fig. 2. Vergelijking van verschillende 2D-naar-3D technieken.

C. Ontworpen algoritme De classificatie, zoals verkregen door het algoritme voor voorgrondextractie, blijft dezelfde. Deze aanpak zal echter linker- en rechterbeeld vroeger in het proces cre¨eren, namelijk tijdens de classificatie van de pixels. Een cut-and-paste techniek wordt toegepast om het gewenste diepte-effect te verkrijgen en de pixels krijgen gelijkaardige dieptewaarden. Het probleem van ontbrekende informatie kan hier echter vermeden worden, aangezien elke pixelwaarde gedefinieerd is. Fig. 1 stelt een voorbeeld voor van het voorgestelde conversie-algoritme, toegepast op videosequenties van een weerbericht. De linkerafbeelding duidt het beeld aan bestemd voor het linkeroog, de rechterafbeelding voor het rechteroog. Doordat de eigenschappen van een chroma keying applicatie kunnen uitgebuit worden, worden de wazige zones rond de rand van voorgrondobjecten, die wel prominent aanwezig zijn bij de voorgaande methode, vermeden. IV. E XPERIMENTELE RESULTATEN Subjectieve experimenten werden uitgevoerd waarin 25 personen van verschillende leeftijdsgroepen deelnamen. Verschillende testsequenties werden ontwikkeld, waarbij videofragmenten onderling ge¨evalueerd werden. Deze fragmenten verschilden van elkaar door verschillende parameters te laten vari¨eren: toegepaste 2D-naar-3D conversietechniek, hoeveelheid ge¨ıntroduceerde diepte, positieve of negatieve parallax3 en aantal 3D animaties. Alle videofragmenten werden bekeken op een stereoscopisch scherm met behulp van actieve shutterglazen. A. Vergelijking van de algoritmes Uit de experimenten kon afgeleid worden dat de ontwikkelde algoritmes (”Algorithm 1” en ”Algorithm 3” in Fig. 2) verkozen worden boven de reeds bestaande technieken. Algoritme 1 correspondeert met de techniek beschreven in paragraaf 3 Negatieve of positieve parallax laat objecten resp. voor of achter het scherm verschijnen.

III-C terwijl Algoritme 3 overeenstemt met paragraaf III-B. Het vergroten of verkleinen van de hoeveelheid diepte heeft in het merendeel van de evaluaties geen effect op de mening van de kijkers wat betreft de beste conversietechniek. B. Effect van het aanpassen van extra parameters Bijkomende experimenten legden zich toe op de vergelijking van videofragmenten die verschilden van elkaar, o.a. in diepte effect, maar waarbij dezelfde conversietechniek gebruikt werd. Bij de meeste algoritmes toonden gebruikers een voorkeur voor een iets groter diepte-effect. Bovendien werden resultaten van voorgaand onderzoek wat betreft het parallax-effect bevestigd. Kijkers prefereren effecten waarbij objecten verschijnen achter het scherm, het zogenaamde positieve parallax-effect. Een laatste conclusie betreft het aantal 3D animaties, waarbij gebruikers duidelijk een voorkeur hebben voor meer objecten met een uitgesproken 3D effect. V. C ONCLUSIES Het voorgestelde algoritme is in staat om 2D videomateriaal om te zetten naar 3D video waarbij een minimum aan diepteartefacten waarneembaar zijn. De bruikbaarheid van de ontwikkelde techniek hangt sterk af van het type applicatie. Uitgebreide subjectieve testen, met een onafhankelijk testpanel, tonen aan dat deze manier om diepte toe te kennen andere technieken overtreft voor in totaal 73,92% van de kijkers. Hoewel de aanpak gepaard gaat met een kleine toename van complexiteit, de veel aangenamere 3D kijkervaring maakt deze techniek verkiesbaar. Bovendien bestaan er veel mogelijkheden om de applicatie te versnellen, die zouden moeten onderzocht worden in de toekomst. De applicatie cre¨ert amusante 3D animaties, waarbij de gebruiker de hoeveelheid diepte kan instellen naar eigen voorkeur. De gebruiker kan tevens beslissen of er meer of minder 3D animaties moeten toegevoegd worden, en of een 3D effect achter of voor het scherm gewenst is. Deze opties zijn niet aanwezig in

de ge¨evalueerde huidige technieken, waardoor onze applicatie de voorkeur krijgt bij chroma keying applicaties. Deze techniek is dus niet enkel toepasbaar op een weerbericht, maar eveneens op andere chroma keying applicaties zoals bv. een nieuwsbericht. Met dit onderzoek werd aldus de nood aan de ontwikkeling van effici¨ente 2D-naar-3D omzettingsalgoritmen in de toekomst aangetoond. R EFERENCES [1] Kauff, P. and Atzpadin, N. and Fehn, C. and M¨uller, M. and Schreer, O. and Smolic, A. and Tanger, R., Depth map creation and image-based rendering for advanced 3DTV services providing interoperability and scalability, [2] W. J. Tam and Liang Zhang, 3D-TV Content Generation: 2D-to-3D Conversion, [3] V. De Silva, Depth Image Based Stereoscopic View Rendering Toolbox , http://www.mathworks.com/matlabcentral/fileexchange/27538-depthimage-based-stereoscopic-view-rendering, 2010 [4] MakeMe3D software, http://www.makeme3d.net/convert 2d to 3d.php

Contents 1

2

Introduction

1

1.1

Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.3

Overview

2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

State of the Art

3

2.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

2.2

Free Viewpoint Video vs. 3D Video

. . . . . . . . . . . . . . . . . . . . . .

3

. . . . . . . . . . . . . . . . . . . . . . . . . .

3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

2.3

2.2.1

Free Viewpoint Video

2.2.2

3D Video

3D content generation 2.3.1

2.3.2

2.4

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

Scene Capturing Technologies . . . . . . . . . . . . . . . . . . . . . .

6

2.3.1.1

Single Camera Techniques . . . . . . . . . . . . . . . . . . .

6

2.3.1.2

Multicamera Techniques . . . . . . . . . . . . . . . . . . . .

7

2.3.1.3

Holographic Techniques . . . . . . . . . . . . . . . . . . . .

8

2D-to-3D conversion

. . . . . . . . . . . . . . . . . . . . . . . . . . .

8

2.3.2.1

Cut-and-Paste Techniques . . . . . . . . . . . . . . . . . . .

9

2.3.2.2

Depth Map Generation

. . . . . . . . . . . . . . . . . . . .

9

2.3.2.3

Other state of the art conversion methods . . . . . . . . . .

13

2.3.2.4

Computer Generated Imagery

. . . . . . . . . . . . . . . .

13

. . . . . . . . . . . . . . . . . . . . . . .

14

2.4.1

Dense Depth Representations . . . . . . . . . . . . . . . . . . . . . .

14

2.4.2

Surface-Based Representations

14

Scene Representation Technologies

x

. . . . . . . . . . . . . . . . . . . . .

2.5

2.4.3

Volumetric Representations

. . . . . . . . . . . . . . . . . . . . . . .

15

2.4.4

Texture Representations . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.4.5

Pseudo 3D Representations

. . . . . . . . . . . . . . . . . . . . . . .

15

2.4.6

Object-Based 3D Scene Modeling . . . . . . . . . . . . . . . . . . . .

16

2.4.7

Head and Body Specic Representations . . . . . . . . . . . . . . . .

16

2.4.8

Existing standards

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

Conventional Stereo Coding . . . . . . . . . . . . . . . . . . . . . . .

17

2.5.1.1

Basic Principle . . . . . . . . . . . . . . . . . . . . . . . . .

17

2.5.1.2

Stereo Encoding Formats

. . . . . . . . . . . . . . . . . . .

17

2.5.1.3

The Multi-View Prole specication . . . . . . . . . . . . .

19

2.5.2

Video plus Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

2.5.3

Multi-View Video Coding

. . . . . . . . . . . . . . . . . . . . . . . .

20

2.5.4

3D Mesh Compression . . . . . . . . . . . . . . . . . . . . . . . . . .

20

2.5.5

Multiple Description Coding

. . . . . . . . . . . . . . . . . . . . . .

20

2.5.5.1

MDC of Single-View Visual Data . . . . . . . . . . . . . . .

21

2.5.5.2

MDC of Multi-View Video

. . . . . . . . . . . . . . . . . .

21

Transmission Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

2.6.1

3DTV broadcast

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

2.6.2

3DTV over IP Networks . . . . . . . . . . . . . . . . . . . . . . . . .

23

2.6.3

Packaged material

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

Coding algorithms 2.5.1

2.6

2.7

2.8

Display technologies 2.7.1

Classical Stereoscopic Displays

. . . . . . . . . . . . . . . . . . . . .

24

2.7.2

Multiple Image Displays . . . . . . . . . . . . . . . . . . . . . . . . .

26

2.7.3

Volumetric

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

Existing 3DTV Standards and Techniques . . . . . . . . . . . . . . . . . . .

27

2.8.1

MPEG Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

2.8.1.1

Depth Enhanced Stereo . . . . . . . . . . . . . . . . . . . .

28

2.8.1.2

Multi-view Video plus Depth . . . . . . . . . . . . . . . . .

29

2.8.1.3

Layered Depth Video

29

xi

. . . . . . . . . . . . . . . . . . . . .

2.8.1.4 2.8.2

2.9

3

Recent MPEG activities . . . . . . . . . . . . . . . . . . . .

Image Based Rendering

31

. . . . . . . . . . . . . . . . . . . . . . . . .

31

2.8.2.1

Single Video plus Depth . . . . . . . . . . . . . . . . . . . .

32

2.8.2.2

Multi-view

34

. . . . . . . . . . . . . . . . . . . . . . . . . . .

Visual discomfort in stereoscopic displays

. . . . . . . . . . . . . . . . . . .

34

2.9.1

Human perception of depth

. . . . . . . . . . . . . . . . . . . . . . .

34

2.9.2

Visual fatigue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

2.9.3

Factors that cause visual discomfort

36

. . . . . . . . . . . . . . . . . .

Implementation

38

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

3.2

Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

3.2.1

Provided material

39

3.2.2

Analyzing video material

3.2.3 3.3

. . . . . . . . . . . . . . . . . . . . . . . .

39

3.2.2.1

The video format . . . . . . . . . . . . . . . . . . . . . . . .

40

3.2.2.2

The audio format

. . . . . . . . . . . . . . . . . . . . . . .

40

Editing video material . . . . . . . . . . . . . . . . . . . . . . . . . .

40

Background/Foreground classication

. . . . . . . . . . . . . . . . . . . . .

41

3.3.1

Dierence between reference frame and current frame . . . . . . . . .

41

3.3.2

Background classication using morphological operators

42

3.3.3

SACON: A consensus based model for background subtraction

3.3.4

Background subtraction based on absolute value classication and edge detection

3.4

. . . . . . . . . .

42

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

Video plus depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

3.4.1.1

. . . . . . . . . . . . . . . . . . . . . . . . . .

47

Proposed generation of left and right view . . . . . . . . . . . . . . .

47

3.4.2.1

Basic implementation

. . . . . . . . . . . . . . . . . . . . .

47

3.4.2.2

Extended implementation . . . . . . . . . . . . . . . . . . .

48

3.4.2.3

Extended implementation for positive parallax eect . . . .

50

3.4.2.4

Performance

. . . . . . . . . . . . . . . . . . . . . . . . . .

51

Discussion on complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

Generation of 3D video 3.4.1

3.4.2

3.5

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

Performance

xii

4

Experimental work

54

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

4.2

Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

4.2.1

Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

4.2.2

Observers

55

4.2.3

Equipment and test environment

. . . . . . . . . . . . . . . . . . . .

56

4.2.4

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

4.3.1

. . . .

57

. . . . . . . . . . . . . . . . . . . . . . . . . .

57

4.3

4.3.2

Experiment 1: perception of depth for changing depth values 4.3.1.1

Introduction

4.3.1.2

Stimuli

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

4.3.1.3

Results

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

Experiment 2: comparing 2D-to-3D conversion methods for constant depth level

4.3.3

4.3.4

4.4

5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.3.2.1

Introduction

4.3.2.2 4.3.2.3

61

. . . . . . . . . . . . . . . . . . . . . . . . . .

61

Stimuli

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

Results

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

Experiment 3: adding additional 3D animations . . . . . . . . . . . .

62

4.3.3.1

Introduction

. . . . . . . . . . . . . . . . . . . . . . . . . .

62

4.3.3.2

Stimuli

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

4.3.3.3

Results

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

Experiment 4: comparing negative and positive parallax eects

. . .

66

. . . . . . . . . . . . . . . . . . . . . . . . . .

66

4.3.4.1

Introduction

4.3.4.2

Stimuli

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

4.3.4.3

Results

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

Summary of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

General discussion

69

5.1

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

5.1.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

5.1.2

Conclusions concerning the proposed implementation . . . . . . . . .

69

xiii

5.2

5.1.3

Conclusions concerning experimental results . . . . . . . . . . . . . .

70

5.1.4

General conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

A Evaluation form

74

B Video editing scripts

81

C Avisynth script

82

D YUV/WMV conversion scripts

83

E Source Code and Video Sequences

84

Bibliography

85

xiv

List of abbreviations 3DAV 3D Audio and Video 3DTV Three-Dimensional Television AC/A Accommodative Convergence/Accommodation AES3 Advanced Encryption Standard 3 AFX

Animated Framework eXtension

ATTEST Advanced Three-Dimensional Television Systems Technologies CAD

Computer Aided Design

CCD

Charged-Coupled Device

CGIs

Computer Generated Images

CMOS Complementary Metal Oxide Semiconductor DCT

Discrete Cosine Transform

DES

Depth Enhanced Stereo

DIBR Depth Image Based Rendering FVV

Free Viewpoint Video

HBM

Human Body Model

HOEs Holographic Optical Elements IBR

Image Based Rendering

IEC

International Electrotechnical Commission

IPD

InterPupillary Distance

IPTV Internet Protocol Television xv

ISO

International Organization for Standardization

ITU-T International Telecommunication Union - Telecommunications LDI

Layered Depth Images

LDV

Layered Depth Video

MDC

Multiple Description Coding

MPEG-4 Moving Picture Experts Group -4 MVC

Multiview Video Coding

MVD

Multiview Video plus Depth

MVP

MultiView Prole

MXF

Material eXchange Format

NCC

Normalized Cross Correlation

NURBS NonUniform Rational Bezier Splines PCM

Pulse-Code Modulation

RGB

Red Green Blue format

SfM

Shape-from-Motion

SfS

Shape-from-Shading

SMP

Shared-Memory Multiprocessor

SSD

Sum of Squared Dierence

TOF

Time-Of-Flight

VoD

Video-on-Demand

VRML Virtual Reality Modeling Language VRT

Vlaamse Radio- en Televisieomroep

WMV Windows Media Video X3D

eXtensible 3D

ZPS

Zero Parallax Setting

ZPS

Zero Parallax Setting

xvi

Chapter 1

Introduction 1.1 Problem Statement Three-dimensional television (3DTV) is nowadays often seen as a major milestone in the ultimate visual experience of media. The concept of stereoscopy has been around for ages, but it is not until recently a real breakthrough from conventional 2D broadcasting to 3D broadcasting is pending. Recent innovations such as the development of auto-stereoscopic display screens, by for example Philips[1], make it possible to view 3D images and movies without the use of 3D glasses and allow people to dream of a 3D experience in the living room. However, the mass production of 3D video content is still pending and deterring the big breakthrough of 3DTV. The development of well working 2D-to-3D conversion techniques is essential for this breakthrough to happen.

These are techniques that allow for the

conversion of 2D material to 3D material but have a lot of diculties to overcome. The awless extraction of depth maps to send depth information to the display screen is one of these problems, which is addressed throughout this thesis.

1.2 Goal This dissertation's main goal is to develop a new 2D-to-3D conversion technique in which the diculties that exist in state of the art conversion methods are avoided.

An easy

approach is to use a chroma key screen on which the background can be projected after recording of the video. This may not be a realistic situation, considering the fact that not all applications use these recording techniques, but it proves the potential of 3DTV. One situation in which this approach could be applicable is a weather forecast.

The

foreground can easily be extracted from the image, whereafter depth information can be assigned without risking assigning imperfect depth values to pixels. 1

Afterwards, more

1.3.

2

OVERVIEW

background animations can be added by the development of an application that focuses on segmenting the background image. A second important goal included the evaluation of this new conversion technique by comparing stereoscopic material generated by distinct low-complexity algorithms.

This

can be done by conducting subjective experiments exploring the eects of perceived depth and increasing or decreasing the amount of 3D animations.

1.3 Overview This master's thesis starts with a general description of the current 3D television technologies. Every step in the 3DTV chain is explained thoroughly, covering the capturing of the 3D data, generating a 3D representation, coding, transmitting, rendering and nally displaying 3D video.

Afterwards, a discussion on current research in the eld of visual

discomfort is included, for the reader to have a good understanding of the experimental work in this dissertation.

Chapter 3 focuses on the implementation aspect and covers

dierent methods. The chapter ends with a description of the nal implementation. The next chapter is dedicated to experimental work and divided into four dierent subjective experiments. The eect of increasing depth level, parallax variations, increasing number of 3D animations and changing 2D-to-3D conversion technique are consecutively explored. Chapter 5 terminates this thesis with a discussion on the results and a general conclusion.

Chapter 2

State of the Art 2.1 Introduction This chapter gives an impression on what is currently going on in the eld of 3DTV. The rst part explains some common terms such as Free Viewpoint Video (FVV) and threedimensional television (3DTV). Thereafter, the focus remains on the concept of 3DTV, more specically on every phase of the 3DTV chain. 3D content creation, generation of 3D representations, coding, transmission technologies, rendering and nally displaying 3D video are explained.

In addition, existing 3DTV standards and techniques, such as the

well-known MPEG video formats, are discussed shortly. Finally, current research in the eld of visual discomfort when viewing stereoscopic image sequences is included, to help understanding the experimental work covered in this dissertation.

2.2 Free Viewpoint Video vs. 3D Video Basically, there is no clear distinction between 3D video and Free Viewpoint Video. The classication is related to their main focus: more on depth impression or more on free navigation. The following sections highlight the techniques used in applications for FVV and 3DTV[2].

2.2.1 Free Viewpoint Video Free Viewpoint Video allows the user to navigate freely within real world virtual scenes, choosing his own viewpoint and viewing direction. In contrast to pure computer graphics applications, FVV targets real world scenes as captured by real cameras. This is interesting for user applications as well as for post-production systems, e.g. for sports and movies.

3

2.2.

FREE VIEWPOINT VIDEO VS. 3D VIDEO

4

Figure 2.2.1: General 3DTV scheme[3].

Dierent technologies can be used for acquisition, processing, representation and rendering, but all make use of multiple views of the same scene. The multiple camera signals are processed and transformed into a certain scene representation format that allows the rendering of virtual intermediate views in between the real camera positions. The choice of a 3D scene representation format for FVV sets the requirements for acquisition and multi-view signal processing, and determines the rendering algorithms, interactivity, compression and transmission. These methods are often classied somewhere in between two extremes: geometry-based and image-based modeling. Using the rst method, real world objects are reproduced using geometric 3D surfaces with an associated texture mapped onto them. The other extreme uses virtual intermediate views that are generated from available real views using interpolation. Other representations do not use explicit 3D models but depth or disparity maps, which assign resp. a depth value or a disparity value

1

to each pixel of an image.

2.2.2 3D Video 3D video oers the user a 3D depth impression of the observed scenery. This functionality is based on the concept of stereoscopy, which has been investigated for a long time. This dissertation focuses on the concept of 3DTV, for which the technical steps to be followed are outlined in the scheme in Figure 2.2.1. There are various techniques to capture 3D scene information, but also many dierent ways for displaying it. An important goal of 3DTV systems is to decouple the capturing and displaying as much as possible. To achieve this decoupling an intermediate representation of the captured data is necessary. Furthermore

1

The disparity can be described as the relative displacement between the features in two images.

2.3.

5

3D CONTENT GENERATION

Figure 2.2.2: Example of 3DTV system[2].

the 3D video data has to be transported through available channels, after using a certain coding technique.

An example of a 3DTV system is shown in Figure 2.2.2 which represents a multi-camera system. In this case, a scene is captured by N synchronized cameras and multiple video signals are encoded and transmitted.

At the receiver they are decoded, rendered and

displayed on a 3D display.

In the following we will walk through the details of the consecutive steps in every 3DTV system. The capturing of the 3D data, generating a 3D representation, coding, transmitting the data over an error prone channel, rendering and nally displaying will be discussed.

2.3 3D content generation The generation of 3D data suitable for 3DTV can be handled in dierent ways. First of all it is possible to capture 3D information from a scene directly and generate the content while recording. Other approaches make it possible to reuse our existing 2D content and convert it to 3D content using conversion methods such as adding depth information. Comparing dierent 2D-to-3D conversion techniques is the main subject of this thesis so current approaches will be discussed thoroughly in the following sections.

Finally,

Computer Generated Images (CGIs) can be used to create several views by simulating the presence of several cameras using a Computer Aided Design (CAD) program.

2.3.

6

3D CONTENT GENERATION

2.3.1 Scene Capturing Technologies The capturing of 3D information of dynamic scenes is without any doubt crucial for 3DTV implementations. In what follows, methods for 3D scene extraction from single and multiple cameras data streams are outlined, as well as the holographic technique[4].

While all

techniques have their diculties, the main point remains that additional hardware and/or processing power are necessary to capture the depth information.

2.3.1.1

Single Camera Techniques

There are several methods for capturing 3D scenes from a single camera video sequence, described as Shape-from-X methods. Currently, the Shape-from-Motion (SfM) seems to be the best solution because of its applicability to general scenarios. All the other techniques (e.g. Shape-from-Texture, Shape-from-Focus) are just used to extract 3D shapes in controlled 3D environments.

Shape-from-Motion

The Shape-from-Motion[5] technique tries to solve for 3D geome-

try by using the relative motion between the viewing camera and the observed scene, which is an expected situation in practice. This kind of relative motion provides an important cue to depth perception and can be seen as a form of disparity over time. The motion eld contains the 2D velocity vectors of the image points, induced by this relative motion. Objects are assumed to be undeformable and their movements to be linear. The majority of Shape-from-Motion algorithms are either optical ow based or feature based (see [5] for further reading).

Shape-from-Texture

Patterned textures can be used to create a good 3D impression 2

because of two main features: the distortion of individual texels

and the rate of change of

texel distortion across the texture region. Perspective distortion makes texels far from the camera appear smaller while foreshortening distortion makes texels that are not parallel to the image plane appear compressed in the direction of inclination of the surface. The output of Shape-from-Texture[4] algorithms generally consists of a dense map of surface normals. This can be used to recover the 3D shape under the assumption of a smooth textured surface. Figure 2.3.1 shows a typical shape reconstruction based on texture features. Conventional Shape-from-Texture algorithms appear to be quite restrictive due to a number of simplifying assumptions. For further reading on more advanced Shape-from-Texture techniques, see [4].

2

A texel is a texture element, representing the smallest graphical unit.

2.3.

7

3D CONTENT GENERATION

Figure 2.3.1: Shape-from-texture (From left to right: original image, segmented texture region, surface normals, depth map and reconstructed 3D shape)[4].

Figure 2.3.2: ZCam[6].

Shape-from-Defocus

Shape-from-Defocus[4] methods can found upon one or multiple

images to estimate the amount of blur.

In a thin-lens system, objects that are in-focus

are observed clearly whilst others are blurred.

Because ambiguity can occur when esti-

mating the blur parameter, necessary for this algorithm, most of the Shape-from-defocus algorithms rely on two or more images of the same scene taken from the same position with dierent camera focal settings. In addition, a blur estimation technique using a single image was proposed. For example, the focus settings of the camera are changed when the attention of the viewer needs to be redirected from foreground to background. With blur estimation techniques, the relative depth level can still be estimated by mapping a large blur value into a higher depth and a smaller blur value to a lower depth level, even when the camera parameters are unknown.

Example of a state of the art single camera system

ZCam (Figure 2.3.2) is an

3

example of a time-of-ight (TOF ) camera product designed by the Israeli developer 3DV Systems.

The ZCam supplements full-color video camera imaging with real time range

imaging information, allowing for the capture of video in 3D. The ZCam can be used to generate a depth map.

2.3.1.2

Multicamera Techniques

Several multicamera systems that aim to produce 3DTV are related to teleconferencing, since these applications obviously require live video feeds. Issues that relate to these kind of

3

The TOF principle refers to measuring the distance of objects by using light pulses.

2.3.

8

3D CONTENT GENERATION

Figure 2.3.3: Device for recording digital holograms[5].

systems are synchronization and calibration of the cameras. Multicamera systems capture the appearance of a dynamic scene from multiple viewpoints at the same time. A variety of techniques exist to extract the FVV from the resulting footage. For further reading, see [5].

2.3.1.3

Holographic Techniques

Holography[5] is a unique technique for recording and reconstructing 3D information of an object. A hologram is essentially a record of the interference pattern obtained from the superposition of a reference beam and the beam scattered by the object. In classical holography, photographic lms are used to record holographic patterns and the reconstruction is performed optically. Recently however, digital holography allows replacing holographic lms with charged-coupled devices (CCD) and complementary metal-oxide-semiconductor (CMOS) image sensors to record and numerically reconstruct holograms. The basic process of recording a digital hologram is shown in Fig.

2.3.3.

The light coming from the

object interferes with the reference beam, and this interference pattern is recorded by the CCD camera.

2.3.2 2D-to-3D conversion Whereas the previous section described active methods of creating 3D video, this section gives an overview of passive state of the art conversion techniques. Due to the upcoming success of 3D television, the need for good working 2D-to-3D conversion algorithms arises. Not only 3D scene capturing technologies are important, but moreover a lot of existing 2D video content needs to be converted to 3D video content. Because this process often requires segmentation of the video frames, real time conversion is particularly problematic.

2.3.

9

3D CONTENT GENERATION

2.3.2.1

Cut-and-Paste Techniques

The easiest way to create a stereoscopic 3D image is to use the original 2D image as the left eye view and generate the right eye image by horizontally shifting local regions of the original image, using a cut-and-paste technique[7]. For the isolation of the local regions image segmentation techniques can be used. This technique works well for images where the objects are well separated, but for images with small objects, large areas with low textures and little gradations of depth, other techniques should be used. Furthermore, as the depth is assigned to each object and not to each pixel, the objects rendered with these techniques may appear at and unrealistic.

2.3.2.2

Depth Map Generation

As stated before in Section 2.3.1.1, the ZCam can be used as an active method to generate 4

a depth map

while recording a 2D video sequence. To be able to understand the following

paragraphs, the concept of depth cues needs to be explained. Depth cues are sources of information from the environment or from within our body that can help us perceive how far away an object is and where they are perceptually located.

Monocular cues involve

those cues that exist for a single eye while binocular cues involve both eyes. Dierent types of depth cues are employed in the following passive methods for depth map generation.

Depth Map Generation from Disparity Estimation

The fundamental principle

behind most 2D-to-3D conversion relies on the binocular processing of the human visual system of two slightly dierent images, in the case of stereoscopic viewing. The horizontal disparities between the left eye and right eye images can be transformed into distance information

5

so that objects can be observed at a dierent depth outside the 2D plane.

The extent of the horizontal shift between the pixels of both images depends on the interlens separation and the distance of the feature of the object to the camera. It is important to remark that this approach requires the use of a stereoscopic camera for recording the two views and cannot be compared to any of the single camera techniques described in Section 2.3.1.1.

Depth Map Generation from Blur

The concept of extracting depth from blur analysis[7],

relies on the fact that a camera's focal parameters have an eect on the image. For a given focal length, there is a direct relationship between the depth of an object (being his distance to the camera) and the amount of blur present in the image.

4 5

This approach for

Auxiliary image denoting the depth of every pixel. A distinction can be made between depth maps containing relative depth information (disparity maps)

and depth maps containing absolute depth information (range maps). Range maps can be derived from disparity maps when sucient camera and capture information is available.

2.3.

10

3D CONTENT GENERATION

extracting depth information is not straightforward because blur can also arise from other factors such as lens aberration, atmospheric interference, fuzzy objects and motion. More information can be found in [7]. This method is strongly related to the Shape-from-Defocus method, as described in paragraph 2.3.1.1, in which a blur estimation technique is applied to obtain information about the amount of blur while recording.

Depth Map Generation from Focus

The depth-from-focus[4] approach is very closely

related to the family of algorithms using blur.

The main dierence is that depth map

generation from focus requires a series of images of the scene with dierent focus levels by varying and registering the distance between camera and scene, whilst depth map generation from defocus only requires one, two or more images with xed object and camera positions and use dierent camera focal settings.

Depth Map Generation from Geometric Perspective

Another way to generate

depth maps is through exploitation of gradient and linear perspective cues[7].

This ap-

proach is expected to be suitable for 3DTV. A single input image is rst classied based on color segmentation. Guided by this category, the vanishing points and lines are then determined by identifying straight lines in the image. The region with the most intersections is considered to be the vanishing point and the major straight lines passing close to the vanishing point are considered to be vanishing lines that provide linear perspective of depth. In general, converging lines that are actually parallel indicate a surface that recedes in depth. Thus, depending on the slopes of the vanishing lines, dierent depth gradient planes can be generated with the vanishing point being at the farthest distance. This geometric depth information can then be fused with depth information of objects generated by the initial image segmentation and classication process to end up with a "natural" depth map.

Depth Map Generation from Edge Information

This approach uses surrogate

depth maps, containing depth information that is concentrated mainly at the edges and object boundaries in the 2D images. These depth maps have large regions with missing and incorrect depth information but it was speculated that the human visual system combines the available information together with pictorial depth cues to ll in the missing areas. These depth maps are relatively simple to generate but are only suitable to use in applications where depth accuracy is not critical. More information can be found in [7].

Depth Map Generation from Motion

The basic principle is underlying on the motion

parallax, how points move relatively to one another with respect to head movements. Near objects move faster than far objects and can thus receive a greater depth value than the

2.3.

11

3D CONTENT GENERATION

background objects.

This way relative motion can be used to estimate the depth map.

The dierence with Shape-from-Motion as described in paragraph 2.3.1.1 is limited to the fact that, in this approach, processing occurs after recording the video sequence.

Depth Map generation from Shading

This Shape-from-Shading[4] technique refers

to the reconstruction of 3D shapes from intensity images using the relationship between surface geometry and image brightness.

The gradual variation of surface shading in an

image reveals the shape information of the objects. Most of the Shape-from-Shading (SfS) algorithms make use of the Lambertian reectance model. A uniformly illuminated Lambertian surface appears equally bright from all viewpoints. Calculation of the depth map requires solving complex equations and is beyond the scope of this thesis (for further reading see [4]).

Depth Map Generation by Short-Term Motion Assisted Color Segmentation The previous depth map generation methods focus on only one depth cue. In this approach[8] the monocular and binocular depth cues are considered together to produce a temporarily and spatially smooth depth map.

To ensure this, video segmentation such as color and

motion segmentation and background registration are necessary. The color segmentation utilizes the color information in the image to detect the objects. Because there is a lot of color variance in adjacent video frames, wrong segments could be produced. In order to deal with this, the motion segmentation helps to extract the object boundary from the moving areas.

In a static scene however, the motion segmentation

extracts nothing from the video. The background registration technique can then be used to register the background information in the memory and subtract the background from the captured image. The method outlined in Figure 2.3.4 contains four parts: motion/edge detection, K-means algorithm for color segmentation, connected component and motion/image segment adaptation. The current frame is rst processed by the K-means algorithm, which clusters the objects into K partitions by the color of each pixel. The connected component algorithm then nds out the connected pixels in the same group after the K-means algorithm. At the same time the frame buer stores the previous frame information and the information is sent to the motion and edge detection part, together with the current frame. The motion detection method calculates the dierences between the current image frame and the previous image frame, taking motion jitter into account by using a weight registration. For edge detection a Sobel edge detection method (or a more advanced method) can be used. After the motion/edge detection and connected component algorithm, the position and range of the moving objects are estimated by the range estimation, which does the raster-

2.3.

12

3D CONTENT GENERATION

Figure 2.3.4: The Proposed Short-Term Motion Assisted Color Segmentation [8].

scan and inverse raster-scan on the image with the following procedures.

If there is an

edge registered in the neighbor of the current scanning pixel, the pixel is recognized as a pixel in the estimated range. A component separation algorithm separates one component into two if there are two connected components occluded with each other. If a component is inside the estimated range and is surrounded by both image and motion edge segments, the component is set as the foreground component and its depth value is assigned to the nearest depth.

This system has been tested for a weather forecast and results show that it is possible to distinguish a foreground person from the background (see Figure 2.3.5). For better results, this method could be combined with the depth from geometry perspective and other depth cues to produce more accurate depth maps. Keeping one of this dissertation's main goals in mind, the development of a 2D-to-3D conversion technique for a weather forecast, it should be noticed that this method is applicable to one single 2D video stream. Thus, this method does not exploit the advantages a suchlike application possesses and is not suitable to use in the remainder of this work.

2.3.

13

3D CONTENT GENERATION

Figure 2.3.5: The depth map of a weather forecast[8].

2.3.2.3

Other state of the art conversion methods

MakeMe3D software

The MakeMe3D software[9] uses self-developed object recog-

nition and motion analysis to perform the 2D-to-3D conversion. This method is mainly based on the depth-from-motion concept as explained in paragraph 2.3.2.2. The technique used by MakeMe3D is called Morph3D and is supposed to achieve good results for movies with image movement rotating around an object, but has diculties creating a depth eect in scenes without image movement. To enhance the 3D eect, two parameters can be adjusted: a positive or negative percentage indicating the value of depth and the used frame oset. All negative percentages bring the image to the back and all positive percentages bring it to the front. MakeMe3D predicates the computation for the depth eect upon the analysis of a number of single frames. The default value is 1 which means that the analysis considers frame 0 as well as frame 1. When adjusting the value to 2, the algorithm considers frame 0 and 2 and skips frame 1. It skips the image in between, which results in a better three-dimensionality for movies with little motion.

Frame dierence method

An easy way of creating a depth eect is to show a certain

frame to the left eye and the previous frame to the right eye. A bigger frame dierence is also possible but will result in big artifacts in videos with a lot of movement. The two latter methods will be evaluated and compared in Chapter 4, covering the experimental work of this dissertation.

2.3.2.4

Computer Generated Imagery

In this case, a stereo pair (in case of stereoscopic 3DTV) of views can be generated by simulating the presence of two cameras using a CAD program[10].

All the information

2.4.

14

SCENE REPRESENTATION TECHNOLOGIES

about depth, occlusion and transparency is assumed to be already known and available in the form of a dense map that can easily be integrated with the digital imagery.

2.4 Scene Representation Technologies During the dierent stages of a 3DTV system, being content generation, transmission and display, a certain 3D scene representation is used.

The content input may be syn-

thetic (computer-generated) or captured from real scenes, the available bandwidth of the transmission channel may vary, and the display technology can change. Because the requirements can dier, the system has to support various representation techniques[11]. Dense depth, surface-based and volumetric representations are the three main ways of representing 3D scene information.

2.4.1 Dense Depth Representations Dense depth representations can be created by storing the distances of the points in a 3D scene from the camera in a lattice, dened by the reference image and denoted as a depth map.

The concept of Layered Depth Images (LDI) represents a 3D object or a scene by one or more views with associated depth maps.

For simple camera congurations, LDI can

be used for automatic real time depth reconstruction in 3DTV applications.

Animated

Framework eXtension (AFX) is the data and rendering format for LDI used by MPEG-4. The most important problem concerns depth discontinuities, such as object boundaries. These boundary eects can be reduced by alpha-matting, where over-blending of depth values is used over object boundaries.

The latter method will face a bandwidth problem, because the depth map information has to be transmitted over a capacity limited channel. Therefore, the depth eld is extracted in such a way that the resulting dense depth representation is very easy to compress eciently.

2.4.2 Surface-Based Representations Four dierent types can be distinguished:

The surface is dened by polygonal meshes.

Non-Uniform Rational Bezier Splines (NURBS) : the surface is represented by a function of 2 parameters which denes a mapping of a 2D region into the 3D Euclidean space.

2.4.

SCENE REPRESENTATION TECHNOLOGIES

15

Subdivision Surfaces : the surface is constructed from an arbitrary polygonal mesh by recursively subdividing each face.

Point-based modeling : surface points, particles or surfels are used instead of polygons as simpler display primitives. This approach is more ecient regarding storage and memory.

2.4.3 Volumetric Representations Volumetric representations refer to the parametrization of the reconstruction volume in a world reference frame. The representation can contain data associated to any location within a volume of interest, where the smallest representable amount of space is called a voxel. In multi-view stereo techniques, the 3D scene is reconstructed from multiple images that come from dierent viewpoints. There is no notion of a single depth dimension.

2.4.4 Texture Representations The texture of a real object is usually extracted from a set of images that capture the surface of the object from various angles. Single texture representations transform a true object surface to a 2D plane approximation. Multitexture representations use the original texture plus a number of articial textures, representing illumination and reection eects.

2.4.5 Pseudo 3D Representations These representations avoid using any 3D geometry to realize a 3D impression starting from 2D video. This can be obtained by image interpolation and image warping, or alternatively by light eld representations. Using image warping, the virtual viewing position is not restricted to the line between two camera centers, which is the fact using image interpolation.

Research has been done in the area of image interpolation of objects in

dynamic scenes.

Some of these methods can be used for 3DTV applications, where for

example natural video objects obtained by image warping are augmented with synthetic content with the help of the MPEG-4 standard.

Another approach uses a light eld to 6

make a 4D simplication of the plenoptic function . Virtual views are then obtained by interpolation from a parametrized representation that uses coordinates of the intersection points of light rays with two known surfaces. The major problem in this representation is the large amount of data.

6

The 5-dimensional function representing the intensity or chromaticity of the light observed from every

position and direction in 3-dimensional space.

2.5.

16

CODING ALGORITHMS

2.4.6 Object-Based 3D Scene Modeling All the objects in a 3D scene have dierent characteristics, so the choice of the best representation technology depends on the given object's appearance, geometry and motion. Furthermore, a 3D model can be animated by changing certain model parameter values over time. Finally, the scene has to be rendered; rendering techniques try to model the interaction between light and environment to generate pictures of scenes.

2.4.7 Head and Body Specic Representations Human head and body representations deserve special attention. A Human Body Model (HBM) can be used as a chain of rigid bodies, called links, interconnected by joints. Furthermore, 3D face modeling and animation is essential to accurately describe the human face using a computer. From this point of view, such representations may be less useful for 3DTV applications.

2.4.8 Existing standards Because interoperability is an important requirement, standardized formats are essential for the success of 3D technologies.

The Virtual Reality Modeling Language (VRML) is an International Organization for Standardization (ISO) standard for 3D scene representations but has limited applicability for 3DTV applications because it cannot handle real time behavior. Extensible 3D (X3D) is a successor of VRML providing real time capabilities.

MPEG-4 meets all the needs of 3D representations for 3DTV (see 3DAV[2]) and will form the basic format for most of the future 3DTV applications.

2.5 Coding algorithms So far, dierent kinds of content generation approaches and scene representation technologies have been outlined in the previous sections.

These dierent types of data must be

encoded eciently to ensure the success of 3DTV. Compression of this stereo video, multiview video and associated depth information require other methods than classical video coding. As we already know from Figure 2.2.2, most 3DTV systems use a number of cameras to capture a scene. The simplest case is classical stereo[12] with two views. In multi-view coding[12] the correlations between adjacent camera views can be exploited to obtain a greater amount of compression.

2.5.

17

CODING ALGORITHMS

Following sections describe consecutively conventional stereo coding, video plus depth, multi-view coding, 3D mesh compression and multiple description coding.

2.5.1 Conventional Stereo Coding 2.5.1.1

Basic Principle

Conventional stereo uses two images showing the scene from a slightly dierent viewpoint, corresponding to the distance of human eyes. Because the images are very similar, they are well suited for compression, when predicting one image from the other already encoded one.

Although the same principles for motion compensation can be used for disparity

compensation, some dierences must be taken into account.

Zero disparity indicates a

very large depth of the corresponding point in 3D, while points close to the camera have a large disparity. Because the statistics of these disparity vectors are dierent from motion vectors, adjustments in entropy coding could be necessary. Furthermore, dierences in a stereo pair can also come from scene lighting and surface reectance. The major drawback of this approach is that the 3D impression cannot be modied: depth perception cannot be adjusted to dierent displays and sizes, the number of views cannot be changed and head motion parallax

2.5.1.2

7

cannot be supported.

Stereo Encoding Formats

Frame compatible formats[13] refer to the class of stereo video formats in which the two views are essentially multiplexed into a single coded frame or sequence of frames. Half of the coded samples represent the left view and the other half represent the right view. Each coded view has half the resolution of the full coded frame. Dierent ways of packing exist, for example, each view may have half vertical or half horizontal resolution. These two views can then be interleaved taking alternating columns or rows or can be positioned either side-by-side or top-bottom. Recently, checkerboard or quincunx sampling is becoming more popular, in which the two views are interleaved in alternating samples in both horizontal and vertical dimensions (see Figure 2.5.1).

7

Head motion parallax refers to the movements the head makes to see from dierent viewpoints.

2.5.

18

CODING ALGORITHMS

Figure 2.5.1: Frame compatible formats where 'x' represents the samples from one view and 'o' represents the samples from the other view[13].

Besides this form of spatial multiplexing, temporal multiplexing is also possible. The left and right views are interleaved as alternating frames or elds of a coded video sequence. These formats are called

frame sequential (or frame interleaved)

and

eld sequential.

The

resulting video stream will have a doubled frame rate but the nal impact on the bandwidth will depend on a number of factors.

It is safe to say that the total bandwidth will be

signicantly less than twice the bit rate of a single view video stream.

The total bit

rate will be mainly aected by the encoding algorithm, reference frame management and correlation between both views. Thus, temporal multiplexing does not necessarily have to lead to a doubling of the required bandwidth.

2.5.

19

CODING ALGORITHMS

Figure 2.5.2: Illustration of prediction in H.262/MPEG-2 Video multi-view prole[12].

2.5.1.3

The Multi-View Prole specication

A standard specication for the combination of inter-view coding and temporal coding has been dened in International Telecommunication Union - Telecommunications (ITU-T) Rec.

H.262/ISO/IEC 13818-2 MPEG-2 Video, the multi-view Prole[12], as illustrated

in Figure 2.5.2. The left view is encoded without any reference to the right view, using standard MPEG-2, which ensures backward compatibility since it is possible to decode the left eye view and to display 2D video.

The right view is encoded using inter-view

prediction in addition to temporal prediction. However, the gain in compression eciency is limited, due to the fact that temporal prediction already performs very well. Typically, when temporal prediction is ecient for a certain frame, then additional inter-view prediction does not increase the coding performance signicantly. Research on compression of conventional stereo will continue in several directions, including for example abandoning backward compatibility to design more ecient inter-view prediction structures.

2.5.2 Video plus Depth An alternative for using multiple views is coding the video signal with an associated depth map, from which a stereo pair can then be rendered at the decoder. This approach can result in ecient compression. The depth map can be seen as a monochromatic luminanceonly video signal, with a depth restricted to a range between the minimum and maximum distance

8

of the corresponding 3D point from the camera. If the chrominance is set to a

constant value, the resulting grayscale image representing the depth map can be processed by any standard video codec. To be able to transmit additional depth information, MPEG specied a corresponding container format ISO/IEC 23002-3 Representation of Auxiliary Video and Supplemental Information , also known as MPEG-C Part 3, for video plus depth data in 2007. Available video codecs can be used, it is only necessary to specify high-level syntax that allows a decoder to interpret two incoming video streams correctly as color and depth. While it has the advantage to be backwards compatible with legacy devices, it is only capable of rendering a limited depth range since it does not directly handle occlusions.

8

The depth range is linearly quantized with 8 bit, i.e., the closest point is associated with the value 255

and the most distant point is associated with the value 0.

2.5.

20

CODING ALGORITHMS

Figure 2.5.3: Temporal/inter-view prediction structure for MVC[12].

2.5.3 Multi-View Video Coding When the same scene is captured by many cameras from a dierent viewpoint, it is logical that there exists a great amount of inter-view dependencies.

This can be exploited as

illustrated in Figure 2.5.3 where images are not only predicted from temporally neighboring images but also from corresponding images in adjacent views[12]. The major drawbacks of this approach are the computational complexity, memory requirements and delay.

2.5.4 3D Mesh Compression 3D meshes are used in 3D video representations to represent the shape of 3D objects. Static 3D meshes consist of two types of data: connectivity, which describes the triangulation of the mesh vertices, and geometry, which assigns 3D locations to vertices. Dynamic 3D meshes can be described as a sequence of static 3D meshes with a common connectivity called frames. Both types of meshes show geometrical dependencies as well as temporal dependencies which can be exploited in compression methods. Compression of dynamic 3D meshes[12] is a young but nevertheless active and prospective research area, which receives many impulses from static mesh compression.

2.5.5 Multiple Description Coding Compressed 3D content has to be transmitted over error-prone channels, so it is necessary to utilize some error resilience methods to ensure robust transmission. Multiple Description Coding (MDC)[12] is a promising approach, that relies on splitting the data stream into

2.5.

21

CODING ALGORITHMS

two or more independently decodable bit streams, called descriptions.

At the decoder

side, the initial data can be recovered with an acceptable quality and when receiving more descriptions a higher delity is reached. The drawback of this approach is the creation of redundant data.

The following subsections describe MDC of single-view visual data, as

well as MDC of multi-view data.

2.5.5.1

MDC of Single-View Visual Data

As we know, in many systems only one view is encoded independently while the others are predicted from this view. It is crucial that the leading view is being coded in the best error resilient method as possible. For MDC of still images, the concept relies on adding controllable redundancy to the image bit stream and thus compromising between error resilience and compression eciency. Some approaches add redundancy in the quantization stage, others operate in transform domain rearranging the transform coecients and utilizing their spatial and scale positions to create redundancy. For video, a MDC has to address both the coding of prediction errors and motion vectors. Techniques for motion vectors are based on overlapped block motion compensation, which generates a smoother motion eld.

Blocks forming the generated motion elds are then partitioned into two

or more coarser elds using quincunx subsampling and are split between the two or more descriptions.

2.5.5.2

MDC of Multi-View Video

In this section we review only two approaches for MDC of stereoscopic video.

In the rst approach, two descriptions are formed by combining a leading view and a spatially scaled view. Frames from the leading view are predicted by only frames from the same view while frames from the other view are predicted by frames from the two views. In the rst description the left eye view is the leading view and the right eye view is the spatially scaled view, while in the second description it is the other way around. If one description is lost due to channel failures, the reconstructed stereoscopic video pair gets one of the views with lower spatial resolution.

The second approach is based on temporal subsampling. Odd frames of both left and right views go to the rst description, while even frames go to the second description. Motion compensated prediction is performed separately in each description where left frames are predicted from preceding left frames, and right frames are predicted from preceding right frames or from the left frames corresponding to the same time instant. If one description is lost, the sequence is reconstructed with half the frame rate.

2.6.

22

TRANSMISSION TECHNOLOGIES

Figure 2.6.1: Transmission technologies of 3D content[10].

2.6 Transmission Technologies The goal of most 3DTV system developers is to create a backwards compatible and ecient 3DTV transport technology. Dierent display technologies may require dierent data representations, which may aect the chosen transmission method. The evolution of 3DTV transport[14] follows the path of analog broadcast, digital broadcast and nally Internet Protocol Television (IPTV). Figure 2.6.1 shows the possible platforms for the delivery of 3D video content.

2.6.1 3DTV broadcast With the transition from analog to digital broadcast, MPEG developed the MPEG-2 Multiview Prole (MVP) standard as described in Section 2.5.1. The Australian 3D company DDD proposed a system that encodes the depth data in a proprietary, very low bit rate format, which is then transmitted in the private or user data eld of an MPEG-2 Transport Stream. A similar approach was followed by the European IST project ATTEST (Advanced Three-Dimensional Television Systems Technologies).

Again, the Video plus

Depth representation was used for transmitting 3D visual information. In contrast to the DDD concept, standard MPEG technologies were used for the compression of the depth information as well. Digital broadcasting is probably the most restrictive because of the limitation in bandwidth: 6 MHz per channel in the US and Japan and 8 MHz in Europe.

2.7.

23

DISPLAY TECHNOLOGIES

2.6.2 3DTV over IP Networks The next step involves moving to transmission of video over the Internet and is currently an active research area. There are already lots of Video-on-Demand (VoD) services oered over the Internet, both for news and entertainment applications. The playback of IPTV requires either a PC or a set-top box connected to a TV. Video content is typically compressed using either an MPEG-2 or an MPEG-4 codec and then sent in an MPEG transport stream delivered via IP multicast in the case of live TV or via IP unicast in the case of VoD.

2.6.3 Packaged material Stereoscopic image sequences often require high capacity devices to support the storage of these large video les. Discussion about future storage devices is still going on, and no single device has yet been claimed to be the best solution for the storage of stereoscopic material. Both storage costs and performance enhancement must be considered.

For applications

where fast playback is important, ash memory cards are becoming increasingly popular.

2.7 Display technologies Many dierent methods for 3D displays exist but no single approach has yet been able to capture the mass market. The more traditional stereo technologies[15] use one view for each eye and some kind of glasses (polarization, shutter, anaglyph) to lter the corresponding view. The newer technologies[15] do not need any kind of viewing aids and can be divided into three categories being autostereoscopic displays, volumetric displays and holographic displays. Normally, the term autostereoscopic applies to all display systems without the need of glasses, but here it is restricted to cover displays such as binocular, multi-view and holoform systems where only multiple 2D images across the eld of view are considered. In holographic displays the image is formed by wave-front reconstruction and includes both real and virtual image reconstruction. In holography[15] the 3D scene is encoded into an interference pattern. This requires a high resolution recording medium and replay medium which places severe constraints on the employed display technology. Volumetric displays form the image by projection onto a volume, or use discrete locations of luminescence within a volume, without the use of light interference.

The biggest disadvantage of stereo based display systems is the impact of binocular asymmetries, the dierences between the left and right eye images of a stereo pair, causing viewing discomfort. The factors that determine the viewing comfort are crosstalk between left and right image and vertical disparity, as discussed more thoroughly in Section 2.9.3.

2.7.

DISPLAY TECHNOLOGIES

24

Figure 2.7.1: Binocular parallax[16].

2.7.1 Classical Stereoscopic Displays The principle of stereoscopic imaging systems[17] is based on displaying two views with a slightly dierent perspective in such a way that the left view is only seen by the left eye and the right view is only seen by the right eye. The horizontal distance on the display screen between corresponding points in left and right view is called the screen parallax. When

2.7.

25

DISPLAY TECHNOLOGIES

Figure 2.7.2: Dolby RGB lter color wheel[18].

the parallax for a certain point in the image is zero, there is no dierence between both of the views and this point will be seen at the screen plane. A negative parallax occurs when R-L < 0 and will result in objects in front of the screen. A positive parallax occurs when R-L > 0 and will result in objects behind the display screen (see Figure 2.7.1). Dierent setups can be used to view stereo pairs. The uncrossed or parallel setup directs the right image to the right eye and the left image to the left eye.

This requires focus

beyond the images. The crossed setup makes the right eye see the left image and the left eye see the right image and requires crossing eyes. Viewing the opposite way around will reverse the sense of depth. The classical stereoscopic displays require the observer to wear some kind of glasses. We can distinguish three main dierent types of 3D glasses.

Anaglyph glasses

The method is based on wavelength specic encoding, using for ex-

ample blue and red wavelength lters, to be used as resp. high pass and low pass lter. The two wavelengths must be separated from each other to prevent crosstalk (see Section 2.9.3), so opposite colors of the spectrum are preferred. The wavelengths of light opposite to the particular color lter in front of the eye are allowed to pass through and reach the eye whereas wavelengths matching the color lter's range are blocked. This passive technique is one of the rst popular and widely used 3D technology and is still used today as anaglyph glasses do not require special hardware other than cheaply made color lter glasses.

Possible crosstalk can however cause viewing problems if the

color lters are not perfectly matched to the colors being shown on the display. An example of red-blue anaglyph glasses is shown in Figure 2.7.3a. In Dolby 3D theaters[18] a technology called Intec is used, which stands for interference lter technology (see Figure 2.7.2). Intec uses a technique for channel separation in stereo projections based on interference lters. The channel separation of the projected images is done with a pair of glasses that are provided with selective interference lters for each eye. Each lens lters the suitable wavelength triple with selective interference lters for each eye.

2.7.

26

DISPLAY TECHNOLOGIES

(a) Anaglyph glasses[19].

(b) RealD polarized glasses[20].

(c) Active shutter glasses[20].

Figure 2.7.3: Dierent types of 3D glasses.

Polarized glasses

Polarization is another method of displaying 3D content via passive

glasses (see Figure 2.7.3b). When light shines on a polarized lter, only the polarized light with a specic polarization direction can pass through. When orthogonal linear polarized light is added to the image source and projected in an overlapping way, and the observers are wearing polarized glasses with an orthogonal polarized lter, the left and right images are separated. Although the cost for the glasses is very low (less than 1 euro[21]), the cost for the polarizing system is very high. This is primarily due to the fact that a special silver screen is used to reect the light while maintaining the polarization property. This technique does guarantee a icker free 3D experience and provides excellent rejection between left and right views avoiding any ghosting issues. This type of glasses makes it possible to watch 3D broadcasting in popular places and is also used in most of 3D cinemas.

LCD shutter glasses

Active shutter glasses (Figure 2.7.3c) are currently a very popular

choice amongst consumer electronics giants who are investing in 3D display technology. The left and right images are presented alternatively with synchronous glasses opening and closing in order to block or show the corresponding image. The left and right images are thus separated by time disparity. Although the switching frequency is supposed to be fast enough to make sure people will not perceive icker, it might bother some who are very sensitive to low refresh rates. While the 3D displays using this technology are very cheap, the shutter glasses are more expensive than their passive counterparts.

Furthermore, glasses are battery powered and must be

recharged.

2.7.2 Multiple Image Displays Multiple image displays, where two or more images are seen across the width of the viewing eld, can be either holoform, multi-view or binocular. Holoform displays produce a large number of closely spaced views to provide smooth motion parallax. This gives the image

2.8.

EXISTING 3DTV STANDARDS AND TECHNIQUES

27

Figure 2.7.4: Parallax-barrier stereo display and lenticular stereo display[22].

a hologram-like appearance. In multi-view displays a series of discrete views are presented across the viewing eld. One eye will lie in a region where one perspective is seen, and the other eye in a position where the adjacent perspective is seen. The simplest multiple image displays are binocular, where only a single pair of viewing zones is displayed. These can be of ve basic types: the viewing zones can be formed by lenticular screens (see Figure 2.7.4 on the right side), with twin projectors, by parallax methods (see Figure 2.7.4 on the left side), by Holographic Optical Elements (HOEs) or prismatic screens.

2.7.3 Volumetric Volumetric displays reproduce the surface of an image within a dened volume of space. Because these displays create an image in which each point has a real point of origin in space, the images can be viewed from dierent angles.

The most important drawback

is that they invariably suer from image transparency where parts of an image that are normally occluded are seen through the foreground object. Volumetric displays can be of two basic types: virtual image where the voxels are formed by a moving or deformable lens or mirror, and real image where the voxels are on a moving screen or are produced on static regions.

2.8 Existing 3DTV Standards and Techniques 2.8.1 MPEG Standards Considering the need for interoperability and compatibility in home entertainment systems and mobile applications, it is essential to standardize media formats for representation, coding and transmission. This means that every step in the 3D chain is independent of the

2.8.

EXISTING 3DTV STANDARDS AND TECHNIQUES

28

Figure 2.8.1: DES, extending high quality stereo with advanced functionalities based on view synthesis[23].

others, as long as it produces content in the standardized format. Furthermore, backward compatibility is also necessary in order for 3D systems to be successful in the mass market. In addition to ITU-T, which is responsible for the development of Recommendations such as the well-known H.264 standard, ISO-MPEG is one of the major institutions that create such specications of media standards. In Section 2.5 about coding techniques the basic available 3D video formats and standards are already outlined. The following subsections give an overview of more advanced MPEG video formats[23] ending with a discussion on the current activities within MPEG.

2.8.1.1

Depth Enhanced Stereo

The rst home user systems are still based on conventional stereo. Therefore a concept called Depth Enhanced Stereo (DES) was proposed as a generic 3D video format (Figure 2.8.1).

It provides backward compatibility and additional depth and possibly occlusion

layers. On top of that, content is decoupled from display. DES combines the important features of all other basic 3D video formats.

2.8.

29

EXISTING 3DTV STANDARDS AND TECHNIQUES

Figure 2.8.2: MVD format and view synthesis for ecient support of multiview autostereoscopic displays[23].

2.8.1.2

Multi-view Video plus Depth

Multi-view Video plus Depth (MVD) can be regarded as an extension of the regular Video plus Depth. Ecient support of multi-view autostereoscopic displays is illustrated in Figure 2.8.2. A display is used that shows 9 views (V1-V9) simultaneously. From a specic position a user can see only one stereo pair of them (Pos1, Pos2, Pos3) which depends on the actual position.

Transmitting these 9 display views directly, e.g.

inecient.

using MVC, would be very

Therefore, in this example only 3 original views V1, V5, and V9 are in the

decoded stream together with corresponding depth maps D1, D5, and D9.

From these

decoded data the remaining views can be synthesized by depth image based rendering (see Section 2.8.2).

2.8.1.3

Layered Depth Video

Dierent types of Layered Depth Video (LDV) exist: one type uses one color video with associated depth map and a background layer with associated depth map (see Figure 2.8.3). Other types of LDV include one color video with associated depth as main view together with one or more residual layers of color and depth. The residual layers include data from other viewing directions, not covered by the main view.

2.8.

EXISTING 3DTV STANDARDS AND TECHNIQUES

Figure 2.8.3: Layered depth video[23].

Figure 2.8.4: Target of 3D video format[23].

30

2.8.

EXISTING 3DTV STANDARDS AND TECHNIQUES

31

Figure 2.8.5: The 3DTV data representation using video plus depth.[24]

2.8.1.4

Recent MPEG activities

Based on evolving market needs, MPEG is considering a new phase of standardization using the targets illustrated in Figure 2.8.4. The rst objective is to enable stereo devices to cope with varying displays and viewing preferences. The second one is to facilitate support for high quality autostereoscopic displays.

Furthermore, the new format aims to enhance

3D rendering capabilities beyond video plus depth, while not incurring a substantial rate increase. Finally, it should reduce the rate requirements relative to sending multiple views directly with MVC or multicast format.

2.8.2 Image Based Rendering The main idea of Image Based Rendering (IBR)[24] was to derive an almost generic depthbased data representation from captured images to decouple camera and display geometry. This can be achieved by estimating depth information from a given stereo or multi-view camera system and to use these depth data to recalculate at the receiver side a virtual stereo pair.

The ATTEST system is based on the transmission of regular video images

with additional depth maps providing a Z-value for each pixel, a data representation, which is often called video plus depth (see Figure 2.8.5), as also described in Section 2.5.2. The nal stereo images are then reconstructed at the receiver side by using depth image based rendering (DIBR) techniques. These techniques consist of a two-step process: rst, the original image points are reprojected into the 3D world, utilizing the respective depth data.

Thereafter, these 3D space points are projected into the image plane of a

virtual camera, which is located at the required viewing position. The concatenation of reprojection (2D-to-3D) and subsequent projection (3D-to-2D) is usually called 3D image warping.

With it, the ATTEST concept has some crucial advantages over former 3DTV proposals,

2.8.

32

EXISTING 3DTV STANDARDS AND TECHNIQUES

Figure 2.8.6: Virtual shift-sensor cameras for stereo reproduction from video plus depth representations (ZPS = Zero Parallax Setting)[24].

such as backward compatibility to existing 2D services, ecient compression capabilities and a high adaptability to 3D display properties, viewing conditions and user preferences.

2.8.2.1

Single Video plus Depth

The concept of a virtual stereo setup is outlined in Figure 2.8.6 and shows the adaptation possibilities at the receiver side.

Two virtual cameras are positioned perpendicularly to

the display surface where the parameter

tc

denotes the inter-axial distance between the

two views. Both cameras have the same focal length

f

and the convergence distance

Zc

denes the depth layer of zero parallax values. Objects, which are displayed with negative parallax values, will appear in front of the 3D display and, vice versa, those with positive parallax values will appear behind the display surface. In general, the convergence distance

Zc

is dened by the convergence angle between two cameras (i.e., the intersection between

the two optical axes).

However, as the virtual setup from Figure 2.8.6 already uses a

parallel camera geometry and is already rectied therefore, parameter

hr,l

Zc

is given by the sensor-shift

in this case (i.e. the intersection of optical rays through the centers of the

two shifted sensors). The 3D reproduction can be then be inuenced by the choice of the parameters as shown in Table 2.1.

2.8.

33

EXISTING 3DTV STANDARDS AND TECHNIQUES

Parameter

+/-

Parallax

Perceived depth

Object size

Inter-axial

+

Increase

Increase

Constant

-

Decrease

Decrease

Constant

+

Increase

Increase

Increase

distance

tc

Focal length f

Convergence distance

-

Decrease

Decrease

Decrease

+

Decrease

Shift (fore display surface)

Constant

-

Increase

Shift (behind display surface)

Constant

Zc

Table 2.1: Control parameters for adaptation of depth reproduction[24].

Figure 2.8.7: Advanced concept for 3DTV with multiple views[24].

An inherent drawback of the video plus depth concept is the possibility that areas, which are occluded in the original view, may become visible in any of the virtual views, an eect that is called disocclusion. Within the ATTEST project, the missing regions were concealed by smoothing the original depth data with an 2D Gaussian low pass lter in order to avoid the appearance of holes.

2.9.

VISUAL DISCOMFORT IN STEREOSCOPIC DISPLAYS

2.8.2.2

34

Multi-view

As outlined in Figure 2.8.7, future 3DTV production will exploit dierent sources of 3D data acquisition and 3D content creation. This includes standard stereo cameras with two views, depth range cameras providing video plus depth streams directly, multi-baseline systems with more than two cameras or post processing tools, which allow to manually convert from conventional 2D movies to the desired 3D representation format. The N video plus depth streams are coded with MVC (see Section 2.5.3) and then converted to the M display views using DIBR. Because the number of cameras is often smaller than the number of display views, the missing views can be reconstructed using interpolation methods.

For more information on depth map

creation for multi-view systems, see [24].

2.9 Visual discomfort in stereoscopic displays In contrast to 2D video, both image quality and visual comfort must be comparable to conventional standards to guarantee a good viewing experience. This is not accomplished yet, so research in this area is very important to understand the factors that contribute to the visual comfort in stereoscopic displays[25]. During this research I have evaluated dierent 2D-to-3D algorithms, by subjectively measuring the visual comfort of observers when watching dierent video fragments. The following sections will give an impression on the human perception of depth and factors that can cause visual discomfort, to have a good understanding of the experiments described in Chapter 4.

2.9.1 Human perception of depth Even though the retinal images from which depth information can be extracted are strictly 2D, humans can perceive depth. Two-dimensional scenes already contain monocular cues, which is a depth perception cue that can be perceived with only one eye. One of the most common monocular cues is that if one of two equally large objects is located closer to the observer, this object will look the largest.

Apart from monocular cues, more depth

information can be obtained by binocular cues. Because our eyes are horizontally separated, each eye receives slightly dierent retinal images. Stereopsis is the perception of depth that is constructed based on the dierence between these two retinal images. The brain fuses the left and right image and from retinal disparity, i.e. the distance between corresponding points in these images, it is able to extract depth information. The horopter is dened as the line that connects all points with zero disparity. A small region around the horopter is the region where binocular single vision takes place, where the two retinal images are fused into a single image with depth perception.

2.9.

35

VISUAL DISCOMFORT IN STEREOSCOPIC DISPLAYS

Whenever we look at objects, our eyes are accommodated and converged by an amount that depends on the distance between us and the object of interest.

Vergence can be

dened as movement of the two eyes in opposite direction to locate the area of interest on 9

the fovea , a process that is primarily disparity driven. Accommodation can be dened as alteration of the lens to focus the area of interest on the fovea, a process that is primarily driven by blur. The ranges of accommodation and vergence where both systems do not introduce any errors, form the zone of comfortable vision. The brain combines depth information from all depth cues. In stereoscopic systems, these cues may conict and there is still no clear theory about how depth perception is aected when this happens. Depth cue integration is an important area of research. Finally, people not only have individual preferences when it comes to stereoscopic applications, but also their gender, race and age might aect their preference for stereoscopy. For example, an important characteristic that diers between individuals is the InterPupillary Distance (IPD). Both angular disparity and perceived depth depend on the IPD. People with a smaller IPD perceive more depth than people with a large IPD for a xed screen disparity and viewing distance, so they reach disparity limits more rapidly. The Accommodative Convergence/Accommodation (AC/A) ratio is another characteristic that diers between individuals.

It describes the change in convergence due to accommodation per

change in accommodation, i.e. the magnitude of the cross-link-interaction. It seems that people with extremely high AC/A ratios have trouble with binocular fusion and depth perception. Finally, dierences in pupil diameter between individuals also aect stereoscopy. Visual abilities also vary with age as a result of changes in the structures of the eye. Accommodative ability decreases with age, starting around 40 years up to approximately 55 years, when little or no accommodation remains.

2.9.2 Visual fatigue Visual fatigue is a term that refers to a decrease in performance of the human visual system and can be objectively measured.

The subjective counterpart of visual fatigue

is visual discomfort. Visual discomfort is often a consequence of visual fatigue. Research revealed that causes of visual fatigue when watching stereoscopic displays include anomalies of binocular vision, dichoptic errors (e.g.

crosstalk), conict between convergence eye

movement and accommodation functions, and excessive binocular parallax. Possible objective indicators for measurement of visual fatigue are pupillary diameter, near and light pupillary reactions, critical fusion frequency, visual acuity, near point, refractionability, visual eld, stereo acuity, xation stability, accommodative response, AC/A ratio, heterophoria, convergent eye movement, spatial contrast sensitivity, color vision, light

9

The fovea is a part of the eye, located in the center of the macula region of the retina and is responsible

for sharp central vision.

2.9.

36

VISUAL DISCOMFORT IN STEREOSCOPIC DISPLAYS

sense, blink rate, tear lm breaking time, pulse rate and respiration time. In order to be able to determine the amount of visual fatigue, both objective and subjective standards need to be combined.

2.9.3 Factors that cause visual discomfort

Presence of vertical parallax This can be described as the vertical disparity between left and right image and can occur when camera positions of stereo cameras are not aligned perfectly.

Vertical

parallax causes discomfort and should be kept as close to zero as possible.

Excessive negative or positive parallax Excessive positive (uncrossed) disparity causes human eyes to diverge and fails to fuse correctly on the screen; similarly, excessive negative (crossed) disparity causes human eyes to converge at an uncomfortable 3D position in front of the display. Objects with excessive negative or positive parallax can cause visual discomfort (see Figure 2.7.1). The on-screen disparity can be calculated by multiplying image disparity and pixel size. A general limit of one degree of disparity can be applied. If these limits are not exceeded for a certain viewing distance, viewing should remain comfortable.

Accommodation and convergence mismatch When our eyes try to xate or converge on the 3D object, they also try to focus or accommodate on the screen where the object is the sharpest. We can conclude that accommodation distance is constant while vergence distance is not. Research shows that this can induce visual fatigue when objects are displayed far away from the display.

Crosstalk Separation of the left and right eye view is one of the major challenges of every display developer. Imperfect separation makes a small proportion of one eye's view perceptible to the other eye, a phenomenon that is called crosstalk. Crosstalk becomes generally more noticeable with an increase in image separation between left and right image. This however creates the depth eect, so an optimal balance needs to be found between the added depth value and the negative eect of crosstalk.

To obtain a comfortable viewing experience it is important that the objects are displayed within the stereoscopic comfort zone (see Figure 2.9.1). The ideal situation is to provide maximum depth but lowest parallax, by placing principal objects so that approximately half of the parallax values are positive, half negative.

2.9.

VISUAL DISCOMFORT IN STEREOSCOPIC DISPLAYS

Figure 2.9.1: Stereoscopic comfort zone[16].

37

Chapter 3

Implementation 3.1 Introduction The main objective of this chapter is to describe every step the 2D video sequences went through from analyzing and editing to 3D rendering of the created 3D content.

Figure

3.1.1 depicts the consecutive steps of the proposed implementation to create left and right views that can be combined in a 3D video.

Figure 3.1.1: Overview of consecutive steps to go from 2D to 3D representation.

38

3.2.

39

PREPARATION

The chroma key and background video were analyzed and edited to separate audio from video and to extract the raw YUV sequences.

This is covered in the preparation sec-

tion and contains all the technical details. This editing phase is necessary for the background/foreground classication to be able to access the pixel values of the videos. Dierent approaches are discussed but only the last algorithm (see Section 3.3.4) is used for the nal implementation. The classication method generates a binary image, indicating foreground and background pixels. This binary image is passed on to the 2D-to-3D conversion block, creating left and right image sequences. The source code of the complete implementation can be found in Appendix E. In addition, WMV compressed les of both background and foreground streams of a weather forecast are provided, together with the corresponding audio le.

These video les have to be

converted to the raw YUV format (by executing the rst script of Appendix D) to be able to give as arguments to the implemented functions. Consequently, output left and right YUV sequences can be converted to WMV video les by executing the second script of Appendix D. These can be oered to a stereoscopic player as left and right videos, and the audio le can be specied to add sound to the 3D experience.

3.2 Preparation 3.2.1 Provided material The 15,6 inch ASUS G51J-laptop (with NVIDIA GeForce GTS 360M graphics card) was oered to work with, provided with active shutter glasses (see Section 2.7.1). Stereoscopic video sequences were played with the stereoscopic player Stereoplayer, which relies on the NVIDIA 3DVision technique.

In the experiments, which are covered in Chapter 4,

an Alienware OptX AW2310 3D 23 inch Full HD Widescreen monitor is used as display screen. In addition, 2D video streams were provided by the VRT to be able to test the 2D-to-3D conversion technique. This technique will be described in forthcoming sections.

3.2.2 Analyzing video material The oered video streams, both background and chroma key plus foreground video, are compliant with the core MXF application specication for the VRT DMF DV25 format[26]. MXF is a container format which supports a number of dierent streams of coded essence, encoded with any of a variety of codecs, together with a metadata wrapper which describes the material contained within the MXF le. This section outlines the results after analyzing the MXF streams with the IRT MXF Analyzer[27], a tool which provides in-depth information about the complete le structure.

3.2.

40

PREPARATION

The essence container has three labels:

MXF-GC IEC DV 625x50I 25Mbps MXF-GC Frame-wrapped IEC-DV 625x50I 25Mbps (06.0e.2b.34.04.01.01.01.0d.01.03.01.02.02.02.01)

MXF-GC AES-BWF Audio MXF-GC Frame-wrapped BWF (Broadcast Wave Format) audio data (06.0e.2b.34.04.01.01.01.0d.01.03.01.02.06.01.00)

MXF-GC Generic Essence Mappings MXF-GC Generic Essence Multiple Wrappings (06.0e.2b.34.04.01.01.03.0d.01.03.01.02.7f.01.00)

The essence will be frame-wrapped and interleaved frame-by-frame. There is one GenericCompoundElement that contains the video as well as four audio channels.

3.2.2.1

The video format

As stated before, PictureEssenceCoding is IEC-DV Video 25Mbps 625x50I. This means that DV25 video is encoded in accordance with IEC 61834. 1 (Standard Denition).

SignalStandard shall be

StoredWidth, SampledWidth, DisplayWidth are all 720 while

StoredHeight, SampledWidth, DisplayWidth are all 288. The chroma subsampling scheme is 4:2:0. Therefore, chroma subsampling is 2 for both horizontal and vertical dimension.

3.2.2.2

The audio format

The audio component is 48 kHz, 16 bit linear PCM and contains 4 channels. Mostly, it is wrapped as AES3 audio data but in this case, BWF wrapping is used.

3.2.3 Editing video material FFmpeg[28] is a complete cross-platform solution to convert and stream audio and video. For editing the original video material oered by the VRT, the FFmpeg command line tool oered enough functionality. It allows to convert between dierent multimedia formats, split and merge audio and video, edit video les, etc.

Video streams (as recorded in the control room of the VRT) are provided, in which preparation and comments of the director before and after the forecast are still present. First of all, the two oered video sequences with a total length of approximately 380 seconds have to be shortened. The nal length of the real weather forecast is 163 seconds. Several parts need to be omitted and then merged back together to obtain the forecast as seen on television. Only the parts corresponding with the fragment where the presenter is located in front of a chroma key screen will be used as arguments for the proposed algorithm.

3.3.

41

BACKGROUND/FOREGROUND CLASSIFICATION

This is possible using the -ss option, which allows you to extract a certain number of frames. The intro, middle part and outro are separated from each other in the video where the presenter is located in front of the chroma key. In the background video we only extract the part that matches that middle part. Only these parts need to be merged together in order to obtain the desired video stream. These cut sequences were saved in the raw YUV format in order to be able to access the raw pixel values.

Furthermore, the audio that

complies with the middle part of the video is separated and saved in the WAV format. The scripts used to create these video fragments and audio les can be found in Appendix B.

3.3 Background/Foreground classication The editing and analyzing phase resulted in two cut video streams of equal length and a complementary audio le.

The goal is to separate the foreground from the background

as accurately as possible to guarantee the generation of depth maps without artifacts in further processing. is necessary.

Therefore, a carefully designed background/foreground classication

The computing language Matlab is used to implement this classication

and for reading and writing YUV sequences an existing toolbox[29] is necessary. In the forthcoming section covering the implementation of the 2D-to-3D conversion technique, the same Matlab toolbox will be used. The weather forecast implies that the scene is taken by a stationary camera. If there are no moving objects in the scene and no variations in illumination, the intensity values for each background pixel should remain constant. In practice however, dierent eects such as moving objects or noise cause the pixels to dier from their constant value. Therefore, it is impossible to classify the pixels by only comparing their intensity value to a constant value. The following sections outline dierent background subtraction methods that were implemented and tested, ending with a description of the nal classication method.

3.3.1 Dierence between reference frame and current frame The easiest and fastest implementation compares a sample of the background frame to each pixel in the current frame. When the absolute value of the dierence between these two pixel values is smaller than some threshold, the pixel is classied as a background pixel. This method requires a carefully chosen threshold for the three channels (either RGB or YUV). Results showed (see left image of Figure 3.3.1) that problems arise around the edges of the foreground object, and even in the object when the pixel values are too close to the background pixel values. This classication method did not deliver acceptable results for background subtraction and was discarded.

3.3.

BACKGROUND/FOREGROUND CLASSIFICATION

42

Figure 3.3.1: Applying morphological opening to initial background classication.

3.3.2 Background classication using morphological operators To overcome the problems described in the previous section, a classication method can be used based on morphological operators. First, a rough detection of the foreground pixels is performed using the algorithm as discussed in 3.3.1, which classies the pixels as either black or white.

Inspecting the resulting image, small white regions are detected in the

object and around the edges.

In order to use the opening and closing operations it is

necessary that a pixel value of 1 represents an object pixel, so in the following we use the reverse image. When applying the opening operation on the reverse image, which removes areas of black pixels smaller than a certain value, most of these areas are set to white as can be seen in the second image of Figure 3.3.1. We can now still notice some noise around the edges of the foreground object. This can be reduced by applying the opening operation on the reverse image of the second image, to remove white noise areas this time. 1

The denition of four-connectivity

is used to dene a connected object in this case. After

this classication, the two channels can be merged again, yielding a better result than the previous method.

3.3.3 SACON: A consensus based model for background subtraction This method[30] exploits color and motion information to detect foreground and background. It assumes that the camera is stable during the video sequence and relies on the dierence between the background pixels of the current frame and a set of reference pixels. It keeps track of a number of previous background samples at each pixel position and uses this to classify the current pixel as a background or foreground pixel. The dierence between the current pixel value and each background sample is calculated and compared to a threshold. Two possible approaches exist to determine the threshold value. The rst is to empirically set a global value for all pixels. The second way is to estimate the standard variance at each image pixel and dene a multiple (typically 2,5 or 3) of this variance. The threshold at a certain pixel position is then equal to the minimum of the latter value and

1

Four-connected pixels are neighbors to every pixel that touches one of their edges and are connected

horizontally and vertically.

3.3.

BACKGROUND/FOREGROUND CLASSIFICATION

43

Figure 3.3.2: Background subtraction scheme.

a constant global value. If the dierence between the current pixel and the background pixel is smaller than this threshold a binary variable is set to 1, otherwise to 0. When the sum of all these binary variables is bigger than another threshold (which is proportional to the number of samples), the pixel is classied as a background pixel.

After implementing and testing this approach, it turns out that this method is too complex for this simple background subtraction problem. The computational complexity is too big to apply this method to a large number of frames.

3.3.4 Background subtraction based on absolute value classication and edge detection Finally, best results are obtained using background subtraction based on several steps of classication, as shown in the scheme in Figure 3.3.2.

3.3.

BACKGROUND/FOREGROUND CLASSIFICATION

44

Figure 3.3.3: Histograms of Y, U and V.

Figure 3.3.4: Background model after classication step 1 and 2.

1. After investigating the histograms of Y,U and V values of a green key frame (see Figure 3.3.3), we can conclude that the luminance value as shown in the left histogram varies a lot between the dierent green key pixels. This is due to the dierence in exposure during the capturing of the scene.

Therefore, we classify the pixels as

background if the U or V value is limited to a certain range, and do not consider the Y value. This range can be derived from both U and V histograms and is limited between 106 and 130 for the U value and between 101 and 129 for the V value. Because these values are not the same for every green key frame, as already stated before, we chose to classify a pixel as background when the dierence is smaller or equal to 20 for both U and V value. This initial classication results in a background model for a random chosen frame as shown in the left image of Figure 3.3.4. 2. As we can derive from the resulting background model from the rst classication step, the edges are very rough and need to be smoothed. Therefore, we use a classication based on RGB values. This classication uses a bigger range for R, G and B values and classies more edge pixels as foreground which results in smoother edges. This is shown in the right image of Figure 3.3.4.

3.4.

GENERATION OF 3D VIDEO

45

Figure 3.3.5: Edge model.

Figure 3.3.6: Final classication of background and foreground pixels.

3. Because it is obvious none of both results in an ideal classication, it is necessary to combine both background models by using the OR-operator. 4. After these operations, we can still see some pixels at the edges which are incorrectly classied as foreground pixels. Therefore, an edge detection model is used to classify the edge pixels as background pixels. The edge model is shown in Figure 3.3.5 and the nal background model is shown in Figure 3.3.6.

3.4 Generation of 3D video The generation of 3D content forms the nal phase in the conversion of the two video streams to left and right views, ready to be played by a stereoscopic player. The major advantage in this area of research is the possibility to start very early in the 3DTV chain of content generation. This allows us to generate 3D content while merging the two video streams together in order to obtain the weather forecast. Two methods were implemented, basically only dierent from each other in their way of rendering the nal views.

The rst implementation exploits the pixel classication method as described in

Section 3.3.4 to create an appropriate depth map. An existing rendering tool is used for the generation of left and right view. The second implementation uses the same method to extract depth information, but, in contrast to the rst approach, deploys a dierent rendering method using a cut-and-paste technique.

3.4.

46

GENERATION OF 3D VIDEO

Figure 3.4.1:

Frames representing the stereo pair generated with the video plus depth

algorithm (depth value = 44).

3.4.1 Video plus depth After the classication made according to the algorithm described in Section 3.3.4, it is possible to immediately generate a depth map.

The latter is represented by a YUV

sequence but contains only one value (the luminance intensity value).

When a certain

pixel is classied as foreground (indicating the presenter), a certain depth value can be assigned. A value of 40 for example should bring the foreground more to the front. The other pixels can be assigned another depth value, e.g. 1 which corresponds to no depth. In order to generate left and right view, the toolbox Depth Image Based Stereoscopic View Rendering (DIBR) can be used.

Input arguments are limited to the 2D video sequence

(in this case the weather forecast as seen on television) accompanied by the corresponding depth map YUV sequence. This program performs stereoscopic view generation, resulting in a left eye and right eye view. The depth maps are interpreted according to the MPEG N8038 informative recommendations.

The disocclusions are lled by background pixel

extrapolation and thus the resulting video shows blur in the disoccluded areas. An example is shown in Figure 3.4.1 where a depth value of 44 is assigned to every pixel of the foreground object.

This algorithm will also be evaluated and compared to other

conversion techniques in Chapter 4. An important remark regarding the applicability of video plus depth: every 2D-to-3D conversion technique is obliged to create some sort of depth map. This algorithm only species the rendering method, requiring a per-pixel depth map as input.

3.4.

47

GENERATION OF 3D VIDEO

3.4.1.1

Performance

This technique remains very slow when applied to the long video sequence of a weather forecast. The main bottleneck is located in the hole lling and depth image based rendering functions. These are functions coming from several existing toolboxes and speeding up this code would lead us too far. Finally, for this kind of applications Matlab may not provide the fastest solution and other platforms should be considered.

3.4.2 Proposed generation of left and right view The proposed 2D-to-3D conversion method uses the same depth extraction method as Section 3.4.1. In the latter, a color image and a corresponding per-pixel depth map are sucient to obtain a 3D video representation, using an existing Depth Image Based Rendering toolbox[31] for the generation of left and right view. The biggest drawback of this method is the fact that existing information concerning the background pixels is not being used. Hole lling is applied to deal with the disocclusion problem, resulting in hazy parts around the edges of the foreground object as shown in Figure 3.4.1. The left view image clearly demonstrates the created artifacts, for example the white regions around the edges of the foreground. For the specic application of a weather forecast, this hole lling is not necessary (in the basic implementation) because background pixel values are always dened. Therefore, an improved rendering method is implemented that allows to use the background/foreground classication to generate a left eye and right eye image.

This makes the use of depth

maps obsolete, but does require an implementation in which both pixel classication and rendering of left and right view are performed simultaneously.

3.4.2.1

Basic implementation

The easiest approach isolates only the presenter from the other pixels, and shifts him a certain amount of pixels to the right to create the left view.

This operation simulates

the eect of horizontal disparities that are introduced when recording stereoscopic video with virtual cameras, separated by a certain horizontal distance.

The negative eect of

possible vertical disparities is obviously impossible in this implementation. Because each background pixel is dened, the problem of disocclusions can be avoided.

The original

merged video stream will be the right eye view, and the stream with the shifted foreground will be the left eye view. Doing so, a depth eect can be obtained when switching between these two views. A crossed setup is used, directing the left view to the right eye and the right view to the left eye. It should be noted that the perception of depth entirely depends on the settings of the stereoscopic player. Left and right views can be swapped, parallax can be adjusted and other parameters can be set.

Results are depicted in Figure 3.4.2,

3.4.

GENERATION OF 3D VIDEO

48

Figure 3.4.2: Frames representing the stereo pair generated with the basic implementation (pixel shift = 20).

where the left view obviously contains less artifacts than the corresponding view in Figure 3.4.1 created by the previous method. Although this approach basically assigns the same depth level to each pixel of the presenter, the perception of depth by the human visual system aids in creating a better 3D experience. The cardboard eect

2

is not dominantly present, when viewing the stereoscopic video one

can for example clearly see the presenter's hands appearing more to the front (while this is actually not the case).

3.4.2.2

Extended implementation

To obtain more pronounced 3D eects, it is possible to shift certain areas of the background stream. The rst step is to apply some kind of segmentation to the background stream. To accomplish this, an existing Matlab template matching toolbox[32] can be used. The applied method is a CPU ecient function that calculates matching score images between template and color 2D images.

It calculates the Sum of Squared Dierence (SSD block

matching), as well as the Normalized Cross Correlation (NCC) which is only dependent on texture and not on illumination. These two parameters are compared to a predened threshold when scanning the image to nd areas that match certain template images e.g. a cloud on the weather map. Obviously, this method implies certain inherent diculties, the most important one is the disocclusion problem. Image information that is occluded in the original view may become visible in the shifted image.

2

The "cardboard eect" is a distortion peculiar to stereoscopic images and describes the phenomenon

whereby objects appear to be attened in depth.

3.4.

GENERATION OF 3D VIDEO

49

Figure 3.4.3: Frames representing the stereo pair generated with the extended implementation (pixel shift foreground = 10, pixel shift background = 6).

Figure 3.4.4: Frames representing the stereo pair generated with the basic implementation (pixel shift = 10).

The question is how to cover these disocclusions in a visually acceptable manner.

To

address this issue we use horizontal extrapolation which lls holes by taking the average of 4 pixels, the boundary pixel with greater depth and 3 pixels to the left or right. An example is shown in Figure 3.4.3 where the foreground is shifted 10 pixels and the clouds in the weather map are shifted 6 pixels. The 3D eect that is accomplished here brings the clouds in between the weather map and the presenter. Figure 3.4.4 depicts the same stereo pair when the basic implementation is applied and no background shift is introduced.

3.4.

50

GENERATION OF 3D VIDEO

Template generation It should be noted that during the process of testing the extended implementation, templates, such as the clouds in the weather map, have to be dened.

No ground truth

information about the composition of the background stream was available, therefore I retrieved the template images using Adobe Photoshop CS5 software[33].

This program

solved the problem by using tools for cutting out the desired template images. In the source code I use these template images (corresponding to objects in the weather map) as default arguments for the matching function.

If perfect template images are

available for matching, these should be oered to the program to ensure optimal detection of the objects. In the latter case, the default templates will not be used. Doing so, the problem was not entirely solved. The template matching toolbox only accepts rectangular templates, so additional binary templates are created to specify which pixels of this template image correspond to the object of importance.

If this modication is

not performed, the entire rectangular template will receive the same depth value as the object, resulting in a loss of 3D eect for the objects matching the dened templates. The proposed approach can be optimized by extending the template matching function.

3.4.2.3

Extended implementation for positive parallax eect

The extended implementation can be used for other purposes than for introducing additional 3D animations. As discussed earlier, the eect of positive or negative parallax can be essential to create a pleasant viewing experience. The previous paragraphs solely focus on a negative parallax eect, making objects appear in front of the display screen. However, this paragraph describes further extensions that were made to obtain the positive parallax eect. The entire background image can be oered to the extended implementation, as some sort of big template image. Changing shifting directions will bring the background to the back, instead of bringing the foreground to the front. The background template image will be shifted to both left and right.

Rendering the resulting 3D video will create a whole

new visual 3D experience when compared to the previous eects. An example is shown in Figure 3.4.5 where the total shift in pixels is set to 6. It should be noted that the uncrossed setup is used here, where the left image is directed to the left eye and the right image to the right eye. The observers will have the impression of 3D in a box, making template images appear behind the display screen. Empirical work comparing positive and negative parallax eects will be described in Chapter 4, Section 4.3.4.

3.5.

51

DISCUSSION ON COMPLEXITY

Figure 3.4.5: Left and right view frame with positive parallax eect (pixel shift background = 6).

3.4.2.4

Performance

For now, the biggest problem still remains the execution time of the implemented methods. Segmentation of each frame of the background video stream, with additional template matching, requires a great amount of time. Furthermore, hole lling of disoccluded areas in the background image is very time consuming as well.

Consequently, the extended

implementation is the slowest, due to the template matching and hole lling functions. As mentioned before, the template matching function is a function used from an existing toolbox, and speeding up this code would lead us too far, deviating from the main goals of this dissertation. Finally, as stated in previous sections, for such applications Matlab may not provide the fastest solution and other platforms should be considered.

3.5 Discussion on complexity In the following chapter, the proposed methods will be evaluated and compared against other state of the art algorithms. Therefore, a short discussion on the complexity of the designed methods is given in this section. The algorithms that will be used in the chapter regarding experimental work are video plus depth (with DIBR) as described in Section 3.4.1, the basic implementation (see Section 3.4.2.1), the extended implementation (see Section 3.4.2.2) and the extended implementation to obtain a positive parallax eect (see Section 3.4.2.3). Evaluated state of the art methods are limited to the algorithms described in paragraphs 2.3.2.3, corresponding to the MakeMe3D software and the frame dierencing method. First of all, I will give some general remarks concerning the implemented technique. The

3.5.

52

DISCUSSION ON COMPLEXITY

2D-to-3D conversion technique

Execution time (in seconds per frame)

Implemented algorithms Video plus depth

30.4051

Proposed generation of left and right view Basic implementation

2.3327

Extended implementation

22.1346

Extended implementation with positive parallax eect

12.1845

State of the art algorithms MakeMe3D

0.2261

Frame dierence

0.0667

Table 3.1: Comparison of execution times of dierent algorithms.

technique works in the pixel domain, so no transformations such as the Discrete Cosine Transform (DCT) were applied to the input video images. Pixel domain operations imply a higher complexity but result in better quality and accuracy.

Furthermore, one of the

strengths of this application is that no encoding/decoding phase is necessary, because raw video les are provided.

This way, additional artifacts created by consecutive encoding

and decoding steps can be avoided. Table 3.1 depicts the execution time, in seconds per processed frame, of each technique applied to the weather forecast streams. Execution time of the evaluated algorithms was measured on a Dual Core Intel Processor (2,1 GHz). As we can derive from the table, the execution times for the proposed methods are a lot higher than for the state of the art algorithms.

This slow execution time is a result of

several factors. Matlab may for instance not be the most ecient programming language to perform tasks on videos with a high number of frames. The basic implementation takes slightly over 2 seconds/frame to execute. The most time consuming functions are reading, writing and editing of YUV images, functions coming from the YUV toolbox. The extended implementation used to obtain a positive parallax eect is slower, about 12 seconds/frame. The main reason for the larger execution time is found in the hole lling.

When shifting the entire background, hole lling becomes

necessary, resulting in a slower execution time. Next, the extended implementation needs more than 20 seconds/frame.

In addition to the previously described time consuming

functions, the extended implementation is based on segmentation and template matching functions, which require even more time to execute. There is a lot of room for optimizations in the proposed algorithm, especially in case of the extended implementation. For foreground/background classication, it is possible to skip for example every other frame, resulting in the need to process only half of the frames. This will double the speed of the algorithm, while retaining a reasonable quality. Another

3.5.

53

DISCUSSION ON COMPLEXITY

possibility is to enhance the existing template matching and segmentation functions. Furthermore, one could apply the segmentation only once in 10 frames (or more) instead of for every frame. For static background video sequences, this would still yield good results. The video plus depth algorithm, relying entirely on a DIBR toolbox for the rendering of left and right view, accepts a 2D image accompanied by a depth map as input arguments. This approach turns out to be the most time consuming of all methods. Unless we dive into the source code of these existing functions, there is little room for improvement here. Finally, source code of each of the proposed techniques can easily be converted to multithreaded code, in order to utilize multiple processors simultaneously in a shared-memory multiprocessor (SMP) machine.

The algorithm does not rely on previous frames and

basically each frame can be processed independently of others. This parallellization could signicantly reduce the execution time of the proposed algorithms. In the following chapter, subjective user tests are conducted to evaluate the dierent 2Dto-3D techniques in terms of viewing comfort.

It will become clear that visual comfort

is a key factor in 2D-to-3D conversion techniques, and that fast algorithms often do not produce enjoyable stereoscopic video material. A summary presenting all (dis)advantages of the evaluated algorithms will be given in the nal chapter.

Chapter 4

Experimental work 4.1 Introduction The main objective of this chapter is to explore whether the newly designed 2D-to-3D conversion algorithm is appropriate to apply to a 2D weather forecast. Experiments were conducted in which dierent existing 2D-to-3D conversion techniques, all aiming to create the most enjoyable 3D video, were applied and evaluated.

Variations in depth value,

number of 3D animations and parallax eect were used to create a realistic set of test scenes to explore visual perception of 3D video.

4.2 Experimental setup First of all, the experiment was restricted to a subjective test setup because objective measurements would lead us too far. The goal was to display 3D videos as pleasing as possible, while still retaining the three-dimensional eect that needs to be accomplished. For instance, disparities caused by an excessive pixel shift will cause viewing discomfort, resulting in an unpleasant 3D experience. Dierent methodologies for the subjective assessment of 3DTV applications are described in the ITU-R-BT.1438[34] recommendation for stereoscopic television pictures. These are adopted from the ITU-R BT.500-11[35] recommendation for the subjective assessment of the quality of television pictures.

The proposed methods are used to measure overall

quality and impairment of distorted still images and image sequences. They can also be used to obtain ratings for certain attributes such as perceived depth and visual comfort. In this experimental setup, a variant of the double-stimulus impairment scale (DSIS) method was applied. The main goal of this research is to investigate which algorithm delivers the best results for the specic application of a weather forecast. Supplementary results about the perceived 54

4.2.

EXPERIMENTAL SETUP

55

Figure 4.2.1: Structure of evaluation[35].

depth are investigated, including the eect of negative or positive depth, the eect of enlarging pixel shift and the eect of putting more objects in a three-dimensional space.

The design, observers, equipment and procedures were the same for every test objective. These are briey discussed in the following sections.

4.2.1 Design The test took approximately 30 minutes. According to the ITU-R BT.500-11 recommendation [35], a couple of training sequences were introduced to stabilize the observers' opinion and clarify the test procedure. A random order was used for the presentation and three dierent test sets were used. The test condition order was arranged in such a way that any eects on the grading of tiredness or adaption were balanced out in the dierent test sets.

The test sets consisted of 22 evaluations, and in each evaluation two video fragments of 17 seconds had to be compared and graded. The fragments were converted from raw YUV les to Windows Media Video (WMV) les, resulting in compressed videos. The fragments were separated by a mid-gray image sequence lasting 3 seconds as indicated by T2 in Figure 4.2.1. T4 denotes a 10 seconds lasting mid-gray image sequence, in which the observer had the time to decide whether fragment 1 or 2 was the most enjoyable. Prior to the experiment, observers were given a brief introduction to the experiment. Questions concerning the test session were answered after the short training session that was conducted.

4.2.2 Observers 24 observers participated in four experiments and the entire test, covering every experiment, consisted of 22 evaluations. Each experiment consisted of a set of evaluations. These sets were assessed by the observers in a random order. There were three dierent test sets

4.3.

56

EXPERIMENTS

so each test set was assessed by 8 observers in general. The observers were non-experts meaning that they were not directly concerned with stereoscopic video quality as part of their normal work. The majority of the observers were students from the same age group (18-30 years old), others were between 50 and 60 years old. For the experiment, the assumption had to be made that every observer had a good visual acuity, good stereo vision and good color vision. When analyzing the results, observers with test results that show a big deviation from the averaged overall judgements made in the experiments were not taken into consideration. After analyzing the results, only one set of results deviated from the common score in 45% of the questions and was left out of the results.

4.2.3 Equipment and test environment A 23 Alienware stereoscopic display was used in this experimental setup, with active shutter glasses to provide the 3D experience. The recommended viewing distance depends on the diagonal width of the display.

As a rule of thumb, the diagonal width can be

multiplied with 1,56 to get the optimal viewing distance for the observers[36]. In this case, a viewing distance of approximately 100 cm was used. The lighting conditions of the room, slightly dimmed, were constant for all observers.

4.2.4 Evaluation Every experiment applied the same evaluation method, being a variant on the DSIS method as already mentioned before.

All assessments were made on paper, to avoid additional

distraction caused by ickering of the laptop screen when wearing the 3D shutter glasses.

4.3 Experiments This part describes the dierent experiments the observers were involved in. A subdivision in four distinct experiments was made to be able to infer results on dierent aspects. Each experiment section gives a short introduction on the context, followed by a presentation of the used video material, ending with a discussion on the results. The rst experiment focuses on the perception of depth for changing assigned depth values. The second experiment aims at comparing dierent 2D-to-3D conversion techniques, including the proposed algorithms of Chapter 3 and several state of the art algorithms. The third experiment tries to nd out whether viewers enjoy more 3D animations, comparing only fragments from the proposed algorithm. At last, the fourth experiment tries to derive viewers' preferences concerning the eect of a positive or negative parallax in 3D video.

4.3.

57

EXPERIMENTS

4.3.1 Experiment 1: perception of depth for changing depth values 4.3.1.1

Introduction

This experiment explores the way observers perceive dierent levels of depth in video sequences such as a weather forecast. Its main goal was to evaluate whether users prefer a small or a large amount of depth, for a realistic set of video sequences using four dierent algorithms.

4.3.1.2

Stimuli

The video material used in this experiment consisted of 4 evaluation sets, each containing two video fragments of 17 seconds. Each evaluation set was related to a certain algorithm and consisted of a video fragment in which a small amount of depth is introduced, and a video fragment with a bigger amount of depth. All the fragments consisted of the same video sequence of the weather forecast, being the rst 17 seconds.

The rst 5 seconds

consisted of the introduction part of the weather forecast, and were necessary to stabilize the viewer's opinion. After these 5 seconds, the real weather forecast began and the 3D eect could be evaluated. Not all 2D-to-3D conversion algorithms use the same input sequences: algorithm 1 and 3 expect two 2D video sequences as input, being the background and foreground video; algorithm 2 and 4 expect one merged 2D video sequence (the weather forecast as seen on television). The rst algorithm is the recently designed algorithm where a pixel shift of resp. 6 and 10 pixels was used to obtain a small and larger 3D eect (see Section 3.4.2.1). Resulting left and right images for a pixel shift of 10 are shown in Figure 4.3.1. The second algorithm uses the open source MakeMe3D software for conversion and requires the setting of a certain percentage to denote a small or larger eect of depth (see Section 2.3.2.3). A value of 25% and 50% was used to obtain a small and larger 3D eect and the frame oset parameter was adjusted to its ideal value. This parameter had to be set to 2 in order to obtain the best result in case of a weather forecast with very little moving objects. Example left and right frames created by this technique are shown in Figure 4.3.2. The third algorithm (see Figure 4.3.3) uses a mixture of the rst algorithm and the standard DIBR (see Section 3.4.1).

Depth values of 12 and 22 were used, which correspond to

approximately 6 and 10 pixels resp. in the rst algorithm . Finally, the fourth algorithm uses a frame dierence of resp.

1 frame and 2 frames to

obtain the desired 3D eect (see Section 2.3.2.3). A simple AviSynth

1

script was used to

create this video (see Appendix C). The same frame was extracted again for left and right

1

AviSynth is a tool for video post-processing.

4.3.

EXPERIMENTS

58

view in Figure 4.3.4, the right frame presenting the frame right before the left frame, thus with a frame oset of 1 frame.

Figure 4.3.1: Frame from left and right view created by the rst algorithm (pixel shift = 10).

Figure 4.3.2: Frame from left and right view created by the second algorithm (depth level = 50%, frame oset = 2).

4.3.

EXPERIMENTS

59

Figure 4.3.3: Frame from left and right view created by the third algorithm (depth value = 22).

Figure 4.3.4: Frame from left and right view created by the fourth algorithm (frame oset = 1).

4.3.1.3

Results

The graph in Figure 4.3.5 shows the results of comparing fragments of the four algorithms at a specic depth. The bars represent the number of times (expressed in a percentage) that a certain fragment was chosen to be the most pleasant to watch. Looking at Figure 4.3.5, some results are more pronounced. A rst conclusion that can be made is that in 41,3% of the evaluations the users had no idea which fragment was better, denoted by the most left bar in Figure 4.3.5.

Thus it can be concluded that a small or somewhat

larger depth eect does not aect the visual viewing comfort of the user, as long as the

4.3.

60

EXPERIMENTS

Figure 4.3.5: Experiment 1: Comparison of the eect of dierent depth.

amplication of depth remains acceptable. The two bars next to the no preference bar represent the comparison of the eect of enlarged depth on my algorithm.

As derived

from the graph, more than double of the observers prefer a bigger depth eect. For the MakeMe3D algorithm (corresponding to Alg. 2) and the DIBR algorithm (corresponding to Alg. 3), the majority of the observers prefer larger depth. The only exception is the fourth algorithm where only a very small percentage of the observers chose the larger 3D eect over the small 3D eect. A dierence of two frames between left and right view in this technique results obviously in too many artifacts. In general we can conclude that for the largest part of the observations, people do not notice a lot of dierence between the enlargement of depth eect, as long as it remains comfortable to watch; perceived depth is usually not aected by dierent conversion algorithms.

4.3.

61

EXPERIMENTS

4.3.2 Experiment 2: comparing 2D-to-3D conversion methods for constant depth level 4.3.2.1

Introduction

The main goal of this experiment is to investigate which 2D-to-3D algorithm viewers prefer for a chroma key application, and whether the amount of introduced depth is important in their classication of the four algorithms. In contrast to experiment 1, the emphasis was now put on the quality of the applied conversion algorithm, not on the added depth value.

4.3.2.2

Stimuli

The material used in this experiment consisted of two sets of 6 evaluations, each containing two video fragments. Each set compares each algorithm two by two to make it possible to determine the overall best algorithm. The rst set is related to a pixel shift of 6 and the second set to a pixel shift of 10. The same stimuli can be used as in the previous section: the four algorithms applied to the 2D weather forecast, once with a small depth level, once with a bigger depth level. Whereas in Section 4.3.1 the perceived depth for dierent depth levels is compared by applying the same algorithm, this section is dedicated to the comparison of dierent algorithms using the same depth level.

4.3.2.3

Results

Comparison of the dierent algorithms with a small depth eect

The winner

in this experiment with little over 39% of the votes, as shown in Figure 4.3.6, is the third algorithm, where in addition to the regular depth extraction a general hole lling method is used to ll the gaps. The newly designed algorithm, where no hole lling is necessary, is a close second with 33,33% of the votes. In the third algorithm a little bit of blur is introduced around the edges of the foreground object, unlike in the rst algorithm in which this blur could be avoided by determining precise pixel values from the background image. This is an unexpected result and we can conclude observers do not notice this small eect of blur in a video sequence. The fourth algorithm ends up at the third place with 14,49% of the votes and the MakeMe3D algorithm nishes last with only 0,72%. Only 12,32% of the observers did not show any preference concerning the most enjoyable algorithm.

Comparison of the dierent algorithms with a larger depth eect

The classi-

cation as deducted from previous paragraph remains the same when increasing the depth level.

Over again, the winner turns out to be the third algorithm with 38,41% of the

votes (see Figure 4.3.7).

For a larger depth eect, the dierence between the rst and

the third algorithm is smaller. This points out the fact that for a larger depth eect the

4.3.

EXPERIMENTS

62

Figure 4.3.6: Experiment 2: Comparison of the dierent algorithms when introducing a small depth eect.

amount of blur also increases, resulting in a little bit more discomfort. With an increasing depth eect, the result after applying the second and fourth algorithm becomes even more unpleasant, resulting in a lower score than for a small depth eect. However, it can be concluded that changing depth level does not have any signicant eect on the users' preferences.

4.3.3 Experiment 3: adding additional 3D animations 4.3.3.1

Introduction

The main goal of this experiment is to discover whether users enjoy a more vivid 3D environment, obtained by putting more objects of the weather map at a certain depth. The basic implementation of the designed algorithm is compared to the extended implementation and no other algorithms are considered.

4.3.

63

EXPERIMENTS

Figure 4.3.7: Experiment 2: Comparison of the dierent algorithms when introducing a large depth eect.

4.3.3.2

Stimuli

The video material used in this experiment consisted of only two evaluations, each containing two video fragments. The rst evaluation compared the basic implementation with a pixel shift of 6 pixels (see Figure 4.3.8), to the extended implementation with a pixel shift of 10 pixels for the presenter and a total pixel shift of 6 for the clouds on the weather map (see Figure 4.3.9).

The second evaluation aimed at creating a bigger 3D eect by

enlarging the pixel shift to 10 pixels for the clouds, and 20 pixels for the presenter.

A

dierent part of the weather forecast was used here for evaluation: the eect of additional 3D animations was more obvious in a sequence where the weather map was present and by changing the scene, observers were more likely to pay attention to the dierence in amount of 3D animations.

4.3.

EXPERIMENTS

64

Figure 4.3.8: Frame from left and right view created by the basic implementation of the rst algorithm (pixel shift = 10).

Figure 4.3.9: Frame from left and right view created by the extended implementation of the rst algorithm (pixel shift foreground = 10, pixel shift background = 6).

4.3.3.3

Results

Exploring the eect of more objects in 3D for a small depth value

Figure 4.3.10

shows that about 65% of the observers enjoys the additional 3D animations, such as the clouds in the weather map for example. They prefer this conversion algorithm over the basic implementation where depth is only assigned to the foreground. Only a small part (8,7%) of the observers claimed not to see any dierence between the video sequences. This observation could indicate that there is genuinely a part of the observers with a decreased ability to view stereoscopic images.

4.3.

EXPERIMENTS

65

Figure 4.3.10: Experiment 3: Exploring the eect of more objects in 3D for a small depth eect.

Exploring the eect of more objects in 3D for a larger depth value

Even when

introducing a larger depth eect, the extended implementation with more objects brought to the front is chosen over the basic implementation (see Figure 4.3.11).

Every single

observer had an explicit preference concerning this evaluation, and although the extended implementation has the most votes, the video with a smaller depth eect and more objects in 3D is more popular than a larger 3D eect. It can be concluded that increasing depth level does not imply a change in user preferences when exploring the eect of more objects in 3D.

4.3.

66

EXPERIMENTS

Figure 4.3.11: Experiment 3: Exploring the eect of more objects in 3D for a large depth eect.

4.3.4 Experiment 4: comparing negative and positive parallax eects 4.3.4.1

Introduction

The use of negative parallax, where the object seems to be positioned before the display screen, is often considered to be more tiring than the use of positive parallax, where the object appears behind or in the screen.

This eect is often used to impress the viewer

but could result in a very uncomfortable viewing experience. This experiment aimed at investigating whether this observation is valid when applied to the designed technique. Again, only video sequences of the implemented 2D-to-3D conversion method were used and no other algorithms were considered.

4.3.4.2

Stimuli

The video material used in this experiment consisted of 4 evaluations, each containing two video fragments. The extended implementation described in Section 3.4.2.3 was used to shift the entire background image to create left and right view and obtain the eect of

4.3.

67

EXPERIMENTS

Figure 4.3.12: Experiment 4: Exploring the eect of a negative or positive parallax.

positive parallax, as shown in Figure 3.4.5. Two levels of depth, a pixel shift of 6 and 10, were used again. The basic implementation, where a negative parallax is used to obtain the 3D eect, was applied again to obtain two video sequences with two dierent depth levels. Each sequence with a positive parallax was compared to each sequence with a negative parallax.

4.3.4.3

Results

As can be derived from Figure 4.3.12, the video sequence with a small depth eect and a positive parallax got 31,52% of the votes and turns out to be the most pleasant to watch among the four dierent video fragments. The second ranked fragment with 29,35% of the votes had a larger depth eect and also a positive parallax. When a comparison is made between positive and negative parallax eects, the results clearly show observers prefer positive parallax over negative parallax.

Still 23,91% of the evaluations were answered

with no preference, indicating not everybody can make a clear distinction between these dierent eects.

4.4.

68

SUMMARY OF RESULTS

4.4 Summary of results The experiments conducted in this research led to several important results:

The rst experiment showed observers tend to prefer a slightly larger depth value over a small one, as long as the applied 2D-to-3D conversion technique produces stereoscopic video material without too many depth artifacts.

The second experiment derived that, for both small and larger depth levels, observers prefer the implemented 2D-to-3D techniques above the evaluated state of the art algorithms. They adhere to the following classication of most enjoyable algorithms, sorted from best to worst algorithm:

1. Video plus depth algorithm with DIBR for rendering of left and right view (Algorithm 3 in previous gures and described in Section 3.4.1). 2. Proposed implementation of left and right view (Algorithm 1 in previous gures and described in Section 3.4.2). 3. Generation of views by introducing a frame dierence (Algorithm 4 in previous gures and described in Section 2.3.2.3). 4. MakeMe3D software (Algorithm 2 in previous gures and described in Section 2.3.2.3).

The third experiment demonstrated observers enjoy an increased amount of 3D animations in the proposed implementation.

The level of depth of these animations

does not have a signicant impact on their preferences.

The fourth experiment showed 3D behind the display screen is more comfortable and less tiring to watch than 3D out of the display screen. However, it is important not to exaggerate the assigned depth in order for the viewing experience to remain enjoyable.

Chapter 5

General discussion 5.1 Conclusions 5.1.1 Introduction The concept of stereoscopic television has been around for a long time but a real breakthrough from conventional 2D television to 3DTV is still pending. The last few years it became obvious that real 3DTV applications can only be a lasting success when the perceived image quality and the viewing comfort are at least comparable to conventional 2D television. In addition, 3D television should be compatible with 2D television to ensure a gradual transition.

However, a rapid progress in the elds of 3D content production,

coding and display, has brought the realm of 3D closer to reality than ever before. An important step in every 3D system is the generation of 3D content.

Dierent 3D

content capturing methods have been discussed thoroughly throughout this dissertation but these are only focusing on generating new 3D content.

The tremendous amount of

existing media in 2D format reinforces the need to be viewed three-dimensional. People want to be able to convert their own videos to 3D, before they will invest in buying a 3D television. The ecient and fast conversion of 2D content to 3D content is essential for the breakthrough of 3DTV. 2D-to-3D conversion techniques recover depth information by analyzing and processing 2D video.

5.1.2 Conclusions concerning the proposed implementation This dissertation proposes a method for 2D-to-3D conversion. The software is designed to meet the needs of 2D-to-3D conversion of a chroma keying application, in which two video streams of background and foreground can be used to create the necessary depth eect. Video streams of a weather forecast provided by the VRT are used for test purposes of the resulting implementation. The use of a chroma key screen when recording a weather 69

5.1.

70

CONCLUSIONS

forecast allows to apply a special technique for the extraction of 3D information.

The

foreground is detected using a background subtraction method and the background can be replaced by a separate background video stream.

Creating left and right images by

shifting the foreground pixels does not result in disocclusions because each pixel value of the background image is dened. In order to enhance the three-dimensionality of the weather forecast, a segmentation algorithm can be applied to the background video sequence and a depth value is assigned to every pixel of the detected objects. The main drawback of the proposed conversion method is the execution time. Due to the segmentation of the background images and the use of existing Matlab toolboxes that require a great amount of time to execute functions such as template matching and rendering of left and right view, real time conversion is not yet possible in the current implementation. However, speeding up the application is possible when using other programming environments and toolboxes. Several optimization possibilities were already suggested in Section 3.5 but further investigation in this eld should be considered.

5.1.3 Conclusions concerning experimental results 2D-to-3D algorithms can only become successful when viewing comfort is taken into account. The second goal of this thesis was dedicated to the comparison of dierent algorithms, including the newly developed method, in terms of visual viewing comfort. Visual discomfort is the subjective counterpart of visual fatigue and can occur in several cases with certain stereoscopic content. Fast motion in depth, 3D artifacts resulting from insucient depth information and unnatural amounts of blur can all contribute to an unpleasant viewing experience. Four dierent algorithms were compared at two dierent depth levels, with either a basic or limited amount of 3D animations, and a positive or negative parallax eect.

Among

these algorithms, two exploit the advantage of having two video streams to retrieve depth information. The rst algorithm (see Section 3.4.2.1) is implemented solely on the base of an existing toolbox to read and write YUV les but uses no other software for rendering the left and right view. The second algorithm (see Section 2.3.2.3) is an existing open source conversion tool and expects a 2D video (e.g. the weather forecast as shown on television in 2D) and returns left and right view as output. The third algorithm (see Section 3.4.1) is implemented based on an existing toolbox for DIBR but still uses both video streams to create the depth map. Finally, the fourth algorithm (see Section 2.3.2.3) relies on a frame dierence method and uses a simple AviSynth script to generate left and right view. Subjective results obtained from experiments comparing dierent video fragments resulted in the general conclusion that the algorithm using DIBR for rendering ends up giving the best viewing experience. The rst algorithm however, comes as a close second, which is not unexpected considering the fact that these videos are very similar. The only dierence

5.1.

CONCLUSIONS

71

between these algorithms is the hole lling method in the third algorithm, while there is no need for hole lling in the rst algorithm and precise pixel values can be assigned where gaps occur. It is of great importance to remark that both techniques use the same approach for depth information extraction. It is a quite unexpected result that people do not notice the blur surrounding the object in 3D for the third algorithm; the limited number of observers however requires some caution in making conclusions based on a very small dierence in percentage of votes. An obvious result remains that the second and fourth algorithm oer insucient results for the conversion of a weather forecast application. The MakeMe3D software relies on depth-from-motion analysis to retrieve depth information from the video sequence.

Due to the static nature of a weather forecast video sequence, 3D artifacts

become inevitable. The fourth algorithm is based on frame dierencing for creating left and right image. When using a frame dierence of only one frame, little variation in left and right image is obtained. When using a frame dierence of more than one frame, too many 3D artifacts are introduced, resulting in discomfort throughout the complete weather forecast. Moreover, tests were conducted concerning the appearance of objects before or behind the display screen. Observers show a strong preference for background images appearing further away than for foreground images appearing closer. Using a negative parallax value in case of the weather forecast, the foreground is coming out of the display screen. Because some parts of the presenter are up against the edges of the display, the stereo eect can be broken, creating conicting sensory information. Using the segmentation algorithm with predened templates more astonishing eects are created from the viewing point of the observers and 3D animations should denitely be included in the design of 3DTV applications.

5.1.4 General conclusion Table 5.1 depicts drawbacks and advantages of the proposed implementations.

In this

research, the emphasis lied on creating stereoscopic video that is enjoyable to watch, rather than creating a real time application in which execution time is essential. From this point of view, the suggested 2D-to-3D conversion technique oers the best solution, resulting in 3D video material with a minimal amount of artifacts and a maximal depth eect. In general, an important conclusion drawn from this research is that no single conversion method outperforms the others for every application. Existing depth extraction algorithms based on motion analysis will perform better in video sequences with a lot of movements, while the newly designed technique is preferable in chroma keying situations with little motion.

The choice of 2D-to-3D conversion algorithm strongly depends on the type of

application and must be taken into careful consideration upon creating the most enjoyable viewing experience.

5.1.

72

CONCLUSIONS

2D-to-3D

Drawbacks

Advantages

conversion technique

Implemented algorithms Video plus depth

- Very slow.

- Most positive subjective scores concerning visual comfort. - Some speed up possible.

Proposed generation of left and right view Basic

- Not very fast.

implementation

- Very positive subjective scores concerning visual comfort. - Speed up possible.

Extended

- Very slow.

implementation

- Creation of clear and enjoyable 3D animations with very little artifacts. - Speed up possible.

Extended

- Slow.

- Preferred over negative parallax

implementation

eect according to subjective user

with positive

tests.

parallax eect

- Speed up possible.

State of the art algorithms MakeMe3D

- Very low subjective

- Fast.

scores concerning visual

- Open source software.

comfort. - Not suitable for conversion of static video scenes. Frame dierence

- Low subjective scores.

- Very fast.

concerning visual comfort. - 3D experience not impressive. Table 5.1: Advantages and drawbacks of evaluated algorithms.

5.2.

FUTURE WORK

73

5.2 Future work The usability of the suggested implementation depends strongly on the type of application. Although the current approach is not suitable for real time applications, adjustments can be made to make the implementation faster and more robust.

Furthermore, changing

certain arguments should make the application suitable for a more general system with a chroma key video stream that needs to be replaced with another background stream. One of the main purposes of this work was to underline that 3DTV applications, with an enjoyable 3D eect, will be possible in the future. In addition, future research and implementations should consider the fact that viewers prefer a box eect rather than objects appearing in front of the screen. The diculty still remains that increasing depth level not only increases the eect of 3D but also decreases the visual comfort. Therefore, in the future, it is important to have a good understanding of the impact of adjusting these parameters. The question lingers whether 3D television will ever reach the mass market and penetrate the living rooms.

Stereoscopic television has the big disadvantage of the requirement

of special glasses and will probably vanish in the shadow of autostereoscopic television. However, for now, the gaming industry has already shown to be willing to follow recent trends by producing 3D laptops while lm industry increasingly produces 3D content. The progress speed of the introduction of 3DTV will strongly depend on the eort made to create attractive 3D content. Nevertheless, one can be optimistic new breathtaking visual experiences lie ahead of us.

Appendix A

Evaluation form Bedankt om deel te nemen in dit onderzoek naar verschillende 2D-naar-3D conversiemethoden. Vul a.u.b. uw persoonlijke gegevens in in onderstaande tabel.

Naam Leeftijd Beroep Geslacht

M/V

Voorbeeldevaluaties Er worden eerst 2 voorbeeldevaluaties gegeven. U ziet eerst een grijs scherm verschijnen met daarop "Evaluatie" gevolgd door een nummer. Dan ziet u 2 videofragmenten, gescheiden van elkaar door een 3 seconden lang grijs scherm. Dan ziet u 10 seconden lang opnieuw een grijs scherm waarin u de tijd heeft om aan te duiden welk fragment u het aangenaamste vond om naar te kijken.

Hou de 3D bril op terwijl de videofragmenten afspelen; wanneer u tijd krijgt om te evalueren, zet de bril even af en duidt aan welk fragment u het beste vindt. Let op, u heeft enkel 10 seconden tot de volgende evaluatie, dus zorg ervoor dat u de 3D bril op tijd terug op heeft.

Evaluatie 1 Welk fragment is volgens u het beste?

74

75

0 Fragment 1 0 Fragment 2 0 Geen idee

"Het eerste fragment was volledig 2D dus fragment 2 is hier het beste fragment."

Evaluatie 2 Welk fragment is volgens u het beste?

0 Fragment 1 0 Fragment 2 0 Geen idee

"Het tweede fragment is niet aangenaam om naar te kijken dus fragment 1 is hier het beste fragment."

Testevaluaties U ziet telkens een scherm verschijnen met het nummer van de evaluatie.

Er worden in

totaal 22 evaluaties uitgevoerd en de test zal ongeveer 20 min. in beslag nemen. Succes!

Evaluatie 1 Welk fragment is volgens u het beste?

0 Fragment 1 0 Fragment 2 0 Geen idee

Evaluatie 2 Welk fragment is volgens u het beste?

0 Fragment 1 0 Fragment 2 0 Geen idee

76

Evaluatie 3 Welk fragment is volgens u het beste?

0 Fragment 1 0 Fragment 2 0 Geen idee

Evaluatie 4 Welk fragment is volgens u het beste?

0 Fragment 1 0 Fragment 2 0 Geen idee

Evaluatie 5 Welk fragment is volgens u het beste?

0 Fragment 1 0 Fragment 2 0 Geen idee

Evaluatie 6 Welk fragment is volgens u het beste?

0 Fragment 1 0 Fragment 2 0 Geen idee

77

Evaluatie 7 Welk fragment is volgens u het beste?

0 Fragment 1 0 Fragment 2 0 Geen idee

Evaluatie 8 Welk fragment is volgens u het beste?

0 Fragment 1 0 Fragment 2 0 Geen idee

Evaluatie 9 Welk fragment is volgens u het beste?

0 Fragment 1 0 Fragment 2 0 Geen idee

Evaluatie 10 Welk fragment is volgens u het beste?

0 Fragment 1 0 Fragment 2 0 Geen idee

78

Evaluatie 11 Welk fragment is volgens u het beste?

0 Fragment 1 0 Fragment 2 0 Geen idee

Evaluatie 12 Welk fragment is volgens u het beste?

0 Fragment 1 0 Fragment 2 0 Geen idee

Evaluatie 13 Welk fragment is volgens u het beste?

0 Fragment 1 0 Fragment 2 0 Geen idee

Evaluatie 14 Welk fragment is volgens u het beste?

0 Fragment 1 0 Fragment 2 0 Geen idee

79

Evaluatie 15 Welk fragment is volgens u het beste?

0 Fragment 1 0 Fragment 2 0 Geen idee

Evaluatie 16 Welk fragment is volgens u het beste?

0 Fragment 1 0 Fragment 2 0 Geen idee

Evaluatie 17 Welk fragment is volgens u het beste?

0 Fragment 1 0 Fragment 2 0 Geen idee

Evaluatie 18 Welk fragment is volgens u het beste?

0 Fragment 1 0 Fragment 2 0 Geen idee

80

Evaluatie 19 Welk fragment is volgens u het beste?

0 Fragment 1 0 Fragment 2 0 Geen idee

Evaluatie 20 Welk fragment is volgens u het beste?

0 Fragment 1 0 Fragment 2 0 Geen idee

Evaluatie 21 Welk fragment is volgens u het beste?

0 Fragment 1 0 Fragment 2 0 Geen idee

Evaluatie 22 Welk fragment is volgens u het beste?

0 Fragment 1 0 Fragment 2 0 Geen idee

Bedankt voor uw deelname!

Appendix B

Video editing scripts The rst script extracts the audio from the le.

mpeg -ss 00:02:37 -t 153 -i IN3PYNEU5STUDIO1FR1020067.mxf -vn -acodec pcm_s16le -ar 48000 -ac 2 -ab 768000 audio_forecast.wav mpeg -f wav -i audio_forecast.wav -ss 00:02:20 -t 5 Intro.wav mpeg -f wav -i audio_forecast.wav -ss 00:02:37 -t 153 CuttedVideo.wav mpeg -f wav -i audio_forecast.wav -ss 00:05:51 -t 6 Outro.wav The second script separates introduction, middle and nal part from the weather forecast and stores them in the raw YUV format, ready for processing.

mpeg -s 720x576 -i IN3PYNEWESTUDIO1VI1020075.mxf -deinterlace -sameq -s 720x504 IN3PYNEWESTUDIO1VI1020075.yuv mpeg -s 720x576 -i IN3PYNEU5STUDIO1FR1020067.mxf -deinterlace -sameq -s 720x504 IN3PYNEU5STUDIO1FR1020067.yuv mpeg -an -ss 00:02:20 -t 5 -s 720x504 -i IN3PYNEU5STUDIO1FR1020067.yuv -s 720x504 Intro.yuv mpeg -an -ss 00:02:37 -t 153 -s 720x504 -i IN3PYNEU5STUDIO1FR1020067.yuv -s 720x504 CuttedVideo.yuv mpeg -an -ss 00:05:51 -t 6 -s 720x504 -i IN3PYNEU5STUDIO1FR1020067.yuv -s 720x504 Outro.yuv mpeg -an -ss 00:02:15 -t 153 -s 720x504 -i IN3PYNEWESTUDIO1VI1020075.yuv -s 720x504 CuttedBackground.yuv

81

Appendix C

Avisynth script The rst AviSynth script creates the right eye view by deleting the rst frame of the original 2D video sequence.

vid2d=RawSource("WeatherForecast_PROG.yuv",720,576,"I420") v1=vid2d DeleteFrame(vid2d,0) The second AviSynth script creates the right eye view by deleting the rst two frames of the original 2D video sequence.

vid2d=RawSource("WeatherForecast_PROG.yuv",720,576,"I420") v1=vid2d v2 = DeleteFrame(vid2d,0) DeleteFrame(v2,0)

82

Appendix D

YUV/WMV conversion scripts The rst script converts WMV video les to raw YUV video les, ready to be oered to the Matlab functions.

mpeg -s 720x504 -i background.wmv -sameq -s 720x504 background.yuv mpeg -s 720x504 -i foreground.wmv -sameq -s 720x504 foreground.yuv The second script compresses left and right eye YUV sequences (width=720, height=504), generated by the algorithms, to WMV video sequences, ready to be oered to a stereoscopic player. It is important to be sure the correct setup is used by the stereoscopic player: when an uncomfortable 3D eect is observed, try swapping left and right view in the Stereoplayer.

mpeg -sameq -s 720x504 -i Demo-L.yuv -s 720x504 Demo-L.wmv mpeg -sameq -s 720x504 -i Demo-R.yuv -s 720x504 Demo-R.wmv

83

Appendix E

Source Code and Video Sequences The CD-ROM contains Matlab source code, and foreground and background WMV video les of a weather forecast. Before using these video streams as arguments for the Matlab functions, the WMV video les have to be converted to the raw YUV format (by executing the rst script of Appendix D). Furthermore, an example of left and right stereoscopic views generated by the extended implementation is also provided. The CD-ROM is attached to the cover of this book.

84

Bibliography [1] Philips 3D Solutions, Technology Backgrounder: experiences,

2006. [Online]. Available:

WOWvx for amazing viewing

http://www.souvr.com/Soft/UploadSoft/

200805/2008051309232509.pdf [2] A. Smolic and H. Kimata and A. Vetro and A. Smolic and H. Kimata and A. Vetro, Development of MPEG Standards for 3D and Free Viewpoint Video, 2005. [3] C. Fehn, R. De La Barré and S. Pastoor, Interactive 3DTV Concepts and Key Technologies. [4] Q. Wei, Converting 2D to 3D: A Survey, 2005. [5] E. Stoykova, A. Alatan, P. Benzie, N. Grammalidis, S. Malassiotis, J. Ostermann, S. Piekh, V. Sainov, C. Theobalt, T. Thevar, and X. Zabulis, IEEE Transactions on Circuits and Systems for Video Technology 3DTV: 3D Time-varying Scene Capture Technologies: A Survey, 2007. [6] Zcam. [Online]. Available:

http://www.techdigest.tv/2008/01/ces_2008_3dv_sy.

html [7] W. J. Tam and L. Zhang, 3D-TV Content Generation: 2D-to-3D Conversion,

Mul-

timedia and Expo, IEEE International Conference on, vol. 0, pp. 18691872, 2006. [8] Y.-L. Chang, C.-Y. Fang, L.-F. Ding, S.-Y. Chien, and L.-G. Chen, Depth Map Generation for 2D-to-3D Conversion by Short-Term Motion Assisted Color Segmentation. in

ICME'07, 2007, pp. 19581961.

[9] MakeMe3D software. [Online]. Available: http://www.makeme3d.net/convert_2d_ to_3d.php [10] Dr. line].

R.

Piroddi, Available:

Stereoscopic

3D

Technologies,

2010.

[On-

http://www.snellgroup.com/documents/white-papers/

white-paper-stereoscopic_3d_technologies.pdf [11] E. Stoykova, A. Alatan, P. Benzie, N. Grammalidis, S. Malassiotis, J. Ostermann, S. Piekh, V. Sainov, C. Theobalt, T. Thevar, and X. Zabulis, IEEE Transactions on 85

86

BIBLIOGRAPHY

Circuits and Systems for Video Technology 3DTV: Scene Representation Technologies for 3DTV - A Survey, 2007. [12] , IEEE Transactions on Circuits and Systems for Video Technology 3DTV: Coding Algorithms for 3DTV - A Survey, 2007. [13] A. Vetro, T. Wiegand and G. J. Sullivan, Overview of the Stereo and Multiview Video Coding Extensions of the H.264/MPEG-4 AVC Standard,

Proceedings of the

IEEE: Vetro, Wiegand, Sullivan, 2011. [14] G. Bozdagi Akar, A. Murat Tekalp, C. Fehn and M. Reha Civanlar, Transport Methods in 3DTV - A Survey.

IEEE Trans. Circuits Syst. Video Techn.

[15] P. Benzie, J. Watson, P. Surman, I. Rakkolainen, K. Hopf, H. Urey, V. Sainov, and C. von Kopylow, A Survey of 3DTV Displays: Techniques and Technologies,

IEEE

Trans. Circuits Syst. Video Techn., vol. 17, no. 11, pp. 16471658, 2007. [16] Dr. P. Nasiopoulos, Dr. L. Coria, Quality of experience of stereoscopic content. [17] S. Kejian and W. Fei, The development of stereoscopic display technology, 2010. [18] Dolby 3D digital cinema. [Online]. Available: http://videotechnology.blogspot.com/ 2009/06/dolby-3d-digital-cinema.html [19] Anaglyph

glasses.

[Online].

Available:

http://www.gadgetgear.nl/2009/12/

faq-anaglyph-3dtv/ [20] Passive

Polarized

vs

Active

Shutter

3D

technology.

[Online].

Available:

http:

//www.best-3dtvs.com/guides/3d-glasses-active-vs-passive/ [21] Price

list

of

linear

polarized

and

anaglyph

glasses

.

[Online].

Available:

http://www.berezin.com/3d/3dglasses.htm [22] R. Lau, S. Ince and J. Konrad, Compression of still multiview images for 3-D automultiscopic spatially-multiplexed displays, 2007. [23] A. Smolic, K. Müller, P. Merkle, and A. Vetro, Development of a new MPEG standard for advanced 3D video applications, 2009. [24] P. Kau, N. Atzpadin, C. Fehn, M. Müller, O. Schreer, A. Smolic, and R. Tanger, Depth map creation and image-based rendering for advanced 3DTV services providing interoperability and scalability,

Image Commun.,

vol. 22, no. 2, pp. 217234,

2007. [25] Marc T. M. Lambooij,

Visual discomfort in stereoscopic displays:

vol. 6490. [Online]. Available: s1&Agg=doi

a review,

http://link.aip.org/link/PSISDG/v6490/i1/p64900I/

87

BIBLIOGRAPHY

[26] VRT

-

Application

specication

for

SD-

DV25.

[Online].

Available:

http:

//www.vrtmedialab.be/media/AS-VRT-DV25-Core.pdf [27] IRT MXF Analyzer. [Online]. Available:

http://mxf.irt.de/tools/analyzer/index.

php# [28] FFmpeg documentation. [Online]. Available: http://www.mpeg.org/ [29] N. Sprljan,

MATLAB xyz toolbox

[http://www.sprljan.com/nikola/matlab],

2011. [30] H. Wang and D. Suter, A Consensus Based Method for Tracking: Modelling Background Scenario and Foreground Appearance. [31] V.

D.

Silva,

Depth

2010. [Online]. Available:

Image

Based

Stereoscopic

View

Rendering,

http://www.mathworks.com/matlabcentral/leexchange/

27538-depth-image-based-stereoscopic-view-rendering [32] D.-J.

Kroon,

line].

Fast/Robust

Available:

Template

Matching,

2011.

[On-

http://www.mathworks.com/matlabcentral/leexchange/

24925-fastrobust-template-matching [33] Photoshop. [Online]. Available: http://www.photoshop.com/products/photoshop [34] International Telecommunication Union, Subjective assessment of stereoscopic television pictures, in

Recommendation ITU-R BT.1438, Geneva, Switzerland, 2000.

[35] I. T. Union, Methodology for the subjective assessment of the quality of the television pictures, in [36] Best

3D

ITU-R Recommendation BT.500-11, Geneva, Switzerland, 2000.

TV

Screen

Size

and

Viewing

Distance.

[Online].

//www.best-3dtvs.com/guides/best-screen-size-viewing-distance/

Available:

http:

List of Figures 2.2.1 General 3DTV scheme[3].

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2.2 Example of 3DTV system[2].

. . . . . . . . . . . . . . . . . . . . . . . . . .

4 5

2.3.1 Shape-from-texture (From left to right: original image, segmented texture region, surface normals, depth map and reconstructed 3D shape)[4]. . . . . .

7

2.3.2 ZCam[6]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.3.3 Device for recording digital holograms[5]. . . . . . . . . . . . . . . . . . . . .

8

2.3.4 The Proposed Short-Term Motion Assisted Color Segmentation [8]. 2.3.5 The depth map of a weather forecast[8].

. . . .

12

. . . . . . . . . . . . . . . . . . . .

13

2.5.1 Frame compatible formats where 'x' represents the samples from one view and 'o' represents the samples from the other view[13]. . . . . . . . . . . . .

18

2.5.2 Illustration of prediction in H.262/MPEG-2 Video multi-view prole[12]. . .

19

2.5.3 Temporal/inter-view prediction structure for MVC[12]. . . . . . . . . . . . .

20

2.6.1 Transmission technologies of 3D content[10]. . . . . . . . . . . . . . . . . . .

22

2.7.1 Binocular parallax[16].

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

2.7.2 Dolby RGB lter color wheel[18]. . . . . . . . . . . . . . . . . . . . . . . . .

25

2.7.3 Dierent types of 3D glasses.

26

. . . . . . . . . . . . . . . . . . . . . . . . . .

2.7.4 Parallax-barrier stereo display and lenticular stereo display[22].

. . . . . . .

27

2.8.1 DES, extending high quality stereo with advanced functionalities based on view synthesis[23].

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

2.8.2 MVD format and view synthesis for ecient support of multiview autostereoscopic displays[23].

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

2.8.3 Layered depth video[23]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

2.8.4 Target of 3D video format[23]. . . . . . . . . . . . . . . . . . . . . . . . . . .

30

2.8.5 The 3DTV data representation using video plus depth.[24] . . . . . . . . . .

31

88

89

LIST OF FIGURES

2.8.6 Virtual shift-sensor cameras for stereo reproduction from video plus depth representations (ZPS = Zero Parallax Setting)[24].

. . . . . . . . . . . . . .

32

2.8.7 Advanced concept for 3DTV with multiple views[24]. . . . . . . . . . . . . .

33

2.9.1 Stereoscopic comfort zone[16]. . . . . . . . . . . . . . . . . . . . . . . . . . .

37

3.1.1 Overview of consecutive steps to go from 2D to 3D representation.

. . . . .

38

. . . .

42

. . . . . . . . . . . . . . . . . . . . . . . .

43

. . . . . . . . . . . . . . . . . . . . . . . . . . .

44

3.3.1 Applying morphological opening to initial background classication. 3.3.2 Background subtraction scheme. 3.3.3 Histograms of Y, U and V.

3.3.4 Background model after classication step 1 and 2. 3.3.5 Edge model.

. . . . . . . . . . . . .

44

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

3.3.6 Final classication of background and foreground pixels.

. . . . . . . . . .

45

3.4.1 Frames representing the stereo pair generated with the video plus depth algorithm (depth value = 44).

. . . . . . . . . . . . . . . . . . . . . . . . .

46

3.4.2 Frames representing the stereo pair generated with the basic implementation (pixel shift = 20). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

3.4.3 Frames representing the stereo pair generated with the extended implementation (pixel shift foreground = 10, pixel shift background = 6). . . . . . . .

49

3.4.4 Frames representing the stereo pair generated with the basic implementation (pixel shift = 10). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

3.4.5 Left and right view frame with positive parallax eect (pixel shift background = 6).

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.2.1 Structure of evaluation[35].

. . . . . . . . . . . . . . . . . . . . . . . . . . .

51

55

4.3.1 Frame from left and right view created by the rst algorithm (pixel shift = 10).

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

58

4.3.2 Frame from left and right view created by the second algorithm (depth level = 50%, frame oset = 2).

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

58

4.3.3 Frame from left and right view created by the third algorithm (depth value = 22).

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

4.3.4 Frame from left and right view created by the fourth algorithm (frame oset = 1).

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

4.3.5 Experiment 1: Comparison of the eect of dierent depth. . . . . . . . . . .

60

90

LIST OF FIGURES

4.3.6 Experiment 2: Comparison of the dierent algorithms when introducing a small depth eect.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

4.3.7 Experiment 2: Comparison of the dierent algorithms when introducing a large depth eect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

4.3.8 Frame from left and right view created by the basic implementation of the rst algorithm (pixel shift = 10).

. . . . . . . . . . . . . . . . . . . . . . . .

64

4.3.9 Frame from left and right view created by the extended implementation of the rst algorithm (pixel shift foreground = 10, pixel shift background = 6).

64

4.3.10Experiment 3: Exploring the eect of more objects in 3D for a small depth eect.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

4.3.11Experiment 3: Exploring the eect of more objects in 3D for a large depth eect.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

4.3.12Experiment 4: Exploring the eect of a negative or positive parallax. . . . .

67