Interactive Augmented Reality in Digital Broadcasting Environments

Interactive Augmented Reality in Digital Broadcasting Environments Diplomarbeit vorgelegt von Tobias Daniel Kammann als Teil der Voraussetzung zur ...
Author: Scot Carr
0 downloads 0 Views 9MB Size
Interactive Augmented Reality in Digital Broadcasting Environments Diplomarbeit

vorgelegt von

Tobias Daniel Kammann

als Teil der Voraussetzung zur Erlangung des Titels Diplom-Informatiker im Studiengang Computervisualistik

Institut für Computervisualistik Arbeitsgruppe Computergraphik

Prüfer: Betreuer:

VICOMTech Visual Communication Technologies Prof. Dr.-Ing. Stefan Müller Dip. Ing. Igor García Olaizola November 2005

Mitglied des Inigraphics.net

Eidesstattliche Erklärung

Hiermit erkläre ich an Eides statt, dass die vorliegende Arbeit selbstständig verfasst wurde und keine anderen als die angegebenen Quellen und Hilfsmittel verwendet wurden. Die Arbeit wurde bisher in gleicher oder ähnlicher Form keiner anderen Prüfungsbehörde vorgelegt und auch nicht veröffentlicht.

Koblenz, 30. November 2005

Unterschrift

Zusammenfassung Seitdem es das Fernsehen als Massenmedium gibt, wurden Möglichkeiten erforscht, die Qualität der gesendeten Bilder zu verbessern. Dies bezog sich nicht nur auf die Auflösung und Farbqualität der gezeigten Szenen, sondern umfasste vielmehr die Aufbereitung und Nachbearbeitung des Materials. Bildbearbeitung und Manipulation wurden immer beliebter unter Produzenten. Heute werden weder Kinofilme noch Fernsehprogramme genau so gezeigt, wie sie aufgenommen wurden. Die Bearbeitung umfasst einfache Farbkorrekturen bis hin zur komplexen und überzeugenden Integration von computergenerierten Objekten in gefilmten Szenen. Letzteres sieht man vor allem in frei erfundenen Szenarios, wo irreale Alternativwelten geschaffen werden, die sich oft niemals komplett als echte Sets nachbauen ließen. Dieses Compositing wird aber auch im alltäglichen Fernsehen eingesetzt, etwa um zusätzliche Informationen in Nachrichtensendungen einzublenden oder Werbung in Sportübertragungen einzufügen. Doch diese postproduction des Videobildes hat stets einen entscheidenden Nachteil: alle Zuschauer erhalten dasselbe Bild – für sie unveränderbar. Viele produzierte Filme und Sendungen werden heute weltweit veröffentlicht und ausgestrahlt. Die Anzahl steigt ständig. Der Bedarf, dieses Videomaterial möglichst schnell und automatisiert für andere Märkte und Zielgruppen anzupassen, steigt mit. Elemente im Video live und während der Ausstrahlung auszutauschen wird eine sehr nützliche neue Methode. Diese zusätzlichen Elemente können sowohl einfache Bilder oder dreidimensionale Objekte sein. Durch ein Kameratracking ist perspektivisch korrektes Einfügen möglich: die neuen Elemente werden von derselben Kameraposition aus dargestellt, von der aus das Video zuvor gedreht wurde. Augmented Television wird Realität. In dieser Arbeit stelle ich diesen neuen Ansatz vor und befasse mich eingehend damit. Die Interaktionsmöglichkeiten zum herkömmlichen Fernsehen werden dadurch stark erweitert. Die bis dato in der postproduction realisierte Bildveränderung kann der Zuschauer nun selbst live vorzunehmen. Er erhält volle Kontrolle über das Einfügen von Elementen – möglich ist letztlich ein komplettes Vermischen von Fernsehen und interaktiven Anwendungen oder Spielen. Eine konkrete Beispielanwendung wird implementiert. Sie basiert auf dem Standard Media Home Platform (MHP), der im Zusammenspiel mit Digital Video Broadcasting (DVB) interaktive Applikationen für das Fernsehen ermöglicht. Diese Demonstration wird augmented interactive television vorstellen und Probleme, Beschränkungen und generelle Designentscheidungen während der Entwicklung aufzeigen. Sie präsentiert den im Baskenland sehr typischen Ballsport "pelota" in einem neuen Gewand. Das digitale Video-Material wird in Echtzeit "augmentiert" – interaktiv beeinflussbar vom Benutzer. Virtuelle Kameraflüge und Blickwinkel in einer 3D-Repräsentation sind möglich. Zukünftige Einsatzfelder und mögliche Erweiterungen und Verbesserungen dieser neuen Technologie werden diskutiert und im Deail behandelt, um aufzuzeigen, dass dieser Ansatz den nächsten logischen Schritt in der Entwicklung des interaktiven Fernsehens darstellt.

Interactive Augmented Reality in Digital Broadcasting Environments Diploma Thesis

written by

Tobias Daniel Kammann

as part of the requirements to obtain the title of Diplom-Informatiker im Studiengang Computervisualistik

Institute for Computational Visualistics Workgroup Computer Graphics

Approved by: Tutor:

VICOMTech Visual Communication Technologies Prof. Dr.-Ing. Stefan Müller Dip. Ing. Igor García Olaizola November 2005

Member of Inigraphics.net

Acknowledgements

My regards especially go to Igor, for supporting me always on spot when I was asking for something, explaining available digital television software and hardware elaborately, and Stefan Müller, supervising this thesis in Germany. Además quiero decir kaixo y milesker a todo el personal en VICOMTech! Con vostros el tiempo fue muy divertido y siempre me sentí bien en vuestra presencia – una cosa muy importante, cuando estás fuera de tu propia tierra escribiendo tu proyecto. Gero arte.

San Sebastián, 28th of October, 2005

Abstract Ever since television signals hit the airwaves, different mediums were pursued to enrich the quality of the displayed reality. These included seeking additional information and beautify its nature. Image editing and movie manipulation has gained more popularity among producers and nowadays almost no program seen on television or in the cinema is shown in the same way it was shot. Editing methods may range from simple color correction to complex insertion of computer generated objects, as is seen in fictional movies when exposing impossible or dreamlike alternative realities. This compositing of video material is also pursued in daily television, presenting additional information below a news report or adding commercials inside a sport show. These productions still have one major restriction in common: the character of a priori – one will always receive a final image, uneditable and unchangeable for the viewer. As the global releases of movies and shows increase, the need for an advance in technology to specifically customize these broadcasts also increases. A method to exchange video elements live while broadcasting is becoming very useful. These elements can be plain images or three dimensional objects. By knowing the camera’s position through means of tracking, the insertion will work perfectly, reusing the same point of view and perspective of the camera that shot the scene in the first place. Augmented Television is being realized. In this paper I will introduce and deal with this. Compared to regular television, interactivity will be extended since the viewer is granted full control over the editing process of the video material – possibly resulting in a perfect blend of TV and interactive applications or games. An application is implemented using the television standard of Media Home Platform (MHP) running alongside the Digital Video Broadcasting (DVB). This example will introduce the topic of augmented interactive television and show restrictions, problems and pitfalls that occurred during this development. The application presents the typical Basque ball sport pelota in a new way, offering augmented digital video material, interactivity, virtual camera flights and observer positions. Future usage and improvements of this technology will be discussed and dealt with in detail providing evidence to the need of this application for future progress to occur.

v

vi

Contents

1 Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Goals of this Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Overview 2.1 Television . . . . . . . . . . . . . 2.1.1 Analog Technology . . . . 2.1.2 Digital Revolution . . . . 2.1.3 Adding Activity . . . . . 2.1.4 Adding Interactivity . . . 2.2 Visual Effects and Postproduction 2.3 Augmented Reality . . . . . . . . 2.3.1 Definition . . . . . . . . . 2.3.2 Current Developments . .

1 2 2 3

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

5 5 5 8 8 9 9 11 11 11

3 State of the Art in Interactive Digital Television 3.1 Standards for Digital Television . . . . . . . . . . 3.1.1 DVB (Digital Video Broadcasting) . . . . . 3.1.2 MPEG Encoding for Video and Audio . . . 3.2 Interactive Television Standards . . . . . . . . . . 3.2.1 MHP (Multimedia Home Platform) . . . . 3.2.2 OCAP (Open Cable Applications Platform) 3.2.3 Other Approaches . . . . . . . . . . . . . 3.3 Set Top Boxes (STB) . . . . . . . . . . . . . . . . 3.4 Summary on current Standards and Hardware . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

13 13 13 14 15 16 19 19 19 21

4 Designing Augmented Reality Applications for Digital Television 4.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Considering Technical Issues and Limitations . . . . . . . . . . . . . 4.3 Different Approaches for Mixing the Realities . . . . . . . . . . . . . 4.3.1 Ready made . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Prepared Augmentation . . . . . . . . . . . . . . . . . . . . 4.3.3 Higher Level . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Sample Application . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

23 23 24 25 25 25 26 26

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

vii

Contents 4.4.1 4.4.2

Basque Pelota . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Concept Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26 28

5 Technical Realization of Basics for Augmented Reality TV 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Available Software Suites . . . . . . . . . . . . . . . . . . . . 5.2.1 Cardinal Studio . . . . . . . . . . . . . . . . . . . . . 5.2.2 iTV Suite . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Conclusion on Software Suites . . . . . . . . . . . . . 5.3 Writing Applications from Scratch . . . . . . . . . . . . . . . 5.3.1 Xlets in MHP . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Uploading and Running . . . . . . . . . . . . . . . . 5.3.3 Set Top Boxes vs Emulators . . . . . . . . . . . . . . 5.4 Stepping through Possibilities of Set Top Boxes . . . . . . . . 5.4.1 Graphics Display . . . . . . . . . . . . . . . . . . . . 5.4.2 User Activity and Interactivity . . . . . . . . . . . . . 5.4.3 Delays and Animations . . . . . . . . . . . . . . . . . 5.4.4 Network Access and Return Channel . . . . . . . . . 5.5 Summary on Sample Applications . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

37 37 37 38 38 39 40 40 44 46 46 47 56 56 57 58

6 Technical Realization of the Pelota Application 6.1 Pelota Implementation Overview . . . . . . . . . . . . . . 6.2 Displaying Video Images . . . . . . . . . . . . . . . . . . 6.3 Tracking the Video Images . . . . . . . . . . . . . . . . . 6.4 Synchronization of Video Frame and Tracking Information 6.5 Discussing different APIs for 3D Graphics in Java . . . . . 6.5.1 Java3D . . . . . . . . . . . . . . . . . . . . . . . 6.5.2 Occlusions in Java3D . . . . . . . . . . . . . . . . 6.5.3 GL4Java . . . . . . . . . . . . . . . . . . . . . . 6.5.4 JOGL and LWJGL . . . . . . . . . . . . . . . . . 6.5.5 Xith3D . . . . . . . . . . . . . . . . . . . . . . . 6.5.6 JME with M3G . . . . . . . . . . . . . . . . . . . 6.5.7 Conclusion on 3D APIs . . . . . . . . . . . . . . 6.6 Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.1 Paths . . . . . . . . . . . . . . . . . . . . . . . . 6.6.2 Interactivity . . . . . . . . . . . . . . . . . . . . . 6.7 Obtaining Occlusion Information . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

59 59 61 63 64 66 67 71 75 76 76 85 86 87 87 87 88

7 Practical Results 7.1 Usage of current Technology . . . . . . . 7.2 Pelota Application for a PC Simulation . . 7.2.1 3D Rendering and Video Display 7.2.2 Usage of Camera Tracking Data . 7.2.3 Multiple Streams and Selection .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

93 93 94 94 95 96

viii

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

Contents 7.2.4 7.2.5 7.2.6 7.2.7

Occlusion . . . . . . . . . . . . . . . Augmentation Usage and Interactivity Portability . . . . . . . . . . . . . . . User Tests and Usability . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. 97 . 99 . 100 . 101

8 Conclusion and Outlook 8.1 Future Vision for Augmented Television . . . . . . . . . . . . . . . 8.1.1 Pelota Application . . . . . . . . . . . . . . . . . . . . . . 8.1.2 Augmented Television in General, an Outlook to the Future 8.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

105 105 105 106 107

9 Appendix 109 9.1 System Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 9.2 CD-ROM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

ix

Contents

x

1 Introduction

Since the early days of moving pictures as entertaining and informative media, people strived for ways to enrich or alter the projected images. These efforts were pursued from the very first days of celluloid when artists scratched directly onto the footage or painted on the film. Nowadays, digital technology has standardized and almost no movie or television production will be released and broadcast without at least some kind of modification: from simple color adjustments to sophisticated postproduction – mixing the recorded video with virtual realities or painted objects – anything is possible and often used. Today, even complete shows are generated inside the computer without any real set ever being built. Up until now, this postproduction is a process that always happens previous to the actual broadcast. The receiving gadget – television, cinema projector or hand-held device – reproducing the footage has no influence on what is going to be displayed. If a movie is shown for different audiences, e.g. localized versions for specific countries, added information for deaf people offering subtitle options, exchanged elements or a different editing of a whole movie, alternative versions have to prepared and released independently. Up to now the viewer has only the option to switch the channel or turn off the device. He or she was chained to the presented material as it was. No real time composition was possible. Simultaneously to advancements in cinema and television technology, interactivity was tried and implemented into the media to offer a value-added experience. During the early 70’s teletext (or video text) has been deployed throughout the world: digital data was transmitted alongside the TV signal, presenting a means to act or interact on a low impact level. Program information

1

1 Introduction could be retrieved, as well as subtitles for the running show, but a real interaction – influencing the current program – was yet to be given. During late 80’s interactive game shows gained popularity. People could call into a live show, take part in quiz events whilst chatting to the moderator or by controlling simple games through tone dialing. Nevertheless, these approaches only allowed interaction with a single "lucky" user, who was able to get through on the line.

1.1 Motivation With the advancements in digital broadcasting technologies and the possibility to transmit arbitrary digital data simultaneously, the limit of more sophisticated applications only lies in the restrictions of available hardware and the imagination of the developers. In this project the goal is to push both – interactivity and real time enrichment of the video-signal – to a higher level. The presented video will be changeable while it is on air: a preset user function or a different language profile should change whole elements of the images, not only subtitles, but whole graphics or parts of the video. The final compositing of a movie or show won’t be decided by the TV studios or postproduction companies anymore, but the decision will be granted to the viewers themselves. The final combination seen by the viewer will happen live, if certain objects or the camera position are tracked and these information are submitted to the presenting device, any three dimensional objects may be inserted with proper perspective. At the same time, this graphical composition provides a good breeding ground for new ideas in interactivity: the user can control the composition and change video elements and to the mentioned tracking, these elements might hold annotations, selectable by the viewer, leading to additional information or shops. Also games within the presented TV scene become possible. Today, different digital television standards are emerging, offering widely sophisticated means of graphical presentation and user interactivity. Cable modems and broadband Internet access in more and more homes permit the idea of real interaction between the viewer and the broadcasting station – just like in the quiz shows that offered feedback to a single elected user.

1.2 Goals of this Work In this project, some of the most common television technologies and standards will be introduced and discussed. Focusing on the MHP standard, heavily deployed within Europe, the possibilities of the Augmented Television idea will be tried out and documented. A concrete sample application will serve as a running example to promote possible usage of this new blend of technologies. Gained experience will eventually allow a future vision of how and where this vision could hit the market.

2

1.3 Summary

1.3 Summary To get an overview of current standards and involved areas, I will start off with a description of television technology and classical postproduction. Also, the terms augmented reality and Augmented Television will be defined in chapter 2. With this basis, we will advance to the next chapter, describing current digital and interactive television standards and hardware. The following chapter 4 lists general issues that have to be considered during the development of a scenario for Augmented Television. The sample application that will be implemented is introduced. The next two chapters will focus on technical implementations, giving examples of pitfalls and restrictions in realizing the goals. The second will concentrate on three dimensional rendering and its final application. Chapter 7 displays results and screenshots and discusses the achieved goals and their usability. The last section will predict future development in this field. Videos, screenshots and a complete documentation of all implemented classes as well as all source code files can be found on the disc, attached behind the last page.

3

1 Introduction

4

2 Overview This chapter will introduce the concepts of television, postproduction and augmented reality, as these parts form the basis of this work. Postproduction alters the recorded video in sometimes long lasting processes, while augmented reality usually tries to do the same trick interactively in real time. For Augmented Television these concepts have to be merged.

2.1 Television In all modern societies television sets are omnipresent today. Since its public availability of less than 60 years, it has become the number one information source and entertainment gadget for the population. It started with expensive, bulky devices with bad image quality and only black and white color transmission. Nowadays flat screens with a diagonal of 82" and portable devices such as video cell phones are available. The following section gives a short historical overview of television from a technological point of view.

2.1.1 Analog Technology The concept of moving pictures has been pursued ever since people had access to pen and paper. So-called flipbooks show the idea in its simplicity: a small book containing a series of pictures that produce an illusion of motion when the pages are flipped with the thumb at a fast pace. This is due to the effect of persistence of vision, a physical phenomenon of the human eye, which leaves the impression of a seen image available to the brain longer than it is actually shown1 . Fast alternating pictures merge into fluid movements. The human eye can’t easily distinguish single images if the rate is higher than ca. 25 Hz. Early steps towards professional movie and television productions can thus be found in the availability of photography during 19th century. The concept of a "camera obscura" was well known for centuries, but intensive research on chemical compounds finally lead to persistent storage of the taken picture, reducing exposure times from several hours in the beginning of photography to split seconds. It is this split second that was needed to make moving pictures. Beforehand, a combination of several cameras or only drawn images could be used to realize this effect. Now the light-sensitive medium called celluloid allows cameras to record real world scenarios with rates of around 15 to 30 frames per second. Even higher speeds are possible with special cameras.2 1

although the correct explanation for this is still being discussed, terms of beta movement and phi movement join the debate, but we won’t go into detail 2 Today, photographic sensors (CMOS, CCDs) allow recording digital pictures directly – speeding up the production process further.

5

2 Overview As technology advanced, costs dropped and radio technology for communication was used more widespread, the idea of broadcasting video signals arised. To shorten things, the television in the form of a live transmitted moving images has existed since the late 1920’s. The first demonstration was held by Baird at the Royal Institution in London. Resolution of this electromechanical broadcast was limited to 30 lines and had a size of only almost 10 to 5 centimeters[17]. It was hardly possible to recognize displayed human faces, but since then quality of the images and integrated sound has risen constantly. Development led to a pure electronic television with a cathode ray tube (CRT), constructing the image. An electron beam moves in a fixed pattern over a raster, influencing the color of single raster elements by changing the intensity. In color TV one element contains color parts of red, green and blue which construct a complete image like a stone mosaic, but with the possibility to change each "stone" several times in a second, resulting in moving pictures. The invention of a CRT finally led to public television broadcasts. Countries started to define their own flavors of broadcast technology with different raster sizes and transmission formats. After a struggle for decades, three main analog color formats are used these days: NTSC, PAL and SÉCAM. These are introduced in the following sections. NTSC In 1952 the Federal Communications Commission (FCC) known as NTSC (National Television Standard Committee) approved of a color system for US television. The proposal offered compatibility with the existing monochrome standard on the US market, but red, green and blue components couldn’t be transmitted directly to allow the monochrome TVs to decode the new signal. Therefore the signal is converted into a different separation of information: a luminance part offers brightness information, decoded by black and white televisions, blue chrominance and red chrominance define additional color components, which are just left out by monochrome devices, but bring color to modern televisions. The RGB signal is converted to so-called YCbCr (or YUV) with a simple linear equation. To finally broadcast the video signal, the new information has to be encoded inside the available frequency without interfering with the information for brightness. A subcarrier is added within the video spectrum, modulated by the reduced bandwith chrominance signals. It is placed in the highest parts of the spectrum within the limits of existing bandwith (4.2 Mhz in the US, 5-6 MHz in Europe). Up until now no major differences occur between NTSC, PAL and SÉCAM. There are distinctive ways of modulating this subcarrier for color and its frequency. Color information can be encoded using less space, since the human eye is more sensitive to brightness than to color differences. In eventually constructing a video frame, a scanning standard had to be defined, setting frame (i.e. raster) size and frequency (frames per second (fps)). Originally, NTSC3 only described a way of color encoding, usable for any scanning standard[10], but nowadays this definition usage has been diluted and referring to a NTSC signal will describe a whole television broadcasting standard. NTSC uses 480 visible lines inside the raster by a total of 525 lines4 with an aspect ratio of 4:3. The frame rate is set to 29.97 fps, while an interlaced method is used resulting in a refresh 3 4

6

as well as PAL The 45 missing lines are used for purposes such as synchronization or additional data

2.1 Television frequency of 59.94 Hz. Interlacing will only display every 2nd line of the image first, and every other line second (fields). This quick alternation results in a flicker free final representation caused by the already mentioned persistence of vision. A disadvantage arose during NTSC broadcasting. Color changes or flickers could be recognized quite often. Especially flesh tones are effected due to sensitivity to phase rotations caused by the transmission channel. A tint control mechanism had to be introduced to cope with this drawback. SÉCAM SÉCAM (Séquentiel couleur avec mémoire, French for "sequential color with memory") was first used in France and is considered the first European color TV standard, introduced in 1967. To PAL and NTSC, SÉCAM differs by the way the color signals are carried. It uses frequency modulation to encode chrominance information on the subcarrier and the same color information is used by two subsequent lines, causing a bandwith advantage. The result is a quite robust image where no tint correction is needed. At the same time reproduction on a monochrome TV is worse, still making parts of the subcarrier visible and sharp transients between highly saturated colors are worse. SÉCAM uses 576 interlaced scanlines (625 total), 50 fields, 25 frames per second and also a 4:3 ratio. Countries using this system include France, Russia, Eastern Europe, and some parts of the Middle East. PAL This standard is today’s world dominant analog TV format. PAL was developed by Walter Bruch at Telefunken in Germany, and the format was introduced in 1967. It uses 576 interlaced scanlines (625 total), 50 fields (causing 50 Hz), a framerate of 25 and a 4:3 aspect ratio. PAL is a closer relative to NTSC concerning color encoding, but looking at scanlines and frequency it holds the same attributes to SÉCAM. The advantage of PAL towards NTSC is a better color handling: PAL uses the same phase rotation approach, but avoids earlier problems that occurred NTSC by inverting the phase information every second line, hence its name phase alternate line5 . This automatically corrects phase errors in the transmission of the signal by canceling them out. Next to the "main" PAL standard (sometimes referred to as PAL B/G), modifications exist such as PAL-M (used in Brazil, 525 lines at 59.94 Hz) or PAL-N (used in Argentina, 625 lines, 50 Hz). These basically define different subcarrier frequencies. A successor PAL+ was defined to allow broadcasting of 16:9 aspect ratio movies with better quality within a regular PAL encoding. While a 16:9 aspect ratio only occupies 432 lines of PAL (576 x 3/4) on a 4:3 receiver, additional 144 lines of video information can be encoded – so to speak hidden for a normal PAL receiver. A PAL+ device can use these hidden lines to gain an overall higher image resolution. 5

or phase alternating line, phase alternation by line

7

2 Overview Common Issues As mentioned above all three standards transmit more scanlines than displayed, leaving space for additional information. In fact, the concept of leaving out lines was due to technical limitations with the CRT rendering and the idea to transmit extra data only came up later. These additional 45 respectively 49 lines of the blanking interval are now used for teletext, closed captions (optional subtitles) or other services6 . A receiver may or may not be able to decode this additional information. No harm is caused if it is incapable of doing so, but extra services are offered for newer hardware.

2.1.2 Digital Revolution The presented analog transmission standards have one major problem in common: they use an inseparable mixture of encoding video information with their transmission. That is to say, the resolution of a signal can’t be doubled or reduced: the 625 lines in PAL will always remain 625 lines with a refresh rate of 50 Hz, due to used frequencies in signaling. Additional information have to fit into the blanking interval. If more space is needed, turns have to be taken: sequential broadcast of this data will result in less availability and longer response times. To overcome this restriction, digital television (DTV) separates transmission from video/data information completely. More advantages include a better possibility for error correction and an overall better image quality – while the analog broadcast drops in quality over distance, the digital encoded image will reach the receiver with an accuracy of 100% until the maximum distance is reached. Moreover convergence of different devices is an important factor: digital television will simplify interoperability of mobile PCs, Internet based broadcastings, etc., not asking for analog-digital converters. DTV uses different compression algorithms, resulting in a much higher video data throughput – resorting to the same frequency space as in analog broadcasting. Bandwith can be spared. While a typical PAL frame would occupy 216 MBit/s represented digitally with a bitdepth of 24, compression algorithms can reduce this amount heavily to 8 MBit/s or less. Higher resolution videos can be sent for even more channels. Additional data can be sent without restrictions as such of a blanking interval. DTV hit the market in the United States in 1994 under the name of DirecTV. Around the same time, ongoing efforts in Europe lead to the specification of their own digital transmission standard, named DVB. But these standards only handled the transport of data. What we are going to see is as yet undefined. Frame resolutions usually adopts to the analog standards, since current television sets would just not be able to display higher resolutions, but frame sizes can vary freely which I refer to in chapter 3, where we will take a closer look at current digital standards.

2.1.3 Adding Activity To put it plainly, in the beginning, user activity only involved switching channels. But during late 1980s the analog signal also held video text information, allowing the viewer to use a remote control to display these additional services containing mainly TV program schedules, weather, 6

8

for instance a signal informing a receiver of a following PAL+ show instead of regular PAL show

2.2 Visual Effects and Postproduction stock information or subtitles. This activity nevertheless was restricted to simple presentation of text and limited by the availability of bandwith. The user’s actions had no further influence on the running program or the displayed text. Digital television did not extend to this at first. Additional activities only ever included the concept that programs might hold more than one video image and that users might select their preferred camera themselves.

2.1.4 Adding Interactivity Television defines its nature as broadcast media: a network operator sends one program to thousands of recipients. It is so far only a one-way media. To propose real interaction between current programs and viewers some kind of a back or return channel needs to be established. Since the early days this typically uses the phone line: if a show is broadcast live on television, selected spectators could join in by talking to the show’s host or by choosing options by tone dialing. One example of this was the popular children’s game show on German television, where a 3D rendered character called Hugo could be controlled by voice commands as a simple jump and run game. Offering these kind of games or to get opinions from homes onto TV shows, involves the viewers to a greater extent. Another early example of user participation was used in a different German game show in the 1970’s called Wünsch Dir was (Make a wish) in form of a "light test". Sympathy with candidates of the show was expressed by turning on as many electronic devices as possible in a selected city. The local electric power company would register the difference in current consumption, resulting in some odd way of ranking mechanism. Of course, this non-ecological gag could not be the answer for realizing user surveys. In 1993 national Spanish broadcaster Television Española tried an interactive TV with its product called TelePick. This additional device connected to the television was supposed to offer the missing link: viewers should get the opportunity to select detailed information on shows or even purchase advertised items directly. Obviously this step was done too early or the technology was just too limited, which resulted in too low sales of the product7 . The project was canceled only one year after it went online. Until today analog television did never offer a complete "out of the box" interactive TV. But the general idea of offering additional information to a presented show plus some kind of return channel for interaction remained present and was seeked. Today new approaches are being released. They will be dealt with in the next chapter.

2.2 Visual Effects and Postproduction Since the early days of moving pictures, producers and artists intended to change the recorded real world images or to invent entirely new fictional scenarios. In early approaches, artists directly scratched or painted into the frames of a celluloid film. Alongside filmed stories within real sets and actors, animation became more and more popular. Characters and whole backgrounds were drawn entirely by hand and recorded frame by frame to result in a hightech version 7

with 13.000 sold but expected rates of 850.000

9

2 Overview of the already mentioned flipbook. "Stop motion animation" did the the same frame by frame trick to give life to puppets or toys: after a frame was shot, animators had to move the objects and body parts bit by bit to get a smooth movement when presenting the frames at a speed of 10, 15 or 25 frames per second. The usage of double exposure film was used to combine two different scenes into one movie: real actors could escape from a Godzilla puppet animated using stop motion technique. Digitalization spread throughout the production process and it became standard to convert analog film to digital data to edit and change it using special workstations and finally to reconvert it to analog data, by exposing a new film with the new enhanced image material. Now the possibilities of alternation became easier: artists not only had one try (as with double exposure) and software sped up production drastically. Interpolations allowed editing of particular frames while the computer would calculate steps in between – not every single frame needed to be treated by hand as in classical animation. Nowadays, whole feature films are produced entirely inside computers, 3D rendered movies gain popularity and coexist with real movies that are enriched by digitally constructed characters, animals, objects, buildings or other special effects. To be able to insert those effects into real footage in a convincing manner, one can resort to the following techniques: • Rotoscoping basically describes a way to trace animations on a frame by frame basis due to overlay of original video footage with another drawing layer (e.g. to make a cartoon character out of a real actor). The technique is heavily used to separate objects from a background that needs to be changed. • Keying techniques that allow the same separation of objects, but a scene typically has to be prepared for this: blue or green background walls allow an easy masking of foreground objects by means of color comparison. Other keys include depth (measuring distance from the camera) or difference keys. • Tracking is used in many different ways and it provides knowledge as to the camera position for correct insertion of objects during the postproduction process. If a camera’s location is known for each frame, computer generated 3D objects may be placed into the scene perfectly while reusing the same camera parameters. Techniques include hardware motion tracking (saving all movements while shooting), image based analysis (determining the position only out of the shot frames) and an opposing approach: a computer is in charge of camera movements, steering cranes and dollies: a motion controlled camera knows its position before the scene is shot. • Lightprobe usage – for proper lighting of inserted 3D objects. High dynamic range images are taken of a crystal orb to store all illumination attributes of a real scene. Rendering programs reuse this information to enlight its geometry, which can be inserted into the footage with persuasive lighting. Keying and tracking will be of especially great interest for our Augmented Television purposes later.

10

2.3 Augmented Reality

2.3 Augmented Reality 2.3.1 Definition Generally defining augmented reality, it describes the overlay of video or photo material with computer generated images allowing the integration of additional information. To pinpoint it down further, the definition used throughout this paper will include three more attributes: the integration of three dimensional rendered objects that works in real time, allows interactivity with the user and must fit into the scene with correct perspective. While postproduction alters a previous shot movie or video, augmented reality will do the same work on live material in real time8 . These guidelines demand high calculational power PCs and advanced technology to insert objects with the right perspective relative to the camera recording the scene and some form of tracking has to be used to determine the current camera position inside the scene. This can be done optically (detecting reference patterns inside the view) or with other sensor devices, relying on electromagnetic emissions, GPS radio signals or infrared.

Figure 2.1: AR in the construction process of automobiles using visual patterns to determine perspective. Pictures kindly supplied by Metaio - Augmented Solutions[1].

2.3.2 Current Developments The control of overlaid graphics in real time is not often seen in public commercial products. Sony made a first approach, releasing the EyeToy add-on for its game console PlayStation, a camera recognizes the player and his or her movements as a new way of user interaction. This has not yet reached the status of augmented reality which includes correct perspective for CG objects. Games and sample applications are emerging at universities and research institutes worldwide. So-called head mounted displays (HMDs) allow an immersive impression of AR: the user can see the overlay of real and virtual objects from his personal viewpoint. Touristic usage, production prototype planning, medical use and previsualization purposes are almost on hand, but seldomly are released as a market-ready product. We give another possible field of use: bringing AR to regular television environments inside households, not asking for 8

talking of a fluent framerate above 15 fps

11

2 Overview

Figure 2.2: an augmented reality game[3] using optical markers, presenting a virtual racing game in an arbitrarily chosen surrounding expensive tracking or presentation hardware in each living room. "Augmented Television" will allow interactively personalized videos, not broadcast ready-made as in the regular postproduction. We will offer this production in real time inside the TV set; summed up in a definition: Augmented Television describes the alternation of a broadcast video in real time, where virtual objects fit into the recorded scene with correct perspective, giving a natural impression for them of belonging to the scenario. The user has to have the possibility to change and alter this presented blend of elements. A direct viewer influence on the broadcast video signal, being the basis for the augmentation, is optional.

12

3 State of the Art in Interactive Digital Television Here I will introduce current standards for digital television, regarding ways of transmission and encoding. New possibilities for interaction and needed hardware are depicted. Eventually we will see the starting point for the following augmented reality television.

3.1 Standards for Digital Television While entertainment and communication industry moved on to the digital era quite some time ago and Internet and compact audio discs let us almost forget about a "predigitial" time period, television stood analog for a much longer time. Still, the majority of customers watches PAL, NTSC or SÉCAM encoded videos, broadcast terrestrial, via cable or by satellite, but the digital alternative has been prepared to take over for quite some time. People swap their VHS recorders for DVD players and digital broadcasting is being advertised more and more. Governments of different countries even set limits, until when all TV and radio signals have to be digital only. The possibility to transmit much more video programs due to compression and besides arbitrary data – usable for additional program information, interactive services or purchases – lets network providers dream of a golden future. Thinking of this potential in digital broadcasting and the market value, one could suggest that the industry would have hurried to define a worldwide standard for digital broadcastings – learning from difficulties and incompatibilities from the analog era. Unfortunately this did not happen and now the world has at least three major DTV systems. To sum it up, Europe now uses transmission technology specified as DVB (Digital Video Broadcast), while the United States rely on ATSC (Advanced Television Systems Committee) and Japan releases its own isolated solution named BML (Broadcast Markup Language). But this only involves the transmission of signals. The video and audio information itself has still to be encoded. This compression will be handled by algorithms defined by the moving picture expert group (MPEG) or through other standards (such as H.263). But since we stay inside Europe and follow available hardware, we will only take a closer look at the DVB transmission below, followed by a description of the compression algorithm used.

3.1.1 DVB (Digital Video Broadcasting) The consortium of the Digital Video Broadcasting Project was formed in Europe in 1993 by different industry leading companies based in Geneva, Switzerland. The declared goal was to define one single open digital television standard for whole Europe. Today, this standard is used

13

3 State of the Art in Interactive Digital Television in countries all over the world, including Russia, China, Australia and Southern Africa, and the consortium lists over 300 members. Different countries started around 1998/1999 with first public tests of DVB. The United Kingdom did the first commercial broadcasting in late 1998. Germany’s capital Berlin was the first area to completely stop broadcasting analog TV signals in 2003, replacing all services by DVB. Germany and other European countries aim to shut down all analog PAL/SÉCAM transmission until 2010. DVB comes in different profiles for their specific environments. Currently, variants for cable broadcasting (DVB-C), terrestrial (DVB-T), satellite distribution (DVB-S/DVB-S2) and handheld devices (DVB-H) are available. They all define some different encoding, due to the different fields of use. While cable broadcasting for instance can live with little error correction, a much bigger overhead has to be used within terrestrial DVB to counteract echoes and loss of packages. All DVB profiles provide the transmission of video and audio information as well as additional digital data of any kind. The video quality varies heavily from profile to profile (cable transmission offers more bandwith than for instance is needed for a hand-held device with a small screen resolution), but moreover the video encoding inside a profile may vary from program to program: the video stream is typically MPEG-2 encoded1 and datarate may be downscaled to free up space for additional information or more video streams belonging to the same program (called bouquets). DVB specifies teletext services as well (DVB-TXT) and subtitles are available separately (DVB-SUB). Moreover, inside its bandwith DVB can carry any digital data along, decoded by the receiver: this can be an electronic program guide (EPG) or in conjunction with the Multimedia Home Platform (MHP) interactive service information.

3.1.2 MPEG Encoding for Video and Audio Representing a full frame of a PAL resolution in pixel information would cause 1,2 MB of information (768 width x 576 height x 3 color bytes). A whole second would need a throughput of almost 32 MB. While this lossless transmission is useful for video production, where we don’t want to lose any information2 , these big frame sizes are just too much for a broadcasting environment. As a result, means of lossy compression were developed to reduce data rates drastically by still keeping high quality image results. This can be reached by converting samples or frames of a video/audio signal into a frequency space and quantize them. Small, hopefully unnoticeable details are left out and the new information is entropy encoded to gain even more compression. To reduce data further, a whole sequence is analyzed and only differences to the previous frame, which is sent as a whole (a key-frame), might be transmitted. MPEG stands for Moving Picture Experts Group. Since the first meeting of this working group in 1988 in Hannover, Germany, several versions of the MPEG-standard have been released. To name a few: 1 2

restricted further giving it the special acronym DVB-MPEG but still lossless compression is used to reduce this data

14

3.2 Interactive Television Standards • MPEG-1, offers a VHS3 -like quality with around 240 scanlines and typical data rates around 500 Kbs. Audio is encoded with so-called layer I, layer II or layer III4 technology. Video frames are only progressive scanned. MPEG-1 is the smallest common ground for compressed digital videodata and can be reproduced by all VideoCD- or DVD-players. • MPEG-2, offers a PAL-like quality at data rates of usually 5 Mbps. Audio can be encoded with all layer-technologies, but moreover with the AAC (Advanced Audio Coding) format. MPEG-2 was intentionally designed for broadcasting environments. Interlaced video can now be encoded as well. • MPEG-3, was intended for HDTV5 with signal rates of 20 to 40 MBit/s. But slightly changes of MPEG-2 did the same job, rendering this version obsolete. • MPEG-4, is planned to extend the capabilities of the currently widespread MPEG-1 and MPEG-2. Since these versions drop heavily in qualities below 1 MBit/s, better support for low bit-rate applications is added (e.g. for Internet-based streaming media or hand-held transmission) and parts of a video stream can now be split up with an object-based model: different smaller videos and even 3D VRML data can be integrated into one video. • MPEG-7, is a multimedia content description standard. Thus, it is not a standard which deals with the actual encoding of moving pictures and audio, like the above. It uses XML6 descriptions to store metadata and can be attached to timecode in order to tag particular events, but in DVB it is not used currently. DVB uses MPEG-2 in conjunction with the layer II audio encoding, although layer III or AAC would offer lower data rates at the same quality. This is due to hardware costs for receivers: layer II decoding takes much less effort and expensive hardware can be saved. The same still happens to the far better MPEG-4: almost no hardware is available capable of rendering all specified profiles. MPEG encoded videos may vary in screen resolution freely, no fixed restrictions as in analog technology exist. Only the market situation will decide on used formats: currently digital video will usually not exceed analog resolutions, since the majority of customers will still connect digital receivers to analog TV sets – a higher resolution would just get lost during the downscaling for presentation. A disadvantage of digital encoding with MPEG remains: if a displayed scene includes many small details, data rates are skyrocketing to get hold of these. But if the bitrate is limited these details are lost and blocks (artifacts) appear, giving the image a kind of ugly mosaicking effect.

3.2 Interactive Television Standards When digital broadcasting emerged, TV companies and network operators started to think of ways to exploit the freshly available possibility to transmit more digital information aside the 3

Video Home System today better known as the stand-alone audio compression format mp3 5 High Definition Television 6 Extensible Markup Language 4

15

3 State of the Art in Interactive Digital Television video and audio signals – only asking for another piece of software besides the receiver’s middleware MPEG decoder to handle this added feature. In the beginning, TV stations would in most cases rely on proprietary middleware to add interactive services, since there was just no other widespread solution available. Big players were and still are among others Microsoft’s "MSTV" and OpenTV. As to its closed character and no given interoperability between these standards, the end customer had to buy or rent special receiver hardware for only one digital channel. This vertical market structure, where one network operator would be in control of programs, interactive services and hardware sales making all customers dependent in the whole chain, is breaking up currently: open standards evolve and exist alongside or even start to dispel proprietary approaches. An open, horizontal market spreads. But the development is not as clear as it seems: lately7 Japan’s leading cable- and satellite-TV provider contracted with OpenTV to use their new proprietary Interactive TV (ITV) concept called Open TV Participate. Today, most common open standards are the Multimedia Home Platform (MHP) and the OpenCable Application Platform (OCAP), but the first mentionable has already been released in 1997 by the ISO8 Multimedia and Hypermedia Experts Group – called MHEG standard. This specification offered a declarative approach of presenting multimedia content, giving at hand an open interchange format for interactive applications. Besides the descriptive concept MHEG-1 also defined objects containing procedural code to allow simple decisions (due to user interaction – pressing the remote for example). MHEG-3 introduced a standardized virtual machine. But due to a huge overhead and an apparently too complex structure, MHEG-3 capable receivers never gained popularity. It was not before the introduction of MHEG-5, until the tides turned for the standard: the definitions were cleaned up and support to use java script objects was integrated. Today MHEG-5 is widely used in the United Kingdom, mainly to realize an improved teletext. TVs took benefit of the declarative character of MHEG-5 – allowing rich and fast presentation of information. The Further development of the standard was pursued (MHEG-6, granting Java even more access to capabilities of DTV receivers), but was never deployed to the market up to now. Nevertheless, these additions to the standard helped during development of successors and the DAVIC standard for ITV, which is reused in the MHP standard, described below.

3.2.1 MHP (Multimedia Home Platform) MHP is the open middleware standard for interactive television as designed by the DVB project – it extends the mostly transmission-related work done by the DVB standard. From 1994 through 1996 the MHP specification process started in an European community project9 on platform interoperability in digital television. In late 1997 first commercial requirements were fixed and the first release of MHP was on February, 23th 2000. In 2002 first hardware receivers were deployed in Finland, now a leading force in DVB-MHP based digital television. Largest market presence can be seen in Italy10 and Korea. Other countries are much slower and some voices already spread the rumor, that MHP will never establish itself in the world market due to too 7

September 2005, read in [4] International Standards Organization 9 DG III–Unitel Project 10 due to controversial heavy subventions by the Italian government 8

16

3.2 Interactive Television Standards complex and chaotic specifications. The MHP standard defines an extensive application execution environment for interactive applications for DTV, independent of the vendor-specific hardware and software. Moreover it allows display of a subset of HTML (DVB-HTML). The virtual machine running the applications is able to execute java code specially designed for MHP environments (DVB-J). To be able to develop MHP compliant java classes, the MHP specification defines all restrictions and mandatory features, currently based on Sun’s personal java edition (pJava)[19] and listing specific behaviour in the technical description[5] weighing some 700 pages. Furthermore, MHP relies on the definition of generic APIs that provide access to the interactive TV typical receiver’s resources and facilities. In some cases it relies on reused APIs such as the DAVIC-API11 for mpeg-2 private field filtering or tuning, the HAVI-API12 for user interface issues and the JavaTV API13 for DVB service selections and to control the video image. The Java Media Framework (JMF) allows further control over streamed media and DVB-API classes for event handling or persistent storage of settings and other data.

Figure 3.1: MHP software stack Today, MHP comes in different version numbers or revisions (1.1 being the latest, widespread implemented is usually 1.0.1 or 1.0.2) and currently three distinguished profiles: • The enhanced broadcast profile (profile 1), aiming at low-cost receivers, implementing only a minimal mandatory set of available features. Only display issues and java execution are defined. No interaction (no way back) between receiver and broadcaster is possible (only the sent data received by the receiver, unidirectional). • The interactive broadcast profile (profile 2), besides minor differences, the second profile defines a return channel. Running applications may download additional data and can communicate bidirectional now. 11

Digital Audio Visual Council Home Audio/Video interoperability: a software specification that defines architecture and primitives for the information interchange of audio and video equipment of different brands. 13 developed by Sun Microsystems 12

17

3 State of the Art in Interactive Digital Television • The Internet access profile (profile 3), defining wider support for communication through not only dial-up connections but as well as broadband. Higher level protocols such as smtp (email) or https and ftp are supported.

The already mentioned possibility to render websites with HTML has only been integrated into profile 2 and 3 in MHP release 1.1. But currently most set top box vendors including MHP, back off of using the 1.1 standard, since implementation is quite expensive. An MHP compliance test has to be passed and not yet demonstrated willingness of customers delay MHP deployment further. Not many running applications offered along with the DVB broadcast program are available and marketing just can’t get hold on new clients, if the promised value-adding can’t be recognized: entertaining game shows via MHP or online purchases are often just not yet possible due to missing return channel connections. Publicity could be discussed, thinking of the way how people recognize the Internet-connected television being afraid of just another opportunity to get ripped off (looking at cell phone ring tone purchases, etc.). The viewer keeps leaned back and currently still changes to the PC for checking mails and going online banking. But this issue is far to off topic for this work to continue psychological approaches to MHP vendings and marketing, but we will later take a close look at technical issues, problems and restrictions of MHP to see its potential.

Figure 3.2: MHP applications in public German Television: interactive entertainment shows using return-channel technology Nevertheless, the first public German television (ARD) for instance is supporting MHP heavily introducing more and more interactive content into news, sports and game shows. Among this little number of up and running applications are already some ideas, that make use of the return-channel – not restricting MHP to a pure informative remake of teletext: viewers take part in quizzes and can win small prices (figure 3.2). Still under development, ARD moreover offers a special channel for MHP-try-outs, broadcast as "ARD MHP Test1" on Astra H1 satellite. Less than 5000 German households are able to receive this additional service, while over 9 million[9] DVB-receivers without MHP were sold, but effort is still taken by some networks, certain of MHP success in the long run. Others claim the death of MHP[9] and withdraw their engagement.

18

3.3 Set Top Boxes (STB)

3.2.2 OCAP (Open Cable Applications Platform) Open Cable has been specified by CableLabs, based in Colorado. The goal is again to offer a standardized way of interactive television transmission and decoding. It is based on MHP and uses Java for application handling and HTML to display descriptive content, but remains incompatible with MHP programs. OCAP only focuses on cableTV and today it is only used within the US. The first version of OCAP 1.0 profile saw light in December 2001. Since then different revisions changed the standard – even causing incompatibility with former versions, giving headaches to all involved parties (hardware providers, middleware developers, application designers). Changes in the standard are also pursued to get closer to MHP specifications. Harmonization of these similar standards has been seeked, but the usage of this so-called Globally Executable MHP (GEM) is only theory today.

3.2.3 Other Approaches Besides these approaches of open middleware, new proprietary services are still emerging as well. Different cost efficient ideas try to offer an added value, reaching the customer faster than MHP profile 3 receivers. One example is BLUCOM, developed by SES Astra: small applications and information can be transmitted within DVB and a bluetooth enabled TV-receiver can communicate with the viewer’s bluetooth cell phone. Participation (feedback) can be realized through SMS services or GPRS connections. There will be no need to connect the television to the Internet, at the same time offering with a well-known environment (the cell phone) an easy interaction interface, which is even suitable for multiple users in front of a television: everyone can take part at the same time if he or she has a bluetooth device[7]. Other concepts don’t want to rely on cheap but rather weak receivers that are hardly capable of rendering true color shades and complex interfaces. To offer an interconnected multimedia home, for instance Apple and Microsoft pursue ideas of bringing the Mac/PC to the living room. While Apple releases the mac mini, followed by a re-release of a new iMac shipping with a remote control for all media and desktop control (Front Row), Microsoft enters the market with its Windows XP Media Center Edition. Advantages and disadvantages are obvious: while offering full PC power including 3D renderings and Internet access, the acquisition costs are multiplied.

3.3 Set Top Boxes (STB) Once a digital video broadcast goes online and may be received by households, the television sets have to be able to handle the signals. During analog days, only a high frequency plug was given and the television had to descramble the signal itself (with its integrated tuner). No additional hardware was needed, but once pay-TV-services or satellite offers became available, the TV sets were no longer able to decode the signals with their integrated tuner only capable of handling cable or terrestrial information. Pay-TV demanded intensive decoding of decrypted signals and satellite TV asked for other descrambling algorithms as well. Multiplexed channels had to be separated and handed over in the traditional video signal, that could be presented by the TV’s integrated tuner.

19

3 State of the Art in Interactive Digital Television This approach always comes with the inconvenience of another gadget you have to connect to the television (the set top box), but is by far the most cost efficient solution. Today, TVs often already include satellite receivers and integrated digital decoders are likely to become standard in new product lines as well.

Figure 3.3: Set Top Boxes: simple zapper To cope with DVB, set top boxes will at least include components for demodulation14 , a demux15 and a mpeg-decoder16 . Once the signal has been converted into a old fashioned PAL signal, it can be handed over to the TV set.

Figure 3.4: Set Top Boxes: allrounder 14

the process of recovering the audio/video information from the radiofrequency signal in a receiver In telecommunications, multiplexing (also muxing or MUXing) is the combining of two or more information channels onto a common transmission medium using hardware called a multiplexer or MUX. The reverse of this is known as inverse multiplexing, demultiplexing, or demuxing. 16 A single video stream is compressed with mpeg-algorithms 15

20

3.4 Summary on current Standards and Hardware But today more and more complex STBs hit the market as well: additional functionality is being integrated, such as personal video recorders (allowing recording digital video streams to an integrated hard drive) or even browser capabilities to allow the viewer email or newsreading in front of his or her television. To gain interactive TV, the already introduced middleware for MHP, OCAP or others has to be included as well. The demux will now untangle video (DVB) and additional service information (such as MHP) and hand these parts over to the appropriate device. The microprocessor might communicate with the broadcaster through the return-channel, ask for additional information and finally generate its part of the video signal, which will be combined with the decoded mpeg-stream to a PAL outcome. Figure 3.4 shows a multi-purpose STB, reflecting many possible fields of usage. To integrate all functionality into a TV-connected device is one goal of the industry: offer an easy to use and still fully equipped media and information center – without the needs of a PC. Talking of MHP and set top boxes, headaches arise: MHP defines a huge set of functionality, but at the same time stating many functions as being optional. Cheap hardware manufacturers will usually leave out all optional definitions, restricting MHP to a lowest common denominator of features. Since implementation for the middleware is quite complex, many companies rely on the same middleware company – Alticast[33] – implementing the MHP stack. The open standard, implemented only by a single market dominating company, practically turns into another monopoly as in proprietary standard times. But as long as the customer profits from this fact (to have a running MHP middleware in all different receivers) and while the standard keeps open, we could live with that.

3.4 Summary on current Standards and Hardware Focusing on Europe, it is safe to say that the DVB project eventually defined the digital TV standard, that will be used throughout the European countries – coverture will be gained in all states. But speaking of interactive standards, the case if far more complicated and not decided yet. Certain lobbies, companies or even governments try to enforce DVB in conjunction with MHP or GEM, but the market seems to trifle time with other approaches as well. Unwilling to pay licences for MHP certifications, hardware suppliers often just leave out MHP in their receivers, since the offer throughout European countries is still too low to be profitable. It is questionable if subventions are the right way to enforce the standard. A good publicity work with a real "value-add" should do the trick. But up to now the viewer is only used to teletext and electronic program guides – not asking for more. Other license-free or even cheaper proprietary solutions to MHP coexist. If MHP does not hurry up, the market could soon be clustered with different approaches for cable, hand-helds, satellite networks. Or even for different channels – following the old idea of a vertical market structure. Nevertheless, MHP seems to be a promising approach of a standard. Its possibilities for interactive television and graphical overlay seem to be a appropriate basis for the idea of an augmented TV in the future. In the next chapter I will discuss different concepts for realizing this augmentation using MHP, followed by research on the actual implementation possibilities.

21

3 State of the Art in Interactive Digital Television

22

4 Designing Augmented Reality Applications for Digital Television Having seen current developments in digital television, it is time to think of realistic approaches for blending AR within this TV context. Restrictions caused by hardware limitations and the conceptual nature given in a broadcasting service are discussed and a full sample application is designed and described.

4.1 Goals The advantage of a "typical" Augmented Reality application has been touched on already. The user can move freely and take a look at the augmented objects (or the real objects superimposed by virtually displayed information) from a free to choose point of view. The integration is seamless and new ways of interaction are tried out these days to have a more natural human interface. This could include recognition and evaluation of facial expressions or gestures to "speak" to the computer. Wearing a head mounted display, the user can act handsfree and many workflows, e.g. in factories, can be sped up heavily due to this portable information device, overlaying relevant information to the viewer’s sight directly – without the needs of walking to a terminal or looking up information in a 1000 page manual. Impossible visualizations are now possible. For example superimposing a human body with it’s innards (as aligned X-Ray images or 3D models, etc.) or adding repair instructions and needed tools in front of the hood of a yet to be fixed car.

Figure 4.1: left: concept of how ready-made AR is already used in soccer broadcasts; right: AR in broadcastings, picture by Orad[8] These advantages of free movement and thus the free selection of the angle towards the live scene are unfortunately not possible in a television environment. The user is chained to the offered point of views broadcast by the film studio. One channel might have a subset of numerous video streams, but still: the user may only select among these and can’t walk around in

23

4 Designing Augmented Reality Applications for Digital Television the displayed scene at free will. Again, the television dictates the per se "lean back" attitude. Augmented Television offers two major advantages towards the classical postproduction: • The viewer can alter the presentation and influence effects. • Live broadcastings can hold effects as well, while postproduction will always happen after shooting the material and before broadcasting it. With the augmentation done inside the receiver, interactivity could reach a level of console computer games or even a regular PC with all possible scenarios of usage. Most probable, Augmented Television could advance user specific customization of the broadcast images: embedded advertisements may alter depending on the viewer’s age or interests and signs and written text could be exchanged due to a selected language-profile or favored settings. Interactive games could be realized, where each viewer can use the mixture of video image and generated virtual scene objects at free will. A single user could also be selected by the broadcaster through the data channel, and his interaction could be transmitted to the other households, where people would lean back and enjoy the movements of the temporarily chosen participant. In sport shows, the viewer could select and deselect visual aid at free will.

4.2 Considering Technical Issues and Limitations To realize Augmented Television, different questions need to be addressed. I will list important ones. They will be dealt with in the following chapters. Where will the augmentation process take place? Before broadcasting? Or inside the viewer’s set top box? How can we supply a video image plus 3D geometry? How will it be possible to align the virtual objects to the video? How can needed tracking data be transmitted and how does a proper synchronization work? What options does the hardware for digital television offer? Do the technical standards used for DTV allow fast graphical overlay at all? If we get to that point, more uncertainties arise: how can the quality of the augmentation be improved? How might a convincing, useful and/or entertaining application look like? What kind of transmitted data do we need to allow this real time composition? How does real interaction define itself and could it be realized with given hard- and software possibilities? Limitations and issues of interest will include the following: • screen resolution and quality A suitable interface design has to be found, other conditions than in a PC environment apply. To what can we resort to in MHP? • interactivity design The user will be most probably in a lean back position and state of mind. Too demanding and tiny scaled interfaces, seen from a rather great distance (in comparison to a 50 cm distance for a PC screen) hinder a fast and complex interaction. A suitable balance for the design has to be found.

24

4.3 Different Approaches for Mixing the Realities • overall performance Set top boxes have to be checked for their performance, data throughput for transmitting needed information (such as 3D geometry and tracking data) and video and graphics rendering need to be examined. This includes the granularity that can be reached for synchronization, needed for the AR overlay. The alternatives of where the augmentation process could take place are listed in the following section.

4.3 Different Approaches for Mixing the Realities Unlike a stand-alone AR-environment or the classical case of postproduction, that offers the video image and the overlaid graphics all in one, we now have a combination of at least two participants: the broadcasting company offering the content and the user and his or her set top box connected to the television. The content offered by the station can be of one of the following stages.

4.3.1 Ready made The first option is to do all augmentation before the video image is transmitted. Thus we get the classic post- or studio-production with overlaid or embedded graphics. Leading in this compositing done in real time is the company Orad[8], but their combination of graphics and video is still done before broadcasting. Although, different releases for different countries are being realized, the viewer at home may never influence this compositing, no interaction for the user is possible unless the viewer has a direct connection to the content producer. This is most unlikely to be developed, since it would only allow a single user to interact with the program and all other users would have to watch his or her actions as well. The system could be realized for votings or polls, though. Receivers offering bidirectional data exchange with the broadcaster (see 3.2.1) could allow user feedback influencing the live program. Since this approach does not imply any direct user interaction or manipulation possibilities, it is very limited, and will therefore not be covered at a greater extent in this document.

4.3.2 Prepared Augmentation A higher degree of interactivity can be offered by transmitting the video image without additional information displayed. Besides, the data channel of the digital broadcasting environment will be used to transmit appropriate coordinate space information for inserting 3D objects. Moreover it would be possible to transmit a prepared mask, depth-information or other additional data along with the video material. These could be used to help the set top box in composing the images and supplying logics for running applications. The 3D objects to be displayed could be transmitted through the data channel, downloaded through a separate connection or pre-installed in the user’s set top box. The rendered elements can be laid over the video stream and a running application may trigger animations and changes due to user interaction. This stage offers

25

4 Designing Augmented Reality Applications for Digital Television more possibilities than the first simple option. A more detailed research on this more interesting approach is one of the main goals of this paper.

4.3.3 Higher Level If set top boxes will gain more popularity, as well as production costs will drop more, the near future will bring more calculation power and CPU speed to the devices. Once it is possible to render high definition 3D graphics in real time over the video stream, it also might not be too far fetched to think of the option to run image recognition algorithms and other calculations along in the box. For instance the tracking of single objects could be done by the end-user device. Even 3D space positioning could be calculated and 3D objects could be inserted. I opt for the second version of transmitting ready-made tracking data to the box. How these data will be generated (hand-made, semi-automatic, real time and automatic) won’t be issue of the client (i.e. the set top box). Thus, synchronization and composition of the images are of main interest concerning real time handling. Hardware restrictions won’t effect development that heavily, since the most intense processings as tracking are done beforehand. Therefore the task of set top box is limited to the combination of all data.

4.4 Sample Application Having described the general functionality of nowadays set top boxes and current DTV standards earlier, it is time to focus on a more specific and real life project environment. This will show the possibilities, limitations and chances of augmented reality in digital television environments in a more practical way: implementation with its problems will be covered in the next chapters.

4.4.1 Basque Pelota VICOMTech, working closely together with ETB1 , takes part in research on new ways to enrich the television experience and to introduce new functionality and comfort, offering additional information through the digital broadcast. Since a DVB server is available in the company as well as receiving set top boxes, it is possible to simulate enriched content in the same way it would be received in private homes, tuning into the digital offered video of ETB. For instance, different projects dealing with the "pelota sport" have been pursued at VICOMTech. "Pelota" is basically a sport, where two or four player volley a ball against a wall (called frontón in Spanish) using the bare hand (pelota a mano (Spanish) or eskupilota (Basque)), a racket, a wooden bat or a basket. The audience might think of a combination of tennis and squash. Playing with a basket (cesta) speeds of even 300 km/h can be reached, thus making pelota one of the fastest ball sports in the world. Pelota is quite popular in the Basque country. Almost every village has its own court to play and a lot of matches are broadcast in the Basque television. Skilled players are well known and usually a lot of bets are placed on these favorites. The overall popularity turned research in 1

The Basque television broadcaster (Euskal Telebista). Actually VICOMTech is partially owned by ETB.

26

4.4 Sample Application

Figure 4.2: pelota sport sport related applications in VICOMTech into one of the cornerstones in the digital television department. The first realized program offered player statistics on command. The viewer can browse through the information while the current live feed of the pelota game is reduced to a smaller window. Player statistics are retrieved from server-side, thus being always uptodate without user interaction. Moreover it is technically possible to place bets on selected competitors, although this has not yet been introduced in the free market and the broadcaster’s offer. Obviously this issue is getting more and more popular and the possibilities of a real2 interactive television will become standard. Placing bets and taking part in lotteries is one of the first logical steps offering the broadcasting stations an additional income. Currently other TV stations are running experiments with this as well[16]. Other projects currently running at the company aim at tracking objects out of video streams or sequential image files markerless. This is also to be used for sport application related research. Since the geometry of the pelota court is well known, efforts are taken to track the camera position during the match. With an adequate resolution of the generated matrix, holding the tracking information, the video images could be overlaid by objects in 2D or 3D. Realistically speaking this could first be utilized to display perspectively correct banner ads stuck to the walls or the floor of the court: a tool to select banner ads for the frontón and to position them has already been developed at VICOMTech: it offers a 3D representation of the court allowing redistribution and selection of image files as advertisements as a preview of how the real court will look like with set up posters. Moreover the tool can visualize a physical simulation of the bouncing ball (see figure 4.3). Tracking moving objects during the sports match are pursued in the company as well. The position of the ball is to be retrieved in 3D. To achieve this goal, pattern recognition is used with a mathematical projection models of the ball. With this method it is possible to predict the most probable follow-up position in a more precise way, aiding the image recognition algorithms. Possible methods for tracking moving objects will be covered later. Having these running projects on hand it was an evident option to take advantage of the existing pieces and to combine the different parts to gain a rich presentation for the company’s efforts in this field of tracking and pelota applications in a digital television environment. 2

by real interactivity meaning an easy bidirectional data exchange

27

4 Designing Augmented Reality Applications for Digital Television

Figure 4.3: Pelota Viewer program for putting up ads and simulating trajectories

4.4.2 Concept Design The idea arose to design a three dimensional interface for pelota sport broadcasts and at the same time integrate the concept of augmented reality for digital television into it. As a minimum a digital broadcast offers a video stream – just as the old analog version (not regarding the coding of the signal). The digital television offers much more: an electronic program guide (EPG) is available, more than one video stream may belong to one program and the newly available MHP can extend the possibilities even more by basically executing any java-program a developer can think of3 . Thus the concept of a useful and also entertaining concept for enriching the pelota matches had to be found – not only to integrate the Augmented Reality into the television, but also to have a whole package of a running and interactive demo application. Augmenting the Video Stream The first step should combine the previous and other current work of VICOMTech with the new augmentation idea. The video streams are recorded and displayed in a preview window. The user can select the preferred one and will instantly be set to full screen tuning into the adequate audio channel as well. Using the remote control the next stream can be selected directly or by returning to the menu first (see figure 4.4). Hidden to the viewer is the alongside transmission of tracking position for each camera (i.e. each video stream). Frame by frame the set top box not only receives the video image but also information on the camera’s position, direction of view, and up-vector. The novelty introduced into the television image is now the possibility to overlay it by additional materials put into the scene with correct perspective. All augmented objects are rendered in a 3D space, where the viewpoint will be set to the exact same viewpoint transmitted for each frame of the video signal. Thus the overlaid objects fit perfectly into the video scene. The set top boxes will receive all needed information, like video, augmentation objects and tracking data streams for camera positions. 3

while keeping the constraints of the slim java virtual machine in mind

28

4.4 Sample Application

Figure 4.4: video Stream selection concept

Figure 4.5: participants in augmentation process

29

4 Designing Augmented Reality Applications for Digital Television Insertion can now be realized easily. We will start off with the idea of putting two-dimensional banners (for advertisement or other informative purposes) into the sport game. The existing application for placing ads (see 4.4.1) is reused and the same file structure is kept. Giving this existing tool on hand of the broadcasting stations or the department of advertisement, it can offer a direct way of manipulating and defining the displayed sponsors with very little effort. The user can drag and drop image files on the walls of the simulated pelota court. The list of images and their positioning in 3D space is obtained locally from file or is transmitted in the real environment by stream or within the start-up package of the program while initializing the application. If a local cache is available, retransmissions of images could be reduced heavily. The program defines an unlimited list of augmentation bouquets: a set of advertisements or other objects might be selected at once. Thus, depending on the policy of the current program, the viewer might select material to add at free will – or even to add none. The augmentation could also be set automatically and be mandatory for advertisements or special viewer profiles, as discussed in 4.1. The file that lists advertisements also specifies if it is possible for the user to switch or not. These inserted flat images may also exchange signs or posters and text to realize a localization. Different languages or even appropriate cultural references may be inserted. Differences in culture or in taste may require censorship or adjustment. Integrating this convenient way of offering the augmentation bouquets, export to another country and their rules will be a walk in the park. Only the country-specific set has to be (pre)selected and all inserted objects will comply regional standards and laws. The integration of a file loader offers the option to include not only basic two dimensional image files or simple OpenGL-specified geometry, but rather to import complex and textured geometries. Content developers can use their favorite modeling software like Maya, 3DSMax or Blender and export their creations. Again, the 3D position and orientation will be defined beforehand and transmitted alongside the geometry data. Issues like realistic lighting and colors as well as depth blurriness of inserted objects need to be addressed to reach a totally convincing composition. Occluding Video Objects Up to now the video is superimposed by virtual objects by knowing the camera’s position for each frame. As long as the inserted geometry does not interfere with any real object’s positioning the impression of the augmentation is convincing. If the inserted objects are chosen wisely regarding colors and lighting situation, a perfect illusion might already be realized utilizing these few means. As soon as some moving objects out of the video move to the same or closer camera position, where the augmented geometries are inserted, the expected occlusion won’t happen: the illusion will fall apart and the viewer will realize the two independent layers (flat video and virtual, overlaid 3D objects). If this problem is not taken care of, the possibilities of Augmented Television are far too restricted to offer a reasonable advantage. A persuasive mixing of realities could still be achieved – but only if the inserted objects and their movements are well-defined in advance.

30

4.4 Sample Application Disadvantage in this case is, that no live broadcasts are possible and no interaction with the user is realizable – or at most to a quite restricted extent only. The problem is simulated in figure 4.6.

Figure 4.6: demonstration of augmentation of a video frame by inserted 2D images, left: broken illusion due to overlapping between real and virtual objects, right: convincing augmentation with occlusion To deal with this problem and to enrich the augmentation new components are introduced into the project. Two ways of implementing the desired occlusion are pursued: • Transmit a mask describing the depth information for each video pixel for each frame. • Transmit a (simplified) 3D geometry for the scene shown in the video. Both possibilities are tested and integrated into the program to a certain extent. The mask and the geometry information are transmitted again using the data channel of the digital broadcast, thus leaving the generation of this data to the server. The set top box only needs to combine the newly available information with the video and virtual material. Using a mask, it is possible to clip out occluded parts with a foreground/background separation for the video elements. The pixels of the virtual objects are shown if the according mask element is set to 1 and hidden if set to 0 (see figure 4.7). This approach is by all means the easiest and probably fastest way to implement occlusion and is in most cases sufficient to realize a far more convincing illusion. Ways to realize this masked videos and means to gain a higher distinction in depth-positions will be pointed out and discussed. If we plan to use Augmented Television for movies, masks could already be available due to earlier postproduction and rotoscoping, but this is not the case in live pelota broadcasts. Considering the alternative of transmitting a reconstructed 3D geometry of the video scene, leads to certain advantages and disadvantages. Regenerating geometry if only the video frames are available is a rather difficult task. Still, the idea is pursued and research is spent on extracting important 3D features out of single pictures or video sequences. To name one approach I refer to a work on 3D reconstruction from photos "Automatic Photo Pop-up" by Hoiem, Efros and Hebert[28]. The introduced algorithm extracts simplified 3D geometry automatically out of single photographs by comparing to a training set of outdoor images. A separation into ground, sky and verticals (for buildings and other objects that stand upon the ground) is possible. Like

31

4 Designing Augmented Reality Applications for Digital Television

Figure 4.7: video frame and predefined 1-bit mask in a children’s pop-up book a simple separation into different depth layers is obtained. Again, distinguishing into these few parts can be satisfactory depending on the needed accuracy of the reconstructed geometry. For the pelota game we have a well-known background scene (the frontón) and besides up to four players the ball has to be tracked and/or reconstructed. While the latter has a fixed, not altering shape with known dimensions, the moving players cause a much more complex task. My attempts will be described in section 6.5.5. For testing, a way to import occlusion geometry and to update its position on a frame by frame basis has been implemented. Geometry can fluctuate from simple OpenGL shapes to arbitrary objects loaded from a file or stream. Once having 3D player positions or even a 3D geometry on hand it is moreover possible to realize not only occlusion with virtual objects but also to calculate collisions. For interactive television where the user might control the inserted object this can advance the illusion to a much higher level. Virtual World Overview The main idea of superimposed digital video streams being realized, effort can be invested into the "surrounding" application, which will offer a more advanced interface to the viewer of the pelota program. The stream selection menu is replaced by a hopefully more pleasing and interesting one. The 3D world that has been designed to display the augmented parts can be reused to serve as an overview of the court as well. Virtual walls and floor are added and since the camera positions for the streams are known every selectable program is represented by a studio camera 3D model. A small preview of the video will be displayed above the appropriate position (see concepts in figures 4.8 and 4.9). The symbolic objects will change their position and aim synchronized to the cameras in the real pelota court moved around by the cameramen. Thus, the viewer can take a look into the court from a virtual position observing the available viewpoints and select the preferred stream with a single push on the remote control. The zapping through available streams searching the favorite point of view will become obsolete, because one can see the expected angle in advance. The application introduces a concept where alongside the video streams a list of virtual viewpoints can be defined by the content offering company. These positions in the 3D space are represented by camera symbols and have numbers attached. Real camera views and virtual viewpoints can be chosen directly by pressing the according number on the remote control.

32

4.4 Sample Application

Figure 4.8: 3D world, numbers to select virtual and real viewpoints; concepts

Figure 4.9: video stream preview above camera models in virtual view; concepts

33

4 Designing Augmented Reality Applications for Digital Television Camera Flights To enhance the program experience in a more interesting and entertaining manner, virtual camera flights are planned. Switching from one virtual or real camera position to another will not jump immediately to the selected location, but rather calculate a transition. This simulation is not only supposed to be pleasing for the eye, but is supposed to help orientation locating the different view angles. Being optional the broadcasting company may switch the feature on and off depending on the current program or let the viewer select at his or her preference. Acceptance of these interaction delaying functions has to be evaluated. A trade-off of entertaining animations and fast to use interfaces has to be found. I will deal with these questions for transitions during the implementation. Virtual View with tracked Objects A completely new approach to viewing television is the possibility to stay in the virtual representation rather then switching to the live streams. The 3D world is not only a menu overview, but can be used as the preferred single viewpoint. Depending on the quality of the player model and world design the user might favor virtual cameras, since he himself can control the view angle. There will be no restrictions to the edited broadcast version where positions change "without agreement" of the viewer. For now the player representation will only consist of basic geometry to point out the general idea and the working concept. Tracking data are evaluated and one can see the players moving and the pelota ball bouncing in 3D, offering a 100% freedom to the viewer’s experience. If these 3D models become more sophisticated in the future, we will actually be quite close to real three dimensional television. Augmentation Purposes As soon as the user chooses to view directly through one of the real cameras, the described virtual world is switched off and the video stream is displayed in full screen. But since camera and object positions are still completely tracked it is possible to overlay the video by some selected virtual objects. This idea of this augmentation is to offer value added television and to aid the viewer. As introduced earlier in this chapter the first step was to put up virtual advertisements in the frontón, since this idea was pursued in VICOMTech before and a front-end program to attach flat images to a virtual pelota court has already been designed. Other options for this special sports broadcast environment might include the following augmentation: • display extended information on an object or player, such as annotations, descriptions or translations • track selected parts (e.g. the ball) and surround those by markers to draw focus on them and to allow easier pursuit (see 4.10) • magnify selected parts to help viewers with reduced eye sight or just to point out areas of interest

34

4.4 Sample Application • display distance visualization aids, statistics, etc. • enrich the scene by 3D animations to draw attention and to augment the viewer’s pleasure • include simple games, where users can navigate through the video scene to pass time during delays or to take part in a raffle

Figure 4.10: augmentations of pelota game, concept arts

35

4 Designing Augmented Reality Applications for Digital Television

36

5 Technical Realization of Basics for Augmented Reality TV With a planed concept now at hand, I will document my try-outs with set top boxes. Sample applications are being developed and I will describe my research on current hardware and limitations in software.

5.1 Overview Java is used as the development platform, since nowadays the majority of digital set top boxes is able to handle Java Xlet applications, injected through the data channel embedded in the broadcast signal. In this chapter we will describe the trials to utilize Java to develop programs run directly on digital televisions as well as on a simulator/a standard PC. While the latter may use different installed Java Runtime Environments installed, the digital receiver will probably run with a much more limited Java Environment, such as Java Micro Edition or the older pJava (personal Java). Having introduced the different standards and technologies which are involved in DTV development in chapter 3, it is time to take a closer look at a possible work flow for deploying one’s application. The proposed description in the following sections imparts the concept and life cycle in DTV. Also, limitations in available hard- and software lead to an exemplary, descriptive explanation utilized at the company VICOMTech.

5.2 Available Software Suites An application for MHP basically consists of compiled Java sources and additional binary resources (e.g. image files). These special Java-applications, named "Xlets" are comparable to Java-Applets and will be introduced in 5.3.1. The easiest approach is to use front end applications for knitting your desired Xlet together. These helper programs offer a more or less "what you see is what you get"-interface with predefined templates, layouts and buttons. Comparable to other content development programs (e.g. Macromedia’s "Flash" for Web-applications) it lets you define objects, texts and different scenes, so called acts. The latters are used as the different states of your Xlet: the programs starts in one act and alters into the next act on user interaction (i.e. pressing a button on the remote control). The detailed concepts of state change will be covered more precisely in 5.3.1. Basically these front end programs only take care of the Java classes, so one does not have to deal with it on source code level. Having put together one’s application it will be compiled and

37

5 Technical Realization of Basics for Augmented Reality TV deployed as a normal Java class file or multiple files including additional resources. Three popular applications were examined: Cardinal Studio by Cardinal Systems, iTV Suite by Icareus and Jame by Fraunhofer institute. We will restrict the introduction to the first two programs.

5.2.1 Cardinal Studio An available solution for the design of Xlets is Cardinal Studio developed by Cardinal Systems[12]. This interactive TV developer and authoring tool mainly offers a graphical front end interface for the fast creation of application prototypes and demos. But moreover it gives Java programmers the possibility to develop their own components and add-ins using the Java beans concept. An integrated emulator offers a preview of the Xlet, displaying a video-stream overlaid by the application and giving the user a remote control at hand to be able to test navigation by clicking the simulated device. Promising as it sounds, for more complex programming Cardinal Studio is not suitable - or only as a starting point. The suite does not offer direct source code manipulation inside the program, restricting the developer too much. It generates the Java sources only before deploying the new application but does not read these saved Java files on start-up. Thus it is impossible to integrate one’s own functions or modifications into the sources, since re-deploying the Xlet will overwrite all hand-made changes. For setting up acts and a corporate design for all different stages of the program it is a good start, though. Having constructed the barebone of the application it is possible to continue adjusting the source code (using your favorite text editor or Eclipse for instance) and to compile the Xlet without the software suite from Cardinal Systems. The emulator claims to be MHP 1.0.2 compliant, but in practice it still does not offer a 100% security of the application´s correctness. Two main problems arose during the evaluation of the suite: • The background video-stream can’t be altered during a running emulation, synchronization between application and video-signal is not possible. Only a selected mpg-file is looped repeatedly. • The display of image files depends on your system configuration (graphics mode, JDK version) and not on the restrictions of set top boxes making a secure preview impossible.1 [15].

5.2.2 iTV Suite Another software suite to develop MHP applications is called iTV Suite by Icareus[13], which acquired Sublime Software Ltd., the original developer of the first version. Besides the graphical front end for designing Xlets, the software package ships with an optional iTV Integrator. This 1

Although this implementation issue is still MHP compliant, since a full palette support is optional for STB developers. But due to cost efficiency many hardware manufacturers only implement mandatory MHP features and none of the optional: Philips states in their SDK-FAQ the following: "We have implemented all mandatory features of MHP 1.0.2. This means effectively that all optional features are not implemented."

38

5.2 Available Software Suites

Figure 5.1: Cardinal Studio 3.1, Icareus iTV Suite

tool allows the automatic conversion of any XML-data source to the native .nkr file format, which can be uploaded through the data carousel and thus presents a means to dynamically exchange content used by a running application. Since only the restricted demo version could be evaluated, a full conclusion is not possible. But – regarding the software emulator for the MHP applications – the same problems occur compared to 5.2.1. The video-stream can only be an arbitrarily chosen MPEG-file from a hard drive, which will be continuously looped in the background without any possible synchronization. Color palettes are not restricted either.

5.2.3 Conclusion on Software Suites The available software suits pursuit an easy-to-use interface for the visual design of MHP applications and offer helpful tools for fast corporate identity layouts and dynamic content exchange. Source code manipulation for lower-level-access of set top box functionality is not offered. The task to get into the core of the MHP underlying engine and to manipulate video controls and graphics overlays is not obtainable. Thus, the enclosed emulators also only present the Xlet without any connection to streams, carousel-data or video-image-information. Moreover, the visual simulation is "too good": start-up latency, color-palette problems, file size restrictions and the speed of loading the visual elements on a PC is completely different to the hard- and software-restrictions in nowadays set top boxes. With emulators it is only possible to see if the interface works and e.g. no broken links occur. For speed issues and readability on a TV screen it is still necessary to upload the compiled Xlet to a consumer set top box. Since MHP implementations differ on the various set top boxes, it is still mandatory to test applications on a broad range of hardware to ensure flawless functionality.

39

5 Technical Realization of Basics for Augmented Reality TV

5.3 Writing Applications from Scratch 5.3.1 Xlets in MHP Xlet Concept To be able to give content developers extensive possibilities for their applications the platform independent Java-language was chosen to be integrated in MHP and the associated set top boxes. Since the environment in a digital television distinguishes heavily from the one in personal computers, it is not appropriate to port the Java Virtual Machine one-to-one. For example, in a conventional environment the Java model assumes that only one program is being executed in a given virtual machine and that the Java application itself has full control of its own life cycle, but in a set top box there might be several programs running parallel and the TV needs to be able to control, pause and restart running applications. Luckily, this concept is already used by Java applets, which are designed to run in web browsers. Since there are other needs for a digital television environment it is not possible to adopt applets directly. Other concepts of interaction, user input and presentation (TV screens have a worse resolution and input is mostly limited to a remote control with less buttons or keys than a normal keyboard of a PC) have to be defined. Also, as stated before, the applets in a TV are much more limited since used hardware does not offer much power for calculations and graphics. Besides these limitations the applets for the TV environment, called "Xlets", share the concept of regular web-applets: it is mandatory to define certain methods, which will be called by the executing instance (in applets this is the browser, in our case the set top box running MHP). Additional to the possibility to start and stop applications the Xlet-interface also includes methods to pause and resume running programs. The reason for including this feature is because of the hardware restrictions: if some Xlets are currently not displayed, the set top box might decide to change the state of those Xlets to paused to gain additional resources for other purposes. Thus the complete list of available states contains loaded, paused, started and destroyed. A Xlet is thus not a standard Java application. More like an applet there can be more than one application running at the same time – on a single Java virtual machine. Still, like a normal application one could call System.exit() for example, causing the whole VM (handling all started Xlets) to shut down. Examining the life cycle of an Xlet it is possible to see how they are connected and when they are triggered: At start up the set top box’s application manager loads the Xlet’s main class, which has to be determined by the broadcaster, who uploads the Xlet package. The default construcor will be called and an instance created. After the transmission of the complete package, the set top box can start up the Xlet right away or latest, when the user selects the application from a menu. Once the loading process is done, the Xlet remains in loaded state, waiting for execution. Depending on the program’s policy the Xlet might be executed automatically or by user intervention: the application moves on, the initXlet() method is called and the application receives an XletContext object, which can hold additionally needed information for initialization, e.g. to prefetch large assets of image files, that have to be obtained one by one through the data carousel. Once the initialization is complete, the Xlet switches to paused state and is

40

5.3 Writing Applications from Scratch

Figure 5.2: Xlet States immediately ready to be run or the application manager even requests start-up directly by calling startXlet(). While the Xlet is running in started state, the Xlet itself can call its pauseXlet() method or this might be triggered by the application manager, due to external reasons such as a lack of available sources or external events (switching to another Xlet, etc.). The loaded Xlet will remain paused until external reinitialization. The MHP specification recommends freeing up as many used resources as possible when set to paused state, but this is up to the developer’s decision. The last step for a Xlet is to move to the destroyed state by calling the destroyXlet() method (again by its own or from outside). All memory resources will be freed and another start-up is only possible by reloading the whole Xlet through the data carousel. By then it will be a new instance, no stored values from an earlier execution can be reused. The above mentioned XletContext is always encapsulating an Xlet, offering an interface for communication with the set top box’s application manager. Similar to Java applets’ AppletContexts it allows an Xlet to tell its context, that it is about to shut down or set itself to hold. Interesting is the newly available feature for a resume request: a Xlet can set itself to paused, listening for a certain event to occur (such as a certain time passed or an event in a mpeg stream) and to be woken up at that moment. This reinitialization is not granted automatically but rather controlled by the manager – if resources might be short, the restart can be delayed or even omitted. Using the getXletProperty() method it is possible for the Xlet to access additional information signaled by the broadcaster. Currently2 only one property is defined by MHP and JavaTV. XletContext.ARGS lets an application obtain all information data given to it through the application signaling (AIT)3 . MHP defines additional Xlet properties, which are also implemented in OCAP: • dvb.app.id - the application ID of the application, as set in the application signaling • dvb.org.id - the organization ID of the application, as set in the application signaling 2 3

August 2005 Since passing command-line arguments is not possible in the Xlet model this can be regarded as a work-around solution to access set values nevertheless

41

5 Technical Realization of Basics for Augmented Reality TV

p u b l i c interface XletContext { p u b l i c s t a t i c final String ARGS = " j a v a x . t v . x l e t . a r g s " p u b l i c v o i d notifyDestroyed ( ) ; p u b l i c v o i d notifyPaused ( ) ; p u b l i c v o i d resumeRequest ( ) ; p u b l i c Object getXletProperty ( String key ) ; }

Figure 5.3: Code: XletContext interface functions • dvb.caller.parameters - the parameters passed to this Xlet if it was initialized by a mechanism other than the AIT The XletContext.ARGS property refers to parameters that are passed in by the AIT, while the dvb.caller.parameters are passed in via the MHP application listing and launching API. Implementation A simple Xlet structure will be given below, to display the general concept. Annotations describe the different functionalities. / / t h e i n t e r f a c e m u s t be i m p l e m e n t e d , o t h e r w i s e t h e m i d d l e w a r e / / won ’ t be a b l e t o e x e c u t e t h e X l e t p u b l i c c l a s s XletDemo implements javax . tv . xlet . Xlet { / / p a s s e d by t h e i n i t X l e t ( ) method t o l e t t h e X l e t know i t s c o n t e x t / / − comparable to a p p l e t c o n t e x t s p r i v a t e javax . tv . xlet . XletContext xletcontext ;

/ / s i n c e t h e s t a r t X l e t ( ) method w i l l c a l l e d a s w e l l d u r i n g initialization / / a s d u r i n g r e s u m e we s t o r e i f we were a l r e a d y up and r u n n i n g o r not in t h i s boolean p r i v a t e boolean yetstarted ;

p u b l i c XletDemo ( ) { / / s h o u l d be empty , a l l / / in i n i t X l e t () }

42

i n i t i a l i z a t i o n i s t o be done

5.3 Writing Applications from Scratch / / i n i t h e r e , t h e c o n t e x t i s p a s s e d h e r e , a c o p y s h o u l d be / / g e n e r a t e d i f not too r e s o u r c e consuming p u b l i c v o i d initXlet ( javax . tv . xlet . XletContext context ) throws javax . tv . xlet . XletStateChangeException { t h i s . context = context ; / / f i r s t t i m e we r u n t h e x l e t yetstarted = f a l s e ; / / t e s t output System . out . println ( " I n i n i t X l e t ( ) . X l e t c o n t e x t = " + context ) ; } p u b l i c v o i d startXlet ( ) throws javax . tv . xlet . XletStateChangeException { / / started yet ? i f ( yetstarted ) { System . out . println ( " s t a r t X l e t ( ) method c a l l e d a g a i n . . . RESUMING ."); } else { System . out . println ( " s t a r t X l e t ( ) method c a l l e d 1 s t t i m e . INIT . " ) ; yetstarted = t r u e ; } / / f u n c t i o n not l i s t e d here : / / t h e s t a r t X l e t ( ) i s supposed t o r e t u r n as f a s t as p o s s i b l e t o the / / a p p l i c a t i o n manager , t h u s a l l f o l l o w i n g a c t i o n s h o u l d be outsourced / / to a separate thread startXletThread ( ) ; } / / s h o u l d s t o p a l l a c t i v i t y and f r e e up a s many r e s o u r c e s a s // possible p u b l i c v o i d pauseXlet ( ) { System . out . println ( " p a u s e X l e t ( ) c a l l e d . " ) ; }

/ / s t o p s t h e Xlet , boolean i n d i c a t e s whether X l e t has t o

43

5 Technical Realization of Basics for Augmented Reality TV / / follow t h i s request of q u i t t i n g . I f i t i s not forced / / ( b o o l e a n s e t t o f a l s e ) , i t can r e q u e s t t o k e e p l i v i n g / / by t h r o w i n g an X l e t S t a t e C h a n g e E x c e p t i o n . p u b l i c v o i d destroyXlet ( boolean unconditional ) throws javax . tv . xlet . XletStateChangeException { i f ( unconditional ) { System . out . println ( " d e s t r o y X l e t ( t r u e ) c a l l e d . Q u i t t i n g . " ) ; } else { / / We h a v e had a p o l i t e r e q u e s t t o d i e , s o we can / / r e f u s e t h i s r e q u e s t i f we want . System . out . println ( " d e s t r o y X l e t ( f a l s e ) c a l l e d . I g n o r i n g s u i c i d e mission . " ) ; / / throw a X l e t S t a t e C h a n g e E x c e p t i o n to t e l l t he / / m i d d l e w a r e t h a t t h e a p p l i c a t i o n would p r e f e r t o / / keep running throw new XletStateChangeException ( " Don ’ t k i l l me ! " ) ; } } }

5.3.2 Uploading and Running TSDeveloper is used in VICOMTech for uploading an Xlet to the DVB server. Once an application is built and uploaded it can be accessed from the TV. A complete folder structure with different Java packages can be imported, including binary files as images. The user has to pick a name (displayed on the television screen when selecting an application) and specify the start-up class that should be executed by the interpreter in the set top box. An origin (company) id has to be set as well as an application id. If more than one Xlet is transmitted, the set top box uses these ids to distinguish several uploaded programs. Once the Xlet is uploaded, the set top box might signal a freshly available Xlet via an icon in the corner of the screen, depending on the manufacturer’s design. Loaded Xlets can be selected by the user or might be run automatically, if this flag has been set before upload. An Xlet will be running until the user either switches off the TV or changes the channel. The used Philips set top boxes4 lost all data defined inside a Xlet when switching channel. Moreover, the java program may destroy itself at any given time and the set top box might consider turning off or pausing certain running applications, if available resources are getting sparsely. Uploading a Xlet of a medium size of 400 kB will take around 10 to 15 seconds. Startup afterwards triggers immediately. 4

Philips DVB-T Receiver DTR4600

44

5.3 Writing Applications from Scratch

Figure 5.4: TSDeveloper used to upload Xlet to DVB server

Figure 5.5: manual loading of Xlets, interface on a Philips set top box

45

5 Technical Realization of Basics for Augmented Reality TV

5.3.3 Set Top Boxes vs Emulators Instead of using a set top box and to upload a written Xlet every time for tryouts and testing purposes, different emulators for PCs can come in handy. Besides the offered developer tools (see 5.2) including their internal emulators, there are more solutions available. A well known tool is called xleTView[18]. I chose to use it throughout the development process. In the xleTView GUI one can directly load (and bookmark) java class file sets and packages. The startup class has to be defined (as in TSDeveloper) and the program simulates a TV screen and offers a drawn remote control inside the interface, where the user can navigate through the demo just like in a real environment. Since no real connection to a DVB broadcasting and included services as DSM-CC and MHP is available, the emulator is mainly restricted to view the layout of one’s design. Background videos or a static image can be set in the initialization file. The majority of implementation of the APIs included in MHP are stub-classes, allowing a faultless compilation of the Xlet, but at the same time not offering all functionality. Main parts are missing in the packages org.davic.*, org.dvb.* and org.havic.ui and org.havi.ui.event. A complete and updated list of missing implementations can be found in the web[21]. In many cases an emulated function will only return null or 0. For example all JMF functionality offering access to video streams will stand without any results. Currently best use of xleTView is to check for the design of visual elements of a Xlet. During development we will compare emulator running Xlets with the "live" version running on a set top box. Worthy of mentioning is the fact that the emulator does offer the stub-classes for MHP, not offering all functionality, and at the same time offering too much power inside packages like the AWT: restrictions imposed to the set top boxes are neglected and a fully Java Standard Edition runs below the emulator. Intents to execute it with the Java Micro Edition – to simulate a closer representation for real-life capabilities – failed.

5.4 Stepping through Possibilities of Set Top Boxes unfortunately development is not as straightforward as hoped: MHP does not include a full set of the JDK specification; it is usually much more a mixture of optional features loaded on top of a lowest common denominator defined through the MHP specifications and in most cases Sun’s personal Java (pJava) definitions. Each manufacturer may include as many (or none) optional MHP definitions as preferred. Usually high-end (and highly priced) set top boxes would implement the whole specification. But to gratify compatibility with as many boxes as possible, developers are still forced to restrict themselves to a minimal set of functions. The MHP definitions are moreover hold quite fuzzy, leaving problems to the company implementing it5 . 5

But due to expensive costs of implementation, testing and certifying an MHP compliant set top box, the majority of companies uses the same middleware (implementing the MHP stack), developed by Alticast.

46

5.4 Stepping through Possibilities of Set Top Boxes Philips, for example, allowed Xlet implementation based on the JDK 1.1.8, restricted through pJava limitations. Since Sun does not support this API any longer, a porting to the Java Micro Edition has become reasonable and became reality with the latest receivers. In the following sections, important issues needed for our purposes of augmented reality – like graphics and synchronization – are considered. It will be described how these should and do work in MHP on a set top box and an emulator.

5.4.1 Graphics Display MHP Display Architecture For our purposes the display of graphics is the most important part. At the same time, this part is the most complex in the MHP specification and general problems arise during the conversion from a PC to a television environment. Aspect ratio of screens may vary (being usually 4:3 on a PC, reaching from 4:3 to 16:9 (widescreen) or even 14:96 on a TV), but also pixel aspect ratio changes: video and TV-applications typically use non-square pixels, while this is standard in a PC graphics API. Rescalings and changes in positions will occur, causing overlay problems, distortions or offsets if not handled correctly. The next trouble may be brought on by color conversions: the RGB color space used in Java’s AWT API has to be converted into the television’s YUV signal. To realize graphics overlay, one has to resort to the AWT classes that can be found inside the MHP implementation. They are restricted to only light-weight classes as found in the personalJava specification. This will mostly affect all window-manager related classes as there is no manager available in MHP. Instead, another API – the HAVi – will introduce an extension to the Java GUI, known as HAVi Level 2 GUI. This specification allows applications to share resources and screen elements without a window manager and can be found inside the package org.havi.ui. Each screen displaying MHP content can be logically split up into three layers. A background plane, a video plane and one for graphics. Moreover there might be another layer in between: MHP provides a means to display subtitles in a separate plane, but without specifying exactly at which position this layer has to be implemented7 . Again, the MHP specification does not fix a rule, but instead propose to turn off subtitles while running an Xlet or to avoid problems by choosing different screen coordinates for Xlet and subtitles. The background will be able to display a single color or (optionally) a still frame (encoded as an mpeg i-frame). The developer won’t be able to draw to the video layer. In general, access to these layers is realized through the HAVi classes. An instance of a HScreen represents one physical display device and a set of HScreenDevices lists the lay6 7

especially introduced in MHP, see MHP specifications, chapter 13.3.7 MHP technical specification, chapter 13.5.2

47

5 Technical Realization of Basics for Augmented Reality TV

Figure 5.6: graphical layers in MHP ers. Typically, one HScreen has at least one of the following HScreenDevice subclasses (mandatory stated in MHP specification8 ). • HBackgroundDevice – background layer • HVideoDevice – video layer • HGraphicsDevice – graphics layer There might be more than one instance of the latter two and to handle access to all layers, the HScreen class defines getter-methods for this purpose9 . The default graphics device’s resolution should be determined by calling java.awt.Toolkit.getScreenSize. The HAVi specification deprecates certain AWT functions, e.g. java.awt.Toolkit.getScreenResolution should not be used as well as java.awt.Toolkit.getNativeContainer. If the screen size can’t be determined (e.g. if an analog screen device does not offer a return value), 4:3 is to be used as a default ratio. The given devices might be altered by calling setGraphicsConfiguration() methods, settings can be retrieved by getDefaultConfiguration() and getBestConfiguration(). The latter is of greater importance for us, since it speeds up configuring an application for a specific device. As argument we have to define a template HGraphicsConfigTemplate, that specifies our needs: e.g. we must be able to resort to image rescaling – then a template must define a preference IMAGE_SCALING_SUPPORT as REQUIRED. Possible switches are REQUIRED, PREFERRED, UNNECESSARY, PREFERRED_NOT or REQUIRED_NOT. Several preferences might be controlled this way. Calling getBestConfiguration() with the template, the method returns a HGraphicsConfiguration, which holds 8 9

MHP specification chapter Annex G 1.1 for details see HAVi specification, chapter 8.3.3.3.1

48

5.4 Stepping through Possibilities of Set Top Boxes

Figure 5.7: images and shapes drawn in xleTView emulator (left) and on a Philips set top box (right) all important settings. Some might be exchanged afterwards and set active by calling setGraphicsConfiguration() with the configuration as argument. If a running application restricts certain settings through REQUIRED or REQUIRED_NOT, the initialization of a second application might fail and result in a NULL return value of getBestConfiguration(), if the template for the new to start program stays in contrary to mandatory preferences. These restrictions will cause headaches to all MHP developers – minimizing available possibilities further. Overlaid Graphics in 2D Through the HAVi-API it is possible to easily display text, shapes and images in the graphics layer. HComposites can be added to the current HScene, analog to AWT component definitions. For a static presentation (without e.g. tooltips or audio feedback) the classes HStaticText and HStaticIcon assist. Moreover we can draw arbitrary shapes to our component using a reference to the according java.awt.Graphics object of the graphics layer. For instance, AWT-Functions as fillOval or fill3DRect work flawlessly in the MHP environment (see figure 5.7). Besides this possible access to the graphics functionality, one has to resort to widget objects as (re)defined through the HAVi specifications rather than the AWT equivalents. All widgets that can be used are derived from org.havi.ui.HComponent (e.g. HStaticText, HTextButton, HGraphicButton). Example code describing these graphical issues with a complete step-by-step implementation are given in the java package es.vicomtech.dtv.xletdemo and especially in the class VICOMXletDemoGfx01. All following (as well as not here mentioned examples) can be found within the same package. We won’t further refer to those. Aside, a complete documentation is available for all sources.

49

5 Technical Realization of Basics for Augmented Reality TV As can be seen in figure 5.7, the color palette problem is restricting developers heavily. While the video layer will be able to present the full spectrum of shades (24 bit colors) available in a video, the other planes might be limited to a much smaller color space, often offering only 256 colors (using a CLUT) – some colors are reserved though, leaving only 188 for free usage. Only 256 color support is mandatory. Moreover the non-square pixel problem is left to developers. Mandatory for MHP implementation is only the minimum of a 720x576 size for all three layers. 768x576, 1024x576 are only optional supports for square pixel ratios. Since the graphical layer may have a different resolution (or even aspect ratio), aligning video and graphics might become troublesome. The configuration template preference VIDEO_GRAPHICS_PIXEL_ALIGNED is not mandatory for MHP receivers. If it is available, life for developers gets easier. In theory, all coordinates used for positioning elements with MHP follow one of this three options: • Normalized coordinates, starting at top left edge with (0,0), bottom right ends with (1,1). Typically used by HAVi classes. • Screen coordinates, taking absolute pixel values (depending on resolution), starting at top left with (0,0). Mandatory in MHP is a minimum of 720x576 pixel. • AWT coordinates, using relative coordinates in normalized mode (up to a maximum of (1,1) or in pixel mode (up to x- and y-value of resolution). The coordinates starting point (0,0) depends on the parent container in which the AWT element is laid into. To influence the origin of the HScreen coordinates, it is possible to set position and size through HSceneTemplates using HSceneTemplate.SCENE_SCREEN_LOCATION and HSceneTemplate.SCENE_SCREEN_DIMENSION, giving values in absolute pixels as defined in the according HGraphicsDevice (see figure 5.8). To vouch for a warranty of displaying all elements we further have to take into consideration that TV screens might cut off parts of the presented videos or graphics at the borders. This might be basically due to analog TV screens, just hiding parts out as to hardware restrictions or for example caused by rescaling of a 16:9 video to a 4:3 screen. Moreover, a user might select a widescreen presentation for a regular PAL signal – leaving out stripes at the top and bottom on purpose (to avoid black borders at the sides). Generally speaking, it is recommended to only use a so-called safe area for the video as well as for the overlay information. While a PAL signal delivers 720x576 pixel for a single frame (called the production aperture), only a smaller image inside this space will offer a 100% safe and stable image (called the clean aperture). No additional encoded information encoded inside the borders of the transmission or dirty designed image alternations that might not range over the entire frame will be found inside the clean aperture[35]. Usually a 5% safe zone to all sides is sufficient to guarantee flawless video reproduction. Overscan is complicating alignment between video image and graphics further. Again to avoid displaying anything outside the clean aperture, many CRTs use what is called an overscan

50

5.4 Stepping through Possibilities of Set Top Boxes

Figure 5.8: Possible configuration of HAVi Devices – displaying the video frame slightly larger than the visible area of the screen to hide any distortions. Unfortunately, different TVs will have a tiny bit different overscan – causing, again, offsets in our seeked alignment between overlaid graphics with the video frame. To state it in an unembellished manner: if the video is displayed inside the video layer, it won’t be possible to have a 100% assurance, that overlaid graphics will appear on all available set top boxes at the exact same position – relative to the video stream.

Transparencies and more GUI Design Besides the restrictions by color palettes and usable safe screen areas, we have to cope with even more design issues, not seen in a computer screen environment. Interlacing causes especially thin lines and elements as well as written text to appear jagged or fringed. Although the graphics layer does work without interlacing (in the contrary to the video layer), the final image composed by the set top box will usually be a regular PAL signal, transmitted to the TV, thus always encoded with interlace. Screen fonts need to be even larger. Text sizes of 18 point are an absolute minimum for a TV broadcast, sizes up to 28 point seem more suitable, taking into account that a viewer’s distance to a TV is usually larger than to a PC. Graphics interfaces must not contain thin lines and are off better with plain colors and thick borders around elements and widgets. The MHP implementation ships with the Tiresias font, which is a good choice – without serifs it is a plain text style, which should be also used in a wisely chosen color: a high luminance contrast to the background should be set. If a transparent background is used, an appropriate outline should be used. In movie/TV postproductions for example, a scrolling text (as used for the credits at the end), will usually be displayed with a 0.5 radius vertical blur applied to intentionally "disfocus" the words, resulting in an overall better rendering – if broadcast with interlace.

51

5 Technical Realization of Basics for Augmented Reality TV

Figure 5.9: transparencies handled in xleTView emulator (left) and on a Philips set top box (right)

Figure 5.10: overlay DVB transparencies class can’t be handled in xleTView emulator (left), but works on a Philips set top box (right)

52

5.4 Stepping through Possibilities of Set Top Boxes

Figure 5.11: Component mattes, stacked Transparencies supported by MHP can be set for graphical elements as well as for image files displayed. PNG and GIF 1-bit alpha is supported completely, while a full gradiant of 8-bit alpha is only optionally, leaving us alone only with 0%, 70% (mandatory) and 100% opacity (see figure 5.9). Graphical elements can either be rendered with transparency through the DVB-API 10 handling graphics through the graphics contexts (java.awt.Graphics 11 ) or through so-called HMattes with the HAVi UI API. With DVB, the opacity might be controlled through an DVBAlphaComposite in conjunction with DVBGraphics. Rendering modes must only support SRC, CLEAR and SRC_OVER. The other five options (SRC_IN, SRC_OUT, DST_IN, DST_OUT, DST_OVER) might not be available12 (see figure 5.10). Relying on HAVi, the HMatte interface and its subclasses provide a way for applications to perform alpha-blending operations on HAVi components. While the DVB approach uses the graphics context, the HAVi version offers opacity on component level – being "higher level". Each component can have its own matte, defined as a static value or by a bitmap mask. For animations, even time dependent changes are possible, alternating bitmaps or mask positions. Images with their own alpha, which lie inside a component, will multiply to a total alpha (see figure 5.11). Different outcomes can be gained by grouping or not grouping components: in a grouped set all subelements will also take the parent’s alpha (additional to their own), while this won’t be the case without groups13 . Again, HMattes are not mandatory to be supported by MHP14 . Access to Video Stream Data To realize video access MHP resorts to the Java Media Framework (JMF). This API has been briefly introduced already in section 3.2.1. To integrate the offered functionality, the package called javax.media has to be imported. Within this set, two distinct classes are of main importance: first a Player has to be defined, which is responsible for decoding and playing 10

defined in org.dvb.ui.* to be more precise: a generated instance will and must be an object of the class org.dvb.ui.DVBGraphics 12 MHP specification, chapter 13.3.6.1 13 HAVi specification 1.1, chapter 8.3.6.2 14 MHP specification, Annex G.7, table G.4 11

53

5 Technical Realization of Basics for Augmented Reality TV back the media file; secondly each Player can have unlimited bound controls. These linked classes are not mandatory for playback, but add extended functionality to the players. Single frame grabbing or freeze frame control can only be realized with the control classes. To actually access a media source, a third object is needed, which will be defined indirectly in most cases: a data source will hold the actual information. This class will hide all differences in file access from our point of view: a http-source can be bound the same way a local or network file is integrated. The indirect usage of a data source is meant to say, a URL or a file name is given during the instantiation. This will cause the Manager to set a MediaLocator internally, giving us access to the data through the locator object. This link also examines the media format and it is possible to ask for it by calling getContentType() on the data source object. For instance, in MHP, a locator (dvb://source) will always use a multipart/dvb.service MIME content type. Restrictions in MHP While the latest release of the JMF is version 2.1.1e[20], the MHP implementation defines usage of the JMF 1.1. This decision is justified due to cost efficiency – in hardware as well as in software: the old version includes all needed functionality for video and audio playback by still being slim and comparably fast. unfortunately direct video pixel access is not possible with version 1. The video frame will always be visible – but without any access into it. The concept of a "high level" set top box (as described in 4.3.3, speaking about smart boxes with integrated computer vision algorithms) is already impeded in the forefront of increasing hardware capabilities. If the MHP specification won’t undergo a revision, switching to 2.x of the JMF, direct access and/or manipulation of the video stream won’t be achievable, neither are recording features. The set top box is by MHP definition "doomed to stay dumb". Moreover, the MHP version of the JMF still is not one by one compliant to Sun’s 1.1 specification. Certain methods work in a different way, will just return an exception or null. This can be understood looking at the different surrounding of JMF usage: in a broadcasting environment a full control of the video data is just not possible – a stream can’t be rewinded or seeked through15 . Freeze frame is possible, but after resuming the stream, there will be a leap forward to the currently broadcast streamed images. The Player is also more limited following current MHP specification: e.g. the Player.setSource() function won’t work (since the offered stream can’t be changed from within: the DVB controls the video data and our java application only runs on top of it). Also functions referring to visual components of a Player might fail, returning null: Player.getVisualComponent(), Player.getControlPanelComponent(), etc. This heavily depends on the implementation and the specific set top box. MHP only defines the video layer as mandatory[6] (see section 5.4.1); these get-methods might return the according AWT component (e.g. to directly draw into them) – but usually only on high end set top boxes. Component based players – rendered into the graphics layer – are only optional. If the getVisualComponent() successfully returns an instance of org.havi.ui.HvideoComponent we are off gladly, since from then on we have converted a background player 15

in combination with a hard disk recorder and a caching mechanism it would be possible to go through the parts already broadcast, but in general this won’t be the case

54

5.4 Stepping through Possibilities of Set Top Boxes (from the video layer) into a component based layer lying inside the AWT hierarchy and the graphics layer: the problem of alignment between video and graphics is solved (as depicted in 5.4.1). More restrictions concern the inability of MHP to tune to different transport streams. Once a media locator is bound to a stream, it can’t be reconnected to another source or seek for other streams. The application or the televion has to switch completely before data of another transport stream can be accessed. This fact is only implicitly given through the MHP specification – due to the fact that this part is not documented at all. Steven Morris pointed it out in [30]. How to access different transport streams through the other available API – JavaTV – is also not specified. Depending on implementation, JavaTV could or could not tune. But besides these restrictions, JavaTV offers two extensions to JMF controls[24]. These can be used to select a certain stream from an offered bouquet (javax.tv.media.MediaSelectControl) or to resize the displayed video frames (javax.tv.media.AWTVideoSizeControl) for instance. Controlling the video using these controls is being shown in my demo Xlets under "#2 TV Image Manipulation" (see 9.2). Nevertheless, video support is only mandatory for a single stream (lying in the video layer). A multi-view selection window (as sketched in figure 4.4) can’t be realized for all set top boxes. As stated before, JMF component based Players, presented inside the graphics layer, are only optional. But if this feature is offered, we win in two issues:

• It is possible to show more than one video stream (for multiple previews). • We can position a video full screen with complete control of where it will lie: since it is set in the same coordinate system as our inserted graphics. Thus, allowing a 100% fitting overlay, without the headaches and uncertainties described earlier when trying to synchronise positions with the video layer.

3D Objects The MHP, HAVi UI or DVB specifications don’t include any 3D render capabilities at all. Rendering is restricted to the described layered, component and java.awt.Graphics based two dimensional GUI elements. Seeing the current restrictions already applied for these objects, asking for even more powerful and hardware demanding abilities seems quite far fetched. But if the MHP specification undergoes a future revision, switching the basis entirely to the Java Micro Edition, a software 3D renderer provided by Sun could be integrated easily. This combination is especially developed for low-power cpu environments as mobile devices. So, why not for set top boxes as well? – Similar restrictions apply: extremely limited memory and calculation power due to hardware (i.e. cost) restrictions. We will discuss different 3D rendering approaches later in section 6.5. To sum up the state of the art here: currently no set top box, that could handle real (soft- or hardware) 3D renderings, is available.

55

5 Technical Realization of Basics for Augmented Reality TV

Figure 5.12: Minimum set of input events

5.4.2 User Activity and Interactivity For a viewer, interaction is only possible through a remote control or a connected keyboard or mouse. Mandatory is only a small subset of a PC keyboard layout. To retrieve pressed buttons of a keyboard or remote control, MHP offers two different ways. At first, the AWT option, as defined in java.awt.Component. The application receives all input events if the component has focus. MHP only demands a minimum set of available keys (see figure 5.12). All other keys (as from a regular keyboard) are only optional. The second method to access input is through the org.dvb.event package. These concept of Events and Listeners is triggered before pressed keys are handed through to the AWT handler. The two approaches are similar and since we didn’t see an advantage of the DVB version to the AWT implementation, we leave it to the second. Mostly, the set top box itself wants to keep control of inputs – for example through number keys – to change channels, etc. If an MHP-Xlet uses the same keys, we have to avoid this conflict: the navigator should ignore the pressed key event and it should only trigger events inside the Xlets. Using HScene.setKeyEvents we can explicitly declare disinterest in certain keys – granting control back to the navigator. Testing this issue on Philips set top boxes still lead to problems: although we defined number key usage for the Xlet, the channel kept on changing when hitting a number button. Again, MHP implementation is not flawless.

5.4.3 Delays and Animations Timing support is defined through the javax.tv.util package. Implementations are required to meet certain specifications: a repeat interval of 40 milliseconds or less must be guaranteed by a granularity of 10 milliseconds or less16 . The given interval goes alongside the typical frame rate of 25 fps (as in PAL signaling). 16

MHP specification, chapter 11.9.1

56

5.4 Stepping through Possibilities of Set Top Boxes Test animations with simple 2D shapes and graphics already caused unpredictable delays within these given 40 ms time span: decoding of video stream ate up all available CPU speed, that the animation stuttered, when a complex frame had to be decoded. If the stream only consisted of a few plain colors (or even only black) the animation was triggered without delays. Hence, triggering animations that need synchronization can’t just be replayed with a followup position and a fixed timer delay. Instead, we need more sophisticated time stamps linking independent positions to the clock. Synchronization with video images will be covered later (in 6.4) when discussing the sample application.

5.4.4 Network Access and Return Channel To establish data transmission (besides the mpeg encoded video signal) in DVB, the standard defines five possibilities to do so: 1. data piping, 2. data streaming, 3. multi-protocol encapsulation, 4. object carousels, 5. data carousels. The last two are used in MHP to transmit the java Xlets (4th) and to allow a java application to have access to carousel data (5th). This carousel technique allows transmission of complete directory structures or simply single files. These data is split up into modules of a maximum size of 64 kByte. These new packages are injected into the carousel in turns: starting with the first package, each piece will be sent through the channel to the receiver. Once the last package went through, the carousel is restarting with the first item. If a receiver wants to receive a complete data set it has to wait for all pieces and reassemble them to decode the content. If a package was errornous the receiver has to wait for the next round. For instance, transmission of a 512 KByte package (our demo xlet package weighs 506 kB) on a 128 KBits/sec connection will consume more than 30 seconds. Hence, it is wise to split up data to different modules and broadcast the most important ones more often (e.g. the java classes to start the application), so latency for the viewer can be reduced. The carousel concept only allows unidirectional broadcasted information transmission to the viewer’s box. To establish a real two-way communication one can take advantage of the return channel, defined in the MHP standard. The currently most widespread version of MHP 1.0x supports a V90 modem dial-up connection relying on a TCP/IP or UDP data transport. It allows HTTP 1.1 and DNS services, but no further higher level protocols, such as https, smtp or ftp. These will only be implemented in the latest MHP release 1.2, specifying broadband connection as well, but today, hardly any set top box is based on this revision. The majority still runs with MHP 1.01 or 1.02 (even less boxes). The return channel will connect to a specified server, which is not necessarily property of the TV broadcaster: the MHP application might only link to a 3rd party (e.g. for online bookings, advertisement redirection), which will then handle all further interaction. No feedback seen inside the video stream offered by the broadcaster will take place. If this is a planned outcome, the server further needs to establish a connection to the broadcaster. Poll results or highscore lists (e.g. for games, running alongside a quiz show, involving the home viewer) might be displayed inside the video, allowing a full interaction between viewer and the running program.

57

5 Technical Realization of Basics for Augmented Reality TV

Figure 5.13: demo program collection in xleTView emulator (left); on a Philips set top box (right) If a set top box’s MHP profile does include a return channel, it has to offer the complete java.net package17 – some more restrictions are made. For instance, multicast over IP support is not mandatory, some static methods of java.net are left out and java.net.SocketPermission is not required for unsigned applications (due to lack of handling connections within the sandbox). Return channel establishment (e.g. connecting dial-up) is defined in the extensions to java.net in the org.dvb.net.rc package.

5.5 Summary on Sample Applications To be able to realize a fast and rich appearance for Xlet applications and especially for augmented reality scenarios, the current restrictions on a set top box are rather unpleasant. Especially timing issues threat the proper reproduction, as seen in the animation subsection. Problems with transparencies and overlain translucent shapes can be coped with, since at least mandatory features are listed and always running. One might have to adjust and scale down textures and images, but this could be solved. The main inconvenience remains the speed problem (GUI renderings as well as loading times). Augmented Television with MHP will only be possible with the possibility to align graphics with the video, which is not yet guaranteed. To document all listed tests and others, not included in this document, I, again, refer to the Xlets and their doc-files, inside the developed package es.vicomtech.dtv.*. The subdirectory xletdemo contains the startup Xlet TheXlet.java and the directory xlets holds all demo-programs. These are all linked from within the startup application, which basically offers a dynamic menu, listing all available demos. The offered demos include topics as text, graphics, transparency, layered transparency, 2D animation, 3D graphics (not running on STB), video overlay, TV image manipulation as size, position, read/write pixel data, read network carousel data, return channel and interactivity through remote control. 17

as stated in the MHP specification, chapter 11.5.3

58

6 Technical Realization of the Pelota Application In the following sections major decisions and concepts that have been realized for the digital pelota environment will be pointed out. Technical implementations of the features listed in chapter 4 are given as well as encountered problems, restrictions and explanations for certain design decisions. Sticking to a Simulator We heavily rely on 3D rendering capabilities for the integration of the virtual objects. The Java Developer’s Kit, running on a PC, includes Java3D by default. unfortunately, the slim versions run on set top boxes currently don’t offer any 3D functionality. For instance, Java3D relies on (even platform dependent) accelerated hardware rendering, which a box without an integrated 3D graphics chip can’t offer. This restriction is sad but logical – first the market has to prove the need of this new technology, then hardware costs will drop and probably more components will be integrated. Only "self-made" software rendering could be used up to now: in chapter 5.4 I already described the placement and animation of 2D shapes and images. All three dimensional graphics could be broken down to a flat representation. Pragmatically speaking this could at least be used to insert flat 2D images into the 3D virtual space using scaling, distortions and sheerings by calculating the perspective ourselves. But to implement a fast simulated three dimensional rendering would have cost too much time, at the same time keeping in mind, that most probably hardware acceleration will soon be available – which would render our implementation futile after a short period of time. Also, I want to concentrate on a rich and strong sample application not losing development on reinventing the wheel. As a result, I chose to implement the following scenario entirely on a PC, simulating a digital set top box environment. As far as possible the same APIs and structures are used to keep a theoretical future-port to the special hardware flawless and uncomplicated. Differences or important issues that have to be taken into consideration will be pointed out. The MHP-APIs operate with the Java Media Framework to access video streams and I will use it in the simulation as well. Furthermore, I will try out different available 3D interfaces for Java and determine the most suitable.

6.1 Pelota Implementation Overview To combine all different elements, a framework has been defined, linking all objects and parts needed. The so-called EoraStarter class represents the Xlet, which will be started from

59

6 Technical Realization of the Pelota Application outside1 . During initialization it instances an EoraDTVWorld, which contains all pelota application issues. In a real MHP environment we hopefully only have to adjust the start class and rebind e.g. user action commands (listeners for pressed remote control instead of keyboard and mouse). The access to our virtual and augmented pelota world is only obtainable through the public classes of the EoraDTVWorld, thus leaving all internals untouched. From outside, it will be possible to trigger the following commands2 : • addCamera() - to define an additional video stream source with tracking datas • addCameraVirtual() - add a camera position for a virtual viewport, no stream • switchToCameraPerspective() - change point of view • addAugmentationSet() - adds an augmentation bouquet, which can be selected • switchToAugmentationSet() - change bouquet by hand • setVideoQuality() - change video stream preview quality, full-screen won’t be affected • setObjectTrackingFile() - source for moving objects’ tracking data • showOSD() - display HUD-like helper information • setBallTrail() - toggle 1st augmentation example: highlighting the ball’s trail • setBallHighlighted() - toggle 2nd augmentation example: replaced ball shape Only the main class structure will be introduced at this point, which is shown in figure 6.1. Most important classes lie in the branches AugmentationObject, ViewPoint and Tracker. The first allows two and three dimensional objects to be linked into the virtual world, derived classes specializing on more specific functionality like importing external 3D geometries from files. Common for all is the function createChild(), which generates and returns a scenegraph object, that can be linked into the virtual scene. It already defines a translation and rotation in the world (always using a meter scale), thus usually attaching those objects without any further matrix changes into the branch. The viewpoint related classes define real and virtual camera positions (setting the three vectors position, aim and up). A camera class will define a video stream and a current viewpoint, updated through the bound tracker source. These trackers will return a single vector as a position, at the highest level totally abstracting from how these data are transmitted and decoded. We can access the tracking data through getPosition(int index), giving the index of the set of data we want to retrieve or – by leaving out the parameter – receiving the current set. update() will advance the tracker to the next set. Derived classes, that handle more complex tracking information (e.g. not only one vector but a whole 4x4 matrix or more than one matrix), have to extend the tracker 1 2

being a normal console under the PC simulation environment or the application manager in a MHP set top box only listing the most important, more can be found in the appropriate javadoc

60

6.2 Displaying Video Images

Figure 6.1: pelota application class design (excerpt) and implement their own getter-functions. The derived TrackerfileReader* classes resort to local files containing the tracking data sets (for local simulation purposes), while the TrackerStreamReader*s will serve as a link to streamed data, that have to be retrieved out of the digital video signal (see 6.3 and 6.4) or from the data carousel. The Linker classes build the bridge to the media content, which can be accessed from streams or from local files using MPEG encoded videos as well as Quicktime data. An overview of the main class design is displayed in figure 6.1.

6.2 Displaying Video Images How to gain access to video stream data in MHP has been described in section 5.4.1. To use this for the pelota application, the complete list of steps is:

61

6 Technical Realization of the Pelota Application 1. include javax.media.* packages, define a Player object, define a FrameGrabbingControl object, have a media file and its URL at hand 2. a Player will be created through the javax.media.Manager class URL url = new URL ( " f i l e : " + filenamestring ) ; p = Manager . createRealizedPlayer ( url ) ;

3. to link the control object to the player, we call fg = ( FrameGrabbingControl ) p . getControl ( " j a v a x . media . c o n t r o l . F r a m e G r a b b i n g C o n t r o l " ) ;

4. we start the player and wait until the controller is linked and everything is prepared for usage p . start ( ) ; Object waitSync = new Object ( ) ; synchronized ( waitSync ) { try { w h i l e ( p . getState ( ) ! = Controller . Started ) waitSync . wait ( ) ; } c a t c h ( Exception e ) { } }

Using AWT with, for example, a BorderLayout it is now easily possible to display the media file in a component of the GUI. Component comp ; i f ( ( comp = p . getVisualComponent ( ) ) ! = null ) add ( " C e n t e r " , comp ) ; / / add c o n t r o l s : p o s i t i o n bar , b u t t o n s f o r pause , f a s t f o r w a r d , e t c . i f ( ( comp = p . getControlPanelComponent ( ) ) ! = null ) add ( " S o u t h " , comp ) ;

Nevertheless, as will be described later in detail, this possibility of direct display won’t be used anymore, once the 3D display will be combined with the video. Instead, the Player will run in the background and the current frame will be grabbed via the FrameGrabbingControl using the grabFrame() function. Afterwards the saved buffer will be converted to an Java

62

6.3 Tracking the Video Images Image and from there to a DirectBufferedImage. When combining 2D with 3D objects this will be of use to display the video. In the opposite to a PC simulation, a real MHP environment may cause problems displaying the video data in an AWT component: only the background video display is mandatory while the component embedded alternative is not[6] and it will depend on the manufacturer’s implementation of MHP specifications.

6.3 Tracking the Video Images To be able to align the 3D space with the video, tracking data for the camera’s position have to be present. They either have to be calculated in real time by the set top box or be transmitted alongside the video signal. For the pelota application we only chose to transfer data computed in advance. It is not only still impossible to access the image pixels in a DVB-stream to do a video-based tracking, but moreover the calculative power is yet not sufficient to attain a reliable positioning. For the pelota application the tracking information points the origin to the front-left corner of the court using a meter scaling. Furthermore there might be scenarios, where an independent "self-detection" of perspective is impossible for the set top box. If both, scenario objects and camera, are moving inside the frame, a complete automatic detection won’t be possible within a 25 fps restriction. Sometimes an image based recognition is just impossible or even not wanted: the broadcast scene might have motion controlled camera data or the whole scene was done in the computer, thus just passing on the camera positions one by one. For our application we rely on these mentioned pregenerated datas from a 3D renderer3 and computer vision based optical tracking: the company’s in-house tools were used as well as other professional tracking software4 . We modified the other VICOMTech project including the pelota court Viewer (see 4.3) to write camera information to files as well. Joint with recorded frames of the program, we get a 100% properly tracked simulation, which will be used during development. As a real time tracking I integrated the possibility to load tracking data from the ARToolit[22]. Up to now the option to run the toolkit’s grabber and transmit video plus coordinate information in real time is not offered. Instead the toolkit’s code has been modified to use a video from a file and – at the same time – write a text-file including matrices for all frames. Integration into the pelota program environment has currently still be done "by hand" and afterwards. To distinguish different tracking data we can use, a header, that defines in the beginning of the frame-by-frame information, is transmitted: • eora_cam Needs 9 values: vectors for position, aim and up, used by 3DSMax export and the VICOMTech Viewer program. 3 4

3DS Max has been used in VICOMTech 2d3’s boujou three or bullet

63

6 Technical Realization of the Pelota Application

Figure 6.2: Boujou bullet: optical tracking, ready-made data before broadcasting. Software access kindly granted by tvt postproduction, Berlin[2]. • eora_cam_matrix Demands 12 values: 4x3 matrix, 3x3 rotation, last column translation, used to set compatibility with the in-house-tool from VICOMTech. • eora_cam_matrix_artoolkit ARToolkit[22] matrix: 4x3 (3x3 rotation, 3x1 translation), different interal scaling. • eora_cam_matrix_boujou Also uses a 4x3 matrix, other offsets and scaling – see ViewPoint class for details.

6.4 Synchronization of Video Frame and Tracking Information The Java Media Framework uses a concept of a media time to represent the current status of a playback. This is split into two components being a clock and a timebase. The latter only describes a constantly increasing time value – no function calls or resets of a system will affect this code. The former is directly integrated into the JMF Player, since it is derived from Sun’s JMF clock class. Here, a floating-point value is used as a rate to map the timebase code to the media time (1.0 will play at 100% speed, 2.0 at double rate, -1.0 normal speed backwards, 0.0 will stop playback, etc.). Unlike to a PC environment it will not be possible to stop or rewind a digital video stream in this manner. Moreover, it is not clearly defined when a stream has started and the clock was set to zero. MHP defines a Normal Play Time (NPT), which can be used by the broadcaster. But this value is implementation dependent: the timecode may be an arbitrary value, constantly increasing not helping with our synchronization between video and tracking data, since we have no reference to where we actually are inside our show.

64

6.4 Synchronization of Video Frame and Tracking Information Instead, we could resort to the DSM-CC5 stream events. The events are embedded into a transport stream via the private section of a MPEG-2 field and can be identified uniquely. A time reference will tell, when the event should be triggered. DSM-CC stream event objects are stored inside the object carousel and behave as any other DSM-CC object. These events hold an unique event ID and a human-readable name. The receiver will now be able to recognize and distinguish sent events, but up to now these event objects only tell us what kind of events we are going to expect. To trigger them we have to use stream event descriptors: they contain additional unique information and trigger one of the events defined earlier. Events can be re-used and retriggered. The included information defines a NPT value as well, telling the receiver when the effect should take action. To assure that an event is not skipped over the MHP specification recommends a repeated transmission of the descriptors "at least once every second for a minimum of five seconds before the time they should trigger"[14]6 . Besides, it is also possible to leave out a NPT timed execution of an event: a "do-it-now" descriptor handed over to the receiver should trigger immediately. Implementing the stream events, the package org.dvb.dsmcc has the two classes DSMCCStreamEvent (the event class) and StreamEvent (the actual triggered descriptor holding the NPT or do-it-now-command). Catching of events will be handled by listeners, namely StreamEventListener, receiveStreamEvent(). Now we can look into the received StreamEvent descriptor and look up the time when it should be triggered by calling getEventNPT(). If we hand over tracking data inside another private field, we would have all we need. Unfortunately, decoding of an occurred event synchronized to a specific frame time can’t be vouched; depending on the middleware and the hardware’s speed offsets of some frames or even seconds might occur. This is no problem for an event usage for score listings of a soccer game or updated poll or bet results, but a frame by frame synchronization is absolutely needed for our purposes of tracking and aligning the graphics. To finally realize synchronization we think of two scenarios: use NPT references only every few seconds to get an assurance of where we are and to transmit all (or bigger sets of) tracking data through the data carousel in advance (or even within the Xlet application itself). Afterwards, we don’t have to rely on event descriptors to deliver the needed data in time, we only need to know the current NPT and can then select the correct tracking matrix. This will only work, if control of the NPT value is possible – if it only is an increasing value we still don’t know the exact instant. Relying on a global timing (using CET timestamps for all tracking data – with a granularity of 0.04 seconds (25 fps) for instance) won’t help either, since some receivers will be slower in decoding the video signal, again causing offsets). All participants will know when it should be broadcast, but still it might just not be possible to harmonize the signals. Lopéz, Gonzàlez, et. al.[32] document trials for synchronization with the NPT and the do-itnow-broadcast for DVB-T usage: a do-it-now-command will send to the receiver once a video stream starts (or the point of first-time synchronization is reached) resetting an internal clock. Afterwards, private metadata field inside MPEG-7 encoded video hold the data sets that need synchronization. The performance of their tests reached an almost 5 frame accuracy. For our purposes we would have to re-sync as often as possible to get best results, but synchronization 5 6

Digital Storage Media, Command and Control chapter 12, stream events, p. 302

65

6 Technical Realization of the Pelota Application errors of 5 frames can still be noticed if fast camera movements occur: virtual objects, tied down and fixed to the background would shake or lag not moving accordingly to the rest of the world. Therefore we think of another approach, described below. The alternative demands access to the MPEG data not caring about NPTs or DSM-CC. If we can manipulate the video before broadcasting, it would be possible to insert tracking data or a reference ID to tracking data into each frame. To make sure that the data can be processed by the receiver in time, to minimize erroneous code and to give the option of interpolation we suggest transmission of 25 tracking data sets at once. Hence, a given frame will include information about its own camera position as well as for the following 24 frames. We should be off well with this buffer of a one second period. To actually encode data, we plan to set pixels outside the visible safe area to special values – storing the needed tracking data encoded in a (not used) RGB value. Moreover, private fields in MPEG encoding can offer the possibility to directly transmit additional information, but since hardware implementations of set top boxes may vary, the storage of information inside the frame are a safer way to go. If access to private data is restricted or not offered for a specific decoder, we will still get the whole frame information. The pelota application uses the 2.x version of the JMF, already allowing pixel manipulation of a stream. We implement this access to write/read pixel data inside the Linker classes (getFrame() methods). If MHP moves on to the 2.x version of JMF as well, an implementation of this tracking synchronization could easily be realized. Up to now, this issue remains on the future "to do" list – waiting for the next MHP revision.

6.5 Discussing different APIs for 3D Graphics in Java Since there are not yet set top boxes available, that are capable of 3D graphics acceleration, it is impossible to predict, which Java API for 3D representation will be chosen, once hardware capabilities are sufficient. The decision highly depends on the rate of set top boxes deployment and the acceptance by the customer. An upgradeable and extendable Runtime Environment would be the best solution for the needs of future applications, but this scenario is unfortunately far away from reality. Current devices usually still run on Java Personal Edition, which is a slim derivation from the Java SDK – by now outdated and unsupported by Sun Microsystems (see chapter 3.2.1). Instead, Sun emphasizes and recommends the usage of the Java Micro Edition, which is also installed in mobile devices such as consumer cell phones. From this stage on it is only possible to simulate the integration of 3D graphics on a standard PC. For the video rendering and stream-access the JMF will be used like in the MHP driven set top boxes. Keeping the design simple, small and in the already known Xlet-structure, will help to point out possibilities and restrictions for porting the simulated 3D-applications to digital televisions. This section will focus on the introduction of different approaches for bringing virtual 3D objects to life using Java. Different APIs will be discussed: open source and/or (partly) supported by Sun Microsystems. It will point out advantages, disadvantages and restrictions. Most important issue is the question if practical ways to realize augmented reality work well, taking into account three parts:

66

6.5 Discussing different APIs for 3D Graphics in Java

Figure 6.3: Java3D scene graph design • how to realize fast video playback • how to overlay this video by virtual objects (3D) • how to reach occlusions between video elements and rendered items

6.5.1 Java3D Java3D is an extension to the standard Java 2 SDK package offered by Sun Microsystems and thus obviously the first choice to try out 3D renderings. It is an application programming interface for writing three-dimensional graphics programs and applets, where all properties of Java can be transferred to this extension. The API was designed in a cooperation between Intel, SGI, Apple and Sun and combines ideas from different other lower level APIs, such as Direct3D, OpenGL, OpenInventor and XGL. Java3D takes care of all lower level related graphics work and thus it is possible to focus on the design of the virtual world. Furthermore it already offers the complete integration of a scene graph: all objects created by the user are linked into a graph, representing all functional and spatial dependencies, thus allowing an object-oriented view, helping to accelerate development of huge applications. The scenegraph takes care of all rendering related issues taking a huge burden of the developer – by sacrificing a certain extent of freedom at the same time: it won’t yet7 be possible to control all OpenGL states for example. The scene graph design of Java3D is split into two parts: a content branch (see figure 6.3 marked blue) and a view branch (red). The content branch may include as many objects, transform groups, lights and shapes including rendering and color-descriptions as the developer would like to store there. However there is only one viewing branch allowed (although the division is optional and the branches might be interleaved, reusing transformations for example). In practice the viewing branch is quite small only containing a few nodes, while the other may contain thousands of complicated 3D worlds. 7

last released version of Java3D today is 1.3.2

67

6 Technical Realization of the Pelota Application

TransformGroup objSpin = new TransformGroup ( ) ; objSpin . setCapability ( TransformGroup . ALLOW_TRANSFORM_WRITE ) ; objSpin . addChild ( new Some3DShape ( ) ) ; objRoot . addChild ( objSpin ) ;

Figure 6.4: set access to transform scenegraph objects As the logic of a scenegraph implies, transforms of a group effect all branches and leafs below. Due to speed improvements every node has to be "activated" explicitly to be able to transform it (see figure 6.4). If these so called capabilities are not set to true, the virtual world won’t change. Shapes Objects of or derived from the class Shape3D represent all geometrical objects in the virtual world. This always defines two parts: a Geometry (containing information about geometric structure, e.g. points, edges, normals, vertex color) and an Appearance (holding attributes like color, texture and material). The appearance also holds controls over the rendering of an object. Different shapes in the world may share the same appearances or geometries to avoid redundancy or to define needed logic. Behaviours To actually modify scenegraph objects it is necessary to resort to so called Behaviours. They provide the means for animating objects, processing keyboard and mouse inputs, reacting to movement and using pick events. Behaviours basically contain two parts: • initialization - specifies wake-up conditions (timers, pressed keys) • processStimulus() - function that will be called on fulfilled wake-up condition Java3D Details The Java3D structure is kept quite simple and without any too sophisticated structures inside the core (such as ready-made geometrical objects). But a variety of additional classes that may be used by the developers can be utilized to speed up development. On top of the basic shapes exist a variety of objects like cubes, spheres, etc. as well as loader classes for external 3D file formats (such as VRML or X3D)8 . Moreover behaviours are extended to offer so called Interpolators: these helper classes can be easily used to interpolate animations with certain user-defined restraints. 8

unfortunately Sun only provides the loader interfaces to port a whole 3D world from a file into the scenegraph. The actual loading process has to be written by someone else. Thus, functionality might be limited or not tested and unreliable.

68

6.5 Discussing different APIs for 3D Graphics in Java

Figure 6.5: Java3D scene graph view port Once the scenegraph is constructed and the application is running the finally displayed image relies on a so called View, which retrieves its viewport information from a ViewingPlatform. Each frame the view is generated and handed over to a Canvas3D object, which can be integrated into one’s Java application in the same way as any other Canvas object known in AWT or SWT9 . Moreover Java3D supports spatial sound services, which usually wouldn’t be considered being part of a 3D API, but simplifying development of immersive applications at a great extent. Many useful add-ons by Sun itself and other third parties are nowadays available, e.g. a huge support for runtime loaders is provided. This allows Java3D to accommodate a wide variety of file formats, such as vendor-specific CAD formats, interchange formats, VRML97 and already X3D. Today the source code of Java3D is available for the public, making it easier to study problems or difficult issues. Although, this step does not derive the concept of open source standards: Java3D is still owned by Sun and if manufacturer‘s set top boxes use this API, they still have to follow the defined standards and won’t be able to "tune" the interface on their own wish. Moreover Sun seems to back off of Java3D to a certain extent. At least they are withdrawing men power from the project: today10 Java3D is listed as a Java community project on java.net. This could imply that Sun does not pursue any more great interest in the 3D interface or just relies heavily on the open source community. If development moves on (perhaps even faster than to closedsource times) and management of Java3D as a rich and straight-forward 3D interface standard 9

Abstract Window Toolkit and Standard Widget Toolkit; a collection of functions that allows Java programs to manipulate virtual graphics (windows, images, buttons, and so on). These abstract graphics can be translated into user-visible windows and controls on the client platform. While AWT is developed by Sun, SWT was introduced by the Eclipse Foundation. The latter system cannot be described as "‘pure"’, since it relies on a platform-dependent library to be able to access the native windowing system and designs. 10 July 2005

69

6 Technical Realization of the Pelota Application is followed, it could definitely still be an option for hardware manufacturers. Besides, Java3D is only suitable for standard PCs up to now, since the hardware acceleration is provided by OpenGL or DirectX11 . Set top boxes must either provide OpenGL support or Microsoft’s interface. Tryouts with Java3D To be able to accomplish augmented reality within a Java environment it is compulsively necessary to combine 2D video images and 3D renderings somehow: since it is only possible to render hardware accelerated virtual worlds into the so-called Canvas3D there are two ways to realize the overlay. • render all 3D objects in offscreen mode and copy them into a "normal" Canvas object, where the 2D drawings (the video) will be realized using the JMF • render everything inside the Canvas3D; fast ways for integrating the 2D video image have to be found Offscreen Rendering To combine the 3D image with the JMF-generated video it is an straight-forward approach to use a Canvas element drawing the video first and overlay it with the rendered image from the 3D view using standard Java2D routines for drawing and adding images. If the virtual objects are rendered at once, an alpha channel has to be used to distinguish between virtual world objects and the unwanted background of the rendering. The video frame can then be overlaid by the captured 3D screenshot. Using this offscreen mode it is possible to manipulate every single frame without restrictions, since we can alter every pixel at free will. Superimposing the video by virtual objects is easy this way – but as soon as the depth position of the augmented shapes is relevant in comparison to the video objects’ positions, one will have an unanswered question to face: What actually is the depth of the those inserted objects? How can the clipping be realized? If this problem is not being addressed, the video will always be stuck behind the computer generated elements, for which we don’t know the depth position a priori. Using offscreen rendering we unfortunately only get the 3D image as a flat 2D output. The depth information is discarded at this stage and a combination with the video is only possible using all rendered pixel or leaving parts out e.g. by comparing with a 1-bit-mask, not taking depth distances into consideration. The advantage would have been the possibility to draw everything into a regular AWT Canvas, which can be directly addressed in MHP12 . But besides the not realizable occlusion issue, the speed of the application has been too slow – even in the simulation on a PC13 : the rendered 3D buffer had to be copied onto the video image buffer with at least 25 fps, which didn’t run as smoothly as hoped. 11

the user has to select the 3D-interface when downloading the installation files for Java3D. A later change is only possible by reinstalling the whole package. 12 depending on the implementation though: AWT component access is not mandatory 13 for configuration see9.1

70

6.5 Discussing different APIs for 3D Graphics in Java Using offscreen buffer method a frame will only be rendered if the method renderOffScreenBuffer is called explicitly. If there are any animations running in the virtual world, they won’t progress and thus only the first frame will be handed over to the screen. This missing update can be avoided if the offscreen Canvas3D is linked to an active view in the application, but since we planned not to render the 3D space directly to screen rather than to obtain the frame in the background for buffer copying, this fact is not expedient at all. Offscreen buffering has been abolished in the development and onscreen buffered was tried next. Onscreen Rendering Using onscreen rendering the Canvas3D must be used. Thus, the 2D video rendering has to be integrated into this container and not vice versa. This turned out to work quite well using a quad placed in 3D space, textured with the current frame of the video, updated every 0.04 seconds (i.e. 25 frames as used in PAL encoding). The video loading and the positioning and synchronization of the screen in the 3D space will be described later in 6.5.5. Being entirely in 3D space with the possibility to set video background images we found all we need for starters. The next issue was to see how occlusions could be realized within Java3D.

6.5.2 Occlusions in Java3D To be able to occlude real objects of the video stream with the inserted virtual objects it would be desirable to have a well-defined mask, separating foreground and background areas or to receive a complete knowledge of the real scene in 3D. Then it would be easy to fulfill the needs for occlusion, drawing the real scene geometry only to the z-Buffer and thus getting the impression of clipped virtual objects. Unfortunately this scenario is more than utopian. Moreover, research using Java3D lead to the result, that the interface for using the stencil buffer is not yet implemented in Java3D [26]. The current release 1.3 does not allow any operations that demand access to that buffer. Access to the color buffer is also forbidden, preventing us from overwriting distance-values for individual pixels. The idea of a pre-defined depth mask seems to be rendered useless as well. One way to gain a certain level of occlusion would be possible using the culling planes of OpenGL: each view frustum defines six planes, where objects are cut off. Usually these positions are aligned to the outside borders of the visible area, reducing rendering times. If we move these planes into the field of view, we could cut off parts of our inserted 3D elements. Nevertheless, these planes will cut the whole scene and are always axis-aligned, thus not offering a flexible alternative. 2D Masking using Rasters With these restrictions another solution had to be found for implementing occlusion. In a first approach the Canvas3D offscreen rendering (see 6.5.1) was used to render the 3D components of the scene. The buffered, currently not displayed frame could then be used to calculate the

71

6 Technical Realization of the Pelota Application

Figure 6.6: Offscreen rendering; automatic composition of 2D frames: video frame, mask and 3D rendered image lead to the results below (note the occlusion on the right) final image, combining it with the background-image and a prepared mask. Unfortunately this only allowed the distinction between foreground and background video material offering only the worst depth resolution imaginable. An improvement could be achieved by reading out the depth buffer for each 3D frame and to combine in with a depth mask offering a better resolution. If an 8-bit depth mask would be used instead of a 1-bit mask, inserted objects might move between particular objects of the video without the restriction of only two or three layers. Nevertheless this track was not pursuited further, since speed issues 14 already hindered the utilization of an offscreen rendering not gaining 25fps on a full PAL video signal. The alternative using an onscreen renderer in Java3D lead to the following solution taking advantage of the Raster class in javax.media.j3d.Raster: a two dimensional raster image may be rendered into 3D space. This raster requires a 3D point as its origin and will always be placed facing the camera. raster = new Raster ( new Point3f( −1.0f , 0 . 7 5 f , 0 . 0 f ) , Raster . RASTER_COLOR , 0 , 0 , frame_width , frame_height , mask , null ) ; raster . setClipMode ( Raster . CLIP_IMAGE ) ; Appearance app = createAppearance ( ) ; Shape3D shapeI = new Shape3D ( raster , app ) ; shapeI . setCapability ( Shape3D . ENABLE_PICK_REPORTING ) ; shapeI . setBounds ( new BoundingSphere ( new Point3d ( ) , 0 . 5 ) ) ; transGroupRaster = new TransformGroup ( ) ; transGroupRaster . addChild ( shapeI ) ;

To link a 2D image into the 3D scene one must define a javax.mediaj3d.Raster object as displayed in the above code. The raster is connected to a 3D point in the virtual world, giving it’s upper left position. The raster is put at z-value 0.0f, thus covering the whole window and displaying the background image (figure 6.7) at the exact same size. Now it is possible to copy the background image into the mask, but leaving out the image parts that are supposed to be 14

using the development platform listed under 9.1

72

6.5 Discussing different APIs for 3D Graphics in Java behind the virtual objects. The raster will be generated by comparing the 1bit-mask file with the background frame. If the mask is set to 1, transparency is set, if the mask is set to 0, the alpha-value won’t be touched (thus displaying the pixel of the background frame). Thus the background image is drawn twice and it only offers 1-bit depth culling with a assumed fixed position of the real world objects of the video. We gain a foreground, middleground (the masked postion) and a background area. This way, the occlusion will always occur at zvalue 0.0f.

Figure 6.7: overlaid 2D mask fitting perfectly in without rescaling as long as mask’ position is not changed To overcome this restriction of the fixed clipping position, the 2D mask has a distance information to the camera. If the raster is now moved to the given z-position the occlusion will occur at the right point. Now inserted virtual objects can even float in front of the "foreground-parts" of the video, thus offering background, middleground and foreground areas. The distance for the middleground is a fixed value, but depending on the environment where it is used, this method can be satisfactory (e.g. a moderator reading news without moving a lot along the depth axis). If the distance value is adjusted accordingly to movements of the objects along the z-axis, a convincing occlusion is even feasible with more vivid scenes. Each frame will have it’s own mask distance to camera value. Since the raster displays a masked part of the video, it has to be rescaled while being moved at the same time. Otherwise the foreground parts of the video won’t fit to the background parts anymore. Depending on the quality of the texture rendering of the graphics card and the distance of the mask to it’s original position, where the raster could be rendered without rescaling, the output may be of worse quality: the shifting of the mask and its rescaling cause anti-aliasing in the texture rendering. A difference can be noticeable and might even result in some "double borders" of objects, if the scaling calculation does not fit by 100% due to rounding errors. In

73

6 Technical Realization of the Pelota Application figure 6.8 anti-aliasing divergences can be noticed in the part of the castle (being the masked part).

Figure 6.8: left: unscaled mask; right: rescaled mask causing anti-aliasing

Since this loss of video quality can’t be tolerable in a digital television environment, another solution has yet to be found. The alternative of not scaling the mask, but to move around and rescale all virtual object (grabbing the top most transform node of the scenegraph) to adjust their position to fit the clipping could also do the trick by permanently causing scenegraph transformations, but I opted for another promising approach, that is given in the Java3D API description. Besides drawing pixel information to the color buffer, it is also possible to pass information for a DepthComponent during construction of the Raster. The idea seems to be able to define a depth-value for each color pixel right away (using RASTER_COLOR_DEPTH as the raster type for the constructor). unfortunately the latest Java API specification15 at this point does not describe the usage of the DepthCompoent at all. As there was no literature available for this issue it was tried to contact a developer of Java3D and the answer confirmed the fact that this undocumented parameter was never fully implemented since OpenGL does not currently support usage of this depth raster directly). Besides this drawback Java3D’s specification allows to draw a raster using RASTER_COLOR or RASTER_DEPTH. Thus, two different rasters together could reach the same goal. Since the video image is already drawn in the background it is even sufficient only to draw a RASTER_DEPTH using the mask information without drawing parts of the video twice – if the depth information still renders the background. In the current official release of the API this feature still does not work. The depth mask’s pixels are always stuck to the origin position of the mask (like the color mask). No advanced occlusion occurs. The latest beta release of Java3D16 only displays erroneous masks, it even does not draw the set background at all.

15 16

1.4.2-07 jdk 1.5.0_03 beta

74

6.5 Discussing different APIs for 3D Graphics in Java Other Approaches With the raster class not working properly the utilization of the stencil buffer could offer the same functionality: this buffer is used to control renderings on a pixel-by-pixel mask, "switching" each position on or off, thus cutting out parts of the rendering. To take advantage of this auxiliary buffer we have to turn off depth testing and color modifications. Then the mask that represents the occlusion for the video is to be drawn into the stencil buffer for each frame with stencil test enabled. Re-enabling depth testing and coloring the stencil buffer is set up the way, that it the renderer can only draw into the pixels where the stencil function is set to the reference value 1. This way a 1-bit-mask is supplied and hardware support for stencil buffer in nowadays graphics cards sounds promising. Nevertheless, the implementation in Java3D has not yet been realized. As Doug Twilleager stated in the Java3D forums on the web[26] all ancillary buffers cause problems to scene graph based 3D APIs. Java3D seems to make it even more complicated since it has no semantic for application controlled traversal of the scene graph tree. No low level functionality of the OpenGL-pipeline is accessable yet. Stencil buffer support is planned to be integrated not before the proximate release 1.4, a complete control of OpenGL states is not to be integrated before the next major release 2.0[29]. Summing up, Java3D offers a quite comfortable scene graph with a lot of possible extensions. The Behaviours and the need to switch on capabilities for every single node to be able to modify their attributes sometimes seem a bit of a hassle and inconvenient. Moreover, up to now occlusions cannot be handled adequately either. Hence I continued looking for alternatives.

6.5.3 GL4Java GL4Java is an open source project, mapping the complete OpenGL 1.2 API and the complete GLU 1.2 API to the Java language. All native and platform-independent window handle functions are integrated, by using the Java-Native-Interface (JNI) and/or the JDirect-Interface of the Microsoft Java Virtual Machine. The interface extends the Canvas class in Java.awt to allow for the creation of OpenGL windows. GL4Java can be regarded as a Java extension, which has a native and a java part. This is due to the request to have hardware acceleration. The native part is currently17 prepared for Unix systems, GNU/Linux + XFree86 3.Y.Z - 4.Y.Z, Solaris, Irix, Windows 9x/NT and Macintosh OS 9.Y.Z. The direct mapping of all OpenGL functionality actually offers everything you would need to develop virtual worlds using Java. But two reasons lead to the decision to seek further and look out for other possibilities. First of all, the project itself has been updated last time in 2001. Although it is open source, the available manpower to fix errors or integrate OpenGL extensions could be too little – or we would have to deal with all missing implementations ourselves. Secondly, this lowest level 3D access would signify a too big overhead in our project development. No loaders for external file formats are given, no aiding classes or scene graph options are available. 17

the last updated version listed in sourceforge is unfortunately dated Nov, 5th, 2001

75

6 Technical Realization of the Pelota Application The only noticeable vantage would clearly be the direct manipulation of OpenGL states, which could help in realizing the aimed at occlusion by means of direct buffer access. But as we continued the perfect mixture of higher and lower level access to 3D was found, which will be described in the following sections.

6.5.4 JOGL and LWJGL The acronyms stand for Java for OpenGL and Light Weight Java Gaming Library. They both represent implementations of OpenGL for Java, giving access to all API 2.0 functionality. JOGL, developed by the "Game Technology Group" at Sun Microsystems, implements almost all vendor extensions offered to OpenGL. The company itself describes the interface as a reference implementation for hardware supported 3D graphics. It is open source, well documented and available for all major platforms (Solaris, Linux, Macintosh, Windows). Besides the OpenGL access LWJGL integrates entire support for OpenAL for audio playback and handling of mouse and keyboard inputs. As an external program library it offers an platform independent alternative to the Microsoft’s DirectX for instance. It mainly focuses on game development, thus implementing a fastest possible interaction and rendering. Only java portable functions have been implemented leaving out the major parts of OpenGL’s GLU. LWJGL aims on portability and tininess – planning to be able to run it as soon as possible on mobile devices with the Java Micro Edition, which also makes it a good choice for our purposes since the restrictions of the virtual machine running in set top boxes could follow the same path as mobile devices to keep the hardware expenses as low as possible. For testing, the JOGL interface was chosen and the yet missing occlusion could be implemented rapidly. The stencil test works flawlessly and also another way without resorting to this auxiliary buffer was realizable. Since it is possible to switch GL states on and off at free will the glColorMask can be set to false while keeping the glDepthMask enabled. Rendering geometry with these flags set leads to a filled depth buffer while leaving the color buffer untouched (see figure 6.9). Resetting these values and rendering other geometry afterwards results in occluded geometry. The part of the draw()-routine of the JOGL based program is shown below. gl . glColorMask ( f a l s e , f a l s e , f a l s e , f a l s e ) ; gl . glDepthMask ( t r u e ) ; / / draw o c c l u s i o n g e o m e t r y now gl . glColorMask ( t r u e , t r u e , t r u e , f a l s e ) ; / / draw n o r m a l g e o m e t r y a f t e r w a r d s

6.5.5 Xith3D Xith3D is another 3D scenegraph for Java, but including a renderer as well and managed as open source. It uses the same basic scenegraph structure as Java3D (e.g. comparable functionality of Nodes, Views, TransformGroups and Appearances, etc). As a target group of developer the

76

6.5 Discussing different APIs for 3D Graphics in Java

Figure 6.9: left: geometry with enabled and glColorMask for all objects; right: disabled glColorMask for one of the geometries to be used for occlusion Xith3D documentation distinctly points out game developers and real time applications. The implementation is supposed to be more an "game orientated alternative"[27] simplifying some issues and leaving some overhead and not relevant parts out of the implementation keeping it leaner than Java3D. A lot of the code written for Java3D can be ported to Xith3D by only changing the imports from the Java3D libraries to the according Xith3D files (most of the classes hold the same name and hierarchy). Existing differences will be described later; some approaches have an altered structure, some classes additional or changed constructors. Some of the most striking design concept differences to Java3D are listed here: • branches don’t hold special capability bits to be manipulatable during execution, simplifying access and control at a great extent • the scenegraph is thread-unsafe, speeding up things but risking undefined results if changes are made at an inappropriate moment • rarely ever copies of content of objects are written; calling a method with referencing to an object will usually re-link the supplied instance to be used in the new context • only floats are supported, no doubles • all geometry is all accessed by reference; no copies between the geometry array and the 3D graphics card are inserted in between • no Behaviour classes with their additional time triggers for events are implemented (not yet) • the render loop has to be called manually, giving the developer full control over timing and frame proceeding • the underlying renderer can be controlled and even exchanged Especially the last item deserves special attention: unlike the Java3D scenegraph, it is possible to access the underlying rendering API (usually OpenGL) by directly calling the OpenGL

77

6 Technical Realization of the Pelota Application commands. This might break with the idea of a high-level scenegraph, where the developer does not have to deal with these basic controls, but it offers a lot of more freedom. Sooner or later Xith3D classes might be implemented offering all imaginable effects and tricks where a manual switching of OpenGL states and buffers is unavoidable up to now (e.g. realizing reflections with the stencil buffer or generating cast shadows). But at this status being in development this possibility offers an unlimited opportunity for our needs: we can resort to the luxury and convenience of a scene graph and at the same time go deep into the rendering process. The renderer can be exchanged at free will and since the code base is open source we could start writing our own translator. Currently the already mentioned renderers LWJGL and JOGL have been fully integrated and tested. Pelota Application Scenegraph Design For the pelota application the scene graph has been set up as described in figure 6.10. Important nodes are below the BranchGroup sceneBG: the Switch m_viewMode will be set to the first branch, if the application is running in augmented view: only the linked background will be drawn, including the MovieBackground showing the video. If the second branch is selected, all objects, that only belong to the virtual world will be drawn (e.g. the frontón or the camera models), the background is set to a static image. Since there is only one Background node allowed, we have to switch between the two this way. Other objects are always set to active, but might still be deactivated, i.e. drawn invisible. The onscreen display is one example. Under m_scene_alwaysDrawnObjects we define another switch responsible for handling different augmentation bouquets: depending on the selected user profile, a branch will be chosen. These branches may include as many subsets as wanted. Occlusion geometry will also be linked under m_scene_alwaysDrawnObjects, here represented by the pelota: during virtual view rendering, the ball will be drawn normally; switching to augmentation view will alter the pelota’s state and draw it accordingly for occlusion purposes (see 6.5.5). Viewpoint Selection and Update Like in Java3D a view branch is used to define the current viewpoint. A instance of an View object gathers all needed information to render a frame of the current scene graph geometries. It is the central spot where all viewing parameters can be adjusted. The field of view (according to the chosen camera lens of the video that is to be augmented) can be set and the eye position can be updated on camera movement. While in Java3D one has to link to a ViewPlatform object, which resorts inside the scene graph being a leaf node, this has been left out in Xith3D and one can directly define camera orientations by altering the transform of the View. Throughout the implementation I use camera definitions by setting position, aim and up vectors. To have a running universe, the View will be linked into a VirtualUniverse. All geometry will be concatenated with the BranchGroup scene. A short code overview is given in figure 6.5.5. / / c r e a t e th e v i r t u a l world VirtualUniverse universe = new VirtualUniverse ( ) ;

78

6.5 Discussing different APIs for 3D Graphics in Java

/ / add a v i e w t o t h e u n i v e r s e view = new View ( ) ; universe . addView ( view ) ; / / add a l o c a l e Locale locale = new Locale ( ) ; universe . addLocale ( locale ) ; / / c r e a t e a BranchGroup scene = new BranchGroup ( ) ; locale . addBranchGraph ( scene ) ; / / eye l o c a t i o n Point3f m_cam_pos = new Point3f ( 8 , 3 , 19 ) ; / / c e n t e r of view Point3f m_cam_aim = new Point3f ( 0 , 0 , 0 ) ; / / v e c t o r p o i n t i n g up p r i v a t e Point3f m_cam_up = new Point3f ( 0 , 1 , 0 ) ; RenderPeer rp = new RenderPeerImpl ( ) ; CanvasPeer cp = rp . makeCanvas ( null , 7 6 8 , 5 7 6 , 3 2 , f a l s e ) ; canvas3D = new Canvas3D ( ) ; canvas3D . set3DPeer ( cp ) ; view . addCanvas3D ( canvas3D ) ; view . getTransform ( ) . lookAt ( m_cam_pos , m_cam_aim , m_cam_up ) ; / / f i e l d o f view , v a l u e w i l l be o v e r w r i t t e n by / / camera s p e c i f i c s e t t i n g s ( l e n s ) f l o a t fov = 4 5 . 0 f ; myfov = ( ( fov / 2 . 0 f ) ∗ ( f l o a t ) Math . PI ) / 1 8 0 . 0 f ; view . setFieldOfView ( myfov ) ;

Listing 6.1: basic virtual world with viewpoint definition

To manage the updated point of view our ViewPoint class will be accessed. If the current eye position is fixed to a certain video, the Camera class will be requested for it’s current positioning by calling getViewPoint() frame by frame. If a new tracking position is available, it will update the viewpoint accordingly. Thus, using the given tracking data, we can fly around in the virtual world with all objects that are set to augmented the video. But since we are currently only viewing the virtual part, it is time to take a look at the way how we can insert the video into our world. Technically the video image is defined as the background image of the three dimensional world and thus we can put any objects we would like to see in front of it.

79

6 Technical Realization of the Pelota Application

Figure 6.10: Pelota application scene graph overview 80

6.5 Discussing different APIs for 3D Graphics in Java Video Loading and Full-Screen Background To realize the video integration an abstract Linker class was designed, which represents the wrapper for the underlying streaming and playback classes. The video frame will always be retrieved by the getFrame method, returning an image as a DirectBufferedImage, which will be handed over into the 3D world. The existing functionality to directly draw the video (using even 2D hardware acceleration) won’t be utilized. The playback won’t be rendered directly to the screen, but instead "hidden" to only let us extract the buffer. A video can be a local file (for simulation) as well as a streamed media. The Linker class might be further derived if adjustments are needed. For testing, two child classes have been realized: JMFLinker, offering a binding to the Java Media Framework, already extensively described in section 3.2.1 and 6.2, and QTLinker, integrating the Quicktime API developed by Apple. JMF can playback all mpeg encoded media files (*.avi, *.mpg) and, depending on the used platform, even more (e.g. resorting to codecs installed in the system under Windows XP). Using this API it was not possible to access the video on a frame by frame basis. Instead, depending on the current time code, the realized player would return the appropriate frame for the current time. This could lead to dropped images since the video playback tries to avoid delays. This scenario actually corresponds to the DVB streaming by 100%, since there a frame by frame access won’t be possible either, if delays stall the stream access. To evade synchronization problems a timer has to be used to keep a matching between video image, tracking data and overlaid graphics. To have more freedom during development the Quicktime API has been integrated as a second link to media files (*.mov). Here the getFrame method will always return the next frame, waiting for the next call without proceeding the video internally. It is even possible to directly select the desired frame. As a drawback the playback of mov-files was a little bit slower and the frame rates more easily dropped below 25 frames while displaying high quality videos (full or half PAL signal). The integration has been useful. Full control over the playback during development was possible. In the final version running on set top boxes this won’t be used though, since only the JMF ships with MHP and JavaTV. For this reason, the Quicktime API and the realization of the QTLinker won’t be explained further. For a local system use (without streamed media or network), I define ASCII files for each simulated video stream: these .cam-files define the video format (quicktime, mpeg), the source file, a human readable name for the camera (to be used in the preview and stream selection mode of the application) and all tracking data for this particular camera’s position. The tracking data may vary in format and will be converted accordingly (see section 6.3). To display the loaded video two different classes have been derived from the scenegraph’s Shape3D class: MovieScreen and MovieBackground. The former is used to display a video frame at an arbitrary position in the virtual world, while the latter defines an always full-screen background video image. For both, the shape’s geometry will only set four points in 3D space. Those will be connected by a quad array to a flat surface and covered by a DirectBufferImage as a texture. The im-

81

6 Technical Realization of the Pelota Application age data will be connected directly with the getFrame() method of the Linker and usually updated 25 times a second. The coordinates of the geometry will be recalculated on each frame, if the position of the video frame has to change in the virtual space. The MovieBackground is used for the normal augmented view of the digital video stream. The video will be set full-screen to the background. To realize this, the special scenegraph node Background comes in handy, switching off the depth buffer and always rendering the connected objects first. Thus, all other geometry will be drawn in front of the video texture. If we don’t use this special node, the occlusion geometry would have punched holes into the background frames (see 6.5.5). An important difference between Java3D and Xith3D is the initialization of a Background object. In Java3D the geometry, which will be bound to the node, is placed below the current viewpoint node, thus moving the shapes’ position when changing eye position: this caused an unpredictable erroneous "shaking" video background as soon as the viewpoint moved. The image was tilted or sheered once the camera was not pointing in a perpendicular angle towards the screen. To overcome this issue, we update the geometry’s quad coordinates every time the viewpoint changes (i.e. each time we receive updated camera values). We recreate the shape by calling the createGeometry() function of the MovieBackground, passing through the new point of view and calculating the according perpendicular position ourselves. This way, the whole background billboard plus the camera’s point of view are moving together through the virtual world, leaving the other rendered objects at their place. This recalculation of a perpendicular placement of the screen to the viewpoint at a fixed distance solved Java3D’s unexpected behaviour. This work-around won’t be needed using Xith3D. The possibility is offered to decide upon where the background geometry should be linked into the scenegrah, calling the constructor with a cameraMode parameter: • View.VIEW_NORMAL, same behaviour as in Java3D, background linked below viewpoint • View.VIEW_FIXED, static position, not changing with altering eye position • View.VIEW_FIXED_POSITION, geometry will be rendered with the camera in the same location but with the camera facing its current direction Using Xith3D we can just fix the background or "move it around" with our camera, as done before in the Java3D version. No difference is noticeable, although the recalculation of the textured quad’s position will consume a small amount of calculation time18 . Now, the video is set behind the inserted virtual objects and using the tracking data to control the viewpoint we have a perfectly working augmentation up and running. 18

no information had been found on how Xith3D does handle the fixed option internally; speed differences were unnoticeable on the testing platform

82

6.5 Discussing different APIs for 3D Graphics in Java Video Previews To integrate video screens for the small preview panels flying above the virtual camera models the MovieScreen class will be used. Each camera class has it’s own MovieScreen for the preview (in the contrary to the MovieBackground, which only exists once in the virtual world). For the small preview the coordinates are attached to the camera tracking position and the quality is set to a lower resolution to guarantee a fast rendering. The resolution can be selected and the preview windows might also be deactivated, if resources are running low or the user does not want these "distractions" in the 3D space. To accomplish the concept of the more classical preview of available streams (described in section 4.4.2), we are also relying on the 3D rendering with textured shapes, although we only display two dimensional layers. This is due to restrictions of MHP specifications: only one video stream can be displayed in the background and component based players are not mandatory to be implemented ([6]). Again, future development and changes of the MHP standard will tell, what could be reached and in which way (including possible 3D renderings). Currently, we will render an empty virtual universe, only including one background image, overlaid by camerafacing video quads with text, the user can select. Occlusions in Xith3D To realize the occlusions in Xith3D I resort to the direct access of the OpenGL states provided through JOGL. As described in section 6.5.4, occlusion geometry is rendered invisible, but filling the depth buffer. The option of using a Raster image is not implemented in Xith3D, giving a difference to Java3D’s design. Instead, a Foreground class exists, displaying bound geometry always in front of all other rendered objects. Setting a textured quad to the foreground using transparency, I could fulfill the same needs as with the raster object, but omitting the depth position of the mask (thus only achieving fore- and background distinction once again). Another try-out was done to combine video textures with occlusion geometry: I render a textured quad in 3D space at the appropriate spacial position, but use transparency to only draw the texture pixels that belong to the shape (e.g. of the player). Unfortunately, when switching to draw only into the color buffer, Xith3D still draws the whole geometry into the buffer, instead of only textured pixel, which would be consistent with what you actually see on the screen. Hence, this approach had to be dropped. In the contrary to Java3D, the stencil buffer access is working in Xith3D (i.e. through JOGL). Example: Ball Occlusion Using the alternative of the geometry, I exemplary realized the occlusion for the pelota ball19 . The AugmentedObject3D class has been extended to offer the possibility to switch between normal and occlusion geometry representation. The latter will deactivate color writing, write to the depth buffer and optionally change geometry: since an AugmentedObject3D is also 19

as well as for players using simple geometry like boxes

83

6 Technical Realization of the Pelota Application used in the virtual world preview, it can often be useful to deactivate all superfluous details of an object when rendering for occlusions. Although it is feasible to import complex geometries from files and to use them for occlusion purposes, I restrict myself to the pelota for a good first example, because only the size of the ball and the coordinate in 3D space have to be known: the object itself won’t change shape or size. Shape-shifting geometry could also be loaded and updated through the tracking information, then giving not only the overall position, but also changes for each limb’s or body part’s angle or position. How to create further occlusion geometries will be covered later in section 6.7. The ball has been tracked beforehand and now, while it’s being presented on the TV, I read the transmitted tracking data to align the geometry with the position in the video. During 3D preview, everything will be rendered visible and will switch to the occlusion version, as soon as a full-screen video stream has been selected. The functions setGeometryOcclusion() and setGeometryNormal() allow this alternation for each AugmentedObject3D. Geometry Loader To overcome the restriction of only displaying simple geometry, constructed from OpenGL primitives, I laid hands on different geometry loaders to be able to use more complex and textured objects, which were designed externally in modeling programs. Eventually I integrated a loader for so-called ASE files. These are ASCII based geometry descriptions, similar to the VRML97 format[23]. In Java3D it is possible to integrate external file formats into the program environment and the scene graph through a loader interface. The concept was adopted in Xith3D as well and important details will be described below. Sample geometries (for instance the camera models) were designed in discreet’s 3DSMax. Texturing and modeling can be done as usually, grouping of elements is also possible. 3DSMax already ships with an ASCII/ASE-exporter, making life easy. To reach best possible results, some export settings have to be taken care of: for output options one has to tick mesh definitions and materials, for mesh options we want to export normals, mapping coordinates and vertex colors and object types should activate geometric, shapes and helpers. Thus, we get a complete export, including all set UV-coordinates for texturing. Only static meshes could be exported and imported. But the ability to control groups defined in 3DSMax makes it possible to animate the geometries within the pelota application: Using the getNamedNodesMap() function of the AseFile class, it is possible to retrieve a map (java.util.Map) of all existing nodes inside the ASE file. Afterwards single nodes can be selected and bound to a BranchGroup in Xith3D. The integration of the loader into the project is realized in AugmentedObject3DAse, extending the 3D class. Being loaded, the instance of this class will return a BranchGroup as usually. Children node manipulation will be possible by calling manipulateTransformGroupOfNode(String nodeID), but yet missing a full implementation. All groups have to be linked manually into a small graph, including individual TransformGroups.

84

6.5 Discussing different APIs for 3D Graphics in Java Loading process, with splitting 3DSMax groups into manipulatable set of nodes, is briefly summarized within the following function: p r i v a t e BranchGroup load ( String source ) { try { AseFile af = new AseFile ( ) ; BufferedReader br = null ; try { br = new BufferedReader ( new FileReader ( source ) ) ; } c a t c h ( IOException e ) { ... } AseReader r = new AseReader ( br ) ; af . parse ( r ) ; Map nodes = af . getNamedNodesMap ( ) ; BrachGroup m_root ; BrachGroup m_subset1 ; BrachGroup m_subset2 ; m_root = new BranchGroup ( ) ; m_root . addChild ( ( Node ) nodes . get ( " r o o t n a m e " ) ) ; m_subset1 . addChild ( ( Node ) nodes . get ( " s u b s e t 1 n a m e " ) ) ; m_subset2 . addChild ( ( Node ) nodes . get ( " s u b s e t 2 n a m e " ) ) ; } c a t c h ( Exception e ) { ... } r e t u r n m_root ; }

Listing 6.2: example for geometry loader

6.5.6 JME with M3G While development with Xith3D was already underway, the usage of the Java Micro Edition (JME) in combination with the Mobile 3D Graphics API (M3G) has been inspected as well. The M3G is an optional package for the JME, offering 3D graphic capabilities. The main target lies on implementation for devices with very little and restricted calculation and memory power, such as mobile devices and hand-helds. The renderer does not resort to hardware acceleration,

85

6 Technical Realization of the Pelota Application allowing usage in low-budget environments. However, the API scales up to higher-end devices featuring bigger color displays, floating point unit abilities or even 3D graphics chips support. The M3G was defined in Java Specification Request 184, cornerstones and list with features, "must" and "must-not"s are given in [31]. These include (excerpt): • The API must support retained mode access (that is, a scene graph). • The API must support mixing and matching of immediate and retained mode access. • The API must not include optional parts (that is, all methods must be implemented). • The API must have importers for meshes, textures, entire scene graphs, etc. • The API must be efficiently implementable without floating point hardware. • The API should be implementable within 150 kB on a real mobile terminal. • The API must be efficiently implementable on top of OpenGL ES. Through the specification all our needs are fulfilled as well: a scene graph for convenience, external file and texture loaders and direct OpenGL access (immediate and retained mode) for occlusion renderings. Up to now, some set top boxes at the company run the Java Personal Edition but some already exist resorting to the JME, being the official successor of personal Java – as stated by Sun – an optional integration of M3G seems quite possible. The lack of time did not allow further research on M3G. Questionable is whether the next MHP specification will use JME as a basis in all implementations. If so, a port from the Personal Edition would signify a great step towards an up to date API, capable of latest developments and scalable for future extensions.

6.5.7 Conclusion on 3D APIs Browsing through available APIs, Java3D seemed as the best choice at first, offering broadest support and the most add-ons such as file loaders, etc., but due to still missing features regarding rastered images and the inability of OpenGL-state access, we backed off from the official Sun project. Low-level APIs such as JOGL and LWJGL serve a solid basis and give the advantage of full-control and a reasonable slim file size, but sacrificing convenience during development and loading a huge burden onto the programmer himself. The eventually used Xith3D API implementation combined all yet missing single advantages into one, but comes with problems as well: the open source based specification is still under development and is yet missing certain functionalities. Moreover it relies on hardware rendering as well. Vantages to Java3D include OpenGL state access as well as an easier management, since no capabilities have to be set recursively inside the scene graph and synchronization is done directly without the need of externally triggered timers or events. The freshly reviewed Mobile 3D Graphics API for the Java Micro Edition seems to offer all features needed in our research, at the same time not demanding hardware accelerated graphics rendering. An integration into MHP based set top boxes seems most probable and future

86

6.6 Cameras development in the project described in this paper should trace M3G steps as well to allow a possible port to this renderer. But since no hardware implementing this combination of JME and M3G for TV boxes is yet available, a prognosis can’t assure one of these alternatives. An open standard driven and supported directly by Sun could have better chances than community driven APIs such as Xith3D. But these two options need to be compared and in the end quality should decide itself – if not clouded by the better marketing and lobbying for a possibly worse product.

6.6 Cameras In our 3D world, viewpoints will be defined to observe the pelota presentation. A Camera class and another concept of virtual cameras are introduced. Each video stream in the DVB environment will create a camera, binding a Linker inside a MovieScreen (compare 6.5.5) for video playback. Frame by frame the camera’s position in the 3D world will be updated by the provided tracking data, as described in 6.3. The ViewPoint class is used to store the spatial information. The second concept of virtual cameras defines 3D coordinates inside the virtual world, from where the user may observe the virtual representation of the game. Currently the CameraVirtual extends the ViewPoint class. No binding to a video is needed, of course. Up to now the virtual cameras have a fixed position, but this could be easily extended by integrating paths.

6.6.1 Paths These paths are currently used to calculate smooth transitions from a current viewpoint to a desired eye position. The ViewPointPath* classes are meant to provide a follow-up position out of a fixed set of transition steps. Currently a linear interpolation is calculated using the current and the final position. Movements between virtual and real cameras (also moving cameras) are possible.

6.6.2 Interactivity User interactivity is currently restricted to a selection among the following: • chose real camera streams • switch augmentation bouquets • chose virtual world position freely or select virtual camera spots • toggle visual aids: pelota trails, etc. • show 2D overview panel or display assistance information More manipulation is easily possible and can be integrated by passing controls from EoraStarter to EoraDTVWorld by calling scheduleUserCommand(int cmd). Currently these calls are triggered by AWTEventListeners and KeyEvents. In the MHP environment the same AWT functionality is used, but we are more restricted in the number of available buttons.

87

6 Technical Realization of the Pelota Application Since interactive manipulation of virtual inserted objects is straight forward and can be easily realized by updating transformation-nodes in the scenegraph, I did not implement any further examples. The main task would be the design of an according logic and to plan behaviour of the different elements. As we already have a fully established scene graph and a well-known geometry of the pelota court, physics including collisions could be set up quite fast. I described an implementation of an interactive augmented reality application (in this special case a game) with great detail in [3]. It covers important issues such as collisions, usability and also occlusions. Nevertheless, possible fields of use will be discussed in the following chapters. The usage of a DVB-back channel will also be analyzed in 8.1.

6.7 Obtaining Occlusion Information Obtaining player figure shapes may vary in effort and quality heavily. I describe two approaches, 2D masks and 3D geometries. There won’t be a silver bullet to get occlusion information, but results can reach a satisfying level four our specific scenario. Masks Breaking it down to the most efficiency I will concentrate on the image version first. Since there are no prepared masks offered, a complete generation from scratch has to be implemented. As no additional information except the PAL signals of the cameras are available, all confidence has to be put in image recognition algorithms. To realize a first version, a locked off camera position is needed. Since the frontón is lit equably and it is possible to record reference images of the empty court (a so-called clean plate), a difference keyer is programmed delivering good results. Objects, not belonging to the frontón will cause a difference value, while the unchanged parts will remain black – comparing distances between RGB-values or brightness differences. Best results are gained by using the brightness version, applying noise-reduction filters (mean filter) before setting a Gaussian blur to the two frames (empty scene, scene with person). Then the difference image will be generated and mapped to gray scale. To remove left-over small errors I further apply two erosions. The result will be thresholded to a binary image. To get rid of fringes we finally calculate one more erosion and a dilation expanding the mask, closing smaller gaps and cracks. Fortunately, color difference is sufficient as television stations decided to repaint the courts green instead of the traditional white when they first started to show pelota games. Examples can be seen in 6.11. Best case result will be a mask as shown in figure 6.12. By now, the resulting mask could already be used in the program - disregarding depth positions of players. To include this yet missing information as well it is possible to estimate the player 3D position roughly: since all camera parameters are known and the court keeps the same as well, I will be able to shoot a ray through the camera and the lowest part of the masked area. Since the application is restricted to the pelota game, I assume that a player is always stuck to the ground of the court (logically not moving freely in 3D space): the intersection with the virtual object

88

6.7 Obtaining Occlusion Information

1)

3)

2)

4)

5)

Figure 6.11: 1) clean plate; 2) scene with person; 3) difference image with erosions; 4) thresholded binary image; 5) eroded and dilated version used for mask representing the court will return the depth position of the person (see figure 6.13). If I find more than one player position the closest to the camera will be transmitted for the mask. Implementation and usage of one single mask bitmap plus depth information has been described earlier. Problems arose during determination of player positions standing on reflective surfaces: the difference image returns the mirrored part as well, resulting in wrong positioning. During video tests I neglect this issue, filming in outdoor courts (typically with stone grounds). To overcome the problem in the future, one should rely on more sophisticated image recognition algorithms: differences in color saturation or the mixture with the ground color could give the hint, that one has to discard lower parts of the feigned mask. To better distinguish among two players I plan to split the two areas of the mask by searching for two pixel weight areas: as soon as I can find a free line through the frame without intersecting masked pixels I am off well. If players overlap, the same depth position is assumed. Now, each position will have an assigned depth value and it is possible to provide partial masks for the video frame. Once, the Raster class including depth-values is working properly, one could just copy these masks into one bitmap covering the whole screen – setting two different distance values, as determined through the ray shot. Another way to include separate masks is to use textured billboards (a single quad)) with the size of the players (see figure 6.14). If the video texture is copied partly to the quad leaving out transparent parts as defined within the mask, the occlusion will happen as planned, but the blurring problems of rescaled textures remain (as seen in 6.8). The idea to use an transparent textured quad only to fill the depth buffer (without redrawing the video parts, but only using the mask) could not be realized, since Xith3D always uses a complete geometry to fill the buffer, not caring about transparent parts (due to textures) of the object.

89

6 Technical Realization of the Pelota Application

Figure 6.12: left: source video frame; right: best case mask

Figure 6.13: shooting ray through camera and player mask to determine 3D position

90

6.7 Obtaining Occlusion Information

Figure 6.14: player models rectangle, cut out with transparent texture mask (green) To be able to use the mask, I have to transmit it alongside the video signal. Currently only local files are used, but two possible future solutions are imaginable: • Transmit an extra stream with the mask – since DVB allows bouquets in one channel, theoretically sending out an unlimited number of streams, this could offer an easy recovery of the mask: JMF access to the stream is already implemented and could be extended for the mask-video. • Encode masks inside MPEG-4 videos – if the whole profile of MPEG-4 is available one day, object based encoding, defining separated areas of a stream, could be used to distinguish depth layers. Geometry Other scenarios might offer ready-made geometries or supply hardware for real time reconstruction of real world’s geometry (i.e. infrared depth cameras), but the pelota application for ETB will only resort to the bare video signal as a source. Hence to reconstruct geometry, first I rely on the player mask retrieved through the difference key again. Afterwards, knowing the depth position of the "player’s pixels", each video pixel, that belongs to a player, could be represented by a small 3D cube. A paper-thin but real 3D object has been generated. Typical situations of a broadcast pelota game will display a player up to a third of the height of the video frame, resulting in huge amounts of cubes. For instance a 200 by 100 pixel rectangle including the player (with 50% pixels occupied) would result in 10.000 cubes to render and to transmit to the set top box (i.e. their positions). A better solution had to be found. Instead, the cut-out information of a player’s shape will be stored as a list of vertices to reconstruct the outline of the (flat) player representation object: the depth position will be set to

91

6 Technical Realization of the Pelota Application the same value for each vertex and the list will span a triangle strip. This information occupies less space and can be easily imported in the application.

Figure 6.15: cut out player models using a list of vertices; left: viewpoint from camera; right: player representation

Figure 6.16: overview of placed player shapes in virtual frontón, according to their 3D position

92

7 Practical Results This chapter will give an overview of the implementation of results for the planned concepts. Screenshots are presented and overall functionalities are described and discussed as well as future prerequisites for a complete implementation of Augmented Television.

7.1 Usage of current Technology After a description of state of the art in digital television, practical work started and a set of sample applications have been developed using Java technology (see figure 7.1). They run on a PC STB-emulator as well as on currently available real digital television receivers implementing DVB-MHP. These different class files are combined within one single program package and can be uploaded to a set top box via the MHP injection server available at VICOMTech. The user can use the remote control to select a single function. Selectable are the following menus and demonstrations: • Graphics, offering demo Xlets for text display, shapes, transparency, layered transparencies, 2D animation, 3D graphics (only simulated on a PC, not currently available in specified MHP) and video overlays. • TV image manipulation, video size and position manipulation, read/write pixel data (only simulated on a PC, not currently available in specified MHP). • Networks, read carousel data, return channel. • Interactivity, remote control usage, return channel interactivity. • System information The main class files allow for future extensions. New submenus and demos can be integrated to display more MHP functionality while the appropriate class files have to be incorporated into the package. The current demos come with a complete javadoc and source code description to serve as a future reference for other staff at VICOMTech who plan to use MHP as well. The demonstration phase ended with a description of hardware possibilities and limitations as documented in chapter 5.4. While color, palette and transparency problems can be handled, frame by frame synchronization between video and overlaid graphics and predictive positioning on a pixel basis are not feasible currently. MHP mandatory features are far too restricting. Since it was not possible to extend a receiver by a 3D renderer, I had to rely on pure PC simulation for the sample, which results will be described below. To sum it up, the current technology is not capable of realizing a real time augmentation of the video signal.

93

7 Practical Results

Figure 7.1: MHP Xlet demo package and documentation

7.2 Pelota Application for a PC Simulation 7.2.1 3D Rendering and Video Display Different 3D renderers were discussed extensively and the advantages and disadvantages have been highlighted. Xith3D, the most promising was chosen for the simulation, knowing well that the code programmed during the following month would probably never be directly runnable on a set top box, since it is not predictable which system hardware manufacturers and middleware providers will choose eventually, if ever. But due to a quite clear separation between program logic and classes representing 3D content, porting to another renderer is realistic within a short amount of time. To display video, the Java Media Framework was used, as it is also integrated into MHP set top boxes. The same interfaces and function calls allow a realistic view on how video could be controlled, once the pelota program is running as a whole on receiver’s hardware. Videos can be read from files and streams. For an easier research on synchronization, the Quicktime Java interface for reproducing movies using the Apple format has also been integrated, as it allowed better control over frame access. Deviating from a streamed broadcast media it became possible to rewind, pause and skip frames at free will, simplifying development. This full control of a JMF-based video player within the AWT hierarchy is not mandatory in MHP. Aside from 3D rendering, we have to rely on a not yet available concept. Once a single frame of a video is retrieved, it is possible to present it on a textured quad inside the 3D world. Setting this geometry orthogonal to the 3D world’s camera at a size and distance filling the screen entirely, we gain a "regular" video presentation – but within our 3D space. It is now possible to display additional two or three dimensional objects in front of this video. Firstly the file structure from an earlier VICOMTech program was reused, to load advertisement data from external files. Image files and their desired positions on floor or walls of the frontón are defined. Moreover other attributes such as repetitions, size and opacity can be specified. A whole set of prepared advertisements can now be loaded with ease.

94

7.2 Pelota Application for a PC Simulation It is possible to display prepared simple OpenGL geometry as well as textured geometries loaded from external files. A file loader has been incorporated to hook more complex geometry into the scenegraph of the program using the ASE format, based on plain ASCII files. Texture files as well as UV-coordinates can also be loaded. I use rendering programs such as 3DStudioMax and Maya to export ASE geometry. If DVB-MHP will incorporate MPEG-4 or MPEG-7 for data transmission one day, other XML-based geometry descriptions as X3D[39] could become a good alternative. Once a 3D renderer is integrated into the Java Runtime Environment inside set top boxes and synchronization gets more accurate, the described augmentation could be realized. Optional MHP features have to be declared mandatory for future revisions, such as display of AWT-based JMF-players for correct alignment. For 3D rendering, the more spread usage of the Java Micro Edition in mobile or cost-efficient devices as cell phones, organizers, portable gaming platforms and receivers, and emerging 3D hardware acceleration for those gadgets lead to the right way. It seems realistic that the M3G renderer will be integrated on top of the JME. More research and try-outs with this combination could have documented a promising alternative to the used renderer. The listed facts and documented features of M3G sound well suitable, but a real life scenario for the purposes of Augmented Television has not been tested yet. Since a port of Xith3D to run on top of the JME has not been tried yet, a closer look at M3G could have brought more insight into the restrictions of a possible alternative, but I opted for realizing a full PC simulation using Xith3D to be able to realize a more advanced prototype.

7.2.2 Usage of Camera Tracking Data As real time camera tracking on a heavily hardware restricted set top boxes seemed too far fetched, I opted for transmitting tracking data alongside the video signal. Each video frame has its own associated tracking set, which currently are only read from local files. In a receiver environment this would restrict us heavily, since all tracking data would have to be transmitted within the Xlet’s package beforehand. Having all data at hand now, frame by frame synchronization can be realized easily in the simulation, picking the set with the correct timecode. The submitted camera position will adjust the viewpoint in our 3D space, aligning the virtual frontón with the tracked real one. Additional objects put in front of the video screen will appear on top of the original footage with correct perspective. Unfortunately, this synchronization is not that easy inside a set top box environment. The available means and tools in MHP don’t offer an assured link in timing between stream and application. An accuracy of up to five frames could be reached, but not guaranteed. Video frame access through JMF 2.x would help a lot by allowing the reading of tracking data stored inside frame information. Synchronization problems with the data carousel and delayed events would not occur that way, but it is questionable whether hardware producers will integrate this frame access. Currently, the video is often handled completely separate from other MHP application parts.

95

7 Practical Results Generating camera tracking data is done by the use of third party products and inhouse tools of VICOMTech. Another program developed at VICOMTech displays a 3D frontón, which allows representation of physically correct ball trajectories and free selection of camera positions. This application has been modified to write the current camera positions onto a file. Recording the video of the virtual world and combining this video with the stored tracking information gave 100% clean laboratory conditions to demonstrate the functionality of the developed Augmented Television pelota application. Other staff at VICOMTech programmed the camera tracking for pelota sports and their calculated matrices can be imported in the program as well. Moreover the ARToolkit was used to export its tracking data and a commercial product boujou bullet gave another alternative. A differently set header distinguishes the different formats of tracking data and the application will then convert them accordingly. Using these ready-made tracking data sets intentionally takes a burden off the set top box. Quite little power is needed to combine the 3D objects with a video background frame. Only the virtual camera viewpoint has to be adjusted. Nevertheless, the question remains, where the needed generation of camera tracking data in advance is useful or even profitable for the broadcaster. Does the added value justify the extra production costs? The pelota demonstration works well with optical tracking using fixed or slowly changing camera positions. No extra costs occur, but extra time before broadcasting is needed to calculate all tracking data.

7.2.3 Multiple Streams and Selection Once the overlay of video and 3D objects had been achieved, a way to choose from different streams had to be found. A flat 2D interface was the first concept, showing a running preview of available streams (see figure 7.2). By pushing numbers 0 through 9 on the remote control one can select a video stream.

Figure 7.2: video Stream selection To extend this quite simple approach the complete frontón was build inside a virtual world, not only fixing the viewer to different camera perspectives, but also granting movement inside

96

7.2 Pelota Application for a PC Simulation the CG environment. Since different court sizes exist, the settings used are read from a file, allowing adjustment for different broadcast shows. All available camera streams are positioned at their correct position, represented by small simplified camera models with attached tripods. Optionally in a small frame floating above they will show the video stream (see figure 7.4). Initially the viewer will see the scene from a bird’s eye view. He or she may then select other positions inside the virtual space or switch to actual streamed video content. To distinguish these from virtual cameras, the numbers hold a celluloid film strip icon (see figure 7.3) and virtual positions show a VR sign and don’t have tripods. Numbered icons don’t resize due to distance to the viewpoint, breaking the illusion of being placed inside the 3D space, but offering a GUI more suitable for a TV screen. Since the images are to be shown on a TV resolution (which can be rather bad), important menu items and fonts have to keep a certain minimal size. Scaling images down from their original dimension would always cause quality loss, which might be too heavy in the environment of a TV screening. Thus, only inserted 3D objects rescale to fit in perfectly while additional text is kept above the video on a separate layer to guarantee readability. To allow for better orientation and also to give the application more of an entertainment factor, transitions between different camera positions can be calculated. These were only used at a fast pace, since delays in the viewing experience and channel selection can get annoying rapidly.

Figure 7.3: 3D world, added numbers to select virtual and real viewpoints

7.2.4 Occlusion To get hold of the occlusion problem of moving bodies in front of the augmented objects, two different approaches have been deviced. Precalculated 1-bit masks split a video into foreground and background, offering a fast and simple solution. I demonstrated the generation of a mask by resorting to a calculated difference image and other computer vision algorithms, which demanded a known, static and well illuminated background – which is the case in our specific scenario. Masks are currently only read

97

7 Practical Results

Figure 7.4: video streams preview above camera models from file, but I proposed ways of encoding these in a real DVB-MHP environment. The other option uses 3D geometry to describe real world objects. These geometries are rendered invisible, but occluding inserted augmented parts. Accuracy of this approach depends heavily on good geometric representation of the player’s shapes. Since a shape detection was not implemented, the demonstration was left to simulated frontón situations, where players are represented by cubes and boxes and the pelota by a simple sphere (figure 7.5). Nevertheless, one can load arbitrary and complex geometry from external ASE-files. Additional tracking data for moving objects are submitted, allowing correct alignment of position and shape. Loaded geometry is linked to the scenegraph with internal hierarchies kept. Control over inner transformations is granted and, for instance, a human body shape could raise his forearm, resulting in the change of the angle for the occlusion geometry’s forearms object and its children, being palm and fingers. But up to now, it is only possible to move objects according to tracking data. The alternation of hierarchical components is prepared but not yet implemented with a running example. While a perfectly fitting mask offers a pixel-by-pixel exact occlusion information, a 3D geometry might lead to inaccurate coverage of the represented video objects. A generation of a mask is also much easier than the reconstruction of scenic geometry. Masks (from a possible postproduction stage) might be available already or could be generated by depth cameras automatically. At the same time the geometry approach comes with the handy advantage of reusable objects for the virtual world overview, whereby the viewer can observe the pelota game from a noncamera viewpoint. A good representation of scene objects in 3D finally leads to a 3D television,

98

7.2 Pelota Application for a PC Simulation

Figure 7.5: Occlusion geometry simulation with simple boxes where the viewer can choose its position freely. Since reconstruction of 3D geometry remains rather difficult, this scenario is still far away. Realistic is a tracking of all players to represent them in a drawn or rendered 2D or 3D space without the video, but for occlusion purposes of real (or even live) scenes, the geometry approach turned out to be too complex compared to the mask version. Insertion of static elements could always be fulfilled satisfyingly using masks for occlusion. Generation of these masks has been presented successfully and transmission could be realized easily within a DVB bouquet. A depth camera in conjunction with a normal camera could "capture" the mask directly alongside the video. Nevertheless, today depth cameras still have a rather bad resolution, making use of them still impossible for sport events where distances from 20 to 100 meters or more have to be filmed. For a studio environment these come in handy already. Occlusions play a crucial role for Augmented Television and can’t be neglected. The advantage is definitely worth the trouble for interactive AR-TV. Usage of masks turns out to be the most appropriate solution, while player positions, tracked without shape recognition, could be used for a simple virtual world overview instead.

7.2.5 Augmentation Usage and Interactivity Textured and lit geometry may be imported and placed into the virtual space and thus into the video. The pelota application uses the augmentation mainly for putting up advertisement images on walls and floor (figure 7.8). Different augmentation bouquets can be selected to suit the consumer target group. In a real receiver environment, a user could select a personal TV profile with his favorites and with stored information about his or her age or sex. Thus, the MHP application could select the corresponding advertisements automatically for the user due to his or her personal profile. In feature films, augmentation could cover inappropriate scenes due to age restrictions or prohibited or undesirable displays of certain symbols. Aside from the placed ads, I brought life to two examples from the possible fields of use as described in 4.4.2. The pelota ball’s trail can be visualized, helping the viewer to chase the fast moving object. In 3D space the viewer can zoom freely towards the trail and e.g. check if the

99

7 Practical Results ball crossed specials lines of the court giving the possibility to ascertain the referees decision on a given or not granted foul. Another augmentation draws an animated lit sphere around the ball to bring attention to it and again to amplify the object.

Figure 7.6: Augmentation of pelota: tracked trail, amplified, animated ball representation Limits of augmentation in a real environment also lie in available bandwith for transmission. All textured geometry data or simple image files have to be present before the application picks them for display. This will result in a long start-up time or good caching mechanisms are needed, offering reuse of data even in different shows or days later. At first, user interactivity is allowed in the form of viewpoint selection. Remote control numbers 0 through 9 will allow the selection of real video streams and virtual viewpoints. The four colored buttons will switch augmentation bouquets or toggle the visual supports (ball trail, etc.). As I concentrated more on the graphical issues of the pelota application, there was no time left to implement a simulated usage of the MHP return channel: bet placements and user polls are imaginable, but are currently missing. Interactivity will probably remain on a much simpler level than reached in computer games: the "lean back" attitude will be always present. It is better to include a couple of accepted useful (or money making) features and leave other features out in order not to confuse or overwhelm the relaxed viewer. On screen display options as in figure 7.7 already list too many features in a non-intuitive way. Simple, four color based toggles might be a better choice for success.

7.2.6 Portability As mentioned throughout the process of development, the current technology for digital television does not enable us to implement augmented TV with receivers relying on DVB-MHP. However, some big changes could empower developers to do so. Some where already mentioned before: • Optional MHP features have to be declared mandatory for future revisions, such as display of AWT-based JMF-players (for correct alignment). • Direct video frame access to MPEG data would allow a guaranteed synchronization.

100

7.2 Pelota Application for a PC Simulation

Figure 7.7: Onscreen display overlay • The personalJava base should always be substituted by its official successor JME (Java Micro Edition). • A decision on a 3D renderer has to be made, if STB’s Java versions always move to the JME, the M3G would probably be the best option since it allows 3D software rendering for the JME. If Java3D eventually supports OpenGL state and buffer access, a port from the used Xith3D could also done fast due to the very similar structure. • MHP middleware implementations have to opt for the latest MHP version (including the above changes) as currently most receivers are too far behind the latest MHP revision number. If video display can eventually be overlaid by 3D renderings, hardware demands are not too high anymore. I opted to transmit precalculated occlusion information and tracking data to the set top boxes. Their task is now left to select the right perspective and apply the occlusion data. No expensive calculations have to be performed within the box: I leave it intentionally to this simple composition of elements to expect a soon realization of the current simulation in real STBs.

7.2.7 User Tests and Usability To test the sample application I did a user survey among co-workers and other persons, not knowing anything about the project. Since the pelota presentation could only be run on a PC simulation, a 100% accurate test scenario could not be established. I concentrated on explaining the idea shortly and showing different example videos1 . Anonymous survey forms had to be filled in and a discussion afterwards led to the following results. I interviewed 9 adults and 7 children2 . 1 2

available on the attached CD-ROM aged 7 to 14

101

7 Practical Results The augmentation of a pelota match only returned positive feedback. The option to display ball trajectories was considered very useful. While adults were more skeptic about background banner advertisement exchanges, children asked further questions and wanted to see the whole court customized – letting the real players stand in an abstract virtual room of 2D posters or within added 3D objects. Clearly, children wanted to add more entertaining features and have their "own version" of the show. The augmentation was considered natural and no irritation arose during observation. Adults didn’t seek too much interactivity and pointed out the visual supporters as of distance overlays, trajectory drawings or player annotations as being useful and worth having. Integration of more interactivity within the 3D space (such as simple games) could not be presented for the survey, but the majority of children were interested in this approach. Although a clever design idea was still missing. Selecting video streams from the simple 2D preview screen stayed rather popular. The virtual world overview was mainly used from a far-away position to select a stream. Other virtual viewpoints were only selected by children to see the transitional ("matrix style") flights, skipping to the next position quickly. Too many virtual positions don’t add much value to the presentation, since a complete observation of the game is only possible through a real camera stream currently – no whole 3D representation of players is available yet. With this eye-pleasing animated 3D view still missing, a camera selection could have been realized and be sufficient with a simple static court overview from a bird’s point of view. Most adults preferred a simple and fast interface for selecting the real streams without fancy animations. With the ease of one single push of a button they want to switch to an overview (2D or 3D) and back to the next video swiftly. Augmented Television found some early friends after the survey, but the discussion showed some issues that need to be addressed. Asked questions as "Can I switch off augmentation if I’m fed up with it?" or "Why should I bother customizing a only temporary available TV show?" show justified criticism. Turning off the augmentation will probably not be the idea of all broadcasters as it gets to advertisements. Hacked set top boxes without ads are imaginable. If the TV receives one final image as it is standard today, there is no way in avoiding ads or product placements. Nevertheless, other purposes, as localizations and visual supporters (as in sports games), remain very useful and may be switched on and off without problems – and without any risk of getting hacked or disabled. Customization on the other hand could be stored in a user profile, being reused for the next similar show. Again, users seem to prefer an easy-to-use interface with a limited set of nice features – but not with all possible options, confusing and disturbing their relaxed free time.

102

7.2 Pelota Application for a PC Simulation

Figure 7.8: Augmented Frontón

103

7 Practical Results

104

8 Conclusion and Outlook With all research documented and implementation described, it is time to draw an overall conclusion and question the future of Augmented Television. Will it possibly become reality and how could it eventually hit the mass market?

8.1 Future Vision for Augmented Television 8.1.1 Pelota Application The current application defines a ready to use framework for Augmented Television, however the quality heavily depends on submitted tracking data and occlusion information, moreover there are still some issues to be addressed. The camera positions are currently only transmitted beforehand and as a complete set, the version of streamed data is not, as of yet, running. The same goes for the objects used for augmentation, no server transmits and exchanges the available sets during the presentation. The return channel was also not yet used for planned feedback (surveys, bet placement). Other possible future extensions are listed below, divided into parts concerning both graphics and application concepts. Graphical Issues Concerning the transmission of occlusion masks, it would be best to stream these as unique video streams alongside the presented video. If MPEG-4 could be used by the broadcasters and respectively by receiver hardware, the mask could be defined inside the video stream, using object based separation of the codec. Augmentation currently only allows display of images and 3D geometry with the same or no illumination for all objects. The objects fit into the scene with correct perspective, but neither the scene’s real illumination nor the camera distance are taken into consideration during rendering. The latter would require for a blur effect, depending on the camera’s focus. While this could be applied rather quickly, correct lighting in real time is a far more futuristic scenario. The usage of light probes and HDRI images allow a convincing synthesis of real footage and rendered objects, but are still too slow to function. To get a correct illumination for the virtual posters, one could overlay these to real white surfaces, mixing the light with the presented addition, demanding a lot of preparation while the video is being shot. A toolkit could be programmed to allow movie editors or advertisement companies to extract certain parts of a scene, remove their distortions and edit their appearance. A filmed wall could be extracted, edited (e.g. change written text) and afterwards stored in an augmentation profile while the set top box handles the overlay.

105

8 Conclusion and Outlook A blend of virtual reality with the augmented TV is imaginable as well. The viewer could use a tracked HMD to watch to program. When he or she moves freely in real space, this could affect the virtual viewpoint inside the 3D frontón. Spatial approaches to a virtual camera model could automatically switch to the camera’s viewpoint video stream. Application Issues Besides improvements in the used interactivity, small entertaining games could be integrated, making use of the given 3D geometry and the tracked players. If all player and ball positions are predetermined, it would be also possible to watch the entire game only in the virtual view form from a freely chosen perspective. Hand-held devices could directly leave out the video stream and resort to simplified 3D rendering in the first place. Other sport events might be more suitable for this approach (like soccer where strategic positions with 22 players on the field would give a more interesting overview than in pelota), but the concept applies here as well. Usage of the data and return channel can be widened extensively. Anonymous feedback could provide statistics for the broadcaster such as, which camera perspective is preferred the most, resulting in a selectable favorite viewpoint. Single users could act as their own editors, selecting cameras and moreover commenting on them. By submitting his or her choice to a new way of a live log, others could join the same perspective, guided by the freshly born home moderator. The 3D overview could undergo a revision, since the user survey questioned the necessity of this approach. Simplicity is still winning in a TV environment. Even in an augmented one.

8.1.2 Augmented Television in General, an Outlook to the Future Augmented Television offers a customizable presentation of live programs or movies. Localizations can especially benefit from this, as some directors even today shoot scenes more than once so as to put in translated sections. These sections are very important for the scene inside the video1 , and without having to reshoot, Augmented Television can automatically carry this out in its place. As it is currently used in DVD releases for different embedded audio tracks, this is now also possible for image manipulation. Parts of the video stream can also hold additional information; sport players names, clothing types, cars, buildings, etc. may carry annotations and explanations or directly lead to an online shop, which will probably be the most interesting issue for entertainment and commercial industry – product placement is already used heavily within movies. What’s questionable is, whether Augmented Television will be possible with the MHP or similar receiver boxes. This depends on the calculation power, the market situation and acceptance. Textual annotations are not too far fetched, but perspectively correct inserted video parts might take a lot more time to become deployed. Competitors to these set top boxes are PC based versions, which integrate the TV into the PC2 and not vise-versa, by empowering set top boxes 1

For instance Stanley Kubrick exchanged typewriter papers in "The Shining" for localized Spanish, German and Italian versions 2 Using PC as a general term for home computing systems, not speaking of a particular platform or operating system; including Macs and other systems

106

8.2 Summary with more and more capacities, that already exist in regular computer environments. A personal computer with a TV decoder card already offers all the needed technology for Augmented Television! Majority of customers still prefer a simple box than a whole PC system, that needs much longer startup times, is far more complex and thus might just offer too much functionality in their living room. But if the idea of a complete home network gains popularity – connecting all available electrical devices inside a household – the PC could base as an entertainment server, forwarding the image to be presented inside the TV to the living room. The mentioned HAVIc could provide this, it allows interoperability among devices while relying on FireWire network transmission, which is sufficient for the required TV signal. With emerging Internet stores for music and also movie downloads3 we are shown the direction of future developments. The gap between the TV receiver and PC technology is still quite big, will the user ever want to have a whole interactive system inside his or her living room? Or will he or she just prefer to go to the office to do online banking and emailing – leaving the ’telly’ alone, simply as it is, but with easily customized augmented video presentations?

8.2 Summary Throughout this project current developments in the field of digital television are portrayed, shedding light on its historical and technical attributes. Focusing on European standards, I take a closer look at the DVB-MHP specifications, allowing interactive television. I follow by implementing sample applications which run in PC simulation environments and regular, commercial MHP receivers. A DVB server is used to upload and change so-called Xlet applications, using Java programming language. Mainly focusing on graphical issues, these samples document the current "state of the art" technology for the MHP service. A concept for a complete Augmented Television application based on Java technology is proposed, discussing all required parts. During the implementation of these ideas, different available APIs are introduced and discussed, eventually leading to the selection of the most appropriate ones for the required purpose. The final application covers all important aspects such as; access and control of video data, the combination of video frames and three dimensional rendered data and their synchronization in time and position. Two different ways for occlusion handling are implemented and user interactivity is established on a basic level. The overall result of this project yields a PC specific application not yet available on set top boxes. It also highlights the most important factors and gives recommendations as to what changes in the MHP have to be made to reach set goals in the future of this technology. An outlook into the future of this possibly emerging technology is given and improvements for the specific implementation are listed. I describe advantages and future scenarios for the newly introduced concept of Augmented Television.

3

Apple’s iTunes also recently introduced video downloads as well, currently focusing on short movies and music clips. However, feature film are likely to follow.

107

8 Conclusion and Outlook

108

9 Appendix 9.1 System Configuration Development Platform • Pentium M, 1,3 GHz • 512 MB RAM • ATI Radeon 7500 • Samsung Mini-DV VP-D15, Logitec QuickCam Express Pro for video recording DTV Hardware • Philips DVB-T Receiver DTR4600 Used Software • Microsoft Windows XP Professional, Service Pack 1 • Microsoft Visual Studio.net 7.0 • Eclipse 2, Java Standard Edition: different versions • Alias Wavefront Maya 6.0, discreet’s 3DSMax 5 • ARToolkit, 2.70.1 • 2d3’s boujou bullet • Adobe Photoshop 6.01

9.2 CD-ROM The CD-ROM includes all documented source code, javadocs and binary files, such as used geometry, texture files and images, put up as augmented posters. A slideshow presentation, including several videos and this textfile as a PDF are on hand.

109

9 Appendix

110

List of Figures 2.1 2.2 3.1 3.2 3.3 3.4

AR in the construction process of automobiles using visual patterns to determine perspective. Pictures kindly supplied by Metaio - Augmented Solutions[1]. . . . an augmented reality game[3] using optical markers, presenting a virtual racing game in an arbitrarily chosen surrounding . . . . . . . . . . . . . . . . . . . . MHP software stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MHP applications in public German Television: interactive entertainment shows using return-channel technology . . . . . . . . . . . . . . . . . . . . . . . . . Set Top Boxes: simple zapper . . . . . . . . . . . . . . . . . . . . . . . . . . Set Top Boxes: allrounder . . . . . . . . . . . . . . . . . . . . . . . . . . . .

left: concept of how ready-made AR is already used in soccer broadcasts; right: AR in broadcastings, picture by Orad[8] . . . . . . . . . . . . . . . . . . . . . 4.2 pelota sport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Pelota Viewer program for putting up ads and simulating trajectories . . . . . . 4.4 video Stream selection concept . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 participants in augmentation process . . . . . . . . . . . . . . . . . . . . . . . 4.6 demonstration of augmentation of a video frame by inserted 2D images, left: broken illusion due to overlapping between real and virtual objects, right: convincing augmentation with occlusion . . . . . . . . . . . . . . . . . . . . . . . 4.7 video frame and predefined 1-bit mask . . . . . . . . . . . . . . . . . . . . . . 4.8 3D world, numbers to select virtual and real viewpoints; concepts . . . . . . . 4.9 video stream preview above camera models in virtual view; concepts . . . . . . 4.10 augmentations of pelota game, concept arts . . . . . . . . . . . . . . . . . . .

11 12 17 18 20 20

4.1

5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9

Cardinal Studio 3.1, Icareus iTV Suite . . . . . . . . . . . . . . . . . . . . . . Xlet States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Code: XletContext interface functions . . . . . . . . . . . . . . . . . . . . . . TSDeveloper used to upload Xlet to DVB server . . . . . . . . . . . . . . . . . manual loading of Xlets, interface on a Philips set top box . . . . . . . . . . . graphical layers in MHP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . images and shapes drawn in xleTView emulator (left) and on a Philips set top box (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Possible configuration of HAVi Devices . . . . . . . . . . . . . . . . . . . . . transparencies handled in xleTView emulator (left) and on a Philips set top box (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23 27 28 29 29

31 32 33 33 35 39 41 42 45 45 48 49 51 52

111

List of Figures 5.10 overlay DVB transparencies class can’t be handled in xleTView emulator (left), but works on a Philips set top box (right) . . . . . . . . . . . . . . . . . . . . . 5.11 Component mattes, stacked . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.12 Minimum set of input events . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.13 demo program collection in xleTView emulator (left); on a Philips set top box (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 6.2

6.16

pelota application class design (excerpt) . . . . . . . . . . . . . . . . . . . . . Boujou bullet: optical tracking, ready-made data before broadcasting. Software access kindly granted by tvt postproduction, Berlin[2]. . . . . . . . . . . . . . Java3D scene graph design . . . . . . . . . . . . . . . . . . . . . . . . . . . . set access to transform scenegraph objects . . . . . . . . . . . . . . . . . . . . Java3D scene graph view port . . . . . . . . . . . . . . . . . . . . . . . . . . Offscreen rendering; automatic composition of 2D frames: video frame, mask and 3D rendered image lead to the results below (note the occlusion on the right) overlaid 2D mask fitting perfectly in without rescaling as long as mask’ position is not changed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . left: unscaled mask; right: rescaled mask causing anti-aliasing . . . . . . . . . left: geometry with enabled and glColorMask for all objects; right: disabled glColorMask for one of the geometries to be used for occlusion . . . . . . . . . Pelota application scene graph overview . . . . . . . . . . . . . . . . . . . . . 1) clean plate; 2) scene with person; 3) difference image with erosions; 4) thresholded binary image; 5) eroded and dilated version used for mask . . . . . . . . left: source video frame; right: best case mask . . . . . . . . . . . . . . . . . . shooting ray through camera and player mask to determine 3D position . . . . player models rectangle, cut out with transparent texture mask (green) . . . . . cut out player models using a list of vertices; left: viewpoint from camera; right: player representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . overview of placed player shapes in virtual frontón, according to their 3D position

7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8

MHP Xlet demo package and documentation . . . . . . . . . . . . . . . . . video Stream selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3D world, added numbers to select virtual and real viewpoints . . . . . . . . video streams preview above camera models . . . . . . . . . . . . . . . . . . Occlusion geometry simulation with simple boxes . . . . . . . . . . . . . . . Augmentation of pelota: tracked trail, amplified, animated ball representation Onscreen display overlay . . . . . . . . . . . . . . . . . . . . . . . . . . . . Augmented Frontón . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14 6.15

112

52 53 56 58 61 64 67 68 69 72 73 74 77 80 89 90 90 91 92 92

. 94 . 96 . 97 . 98 . 99 . 100 . 101 . 103

Bibliography [1] Metaio - Augmented Solutions http://www.metaio.com, 06.08.2005, 17:30 [2] tvt postproduction, Berlin http://www.tvtpost.de, 07.08.2005, 16:00 [3] T. Kammann, "Entwicklung eines Augmented-Reality-Spiels", 2003 http://www.augmented.org > Projects > AR-Race, 01.05.2005, 13:00 [4] digitalfernsehen.de Interaktivität: Japaner setzen auf Open TV http://www.digitalfernsehen.de/news/news_46594.html, 06.09.2005, 18:00 [5] Digital Video Broadcasting (DVB); Multimedia Home Platform (MHP) Specification 1.0.3 http://www.dvb.org, 04.07.2005, 17:00 http://www.mhp.org, 04.07.2005, 17:00 [6] MHP Specification 1.3.1, chapter 13.4.1 http://www.mhp.org, 04.07.2005, 17:00 [7] SAS Astra BLUCOM http://www.digitalfernsehen.de/news/news_46899.html, 20.09.2005, 18:00 [8] ORAD Inc. http://www.orad.tv, 31.10.2005, 18:00 [9] Infosat - Das Digitalmagazin http://www.infosat.de, 19.08.2005, 18:00 MHP-Abschaltungen nach der IFA? http://www.infosat.de/Meldungen/?srID=6&msgID=16725, 19.08.2005, 18:00 [10] C. Poynton. Digital Video and HDTV - Algorithms and Interfaces Morgan Kaufmann Publishers [11] H. Benoit. digital television - MPEG-1, MPEG-2 and principles of the DVB system Focal Press [12] Cardinal Systems http://www.cardinal.fi, 27.09.2005, 10:00

113

Bibliography [13] Icareus Entertainment Systems http://www.icareus.com, 27.09.2005, 10:00 [14] S. Morris, A. Smith-Chaigneau "Interactive TV Standards, A Guide to MHP, OCAP and JavaTV", 2005 [15] Philips MHP SDK Frequently Asked Questions http://www.philips.com, SDK 10.05.2005, 18:00

package

MHP-SDK-Download-323-and-421-795,

[16] "Premiere plant Kanal für Pferdewetten", SPIEGEL online, 19.07.2005, 8:30 [17] Wikipedia, english version, "Television" http://en.wikipedia.org/wiki/Television, 12.08.2005, 10:00 [18] xleTView, by Martin Sveden http://xletview.sourceforge.net, 10.07.2005, 18:00 [19] Java Personal Edition http://java.sun.com/products/personaljava, 28.10.2005, 18:00 "End Of Life Preannouncement" [20] Java Media Framework http://java.sun.com/products/java-media/jmf/2.1.1/download.html, 01.09.2005, 17:00 [21] xleTView, list of missing implementations for used APIs http://xletview.sourceforge.net/status/status-current.html, 01.07.2005, 16:00 [22] ARToolkit http://www.hitl.washington.edu/research/shared_space/download/, 10.07.2005, 15:00 [23] VRML97 Functional specification and VRML97 External Authoring Interface (EAI), International Standard ISO/IEC 14772-1:1997 and ISO/IEC 14772-2:2002 http://www.web3d.org/x3d/specifications/vrml/, 15.06.2005, 14:00 [24] The JavaTV API White Paper, Sun Microsystems, Version 1.0 Chapter 5.1 [25] The JavaTV API White Paper, Sun Microsystems, Version 1.0 Chapter 5.2 [26] Java3D Intereset Mailing List and Forums, Mon, 4 Jun 2001 11:28:53 -0700, by Doug Twilleager , Subject "Re: Stencil buffers" [27] Xith3D Website, Intro Section - What is Xith3D? http://xith.org/, 09.08.2005, 14:00

114

Bibliography [28] D. Hoiem, A.A. Efros, M. Hebert, "Automatic Photo Pop-up", ACM SIGGRAPH 2005. http://www-2.cs.cmu.edu/∼dhoiem/projects/popup/, 02.07.2005, 17:00 [29] Java.net - List of proposed API changes, 2.0 Possible Major Features states "Access to the native context (JOGL integration)", which would offer direct access to OpenGL states https://j3d-core.dev.java.net/j3d1_4/proposed-changes.html, 01.08.2005, 15:00 [30] "The Java Media Framework" "Restrictions on JMF Players in MHP - tuning" http://www.interactivetvweb.org, 12.09.2005, 17:00 [31] JSR-000184 Mobile 3D Graphics API for J2METM Mobile 3D Graphics API, Technical Specification, version 1.1, June 22, 2005 Java Community Process (JCP), SR-184 Expert Group http://jcp.org/aboutJava/communityprocess/mrel/jsr184/index.html, 26.09.2005, 13:00 [32] A. Lopéz, D. González, J. Fabregat, A. Puig, J. Mas, M. Noé, E. Villalón, F. Enrich, V. Domingo, G. Fernàndez. 2003. Synchronized MPEG-7 Metadata Broadcasting over DVB Networks in an MHP Application Framework. [33] Alticast http://www.alticast.com, 31.10.2005, 17:00 [34] J.M. Van Verth, L.M. Bishop. 2004. Essential Mathematics for Games and Interactive Applications - A Programer’s Guide [35] Evertz, Glossary of Technical Film and Broadcasting Terms http://www.evertz.com/glossary.php, 10.10.2005, 17:00 [36] Woo, M.; Neider, J.; Davis, T. OpenGL Programming Guide Third Edition, OpenGL Architecture Review Board [37] Billinghurst, M., Kato, H., Weghorst, S. and Furness, T. A. (1999). A Mixed Reality 3D Conferencing Application (Technical Report R-99-1). Seattle: Human Interface Technology Laboratory, University of Washington. [38] Kevin Hawkins, Dave Astle. OpenGL Game Programming Prima Tech [39] Extensible 3D (X3D) http://www.web3d.org/x3d/specifications/, 16.10.2005, 16:00

115