Applying Stereoscopic 3D and Head Tracking for Virtual Reality in Games

Applying Stereoscopic 3D and Head Tracking for Virtual Reality in Games Marries van de Hoef, Bas Zalmstra May 5, 2011 Utrecht University 1 Introduct...
Author: Zoe Bennett
9 downloads 2 Views 3MB Size
Applying Stereoscopic 3D and Head Tracking for Virtual Reality in Games Marries van de Hoef, Bas Zalmstra May 5, 2011 Utrecht University

1

Introduction

This paper is meant as introduction for developers who want to apply stereoscopic 3D and/or head tracking for virtual reality. Stereoscopy can be seen as a new dimension in display technology. The last time a new dimension of display information was added, was with the transition to color television. Although stereoscopy has been introduced many times, there has not been a breakthrough in popularity. However, the breakthrough can happen very soon because many manufacturers are introducing stereoscopic displays. There are multiple steps in the delivering of stereoscopic content, as visualized in Figure 1. 1. Two images have to be generated from dierent points of view. These images have to be generated with the depth perception in mind, because the depth expectations may not be violated. The images should not contain any depth violations, which occur when an object is on front of an object which is actually closer (this can happen with for example HUD-elements) [1]. The images should also not contain window violations, which occur when an object which is popping out of the display is cut o by the edge of the display [2]. Also, the images have to be comfortable for the human eye, preventing headaches and nausea.

Figure 1: Steps in delivering stereoscopic 3D. 1

2. Both images have to be transported to the display. This has to be done in a specic format, enabling the display to separate the two images again. 3. The display has to show both images simultaneously and it should use a technique to make sure each image reaches only the left or right eye. This paper focuses on the technical part of 3D. Depth and window violations will not be discussed further. In Section 2 common display methods for stereoscopic 3D are discussed. Section 3 describes the formats for transporting the images to the display. Techniques to generate the images for both eyes are discussed in Section 4. Another way of enhancing the 3D experience, is to create the illusion of 3D by using head-tracking. Without head-tracking, the image on the display is the same regardless of the viewing position. However, when the position of the head is known, the point of view can be adjusted to that position. This way, the display does not show an image of the virtual world, but it seems to be a window into the virtual world. This technique will be discussed in Section 5.

2

Display Methods

To experience stereoscopic 3D, both eyes of the user have to perceive a dierent image simultaneously. The left eye should receive the image for the left eye, and only that image. Analogous, the right eye should only receive its corresponding image. To achieve this, a display has to present two dierent images at the same time, and there has to be a lter to prevent each eye from receiving the image meant for the other eye. If the ltering is not done correctly, crosstalk will occur and this troubles the 3D experience. Crosstalk is the eect that the eye also receives a part of the image meant for the other eye. An inherent characteristic of a lter is that it lters depending on a property. In the case of a lter for stereoscopic 3D, light has to be ltered. The properties of light include: wavelength (color), polarization, time and location. All of these properties are used by stereoscopic display methods to lter on. The anaglyph method uses wavelength, passive displays use the polarization of light and active displays use the time property. It can be regarded that stereoscopic head mounted displays use the location to "lter" the light, but this will not be discussed here.

2.1 Anaglyph Anaglyph uses color to lter light for each eye. This inherently means that the color perception is always distorted, which is the major disadvantage of this method. Because every display is capable of outputting color, this method works on every display. The user only needs a pair of color lters for his eyes. This accessibility is the major advantage of this method. The images for each eye have to be ltered with a dierent wavelength. The wavelength intervals for each eye should not overlap, because that would 2

Figure 2: An impression of the wavelength-sensitivity of the color receptors in the human eye. cause crosstalk. But the major constraint is that a display cannot output every wavelength. A display just outputs at three specic wavelengths (for red, green and blue) and only varies the intensity. To the human eye this appears as real color, because the three wavelengths correspond with the three color cones in the human eye. The display eectively just activates the cones in the human eye, instead of correctly reproducing light. Figure 2 shows the wavelength sensitivity of the cones in the human eye. The wavelengths the display outputs are close to the peaks of the cones. In a straightforward approach the three colors (red, green and blue) have to be divided amongst the eyes. The most common partitioning is red for the left eye, and cyan (green and blue) for the right eye. But it can also be divided in other ways. There are also other approaches such as ColorCode3D where one eye perceives nearly all color information and the other eye is only for the depth perception. Our experience with the standard red-cyan Figure 3: Red-cyan glasses (shown in Figure 3) is that the depth anaglyph glasses. perception is good while the color perception is mediocre. Our main complaint is that the image is very uneasy because both eyes receive totally dierent colors. We think the red-cyan anaglyph method is only suitable for short testing of 3D, and not for entertainment purposes.

2.2 Active Active 3D displays use time to separate the image for the left eye from the image for the right eye. The displayed image alternates between the image for the left eye and the right eye. To make sure the left eye does not receive the image for the right eye and the other way around, the user has to wear glasses which blinds either the left eye or the right eye. The side which the glasses blind has to alternate synchronously with the display. This eectively results in time-sharing of the display between the left and right eye. The alternating frequency has to be very high to prevent the user from noticing the ickering 3

Figure 4: An active 3D display as seen through shutter glasses. The display shows a calibration image, which is white for the left eye and black for the right eye. The right photo shows the display through the right glass when viewing the display at an angle. glasses. Commonly, a frequency of 120Hz is used for this. The glasses have to change translucency at the same frequency. Liquid crystal shutter glasses are used to achieve such fast changes in translucency. The result is a perfect stereoscopic 3D experience. Both eyes get unique images in full resolution and color. However, this comes at the cost of some signicant disadvantages:

Less light The glasses are shut half of the time, causing the eyes to receive much less light. The eyes receive at least 50% less light, but probably more because crosstalk has to be countered. In the left photo in Figure 4, the left glass should be uniformly white. But due to the capture time of the camera, the ickering of the glass is captured. Because the size of the white area is smaller than the size of the dark area in the left glass, it can be concluded that the eyes receive less than 50% of the light.

Flickering Some commonly used light sources icker at high frequencies. This

is not noticeable under normal conditions, but when wearing shutter glasses this can become a very apparent problem. This eect can be reduced by changing frequency of the shutter glasses, but in our experience the reduced eect remained very unpleasant. This could cause headaches or nausea.

Crosstalk We have experienced some crosstalk, but not an inconvenient amount. This happens in particular when the head (with glasses) is rotated horizontally. We have experienced this with multiple active 3D displays. The eect is shown in Figure 4. In the left image, the glasses are aligned with the display and the right glass is black. When the glasses are rotated, as done in the right image, the right glass is not totally black anymore. Apparently, the right eye receives some of the white color which was meant for the left eye (crosstalk). We cannot explain why crosstalk occurs in this situation.

High frequency display The display has to be able to reach a very high re4

Figure 5: From left to right: circular, elliptical and linear polarization of light. fresh rate. This used to be common with CRT displays, but that changed with the introduction of LCD and plasma displays. Special LCD or plasma displays have to be bought to reach the required frequencies.

High cost glasses The liquid crystal shutter glasses contain electronics. For

every user, another pair of glasses have to be bought which can quickly become costly.

Our experience is that the ickering is very annoying and inhibits a prolonged usage of the display. The 3D experience itself is unsurpassed.

2.3 Passive Passive 3D displays use the polarization of light to separate the images. The polarization of light is essentially the orientation of the oscillation of the light wave. This is visualized in Figure 5.

2.3.1 Polarization types Two types of polarization can be used for stereoscopic displays, linear and circular polarization. These are visualized in Figure 5.

5

Figure 6: A polarized display as seen through polarized glasses. The top half is covered by a polarized glass.

Linear polarization To use linear polarization, the polarization directions for

the two lters have to be perpendicular. Because of this, none of the light waves meant for the other eye can pass through the lter. However, when the user rotates its head and the glasses with it, the lter and its polarization direction rotates with it. The result is that the polarization direction of the lters in the display and the glasses no longer match, and thus the lters in the glasses no longer block all the light meant for the other eye. This results in crosstalk.

Circular polarization With circular polarization, light is ltered based on

the clockwise or counter-clockwise rotation. This makes the circularly polarized lters rotationally invariant, solving the crosstalk problem of linear polarization.

Linear polarization is rarely used today, most consumer displays use circular polarization [3].

2.3.2 Polarization lters For the viewer, using the polarization of light oers substantial advantages. A dierent polarization of light cannot be perceived by the human eye, so the technique does not distort the image. This contrasts with for example the anaglyph method, because that method inherently has a form of color distortion. To lter the polarized light, the user has to wear a set of glasses with simple polarization lters. These can be similar to the anaglyph glasses but with a dierent type of lter. The diculty for this method does not lie in the glasses, but in the display. Two dierent images have to be displayed simultaneously with a dierent polarization, and this causes some disadvantages:

6

Figure 7: Translucency of polarized glasses (top) and active shutter glasses (bottom).

Reections The polarization has as result that the screen has to be anti-

reective (glossy) to reduce crosstalk [4]. A glossy screen has the problem of reecting the background of the user. Those reections can be very distracting and can prohibit the depth perception. In our experience and computer setup, this was not a problem.

Resolution The display has to show two images simultaneously. To show

all this information, twice the amount of pixels have to be present on the display. In most consumer displays, the display does not have an increased amount of pixels. The polarization of the pixels alternate in vertical direction, resulting in an alternating polarization between the even and odd horizontal lines. Because the amount of pixels remains the same, the vertical resolution is eectively cut in half. In our experience this noticeably decreased the image quality. Also a subtle pattern of horizontal black lines can be observed, as shown in Figure 6. This problem can be solved by doubling the vertical pixel density, but that would make the display more expensive.

Viewing angle The vertical viewing angle can be very small with polarized

displays. This is caused by several factors, such as distance between the LCD and the polarization lter [4]. The viewing angle on the Zalman Trimon ZM-M215W is only 10-12 degrees [5]. In our experience, such a small vertical viewing angle is very restrictive. When the vertical position of the head in front of the display is not exactly right, the depth perception is troubled.

When ltering light, the result will always be darker. With active shutter glasses the result is signicant, but with polarized glasses this eect is much less. The dierence is clearly visible in Figure 7. In our experience, the depth perception is signicantly decreased by the resolution and the viewing angle disadvantages. These problems might be solved 7

in the near future, as research in this area has already achieved results [4]. This makes passive 3D displays a very promising solution for stereoscopic 3D in the future.

3

Formats

To display stereoscopic 3D, both the image for the left eye and the image for the right eye have to be transported to the display. Transporting two images to a display simultaneously is poorly standardized, so there is a large variety of formats. The formats can be divided in two categories: formats which transport both images in full resolution and formats which transport both images at half their resolution. Most formats pack both images into one image.

3.1 Full-resolution formats When transporting both images in full resolution, the amount of data is doubled. At a 1080p resolution, the bandwidth of DVI is not sucient. In this case two DVI-connections must be used. HDMI 1.4 introduces 3D Over HDMI which does support a 1080p resolution, although only at 24 frames per second. Some HDMI 1.3 devices do support stereoscopic 3D at 1080p, for example the PS3. 3D Over HDMI also standardizes formats, which includes both full-resolution formats and half-resolution formats [6].

3.1.1 Over/under The name for the over/under format is used for both a full-resolution and a half-resolution variant. The full resolution variant more than doubles the height of the image. The left image is placed above right image, and between them is an empty space. The empty space is 45 pixels high when showing both images at 1080p, and 30 pixels at 720p [6]. The format is shown in Figure 8. This format is for example used by active-shutter televisions.

3.1.2 Alternating With the alternating format, the images for the left and right eye alternate. Either the image for the left or right eye is transported. Figure 8: The full-resolution This is done through quad-buering, which is over/under format. explained in Section 3.3. This is the common format for active 3D monitors.

8

3.2 Half-resolution formats 3.2.1 Vertical interlacing In vertical direction a line for the left eye is alternated with a line for the right eye. The total height is not doubled, resulting in a major loss of resolution in vertical direction. The format is shown in Figure 9. This format is commonly used for passive 3D monitors. Hor- Figure 9: The vertical interlacizontal interlacing is also possible, but this is ing format. rarely used.

3.2.2 Checkerboard interlacing Both in vertical and horizontal direction, the images for the left and right eye are alternated. This results in a grid pattern, as shown in Figure 10. This format is used by DLP televisions [7]. Figure 10: The checkerboard interlacing format.

3.2.3 Side-by-side

Both images are scaled to half their width. The images are placed side-by-side with the image for the left eye at the left side. The format is shown in Figure 11.

3.2.4 Over/under (half resolution)

Figure 11: The side-by-side forBoth images are scaled to half their height. mat. The left left image is placed on top of the right image. The format is shown in Figure 12.

3.3 Quad-buering Quad-buering is an alternative technique which moves the formatting issue away from Figure 12: The half-resolution the programmer. With quad-buering, there over/under format. is a separate front- and back buer for both the left and right eye. In total there are four buers, hence the name. The transportation of the images is handled by the graphics drivers and hardware. However, quad-buering is not common on graphics hardware. Only NVidia Quadro cards, and ATI FireGL/FirePro cards support it [8, 9]. (Normal NVidia graphics cards could also support quad-buering, but it is disabled in the driver [10].) Quad-buering is well supported in the OpenGL graphics API [11], but not in the Direct3D graphics API [12]. 9

Figure 13: The left (blue) and right (red) eye camera and center 2D camera (black) are shown with their projection plane from a top view. Dierent stereoscopic projection methods are shown, from left to right: cameras shifted, cameras toed-in and cameras with an o-axis projection.

4

Rendering Stereoscopic Images

For stereoscopic 3D, two images have to be rendered instead of one. The two images are rendered from two cameras, which represent the eyes in the virtual world. The two cameras should project on the same area, which represents the display in the virtual world. When generating the two cameras by copying the 2D camera and shifting them apart, the two cameras do no project on the same area. This can be resolved by rotating the cameras. However, with that approach, the projection planes are not aligned. Instead, an o-axis projection matrix is used to let both projection areas coincide. The o-axis projection matrix is discussed in Section 5.2. All dierent approaches are shown in Figure 13.

4.1 Reprojection An image has to be rendered from both cameras. This can be done by rendering everything twice. However, for high-end applications this can result in an unacceptable loss of performance. Alternatively, the images for the left and right eye can be reconstructed from the center image using the depth of each pixel. The technique is used by Crytek [13] and an implementation is discussed in [14]. This alternative technique does not Figure 14: The blue and red parts on perfectly reproduce the images for the curve are unique to the respective both eyes. That is impossible because eye. the cameras for the left and right eye receive some information which is not received by the other cameras, as shown in Figure 14. However, it gives a good result when using sensible values for the eye-separation and focal distance [14]. 10

Figure 15: Head coupling changes the view into the virtual world.

5

Head Tracking for Virtual Reality

To enhance the 3D experience, the position of the head can be coupled to the position of the camera. When the user moves his head, the view is adjusted accordingly (as shown in Figure 15). This gives the user the impression that the display is a window into a virtual world, instead of a at image of the virtual world. This technique is called Fish Tank Virtual Reality or head coupling. The depth perception is possibly better with head coupling than with stereoscopic 3D, and even better when the techniques are combined [15]. Our experience is that the head coupling eect increases the immersion signicantly. However, we did not get the impression that the screen is really a window into the virtual world. We think this is because your eyes have to be focused on the screen, in stead of at the depth of the scene. To achieve head coupling, the position of the head has to be detected through for example a webcam. This position has to be transformed to the three dimensional space where the position and size of the screen is dened in. From this information a projection matrix can be generated which accommodates for an arbitrary position of the head.

5.1 Tracking An accessible way to detect the position of the head is through a camera such as a webcam. The webcam can perform face detection using for example OpenCV. However, face detection might not be as accurate, robust and fast as necessary. Alternatively, the user can wear a set of glasses with at least two markers, as shown in Figure 16. When the user is using a passive or active stereoscopic display, a pair Figure 16: Glasses with markof glasses has to be worn anyway. Adding ers. two markers on that pair of glasses does not introduce additional discomfort. 11

Figure 17: Calculating the 3D marker positions from 2D image points. To detect the markers, the image from the webcam is rst converted to the HSV color model. This allows us to introduce invariance to shading. A threshold is applied on the image to lter out the markers. The threshold of the hue is relatively tight. The threshold on the saturation and value is looser to achieve the invariance to shading. Alternatively for a more invariant result, an infra-red camera can be used in combination with infrared light sources. That results in an grayscale image, which requires one threshold. Noise can be ltered out by eroding the thresholded image. A union-nd algorithm can be used to quickly nd the largest two blobs in the image. It essentially checks per line in the image which sections of that line overlap, and it connects those sections eciently using a tree. The centers of the blobs can be found by taking the average position of all pixels in the blob. The 2D positions on the camera image have to be transformed to a point in 3D. This is visualized in Figure 17. First, the 2D positions i1 and i2 have to be transformed to lines in 3D. The lines l1 and l2 start at the camera c and intersect the respective 2D position on the projection plane. These lines can be constructed geometrically, or by applying the inverse intrinsic and extrinsic matrices of the camera. Because the distance e between the two 3D points (the markers) and the angle α between the lines are known, the depth d can be e/2 calculated using d = sin(α/2) . The depth is used to nd the correct position m of the marker along the line l. Note that the user is assumed to be exactly facing the camera. Additional markers have to be used to avoid this assumption. The average of m1 and m2 is used as the head position. For stereoscopic 3D, the eye positions for both eyes are on the line of m1 through m2 . The positions depend on the desired eye separation. The points of the display have to be dened in the same space as the 3D head position.

5.2 Projection When using a normal projection matrix, it is assumed that the user is exactly centered in front of the display. To take the head position into account for the projection matrix, an o-axis projection matrix has to be used [16]. The o-axis projection matrix allows the position of the camera to shift relative to 12

Figure 18: The frustum and the input variables required to calculate the o-axis projection matrix in OpenGL. the projection plane. An o-axis projection matrix is also used in Section 4 to correctly render stereoscopic images. One o-axis projection matrix can be created to accommodate for both head coupling and stereoscopic 3D. The o-axis projection matrix is generated under the condition that the head is located at the origin, while looking down the negative z-axis. The input required to calculate the o-axis projection matrix consists of 6 variables: the distance to the near and far plane, and the distance from the z-axis to the four sides of the frustum at the depth of the near plane. These variables are visualized in the frustum in Figure 18. Note that right − lef t is equal to the width of the frustum at the near plane, and bottom−top is equal to the height of the frustum at the near plane. A right-handed o-axis projection matrix can be created with the glfrustum function in OpenGL, which uses the input variables to create the following matrix:

P =

2near right−lef t

0

0 0 0

2near top−bottom

0 0

right+lef t right−lef t top+bottom top−bottom ar+near − ff ar−near

2f ar·near f ar−near

−1

0

0 0

The input for the o-axis projection matrix can be found by transforming the space such that the head is located at the origin, looking down the negative z-axis. In this space, the position and size of the display can be used to nd the input variables. Note that the values have to be scaled to the near plane. Because the head is located at the center, the display moves in the virtual world instead of the head. In reality, the position of the display is static and the head moves. The result is that a dierent part of the virtual world is visible, which breaks the illusion. To compensate for the head position, the o-axis projection matrix has to be multiplied with the inverse translation of the head. The resulting matrix is Q = P · T , where P is the o-axis projection matrix and T is the inverse translation matrix. In a more complex situation such as a multi-monitor setup, the display might be rotated. This is accounted for by applying the corresponding inverse rotation matrix. In this setup, the resulting matrix is Q = P ·R·T , where R is the inverse rotation matrix [17]. 13

The scale between the head-tracking space and the world space is very important to consider. The real world space is now coupled to the virtual world space, and the scale between them has to match the expectations of the user. If the scale does not match the expectation of the user, the head coupling effect is not convincing. The correctness of the scale is more important when the realism of the scene is high. For example when a house has the scale of a shoebox, the user does not perceive the desired head coupling eect. To resolve this, the size of the head-tracking space has to be adjusted. The scale of the world space could also be adjusted, but that is not convenient because a lot of other properties are relative to the world scale.

5.3 Limitations The head coupling technique is only suited for one user. The projection is adapted to the head position of one user, giving other viewers a distorted image. Furthermore, the eect is only noticeable when the user moves his head. Because the head coupling eect is not common, the user might not even think about moving his head, or the user might become tired of moving his head all the time. This prevents the user from perceiving the head coupling eect. Not all equipment is suited for use with head tracking. A camera with a wide eld of view is preferred to track large movements. Also, the display needs a wide viewing angle. This can be a problem with passive 3D displays because their vertical viewing angle can be very limited.

5.4 Future work The combination of head coupling with stereoscopic 3D is promising. The depth perception in stereoscopic 3D changes when moving the head closer or further away from the display, or when rotating the head. These depth distortions can be countered when using head coupling, allowing for a more smooth 3D experience.

6

Acknowledgments

We would like to thank Cor Jansen for providing us with the equipment required for this study. We would like to thank Wolfgang Hürst and Arno Kamphuis, our supervisors at Utrecht University. The images in Figure 2 and Figure 3 are taken from respectively http://en.wikipedia.org/wiki/File:Cones_SMJ2_E.svg and http://en.wikipedia.org/wiki/File:Anaglyph_glasses.png.

14

References [1] Neil Schneider, Crytek Interview Part I of III, August 2010. http://www.mtbs3d.com/index.php?view=article&id=11650 [2] Sébastien Schertenleib et al. Optimization for Making Stereoscopic 3D Games on PlayStation (PS3), 2010. http://www.technology.scee.net/les/presentations/nordic/Optimization forMakingStereoscopic3DGamesonPlayStationPS3.pdf [3] Andrew Woods. The Illustrated 3D HDTV (and 3D Monitor) List. http://www.3dmovielist.com/3dhdtvs.html#3dmon2009 [4] Y. Yoshihara et al. 3D Crosstalk of Stereoscopic Display Using Patterned Retarder and Corresponding Glasses, in Proceedings of the IDW '08, pages 1135-1138, 2008. http://wenku.baidu.com/view/c329926fb84ae45c3b358c3b.html [5] Zalman ZM-M15W Specications. http://www.zalman.co.kr/ENG/product/Product_Read.asp?idx=384 [6] HDMI Specication Version 1.4a Extraction of 3D Signaling Portion, March 2010. http://www.hdmi.org/manufacturer/specication.aspx [7] DLP 3-D HDTV Technology, 2007. http://dlp.com/downloads/DLP 3D HDTV Technology.pdf [8] Nvidia Quadro Stereo Technology. http://www.nvidia.com/object/quadro_stereo_technology.html [9] ATI FirePro 3D Graphics. http://www.amd.com/us/Documents/AMDATI038_FamilyBro_Final_Web.pdf [10] Ross Walker. How to enable Quad Buered Open GL Stereo in a Window on any NVIDIA GeForce DDR, 2 or 3 card, 2009. http://www.rosswalker.co.uk/stereo/howto1_PatchedDrivers.htm [11] OpenGL glDrawBuer documentation. http://www.opengl.org/sdk/docs/man/xhtml/glDrawBuer.xml [12] MSDN D3DBACKBUFFER_TYPE documentation. http://msdn.microsoft.com/en-us/library/bb172506.aspx [13] Nicolas Schulz. Bringing Stereo To Consoles, GDC Europe 2010. http://www.crytek.com/sites/default/les/Part 2 - Bringing Stereo to Consoles.ppt [14] Marries van de Hoef and Bas Zalmstra. struction

of

Stereoscopic

Images

Using

Fast

http://www.marries.nl/projects/stereoscopic-3d/ 15

Gather-based

,

Reprojection

Con-

May 2011.

[15] Colin Ware et al. Fish Tank Virtual Reality, in Proceedings of the INTERACT '93 and CHI '93, April 1993. [16] Michael Deering. High Resolution Virtual Reality, in 26, pages 195-202, July 1992.

Computer Graphics

[17] Robert Kooima. Generalized Perspective Projection, June 2009. http://aoeu.snth.net/static/gen-perspective.pdf

16

Suggest Documents