A hybrid CPU-GPU Implementation for Interactive Ray-Tracing of Dynamic Scenes

A hybrid CPU-GPU Implementation for Interactive Ray-Tracing of Dynamic Scenes Brian C. Budge∗ John C. Anderson† Christoph Garth‡ Kenneth I. Joy§ U...
1 downloads 1 Views 1MB Size
A hybrid CPU-GPU Implementation for Interactive Ray-Tracing of Dynamic Scenes Brian C. Budge∗

John C. Anderson†

Christoph Garth‡

Kenneth I. Joy§

University of California, Davis

University of California, Davis

University of California, Davis

University of California, Davis

A BSTRACT In recent years, applying the powerful computational resources delivered by modern GPUs to ray tracing has resulted in a number of ray tracing implementations that allow rendering of moderately sized scenes at interactive speeds. For non-static scenes, besides ray tracing performance, fast construction of acceleration data structures such as kd-trees is of primary concern. In this paper, we present a novel implementation for the ray tracing of both static and dynamic scenes. We first describe an optimized GPU-based ray tracing approach within the CUDA framework that does not explicitly make use of ray coherency or architectural specifics and is therefore simple to implement, while still exceeding performance of previously presented approaches. Optimal performance is achieved by empirically tuning the ray tracing kernel to the executing hardware. Furthermore, we describe a straightforward parallel approach for approximate quality kd-tree construction, aimed at multi-core CPUs. The resulting hybrid ray tracer is able to render fully dynamic scenes with hundreds of thousands of triangles at interactive speeds. We describe our implementation in detail and provide a performance analysis and comparison to prior work. Index Terms: I.3.6 [Computer Graphics]: Graphics data structures and data types—Methodology and Techniques Realism; I.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism— Raytracing 1

I NTRODUCTION

GPU-based ray tracers have recently achieved a performance equal and surpassing that of CPU implementations. Various techniques have been proposed to adapt ray tracing to GPU acceleration, and such optimizations are typically aimed at addressing both strengths and limitations of the underlying architectural programming model. Typical approaches exploit SIMD hardware characteristics through the use of coherency between neighboring rays (e.g. [15, 5]), and limit the use of per-ray traversal state by employing specialized acceleration data structures (cf. [9, 7, 4]). Both approaches complicate implementation of a GPU ray tracer since the required data structures and algorithms deviate strongly from their CPU equivalents. On latest-generation GPUs, under the CUDA programming model (cf. Section 3), graphics hardware is able to support the execution of general purpose programs with a large number of concurrent threads. Typically, the number of simultaneous threads is much larger than the number of available execution units, and sophisticated hardware scheduling techniques are applied to make maximal use of the available execution resources through techniques such as latency hiding and zero-cost thread switching. For maximum performance, each thread must limit its state, since overall state is ∗ e-mail:

[email protected] [email protected] ‡ e-mail: [email protected] § e-mail: [email protected] † e-mail:

limited. Therefore, ray tracing implementations under CUDA are often explicitly implemented to achieve the sweet spot of a specific hardware implementation such as the G80 architecture; to guarantee optimal performance, complex analysis of SIMD mapping, register counts and local memory constraints per thread is required ([5] provides an example). However, the resulting algorithms may not perform optimally on future hardware. In this paper, we present a novel, optimized GPU-accelerated ray tracing implementation within the CUDA framework based on fast stack-based kd-tree traversal. Each ray is mapped to a thread, and a single kernel is used for the entire ray tracing pipeline including shading. Our ray tracer is not tailored to a specific number of execution units or state size, nor do we explicitly address ray coherency; instead, we implicitly make use of hardware scheduling characteristics by selecting execution parameters through experiment. This allows us to both achieve optimal performance and preserve simplicity of implementation. Especially, we base our code on generic optimization principles and deviate from an optimized CPU implementation only in the use of a small number of specialized instructions. After briefly discussion previous work and describing the CUDA framework and G80 hardware, we provide an in-depth description of our ray tracing kernel in in section 4, and examine optimal execution parameters (Section 5); in addition, we give performance numbers on typical scenes and compare them to prior work. To facilitate ray tracing of animated scenes at interactive speeds using the ray tracer presented here, we describe a straightforward parallel CPU-based method for approximate quality kd-tree construction, based on the work by Hunt et al. [8]. Using this approach, on a typical multi-core CPU with four cores, we are able to achieve interactive construction rates for moderately-sized animated scenes. We give a brief discussion of our algorithm in Section 6, and provide performance analysis for the coupled CPU-GPU ray tracing approach. Finally, we conclude on the presented material and discuss future work in 7. 2

P REVIOUS

WORK

Several combinations of data structures and traversal algorithms, targeted at GPU-based ray tracing implementations, appear in prior work. The discussion was started by Carr et al. in [2] who employed early pixel shaders to achieve brute-force ray-triangle intersection; however, ray generation and shading still took place on the CPU. In [11] and [10], Purcell et al. went on to present a ray tracing framework that performed all calculations on the GPU, and took special care to formulate the ray-tracing pipeline in terms of stream computation. Furthermore, secondary rays were mapped to additional render passes. While the corresponding implementations were highly innovative, their performance was not competitive with existing CPU implementations, and they suffered from limited computational precision and limited branching capabilities. For static scenes, kd-trees have been identified as the fastest acceleration structure for ray-tracing [6]. Before generic programming models for the GPU became available, the traversal of such hierarchical acceleration structures on the GPU was difficult because it requires a stack that was hard to model in previous programming environments. In this situation, Ernst et al. [3] demonstrated a

GPU-based stack implementation and applied it to kd-tree traversal to accelerate GPU ray tracing. Foley and Sugerman [4] suggested stackless kd-tree traversal algorithms based on restarting the traversal or backtracking. The inherent redundancy contained in these approaches however severely limited the performance. This can be worked around using short stacks, as shown by Horn et al. in [7], or by removing the redundancy of stackless approaches through ropes, as suggested by Popov et al. in [9]. For dynamic scenes, bounding volume hierarchies (BVH) are typically employed to accelerate ray tracing, as data structure updates are much easier to achieve (cf. [15] for a comprehensive discussion). Thrane and Simonsen showed in [14] that parallel stackless traversal is possible also in this context. More recently, Gnther et al. [5] described fast ray packet traversal of BVHs. They further propose an algorithm for CPU-based BVH updates, and their combined rendering pipeline achieves interactive framerates for moderately complex dynamic scenes. Most notably, their implementation is aimed at the CUDA platform and they discuss specific optimizations on G80 hardware. On the other hand, efficient dynamic building of kd-trees was only recently discussed. Hunt et al.[8] presented an approximate evaluation of the Surface Area Heuristic (SAH) that allows fast determination of nearly optimal split planes. The parallel build scheme we present in Section 6 is based on their work. More recently, Shevtsov et al.[12] implemented a highly parallel and scalable SIMD kd-tree build that can leverage multi-core CPUs. In the following section, before we discuss our CUDA-based ray tracing implementation in more detail, we present a brief overview of the basic characteristics of the CUDA programming model and the G80 architecture that are relevant to our work. 3

H ARDWARE M ODEL

A RCHITECTURE

AND

union kdNode // 8 bytes { struct inner { uint right: 29; uint dim: 2; uint isLeaf: 1; float split; }; };

struct leaf { uint numTri: uint uint

31;

isLeaf: 1; tIndex;

};

Figure 2: kd-tree node layout

code for each thread can be layed out in virtual grids of dimension up to 3, and the position of each thread in the grid can be directly accessed in the thread, allowing a simplified implementation of logically two- or three-dimensional algorithms. Scheduling of thread blocks is done transparently by the runtime environment, and hardware limits are not directly enforced. Therefore, a CUDA programs will run on future hardware will automatically be able to take advantage of enhanced computing capabilities. However, to obtain good performance on current generation hardware, it is necessary to be aware of the current hardware limitations. The G80 is available in a number of different configurations, regarding both the number of on-chip multiprocessors and available on-board memory. The current consumer high-end configuration has 16 cores with 16kB of shared memory per core and 768MB of global memory. The number of threads per core is limited to 768 and to 12k for the entire GPU. Full utilization of all cores can be achieved if each thread uses no more than 10 scalar registers and 5 words of shared memory.

P ROGRAMMING

With the introduction of the NVIDIA G80 architecture, the programmability and performance of GPU has increased significantly. The CUDA programming abstraction essentially presents the GPU as a highly-parallel general purpose processor. Physically, the G80 architecture consists of a number of multiprocessors or cores, each executing 32 threads in a SIMD fashion. Threads are grouped into logical blocks, with a maximum of 768 threads per block, and the multiprocessors work independently on a disjoint set of blocks. The individual cores support hardware multi-threading, allowing them to work on multiple blocks simultaneously and hide hardware latencies such as memory access or instruction dependency. The mapping of block to core is deterministic, and one block is always executed by the same multiprocessor. Memory access follows a hierarchy that determines latency and bandwidth. Each multiprocessor has a small amount of on-chip memory that is split into a register file and shared memory. Global memory, on the other hand, is used for storing textures and other data, and resides on the GPU board. While access to shared memory and registers is fast, global memory reads and writes incur significant latency. Shared memory is split among the blocks of threads executing on the corresponding core, and each thread within a block can only access the shared memory allocated to the block. The register file is partitioned per thread. Cached global memory access can be achieved by means of a single texture unit for each multiprocessor. The number of simultaneously executing threads that the scheduler can assign to a multiprocessor is determined by the resource usage of individual threads: the number of registers used, the shared memory size of a block, and the number of block threads. The CUDA programming model [1] presents a thin abstraction of the hardware layout of the G80. GPU code is essentially written in an extension to the C language that is tailored to the SIMD nature of the underlying hardware. Thread blocks executing identical

4

A CUDA R AY T RACING K ERNEL

In this section, we present a simplified variant of stack-based kdtree traversal that is both simple to implement in and outperforms previous work. The performance of our algorithm is built on simple optimization principles and focusses less on highly specialized data structures. The basic data structure element in our kd-tree memory representation is kdNode (cf. Figure 2). To keep the memory footprint of nodes small, we encode the information in the form of bitfields and treat kdNode as a union depending on whether it represents an inner node or a leaf node. These two cases are distinguished by the isLeaf bit. The tree is layed out as a contiguous list of nodes; to save on redundancy, we adopt the convention that a node’s left child is always the node immediately following it in memory. Hence, inner nodes must only store the index of the right child inner.right in the list, along with the split plane orientation (inner.dim) and offset (inner.split). On the other hand, leaf kdNodes store triangles contained within a kd-tree leaf as an offset and size into a precomputed triangle reference list. Again, to simplify memory representation, we arrange the global triangle reference list such that the triangle references belonging to a kd-leaf form a contiguous subset of the global list. Then, leaf.tIndex and leaf.numTri denote the offset and length of this subset in the list. Hence, a kdNode node is 8 bytes in size. Remark that while CUDA currently does not support the C99-style bitfield notation employed in Figure 2, access to bitfield members is easily translated to elementary bit operations. The resulting CUDA traversal code given in Figure 3 in abbreviated form is stack-based and therefore very similar to typical CPU traversal code. and uses a per-thread stack. This allows a straightforward implementation of the traversal while still preserving good performance. The number of GPU- or CUDA-specific optimization is limited to the following three (see highlighted sections of Figure 3): first, we reduce the cost of ray-node intersection using the reduced division instruction fdividef that requires 20 instead of

B UDDHA

P OWER P LANT

FAIRY F OREST

RT08

Figure 1: Ray traced images from our interactive ray tracer. The B UDDHA and P OWER P LANT scenes are shaded with a single point light source. The FAIRY F OREST and RT08 scenes are shaded using multiple materials and a single point light source.

Scene T OYS B UNNY FAIRY F OREST B UDDHA P OWER P LANT

#Tris 11.1K 69.4K 174K 1.08M 12.7M

[9], kd-tree, G80 primary +shading – – 12.7 5.9 10.6 4.0 – – – –

[5], BVH, G80 primary +shadows – – – – 14.6 4.8 – – 6.4 2.9

[12], kd-tree, CPU primary +shadows – – – – – – – 15.4 – –

primary 59.6 37.9 27.7 23.1 30.0

our results +shading 17.1 (31.6) 14.0 (16.1) 7.6 (15.9) 11.4 (12.3) 15.5

random 14.9 9.61 11.2 4.0 13.3

Table 1: Ray tracing performance for a 1024×1024 viewport in fps of our GPU-based kd-tree raytracer in comparison to recently published results. Note that [5] uses BVH acceleration structures, and [12] uses a CPU implementation (Intel Core2 Duo 3GHz×2). The figures in parentheses were measured on a newer G92-based card.

. 36 cycles for division on current hardware (cf. [1]), thereby speeding up traversal intersections. Second, the kdNode list is accessed through a texture unit. This allows caching of fetched data (such as often traversed nodes near the root) and can in general speed up the nature of incoherent memory accesses. Last, we preallocate the per-thread traversal stack (maximum size derived from kd-tree) together with ray origin and direction in shared memory for fast index-based access. Triangle intersection tests are performed using the barycentric coordinate test (see e.g. [13]) in projected form (cf. [15], pp. 93, for a detailed presentation). The latter approach exploits the fact that projecting both a triangle and the hit point into any plane that is not orthogonal to the triangle itself does not change the barycentric coordinates of the hit point. After projection, all calculations

can be performed in two dimensions. Typically, one of xy−, yz− and xz−planes is selected as projection plane subject to maximizing the projected triangle area for numerical stability. The inmemory triangle representation (derived from [15], cf. Figure 4) stores triangle plane equation (nu,nv,d) and triangle edge line equations (b nu,b nv,b d and c nu,c nv,c d, respectively). Furthermore, we store an 8-bit material index (material) and three vertex indices (tIndex*) that can be used to index per-vertex information such as normals or texture coordinates to be used in shading. Overall, the triangle structure has a size of 48 bytes. We access individual triangles through a one-dimensional texture (viewed as three uint4s), again exploiting texture cache. The performance of our system for general ray-traversal including secondary rays (cf. Section 5) allows for a straight-forward im-

#define KD_LEFT (idx+1) #define KD_RIGHT (getRightChild(node)) uint2 node = tex1Dfetch( kdtree, 0 ); while( true ) { while( !isLeaf(node) ) { uint dim = getDim( node );

struct materialInfo // 36 bytes { float3 Ks; // specular color float3 Kd; // diffuse color float Ns; // specular exponent float Ni; // index of refraction int16_t texId; // diffuse texture int16_t type; // material type };

Figure 5: Structure for storing material properties on the GPU. float dnum = getSplit( node ) - origin[dim]; dnum = dnum ? dnum : 1; float d = __fdividef(dnum, direc[dim]); // distance to plane is outside of valid interval if(d>upper||d

Suggest Documents