Maximizing OpenGL Performance for GPUs

GDC Tutorial: Advanced OpenGL Game Development Maximizing OpenGL Performance for GPUs March 8, 2000 John F. Spitzer [email protected] Develo...

Author: Lily Smith

5 downloads 0 Views 374KB Size

Report

Download PDF

Recommend Documents

Maximizing Cylinder Performance

MAXIMIZING YOUR PERFORMANCE AT SEA

Stream Processors and GPUs: Architectures for High Performance Computing

A PERFORMANCE COMPARISON OF SORT AND SCAN LIBRARIES FOR GPUS

Best Practices for Maximizing Commercial Card Program Performance

Webinar: Maximizing Hazelcast Performance with Serialization!

The Secret to Maximizing Your Athletic Performance

Maximizing Performance with Processor and Core Affinity

Evidence- based recommendations for maximizing competitive swimming performance

Maximizing value in high-performance networks

C compiler for NVIDIA GPUs

Software-Based ECC for GPUs

Outline. OpenGL: A Practical Introduction. OpenGL Definitions. Features of OpenGL. OpenGL Anti-definitions

PERFORMANCE AND POWER COMPARISONS BE- TWEEN NVIDIA AND ATI GPUS

A Detailed Look at Cairo's OpenGL Spans Compositor Performance

A Scalable Tridiagonal Solver for GPUs

Evaluating GPUs for Network Packet Signature Matching

Total Corporate Responsibility Funds Maximizing Financial and Sustainability Performance

Maximizing System Performance: Using Reconfigurability to Monitor System Communications

Maximizing Six-Core AMD Opteron Processor Performance with RHEL

GPUs For Astrophysics a technical discussion

Using OpenGL

GDC Tutorial: Advanced OpenGL Game Development

Maximizing OpenGL Performance for GPUs March 8, 2000

John F. Spitzer [email protected]

Developer Relations Engineer NVIDIA Corporation

Potential Bottlenecks They mirror the OpenGL pipeline •• ••

Data Data transfer transfer from from application application to to GPU GPU Vertex Vertex lighting lighting

•• Texture -vertex or Texture coordinate coordinate generation generation (TexGen), (TexGen), other other per per-vertex or per -triangle operations per-triangle operations •• Texture Texture mapping mapping •• Other -fragment operations Other per per-fragment operations

What Else Can Slow You Down? Pixel operations •• ••

glDrawPixels, glDrawPixels, glReadPixels, glReadPixels, glCopyPixels glCopyPixels Texture Texture image image downloads downloads

Other stuff •• ••

Inefficient Inefficient context context management management Inefficient Inefficient state state management management

Transferring Geometric Data from App to GPU So many ways to do it •• ••

Immediate Immediate mode mode Display Display lists lists

•• Vertex Vertex arrays arrays •• Compiled Compiled vertex vertex arrays arrays •• Vertex Vertex array array range range extension extension

Immediate Mode The old stand -by stand-by •• ••

Has Has the the most most flexibility flexibility Makes Makes the the most most calls calls

•• Has Has the the highest highest CPU CPU overhead overhead •• Varies Varies in in performance performance depending depending on on CPU CPU speed speed •• Not Not the the most most efficient efficient

Display Lists Fast, but limited •• ••

Immutable Immutable Requires Requires driver driver to to allocate allocate memory memory to to hold hold data data

•• Allows Allows large large amount amount of of driver driver optimization optimization •• Can Can sometimes sometimes be be cached cached on on graphics graphics subsystem subsystem •• Typically Typically very very fast fast

Vertex Arrays Best of both worlds •• ••

Data Data can can be be changed changed as as often often as as you you like like Data Data can can be be interleaved interleaved or or in in separate separate arrays arrays

•• Can Can use use straight straight lists lists or or indices indices •• Reduces Reduces number number of of API API calls calls vs. vs. immediate immediate mode mode •• Little Little room room for for driver driver optimization, optimization, since since data data referenced referenced by by pointers pointers can can change change at at any any time time

Compiled Vertex Arrays Solve part of the problem •• ••

Allow Allow user user to to lock lock portions portions of of vertex vertex array array In In turn, turn, gives gives driver driver more more optimization optimization opportunities: opportunities:

–– Shared Shared vertices vertices can can be be detected, detected, allowing allowing driver driver to to eliminate eliminate superfluous superfluous operations operations –– Locked Locked data data can can be be copied copied to to higher higher bandwidth bandwidth memory memory for for more more efficient efficient transfer transfer to to the the GPU GPU •• Still Still requires requires transferring transferring data data twice twice

Vertex Array Range Extension Eliminates the double copy •• Analogous Analogous to to Direct3D Direct3D vertex vertex buffers buffers •• wglAllocateMemoryNV wglAllocateMemoryNV returns returns aa chunk chunk of of AGP AGP or or video video memory memory depending depending upon upon the the user’s user’s needs needs •• Application Application manages manages AGP/video AGP/video memory memory itself itself •• Video Video memory memory is is fastest, fastest, but but most most restrictive restrictive •• AGP AGP is is often often just just as as fast, fast, but but must must be be used used with with care care •• AGP AGP memory memory is is uncached uncached –– write write to to itit sequentially sequentially to to maximize maximize write write combining combining (and, (and, thus, thus, memory memory bandwidth) bandwidth)

Vertex Lighting Gaining speed and popularity •• ••

Not Not terribly terribly fast fast when when performed performed on on CPU CPU Very Very fast fast when when performed performed on on GPU! GPU!

•• Different Different types types have have different different performance performance characteristics characteristics •• Some Some lighting lighting modes modes cost cost more more than than others others •• 88 simultaneous simultaneous lights lights allowed, allowed, minimize minimize for for best best performance performance

Vertex Lighting Performance Quadro Lighting Perform ance

Millions of Triangles per Second

14 infinite lights infinite viewer

12

infinite lights local viewer

10 8

local lights 6 spot lights

4 2 0 1

2

3

4

5

Num ber of Lights

6

7

8

Light Types and Modes Differing performance characteristics •• Infinite Infinite lights lights fastest fastest with with infinite infinite viewer, viewer, since since half half angle angle vector vector need need not not be be recomputed recomputed for for every every vertex vertex •• Local Local lights lights are are more more computationally computationally expensive, expensive, but but often often offer offer some some features features for for “free”: “free”: •• Local Local viewer viewer •• Attenuation Attenuation •• Color Color material material is is typically typically not not free free •• Two -sided lighting Two-sided lighting is is almost almost never never free free

Number of Lights to Enable Minimize lights to maximize performance •• ••

More More is is not not necessarily necessarily better better Saturation Saturation often often occurs occurs with with over over four four active active lights lights

•• Quickly Quickly calculate calculate distance distance squared squared from from each each object object to to each each local local light light to to determine determine whether whether itit should should be be enabled enabled or or not not •• Dot Dot products products can can be be used used to to determine determine whether whether an an object object is is in in the the cone cone of of aa spot spot light light or or not not •• IfIf an an object object has has more more than than four four lights, lights, disable disable the the furthest furthest •• You You might might have have to to reduce reduce the the size size of of your your objects objects

Texture Coordinate Generation TexGen often hardware accelerated, but not free Quadro Te xture Coordinate Generation Pe rformance

glTexGen Mode

Reflection Map Sphere Map Eye Linear Normal Map Object Linear Explicit 0

2

4

6

8

10

12

Millions of Triangles per Second

14

16

Texture Coordinate Transformation Use the texture matrix wisely •• Like Like TexGen, TexGen, the the texture texture matrix matrix is is often often hardware hardware accelerated, accelerated, but but not not free free •• IfIf texture -frame basis, texture coordinates coordinates are are not not changed changed on on aa per per-frame basis, itit might -multiply them might be be better better to to pre pre-multiply them •• For For calculating calculating projected projected textures, textures, shadows shadows and and so so forth, forth, using using the the texture texture matrix matrix is is encouraged encouraged •• IfIf one one texture texture matrix matrix is is hardware hardware accelerated, accelerated, the the other other (in (in the the case case of of multitexturing) multitexturing) usually usually is is too too

Other Vertex/Face Calculations Clipping •• ••

Performed Performed efficiently efficiently on on GPU, GPU, no no need need to to do do yourself yourself Per -object view -frustum culling Per-object view-frustum culling still still strongly strongly encouraged encouraged

Culling •• ••

Backface Backface culling culling can can cut cut fillrate fillrate requirements requirements in in half half Enable Enable whenever whenever feasible feasible (i.e. (i.e. on on closed closed objects) objects)

Polygon offset •• ••

Very Very useful useful for for hidden hidden line, line, decals, decals, appliqués appliqués Typically Typically little little or or no no performance performance overhead overhead

Other Vertex/Face Calculations (continued) Dual matrix vertex weighting •• ••

Great Great for for doing doing simple simple skinning skinning Not Not free, free, but but can can be be done done much much faster faster than than on on CPU CPU

Fog calculations •• ••

Not Not free, free, but but probably probably faster faster than than the the CPU CPU Different Different modes modes usually usually have have the the same same performance performance

•• Can Can calculate calculate your your own, own, ifif you you want want

Texturing Hard to optimize •• ••

Speed Speed vs. vs. quality quality –– make make itit aa user user settable settable option option Pick Pick the the right right filtering filtering modes modes

•• Pick Pick the the right right texture texture formats formats •• Pick Pick the the right right texture texture functions functions •• Load Load the the textures textures efficiently efficiently •• Manage Manage your your textures textures effectively effectively •• Use Use multitexture multitexture

Texture Filtering/Formats Highest quality, not necessarily highest speed •• ••

Use Use GL_LINEAR_MIPMAP_LINEAR GL_LINEAR_MIPMAP_LINEAR filtering filtering Optionally Optionally use use anisotropic anisotropic filtering filtering (only (only 10% 10% hit hit on on GeForce GeForce))

•• Use -bit or -bit internal Use 24 24-bit or 32 32-bit internal texture texture formats formats

Highest speed, not necessarily highest quality •• ••

Use Use bilinear bilinear mipmapping mipmapping (GL_LINEAR_MIPMAP_NEAREST) (GL_LINEAR_MIPMAP_NEAREST) Use -bit internal Use packed packed pixel pixel 16 16-bit internal (and (and external) external) formats formats

•• Use Use S3TC S3TC texture texture compression, compression, ifif available available •• Use Use single/dual single/dual component component formats, formats, ifif practical practical

Maximizing Texture Download Performance Very important if you have a lot lot of of textures textures •• ••

Use Use glTexSubImage2D glTexSubImage2D rather rather than than glTexImage2D glTexImage2D Match Match external/internal external/internal formats formats

•• Use Use texture texture compression, compression, ifif available available •• IfIf using using copy_texture, copy_texture, match match texture texture internal internal format format to to that that of of framebuffer -bit desktop framebuffer (e.g. (e.g. 32 32-bit desktop to to GL_RGBA8) GL_RGBA8) •• IfIf using using paletted paletted textures, textures, share share the the palette palette between between multiple multiple textures textures

Other Texture Tips Texture Binds •• Minimize Minimize these, these, possibly possibly by by sorting sorting objects objects by by texture texture ID ID

Multitexture •• ••

Collapse Collapse two two passes passes into into one one by by using using multitexture multitexture Use Use register register combiners combiners extension extension to to reduce reduce number number of of passes passes •• Allows Allows much much more more flexibility flexibility than than standard standard OpenGL OpenGL modes modes

•• Permits Permits separate separate RGB RGB and and Alpha Alpha processing processing •• Use Use only only one one general general register register combiner, combiner, ifif possible possible

Other Fragment Operations Polygon stipple •• May May be be fast fast by by itself, itself, but but not not in in conjunction conjunction with with texturing texturing

Specular color summation and fog application •• Free Free on on many many systems systems

Testing operations (scissor, alpha, stencil, depth) •• Testing Testing for for scissor/alpha scissor/alpha usually usually free free •• Depth/stencil Depth/stencil can can require require aa read/modify/write read/modify/write at at some some cost cost •• Render -to-back to Render from from front front-to-back to minimize minimize writing writing to to depth depth buffer buffer

Other Fragment Operations (continued) Blending •• ••

Most Most modes modes cut cut fill fill rates rates in in half, half, because because of of read/modify/write read/modify/write Use Use only only where where necessary necessary

Color logical operation ((LogicOp) LogicOp) •• ••

Can Can make make system system default default to to software software rendering rendering Avoid Avoid anything anything but but default default mode mode (GL_COPY) (GL_COPY)

Pixel Operations Blitting between between system system and and framebuffer framebuffer memory •• ••

Keep Keep itit simple simple –– no no weird weird formats formats or or types types No No pixel pixel maps, maps, shifts, shifts, biases, biases, or or other other operations operations

•• On On almost almost all all systems, systems, RGB/RGBA RGB/RGBA unsigned unsigned byte byte formats formats are are somewhat somewhat optimized optimized •• Other Other formats, formats, which which more more closely closely match match native native framebuffer framebuffer configuration, configuration, may may be be faster faster •• Avoid Avoid reading/writing reading/writing depth depth buffer buffer –– instead, instead, use use GL_KTX_buffer_region GL_KTX_buffer_region extension extension for for incremental incremental updates updates

Context Switching Do it wisely, or it will cost you •• ••

Context Context switching switching is is expensive, expensive, keep keep itit to to aa minimum minimum Try Try “faking” “faking” multiple multiple windows windows by by setting setting the the viewport viewport and and scissor -window” scissor rectangle rectangle to to restrict restrict drawing drawing to to that that “sub “sub-window”

•• IfIf multiple -using aa single multiple windows windows are are necessary, necessary, try try re re-using single context context by by binding binding itit to to separate separate windows windows

General Performance Concerns State management •• ••

Try Try to to avoid avoid setting setting redundant redundant state state (this (this is is common) common) Minimize Minimize state state changes changes by by sorting sorting in in order order of of attributes, attributes, ifif possible possible (starting (starting with with most most expensive expensive to to change) change)

Antialiasing •• Be Be sure sure that that the the system system can can support support itit in in hardware hardware •• Test -time to Test at at run run-time to determine determine ifif it’s it’s fast fast enough, enough, and and disable disable ifif it’s it’s not not

Identifying Bottlenecks Start with your application •• Use Use aa profiling profiling tool, tool, like like Intel’s Intel’s VTUNE, VTUNE, to to identify identify parts parts of of your your code code where where the the most most time time is is being being spent spent •• Expect -intensive application Expect aa graphics graphics-intensive application (like (like aa game) game) to to spend spend aa good good amount amount of of time time in in glBegin glBegin,, glEnd glEnd,, glFinish glFinish,, etc. etc.

Graphics bottlenecks •• ••

Make Make the the window window smaller smaller Assuming Assuming you you don’t don’t have have aa dynamic dynamic LOD LOD selector, selector, your your performance performance will will go go up up ifif raster raster bound, bound, not not ifif geometry geometry bound bound

Know What’s Fast Before you start coding •• Use Use aa performance performance benchmark, benchmark, like like SPECglperf SPECglperf,, or or aa custom -written benchmark custom-written benchmark •• Investigate Investigate your your target target platform(s) platform(s) •• Determine Determine which which modes modes are are fast, fast, and and which which aren’t aren’t

At runtime •• Build -benchmark to Build in in aa mini mini-benchmark to test test performance performance •• Select Select rendering rendering paths paths depending depending upon upon performance performance •• Allows Allows scalability scalability across across many many platforms platforms

Questions?