Crom - CPU/GPU Hybrid Computation Platform for Visual Effects Nathan Cournia, Casey Vanover, Bill Spitzak, Hans Rijpkema, Josh Tomlinson, Bradley Smith, Nathan Litke Rhythm and Hues Studios
Who We Are
Motivation ●
Modernize lighting/compositing workflows
●
Unify user experience ●
●
Workflow evolved across four proprietary packages
Streamline pipeline
Look Development (Lighthouse)
Render (Wren)
Light Placement (Voodoo)
Scene Lighting (Lighthouse)
LightCmp (Icy)
Requirements ●
●
●
Rethink our software designed up to 25 years ago ●
Multiple-cores, multiple GPUs, international locations, cloud
●
Decouple interface from computation engines
Seamless integration with other software: ●
Pipelines: R+H, Shotgun, etc
●
Renderers: R+H, Mantra, etc.
User extensible: ●
C++
●
Python (new nodes, Qt interfaces)
●
Interface builder / Visual Programming
●
Easily share networks / interfaces
Main Idea ●
Crom is a VFX platform
VFX Platform ●
Look Development
●
Scene Lighting
●
Compositing
●
Misc. Tools
General Design ●
●
Core data structure is a dependency graph Data passed between dependency graph nodes are strongly typed
●
Dependency graph is stateless
●
Can hook up anything to anything else
Stateless Nodes ●
●
Multiple threads can traverse the graph in parallel "Global" state is passed up the dependency graph in a "Context / Request" object ●
Multiple frames, tiles, layers, etc. can be concurrently computed
Data ●
Data passed between nodes is stored in a "property graph"
●
Data representation is decoupled from programming interface ●
●
An interface, i.e. Adapter/Wrapper, can be placed onto a property graph to define an object A property graph can be adapted to provide multiple interfaces
●
Copy-on-write semantics allow for sharing of data
●
Heuristics to place subsets of data into a persistent cache
●
Property graph is dynamically user extensible yet strongly typed
VFX Compositor ●
●
Compositor: Assembles multiple images into a final image(s). Example: Nuke
GPU/CPU Compositor ●
●
Crom implements a hybrid GPU/CPU compositor Dependency graph traversal produces two main items in the property graph: ●
●
Instruction Tree: Low-level operations to be performed Data Callbacks: Objects that will be invoked to populate the compositing engine with data from the dependency graph
Example cmp Node Graph
Example Instruction Tree
Callbacks ReadImage1 Callback
ReadImage2 Callback
RGB1 Callback
Callbacks (cont.) ReadImage1 Callback
ReadImage2 Callback
RGB1 Callback
Instruction Tree (cont.) ●
● ●
Generic representation of low-level operations that need to be done. When working interactively, converted to GLSL. When working on the render farm, converted to OpenCL.
Instruction Tree (GLSL)
uniform sampler2D ReadImage1 ; uniform sampler2D ReadImage2 ; uniform vec4 RGB1 ; varying vec2 v0000 ; void main(void ) { vec4 t0001 = texture2D(ReadImage1, v0000); vec4 t0002 = t0001 + (texture2D(ReadImage2, v0000) * (1 - clamp(t0001.w, 0, 1))); gl_FragColor = (vec4(t0002.xyz, clamp(t0002.w, 0, 1)) * RGB1 ); }
Per-Pixel Expressions ●
●
Instruction tree nodes can not only be created from the dependency graph but also from crom's expression language Allows for fast per-pixel expressions! sample(ReadImage1.output, vec2(sin(pos.x), pos.y + cos(pos.x)))
Lazy Programmers ●
cmp node library only has around 50 nodes ●
●
Define low-level operations (cmp.Add, cmp.Translate, cmp.Crop, cmp.Text)
Most nodes are user defined via "macro" nodes!
Macro Node (cmp.Gamma)
Macro Node (cmp.Gamma)
Macro Nodes ●
● ●
●
Benefit of macro nodes is that they produce an Instruction tree without the user writing any C++ / Python Macro nodes can be just as fast as built-in nodes Custom interfaces can be created that are indistinguishable from built-in interfaces via the interface builder or Python Macro nodes usually contain other macro nodes ●
Production scripts contain well over 250k nodes
GPU Saturation ●
●
●
●
●
Depedency graph traversal produces hundreds of GPU API calls When scrubbing controls commands build up in GPU Easy to saturate GPU with tens of thousands of commands with a simple gesture GUI quickly becomes unresponsive as GPU tries to process given commands A cornerstone of the Crom platform is that sub-tasks can be interrupted/canceled ●
●
Allows for fast feedback
GPU APIs do not support canceling commands
Dispatch Queue ● ●
● ●
Crom uses a global GPU dispatch queue All compute communication with the GPU happens on a single context/thread pair Compute threads locally queue commands Locally queued commands are enqueued to global queue in logical batches
Dispatch Queue Observations ●
Global queue throttles commands to ensure GPU driver's command buffer is not to deep
●
Commands in global dispatch queue can be interrupted
●
Easy to support "native kernels" in OpenGL backend
●
GPU throughput not optimal. Overall system is more responsive
●
Tricky to handle errors in dispatch queue
●
●
Must be careful not to interrupt object creation/population commands that are needed for later commands Single context/thread pair helps avoid nasty driver bugs
GPU Limitations ●
In practice the GPU has several limitations: ●
Memory
●
Uniforms
●
Varyings
●
Image Units
●
Instructions
Instruction Tree Splitting ●
●
●
●
The instruction tree tells us: ●
Memory requirements
●
Uniform requirements
●
Varying requirements
●
Number of input images
●
Estimate of instructions needed
We break up the instruction tree into smaller sub-trees that "fit" on the GPU Use multiple shader/kernel invocations to composite image Sub-tree output can be cached
Questions? Nathan Cournia
[email protected]