GPU-Accelerated Computation for Statistical Analysis of the Next-Generation Sequencing Data
Crom - CPU/GPU Hybrid Computation Platform for...
Transcript of Crom - CPU/GPU Hybrid Computation Platform for...
Crom - CPU/GPU Hybrid Computation Platform for Visual Effects
Nathan Cournia, Casey Vanover, Bill Spitzak, Hans Rijpkema,Josh Tomlinson, Bradley Smith, Nathan Litke
Rhythm and Hues Studios
Who We Are
Motivation
● Modernize lighting/compositing workflows● Unify user experience
● Workflow evolved across four proprietary packages
● Streamline pipeline
Look Development(Lighthouse)
Light Placement(Voodoo)
Scene Lighting(Lighthouse)
Render(Wren)
LightCmp(Icy)
Requirements
● Rethink our software designed up to 25 years ago● Multiple-cores, multiple GPUs, international locations, cloud
● Decouple interface from computation engines
● Seamless integration with other software:● Pipelines: R+H, Shotgun, etc
● Renderers: R+H, Mantra, etc.
● User extensible:● C++
● Python (new nodes, Qt interfaces)
● Interface builder / Visual Programming
● Easily share networks / interfaces
Main Idea
● Crom is a VFX platform
VFX Platform
● Look Development● Scene Lighting● Compositing● Misc. Tools
General Design
● Core data structure is a dependency graph● Data passed between dependency graph nodes are
strongly typed● Dependency graph is stateless● Can hook up anything to anything else
Stateless Nodes
● Multiple threads can traverse the graph in parallel
● "Global" state is passed up the dependency graph in a "Context / Request" object● Multiple frames, tiles, layers, etc. can be
concurrently computed
Data
● Data passed between nodes is stored in a "property graph"
● Data representation is decoupled from programming interface● An interface, i.e. Adapter/Wrapper, can be placed onto a property graph to
define an object
● A property graph can be adapted to provide multiple interfaces
● Copy-on-write semantics allow for sharing of data
● Heuristics to place subsets of data into a persistent cache
● Property graph is dynamically user extensible yet strongly typed
VFX Compositor
● Compositor: Assembles multiple images into a final image(s).
● Example: Nuke
GPU/CPU Compositor
● Crom implements a hybrid GPU/CPU compositor
● Dependency graph traversal produces two main items in the property graph:● Instruction Tree: Low-level operations to be
performed● Data Callbacks: Objects that will be invoked to
populate the compositing engine with data from the dependency graph
Example cmp Node Graph
Example Instruction Tree
Callbacks
ReadImage1Callback
ReadImage2Callback
RGB1Callback
Callbacks (cont.)
ReadImage1Callback
ReadImage2Callback
RGB1Callback
Instruction Tree (cont.)
● Generic representation of low-level operations that need to be done.
● When working interactively, converted to GLSL.● When working on the render farm, converted to
OpenCL.
Instruction Tree (GLSL)
uniform sampler2D ReadImage1 ;uniform sampler2D ReadImage2 ;uniform vec4 RGB1;varying vec2 v0000 ;void main(void ){ vec4 t0001 = texture2D(ReadImage1, v0000); vec4 t0002 = t0001 + (texture2D(ReadImage2, v0000) * (1 - clamp(t0001.w, 0, 1))); gl_FragColor = (vec4(t0002.xyz, clamp(t0002.w, 0, 1)) * RGB1 );}
Per-Pixel Expressions
● Instruction tree nodes can not only be created from the dependency graph but also from crom's expression language
● Allows for fast per-pixel expressions!sample(ReadImage1.output, vec2(sin(pos.x), pos.y + cos(pos.x)))
Lazy Programmers
● cmp node library only has around 50 nodes● Define low-level operations (cmp.Add,
cmp.Translate, cmp.Crop, cmp.Text)
● Most nodes are user defined via "macro" nodes!
Macro Node (cmp.Gamma)
Macro Node (cmp.Gamma)
Macro Nodes
● Benefit of macro nodes is that they produce an Instruction tree without the user writing any C++ / Python
● Macro nodes can be just as fast as built-in nodes● Custom interfaces can be created that are
indistinguishable from built-in interfaces via the interface builder or Python
● Macro nodes usually contain other macro nodes● Production scripts contain well over 250k nodes
GPU Saturation
● Depedency graph traversal produces hundreds of GPU API calls
● When scrubbing controls commands build up in GPU
● Easy to saturate GPU with tens of thousands of commands with a simple gesture
● GUI quickly becomes unresponsive as GPU tries to process given commands
● A cornerstone of the Crom platform is that sub-tasks can be interrupted/canceled● Allows for fast feedback
● GPU APIs do not support canceling commands
Dispatch Queue
● Crom uses a global GPU dispatch queue● All compute communication with the GPU
happens on a single context/thread pair● Compute threads locally queue commands● Locally queued commands are enqueued to
global queue in logical batches
Dispatch Queue Observations
● Global queue throttles commands to ensure GPU driver's command buffer is not to deep
● Commands in global dispatch queue can be interrupted
● Easy to support "native kernels" in OpenGL backend
● GPU throughput not optimal. Overall system is more responsive
● Tricky to handle errors in dispatch queue
● Must be careful not to interrupt object creation/population commands that are needed for later commands
● Single context/thread pair helps avoid nasty driver bugs
GPU Limitations
● In practice the GPU has several limitations:● Memory● Uniforms● Varyings● Image Units● Instructions
Instruction Tree Splitting
● The instruction tree tells us:● Memory requirements
● Uniform requirements
● Varying requirements
● Number of input images
● Estimate of instructions needed
● We break up the instruction tree into smaller sub-trees that "fit" on the GPU
● Use multiple shader/kernel invocations to composite image
● Sub-tree output can be cached
Questions?Nathan Cournia