Grass, Fur and all things hairy Nicolas ThibierozKarl Hillesland Gaming Engineering Manager,...
-
Upload
carmen-batchelor -
Category
Documents
-
view
217 -
download
1
Transcript of Grass, Fur and all things hairy Nicolas ThibierozKarl Hillesland Gaming Engineering Manager,...
Grass, Fur and all things hairy
Nicolas Thibieroz Karl HilleslandGaming Engineering Manager, AMD Senior Research Engineer, AMD
Next-gen Grass, Fur and Hair
●The time for next-gen quality is now●Tomb Raider pioneered next-gen hair
● Even on PS4/XB1●Users expect this level of quality for next-gen titles●You need to start thinking about this●This talk is about making high-quality fur, grass and hair run at real-time performance
TressFX applied to Grass, Fur and Hair
●Variations of the same technique can be used for all those applications●In all cases the core principles of next-gen quality are still needed:
● Compute simulations● Anti-aliasing● Transparency● Volumetric self-shadowing● A good lighting model
Forward Rendering Pipeline – a refresher
●Consists of three steps:● Hair simulation● Shade and store fragments into buffers● Fetch shaded fragments, sort and render
// Retrieve current pixel count and increase counter uint uPixelCount = LinkedListUAV.IncrementCounter(); uint uOldStartOffset;
// Exchange indices in LinkedListHead texture corresponding to pixel location InterlockedExchange(LinkedListHeadUAV[address], uPixelCount, uOldStartOffset);
// Append new element at the end of the Fragment and Link Buffer Element.uNext = uOldStartOffset; LinkedListUAV[uPixelCount] = Element;
● Head UAV● Each pixel location has a “head pointer” to a linked list in
the PPLL UAV● PPLL UAV
● As new fragments are rendered, they are added to the next open location in the PPLL (using UAV counter)
● A link is created to the fragment pointed to by the head pointer
● Head pointer then points to the new fragment
Per-Pixel Linked Lists
Head UAV
PPLL UAV
CSCSCS
Input Geometry Post-simulation geometry (UAV)
Forward Rendering Pipeline – a refresherHair Simulation
Simulation parameters
Model space
World space
Forward Rendering Pipeline – a refresherShade and Store fragments into Buffers
Coverage
depth
color
coverage
nextLighting
VS PS
Homogeneous clip space
World space
Null RT
Stencil
PPLL UAV
Head UAV
Shadows
Extrusion from line segments to non-indexed
triangles
Full Screen Quad
Forward Rendering Pipeline – a refresherFetch shaded fragments, sort and render
VS PS
Stencil
Head UAV
PPLL UAV
Render target
Fragment sorting and manual blending
Forward Rendering Performance
●Main cost in forward rendering mode is in the shading part
● All fragments are lit and shadowed before being stored● PPLL storing is typically not the bottleneck!
●Don’t need maximum quality on all fragments● “tail” fragments need only “good enough” quality
●Solution: Use shader LOD
Forward vs Deferred Rendering Pipeline
Deferred rendering pipeline
●Hair simulation●Store fragment properties into buffers●Fetch fragment properties, sort, shade and render
● Full shading on K-frontmost fragments
● “Tail” fragments are shaded with a simpler light equation and shadowing algorithm
Forward rendering pipeline
●Hair simulation●Full shading and store fragments into buffers●Fetch shaded fragments, sort and render
CSCSCS
Input Geometry Post-simulation geometry (UAV)
Deferred Rendering PipelineHair Simulation – unchanged!
Simulation parameters
Model space
World space
Deferred Rendering Pipeline – a refresherStore Fragment Properties into Buffers
Coverage
depth
tangent
coverage
next
VS PS
Homogeneous clip space
World space
Null RT
Stencil
PPLL UAV
Head UAV
Index Buffer
Indexed triangle list
Deferred Rendering PipelineFetch fragments, sort, shade and render
VS PS
Stencil
Head UAV
PPLL UAV
Render targetK frontmost fragment: full shading, sorting and manual blending
Lighting Shadows
Full Screen Quad
Tail fragments: cheap chading, no sorting and manual blending
Deferred Rendering Shading LOD Optimization
●Deferred approach allows a reduction in shading cost “Shader LOD”● Only sort and shade K frontmost fragments at high quality● “Simple” shading and out-of-order rendering on tail fragments● Single-tap shadowing on tail fragments
●Very little quality difference compared to full shading● But much better performance!
Technique Cost
Out of order, no shading 1.31 ms
Out of order, shading 2.80 ms
Forward PPLL, shading 3.38 ms
Deferred PPLL, shading 2.13 ms Fur model with ~130,000 fur strandsRunning on AMD Radeon 7970 @ 1080p
Shading cost is ~ 1.5 ms
PPLL costis ~ 0.58 ms
Fast!
Full quality shading forced on for all fragments
Shading LOD
●A great portion of time was spent in the GPU front-end● 920,000 line segments for fur model
●Expansion from line segments to triangles was done in GS and then VS with Draw()● Each segment would create a quad (two triangles) with 6 vertices
Geometry Optimizations
DrawIndexed() method
Indexed triangle list = { ( 0, 1, 2 ), (2, 1, 3 ), ( 2, 3, 4 ), (4, 3, 5 ), ( … ) };
1
Line segments Expanded quads
0
1
2
3 2
4
0
5
1,4
Draw() method
Line segments Expanded quads
0
1
2
3,5
6
2,3
7,10
8,9
0
11
Triangle list = { ( 0, 1, 2 ), ( 3, 4, 5 ), ( 6, 7, 8 ), (9, 10, 11 ), ( … ) };
●Offline creation of index buffer plus DrawIndexed() maximizes post vertex cache use!
●Input line segments have a random order●Just render fewer (but thicker) fragments when far away!●Needs shading adjustments to ensure smooth quality transitions●Increase alpha threshold for fragment inclusion when far away
Distance-based LOD system Optimization
●PPLL Head UAV uses a RWTexture2D instead of a Buffer● Results in more efficient caching for UAV accesses
●Avoid GPR indexing for sorting● Sorting K frontmost fragments required array of Generic Purpose Registers with
random indexing into it● Used an ALU-based indexing approach to improve performance
●TO DO: compute shader simulation optimizations● Currently a set of multiple compute shaders● Looking at combining some of these, optimizing shaders and output formats
Other Optimizations
Per-Pixel Linked Lists UAV Memory Considerations
●How much memory is needed?● Guesstimate for a given usage model● Max (hair pixels x average overdraw) fragments
●What happens when I run out?● Missing fragments
●What can be done about it?
k-Buffer in Memory
PP Linked-List (PPLL) k-Buffer fixed size array
Node Pool
All fragments
How big?
k k k k k k k k
k k k k k k k k
k k k k k k k k
k k k k k k k k
k k k k k k k k
k k k k k k k k
Simple Memory Bound
The Front kApproximation to avoid massive sorting●Only sort the front k fragments per-pixel●Blend the rest out-of-order
If deferring for shader LOD … also● Full quality shade on front k● Cheap shade on rest
20 frags/pixel (ave) Red = over 100
k is 4, 8, 16
The Front kApproximation to avoid massive sorting●Only sort the front k fragments per-pixel●Blend the rest out-of-order
If deferring for shader LOD … also● Full quality shade on front k● Cheap shade on rest
k-Buffer
Tail
Can’t know front k until all fragments processed
k-Buffer
For Each Fragment in Each Pixel
Index of furthest
New Fragment
Blend
Tail ColorTail Fragment
If New Fragment in k
Index of furthest
k-Buffer
Blend
Tail Color
If in k1. Swap with furthest2. Find new furthest3. Blend with tail
Tail Fragment
New Fragment
If not in k
Index of furthest
k-Buffer
Blend
Tail Color
If not in k1. Blend with tail
Tail Fragment
New Fragment
From PPLL to k-BufferFor each pixel:
Write frags to memFor each fragment in each pixel
read fragment from memupdate k-buffer (reg)blend tail fragment (reg)
Read k-buffer from memSort and blend k-buffer (reg)
update k-buffer (mem)blend tail fragment
(mem)
k-Buffer
Screen Width
Scr
een
Heig
ht
k
8 bytes each(depth and data)
PPLL nodes were 12 bytes(depth, data, next)
K=4, 8, 16
PPLL: 2nd Pass
New Fragment
Index of furthest
Blend
Tail ColorTail Fragment
k-Buffer
Registers
k-Buffer in Memory: 1st Pass
New Fragment
Index of furthest
Blend
Tail ColorTail FragmentMutex, index,
…
BlendUnit
k-Buffer
Memory
Mutex/Count/Index Buffer
Screen Width
Scr
een
Heig
ht
Mutex BitInitialized Bit
Max Index(4 bits)
Count(remainder)
High bit
32 bits
Spinlock Mutex[allow_uav_condition]for(; i<MAX_LOOP_COUNT && !bStop; ++i){ uint oldID; InterlockedExchange( tRWMutex[vScreenAddress], RESERVED, oldID); if( (oldID&RESERVED) != RESERVED) ) {
[[ … Do work ]]DeviceMemoryBarrier();tRWMutex[vScreenAddress] = (new_max_id<<28)+INITED;bStop = true;
} // end mutex check}// end spinlock loop
Paranoia
Try
Release
Do Work
Find New Max Depthuint new_max_depth = u_inDepth;[unroll] for(int t=0; t<KBUFFER_SIZE; t++){
uint element_depth = DEPTH( vScreenAddress, t );
if(element_depth > new_max_depth ){
new_max_depth = element_depth;new_max_id = t;
}}
Generally more memory traffic
than PPLL
Initialization: The first k
Options●Clear k-buffer fullscreen (0,1)●Clear k-buffer stenciled, 3rd pass●Clear on first fragment●Count
Mutex BitInitialized Bit
Max Index(4 bits)
Count(remainder)
High bit
The first kInterlockedAdd( tRWMutex[vScreenAddress], 1, oldCount);
[allow_uav_condition]if(oldCount < KBUFFER_SIZE){ DATA(vScreenAddress,oldCount) = u_inData; DEPTH(vScreenAddress,oldCount) = u_inDepth; return uint2(u_outDepth,u_outData);}
Mutex BitInitialized Bit
Max Index(4 bits)
Count(remainder)
High bit
Models
2k polygons
~20k hairs~130k hairs
Stats2-3.5 M fragments
200-300k pixels
ShadingOne point light & shadow
2 shifted specular lobes
Depth Complexity
Grey 1Blue 8Green 50Red 100+
Contention
Max attempts per pixel, k=4
Dark Blue 1Aqua <=4Bright Aqua <=8
Performance
Time ratio to out-of-order blending●Forward PPLL: 1.02 to 1.4●Forward k-Buffer: 1.2 to 1.4●Deferred PPLL: 0.7 to 0.9●Deferred k-Buffer: 0.9 to 1.6
K-Buffer in Memory
●Simple memory bound●Can be less memory●Usually slower
● Increased memory traffic
Simulation
Hair Simulation
●Length Constraint●Local Constraint●Global Constraint●Model Transform●Collision Shapes●External Forces (wind, gravity, etc.)
Fur Simulation
●Length Constraint●Local Constraint●Global Constraint●Model Transform●Collision Shapes●External Forces (wind, gravity, etc.)
Grass Simulation
●Length Constraint●Local Constraint (1D)●Global Constraint●Model Transform●Collision Shapes●External Forces (wind, gravity, etc.)
Constraint Method (iterative)
●Used for length, local and global constraints●Length is most difficult to converge
● particularly under large movement
C0
C1
Cn-2
p0
p2
Pn-2
Pn-1
Tridiagonal Matrix Formulation
● Direct solve for length constraint● Almost zero stretch● Limited to smaller time steps (stability)
● Still cheap● Leverages matrix structure of strands● Two sweeps of strand
Tridiagonal Matrix Formulation
“Tridiagonal Matrix Formulation for Inextensible Hair Strand Simulation”, VRIPHYS, 2013
Demos
Summary●Next-gen look is possible now!●Deferred Rendering for shading LOD is fastest●k-buffer in memory is an option for memory-constrained situations●High-quality grass and fur simulation with compute
Upcoming TressFX 2 SDK sample update with fur scenario at http://developer.amd.com/tools-and-sdks/graphics-development/amd-radeon-sdk/
Questions?
Extras
Isoline Tessellation for hair/fur? 1/2
●Isoline tessellation has two tess factors● First is line density (lines per invocation)● Second is line detail (segments per line)
●In theory provides easy LOD system● Variable line density and detail by increasing both tessellation factors
based on distance
Tess = (1,1) Tess = (2,1) Tess = (2,2) Tess = (2,3) Tess = (3,3)
Isoline Tessellation for hair/fur? 2/2
●In practice isoline tessellation is not cost effective for this scenario●Lines are always 1-pixel thick
● Need GS to extrude them into triangles for smooth edges● Major impact on performance!
● Alternative is to enable MSAA● Most engines are deferred so this causes a large performance impact
● No extrusion for smoothing edges and no MSAA = poor quality!
●Bottom line: a pure Vertex Shader solution is faster● LOD benefit is easily done in VS (more on this later)● Curvature is rarely a problem (dependant on vertices/strands at authoring time)
AA, Self-shadowing and Transparency
Basic Rendering
Antialiasing Antialiasing
+ Self Shadowing
Antialiasing
+ Self Shadowing
+ Transparency