Fragment-Parallel Composite and Filter Anjul Patney, Stanley Tzeng, and John D. Owens University of...
-
date post
19-Dec-2015 -
Category
Documents
-
view
222 -
download
0
Transcript of Fragment-Parallel Composite and Filter Anjul Patney, Stanley Tzeng, and John D. Owens University of...
Fragment-Parallel Composite and Filter
Anjul Patney, Stanley Tzeng, and John D. OwensUniversity of California, Davis
Parallelism in Interactive Graphics• Well-expressed in hardware as well as APIs
• Consistently growing in degree & expression–More and more cores on upcoming GPUs– From programmable shaders to pipelines
• We should rethink algorithms to exploit this
• This paper provides one example– Parallelization of composite/filter stages
A Feed-Forward Rendering Pipeline
Geometry Processing
Rasterization
Composite
Filter
Primitives
Pixels
Composite & Filter
• Input: – Unordered list of
fragments
• Output– Pixel colors
• Assumption– No fragments are
discarded
Pixel
Sample Locations
Basic Idea
Pixel-Parallel
Processors
Basic Idea
Insufficientparallelism
Irregularity
Fragment-Parallel
Processors
Motivation
• Most applications have low depth complexity– Pixel-level parallelism is sufficient
• We are interested in applications with– Very high depth complexity– High variation in depth complexity
• Further– Future platforms will demand more parallelism– High depth-complexity can limit pixel-
parallelism
Motivation
10
70
130
190
250
310
370
430
490
550
610
670
730
10
100
1000
10000
100000
1000000
Distribution of DepthComplexity
Number of depth layers
Nu
mb
er
of
su
bp
ixels
Related Work
Order-Independent Transparency (OIT)
• Depth-Peeling [Everitt 01]
– One pass per transparent layer
• Stencil-Routed A-buffer [Myers & Bavoil 07]
– One pass per 8 depth layers1
• Bucket Depth-Peeling [Liu et al. 09]
– One pass per up to 32 layers21 Maximum MSAA samples per pixel2 Maximum render targets
Related Work
Order-Independent Transparency (OIT)
• OIT using Direct3D 11 [Gruen et al. 10]
– Use fragment linked-lists– Per-pixel sort and composite
• Hair Self-Shadowing [Sintorn et al. 09]
– Each fragment computes its contribution– Assumes constant opacity
Related Work
Programmable Rendering Pipelines
• RenderAnts [Zhou et al. 09]
– Sort fragments globally– Per-pixel composite/filter
• FreePipe [Liu et al. 10]
– Sort fragments globally– Per-pixel composite/filter
Pixel-Parallel FormulationPi P(i+1) P(i+2)
Sj S(j+1) S(j+2) S(j+3) S(j+4) S(j+5) S(j+6)j (j+1) (j+2) (j+3) (j+4) (j+5) (j+6)Thread IDs
P: PixelS: Subsample
Fragment-Parallel Formulation
P: PixelS: Subsample
Pi P(i+1) P(i+2)
Sj S(j+1) S(j+2) S(j+3) S(j+4) S(j+5) S(j+6)
P: PixelS: Subsample
Thread IDs
jj+ 1 j+ 2 j+ 3 j+ 4 j+ 5 j+ 6 j+ 7 j+ 8 j+ 9 j+ 10 j+ 11 j+ 12 j+ 13 j+ 14 j+ 15 j+ 16 j+ 17 j+ 18 j+ 19 j+ 20 j+ 21 j+ 22 j+ 23
Fragment-Parallel Formulation• How can this behavior be achieved?
• Revisit the composite equation
Cs = α1C1 + (1-α1){α2C2+(1-α2)(…(αN+(1-αN)CB)…}fragment 1 fragment 2 … background
Cs = 1.α1.C1 + (1-α1).α2.C2 + (1-α1)(1-α2).α3.C3 + …
+ (1-α1)(1-α2)…(1-αk-1).αi.Ck + …
+ (1-α1)(1-α2)…(1-αN).CBLocal Contribution Lk
Global Contribution Gk
Fragment-Parallel Formulation
• Lk is trivially parallel (local computation)
• Gk is the result of a scan operation (product)
• For the list of input fragments– Compute G[ ] and L[ ], multiply– Perform reduction to add subpixel contributions
Cs = G1.L1 + G2.L2 + G3.L3 … GN.LN
Gk = (1-α1).(1-α2)…(1-αk-1)Lk = αk.Ck
Fragment-Parallel Formulation• Filter, for every pixel:
• This can be expressed as another reduction– After multiplying with subpixel weights
κm
– Can be merged with previous reduction
Cp = Cs1.κ1 + Cs2.κ2 + … + CsM.κM
Fragment-Parallel Composite & Filter
Final Algorithm
1. Two-key sort (Subpixel ID, depth)
2. Segmented Scan (obtain Gk)
3. Premultiply with weights (Lk, κm)
4. Segmented Reduction
Fragment-Parallel Formulation
P: PixelS: Subsample
Pi P(i+1) P(i+2)
P: PixelS: Subsample
Segmented Scan (product)
Segmented Reduction (sum)
Implementation
• Hardware used: NVIDIA GeForce GTX 280
• We require fast Segmented Scan and Reduce– CUDPP library provides that– Restricts implementation to NVIDIA CUDA
• No direct access to hardware rasterizer–We wrote our own
Example System – Polygons
• Applications– Games
• Depth Complexity– 1 to few tens of layers– Suited to pixel-parallel
• Fragment-parallel software rasterizer
Example System – Particles
• Applications– Simulations, games
• Depth Complexity– Hundreds of layers– High depth-variance
• Particle-parallel sprite rasterizer
Example System – Volumes
• Applications– Scientific Visualization
• Depth Complexity– Tens to Hundreds of
layers– Low depth-variance
• Major-axis-slice rasterizer
Example System – Reyes
• Applications– Offline rendering
• Depth Complexity– Tens of layers– Moderate depth variance
• Data-parallel micropolygon rasterizer
Performance Results
Part
icle
s
Volu
me
Reye
s (g
rass
)
Poly
gon
0
100
200
300
400
500
600
Ren
deri
ng
Tim
e (
ms)
Fragment GenerationPixel-Parallel Composite/FilterFragment-Parallel Composite/Fil-ter
Performance Variation
0 200 400 600 800 1000 1200 1400 16001.00E+05
1.00E+06
1.00E+07
1.00E+08
Performance Variation
Fragment-ParallelPixel-Parallel
Depth Complexity
Fra
gm
en
ts p
er
se
co
nd
Limitations
• Increased memory traffic– Several passes through CUDPP
primitives
• Unclear how to optimize for special cases– Threshold opacity– Threshold depth complexity
Summary and Conclusion
• Parallel formulation of composite equation–Maps well to known primitives– Can be integrated with filter– Consistent performance across varying workloads
• FPC is applicable to future rendering pipelines– Exploits higher degree of parallelism– Better related to size of rendering workload
• A tool for building programmable pipelines
Future Work
• Performance– Reduction in memory traffic– Extension to special-case scenes– Hybrid PPC-FPC formulations
• Applications– Integration with hardware rasterizer– Cinematic rendering, Photoshop
Acknowledgments
• NSF Award 0541448• SciDAC Insitute for Ultrascale
Visualization• NVIDIA Research Fellowship • Equipment donated by NVIDIA• Discussions and Feedback
– Shubho Sengupta (UC Davis), Matt Pharr (Intel), Aaron Lefohn (Intel), Mike Houston (AMD)
– Anonymous reviewers
• Implementation assistance– Jeff Stuart, Shubho Sengupta
Thanks!