“Batch, Batch, Batch”
Transcript of “Batch, Batch, Batch”
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 1/38
“ Batch, Bat ch, Bat ch:”What Does I t Really Mean?
Mat thias Wloka
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 2/38
What I s a Bat ch?
• Every Draw I ndexedPrim it ive() is a bat ch
– Subm it s n num ber of t riangles t o GPU
– Same render st ate applies t o all t ri s in batch
– SetState calls prior t o Draw are part of batch
• Assum ing eff icient use of API
– No Draw* Prim it iveUP()
– Draw Prim it ive() perm issible if w arranted
– No unnecessary st ate changes
• Changing stat e means at least tw o batches
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 3/38
Why Are Small Batches Bad?
• Games w ould rather draw 1M
obj ect s/ bat ches of 10 t r is each
– versus 10 object s/ batches of 1M t ris each
• Lots of guesses
– Changing state inef f icient on GPUs (WRONG)
– GPU t riangle st art -up costs (WRONG)
–
OS kernel t ransit ions (WRONG)
• Fut ure GPUs w ill make it bet t er!? Really?
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 4/38
• Test app does…
– Degenerate t riangles (no f il l cost )
– 100% PostTnL cache vert ices (no xform cost )
– Stat ic data ( minimal AGP overhead) – ~ 100k tr is/ frame, i .e., f loor(100k/ x) draw s
– Toggles st ate betw een draw calls:(VBs, w / v/ p matrix , tex-stage and alpha states)
• Timed across 1000 f rames
• Theoretical m axim um t riangle rat es!
Let ’s Wr it e Code!
Test ing Small Bat ch Performance
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 5/38
Measured Batch-Size Performance
0
10
20
30
40
50
60
70
80
90
100
1 0
3 0
5 0
7 0
9 0
1 1 0
1 3 0
1 5 0
1 7 0
1 9 0
3 0 0
5 0 0
7 0 0
9 0 0
1 1 0 0
1 3 0 0
1 5 0 0
triangles/batch
m i l l i o n t r i a
n g l e s / s
Athlon XP 2.7+; NVIDIA GeForce FX 5800
Athlon XP 2.7+; NVIDIA GeForce4 Ti 4600
Athlon XP 2.7+; NVIDIA GeForce3 Ti 500
Athlon XP 2.7+; NVIDIA GeForce4 MX 440
Athlon XP 2.7+; NVIDIA GeForce2 MX/MX 400
Axis scale changeAxis scale change
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 6/38
Opt im izat ion Opport unit ies
0
10
20
30
40
50
60
70
80
90
100
1 0
3 0
5 0
7 0
9 0
1 1 0
1 3 0
1 5 0
1 7 0
1 9 0
3 0 0
5 0 0
7 0 0
9 0 0
1 1 0 0
1 3 0 0
1 5 0 0
triangles/batch
m i l l i o n t r i a
n g l e s / s
Athlon XP 2.7+; NVIDIA GeForce FX 5800
Athlon XP 2.7+; NVIDIA GeForce4 Ti 4600
Athlon XP 2.7+; NVIDIA GeForce3 Ti 500
Athlon XP 2.7+; NVIDIA GeForce4 MX 440
Athlon XP 2.7+; NVIDIA GeForce2 MX/MX 400
40x40x
>100x>100x
Axis scale changeAxis scale change
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 7/38
Measured Batch-Size Performance
0
10
20
30
40
50
60
70
80
90
100
1 0
3 0
5 0
7 0
9 0
1 1 0
1 3 0
1 5 0
1 7 0
1 9 0
3 0 0
5 0 0
7 0 0
9 0 0
1 1 0 0
1 3 0 0
1 5 0 0
triangles/batch
m i l l i o n t r i a
n g l e s / s
Athlon XP 2.7+; NVIDIA GeForce FX 5800
Athlon XP 2.7+; NVIDIA GeForce4 Ti 4600
Athlon XP 2.7+; NVIDIA GeForce3 Ti 500
Athlon XP 2.7+; NVIDIA GeForce4 MX 440
Athlon XP 2.7+; NVIDIA GeForce2 MX/MX 400
Axis scale changeAxis scale change
<130 tris/batch:- App is GPUGPU--independentindependent- Completely CPU-limited
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 8/38
CPU-Limited?
• Then perform ance result s only depend on
– How fast t he CPU is
• Not GPU
– How much dat a the CPU processes• Not how many t riangles per batch!
• CPU processes draw calls (and
SetStates) , i.e., bat ches
• Let’s graph batches/ s!
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 9/38
What To Expect I f CPU Lim it ed
batch-size: triangles/batch
b a t c h e s / s
fast CPU
slow CPU
GPU 1GPU 2GPU 3
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 10/38
Effect s of Dif ferent CPU Speeds
Two distinct bands,corresponding todifferent CPU speeds
batch-size: triangles/batch
b a t c h e s / s
fast CPU
slow CPU
GPU 1GPU 2GPU 3
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 11/38
Effect s of Number of Tris/ Bat ch
Straight horizontallines: batches/sindependent of
number of trianglesper batch
batch-size: triangles/batch
b a t c h e s / s
fast CPU
slow CPU
GPU 1GPU 2GPU 3
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 12/38
Effect s of Dif ferent GPUs
Different GPUsperform similarly;slight variationsdue to different
driver paths
batch-size: triangles/batch
b a t c h e s / s
fast CPU
slow CPU
GPU 1GPU 2GPU 3
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 13/38
0
25
50
75
100
125
150
175
200
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
1 0 0
1 1 0
1 2 0
1 3 0
1 4 0
1 5 0
1 6 0
1 7 0
1 8 0
1 9 0
2 0 0
T h o u s a n d s
triangles/batch
b a t c h e s /
s
Athlon XP 2.7+; NVIDIA GeForceFX 5800 UltraAthlon XP 2.7+; NVIDIA GeForce4 Ti 4600Athlon XP 2.7+; NVIDIA GeForce3 Ti 500Athlon XP 2.7+; NVIDIA GeForce4 MX 440Athlon XP 2.7+; NVIDIA GeForce2 MX/MX 4001GHz Pentium 3; NVIDIA GeForceFX 5800 Ultra1GHz Pentium 3; NVIDIA GeForce4 Ti 46001GHz Pentium 3; NVIDIA GeForce3 Ti 5001GHz Pentium 3; NVIDIA GeForce4 MX 4401GHz Pentium 3; NVIDIA GeForce2 MX/MX 4001GHz Pentium 3; Radeon 9700/9500 SERIES
Measured Batches Per Second
1GHz Pentium 3
Athlon XP 2.7+
~170k batches/s
~60k batches/s
x ~2.7
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 14/38
0
25
50
75
100125
150
175
200
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
1 0
0
1 1
0
1 2
0
1 3
0
1 4
0
1 5
0
1 6
0
1 7
0
1 8
0
1 9
0
2 0
0
T h o u s a n d s
triangles/batch
b a t c h e s
/ s
1GHz Pentium 3; NVIDIA GeForce4 Ti 4600; OpenGL1GHz Pentium 3; NVIDIA GeForce4 Ti 4600; Direct3D
Side Not e: OpenGL Performance
OpenGLOpenGL
Direct3DDirect3D
x 1.7x 1.7--2.32.3
OpenGLOpenGL
Direct3DDirect3D
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 15/38
CPU Limit ed?
• Yes, at < 130 t r is/ bat ch (avg) you are
– completely,
– utter ly,
– total ly,
– 100%
– CPU lim it ed!
• CPU is busy doing not hing,but submit t ing bat ches!
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 16/38
How ‘Real’ I s Test App?
• Test app only does SetSt ate, Draw , repeat ;
– Stays in CPU cache
–
No frustum cull ing, no nothing – So pret t y much best case
• Test app changes arbit rary set of states
–
Types of state changes? – And how many states change?
– Maybe real apps do few er/ bett er st ate changes?
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 17/38
Real Wor ld Performance• 353 batches/ fr ame @ 16% 1.4GHz CPU: 26fps
• 326 batches/ fr ame @ 18% 1.4GHz CPU: 25fps
•
467 batches/ fr ame @ 20% 1.4GHz CPU: 25fps• 450 batches/ fr ame @ 21% 1.4GHz CPU: 25fps
• 700 bat ches/ fr ame @ 100% ( !) 1 .5GHz CPU: 50fps
• 1000 batches/ fr ame @ 100% ( !) 1.5GHz CPU: 40fps
•
414 batches/ fr ame @ 20% (?) 2.2GHz CPU: 27fps• 263 batches/ fr ame @ 20% (?) 3.0GHz CPU: 18fps
• 718 batches/ fr ame @ 20% (?) 3.0GHz CPU: 21fps
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 18/38
Normalized
Real Wor ld Performance• ~ 41k batches/ s @ 100% of 1GHz CPU
• ~ 32k batches/ s @ 100% of 1GHz CPU
•
~ 42k batches/ s @ 100% of 1GHz CPU• ~ 38k batches/ s @ 100% of 1GHz CPU
• ~ 25k batches/ s @ 100% of 1GHz CPU
• ~ 25k batches/ s @ 100% of 1GHz CPU
• ~ 25k batches/ s @ 100% of 1GHz CPU
• ~ 8k bat ches/ s @ 100% of 1GHz CPU
• ~ 25k batches/ s @ 100% of 1GHz CPU
1 0 k 1 0 k – –
4 0 k b a t c h e s / s
4 0 k b a t c h e s / s
( 1 0 0 % 1 G H z C P U )
( 1 0 0 % 1 G H z C P U )
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 19/38
Small Bat ches Feasible I n Future?
• VTune (1GHz Pent ium 3 w / 2 t r i / bat ch):
– 78% driver; 14% D3D; 6% Other32; rest noise
• Driver doing l it t le per Draw / SetSt ate, but
– Lit t le tim es very large mult iplier is st ill large
• Nvidia is opt im izing dr ivers, but …
• Submit t ing X batches: O(X) w ork for CPU
– CPU (game, runt ime, dr iver ) processes bat ch
– Can reduce constant s but not order O( )
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 20/38
GPUs Get t ing Faster More
Quickly Than CPUs
0
50
100
150
200
R i v a 1 2 8
R i v a Z X
R
i v a T N T
T N T 2
G
e F o r c e
G
e F o r c e 2
G
e F o r c e 2
U l t r a
G
e F o r c e 3
G
e F o r c e 3
T i
G
e F o r c e 4
T i
G e
f o r c e F X
2H97 1H98 2H98 1H99 2H99 1H00 2H00 1H01 2H01 1H02 2H02
GPU
0
1000
2000
3000
4000
5000
CPU MHz
GPU MTrisGPU 32-bit AA FillGPU GFlopsCPU MHz
Avg. 18month CPU Speedup: 2.22.2Avg. 18month GPU Speedup: 3.03.0--3.73.7
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 21/38
GPUs Cont inue To Outpace CPUs
• CPU processes batches, thus
– Number of batches/ frame MUST scale w it h:
• Driver/ Runt ime optim izations
• CPU speed increases
• GPU processes t r iangles (per batch) , thus
– Number of t r iangles/ bat ch scales w it h:
• GPU speed increases
• GPUs get t ing fast er more quickly t han CPUs
– Batch sizes CAN increase
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 22/38
So, How Many Tr is Per Batch?
• 500? 1000? I t does not matt er! – I mpossible to f it everything int o large batches
– A few 2 t ris/ batch do NOT kill performance!
– N tr is/ batch: N increases every 6 months
• I am a donut! Ask not how many tr is/ batch, butrather how many bat ches/ frame!
• You get X bat ches per f rame, depending on:
– Target CPU spec – Desired f rame-rate
– How m uch % CPU available for submit t ing batches
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 23/38
You get X batches per frame,You get X batches per f rame,
X mainly depends onX mainly depends on CPU specCPU spec
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 24/38
What is X?
• 25k batches/ s @ 100% 1 GHz CPU – Target: 30fps; 2GHz CPU; 20% (0.2) Draw / SetState:
– X = 333 bat ches/ frame
• Formula: 25k * GHz * Percentage/ Framerate
– GHz = target spec CPU frequency
– Percentage = value 0..1 corresponding t o CPU
percentage available for Draw / Set St ate calls
– Framerate = target f rame rate in fps
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 25/38
Please Hang Over Your Bed
25k bat ches/ s @ 100%25k bat ches/ s @ 100%
1GHz CPU1GHz CPU
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 26/38
How Many Tr iangles Per Batch?
• Up t o you!
– Anything betw een 1 to 10,000+ t r is possible
• I f small number, eit her – Triangles are large or ext remely expensive
– Only GPU vert ex engines are id le
• Or
– Game is CPU bound, but don’t care becauseyou budgeted your CPU ahead of t ime, right ?
– GPU idle (available for upping visual qualit y)
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 27/38
GPU I dle? Add Tr iangles For Free!
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 28/38
GPU I dle?Compl icate Pixel Shaders For Free!
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 29/38
300 Batches Per Frame Sucks
• (Ab)use GPU to pack mult iple bat chestogether
• Crit ical NOW!
– For increasing number of obj ects in gamewor ld
• Will only become more crit ical in t hefuture
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 30/38
Batch Breaker: Text ure Change
• Use all of Geforce FX’s 16 t ext ures
– Fit 8 dist inct dual-t extured bat ches int o 1single batch
• Pack mult iple text ures int o 1 surface
– Works as long as no w rap/ repeat
– Requires t ool support
– Potentially w astes t exture space – Potential problems w / mult i-sampling
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 31/38
Batch Breaker: Transform Change
• Pre-t ransform static geometry – Once in a w hile
– Video memory overhead: model r eplication
• 1-Bone mat rix palet t e skinning – Encode world matr ix as 2 float4s
• axis/ angle
• t ranslate/ uniform scale
– Video memory overhead: model r eplication
• Data-dependent vertex branching – Render variable # of bones/ light s in one batch
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 32/38
Batch Breaker: Mat er ial Change
• Compute mult iple materials in pixel-shaders – Choose/ I nt erpolate based on
• Per-vert ex att ribute
•
Texture-map
• More perform ance opt im ization t ips and tr icks:
Friday 3:00pmFr iday 3:00pm
“ Graphics Pipeline Performance”“ Graphics Pipeline Performance”C.C. CebenoyanCebenoyan and M. Wlokaand M. Wloka
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 33/38
But Only High-End GPUsHave That Feature!?
• Yes, but high-end GPUs most likely CPU-bound
•
High-End GPUs most suit ed to deal w it h: – Longer vertex-shaders
– Longer pixel-shaders
– More text ure accesses
– Bigger video memory requirements
• To improve batching
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 34/38
But These Things Slow GPU Dow n!?
• Remember: CPU-l im it ed
– GPU is most ly idle
• Making GPU w ork, so CPU does NOT
• Overall effect : f aster game
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 35/38
25k bat ches/ s @ 100%25k bat ches/ s @ 100%1GHz CPU1GHz CPU
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 36/38
Acknowledgements
• Many t hanks t o
Gary McTaggar t , Valve
Jay Patel, Blizzard
Tom Gambi l l , NCSof tScot t Brow n, Net DevilGui llerm o Garcia-Sampedro, PopTop
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 37/38
Quest ions, Comment s, Feedback?
• Mat thias Wloka: mw [email protected]
• ht t p:/ / developer.nvidia.com
7/31/2019 “Batch, Batch, Batch”
http://slidepdf.com/reader/full/batch-batch-batch 38/38
Can You Afford t oLoose These Speed-Ups?
• 2 t r is/ bat ch
– Max. of ~ 0.1 MTriangles/ s for 1GHz Pent ium 3
• Factor 1500x away from max. throughput – Max. of ~ 0.4 MTriangles/ s for Athlon XP 2.7+
• Factor 375x aw ay from max. thr oughput