“Batch, Batch, Batch”

38
“Batch, Batch, Batch:” Wha t Does It Reall y Me an? Matthias Wloka

Transcript of “Batch, Batch, Batch”

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 1/38

“ Batch, Bat ch, Bat ch:”What Does I t Really Mean?

Mat thias Wloka

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 2/38

What I s a Bat ch?

• Every Draw I ndexedPrim it ive() is a bat ch

 –  Subm it s n num ber of t riangles t o GPU

 –  Same render st ate applies t o all t ri s in batch

 –  SetState calls prior t o Draw are part of batch

• Assum ing eff icient use of API

 –  No Draw* Prim it iveUP()

 –  Draw Prim it ive() perm issible if w arranted

 –  No unnecessary st ate changes

• Changing stat e means at least tw o batches

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 3/38

Why Are Small Batches Bad?

• Games w ould rather draw 1M

obj ect s/ bat ches of 10 t r is each

 –  versus 10 object s/ batches of 1M t ris each

• Lots of guesses

 –  Changing state inef f icient on GPUs (WRONG)

 –  GPU t riangle st art -up costs (WRONG)

 – 

OS kernel t ransit ions (WRONG)

• Fut ure GPUs w ill make it bet t er!? Really?

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 4/38

• Test app does…

 –  Degenerate t riangles (no f il l cost )

 –  100% PostTnL cache vert ices (no xform cost )

 –  Stat ic data ( minimal AGP overhead) –  ~ 100k tr is/ frame, i .e., f loor(100k/ x) draw s

 –  Toggles st ate betw een draw calls:(VBs, w / v/ p matrix , tex-stage and alpha states)

• Timed across 1000 f rames

• Theoretical m axim um t riangle rat es!

Let ’s Wr it e Code!

Test ing Small Bat ch Performance

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 5/38

Measured Batch-Size Performance

0

10

20

30

40

50

60

70

80

90

100

   1   0

   3   0

   5   0

   7   0

   9   0

   1   1   0

   1   3   0

   1   5   0

   1   7   0

   1   9   0

   3   0   0

   5   0   0

   7   0   0

   9   0   0

   1   1   0   0

   1   3   0   0

   1   5   0   0

triangles/batch

  m   i   l   l   i  o  n   t  r   i  a

  n  g   l  e  s   /  s

Athlon XP 2.7+; NVIDIA GeForce FX 5800

Athlon XP 2.7+; NVIDIA GeForce4 Ti 4600

Athlon XP 2.7+; NVIDIA GeForce3 Ti 500

Athlon XP 2.7+; NVIDIA GeForce4 MX 440

Athlon XP 2.7+; NVIDIA GeForce2 MX/MX 400

Axis scale changeAxis scale change

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 6/38

Opt im izat ion Opport unit ies

0

10

20

30

40

50

60

70

80

90

100

   1   0

   3   0

   5   0

   7   0

   9   0

   1   1   0

   1   3   0

   1   5   0

   1   7   0

   1   9   0

   3   0   0

   5   0   0

   7   0   0

   9   0   0

   1   1   0   0

   1   3   0   0

   1   5   0   0

triangles/batch

  m   i   l   l   i  o  n   t  r   i  a

  n  g   l  e  s   /  s

Athlon XP 2.7+; NVIDIA GeForce FX 5800

Athlon XP 2.7+; NVIDIA GeForce4 Ti 4600

Athlon XP 2.7+; NVIDIA GeForce3 Ti 500

Athlon XP 2.7+; NVIDIA GeForce4 MX 440

Athlon XP 2.7+; NVIDIA GeForce2 MX/MX 400

40x40x

>100x>100x

Axis scale changeAxis scale change

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 7/38

Measured Batch-Size Performance

0

10

20

30

40

50

60

70

80

90

100

   1   0

   3   0

   5   0

   7   0

   9   0

   1   1   0

   1   3   0

   1   5   0

   1   7   0

   1   9   0

   3   0   0

   5   0   0

   7   0   0

   9   0   0

   1   1   0   0

   1   3   0   0

   1   5   0   0

triangles/batch

  m   i   l   l   i  o  n   t  r   i  a

  n  g   l  e  s   /  s

Athlon XP 2.7+; NVIDIA GeForce FX 5800

Athlon XP 2.7+; NVIDIA GeForce4 Ti 4600

Athlon XP 2.7+; NVIDIA GeForce3 Ti 500

Athlon XP 2.7+; NVIDIA GeForce4 MX 440

Athlon XP 2.7+; NVIDIA GeForce2 MX/MX 400

Axis scale changeAxis scale change

<130 tris/batch:- App is GPUGPU--independentindependent- Completely CPU-limited

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 8/38

CPU-Limited?

• Then perform ance result s only depend on

 –  How fast t he CPU is

• Not GPU

 –  How much dat a the CPU processes• Not how many t riangles per batch!

• CPU processes draw calls (and

SetStates) , i.e., bat ches

• Let’s graph batches/ s!

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 9/38

What To Expect I f CPU Lim it ed

batch-size: triangles/batch

       b     a       t     c       h     e     s       /     s

fast CPU

slow CPU

GPU 1GPU 2GPU 3

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 10/38

Effect s of Dif ferent CPU Speeds

Two distinct bands,corresponding todifferent CPU speeds

batch-size: triangles/batch

       b     a       t     c       h     e     s       /     s

fast CPU

slow CPU

GPU 1GPU 2GPU 3

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 11/38

Effect s of Number of Tris/ Bat ch

Straight horizontallines: batches/sindependent of

number of trianglesper batch

batch-size: triangles/batch

       b     a       t     c       h     e     s       /     s

fast CPU

slow CPU

GPU 1GPU 2GPU 3

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 12/38

Effect s of Dif ferent GPUs

Different GPUsperform similarly;slight variationsdue to different

driver paths

batch-size: triangles/batch

       b     a       t     c       h     e     s       /     s

fast CPU

slow CPU

GPU 1GPU 2GPU 3

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 13/38

0

25

50

75

100

125

150

175

200

         1         0

         2         0

         3         0

         4         0

         5         0

         6         0

         7         0

         8         0

         9         0

         1         0         0

         1         1         0

         1         2         0

         1         3         0

         1         4         0

         1         5         0

         1         6         0

         1         7         0

         1         8         0

         1         9         0

         2         0         0

        T        h      o      u      s      a      n        d      s

triangles/batch

        b      a        t      c        h      e      s        /

      s

Athlon XP 2.7+; NVIDIA GeForceFX 5800 UltraAthlon XP 2.7+; NVIDIA GeForce4 Ti 4600Athlon XP 2.7+; NVIDIA GeForce3 Ti 500Athlon XP 2.7+; NVIDIA GeForce4 MX 440Athlon XP 2.7+; NVIDIA GeForce2 MX/MX 4001GHz Pentium 3; NVIDIA GeForceFX 5800 Ultra1GHz Pentium 3; NVIDIA GeForce4 Ti 46001GHz Pentium 3; NVIDIA GeForce3 Ti 5001GHz Pentium 3; NVIDIA GeForce4 MX 4401GHz Pentium 3; NVIDIA GeForce2 MX/MX 4001GHz Pentium 3; Radeon 9700/9500 SERIES

Measured Batches Per Second

1GHz Pentium 3

Athlon XP 2.7+

~170k batches/s

~60k batches/s

x ~2.7

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 14/38

0

25

50

75

100125

150

175

200

         1         0

         2         0

         3         0

         4         0

         5         0

         6         0

         7         0

         8         0

         9         0

         1         0

         0

         1         1

         0

         1         2

         0

         1         3

         0

         1         4

         0

         1         5

         0

         1         6

         0

         1         7

         0

         1         8

         0

         1         9

         0

         2         0

         0

        T        h      o      u      s      a      n        d      s

triangles/batch

        b      a        t      c        h      e      s

        /      s

1GHz Pentium 3; NVIDIA GeForce4 Ti 4600; OpenGL1GHz Pentium 3; NVIDIA GeForce4 Ti 4600; Direct3D

Side Not e: OpenGL Performance

OpenGLOpenGL

Direct3DDirect3D

x 1.7x 1.7--2.32.3

OpenGLOpenGL

Direct3DDirect3D

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 15/38

CPU Limit ed?

• Yes, at < 130 t r is/ bat ch (avg) you are

 –  completely,

 –  utter ly,

 –  total ly,

 –  100%

 –  CPU lim it ed!

• CPU is busy doing not hing,but submit t ing bat ches!

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 16/38

How ‘Real’ I s Test App?

• Test app only does SetSt ate, Draw , repeat ;

 –  Stays in CPU cache

 – 

No frustum cull ing, no nothing –  So pret t y much best case

• Test app changes arbit rary set of states

 – 

Types of state changes? –  And how many states change?

 –  Maybe real apps do few er/ bett er st ate changes?

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 17/38

Real Wor ld Performance• 353 batches/ fr ame @ 16% 1.4GHz CPU: 26fps

• 326 batches/ fr ame @ 18% 1.4GHz CPU: 25fps

467 batches/ fr ame @ 20% 1.4GHz CPU: 25fps• 450 batches/ fr ame @ 21% 1.4GHz CPU: 25fps

• 700 bat ches/ fr ame @ 100% ( !) 1 .5GHz CPU: 50fps

• 1000 batches/ fr ame @ 100% ( !) 1.5GHz CPU: 40fps

414 batches/ fr ame @ 20% (?) 2.2GHz CPU: 27fps• 263 batches/ fr ame @ 20% (?) 3.0GHz CPU: 18fps

• 718 batches/ fr ame @ 20% (?) 3.0GHz CPU: 21fps

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 18/38

Normalized

Real Wor ld Performance• ~ 41k batches/ s @ 100% of 1GHz CPU

• ~ 32k batches/ s @ 100% of 1GHz CPU

~ 42k batches/ s @ 100% of 1GHz CPU• ~ 38k batches/ s @ 100% of 1GHz CPU

• ~ 25k batches/ s @ 100% of 1GHz CPU

• ~ 25k batches/ s @ 100% of 1GHz CPU

• ~ 25k batches/ s @ 100% of 1GHz CPU

• ~ 8k bat ches/ s @ 100% of 1GHz CPU

• ~ 25k batches/ s @ 100% of 1GHz CPU

1 0 k  1 0 k  – – 

4 0 k  b a t c h e s  / s 

4 0 k  b a t c h e s  / s 

( 1 0 0 %  1 G H z  C P U  ) 

( 1 0 0 %  1 G H z  C P U  ) 

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 19/38

Small Bat ches Feasible I n Future?

• VTune (1GHz Pent ium 3 w / 2 t r i / bat ch):

 –  78% driver; 14% D3D; 6% Other32; rest noise

• Driver doing l it t le per Draw / SetSt ate, but

 –  Lit t le tim es very large mult iplier is st ill large

• Nvidia is opt im izing dr ivers, but …

• Submit t ing X batches: O(X) w ork for CPU

 –  CPU (game, runt ime, dr iver ) processes bat ch

 –  Can reduce constant s but not order O( )

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 20/38

GPUs Get t ing Faster More

Quickly Than CPUs

0

50

100

150

200

   R   i  v  a   1   2   8

   R   i  v  a   Z   X

   R

   i  v  a   T   N   T

   T   N   T   2

   G

  e   F  o  r  c  e

   G

  e   F  o  r  c  e   2

   G

  e   F  o  r  c  e   2

   U   l   t  r  a

   G

  e   F  o  r  c  e   3

   G

  e   F  o  r  c  e   3

   T   i

   G

  e   F  o  r  c  e   4

   T   i

   G  e

   f  o  r  c  e   F   X

2H97 1H98 2H98 1H99 2H99 1H00 2H00 1H01 2H01 1H02 2H02

GPU

0

1000

2000

3000

4000

5000

CPU MHz

GPU MTrisGPU 32-bit AA FillGPU GFlopsCPU MHz

Avg. 18month CPU Speedup: 2.22.2Avg. 18month GPU Speedup: 3.03.0--3.73.7

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 21/38

GPUs Cont inue To Outpace CPUs

• CPU processes batches, thus

 –  Number of batches/ frame MUST scale w it h:

• Driver/ Runt ime optim izations

• CPU speed increases

• GPU processes t r iangles (per batch) , thus

 –  Number of t r iangles/ bat ch scales w it h:

• GPU speed increases

• GPUs get t ing fast er more quickly t han CPUs

 –  Batch sizes CAN increase

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 22/38

So, How Many Tr is Per Batch?

• 500? 1000? I t does not matt er! –  I mpossible to f it everything int o large batches

 –  A few 2 t ris/ batch do NOT kill performance!

 –  N tr is/ batch: N increases every 6 months

• I am a donut! Ask not how many tr is/ batch, butrather how many bat ches/ frame!

• You get X bat ches per f rame, depending on:

 –  Target CPU spec –  Desired f rame-rate

 –  How m uch % CPU available for submit t ing batches

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 23/38

You get X batches per frame,You get X batches per f rame,

X mainly depends onX mainly depends on CPU specCPU spec

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 24/38

What is X?

• 25k batches/ s @ 100% 1 GHz CPU –  Target: 30fps; 2GHz CPU; 20% (0.2) Draw / SetState:

 –  X = 333 bat ches/ frame

• Formula: 25k * GHz * Percentage/ Framerate

 –  GHz = target spec CPU frequency

 –  Percentage = value 0..1 corresponding t o CPU

percentage available for Draw / Set St ate calls

 –  Framerate = target f rame rate in fps

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 25/38

Please Hang Over Your Bed

25k bat ches/ s @ 100%25k bat ches/ s @ 100%

1GHz CPU1GHz CPU

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 26/38

How Many Tr iangles Per Batch?

• Up t o you!

 –  Anything betw een 1 to 10,000+ t r is possible

• I f small number, eit her –  Triangles are large or ext remely expensive

 –  Only GPU vert ex engines are id le

• Or

 –  Game is CPU bound, but don’t care becauseyou budgeted your CPU ahead of t ime, right ?

 –  GPU idle (available for upping visual qualit y)

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 27/38

GPU I dle? Add Tr iangles For Free!

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 28/38

GPU I dle?Compl icate Pixel Shaders For Free!

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 29/38

300 Batches Per Frame Sucks

• (Ab)use GPU to pack mult iple bat chestogether

• Crit ical NOW!

 –  For increasing number of obj ects in gamewor ld

• Will only become more crit ical in t hefuture

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 30/38

Batch Breaker: Text ure Change

• Use all of Geforce FX’s 16 t ext ures

 –  Fit 8 dist inct dual-t extured bat ches int o 1single batch

• Pack mult iple text ures int o 1 surface

 –  Works as long as no w rap/ repeat

 –  Requires t ool support

 –  Potentially w astes t exture space –  Potential problems w / mult i-sampling

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 31/38

Batch Breaker: Transform Change

• Pre-t ransform static geometry –  Once in a w hile

 –  Video memory overhead: model r eplication

• 1-Bone mat rix palet t e skinning –  Encode world matr ix as 2 float4s

• axis/ angle

• t ranslate/ uniform scale

 –  Video memory overhead: model r eplication

• Data-dependent vertex branching –  Render variable # of bones/ light s in one batch

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 32/38

Batch Breaker: Mat er ial Change

• Compute mult iple materials in pixel-shaders –  Choose/ I nt erpolate based on

• Per-vert ex att ribute

Texture-map

• More perform ance opt im ization t ips and tr icks:

Friday 3:00pmFr iday 3:00pm

“ Graphics Pipeline Performance”“ Graphics Pipeline Performance”C.C. CebenoyanCebenoyan and M. Wlokaand M. Wloka

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 33/38

But Only High-End GPUsHave That Feature!?

• Yes, but high-end GPUs most likely CPU-bound

High-End GPUs most suit ed to deal w it h: –  Longer vertex-shaders

 –  Longer pixel-shaders

 –  More text ure accesses

 –  Bigger video memory requirements

• To improve batching

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 34/38

But These Things Slow GPU Dow n!?

• Remember: CPU-l im it ed

 –  GPU is most ly idle

• Making GPU w ork, so CPU does NOT

• Overall effect : f aster game

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 35/38

25k bat ches/ s @ 100%25k bat ches/ s @ 100%1GHz CPU1GHz CPU

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 36/38

Acknowledgements

• Many t hanks t o

Gary McTaggar t , Valve

Jay Patel, Blizzard

Tom Gambi l l , NCSof tScot t Brow n, Net DevilGui llerm o Garcia-Sampedro, PopTop

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 37/38

Quest ions, Comment s, Feedback?

• Mat thias Wloka: mw [email protected]

• ht t p:/ / developer.nvidia.com

7/31/2019 “Batch, Batch, Batch”

http://slidepdf.com/reader/full/batch-batch-batch 38/38

Can You Afford t oLoose These Speed-Ups?

• 2 t r is/ bat ch

 –  Max. of ~ 0.1 MTriangles/ s for 1GHz Pent ium 3

• Factor 1500x away from max. throughput –  Max. of ~ 0.4 MTriangles/ s for Athlon XP 2.7+

• Factor 375x aw ay from max. thr oughput