A Low-Power Handheld GPU using Logarithmic Arithmetic and Triple DVFS Power Domains Byeong-Gyu Nam,...
-
Upload
claude-wood -
Category
Documents
-
view
221 -
download
0
description
Transcript of A Low-Power Handheld GPU using Logarithmic Arithmetic and Triple DVFS Power Domains Byeong-Gyu Nam,...
A Low-Power Handheld GPU using Logarithmic Arithmetic and
Triple DVFS Power Domains
Byeong-Gyu Nam, Jeabin Lee, Kwanho Kim, Seung Jin Lee, and Hoi-Jun Yoo
Outline• Backgrounds• Proposed Handheld GPU
• Low-Power Vertex Shader• Low-Power Rendering Engine• Triple-Domain DVFS
• Evaluation Results• Conclusion
Motivations• Handheld 3D Graphics Systems
• Low-Power long-battery lifetime• Small-Area limited-footprint• High-Performance more realistic scenes
• Previous Works• [Lindholm et al., SIGGRAPH01] Conventional
Vertex Shader Architecture• [Kameyama et al., GH03] Low-cost, limited performance• [Sohn et al., GH04] Fixed-Point Vertex Shader• [Yu et al., ISSCC06] High-Performance Floating-Point
Vertex Shader
Contributions• Handheld GPU using Logarithmic Arithmetic
• Power-, area-efficient, and high-throughput graphics pipeline
• Vertex shader using multi-function unit• Merges matrix, vector, and elementary functions in a
single-arithmetic 4-way unit
• Handheld GPU with Triple-domain DVFS• Separate control of each power domain
according to its workload• Minimize power consumption for given workload
Logarithmic Arithmetic• Various Math Operations for 3DG Pipeline
• Matrix-vector multiplication• Vector operations• Elementary functionsToo complicated for resource limited handsets
• Logarithmic Number System (LNS)• Reduces arithmetic complexity• Conversion errors
• Handheld 3D Graphics• Small screen tolerates a little computation error Logarithmic arithmetic can be used
2 2
2 2
2
(log log )
(log log )
log
2 2
2 2
2 2
x y X Y
x y X Y
y xy y X
x y
x y
x
MAT,VMUL, VMAD, VDOT, VDIV, VDSQ,
POW, LOG, SIN, COS, ATAN
Dynamic Voltage Frequency Scaling• Power Consumption is Proportional to
• Square of supply voltage• Operating frequency Dynamic scaling supply voltage and frequency
is important for low-power design• DVFS in module-wise manner
• Can assign the optimal frequency and voltage to each module [Sheaffer et al. GH04]• cf) Chip-wise DVFS cannot exploit dynamic workload
characteristic of each module
2ddP C V f
Overall Architecture
• Integrates RISC, VS, and RE• Logarithmic arithmetic based pipeline• Triple power domains with DVFS• 3 PMUs for management and FIFO for communication
RISCVertex Shader
(VS)
Rendering Engine
(RE)VIB
PMU
ClkVdd
PMU PMU
ClkVdd ClkVdd
External asynchronous SRAM
RISC Domain
VS Domain
RE Domain
VOB
IndexFIFO
MatrixFIFO
Application Geometry Rendering
Vertex Shader• Multi-function Unit
• Unifies matrix, vector, and elementary ftn.
• Single-cycle throughput for vector and elem. ftn.
• 2-cycle Transformation• Matrix-vector
multiplication using LNS• 4-cycle in conventional
way• Operand Forwarding
• Log-domain forwarding
InstructionMemory
(2KB)
GeneralPurposeRegister(512B)
Instruction Decode and C
ontrol
ConstantMemory
(4KB)
Vertex InputBuffer(512B)
SWZ
VOR(256B)
Multifunctionunit
IndexFIFO(64B)
SWZ SWZ
VertexFetch
MatrixFIFO(4KB)
Matrix
Number System
• Logarithmic converter • Antilogarithmic Converter
15-entry LUT
aim+bi
x
m>>ci m>>di m>>ei
CPA**
CSA*
CSA*
ci di ei bi
8-entry LUT
ai f+bi
x
f>>ci f>>di
CSA*
CPA**
CSA*
ci di bi
( )log log ( )log ( )
2 2
2
2 11
1
e
i i
x mx e m
m a m b
2 2 2
2
x e f f
fi i
e
a f b
• Hybrid Number System (HNS)• Complex op. in LNS• Linear add, sub in FLP• Number Converters based on
piecewise linear approximation
FLP*(add,sub)
LNS**(complex op)
HNS
+
* FLP: Floating-Point Number System** LNS: Logarithmic Number System
* CSA: Carry Save Adder** CPA: Carry Propagate Adder
Matrix-Vector Multiplication00 01 02 03 00 030 01 02
10 11 12 13 10 131 11 120 1 2 3
20 21 22 23 2 20 21 22 23
3 31 3230 31 32 33 30 33
c c c c c cx c cc c c c c cx c c
x x x xc c c c x c c c c
x c cc c c c c c
• MAT requires 16 multiplications• MAT in HNS requires 20 LOGCs, 16 ADDs, 16 ALOGCs• Fixed matrix coeffis. for an object 4 LOGCs, 16 ADDs, 16 ALOGCs• 2-phase implementation requires 2 LOGCs, 8 ADDs, 8 ALOGCs / phase
2 00 2 032 01 2 022 10 2 132 11 2 12
2 0 2 1 2 22 20 2 232 21 2 222 30 2 31 2 32 2 33
log loglog loglog loglog loglog log loglog loglog loglog log log log2 2 2 2
c cc cc cc cx x xc cc cc c c c
2 3log x
MAT 1st phaseMAT 2nd phase
* PMUL: programmable multiplier
to Channel1,2,3
x0x2
log2 c00log2 c02
log2 c01log2 c03
PMUL*
Cha
nnel
0
AD
D
adde
rtr
ee
LOG
LUT
LOGC
Cha
nnel
1C
hann
el 2
Cha
nnel
3
Boo
th E
nc.
log2x1, for 1st phase, log2x3, for 2nd phase (from LOGC of Channel1 )
ALO
GLU
TLO
GLU
T
E1 E2
adde
rtr
ee
AD
D
AD
D
AC
C
E3 E4 E5
adde
rtr
ee
ALO
GLU
T
ALOGC
>>
PADD
Vector Operations• Defined as a Generic Operation
• Vector MUL,DIV,DSQ,MAD• Converted into HNS
• 2 LOGCs are required / Channel
{0,1,2,3}
where { , }, { , }, {0.5,1}
pi i i i
p
x y z
2 2log log
{0,1,2,3}
where { , }, {0,1}
2 i ix y qii
q
z
x0
y0
z0
PMUL
Cha
nnel
0
AD
D
adde
rtr
ee
LOG
LUT
LOGC
Cha
nnel
1C
hann
el 2
Cha
nnel
3
Boo
th E
nc.
ALO
GLU
TLO
GLU
T
E1 E2
adde
rtr
ee
AD
D
AD
D
AC
C
E3 E4 E5
adde
rtr
ee
ALO
GLU
T
ALOGC
>>
PADD*
* PADD: programmable adder
Channel 0 Channel 1 Channel 2 Channel 3
+
+
+
+
Elementary Functions• Requires Power Ftns. Evaluations
• POW, Taylor Series for Trigonometric (SIN,COS,ATAN...)• Converted into HNS
• Booth multiplier is required0 31 2 4
0 1 2 3 4
where { , }, and are coefficients
k kk k k
c ki i
c x c x c x c x c x
3 3 21 1 2 2 2 2 4 4 20 loglog log log0 [2 2 2 2 ]C k xC k x C k x C k xkc x
xyxy y
x 22 loglog 22
x0
y0
bias
PMUL
Cha
nnel
0
AD
D
adde
rtr
ee
LOG
LUT
LOGC
Cha
nnel
1C
hann
el 2
Cha
nnel
3
Boo
th E
nc.
ALO
GLU
TLO
GLU
TE1 E2
adde
rtr
ee
AD
D
AD
D
AC
C
E3 E4 E5
adde
rtr
ee
ALO
GLU
T
ALOGC
>>
PADD
Operand Forwarding in Log-domaint
c
op1
op2
op1
op2
LogAlog
Log
Log
Log
Alog
Log
Log
Log
Alog
Cancelled out
• Avoids back-to-back antilog and log converters• Forward intermediate log value to next instruction• Pipeline latency reduction• Computation error reduction
• Divisions required in RE are implemented as subtractors• SIMD Interpolators in TSE and RST• SIMD Divider in Texture Unit
Rendering EngineVe
rtex
buffe
r
Vert
ex S
hade
r Triangle SetupEngine (TSE)
Rasterizer(RST)
IntplIntpl
Texture Unit(TEX)
DIV
Pixel pipeline
2 2 2 2 2log ,log ,log ,log log, , , / 2 s t u v ws t u v w
LOGConv
SUB
LOGConv
LOGConv
LOGConv
ALOGConv
ALOGConv
ALOGConv
LOGConv
ALOGConv
SUBSUBSUB
s t u v w
s/w t/w u/w v/w
Triple-Domain DVFS
PMU
FIFO FIFO
PMU PMU
Objects Vertices Pixels
RISC VS RE• Workloads for each module in 3D pipeline are
different• Objects for RISC, vertices for VS, pixels for RE
• Separate power-domains with DVFS• Managed separately according to its workload
Performance Evaluations• Environment Setup
• Accuracy simulation• C libraries for HNS & FLP ops.
• Performance measurement• Cycle count evaluator
• Hardware model for graphics processor core• Fully synthesizable structural HDL model
3D application
Hardware model
Cycle countevaluator
Power managementmodel
HNS library FLP library
Accuracy Comparison
from FLP from HNS
• FLP and HNS libraries• QVGA Test Screens• Tr&Lighting , Tr&Texturing• Tolerable image difference
Performance Comparison
• Test vectors• Standard OpenGL TnL• O-N and C-T for advanced diffuse and specular lighting
• Comparison results• 19.1%, 49.3%, and 23.9% cycle count reduction for OpenGL, O-N,
and C-T models from latest work [YCK*06]
OpenGL Oren-Nayar Cook-Torrance0
20
40
60
80
100
120
140
160
180
Cyc
le c
ount
s
[AYH*04][SWY04][KCY*06][YCK*06]
This work
Implementation
RISC D$I$
Matrix FIFO
VertexShader
CMEM
VOB
Rendering Engine
IMEMVIB
DMEM
RISCPMU
VSPMU
REPMU
• Process Technology• TSMC 0.18um CMOS 1P
6M• Area
• 17.2mm2
• 393K gates, 29KB SRAM• Processing Speed
• 5.26 Mvertices/s • 50Mpixels/s
• Power Consumption• 52.4mW @ 60fps• 153mW @ full speed
• PMU Specifications• 1.0 – 1.8V, 89 – 200MHz• 0.45mm2, 5mW
Power Consumption
• DVFS for VS according to the workload• Higher workload makes a droop in FIFO entry level• FIFO droop increases Freq and Vdd of VS• Increased VS speed restores FIFO entry level
• For test model with 5,878 triangles, 5 DC, and OpenGL TnL• GPU consumes 52.4mW @ 60fps
11
0
180
16
1.789
1.0
50 Triangle size
FIFO entry level
Freq. of VS
Voltage of VS
# of
pixe
ls#
ofen
trie
sC
lkVd
d
Handheld GPUs
• Compared with the latest work [YCK*06]• 2.47x performance improvement• 50.5% power reduction• 38.5% area reduction• 23.5% Kvertices/MHz improvement
[KKF*03]
[SWY04]
Thiswork
[YCK*06]
Functions
GE+RE
GE
GE+RE
GE
Performance(Mvertices/s)
0.185
7.2
5.26
2.13
mW @ fullspeed
38
115
153(87 for GE)
157
Process /Freq.
0.18um /30MHz
0.13um /400MHz
0.18um /200MHz
0.18um /100MHz
Area(Kgates)
360
230
393(242 for GE)
375
Kvertices/s/MHz
6.17
18
26.3
21.3
Conclusion• A Handheld GPU for Low-Power Wireless
Applications• 2.47x performance improvement• 50.5% power reduction• 38.5% area reduction
• 3D Graphics Pipeline based on Logarithmic Arithmetic• 5.26Mvertices/s for OpenGL TnL
• Triple power domains with DVFS• 52.4mW @ 60fps