Lecture 2 Benchmarks, Economics, and Technology

Lecture 2Benchmarks, Economics, and

TechnologyEEC 171 Parallel Architectures

John OwensUC Davis

Credits• © John Owens / UC Davis 2007–9.

• Thanks to many sources for slide material: Computer Organization and Design (Patterson & Hennessy) © 2005, Computer Architecture (Hennessy & Patterson) © 2007, Inside the Machine (Jon Stokes) © 2007, © Dan Connors / University of Colorado 2007, © Wen-Mei Hwu/David Kirk, University of Illinois 2007, © David Patterson / UCB 2003–6, © Mary Jane Irwin / Penn State 2005, © John Kubiatowicz / UCB 2002, © Krste Asinovic/Arvind / MIT 2002, © Morgan Kaufmann Publishers 1998.

Outline• Benchmarks

• Economics

• Technology: In a power-constrained world, we can’t keep increasing performance by increasing clock speed.

• Architectural trends: Microarchitectures have historically increased performance by increasing ILP and pipeline depth. Today, we can’t keep increasing performance at historical levels by these methods.

• Bandwidth vs. Latency

SPEC Benchmarks

3DMark06

• Historically, becomes more CPU bound over time

• (4) Graphics and (2) CPU tests split in 3DMark06

Introduction

This paper introduces 3DMark®06, the latest in the 3DMark benchmark series built by Futuremark

®

Corporation. The name 3DMark has become a standard in 3D graphics benchmarking; since the first version released in 1998 it has grown to be used by virtually all on-line and paper publications. It has proven itself as an impartial and highly accurate tool for benchmarking 3D graphics performance of the latest PC hardware. 3DMark has a very large following worldwide among individual PC owners. More than 30 million benchmark results have been submitted to Futuremark’s Online ResultBrowser database. It has become a point of great prestige to be the holder of the highest 3DMark score among PC enthusiasts as well as among the graphics hardware vendors. A compelling, easy download and installation, and easy-to-use interface have made 3DMark very popular among game enthusiasts. Futuremark’s latest benchmark, 3DMark06, continues this tradition by providing a state-of-the-art Microsoft® DirectX

® 9 3D performance benchmark.

Figure 1: 3DMark06 is Futuremark’s latest DirectX 9.0 benchmark with support for

ShaderModel 2.0, ShaderModel 3.0 and HDR 3DMark06 is an all new 3DMark version taking all out of Microsoft’s DirectX 9.0c. The previous version, 3DMark05, was the first benchmark requiring ShaderModel 2.0. However 3DMark05 used DirectX 9 specific features in a slightly limited manner, because fully ShaderModel 3.0 supporting hardware was rare at the time of its development and launch. In contrast, 3DMark06 requires DirectX 9 hardware with full support for at least ShaderModel 2.0 and for some of the tests ShaderModel 3.0, and once again takes shader usage to levels never seen before.

3DMark06 Whitepaper v1.0.2 Page 3 of 41 © Futuremark Corporation 2006

TPC-C

• “The difficulty in designing TPC benchmarks lies in reducing the diversity of operations found in a production application, while retaining its essential performance characteristics, namely, the level of system utilization and the complexity of its operations.”

Overview of the TPC Benchmark C:The Order-Entry Benchmark

By Francois Raab, Walt Kohler, Amitabh Shah

TPC-C• n warehouses

• Each warehouse, 10 sales districts, terminal per s.d.

• Each sales district, 3000 customers

• New order: 10 items, 10% chance from other w.house

• Other transactions: payment, order status, delivery, stock query

• TPC-C measures:

• orders per second (tpmC)

• Price-performance ($/tpmC)

Chip Economics

• Courtesy the Linley Group (Linley Gwennap) (thanks!)

• Scenario:

• Building an ASIC in 0.18 µm technology (6? years old)

• 50 mm2 non-CPU, 5 mm2 CPU, 55 mm2 total per die

• How much does it cost to build this chip?

Wafer to Chips• 200 mm2 wafer

• Costs $3400

• Gross dies/wafer: 523 dies can be cut per wafer

• “Effective Area” fraction = 85%

• 0.5 defects per cm2

• Yield: 80%

• 1/(defects × die size × effective area)3

• Net dies per wafer: 418

• Untested die cost: $8.14

Cost per Chip

• $8.14 per untested die

• $1.00 test cost

• $0.50 packaging and assembly

• $2.00 package cost (BGA)

• 98% final test yield ($0.23 yield loss per chip)

• Total manufacturing cost : $11.87

Today’s VLSI Capability

[courtesy Bill Dally]


1. Exploits Ample

Computation!


1. Exploits Ample

Computation!

2. Requires Efficient

Communication!

“The complexity for minimum component costs has increased at a rate of roughly a factor of two per year (see graph on next page). Certainly over

the short term this rate can be expected to continue, if not to increase. Over the longer

term, the rate of increase is a bit more uncertain, although there is no reason to

believe it will not remain nearly constant for at least 10 years. That means by 1975, the number

of components per integrated circuit for minimum cost will be 65,000. I believe that such

a large circuit can be built on a single wafer.”

“Cramming more components onto integrated circuits” by Gordon E. Moore, Electronics,

Volume 38, Number 8, April 19, 1965

Moore’s Law

!"#$%&'()$*+,-'"./#,01+,2./3#&,1+,45&)",67+,6789

!"#$%&'()"*+,!-."/!0"!)*%"/%0#!(0"*.1.-!)"2()%3'*"%4"-.*(*&#!0/."%-"!"4.$"5(%5.*6""73(*"!))%$*"!#").!*#"899"/%':%0.0#*:.-")(0.!-"(0/3"%-"!"+,!-#.-"'())(%0":.-"*+,!-."(0/36"73,*;<8;999"/%':%0.0#*"0..5"%//,:="%0)=" !>%,#"%0.&4%,-#3" !*+,!-."(0/36

?0"#3."*()(/%0"$!4.-"/,--.0#)=",*.5;",*,!))="!0"(0/3"%-'%-."(0"5(!'.#.-;"#3.-."(*"!':)."-%%'"4%-"*,/3"!"*#-,/#,-."(4#3."/%':%0.0#*"/!0">."/)%*.)=":!/2.5"$(#3"0%"*:!/."$!*#.54%-"(0#.-/%00./#(%0":!##.-0*6""73(*"(*"-.!)(*#(/;"*(0/.".44%-#*"#%!/3(.1."!").1.)"%4"/%':).@(#="!>%1."#3.":-.*.0#)="!1!()!>).(0#.A-!#.5" /(-/,(#*" !-." !)-.!5=",05.-$!=",*(0A"',)#()!=.-'.#!)(B!#(%0":!##.-0*"*.:!-!#.5">="5(.)./#-(/"4()'*6""C,/3"!5.0*(#="%4"/%':%0.0#*"/!0">."!/3(.1.5">=":-.*.0#"%:#(/!)#./30(+,.*"!05"5%.*"0%#"-.+,(-."#3."'%-.".@%#(/"#./30(+,.*;*,/3"!*".)./#-%0">.!'"%:.-!#(%0*;"$3(/3"!-.">.(0A"*#,5(.5"#%'!2.".1.0"*'!)).-"*#-,/#,-.*6

:($&#;*)(<,%=#,>)#"?73.-."(*"0%"4,05!'.0#!)"%>*#!/)."#%"!/3(.1(0A"5.1(/.

=(.)5*"%4"D99E6""F#":-.*.0#;":!/2!A(0A"/%*#*"*%"4!-".@/..5#3."/%*#"%4"#3."*.'(/%05,/#%-"*#-,/#,-."(#*.)4"#3!#"#3.-."(*"0%(0/.0#(1."#%"(':-%1."=(.)5*;">,#"#3.="/!0">."-!(*.5"!*"3(A3"!*

(*"./%0%'(/!))="G,*#(4(.56""H%">!--(.-".@(*#*"/%':!-!>)."#%#3."#3.-'%5=0!'(/".+,()(>-(,'"/%0*(5.-!#(%0*"#3!#"%4#.0"")('(#=(.)5*"(0"/3.'(/!)"-.!/#(%0*I"(#"(*"0%#".1.0"0./.**!-="#%"5%!0="4,05!'.0#!)"-.*.!-/3"%-" #%"-.:)!/.":-.*.0#":-%/.**.*6?0)="#3.".0A(0..-(0A".44%-#"(*"0..5.56

J0"#3.".!-)="5!=*"%4"(0#.A-!#.5"/(-/,(#-=;"$3.0"=(.)5*"$.-..@#-.'.)=")%$;"#3.-."$!*"*,/3"(0/.0#(1.6""7%5!="%-5(0!-="(0&#.A-!#.5"/(-/,(#*"!-."'!5."$(#3"=(.)5*"/%':!-!>)."$(#3"#3%*.%>#!(0.5"4%-"(05(1(5,!)"*.'(/%05,/#%-"5.1(/.*6""73."*!'.:!##.-0"$())"'!2.")!-A.-"!--!=*"./%0%'(/!);"(4"%#3.-"/%0*(5&.-!#(%0*"'!2."*,/3"!--!=*"5.*(-!>).6

@#;%,5&'3"#/K())"(#">.":%**(>)."#%"-.'%1."#3."3.!#"A.0.-!#.5">="#.0*

%4"#3%,*!05*"%4"/%':%0.0#*"(0"!"*(0A)."*()(/%0"/3(:LJ4"$."/%,)5"*3-(02"#3."1%),'."%4"!"*#!05!-5"3(A3&*:..5

5(A(#!)"/%':,#.-"#%"#3!#"-.+,(-.5"4%-"#3."/%':%0.0#*"#3.'&*.)1.*;"$."$%,)5".@:./#"(#"#%"A)%$">-(A3#)="$(#3":-.*.0#":%$.-5(**(:!#(%06" "M,#" (#"$%0N#"3!::.0"$(#3" (0#.A-!#.5"/(-/,(#*6C(0/."(0#.A-!#.5".)./#-%0(/"*#-,/#,-.*"!-."#$%&5('.0*(%0!);#3.="3!1."!"*,-4!/."!1!()!>)."4%-"/%%)(0A"/)%*."#%".!/3"/.0#.-%4"3.!#"A.0.-!#(%06""J0"!55(#(%0;":%$.-"(*"0..5.5":-('!-()="#%5-(1."#3."1!-(%,*")(0.*"!05"/!:!/(#!0/.*"!**%/(!#.5"$(#3"#3.*=*#.'6""F*")%0A"!*"!"4,0/#(%0"(*"/%04(0.5"#%"!"*'!))"!-.!"%0!"$!4.-;"#3."!'%,0#"%4"/!:!/(#!0/."$3(/3"',*#">."5-(1.0"(*5(*#(0/#)=")('(#.56""J0"4!/#;"*3-(02(0A"5('.0*(%0*"%0"!0"(0#.&A-!#.5"*#-,/#,-."'!2.*"(#":%**(>)."#%"%:.-!#."#3."*#-,/#,-."!#3(A3.-"*:..5"4%-"#3."*!'.":%$.-":.-",0(#"!-.!6

A;>,'B,&#$C'()(<O).!-)=;"$."$())" >." !>)." #%" >,()5" *,/3" /%':%0.0#&

/-!''.5".+,(:'.0#6""H.@#;"$."!*2",05.-"$3!#"/(-/,'*#!0/.*$."*3%,)5"5%"(#6""73."#%#!)"/%*#"%4"'!2(0A"!":!-#(/,)!-"*=*#.'4,0/#(%0"',*#">."'(0('(B.56""7%"5%"*%;"$."/%,)5"!'%-#(B.#3.".0A(0..-(0A"%1.-"*.1.-!)"(5.0#(/!)"(#.'*;"%-".1%)1."4).@&(>)."#./30(+,.*"4%-"#3.".0A(0..-(0A"%4")!-A."4,0/#(%0*"*%"#3!#0%"5(*:-%:%-#(%0!#.".@:.0*."0..5">.">%-0.">="!":!-#(/,)!-!--!=6""P.-3!:*"0.$)="5.1(*.5"5.*(A0"!,#%'!#(%0":-%/.5,-.*/%,)5"#-!0*)!#."4-%'")%A(/"5(!A-!'"#%"#./30%)%A(/!)"-.!)(B!&#(%0"$(#3%,#"!0="*:./(!)".0A(0..-(0A6

J#"'!=":-%1."#%">."'%-."./%0%'(/!)"#%">,()5")!-A.

Intel historical data

Semiconductor Scaling RatesFrom Digital Systems Engineering, Dally and Poulton, 1998

Parameter Current Value Yearly Factor

Years to Double (Half)

Moore’s Law (grids on a die)** 1 B 1.49 1.75

Gate Delay 150 ps 0.87 (5)

Capability (grids / gate delay) 1.71 1.3

Device-length wire delay 1.00

Die-length wire delay / gate delay 1.71 1.3

Pins per package 750 1.11 7

Aggregate off-chip bandwidth 1.28 3

** Ignores multi-layer metal, 8-layers in 2001

Int’l Technology Roadmap for Semiconductors

!"####$%&'())#*+(,-(.#/&012+)+34#51('(06&'786708!

�

!

!"#$%&#'()*+,-$.

/011$#, 2%34 5'

9:#;<2067+28=017.#>&'#9#?&('8

9@@A#B/*C#>'+,<06#/&012+)+34#/'&2,8#D#

;<2067+28#.&'#517.

6789:;

6789:6

678<::

678<:6

678<:;

678<:=

6>>? ;::: ;::? ;:6: ;:6? ;:;: ;:;?

?&('#+E#>'+,<067+2

>'+,<06#;<2067+28=517.

F#G73(#HI@JKL#D#M768N#6'(28786+'8#O

@A%,B'CD-,EFBDG'HIJD-,K0+A-D92#"#A9F#AA'HLJD-02FK

@A%,B'CD-,EFBDG'HIJD-,K0+A-D92#"#A9F#AA'H;JD-02FK

@A%,B'CD-,EFBDG'HIJD-,KMD)&A#92#"#A9F#AA'HM2F'K

NO!0'CD-,EFBDG'HIJD-,K

0PQ'IR$%),D,-1$,EFBDG9'BD&B9G#$S1$T%)U#'HBGK

0PQ'IR$%),D,-1$,EFBDG9'U1,-9G#$S1$T%)U'HUGK

9@@A#D 9@99#B/*C#*(23&

!"#$%&#'()*+,-$.

/011$#, 2%34 5'

9:#;<2067+28=017.#>&'#9#?&('8

!"#$%&#'()*+,-$.

/011$#, 2%34 5'

9:#;<2067+28=017.#>&'#9#?&('8

9@@A#B/*C#>'+,<06#/&012+)+34#/'&2,8#D#

;<2067+28#.&'#517.

6789:;

6789:6

678<::

678<:6

678<:;

678<:=

6>>? ;::: ;::? ;:6: ;:6? ;:;: ;:;?

?&('#+E#>'+,<067+2

>'+,<06#;<2067+28=517.

F#G73(#HI@JKL#D#M768N#6'(28786+'8#O

@A%,B'CD-,EFBDG'HIJD-,K0+A-D92#"#A9F#AA'HLJD-02FK

@A%,B'CD-,EFBDG'HIJD-,K0+A-D92#"#A9F#AA'H;JD-02FK

@A%,B'CD-,EFBDG'HIJD-,KMD)&A#92#"#A9F#AA'HM2F'K

NO!0'CD-,EFBDG'HIJD-,K

0PQ'IR$%),D,-1$,EFBDG9'BD&B9G#$S1$T%)U#'HBGK

0PQ'IR$%),D,-1$,EFBDG9'U1,-9G#$S1$T%)U'HUGK

9@@A#D 9@99#B/*C#*(23&

!

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � # � � � � � � � � � �� # �� ! � � � � � � $ � � � � & � � � � " % � � � � � � � �

RV8'(WR8OW!R(XW!2'R8FVWX2XIY'OX!N0!P'@XO'M80(FXWNQFRXOM5''''!""#'

!"#$%&'(#)%"****+!

"#$$%&%'(! )*+,-.!)'!/""#(#0'1!(2%!)*34!(%/5-!206"!7896#:!)*34!:0'$%&%':%-!9#;/''8/66<! (0!7&%-%'(! (2%! 6/(%-(!30/"5/7!

#'$0&5/(#0'!/'"!(0!-06#:#(!$%%"9/:=!$&05!(2%!-%5#:0'"8:(0&!#'"8-(&<!/(;6/&>%.!!

*2%!)*34!#-!&%6%/-%"!/''8/66<1!?#(2!87"/(%-!/'"!:0&&%:(#0'-!(0!"/(/!(/96%-!%/:2!%@%';'859%&%"!<%/&!A-8:2!/-!BCCC1!BCCB1!

BCCD1!BCCEF!?2#6%!:0576%(%!%"#(#0'-!/&%!&%6%/-%"!%/:2!0"";'859%&%"!<%/&!ABCCG1!BCCH1!BCCI1!BCCJF.!*2#-!)*34!7&0:%--!

(28-!%'-8&%-!:0'(#'8/6!/--%--5%'(!0$!(2%!-%5#:0'"8:(0&!#'"8-(&<K-!'%/&!/'"!60'>;(%&5!'%%"-.!)(!/6-0!/660?-!(2%!(%/5-!(0!

:0&&%6/(%! #'! /! (#5%6<! $/-2#0'! (2%! )*34! 7&0L%:(#0'-! (0! 50-(! &%:%'(! &%-%/&:2! /'"! "%@%6075%'(! 9&%/=(2&08>2-! (2/(! 5/<!

7&0@#"%!-068(#0'-!(0!(20-%!'%%"-.!

,-./0.1*2-34534*

*2%!)*34!/--%--%-!(2%!7&#':#7/6!(%:2'060><!'%%"-!(0!>8#"%!(2%!-2/&%"!&%-%/&:21!-20?#'>!(2%!M(/&>%(-N!(2/(!'%%"!(0!9%!5%(.!

*2%-%!(/&>%(-!/&%!/-!58:2!/-!70--#96%!O8/'(#$#%"!/'"!%P7&%--%"!#'!(/96%-1!-20?#'>!(2%!%@068(#0'!0$!=%<!7/&/5%(%&-!0@%&!

(#5%.!Q::057/'<#'>!(%P(!%P76/#'-!/'"!:6/&#$#%-!(2%!'859%&-!:0'(/#'%"!#'!(2%!(/96%-!?2%&%!/77&07&#/(%.!

*2%!)*34!$8&(2%&!"#-(#'>8#-2%-!9%(?%%'!"#$$%&%'(!5/(8&#(<!0&!:0'$#"%':%! 6%@%6-1!&%7&%-%'(%"!9<!:060&-! #'! (2%! (/96%-1! $0&!

(2%-%!(/&>%(-R!

!

!"#$%"&'$(")*+,-.*$'/.#-,+0/-'1,"#2,"(+,)+/#3,.4'/5/6+2, !!

!"#$%"&'$(")*+,-.*$'/.#-,"(+,7#.8#, � �

9#'+(/5,-.*$'/.#-,"(+,7#.8#, !"

!"#$%"&'$(")*+,-.*$'/.#-,"(+,:;<,7#.8#, � �!

!

*2%!$#&-(!-#(8/(#0'1!MS/'8$/:(8&/96%!-068(#0'-!%P#-(1!/'"!/&%!9%#'>!07(#5#T%"1N!#'"#:/(%-!(2/(!(2%!(/&>%(!#-!/:2#%@/96%!?#(2!

(2%!:8&&%'(6<!/@/#6/96%!(%:2'060><!/'"!(006-1!/(!7&0"8:(#0';?0&(2<!:0-(!/'"!7%&$0&5/':%.!*2%!<%660?!:060&!#-!8-%"!?2%'!

/""#(#0'/6! "%@%6075%'(! #-! '%%"%"! (0! /:2#%@%! (2/(! (/&>%(.!U0?%@%&1! (2%! -068(#0'! #-! /6&%/"<! #"%'(#$#%"! /'"! %P7%&(-! /&%!

:0'$#"%'(!(2/(!#(!?#66!"%50'-(&/(%!(2%!&%O8#&%"!:/7/9#6#(#%-!#'!(#5%!$0&!7&0"8:(#0'!-(/&(.!*2%!-#(8/(#0'!M)'(%&#5!4068(#0'-!

/&%!V'0?'N!5%/'-!(2/(!6#5#(/(#0'-!0$!/@/#6/96%!-068(#0'-!?#66!'0(!"%6/<!(2%!-(/&(!0$!7&0"8:(#0'1!98(!?0&=;/&08'"-!?#66!9%!

#'#(#/66<!%5760<%"!#'!(2%-%!:/-%-.!489-%O8%'(!#57&0@%5%'(!#-!%P7%:(%"!(0!:60-%!/'<!>/7-!$0&!7&0"8:(#0'!7%&$0&5/':%!#'!

/&%/-!-8:2!/-!7&0:%--!:0'(&061!<#%6"1!/'"!7&0"8:(#@#(<.!*2%!$08&(2!/'"!6/-(!-#(8/(#0'!#-!2#>26#>2(%"!/-!M&%"N!0'!(2%!30/"5/7!

(%:2'060><!&%O8#&%5%'(-!(/96%-!/'"!2/-!9%%'!&%$%&&%"!(0!/-!(2%!M3%"!W&#:=!+/66N!-#':%!(2%!9%>#''#'>!0$!)*34.!A*2%!M&%"N!

#-!0$$#:#/66<!0'!(2%!30/"5/7!(0!:6%/&6<!?/&'!?2%&%!7&0>&%--!5#>2(!%'"!#$!(/'>#96%!9&%/=(2&08>2-!/&%!'0(!/:2#%@%"!#'!(2%!

$8(8&%.F!X859%&-!#'!(2%!&%"!&%>#5%1!(2%&%$0&%1!/&%!0'6<!5%/'(!/-!?/&'#'>-!/'"!-2086"!'0(!9%!#'(%&7&%(%"!/-!M(/&>%(-N!0'!

(2%! 30/"5/7.! Y0&! -05%!30/"5/7! &%/"%&-1! (2%! M&%"N! "%-#>'/(#0'!5/<! '0(! 2/@%! /"%O8/(%6<! -%&@%"! #(-!-.*+! 78&70-%! 0$!

2#>26#>2(#'>!-#>'#$#:/'(!/'"!%P:#(#'>!:2/66%'>%-.!*2%&%!:/'!9%!/!(%'"%':<!(0!@#%?!"#=!'859%&!#'!(2%!30/"5/7!/-!M0'!(2%!

&0/"!(0!-8&%!#576%5%'(/(#0'N!&%>/&"6%--!0$!#(-!:060&.!*0!"0!-0!?086"!9%!/!-%&#08-!5#-(/=%.!

M3%"N!#'"#:/(%-!?2%&%!(2%&%!/&%!'0!M='0?'!5/'8$/:(8&/96%!-068(#0'-N!A0$!&%/-0'/96%!:0'$#"%':%F!(0!:0'(#'8%"!-:/6#'>!#'!

-05%!/-7%:(!0$! (2%!-%5#:0'"8:(0&! (%:2'060><.!Q'!/'/6<-#-!0$!M&%"N!8-/>%!5#>2(!:6/--#$<!(2%!M&%"N!7/&/5%(%&-! #'(0!(?0!

:/(%>0&#%-R!

>?! ?2%&%!(2%!:0'-%'-8-!#-!(2/(!(2%!7/&(#:86/&!@/68%!?#66!86(#5/(%6<!9%!/:2#%@%"!A7%&2/7-!6/(%F1!98(!$0&!?2#:2!(2%!#'"8-(&<!

"0%-'K(!2/@%!58:2!:0'$#"%':%!#'!/'<!:8&&%'(6<!7&070-%"!-068(#0'A-F1!0&!

@?! ?2%&%!(2%!:0'-%'-8-!#-!(2/(!(2%!@/68%!?#66!'%@%&!9%!/:2#%@%"!A$0&!%P/576%1!-05%!M?0&=;/&08'"N!?#66!&%'"%&!#(!

#&&%6%@/'(!0&!7&0>&%--!?#66!#'"%%"!%'"F!

*0! /:2#%@%! (2%! &%"! 7/&/5%(%&-! 0$! (2%! $#&-(! :/(%>0&<1! 9&%/=(2&08>2-! #'! &%-%/&:2! /&%! '%%"%".! )(! #-! 207%"! (2/(! -8:2!

9&%/=(2&08>2-!?086"!&%-86(!#'!(2%!M&%"N!(8&'#'>!(0!M<%660?N!A5/'8$/:(8&/96%!-068(#0'-!/&%!='0?'F!/'"1!86(#5/(%6<!M?2#(%N!

A5/'8$/:(8&/96%!-068(#0'-!/&%!='0?'!/'"!/&%!9%#'>!07(#5#T%"F!#'!$8(8&%!%"#(#0'-!0$!)*34.!!

Q-!#'"#:/(%"!#'!(2%!0@%&@#%?1!(2%!30/"5/7!2/-!9%%'!78(!(0>%(2%&!#'!(2%!-7#&#(!0$!"%$#'#'>!?2/(!(%:2'#:/6!:/7/9#6#(#%-!(2%!

#'"8-(&<!'%%"-!(0!"%@%607!#'!0&"%&!(0!-(/<!0'!S00&%K-!Z/?!/'"!(2%!0(2%&!(&%'"-1!/'"!?2%'.!40!(2%!)*34!#-!'0(!-0!58:2!/!

$0&%:/-(#'>!%P%&:#-%!/-!/!?/<!(0!#'"#:/(%!?2%&%!&%-%/&:2!-2086"!$0:8-!(0!:0'(#'8%!S00&%K-!6/?.!)'!(2/(!#'#(#/6!M:2/66%'>%N!

-7#&#(1! (2%![@%&/66!30/"5/7!*%:2'060><!\2/&/:(%&#-(#:-! A[3*\F! (%/5!87"/(%-!=%<!2#>2;6%@%6! (%:2'060><!'%%"-1!?2#:2!

%-(/96#-2!-05%!:0550'!&%$%&%':%!70#'(-!(0!5/#'(/#'!:0'-#-(%':<!/50'>!(2%!:2/7(%&-.!*2%!2#>2;6%@%6!(/&>%(-!%P7&%--%"!#'!

(2%! [3*\! (/96%-! /&%! 9/-%"! #'! 7/&(! 0'! (2%! :057%66#'>! %:0'05#:! -(&/(%><! 0$! 5/#'(/#'#'>! (2%! 2#-(0&#:/6! 2#>2! &/(%! 0$!

/"@/':%5%'(!#'!#'(%>&/(%"!:#&:8#(!(%:2'060>#%-.!!

!"#$%&!#'&(!%)&(*$!#+"&)*),-$')(./(0$1)'$2#/%+)&.3+!)'24$$$$"##$$

10-Year GPU Projection (2004–2014)• Transistors (NV40): 222M/

2237M

• Clock speed (NV40, MHz): 475/1890

• Capability: 105B/4228B

• Memory bandwidth(NV40, GB/s): 35/322

• Memory latency (RAS, ns): 40/23

• Power/chip (maximum, W): 158/198 (then flat)

• Take-home point: Capability > mem bw > mem latency29.2 Keys to High-Performance Computing

In the previous section, we have seen that modern technology allows each new genera-tion of hardware a substantial increase in capability. Effectively utilizing this wealth ofcomputational resources requires addressing two important goals. First, we must organ-ize our computational resources to allow high performance on our applications of inter-est. Simply providing large amounts of computation is not sufficient, however; efficientmanagement of communication is necessary to feed the computation resources on thechip. In this section we describe techniques that allow us to achieve efficient computa-tion and efficient communication, and we discuss why modern microprocessors(CPUs) are a poor match for these goals.

29.2.1 Methods for Efficient Computation In Section 29.1 we saw that it is now possible to place hundreds to thousands of com-putation units on a single die. The keys to making the best use of transistors for com-putation are to maximize the hardware devoted to computation, to allow multiplecomputation units to operate at the same time through parallelism, and to ensure thateach computation unit operates at maximum efficiency.

29.2 Keys to High-Performance Computing 461

Figure 29-2. Changes in Key GPU Properties over Time

429_gems2_ch29_new.qxp 2/2/2005 5:05 PM Page 461

Excerpted from GPU Gems 2Copyright 2005 by NVIDIA Corporation

From Owens, Streaming Architectures and Technology Trends, GPU Gems 2, March 2005

2004, 2008 (NV GF280), 2014

29.2 Keys to High-Performance Computing In the previous section, we have seen that modern technology allows each new genera-tion of hardware a substantial increase in capability. Effectively utilizing this wealth ofcomputational resources requires addressing two important goals. First, we must organ-ize our computational resources to allow high performance on our applications of inter-est. Simply providing large amounts of computation is not sufficient, however; efficientmanagement of communication is necessary to feed the computation resources on thechip. In this section we describe techniques that allow us to achieve efficient computa-tion and efficient communication, and we discuss why modern microprocessors(CPUs) are a poor match for these goals.

29.2.1 Methods for Efficient Computation In Section 29.1 we saw that it is now possible to place hundreds to thousands of com-putation units on a single die. The keys to making the best use of transistors for com-putation are to maximize the hardware devoted to computation, to allow multiplecomputation units to operate at the same time through parallelism, and to ensure thateach computation unit operates at maximum efficiency.

29.2 Keys to High-Performance Computing 461

Figure 29-2. Changes in Key GPU Properties over Time

429_gems2_ch29_new.qxp 2/2/2005 5:05 PM Page 461

Excerpted from GPU Gems 2Copyright 2005 by NVIDIA Corporation

!""!# !""$# !""%# !""&# !"'"# !"'!# !"'$# !"'%#

'"#

'""#

'"""#

'""""#

()*+,-,./),#

01/23#45667#89:;<#

0*5*=-1-.>#

96?/)>#@A#8B@C,<#

Technology Theme

• In a power-constrained world, we can’t keep increasing performance by increasing clock speed.

The Raisin Theory• Observing the data—the laws of physics:

• The silicon is not scaling in the same ways of the past:

• Voltage scaling, the most significant contributor to power scaling, is decaying

• We can no longer use the same architectural approaches for increased performance

• At the same power:

• The die gets smaller (a raisin)

• The thermal density increases (a cooked raisin)

[courtesy Andy Riffel]

What’s this?

• And how much energy does it take to charge it?

What’s this?

From the Computer Desktop Encyclopedia

Process generations

• Let’s assume one process generation to the next makes new transistors 0.7 times the size of the old transistors (“linear shrink”)

• Recent example: 65 nm to 45 nm = 0.692

• What is this “feature size”?

Historical scaling (0.7x)

• Power = Ceff [device] × # devices × Voltage2 × MHz

• Not talking about wires, leakage, etc.

• With process shrink, assuming constant die size:

• Change in Ceff:

• Change in number of devices:

• Change in frequency:

• Plug that all together:

Historical scaling (0.7x)• Power = Ceff [device] × # devices × Voltage2 × MHz

• Not talking about wires, leakage, etc.

• With process shrink:

• Change in Ceff:

• Change in number of devices:

• Change in frequency:

• What else do we need to change to get constant power?

• Plug that all together:

Historical scaling result

• Constant power from generation to generation

• What is our increase of (potential) performance from generation to generation?

Today’s scaling

• Today, for 0.7x transistor scaling:

• Ceff continues to scale (0.7x)

• Device density continues to scale (2x)

• Voltage does not scale as much (0.95x)

• Frequency increases only by 1.2x

• Result: 2.3x performance, 1.46x power

How about wiring? (x=0.7)• Capacitance: x

• Resistance of wire: 1/x

• Length of wire: x

• Cross-section of wire: 1/x2

• Resistance of constant-length wire: 1/x2 (+32%/year)

• Resistance of cross-chip wire: 1/x3 (+49%/year)

• RC/τ: delay of wire compared to device delay

• Constant-length: +51%, cross-chip: +71%

Raisin Theory—The Real RHT

• Yellow line

• 1st slope = 2.0X devices * 1.40 MHz = 2.8x perf. scaling

• 2nd slope = 1.9X devices * 1.20 MHz = 2.3x perf. scaling

• Purple line

• 1st slope = 2.0X devices * 1.40 MHz = 2.8x perf. scaling

• 2nd slope = 1.9X devices * 0.82 MHz = 1.56x perf. scaling

The dashed purple lineis performance gains at

CONSTANT power

Why you are taking this class

2nd slope = 1.9X devices × 0.82 MHz = 1.56x perf. scaling

Technology Theme

• Microarchitectures have historically increased performance by increasing ILP and pipeline depth. Today, we can’t keep increasing performance at historical levels by these methods.

Why Do Processors Get Faster?

• Define “faster” as “more instructions per second”.

• Neglecting software improvements …

• 3 reasons:

• More parallelism (or more work per pipeline stage): fewer clocks/instruction

• Deeper pipelines: fewer gates/clock

• Transistors get faster (Moore’s Law): fewer ps/gate

Microarchitectural Theme

• Microarchitectures have historically increased performance by increasing ILP and pipeline depth. Today, we can’t keep increasing performance at historical levels by these methods.

Limits to Instruction Level Parallelism

[David Wall, Limits of Instruction-Level Parallelism,

WRL Research Report 93/6

Clock Scaling: Historical and Projected

10


0

1

10

100

1000

10000

1970

1972

1974

1976

1978

1980

1982

1984

1986

1988

1990

1992

1994

1996

1998

2000

2002

2004

2006

2008

2010

2012

2014

Clo

ck S

peed (

MH

z)

� � � ��

375 FO4

70 FO4

85 FO4

115 FO4

230 FO4

185 FO4

175 FO4

53 FO4

40 FO4

46 FO4

25 FO4

9 FO4

15 FO4

10µm 3µm 1µm1.5µm 0.8µm 0.35µm 0.18µm0.13µm 0.07µm

0.1µm 0.05µm0.035µm

��!��

��

Microprocessor Scaling is Slowing

1e+0

1e+1

1e+2

1e+3

1e+4

1e+5

1e+6

1e+7

1980 1985 1990 1995 2000 2005 2010 2015 2020

Perf (ps/Inst)52%/year

19%/year

ps/gate 19%

Gates/clock 9%

Clocks/inst 18%

��!��!�

Future Potential is Large

""Today: 30:1Today: 30:1

""10 years: 1000:110 years: 1000:1

1e-4

1e-3

1e-2

1e-1

1e+0

1e+1

1e+2

1e+3

1e+4

1e+5

1e+6

1e+7

1980 1985 1990 1995 2000 2005 2010 2015 2020

� � � � � � � � � � ��

��!��

��!��

��!��

��

��

��!��!�

Good luck!

Good luck on your finals!Good luck on your finals!

ItIt’’s been great teaching you.s been great teaching you.

Have an enjoyable and relaxing break!Have an enjoyable and relaxing break!

• “The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays” (ISCA02) [courtesy Steve Keckler]


[courtesy of Bill Dally]

10


0

1

10

100

1000

10000

1970

1972

1974

1976

1978

1980

1982

1984

1986

1988

1990

1992

1994

1996

1998

2000

2002

2004

2006

2008

2010

2012

2014

Clo

ck S

peed (

MH

z)

8FO4

16FO4

Intel Processors

375 FO4

70 FO4

85 FO4

115 FO4

230 FO4

185 FO4

175 FO4

53 FO4

40 FO4

46 FO4

25 FO4

9 FO4

15 FO4


0.1µm 0.05µm0.035µm

[courtesy of Steve Keckler]

The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays


1e+0

1e+1

1e+2

1e+3

1e+4

1e+5

1e+6

1e+7

1980 1985 1990 1995 2000 2005 2010 2015 2020


19%/year

ps/gate 19%

Gates/clock 9%

Clocks/inst 18%



•• Today: 30:1Today: 30:1

•• 10 years: 1000:110 years: 1000:1

1e-4

1e-3

1e-2

1e-1

1e+0

1e+1

1e+2

1e+3

1e+4

1e+5

1e+6

1e+7

1980 1985 1990 1995 2000 2005 2010 2015 2020

Perf (ps/Inst)

Linear (ps/Inst)

52%/year

74%/year

19%/year30:1

1,000:1

30,000:1


Good luck!





• At the right-hand turn: 30:1

• 5 years: 1000:1 [courtesy of Bill Dally]

10


0

1

10

100

1000

10000

1970

1972

1974

1976

1978

1980

1982

1984

1986

1988

1990

1992

1994

1996

1998

2000

2002

2004

2006

2008

2010

2012

2014

Clo

ck S

peed (

MH

z)

8FO4

16FO4

Intel Processors

375 FO4

70 FO4

85 FO4

115 FO4

230 FO4

185 FO4

175 FO4

53 FO4

40 FO4

46 FO4

25 FO4

9 FO4

15 FO4


0.1µm 0.05µm0.035µm

[courtesy of Steve Keckler]

The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays


1e+0

1e+1

1e+2

1e+3

1e+4

1e+5

1e+6

1e+7

1980 1985 1990 1995 2000 2005 2010 2015 2020


19%/year

ps/gate 19%

Gates/clock 9%

Clocks/inst 18%



•• Today: 30:1Today: 30:1

•• 10 years: 1000:110 years: 1000:1

1e-4

1e-3

1e-2

1e-1

1e+0

1e+1

1e+2

1e+3

1e+4

1e+5

1e+6

1e+7

1980 1985 1990 1995 2000 2005 2010 2015 2020

Perf (ps/Inst)

Linear (ps/Inst)

52%/year

74%/year

19%/year30:1

1,000:1

30,000:1


Good luck!




Summary

• Microprocessors have historically increased performance by:

• Increasing clock speed

• Increasing pipeline depth

• Increasing instruction-level parallelism (work per clock)

• We can’t do any of these things any more.

• Processor

• logic capacity: about 30% per year

• clock rate: about 20% per year

• Memory

• DRAM capacity: about 60% per year (4x every 3 years)

• Memory speed (latency): about 10% per year

• Memory bandwidth: about 25% per year

• Cost per bit: improves about 25% per year

• Disk

• capacity: about 60% per year

• Total use of data: 100% per 9 months!

Technology rates of change

Disk TrendsMapReduce is a programming model for processing vast amounts of data. One of the reasons that it works so well is because it exploits a sweet spot of modern disk drive technology trends. In essence MapReduce works by repeatedly sorting and merging data that is streamed to and from disk at the transfer rate of the disk. Contrast this to accessing data from a relational database that operates at the seek rate of the disk (seeking is the process of moving the disk's head to a particular place on the disk to read or write data).

So why is this interesting? Well, look at the trends in seek time and transfer rate. Seek time has grown at about 5% a year, whereas transfer rate at about 20%. Seek time is growing more slowly than transfer rate—so it pays to use a model that operates at the transfer rate. Which is what MapReduce does. I first saw this observation in Doug Cutting's talk, with Eric Baldeschwieler, at OSCON last year, where he worked through the numbers for updating a 1 terabyte database using the two paradigms B-Tree (seek-limited) and Sort/Merge (transfer-limited).http://www.lexemetech.com/2008/03/disks-have-become-tapes.html

http://www.lexemetech.com/2008/03/disks-have-become-tapes.html


Disk TrendsThe general point was well summed up by Jim Gray in an interview in ACM Queue from 2003:

... programmers have to start thinking of the disk as a sequential device rather than a random access device.

Or the more pithy: “Disks have become tapes.” (Quoted by David DeWitt.)http://www.lexemetech.com/2008/03/disks-have-become-tapes.html



Tracking Technology Performance Trends

• Drill down into 4 technologies:

• Disks; Memory; Network; Processors

• Compare ~1980 Archaic (Nostalgic) vs. ~2000 Modern (Newfangled)

• Performance Milestones in each technology

• Compare for Bandwidth vs. Latency improvements in performance over time

• Bandwidth: number of events per unit time

• E.g., M bits / second over network, M bytes / second from disk

• Latency: elapsed time for a single event

• E.g., one-way network delay in microseconds, average disk access time in milliseconds

Disks: Archaic (Nostalgic) v. Modern (Newfangled)

• CDC Wren I, 1983

• 3600 RPM

• 0.03 GBytes capacity

• Tracks/Inch: 800

• Bits/Inch: 9550

• Three 5.25” platters

• Bandwidth: 0.6 MBytes/sec

• Latency: 48.3 ms

• Cache: none

• Seagate 373453, 2003

• 15000 RPM (4X)

• 73.4 GBytes (2500X)

• Tracks/Inch: 64000 (80X)

• Bits/Inch: 533,000 (60X)

• Four 2.5” platters (in 3.5” form factor)

• BW: 86 MB/sec (140X)

• Latency: 5.7 ms (8X)

• Cache: 8 MB

Disk Latency vs. Bandwidth

• Disk: 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x)

• (latency = simple operation w/o contention; BW = best-case)

4/1/07 CS252-s06, Lec 02-intro 8

Latency Lags Bandwidth (for last ~20 years)

• Performance Milestones

• Disk: 3600, 5400, 7200, 10000,15000 RPM (8x, 143x)

(latency = simple operation w/o contention

BW = best-case)

1

10

100

1000

10000

1 10 100

Relative Latency Improvement

Relative

BW

Improve

ment

Disk

(Latency improvement = Bandwidth improvement)

Memory: Archaic (Nostalgic) v. Modern (Newfangled)

• 1980 DRAM (asynchronous)

• 0.06 Mbits/chip

• 64,000 xtors, 35 mm2

• 16-bit data bus per module, 16 pins/chip

• 13 Mbytes/sec

• Latency: 225 ns

• (no block transfer)

• 2000 Double Data Rate Synchr. (clocked) DRAM

• 256 Mb/chip (4000X)

• 256M xtors, 204 mm2

• 64b data bus / DIMM, 66 pins/chip (4X)

• 1600 MB/sec (120X)

• Latency: 52 ns (4X)

• Block transfers (page mode)

Latency Lags Bandwidth (last ~20 years)

• Memory Module: 16 bit plain DRAM, Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x)


4/1/07 CS252-s06, Lec 02-intro 10



• Memory Module: 16bit plainDRAM, Page Mode DRAM, 32b,64b, SDRAM,DDR SDRAM (4x,120x)

• Disk: 3600, 5400, 7200, 10000,15000 RPM (8x, 143x)


BW = best-case)

1

10

100

1000

10000

1 10 100


Relative

BW

Improve

ment

MemoryDisk


LANs: Archaic (Nostalgic) v. Modern (Newfangled)

• Ethernet 802.3 (1978)

• 10 Mbits/s link speed

• Latency: 3000 µsec

• Shared media

• Coaxial cable

• Ethernet 802.3ae (2003)

• 10,000 Mbits/s (1000X)link speed

• Latency: 190 µsec (15X)

• Switched media

• Category 5 copper wire

Copper, 1mm thick, twisted to avoid antenna effect

Twisted Pair:“Cat 5” is 4 twisted pairs in bundleCoaxial Cable:

Copper coreInsulator

Braided outer conductorPlastic Covering


• Ethernet: 10Mb, 100Mb, 1000Mb, 10000 Mb/s (16x,1000x)


4/1/07 CS252-s06, Lec 02-intro 12



• Ethernet: 10Mb, 100Mb,1000Mb, 10000 Mb/s (16x,1000x)


• Disk: 3600, 5400, 7200, 10000,15000 RPM (8x, 143x)


BW = best-case)

1

10

100

1000

10000

1 10 100


Relative

BW

Improve

ment

Memory

Network

Disk


CPUs: Archaic (Nostalgic) v. Modern (Newfangled)

• 1982 Intel 80286

• 12.5 MHz

• 2 MIPS (peak)

• Latency 320 ns

• 134,000 xtors, 47 mm2

• 16-bit data bus, 68 pins

• Microcode interpreter, separate FPU chip

• (no caches)

• 2001 Intel Pentium 4

• 1500 MHz (120X)

• 4500 MIPS (peak) (2250X)

• Latency 15 ns (20X)

• 42,000,000 xtors, 217 mm2

• 64-bit data bus, 423 pins

• 3-way superscalar, dynamic translate to RISC, Superpipelined (22 stage), Out-of-Order execution

• On-chip 8 KB Data caches, 96 KB Instr. Trace cache, 256 KB L2 cache


• Processor: ‘286, ‘386, ‘486, Pentium, Pentium Pro, Pentium 4 (21x,2250x)

4/1/07 CS252-s06, Lec 02-intro 14



• Processor: ‘286, ‘386, ‘486,Pentium, Pentium Pro,Pentium 4 (21x,2250x)

• Ethernet: 10Mb, 100Mb,1000Mb, 10000 Mb/s (16x,1000x)


• Disk : 3600, 5400, 7200, 10000,15000 RPM (8x, 143x)

1

10

100

1000

10000

1 10 100


Relative

BW

Improve

ment

Processor

Memory

Network

Disk


CPU high,

Memory low

(“Memory

Wall”)

Rule of Thumb for Latency Lagging BW

• In the time that bandwidth doubles, latency improves by no more than a factor of 1.2 to 1.4

• (and capacity improves faster than bandwidth)

• Stated alternatively:

Bandwidth improves by more than the square of the improvement in latency

6 Reasons Latency Lags Bandwidth

1. Moore’s Law helps BW more than latency

• Faster transistors, more transistors, more pins help bandwidth

• MPU Transistors: 0.130 vs. 42 M xtors (300X)

• DRAM Transistors: 0.064 vs. 256 M xtors (4000X)

• MPU Pins: 68 vs. 423 pins (6X)

• DRAM Pins: 16 vs. 66 pins (4X)

• Moore’s Law helps BW more than latency

• Smaller, faster transistors but communicate over (relatively) longer lines: limits latency improvements

• Feature size: 1.5 to 3 vs. 0.18 micron (8X,17X)

• MPU Die Size: 35 vs. 204 mm2 (ratio sqrt ⇒ 2X)

• DRAM Die Size: 47 vs. 217 mm2 (ratio sqrt ⇒ 2X)

6 Reasons Latency Lags Bandwidth

6 Reasons Latency Lags Bandwidth (cont’d)

2. Distance limits latency

• Size of DRAM block ⇒ long bit and word lines ⇒ most

of DRAM access time

• Speed of light and computers on network

• 1. & 2. explains linear latency vs. square BW?


3. Bandwidth easier to sell (“bigger = better”)

• E.g., 10 Gbits/s Ethernet (“10 Gig”) vs. 10 µsec latency Ethernet

• 4400 MB/s DIMM (“PC4400”) vs. 50 ns latency

• Even if just marketing, customers now trained

• Since bandwidth sells, more resources thrown at bandwidth, which further tips the balance

4. Latency helps BW, but not vice versa

• Spinning disk faster improves both bandwidth and rotational latency

• 3600 RPM ⇒ 15000 RPM = 4.2X

• Average rotational latency: 8.3 ms ⇒ 2.0 ms

• Things being equal, also helps BW by 4.2X

• Lower DRAM latency ⇒ More access/second (higher bandwidth)

• Higher linear density helps disk BW (and capacity), but not disk latency

• 9,550 BPI ⇒ 533,000 BPI ⇒ 60X in BW


5. Bandwidth hurts latency

• Queues help bandwidth, hurt latency (Queuing Theory)

• Adding chips to widen a memory module increases bandwidth but higher fan-out on address lines may increase latency

6. Operating System overhead hurts latency more than Bandwidth

• Long messages amortize overhead; overhead bigger part of short messages


Summary of Technology Trends• For disk, LAN, memory, and microprocessor, bandwidth improves by square of

latency improvement

• In the time that bandwidth doubles, latency improves by no more than 1.2X to 1.4X

• Lag probably even larger in real systems, as bandwidth gains multiplied by replicated components

• Multiple processors in a cluster or even in a chip

• Multiple disks in a disk array

• Multiple memory modules in a large memory

• Simultaneous communication in switched LAN

• HW and SW developers should innovate assuming Latency Lags Bandwidth

• If everything improves at the same rate, then nothing really changes

• When rates vary, require real innovation

CachingReplicationPrediction

Lecture 2 Benchmarks, Economics, and Technology

Documents

Transcript of Lecture 2 Benchmarks, Economics, and Technology