A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 (...

56
A Multigrain Parallelizing Compiler with Power Control for Multicore Processors Hironori Kasahara Professor Department of Computer Science Professor , Department of Computer Science Director, Advanced Chip-Multiprocessor Research Institutes Research Institutes Waseda University Tokyo Japan Tokyo, Japan http://www.kasahara.cs.waseda.ac.jp Feb. 5, 2008 at Google Headquarter

Transcript of A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 (...

Page 1: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

A Multigrain Parallelizing g gCompiler with Power Control

for Multicore Processors Hironori Kasahara

Professor Department of Computer ScienceProfessor, Department of Computer ScienceDirector, Advanced Chip-Multiprocessor

Research InstitutesResearch Institutes Waseda University

Tokyo JapanTokyo, Japanhttp://www.kasahara.cs.waseda.ac.jp

Feb. 5, 2008 at Google Headquarter

Page 2: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Hironori Kasahara<Personal History>

B.S. (1980,Waseda), M.S.(1982,Waseda), Ph.D.(1985,EE, Waseda). Res.Assoc. (1983,Waseda),S i l R h F ll JSPS (1985) Vi i i S h l (1985 U i C lif i B k l )Special Research Fellow JSPS (1985) ,Visiting Scholar (1985.Univ.California at Berkeley), .Assist. Prof. (1986.Waseda), Assoc. Prof.(1988,Waseda), Visiting Research Scholar(1989-1990. Center for Supercomputing R&D, Univ.of Illinois at Urbana-Champaign), Prof.(1997-,Dept. CS, Waseda). , IFAC World Congress Young Author Prize (1987), IPSJ Sakai Memorial Special Award (1997), STARC Industry-Academia Cooperative Research Award (2004)

<Activities for Societies>IPSJ:Sig. Computer Architecture(Chair), Trans of IPSJ Editorial Board (HG Chair),

Journal of IPSJ Editorial Board (HWG Chair), 2001 Journal of IPSJ Special Issue on ParallelJournal of IPSJ Editorial Board (HWG Chair), 2001 Journal of IPSJ Special Issue on Parallel Processing(Chair of Editorial Board: Guest Editor, JSPP2000 (Program Chair) etc. ACM :International Conference on Supercomputing(ICS)(Program Committee)

Int’l conf. on Supercomputing (PC, esp. ‘96 ENIAC 50th Anniversary Co-Prog. Chair). IEEE C t S i t J Ch t Ch i T k S ti B d M b SC07 PCIEEE: Computer Society Japan Chapter Chair, Tokyo Section Board Member, SC07 PCOTHER: PCs of many conferences on Supercomputing and Parallel Processing.

<Activities for Governments>METI:IT Policy Proposal Forum(Architecture/HPC WG Chair), y p ( ),

Super Advanced Electronic Basis Technology Investigation CommitteeNEDO:Millennium Project IT21 “Advanced Parallelizing Compiler”( Project Leader),Computer Strategy WG (Chair).Multicore for Realtime Consumer Electronics Project Leader etc.MEXT:Earth Simulator project evaluation committee, 10PFLOPS Supercomputer evaluat. comm.p j , p pJAERI: Research accomplishment evaluation committee, CCSE 1st class invited researcher. JST: Scientific Research Fund Sub Committee, COINS Steering Committee ,

Precursory Research for Embryonic Science and Technology (Research Area Adviser)Cabinet Office: CSTP Expert Panel on Basic Policy, Information & Communication Field

2

Cabinet Office: CSTP Expert Panel on Basic Policy, Information & Communication Field Promotion Strategy , R&D Infrastructure WG, Software & Security WG

<Papers>Papers 158, Invited Talks 64, Tech. Reports 114, Symposium 25, News Papers/TV/Web News/Magazine 139, IEEE Trans. Computer, IPSJ Trans., ISSCC, Cool Chips, Supercomputing, ACM ICS, etc.etc.

Page 3: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Multi-core EverywhereMulti-core from embedded to supercomputers

Consumer Electronics (Embedded)Mobile Phone, Game, Digital TV, Car Navigation,

DVD CDVD, Camera, IBM/ Sony/ Toshiba Cell, Fuijtsu FR1000, NEC/ARMMPCore&MP211, Panasonic Uniphier, Renesas SH multi-core(4 core RP1 8 core RP2)Renesas SH multi core(4 core RP1, 8 core RP2)Tilera Tile64, SPI Storm-1(16 VLIW cores)

PCs, ServersIntel Quad Xeon, Core 2 Quad, Montvale, Tukwila, 80 coreAMD Quad Core Opteron, Phenom

WSs, Deskside & Highend ServersIBM Power4,5,5+,6 Sun Niagara(SparcT1,T2), Rock

S tIBM BlueGene/L 

Lawrence Livermore National Laboratory2005/04稼動予定

OSCAR Type Multi-core Chip by Renesas in METI/NEDO Multicore for Real-time Consumer Electronics Project (Leader: Prof.Kasahara)

SupercomputersEarth Simulator:40TFLOPS, 2002, 5120 vector proc.IBM Blue Gene/L: 360TFLOPS, 2005, Low power CMP

based 128K processor chips, BG/P 2008

Lawrence Livermore National Laboratory2005/04稼動予定

低消費電力チップマルチプロセッサをベースとしたスパコン

High quality application software, Productivity, Costperformance, Low power consumption are important

Ex, Mobile phones, GamesC il t d lti

65,536-プロセッサチップ(128Kプロセッサ

360TFLOPS

3

Compiler cooperated multi-core processors are promising to realize the above futures

 =毎秒360兆回の    浮動小数点計算)

キャビネットあたり1024プロセッサチップ

1プロセッサチップ上に2プロセッサ集積

Page 4: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Market of Consumer Electronics1 T illi D ll i 20101 Trillion Dollars in 2010 (World Wide)

1 6 兆 円

W /W 市 場 規 模

1 6 兆 円

W /W 市 場 規 模

市場規

1 2 兆 円

1 4 兆 円

兆 円

携 帯 電 話

市場規

1 2 兆 円

1 4 兆 円

兆 円

携 帯 電 話

規模(出荷額)

1 . 5 兆 円

2 兆 円テ ゙ シ ゙ タ ル カ メ ラ

゙ シ ゙ タ ゙ ゙ オ

薄 型 テ レ ヒ ゙

規模(出荷額)

1 . 5 兆 円

2 兆 円テ ゙ シ ゙ タ ル カ メ ラ

゙ シ ゙ タ ゙ ゙ オ

薄 型 テ レ ヒ ゙

0

5 千 億 円

1 兆 円テ ゙ シ ゙ タ ル ヒ ゙テ ゙ オ

カ ー ナ ヒ ゙

D V D レ コ ー タ ゙

0

5 千 億 円

1 兆 円テ ゙ シ ゙ タ ル ヒ ゙テ ゙ オ

カ ー ナ ヒ ゙

D V D レ コ ー タ ゙

0` 0 3 ` 0 4 ` 0 5 ` 0 6` 0 2

0` 0 3 ` 0 4 ` 0 5 ` 0 6` 0 2

121276764949デ ジ タル ス チ ル カメラ(M 台 )

年 平 均 成長 率 %‘07‘03

121276764949デ ジ タル ス チ ル カメラ(M 台 )

年 平 均 成長 率 %‘07‘03

Dig Camera

Annual Growth Rates

2005.5.11NEDOロードマップ報告会電子・情報技術開発部「技術開発戦略」より

43431141142727PC用 DV D(記 録 型 )(M 台 )

747433333.63.6D V Dレコー ダ (M 台 )

4545272766デ ジ タル TV(M 台 )

121276764949デ ジ タル ス チ ル カメラ(M 台 )

43431141142727PC用 DV D(記 録 型 )(M 台 )

747433333.63.6D V Dレコー ダ (M 台 )

4545272766デ ジ タル TV(M 台 )

121276764949デ ジ タル ス チ ル カメラ(M 台 )Dig. CameraDig. TVDVD RecorderDVD for PC

4111120.920.914.014.0自 動 車 用 半 導 体 需 要 (B$)

88670670490490携 帯 電 話 (M 台 )

43431141142727PC用 DV D(記 録 型 )(M 台 )

111120.920.914.014.0自 動 車 用 半 導 体 需 要 (B$)

88670670490490携 帯 電 話 (M 台 )

43431141142727PC用 DV D(記 録 型 )(M 台 )Mobile PhoneLSI for Cars

Page 5: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Roadmap of compiler cooperative multicore project02 03 04 05 06 07 0800 01 09 10

Apply for Supercomputer CompilersCompiler development of

Millennium Project IT21NEDO Advanced Parallelizing Compiler Apply for Supercomputer Compilers

Multiprocessor ServersParallelizing Compiler

(Waseda Univ. Fujitsu,Hitachi,JIPDEC,AIST)

STARC Compiler Cooperative Practical ResarchBasic resrach STARC:

Semiconductor Technology Academic

Research Center Fujitsu,Toshiba,NEC,

STARC:Semiconductor

Technology Academic Research Center

Fujitsu,Toshiba,NEC,

(Waseda・Toshiba・Fujitsu・Panasonic・Sony)

(Waseda・Fujitsu・NEC・Toshiba)

p pChip Multiprocessor( Waseda Univ., Fujitsu, NEC,Toshiba, Panasonic,Sony) Waseda Univ Hitachi Renesas

Arch. % CompilR&D

Plan Practical Use

Renesas,Panasonic, Sony etc.

Renesas,Panasonic, Sony etc.

y

NEDO (2004.07-2007.06)Heterogeneous Multiprocessor(Waseda Univ., Hitachi)

Waseda Univ., Hitachi, Renesas,

Soft realtime

PlanMulticore Arch. Compiler

Practical Use

( , )

NEDO (2005.06-2008.03)Multicore Technology for Realtime Consumer Electronics Plan

API R&DPractical Use

Waseda Univ., Hitachi, Renesas,Fujitsu, NEC, Toshiba, PanasonicPower Saving Multicore Architecture,Parallelizing Compiler,API

H t M lti A h

5

Mar. Oct.

NEDO (2007.02- 2010.03)Heterogeneous Multicore for

Consumer Electronics Waseda Univ., Hitachi, Renesas, Tokyo Inst, of Tech. Mar.Mar.

PlanHetero Multicore Arch.

& Compiler R&D

Page 6: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

METI/NEDO Advanced Parallelizing Compiler Technology ProjectBackground and Problems①Adoption of parallel processing as a core

Millenium Project IT21 2000.9.8 –2003.3.31W d U i F jit Hit hi AIST ①Adoption of parallel processing as a core

technology on PC to HPC ② Increase of importance of software on IT③ Need for improvement of cost-performance

and usability

Performance

1T

Waseda Univ., Fujitsu, Hitachi, AIST

Contents of Research and Development① R & D of advanced parallelizing compiler

and usabilityHardware Peak performance

Multigrain, Data localization, Overhead hiding② R & D of Performance evaluation technology

for parallelizing compilers<Purpose>

Proj

ect

Improvement of① Effective performance

Goal: Double the effective performance

Ripple EffectA

PC P① Effective performance

②Cost-performance③ Ease of use

Slow down

pp① Development of competitive next

generation PC and HPC② Putting the innovative automatic

parallelizing compiler technology Effective Performance

1980 20001990

p g p gyto practical use

③ Development and market acquisitionof future single-chip multiprocessors

④ Boosting R&D in the following many fields:year1G

6

IT, Bio-tech., Device, Earth environment, Next-generation VLSI design, Financial engineering, Weather forecast, New clean energy, Space development, Automobile, Electric Commerce, etc

④ g g y

Theoretical maximum performance vs.Effective performance of HPC

y

Page 7: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Performance of APC Compiler on IBM pSeries690 16 Processors High-end Server

• IBM XL Fortran for AIX Version 8.1– Sequential execution : -O5 -qarch=pwr4– Automatic loop parallelization : -O5 -qsmp=auto -qarch=pwr4– OSCAR compiler : -O5 -qsmp=noauto -qarch=pwr4

(su2cor: -O4 -qstrict)12.0

XL Fortran(max)3.4s 3.1s 3.5s3 5 ti d

10.0

XL Fortran(max)

APC(max)3.0s

38.5s

28.8s3.5 times speedup in average

6.0

8.0

dup

ratio

3.8s

38.5s

4.0Spee

16.7s7.1s

16 4s 13 0s115.2s 107.4s 105.0s

28.9s

0.0

2.0

23 1s

19.2s

16.7s

38 3s

21.5s21.0s

16.4s

23.2s

13.0s

37.4s

39 5s27 8s35 1s30 3s21 5s 126 5s 307 6s 291 2s 279 1s

126.5s 184.8s

22 5s 85 8s 18 8s

22.5s 85.8s 18.8s

282 4s 321 4s

282.4s 321.4s

7tomcatvswim

su2corhydro 2d

mgridappluturb3d aps i

fppppwave5

wupwiseswim2kmgrid

2kapplu2ks ixt r

ackaps i2k

23.1s 38.3s 39.5s27.8s35.1s30.3s21.5s 126.5s 307.6s 291.2s 279.1s22.5s 85.8s 18.8s 282.4s 321.4s

Page 8: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Performance of Multigrain Parallel Processing for 102.swim on IBM pSeries690

16 00

14.00

16.00XLF(AUTO)

OSCAR

10.00

12.00

p ra

tio

6.00

8.00

Spee

d up

2.00

4.00S

0.001 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Processors

8

Page 9: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

OSCAR Parallelizing Compiler• Improve effective performance, cost-performance

and productivity and reduce consumed power– Multigrain Parallelization

• Exploitation of parallelism from the whole program by use ofcoarse-grain parallelism among loops and subroutines,coarse grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism

Data Localization– Data Localization• Automatic data distribution for distributed shared memory, cache

and local memory on multiprocessor systems.– Data Transfer Overlapping

• Data transfer overhead hiding by overlapping task execution anddata transfer using DMA or data pre-fetchingdata transfer using DMA or data pre fetching

– Power Reduction• Reduction of consumed power by compiler control of frequency,

9voltage and power shut down with hardware supports.

Page 10: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Generation of Coarse Grain Tasks

Macro-tasks (MTs)Bl k f P d A i t (BPA) B i Bl k (BB)Block of Pseudo Assignments (BPA): Basic Block (BB)Repetition Block (RB) : outermost natural loop Subroutine Block (SB): subroutineSubroutine Block (SB): subroutine

BPA Near fine grain parallelization BPABPA Near fine grain parallelization

Loop level parallelization BPA

BPARBSBBPARBProgram RB Near fine grain of loop body

Coarse grainparallelization

RBSB

BPA

RBSBBPARBSBSB Coarse grain

parallelization

BPARBSB

SBBPARBSB

101st. Layer 2nd. Layer 3rd. LayerTotalSystem

Page 11: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Earliest Executable Condition Analysis for coarse grain tasks (Macro-tasks)

1 BPABPA

Data Dependency

Control flow

Conditional branch

Block of Psuedo

1

3 BPA2 BPA

4 BPA

BPA

RB Repetition Block

RB

Block of PsuedoAssignment Statements

7

2 3

4 8BPA

5 BPA

6 RB

7 RB15 BPA

RB

RB

RB

BPA

7

11

4

5 6

8

9 10

6

BPA8 BPA

9 BPA 10 RB

11 BPA

RB 1115 7

12

13

Data dependency

Extended control dependency11 BPA

12 BPA

13 RB14

13Extended control dependencyConditional branch

ORAND

Original control flowA Macro Flow Graph

1114 RB

ENDA Macro Task Graph

Page 12: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Automatic processor assignment in 103 su2cor103.su2cor

• Using 14 processorsg p– Coarse grain parallelization within DO400 of subroutine LOOPS

12

Page 13: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

MTG of Su2cor-LOOPS-DO400MTG of Su2cor-LOOPS-DO400Coarse grain parallelism PARA_ALD = 4.3

13DOALL Sequential LOOP BBSB

Page 14: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Data-Localization L Ali d D itiLoop Aligned Decomposition

• Decompose multiple loop (Doall and Seq) into CARs and LRs id i i t l d t d dconsidering inter-loop data dependence.

– Most data in LR can be passed through LM.– LR: Localizable Region, CAR: Commonly Accessed Region

LR CAR CARLR LRC RB1(Doall)DO I=1,101

DO I=69,101DO I=67,68DO I=36,66DO I=34,35DO I=1,33

DO I=1,33C RB2(Doseq)

A(I)=2*IENDDO

DO I=67 67

DO I=35,66

DO I=34,34( q)

DO I=1,100B(I)=B(I-1)

+A(I)+A(I+1)ENDDO

DO I=2 34

DO I=68,100

DO I=67,67

DO I=68 100DO I=35 67

RB3(Doall)DO I=2,100

C(I)=B(I)+B(I-1)

14

DO I=2,34 DO I=68,100DO I=35,67C(I)=B(I)+B(I-1)ENDDO

C

Page 15: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Data Localization11

2

1

32

PE0 PE1

12 1

3 4 56

23 45

6 7 8910 1112

3

4

2

6 7

8

14

18

7

6 7 8910 1112

1314 15 16

5

8

9

11

15

18

19

25

8 9 101718 19 2021 22

10

11

13 16

25

29

112324 25 26

dlg0dlg3dlg1 dlg2

17 20

21

22 26

30

12 1314

15

2728 29 3031 32

33Data Localization Group

dlg023 24

27 28

32

15MTG MTG after Division A schedule for two processors

15 33Data Localization Group31

Page 16: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

An Example of Data Localization for Spec95 SwimDO 200 J=1 N cache sizeDO 200 J=1 N cache sizeDO 200 J=1,NDO 200 I=1,M

UNEW(I+1,J) = UOLD(I+1,J)+1 TDTS8*(Z(I+1,J+1)+Z(I+1,J))*(CV(I+1,J+1)+CV(I,J+1)+CV(I,J)2 +CV(I+1 J)) TDTSDX*(H(I+1 J) H(I J))

UN

0 1 2 3 4MBcache sizeDO 200 J=1,N

DO 200 I=1,MUNEW(I+1,J) = UOLD(I+1,J)+

1 TDTS8*(Z(I+1,J+1)+Z(I+1,J))*(CV(I+1,J+1)+CV(I,J+1)+CV(I,J)2 +CV(I+1 J)) TDTSDX*(H(I+1 J) H(I J))

UN

0 1 2 3 4MBcache size

2 +CV(I+1,J))-TDTSDX*(H(I+1,J)-H(I,J))VNEW(I,J+1) = VOLD(I,J+1)-TDTS8*(Z(I+1,J+1)+Z(I,J+1))

1 *(CU(I+1,J+1)+CU(I,J+1)+CU(I,J)+CU(I+1,J))2 -TDTSDY*(H(I,J+1)-H(I,J))

VOZ

PNVN UOPO CVCUH

2 +CV(I+1,J))-TDTSDX*(H(I+1,J)-H(I,J))VNEW(I,J+1) = VOLD(I,J+1)-TDTS8*(Z(I+1,J+1)+Z(I,J+1))

1 *(CU(I+1,J+1)+CU(I,J+1)+CU(I,J)+CU(I+1,J))2 -TDTSDY*(H(I,J+1)-H(I,J))

VOZ

PNVN UOPO CVCUHPNEW(I,J) = POLD(I,J)-TDTSDX*(CU(I+1,J)-CU(I,J))

1 -TDTSDY*(CV(I,J+1)-CV(I,J))200 CONTINUE

H

UNPNVN

PNEW(I,J) = POLD(I,J)-TDTSDX*(CU(I+1,J)-CU(I,J))1 -TDTSDY*(CV(I,J+1)-CV(I,J))

200 CONTINUE

H

UNPNVN

DO 210 J=1,NUNEW(1,J) = UNEW(M+1,J)VNEW(M+1,J+1) = VNEW(1,J+1)PNEW(M+1 J) = PNEW(1 J)

UNVOPNVN UO

U PV

DO 210 J=1,NUNEW(1,J) = UNEW(M+1,J)VNEW(M+1,J+1) = VNEW(1,J+1)PNEW(M+1 J) = PNEW(1 J)

UNVOPNVN UO

U PV

DO 300 J=1,N

PNEW(M+1,J) = PNEW(1,J)210 CONTINUE

VOPNVN UOPO

C h li fli tDO 300 J=1,N

PNEW(M+1,J) = PNEW(1,J)210 CONTINUE

VOPNVN UOPO

C h li fli tDO 300 I=1,MUOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J))POLD(I,J) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J)) (b) Image of alignment of arrays on

Cache line conflicts occurs among arrays which share the same location on cache

DO 300 I=1,MUOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J))POLD(I,J) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J)) (b) Image of alignment of arrays on

Cache line conflicts occurs among arrays which share the same location on cache

16

( , ) ( , ) ( ( , ) ( , ) ( , ))300 CONTINUE

(a) An example of target loop group for data localization

(b) Image of alignment of arrays on cache accessed by target loops

( , ) ( , ) ( ( , ) ( , ) ( , ))300 CONTINUE

(a) An example of target loop group for data localization

(b) Image of alignment of arrays on cache accessed by target loops

Page 17: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Data Layout for Removing Line Conflict Misses by Array Dimension Paddingby Array Dimension Padding

Declaration part of arrays in spec95 swim

PARAMETER (N1=513, N2=513) PARAMETER (N1=513, N2=544)

before padding after padding

COMMON U(N1,N2), V(N1,N2), P(N1,N2),* UNEW(N1,N2), VNEW(N1,N2),1 PNEW(N1,N2), UOLD(N1,N2),

COMMON U(N1,N2), V(N1,N2), P(N1,N2),* UNEW(N1,N2), VNEW(N1,N2),1 PNEW(N1,N2), UOLD(N1,N2),

* VOLD(N1,N2), POLD(N1,N2),2 CU(N1,N2), CV(N1,N2),* Z(N1,N2), H(N1,N2)

* VOLD(N1,N2), POLD(N1,N2),2 CU(N1,N2), CV(N1,N2),* Z(N1,N2), H(N1,N2)

4MB 4MB

padding

17Box: Access range of DLG0

Page 18: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

APC Compiler OrganizationAPC M l i iAPC Multigrain Parallelizing Compiler

- Cache optimization

Automatic DataDistribution

MultigrainParallelizationSource

ProgramOpenMP

C d

Cache optimization- DSM optimization

SchedulingData DependenceAnalysis

- Hierarchical Parallel- Affine Partitioning

Program

FORTRANCode

Generation- Static Scheduling- Dynamic Scheduling

- Inter-procedural- Conditional

SpeculativeE ti

- Coarse Grain- Medium Grain- Architecture Support

Execution

Parallelizing Tuning System

Feedback

Runtime

IBM XL FortranVer.8.1

Sun Forte 6Update 2

IBM XL FortranVer.7.1

Program Visualization Technique

Techniques for Profiling & Utilizing Runtime Info

Feedback-directed Selection Technique of

Compiler Directives

RuntimeInfo

18

Utilizing Runtime Info

Variety of Shared Memory Parallel machines

SUNUltra 80

IBMpSeries690

IBMRS6000

Page 19: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Image of Generated OpenMP Code for Hierarchical Multigrain Parallel Processing

Centralized scheduling codeHierarchical Multigrain Parallel Processing code

SECTIONSSECTION SECTION1st layer

Distributed scheduling

d

T0 T1 T2 T3

MT1_1

T4 T5 T6 T7MT1_1

1st layer

code SYNC SENDMT1_2

SYNC RECVMT1_3SB

MT1_2DOALL

MT1 4MT1-3

1_4_11_4_11_3_2

1_3_3

1_3_1

1_3_2

1_3_3

1_3_1

1_3_2

1_3_3

1_3_1MT1_4RB MT1-4

1_4_2

1_4_4

1_4_3

1_4_2

1_4_4

1_4_31_3_4

1 3 6

1_3_5

1_3_4

1 3 6

1_3_5

1_3_4

1 3 6

1_3_5

1_3_1

1_3_2 1_3_3 1_3_41_4_11_3_6 1_3_6 1_3_6

END SECTIONS2nd layer

1_3_5 1_3_6

1_4_21_4_3 1_4_4

19Thread group0

Thread group1

2nd layer

2nd layer

Page 20: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Performance OSCAR Multigrain Parallelizing Compiler on a IBM p550q 8core Deskside Server

2.7 times speedup against loop parallelizing compiler on 8 cores

Loop parallelizationMultigrain parallelizition

7

8

9

5

6

7

dup

ratio

2

3

4

spee

d

0

1

2

t t i 2 h d 2d id l t b3d i f 5 i id l i

Page. 20

tomcatvswim su2corhydro2dmgrid applu turb3d apsi fpppp wave5 swim mgrid applu apsispec95 spec2000

Page 21: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

OSCAR Compiler Performance on 24 Processor IBM p690Highend SMP Serverp g

GXGXP P P P

GXP P P P

GX

MemSlot

GX Sl ot

L3 L3 L3 L3L3 L3 L3 L3

L3 L3 L3 L3

GX SlotMem

SlotMemSlot

MemSlot

GX Sl ot

20

25 XL Fortran Ver.8.1OSCAR

P

L2

P P

L2

P

P

L2

PP

L2

P

P

L2

P P

L2

P

GX

P

L2

PP

L2

P

P

L2

P P

L2

P

GX

GX

GX

P

L2

P P

L2

P

GXGX

L3 L3

L3 L3

L3 L3

L3 L3

L3 L3

L3 L3

MCM 3MCM 2

15

20

p ra

tio

GXGX

P

L2

PP

L2

P

GX

P

L2

PP

L2

P

GX

GXGX

L3 L3 L3 L3L3 L3L3 L3

L3 L3 L3 L3MCM 1 MCM 0

MemSlot

MemSlot

MemSlot

MemSlot

GX Slot

Four 8-way MCM Features Assembled into a 32-way pSeries 690

10

spee

du

0

5

v m r d d u d i p 5 m d u i

tom

catv

swim

su2c

o

hydr

o2

mgr

i d

appl

turb

3d

aps

fppp

p

wav

e

swi m

mgr

id

appl ap

s

spec95 spec2000

Page. 21 4.82 times speedup against loop parallelization spec95 spec2000

Page 22: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Performance on SGI Altix 450Montecito 16 processors cc-NUMA serverp

789 Intel Ver.8.1

OSCAR

3456

peed

up ra

tio

0123sp

0

tom

catv

swim

su2c

or

hydr

o2d

mgr

id

appl

u

turb

3d apsi

fppp

p

wav

e5

swim

mgr

id

appl

u

apsi

95 2000

• OSCAR compiler gave us 1.86 times speedupagainst Intel Fortran Itanium Compiler revision 8 1

spec95 spec2000

against Intel Fortran Itanium Compiler revision 8.1

Page 23: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Performance of OSCAR compiler on16 SGI Alti 450 M t l16 cores SGI Altix 450 Montvale server

6 Intel Ver.10.1

4

5

o

OSCAR

2

3

4

peed

up ra

ti

1

2sp

0

omca

tv

swim

su2c

or

hydr

o2d

mgr

id

appl

u

turb

3d apsi

fppp

p

wav

e5

swim

mgr

id

appl

u

apsi

• OSCAR compiler gave us 2.32 times speedup

to h

spec95 spec2000

OSC co p e gave us .3 t es speedupagainst Intel Fortran Itanium Compiler revision 10.1

Page 24: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

NEC/ARM MPCore Embedded 4 core SMP

3.54

4.5 g77oscar

22.5

3

peed

up ra

tio

00.5

1

1.5sp

01 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

toamcatv swim su2cor hydro2d mgrid applu turb3d

24

SPEC95

3.48 times speedup by OSCAR compiler against sequential processing

Page 25: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Power Reduction by Power Supply, Clock Frequencyand Voltage Control by OSCAR Compiler

• Shortest execution time mode

25

Page 26: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

OSCAR Multi-Core ArchitectureCMP (chip multiprocessor 0)

CMP m

PE0 PE

CMP (chip multiprocessor 0)0

CPU

I/ODevicesI/O

Devices0 PE1 PE n

LDM/D-cacheLPM/

CSM j I/OCMP k

CPUDTC

DSMI-Cache

CSN t k I t fFVR

Intra-chip connection network

CSMNetwork InterfaceFVR

FVR

CSM / L2 Cache

(Multiple Buses, Crossbar, etc) FVR

FVRFVR FVR FVR FVR

Inter-chip connection network (Crossbar, Buses, Multistage network, etc)CSM: central shared mem. LDM : local data mem.

FVR

26DSM: distributed shared mem.DTC: Data Transfer Controller

LPM : local program mem.FVR: frequency / voltage control register

Page 27: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

An Example of Machine Parameters for the Power Saving Scheme

• Functions of the multiprocessor• Functions of the multiprocessor– Frequency of each proc. is changed to several levels

Voltage is changed together with frequency– Voltage is changed together with frequency– Each proc. can be powered on/off

state FULL MID LOW OFFstatefrequencyvoltagedynamic energy

FULL111

MID1 / 20.873 / 4

LOW1 / 40.711 / 2

OFF000y gy

static power 1 1 1 0

• State transition overheadstateFULLMIDLOW

FULL0

40k40k

MID40k0

40k

LOW40k40k0

OFF80k80k80k

stateFULLMIDLOW

FULL02020

MID20020

LOW20200

OFF404040

27

LOWOFF

40k80k

40k80k

080k

80k0

LOWOFF

2040

2040

040

400

delay time [u.t.] energy overhead [μJ]

Page 28: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Speed-up in Fastest Execution Modep p

4 5

3.54

4.5w/o Savingw Saving

2.5

3

dup

ratio

1

1.5

2

spee

d

0

0.51

1 2 4 1 2 4 1 2 4 1 2 4

tomcatv swim applu mpeg2enc

b h k28

benchmark

Page 29: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Consumed Energy in Fastest Execution Mode

200 1600

140160

180

200

w/o Savingw Saving 1200

14001600

J)

80

100

120

140

ener

gy(J

)

600800

1000

nerg

y(m

J

20

4060

200400600en

01 2 4 1 2 4 1 2 4

tomcatv swim applu

01 2 4

b fbenchmark

mpeg2 encode

number of proc.

29

mpeg2_encode

Page 30: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Energy Reduction by OSCAR Compiler in Real-time Processing mode (10% Leak)in Real-time Processing mode (10% Leak)

• deadline = sequential execution time Leakage Power: 10%30

• deadline = sequential execution time, Leakage Power: 10%

Page 31: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

METI/NEDO National Project Multi core for Real time Consumer ElectronicsMulti-core for Real-time Consumer Electronics

<Goal> R&D of compiler cooperative multi-core processor technology for consumer (2005.7~2008.3)**core processor technology for consumer electronics like Mobile phones, Games, DVD, Digital TV, Car navigation systems. CMP m

(マルチコア・チ プ )

PC0(プロセッサコア0) PC

CMP 0 (マルチコアチップ0)

CSM j

CPU(プロセッサ)

I/O DevicesI/O

(入出力装置)CMP m(マルチコア・チ プ )

PC0(プロセッサコア0) PC

CMP 0 (マルチコアチップ0)

CSM j

CPU(プロセッサ)

I/O DevicesI/O

(入出力装置)

新マルチコアプロセッサ

•高性能

低消費電力

新マルチコアプロセッサ

•高性能

低消費電力

CMP m(マルチコア・チ プ )

PC0(プロセッサコア0) PC

CMP 0 (マルチコアチップ0)

CSM j

CPU(プロセッサ)

I/O DevicesI/O

(入出力装置)CMP m(マルチコア・チ プ )

PC0(プロセッサコア0) PC

CMP 0 (マルチコアチップ0)

CSM j

CPU(プロセッサ)

I/O DevicesI/O

(入出力装置)

新マルチコアプロセッサ

•高性能

低消費電力

新マルチコアプロセッサ

•高性能

低消費電力

( )

<Period> From July 2005 to March 2008

<Features> ・Good cost performance

チップm)

(プロセッサ

コアn)

) PC1(プロ

セッサコア1)

PC n

DSM(分散共有メモリ)

LDM/D-cache

(ローカルデータメモリ/L1データ

キャッシュ)

LPM/I-Cache

(ローカルプログラムメモリ/

命令キャッシュ)

j

CSM(集中共有

メモリ)

I/O

CSP k

(入出力用マルチコア・チップ)

NI(ネットワークインターフェイス)

DTC(データ

転送コントローラ)

チップm)

(プロセッサ

コアn)

) PC1(プロ

セッサコア1)

PC n

DSM(分散共有メモリ)

LDM/D-cache

(ローカルデータメモリ/L1データ

キャッシュ)

LPM/I-Cache

(ローカルプログラムメモリ/

命令キャッシュ)

j

CSM(集中共有

メモリ)

I/O

CSP k

(入出力用マルチコア・チップ)

NI(ネットワークインターフェイス)

DTC(データ

転送コントローラ)

•低消費電力

•短HW/SW開発期間

•各チップ間でアプリケーション

•低消費電力

•短HW/SW開発期間

•各チップ間でアプリケーション

チップm)

(プロセッサ

コアn)

) PC1(プロ

セッサコア1)

PC n

DSM(分散共有メモリ)

LDM/D-cache

(ローカルデータメモリ/L1データ

キャッシュ)

LPM/I-Cache

(ローカルプログラムメモリ/

命令キャッシュ)

j

CSM(集中共有

メモリ)

I/O

CSP k

(入出力用マルチコア・チップ)

NI(ネットワークインターフェイス)

DTC(データ

転送コントローラ)

チップm)

(プロセッサ

コアn)

) PC1(プロ

セッサコア1)

PC n

DSM(分散共有メモリ)

LDM/D-cache

(ローカルデータメモリ/L1データ

キャッシュ)

LPM/I-Cache

(ローカルプログラムメモリ/

命令キャッシュ)

j

CSM(集中共有

メモリ)

I/O

CSP k

(入出力用マルチコア・チップ)

NI(ネットワークインターフェイス)

DTC(データ

転送コントローラ)

•低消費電力

•短HW/SW開発期間

•各チップ間でアプリケーション

•低消費電力

•短HW/SW開発期間

•各チップ間でアプリケーション

・Short hardware and software development periods・Low power consumption

CSM / L2 Cache(集中共有メモリあるいはL2キャッシュ )

IntraCCN (チップ内結合網: 複数バス、クロスバー等 )

InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 )

CSM / L2 Cache(集中共有メモリあるいはL2キャッシュ )

IntraCCN (チップ内結合網: 複数バス、クロスバー等 )

InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 )

プリケ ション共用可

•高信頼性

•半導体集積度と共に性能向上

プリケ ション共用可

•高信頼性

•半導体集積度と共に性能向上

マルチコア統合ECU

,,

CSM / L2 Cache(集中共有メモリあるいはL2キャッシュ )

IntraCCN (チップ内結合網: 複数バス、クロスバー等 )

InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 )

CSM / L2 Cache(集中共有メモリあるいはL2キャッシュ )

IntraCCN (チップ内結合網: 複数バス、クロスバー等 )

InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 )

プリケ ション共用可

•高信頼性

•半導体集積度と共に性能向上

プリケ ション共用可

•高信頼性

•半導体集積度と共に性能向上

マルチコア統合ECU

,,・Low power consumption・Scalable performance improvement with the advancement of semiconductor

開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ

with the advancement of semiconductor ・Use of the same parallelizing compiler for multi-cores from different vendors

31using newly developed API

API:Application Programming Interface

**Hitachi, Renesas, Fujitsu,

Toshiba, Panasonic, NEC

Page 32: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

NEDOMulticore Technology for Realtime Consumer Electronics R&D Organization(2005.7-2008.3 )

Integrated R&D Steering CommitteeChair : Hironori Kasahara (Waseda Univ.), Project LeaderSub-chair : Kunio Uchiyama (Hitachi), Project Sub-leader

Grant: Multicore Architecture and CompilerResearch and Development

Subsidy: Evaluation Environment for MulticoreTechnologyResearch and Development

Project Leader :Prof. Hironori Kasahara, Waseda Univ.

TechnologyProject Sub-leader :

Dr. Kunio Uchiyama, Chief Researcher, Hitachi

R&D Steering ComitteeT t Chi E l ti S t

Standard MulticoreArchitecture & APIC i

Architecture & CompilerR&D Group.Group Leader :

gTest Chip Development Group.Group Leader : Renesas Technology

Evaluation Sytem Development GroupGroup Leader : Hitachi

CommitteeChair : Prof. HironoriKasahara, Waseda Univ.

Group Leader :Associate Prof. KeijiKimura, Waseda Univ.(Outsourcing : Fujitsu)

gy

Hitachi, Fujitsu, Toshiba, NEC, Panasonic Renesas Technology

32

Panasonic, Renesas Technology

Page 33: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Fujitsu FR-1000Multicore ProcessorFR550 VLIW Processor

I-cacheInteger Operation UnitFR-V Multi-core Processor

Inst 0

FR550 core

FR550 core

GR32KBInst. 0Inst. 1Inst. 2Inst. 3

DMA-E

Local BUS IF

Mem. Cont. Mem. Cont.

Memory

Memory Crossbar Switch

DMA-I FR

Media Operation Unit

D-cache32KB

Inst. 4Inst. 5Inst. 6Inst. 7

FR550 core

FR550 core

IO

Media Operation Unit

C 1 C 2IO Chip

Core1 Core2

×

Core1 Core2

Switch SwitchFast I/O Bus

Memory Memory Memory Memory

Crossbar

Fast I/O Bus

•Memory Bus: 64bit x 2ch / 266MHz•System Bus: 64bit / 178MHz

33Bus Crossbar

(FR1000)

Page 34: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Panasonic UniPhier

34

Page 35: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

CELL Processor OverviewC ocesso Ove v ew• Power Processor Element

(PPE)Synergistic Processor Elements for High (Fl)ops / WattSPE

( )– PowerCore processes OS and

Control tasks2 M lti th d d

SPU

LS

SPU

LS

SPU

LS

SPU

LS

SPU

LS

SPU

LS

SPU

LS

SPU

LS

16B/cycle

– 2-way Multi-threaded• Synergistic Processor Element

(SPE) 16B/cycle16B/cycle

EIB (up to 96B/cycle)

16B/cycle

(SPE)– 8 SPE offers high performance– Dual issue RISC Architecture

16B/cycle (2x)16B/cycle

BICMIC

16B/cycle

L2

512KB

PPE

– 128bit SIMD(16‐way)– 128 x 128bit General Registers

256KB L l StRRAC I/ODual

XDRTM

PPUL1

32B/cycle

16B/cycle

512KB

32KB+32KB– 256KB Local Store– DedicatedDMA engines

XDR

従来の演算処理のためにVMXを持つ64-bitのPower Architecture

32KB+32KB

35

Page 36: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

1987 OSCAR(Optimally SSccheduled Advanced Multiprocessor)

36

Page 37: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

OSCAR(Optimally SSccheduled Advanced Multiprocessor)

HOST COMPUTER

CONTROL & I/O PROCESSOR(Simultaneous Readable)

CENTRALIZED SHARED MEMORY1

RISC Processor I/O Processor

DataMemory

Prog.Memory

Bank1

Addr.n

Bank2

Addr.n

Bank3

Addr.nDistributedShared

(Simultaneous Readable)

CSM2 CSM3

Read & Write RequestsArbitrator

Memory Memory

Bus Interface

SharedMemory

Distributed

(CP)(CP)

(CP) (CP) (CP).. ..

DistributedShared Memory (Dual Port)

-5MFLOPS,32bit RISC Processor (64 Registers)

..

PE1 PE5 PE6 PE10 PE11 PE15PE8 PE9 PE16

(64 Registers)-2 Banks of Program Memory-Data Memory-Stack Memory-DMA Controller

37 5PE CLUSTER (SPC1) SPC2 SPC3

LPC2 8PE PROCESSOR CLUSTER (LPC1)

Page 38: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

SYSTEM BUS

OSCAR PE (Processor Element)SYSTEM BUS

BUS INTERFACE

D S MLOCAL BUS 2LOCAL BUS 1

D M A L P M L S M L D M IPU FPU

R E GI N S C

R E G

D M A L D M

INSTRUCTION BUS DP

: DMA CONTROLLER : LOCAL DATA MEMORYD M AL P M

I N S C

D S M

L D M

D PI P U

F P U

: DMA CONTROLLER: LOCAL PROGRAM MEMORY (128KW * 2BANK): INSTRUCTION CONTROL UNIT

: LOCAL DATA MEMORY (256KW): DATA PATH: INTEGER PROCESSING UNIT: FLOATING: DISTRIBUTED

38

D S M

L S M

F P U

R E G: LOCAL STACK MEMORY (4KW)

: FLOATING PROCESSING UNIT: REGISTER FILE (64 REGISTERS)

: DISTRIBUTED SHARED MEMORY (2KW)

Page 39: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

1987 OSCAR PE Board

39

Page 40: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

LOCAL MEMORY SPACESYSTEM MEMORY SPACE

OSCAR Memory Space

00000000

00100000

UNDEFINED

00000000

00000800

D S M

NOT USE

(Distributed Shared Memory)

PE 1

PE15

(Local Memory Area)

BROAD CAST

00200000

01000000

CP

NOT USE

00010000

NOT USE

L P M(Bank0)(Local Program Memory)

PE 0

PE 1

00020000

BROAD CAST

01100000

02000000

NOT USE

CP

00030000

00040000

NOT USE

(Control Processor)L P M(Bank1)

02000000

02100000

PE 0

00080000

000F0000

L D M

NOT USEPE 1

(Local Data Memory)

02200000

02F00000

:::

000F0000

00100000

CONTROL

SYSTEMPE15

03000000

03400000

CSM1, 2, 3ACCESSING

AREA

PE15

(Centralized Shared Memory)

40FFFFFFFF

NOT USE

FFFFFFFF

Page 41: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

OSCAR Multi-Core ArchitectureCMP (chip multiprocessor 0)

CMP m

PE0 PE

CMP (chip multiprocessor 0)0

CPU

I/ODevicesI/O

Devices0 PE1 PE n

LDM/D-cacheLPM/

CSM j I/OCMP k

CPUDTC

DSMI-Cache

CSN t k I t fFVR

Intra-chip connection network

CSMNetwork InterfaceFVR

FVR

CSM / L2 Cache

(Multiple Buses, Crossbar, etc) FVR

FVRFVR FVR FVR FVR

Inter-chip connection network (Crossbar, Buses, Multistage network, etc)CSM: central shared mem. LDM : local data mem.

FVR

41DSM: distributed shared mem.DTC: Data Transfer Controller

LPM : local program mem.FVR: frequency / voltage control register

Page 42: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

API and Parallelizing Compiler in METI/NEDOAdvanced Multicore for Realtime Consumer Electronics Projectj

Executable

Translate into

parallel

API to specify data assignment,

data transfer,

Sequential Application

codes for each vendor

chip

parallel codes for

each vender

data transfer, power reduction

control

Program(Subset of C Language)

RealtiA

Image

Was

Backend compilerMach. Codes

Proc0ScheduledTasks

APIdecoder

Sequential Compilerm

e Cons

Applicati

e, Secure

seda OS

Backend Compiler

T1 Stop

Proc1ScheduledT k

SH multicore

Mach.CodesAPI

decoderSequential Compilersum

er Eion Proge A

udio etc.

SCA

R C

o

FR-V

TasksT2 T4

Proc2

Codesdecoder Compiler

Electronigram

sStream

i

ompiler

Backend CompilerScheduledTasks

T3 T6 Slow

Mach.CodesSequential

CompilerAPIdecoder

42

csng (CELL)

Data Transfer by DTC(DMAC)

Page 43: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

OSCAR Multigrain Parallelizing Compilerg g p• Automatic

ParallelizationMultigrain Parallel– Multigrain Parallel Processing

– Data LocalizationData transfer– Data transfer Overlapping

– Complier Controlled Power Saving SchemePower Saving Scheme

• Compiler cooperative Multi-core architecturearchitecture– OSCAR Multi-core

ArchitectureOSCAR Heterogeneous– OSCAR Heterogeneous Multiprocessor Architecture

• Commercial SMP43

• Commercial SMP machines

Page 44: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Processor Block DiagramProcessor Block DiagramCCN:Cache controllerIL DL I t ti /D t

Core #3CPU FPUCore #2

600MHzIL, DL: Instruction/Data

local memory

I$32K

D$32K

IL DL

CPU FPU

CCNI$32K

D$32K

CPU FPUCore #1

$ $CPU FPUCore #0

CPU FPU rolle

r

URAM: User RAMGCPG: Global clock pulse generator

IL8K

DL16K

CCN

URAM 128K

32K 32KIL8K

DL16K

CCNI$32K

D$32K

IL8K

DL16K

CCNI$32K

D$32K

CPU FPU

CCN

p co

ntr

LCPGgLCPG: Local CPG for each coreLBSC: SRAM controller

URAM 128K8K 16KURAM 128K

IL8K

DL16K

URAM 128K Snoo

p(S

NC

)LCPG0-3LCPG0-3LCPG

0-3LCPG0-3

On-chip system bus (SHwy)

LBSC: SRAM controllerDBSC: DDR2 controller

300MHzS (0 3

DBSC PCIExp

HWGCPGIPLBSC

444 laneDDR2

32bitSRAM32bit

Page 45: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Chip OverviewProcess Technology

90nm, 8-layer, triple-Vth, CMOS

Chip Size 97 6mm2 (9 88mm x 9 88mm)Core Core Chip Size 97.6mm (9.88mm x 9.88mm)

Supply Voltage 1.0V (internal), 1.8/3.3V (I/O)

Power Consumption

0.6 mW/MHz/CPU @ 600MHz (90nm G)

#0 PeripheralsCore

#1

Consumption 600MHz (90nm G)

Clock Frequency 600MHz

CPU Performance 4320 MIPS (Dhrystone 2.1)

SNC

Core Core

#3FPU Performance 16.8 GFLOPS

I/D Cache 32KB 4way set-associative (each)

#2#3

( )

ILRAM/OLRAM 8KB/16KB (each CPU)

URAM 128KB (each CPU)

P k FCBGA 554 i 29SH4A Multicore SoC Chip Package FCBGA 554pin, 29mm x 29mm

ISSCC07 Paper No.5.3, Y. Yoshida, et al., “A 4320MIPS Four-Processor Core SMP/AMP with

45

pIndividually Managed Clock Frequency for Low Power Consumption”

Page 46: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Performance on a DevelopedSH Multi core (RP1: SH X3)SH Multi-core (RP1: SH-X3)

Using Compiler and APIAudio AAC* Encoder Image Susan Smoothing

up

5.0

4.02 95

3.825.0

4.03.43

d u

p

Speed

3.01.91

2.95

3.01.86

2.70

Speed

2.0

1.0

1.002.0

1.0

1.00

Number of Processors1 2 3 4

01 2 3 4

0Number of Processors

Page. 46

Number of Processors**) Mibench Embedded application benchmark by Michigan Univ.*) ISO Advanced Audio Coding:

Number of Processors

Page 47: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

RP2 Chip Photo and Specifications

Process 90nm 8-layer triple-ILRAM I$ Process Technology

90nm, 8-layer, triple-Vth, CMOS

Chip Size 104 8mm2

Core#1

C0

LB

SC

URAMDLRAM

Core#0ILRAM

D$

I$

Chip Size 104.8mm(10.61mm x 9.88mm)

CPU Core 6.6mm2

Core#2 Core#3SN

SHWY CPU Core Size

6.6mm(3.36mm x 1.96mm)

Supply 1.0V–1.4V (internal), Core#6 Core#7

SNC

1

SHWY VSWC

pp yVoltage

( ),1.8/3.3V (I/O)

Power 17 (8 CPUs, Core#4 Core#5

SC

SM

Domains(

8 URAMs, common)DBSC

DDRPADGCPG

47

Page 48: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

8 Core RP2 Chip Block DiagramCl #1

Core #3CPU FPUCore #2

Cluster #0 Cluster #1Core #7CPUFPU Core #6

Barrier Sync. Lines

I$16K

D$16K

CPU FPU

Local memoryI$

16KD$

16K

CPU FPUCore #1CPU FPUCore #0

CPU FPULCPG0 LCPG1

I$16K

D$16K

CPUFPU

I$16K

D$16K

CPUFPU Core #5CPUFPU Core #4

CPUFPU

User RAM 64K

Local memoryI:8K, D:32K

16K 16KLocal memoryI:8K, D:32K

I$16K

D$16K

Local memoryI 8K D 32K

I$16K

D$16K

CPU FPU

CCNBAR

ntro

ller

1

ntro

ller

0LCPG0PCR3PCR2

LCPG1PCR7PCR6User RAM 64K

I:8K, D:32K16K16K

I:8K, D:32K

I$16K

D$16K

I 8K D 32K

I$16K

D$16K

CPUFPU

CCNBAR

User RAM 64KUser RAM 64K

I:8K, D:32K

URAM 64K

Local memoryI:8K, D:32K

noop

con

noop

con

PCR2PCR1PCR0

PCR6PCR5PCR4

User RAM 64KUser RAM 64K

I:8K, D:32K

URAM 64K

Local memoryI:8K, D:32K

On-chip system bus (SuperHyway)SSC 0 C

DDR2LCPG: Local clock pulse generator

PCR: Power Control RegistercontrolSRAM

controlDMA

control48

CCN/BAR:Cache controller/Barrier RegisterURAM: User RAM

control control control

Page 49: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Processing Performance on the Developed Multicore Using Automatic Parallelizing CompilerMulticore Using Automatic Parallelizing Compiler

Speedup against single core execution for audio AAC encoding

6.0

7.0

4 0

5.0

ps

5.8 3.0

4.0

Spee

du

1 01.9

3.6 1.0

2.0

1.0 0.0

1 2 4 8N b f

49*) Advanced Audio Coding

Numbers of processor cores

Page 50: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Power Reduction Using Power Shutdown and Voltage Control by OSCAR Parallelizing CompilerControl by OSCAR Parallelizing Compiler

7Without Power Control with Power Control

(Resume Standby:

6

7

W] 6

7 (Resume Standby: Power shutdown & voltage lowering)

4

5

ower

[W

4

5

3

4

med

Po

3

4

1

2

Con

su

1

2

86 0% Power Reduction0 0

Average Power5.82 [W]

Average Power0.81 [W]

86.0% Power Reduction

Page 51: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

OSCAR Heterogeneous MulticoreOSCAR Heterogeneous Multicore

• OSCAR Type Memory• OSCAR Type Memory Architecture

• LPM– Local Program Memoryg y

• LDM– Local Data Memory

• DSMDi ib d Sh d M– Distributed Shared Memory

• CSM– Centralized Shared Memory

• On Chip and/or Off Chipp p• DTU

– Data Transfer Unit• Interconnection Network

M lti l B– Multiple Buses– Split Transaction Buses– CrossBar …

51

Page 52: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Static Scheduling of Coarse Grain Tasks for a Heterogeneous Multi-core

MT1for CPU

CPU0 CPU1 DRP

MT1MT2for CPU

MT3for CPU

MT1

MT2 MT3MT6for CPU

MT5for DRP

MT4for DRP

MT7 MT10MT9MT8

MT4MT5

MT6 MT7MT8MT7

for DRPMT10for DRP

MT9for DRP

MT8for CPU

MT13f DRP

MT12f DRP

MT11f CPU

MT8MT9MT10

MT11for DRPfor DRPfor CPU

EMT

MT12MT13

52

Page 53: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

An Image of Static Schedule for Heterogeneous Multi-core with Data Transfer Overlapping and Power Controlpp g

TIM

EE

53

Page 54: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Compiler Performance on a OSCAR HeteroCompiler Performance on a OSCAR Hetero--multimulti--corecore25.2 times speedup using 4 SH general purpose cores and 4 DRP p p g g p paccelerators against a single SH

16.3016.25

17.2217.1016.13 16.47

16.0

18.0

20.0

25.2

41.3 times Speedup

9 0810.05

9 0810.09

9 1010.11

10.3410.3110.17

10 0

12.0

14.0 times Speedup against

5.176.09

9.08

5.186.10

9.08

5.196.11

9.10

6.0

8.0

10.01SH core

0.40

1.59

2.90

0.40

1.59

2.90

0.000.00

2.90

1.00

0.0

2.0

4.0

U U DR

P

DR

P

DR

P

DR

P

DR

P

DR

P

DR

P

DR

P

U U DR

P

DR

P

DR

P

DR

P

DR

P

DR

P

DR

P

DR

P

U U DR

P

DR

P

DR

P

DR

P

DR

P

DR

P

DR

P

DR

P

lock

Cyc

les

1C

PU 1

CP

4C

P

2C

PU

+1D

2C

PU

+2D

4C

PU

+2D

2C

PU

+4D

4C

PU

+4D

8C

PU

+4D

4C

PU

+8D

8C

PU

+8D

1C

P

4C

P

2C

PU

+1D

2C

PU

+2D

4C

PU

+2D

2C

PU

+4D

4C

PU

+4D

8C

PU

+4D

4C

PU

+8D

8C

PU

+8D

1C

P

4C

P

2C

PU

+1D

2C

PU

+2D

4C

PU

+2D

2C

PU

+4D

4C

PU

+4D

8C

PU

+4D

4C

PU

+8D

8C

PU

+8D

Exe

cutio

n C

l

SH4ACore

SH4ACore

SH4ACore

汎用高性能プロセッサ

SH4A CoreOn-Chip CSM

STB×1

SH4A CoreOn-Chip CSM

BUS×3

SH4A CoreOn-Chip CSM

STB×3

Page 55: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Power Reduction by OSCAR Compiler(4SHs+4DRPs)0 78 W 22% P d ti b C il C t l0.78 W: 22% Power reduction by Compiler Control

1.2-22%

0 8

1

[W]

DRP

1.01

0 78

1.01

0 78

1.01

0 78

1.01

0 78

1.01

0 78

1.01

1 78

22%

0.6

0.8

sum

ptio

n [ DRP

Clock

FPU

IEU

0.78 0.78 0.78 0.78 0.78 1.78

0.4

Pow

er

Cons

BPU

Register

0

0.2

Avera

ge P

FVOFF

FVON

FVOFF

FVON

FVOFF

FVON

FVOFF

FVON

FVOFF

FVON

FVOFF

FVON

STB×1 Bus×3 STB×3 STB×1 Bus×3 STB×3

2007.7.30 55Off Chip On Chip

Page 56: A Multigggrain Parallelizing Compiler with Power …デジタルTV(M台) 6 27 45 用 ( 記録型 )( 台 ) 2727 114114 4343 3.63.6 3333 7474 6 Dig. Camera Dig. TV DVD

Compiler cooperative low power high effectiveConclusions

Compiler cooperative low power high effective performance multi-core processors will be more important in wide range of information systems fromimportant in wide range of information systems from games, mobile phones, automobiles to peta-scale supercomputers. p pParallelizing compilers are essential for realization of

Good cost performancepShort hardware and software development periodsLow power consumptionp pHigh software productivityScalable performance improvement with advancement p pin semiconductor integration technology

Key technologies in multi-core compiler

56Multigrain parallelization, Data localization, Data transfer overlapping using DMA, Low power control technologies