Partitioning and Analysis of the Network- on-Chip on a...

Partitioning and Analysis of the Network-

on-Chip on a COTS Many-Core Platform

Matthias Becker, Borislav Nicolić, Dakshina Dasari, Benny Åkesson, Vincent Nélis, Moris Behnam, Thomas Nolte

RTAS, Pittsburgh 18. April 2017

31

Many-Core processors developed with large core count (64, 256, 1024 cores).

32


How to use it?

33


How to use it?

34


Execute Large Applications that Utilize all/many Cores

How to use it?

35



Consolidate Many Applications on the Cores/Cluster

How to use it?

36




How to use it?

37




How to use it?

● System Model

● Motivation

● Partitioning the NoC

● WCTT Analysis for the partitioned NoC

● Setting the traffic shaping parameters

● Evaluation

● Conclusions

38

Outline

39

Kalray MPPA Many-Core PlatformOverview

● 256 Cores on one Processor● 16 Compute Clusters● 16 Compute Cores● 1 Resource Management Core● Local Memory

● 4 IO/Subsystems● Each containing 4 Compute Cores

40




41




43


Clu

ste

r



45


Clu

ste

r



46


Clu

ste

r

We

st -

Ethe

rne

t

East

-Et

hern

et



47


Clu

ste

r

North - DDR

South - DDRW

est

-Et

hern

et

East

-Et

hern

et

● 2D-Torus Topology ● 2 topologically identical NoCs● D-NoC for data communication● C-NoC for control messages

48

I/O Subsystem DDR0

I/O Subsystem DDR1

I/O

Sub

syst

em E

ther

net 1

I/O Subsystem

Ethernet 1Kalray MPPA Many-Core PlatformThe Network-on-Chip (1)

● Wormhole Switching● Output Buffer● Round Robin Arbitration

49

Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)


50



51



52



53



54



55



56



57



58



59


North

South

We

st

East


60


North

South

We

st

East


61


North

South

West

Cluster

East

Roun

d R

ob

in

FIFO RR

North

South

We

st

East

62


63


64


65


66


67


68


● No Flow Control on Link Level ● Flow Regulation on Source Nodes● Packet Shaper● Traffic Limiter

69



70



71


Packet Shaper


72


Packet Shaper

…Application Payload


73


Packet Shaper


H H H…NoC Packets


74


Packet Shaper


H H H…NoC Packets

NoC


75


Packet Shaper Traffic Limiter


H H H…NoC Packets

NoC


76


Packet Shaper Traffic Limiter


H H H…NoC Packets

NoC

● Window Size !"● Bandwidth Quota #

● Applications on each cluster need to access the NoC● Exchanging messages● Accessing off-chip memory

● Applications operate on read-execute-write semantic

77

Application Model (1)

82

Application Model

● Each application has a number of:● Read requests● Write requests

83

Application Model

Read


84

Application Model

Read


No

C

85

Application Model

Read


No

CIO

86

Application Model

Read


No

CIO

…

87

Application Model

Read


Write

No

CIO

…

88

Application Model

Read


Write

No

CIO

… …

89

Application Model

Read


Write

No

CIO

… …

∆%&'

90

Application Model

Read


Write

No

CIO

… …

∆%&' ∆%&()

91

Application Model

Read


Write

No

CIO

… …

∆%&' ∆%&()∆"*+,&

92

Application Model

Read


Write

No

CIO

… …

∆%&' ∆%&()∆"*+,&

93

Motivation

Application NoC

Application

94

Motivation

Application NoC

Application

95

Motivation

Application NoC

Application

96

Motivation

Application NoC

Application

● Analysis of the NoC is non trivial● Many architectural features pose challenges (buffer,

traffic limiter, routing, …)

98

Motivation

Application NoC

Application

● Analysis of the NoC is non trivial● Many architectural features pose challenges (buffer,

traffic limiter, routing, …)

Pessimistic estimates à Larger tasks WCET à Less efficient platform usage

99

Contributions

NoC organization that reduces contention by partitioning

100

Contributions


Timing analysis for the partitioned NoC

101

Contributions


Timing analysis for the partitioned NoC

A method to configure the flow regulation on source nodes

102

Partitioning the NoC

I/O Subsystem DDR0

I/O Subsystem DDR1

I/O

Sub

syst

em E

ther

net 1

I/O Subsystem

Ethernet 1

103


I/O Subsystem DDR0

I/O Subsystem DDR1

I/O

Sub

syst

em E

ther

net 1

I/O Subsystem

Ethernet 1

104


I/O Subsystem DDR0

I/O Subsystem DDR1

I/O

Sub

syst

em E

ther

net 1

I/O Subsystem

Ethernet 1

105


106


● Avoid horizontal communication● Cluster communicate with the closest I/O

subsystem● Each cluster sends messages via the I/O

subsystem● Most NoC packets target loading of code● Cluster to cluster messages go through the

I/O subsystem

● 8 identical NoC partitions

107





I/O subsystem


108





I/O subsystem


110

WCTT Analysis in the Partitioned NoCOverview

● 3 cases to analyze● Sending a request message on the C-NoC ⎯ ./!!01230● Sending data on the D-NoC to the I/O subsystem ⎯ ./!!00→56● Receiving data on the D-NoC ⎯ ./!!56→00

111



112


!"#

$!%A

B

AA

!&' BB

Cluster A

Cluster B

I/O System

Compute Cluster to I/O


113


!

"#

"$A

%

&B

A

B

"'B

Cluster A

Cluster B

I/O System

I/O to Compute Cluster

A


114


!

"#

"$A

%

&B

A

B

"'B

Cluster A

Cluster B

I/O System

I/O to Compute Cluster

A

115

WCTT Analysis in the Partitioned NoC./!!00→56 (1)

!"#

$!%A

B

AA

!&' BB

Cluster A

Cluster B

I/O System


116


!"#

$!%A

B

AA

!&' BB

Cluster A

Cluster B

I/O System


Flow Regulation Delay

117


!"#

$!%A

B

AA

!&' BB

Cluster A

Cluster B

I/O System


Flow Regulation Delay Round Robin Delay

118

● Based on what criteria to select the bandwidth quota#?● ./!!● Buffer occupation in >?

WCTT Analysis in the Partitioned NoC./!!00→56 (2) The Traffic Limiter

119

0

1000

2000

3000

4000

5000

6000

0

20000

40000

60000

80000

100000

67 117 167 217 267 317 367 417 467 517 567

BufferOccupation[flit]

WCTT[cycles]

Nmax

WCTT

Max.Buffer

AvailableBuffer

FlowRegulationBudget# [flit]



120

0

1000

2000

3000

4000

5000

6000

0

20000

40000

60000

80000

100000

67 117 167 217 267 317 367 417 467 517 567


WCTT[cycles]

Nmax

WCTT

Max.Buffer

AvailableBuffer


789:



121

0

1000

2000

3000

4000

5000

6000

0

20000

40000

60000

80000

100000

67 117 167 217 267 317 367 417 467 517 567


WCTT[cycles]

Nmax

WCTT

Max.Buffer

AvailableBuffer


789:


@ABB is not minimal


122

0

1000

2000

3000

4000

5000

6000

0

20000

40000

60000

80000

100000

67 117 167 217 267 317 367 417 467 517 567


WCTT[cycles]

Nmax

WCTT

Max.Buffer

AvailableBuffer


789: 78;<


@ABB is not minimal


123

0

1000

2000

3000

4000

5000

6000

0

20000

40000

60000

80000

100000

67 117 167 217 267 317 367 417 467 517 567


WCTT[cycles]

Nmax

WCTT

Max.Buffer

AvailableBuffer


789: 78;<


@ABB is not minimalBuffer in router

overflows


124

0

1000

2000

3000

4000

5000

6000

0

20000

40000

60000

80000

100000

67 117 167 217 267 317 367 417 467 517 567


WCTT[cycles]

Nmax

WCTT

Max.Buffer

AvailableBuffer


789: 78;<



overflows


125

0

1000

2000

3000

4000

5000

6000

0

20000

40000

60000

80000

100000

67 117 167 217 267 317 367 417 467 517 567


WCTT[cycles]

Nmax

WCTT

Max.Buffer

AvailableBuffer


789: 78;<



overflows


126


● Observations from the traffic limiter settings● The buffer in >? transmits a flit in each cycle● Faster injection at the source node has no impact on the ./!!

127


./!!00→56 = DE + DG H IEJ− 1 +DG +/G2M


128


./!!00→56 = DE + DG H IEJ− 1 +DG +/G2M

RR-blocking and transmission from the buffer of all but the last packet


129


./!!00→56 = DE + DG H IEJ− 1 +DG +/G2M


RR-blocking of the last packet


130


./!!00→56 = DE + DG H IEJ− 1 +DG +/G2M



Transmission of the last packet over the NoCwithout interference


131


./!!00→56 = DE + DG H IEJ− 1 +DG +/G2M



Transmission of the last packet over the NoCwithout interference


● Two cases:

● Find #N+O such that the ./!!00→56 is minimum

● Find #N(P such that the buffer in >? does not overflow

132

Determining the parameters for the traffic limiter (1)

B@, 7

● Two cases:

● Find #N+O such that the ./!!00→56 is minimum

● Find #N(P such that the buffer in >? does not overflow

133

Determining the parameters for the traffic limiter (1)

B@, 7

134

Determining the parameters for the traffic limiter (2) - #N+O

● Buffer >?

Time [cycles]

Dat

a [fl

it]

135


● Buffer >?

Flits that arrive in the buffer, shaped by

traffic limiter

Time [cycles]

Dat

a [fl

it]

136


● Buffer >?


traffic limiter

Flits that depart from the buffer, shaped by the RR-interference

Time [cycles]

Dat

a [fl

it]

137


● Buffer >?


traffic limiter


Time [cycles]

Dat

a [fl

it]

Departure Segment R

138


● Buffer >?


traffic limiter


Time [cycles]

Dat

a [fl

it]

Departure Segment R

Set #N+O such that the flits that arrive during one departure segment equal the flits that leave the buffer in the same time.

139


● Buffer >?


traffic limiter


Time [cycles]

Dat

a [fl

it]

Departure Segment R


140


● Buffer >?


traffic limiter


Time [cycles]

Dat

a [fl

it]

Departure Segment R


!" + DE ≤#

DE

H DG + #

141


● Buffer >?


traffic limiter


Time [cycles]

Dat

a [fl

it]

Departure Segment R


!" + DE ≤#

DE

H DG + #Binary Search, ILP, …

142


● Buffer >?


traffic limiter


Time [cycles]

Dat

a [fl

it]

Departure Segment R


!" + DE ≤#

DE


143


● Buffer >?


traffic limiter


Time [cycles]

Dat

a [fl

it]

Departure Segment R


!" + DE ≤#

DE


144

Evaluation

● Experiments evaluate different aspects of the work● Measurements on the Kalray MPPA platform● Case study of an engine management system

● All experiments based parameters of the Kalray MPPA● D-NoC packet payload = 62 flit● C-NoC packet payload = 2 flit● Header size = 4 flit● Router● Switching delay = 1 cycle● Channel delay = 1 cycle● Buffer size = 401 flit

145

Evaluation

146

EvaluationTotal Read Latency on the MPPA (1)

147


● Measuring the time to read data to a compute cluster from off-chip memory [1, 16] KB

● Varying number of cluster access memory through same I/O node [16, 8, 4, 2] clusters

● Latency on cluster 0 and 4 are observed● They represent one NoC partition

● Each data point represents the maximum observed value out of 10000 samples

148






149






150






152






153






154


0

10

20

30

40

50

60

1KB 2KB 4KB 8KB 16KB

Latencyinm

s

16Clusters- C0 16Clusters- C4 8Clusters- C0 8Clusters- C4

4Clusters- C0 4Clusters- C4 2Clusters- C0 2Clusters- C4

● Engine Management System (EMS)● 15 runnables with periods [5, 10, 20, 100] ms● Footprint [7076, 17424] bytes (code + data)

● Each runnable loads its footprint at the beginning and writes its footprint at the end of execution to the off-chip memory

155

EvaluationSimulation based Case Study (1)



156


Clu

ster

Clu

ster I/O Subsys.



157


EMS

EMS

Clu

ster

Clu

ster I/O Subsys.



158


EMS

EMS

RC

Clu

ster

Clu

ster I/O Subsys.



159


EMS

EMS

RC

Reorder Core to manage memory requests

Clu

ster

Clu

ster I/O Subsys.

160


0

2000

4000

6000

8000

10000

M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15

Latency[cycle]

Analysis Max Isolation

Writing to memory./!!00→56

161


0

2000

4000

6000

8000

10000


Latency[cycle]


0

2000

4000

6000

8000

10000


Latency[cycle]


Writing to memory./!!00→56

Reading from memory./!!01230

+./!!56→00

● Shared NoC is one of the main sources for interference

● Difficult to analyze due to different architectural features

● Novel NoC partitioning scheme to reduce interference and easy analysis

● Tailored analysis for the partition● Configuration of traffic limiter to avoid buffer overflow

and to guarantee minimum transmission times

● Focus on the memory access within the I/O subsystem● Handling of requests affects the overall latency of

memory access

164

Conclusions and Future Work

Thank you for the attention!Questions?

Partitioning and Analysis of the Network- on-Chip on a...

Documents

Transcript of Partitioning and Analysis of the Network- on-Chip on a...