Partitioning and Analysis of the Network- on-Chip on a...
Transcript of Partitioning and Analysis of the Network- on-Chip on a...
-
Partitioning and Analysis of the Network-
on-Chip on a COTS Many-Core Platform
Matthias Becker, Borislav Nicolić, Dakshina Dasari, Benny Åkesson, Vincent Nélis, Moris Behnam, Thomas Nolte
RTAS, Pittsburgh 18. April 2017
-
31
Many-Core processors developed with large core count (64, 256, 1024 cores).
-
32
Many-Core processors developed with large core count (64, 256, 1024 cores).
How to use it?
-
33
Many-Core processors developed with large core count (64, 256, 1024 cores).
How to use it?
-
34
Many-Core processors developed with large core count (64, 256, 1024 cores).
Execute Large Applications that Utilize all/many Cores
How to use it?
-
35
Many-Core processors developed with large core count (64, 256, 1024 cores).
Execute Large Applications that Utilize all/many Cores
Consolidate Many Applications on the Cores/Cluster
How to use it?
-
36
Many-Core processors developed with large core count (64, 256, 1024 cores).
Execute Large Applications that Utilize all/many Cores
Consolidate Many Applications on the Cores/Cluster
How to use it?
-
37
Many-Core processors developed with large core count (64, 256, 1024 cores).
Execute Large Applications that Utilize all/many Cores
Consolidate Many Applications on the Cores/Cluster
How to use it?
-
● System Model
● Motivation
● Partitioning the NoC
● WCTT Analysis for the partitioned NoC
● Setting the traffic shaping parameters
● Evaluation
● Conclusions
38
Outline
-
39
Kalray MPPA Many-Core PlatformOverview
-
● 256 Cores on one Processor● 16 Compute Clusters● 16 Compute Cores● 1 Resource Management Core● Local Memory
● 4 IO/Subsystems● Each containing 4 Compute Cores
40
Kalray MPPA Many-Core PlatformOverview
-
● 256 Cores on one Processor● 16 Compute Clusters● 16 Compute Cores● 1 Resource Management Core● Local Memory
● 4 IO/Subsystems● Each containing 4 Compute Cores
41
Kalray MPPA Many-Core PlatformOverview
-
● 256 Cores on one Processor● 16 Compute Clusters● 16 Compute Cores● 1 Resource Management Core● Local Memory
● 4 IO/Subsystems● Each containing 4 Compute Cores
43
Kalray MPPA Many-Core PlatformOverview
Clu
ste
r
-
● 256 Cores on one Processor● 16 Compute Clusters● 16 Compute Cores● 1 Resource Management Core● Local Memory
● 4 IO/Subsystems● Each containing 4 Compute Cores
45
Kalray MPPA Many-Core PlatformOverview
Clu
ste
r
-
● 256 Cores on one Processor● 16 Compute Clusters● 16 Compute Cores● 1 Resource Management Core● Local Memory
● 4 IO/Subsystems● Each containing 4 Compute Cores
46
Kalray MPPA Many-Core PlatformOverview
Clu
ste
r
We
st -
Ethe
rne
t
East
-Et
hern
et
-
● 256 Cores on one Processor● 16 Compute Clusters● 16 Compute Cores● 1 Resource Management Core● Local Memory
● 4 IO/Subsystems● Each containing 4 Compute Cores
47
Kalray MPPA Many-Core PlatformOverview
Clu
ste
r
North - DDR
South - DDRW
est
-Et
hern
et
East
-Et
hern
et
-
● 2D-Torus Topology ● 2 topologically identical NoCs● D-NoC for data communication● C-NoC for control messages
48
I/O Subsystem DDR0
I/O Subsystem DDR1
I/O
Sub
syst
em E
ther
net 1
I/O Subsystem
Ethernet 1Kalray MPPA Many-Core PlatformThe Network-on-Chip (1)
-
● Wormhole Switching● Output Buffer● Round Robin Arbitration
49
Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)
-
● Wormhole Switching● Output Buffer● Round Robin Arbitration
50
Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)
-
● Wormhole Switching● Output Buffer● Round Robin Arbitration
51
Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)
-
● Wormhole Switching● Output Buffer● Round Robin Arbitration
52
Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)
-
● Wormhole Switching● Output Buffer● Round Robin Arbitration
53
Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)
-
● Wormhole Switching● Output Buffer● Round Robin Arbitration
54
Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)
-
● Wormhole Switching● Output Buffer● Round Robin Arbitration
55
Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)
-
● Wormhole Switching● Output Buffer● Round Robin Arbitration
56
Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)
-
● Wormhole Switching● Output Buffer● Round Robin Arbitration
57
Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)
-
● Wormhole Switching● Output Buffer● Round Robin Arbitration
58
Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)
-
● Wormhole Switching● Output Buffer● Round Robin Arbitration
59
Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)
North
South
We
st
East
-
● Wormhole Switching● Output Buffer● Round Robin Arbitration
60
Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)
North
South
We
st
East
-
● Wormhole Switching● Output Buffer● Round Robin Arbitration
61
Kalray MPPA Many-Core PlatformThe Network-on-Chip (2)
North
South
West
Cluster
East
Roun
d R
ob
in
FIFO RR
North
South
We
st
East
-
62
Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)
-
63
Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)
-
64
Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)
-
65
Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)
-
66
Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)
-
67
Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)
-
68
Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)
-
● No Flow Control on Link Level ● Flow Regulation on Source Nodes● Packet Shaper● Traffic Limiter
69
Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)
-
● No Flow Control on Link Level ● Flow Regulation on Source Nodes● Packet Shaper● Traffic Limiter
70
Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)
-
● No Flow Control on Link Level ● Flow Regulation on Source Nodes● Packet Shaper● Traffic Limiter
71
Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)
Packet Shaper
-
● No Flow Control on Link Level ● Flow Regulation on Source Nodes● Packet Shaper● Traffic Limiter
72
Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)
Packet Shaper
…Application Payload
-
● No Flow Control on Link Level ● Flow Regulation on Source Nodes● Packet Shaper● Traffic Limiter
73
Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)
Packet Shaper
…Application Payload
H H H…NoC Packets
-
● No Flow Control on Link Level ● Flow Regulation on Source Nodes● Packet Shaper● Traffic Limiter
74
Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)
Packet Shaper
…Application Payload
H H H…NoC Packets
NoC
-
● No Flow Control on Link Level ● Flow Regulation on Source Nodes● Packet Shaper● Traffic Limiter
75
Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)
Packet Shaper Traffic Limiter
…Application Payload
H H H…NoC Packets
NoC
-
● No Flow Control on Link Level ● Flow Regulation on Source Nodes● Packet Shaper● Traffic Limiter
76
Kalray MPPA Many-Core PlatformThe Network-on-Chip (3)
Packet Shaper Traffic Limiter
…Application Payload
H H H…NoC Packets
NoC
● Window Size !"● Bandwidth Quota #
-
● Applications on each cluster need to access the NoC● Exchanging messages● Accessing off-chip memory
● Applications operate on read-execute-write semantic
77
Application Model (1)
-
82
Application Model
● Each application has a number of:● Read requests● Write requests
-
83
Application Model
Read
● Each application has a number of:● Read requests● Write requests
-
84
Application Model
Read
● Each application has a number of:● Read requests● Write requests
No
C
-
85
Application Model
Read
● Each application has a number of:● Read requests● Write requests
No
CIO
-
86
Application Model
Read
● Each application has a number of:● Read requests● Write requests
No
CIO
…
-
87
Application Model
Read
● Each application has a number of:● Read requests● Write requests
Write
No
CIO
…
-
88
Application Model
Read
● Each application has a number of:● Read requests● Write requests
Write
No
CIO
… …
-
89
Application Model
Read
● Each application has a number of:● Read requests● Write requests
Write
No
CIO
… …
∆%&'
-
90
Application Model
Read
● Each application has a number of:● Read requests● Write requests
Write
No
CIO
… …
∆%&' ∆%&()
-
91
Application Model
Read
● Each application has a number of:● Read requests● Write requests
Write
No
CIO
… …
∆%&' ∆%&()∆"*+,&
-
92
Application Model
Read
● Each application has a number of:● Read requests● Write requests
Write
No
CIO
… …
∆%&' ∆%&()∆"*+,&
-
93
Motivation
Application NoC
Application
-
94
Motivation
Application NoC
Application
-
95
Motivation
Application NoC
Application
-
96
Motivation
Application NoC
Application
● Analysis of the NoC is non trivial● Many architectural features pose challenges (buffer,
traffic limiter, routing, …)
-
98
Motivation
Application NoC
Application
● Analysis of the NoC is non trivial● Many architectural features pose challenges (buffer,
traffic limiter, routing, …)
Pessimistic estimates à Larger tasks WCET à Less efficient platform usage
-
99
Contributions
NoC organization that reduces contention by partitioning
-
100
Contributions
NoC organization that reduces contention by partitioning
Timing analysis for the partitioned NoC
-
101
Contributions
NoC organization that reduces contention by partitioning
Timing analysis for the partitioned NoC
A method to configure the flow regulation on source nodes
-
102
Partitioning the NoC
I/O Subsystem DDR0
I/O Subsystem DDR1
I/O
Sub
syst
em E
ther
net 1
I/O Subsystem
Ethernet 1
-
103
Partitioning the NoC
I/O Subsystem DDR0
I/O Subsystem DDR1
I/O
Sub
syst
em E
ther
net 1
I/O Subsystem
Ethernet 1
-
104
Partitioning the NoC
I/O Subsystem DDR0
I/O Subsystem DDR1
I/O
Sub
syst
em E
ther
net 1
I/O Subsystem
Ethernet 1
-
105
Partitioning the NoC
-
106
Partitioning the NoC
● Avoid horizontal communication● Cluster communicate with the closest I/O
subsystem● Each cluster sends messages via the I/O
subsystem● Most NoC packets target loading of code● Cluster to cluster messages go through the
I/O subsystem
● 8 identical NoC partitions
-
107
Partitioning the NoC
● Avoid horizontal communication● Cluster communicate with the closest I/O
subsystem● Each cluster sends messages via the I/O
subsystem● Most NoC packets target loading of code● Cluster to cluster messages go through the
I/O subsystem
● 8 identical NoC partitions
-
108
Partitioning the NoC
● Avoid horizontal communication● Cluster communicate with the closest I/O
subsystem● Each cluster sends messages via the I/O
subsystem● Most NoC packets target loading of code● Cluster to cluster messages go through the
I/O subsystem
● 8 identical NoC partitions
-
110
WCTT Analysis in the Partitioned NoCOverview
-
● 3 cases to analyze● Sending a request message on the C-NoC ⎯ ./!!01230● Sending data on the D-NoC to the I/O subsystem ⎯ ./!!00→56● Receiving data on the D-NoC ⎯ ./!!56→00
111
WCTT Analysis in the Partitioned NoCOverview
-
● 3 cases to analyze● Sending a request message on the C-NoC ⎯ ./!!01230● Sending data on the D-NoC to the I/O subsystem ⎯ ./!!00→56● Receiving data on the D-NoC ⎯ ./!!56→00
112
WCTT Analysis in the Partitioned NoCOverview
!"#
$!%A
B
AA
!&' BB
Cluster A
Cluster B
I/O System
Compute Cluster to I/O
-
● 3 cases to analyze● Sending a request message on the C-NoC ⎯ ./!!01230● Sending data on the D-NoC to the I/O subsystem ⎯ ./!!00→56● Receiving data on the D-NoC ⎯ ./!!56→00
113
WCTT Analysis in the Partitioned NoCOverview
!
"#
"$A
%
&B
A
B
"'B
Cluster A
Cluster B
I/O System
I/O to Compute Cluster
A
-
● 3 cases to analyze● Sending a request message on the C-NoC ⎯ ./!!01230● Sending data on the D-NoC to the I/O subsystem ⎯ ./!!00→56● Receiving data on the D-NoC ⎯ ./!!56→00
114
WCTT Analysis in the Partitioned NoCOverview
!
"#
"$A
%
&B
A
B
"'B
Cluster A
Cluster B
I/O System
I/O to Compute Cluster
A
-
115
WCTT Analysis in the Partitioned NoC./!!00→56 (1)
!"#
$!%A
B
AA
!&' BB
Cluster A
Cluster B
I/O System
Compute Cluster to I/O
-
116
WCTT Analysis in the Partitioned NoC./!!00→56 (1)
!"#
$!%A
B
AA
!&' BB
Cluster A
Cluster B
I/O System
Compute Cluster to I/O
Flow Regulation Delay
-
117
WCTT Analysis in the Partitioned NoC./!!00→56 (1)
!"#
$!%A
B
AA
!&' BB
Cluster A
Cluster B
I/O System
Compute Cluster to I/O
Flow Regulation Delay Round Robin Delay
-
118
● Based on what criteria to select the bandwidth quota#?● ./!!● Buffer occupation in >?
WCTT Analysis in the Partitioned NoC./!!00→56 (2) The Traffic Limiter
-
119
0
1000
2000
3000
4000
5000
6000
0
20000
40000
60000
80000
100000
67 117 167 217 267 317 367 417 467 517 567
BufferOccupation[flit]
WCTT[cycles]
Nmax
WCTT
Max.Buffer
AvailableBuffer
FlowRegulationBudget# [flit]
● Based on what criteria to select the bandwidth quota#?● ./!!● Buffer occupation in >?
WCTT Analysis in the Partitioned NoC./!!00→56 (2) The Traffic Limiter
-
120
0
1000
2000
3000
4000
5000
6000
0
20000
40000
60000
80000
100000
67 117 167 217 267 317 367 417 467 517 567
BufferOccupation[flit]
WCTT[cycles]
Nmax
WCTT
Max.Buffer
AvailableBuffer
FlowRegulationBudget# [flit]
789:
● Based on what criteria to select the bandwidth quota#?● ./!!● Buffer occupation in >?
WCTT Analysis in the Partitioned NoC./!!00→56 (2) The Traffic Limiter
-
121
0
1000
2000
3000
4000
5000
6000
0
20000
40000
60000
80000
100000
67 117 167 217 267 317 367 417 467 517 567
BufferOccupation[flit]
WCTT[cycles]
Nmax
WCTT
Max.Buffer
AvailableBuffer
FlowRegulationBudget# [flit]
789:
● Based on what criteria to select the bandwidth quota#?● ./!!● Buffer occupation in >?
@ABB is not minimal
WCTT Analysis in the Partitioned NoC./!!00→56 (2) The Traffic Limiter
-
122
0
1000
2000
3000
4000
5000
6000
0
20000
40000
60000
80000
100000
67 117 167 217 267 317 367 417 467 517 567
BufferOccupation[flit]
WCTT[cycles]
Nmax
WCTT
Max.Buffer
AvailableBuffer
FlowRegulationBudget# [flit]
789: 78;<
● Based on what criteria to select the bandwidth quota#?● ./!!● Buffer occupation in >?
@ABB is not minimal
WCTT Analysis in the Partitioned NoC./!!00→56 (2) The Traffic Limiter
-
123
0
1000
2000
3000
4000
5000
6000
0
20000
40000
60000
80000
100000
67 117 167 217 267 317 367 417 467 517 567
BufferOccupation[flit]
WCTT[cycles]
Nmax
WCTT
Max.Buffer
AvailableBuffer
FlowRegulationBudget# [flit]
789: 78;<
● Based on what criteria to select the bandwidth quota#?● ./!!● Buffer occupation in >?
@ABB is not minimalBuffer in router
overflows
WCTT Analysis in the Partitioned NoC./!!00→56 (2) The Traffic Limiter
-
124
0
1000
2000
3000
4000
5000
6000
0
20000
40000
60000
80000
100000
67 117 167 217 267 317 367 417 467 517 567
BufferOccupation[flit]
WCTT[cycles]
Nmax
WCTT
Max.Buffer
AvailableBuffer
FlowRegulationBudget# [flit]
789: 78;<
● Based on what criteria to select the bandwidth quota#?● ./!!● Buffer occupation in >?
@ABB is not minimalBuffer in router
overflows
WCTT Analysis in the Partitioned NoC./!!00→56 (2) The Traffic Limiter
-
125
0
1000
2000
3000
4000
5000
6000
0
20000
40000
60000
80000
100000
67 117 167 217 267 317 367 417 467 517 567
BufferOccupation[flit]
WCTT[cycles]
Nmax
WCTT
Max.Buffer
AvailableBuffer
FlowRegulationBudget# [flit]
789: 78;<
● Based on what criteria to select the bandwidth quota#?● ./!!● Buffer occupation in >?
@ABB is not minimalBuffer in router
overflows
WCTT Analysis in the Partitioned NoC./!!00→56 (2) The Traffic Limiter
-
126
WCTT Analysis in the Partitioned NoC./!!00→56 (3)
● Observations from the traffic limiter settings● The buffer in >? transmits a flit in each cycle● Faster injection at the source node has no impact on the ./!!
-
127
WCTT Analysis in the Partitioned NoC./!!00→56 (3)
./!!00→56 = DE + DG H IEJ− 1 +DG +/G2M
● Observations from the traffic limiter settings● The buffer in >? transmits a flit in each cycle● Faster injection at the source node has no impact on the ./!!
-
128
WCTT Analysis in the Partitioned NoC./!!00→56 (3)
./!!00→56 = DE + DG H IEJ− 1 +DG +/G2M
RR-blocking and transmission from the buffer of all but the last packet
● Observations from the traffic limiter settings● The buffer in >? transmits a flit in each cycle● Faster injection at the source node has no impact on the ./!!
-
129
WCTT Analysis in the Partitioned NoC./!!00→56 (3)
./!!00→56 = DE + DG H IEJ− 1 +DG +/G2M
RR-blocking and transmission from the buffer of all but the last packet
RR-blocking of the last packet
● Observations from the traffic limiter settings● The buffer in >? transmits a flit in each cycle● Faster injection at the source node has no impact on the ./!!
-
130
WCTT Analysis in the Partitioned NoC./!!00→56 (3)
./!!00→56 = DE + DG H IEJ− 1 +DG +/G2M
RR-blocking and transmission from the buffer of all but the last packet
RR-blocking of the last packet
Transmission of the last packet over the NoCwithout interference
● Observations from the traffic limiter settings● The buffer in >? transmits a flit in each cycle● Faster injection at the source node has no impact on the ./!!
-
131
WCTT Analysis in the Partitioned NoC./!!00→56 (3)
./!!00→56 = DE + DG H IEJ− 1 +DG +/G2M
RR-blocking and transmission from the buffer of all but the last packet
RR-blocking of the last packet
Transmission of the last packet over the NoCwithout interference
● Observations from the traffic limiter settings● The buffer in >? transmits a flit in each cycle● Faster injection at the source node has no impact on the ./!!
-
● Two cases:
● Find #N+O such that the ./!!00→56 is minimum
● Find #N(P such that the buffer in >? does not overflow
132
Determining the parameters for the traffic limiter (1)
B@, 7
-
● Two cases:
● Find #N+O such that the ./!!00→56 is minimum
● Find #N(P such that the buffer in >? does not overflow
133
Determining the parameters for the traffic limiter (1)
B@, 7
-
134
Determining the parameters for the traffic limiter (2) - #N+O
● Buffer >?
Time [cycles]
Dat
a [fl
it]
-
135
Determining the parameters for the traffic limiter (2) - #N+O
● Buffer >?
Flits that arrive in the buffer, shaped by
traffic limiter
Time [cycles]
Dat
a [fl
it]
-
136
Determining the parameters for the traffic limiter (2) - #N+O
● Buffer >?
Flits that arrive in the buffer, shaped by
traffic limiter
Flits that depart from the buffer, shaped by the RR-interference
Time [cycles]
Dat
a [fl
it]
-
137
Determining the parameters for the traffic limiter (2) - #N+O
● Buffer >?
Flits that arrive in the buffer, shaped by
traffic limiter
Flits that depart from the buffer, shaped by the RR-interference
Time [cycles]
Dat
a [fl
it]
Departure Segment R
-
138
Determining the parameters for the traffic limiter (2) - #N+O
● Buffer >?
Flits that arrive in the buffer, shaped by
traffic limiter
Flits that depart from the buffer, shaped by the RR-interference
Time [cycles]
Dat
a [fl
it]
Departure Segment R
Set #N+O such that the flits that arrive during one departure segment equal the flits that leave the buffer in the same time.
-
139
Determining the parameters for the traffic limiter (2) - #N+O
● Buffer >?
Flits that arrive in the buffer, shaped by
traffic limiter
Flits that depart from the buffer, shaped by the RR-interference
Time [cycles]
Dat
a [fl
it]
Departure Segment R
Set #N+O such that the flits that arrive during one departure segment equal the flits that leave the buffer in the same time.
-
140
Determining the parameters for the traffic limiter (2) - #N+O
● Buffer >?
Flits that arrive in the buffer, shaped by
traffic limiter
Flits that depart from the buffer, shaped by the RR-interference
Time [cycles]
Dat
a [fl
it]
Departure Segment R
Set #N+O such that the flits that arrive during one departure segment equal the flits that leave the buffer in the same time.
!" + DE ≤#
DE
H DG + #
-
141
Determining the parameters for the traffic limiter (2) - #N+O
● Buffer >?
Flits that arrive in the buffer, shaped by
traffic limiter
Flits that depart from the buffer, shaped by the RR-interference
Time [cycles]
Dat
a [fl
it]
Departure Segment R
Set #N+O such that the flits that arrive during one departure segment equal the flits that leave the buffer in the same time.
!" + DE ≤#
DE
H DG + #Binary Search, ILP, …
-
142
Determining the parameters for the traffic limiter (2) - #N+O
● Buffer >?
Flits that arrive in the buffer, shaped by
traffic limiter
Flits that depart from the buffer, shaped by the RR-interference
Time [cycles]
Dat
a [fl
it]
Departure Segment R
Set #N+O such that the flits that arrive during one departure segment equal the flits that leave the buffer in the same time.
!" + DE ≤#
DE
H DG + #Binary Search, ILP, …
-
143
Determining the parameters for the traffic limiter (2) - #N+O
● Buffer >?
Flits that arrive in the buffer, shaped by
traffic limiter
Flits that depart from the buffer, shaped by the RR-interference
Time [cycles]
Dat
a [fl
it]
Departure Segment R
Set #N+O such that the flits that arrive during one departure segment equal the flits that leave the buffer in the same time.
!" + DE ≤#
DE
H DG + #Binary Search, ILP, …
-
144
Evaluation
-
● Experiments evaluate different aspects of the work● Measurements on the Kalray MPPA platform● Case study of an engine management system
● All experiments based parameters of the Kalray MPPA● D-NoC packet payload = 62 flit● C-NoC packet payload = 2 flit● Header size = 4 flit● Router● Switching delay = 1 cycle● Channel delay = 1 cycle● Buffer size = 401 flit
145
Evaluation
-
146
EvaluationTotal Read Latency on the MPPA (1)
-
147
EvaluationTotal Read Latency on the MPPA (1)
● Measuring the time to read data to a compute cluster from off-chip memory [1, 16] KB
● Varying number of cluster access memory through same I/O node [16, 8, 4, 2] clusters
● Latency on cluster 0 and 4 are observed● They represent one NoC partition
● Each data point represents the maximum observed value out of 10000 samples
-
148
EvaluationTotal Read Latency on the MPPA (1)
● Measuring the time to read data to a compute cluster from off-chip memory [1, 16] KB
● Varying number of cluster access memory through same I/O node [16, 8, 4, 2] clusters
● Latency on cluster 0 and 4 are observed● They represent one NoC partition
● Each data point represents the maximum observed value out of 10000 samples
-
149
EvaluationTotal Read Latency on the MPPA (1)
● Measuring the time to read data to a compute cluster from off-chip memory [1, 16] KB
● Varying number of cluster access memory through same I/O node [16, 8, 4, 2] clusters
● Latency on cluster 0 and 4 are observed● They represent one NoC partition
● Each data point represents the maximum observed value out of 10000 samples
-
150
EvaluationTotal Read Latency on the MPPA (1)
● Measuring the time to read data to a compute cluster from off-chip memory [1, 16] KB
● Varying number of cluster access memory through same I/O node [16, 8, 4, 2] clusters
● Latency on cluster 0 and 4 are observed● They represent one NoC partition
● Each data point represents the maximum observed value out of 10000 samples
-
152
EvaluationTotal Read Latency on the MPPA (1)
● Measuring the time to read data to a compute cluster from off-chip memory [1, 16] KB
● Varying number of cluster access memory through same I/O node [16, 8, 4, 2] clusters
● Latency on cluster 0 and 4 are observed● They represent one NoC partition
● Each data point represents the maximum observed value out of 10000 samples
-
153
EvaluationTotal Read Latency on the MPPA (1)
● Measuring the time to read data to a compute cluster from off-chip memory [1, 16] KB
● Varying number of cluster access memory through same I/O node [16, 8, 4, 2] clusters
● Latency on cluster 0 and 4 are observed● They represent one NoC partition
● Each data point represents the maximum observed value out of 10000 samples
-
154
EvaluationTotal Read Latency on the MPPA (2)
0
10
20
30
40
50
60
1KB 2KB 4KB 8KB 16KB
Latencyinm
s
16Clusters- C0 16Clusters- C4 8Clusters- C0 8Clusters- C4
4Clusters- C0 4Clusters- C4 2Clusters- C0 2Clusters- C4
-
● Engine Management System (EMS)● 15 runnables with periods [5, 10, 20, 100] ms● Footprint [7076, 17424] bytes (code + data)
● Each runnable loads its footprint at the beginning and writes its footprint at the end of execution to the off-chip memory
155
EvaluationSimulation based Case Study (1)
-
● Engine Management System (EMS)● 15 runnables with periods [5, 10, 20, 100] ms● Footprint [7076, 17424] bytes (code + data)
● Each runnable loads its footprint at the beginning and writes its footprint at the end of execution to the off-chip memory
156
EvaluationSimulation based Case Study (1)
Clu
ster
Clu
ster I/O Subsys.
-
● Engine Management System (EMS)● 15 runnables with periods [5, 10, 20, 100] ms● Footprint [7076, 17424] bytes (code + data)
● Each runnable loads its footprint at the beginning and writes its footprint at the end of execution to the off-chip memory
157
EvaluationSimulation based Case Study (1)
EMS
EMS
Clu
ster
Clu
ster I/O Subsys.
-
● Engine Management System (EMS)● 15 runnables with periods [5, 10, 20, 100] ms● Footprint [7076, 17424] bytes (code + data)
● Each runnable loads its footprint at the beginning and writes its footprint at the end of execution to the off-chip memory
158
EvaluationSimulation based Case Study (1)
EMS
EMS
RC
Clu
ster
Clu
ster I/O Subsys.
-
● Engine Management System (EMS)● 15 runnables with periods [5, 10, 20, 100] ms● Footprint [7076, 17424] bytes (code + data)
● Each runnable loads its footprint at the beginning and writes its footprint at the end of execution to the off-chip memory
159
EvaluationSimulation based Case Study (1)
EMS
EMS
RC
Reorder Core to manage memory requests
Clu
ster
Clu
ster I/O Subsys.
-
160
EvaluationSimulation based Case Study (2)
0
2000
4000
6000
8000
10000
M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15
Latency[cycle]
Analysis Max Isolation
Writing to memory./!!00→56
-
161
EvaluationSimulation based Case Study (2)
0
2000
4000
6000
8000
10000
M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15
Latency[cycle]
Analysis Max Isolation
0
2000
4000
6000
8000
10000
M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15
Latency[cycle]
Analysis Max Isolation
Writing to memory./!!00→56
Reading from memory./!!01230
+./!!56→00
-
● Shared NoC is one of the main sources for interference
● Difficult to analyze due to different architectural features
● Novel NoC partitioning scheme to reduce interference and easy analysis
● Tailored analysis for the partition● Configuration of traffic limiter to avoid buffer overflow
and to guarantee minimum transmission times
● Focus on the memory access within the I/O subsystem● Handling of requests affects the overall latency of
memory access
164
Conclusions and Future Work
-
Thank you for the attention!Questions?