Many,core%Architecture%Characteriza5on% of%the%Path ... · • The ordering core receives packets...

23
COMPUTER ARCHITECTURE GROUP Omer Khan Assistant Professor Electrical and Computer Engineering University of Connec9cut Contact Email: [email protected] Manycore Architecture Characteriza5on of the Path Planning Workload CogArch Workshop 2015

Transcript of Many,core%Architecture%Characteriza5on% of%the%Path ... · • The ordering core receives packets...

Page 1: Many,core%Architecture%Characteriza5on% of%the%Path ... · • The ordering core receives packets (potentially out of order) from many flow cores, and it reorders and commits the

COMPUTER ARCHITECTURE GROUP

Omer  Khan  Assistant  Professor    

Electrical  and  Computer  Engineering  University  of  Connec9cut  

Contact  Email:  [email protected]  

Many-­‐core  Architecture  Characteriza5on  of  the  Path  Planning  Workload  

 CogArch  Workshop  2015  

Page 2: Many,core%Architecture%Characteriza5on% of%the%Path ... · • The ordering core receives packets (potentially out of order) from many flow cores, and it reorders and commits the

COMPUTER ARCHITECTURE GROUP

Path Planning in Cognitive Computing

Sensors  Acquisi9on  &  Percep9on  

Controllers  Planner  

Actors  Scheduler  

Image  from  KIT  Ins9tute  for  Anthropoma9cs        

Desired  Driving  Behavior  

Actual    Driving  Behavior  

S

D

•  Collision  free  path?  •  Most  efficient?  

2

Page 3: Many,core%Architecture%Characteriza5on% of%the%Path ... · • The ordering core receives packets (potentially out of order) from many flow cores, and it reorders and commits the

COMPUTER ARCHITECTURE GROUP

Path Planning: A Challenge Application

Real  9me  Performance   Energy  Constraints    

Resiliency  Constraints  

3

Path  Planning  

Page 4: Many,core%Architecture%Characteriza5on% of%the%Path ... · • The ordering core receives packets (potentially out of order) from many flow cores, and it reorders and commits the

COMPUTER ARCHITECTURE GROUP

Path Planning: Efficiency via Parallelization Shortest Path Dijkstra Algorithm: Initialize Nodes(), Edges()!

For (each Node u) --Outer Loop visits each node once!

For (each Edge of u) --Inner Loop!

1. Calculates distance from current node to each neighbor!

2. Checks for the next best node (u) among the neighbors! O( N.logN + E )

Inner  Loop  paralleliza5on  Each  thread  is  assigned    a  set  of  current    node’s  edges  to    relax    BARRIERS  are  applied  to  synchronize  threads  during  each  itera9on    

J  Work  efficient  due  to  no  redundant  computa9ons  compared  to  sequen9al  

L  Hard  to  parallelize  due  to  lack  of  edge-­‐level  parallelism  (bi-­‐direc9onal  search  improves  concurrency)  

Outer  Loop  paralleliza5on  Convergence  based:  Divide  nodes  among  threads,  then  each  thread  relaxes  its  nodes  itera9vely  un9l  the  distances  converge  (i.e.,  no  change)        

Range  based:  Dynamically  distribute  “range  of  nodes”  among  threads,  then  each  thread  relaxes  its  set  of  nodes  one  by  one        

1.3

1.3

1.3

0.5

1.3

1.2

0.8

1.1

LOCK  each  node  to  avoid  poten9al  races  among  threads  relaxing  shared  nodes.  Apply  a  BARRIER  to  synchronize  threads  aWer  each  itera9on  

L  Work  inefficient  due  to  many  redundant  relaxa9ons  for  nodes  that  stabilize  before  convergence  

J  Highly  parallelizable  due  to  node-­‐level  parallelism  

J  Work  efficient  and  exploits  node-­‐level  parallelism  (bi-­‐direc9onal  search  improves  concurrency)  

L  Needs  intelligent  scheduling  for  dynamic  work  balancing  

1.2

4

Page 5: Many,core%Architecture%Characteriza5on% of%the%Path ... · • The ordering core receives packets (potentially out of order) from many flow cores, and it reorders and commits the

COMPUTER ARCHITECTURE GROUP

Characterization Space •  Simulated  a  9led  256-­‐core  NOC-­‐based  mul9core    •  Algorithms:  Single-­‐  and  Mul9ple-­‐Objec9ve  Shortest  Path  (SOSP/MOSP)  

•  Dijkstra:  Visits  each  node  once,  hence  high  work  complexity  •  Heuris9c  Algorithms:  useful  for  pre-­‐processed  input  graphs  

•  A*,  D*:    Number  of  nodes  visited  is  drama9cally  reduced  •  ∆-­‐Stepping:  Visits  each  node  once,  but  work  done  per  node  is  determined  by  “delta”  

•  Mar5n’s  Algorithm:  Similar  to  Dijkstra,  but  considers  an  addi9onal  objec9ve  when  evalua9ng  for  the  next  best  node  

•  Paralleliza9on  strategies  •  Inner  Loop  •  Outer  Loop:  Convergence  and  Range  based  

•  Inputs  (methodology  similar  to  Gtgraph  from  Georgia  Tech)  •  Adjacency  list  representa9on  with  randomly  distributed  edge  weights  •  Graph  configura9ons  

•  Number  of  nodes:  16K  –  4M  •  Sparse  graph:  4  –  32  edges/node  •  Dense  graph:  8K  edges/node  

5

Page 6: Many,core%Architecture%Characteriza5on% of%the%Path ... · • The ordering core receives packets (potentially out of order) from many flow cores, and it reorders and commits the

COMPUTER ARCHITECTURE GROUP

Characterization Objectives

•  Path  planning  is  challenging,  because  •  Operates  on  unstructured  data  •  Complex  dependence  paderns  between  tasks  that  are  known  only  during  program  execu9on  

•  Characteriza9on  to  revel  four  areas  where  compu9ng  must  adapt  at  run9me  to  exploit  execu9on  efficiency  1.  Dynamic  Workload  Selec9on  and  Balancing  2.  Concurrency  Controls  3.  Input  Dependence  4.  Accuracy  of  Computa9on  

6

Page 7: Many,core%Architecture%Characteriza5on% of%the%Path ... · • The ordering core receives packets (potentially out of order) from many flow cores, and it reorders and commits the

COMPUTER ARCHITECTURE GROUP

Dense Graph (16K nodes, 8K edges)

7

Shortest  Path  Algorithm   Comple5on  Time  (ms)  

Num.  of  Threads  

Accuracy  (%)  

Comments  

Dijkstra  Sequen9al  (baseline)   14200   1   100%   Convergence  based  is  work  inefficient;  Range-­‐based  incurs  extra  communica9on  Inner  loop  has  good  concurrency,  gives  ~40X  speedup;  

Inner  Loop  Par   549   32  Bi-­‐direc5onal  Inner  Loop  Par   356   96  Convergence  based  Outer  Loop  Par   6300   160  Range-­‐based  Outer  Loop  Par   7691   256  Bidirec9onal  Range-­‐based  Outer   7424   256  

D*  sequen9al   74   1   97%   Very  high  Work  Efficiency    D*  Bi-­‐direc5onal  Inner  Loop  Par   2.51   32  

Mar9n’s  Sequen9al  (baseline)   14500   1   100%   ~45X  speedup  B-­‐direc5onal  Inner  Loop  Par   321   128  Bi-­‐direc9onal  Range-­‐based  Outer  Par   2859   256  

Page 8: Many,core%Architecture%Characteriza5on% of%the%Path ... · • The ordering core receives packets (potentially out of order) from many flow cores, and it reorders and commits the

COMPUTER ARCHITECTURE GROUP

Sparse Graph (16K nodes, 16 edges)

8

Shortest  Path  Algorithm   Comple5on  Time  (ms)  

Num.  of  Threads  

Accuracy  (%)  

Comments  

Dijkstra  Sequen9al  (baseline)   30   1   100%   Inner  loop  has  low  concurrency;    Convergence  based  Outer  loop  is  work  inefficient;  Range-­‐based  bi-­‐direc5onal  gives  ~5X  speedup  

Inner  Loop  Par   30   1  Bi-­‐direc9onal  Inner  Loop  Par   14   2  Convergence  based  Outer  Loop  Par   377   256  Range-­‐based  Outer  Loop  Par   10.8   128  Bi-­‐direc5onal  Range-­‐based  Outer  Loop  Par  

6.4   192  

∆-­‐Stepping  Inner  Par  (∆=50%)   2.9   16   80%   Work  Efficient  D*  Sequen9al   4   1   97%  D*  Bi-­‐direc5onal  Inner  Loop  Par   3.4   2  

Ø Mo5vates  the  need  to  adapt  to  “choices”  along  (1)  input  dependence,  (2)  dynamic  workload  balancing,  (3)  concurrency  controls,  and  (4)  the  accuracy  of  computa5on  

Page 9: Many,core%Architecture%Characteriza5on% of%the%Path ... · • The ordering core receives packets (potentially out of order) from many flow cores, and it reorders and commits the

COMPUTER ARCHITECTURE GROUP

“Situational Scheduler” for Adaptation

Algorithmic  Choices  

Architectural  Choices  

Many-core Substrate

Performance  Monitoring/  Decision  Engine  

Ø  User/programmer  selects  an  algorithm  or  heuris9c  based  on  sta9c  informa9on  such  as  input  characteris5cs  or  solu5on  accuracy  requirements  

Ø  U9lize  run9me  informa9on  to  make  decision  regarding  concurrency  control  and  workload  balancing  methods  

•  Develop  models  that  predict  the  choices  to  improve  efficiency  of  computa9on  

•  May  u9lize  heuris9c  or  control  theore9c  or  even  machine  learning  methods  •  Goal  is  to  design  the  decision  engine  to  be  low  overhead  with  high  accuracy  

of  predic9ng  the  right  decision  

9

Page 10: Many,core%Architecture%Characteriza5on% of%the%Path ... · • The ordering core receives packets (potentially out of order) from many flow cores, and it reorders and commits the

COMPUTER ARCHITECTURE GROUP

Architecture Innovations for Extreme Efficiency Range-based Outer Loop Parallel Dijkstra

while all nodes are not visited!•  Calculate next

range of nodes to relax based on the degree of graph!

•  Distribute nodes in the current range among worker threads!

while assigned nodes are not visited (update Q array)!•  LOCK the node to be relaxed!

•  for all neighbor nodes!•  Update the D array with new

distance, if new distance is less than older distance (RELAX function)!

•  UNLOCK the node that was relaxed!BARRIER to synchronize all threads!

Master  Thread    (also  par9cipates  as  a  Worker  thread)  

Worker  Thread    (only  one  shown  here,  others  do  same  work)  

1. Create work 3. Wait and do the assigned work

5. All Done

Accelerate  Computa9on  

Accelerate  Communica9on  

Accelerate    Data  Access  

2. Here is work

4. Done

10

Page 11: Many,core%Architecture%Characteriza5on% of%the%Path ... · • The ordering core receives packets (potentially out of order) from many flow cores, and it reorders and commits the

COMPUTER ARCHITECTURE GROUP

Data Access Efficiency

•  Data  locality  in  path  planning  is  challenging  due  to  unstructured  data  dependent  accesses    

•  Locality-­‐aware  protocols  [ISCA’13,  HPCA’14]  •  Exploit  the  run9me  variability  

in  locality/reuse  of  data  at  various  layers  of  the  on-­‐chip  cache  hierarchy  

•  Intelligent  fine-­‐grain  data  alloca9on/replica9on  at  private  and  shared  caches  

•  Locality-­‐aware  private-­‐L1  alloca9on/replica9on    

•  Locality-­‐aware  shared-­‐L2  replica9on    

•  Locality-­‐aware  Private  Caching  [ISCA’13]  •  Privately  cache  high  locality  lines  at  L1  

cache  •  Remotely  access  (at  word  level)  low  

locality  lines  at  L2  cache  

•  Alloca9on  based  on  locality  of  data  dynamically  profiled  using  cache-­‐line  level  in-­‐hardware  locality  classifiers  

11

Page 12: Many,core%Architecture%Characteriza5on% of%the%Path ... · • The ordering core receives packets (potentially out of order) from many flow cores, and it reorders and commits the

COMPUTER ARCHITECTURE GROUP

Locality-Aware Data Access Range-based Outer Loop Parallel Dijkstra •  Reducing  Sharing  Misses  

0  2  4  6  8  

Baseline   Locality-­‐aware  Private  Caching  

L1  Cache

 Miss  

Breakd

own  (%

)   Cold   Capacity   Sharing   Word  

•  Sharing  misses  (expensive)  turned  into  word  misses  (cheap)  as  more  cache  lines  with  low  locality  are  iden9fied  by  the  hardware  classifiers  

12

Page 13: Many,core%Architecture%Characteriza5on% of%the%Path ... · • The ordering core receives packets (potentially out of order) from many flow cores, and it reorders and commits the

COMPUTER ARCHITECTURE GROUP

Locality-Aware Data Access Range-based Outer Loop Parallel Dijkstra •  Energy  Consump9on  tradeoffs  

•  Reduce  invalida9ons,  asynchronous  write-­‐backs  and  cache-­‐line  ping-­‐pong’ing  

0  

0.0001  

0.0002  

0.0003  

0.0004  

0.0005  

Baseline   Locality-­‐aware  Private  Caching  

Energy  

Consum

p5on

   DRAM  

 Network  Link  

 Network  Router  

 Directory  

 L2  Cache  

 L1-­‐D  Cache  

 L1-­‐I  Cache  

13

Page 14: Many,core%Architecture%Characteriza5on% of%the%Path ... · • The ordering core receives packets (potentially out of order) from many flow cores, and it reorders and commits the

COMPUTER ARCHITECTURE GROUP

Locality-Aware Data Access Range-based Outer Loop Parallel Dijkstra •  Comple9on  9me  tradeoffs  

•  Less  9me  spent  wai9ng  for  coherence  traffic  to  be  serviced  •  Cri9cal  sec9on  9me  reduc9on  -­‐>  synchroniza9on  9me  reduc9on  

0.00E+00  

4.00E+06  

8.00E+06  

1.20E+07  

Baseline   Locality-­‐aware  Private  Caching  

Comple5

on  

Time  (ns)   Synchroniza9on  

L2Home-­‐OffChip  

L2Home-­‐Sharers  

L2Home-­‐Wai9ng  

L1Cache-­‐L2Home  

Compute  

14

Page 15: Many,core%Architecture%Characteriza5on% of%the%Path ... · • The ordering core receives packets (potentially out of order) from many flow cores, and it reorders and commits the

COMPUTER ARCHITECTURE GROUP

How about Accelerating Computation?

•  Accelera9ng  computa9on  alone  even  under  an  idealis9c  data  access  setup  is  not  sufficient!  Ø Must  address  data  dependencies  that  lead  to  fine-­‐grain  communica5on  

bollenecks  

0.00E+00  2.00E+05  4.00E+05  6.00E+05  8.00E+05  1.00E+06  1.20E+06  1.40E+06  

0   200000   400000   600000   800000   1000000   1200000   1400000  

Area  (u

m^2)  

Cycles  DIJKSTRA   FFT  

Aladdin  [Shao  et  al.,  ISCA’14]    Tool’s  Design  Space  Latency    

Gap  

15

Page 16: Many,core%Architecture%Characteriza5on% of%the%Path ... · • The ordering core receives packets (potentially out of order) from many flow cores, and it reorders and commits the

COMPUTER ARCHITECTURE GROUP

The Case for a Many-core Accelerator

Conventional System Our proposal Accelerate Computation +

Data access + Communication

Shared Memory

Coherence and Communication over shared memory

Data Access

Core Core Core

Locality Aware Data Access

Send() / Receive() Messages +

Core+ ACC

Core+ ACC

Core+ ACC

16

Coherence over shared memory

Page 17: Many,core%Architecture%Characteriza5on% of%the%Path ... · • The ordering core receives packets (potentially out of order) from many flow cores, and it reorders and commits the

COMPUTER ARCHITECTURE GROUP

Why Accelerate Communication? Example

•  The ordering core receives packets (potentially out of order) from many flow cores, and it reorders and commits the packets

•  Several shared memory versions implemented (we show only the best lockless shared data structure implementation)

Ordering Core

Flow Core

Flow Core

Flow Core

Flow Core

Flow Core

Flow Core

Flow Core

Flow Core

Lockless Shared Data Structure

Ordering Core

recv() Queues

Shared Memory Explicit Messaging

send()

17

Page 18: Many,core%Architecture%Characteriza5on% of%the%Path ... · • The ordering core receives packets (potentially out of order) from many flow cores, and it reorders and commits the

COMPUTER ARCHITECTURE GROUP

Why Accelerate Communication? Example

•  Shared memory: 20 cycles/packet at 256 cores is the best result

•  Explicit Messaging: 10 cycles/packet at 4 cores •  ~2X latency advantage using point-to-point

communication and by avoiding data ping-ponging

0  

50  

100  

150  

2   4   8   12   16   24   32   48   64   128  256  

Latency  pe

r  Packet  

Cores  

Shared  Memory  Explicit  Messaging  

Methods •  In-house simulator

with ADL front-end: Simple in-order RISC cores

•  Compiler support for send() and receive() instructions

•  BARC’15 paper

18

Page 19: Many,core%Architecture%Characteriza5on% of%the%Path ... · • The ordering core receives packets (potentially out of order) from many flow cores, and it reorders and commits the

COMPUTER ARCHITECTURE GROUP

What about Resilience?

•  Redundancy  alone  can  achieve  resiliency  but  hurts  efficiency  

•  Our  Approach:  Given  correctness  guarantee  constraints,  selec9vely  apply  resilience  to  code  that  is  crucial  for  program  correctness  and  output  •  Opens  a  new  research  

direc9on  that  trades  off  program  accuracy  with  efficient  resiliency  [CAL’15]  

Performance/Energy/Power (Overheads)

Error Vulnerability (Coverage)

Program A

ccuracy Symptom-based

Symptom -based

n-Modular

n-Modular

Software

Software

Resiliency Methods?

19

Page 20: Many,core%Architecture%Characteriza5on% of%the%Path ... · • The ordering core receives packets (potentially out of order) from many flow cores, and it reorders and commits the

COMPUTER ARCHITECTURE GROUP

Declarative Resilience D* Heuristic Algorithm for Path Planning

•  Heuris9c  algorithm,  D*  (aka  A*)  is  work  efficient  and  popular  in  applica9ons  that  use  pre-­‐processed  input  graphs  

D*  Shortest  Path  

S  

D  

S  

D   Correct    Shortest  Path  

Perturbed    Shortest  Path  

•  Declara9ve  resilience  allows  the  heuris;c  calcula;on  to  be  considered  non-­‐crucial  code,  and  hence  minor  perturba9ons  can  be  tolerated  

20

Page 21: Many,core%Architecture%Characteriza5on% of%the%Path ... · • The ordering core receives packets (potentially out of order) from many flow cores, and it reorders and commits the

COMPUTER ARCHITECTURE GROUP

Declarative Resilience D* Heuristic Algorithm for Path Planning

while not at the destination node!!

RESILIENCE OFF!!

•  for all neighbor nodes!•  Lookup edge weights for

neighboring nodes!•  Use heuristic to

calculate the next node with minimum distance!

!

RESILIENCE ON!!

Go to the next best node!

Sequen9al  Pseudo-­‐code  for  D*   Considera9ons  for  program  correctness  of  non-­‐crucial  code    1.  Unroll  for  loop  to  remove  all  

control  flow  instruc9ons  

2.  No  stores  to  globally  visible  memory  i.e.,  all  updates  local  

3.  Local  store  address  calcula9on  protected  using  redundancy  

4.  Next  node  calcula9on  checked  for  “within  bounds”  •  Based  on  current  node  ID  

and  degree  of  the  graph  •  If  bounds  violated,  next  

node  is  not  updated  (i,e.,  re-­‐execute  current  node)  

21

Page 22: Many,core%Architecture%Characteriza5on% of%the%Path ... · • The ordering core receives packets (potentially out of order) from many flow cores, and it reorders and commits the

COMPUTER ARCHITECTURE GROUP

Declarative Resilience Results D* Heuristic Algorithm for Path Planning

0  

0.2  

0.4  

0.6  

0.8  

1  

1.2  

1.4  

Baseline   Re-­‐Execu9on   Declara9ve  Resilience  

Comple5

on  Tim

e  (normalized

)  

 Resilience-­‐On-­‐Delay  

 Re-­‐Execu9on-­‐Time  

 Network-­‐Recv-­‐Stall-­‐Time  

 Synchroniza9on-­‐Stall-­‐Time  

 Compute-­‐Time  

Memory-­‐Stall-­‐Time  

•  Re-­‐execu9ng  all  instruc9ons  incurs  30%  performance  overhead  [COMPUTER’13]  

•  Declara9ve  Resilience  performance  overhead  is  8%  •  Protects  all  crucial  code,  and  against  the  side  effects  of  non-­‐crucial  code  

22

Page 23: Many,core%Architecture%Characteriza5on% of%the%Path ... · • The ordering core receives packets (potentially out of order) from many flow cores, and it reorders and commits the

COMPUTER ARCHITECTURE GROUP

Summary •  Exploi9ng  concurrency  in  path  planning  is  non  trivial  because  (1)  

it  operates  on  unstructured  data,  and  (2)  complex  dependence  paderns  between  tasks  are  known  only  during  program  execu9on  

Ø  Develop  a  “Situa9onal  Scheduler”  that  adapts  to  (1)  input  dependence,  (2)  dynamic  workload  varia9ons,  (3)  exploitable  concurrency,  and  (4)  accuracy  requirements  

Ø Many-­‐core  architectures  must  accelerate  computa9on,  communica9on  and  data  accesses  for  extreme  efficiency  

Ø  Resiliency  of  computa9on  must  be  considered  as  a  first  order  metric.  A  declara9ve  resiliency  method  poten9ally  reduces  the  efficiency  overheads  of  resilience  

23