T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC...

125
THE SNIPER MULTICORE SIMULATOR HTTP://WWW.SNIPERSIM.ORG SUNDAY,SEPTEMBER 22ND, 2013 IISWC 2013, PORTLAND 09:00 INTRODUCTION,SNIPER OVERVIEW & RATIONALE 09:45 SNIPER INTERNALS 10:00 –COFFEE BREAK 10:30 VALIDATION RESULTS 10:45 RUNNING SIMULATIONS AND PROCESSING RESULTS 11:30 HANDSON DEMO 12:00 –END

Transcript of T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC...

Page 1: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

THE  SNIPER  MULTI-­‐CORE  SIMULATOR  

HTTP://WWW.SNIPERSIM.ORG  SUNDAY,  SEPTEMBER  22ND,  2013  

IISWC  2013,  PORTLAND  

09:00  INTRODUCTION,  SNIPER  OVERVIEW  &  RATIONALE  09:45  SNIPER  INTERNALS  10:00  –  COFFEE  BREAK  –  10:30  VALIDATION  RESULTS  10:45  RUNNING  SIMULATIONS  AND  PROCESSING  RESULTS  11:30  HANDS-­‐ON  DEMO  12:00  –  END  –  

Page 2: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

INTEL  EXASCIENCE  LAB  

•  SoHware  and  hardware  for  ExaFLOPS  scale  machines  •  CollaboraXon  between  Intel,  imec  and  5  Flemish  universiXes  

•  Study  Space  Weather  as  an  HPC  workload  

Simulation Toolkit

Architectural Simulation Hardware design

Space Weather Modeling Visualization

2  

Page 3: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

THE  SNIPER  MULTI-­‐CORE  SIMULATOR  INTRODUCTION  

WIM  HEIRMAN,  TREVOR  E.  CARLSON,  IBRAHIM  HUR  KENZO  VAN  CRAEYNEST  AND  LIEVEN  EECKHOUT  

 HTTP://WWW.SNIPERSIM.ORG  

SUNDAY,  SEPTEMBER  22ND,  2013  IISWC  2013,  PORTLAND  

Page 4: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

TRENDS  IN  PROCESSOR  DESIGN:  CACHE  

Cache  sizes  are  increasing  

4  

0  

10  

20  

30  

40  

50  

60  

Jan-­‐93   Oct-­‐95   Jul-­‐98   Apr-­‐01   Jan-­‐04   Oct-­‐06   Jul-­‐09   Apr-­‐12   Dec-­‐14  

Mbytes  

LLC  Cache  Sizes  

IPF   x86  

Page 5: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

TRENDS  IN  PROCESSOR  DESIGN:  CORES  

Number  of  cores  per  node  is  increasing    – 2001:  Dual-­‐core  POWER4  – 2005:  Dual-­‐core  AMD  Opteron  – 2011:  10-­‐core  Intel  Xeon  Westmere-­‐EX  – 2012:  Intel  MIC  Knights  Corner  (60+  cores)  

5  

Page 6: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

SIMULATION  

•  Design  tomorrow’s  processor    using  today’s  hardware  

•  SimulaXon  – Obtain  performance  characterisXcs    for  new  architectures  

– Architectural  exploraXon  – Early  soHware  opXmizaXon  

6  

Page 7: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

DEMANDS  ON  SIMULATION  ARE  INCREASING  

Increasing  core  counts  – Linear  increase  in  simulator  workload  – Single-­‐threaded  simulator  sees  a  rising  gap  

•  workload:  increasing  target  cores  •  available  processing  power:  near-­‐constant    single-­‐thread  performance  of  host  machine  

– Need  to  use  all  cores  of  the  host  machine    

à  Parallel  simulaXon  7  

Page 8: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

DEMANDS  ON  SIMULATION  ARE  INCREASING  

Increasing  cache  size  – Need  a  large  working  set  to  exercise  large  caches  

•  Scaled-­‐down  applicaXons  won’t  exhibit  the  same  behavior  

– ApplicaXon  designers  want  to  see    full-­‐program  behavior  

à  Long-­‐running  simulaXons  are  required  

8  

Page 9: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

UPCOMING  CHALLENGES  

•  Future  systems  will  be  diverse  –  Varying  processor  speeds  –  Varying  failure  rates  for  different  components  –  Homogeneous  applicaXons  become  heterogeneous  

•  SoHware  and  hardware  soluXons  are  needed  to  solve  these  challenges  –  Handle  heterogeneity  (reacXve  load  balancing)  –  Be  fault  tolerant  –  Improve  power  efficiency  at  the  algorithmic  level  

(extreme  data  locality)  

•  Hard  to  model  accurately  with  analyXcal  models  

9  

Page 10: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

FAST  AND  ACCURATE  SIMULATION  IS  NEEDED  

•  SimulaXon  use  cases  – Architecture  exploraXon  –  Pre-­‐silicon  soHware  opXmizaXon  –  [ValidaXon]  

•  Cycle-­‐accurate  simulaXon  is  too  slow  for  exploring  mulX/many-­‐core  design  space  and  soHware  

•  Key  quesXons  –  Can  we  raise  the  level  of  abstracXon?  – What  is  the  right  level  of  abstracXon?  – When  to  use  these  abstracXon  models?  

10  

Page 11: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

EXPERIMENT  DESIGN  IN  ARCHITECTURE  EXPLORATION/EVALUATION  •  OpXmizing  the  probability  of  success    (i.e.,  finding  the  best  architecture/parameters):  

•  Coverage:  how  many  architecture  configuraXons  can  I  run  •  Confidence:  #  benchmarks,  re-­‐runs  for  variable  applicaXons  •  Accuracy:  simulaXon  model  detail  vs.  runXme  

•  How  many  scenarios  can  I  run?  •  N  =  total  number  of  simulaXon  scenarios  •  d  =  days  unXl  paper  deadline  •  t  =  average  Xme  per  simulaXon  •  B  =  number  of  benchmarks  •  A  =  number  of  architectures  

11  

N =d

t ×B× Aminimize  t  to  maximize  N  

Page 12: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

FAST  OR  ACCURATE  SIMULATION?  

12  

performance  

architecture  

A   B   C   D   E  

?   ?  ?   performance  

architecture  

A   B   C   D   E  

Cycle-­‐accurate  simulator   Higher-­‐abstracXon  level  simulator  

Page 13: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

THE  ARCHITECTURE  DESIGN  WATERFALL  

13  

AnalyXcal  models  High-­‐level  simulaXon   Cycle-­‐accurate  

simulaXon  

#  archite

ctures  

considered

 

1010   105  1000   10   1  

design  process  (Xme)  

benchm

arks  

/app

licaX

ons  

Program  characterisXcs  

RepresentaXve  applicaXons  

Traces  /  Microbenchmarks  

Pre-­‐silicon  soHware  opXmizaXon,  co-­‐design  

Page 14: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

NEEDED  DETAIL  DEPENDS  ON  FOCUS  

14  

Component   Single-­‐event  Cme  scale  

Required  sim  Cme  

single  clock  cycle  

microseconds  

millions  of  cycles  

seconds  

RTL  

OOO  execuXon  

Core  memory  ops  

L1  cache  access  

LLC  access  

Off-­‐socket  

Too  slow  

Not  accurate  enough  

Page 15: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

ONE-­‐IPC  MODELING  –  TOO  SIMPLE?  

•  Simple  high-­‐abstracXon  model,  oHen  used  in  uncore  studies  

•  AlternaXve  for  memory  access  traces  – Aims  to  provide  more-­‐realisXc  access  paserns  – Allows  for  Xming  feedback  

•  But:  One-­‐IPC  core  models  do  not  exhibit  ILP/MLP  – Memory  request  rates  are  not  as  accurate  as  more  detailed  simulators  

–  #  outstanding  requests  incorrect:  underesXmate  required  queue  sizes  

– No  latency  is  hidden:  overesXmate  runXme  improvements  

  15  

Page 16: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

DETAILED  MODEL  VS.  INTERVAL  SIM  

16  

FuncXonal  Simulator  

Decode  

ROB  

 ExecuXon  Units  

LSQ  

Cache  Hierarchy  

Commit  

DRAM  

Issue  Queue  

 Fetch  

BP  I$  

Interval  SimulaXon  

Balanced  architecture  Main  factors  in  core  performance:  •  Parallelism  exposed  by  ROB  •  Miss  events  

Page 17: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

Out-­‐of-­‐order  core  performance  model    with  in-­‐order  simulaXon  speed  

INTERVAL  SIMULATION  

17  

effecXve  dispatch  ra

te  

Xme  

I-­‐cache  miss  branch  mispredicXon  

long-­‐latency  load  miss  

interval  1   interval  2   interval  3  

D.  Genbrugge  et  al.,  HPCA’10  S.  Eyerman  et  al.,  ACM  TOCS,  May  2009  

T.  Karkhanis  and  J.  E.  Smith,  ISCA’04,  ISCA’07  

Page 18: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

INTERVAL  SIMULATION  FROM  30,000  FEET  

Interval  simulaXon  considers  instrucXons  (in-­‐order)  at  dispatch  •  dispatch  not  possible  

–  InstrucXon  cache  /  TLB  miss  –  Branch  mispredicXon  (not  dispatching  useful  instrucXons)  –  Front-­‐end  refill  aHer  mispredicXon  –  ROB  full:  long-­‐latency  miss  at  head  of  ROB  

18  

Decode  

ROB  

 ExecuXon  Units  

LSQ  

Cache  Hierarchy  

Commit  

DRAM  

Issue  Queue  

 Fetch  

BP  I$  

Page 19: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

INTERVAL  SIMULATION  FROM  30,000  FEET  

Interval  simulaXon  considers  instrucXons  (in-­‐order)  at  dispatch  •  dispatch  not  possible  •  dispatch  possible:  at  rate  governed  by  ROB  

–  Lisle’s  law:  progress  rate  =  #elements  /  Xme  spent  in  queue  –  Computed  using  ROB  fill  and  criXcal  path  through  ROB  

•  Computed  using  dynamic  instrucXon  dependencies  and  latencies  

19  

Decode  

ROB  

 ExecuXon  Units  

LSQ  

Cache  Hierarchy  

Commit  

DRAM  

Issue  Queue  

 Fetch  

BP  I$  

Page 20: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

KEY  BENEFITS  OF  THE  INTERVAL  MODEL  

•  Models  superscalar  OOO  execuXon  •  Models  impact  of  ILP  •  Models  second-­‐order  effects:  MLP    •  Allows  for  construcXng  CPI  stacks  

20  

Page 21: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

CYCLE  STACKS  

•  Where  did  my  cycles  go?  •  CPI  stack  – Cycles  per  instrucXon  – Broken  up  in  components  

•  Normalize  by  either  – Number  of  instrucXons  (CPI  stack)  – ExecuXon  Xme  (Xme  stack)  

•  Different  from  miss  rates:    cycle  stacks  directly  quanXfy    the  effect  on  performance  

21  

CPI  

L2  cache  I-­‐cache  Branch  Base  

Page 22: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

CONSTRUCTING  CPI  STACKS  

•  Interval  simulaXon:    track  why  Xme  is  advanced  – No  miss  events  •  Dispatch  instrucXons  at  base  CPI  •  Increment  base  component  

– Miss  event  •  Fast-­‐forward  Xme  by  X  cycles  •  Increment  component  by  X  

22  

CPI  

L2  cache  I-­‐cache  Branch  Base  

Page 23: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

CYCLE  STACKS  FOR  PARALLEL  APPLICATIONS  

By  thread:  heterogeneous  behavior    in  a  homogeneous  applicaXon?  

23  DRAM  

L2  

L1   L1   L1   L1  

L2   L2   L2  

L3  

L2  

L1   L1   L1   L1  

L2   L2   L2  

L3  data  

Page 24: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

CYCLE  STACKS  AND  SCALING  BEHAVIOR  

•  Scaling  to  more  cores,  larger  input  set  size  •  How  does  execuXon  Xme  scale,  and  why?  

24  

Page 25: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

CYCLE  STACKS  AND  SCALING  BEHAVIOR  

•  Scale  input:  applicaXon  becomes  DRAM  bound  

25  

Page 26: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

CYCLE  STACKS  AND  SCALING  BEHAVIOR  

•  Scale  input:  applicaXon  becomes  DRAM  bound  •  Scale  core  count:  sync  losses  increase  to  20%  

26  

Page 27: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

SNIPER:  A  FAST  AND  ACCURATE  SIMULATOR  

•  Hybrid  simulaXon  approach  – AnalyXcal  interval  core  model  – Micro-­‐architecture  structure  simulaXon  

•  branch  predictors,  caches  (incl.  coherency),  NoC,  etc.  •  Hardware-­‐validated,  Pin-­‐based  •  Models  mulX/many-­‐cores  running  mulX-­‐threaded  and  mulX-­‐program  workloads  

•  Parallel  simulator  scales  with  the  number  of  simulated  cores  

•  Available  at  http://snipersim.org  27  

Page 28: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

SIMULATION  IN  SNIPER  

28  

funcXonal  simulator    (Pin)  

memory  hierarchy  simulator  

branch  predictor  simulator  

processor  cores  

A  single-­‐process,  mulCthreaded  workload  (v1.0)  

MulCple,  single-­‐threaded  workloads  (v2.0)  

ExecuXon-­‐driven  simulaXon  

Trace-­‐driven  simulaXon  

Page 29: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

SIMULATION  IN  SNIPER  WITH  SIFT  

29  

memory  hierarchy  simulator  

processor  cores  

MulCple,  mulC-­‐threaded  workloads  (v3.03)  

FuncXonal-­‐directed  simulaXon  +  Xming-­‐feedback  

Pin  

Pin  

A  bi-­‐direcXonal    single-­‐thread  SIFT  connecXon  

PinPlay  PinPlay  support  (v4.2)  

Page 30: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

MANY  ARCHITECTURE  OPTIONS  

L2  

L1I   L1D   L1I   L1

D  

L2  

L1I   L1D   L1I   L1

D  

L3  

L2  

L1I   L1D   L1I   L1

D  

L2  

L1I   L1D   L1I   L1

D  

L3  

L2  

L1I   L1D   L1I   L1

D  

L2  

L1I   L1D   L1I   L1

D  

L3  

L2  

L1I   L1D   L1I   L1

D  

L2  

L1I   L1D   L1I   L1

D  

L3  

DRAM  

30  

L1I   L1D  

L2  

L1I   L1D  

L2  

L1I   L1D  

L2  

L1I   L1D  

L2  

L1I   L1D  

L2  

L1I   L1D  

L2  

L1I   L1D  

L2  

L1I   L1D  

L2  

L1I   L1D  

L2  

L1I   L1D  

L2  

L1I   L1D  

L2  

NoC  

L1I   L1D  

L2  

Page 31: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

THREAD  SCHEDULING  AND  MIGRATION  

31  

L2  

L1  

L3  

L1  

L2  

L1   L1  

L2  

L1   L1  

L2  

L1   L1  

Scheduler  

Threads  

Cores  

Thread  migraCon  (v4.0)  

Page 32: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

ADVANCED  VISUALIZATION  

•  Cycle  stacks  through  Xme  

32  

Page 33: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

TOP  SNIPER  FEATURES  •  Interval  Model  •  CPI  Stacks  and  InteracXve  VisualizaXon  •  Parallel  MulXthreaded  Simulator  •  x86-­‐64  and  SSE2  support  •  Validated  against  Core2,  Nehalem  •  Thread  scheduling  and  migraXon  •  Full  DVFS  support  •  Shared  and  private  caches  •  Modern  branch  predictor  •  Supports  pthreads  and  OpenMP,  TBB,  OpenCL,  MPI,  …  •  SimAPI  and  Python  interfaces  to  the  simulator  •  Many  flavors  of  Linux  supported  (Redhat,  Ubuntu,  etc.)  

33  

Page 34: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

SIMULATOR  COMPARISON  

Sniper   Graphite   Gem5   COTSon   MARSSx86  

Integrated   X  

Func-­‐directed   X   X   X   X  

User-­‐level   X   X   X  

Full-­‐system   X   X   X  

Archs  Supported   x64   x64   x64  Alpha  SPARC  

x64   x64  

Parallel  (in-­‐node)   X   X  

Flexible  cache  hierarchy  

X   X   X   X  

34  

Page 35: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

SNIPER  LIMITATIONS  

•  User-­‐level  –  Perfect  for  HPC  (hybrid  MPI/OpenMP  applicaXons)  – Not  the  best  match  for  workloads  with  significant  OS  involvement  

•  FuncXonal-­‐directed  – No  simulaXon  /  cache  accesses  along  false  paths  

•  High-­‐abstracXon  core  model  – Not  suited  to  model  all  effects  of  core-­‐level  changes  –  Perfect  for  memory  subsystem  or  NoC  work  

•  x86  only  35  

Page 36: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

SNIPER  HISTORY  •  November,  2011:  SC’11  paper,  first  public  release  •  March  2012,  version  2.0:  MulX-­‐program  workloads  •  May  2012,  version  3.0:  Heterogeneous  architectures  •  November  2012,  version  4.0:  Thread  scheduling  and  migraXon  •  April  2013,  version  5.0:  MulX-­‐threaded  applicaXon  sampling  •  June  2013,  version  5.1:  SuggesXons  for  opXmizaXon  visualizaXon  •  September  2013,  version  5.2:  MESI/F,  2-­‐level  TLBs,  Python  scheduling  •  Today:  700+  downloads  from  60  countries  

36  

Page 37: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

NEW  IN  RELEASE  5.2:  MESI/F  COHERENCY  

•  Added  MESI  and  MESIF  coherency  protocol  – EXCLUSIVE  state:  allows  for  silent  upgrades  – FORWARDING  state:  special  case  of  SHARED,  allows  for  cache-­‐to-­‐cache  transfers  of  shared  data  

37  

0.96  

0.97  

0.98  

0.99  

1  

npb/A  

norm

alized

 run-­‐Cm

e  

Xeon  Phi-­‐like  

MSI   MESI   MESIF  

0.985  

0.99  

0.995  

1  

npb/A  

Xeon/i7-­‐like  

Page 38: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

USE  CASE  #1:  ARCHITECTURE  EXPLORATION  

38  

•  Many  architectural  opXons  for  next-­‐generaXon  architecture  (45nm  to  22nm)  – Compare  performance,  energy  efficiency  

baseline:  2x  quad-­‐core  

8  cores  16  cores,  no  L3,  stacked  DRAM  

16  slow  cores  16  thin  cores  

core  cache  

Heirman  et  al.,  PACT’12  

Page 39: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

USE  CASE  #1:  ARCHITECTURE  EXPLORATION  

39  

Performance:  –  3D  (no  LLC)  is  good  when  streaming,  bad  when  sharing  

–  Dual-­‐issue  good  when  compute  bound,  bad  when  memory  bound  (smaller  ROB  exposes  less  MLP!)  

Page 40: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

USE  CASE  #1:  ARCHITECTURE  EXPLORATION  

40  

Power:  –  3D  reduces  DRAM  energy,  lack  of  LLC  may  increase  total  energy  

–  Low-­‐frequency  can  be  significantly  more  energy-­‐efficient  

Page 41: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

USE  CASE  #2:  HW/SW  CO-­‐OPTIMIZATION  

41  

•  Heat  transfer:  stencil  on  regular  grid  –  Important  kernel,  part  of  Berkeley  Dwarfs  (structured  grid)  

•  Improve  memory  locality:  Xling  over  mulXple  Xme  steps  –  Trade  off  locality  with  redundant  computaXon  –  OpXmum  depends  on  relaXve  cost  (performance  &  energy)    of  computaXon,  data  transfer  à  requires  integrated  simulator  

Page 42: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

42  

USE  CASE  #2:  HW/SW  CO-­‐OPTIMIZATION  

•  TradiXonally:  –  Architects  design  hardware  based  on  fixed  benchmarks  

–  Programmers  opXmize  soHware  for  today’s  architecture  

 SoHware  parameters  depend  on  architecture!  •  Co-­‐opXmizaXon  yields  66%  more  performance,  or  25%  more  energy  efficiency,  than  isolated  opXmizaXon  

(a) Performance (simulated time steps per second)

0

50

100

150

200

250

300

0 1 2 3 4

Ste

ps/

time

(1

/s)

Arithmetic intensity (FLOP/byte)

8-core

32 64 128 256 512

0

50

100

150

200

250

300

0 1 2 3 4

Ste

ps/

time

(1

/s)

Arithmetic intensity (FLOP/byte)

3D

32 64 128 256 512

0

50

100

150

200

250

300

0 1 2 3 4

Ste

ps/

time

(1

/s)

Arithmetic intensity (FLOP/byte)

low-frequency

32 64 128 256 512

0

50

100

150

200

250

300

0 1 2 3 4

Ste

ps/

time

(1

/s)

Arithmetic intensity (FLOP/byte)

dual-issue

32 64 128 256 512

(b) Energy e⌅ciency (simulated time steps per Joule)

0.0

0.5

1.0

1.5

2.0

2.5

0 1 2 3 4

Ste

ps/

En

erg

y (1

/J)

Arithmetic intensity (FLOP/byte)

8-core

32 64 128 256 512

0.0

0.5

1.0

1.5

2.0

2.5

0 1 2 3 4

Ste

ps/

En

erg

y (1

/J)

Arithmetic intensity (FLOP/byte)

3D

32 64 128 256 512

0.0

0.5

1.0

1.5

2.0

2.5

0 1 2 3 4

Ste

ps/

En

erg

y (1

/J)

Arithmetic intensity (FLOP/byte)

low-frequency

32 64 128 256 512

0.0

0.5

1.0

1.5

2.0

2.5

0 1 2 3 4

Ste

ps/

En

erg

y (1

/J)

Arithmetic intensity (FLOP/byte)

dual-issue

32 64 128 256 512

Figure 9: Hardware/software co-design for the heat benchmark when optimizing for performance and energye�ciency: four architecture design points are considered while varying software parameters such as tile size(see legend) and arithmetic intensity (horizontal axes).

for performance and power to explore these trade-o�s andhardware/software interactions.

7.3 Co-design analysisFigure 9 plots the simulation results for hard-

ware/software co-design for the heat benchmark application.The grid domain is 4096�4096 elements. This domain issplit up into square tiles measuring between 32 and 512 datapoints on each side. The complete domain is 128 MB in totaland does not fit in the last-level cache for any of the archi-tectures considered, so tiles always have to be loaded frommain memory. The number of time steps performed on eachtile before moving on to the next one varies between 1 and65 steps. The graphs plot the achieved number of time stepsper second, or per Joule of consumed energy, as a function ofarithmetic intensity. The performance graphs in Figure 9(a)follow the basic roofline model from Figure 8: initially, in-creasing the number of time steps improves performance,which later falls back once the amount of redundant com-putation becomes too high. Additionally, each architecturehas an optimal tile size which maximizes data reuse whilestill fitting in the cache. For example, the 8-core design pointreach a performance level of around 150 simulated time stepsper second, using a tile size of 128�128. The working set ofthis application is two tiles worth of data, corresponding tothe previous and current time steps. At one double-precisionfloating-point number or 8 bytes per element, the workingset of 256 KB for the 1282-sized tile fits in a core’s private512 KB L2 cache, whereas for the larger 2562 tiles it doesnot — which makes performance significantly lower thanthat predicted by the roofline model. The three other ar-chitectures, which have an L2 cache of only 256 KB, reachtheir optimal performance using the smaller 642 tiles.

If we consider the e�ect of arithmetic intensity on energyrather than performance, we see that generally the optimumhas shifted towards the left, meaning that less redundantwork — and a slightly increased access rate to main mem-

ory — is preferred. This is because the ratio between accesstimes of caches and DRAM is very high, making the perfor-mance aspect of more redundant work cheap when comparedto extra DRAM accesses. On the other hand, when compar-ing the energy cost of DRAM accesses versus that of extracomputation, the ratio is lower which makes the relative costof the redundant work higher. Note that this extra cost con-sists of not just extra floating-point operations, which are inthemselves very cheap (in the order of 0.5 nJ per double-precision operation, vs. 100 nJ per DRAM access1) butalso the cost of extra instructions flowing through all stagesof the out-of-order pipeline, extra cache accesses, etc. (InFigure 5, floating-point ALU energy represents only a smallfraction of total core energy.)

When considering performance alone, the 3D architectureclearly wins for this benchmark. The additional bandwidthof the 3D stacked memory allows for a steeper performanceslope in the leftmost part of the performance graph, whilethe availability of 16 full-sized cores results in the highestpeak performance across all architecture design points con-sidered. Yet, tiles have to be kept small enough such thatthey fit in L2 cache; this architecture does not have an L3cache so the cost of L2 misses, which become significantwith tile sizes larger than 642, is much higher here than inthe other architectures.

On the other hand, the low-frequency architecture, whilebeing less high-performance than both the 8-core and 3Ddesign points, reaches the highest energy e⌅ciency. Becausethis application scales fairly good with core count, and thepower envelope of the low-frequency chip is still modest, onecould even consider another doubling of the number of coreswhich should almost double performance at very little addi-tional cost in energy.

1According to the relevant component in the dynamic en-ergy stacks computed by Sniper/McPAT, scaled by the num-ber of FP operations or DRAM accesses throughout thebenchmark, respectively.

Page 43: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

USE  CASE  #3:  SCHEDULING  HETERO.  CMPS  

•  Heterogeneous  CMPs  provide  flexible  power/performance  trade-­‐off  (e.g.  ARM  big.LITTLE)  

•  Throughput-­‐opXmized  schedulers  place  thread  with  highest  benefit  on  big  cores  

•  Does  not  help  parallel  applicaXons:    fast  threads  need  to  wait  for    slow  threads  

43  

power-efficient

high-performance

B

S

B B …

S S S …

norm

alize

d  runX

me  

Page 44: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

USE  CASE  #3:  SCHEDULING  HETERO.  CMPS  

•  Equal-­‐progress  scheduler  esXmates  speedup/slowdown  of  thread  if  it  were  to  run  on  the  other  core  type  

44  Van  Craeynest  et  al.,  PACT’13  

CPI

big

ILP change

CPI

smal

l

MLP change

CPIbig MLPbig ILPbig

CPIsmall MLPsmall ILPsmall

S

B

Page 45: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

USE  CASE  #3:  SCHEDULING  HETERO.  CMPS  

•  Equal-­‐progress  scheduler  esXmates  speedup/slowdown  of  thread  if  it  were  to  run  on  the  other  core  type  

•  Matches  progress  (=hypotheXcal  speed  thread  always  ran  on  big  core)  

•  Results  in  performance  improvement,  even  for  heterogeneous  applicaXons  

45  Van  Craeynest  et  al.,  PACT’13  

Page 46: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

THE  SNIPER  MULTI-­‐CORE  SIMULATOR  SIMULATOR  INTERNALS  

WIM  HEIRMAN,  TREVOR  E.  CARLSON,  IBRAHIM  HUR,  KENZO  VAN  CRAEYNEST  AND  LIEVEN  EECKHOUT  

 HTTP://WWW.SNIPERSIM.ORG  

SUNDAY,  SEPTEMBER  22ND,  2013  IISWC  2013,  PORTLAND  

Page 47: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

RELAXED  SYNCHRONIZATION  

•  Graphite  introduced  relaxed  synchronizaXon  with  a  number  of  different  synchronizaXon  schemes  –  none:  only  synchronizes  when  the  applicaXon  does;  for  pthread  calls,  etc.  

–  random-­‐pairs:  synchronizes  random  pairs  of  threads  –  barrier:  synchronizes  all  threads  at  a  given  simulated  Xme  interval  

•  Sniper  defaults  to  barrier  synchronizaXon  with  100ns  intervals  – MulX-­‐machine  mode  not  supported,  so  Xght  synchronizaXon  is  easier  

47  

Page 48: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

BARRIER  SYNCHRONIZATION  IN  ACTION  

real  Xme  

simulated

 Xme  

pthread_cond_wait  

pthread_cond_signal  

barrier  

barrier  

barrier  

48  

Page 49: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

PARALLELISM  INSIDE  SNIPER  

•  Each  simulated  thread  corresponds  to  a  thread  on  the  host  –  FuncXonal  simulaXon  (Pin  mode)  /  reading  trace  (SIFT  mode)  –  Executes  Xming  model  for  the  core  (and  associated  caches)    on  which  the  thread  currently  runs  (when  scheduled)  

–  Each  core  model  maintains  its  own  local  Xme  

•  Extra  threads  for  network  and  DRAM  models  –  Can  process  invalidaXon  requests  without  interrupXng  the  core  model  

•  Each  thread  can  independently  make  progress  –  Causality  errors  can  occur,  no  rollback  –  Skew  is  limited  to  100ns  

49  

Page 50: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

THREADS  IN  SNIPER  

50  

core  L1  I/D  L2  

core  L1  I/D  L2  

L3  

NI  

core  L1  I/D  L2  

core  L1  I/D  L2  

L3  

NI  

DRAM  

applicaXon  threads   network  threads  

Page 51: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

core  L1  I/D  L2  

L3  

NI  

TIME  IN  SNIPER  

•  Each  memory  access  instantly  returns  latency  •  ApplicaXon  threads  maintain  Xme  •  Network  threads  reset  Xme  for  each  request  

51  

core  L1  I/D  L2  

core  L1  I/D  L2  

L3  

NI  

core  L1  I/D  L2  

core  L1  I/D  L2  

L3  

NI  

DRAM  

t=0  

t=1  

t=5  

t=10  

t=15  

t=20  

t=20  t=24  t=26  

t=27  

t=28  

t=29  

t=30  

t=22  

t=4  

Page 52: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

MODELING  CONTENTION  

•  Events  may  happen  out  of  order  •  How  to  model  bandwidth  /  contenXon?  – History  list  

•  Resource  in  use  at  Xmes  0…10,  12…17,  25…30  •  Access  at  15:  delay  =  2  •  Access  at  8,  length  5:  ?  

•  Causality  errors  are  possible  – Effect  is  limited,  as  long  as  average  bandwidth  is  OK  – Allows  for  faster  simulaXon,  easier  implementaXon  – Speed  versus  accuracy  trade-­‐off  

52  

Page 53: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

53  

SIMULATOR  ARCHITECTURE:  PIN  FRONT-­‐END  

Scheduler  

Core  performance  models  common/performance_model  

Cache  performance  models  common/core/memory_subsystem  L2  

L1   L1  

L2  

L1   L1  

L2  

L1   L1  

L2  

L1   L1  

NoC  NoC  performance  models  common/network  

Thread  scheduling  common/scheduler  

Timing  model  common/  

FuncSonal  simulaSon:  instrumentaSon  pin/  

ApplicaXon  threads  (single  process)   run-­‐sniper  -­‐-­‐  ...  benchmarks/run-­‐sniper  -­‐p  

Page 54: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

54  

SIMULATOR  ARCHITECTURE:  SIFT  MODE  

Scheduler  

Core  performance  models  common/performance_model  

Cache  performance  models  common/core/memory_subsystem  L2  

L1   L1  

L2  

L1   L1  

L2  

L1   L1  

L2  

L1   L1  

NoC  NoC  performance  models  common/network  

Thread  scheduling  common/scheduler  

Timing  model  common/  

ApplicaXon  1  sift/recorder  

SIFT  pipe  sift/  

App  2  

TraceThread  common/system/trace_thread.cc  

run-­‐sniper  -­‐-­‐traces  run-­‐sniper  -­‐-­‐mpi  benchmarks/run-­‐sniper  -­‐-­‐benchmarks  

Page 55: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

PIN  FRONT-­‐END:  EXAMPLE  

 001: mov $123, eax 002: mov $321, ebx 003: mov $2, ecx    004: mov (eax), edx 005: add edx, 4  006: mov edx, (ebx) 007: add $4, eax 008: add $4, ebx 009: add $-1, ecx  00a: jnz $004

PM  

BasicBlockQueue  

DynamicInstrucXonInfoQ

ueue  

Addr(123)  –  Addr(321)  –  Branch(taken)  –  Addr(127)  –  Addr(325)  –  Branch(not  taken)  

pm-­‐>pushDynamicInstruc2onInfo  ()  

handleBasicBlock(0x001)   handleBasicBlock(0x004)        

   handleMemoryRead(value  of  eax)   handleMemoryWrite(value  of  ebx)   handleBranch(taken/noFaken)  

pm-­‐>queueBasicBlock()  

001:  out(eax)  002:  out(ebx)  003:  out(ecx)  –  004:  in(eax),  out(edx),  mem_rd(.)  005:  in(edx),  out(edx)  006:  in(edx,ebx),  mem_wr(.)  007:  in(eax),  out(eax)  008:  in(ebx),  out(ebx)  009:  in(ecx),  out(ecx)  00a:  jump(.)  –  004:  …  

Page 56: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

THE  SNIPER  MULTI-­‐CORE  SIMULATOR  SIMULATOR  ACCURACY  

AND  HARDWARE  VALIDATION  

TREVOR  E.  CARLSON,  WIM  HEIRMAN,  IBRAHIM  HUR,    KENZO  VAN  CRAEYNEST  AND  LIEVEN  EECKHOUT  

 HTTP://WWW.SNIPERSIM.ORG  

SUNDAY,  SEPTEMBER  22ND,  2013  IISWC  2013,  PORTLAND  

Page 57: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

HARDWARE  VALIDATION  

•  Why  validaXon?  – Debugging  – Verifying  modeling  assumpXons  – Balance  between  accuracy  and  generality  

•  e.g.:  loop  buffer  in  Nehalem/Westmere;                      uop-­‐cache  in  Sandy  Bridge  

•  Current  status:  – Validated  against  Core2  (internal,  results  @  SC’11)  – Nehalem  ongoing  (public  version)  

57  

Page 58: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

EXPERIMENTAL  SETUP  

•  Benchmarks  – Complete  SPLASH-­‐2  suite  

•  1  to  16  threads  •  Linux  pthreads  API  

– Extensive  use  of  microbenchmarks  to  tune  parameters  and  track  down  problems  

•  Hardware  – Four-­‐socket  Intel  Xeon  X7460  machine  – Core2  (45nm,  Penryn)  with  6  cores/socket  

58  

Page 59: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

EXPERIMENTAL  SETUP:  ARCHITECTURE  

L2  

L1I   L1D   L1I   L1

D  

L2  

L1I   L1D   L1I   L1

D  

L3  

L2  

L1I   L1D   L1I   L1

D  

L2  

L1I   L1D   L1I   L1

D  

L3  

L2  

L1I   L1D   L1I   L1

D  

L2  

L1I   L1D   L1I   L1

D  

L3  

L2  

L1I   L1D   L1I   L1

D  

L2  

L1I   L1D   L1I   L1

D  

L3  

DRAM  

59  

Page 60: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

HINTS  FOR  COMPARING  TO  HARDWARE  

•  Threads  are  pinned  to  their  own  core  pthread_setaffinity_np()  

•  Steepstep  is  disabled  echo  performance  >  /sys/devices/system/cpu/*/cpufreq/

scaling_governor  

•  Turbo  mode,  Hyperthreading  disabled  – BIOS  se~ng  

•  Use  hardware  performance  counters  – But  can  be  difficult  to  interpret  – Overlapping  cache  misses  (HW)  vs.  hits  (Sniper)  

60  

Page 61: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

INTERVAL  PROVIDES  NEEDED  ACCURACY  

61  

The  interval  core  model  provides  consistent  accuracy  of  25%  avg.  abs.  error,  with  a  minimal  slowdown  

Page 62: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

INTERVAL:  GOOD  OVERALL  ACCURACY  

62  

Good  accuracy  for  the  enXre  benchmark  suite  

Page 63: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

INTERVAL:  BETTER  RELATIVE  ACCURACY  •  ApplicaXon  scalability  is  affected  by  memory  bandwidth  •  Interval  model  provides  more  realisXc  memory  request  

streams,  which  results  in  a  more  accurate  scaling  predicXon  

63  

Page 64: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

APPLICATION  OPTIMIZATION  •  Splash2-­‐Raytrace  shows  very  bad  scaling  behavior  •  CPI  stack  shows  why:  heavy  lock  contenXon  •  Conversion  to  use  locked  increment  instrucXon  helps  

64  

Page 65: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

SIMULATOR  PERFORMANCE  

65  

Sniper  currently  scales  to  2  MIPS  

Typical  simulators  run  at  10s-­‐100s  KIPS,  without  scaling  

Page 66: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

SYNCHRONIZATION  VARIABILITY  

66  

Variability  due  to  relaxed  synchronizaXon  is  applicaXon  specific  

Page 67: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

SYNCHRONIZATION:  SPEED  VS.  ACCURACY  

67  

�Speed  vs.  simulaXon  accuracy  for  barrier  quanta  of  10  ns,  100  ns,  1  μs,  10  μs,  100  μs  and  1  ms  

Page 68: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

FLEXIBILITY  TO  CHOOSE  NEEDED  FIDELITY  

68  

Page 69: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

MANY-­‐CORE  SIMULATIONS  High  simulaXon  speed  up  to  1024  simulated  cores  

–  Efficient  simulaXon:  L1-­‐based  benchmarks  execute  faster  –  Host  system:  dual-­‐socket  Xeon  X5660  (6-­‐core  Westmere),  96  GB  RAM  

69  

Page 70: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

VALIDATING  FOR  NEHALEM  

70  

0  0.2  0.4  0.6  0.8  1  

1.2  1.4  1.6  1.8  2  

HW  M

easuremen

t  /  Snipe

r  Result  

0  0.2  0.4  0.6  0.8  1  

1.2  1.4  1.6  1.8  2  

HW  M

easuremen

t  /  Snipe

r  Result  

Page 71: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

NEHALEM  VALIDATION  

•  Currently  working  to  validate  addiXonal  modern  systems  –  The  end  goal  is  to  improve  both  the  interval  model  accuracy  as  well  as  the  overall  Sniper  accuracy  for  both  the  core  and  uncore  

– Ongoing  work  which  unfortunately  takes  quite  a  bit  of  Xme  

•  Recent  example:  radix  –  CVT*  instrucXons  were  not  properly  modeled  

•  Defaults  to  1  cycle,  but  they  are  actually  3  to  6  cycles  – Adding  CVT*  instrucXons  improved  error:  33%  to  4%  

71  

Page 72: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

POWER/ENERGY  VALIDATION  •  Goal:  –  Validate  correctness  of  McPAT’s  input  –  Gain  confidence  in  relaXve  accuracy  

•  Most  important  for  HW/SW  design  space  exploraXon  –  Not  necessarily:  obtain  high  absolute  accuracy  

•  Illusory  at  this  level  of  abstracXon!  •  ValidaXon  setup:  –  Dual-­‐socket  Intel  Xeon  (X5570  Nehalem)  server  – Measure  wall  power  using  RackXvity  RC0816  PDU  

72  

Page 73: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

POWER/ENERGY  VALIDATION:  METHODOLOGY  •  One  HW  power  measurement  per  second  

–  Select  benchmarks  with  long  parallel  region  –  Subtract  server  idle  power  –  Apply  temperature  correcXon  (staXc  power)  –  Compute  average  dynamic  power  when  thermal  equilibrium  is  reached  

•  McPAT  +  DRAM  power  model    –  Compute  CPU  +  DRAM  dynamic  power  of  workload’s  parallel  region  –  Validate  against  HW  measurments  

73  

Page 74: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

POWER/ENERGY  VALIDATION:  RESULTS  

Good  absolute  and  relaXve  accuracy  – 8.3%  average  absolute  error  – 16%  max  over  SPEComp  

74  

60

80

100

120

140

160

180

O-ammp_m

O-applu_m

O-equake_m

O-fma3d_m

O-mgrid_m

O-swim_m

heat-64-1

heat-64-17

heat-128-21

heat-256-37

Dyn

am

ic p

ow

er

(W)

Measured Predicted

Page 75: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

THE  SNIPER  MULTI-­‐CORE  SIMULATOR  RUNNING  SIMULATIONS    

AND  PROCESSING  RESULTS  

WIM  HEIRMAN,  TREVOR  E.  CARLSON,    KENZO  VAN  CRAEYNEST,  IBRAHIM  HUR  AND  LIEVEN  EECKHOUT  

 HTTP://WWW.SNIPERSIM.ORG  

SUNDAY,  SEPTEMBER  22RD,  2013  IISWC  2013,  PORTLAND  

Page 76: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

OVERVIEW  

•  Obtain  and  compile  Sniper  •  Running  •  ConfiguraXon  •  SimulaXon  results  •  InteracXng  with  the  simulaXon  – SimAPI:  applicaXon  – Python  scripXng  

76  

Page 77: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

RUNNING  SNIPER  

•  Download  Sniper  – hsp://snipersim.org/w/Download  

•  Download  tar.gz  •  Git  clone  

   ~/sniper$  export  SNIPER_ROOT=$(pwd)  #optional      ~/sniper$  make  

•  Running  an  applicaXon  ~/sniper$  ./run-­‐sniper  -­‐-­‐  /bin/true  ~/sniper/test/fft$  make  run  

77  

Page 78: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

RUNNING  SNIPER  

•  Integrated  benchmarks  distribuXon  – hsp://snipersim.org/w/Download_Benchmarks  ~/benchmarks$  export  BENCHMARKS_ROOT=$(pwd)  ~/benchmarks$  make  ~/benchmarks$  ./run-­‐sniper  –p  splash2-­‐fft  \                                                        –i  small  –n  4  

•  Standardizes  input  sets  and  command  lines  •  Includes  SPLASH-­‐2,  PARSEC,  NAS  Parallel  Benchmarks,  integraXon  for  SPEC  CPU  2006  

78  

Page 79: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

INTEGRATION  WITH  BENCHMARKS  

•  To  add  a  new  benchmark  – Add  source  code  – Add  __init__.py  file  

•  Provides  applicaXon  invocaXon  details  •  Define  input  sets  (e.g.:  test,  small,  large)  

– Mark  the  ROI  region  – Simple  example:  see  local/pi  

79  

Page 80: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

MULTI-­‐PROGRAMMED  WORKLOADS  

•  Recording  traces  (SIFT  format)  $  ./record-­‐trace  -­‐o  fft  -­‐-­‐  test/fft/fft  -­‐p1  

•  Limited  trace,  by  instrucXon  count:    Fast-­‐forward  (-­‐f),  detailed  length  (-­‐d),  block  size  (-­‐b)  $  ./record-­‐trace  -­‐o  fft  –f  1e9  –d  1e9  –b  1e8  \          -­‐-­‐  test/fft/fft  -­‐p1  –m20  

•  Running  traces  $  ./run-­‐sniper  -­‐c  gainestown  -­‐n  4  \          -­‐-­‐traces=gcc.sift,swim.sift,\                            swim.sift,equake.sift  

80  

Page 81: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

PINPLAY  TRACES  

•  PinPlay  is  a  record/replay  framework  – PinBall  consists  of  checkpoint  +  external  input  – Replay  starts  from  checkpoint,  then  funcXonally  executes  applicaXon  

– Smaller  than  SIFT  trace  (checkpoint+I/O    vs.  full  dynamic  instrucXon  stream)  

•  Running  PinBalls  $  ./run-­‐sniper  -­‐n  2  -­‐-­‐pinballs=gcc.sift,swim.sift  

81  

Page 82: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

PINPOINTS  

•  SimPoint  methodology  selects  representaXve  samples  from  a  long  applicaXon  – See  Perelman,  SIGMETRICS  '03  – PinPoints  =  PinPlay  +  SimPoint  

•  Released  the  full-­‐program  and  PinPoints  versions  of  CPU2006  on  the  Sniper  website  – hsp://www.snipersim.org/w/Pinballs  

82  

Page 83: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

MPI  WORKLOADS  

•  Run  single-­‐node  (shared-­‐memory)  MPI  apps  – Supports  MPICH2  and  derivaXves  (Intel  MPI,  etc.)  

•  Running  an  MPI  app  with  one  rank  per  core:  $  ./run-­‐sniper  -­‐-­‐mpi  -­‐n  4  -­‐-­‐  ./pi  

•  See  test/mpi  and  test/mpi-­‐omp  for  pure  MPI  and  hybrid  MPI+OpenMP  examples  

83  

Page 84: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

REGION  OF  INTEREST  

•  Skip  benchmark  iniXalizaXon  and  cleanup  •  Mark  code  with  ROI  begin  /  end  markers  – SimRoiStart()  /  SimRoiEnd()  in  your  own  applicaXon  

–  $  ./run-­‐sniper  -­‐-­‐roi  -­‐-­‐  test/fft/fft  •  Already  done  in  benchmarks  distribuXon  –  benchmarks/run-­‐sniper  implies  -­‐-­‐roi  – Use  -­‐-­‐no-­‐roi  to  override  

•  Cache  warming  during  pre-­‐ROI  period  – Use  -­‐-­‐no-­‐cache-­‐warming  to  override  

84  

Page 85: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

CONFIGURATION  

•  Stackable  configuraXon  files  (run-­‐sniper  -­‐c)    and  explicit  command-­‐line  opXons  (-­‐g)  –  Template  configuraXons  in  sniper/config/*.cfg  (-­‐c  name)  –  Your  own  local  configuraXon  files  (-­‐c  filename.cfg)  –  Explicit  opXon:  -­‐g  -­‐-­‐secXon/key=value  

•  MulXple  configuraXon  files,  and  -­‐g  opXons,  can  be  combined  –  Config  files  specified  later  on  the  command  line  take  precedence  –  config/base.cfg  is  always  included  –  If  no  -­‐c  opXon  is  provided,  config/gainestown.cfg  is  the  default    

(quad-­‐core  Nehalem-­‐based  Xeon)  

•  Complete  configuraXon  is  stored  in  sim.cfg  aHer  each  run  

85  

Page 86: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

CONFIGURATION  •  Example  configuraXon:  largecache.cfg  

[perf_model/l3_cache]  cache_size  =  16384    #  KB  

   $  run-­‐sniper  -­‐c  gainestown  -­‐c  largecache.cfg  

•  Equivalent  to:  

 $  run-­‐sniper  -­‐c  gainestown  \      -­‐g  -­‐-­‐perfmodel/l3_cache/cache_size=16384  

86  

Page 87: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

SIMULATION  RESULTS  

•  Files  created  aHer  each  simulaXon:  –  sim.cfg:  all  configuraXon  opXons  used  for  this  run  (includes  defaults,  all  -­‐c  and  -­‐g  opXons)  

–  sim.out:  basic  staXsXcs  (number  of  cycles,  instrucXons  per  core,  cache  access  and  miss  rates,  …)  

–  sim.stats[.sqlite3]:  complete  set  of  all  recorded  staXsXcs  at  key  points  in  the  simulaXon    (start,  roi-­‐begin,  roi-­‐end,  stop)  

•  Processing  and  visualizaXon  scripts  in  tools/  •  Use  the  sniper_lib  Python  package  for  parsing  

87  

Page 88: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

SIMULATION  RESULTS  •  sim.out:  Quick  overview  of  basic  performance  results  

88  

                                                                         |  Core  0          |  Core  1              Instructions                                              |          506505  |          505562      Cycles                                                          |          469101  |          468620      Time  (ns)                                                    |          176354  |          176173  Branch  predictor  stats                              |                        |                            num  correct                                                |            15338  |            15205      num  incorrect                                            |              1280  |              1218      misprediction  rate                                  |            7.70%  |            7.42%      mpki                                                              |              2.53  |              2.41  Cache  Summary                                                |                        |                            Cache  L1-­‐I                                                  |                        |                                num  cache  accesses                              |            46642  |            46555          num  cache  misses                                  |                217  |                178          miss  rate                                                |            0.47%  |            0.38%          mpki                                                          |              0.43  |              0.35      Cache  L1-­‐D                                                  |                        |                                num  cache  accesses                              |          332771  |          332412          num  cache  misses                                  |                517  |                720          miss  rate                                                |            0.16%  |            0.22%          mpki                                                          |              1.02  |              1.42      Cache  L2                                                      |                        |                                num  cache  accesses                              |                984  |              1090          num  cache  misses                                  |                459  |                853          miss  rate                                                |          46.65%  |          78.26%          mpki                                                          |              0.91  |              1.69  

Page 89: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

SIMULATION  RESULTS  –  TOOLS  •  CPI  stacks  $  ./tools/cpistack.py  [-­‐-­‐time|-­‐-­‐cpi|-­‐-­‐abstime]  [-­‐-­‐aggregate]                                                    CPI            CPI  %          Time  %  Core  0      depend-­‐int                      0.20          23.42%          23.42%      depend-­‐fp                        0.16          18.94%          18.94%      branch                              0.12          14.04%          14.04%      ifetch                              0.04            4.16%            4.16%      mem-­‐l1d                            0.21          24.41%          24.41%      mem-­‐l3                              0.02            2.72%            2.72%      mem-­‐dram                          0.05            5.73%            5.73%      sync-­‐mutex                      0.02            2.59%            2.59%      sync-­‐cond                        0.03            3.01%            3.01%      other                                0.01            0.97%            0.97%        total                                0.84        100.00%            0.00s  Core  1      depend-­‐int                      0.20          23.92%          23.92%      depend-­‐fp                        0.16          18.79%          18.79%      branch                              0.12          13.72%          13.72%      mem-­‐l1d                            0.20          24.06%          24.06%      mem-­‐l3                              0.06            6.79%            6.79%      sync-­‐mutex                      0.04            5.22%            5.22%      sync-­‐cond                        0.05            5.60%            5.60%      other                                0.02            1.89%            1.89%        total                                0.85        100.00%            0.00s     89  

Page 90: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

SIMULATION  RESULTS  –  TOOLS  

•  McPAT:  power,  energy  and  area  $  ./tools/mcpat.py  [-­‐t  static|dynamic|total|area]                                              Power          Energy      Energy  %      core-­‐core                13.77  W          0.00  J          30.14%      core-­‐ifetch              4.44  W          0.00  J            9.71%      core-­‐int                    1.22  W          0.00  J            2.67%      core-­‐fp                      1.47  W          0.00  J            3.21%      core-­‐mem                    6.30  W          0.00  J          13.78%      icache                        5.55  W          0.00  J          12.16%      dcache                        4.74  W          0.00  J          10.37%      l3                                3.38  W          0.00  J            7.39%      dram                            4.51  W          0.00  J            9.87%      other                          0.32  W          0.00  J            0.70%        core                          27.19  W          0.00  J          59.52%      total                        45.68  W          0.01  J        100.00%  

90  

Page 91: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

SIMULATION  RESULTS  –  TOOLS  

•  Topology  $  ./tools/gen_topology.py    

91  

Page 92: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

SIMULATION  RESULTS  –  TOOLS  

•  LLC  miss  latency  breakdown  $  ./tools/llcstack.py      

Requests:                            97485627    Total  time:                              85.16      dram-­‐bus:                                    0.48    dram-­‐device:                            19.45    dram-­‐queue:                                0.07    inv-­‐imbalance:                          3.95    noc-­‐base:                                  31.94    noc-­‐queue:                                  2.11    remote-­‐cache-­‐inv:                    0.80    remote-­‐cache-­‐wb:                    10.15    td-­‐access:                                16.22    

92  

Page 93: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

SIMULATION  RESULTS  –  TOOLS  •  Per-­‐funcXon  staXsXcs  (gprof2dot,  KCacheGrind  output)  $  run-­‐sniper  -­‐-­‐profile      calls        time    t.self    icount          ipc  l2.mpki  name              1    50.06%      0.00%    50.05%        1.06        0.86  _start              1    50.06%      0.00%    50.05%        1.06        0.86      _start              1    50.06%      0.00%    50.05%        1.06        0.86          __libc_start_main              1    50.06%      0.00%    50.05%        1.06        0.86              main              1    49.11%      0.20%    49.95%        1.08        0.69                  SlaveStart              1    44.16%      0.18%    46.65%        1.12        0.40                      FFT1D            32    32.87%    17.91%    36.83%        1.19        0.05                          FFT1DOnce            32    14.96%      1.85%    10.62%        0.75        0.08                              Reverse        1024    13.12%    13.12%      8.57%        0.69        0.02                                  BitReverse              3      5.92%      5.92%      6.24%        1.12        1.31                          Transpose            16      2.17%      2.17%      3.25%        1.58        0.18                          TwiddleOneCol              1    49.94%      0.00%    49.95%        1.06        1.63  clone              1    49.94%      0.02%    49.95%        1.06        1.63      start_thread              1    49.91%      0.17%    49.94%        1.06        1.58          SlaveStart              1    43.89%      0.15%    46.60%        1.12        0.66              FFT1D            32    32.60%    17.79%    36.83%        1.20        0.06                  FFT1DOnce            32    14.81%      1.79%    10.62%        0.76        0.09                      Reverse        1024    13.03%    13.03%      8.57%        0.70        0.03                          BitReverse              3      7.02%      7.02%      6.24%        0.94        3.34                  Transpose            16      2.04%      2.04%      3.25%        1.68        0.18                  TwiddleOneCol  

93  

Page 94: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

94  

VIZ:  CYCLES  STACKS  IN  TIME  

Time  

Cycles  (%)  

Legend  

OpXons  

Markers  

IPC  VisualizaXon  

$  run-­‐sniper  -­‐-­‐viz  

Page 95: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

95  

VIZ:  MCPAT  OUTPUT  OVER  TIME  

Page 96: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

96  

3D  VISUALIZATION:  IPC  VS.  TIME  VS.  CORE  

Page 97: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

AUTOMATIC  RESULTS  PROCESSING  

•  Avoid  manually  copy/pasXng  simulaXon  results  for  reliability,  scalability  (how  else  will  you  auto-­‐generate  graphs  based  on  100s  of  runs?)  

•  sniper_lib.get_results()  parses  sim.cfg,  sim.stats,  returns  configuraXon  and  staXsXcs  (roi-­‐end  –  roi-­‐begin)  for  all  cores  

 ~/sniper/tools$  python  >  import  sniper_lib  >  results  =  sniper_lib.get_results(resultsdir  =  ‘..’)  >  print  results      {‘config’:  {‘general/total_cores’:  ‘64’,                              ‘perf_model/core/frequency’:  ‘2.66’,  …},        ‘results’:  {‘performance_model.instruction_count’:[123],        ‘performance_model.elapsed_time’:  [23000000],  …}}  

97  

Page 98: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

SIMULATION  RESULTS  

•  Let’s  compute  the  IPC  for  core  0  •  Core  frequency  is  variable  (DVFS)    so  cycle  count  has  to  be  computed  –  Time  is  in  femtoseconds,  frequency  in  GHz  

 >  instrs  =  results[‘results’]                        [‘performance_model.instruction_count’][0]  >  cycles  =  results[‘results’]                          [‘performance_model.elapsed_time’][0]                    *  float(results[‘config’][‘perf_model/core/frequency’])                    *  1e-­‐6    #  femtoseconds  -­‐>  nanoseconds  >  ipc  =  instrs  /  cycles  2.0  

98  

Page 99: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

INTERACTING  WITH  SNIPER  

99  

Sniper  simulator  

applicaXon  

binary  input/  

cmdline  

SimAPI  

configuraXon  

Python  scripts  

staXsXcs  

visualizaXon  

Page 100: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

SIMAPI  IMPLEMENTATION  

•  Magic  instrucXons  allow  the  applicaXon  to  talk  to  the  simulator  directly    __asm__  __volatile__  (        "xchg  %%bx,  %%bx\n"        :  "=a"  (_res)          /*  output        */        :  "a"  (_cmd),                                                          "b"  (_arg0),                                                        "c"  (_arg1)          /*  input          */                    );                          /*  clobbered  */        

•  Pin  intercepts  this  instrucXon  and  passes  control  to  the  simulator  

•  Command  and  arguments  passed  through    rax/rbx/rcx  registers,  result  in  rax  

100  

Page 101: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

APPLICATION  SIMAPI  

•  Calling  simulator  API  funcXons  from  your  C  program  #include  <sim_api.h>  –  SimInSimulator()  

•  Return  1  when  running  inside  Sniper,  0  when  running  naXvely  –  SimGetProcId()  

•  Return  processor  number  of  caller  

–  SimRoiStart()  /  SimRoiEnd()  •  Start/end  detailed  mode  (when  using  ./run-­‐sniper  -­‐-­‐roi)  

–  SimSetFreqMHz(proc,  mhz)  /  SimGetFreqMHz(proc)  •  Set  /  get  processor  frequency  (integer,  in  MHz)  

–  SimUser(cmd,  arg)  •  User-­‐defined  funcXon   101  

Page 102: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

PYTHON  SCRIPTING  

•  Scripts  are  run  on  simulator  startup  – Register  hooks:  callbacks  when  certain  events  happen  during  the  simulaXon  

– See  common/system/hooks_manager.h  for  all  available  hooks  

•  Use  an  exisXng  script  from  sniper/scripts/*.py:    ./run-­‐sniper  -­‐s  scriptname  

•  Or  your  own  script:    ./run-­‐sniper  -­‐s  myscriptname.py  

•  Use  sim  package  for  convenience  wrappers  102  

Page 103: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

PYTHON  SCRIPTING  

•  Low-­‐level  script  •  Execute  “foo”  at  each  barrier  synchronizaXon  

import  sim_hooks  def  foo(t):      print  'The  time  is  now',  t  sim_hooks.register(sim_hooks.HOOK_PERIODIC,  foo)  

103  

Page 104: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

PYTHON  SCRIPTING  

•  Higher-­‐level  script  •  Execute  “foo”  at  each  barrier  synchronizaXon  

import  sim  class  Class:      def  hook_periodic(self,  t):          print  'The  time  is  now',  t  sim.util.register(Class())  

104  

Page 105: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

PYTHON  SCRIPTING  •  High-­‐level  script:  execute  “foo”  every  X  ms  •  Pass  in  parameter  using    ./run-­‐sniper  -­‐s  myscript.py:X  

import  sim  class  Class:      def  setup(self,  args):          sim.util.Every(long(args)*sim.util.Time.MS,                                        self.periodic)      def  periodic(self,  t,  t_delta):          print  'The  time  is  now',  t          print  'Elapsed  time  since  last  call',  t_delta  sim.util.register(Class())  

105  

Page 106: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

PYTHON  SCRIPTING  

•  Access  configuraXon,  staXsXcs,  DVFS  •  Live  periodic  IPC  trace:  –  See  scripts/ipctrace.py  for  a  more  complete  example  

 

class  IPCTracer:      def  setup(self,  args):          sim.util.Every(1*sim.util.Time.US,  self.periodic)          self.instrs_prev  =  0      def  periodic(self,  t,  t_delta):          freq  =  sim.dvfs.get_frequency(0)          cycles  =  t_delta  *  freq  *  1e-­‐9    #  fs  *  MHz  -­‐>  cycles          instrs  =  long(sim.stats.get('performance_model',  0,                                                                    'instruction_count'))          print  'IPC  =',  (instrs  –  self.instrs_prev)  /  cycles          self.instrs_prev  =  instrs  

106  

Page 107: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

PYTHON  &  MAGIC  INSTRUCTIONS  

•  Communicate  informaXon  between  applicaXon  and  Python  script  – E.g.:  simulated  hardware  performance  counters  

•  ApplicaXon:    uint64_t  ninstrs  =  SimUtil(0xdeadbeef,  SimGetProcId())  

•  Python  script:    class  PerfCtr:        def  setup(self):            sim.util.register_command(0xdeadbeef,  self.compute)        def  compute(self,  arg):            return  sim.stats.get('performance_model',  arg,                                                            'instruction_count')  

107  

Page 108: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

EXAMPLE:  REGION-­‐BASED  STATISTICS  We  want  to  know  the  L2  miss  rate  for  a  snipet  of  code    •  ApplicaXon:  mark  code  snippet  with  Sim[Named]Marker  

 void  myfunc(int  a)  {    SimNamedMarker(1,  "myfunc");    //  function  body    SimNamedMarker(2,  "myfunc");  

}  

•  Script:  intercept  markers  and  write  uniquely  named  staXsXcs  snapshots    import  sim  class  Marker:      def  __init__(self):  self.i  =  0      def  hook_magic_marker(self,  thread,  core,  a,  b,  s):          sim.stats.write('marker-­‐%s-­‐%d-­‐%d'  %  (s,  a,  self.i))          if  a  ==  2:  self.i  +=  1  sim.util.register(Marker())  

108  

Page 109: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

EXAMPLE:  REGION-­‐BASED  STATISTICS  •  Run  simulaXon  

 #  ./run-­‐sniper  –cgainestown  –smyscript.py  -­‐-­‐  ./myapp  #  ./tools/dumpstats.py  -­‐-­‐list  

•  Process  results    import  sniper_stats,  sniper_lib  stats  =  sniper_stats.SniperStats(resultsdir  =  '.')  niters  =  max([  int(name.split('-­‐')[-­‐1])                                for  name  in  stats.get_snapshots()                                  if  name.startswith('marker-­‐myfunc-­‐1-­‐')  ])  for  i  in  range(0,  niters+1):      snapshot  =  sniper_lib.get_results(resultsdir  =  '.',  partial  =    

   ('marker-­‐myfunc-­‐1-­‐%d'  %  i,  'marker-­‐myfunc-­‐2-­‐%d'  %  i))['results']      accesses  =  sum(snapshot['L2.loads'])  +  sum(snapshot['L2.stores'])      misses  =  sum(snapshot['L2.load-­‐misses'])  \                    +  sum(snapshot['L2.store-­‐misses'])      print  'Iteration  #%d:  %.2f%%'  %  (i,  100.  *  misses  /  float(accesses))    

109  

Page 110: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

THE  SNIPER  MULTI-­‐CORE  SIMULATOR  THREAD  SCHEDULING  

KENZO  VAN  CRAEYNEST,  TREVOR  E.  CARLSON,    WIM  HEIRMAN  AND  LIEVEN  EECKHOUT  

 HTTP://WWW.SNIPERSIM.ORG  

SUNDAY,  SEPTEMBER  22ND,  2013  IISWC  2013,  PORTLAND  

Page 111: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

SCHEDULING  THREADS  ON  HETEROGENEOUS  MULTI-­‐CORE  PROCESSORS  •  MulXple  core  types  – Different  power/performance  trade-­‐offs  

small  power-­‐efficient  cores  

big  high-­‐performance  cores  

B

S

B B …  

S S S …  

intermediate  cores  

…  M M M

111  

Page 112: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

CONFIGURING  HETEROGENEOUS  CONFIGURATIONS  

•  TradiXonal  heterogeneous  configuraXon  opXons  – Comma  separated  configuraXon  parameters  supported  for  most  se~ngs  

–  (Re)defining  core  parameters    -­‐-­‐perf_model/core/interval_Xmer/window_size=16,128,16,128  -­‐-­‐perf_model/core/interval_Xmer/dispatch_width=2,4,2,4  …  

•  Get’s  complicated  fast  •  Error  prone  

 

112  

Page 113: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

A  BETTER  WAY:  DEFINING  HETEROGENEOUS  CONFIGURATIONS  

113  

big.cfg  

medium.cfg  

small.cfg  small  power-­‐efficient  cores  

big  high-­‐performance  cores  

B

S

B B …  

S S S …  

intermediate  cores  

…  M M M

./run_sniper –c big,small,big,small,medium -- /bin/ls

Page 114: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

USING  HETEROGENEITY  INSIDE  SNIPER  

•  Example:  hardware  scheduling  – Need  to  know  what  core  types  there  are  

•  and  which  are  which  – OpXon  1:  hardcoded    

•  core_id  1  is  a  big  core,  core_id  is  a  small  core,  …  •  hard  to  maintain,  difficult  to  port  for  different  experiments  

– OpXon  2:  IdenXfy  core  type  by  examining  core  parameters  •  Sim()-­‐>getCfg()-­‐>getBoolArray    

       ("perf_model/core/interval_Xmer/window_size",  coreId)  

•  Not  flexible  

114  

Page 115: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

TAGS:  A  CONVENIENT  WAY  TO  USING  HETEROGENEITY  INSIDE  SNIPER  •  Defining  tags  – OpXon  1:  manually  assign  tags  to  core  (or  any  other  object)  •  -­‐-­‐tags/core/big=0,1,0,1  •  -­‐-­‐tags/core/small=1,0,1,0  

– OpXon  2:  use  heterogeneous  .cfg  files  •  Add    “–c  big.cfg,small,big,small”  to  run-­‐sniper  •  AutomaXcally  creates  tags  and  heterogeneous  configuraXon  

•  Using  tags  •  Sim()-­‐>getTagsManager()-­‐>hasTag("core",  coreId,  "small");  •  Sim()-­‐>getTagsManager()-­‐>hasTag("core",  coreId,  “big");  

115  

Page 116: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

•  Thread  scheduling  support  added  in  Sniper  4.0  – Fully  implemented  inside  the  simulator,  no  applicaXon  modificaXon  or  re-­‐linking  needed  

•  Base  scheduling  infrastructure  supports:  – Affinity  mask  for  each  thread  (allowed  cores)  

•  IniXalized  at  thread  start  based  on  scheduling  policy  •  Updated  by  sched_affinity  system  calls  •  Updated  by  scheduling  policy,  user  code  

– Pre-­‐empXve  round-­‐robin  scheduling  •  Cores  pick  a  new  thread  when  Xme  quantum  expires,  or  when  running  thread  goes  to  sleep  

THREAD  SCHEDULING  

116  

Page 117: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

•  Pinned  (scheduler/type=pinned)  [default]  –  Threads  are  assigned  to  a  single  core,  round-­‐robin  – No  migraXon  – OpXonal  core  mask  for  disabled  cores:  --scheduler/pinned/core_mask=1,0,1,1,0,0,0,1  

•  Roaming  (scheduler/type=roaming)  –  IniXal  affinity:  all  cores  (with  opXonal  mask)  –  Threads  freely  migrate  to  idle  cores  

•  StaXc  [deprecated]  –  Fixed  assignment,  no  migraXon  – No  oversubscripXon  (#threads  <=  #cores)  

AVAILABLE  SCHEDULERS  

117  

Page 118: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

CUSTOM  THREAD  SCHEDULING  

•  When:  periodically,  in  response  to  hooks  (thread  create,  stall,  resume,  exit)  

•  How  –  hard  way:  moveThread()  –  Immediately  moves  a  thread  (but  watch  out  for  safe  pre-­‐empXon  points,  deadlocks,  manual  oversubscripXon,  …)  

•  How  –  easy  way:  manipulaXng  affinity  masks  – Threads  will  be  moved  as  soon  as  it  is  safe,  automaXc  oversubscripXon  (round-­‐robin,  quantum-­‐based)  

118  

Page 119: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

BIGSMALL  SCHEDULER  

•  Provided  as  a  demo/starXng  point  – Uses  TagsManager  to  idenXfy  core  types  

•  Recommended  because  of  flexibility  

•  Randomly  reschedule  threads  between  (mulXple)  big  and  (mulXple)  small  cores  – By  sorXng  cores  based  on  random  values  – Sets  affinity  mask  of  threads  to  either  “all  small  cores”  or  “all  big  cores”  

– Round-­‐robin  scheduling  within  each  core  pool  

119  

Page 120: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

MEMORY  INTENSITY  BASED  SCHEDULER  

•  Threads  that  spend  most  of  their  Xme  waiXng  for  memory  get  scheduled  on  small  core  types  [Kumar  at  al.,  2010]  – Common  pracXce  to  achieve  energy-­‐efficiency  

•  StarXng  from  BigSmall  scheduler  – Use  memory  cycles  in  the  core  Xming  models  – Register  as  a  per-­‐thread  staXsXc  – Use  instead  of  random  value  

•  Other  straight-­‐forward  opXons:    –  IPC  [Becchi  and  Crowley,2008],    predicted  slowdown  [Van  Craeynest  et  al.,  2012]  

120  

Page 121: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

SCHEDULING:  PER-­‐THREAD  STATISTICS  

•  Most  staXsXcs  are  collected  per  core  (instrucXon  count,  cache  misses,  etc.)  

•  Support  for  per-­‐thread  staXsXcs  – Linked  to  per-­‐core  staXsXc  – Updated  automaXcally  when  thread  moves  – Access  per-­‐thread  staXsXc  

Sim()-­‐>getThreadStatsManager()-­‐>getThreadStaXsXc          (thread_id,  ThreadStatsManager::INSTRUCTIONS)  

– Or  register  your  own  staXsXc  SchedulerDynamic::ThreadStats::ThreadStats  

121  

Page 122: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

THE  SNIPER  MULTI-­‐CORE  SIMULATOR  HANDS-­‐ON  DEMO  

TREVOR  E.  CARLSON,  WIM  HEIRMAN,  KENZO  VAN  CRAEYNEST  AND  LIEVEN  EECKHOUT  

 HTTP://WWW.SNIPERSIM.ORG  

SUNDAY,  SEPTEMBER  22RD,  2013  IISWC  2013,  PORTLAND  

Page 123: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

SNIPER  DEMO  

•  Downloading  •  Compiling  •  Running  a  demo  applicaXon  •  EvaluaXng  Performance  – CPI  Stacks  

•  ConfiguraXon  and  Run-­‐Xme  ModificaXons  – ConfiguraXon  files  – Python  scripXng  – ROI  markers  and  Magic  instrucXons    

123  

Page 124: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

REFERENCES  

•  Sniper  website  –  hsp://snipersim.org/  

•  Download  –  hsp://snipersim.org/w/Download  –  hsp://snipersim.org/w/Download_Benchmarks  

•  Ge~ng  started  –  hsp://snipersim.org/w/Ge~ng_Started  

•  QuesXons?  –  hsp://groups.google.com/group/snipersim  –  hsp://snipersim.org/w/Frequently_Asked_QuesXons  

124  

Page 125: T S NIPERM ULTIC ORES IMULATOR - snipersim.orgsnipersim.org/documents/2013-09-22 Sniper IISWC Tutorial.pdf · 9/22/2013  · set of 256 KB for the 1282-sized tile fits in a core’s

THE  SNIPER  MULTI-­‐CORE  SIMULATOR  

WIM  HEIRMAN,  TREVOR  E.  CARLSON,  IBRAHIM  HUR  KENZO  VAN  CRAEYNEST  AND  LIEVEN  EECKHOUT  

 HTTP://WWW.SNIPERSIM.ORG  

SUNDAY,  SEPTEMBER  22RD,  2013  IISWC  2013,  PORTLAND