TIDeFlow: A dataflow-inspired execution model for High ... A DATAFLOW-INSPIRED EXECUTION MODEL FOR...

Click here to load reader

  • date post

    12-May-2018
  • Category

    Documents

  • view

    221
  • download

    1

Embed Size (px)

Transcript of TIDeFlow: A dataflow-inspired execution model for High ... A DATAFLOW-INSPIRED EXECUTION MODEL FOR...

  • TIDEFLOW: A DATAFLOW-INSPIRED EXECUTION MODEL FOR

    HIGH PERFORMANCE COMPUTING PROGRAMS

    by

    Daniel A. Orozco

    A dissertation submitted to the Faculty of the University of Delaware in partialfulfillment of the requirements for the degree of Doctor of Philosophy in Electrical andComputer Engineering

    Spring 2012

    c 2012 Daniel A. OrozcoAll Rights Reserved

  • TIDEFLOW: A DATAFLOW-INSPIRED EXECUTION MODEL FOR

    HIGH PERFORMANCE COMPUTING PROGRAMS

    by

    Daniel A. Orozco

    Approved:Kenneth E. Barner, Ph.D.Chair of the Department of Electrical and Computer Engineering

    Approved:Babatunde A Ogunnaike, Ph.D.Interim Dean of the College of Engineering

    Approved:Charles G. Riordan, Ph.D.Vice Provost for Graduate and Professional Education

  • I certify that I have read this dissertation and that in my opinion it meets theacademic and professional standard required by the University as a dissertationfor the degree of Doctor of Philosophy.

    Signed:Guang R. Gao, Ph.D.Professor in charge of dissertation

    I certify that I have read this dissertation and that in my opinion it meets theacademic and professional standard required by the University as a dissertationfor the degree of Doctor of Philosophy.

    Signed:Xiaoming Li, Ph.D.Member of dissertation committee

    I certify that I have read this dissertation and that in my opinion it meets theacademic and professional standard required by the University as a dissertationfor the degree of Doctor of Philosophy.

    Signed:Chengmo Yang, Ph.D.Member of dissertation committee

    I certify that I have read this dissertation and that in my opinion it meets theacademic and professional standard required by the University as a dissertationfor the degree of Doctor of Philosophy.

    Signed:Michela Taufer, Ph.D.Member of dissertation committee

  • ACKNOWLEDGEMENTS

    To my parents, because they showed me that there was no limit to what I could

    achieve.

    iv

  • TABLE OF CONTENTS

    LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixLIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv

    Chapter

    1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    2.1 Computer Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 14

    2.1.1 Serial Processor Systems . . . . . . . . . . . . . . . . . . . . . 142.1.2 Shared Memory Systems . . . . . . . . . . . . . . . . . . . . . 162.1.3 Distributed Memory Systems . . . . . . . . . . . . . . . . . . 172.1.4 Multicore Systems . . . . . . . . . . . . . . . . . . . . . . . . 182.1.5 Manycore Systems . . . . . . . . . . . . . . . . . . . . . . . . 18

    2.1.5.1 Tileras Processors . . . . . . . . . . . . . . . . . . . 202.1.5.2 Sun UltraSPARC T2 . . . . . . . . . . . . . . . . . . 202.1.5.3 IBM Cyclops64 . . . . . . . . . . . . . . . . . . . . . 20

    2.2 Previous Models for Execution, Concurrency and Programming . . . 26

    2.2.1 Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.2.2 Petri Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.2.3 EARTH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.2.4 ParalleX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.2.5 Swarm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.2.6 POSIX Threads . . . . . . . . . . . . . . . . . . . . . . . . . . 312.2.7 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.2.8 MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.2.9 X10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    v

  • 2.2.10 Cilk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.2.11 Intels Concurrent Collections . . . . . . . . . . . . . . . . . . 332.2.12 StreamIt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342.2.13 Intels Thread Building Blocks . . . . . . . . . . . . . . . . . . 352.2.14 Other Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 35

    3 THE TIDEFLOW PROGRAM EXECUTION MODEL . . . . . . 37

    3.1 An Overview of the TIDeFlow Model . . . . . . . . . . . . . . . . . . 383.2 Actors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    3.2.1 Actors Represent Parallel Loops . . . . . . . . . . . . . . . . . 403.2.2 Execution of an Actor . . . . . . . . . . . . . . . . . . . . . . 413.2.3 Signals Generated by Actors . . . . . . . . . . . . . . . . . . . 433.2.4 Actor States . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.2.5 Actor Finite State Machine . . . . . . . . . . . . . . . . . . . 463.2.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    3.2.6.1 A Hello World Example . . . . . . . . . . . . . . . . 463.2.6.2 Iterations and Time Instances . . . . . . . . . . . . . 483.2.6.3 Termination Signals . . . . . . . . . . . . . . . . . . 49

    3.3 Arcs and Tokens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    3.3.1 Arcs Represent Dependencies Between Parallel Loops. . . . . . 523.3.2 Representing Outer Loop Carried Dependencies . . . . . . . . 523.3.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

    3.3.3.1 Overlapping Communication and Computation . . . 543.3.3.2 Using Outer Loop Carried Dependencies . . . . . . . 553.3.3.3 Expressing Pipelining Through Backedges . . . . . . 573.3.3.4 A Matrix Multiplication Kernel . . . . . . . . . . . . 593.3.3.5 A Program Where Actors Execute Only Once . . . . 59

    3.4 Composability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.5 Task Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.6 Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    4 A THROUGHPUT ANALYSIS TO SUPPORT PARALLELEXECUTION IN MANYCORE PROCESSORS . . . . . . . . . . . 66

    4.1 The Importance of Throughput in Parallel Programs . . . . . . . . . 674.2 Queueing Theory and its Relationship to Throughput . . . . . . . . . 69

    vi

  • 4.3 Techniques to Increase the Throughput of Parallel Operations . . . . 714.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

    4.4.1 Throughput of a Test and Set Lock . . . . . . . . . . . . . . . 764.4.2 Throughput of a Parallel Count Operation . . . . . . . . . . . 77

    4.4.2.1 Using Locks . . . . . . . . . . . . . . . . . . . . . . . 774.4.2.2 Using Compare-and-Swap . . . . . . . . . . . . . . . 794.4.2.3 Using In-Memory Atomic Increments . . . . . . . . . 79

    4.4.3 Throughput of Common Queues . . . . . . . . . . . . . . . . . 79

    4.4.3.1 Single Lock Queue . . . . . . . . . . . . . . . . . . . 804.4.3.2 MS-Queue . . . . . . . . . . . . . . . . . . . . . . . . 804.4.3.3 MC-Queue . . . . . . . . . . . . . . . . . . . . . . . 824.4.3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . 82

    4.4.4 Simplifying the Representation of Tasks to Increase QueueThroughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

    4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

    5 TIDEFLOW IMPLEMENTATION . . . . . . . . . . . . . . . . . . . 96

    5.1 TIDeFlow C Interface . . . . . . . . . . . . . . . . . . . . . . . . . . 97

    5.1.1 Initializing the TIDeFlow Runtime System . . . . . . . . . . . 985.1.2 Creation of TIDeFlow Programs . . . . . . . . . . . . . . . . . 98

    5.1.2.1 Creation of a Program Context . . . . . . . . . . . . 985.1.2.2 Addition of Actors or Programs to a Context . . . . 995.1.2.3 Addition of Dependencies Between Actors . . . . . . 995.1.2.4 Providing Static Parameters to Actors . . . . . . . . 100

    5.1.3 Running TIDeFlow Programs . . . . . . . . . . . . . . . . . . 100

    5.2 Intermediate Representation . . . . . . . . . . . . . . . . . . . . . . . 1015.3 Compilation and Execution of a TIDeFlow Program . . . . . . . . . . 1025.4 TIDeFlow Runtime System . . . . . . . . . . . . . . . . . . . . . . . 1025.5 Parallel Program Traces . . . . . . . . . . . . . . . . . . . . . . . . . 105

    vii

  • 6 TIDEFLOW PROGRAMMING PRACTICES . . . . . . . . . . . . 108

    6.1 Time Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1116.2 Bandwidth Allocation Between Loaders . . . . . . . . . . . . . . . . . 1126.3 Restraining Execution Speed to Enforce Pipelining . . . . . . . . . . 112

    7 EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

    7.1 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 1147.2 Reverse Time Migration . . . . . . . . . . . . . . . . . . . . . . . . . 115

    8 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

    9 FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

    9.1 Open Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1229.2 Improvements to Current Techniques . . . . . . . . . . . . . . . . . . 124

    BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

    Appendix

    A COPYRIGHT INFORMATION . . . . . . . . . . . . . . . . . . . . . 136

    A.1 Permission from IEEE . . . . . . . . . . . . . . . . . . . . . . . . . . 136A.2 Permission from ACM . . . . . . . . . . . . . . . . . . . . . . . . . . 137A.3 Permissions from Springer . . . . . . . . . . . . . . . . . . . . . . . . 137A.4 Papers I Own the Copyright to . . . . . . . . . . . . . . . . . . . . . 138A.5 Copy of the Licensing Agreements . . . . . . . . . . . . . . . . . . . . 138

    viii

  • LIST OF TABLES

    2.1 Cy