Architecting Parallel SoSo t a eftware with Patteatte...

34
8/16/2012 © Kurt Keutzer 1 Architecting Parallel Software withPatterns Kurt Keutzer, and the PALLAS Group Michael Anderson, Katya Gonina, Patrick Li, David Sheffield, Jike Chong (Simply Hired) Mark Murphy (Google) 1/66 Jike Chong (Simply Hired), Mark Murphy (Google) Boryiing Su (Google), Naryanan Sundaram (IntelDubey) Yubei Chen, Forrest Iandola, and Peter Jin With special thanks to Tim Mattson Assumption #1: How not to develop parallel code Initial Code Profiler Performance profile Re-code with more threads Not fast enough 2/66 Fast enough Ship it Lots of failures Lots of failures N PE’s slower than 1 N PE’s slower than 1

Transcript of Architecting Parallel SoSo t a eftware with Patteatte...

Page 1: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 1

Architecting Parallel Software with PatternsSo t a e t atte s

Kurt Keutzer, and

the PALLAS GroupMichael Anderson, Katya Gonina, 

Patrick Li,  David Sheffield,Jike Chong (Simply Hired) Mark Murphy (Google)

1/66

Jike Chong (Simply Hired), Mark Murphy (Google)Bor‐yiing Su (Google), Naryanan Sundaram (Intel‐Dubey) 

Yubei Chen, Forrest Iandola, and Peter JinWith special thanks to Tim Mattson 

Assumption  #1: How not to develop parallel code

Initial Code

Profiler

Performanceprofile

Re-code with more threads

Not fastenough

2/66

Fast enough

Ship itLots of failuresLots of failures

N PE’s slower than 1N PE’s slower than 1

Page 2: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 2

Steiner Tree Construction Time By Routing Each Net in Parallel

Benchmark Serial 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads

adaptec1 1.68 1.68 1.70 1.69 1.69 1.69

newblue1 1.80 1.80 1.81 1.81 1.81 1.82

newblue2 2.60 2.60 2.62 2.62 2.62 2.61

adaptec2 1.87 1.86 1.87 1.88 1.88 1.88

adaptec3 3 32 3 33 3 34 3 34 3 34 3 34

3/66

adaptec3 3.32 3.33 3.34 3.34 3.34 3.34

adaptec4 3.20 3.20 3.21 3.21 3.21 3.21

adaptec5 4.91 4.90 4.92 4.92 4.92 4.92

newblue3 2.54 2.55 2.55 2.55 2.55 2.55

average 1.00 1.0011 1.0044 1.0049 1.0046 1.0046

Assumption #2: Nor this

Initial Code

Super-compiler

Performanceprofile

Tunecompiler

Not fastenough

4/66

p

Fast enough

Ship it30 years of HPC 30 years of HPC

research don’t offer research don’t offer much hopemuch hope

Page 3: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 3

Automatic parallelization? 

30

Basic speculative multithreadingSoftware value predictionEnabling optimizations

Aggressive techniques such as speculative multithreading help, but they are not enough.  

5

10

15

20

25

Spee

dup

%

Ave SPECint speedup of 8% … will climb to ave. of 15% once their system is fully enabled. There are no indications auto par. will radically improve any time soon.Hence, I do not believe Auto‐par will solve our problems.

5/66

0

bzip2

crafty ga

pgc

cgz

ip mcf

parse

rtw

olf

vorte

xvp

r

avera

ge

A Cost-Driven Compilation Framework for Speculative Parallelization of Sequential Programs, Zhao-HuiDu, Chu-Cheow Lim, Xiao-Feng Li, Chen Yang, Qingyu Zhao, Tin-Fook Ngai (Intel Corporation) in PLDI 2004

Results for a simulated dual core platform configured as a main core and a core for speculative execution.

Assumption #3: This won’t help either

Code in newcool language

Profiler

Performanceprofile

Re-code with cool language

Not fastenough

6/66

Fast enough

Ship it

After 200 parallel After 200 parallel languages where’s the languages where’s the light at the end of the light at the end of the

tunnel?tunnel?

Page 4: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 4

Parallel Programming environments in the 90’s

7/66

So What’s the Alternative?

8/66

Page 5: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 5

Principles of SW Design

After  15 years in industry, at one time overseeing the techology of 25 software products, my two best principles to facilitate good software design are:

Use of modularity

Definition  of invariants 

Modularity helps:

Architect: Makes overall design sound and comprehensible

Project manager:

• As a manager I am able to comfortably assign different modules to different developers

• I am also able to use module definitions to track development

Module implementors:  As a module implementor I am able to focus on the 

9/66

implementation, optimization, and verification of my module with a minimum of concern about the rest of the design

Identify invariants and key computations

Non‐Principles of SW Design

What’s life like without modularity?

Spaghetti  code

Wars over the interpretation of the specification

Waiting on other codersg

Wondering why you didn’t touch anything and now your code broke

Hard to verify your code in isolation, and therefore hard to optimize

Hard to parallelize without identifying key computations

Modularity will help us obviate all these

10/66

y p

• Parnas, “On the criteria to be used on composing systems into modules,”  CACM,  December 1972. 

But modularity alone is not enough, because …

Page 6: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 6

Modularity is important …. But …

a) A building  b) A factory

Pop quiz: Is software more like?

11/66

What computations we do is as important than how we do them  ….

Apps

Dwarves Em

bed

SP

EC

DB

Gam

esM

LH

PC

CA

D

Health Image Speech Music Browser

Structured Grid

Dynamic Prog.Unstructured Grid

Finite State Mach.Circuits

Dwarves E S D G M H C Health Image Speech Music BrowserGraph AlgorithmsGraphical ModelsBacktrack / B&B

12/66

St uctu ed G dDense MatrixSparse MatrixSpectral (FFT)Monte CarloN-Body

Page 7: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 7

Architecting Parallel Software

We believe the solution to parallel programming is developing a good software architecture

A software architecture is a hierarchical composition of:

Computational patterns – the atoms, the machinery

Structural patterns – the molecular bonds, the layout

This software architecture naturally gives:

Modularity supplied through structural patterns

• Efficient management

• Efficient implementation

• Efficient verification

13/66

• Efficient verification

Computational efficiency supplied through computational patterns

Outline

Architecting Parallel SoftwareStructural PatternsComputational PatternsComputational PatternsExamplesSummary

14/66

Page 8: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 8

•Pipe and Filter

Identify the SW Structure

Structural Patterns

•Pipe-and-Filter•Agent-and-Repository•Event-based•Layered Systems•Model-view-controller•Arbitrary Task Graphs

15/6615

•Puppeteer•Iterator/BSP•MapReduce

These define the structure of our software but they do not describe what is computed

Analogy: Layout of Factory Plant

16/6616

Page 9: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 9

Identify Key Computations

Apps

Dwarves Em

bed

SP

EC

DB

Gam

esM

LH

PC

CA

D

Health Image Speech Music Browser

Structured Grid

Dynamic Prog.Unstructured Grid

Finite State Mach.Circuits

Dwarves E S D G M H C Health Image Speech Music BrowserGraph AlgorithmsGraphical ModelsBacktrack / B&B

17/66

St uctu ed G dDense MatrixSparse MatrixSpectral (FFT)Monte CarloN-Body

Computational patterns describe the key computations but not how they are implemented

Analogy: Machinery of the Factory

18/6618

Page 10: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 10

Architecting the Whole Application

19/66

• SW Architecture of Large‐Vocabulary Continuous Speech Recognition 

19

• Raises appropriate issues like scheduling, latency, throughput, workflow, resource management, capacity etc.

Analogous to the design of an entire manufacturing plant

Outline

Architecting Parallel SoftwareStructural Patterns

Pipe and filterPipe and filterIteratorMap Reduce

Computational PatternsExamplesSummary

20/66

Page 11: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 11

Elements of a structural pattern

Components are where the computation happens

A configuration is a graph of components (vertices) and connectors (edges)A structural patterns may be

21/66

Connectors are where the communication happens

p ydescribed as a familiy of graphs.

Inventory of Structural Patterns

• Pipe-and-Filter• Agent-and-Repositoryg p y• Event-based• Layered Systems• Model-view-controller• Arbitrary Task Graphs• Puppeteer

It t /BSP

22/66

• Iterator/BSP• MapReduce

• We build arbitrarily complex software structures out of these nine patterns

Page 12: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 12

Filter 1

Pattern 1: Pipe and Filter

•Filters embody computation•Only see inputs and produce outputs• No global or shared state •Pipes embody

Filter 5

Filter 4

Filter 2Filter 3

Pipes embody communication

May have feedback

23/66

Filter 6 Filter 7

Examples?Examples?

Examples of pipe and filter

Almost every large software program has a pipe and filter structure at the highest level

24/6624

Logic optimizerImage Retrieval SystemCompiler

Page 13: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 13

Pattern 2: Iterator Pattern Initialization condition

Variety of functions

iterate

Variety of functions performed asynchronously

25/66

Exit condition met?

Synchronize results of iteration

Yes

No

Examples?Examples?

Example of Iterator Pattern:Training a Classifier: SVM Training

Updatesurface

iterate

Iterator Structural Pattern

26/66

IdentifyOutlier

All points withinacceptable error? Yes

No

Page 14: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 14

Pattern 3: MapReduce

To us, it means 

A map stage, where data is mapped onto independent computationsindependent computations 

A reduce stage, where the results of the map stage are summarized (i.e. reduced)

Map Map

27/66

ReduceReduce

Examples?Examples?

Examples of Map Reduce

General structure:

Map a computation across distributed data sets 

Reduce the results to find the best/(worst), maxima/(minima)Reduce the results to find the best/(worst), maxima/(minima)

28/66

Speech recognitionSpeech recognition•• Map HMM computation Map HMM computation to evaluate word matchto evaluate word match•• Reduce to find the mostReduce to find the most--likely word sequenceslikely word sequences

SupportSupport--vector machines (ML)vector machines (ML)•• Map to evaluate distance from Map to evaluate distance from the frontier the frontier •• Reduce to find the greatest Reduce to find the greatest outlier from the frontieroutlier from the frontier

Page 15: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 15

Outline

Architecting Parallel SoftwareStructural PatternsComputational PatternsComputational Patterns

Linear AlgebraSpectral MethodsDynamic programming

ExamplesSummary

29/66

Inventory of Key Computations

Apps

Dwarves Em

bed

SP

EC

DB

Gam

esM

LH

PC

CA

D

Health Image Speech Music Browser

Structured Grid

Dynamic Prog.Unstructured Grid

Finite State Mach.Circuits

Dwarves E S D G M H C Health Image Speech Music BrowserGraph AlgorithmsGraphical ModelsBacktrack / B&B

30/66

St uctu ed G dDense MatrixSparse MatrixSpectral (FFT)Monte CarloN-Body

We build arbitrarily complex computations out of these thirteen computational patterns

Page 16: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 16

CP1: Linear Algebra

Vector Space: A set closed under + has identity and inverse elements, scalar multiplication

Linear Map: Operator T on vectors u,v, scalar α s.t. T(u + v) = Tu + Tv, and T(αv) = αT(v)

Matrix: An m×n array of numbers representing a Linear map from Rn to Rm

Linear Equations: Ax = b    

31/66

Eigenvalues/vectors: Ax = λx

Basic Linear Algebra Subroutines (BLAS)

Three "Levels“, known as BLAS, characterized by intrinsic ratio of computation to memory movement

Level Example # mem refs # flops q 1 xAXPY: y = y+αx 3n 2n1 2/3

32/66

2 xGEMV: y=y+Ax n2 2n2 23 xGEMM: C=C+AB 4n2 2n3 n/2

Page 17: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 17

Linear Algebra Resources

www.netlib.org/lapack www.netlib.org/scalapack

33/66www.cs.utk.edu/~dongarra/etemplateswww.netlib.org/templates

gams.nist.gov

CP2: Spectral Methods Pattern: MRI Reconstruction

"Spectral Methods" are a broad class of numerical algorithms for solving PDEs, but notions of Spectral Analysis (i.e. convenient changes of basis) are important in every application area

In Magnetic Resonance Imaging (MRI), images are collected in "k‐space" ‐‐ i.e. an MRI scan produces a Fourier Domain image

MRI Scan== DFT

Fourier and Wavelet representations are different Spectral analyses that expose different properties of images

34/66

WaveletXform

different properties of images convenient for solving our problems

Page 18: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 18

Spectral Methods Pattern: Fast Transforms

Spectral Methods rely on representions of data in "convenient" bases that produce working, computationally feasible algorithms 

Changing a basis is, in general, an O(N2) matrix‐vectorChanging a basis is, in general, an O(N ) matrix vector multiplication. The matrices representing "convenient" bases factor into O(N log N) fast transforms!

35/66

Spectral Methods Pattern: Libraries

Fast transform algorithms like the FFT are notoriously difficult to optimize:

Luckily implementations of the FFT existLuckily, implementations of the FFT exist for every platform. E.G:

FFTW and SPIRAL: Highly successful auto‐tuners for FFT (and others) on PCs and workstations 

CUFFT for Cuda on Nvidia GPUs

36/66

Page 19: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 19

CP3: Dynamic Programming

Class of problems for which the optimal solution can be built up from optimal solutions to sub‐problems

Principle of optimality:  Optimal cover for a tree consists of a best match at the root of the tree plus the optimal cover forbest match at the root of the tree plus the optimal cover for the sub‐trees starting at each input of the match

x

yp

Best cover forthis match usesbest covers forx, y, z

Choose leastcost tree-cover

37/66

zBest cover forthis match usesbest covers forp, z

at root

Mapping a circuits into a Cell Library

subject tree

38/66

Page 20: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 20

Example of Optimal Tree Covering

AOI214 + 3 = 7

INV11 + 2 = 13 4 + 3 = 711 + 2 = 13

NAND22 + 6 + 3 = 11

NAND23 + 3 = 6INV

2

39/66

NAND23

NAND23

code generation in compilers

40/66

Page 21: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 21

TimeObservations

Speech Recognition

axay

axay

axay

axay

axay

axay

axay

axay

axay

axay

axay

axay

axay

axay

axay

axay

axay

axay

axay

axay

axay

axay

axay

axay

axay

axay

axay

axay

axay

r z sr e e e e k k a a a g n nay

ay

ay

ay

ay

ay p p

iy

iy

iy

ch

ch

Spe

ech

Mod

el S

tate

s

ychehg

iyk

n

p

ychehg

iyk

n

p

ychehg

iyk

n

p

ychehg

iyk

n

p

ychehg

iyk

n

p

ychehg

iyk

n

p

ychehg

iyk

n

p

ychehg

iyk

n

p

ychehg

iyk

n

p

ychehg

iyk

n

p

ychehg

iyk

n

p

ychehg

iyk

n

p

ychehg

iyk

n

p

ychehg

iyk

n

p

ychehg

iyk

n

p

ychehg

iyk

n

p

ychehg

iyk

n

p

ychehg

iyk

n

p

ychehg

iyk

n

p

ychehg

iyk

n

p

ychehg

iyk

n

p

ychehg

iyk

n

p

ychehg

iyk

n

p

ychehg

iyk

n

p

ychehg

iyk

n

p

ychehg

iyk

n

p

ychehg

iyk

n

p

ychehg

iyk

n

p

ychehg

iyk

n

p

41/66

Wreck a nice beach Interpretation

r

s

z

r

s

z

r

s

z

r

s

z

r

s

z

r

s

z

r

s

z

r

s

z

r

s

z

r

s

z

r

s

z

r

s

z

r

s

z

r

s

z

r

s

z

r

s

z

r

s

z

r

s

z

r

s

z

r

s

z

r

s

z

r

s

z

r

s

z

r

s

z

r

s

z

r

s

z

r

s

z

r

s

z

r

s

z

Recognize speech

Outline

Architecting Parallel SoftwareStructural PatternsComputational PatternsComputational PatternsExamples

Classification using Support Vector MachinesSpeech Recognition

Summary

42/66

Page 22: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 22

Support Vector Machine Classifier

Train Classifier

Feature ExtractionChoose Examples

New Images

43/66

Results

Exercise Classifier

User Feedback

??

??

Catanzaro, Sundaram, Keutzer, “Fast SVM Training and Classification on Graphics Processors”, ICML 2008

Feature Extraction

• Image is reduced to a set of low-dimensional feature vectors

Build ScaleBuild Scale-Space

Representation

Select Interest Points and Support Regions

Structured Grid

44/66"Image Feature Extraction for Mobile Processors", Mark Murphy, Hong Wang, Kurt Keutzer IISWC '09

Build Descriptors

Dense Linear Algebra

Map Reduce Structured Grid

Map Reduce

Page 23: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 23

Train Classifier: SVM Training

Update Update Optimality Optimality ConditionsConditions

Select Select

Train Classifier iterate

MapReduce

45/66

Working Working Set, Set,

Solve QPSolve QP

Iterator Pattern

MapReduce

Exercise Classifier : SVM Classification 

Test DataTest Data

Compute dot products

SVSV

Exercise ClassifierDense LinearAlgebra

46/66

Compute Kernel Compute Kernel values, sum &

scale

OutputOutputMapReduce

Page 24: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 24

Support‐Vector Machine Mini‐Framework

Support‐Vector Machine Framework used to achieve:

47/66

9‐35x speedup for training

81‐138x for classification

2200 downloads since release

Fast support vector machine training and classification , Catanzaro, Sundaram, Keutzer, International Conference on Machine Learning 2008

Outline

Architecting Parallel SoftwareStructural PatternsComputational PatternsComputational PatternsExamples

CBIRSpeech Recognition

Summary

48/66

Page 25: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 25

Speech Recognition System Structure

49/66

Inference engine based systemUsed in Sphinx (CMU, USA), HTK (Cambridge, UK), and Julius (CSRC, Japan)

Modular and flexible setupShown to be effective for Arabic, English, Japanese, and Mandarin

Decompose Tasks

Architecting Parallel Software

•Group tasks•Order Tasks

Decompose Data• Data sharing• Data access

50/66

Identify the Software Structure Identify the Key Computations

Page 26: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 26

Intro

Recognition is a process of graph traversal/dynamic programmingEach time-step we need to identify the likely states in the recognition network given the observation acoustic signal

Inference Engine Architecture

From a the of active states we want to compute the next set of active states using probabilities of acoustic symbols and state transitionsWhat Structural pattern is this?

51/66

Iterative Refinement Structural Pattern

One iteration per time stepIdentify the set of probable

HiddenMarkovModel

Identify the set of probable states in the network given acoustic signal given current active state setPrune unlikely statesRepeat

52/66

Page 27: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 27

Key computation: HMM Inference Algorithm

Finds the most‐likely sequence of states that produced the observation

An instance of:  Graphical Models Implemented with: Dynamic Programming

ss

x An Observations A State

P( xt|st ) s m [t-1][st-1]

Legends:

s ss sState 1

Obs 1 Obs 2 Obs 3 Obs 4 x x x x

t

Viterbi Algorithm

53/66

P( st|st-1 ) s m [t][st ]

Markov Condition:

s ss s

s ss s

s ss s

State 2

State 3

State 4

J. Chong, Y. Yi, A. Faria, N.R. Satish and K. Keutzer, “Data-Parallel Large Vocabulary Continuous Speech Recognition on Graphics Processors”, Emerging Applications and Manycore Arch. 2008, pp. 23-35, June 2008

Inference Engine in LVCSRStructural pattern to support three phases of inference

0. Gather operands from irregular data structure to runtime buffer

1. Perform observation probability computation

2. Perform graph traversal computation

0. Gather operands

1 P(x |s )x

54/66

s2. m [t][st ]

1. P(xt|st)x

Page 28: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 28

Each Filter is a Map Reduce0. Gather operands

Gather and coalesce each of the above operands for every sGather and coalesce each of the above operands for every stFacilitates opportunity for SIMD 

0. Gather operand

55/66

Coalesced data

Each Filter is Map Reduce1. observation probability computation

Gaussian Mixture Model Probabilityy

Probability that given this feature‐frame (e.g. 10ms) we are in this state/phone

1. P(xt|st)x

56/66

GMM probability

Page 29: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 29

Observation probabilities are computed from Gaussian Mixture Models

Each Gaussian probability in each mixture is independentProbability for one phone state is the sum of all Gaussians times

Phase 1: Observation Probability computation Architecture

y pthe mixture probability for that state

57/66Dan Klein’s CS288, Lecture 9

Map‐Reduce Structural Pattern

Map each mixture probability computationprobability computation Reduce the result –accumulate the total probability for that state

Gaussian Mixture Model for One Phone State

Mixture Components

58/66

… … … … … … … …… … … ……

Page 30: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 30

Each Filter is Map Reduce 2. graph traversal computation

Map probability computation across distributed data sets –perform multiplication as below 

Reduce the results to find the maximumly likely states

59/66

s2. m [t][st ]

max

Inference Engine

LVCSR Software Architecture

Pipe‐and‐filter

Graphical Model

Recognition NetworkRecognition Network

Acoustic Model

Pronunciation Model

Language Model

Inference Engine

Beam Search Iterations

Graphical Model

Dynamic Programming

Pipe and Filter

Speech Feature Extractor

Voice Input

MapReduce

Active State Computation Steps

60/66

Iterative Refinement

Extractor

SpeechFeatures

WordSequence

I think therefore I am

Page 31: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 31

Speech Recognition Results

Input: Speech audio waveform 

Output: Recognized word sequences

Achieved 11x speedup over sequential version

Allows 3.5x faster than real time recognition

Our technique is being deployed in a hotline call‐center data analytics company

61/66

company

Used to search content, track service quality and provide early detection of service issues

Scalable HMM based Inference Engine in Large Vocabulary Continuous Speech Recognition,Kisun You, Jike Chong, Youngmin Yi, Ekaterina Gonina, Christopher Hughes, Wonyong Sung and Kurt Keutzer, IEEE Signal Processing Magazine, March 2010

Multi‐media Speech Recognition

Read Files

Initialize data t t

GPUCHMM FormatCHMM Format

Prof. Dorothea Kolossa structures

CPUPhase 0

Phase 1

Compute Observation Probability

Phase 2

Prepare ActiveSet

Iteration Control Fixed Beam WidthFixed Beam Width

CHMM GPU ObsProbCHMM GPU ObsProb

Prof. Dorothea KolossaSpeech Application Domain ExpertTechnische Universität Berlin

Extended audio-only speech recognition framework to enable audio-visual speech recognition (lip reading)

Achieved a 20x speedup in application performance compared to a sequentialversion in C++

62/66

Backtrack

Output Results

Graph Traversal

SaveBacktrack Log

Collect Backtrack Info

CHMM Scoring formatCHMM Scoring format

The application framework enabled aMatlab/Java programmer to effectivelyutilize highly parallel platformDorothea Kolossa, Jike Chong, Steffen Zeiler, Kurt Keutzer, “Efficient Manycore CHMM Speech Recognition for Audiovisual and Multistream Data”, Accepted at Interspeech 2010.

Page 32: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 32

Outline

Architecting Parallel SoftwareStructural PatternsComputational PatternsComputational PatternsExamplesSummary

63/66

Other Interesting Results

Patterns have helped the PALLAS research group publish papers in a diverse group of leading Computer Science conferences in the last few years:

Interspeech 2009, Interspeech 2010 (2)

IEEE Signal Processing Magazine 2009 

European Conference on Computer Vision 2010

International Conference on Computer Vision 2009

International Parallel and Distributed Processing Symposium 2012, 2011, 2009 (2) 

Workshop on High Performance Computing in Finance at Super Computing 2009

Joint Annual Meeting of the International Society for Magnetic Resonance in Medicine, ISMRM 2010

International Conference on Machine Learning  2008 

IEEE Transactions on Medical Imaging (2012)

64/66

What’s the point?

Computational patterns  give a new powerful viewpoint to efficiency programmers: 

Enable  us to disentangle the big fuzzy ball of yarn of computation

• add 20 IQ points to our problem solving (as per Alan Kay)

Our Pattern language helps you to write good parallel code

Page 33: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 33

Speedups

Application Speedups

MRI 100xSVM t i 20SVM-train 20x

SVM-classify 109xContour 130x

Object Recognition 80xPoselet 20x

Optical Flow 32xSpeech 11x

65/66

Speech 11xValue-at-risk 60x

Option Pricing 25x

Summary

The key to productive and efficient parallel programming is creating  a good software architecture – a hierarchical composition of:

Structural patterns: enforce modularity and expose invariants

I showed you three –seven more will be all you need

Computational patterns:  provide efficiency, identify key computations to be parallelized

• I showed you three –ten more will be all you need

Orchestration of computational and structural patterns creates architectures which greatly facilitates the development of parallel programs:

66/66

I showed you three – there are manymore

Patterns: http://parlab.eecs.berkeley.edu/wiki/patterns/patterns

PALLAS: http://parlab.eecs.berkeley.edu/research/pallas

Page 34: Architecting Parallel SoSo t a eftware with Patteatte srnsparlab.eecs.berkeley.edu/sites/all/parlab/files/bootcamp... · 2012. 8. 16. · A Cost-Driven Compilation Framework for Speculative

8/16/2012

© Kurt Keutzer 34

Extras

67/66