High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell,...

37
High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by Steve Rumble

Transcript of High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell,...

Page 1: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

High-throughput sequence alignment using Graphics Processing Units

Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney

UMD

Presented by Steve Rumble

Page 2: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Motivation

NGS technologies produce a ton of data AB SOLiD: 22e6 25-mers Others are even worse…

How does 200e6 50-mers sound?

Algorithms have been pushed hard, but typically assume same workstation CPU

Wozniak and others showed S-W could be well-parallelised on special H/W. What of other algorithms/hardware?

Page 3: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Motivation

GPUs have recently evolved general purpose programmability (GPGPU)

E.g.: nVidia 8800 GTX 16 multiprocessors

8 processors each => 128 stream processors

768MB onboard 1.35GHz clock Almost a year old now…

Page 4: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Short GPU Overview

Highly parallel execution (hundreds of simultaneous operations)

Hundreds of gigaflops per chip! Large on-board memories (up to 2GB)

Limitations: No recursion (no stacks) Each multiprocessor’s constituent processors

execute same instruction Thread Divergence due to conditionals hurts…

No direct host memory access Small caches (locality is key) High memory latency No dynamic memory allocation (why one would ever

do that, I don’t know)

Page 5: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Short GPU Overview

GPGPU environments

Previously had to reduce problems to graphics primitives… no more

Simplified C-like programming Paper has very little detail, but they make

it sound enticingly simple…

Each processor runs the same ‘kernel’

Page 6: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Muh-muh-muh… MUMmer!

Maximal Unique Match

Find longest match for each subsequence of a read (of reasonable length)

Employs Suffix Trees

Page 7: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

MUMmerGPU Plug-and-play replacement for MUMmer MUMmer is not ‘arithmetic intensive’

Is the GPU a good fit?

Six-step process 1) Build Suffix Tree of reference genome

(Ukkonen’s alg. – O(n)) on host CPU 2) Suffix Tree -> GPU Memory 3) Queries -> GPU Memory 4) Kick off the GPU… 5) Results -> Host Memory 6) Final processing on Host CPU

Page 8: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Suffix Trees

We want to find the longest subsequence of a string (query) quickly

Suffix Trees permit O(m) string search, m = string length

Space complexity is O(n) But constants are apparently pretty big

Page 9: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Suffix Trees

Definition: Node edges have a node label

A string subsequence Non-empty (but can be terminating)

A path label is the sequence formed by traversing from root to leaf

1-1 correspondence of suffixes of S to path labels

Internal nodes have at least 2 children

n leaf nodes – one for each suffix of S

Page 10: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Suffix Trees

O(n) space n leaf nodes => at most n – 1 internal nodes => n + (n – 1) + 1 = 2n nodes (worst

case)

n = 3n – 1 = 23 + 2 + root = 6 nodes

Page 11: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Suffix Trees

Example: TORONTO$ ‘$’ is terminating character

T

ORONTO$

O$

NTO$RONTO$

6

4

0 5

2

3 1

O

$

RONTO$

NTO

$

Page 12: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Suffix Trees

Example: TORONTO$ Searching for ‘ONT’

T

ORONTO$

O$

NTO$RONTO$

6

4

0 5

2

3 1

O

$

RONTO$

NTO

$

Page 13: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Suffix Trees

Example: TORONTO$ Searching for ‘ONT’

T

ORONTO$

O$

NTO$RONTO$

6

4

0 5

2

3 1

O

$

RONTO$

NTO

$

Page 14: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Suffix Trees

Example: TORONTO$ Searching for ‘ONT’

T

ORONTO$

O$

NTO$RONTO$

6

4

0 5

2

3 1

O

$

RONTO$

NTO

$

Page 15: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Suffix Trees

Example: TORONTO$ Searching for ‘ONT’

T

ORONTO$

O$

NTO$RONTO$

6

4

0 5

2

3 1

O

$

RONTO$

NTO

$

‘ONT’ at position 3 in S

Page 16: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Suffix Trees

MUMmer wants to find all maximal unique matches for all suffixes: E.g., for query ACCGTGCGTC, we want:

ACCGTGCGTC CCGTGCGTC CGTGCGTC GTGCGTC … Up to some reasonable limit…

Don’t want to go back to root of tree each time…

Page 17: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Suffix Trees

Suffix Links All internal, non-root nodes have a

suffix link to another node If x is a single character and a is a

(possibly empty) string (subsequence), then the path from the root to a node v spelling ax (path-label is ax) has a suffix link to node v’, whose path-label is a.

Got that?

Page 18: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Suffix Trees

Example: TORONTO$ Suffix Links… Don’t backtrack (bad ex.)

T

ORONTO$

O$

NTO$RONTO$

6

4

0 5

2

3 1

O

$

RONTO$

NTO

$

Page 19: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Suffix Trees

Example: BANANA$ Better example of Suffix Links

A

$

NA

NA

1

0

5

3

BA

NA

NA

$

NA$$

24

NA$$

Page 20: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Suffix Trees

Example: BANANA$ Searching for suffixes of ‘ANANA’

A

$

NA

NA

1

0

5

3

BA

NA

NA

$

NA$$

24

NA$$

Page 21: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Suffix Trees

Example: BANANA$ Searching for suffixes of ‘ANANA’

A

$

NA

NA

1

0

5

3

BA

NA

NA

$

NA$$

24

NA$$

Page 22: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Suffix Trees

Example: BANANA$ Searching for suffixes of ‘ANANA’

A

$

NA

NA

1

0

5

3

BA

NA

NA

$

NA$$

24

NA$$

Page 23: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Suffix Trees

Example: BANANA$ Searching for suffixes of ‘ANANA’

A

$

NA

NA

1

0

5

3

BA

NA

NA

$

NA$$

24

NA$$

Page 24: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Suffix Trees

Example: BANANA$ Searching for suffixes of ‘ANANA’

A

$

NA

NA

1

0

5

3

BA

NA

NA

$

NA$$

24

NA$$

Page 25: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Suffix Trees

Example: BANANA$ Searching for suffixes of ‘ANANA’

A

$

NA

NA

1

0

5

3

BA

NA

NA

$

NA$$

24

NA$$

Page 26: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Memory Limitations

Suffix trees take up a fair bit of memory

GPUs have 100’s of MBs, but this is still small

Divide the target sequence into ‘k’ segments with overlaps

Page 27: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Cache Optimisation

Memory latency high, cache performance crucial We’re walking a tree here, not crunching numbers

down an array

Can store read-only data in 2D textures; nVidia caching scheme optimises access

Re-order and squish tree nodes into ‘texel blocks’ such that:

Nodes near root are level-ordered (BFS) Nodes further down are ordered with descendants

Page 28: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Cache Optimisation1

2 3 4 5

6 7 8 9 10 11 12 13

14 15 16 17 18 19 21 2320 22 24 25 26 27 28 29

0 2 4 6 8 10 12 14

1 3 5 7 9 11 13 15

16 18 20 22 24 26 28 30

17 19 21 23 25 27 29 31

• Texture cache organized in 2x2 blocks.• Try to place all children of a node are in the same cache block

Shamelessly cribbed from:http://www.cbcb.umd.edu/software/cmatch/FastExactStringMatching.ppt

Page 29: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Cache Optimisation

Reference Sequence stored in 4x216 blocks of a 2D array Sequence: A B C D E F G H …

……….

A EB FC GD H

……….

α Φ β Χ Γ Ψ Δ Ω

Why? It worked well.

Page 30: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Cache Optimisation

Memory layouts heuristically determined nVidia cache details not public

Cache optimisation improves execution speed ‘by several fold’.

Page 31: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Conclusions

GPGPU isn’t just good for ‘arithmetic intensive’ applications

5-11x speed-up for NGS data

Page 32: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Conclusions

Fine Print: 5-11x is for the Suffix Tree kernel on the GPU Reality is different! 3.5x speed-up for real data in terms of total

application runtime. Pretty constant across read lengths (35-700+ bp)

Careful management of memory layout is crucial

Authors claim several-fold performance increase (could be difference between some improvement and none)

Page 33: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Conclusions

Runtime dominated by serial parts of MUMmer

Page 34: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Food for Thought

8800 GTX costs ~$400, uses 100-150 watts

Quad Core 2 chip runs ~$250, uses 100-130 watts

Each core approx. 2x faster than their test CPU

MUMmerGPU maximally 3.5x faster than test CPU

What have we won here?

Page 35: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Food for Thought

Confusing reports

“Fast Exact String Matching on the GPU” (Schatz, Trapnell) claims up to 35x improvement

Earlier course paper (early/mid-2007)

Why from 35x down to 5-11x with MUMmerGPU?

Page 36: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

My Impressions…

(…whatever they’re worth)

GPU is not a clear win (in this case) Suffix trees seem unsuited:

Cache locality trouble O(n) footprint, but multiplicative constants

are still substantial Host CPUs seem to be as good or better

(in $ and watts)

Page 37: High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

My Impressions…

GPGPU’s aren’t a great fit here

At least for this algorithm…

MUMmerGPU isn’t the order-of-magnitude win it claims to be

But this is a first-generation, general-purpose chip

geared toward number-crunching, not pointer-traversing

I don’t think we’ve seen the last (nor the best) of GPUs…