Pycon 2011 talk (may not be final, note)

28

Transcript of Pycon 2011 talk (may not be final, note)

Page 1: Pycon 2011 talk (may not be final, note)
Page 2: Pycon 2011 talk (may not be final, note)

Handling ridiculous amounts of data with probabilistic data structures

C. Titus Brown

Michigan State University

Computer Science / Microbiology

Page 3: Pycon 2011 talk (may not be final, note)

Resources

http://www.slideshare.net/c.titus.brown/

Webinar: http://oreillynet.com/pub/e/1784

Source: github.com/ctb/N-grams (this talk): khmer-ngramDNA (the real cheese): khmer

khmer is implemented in C++ with a Python wrapper, which has been awesome for

scripting, testing, and general development. (But man, does C++ suck…)

Page 4: Pycon 2011 talk (may not be final, note)

Lincoln Stein

Sequencing capacity is outscalingMoore’s Law.

Page 5: Pycon 2011 talk (may not be final, note)

Hat tip to Narayan Desai / ANL

We don’t have enough resources or people to analyze data.

Page 6: Pycon 2011 talk (may not be final, note)

Data generation vs data analysis

It now costs about $10,000 to generate a 200 GB sequencing data set (DNA) in about a week.

(Think: resequencing human; sequencing expressed genes; sequencing metagenomes, etc.)

…x1000 sequencers

Many useful analyses do not scale linearly in RAM or CPU with the amount of data.

Page 7: Pycon 2011 talk (may not be final, note)

The challenge?

Massive (and increasing) data generation capacity, operating at a boutique level, with

algorithms that are wholly incapable of scaling to the data volume.

Note: cloud computing isn’t a solution to a sustained scaling problem!!

(See: Moore’s Law slide)

Page 8: Pycon 2011 talk (may not be final, note)

Awesomeness

Easy stuff like Google Search

Life’s too short to tackle the easy problems – come to academia!

Page 9: Pycon 2011 talk (may not be final, note)

A brief intro to shotgun assembly

It was the best of times, it was the wor

, it was the worst of times, it was the

isdom, it was the age of foolishness

mes, it was the age of wisdom, it was th

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness

…but for 2 bn fragments.

Not subdivisible; not easy to distribute; memory intensive.

Page 10: Pycon 2011 talk (may not be final, note)

Define a hash function (word => num)

Page 11: Pycon 2011 talk (may not be final, note)
Page 12: Pycon 2011 talk (may not be final, note)
Page 13: Pycon 2011 talk (may not be final, note)
Page 14: Pycon 2011 talk (may not be final, note)

Storing words in a Bloom filter>>>x = BloomFilter([1001, 1003, 1005])

>>> 'oogaboog' in x

False

>>>x.add('oogaboog')

>>> 'oogaboog' in x

True

Page 15: Pycon 2011 talk (may not be final, note)

Storing words in a Bloom filter

Page 16: Pycon 2011 talk (may not be final, note)

Storing text in a Bloom filter

Page 17: Pycon 2011 talk (may not be final, note)
Page 18: Pycon 2011 talk (may not be final, note)
Page 19: Pycon 2011 talk (may not be final, note)

Storing and retrieving text

Page 20: Pycon 2011 talk (may not be final, note)

Sequence assembly

Page 21: Pycon 2011 talk (may not be final, note)

Repetitive strings are the devil

Page 22: Pycon 2011 talk (may not be final, note)

Note, it’s a probabilistic data structure

Page 23: Pycon 2011 talk (may not be final, note)

Assembling DNA sequence

• Can’t directly assemble with Bloom filter approach (false connections, and also lacking many convenient graph properties)

• But we can use the data structure to grok graph properties and eliminate/break up data:– Eliminate small graphs (no false negatives!)– Disconnected partitions (parts -> map reduce)– Local graph complexity reduction & error/artifact trimming

…and then feed into other programs.

This is a data reducing prefilter

Page 24: Pycon 2011 talk (may not be final, note)

Right, but does it work??

• Can assemble ~200 GB of metagenome DNA on a single 4xlarge EC2 node (68 GB of RAM) in 1 week ($500).

…compare with not at allon a 512 GB RAM machine.

• Error/repeat trimming on a tricky worm genome: reduction from

– 170 GB resident / 60 hrs

– 54 GB resident / 13 hrs

Page 25: Pycon 2011 talk (may not be final, note)

How good is this graph representation?

• V. low false positive rates at ~2 bytes/k-mer;– Nearly exact human genome graph in ~5 GB.– Estimate we eventually need to store/traverse 50 billion k-

mers (soil metagenome)

• Good failure mode: it’s all connected, Jim! (No loss of connections => good prefilter)

• Did I mention it’s constant memory? And independent of word size?

• …only works for de Bruijn graphs

Page 26: Pycon 2011 talk (may not be final, note)

Thoughts for the future

• Unless your algorithm scales sub-linearly as you distribute it across multiple nodes (hah!), oryour problem size has an upper bound, cloud computing isn’t a long-term solution in bioinformatics

• Synopsis data structures & algorithms (which incl. probabilistic data structures) are a neat approach to parsing problem structure.

• Scalable in-memory local graph exploration enables many other tricks, including near-optimal multinodegraph distribution.

Page 27: Pycon 2011 talk (may not be final, note)

Groxel view of knot-like region / ArendHintze

Page 28: Pycon 2011 talk (may not be final, note)

Acknowledgements:

The k-mer gang:

• Adina Howe

• Jason Pell

• RosangelaCanino-Koning

• Qingpeng Zhang

• ArendHintze

Collaborators:

• Jim Tiedje (Il padrino)

• Janet Jansson, Rachel Mackelprang, Regina Lamendella, Susannah Tringe, and many others (JGI)

• Charles Ofria (MSU)Funding: USDA NIFA; MSU, startup and iCER; DOE; BEACON/NSF

STC; Amazon Education.