2014 pycon-talk
-
Upload
ctitusbrown -
Category
Science
-
view
1.649 -
download
1
description
Transcript of 2014 pycon-talk
![Page 1: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/1.jpg)
Instrument ALL the things:Studying data-intensive workflows in the clowd.
C. Titus BrownMichigan State University
(See blog post)
![Page 2: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/2.jpg)
A few upfront definitions
Big Data, n: whatever is still inconvenient to compute on.
Data scientist, n: a statistician who lives in San Francisco.
Professor, n: someone who writes grants to fund people who do the work (c.f. Fernando Perez)
I am a professor (not a data scientist) who writes grants so that others can do data-intensive biology.
![Page 3: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/3.jpg)
This talk dedicated to Terry Peppers
Titus, I no longer understand what you actually do…
Daddy, what do you do at work!?
![Page 4: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/4.jpg)
I assemble puzzles for a living.
Well, ok, I strategize about solving multi-dimensional puzzles with billions of pieces and no box.
![Page 5: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/5.jpg)
Three bioinformatic strategies in use
• Greedy: “if the piece sorta fits…”
• N2 – “Do these two pieces match? How about this next one?”
• The Dutch approach.
![Page 6: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/6.jpg)
The Dutch Solution(De Bruijn assembly)
Find similarities within puzzle pieces
![Page 7: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/7.jpg)
The Dutch Solution
Algorithmically:• Is linear in time with number of pieces
(Way better than N2!)
• Is linear in memory with volume of data (This is due to errors in digitization process.)
![Page 8: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/8.jpg)
Practical memory measurements
Velvet measurements (Adina Howe)
GB RAM
(About $500 of data)
![Page 9: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/9.jpg)
Our research challenges –
1. It costs only $10k & 1 week to generate enough sequence data that no commodity computer (and few supercomputers) can assemble it.
2. Hundreds -> thousands of such data sets are being generated each year.
![Page 10: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/10.jpg)
Our research challenges –
1. It costs only $10k & 1 week to generate enough sequence data that no commodity computer (and few supercomputers) can assemble it.
2. Hundreds -> thousands of such data sets are being generated each year.
(Solved)
![Page 11: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/11.jpg)
Our research (i) - CS
• Streaming lossy compression approach that discards pieces we’ve seen before.
• Low memory probabilistic data structures.(…see Pycon 2013 talk)
=> RAM now scales better: O(I) where I << N(I is sample dependent but typically I < N/20)
![Page 12: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/12.jpg)
Our research (ii) - approach
• Open source, open data, open science, and reproducible computational research.– GitHub– Automated testing, CI, & literate reSTing– Blogging, Twitter– IPython Notebook for data analysis, figures.
• Protocols for assembling in the cloud.
![Page 13: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/13.jpg)
Molgula oculata
Molgula occulta
Molgula oculata
Real solutions, tackling squishy biology!
Elijah Lowe & Billie Swalla
![Page 14: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/14.jpg)
Doing things right => #awesomesauce
![Page 15: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/15.jpg)
Benchmarking strategy
• Rent a bunch of cloud VMs from Amazon and Rackspace.
• Extract commands from tutorials using literate-resting.
• Use ‘sar’ (sysstat pkg) to sample CPU, RAM, and disk I/O.
![Page 16: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/16.jpg)
Benchmarking output
Data subset; AWS m1.xlarge
![Page 17: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/17.jpg)
Each protocol has many steps
Data subset; AWS m1.xlarge
![Page 18: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/18.jpg)
Most interested in RAM-intensive bit
Data subset; AWS m1.xlarge
![Page 19: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/19.jpg)
Most interested in RAM-intensive bit
Complete data; AWS m1.xlarge
![Page 20: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/20.jpg)
Observation #1: Rackspace is faster
machine data disk working hours cost
rackspace-15gb 200 GB 100 GB 34.9 $23.70
m2.xlarge EBS ephemeral 44.7 $18.34
m1.xlarge EBS ephemeral 45.5 $21.82
m1.xlargeEBS, max
IOPS ephemeral 49.1 $23.56
m1.xlargeEBS, max
IOPS EBS, max IOPS 52.5 $25.20
![Page 21: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/21.jpg)
Surprise #1: AWS ephemeral storage is FASTER
machine data disk working hours cost
rackspace-15gb 200 GB 100 GB 34.9 $23.70
m2.xlarge EBS ephemeral 44.7 $18.34
m1.xlarge EBS ephemeral 45.5 $21.82
m1.xlargeEBS, max
IOPS ephemeral 49.1 $23.56
m1.xlargeEBS, max
IOPS EBS, max IOPS 52.5 $25.20
![Page 22: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/22.jpg)
Observation #2: NUMA costs
Same task done with varying memory sizes.
![Page 23: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/23.jpg)
Observation #2: NUMA costs
Same task done with varying memory sizes.
![Page 24: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/24.jpg)
Can’t we just use a faster computer?
• Demo data on m1.xlarge: 2789 s• Demo data on m3.xlarge: 1970 s – 30% faster!
(Why?m3.xlarge has 2x40 GB SSD drives & 40% faster
cores.)
Great! Let’s try it out!
![Page 25: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/25.jpg)
Observation #3: multifaceted problem!
• Full data on m1.xlarge: 45.5 h• Full data on m3.xlarge: out of disk space.
We need about 200 GB to run the full pipeline.
You can have fast disk or lots of disk but not both, for the moment.
![Page 26: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/26.jpg)
Future directions
1. Invest in cache-local data structures and algorithms.
2. Invest in streaming/in-memory approaches.
3. Not clear (to me) that straight code optimization or infrastructure engineering is worthwhile investment.
![Page 27: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/27.jpg)
Frequently Offered Solutions
1. You should like, totally multithread that.(See: McDonald & Brown, POSA)
2. Hadoop will just crush that workload, dude.(Unlikely to be cost-effective.)
3. Have you tried <my proprietary Big Data technology stack>?
(Thatz Not Science)
![Page 28: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/28.jpg)
Optimization vs scaling
• Linear time/memory improvements would not have addressed our core problem.(2 years, 20x improvement, 100x increase in data.)
• Puzzle problem is a graph problem with big data, no locality, small compute. Not friendly.
• We need(ed) to scale our algorithms.
• Can now run on single-chassis, in ~15 GB RAM.
![Page 29: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/29.jpg)
Optimization vs scaling --
![Page 30: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/30.jpg)
Scaling can be more important!
![Page 31: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/31.jpg)
What are we losing by focusing our engineering on pleasantly parallel problems?
• Hadoop is fundamentally not that interesting.
• Research is about the 100x.
• Scaling new problems, evaluating/creating new data structures and algorithms, etc.
![Page 32: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/32.jpg)
(From my PyCon 2011 talk.)
Theme: Life’s too short to tackle the easy problems – come to academia!
![Page 33: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/33.jpg)
Thanks!
• Leigh Sheneman, for starting the benchmarking project.
• Labbies: Michael R. Crusoe, Luiz Irber, Likit Preeyanon, Camille Scott, and Qingpeng Zhang.
![Page 34: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/34.jpg)
Thanks!• github.com/ged-lab/
– khmer – core project– khmer-protocols – tutorials/acceptance tests– literate-resting – script to pull out code from reST tutorials
• Blog post at: http://ivory.idyll.org/blog/2014-pycon.html
• Michael R. Crusoe, Likit Preeyanon, Camille Scott, and Qingpeng Zhang are here at PyCon.
…note, you can probably afford tobuy them off me :)
![Page 35: 2014 pycon-talk](https://reader033.fdocuments.net/reader033/viewer/2022061102/53fb58ca8d7f729c2e8b5742/html5/thumbnails/35.jpg)
Different computational strategies for k-mer counting, revealed!
Khmer-counting paper pipeline; Qingpeng Zhang