The Longest Common Substring Problem
-
Upload
brynne-hebert -
Category
Documents
-
view
52 -
download
0
description
Transcript of The Longest Common Substring Problem
![Page 1: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/1.jpg)
The Longest Common Substring
Problema.k.a Long Repeat
by Donnie Demuth
![Page 2: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/2.jpg)
Sections
1. MapReduce and Hadoop2. Map and Reduce3. Mappers and Reducers4. Using Tools (Amazon)5. Conclusions
![Page 3: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/3.jpg)
1. MapReduce and Hadoop
• What is it?• And how do I get it?
![Page 4: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/4.jpg)
Google MapReduce
• Circa 2003• Based on Map and Reduce (go figure)– and Functional Programming!
• Proprietary
![Page 5: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/5.jpg)
Apache Hadoop
• Circa 2006, released 2009• Named after an Elephant Toy• Seconds, maybe a minute, to install
![Page 6: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/6.jpg)
Installing Hadoop on OSX
• Single Cluster setup is a piece of cake• Download the archive (tar.gz)• Modify conf/hadoop-env.sh:
– # export JAVA_HOME=/usr/lib/j2sdk1.6-sun – export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/
• Modify bin/hadoop:– JAVA=$JAVA_HOME/bin/java– JAVA=$JAVA_HOME/Commands/java
• Just run bin/hadoop with arguments
![Page 7: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/7.jpg)
STOP!
• Actually, installing Hadoop wasn’t necessary• We can write parallel code without it
![Page 8: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/8.jpg)
2. Map and Reduce
• What is it?– Quick Primer to Functional Programming• Higher-Order Functions• Alonzo Church (Lamba Calculus)
• Haskell Curry (Spicy Food)
• How do I use it?
(x (y x*x + y*y))(5)(2)↦ ↦
![Page 9: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/9.jpg)
Code w/ Side-Effects
>>> thing = {'name':'Donald'}>>> def change_name(object): object['name'] = 'Donnie'... >>> change_name(thing)>>> thing{'name': 'Donnie'}
![Page 10: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/10.jpg)
Pure Code, Side-effect Free
>>> thing = {'name':'Donald'}>>> def change_name(object): ... new_obj = {'name': 'Donnie'}... # copy any other values... return new_obj... >>> thing = change_name(thing)>>> thing{'name': 'Donnie'}
![Page 11: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/11.jpg)
Benefits of Pure Code / FP
• easy to understand– Local vars = easy– Global vars + side-effects = hard
• it’s easy to parallelize– We only care about what we know RIGHT NOW
![Page 12: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/12.jpg)
Map
1
2
3
1
4
6
f(x)
![Page 13: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/13.jpg)
Map in Python
• Use the map(<function>, <list>) built-in
>>> map(lambda x: x*x, range(1,100))[1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361, 400, 441, 484, 529, 576, 625, 676, 729, 784, 841, 900, 961, 1024, 1089, 1156, 1225, 1296, 1369, 1444, 1521, 1600, 1681, 1764, 1849, 1936, 2025, 2116, 2209, 2304, 2401, 2500, 2601, 2704, 2809, 2916, 3025, 3136, 3249, 3364, 3481, 3600, 3721, 3844, 3969, 4096, 4225, 4356, 4489, 4624, 4761, 4900, 5041, 5184, 5329, 5476, 5625, 5776, 5929, 6084, 6241, 6400, 6561, 6724, 6889, 7056, 7225, 7396, 7569, 7744, 7921, 8100, 8281, 8464, 8649, 8836, 9025, 9216, 9409, 9604, 9801]
![Page 14: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/14.jpg)
Reduce
1
2
3
0f(x, y)
f(x, y)
f(x, y) = 6
![Page 15: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/15.jpg)
Reduce in Python
• Use the map(<function>, <list>, <unit>) built-in
>>> reduce(lambda x, y: x+y, [1,2,3], 0)6
>>> reduce(lambda x, y: x+y, (map(lambda x: x*x, range(1,100)), 0)
328350
![Page 16: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/16.jpg)
3. Mappers and Reducers
• How do I write them?– Word Count (Hello World for Distrib. Comp.)– Longest Repeat
• Show me how to pipe them
![Page 17: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/17.jpg)
Mappers
• Pseudo-Code– Take some input– Process it– And emit a Key – Value pair
![Page 18: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/18.jpg)
Word Count Mapper
• For some input:– Donald Demuth Donald Draper
• The output should be:– Donald 1– Demuth 1– Donald 1– Draper 1
![Page 19: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/19.jpg)
Word Count Mapper Code
• wordcount/mapper.py
#!/usr/bin/env pythonimport sys, re
word_re = re.compile('[a-zA-Z]+')for line in sys.stdin: line = line.strip().lower() for word in word_re.findall(line): print '%s\t%s' % (word, 1)
![Page 20: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/20.jpg)
Reducers
• Dependant on the Mapper’s emissions• Pseudo-Code for word count– Read an emission from the mapper– Find the key and the value– Store the key in a dictionary with it’s value• But if the key already exists, add the value with the
pre-existing value!
– Emit the dictionary
![Page 21: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/21.jpg)
Word Count Reducer Code
• wordcount/reducer.py#!/usr/bin/env pythonimport sys
counts = {}for line in sys.stdin: line = line.strip() word, count = line.split('\t', 1) count = int(count) counts[word] = counts.get(word, 0) + count
for word, count in counts.items(): print '%s\t%s'% (word, count)
![Page 22: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/22.jpg)
Unix Pipes
• Does this really work??
$ cat books/*.txt | wordcount/mapper.py | wordcount/reducer.py | sort | heada 10526ab 3aback 1abaft 2abaht 1abandon 2abandoned 10abandonment 1abasement 1abash 1
![Page 23: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/23.jpg)
Longest Repeat (LCS)
• Many problems can be solved with a series of Maps and Reduces
• However, Hadoop Streaming is a single Map and Reduce step
• After much trial and error my solution involves a pre-processing step
![Page 24: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/24.jpg)
Pre-processing
• fasta_to_line.py
• gen_suffixes.py
ecoli.fasta.line
ecoli.fasta.line.0
ecoli.fasta.line.100000
ecoli.fasta.line.200000
4.6 megs
4.5 megs
4.4 megs
4.3 megs
ecoli.fasta ecoli.fasta.line
![Page 25: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/25.jpg)
LCS Mapper
• Pseudo-code– Read a line from a suffix file– Determine the index (first chars)– Cycle through the first 100,000 positions• Cycle through possible lengths (10 3000)
– Emit the Length (Key) and the Position (Val)
• Emit (-1) and (-1) to STAY ALIVE
![Page 26: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/26.jpg)
LCS Reducer
• Pseudo-Code– Simple– Find the largest KEY emitted by any mapper– Display it
![Page 27: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/27.jpg)
LCS w/ Murmur.txt$ cat murmur.txt.line.0 | lcs/mapper.py | lcs/reducer.py length(63) pos(128)
$ python>>> text = open('murmur.txt.line').read()>>> text[128:128+63]'Dance the cha chaOr the can canShake your pom pomTo Duran Duran'
>>> seq = text[128:128+63]>>> text.index(seq)128>>> text[129:].index(seq) + 1291777>>> text[128:128+63] == text[1777:1777+63]True>>> text[1777:1777+63]'Dance the cha chaOr the can canShake your pom pomTo Duran Duran'
![Page 28: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/28.jpg)
4. Using Tools, Amazon
• Harness the power of many machines at once– Easy to use 20
• Need to sign up for:– Amazon Elastic MapReduce Service (EMS)– Amazon Elastic Compute Cloud (EC2)– Amazon Simple Storage Service (S3)– Amazon SimpleDB
![Page 29: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/29.jpg)
Deploying Data/Code
• First you’ll need to upload it to S3– Create a new bucket (or global folder) named ecoli-lcs
– Create a new path named input, ecoli-lcs/input– Upload all of the generated suffixes to the input
folder– Upload mapper.py and reducer.py to ecoli-lcs
![Page 30: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/30.jpg)
Creating a Job (Flow)
![Page 31: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/31.jpg)
Creating a Job Flow (…)
![Page 32: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/32.jpg)
RESULTS!
• Need to download the output$ cd output$ cat * | sort (...)length(2815) pos(4166641)
$ python>>> text = open('ecoli.fasta.line').read()>>> seq = text[4166641:4166641+2815]>>> text.index(seq)4166641>>> text[4166642:].index(seq) + 41666424208043>>> text[4166641:4166641+2815] == text[4208043:4208043+2815]
![Page 33: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/33.jpg)
5. Conclusions
• Costs– It’s about 3 cents an hour for a “medium” VM– One run took 840 instance hours (20+ actual)
• Approx. $25– Used about 2000 instance hours in total
• Hadoop Streaming is EASY– Though requires many (easy) tools– But costly if you have “bugs”
![Page 34: The Longest Common Substring Problem](https://reader036.fdocuments.net/reader036/viewer/2022062301/5681385e550346895da00ddd/html5/thumbnails/34.jpg)
A Better Solution?
• Jeff Parker’s program used the following approach:– Cycle through the sequence and find all repeats of
a given size– Emit the location– Increase the size and use the previously known
locations to find larger matches
• Looks good for MapReduce (Core)