EC2, MapReduce, and Distributed Processing
-
Upload
jonathan-dahl -
Category
Technology
-
view
7.799 -
download
3
description
Transcript of EC2, MapReduce, and Distributed Processing
EC2, MapReduce, and Distributed Processing
Jonathan Dahl
(and Rail Spikes,Slantwise,Zencoder,
etc.)
distributed processing /dis'trib'ut'ed prŏs'ěs'ĭz/ noun Refers to any of a variety of computer systems that use more than one computer, or processor, to run an application. This includes parallel processing, in which a single computer uses more than one CPU to execute programs. More often, however, distributed processing refers to local-area networks (LANs) designed so that a single program can run simultaneously at various
asynchronous processing/a·syn·chro·nous prŏs'ěs'ĭz/ noun Computations that run independently of each other, without requiring constant synchronization. Each operation
parallel processing/ p a r · a l · l e l p rŏs 'ěs ' ĭz / n o u n Simultaneous computation of a single problem or system running across separate CPU cores. Includes
distributed processing/dis'trib'ut'ed prŏs'ěs'ĭz/ noun Just like parallel processing, but utilizing separate full systems, not just separate CPU cores.
You
Me
Map______...
Transcoder 3
Transcoder 2
Rails DB
Transcoder 1
1. Poll Queue
2. Get job
Message
Queue
3. Result
Roadmap:I. Functional ProgrammingII. MapReduceIII. EC2IV. Distributed Processing
1. Functional Programming
ƒ(x) vs. i++;
ƒ(x) = 2x + 1
ƒ(person) = first name + last name
lambda {|x| x*2 + 1 }
lambda do |user| "#{user.firstname} #{user.lastname}"end
ƒ(users) = ∑ of logins for each user
users.sum { |user| user.number_of_logins }
var total_logins = 0;
for (i = 0; i < users.size; i++) { total_logins += number_of_logins(users[i])}
users.sum(&:number)
users.sum(&:number)
users.each {}
result = Array.new
users.each {|user| result << user.email }
result
reduce
reduce == inject == fold
reduce(list, function, init)
reduce(list, function, init)
(1..10)[“a”, “b”, “c”, “d”][#<User id: 19>, #<User id=43>]
reduce(list, function, init)
ƒ(x,y) = x + yƒ(x,y) = x << y if y > 0ƒ(x,y) = x << y.upcase
reduce(list, function, init)
lambda {|result, i| result + i}
lambda do |result, i| result << i if i > 0end
lambda {|r, i| r << i.upcase }
reduce(list, function, init)
0[]Hash.new(“”)
list.reduce(init) {}
(1..10).reduce(0) do |r, x| r + xend
(1..10).reduce(0) do |r, x| r + xend
(1..10).reduce(0) do |r, x| r + xend
(1..10).reduce(0) do |r, x| r + xend
(1..10).reduce(0) do |r, x| r + xend
(1..10).reduce(0) do |r, x| r + xend
(1..10).reduce(0) do |r, x| r + xend
(1..10).reduce(0) do |r, x| r + xend
(1..10).reduce(0) do |r, x| r + xend# 55
reduceinjectfold
reduceinjectfold
list -> valuereduceinjectfold
reduceinjectfold
reduceinjectfold
reduceinjectfold
reduceinjectfold
|result, x|
reduceinjectfold
reduceinjectfold
reduceinjectfold
reduceinjectfold
reduceinjectfold
reduceinjectfold
reduceinjectfold
reduceinjectfold
reduceinjectfold
reduceinjectfold
reduceinjectfold
map(list, function)
map(list, function)
(1..10)[“a”, “b”, “c”, “d”][#<User id: 19>, #<User id=43>]
map(list, function)
lambda {|x| x + 1 }lambda {|x| x.upcase }lambda {|x| x.nil? }
list.map {}
(1..10).map {|x| x > 5 }
(1..10).map {|x| x > 5 }
(1..10).map {|x| x > 5 }
(1..10).map {|x| x > 5 }
(1..10).map {|x| x > 5 }
(1..10).map {|x| x > 5 }
(1..10).map {|x| x > 5 }# [false, false, false, false, false, true, true, true, true, true]
[“a”,”b”,”c”]
[“a”,”b”,”c”] [“A”,”B”,”C”]=>
[“a”,”b”,”c”] [“A”,”B”,”C”]
User.all
=>
[“a”,”b”,”c”] [“A”,”B”,”C”]
User.all [“david”, “stanley”, “anna”]=>
=>
(1..5).map {|x| x * x}
1 * 12 * 23 * 34 * 45 * 5
parallelizable!
(1..5).reduce(0) {|i,x| i * x}
map: parallelizable
reduce: not (?)
II. MapReduce
MapReduce != map + reduce
MAP a problem across several
servers
REDUCE the results of each server to a
single result set
list.map {|i| i.function }
results.reduce {|final, i| final[i.key] = i.function }
list.map {|i| i.function }
results.reduce {|final, i| final[i.key] = i.function }
list.map {|i| i.function }
results.reduce {|final, i| final[i.key] = i.function }
list.map {|i| i.function }
results.reduce {|final, i| final[i.key] = i.function }
list.map {|i| i.function }
results.reduce {|final, i| final[i.key] = i.function }
list.map {|i| i.function }(group)
results.reduce {|final, i| final[i.key] = i.function }
list.map {|i| i.function }
results.reduce {|final, i| final[i.key] = i.function }
list.map {|i| i.function }
results.reduce {|final, i| final[i.key] = i.function }
list.map {|i| i.function }
results.reduce {|final, i| final[i.key] = i.function }
key -> value
(1..10).map {|x| }
1. Initial data
(1..10).map_with_index {|i, x| }
1. Initial data
1. Initial data
• GFS chunk identifier• Book page number• Web URL• Arbitrary group ID
Map server I:‘key1’ -> 6.8‘key2’ -> 6.9‘key3’ -> 8.1
2. Intermediate data
2. Intermediate data
Map server 2:‘key1’ -> 6.2‘key4’ -> 5.5
Reduce results:‘key1’ -> 6.5‘key2’ -> 6.9‘key3’ -> 8.1‘key4’ -> 5.5
3. Final data
another view
• Stage in between ‘map’ and ‘reduce’
• All mappers must finish before reduce
• Prepare intermediate results
• (Group results by key)
Parallel reduce?
ƒ(key1), ƒ(key3), ƒ(key4)
ƒ(key2), ƒ(key5)
Example
chunky: 12bacon: 15
book = File.open("wrnpc12.txt", "r").to_awords = book.join(" ").split(" ")c = words.inject(Hash.new(0)) do |i, word| i[word.downcase.to_sym] += 1 iendwords = c.sort{|a,b| b[1]<=>a[1]}
c = words.inject(Hash.new(0)) do |i, word| i[word.downcase.to_sym] += 1 iend
puts words[1]puts words[100]puts words[1000]
puts word_counts[:ruby]puts word_counts[:rails]
+1 second
book = File.open("wrnpc12.txt", "r").to_awords = book.join(" ").split(" ")c = words.inject(Hash.new(0)) do |i, word| i[word.downcase.to_sym] += 1 iendwords = c.sort{|a,b| b[1]<=>a[1]}
word_chunks = input_words.chunk(200)
mapped_words = word_chunks.map do |words| distributed_count(words)end
def distributed_count(words) c = words.inject(Hash.new(0)) do |i, word| i[word.downcase.to_sym] += 1 i end c.sort{|a,b| b[1]<=>a[1]}end
grouped_words = group(mapped_words)
# :the => [1829, 887, 1523] ..# :cat => [19, 7, 36, 132] ...
final_results = grouped_words.inject({}) do |result, words| result[words.first] = words.last.inject(0) {|r, i| r + i } resultendwords = final_results.sort{|a,b| b[1]<=>a[1]}
puts words[1]puts words[100]puts words[1000]
puts word_counts[:ruby]puts word_counts[:rails]
requirements
1. Fixed problem
2. Mappable problem
3. Distributed reduce
example uses
III. EC2
Why?
Example
1851-1922
4TB
Hadoop + EC2
Hadoop
100 instances
24 hours
$240
(€164)
IV. Three Thoughts
Transcoder 3
Transcoder 2
Rails DB
Transcoder 1
1. Poll Queue
2. Get job
Message
Queue
3. Result
Hadoop
Thanks!Jonathan Dahl
Slides at Rail Spikes http://railspikes.com
Photo Credits
•Rofi: http://flickr.com/photos/rofi/
•Digital:Slurp http://flickr.com/photos/digitalslurp/
•Others stolen from Google Image search