[537] Google Infrastructure
Transcript of [537] Google Infrastructure
GFSGoal: present global FS that stores data across many machines. Need to handle 100’s TBs.
Contrast: NFS only exports a local FS on one machine to other machines.
Google published details in 2003.
Open source implementation: Hadoop FS (HDFS)
Failure: NFS ComparisonNFS only recovers from temporary failure. - not permanent disk/server failure - recover means making reboot invisible - technique: retry (stateless and idempotent protocol helps)
GFS needs to handle permanent failure. - techniques: replication and failover (like RAID)
Measure Then BuildGoogle workload characteristics: - huge files (GBs) - almost all writes are appends - concurrent appends common - high throughput is valuable - low latency is not
Example WorkloadsMapReduce - read entire dataset, do computation over it
Producer/consumer - many producers append work to file concurrently - one consumer reads and does work
Example WorkloadsMapReduce - read entire dataset, do computation over it
Producer/consumer - many producers append work to file concurrently - one consumer reads and does work - append not idempotent, is work idempotent?
CodesignOpportunity to build FS and application together.
Make sure applications can deal with FS quirks.
Avoid difficult FS features: - rename dir - links
ReplicationServer 1 Server 2 Server 3 Server 4 Server 5
A A AB BB C C C
Less orderly than RAID: - machines come and go, capacity may vary - different data may have different replication
ReplicationServer 1 Server 2 Server 3 Server 4 Server 5
A A AB BB C C C
Less orderly than RAID: - machines come and go, capacity may vary - different data may have different replication - how to map logical to physical?
RecoveryServer 1 Server 2 Server 3 ??? Server 5
A A AB BB C C CA B
Machine may be dead forever, or it may come back.
ObservationServer 1 Server 2 Server 3 Server 5
ABB C C CA B
Server 4
A
Maintaining replication and finding data will be difficult unless we have a global view of the data.
Architecture
Master
Client Worker
(one)
(many)
[metadata]
local FS’s
RPCRPC
RPC
metadata consistency easy
large capacity
Chunk LayerBreak GFS files into large chunks (e.g., 64MB).
Workers store physical chunks in Linux files.
Master maps logical chunk to physical chunk locations.
Worker w2
Master
chunk map:
logical 924 521 …
phys w2,w5,w7
w2,w9,w11 …
Worker w2
Local FS/chunks/942 => data1 /churks/521 => data2 …
Client Reads a Chunk
Master
chunk map:
logical 924 521 …
phys w2,w5,w7
w2,w9,w11 …
Worker w2
Local FS/chunks/942 => data1 /churks/521 => data2 …
client
Client Reads a Chunk
Master
chunk map:
logical 924 521 …
phys w2,w5,w7
w2,w9,w11 …
Worker w2
Local FS/chunks/942 => data1 /churks/521 => data2 …
client
lookup 924
Client Reads a Chunk
Master
chunk map:
logical 924 521 …
phys w2,w5,w7
w2,w9,w11 …
Worker w2
Local FS/chunks/942 => data1 /churks/521 => data2 …
client
w2,w5,w7
Client Reads a Chunk
Master
chunk map:
logical 924 521 …
phys w2,w5,w7
w2,w9,w11 …
Worker w2
Local FS/chunks/942 => data1 /churks/521 => data2 …
client
Client Reads a Chunk
Master
chunk map:
logical 924 521 …
phys w2,w5,w7
w2,w9,w11 …
Worker w2
Local FS/chunks/942 => data1 /churks/521 => data2 …
client
read 942: offset=0 size=1MB
Client Reads a Chunk
Master
chunk map:
logical 924 521 …
phys w2,w5,w7
w2,w9,w11 …
Worker w2
Local FS/chunks/942 => data1 /churks/521 => data2 …
client
data
Client Reads a Chunk
Master
chunk map:
logical 924 521 …
phys w2,w5,w7
w2,w9,w11 …
Worker w2
Local FS/chunks/942 => data1 /churks/521 => data2 …
client
read 942: offset=1MB size=1MB
Client Reads a Chunk
Master
chunk map:
logical 924 521 …
phys w2,w5,w7
w2,w9,w11 …
Worker w2
Local FS/chunks/942 => data1 /churks/521 => data2 …
client
data
Client Reads a Chunk
Master
chunk map:
logical 924 521 …
phys w2,w5,w7
w2,w9,w11 …
Worker w2
Local FS/chunks/942 => data1 /churks/521 => data2 …
client
read 942: offset=2MB size=1MB
Client Reads a Chunk
Master
chunk map:
logical 924 521 …
phys w2,w5,w7
w2,w9,w11 …
Worker w2
Local FS/chunks/942 => data1 /churks/521 => data2 …
client
data
Client Reads a Chunk
Master
chunk map:
logical 924 521 …
phys w2,w5,w7
w2,w9,w11 …
Worker w2
Local FS/chunks/942 => data1 /churks/521 => data2 …
client
Client Reads a Chunk
Master
chunk map:
logical 924 521 …
phys w2,w5,w7
w2,w9,w11 …
Worker w2
Local FS/chunks/942 => data1 /churks/521 => data2 …
client
Master is not bottleneck because not involved in most reads.
Client Reads a Chunk
Master
chunk map:
logical 924 521 …
phys w2,w5,w7
w2,w9,w11 …
Worker w2
Local FS/chunks/942 => data1 /churks/521 => data2 …
client
How does client know what chunk to read?
File NamespaceMap path names to logical chunk lists.
1. Client sends path name to master. 2. Master sends chunk locations to client. 3. Client reads/writes to workers directly.
File Namespace
Master
chunk map:logical
924 phys
w2,w5,w7
Worker w2
Local FS/chunks/942 => data1 /churks/521 => data2 …
client
… …
File Namespace
Master
chunk map:logical
924 phys
w2,w5,w7
Worker w2
Local FS/chunks/942 => data1 /churks/521 => data2 …
client
… …
file namespace:/foo/bar => 924,813 /var/log => 123,999
File Namespace
Master
chunk map:logical
924 phys
w2,w5,w7
Worker w2
Local FS/chunks/942 => data1 /churks/521 => data2 …
client
… …
file namespace:/foo/bar => 924,813 /var/log => 123,999
lookup /foo/bar
File Namespace
Master
chunk map:logical
924 phys
w2,w5,w7
Worker w2
Local FS/chunks/942 => data1 /churks/521 => data2 …
client
… …
file namespace:/foo/bar => 924,813 /var/log => 123,999
924: [w2,w5,w7] 813: […]
File Namespace
Master
chunk map:logical
924 phys
w2,w5,w7
Worker w2
Local FS/chunks/942 => data1 /churks/521 => data2 …
client
… …
file namespace:/foo/bar => 924,813 /var/log => 123,999
File Namespace
Master
chunk map:logical
924 phys
w2,w5,w7
Worker w2
Local FS/chunks/942 => data1 /churks/521 => data2 …
client
… …
file namespace:/foo/bar => 924,813 /var/log => 123,999
read 942: offset=0MB size=1MB
What if Chunk Size Doubles?
Master
chunk map:logical
924 813
phys w2,w5,w7 w1,w8,w9 … …
file namespace:/foo/bar => 924,813 /var/log => 123,999
What if Chunk Size Doubles?
Masterfile namespace:
/foo/bar => 924 /var/log => 123 lists half as long
chunk map:logical
924 813
phys w2,w5,w7 w1,w8,w9 … …
What if Chunk Size Doubles?
Masterfile namespace:
/foo/bar => 924 /var/log => 123 lists half as long
chunk map:logical
924phys
w2,w5,w7 … …half as many entries
Chunk SizeGFS uses very large chunks, i.e., 64MB.
How does chunk size affect size of data structs?
A: logical-block lists halved, chunk map halved
Any disadvantages to making chunks huge?
Chunk SizeGFS uses very large chunks, i.e., 64MB.
How does chunk size affect size of data structs?
A: logical-block lists halved, chunk map halved
Any disadvantages to making chunks huge? - sometimes slow. Cannot parallelize I/O as much.
Master: Crashes + ConsistencyFile namespace and chunk map are 100% in RAM. - allows master to work with 1000’s of workers - what if master crashes?
Master: Crashes + ConsistencyFile namespace and chunk map are 100% in RAM. - allows master to work with 1000’s of workers - what if master crashes?
File NamespaceWrite namespace updates to two types of logs: - local disk (disk is never read except for crash) - disk on backup master (in case permanent fail)
Occasionally dump entire state to checkpoint. - use format that can be directly mapped for fast recovery (i.e., no parsing). - why can’t we use pointers?
Master: Crashes + ConsistencyFile namespace and chunk map are 100% in RAM. - allows master to work with 1000’s of workers - what master crashes?
Chunk MapDon’t persist on master. Just ask workers which chunks they have.
What if worker dies too? Doesn’t matter, then that worker can’t serve chunks in the map anyway.
Chunk MapDon’t persist on master. Just ask workers which chunks they have.
What if worker dies too? Doesn’t matter, then that worker can’t serve chunks in the map anyway.
WorkerMaster
A B C D
Chunk MapDon’t persist on master. Just ask workers which chunks they have.
What if worker dies too? Doesn’t matter, then that worker can’t serve chunks in the map anyway.
WorkerMaster
A B C D
I have {A,B,C,D}
Chunk MapDon’t persist on master. Just ask workers which chunks they have.
What if worker dies too? Doesn’t matter, then that worker can’t serve chunks in the map anyway.
WorkerMaster
A B C D
Chunk MapDon’t persist on master. Just ask workers which chunks they have.
What if worker dies too? Doesn’t matter, then that worker can’t serve chunks in the map anyway.
WorkerMaster
A B
Chunk MapDon’t persist on master. Just ask workers which chunks they have.
What if worker dies too? Doesn’t matter, then that worker can’t serve chunks in the map anyway.
WorkerMaster
A B
I have {A,B}
GFS SummaryFight failure with replication.
Distribute data for performance, reliability.
Centralize metadata for simplicity.
Do problem 1!
ProblemDatasets are too big to process single threaded. (or even single machined!)
Good concurrent programmers are rare.
Want a concurrent programming framework that is: - easy to use (no locks, CVs, race conditions) - general (works for many problems)
MapReduceStrategy: break data into buckets, do computation over each bucket.
Google published details in 2004.
Open source implementation: Hadoop
Example: Revenue per State
State Sale ClientIDWI 100 9292CA 10 9523WI 15 9331CA 45 9523TX 9 8810WI 20 9292
How to quickly sum sales in every state without any one machine iterating over all results?
StrategyOne set of processes groups data into logical buckets.
Each bucket has a single process that computes over it.
StrategyOne set of processes groups data into logical buckets. (mappers)
Each bucket has a single process that computes over it. (reducers)
StrategyOne set of processes groups data into logical buckets. (mappers)
Each bucket has a single process that computes over it. (reducers)
Claim: if no bucket has too much data, no single process can do too much work.
Example: Revenue per State
State SaleWI 100CA 10WI 15CA 45TX 9WI 20
How to quickly sum sales in every state without any one machine iterating over all results?
State SaleWI 100CA 10WI 15CA 45TX 9WI 20
mapper 1
WI 100CA 10WI 15
mapper 2
CA 45TX 9WI 20
WI 100,15CA 10
CA 45TX 9WI 20
State SaleWI 100CA 10WI 15CA 45TX 9WI 20
mapper 1
WI 100CA 10WI 15
mapper 2
CA 45TX 9WI 20
WI 100,15CA 10
CA 45TX 9WI 20
reducer 1
reducer 2
Reduce WI
Reduce CA
Reduce TX
State SaleWI 100CA 10WI 15CA 45TX 9WI 20
mapper 1
WI 100CA 10WI 15
mapper 2
CA 45TX 9WI 20
WI 100,15CA 10
CA 45TX 9WI 20
reducer 1
reducer 2
Reduce WI
Reduce CA
Reduce TX
WI 135
CA 55TX 9
Revenue per State
State Sale ClientIDWI 100 9292CA 10 9523WI 15 9331CA 45 9523TX 9 8810WI 20 9292
Mappers could have grouped by any field desired (e.g., by ClientID).
Mapper OutputSometimes mappers simply classify records (state revenue example).
Sometimes mappers produce multiple intermediate records per input (e.g., friend counts).
friend1 friend2133 155133 99133 300300 99300 2199 155
mapper 1
133 155133 99133 300
mapper 2
300 99300 2199 155
133
155
99300
21
friend1 friend2133 155133 99133 300300 99300 2199 155
mapper 1
133 155133 99133 300
mapper 2
300 99300 2199 155
133 155,99,300155 13399 133300 133
300 99,2199 300,15521 300155 99
Many Other WorkloadsDistributed grep (over text files)
URL access frequency (over web request logs)
Distributed sort (over strings)
PageRank (over all web pages)
…
Hadoop APIpublic void map(LongWritable key, Text value) { // WRITE CODE HERE }
public void reduce(Text key, Iterator<IntWritable> values) { // WRITE CODE HERE }
public void map(LongWritable key, Text value) { String line = value.toString(); StringToke st = new StringToke(line); while (st.hasMoreTokens()) output.collect(st.nextToken(), 1); }
public void reduce(Text key, Iterator<IntWritable> values) { int sum = 0; while (values.hasNext()) sum += values.next().get(); output.collect(key, sum); }
what doesthis do?
MapReduce over GFSMapReduce writes/reads data to/from GFS.
MapReduce workers run on same machines as GFS workers.
GFS files mappers intermediate
local files reducers GFS files
MapReduce over GFS
GFS files mappers intermediate
local files reducers GFS files
Why not store intermediate files in GFS?
MapReduce writes/reads data to/from GFS.
MapReduce workers run on same machines as GFS workers.
MapReduce over GFS
GFS files mappers intermediate
local files reducers GFS files
Which edges involve network I/O?
1 2 3 4
MapReduce writes/reads data to/from GFS.
MapReduce workers run on same machines as GFS workers.
MapReduce over GFS
GFS files mappers intermediate
local files reducers GFS files
Which edges involve network I/O? Edges 3+4. Maybe 1.
1 2 3 4
MapReduce writes/reads data to/from GFS.
MapReduce workers run on same machines as GFS workers.
MapReduce over GFS
GFS files mappers intermediate
local files reducers GFS files
How to avoid I/O for 1?
1 2 3 4
MapReduce writes/reads data to/from GFS.
MapReduce workers run on same machines as GFS workers.
Exposing LocationGFS exposes which servers store which files (not transparent, but very useful!)
Hadoop example:
BlockLocation[] getFileBlockLocations(Path p, long start, long len);
Spec: return an array containing hostnames, offset and size of portions of the given file.
MapReduce PolicyMapReduce needs to decide which machines to use for map and reduce tasks. Potential factors: - try to put mappers near one of the three replicas - for reducers, store one output replica locally - try to use underloaded machines - consider network topology
Failed TasksA MapReduce master server tracks status of all map and reduce tasks.
If any don’t respond to pings, they are simply restarted on different machines.
This is possible because tasks are deterministic, and we still have the inputs.
Slow TasksSometimes a machine gets overloaded or a network link is slow.
With 1000’s of tasks, this will always happen.
Spawning duplicate tasks when there are only a few stragglers left reduces some job times by 30%.
MapReduce SummaryMapReduce makes concurrency easy!
Limited programming environment, but works for a fairly wide variety of applications.
Machine failures are easily handled.
Do problem 2!
Search Engine GoalUsers should be able to enter search phrases.
Want to return results that are: - high quality (how to judge?) - relevant
It’s ok to do a lot of processing online, but searches must be fast!
Crawler Web Servers
Snapshot of Pages
Relevance? Quality?Indexing
Internet
Search Engine
Webpages Searchers
Crawler Web Servers
Snapshot of Pages
Relevance? Quality?
MapReduce Jobs
Internet
Search Engine
Webpages Searchers
OutlineWeb Crawling
Indexing - PageRank - Inverted Indexes
Searching
Crawler Web Servers
Snapshot of Pages
Relevance? Quality?
MapReduce Jobs
Internet
Search Engine
Webpages Searchers
OutlineWeb Crawling
Indexing - PageRank - Inverted Indexes
Searching
Crawler Web Servers
Snapshot of Pages
Relevance? Quality?
MapReduce Jobs
Internet
Search Engine
Webpages Searchers
robots.txtrobots.txt file can tell crawlers not to crawl. Example:
User-agent: googlebot # all Google services Disallow: /private/ # disallow this directory User-agent: googlebot-news # only the news service Disallow: / # disallow everything User-agent: * # any robot Disallow: /something/ # disallow this directory
Some web developers set up intentional spider traps to punish crawlers that ignore these.
example source: http://en.wikipedia.org/wiki/Robots_exclusion_standard
Spider TrapsServer returns data so that page example.com/N has a link to example.com/(N+1).
From crawler’s perspective, web is infinite!
Prioritize via heuristics (avoid dynamic content) and quality rankings (later).
“Almost daily, we receive an email something like, ‘Wow, you looked at a lot of pages from my web site. How did you like it?’”
Sergey Brin + Lawrence Page
Source: The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)
OutlineWeb Crawling
Indexing - PageRank - Inverted Indexes
Searching
Crawler Web Servers
Snapshot of Pages
Relevance? Quality?
MapReduce Jobs
Internet
Search Engine
Webpages Searchers
Quality ProblemWeb pages “proliferate free of quality control”.
Contrast with peer-reviewed academic papers.
Need to infer quality from the web graph.
Quality ProblemWeb pages “proliferate free of quality control”.
Contrast with peer-reviewed academic papers.
Need to infer quality from the web graph.
Give every page a singe PageRank score representing quality.
Strategy: Count LinksImportance: A = 1 B = 4 C = 1 D = 0 E = 1 F = 1
A B
C
DE
F
should A get 2 “votes”?
Strategy: Count LinksImportance: A = 1 B = 3.5 C = 0.5 D = 0 E = 0.5 F = 0.5
A B
C
DE
F
0.5
0.5
0.5 0.5
Strategy: Count LinksImportance: A = 1 B = 3.5 C = 0.5 D = 0 E = 0.5 (from A’s vote) F = 0.5
A B
C
DE
F
0.5
0.5
0.5 0.5
Strategy: Count LinksImportance: A = 1 B = 3.5 C = 0.5 (from B’s vote) D = 0 E = 0.5 (from A’s vote) F = 0.5
A B
C
DE
F
0.5
0.5
0.5 0.5
Strategy: Count LinksImportance: A = 1 B = 3.5 C = 0.5 (from B’s vote) D = 0 E = 0.5 (from A’s vote) F = 0.5
A B
C
DE
F
0.5
0.5
0.5 0.5
Why do A and B get same votes? B is more important.
Circular VotesWant: number of votes you get determines number of votes you give.
Problem: changing A’s votes changes B’s votes changes A’s votes…
Circular VotesWant: number of votes you get determines number of votes you give.
Problem: changing A’s votes changes B’s votes changes A’s votes…
Fortunately, if you just keep updating every PageRank, it eventually converges.
Convergence Goal (Simplified)
Rank(x) = “sum of all votes for x”
“x” is a page, Rank(x) is its PageRank.
Convergence Goal (Simplified)
y∈LinksTo(x)
Rank(x) = Σ “y’s vote for x”
LinksTo(x) is the set of all pages linking to x.
Convergence Goal (Simplified)
y∈LinksTo(x)
Rank(x) = Σ Rank(y)
Ny is the number of links from y to other pages.
Ny
Convergence Goal (Simplified)
Rank(x) =
Normalize with “c” to get desired amount of “rank” in system.
cy∈LinksTo(x)
Σ Rank(y)Ny
Convergence Goal (Simplified)
Rank(x) = cy∈LinksTo(x)
Σ Rank(y)Ny
keep updating rank for every page until ranks stop changing much
Intuition: Random SurferImagine!
1. a bunch of web surfers start on various pages 2. they randomly click links, forever 3. you measure webpage visit frequency
Intuition: Random SurferImagine!
1. a bunch of web surfers start on various pages 2. they randomly click links, forever 3. you measure webpage visit frequency
Visit frequency will be proportional to PageRank.
Graph 1
A B C
0.50.25 0.25
Rank(B) = (0.25 / 1) + (0.25 / 1) = 0.5 Rank(A) = (0.5 / 2) = 0.25 Rank(C) = (0.5 / 2) = 0.25
Rank(x) = cy∈LinksTo(x)
Σ Rank(y)Ny
Graph 3
A B
Problem: Surfers get stuck in C and D. C+D called a rank “sink”. A and B get 0 rank.
C D
ProblemsProblem A: dangling links
Problem B: rank sinks
Solution?
Surfers should jump to new random page with some probability.
Computationranks = INIT_RANKS; //rank for each page do { new_ranks = compute_ranks(ranks, edges); change = compute_diff(new_ranks, ranks); ranks = new_ranks; } while (change > threshold);
Computationranks = INIT_RANKS; //rank for each page do { new_ranks = compute_ranks(ranks, edges); change = compute_diff(new_ranks, ranks); ranks = new_ranks; } while (change > threshold);
Many MapReduce jobs can be used.
The PageRank Citation Ranking: Bringing Order to the Web (http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf)
“To test the utility of PageRank for search, we built a web search engine called Google”
Larry Page etal.
The PageRank Citation Ranking: Bringing Order to the Web (http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf)
OutlineWeb Crawling
Indexing - PageRank - Inverted Indexes
Searching
Crawler Web Servers
Snapshot of Pages
Relevance? Quality?
MapReduce Jobs
Internet
Search Engine
Webpages Searchers
Relevance ProblemA website may be important, but is it relevant to the user’s current query?
Infer relevance by page contents, such as: - html body - title - meta tags - headers - etc
IndexingStrategy: indexing.
Generate files organize by topic, keyword, or some other criteria that organize documents.
For a given word, we want to be able to find all related documents.
RepresentationFor fast processing, assign: - docID to each unique page - wordID to each unique word on the web
Lorem ipsum dolor sit amet, lorem soluta delicata no vim. Te vel facete ornatus, mei aeque maiestatis te.
http://www.example.com/…
RepresentationFor fast processing, assign: - docID to each unique page - wordID to each unique word on the web
Lorem ipsum dolor sit amet, lorem soluta delicata no vim. Te vel facete ornatus, mei aeque maiestatis te.
http://www.example.com/…5 922 2 66 42 5 15 79 1431 21 3 22 68 12 47 887 244 3
docID=1442
Forward Index
5 922 2 66 42 5 15 79 1431 21 3 22 68 12 47 887 244 3
docID=1442docID wordID1442 51442 9221442 21442 661442 421442 5
… …
forward index
522 141 553 999 243 66 42 5 15 79 15 79 1431 21 3 22
docID=9977
…
Inverted Index
docID wordID1442 51442 9221442 21442 661442 421442 5
… …
forward indexdocID wordID1442 51442 9221442 21442 661442 421442 5
… …
Inverted Index
docID wordID1442 51442 9221442 21442 661442 421442 5
… …
forward indexwordID docID
5 1442922 1442
2 144266 144242 14425 1442… …
swap columns
Inverted Index
docID wordID1442 51442 9221442 21442 661442 421442 5
… …
forward indexwordID docID
1 2442 14425 14425 14425 9996 133… …
sort by wordID
Inverted Index
docID wordID1442 51442 9221442 21442 661442 421442 5
… …
forward index inverted indexwordID docID
1 2442 14425 1442,1442,9996 133,4117 1442,133,9999 411,875… …
Computing Inverted Index with MapReduce
Mapper: read words from files - out key: word - out val: file name
Reducer: make list of file names - out key: word - out val: list of file names
OutlineWeb Crawling
Indexing - PageRank - Inverted Indexes
Searching
Crawler Web Servers
Snapshot of Pages
Relevance? Quality?
MapReduce Jobs
Internet
Search Engine
Webpages Searchers
One-word QueriesInverted index may be split into “posting files” across many machines. wordID => machine is known.
Front-end server takes query, converts to wordID.
Front-end fetches docID’s from server with posting file.
docID’s are sorted based on PageRank and relevance and returned to user.
Multi-Word QueriesQuery is converted into list of wordIDs.
docID’s from the posting files for each wordID are retrieved.
The lists of docID’s can be unioned (OR) or intersected (AND).
Search is Resource Intense
Indexes greatly reduce data that must be considered relative to the grep approach.
However! Most of the data read from the posting lists won’t be relevant, so a lot of data must be scanned.