[537] Google Infrastructure

[537] Google InfrastructureTyler Harter

Google File System MapReduce

Search

GFS OverviewMotivation

Architecture

Master metadata

GFSGoal: present global FS that stores data across many machines. Need to handle 100’s TBs.

Contrast: NFS only exports a local FS on one machine to other machines.

Google published details in 2003.

Open source implementation: Hadoop FS (HDFS)

Failure: NFS ComparisonNFS only recovers from temporary failure. - not permanent disk/server failure - recover means making reboot invisible - technique: retry (stateless and idempotent protocol helps)

GFS needs to handle permanent failure. - techniques: replication and failover (like RAID)

Measure Then BuildGoogle workload characteristics: - huge files (GBs) - almost all writes are appends - concurrent appends common - high throughput is valuable - low latency is not

Example WorkloadsMapReduce - read entire dataset, do computation over it

Producer/consumer - many producers append work to file concurrently - one consumer reads and does work

Example WorkloadsMapReduce - read entire dataset, do computation over it

Producer/consumer - many producers append work to file concurrently - one consumer reads and does work - append not idempotent, is work idempotent?

CodesignOpportunity to build FS and application together.

Make sure applications can deal with FS quirks.

Avoid difficult FS features: - rename dir - links


Architecture

Master metadata

ReplicationServer 1 Server 2 Server 3 Server 4 Server 5

A A A


A A AB BB


A A AB BB C C C


A A AB BB C C C

Less orderly than RAID: - machines come and go, capacity may vary - different data may have different replication


A A AB BB C C C

Less orderly than RAID: - machines come and go, capacity may vary - different data may have different replication - how to map logical to physical?

RecoveryServer 1 Server 2 Server 3 Server 4 Server 5

A A AB BB C C C

RecoveryServer 1 Server 2 Server 3 ??? Server 5

A A AB BB C C C


A A AB BB C C CA


A A AB BB C C CA B


A A AB BB C C CA B

Machine may be dead forever, or it may come back.

RecoveryServer 1 Server 2 Server 3 Server 5

A ABB C C CA B

Server 4

A B


ABB C C CA B

Server 4

A B


ABB C C CA B

Server 4

A

ObservationServer 1 Server 2 Server 3 Server 5

ABB C C CA B

Server 4

A

Maintaining replication and finding data will be difficult unless we have a global view of the data.

Architecture

Master

Client Worker

Architecture

Master

Client Worker

RPCRPC

RPC

Architecture

Master

Client Worker

[metadata]

[data]

RPCRPC

RPC

Architecture

Master

Client Worker

(one)

(many)(many)

[metadata]

[data]

RPCRPC

RPC

Architecture

Master

Client Worker

(one)

(many)

[metadata]

local FS’s

RPCRPC

RPC

Architecture

Master

Client Worker

(one)

(many)

[metadata]

local FS’s

RPCRPC

RPC

metadata consistency easy

large capacity

Chunk LayerBreak GFS files into large chunks (e.g., 64MB).

Workers store physical chunks in Linux files.

Master maps logical chunk to physical chunk locations.


Architecture

Master metadata

Chunk Map

Master

chunk map:

logical 924 521 …

phys w2,w5,w7

w2,w9,w11 …

Worker w2

Master

chunk map:

logical 924 521 …

phys w2,w5,w7

w2,w9,w11 …

Worker w2

Local FS/chunks/942 => data1 /churks/521 => data2 …

Client Reads a Chunk

Master

chunk map:

logical 924 521 …

phys w2,w5,w7

w2,w9,w11 …

Worker w2


client


Master

chunk map:

logical 924 521 …

phys w2,w5,w7

w2,w9,w11 …

Worker w2


client

lookup 924


Master

chunk map:

logical 924 521 …

phys w2,w5,w7

w2,w9,w11 …

Worker w2


client

w2,w5,w7


Master

chunk map:

logical 924 521 …

phys w2,w5,w7

w2,w9,w11 …

Worker w2


client


Master

chunk map:

logical 924 521 …

phys w2,w5,w7

w2,w9,w11 …

Worker w2


client

read 942: offset=0 size=1MB


Master

chunk map:

logical 924 521 …

phys w2,w5,w7

w2,w9,w11 …

Worker w2


client

data


Master

chunk map:

logical 924 521 …

phys w2,w5,w7

w2,w9,w11 …

Worker w2


client

read 942: offset=1MB size=1MB


Master

chunk map:

logical 924 521 …

phys w2,w5,w7

w2,w9,w11 …

Worker w2


client

data


Master

chunk map:

logical 924 521 …

phys w2,w5,w7

w2,w9,w11 …

Worker w2


client



Master

chunk map:

logical 924 521 …

phys w2,w5,w7

w2,w9,w11 …

Worker w2


client

data


Master

chunk map:

logical 924 521 …

phys w2,w5,w7

w2,w9,w11 …

Worker w2


client


Master

chunk map:

logical 924 521 …

phys w2,w5,w7

w2,w9,w11 …

Worker w2


client

Master is not bottleneck because not involved in most reads.


Master

chunk map:

logical 924 521 …

phys w2,w5,w7

w2,w9,w11 …

Worker w2


client

How does client know what chunk to read?

File NamespaceMap path names to logical chunk lists.

1. Client sends path name to master. 2. Master sends chunk locations to client. 3. Client reads/writes to workers directly.

File Namespace

Master

chunk map:logical

924 phys

w2,w5,w7

Worker w2


client

… …

File Namespace

Master

chunk map:logical

924 phys

w2,w5,w7

Worker w2


client

… …

file namespace:/foo/bar => 924,813 /var/log => 123,999

File Namespace

Master

chunk map:logical

924 phys

w2,w5,w7

Worker w2


client

… …


lookup /foo/bar

File Namespace

Master

chunk map:logical

924 phys

w2,w5,w7

Worker w2


client

… …


924: [w2,w5,w7] 813: […]

File Namespace

Master

chunk map:logical

924 phys

w2,w5,w7

Worker w2


client

… …


Chunk SizeGFS uses very large chunks, i.e., 64MB.

How does chunk size affect size of data structs?

What if Chunk Size Doubles?

Master

chunk map:logical

924 813

phys w2,w5,w7 w1,w8,w9 … …



Masterfile namespace:

/foo/bar => 924 /var/log => 123 lists half as long

chunk map:logical

924 813

phys w2,w5,w7 w1,w8,w9 … …


Masterfile namespace:

/foo/bar => 924 /var/log => 123 lists half as long

chunk map:logical

924phys

w2,w5,w7 … …half as many entries



A: logical-block lists halved, chunk map halved

Any disadvantages to making chunks huge?



A: logical-block lists halved, chunk map halved

Any disadvantages to making chunks huge? - sometimes slow. Cannot parallelize I/O as much.

Master: Crashes + ConsistencyFile namespace and chunk map are 100% in RAM. - allows master to work with 1000’s of workers - what if master crashes?

File NamespaceWrite namespace updates to two types of logs: - local disk (disk is never read except for crash) - disk on backup master (in case permanent fail)

Occasionally dump entire state to checkpoint. - use format that can be directly mapped for fast recovery (i.e., no parsing). - why can’t we use pointers?

Master: Crashes + ConsistencyFile namespace and chunk map are 100% in RAM. - allows master to work with 1000’s of workers - what master crashes?

Chunk MapDon’t persist on master. Just ask workers which chunks they have.

What if worker dies too?


What if worker dies too? Doesn’t matter, then that worker can’t serve chunks in the map anyway.



WorkerMaster

A B C D



WorkerMaster

A B C D

I have {A,B,C,D}



WorkerMaster

A B C D



WorkerMaster

A B



WorkerMaster

A B

I have {A,B}

GFS SummaryFight failure with replication.

Distribute data for performance, reliability.

Centralize metadata for simplicity.

Do problem 1!


Search

MapReduce OverviewMotivation

MapReduce Programming

Implementation

ProblemDatasets are too big to process single threaded. (or even single machined!)

Good concurrent programmers are rare.

Want a concurrent programming framework that is: - easy to use (no locks, CVs, race conditions) - general (works for many problems)

MapReduceStrategy: break data into buckets, do computation over each bucket.

Google published details in 2004.

Open source implementation: Hadoop

Example: Revenue per State

State Sale ClientIDWI 100 9292CA 10 9523WI 15 9331CA 45 9523TX 9 8810WI 20 9292

How to quickly sum sales in every state without any one machine iterating over all results?

StrategyOne set of processes groups data into logical buckets.

Each bucket has a single process that computes over it.

StrategyOne set of processes groups data into logical buckets. (mappers)

Each bucket has a single process that computes over it. (reducers)

StrategyOne set of processes groups data into logical buckets. (mappers)

Each bucket has a single process that computes over it. (reducers)

Claim: if no bucket has too much data, no single process can do too much work.



Implementation

Example: Revenue per State

State SaleWI 100CA 10WI 15CA 45TX 9WI 20

How to quickly sum sales in every state without any one machine iterating over all results?


mapper 1

WI 100CA 10WI 15

mapper 2

CA 45TX 9WI 20


mapper 1

WI 100CA 10WI 15

mapper 2

CA 45TX 9WI 20

WI 100,15CA 10

CA 45TX 9WI 20


mapper 1

WI 100CA 10WI 15

mapper 2

CA 45TX 9WI 20

WI 100,15CA 10

CA 45TX 9WI 20

reducer 1

reducer 2

Reduce WI

Reduce CA

Reduce TX


mapper 1

WI 100CA 10WI 15

mapper 2

CA 45TX 9WI 20

WI 100,15CA 10

CA 45TX 9WI 20

reducer 1

reducer 2

Reduce WI

Reduce CA

Reduce TX

WI 135

CA 55TX 9

Revenue per State

State Sale ClientIDWI 100 9292CA 10 9523WI 15 9331CA 45 9523TX 9 8810WI 20 9292

Mappers could have grouped by any field desired (e.g., by ClientID).

SQL Equivalents

SELECT sum(sale) FROM tbl_sales GROUP BY state;

SQL Equivalents

SELECT sum(sale) FROM tbl_sales GROUP BY clientID;

SQL Equivalents

SELECT max(sale) FROM tbl_sales GROUP BY clientID;

SQL Equivalents

SELECT max(sale) FROM tbl_sales GROUP BY clientID;

reduce

map

Mapper OutputSometimes mappers simply classify records (state revenue example).

Sometimes mappers produce multiple intermediate records per input (e.g., friend counts).

Example: Counting Friends

friend1 friend2133 155133 99133 300300 99300 2199 155

133

155

99300

21

friend1 friend2133 155133 99133 300300 99300 2199 155

mapper 1

133 155133 99133 300

mapper 2

300 99300 2199 155

133

155

99300

21

friend1 friend2133 155133 99133 300300 99300 2199 155

mapper 1

133 155133 99133 300

mapper 2

300 99300 2199 155

133 155,99,300155 13399 133300 133

300 99,2199 300,15521 300155 99

Example: Counting Links

url htmlhttp:// <html><body>…<a href=“…

… …

Many Other WorkloadsDistributed grep (over text files)

URL access frequency (over web request logs)

Distributed sort (over strings)

PageRank (over all web pages)

…

Map/Reduce Function Typesmap(k1,v1) —> list(k2,v2) reduce(k2,list(v2)) —> list(k3,v3)

Hadoop APIpublic void map(LongWritable key, Text value) { // WRITE CODE HERE }

public void reduce(Text key, Iterator<IntWritable> values) { // WRITE CODE HERE }

public void map(LongWritable key, Text value) { String line = value.toString(); StringToke st = new StringToke(line); while (st.hasMoreTokens()) output.collect(st.nextToken(), 1); }

public void reduce(Text key, Iterator<IntWritable> values) { int sum = 0; while (values.hasNext()) sum += values.next().get(); output.collect(key, sum); }

what doesthis do?



Implementation

MapReduce over GFSMapReduce writes/reads data to/from GFS.

MapReduce workers run on same machines as GFS workers.

GFS files mappers intermediate

local files reducers GFS files

MapReduce over GFS



Why not store intermediate files in GFS?

MapReduce writes/reads data to/from GFS.


MapReduce over GFS



Which edges involve network I/O?

1 2 3 4



MapReduce over GFS



Which edges involve network I/O? Edges 3+4. Maybe 1.

1 2 3 4



MapReduce over GFS



How to avoid I/O for 1?

1 2 3 4



Exposing LocationGFS exposes which servers store which files (not transparent, but very useful!)

Hadoop example:

BlockLocation[] getFileBlockLocations(Path p, long start, long len);

Spec: return an array containing hostnames, offset and size of portions of the given file.

MapReduce PolicyMapReduce needs to decide which machines to use for map and reduce tasks. Potential factors: - try to put mappers near one of the three replicas - for reducers, store one output replica locally - try to use underloaded machines - consider network topology

Failed TasksA MapReduce master server tracks status of all map and reduce tasks.

If any don’t respond to pings, they are simply restarted on different machines.

This is possible because tasks are deterministic, and we still have the inputs.

Slow TasksSometimes a machine gets overloaded or a network link is slow.

With 1000’s of tasks, this will always happen.

Spawning duplicate tasks when there are only a few stragglers left reduces some job times by 30%.

MapReduce SummaryMapReduce makes concurrency easy!

Limited programming environment, but works for a fairly wide variety of applications.

Machine failures are easily handled.

Do problem 2!


Search

Search Engine GoalUsers should be able to enter search phrases.

Want to return results that are: - high quality (how to judge?) - relevant

It’s ok to do a lot of processing online, but searches must be fast!

Internet

Search Engine

Searchers

Web Servers

Internet

Search Engine

Crawler Web Servers

Internet

Search Engine

Webpages Searchers

Crawler Web Servers

Snapshot of Pages

Internet

Search Engine

Webpages Searchers

Crawler Web Servers

Snapshot of Pages IndexesIndexing

Internet

Search Engine

Webpages Searchers

Crawler Web Servers

Snapshot of Pages

Relevance? Quality?Indexing

Internet

Search Engine

Webpages Searchers

Crawler Web Servers

Snapshot of Pages

Relevance? Quality?

MapReduce Jobs

Internet

Search Engine

Webpages Searchers

OutlineWeb Crawling

Indexing - PageRank - Inverted Indexes

Searching

Crawler Web Servers

Snapshot of Pages

Relevance? Quality?

MapReduce Jobs

Internet

Search Engine

Webpages Searchers

Web CrawlerStart with seed list

Fetch off list

Each page generates new links

robots.txtrobots.txt file can tell crawlers not to crawl. Example:

User-agent: googlebot # all Google services Disallow: /private/ # disallow this directory User-agent: googlebot-news # only the news service Disallow: / # disallow everything User-agent: * # any robot Disallow: /something/ # disallow this directory

Some web developers set up intentional spider traps to punish crawlers that ignore these.

example source: http://en.wikipedia.org/wiki/Robots_exclusion_standard

http://en.wikipedia.org/wiki/Robots_exclusion_standard

Spider TrapsServer returns data so that page example.com/N has a link to example.com/(N+1).

From crawler’s perspective, web is infinite!

Prioritize via heuristics (avoid dynamic content) and quality rankings (later).

http://example.com/N

http://example.com/

“Almost daily, we receive an email something like, ‘Wow, you looked at a lot of pages from my web site. How did you like it?’”

Sergey Brin + Lawrence Page

Source: The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)

OutlineWeb Crawling


Searching

Crawler Web Servers

Snapshot of Pages

Relevance? Quality?

MapReduce Jobs

Internet

Search Engine

Webpages Searchers

Quality ProblemWeb pages “proliferate free of quality control”.

Contrast with peer-reviewed academic papers.

Need to infer quality from the web graph.

Quality ProblemWeb pages “proliferate free of quality control”.

Contrast with peer-reviewed academic papers.

Need to infer quality from the web graph.

Give every page a singe PageRank score representing quality.

Strategy: Count LinksImportance: A = 1 B = 4 C = 1 D = 0 E = 1 F = 1

A B

C

DE

F

Strategy: Count LinksImportance: A = 1 B = 4 C = 1 D = 0 E = 1 F = 1

A B

C

DE

F

should A get 2 “votes”?

Strategy: Count LinksImportance: A = 1 B = 3.5 C = 0.5 D = 0 E = 0.5 F = 0.5

A B

C

DE

F

0.5

0.5

0.5 0.5

Strategy: Count LinksImportance: A = 1 B = 3.5 C = 0.5 D = 0 E = 0.5 (from A’s vote) F = 0.5

A B

C

DE

F

0.5

0.5

0.5 0.5

Strategy: Count LinksImportance: A = 1 B = 3.5 C = 0.5 (from B’s vote) D = 0 E = 0.5 (from A’s vote) F = 0.5

A B

C

DE

F

0.5

0.5

0.5 0.5

Strategy: Count LinksImportance: A = 1 B = 3.5 C = 0.5 (from B’s vote) D = 0 E = 0.5 (from A’s vote) F = 0.5

A B

C

DE

F

0.5

0.5

0.5 0.5

Why do A and B get same votes? B is more important.

Circular VotesWant: number of votes you get determines number of votes you give.

Problem: changing A’s votes changes B’s votes changes A’s votes…

Circular VotesWant: number of votes you get determines number of votes you give.

Problem: changing A’s votes changes B’s votes changes A’s votes…

Fortunately, if you just keep updating every PageRank, it eventually converges.

Convergence Goal (Simplified)

Rank(x) = “sum of all votes for x”

“x” is a page, Rank(x) is its PageRank.


y∈LinksTo(x)

Rank(x) = Σ “y’s vote for x”

LinksTo(x) is the set of all pages linking to x.


y∈LinksTo(x)

Rank(x) = Σ Rank(y)

Ny is the number of links from y to other pages.

Ny


Rank(x) =

Normalize with “c” to get desired amount of “rank” in system.

cy∈LinksTo(x)

Σ Rank(y)Ny


Rank(x) = cy∈LinksTo(x)

Σ Rank(y)Ny

keep updating rank for every page until ranks stop changing much

Intuition: Random SurferImagine!

1. a bunch of web surfers start on various pages 2. they randomly click links, forever 3. you measure webpage visit frequency

Intuition: Random SurferImagine!

1. a bunch of web surfers start on various pages 2. they randomly click links, forever 3. you measure webpage visit frequency

Visit frequency will be proportional to PageRank.

Graph 1

A B C

Graph 1

A B C

0.50.25 0.25

Graph 1

A B C

0.50.25 0.25

Rank(B) = (0.25 / 1) + (0.25 / 1) = 0.5 Rank(A) = (0.5 / 2) = 0.25 Rank(C) = (0.5 / 2) = 0.25

Rank(x) = cy∈LinksTo(x)

Σ Rank(y)Ny

Graph 2

A B C

Problem: random surfers without links die. (and take the rank with them!)

Graph 3

A B

Problem: ???

C D

Graph 3

A B

Problem: Surfers get stuck in C and D. C+D called a rank “sink”. A and B get 0 rank.

C D

ProblemsProblem A: dangling links

Problem B: rank sinks

Solution?

ProblemsProblem A: dangling links

Problem B: rank sinks

Solution?

Surfers should jump to new random page with some probability.

Computationranks = INIT_RANKS; //rank for each page do { new_ranks = compute_ranks(ranks, edges); change = compute_diff(new_ranks, ranks); ranks = new_ranks; } while (change > threshold);

Computationranks = INIT_RANKS; //rank for each page do { new_ranks = compute_ranks(ranks, edges); change = compute_diff(new_ranks, ranks); ranks = new_ranks; } while (change > threshold);

Many MapReduce jobs can be used.

The PageRank Citation Ranking: Bringing Order to the Web (http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf)

http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

“To test the utility of PageRank for search, we built a web search engine called Google”

Larry Page etal.

The PageRank Citation Ranking: Bringing Order to the Web (http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf)

http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

Do problem 3!

OutlineWeb Crawling


Searching

Crawler Web Servers

Snapshot of Pages

Relevance? Quality?

MapReduce Jobs

Internet

Search Engine

Webpages Searchers

Relevance ProblemA website may be important, but is it relevant to the user’s current query?

Infer relevance by page contents, such as: - html body - title - meta tags - headers - etc

IndexingStrategy: indexing.

Generate files organize by topic, keyword, or some other criteria that organize documents.

For a given word, we want to be able to find all related documents.

RepresentationFor fast processing, assign: - docID to each unique page - wordID to each unique word on the web

Lorem ipsum dolor sit amet, lorem soluta delicata no vim. Te vel facete ornatus, mei aeque maiestatis te.

http://www.example.com/…

http://www.example.com/

RepresentationFor fast processing, assign: - docID to each unique page - wordID to each unique word on the web

Lorem ipsum dolor sit amet, lorem soluta delicata no vim. Te vel facete ornatus, mei aeque maiestatis te.

http://www.example.com/…5 922 2 66 42 5 15 79 1431 21 3 22 68 12 47 887 244 3

docID=1442

http://www.example.com/

Forward Index

5 922 2 66 42 5 15 79 1431 21 3 22 68 12 47 887 244 3

docID=1442docID wordID1442 51442 9221442 21442 661442 421442 5

… …

forward index

522 141 553 999 243 66 42 5 15 79 15 79 1431 21 3 22

docID=9977

…

Inverted Index

docID wordID1442 51442 9221442 21442 661442 421442 5

… …

forward index

Inverted Index

docID wordID1442 51442 9221442 21442 661442 421442 5

… …

forward indexdocID wordID1442 51442 9221442 21442 661442 421442 5

… …

Inverted Index

docID wordID1442 51442 9221442 21442 661442 421442 5

… …

forward indexwordID docID

5 1442922 1442

2 144266 144242 14425 1442… …

swap columns

Inverted Index

docID wordID1442 51442 9221442 21442 661442 421442 5

… …

forward indexwordID docID

1 2442 14425 14425 14425 9996 133… …

sort by wordID

Inverted Index

docID wordID1442 51442 9221442 21442 661442 421442 5

… …

forward index inverted indexwordID docID

1 2442 14425 1442,1442,9996 133,4117 1442,133,9999 411,875… …

Computing Inverted Index with MapReduce

Mapper: read words from files - out key: word - out val: file name

Reducer: make list of file names - out key: word - out val: list of file names

OutlineWeb Crawling


Searching

Crawler Web Servers

Snapshot of Pages

Relevance? Quality?

MapReduce Jobs

Internet

Search Engine

Webpages Searchers

One-word QueriesInverted index may be split into “posting files” across many machines. wordID => machine is known.

Front-end server takes query, converts to wordID.

Front-end fetches docID’s from server with posting file.

docID’s are sorted based on PageRank and relevance and returned to user.

Multi-Word QueriesQuery is converted into list of wordIDs.

docID’s from the posting files for each wordID are retrieved.

The lists of docID’s can be unioned (OR) or intersected (AND).

Do problem 4!

Search is Resource Intense

Indexes greatly reduce data that must be considered relative to the grep approach.

However! Most of the data read from the posting lists won’t be relevant, so a lot of data must be scanned.

Search SummaryCrawler: watch for robots.txt

PageRank: simulate random surfer

Inverted Index: list of docs containing a word

Search: take intersection of posting lists

[537] Google Infrastructure

Documents

Transcript of [537] Google Infrastructure