Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling...

26
Scalable Web Programming CS193S - Jan Jannink - 2/25/10

Transcript of Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling...

Page 1: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to

Scalable Web ProgrammingCS193S - Jan Jannink - 2/25/10

Page 2: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to

Weekly Syllabus1.Scalability: (Jan.)

2.Agile Practices

3.Ecology/Mashups

4.Browser/Client

5.Data/Server: (Feb.)

6.Security/Privacy

7.Analytics

8.Cloud/Map-Reduce

9.Published APIs: (Mar.)*

10. Future

* PROJECT DUE DATE

Page 3: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to

Cloud Recap

Progressive commoditization of IT services

Choose based on value creation in project

Build it, Scale it, Code it, Customize it

Google has most efficient data centers

Ning, SocialGo offer customizable communities

Page 4: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to

Server & Data Scaling

Traditionally depended on next hardware release

AltaVista search engine

limited to the most expensive DEC Alpha box

Original eBay build-out

massive SUN/Teradata clusters

Page 5: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to

Map/Reduce

Background

Google founders’ disdain for traditional RDBMS

Original paper published 5 years ago

Main Features

limitless scalability on cheap hardware

real time fault tolerance

Page 6: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to

Google Example

Key ‘contrarian’ insight

scaling on cheap hardware is best

need generic API to divide and conquer

If McDonald’s needed a top chef in each store

how much more would it cost?

how many restaurants would it now have?

Page 7: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to

Hadoop

First Open Source implementation

by Doug Cutting also of Lucene fame

Storage, Execution, Management components

HDFS, HBase, Hive

MapReduce

Pig, ZooKeeper

Page 8: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to

1-2-3

Configure server clusters

Code Map & Reduce classes

input list of (key, value) pairs

output set of (key, value) pairs

Start HDFS, MapReduce servers

Page 9: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to

Execution Schema

browser

Page 10: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to

Execution Schema

browserapache

Page 11: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to

Execution Schema

master browserapache

Page 12: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to

Execution Schema

m

a

p

re

du

ce

master browserapache

Page 13: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to

Execution Schema

m

a

p

re

du

ce

data

data

data

master browserapache

Page 14: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to

Execution Schema

m

a

p

re

du

ce

data

data

data

master browserapache

Page 15: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to

Execution Schema

m

a

p

re

du

ce

data

data

data

master browserapache

Page 16: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to

Example:Word Count

cat data.txt | sort | uniq -c

When data.txt is huge

split it into even chunks

sort chunks (or better yet, prefix sort)

count on a per item/prefix basis

return result

Page 17: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to

Insights

Relates to network switching, routing concepts

DB queries can generate initial (key, value) pairs

Master restarts failed map, reduce tasks

ping nodes after first results start coming in

Rule of thumb

data / 64MB ≅ # mappers > # reducers

Page 18: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to

Additional Details

Storing intermediate results

RAM/File systems on mappers, reducers

Can combine, e.g., duplicate removal, on mappers

Partition mapper results by key for reducers

Backup masters possible (not frequently used)

Caching becomes a critical system component

Page 19: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to

Caching is a Huge Win

m

a

p

re

du

ce

data

data

data

master browserapache

cache

Page 20: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to

Current Status

Current production development at Google

entirely Map/Reduce

Yahoo runs ~4000 node Hadoop cluster

100TB data sorting record < 3 hours

Facebook has ~1000 node Hadoop install

Amazon offers Elastic MapReduce

Page 21: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to

Controversy?

DB community on both sides of argument

DeWitt, Stonebraker: a giant step backwards

Abadi: HadoopDB

In web apps, however

DBs are used for persistence, not transactions

MapReduce provides scale and fault tolerance

Page 22: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to

Future Directions

AsterData, Google

hybrid Map/Reduce DBs, Data Warehouses

HadoopDB

open source PostgreSQL & Hadoop hybrid

NoSQL movement

Page 23: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to

Data Glut

Over 3 million English Wikipedia articles

1000+ new articles a day

unthinkable to use without search

Facebook Data Warehouse adds 15 TB a day

in 2007 it was 15 TB total

Google has several multi Petabyte data stores

Page 24: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to

Here to Stay

Easy to learn, easy to maintain

very incremental learning curve

Scales as fast as data is growing

only option for large data mining tasks

Accommodates multiple persistence backends

Page 25: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to

Worth Checking Out

Hadoop

http://hadoop.apache.org/

Wikimedia Report Card

http://stats.wikimedia.org/reportcard/

Data blog

http://www.dbms2.com/

Page 26: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to

Q & A Topics

Merging DB optimizers and Map/Reduce

Managing multi stage Map/Reduce pipelines

Map/Reduce and virtual machines