20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion...

49
dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer can read ~50MB/sec from disk so 3 months to read from web Solution?

Transcript of 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion...

Page 1: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

dealing w/ lots of data

20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.)

One computer can read ~50MB/sec from disk

so 3 months to read from web

Solution?

Page 2: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

Solution

get 1000’s of computers in cluster

take 2.5 hrs to process the web

Page 3: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

cluster of 1000 computers4,500 Terabytes (4.5 Petabytes) 8 racks of 64 quad core servers

Page 4: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

The only thing needed are network connections, a chilled water supply,

and electricity

Page 5: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer
Page 6: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

Nice Sun Server Box

Page 7: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer
Page 8: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

You: a graduate of UMW

!

Page 9: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

You: a graduate of UMW

I hire you to write a program to handle lotsand lots of data.

!

Page 10: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

You: a graduate of UMW

I hire you to write a program to handle lotsand lots of data.

3 months run time unacceptable.

Page 11: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

You: a graduate of UMW

I hire you to write a program to handle lotsand lots of data.

3 months run time unacceptable.

I buy you 2 nice Sun 20ft server containers

Page 12: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

You: a graduate of UMW

I hire you to write a program to handle lotsand lots of data.

3 months run time unacceptable.

I buy you 2 nice Sun 20ft server containers

How do you program it?

Page 13: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

Solutions

Map Reduce(easy way to write distributed programs)

Sharding

node.js

Page 14: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer
Page 15: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

mongo

document oriented database

replaces concept of ‘row’ with ‘document’

Page 16: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer
Page 17: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

mongoDBQuickstart

Page 18: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

Installation

On Mac using Homebrew package manager:

brew install mongo

On Linux, see mongodb.org.

should install the mongodb-10gen package

On Windows, download from mongodb.org/downloads

Page 19: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

nitrous.io

parts install mongodb

parts start mongodb

Page 20: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

mongod - the server

Page 21: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

mongod - the server

Page 22: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer
Page 23: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

mongo command line interpreter

Page 24: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

JavaScript Shell

Page 25: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

commands similar to MySQL

Page 26: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

show databases -> show dbs

Page 27: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

use xxx

Page 28: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

show tables -> show

collections

Page 29: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

Queries

db.movies.findOne({})

What to match -- like where ...

name of collection (i.e., table)

Page 30: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer
Page 31: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer
Page 32: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer
Page 33: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

Comparison Operators $lt $lte $gt $gte $ne

db.users.find({“age”: {“$gte”: 18, “$lte”: 30}})

db.users.find(“name”: {“$ne”: “Ann”}})

Page 34: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

ordb.movies.find({"actors.name": {"$in": ["Angelina Jolie", "Natalie Portman"]}})

Page 35: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

Regular ExpressionsJoe, joe, Joey, joey

!db.users.find({“name”: /joey?/i})

Page 36: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

query demo

Page 37: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

save & insert: the insert equivalents

!

db.contacts.save({"name": "Cheryl Clark", "cell": "575-541-1360"})

Collection

What to save

Page 38: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

Databases and collections are

created automagically

Page 39: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

Databases and collections are

created automagically

Page 40: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

Databases and collections are created automagically

Page 41: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

How to save and restore mongoDB

databasesmongodump / mongorestore

!!

for ex., !

mongodump -d whichflix !

saves to a folder called dump

Page 42: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

mongoDB & PHPthe equivalent of mysqli_connect

Page 43: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

need to install mongoDB librarypip install pymongo

Page 44: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

then …

>>> from pymongo import MongoClient !>>> client = MongoClient() >>> db = client.abductions !>>> db.implants.find_one() {u'implant': u'metal cylinder', u'_id': ObjectId('53409e18fed08dc8b0001034'), u'name': u'Dana Scully'} >>>

Page 45: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

more later

Page 46: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

replication(historically master/slave)

Page 47: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

writes are only to the primary

Page 48: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

secondaries

Page 49: 20+ billion web pages x 20KB = 400+ terabytes (to just ... · dealing w/ lots of data 20+ billion web pages x 20KB = 400+ terabytes (to just store it -- not to do sth.) One computer

master slave

master performs reads and writes

mongod --master (default port 27017)

slaves perform reads not writes

mongod --slave --dbpath /data/db/slave --port 10001 --source localhost:27017