MongoDB Hadoop and Humongous Data

51
MongoDB Hadoop & hu mongo us data Tuesday, December 11, 12
  • date post

    12-Sep-2014
  • Category

    Documents

  • view

    3.158
  • download

    3

description

 

Transcript of MongoDB Hadoop and Humongous Data

Page 1: MongoDB Hadoop and Humongous Data

MongoDB Hadoop

& humongous data

Tuesday, December 11, 12

Page 2: MongoDB Hadoop and Humongous Data

Talking aboutWhat is Humongous Data

Humongous Data & You

MongoDB & Data processing

Future of Humongous DataTuesday, December 11, 12

Page 3: MongoDB Hadoop and Humongous Data

What is humongous

data?Tuesday, December 11, 12

Page 4: MongoDB Hadoop and Humongous Data

2000Google IncToday announced it has released the largest search engine on the Internet.

Google’s new index, comprising more than 1 billion URLs

Tuesday, December 11, 12

Page 5: MongoDB Hadoop and Humongous Data

Our indexing system for processing links indicates that we now count 1 trillion unique URLs

(and the number of individual web pages out there is growing by several billion pages per day).

2008

Tuesday, December 11, 12

Page 6: MongoDB Hadoop and Humongous Data

An unprecedented amount of data is being created and is accessible

Tuesday, December 11, 12

Page 7: MongoDB Hadoop and Humongous Data

Data Growth

0

250

500

750

1000

2000 2001 2002 2003 2004 2005 2006 2007 2008

1 4 10 2455

120

250

500

1,000

Millions of URLs

Tuesday, December 11, 12

Page 8: MongoDB Hadoop and Humongous Data

Truly Exponential Growth

Is hard for people to grasp

A BBC reporter recently: "Your current PC is more powerful than the computer they had on board the first flight to the moon".

Tuesday, December 11, 12

Page 9: MongoDB Hadoop and Humongous Data

Moore’s LawApplies to more than just CPUs

Boiled down it is that things double at regular intervals

It’s exponential growth.. and applies to big data

Tuesday, December 11, 12

Page 10: MongoDB Hadoop and Humongous Data

How BIG is it?

Tuesday, December 11, 12

Page 11: MongoDB Hadoop and Humongous Data

How BIG is it?

2008

Tuesday, December 11, 12

Page 12: MongoDB Hadoop and Humongous Data

How BIG is it?

2008

2007

20062005

20042003

20022001

Tuesday, December 11, 12

Page 13: MongoDB Hadoop and Humongous Data

Why all this talk about BIG

Data now?

Tuesday, December 11, 12

Page 14: MongoDB Hadoop and Humongous Data

In the past few years open source software emerged enabling ‘us’ to handle BIG Data

Tuesday, December 11, 12

Page 15: MongoDB Hadoop and Humongous Data

The Big DataStory

Tuesday, December 11, 12

Page 16: MongoDB Hadoop and Humongous Data

Is actually two stories

Tuesday, December 11, 12

Page 17: MongoDB Hadoop and Humongous Data

Doers & Tellers talking about different things

http://www.slideshare.net/siliconangle/trendconnect-big-data-report-september

Tuesday, December 11, 12

Page 18: MongoDB Hadoop and Humongous Data

TellersTuesday, December 11, 12

Page 19: MongoDB Hadoop and Humongous Data

DoersTuesday, December 11, 12

Page 20: MongoDB Hadoop and Humongous Data

Doers talk a lot more about actual solutions

Tuesday, December 11, 12

Page 21: MongoDB Hadoop and Humongous Data

They know it’s a two sided story

Processing

Storage

Tuesday, December 11, 12

Page 22: MongoDB Hadoop and Humongous Data

Take aways

MongoDB and Hadoop

MongoDB for storage & operations

Hadoop for processing & analytics

Tuesday, December 11, 12

Page 23: MongoDB Hadoop and Humongous Data

MongoDB& Data Processing

Tuesday, December 11, 12

Page 24: MongoDB Hadoop and Humongous Data

Applications have complex needs

MongoDB ideal operational database

MongoDB ideal for BIG data

Not a data processing engine, but provides processing functionality

Tuesday, December 11, 12

Page 25: MongoDB Hadoop and Humongous Data

Many options for Processing Data

•Process in MongoDB using Map Reduce

•Process in MongoDB using Aggregation Framework

•Process outside MongoDB (using Hadoop)

Tuesday, December 11, 12

Page 26: MongoDB Hadoop and Humongous Data

MongoDB Map ReduceData

Map()

emit(k,v)

Sort(k)

Group(k)

Reduce(k,values)

k,v

Finalize(k,v)

k,v

MongoDB

map iterates on documentsDocument is $this

1 at time per shard

Input matches output

Can run multiple times

Tuesday, December 11, 12

Page 27: MongoDB Hadoop and Humongous Data

MongoDB Map Reduce

MongoDB map reduce quite capable... but with limits

- Javascript not best language for processing map reduce

- Javascript limited in external data processing libraries

- Adds load to data store

Tuesday, December 11, 12

Page 28: MongoDB Hadoop and Humongous Data

MongoDB Aggregation

Most uses of MongoDB Map Reduce were for aggregation

Aggregation Framework optimized for aggregate queries

Realtime aggregation similar to SQL GroupBy

Tuesday, December 11, 12

Page 29: MongoDB Hadoop and Humongous Data

MongoDB & Hadoop

Map (k1, v1, ctx)

ctx.write(k2,v2)

Map (k1, v1, ctx)

ctx.write(k2,v2)

Map (k1, v1, ctx)

ctx.write(k2,v2)

Creates a list of Input Splits

(InputFormat)

Output Format

MongoDB

single server or sharded cluster

same as Mongo's shard chunks (64mb)

each spliteach spliteach split

MongoDB

RecordReader

Runs on same thread as map

Reducer threads

Partitioner(k2)Partitioner(k2)Partitioner(k2)

Many map operations1 at time per input split

Runs once per keyReduce(k2,values3)

kf,vf

Combiner(k2,values2)

k2, v3

Combiner(k2,values2)

k2, v3

Combiner(k2,values2)

k2, v3

Sort(k2)Sort(k2)Sort(keys2)

Tuesday, December 11, 12

Page 30: MongoDB Hadoop and Humongous Data

DEMOTIME

Tuesday, December 11, 12

Page 31: MongoDB Hadoop and Humongous Data

DEMOInstall Hadoop MongoDB Plugin

Import tweets from twitter

Write mapper in Python using Hadoop streaming

Write reducer in Python using Hadoop streaming

Call myself a data scientist

Tuesday, December 11, 12

Page 32: MongoDB Hadoop and Humongous Data

Installing Mongo-hadoop

hadoop_version '0.23'hadoop_path="/usr/local/Cellar/hadoop/$hadoop_version.0/libexec/lib"

git clone git://github.com/mongodb/mongo-hadoop.gitcd mongo-hadoopsed -i '' "s/default/$hadoop_version/g" build.sbtcd streaming./build.sh

https://gist.github.com/1887726

Tuesday, December 11, 12

Page 33: MongoDB Hadoop and Humongous Data

Groking Twitter

curl \https://stream.twitter.com/1/statuses/sample.json \-u<login>:<password> \| mongoimport -d test -c live

... let it run for about 2 hoursTuesday, December 11, 12

Page 34: MongoDB Hadoop and Humongous Data

DEMO 1Tuesday, December 11, 12

Page 35: MongoDB Hadoop and Humongous Data

Map Hashtags in Python#!/usr/bin/env python

import syssys.path.append(".")

from pymongo_hadoop import BSONMapper

def mapper(documents): for doc in documents: for hashtag in doc['entities']['hashtags']: yield {'_id': hashtag['text'], 'count': 1}

BSONMapper(mapper)print >> sys.stderr, "Done Mapping."

Tuesday, December 11, 12

Page 36: MongoDB Hadoop and Humongous Data

Reduce hashtags in Python#!/usr/bin/env python

import syssys.path.append(".")

from pymongo_hadoop import BSONReducer

def reducer(key, values): print >> sys.stderr, "Hashtag %s" % key.encode('utf8') _count = 0 for v in values: _count += v['count'] return {'_id': key.encode('utf8'), 'count': _count}

BSONReducer(reducer)Tuesday, December 11, 12

Page 37: MongoDB Hadoop and Humongous Data

All together

hadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar \ -mapper examples/twitter/twit_hashtag_map.py \-reducer examples/twitter/twit_hashtag_reduce.py \-inputURI mongodb://127.0.0.1/test.live \-outputURI mongodb://127.0.0.1/test.twit_reduction \-file examples/twitter/twit_hashtag_map.py \-file examples/twitter/twit_hashtag_reduce.py

Tuesday, December 11, 12

Page 38: MongoDB Hadoop and Humongous Data

Popular Hash Tagsdb.twit_hashtags.find().sort( {'count' : -1 })

{ "_id" : "YouKnowYoureInLoveIf", "count" : 287 }{ "_id" : "teamfollowback", "count" : 200 }{ "_id" : "RT", "count" : 150 }{ "_id" : "Arsenal", "count" : 148 }{ "_id" : "milars", "count" : 145 }{ "_id" : "sanremo", "count" : 145 }{ "_id" : "LoseMyNumberIf", "count" : 139 }{ "_id" : "RelationshipsShould", "count" : 137 }{ "_id" : "oomf", "count" : 117 }{ "_id" : "TeamFollowBack", "count" : 105 }{ "_id" : "WhyDoPeopleThink", "count" : 102 }{ "_id" : "np", "count" : 100 }

Tuesday, December 11, 12

Page 39: MongoDB Hadoop and Humongous Data

DEMO 2Tuesday, December 11, 12

Page 40: MongoDB Hadoop and Humongous Data

Aggregation in Mongo 2.1db.live.aggregate(

{ $unwind : "$entities.hashtags" } , { $match : { "entities.hashtags.text" : { $exists : true } } } , { $group : { _id : "$entities.hashtags.text", count : { $sum : 1 } } } , { $sort : { count : -1 } }, { $limit : 10 })Tuesday, December 11, 12

Page 41: MongoDB Hadoop and Humongous Data

Popular Hash Tagsdb.twit_hashtags.aggregate(a){

"result" : [ { "_id" : "YouKnowYoureInLoveIf", "count" : 287 }, { "_id" : "teamfollowback", "count" : 200 }, { "_id" : "RT", "count" : 150 }, { "_id" : "Arsenal", "count" : 148 }, { "_id" : "milars", "count" : 145 }, { "_id" : "sanremo","count" : 145 }, { "_id" : "LoseMyNumberIf", "count" : 139 }, { "_id" : "RelationshipsShould", "count" : 137 }, ],"ok" : 1}

Tuesday, December 11, 12

Page 42: MongoDB Hadoop and Humongous Data

Futureof The

humongousdata

Tuesday, December 11, 12

Page 43: MongoDB Hadoop and Humongous Data

What is BIG?

BIG today is normal tomorrow

Tuesday, December 11, 12

Page 44: MongoDB Hadoop and Humongous Data

Data Growth

0

2250

4500

6750

9000

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011

1 4 10 24 55 120 250500

1,000

2,150

4,400

9,000

Millions of URLs

Tuesday, December 11, 12

Page 45: MongoDB Hadoop and Humongous Data

Data Growth

0

2250

4500

6750

9000

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011

1 4 10 24 55 120 250500

1,000

2,150

4,400

9,000

Millions of URLs

Tuesday, December 11, 12

Page 46: MongoDB Hadoop and Humongous Data

2012Generating over 250 Millions of tweets per day

Tuesday, December 11, 12

Page 47: MongoDB Hadoop and Humongous Data

MongoDB enables us to scale with the redefinition of BIG.

New processing tools like Hadoop & Storm are enabling us to process the new BIG.

Tuesday, December 11, 12

Page 48: MongoDB Hadoop and Humongous Data

Hadoop is our first step

Tuesday, December 11, 12

Page 49: MongoDB Hadoop and Humongous Data

MongoDB is committed to working

with best data tools including

Hadoop, Storm, Disco, Spark & more

Tuesday, December 11, 12

Page 50: MongoDB Hadoop and Humongous Data

download at

Questions?

http://spf13.comhttp://github.com/spf13@spf13

github.com/mongodb/mongo-hadoop

Tuesday, December 11, 12

Page 51: MongoDB Hadoop and Humongous Data

Tuesday, December 11, 12