MongoDB, Hadoop and humongous data - MongoSV 2012
-
Upload
steven-francia -
Category
Technology
-
view
9.671 -
download
7
description
Transcript of MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB Hadoop
& humongous data
Talking aboutWhat is Humongous Data
Humongous Data & You
MongoDB & Data processing
Future of Humongous Data
15+ years building the internet
@spf13Steve Francia
AKA
Chief Solutions Architect @ responsible for drivers, integrations, web & docs
Father, husband, skateboarder
What is humongous
data?
2000Google IncToday announced it has released the largest search engine on the Internet.
Google’s new index, comprising more than 1 billion URLs
Our indexing system for processing links indicates that we now count 1 trillion unique URLs
(and the number of individual web pages out there is growing by several billion pages per day).
2008
An unprecedented amount of data is being created and is accessible
Data Growth
0
250
500
750
1000
2000 2001 2002 2003 2004 2005 2006 2007 2008
1 4 10 2455
120
250
500
1,000
Millions of URLs
Truly Exponential Growth
Is hard for people to grasp
A BBC reporter recently: "Your current PC is more powerful than the computer they had on board the first flight to the moon".
Moore’s LawApplies to more than just CPUs
Boiled down it is that things double at regular intervals
It’s exponential growth.. and applies to big data
How BIG is it?
How BIG is it?
2008
How BIG is it?
2008
2007
20062005
20042003
20022001
Why all this talk about BIG
Data now?
In the past few years open source software emerged enabling ‘us’ to handle BIG Data
The Big DataStory
Is actually two stories
Doers & Tellers talking about different things
http://www.slideshare.net/siliconangle/trendconnect-big-data-report-september
Tellers
Doers
Doers talk a lot more about actual solutions
They know it’s a two sided story
Processing
Storage
Take aways
MongoDB and Hadoop
MongoDB for storage & operations
Hadoop for processing & analytics
MongoDB& Data Processing
Applications have complex needs
MongoDB ideal operational database
MongoDB ideal for BIG data
Not a data processing engine, but provides processing functionality
Many options for Processing Data
•Process in MongoDB using Map Reduce
•Process in MongoDB using Aggregation Framework
•Process outside MongoDB (using Hadoop)
MongoDB Map ReduceData
Map()
emit(k,v)
Sort(k)
Group(k)
Reduce(k,values)
k,v
Finalize(k,v)
k,v
MongoDB
map iterates on documentsDocument is $this
1 at time per shard
Input matches output
Can run multiple times
MongoDB Map Reduce
MongoDB map reduce quite capable... but with limits
- Javascript not best language for processing map reduce
- Javascript limited in external data processing libraries
- Adds load to data store
MongoDB Aggregation
Most uses of MongoDB Map Reduce were for aggregation
Aggregation Framework optimized for aggregate queries
Realtime aggregation similar to SQL GroupBy
MongoDB & Hadoop
Map (k1, v1, ctx)
ctx.write(k2,v2)
Map (k1, v1, ctx)
ctx.write(k2,v2)
Map (k1, v1, ctx)
ctx.write(k2,v2)
Creates a list of Input Splits
(InputFormat)
Output Format
MongoDB
single server or sharded cluster
same as Mongo's shard chunks (64mb)
each spliteach spliteach split
MongoDB
RecordReader
Runs on same thread as map
Reducer threads
Partitioner(k2)Partitioner(k2)Partitioner(k2)
Many map operations1 at time per input split
Runs once per keyReduce(k2,values3)
kf,vf
Combiner(k2,values2)
k2, v3
Combiner(k2,values2)
k2, v3
Combiner(k2,values2)
k2, v3
Sort(k2)Sort(k2)Sort(keys2)
DEMOTIME
DEMOInstall Hadoop MongoDB Plugin
Import tweets from twitter
Write mapper in Python using Hadoop streaming
Write reducer in Python using Hadoop streaming
Call myself a data scientist
Installing Mongo-hadoop
hadoop_version '0.23'hadoop_path="/usr/local/Cellar/hadoop/$hadoop_version.0/libexec/lib"
git clone git://github.com/mongodb/mongo-hadoop.gitcd mongo-hadoopsed -i '' "s/default/$hadoop_version/g" build.sbtcd streaming./build.sh
https://gist.github.com/1887726
Groking Twitter
curl \https://stream.twitter.com/1/statuses/sample.json \-u<login>:<password> \| mongoimport -d test -c live
... let it run for about 2 hours
DEMO 1
Map Hashtags in Python#!/usr/bin/env python
import syssys.path.append(".")
from pymongo_hadoop import BSONMapper
def mapper(documents): for doc in documents: for hashtag in doc['entities']['hashtags']: yield {'_id': hashtag['text'], 'count': 1}
BSONMapper(mapper)print >> sys.stderr, "Done Mapping."
Reduce hashtags in Python#!/usr/bin/env python
import syssys.path.append(".")
from pymongo_hadoop import BSONReducer
def reducer(key, values): print >> sys.stderr, "Hashtag %s" % key.encode('utf8') _count = 0 for v in values: _count += v['count'] return {'_id': key.encode('utf8'), 'count': _count}
BSONReducer(reducer)
All together
hadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar \ -mapper examples/twitter/twit_hashtag_map.py \-reducer examples/twitter/twit_hashtag_reduce.py \-inputURI mongodb://127.0.0.1/test.live \-outputURI mongodb://127.0.0.1/test.twit_reduction \-file examples/twitter/twit_hashtag_map.py \-file examples/twitter/twit_hashtag_reduce.py
Popular Hash Tagsdb.twit_hashtags.find().sort( {'count' : -1 })
{ "_id" : "YouKnowYoureInLoveIf", "count" : 287 }{ "_id" : "teamfollowback", "count" : 200 }{ "_id" : "RT", "count" : 150 }{ "_id" : "Arsenal", "count" : 148 }{ "_id" : "milars", "count" : 145 }{ "_id" : "sanremo", "count" : 145 }{ "_id" : "LoseMyNumberIf", "count" : 139 }{ "_id" : "RelationshipsShould", "count" : 137 }{ "_id" : "Bahrain", "count" : 129 }{ "_id" : "bahrain", "count" : 125 }{ "_id" : "oomf", "count" : 117 }{ "_id" : "BabyKillerOcalan", "count" : 106 }{ "_id" : "TeamFollowBack", "count" : 105 }{ "_id" : "WhyDoPeopleThink", "count" : 102 }{ "_id" : "np", "count" : 100 }
DEMO 2
Aggregation in Mongo 2.1db.live.aggregate(
{ $unwind : "$entities.hashtags" } , { $match : { "entities.hashtags.text" : { $exists : true } } } , { $group : { _id : "$entities.hashtags.text", count : { $sum : 1 } } } , { $sort : { count : -1 } }, { $limit : 10 })
Popular Hash Tagsdb.twit_hashtags.aggregate(a){
"result" : [ { "_id" : "YouKnowYoureInLoveIf", "count" : 287 }, { "_id" : "teamfollowback", "count" : 200 }, { "_id" : "RT", "count" : 150 }, { "_id" : "Arsenal", "count" : 148 }, { "_id" : "milars", "count" : 145 }, { "_id" : "sanremo","count" : 145 }, { "_id" : "LoseMyNumberIf", "count" : 139 }, { "_id" : "RelationshipsShould", "count" : 137 }, { "_id" : "Bahrain", "count" : 129 }, { "_id" : "bahrain", "count" : 125 } ],"ok" : 1}
Futureof The
humongousdata
What is BIG?
BIG today is normal tomorrow
Data Growth
0
2250
4500
6750
9000
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
1 4 10 24 55 120 250500
1,000
2,150
4,400
9,000
Millions of URLs
Data Growth
0
2250
4500
6750
9000
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
1 4 10 24 55 120 250500
1,000
2,150
4,400
9,000
Millions of URLs
2012Generating over 250 Millions of tweets per day
MongoDB enables us to scale with the redefinition of BIG.
New processing tools like Hadoop & Storm are enabling us to process the new BIG.
Hadoop is our first step
MongoDB is committed to working
with best data tools including
Hadoop, Storm, Disco, Spark & more
download at
Questions?
http://spf13.comhttp://github.com/spf13@spf13
github.com/mongodb/mongo-hadoop