MongoDB, Hadoop and humongous data - MongoSV 2012

MongoDB Hadoop

& humongous data

Talking aboutWhat is Humongous Data

Humongous Data & You

MongoDB & Data processing

Future of Humongous Data

15+ years building the internet

@spf13Steve Francia

Chief Solutions Architect @ responsible for drivers, integrations, web & docs

Father, husband, skateboarder

What is humongous

2000Google IncToday announced it has released the largest search engine on the Internet.

Google’s new index, comprising more than 1 billion URLs

Our indexing system for processing links indicates that we now count 1 trillion unique URLs

(and the number of individual web pages out there is growing by several billion pages per day).

An unprecedented amount of data is being created and is accessible

Data Growth

2000 2001 2002 2003 2004 2005 2006 2007 2008

1 4 10 2455

Millions of URLs

Truly Exponential Growth

Is hard for people to grasp

A BBC reporter recently: "Your current PC is more powerful than the computer they had on board the first flight to the moon".

Moore’s LawApplies to more than just CPUs

Boiled down it is that things double at regular intervals

It’s exponential growth.. and applies to big data

How BIG is it?

20062005

20042003

20022001

Why all this talk about BIG

Data now?

In the past few years open source software emerged enabling ‘us’ to handle BIG Data

The Big DataStory

Is actually two stories

Doers & Tellers talking about different things

http://www.slideshare.net/siliconangle/trendconnect-big-data-report-september

Tellers

Doers talk a lot more about actual solutions

They know it’s a two sided story

Processing

Storage

Take aways

MongoDB and Hadoop

MongoDB for storage & operations

Hadoop for processing & analytics

MongoDB& Data Processing

Applications have complex needs

MongoDB ideal operational database

MongoDB ideal for BIG data

Not a data processing engine, but provides processing functionality

Many options for Processing Data

•Process in MongoDB using Map Reduce

•Process in MongoDB using Aggregation Framework

•Process outside MongoDB (using Hadoop)

MongoDB Map ReduceData

emit(k,v)

Sort(k)

Group(k)

Reduce(k,values)

Finalize(k,v)

MongoDB

map iterates on documentsDocument is $this

1 at time per shard

Input matches output

Can run multiple times

MongoDB Map Reduce

MongoDB map reduce quite capable... but with limits

- Javascript not best language for processing map reduce

- Javascript limited in external data processing libraries

- Adds load to data store

MongoDB Aggregation

Most uses of MongoDB Map Reduce were for aggregation

Aggregation Framework optimized for aggregate queries

Realtime aggregation similar to SQL GroupBy

MongoDB & Hadoop

Map (k1, v1, ctx)

ctx.write(k2,v2)

Map (k1, v1, ctx)

ctx.write(k2,v2)

Map (k1, v1, ctx)

ctx.write(k2,v2)

Creates a list of Input Splits

(InputFormat)

Output Format

MongoDB

single server or sharded cluster

same as Mongo's shard chunks (64mb)

each spliteach spliteach split

MongoDB

RecordReader

Runs on same thread as map

Reducer threads

Partitioner(k2)Partitioner(k2)Partitioner(k2)

Many map operations1 at time per input split

Runs once per keyReduce(k2,values3)

Combiner(k2,values2)

k2, v3

Sort(k2)Sort(k2)Sort(keys2)

DEMOTIME

DEMOInstall Hadoop MongoDB Plugin

Import tweets from twitter

Write mapper in Python using Hadoop streaming

Write reducer in Python using Hadoop streaming

Call myself a data scientist

Installing Mongo-hadoop

hadoop_version '0.23'hadoop_path="/usr/local/Cellar/hadoop/$hadoop_version.0/libexec/lib"

git clone git://github.com/mongodb/mongo-hadoop.gitcd mongo-hadoopsed -i '' "s/default/$hadoop_version/g" build.sbtcd streaming./build.sh

https://gist.github.com/1887726

Groking Twitter

curl \https://stream.twitter.com/1/statuses/sample.json \-u<login>:<password> \| mongoimport -d test -c live

... let it run for about 2 hours

DEMO 1

Map Hashtags in Python#!/usr/bin/env python

import syssys.path.append(".")

from pymongo_hadoop import BSONMapper

def mapper(documents): for doc in documents: for hashtag in doc['entities']['hashtags']: yield {'_id': hashtag['text'], 'count': 1}

BSONMapper(mapper)print >> sys.stderr, "Done Mapping."

Reduce hashtags in Python#!/usr/bin/env python

import syssys.path.append(".")

from pymongo_hadoop import BSONReducer

def reducer(key, values): print >> sys.stderr, "Hashtag %s" % key.encode('utf8') _count = 0 for v in values: _count += v['count'] return {'_id': key.encode('utf8'), 'count': _count}

BSONReducer(reducer)

All together

hadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar \ -mapper examples/twitter/twit_hashtag_map.py \-reducer examples/twitter/twit_hashtag_reduce.py \-inputURI mongodb://127.0.0.1/test.live \-outputURI mongodb://127.0.0.1/test.twit_reduction \-file examples/twitter/twit_hashtag_map.py \-file examples/twitter/twit_hashtag_reduce.py

Popular Hash Tagsdb.twit_hashtags.find().sort( {'count' : -1 })

{ "_id" : "YouKnowYoureInLoveIf", "count" : 287 }{ "_id" : "teamfollowback", "count" : 200 }{ "_id" : "RT", "count" : 150 }{ "_id" : "Arsenal", "count" : 148 }{ "_id" : "milars", "count" : 145 }{ "_id" : "sanremo", "count" : 145 }{ "_id" : "LoseMyNumberIf", "count" : 139 }{ "_id" : "RelationshipsShould", "count" : 137 }{ "_id" : "Bahrain", "count" : 129 }{ "_id" : "bahrain", "count" : 125 }{ "_id" : "oomf", "count" : 117 }{ "_id" : "BabyKillerOcalan", "count" : 106 }{ "_id" : "TeamFollowBack", "count" : 105 }{ "_id" : "WhyDoPeopleThink", "count" : 102 }{ "_id" : "np", "count" : 100 }

DEMO 2

Aggregation in Mongo 2.1db.live.aggregate(

{ $unwind : "$entities.hashtags" } , { $match : { "entities.hashtags.text" : { $exists : true } } } , { $group : { _id : "$entities.hashtags.text", count : { $sum : 1 } } } , { $sort : { count : -1 } }, { $limit : 10 })

Popular Hash Tagsdb.twit_hashtags.aggregate(a){

"result" : [ { "_id" : "YouKnowYoureInLoveIf", "count" : 287 }, { "_id" : "teamfollowback", "count" : 200 }, { "_id" : "RT", "count" : 150 }, { "_id" : "Arsenal", "count" : 148 }, { "_id" : "milars", "count" : 145 }, { "_id" : "sanremo","count" : 145 }, { "_id" : "LoseMyNumberIf", "count" : 139 }, { "_id" : "RelationshipsShould", "count" : 137 }, { "_id" : "Bahrain", "count" : 129 }, { "_id" : "bahrain", "count" : 125 } ],"ok" : 1}

Futureof The

humongousdata

What is BIG?

BIG today is normal tomorrow

Data Growth

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011

1 4 10 24 55 120 250500

Millions of URLs

Data Growth

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011

1 4 10 24 55 120 250500

Millions of URLs

2012Generating over 250 Millions of tweets per day

MongoDB enables us to scale with the redefinition of BIG.

New processing tools like Hadoop & Storm are enabling us to process the new BIG.

Hadoop is our first step

MongoDB is committed to working

with best data tools including

Hadoop, Storm, Disco, Spark & more

download at

Questions?

http://spf13.comhttp://github.com/spf13@spf13

github.com/mongodb/mongo-hadoop

MongoDB, Hadoop and humongous data - MongoSV 2012

Technology

Transcript of MongoDB, Hadoop and humongous data - MongoSV 2012

Engrandeciendo Grails con MongoDB€¦ · MongoDB: por dónde empiezo • MongoDB (from "humongous") is a scalable, high-performance, open source, document-oriented database - 10gen,

MongoDB Europe 2016 - MongoDB 3.4 preview and introduction to MongoDB Atlas

COMP 430 Intro. to Database Systems · What is MongoDB? “Humongous” D •NoSQL, no schemas DB •Lots of similarities with SQL RDBMs, but with more flexibility •No joins because

An Introduction to MongoDB Michael Grayson … (from "humongous") is an open-source document database, and the leading NoSQL database . Written in C++, MongoDB features: • Document-Oriented

Mongosv 2011 - MongoDB on Amazon EC2

Scaling for Humongous amounts of data with MongoDB

MongoSV 2011

Mongosv 2011 - Replication

Monitoring MongoDB (MongoSV)

Automate MongoDB with MongoDB Management Service

Scaling for Humongous amounts of data with MongoDB...Scaling for Humongous amounts of data with MongoDB Alvin Richards Technical Director alvin@10gen.com @jonnyeight alvinonmongodb.com

AP Human Geography 2012 The Humongous Review …cudaaphug.wikispaces.com/file/view/big+review+doc.pdf/314974284/... · AP Human Geography 2012 The Humongous Review ... Discuss the

Humongous MongoDB · Agenda • Scaling MongoDB - Concepts • Replica Sets & Sharding • Read Preference, Write Concern, Etc • Map/Reduce • Aggregation 2

mongodb training | mongodb online training | mongodb training and certification | mongodb course

Handling humongous data with NoSQL-MongoDB - bed …bed-con.org/.../slides/Handling_humongous_data_with_NoSQL-Mon… · Handling humongous data with NoSQL/MongoDB Andreas Hartmann

Thomas risberg mongosv-2012-spring-data-cloud-foundry

Mr. Hughes' Humongous Big Fat Test

Wang Bo Introduction to MongoDB. Background Creator: 10gen, former doublick Name: short for humongous ( ) Language: C++

1 Mongo DB MongoDB (from "humongous) is a scalable, high- performance, open source, document-oriented database. Written in C++. Home:

AEM Communities 6.1 - MongoSV '15