Download - MongoDB + Spark

MongoDB + Spark@blimpyacht

Level Setting

TROUGH OF DISILLUSIONMENT

Interactive ShellEasy (-er)Caching

HDFS

Distributed Data

HDFSYARN

Hive

Pig

Domain Specific Languages

MapReduce

Spark Stand Alone

YARN

Mesos

HDFS

Distributed Resources

YARN

SparkMesos

HDFS

Spark Stand Alone

Hadoop

Distributed Processing

YARN

SparkMesos

Hive

Pig

HDFS

Hadoop

Spark Stand Alone

Domain Specific Languages

YARN

SparkMesos

Hive

Pig

SparkSQL

Spark Shell

SparkStreaming

HDFS

Spark Stand Alone

Hadoop

YARN

SparkMesos

Hive

Pig

SparkSQL

Spark Shell

SparkStreaming

Spark Stand Alone

Hadoop

Stand AloneYARN

SparkMesos

Hive

Pig

SparkSQL

SparkShell

SparkStreaming

MapReduce

Stand AloneYARN

SparkMesos

SparkSQL

SparkShell

SparkStreaming

executor

Worker Node

executor

Worker Node Master

Java Driver

Hadoop Connector

Driver Application

Parallelization

Parellelize = x

Transformations

Parellelize = x t(x) = x’ t(x’) = x’’

Transformationsfilter( func )union( func )intersection( set )distinct( n )map( function )

Action

f(x’’) = yParellelize = x t(x) = x’ t(x’) = x’’

Actionscollect()count()first()take( n )reduce( function )

Lineage

f(x’’) = yParellelize = x t(x) = x’ t(x’) = x’’

Transform Transform ActionParallelize

Lineage

Transform Transform ActionParallelize Transform Transform ActionParallelize Transform Transform ActionParallelize Transform Transform ActionParallelize Transform Transform ActionParallelize

Lineage

Transform Transform ActionParallelize Transform Transform ActionParallelize Transform Transform ActionParallelize Transform Transform ActionParallelize Transform Transform ActionParallelize

Lineagehttp://www.blimpyacht.com/2016/02/03/a-visual-guide-to-the-spark-hadoop-ecosystem/

https://github.com/mongodb/mongo-hadoop

Spark ConfigurationConfiguration conf = new Configuration();conf.set(

"mongo.job.input.format", "com.mongodb.hadoop.MongoInputFormat”);conf.set(

"mongo.input.uri", "mongodb://localhost:27017/db.collection”);

Spark ContextJavaPairRDD<Object, BSONObject> documents = context.newAPIHadoopRDD( conf,

MongoInputFormat.class,Object.class,BSONObject.class

);

Spark Submit

/usr/local/spark-1.5.1/bin/spark-submit \ --class com.mongodb.spark.examples.DataframeExample \ --master local Examples-1.0-SNAPSHOT.jar

Stand AloneYAR

N

SparkMesos

SparkSQL

SparkShell

SparkStreaming

JavaRDD<Message> messages = documents.map (

new Function<Tuple2<Object, BSONObject>, Message>() {

public Message call(Tuple2<Object, BSONObject> tuple) { BSONObject header = (BSONObject)tuple._2.get("headers");

Message m = new Message(); m.setTo( (String) header.get("To") ); m.setX_From( (String) header.get("From") ); m.setMessage_ID( (String) header.get( "Message-ID" ) ); m.setBody( (String) tuple._2.get( "body" ) );

return m; } });

THE FUTUREAND

BEYOND THE INFINITE

Spark Connector

Aggregation Filters$match | $project | $group

Data Locality mongos

THANKS!@blimpyacht