MongoDB + Spark@blimpyacht
Level Setting
TROUGH OF DISILLUSIONMENT
Interactive ShellEasy (-er)Caching
HDFS
Distributed Data
HDFSYARN
Hive
Pig
Domain Specific Languages
MapReduce
Spark Stand Alone
YARN
Mesos
HDFS
Distributed Resources
YARN
SparkMesos
HDFS
Spark Stand Alone
Hadoop
Distributed Processing
YARN
SparkMesos
Hive
Pig
HDFS
Hadoop
Spark Stand Alone
Domain Specific Languages
YARN
SparkMesos
Hive
Pig
SparkSQL
Spark Shell
SparkStreaming
HDFS
Spark Stand Alone
Hadoop
YARN
SparkMesos
Hive
Pig
SparkSQL
Spark Shell
SparkStreaming
HDFS
Spark Stand Alone
Hadoop
YARN
SparkMesos
Hive
Pig
SparkSQL
Spark Shell
SparkStreaming
HDFS
Spark Stand Alone
Hadoop
YARN
SparkMesos
Hive
Pig
SparkSQL
Spark Shell
SparkStreaming
Spark Stand Alone
Hadoop
Stand AloneYARN
SparkMesos
Hive
Pig
SparkSQL
SparkShell
SparkStreaming
MapReduce
Stand AloneYARN
SparkMesos
SparkSQL
SparkShell
SparkStreaming
Stand AloneYARN
SparkMesos
SparkSQL
SparkShell
SparkStreaming
executor
Worker Node
executor
Worker Node Master
Java Driver
Hadoop Connector
Driver Application
Parallelization
Parellelize = x
Transformations
Parellelize = x t(x) = x’ t(x’) = x’’
Transformationsfilter( func )union( func )intersection( set )distinct( n )map( function )
Action
f(x’’) = yParellelize = x t(x) = x’ t(x’) = x’’
Actionscollect()count()first()take( n )reduce( function )
Lineage
f(x’’) = yParellelize = x t(x) = x’ t(x’) = x’’
Transform Transform ActionParallelize
Lineage
Transform Transform ActionParallelize Transform Transform ActionParallelize Transform Transform ActionParallelize Transform Transform ActionParallelize Transform Transform ActionParallelize
Lineage
Transform Transform ActionParallelize Transform Transform ActionParallelize Transform Transform ActionParallelize Transform Transform ActionParallelize Transform Transform ActionParallelize
Lineage
Transform Transform ActionParallelize Transform Transform ActionParallelize Transform Transform ActionParallelize Transform Transform ActionParallelize Transform Transform ActionParallelize
Lineagehttp://www.blimpyacht.com/2016/02/03/a-visual-guide-to-the-spark-hadoop-ecosystem/
https://github.com/mongodb/mongo-hadoop
Spark ConfigurationConfiguration conf = new Configuration();conf.set(
"mongo.job.input.format", "com.mongodb.hadoop.MongoInputFormat”);conf.set(
"mongo.input.uri", "mongodb://localhost:27017/db.collection”);
Spark ContextJavaPairRDD<Object, BSONObject> documents = context.newAPIHadoopRDD( conf,
MongoInputFormat.class,Object.class,BSONObject.class
);
Spark Submit
/usr/local/spark-1.5.1/bin/spark-submit \ --class com.mongodb.spark.examples.DataframeExample \ --master local Examples-1.0-SNAPSHOT.jar
Stand AloneYAR
N
SparkMesos
SparkSQL
SparkShell
SparkStreaming
JavaRDD<Message> messages = documents.map (
new Function<Tuple2<Object, BSONObject>, Message>() {
public Message call(Tuple2<Object, BSONObject> tuple) { BSONObject header = (BSONObject)tuple._2.get("headers");
Message m = new Message(); m.setTo( (String) header.get("To") ); m.setX_From( (String) header.get("From") ); m.setMessage_ID( (String) header.get( "Message-ID" ) ); m.setBody( (String) tuple._2.get( "body" ) );
return m; } });
THE FUTUREAND
BEYOND THE INFINITE
Spark Connector
Aggregation Filters$match | $project | $group
Data Locality mongos
THANKS!@blimpyacht
Top Related