Blazing Fast Analytics with MongoDB & Spark
Transcript of Blazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & Spark
4
Agenda
The data challengeSparkUse CasesConnectorsDemo
2010
Eric Schmidt
Every two days now we create as much information as we did from the dawn of civilization up until 2003
“
Apache Spark is the Taylor Swift of big data software.
“
Derrick Harris, Fortune
8
What is Spark?
Fast and general computing engine for clusters
• Makes it easy and fast to process large datasets• APIs in Java, Scala, Python, R• Libraries for SQL, streaming, machine learning, Graph• It’s fundamentally different to what’s come before
9
Why not just use Hadoop?
• Spark is FAST– Faster to write.– Faster to run.
• Up to 100x faster than Hadoop in memory• 10x faster on disk.
A visual comparison
Hadoop Spark
11
RDD Operations
Transformations Actionsmap reducefilter collectflatMap countmapPartitions savesample lookupKeyunion takejoin foreachgroupByKeyreduceByKey
12
Spark higher level libraries
Spark
Spark SQL
Spark Streaming MLIB GraphX
Spark + MongoDB
14
Data Management
OLTPApplicationsFine grained operationsLow Latency
Offline Processing Analytics Data WarehousingHigh Throughput
15
Spark + MongoDB top use cases:– Business Intelligence– Data Warehousing – Recommendation– Log processing– User Facing Services– Fraud detection
16
MongoDB and Spark
17
Spark reading directly from MongoDB
18
Aggregation pipeline to Pre-filter
Aggregation pipeline filter: $match
19
Spark writing directly to MongoDB
Fraud Detection
I'm so in love!
Me, too<3
Now send me your CC number
?
Ok, XXXX-123-zzz
$$$
Fraud Detection
Sharing Workloads
Chat App
HDFS HDFS HDFS ArchivingData Crunching
LoginUser ProfileContactsMessages…
Fraud DetectionSegmentationRecommendations
Spark
MongoDB + Spark Connector
24
MongoDB Spark Connectorhttps://spark-packages.org/?q=official+mongodb
MongoDB Spark
Connector
MongoDB Shard
Spark
MongoDB Spark Connector
https://github.com/mongodb/mongo-spark
Spark Streaming
27
Spark Streaming
Twitter Feed Spark
28
Spark Streaming
Twitter Feed
{"statuses": [{"coordinates": null,"favorited": false,"truncated": false,"created_at": "Mon Sep 24
03:35:21 +0000 2012","id_str": "250075927172759552","entities": {"urls": [
],"hashtags": [{"text": "freebandnames","indices": [20,34
]}
],"user_mentions": []}
}}
29
Spark Streaming{"statuses": [{"coordinates": null,"favorited": false,"truncated": false,"created_at": "Mon Sep 24
03:35:21 +0000 2012","id_str": "250075927172759552","entities": {"urls": [
],"hashtags": [{
"text": "freebandnames","indices": [20,34
]}
],"user_mentions": []}
}}
{"time": "Mon Sep 24 03:35","freebandnames": 1
}
{"statuses": [{"coordinates": null,"favorited": false,"truncated": false,"created_at": "Mon Sep 24
03:35:21 +0000 2012","id_str": "250075927172759552","entities": {"urls": [
],"hashtags": [
{"text": "freebandnames","indices": [20,34
]}
],"user_mentions": []}
}}
{"statuses": [{"coordinates": null,"favorited": false,"truncated": false,"created_at": "Mon Sep 24
03:35:21 +0000 2012","id_str": "250075927172759552","entities": {
"urls": [
],"hashtags": [{"text": "freebandnames","indices": [20,34
]}
],"user_mentions": []}
}}
{"statuses": [{
"coordinates": null,"favorited": false,"truncated": false,"created_at": "Mon Sep 24
03:35:21 +0000 2012","id_str": "250075927172759552","entities": {"urls": [
],"hashtags": [{"text": "freebandnames","indices": [20,34
]}
],"user_mentions": []}
}}
{"time": "Mon Sep 24 03:35","freebandnames": 4
}
Spark
30
Capped Collection
MongoDB and Spark Streaming feature
{"time": "Mon Sep 24 03:35","freebandnames": 4
}
{"time": "Mon Nov 5 09:40",“mongoDBLondon": 400
}
{"time": "Mon Nov 5 11:50",“spark": 7556
}
{"time": "Mon Nov 24 12:50","itshappening": 100
}
Tailable Cursor
MongoDB + Spark MLib Demo
32
Collaborative Filtering
• Two parts• Collaborative: Using Rating preference from several Users• Filtering: Recommend preferences
UserId / MovieId Star Wars Toy Story Frozen
Buzz 4 4 5
Woody 5 4
Jessie 5 ?
Movie Ratings as a matrix
33
MLib ALS
• Approximate into User & Movie latent factor matrices
UserId / MovieId
Frozen ToyStory
Star Wars
Buzz 4 4 5
Woody 5 4
Jessie 5
Buzz x y
Woody x y
Jessie x y
Star Wars
Toy Story
Frozen
x x x
y y y
f(i)
f(j)
rij
34
Prediction Process
• Load movie ratings data from MongoDB• Reflect and Infer the input formats for the ALS algorithm• Split the data
– 80% for training and 20% for validating the model• Calculate the best model using ALS algorithm
– Build/train a User Movie matrix model• Combine the data with user preferences and retrain the
model
35
Explore as a Databricks Notebookhttp://cdn2.hubspot.net/hubfs/438089/notebooks/MongoDB_guest_blog/Using_MongoDB_Connector_for_Spark.html
MongoDB + Spark Case Study
37
China Eastern Airlines – Fare Engine
130K seats,180 million fares & 1.6 billion daily searches
38
Spark and MongoDB
• An extremely powerful combination
• Many possible use cases
• Some operations are actually faster if performed using Aggregation Framework
• Evolving all the time
Questions?
Muthu [email protected]@muthumongo