Blazing Fast Analytics with MongoDB & Spark

3

Muthu Chinnasamy

Senior Solutions [email protected]: @MuthuMongo

4

Agenda

The data challengeSparkUse CasesConnectorsDemo

2010

Eric Schmidt

Every two days now we create as much information as we did from the dawn of civilization up until 2003

“

Apache Spark is the Taylor Swift of big data software.

“

Derrick Harris, Fortune

8

What is Spark?

Fast and general computing engine for clusters

• Makes it easy and fast to process large datasets• APIs in Java, Scala, Python, R• Libraries for SQL, streaming, machine learning, Graph• It’s fundamentally different to what’s come before

9

Why not just use Hadoop?

• Spark is FAST– Faster to write.– Faster to run.

• Up to 100x faster than Hadoop in memory• 10x faster on disk.

A visual comparison

Hadoop Spark

11

RDD Operations

Transformations Actionsmap reducefilter collectflatMap countmapPartitions savesample lookupKeyunion takejoin foreachgroupByKeyreduceByKey

12

Spark higher level libraries

Spark

Spark SQL

Spark Streaming MLIB GraphX

Spark + MongoDB

14

Data Management

OLTPApplicationsFine grained operationsLow Latency

Offline Processing Analytics Data WarehousingHigh Throughput

15

Spark + MongoDB top use cases:– Business Intelligence– Data Warehousing – Recommendation– Log processing– User Facing Services– Fraud detection

16

MongoDB and Spark

17

Spark reading directly from MongoDB

18

Aggregation pipeline to Pre-filter

Aggregation pipeline filter: $match

19

Spark writing directly to MongoDB

Fraud Detection

I'm so in love!

Me, too<3

Now send me your CC number

?

Ok, XXXX-123-zzz

$$$

Fraud Detection

Sharing Workloads

Chat App

HDFS HDFS HDFS ArchivingData Crunching

LoginUser ProfileContactsMessages…

Fraud DetectionSegmentationRecommendations

Spark

MongoDB + Spark Connector

24

MongoDB Spark Connectorhttps://spark-packages.org/?q=official+mongodb

MongoDB Spark

Connector

MongoDB Shard

Spark

MongoDB Spark Connector

https://github.com/mongodb/mongo-spark

Spark Streaming

27

Spark Streaming

Twitter Feed Spark

28

Spark Streaming

Twitter Feed

{"statuses": [{"coordinates": null,"favorited": false,"truncated": false,"created_at": "Mon Sep 24

03:35:21 +0000 2012","id_str": "250075927172759552","entities": {"urls": [

],"hashtags": [{"text": "freebandnames","indices": [20,34

]}

],"user_mentions": []}

}}

29

Spark Streaming{"statuses": [{"coordinates": null,"favorited": false,"truncated": false,"created_at": "Mon Sep 24


],"hashtags": [{

"text": "freebandnames","indices": [20,34

]}


}}

{"time": "Mon Sep 24 03:35","freebandnames": 1

}



],"hashtags": [

{"text": "freebandnames","indices": [20,34

]}


}}


03:35:21 +0000 2012","id_str": "250075927172759552","entities": {

"urls": [


]}


}}

{"statuses": [{

"coordinates": null,"favorited": false,"truncated": false,"created_at": "Mon Sep 24



]}


}}


}

Spark

30

Capped Collection

MongoDB and Spark Streaming feature


}

{"time": "Mon Nov 5 09:40",“mongoDBLondon": 400

}

{"time": "Mon Nov 5 11:50",“spark": 7556

}

{"time": "Mon Nov 24 12:50","itshappening": 100

}

Tailable Cursor

MongoDB + Spark MLib Demo

32

Collaborative Filtering

• Two parts• Collaborative: Using Rating preference from several Users• Filtering: Recommend preferences

UserId / MovieId Star Wars Toy Story Frozen

Buzz 4 4 5

Woody 5 4

Jessie 5 ?

Movie Ratings as a matrix

33

MLib ALS

• Approximate into User & Movie latent factor matrices

UserId / MovieId

Frozen ToyStory

Star Wars

Buzz 4 4 5

Woody 5 4

Jessie 5

Buzz x y

Woody x y

Jessie x y

Star Wars

Toy Story

Frozen

x x x

y y y

f(i)

f(j)

rij

34

Prediction Process

• Load movie ratings data from MongoDB• Reflect and Infer the input formats for the ALS algorithm• Split the data

– 80% for training and 20% for validating the model• Calculate the best model using ALS algorithm

– Build/train a User Movie matrix model• Combine the data with user preferences and retrain the

model

35

Explore as a Databricks Notebookhttp://cdn2.hubspot.net/hubfs/438089/notebooks/MongoDB_guest_blog/Using_MongoDB_Connector_for_Spark.html

MongoDB + Spark Case Study

37

China Eastern Airlines – Fare Engine

130K seats,180 million fares & 1.6 billion daily searches

38

Spark and MongoDB

• An extremely powerful combination

• Many possible use cases

• Some operations are actually faster if performed using Aggregation Framework

• Evolving all the time

Questions?

Muthu [email protected]@muthumongo

Blazing Fast Analytics with MongoDB & Spark

Documents

Transcript of Blazing Fast Analytics with MongoDB & Spark