Blazing Fast Analytics with MongoDB & Spark

40

Transcript of Blazing Fast Analytics with MongoDB & Spark

Page 1: Blazing Fast Analytics with MongoDB & Spark
Page 2: Blazing Fast Analytics with MongoDB & Spark

Blazing Fast Analytics with MongoDB & Spark

Page 3: Blazing Fast Analytics with MongoDB & Spark

3

Muthu Chinnasamy

Senior Solutions [email protected]: @MuthuMongo

Page 4: Blazing Fast Analytics with MongoDB & Spark

4

Agenda

The data challengeSparkUse CasesConnectorsDemo

Page 5: Blazing Fast Analytics with MongoDB & Spark

2010

Eric Schmidt

Every two days now we create as much information as we did from the dawn of civilization up until 2003

Page 6: Blazing Fast Analytics with MongoDB & Spark
Page 7: Blazing Fast Analytics with MongoDB & Spark

Apache Spark is the Taylor Swift of big data software.

Derrick Harris, Fortune

Page 8: Blazing Fast Analytics with MongoDB & Spark

8

What is Spark?

Fast and general computing engine for clusters

• Makes it easy and fast to process large datasets• APIs in Java, Scala, Python, R• Libraries for SQL, streaming, machine learning, Graph• It’s fundamentally different to what’s come before

Page 9: Blazing Fast Analytics with MongoDB & Spark

9

Why not just use Hadoop?

• Spark is FAST– Faster to write.– Faster to run.

• Up to 100x faster than Hadoop in memory• 10x faster on disk.

Page 10: Blazing Fast Analytics with MongoDB & Spark

A visual comparison

Hadoop Spark

Page 11: Blazing Fast Analytics with MongoDB & Spark

11

RDD Operations

Transformations Actionsmap reducefilter collectflatMap countmapPartitions savesample lookupKeyunion takejoin foreachgroupByKeyreduceByKey

Page 12: Blazing Fast Analytics with MongoDB & Spark

12

Spark higher level libraries

Spark

Spark SQL

Spark Streaming MLIB GraphX

Page 13: Blazing Fast Analytics with MongoDB & Spark

Spark + MongoDB

Page 14: Blazing Fast Analytics with MongoDB & Spark

14

Data Management

OLTPApplicationsFine grained operationsLow Latency

Offline Processing Analytics Data WarehousingHigh Throughput

Page 15: Blazing Fast Analytics with MongoDB & Spark

15

Spark + MongoDB top use cases:– Business Intelligence– Data Warehousing – Recommendation– Log processing– User Facing Services– Fraud detection

Page 16: Blazing Fast Analytics with MongoDB & Spark

16

MongoDB and Spark

Page 17: Blazing Fast Analytics with MongoDB & Spark

17

Spark reading directly from MongoDB

Page 18: Blazing Fast Analytics with MongoDB & Spark

18

Aggregation pipeline to Pre-filter

Aggregation pipeline filter: $match

Page 19: Blazing Fast Analytics with MongoDB & Spark

19

Spark writing directly to MongoDB

Page 20: Blazing Fast Analytics with MongoDB & Spark

Fraud Detection

I'm so in love!

Me, too<3

Now send me your CC number

?

Ok, XXXX-123-zzz

$$$

Page 21: Blazing Fast Analytics with MongoDB & Spark

Fraud Detection

Page 22: Blazing Fast Analytics with MongoDB & Spark

Sharing Workloads

Chat App

HDFS HDFS HDFS ArchivingData Crunching

LoginUser ProfileContactsMessages…

Fraud DetectionSegmentationRecommendations

Spark

Page 23: Blazing Fast Analytics with MongoDB & Spark

MongoDB + Spark Connector

Page 24: Blazing Fast Analytics with MongoDB & Spark

24

MongoDB Spark Connectorhttps://spark-packages.org/?q=official+mongodb

Page 25: Blazing Fast Analytics with MongoDB & Spark

MongoDB Spark

Connector

MongoDB Shard

Spark

MongoDB Spark Connector

https://github.com/mongodb/mongo-spark

Page 26: Blazing Fast Analytics with MongoDB & Spark

Spark Streaming

Page 27: Blazing Fast Analytics with MongoDB & Spark

27

Spark Streaming

Twitter Feed Spark

Page 28: Blazing Fast Analytics with MongoDB & Spark

28

Spark Streaming

Twitter Feed

{"statuses": [{"coordinates": null,"favorited": false,"truncated": false,"created_at": "Mon Sep 24

03:35:21 +0000 2012","id_str": "250075927172759552","entities": {"urls": [

],"hashtags": [{"text": "freebandnames","indices": [20,34

]}

],"user_mentions": []}

}}

Page 29: Blazing Fast Analytics with MongoDB & Spark

29

Spark Streaming{"statuses": [{"coordinates": null,"favorited": false,"truncated": false,"created_at": "Mon Sep 24

03:35:21 +0000 2012","id_str": "250075927172759552","entities": {"urls": [

],"hashtags": [{

"text": "freebandnames","indices": [20,34

]}

],"user_mentions": []}

}}

{"time": "Mon Sep 24 03:35","freebandnames": 1

}

{"statuses": [{"coordinates": null,"favorited": false,"truncated": false,"created_at": "Mon Sep 24

03:35:21 +0000 2012","id_str": "250075927172759552","entities": {"urls": [

],"hashtags": [

{"text": "freebandnames","indices": [20,34

]}

],"user_mentions": []}

}}

{"statuses": [{"coordinates": null,"favorited": false,"truncated": false,"created_at": "Mon Sep 24

03:35:21 +0000 2012","id_str": "250075927172759552","entities": {

"urls": [

],"hashtags": [{"text": "freebandnames","indices": [20,34

]}

],"user_mentions": []}

}}

{"statuses": [{

"coordinates": null,"favorited": false,"truncated": false,"created_at": "Mon Sep 24

03:35:21 +0000 2012","id_str": "250075927172759552","entities": {"urls": [

],"hashtags": [{"text": "freebandnames","indices": [20,34

]}

],"user_mentions": []}

}}

{"time": "Mon Sep 24 03:35","freebandnames": 4

}

Spark

Page 30: Blazing Fast Analytics with MongoDB & Spark

30

Capped Collection

MongoDB and Spark Streaming feature

{"time": "Mon Sep 24 03:35","freebandnames": 4

}

{"time": "Mon Nov 5 09:40",“mongoDBLondon": 400

}

{"time": "Mon Nov 5 11:50",“spark": 7556

}

{"time": "Mon Nov 24 12:50","itshappening": 100

}

Tailable Cursor

Page 31: Blazing Fast Analytics with MongoDB & Spark

MongoDB + Spark MLib Demo

Page 32: Blazing Fast Analytics with MongoDB & Spark

32

Collaborative Filtering

• Two parts• Collaborative: Using Rating preference from several Users• Filtering: Recommend preferences

UserId / MovieId Star Wars Toy Story Frozen

Buzz 4 4 5

Woody 5 4

Jessie 5 ?

Movie Ratings as a matrix

Page 33: Blazing Fast Analytics with MongoDB & Spark

33

MLib ALS

• Approximate into User & Movie latent factor matrices

UserId / MovieId

Frozen ToyStory

Star Wars

Buzz 4 4 5

Woody 5 4

Jessie 5

Buzz x y

Woody x y

Jessie x y

Star Wars

Toy Story

Frozen

x x x

y y y

f(i)

f(j)

rij

Page 34: Blazing Fast Analytics with MongoDB & Spark

34

Prediction Process

• Load movie ratings data from MongoDB• Reflect and Infer the input formats for the ALS algorithm• Split the data

– 80% for training and 20% for validating the model• Calculate the best model using ALS algorithm

– Build/train a User Movie matrix model• Combine the data with user preferences and retrain the

model

Page 35: Blazing Fast Analytics with MongoDB & Spark

35

Explore as a Databricks Notebookhttp://cdn2.hubspot.net/hubfs/438089/notebooks/MongoDB_guest_blog/Using_MongoDB_Connector_for_Spark.html

Page 36: Blazing Fast Analytics with MongoDB & Spark

MongoDB + Spark Case Study

Page 37: Blazing Fast Analytics with MongoDB & Spark

37

China Eastern Airlines – Fare Engine

130K seats,180 million fares & 1.6 billion daily searches

Page 38: Blazing Fast Analytics with MongoDB & Spark

38

Spark and MongoDB

• An extremely powerful combination

• Many possible use cases

• Some operations are actually faster if performed using Aggregation Framework

• Evolving all the time

Page 39: Blazing Fast Analytics with MongoDB & Spark

Questions?

Muthu [email protected]@muthumongo

Page 40: Blazing Fast Analytics with MongoDB & Spark