As simple as Apache Spark
-
Upload
data-science-warsaw -
Category
Technology
-
view
350 -
download
0
Transcript of As simple as Apache Spark
Data Science Warsaw, 2015.10.13
As simple asApache Spark
Data Science Warsaw, 2015.10.13
About me
● At ICM since 5 years● Knowledge Discovery in Documents
○ Object disambiguation, Document classification, Document similarity, etc., etc.
● Enough big to use Big Data ecosystems○ Hadoop since 2012○ Spark since 2013 (2014 for real)
2
Data Science Warsaw, 2015.10.13
We have still about 19 minutes...
Data Science Warsaw, 2015.10.13
Obligatory word count example
■ Task: to count the number of occurrences of each word in a text■ Frequently used when introducing the MapReduce paradigm
4
Tell me
you’ve already
know it!
All rights reserved, © 2015 ICM UW 5
Hadoop has a rich set of libraries
Map-Reduce — good for batch
Pig — Scripts
Oozie — Workflows
Mahout — Machine LearningHive — SQL Queries
Impala —Ad-hoc Queries
Storm — Real Time Streaming
Giraph — Graphs
All rights reserved, © 2015 ICM UW 6
Hadoop ecosystem is (too) large
Map-Reduce — good for batch
■ Using multiple libraries results in● long deployment, costful support, burden of administering
number of configuration files● lots of glueing code between libraries
Pig — Scripts
Oozie — Workflows
Mahout — Machine LearningHive — SQL Queries
Impala —Ad-hoc Queries
Storm — Real Time Streaming
Giraph — Graphs
All rights reserved, © 2015 ICM UW
Let’s walk into Big Datalike a Boss with .
All rights reserved, © 2015 ICM UW
Spark ecosystem is versatile yet seamless
Ecosystem of high-level tools for various use-cases
8
Spark Core
Spark SQLSpark
Streaming near real-time
MLLibmachine learning
GraphX graph
processing
SparkRR on Spark
All rights reserved, © 2015 ICM UW
Spark ecosystem is versatile yet seamless
9
„One to rule them all"
All rights reserved, © 2015 ICM UW
Example: versatile yet seamless
1. Select positions from historic tweets.
2. Train a model of 10 clusters of neighbouring nodes.
3. Classify real–time tweets from last 20 sec. every 3 sec. and count them for each cluster.
10
points = sc.runSql[Double, Double]("SELECT latitude, longitude FROM historic_tweets")
model = KMeans.train(points, 10)
sc.twitterStream(...)
.map(lambda t: (model.closestCenter(t.location), 1))
.reduceByKeyAndWindow(lambda x,y: x+y , Seconds(20), Seconds(3))
Source: The State of Spark, and Where We’re Going Next, presentation by M. Zaharia, 2013
All rights reserved, © 2015 ICM UW
How to start?
All rights reserved, © 2015 ICM UW
First: download
12
All rights reserved, © 2015 ICM UW
Second: ./bin/pyspark
13
All rights reserved, © 2015 ICM UW
Third: Code much and often
14
import pyspark.ml.recommendation.ALSimport pyspark.ml.recommendation.Rating// Transform Strings: "user_id,movie_id,rating" to Ratings: Rating(user_id:Int,movie_id:Int,rating:Double)data = sc.textFile("path/to/data.csv")ratings = data.map(lambda s: s.split(',')).map(lambda arr: Rating(int(arr[0]), int(arr[1]), float(arr[2]))// Build the recommendation model using ALS// Factor the rating matrix A=[n,m] into B=[n,f] and C=[f,m], where A ~= B x CnumFeatures = 10; numIterations = 20model = ALS.train(ratings, numFeatures, numIterations, 0.01)
All rights reserved, © 2015 ICM UW
● We know which products are preferred by a particular user● Having information about preferences -- recommend to a
particular user -- a product which she or he is likely to purchase● Iterative method
15
= X
Collaborative Filtering: Problem Statement
All rights reserved, © 2015 ICM UW
Collaborative Filtering: Problem Statement
16
USE
R
USE
R
=
~TASTEDEMANDMOVIE
MOVIE
~TAST
ESU
PPLY
X
Demo time!
All rights reserved, © 2015 ICM UW
Tweets exploration
17
Demo time!
All rights reserved, © 2015 ICM UW
Spark ecosystem is versatile yet seamless
Ecosystem of high-level tools for various use-cases
18
Spark Core
Spark SQLSpark
Streaming near real-time
MLLibmachine learning
GraphX graph
processing
SparkRR on Spark
All rights reserved, © 2015 ICM UW 19
What next? Trainings!