As simple as Apache Spark

Data Science Warsaw, 2015.10.13

As simple asApache Spark


About me

● At ICM since 5 years● Knowledge Discovery in Documents

○ Object disambiguation, Document classification, Document similarity, etc., etc.

● Enough big to use Big Data ecosystems○ Hadoop since 2012○ Spark since 2013 (2014 for real)

2


We have still about 19 minutes...


Obligatory word count example

■ Task: to count the number of occurrences of each word in a text■ Frequently used when introducing the MapReduce paradigm

4

Tell me

you’ve already

know it!

All rights reserved, © 2015 ICM UW 5

Hadoop has a rich set of libraries

Map-Reduce — good for batch

Pig — Scripts

Oozie — Workflows

Mahout — Machine LearningHive — SQL Queries

Impala —Ad-hoc Queries

Storm — Real Time Streaming

Giraph — Graphs


Hadoop ecosystem is (too) large

Map-Reduce — good for batch

■ Using multiple libraries results in● long deployment, costful support, burden of administering

number of configuration files● lots of glueing code between libraries

Pig — Scripts

Oozie — Workflows

Mahout — Machine LearningHive — SQL Queries

Impala —Ad-hoc Queries

Storm — Real Time Streaming

Giraph — Graphs

All rights reserved, © 2015 ICM UW

Let’s walk into Big Datalike a Boss with .


Spark ecosystem is versatile yet seamless

Ecosystem of high-level tools for various use-cases

8

Spark Core

Spark SQLSpark

Streaming near real-time

MLLibmachine learning

GraphX graph

processing

SparkRR on Spark



9

„One to rule them all"


Example: versatile yet seamless

1. Select positions from historic tweets.

2. Train a model of 10 clusters of neighbouring nodes.

3. Classify real–time tweets from last 20 sec. every 3 sec. and count them for each cluster.

10

points = sc.runSql[Double, Double]("SELECT latitude, longitude FROM historic_tweets")

model = KMeans.train(points, 10)

sc.twitterStream(...)

.map(lambda t: (model.closestCenter(t.location), 1))

.reduceByKeyAndWindow(lambda x,y: x+y , Seconds(20), Seconds(3))

Source: The State of Spark, and Where We’re Going Next, presentation by M. Zaharia, 2013


How to start?


First: download

12


Second: ./bin/pyspark

13


Third: Code much and often

14

import pyspark.ml.recommendation.ALSimport pyspark.ml.recommendation.Rating// Transform Strings: "user_id,movie_id,rating" to Ratings: Rating(user_id:Int,movie_id:Int,rating:Double)data = sc.textFile("path/to/data.csv")ratings = data.map(lambda s: s.split(',')).map(lambda arr: Rating(int(arr[0]), int(arr[1]), float(arr[2]))// Build the recommendation model using ALS// Factor the rating matrix A=[n,m] into B=[n,f] and C=[f,m], where A ~= B x CnumFeatures = 10; numIterations = 20model = ALS.train(ratings, numFeatures, numIterations, 0.01)


● We know which products are preferred by a particular user● Having information about preferences -- recommend to a

particular user -- a product which she or he is likely to purchase● Iterative method

15

= X

Collaborative Filtering: Problem Statement


Collaborative Filtering: Problem Statement

16

USE

R

USE

R

=

~TASTEDEMANDMOVIE

MOVIE

~TAST

ESU

PPLY

X

Demo time!


Tweets exploration

17

Demo time!



Ecosystem of high-level tools for various use-cases

18

Spark Core

Spark SQLSpark

Streaming near real-time

MLLibmachine learning

GraphX graph

processing

SparkRR on Spark


What next? Trainings!


Thank you!

20

Piotr [email protected]

@pjden

mailto:[email protected]

mailto:[email protected]

As simple as Apache Spark

Technology

Transcript of As simple as Apache Spark