As simple as Apache Spark

20
Data Science Warsaw, 2015.10.13 As simple as Apache Spark

Transcript of As simple as Apache Spark

Page 1: As simple as Apache Spark

Data Science Warsaw, 2015.10.13

As simple asApache Spark

Page 2: As simple as Apache Spark

Data Science Warsaw, 2015.10.13

About me

● At ICM since 5 years● Knowledge Discovery in Documents

○ Object disambiguation, Document classification, Document similarity, etc., etc.

● Enough big to use Big Data ecosystems○ Hadoop since 2012○ Spark since 2013 (2014 for real)

2

Page 3: As simple as Apache Spark

Data Science Warsaw, 2015.10.13

We have still about 19 minutes...

Page 4: As simple as Apache Spark

Data Science Warsaw, 2015.10.13

Obligatory word count example

■ Task: to count the number of occurrences of each word in a text■ Frequently used when introducing the MapReduce paradigm

4

Tell me

you’ve already

know it!

Page 5: As simple as Apache Spark

All rights reserved, © 2015 ICM UW 5

Hadoop has a rich set of libraries

Map-Reduce — good for batch

Pig — Scripts

Oozie — Workflows

Mahout — Machine LearningHive — SQL Queries

Impala —Ad-hoc Queries

Storm — Real Time Streaming

Giraph — Graphs

Page 6: As simple as Apache Spark

All rights reserved, © 2015 ICM UW 6

Hadoop ecosystem is (too) large

Map-Reduce — good for batch

■ Using multiple libraries results in● long deployment, costful support, burden of administering

number of configuration files● lots of glueing code between libraries

Pig — Scripts

Oozie — Workflows

Mahout — Machine LearningHive — SQL Queries

Impala —Ad-hoc Queries

Storm — Real Time Streaming

Giraph — Graphs

Page 7: As simple as Apache Spark

All rights reserved, © 2015 ICM UW

Let’s walk into Big Datalike a Boss with .

Page 8: As simple as Apache Spark

All rights reserved, © 2015 ICM UW

Spark ecosystem is versatile yet seamless

Ecosystem of high-level tools for various use-cases

8

Spark Core

Spark SQLSpark

Streaming near real-time

MLLibmachine learning

GraphX graph

processing

SparkRR on Spark

Page 9: As simple as Apache Spark

All rights reserved, © 2015 ICM UW

Spark ecosystem is versatile yet seamless

9

„One to rule them all"

Page 10: As simple as Apache Spark

All rights reserved, © 2015 ICM UW

Example: versatile yet seamless

1. Select positions from historic tweets.

2. Train a model of 10 clusters of neighbouring nodes.

3. Classify real–time tweets from last 20 sec. every 3 sec. and count them for each cluster.

10

points = sc.runSql[Double, Double]("SELECT latitude, longitude FROM historic_tweets")

model = KMeans.train(points, 10)

sc.twitterStream(...)

.map(lambda t: (model.closestCenter(t.location), 1))

.reduceByKeyAndWindow(lambda x,y: x+y , Seconds(20), Seconds(3))

Source: The State of Spark, and Where We’re Going Next, presentation by M. Zaharia, 2013

Page 11: As simple as Apache Spark

All rights reserved, © 2015 ICM UW

How to start?

Page 12: As simple as Apache Spark

All rights reserved, © 2015 ICM UW

First: download

12

Page 13: As simple as Apache Spark

All rights reserved, © 2015 ICM UW

Second: ./bin/pyspark

13

Page 14: As simple as Apache Spark

All rights reserved, © 2015 ICM UW

Third: Code much and often

14

import pyspark.ml.recommendation.ALSimport pyspark.ml.recommendation.Rating// Transform Strings: "user_id,movie_id,rating" to Ratings: Rating(user_id:Int,movie_id:Int,rating:Double)data = sc.textFile("path/to/data.csv")ratings = data.map(lambda s: s.split(',')).map(lambda arr: Rating(int(arr[0]), int(arr[1]), float(arr[2]))// Build the recommendation model using ALS// Factor the rating matrix A=[n,m] into B=[n,f] and C=[f,m], where A ~= B x CnumFeatures = 10; numIterations = 20model = ALS.train(ratings, numFeatures, numIterations, 0.01)

Page 15: As simple as Apache Spark

All rights reserved, © 2015 ICM UW

● We know which products are preferred by a particular user● Having information about preferences -- recommend to a

particular user -- a product which she or he is likely to purchase● Iterative method

15

= X

Collaborative Filtering: Problem Statement

Page 16: As simple as Apache Spark

All rights reserved, © 2015 ICM UW

Collaborative Filtering: Problem Statement

16

USE

R

USE

R

=

~TASTEDEMANDMOVIE

MOVIE

~TAST

ESU

PPLY

X

Demo time!

Page 17: As simple as Apache Spark

All rights reserved, © 2015 ICM UW

Tweets exploration

17

Demo time!

Page 18: As simple as Apache Spark

All rights reserved, © 2015 ICM UW

Spark ecosystem is versatile yet seamless

Ecosystem of high-level tools for various use-cases

18

Spark Core

Spark SQLSpark

Streaming near real-time

MLLibmachine learning

GraphX graph

processing

SparkRR on Spark

Page 19: As simple as Apache Spark

All rights reserved, © 2015 ICM UW 19

What next? Trainings!

Page 20: As simple as Apache Spark

All rights reserved, © 2015 ICM UW

Thank you!

20

Piotr [email protected]

@pjden