Apache spark Intro
-
Author
tudor-lapusan -
Category
Data & Analytics
-
view
202 -
download
0
Embed Size (px)
Transcript of Apache spark Intro

Apache Spark Introworkshop
BigData Romania

Apache Spark Intro
★ Apache Spark history★ RDD★ Transformations★ Actions★ Hands-on session

Apache Spark History
https://www.reddit.com/r/IAmA/comments/31bkue/im_matei_zaharia_creator_of_spark_and_cto_at/

From where to learn Spark ?
http://spark.apache.org/
http://shop.oreilly.com/product/0636920028512.do

Spark architecture

Easy ways to run Spark ?★ your IDE (ex. Eclipse or IDEA)★ Standalone Deploy Mode: simplest way to deploy Spark
on a single machine★ Docker & Zeppelin★ EMR★ Hadoop vendors (Cloudera, Hortonworks)

Supported languages

Spark basics
★ RDD★ Operations : Transformations and Actions

RDD
An RDD is simply an immutable distributed collection of objects!
b c d ge f ih kj ml ona qp

Creating RDD (I) Pythonlines = sc.parallelize([“workshop”, “spark”])
Scalaval lines = sc.parallelize(List(“workshop”, “spark”))
Java JavaRDD<String> lines = sc.parallelize(Arrays.asList(“workshop”, “spark”))

Creating RDD (II) Pythonlines = sc.textFile(“/path/to/file.txt”)
Scalaval lines = sc.textFile(“/path/to/file.txt”)
Java JavaRDD<String> lines = sc.textFile(“/path/to/file.txt”)

RDD persistence MEMORY_ONLY
MEMORY_AND_DISKMEMORY_ONLY_SERMEMORY_AND_DISK_SERDISK_ONLYMEMORY_ONLY_2MEMORY_AND_DISK_2OFF_HEAP

Other data structures in Spark
★ Paired RDD★ DataFrame★ DataSet

Paired RDD
Paired RDD = an RDD of key/value pairs
user1 user2 user3 user4 user5
id1/user1 id2/user2 id3/user3 id4/user4 id5/user5

Spark operations RDD 1
RDD 2
RDD 4
RDD 6
RDD 3
RDD 5
Action
Transformation

TransformationsRDD 1
RDD 2Transformations describe how to transform an RDD into another RDD.
RDD 1
RDD 2

Transformations RDD 1
RDD{1,2,3,4,5,6}
MapRDD{2,3,4,5,6,7}
FilterRDD{1,2,3,5,6}
map x => x +1 filter x => x != 4

Popular transformations★ map★ filter★ sample★ union★ distinct★ groupByKey★ reduceByKey★ sortByKey★ join

Actions
Actions compute a result from an RDD !
RDD 1

Actions
InputRDD{1,2,3,4,5,6}
MapRDD{2,3,4,5,6,7}
FilterRDD{1,2,3,5,6}
map x => x +1 filter x => x != 4
count()=6 take(2)={1,2} saveAsTextFile()

Popular actions★ collect★ count★ first★ take★ takeSample★ countByKey★ saveAsTextFile

Transformations and Actions
users
administrators
filter
take(3)

Transformations and Actions
users
administrators
filter()
take(3) saveAsTextFile()

Transformations and Actions
users
administrators
filter()
take(3) saveAsTextFile()
persist()

Lazy initialization
users
administrators
filter
take(3)

How Spark Executes Your Program

Hands-on session

MovieLens MovieLens data sets were collected by the GroupLens Research Projectat the University of Minnesota. This data set consists of:
* 100,000 ratings (1-5) from 943 users on 1682 movies. * Each user has rated at least 20 movies.
* Simple demographic info for the users (age, gender, occupation, zip)
Download link : http://grouplens.org/datasets/movielens/

MovieLens dataset
useruser_idagegenderoccupationzipcode
user_ratinguser_idmovie_idratingtimestamp
moviemovie_idtitlerelease_datevideo_releaseimdb_urlgenres...

Exercises already solved !
★ Return only the users with occupation ‘administrator’
★ Increase the age of each user by one★ Join user and rating datasets by user id

Exercises to solve★ How many men/women register to MovieLens★ Distribution of age for male/female registered to
MovieLens★ Which are the movies names with rating x?
★ Average rating by movies★ Sort users by their occupation

Congrats if you reached this slide !