Apache spark Intro

download Apache spark Intro

of 32

Embed Size (px)

Transcript of Apache spark Intro

Apache Spark Introworkshop BigData Romania

As a BigData/DataScience community, we did a series of workshops (ex. How to think in MapReduce, Hive, Machine Learning) and meetups with the goals to meet and know each other and to grow the knowledge in this fields. I saw that more and more companies from Cluj start to play with this kind of technologies, so if you like it, I believe that very shortly you will work on some very cool and challenging projects.

first goal is to see how easy you can start working with Sparksecond goal is see and try Spark main functionalitiesORin the next two hours we are gonna show you how easy is to start working with Spark, first how you can use just only your IDE (Eclipse or Idea) to work with Spark without any cluster deployment and second, of course, we are gonna describe Spark main functionalities and try some practical examples

Apache Spark Intro

Apache Spark historyRDDTransformationsActionsHands-on session

how spark really works : https://www.quora.com/What-exactly-is-Apache-Spark-and-how-does-it-workhttps://spark.apache.org/research.html

Apache Spark History


From where to learn Spark ?



Spark architecture

Easy ways to run Spark ?

your IDE (ex. Eclipse or IDEA)Standalone Deploy Mode: simplest way to deploy Spark on a single machineDocker & ZeppelinEMRHadoop vendors (Cloudera, Hortonworks)

change them with EMR, Hadoop vendors (Cloudera, Hortonworks) or standalone, or just in your IDE

Supported languages

Spark basics

RDDOperations : Transformations and Actions

RDD An RDD is simply an immutable distributed collection of objects!bcdgefihkjmlonaqp

RDD is the core concept in Sparka Spark program is based on creating an RDD, transforming on RDD or performing an action on RDD to get the resultsanother RDD definition : RDD is fault-tolerant collection of elements distributed across many servers on which we can perform parallel operations.

TODOwhy RDD is immutable ? (concurrency stuff ? )

Creating RDD (I) Pythonlines = sc.parallelize([workshop, spark])

Scalaval lines = sc.parallelize(List(workshop, spark))

Java JavaRDD lines = sc.parallelize(Arrays.asList(workshop, spark))

spark provide two ways to create RDDs : loading an external dataset and parallelizing a collection

Creating RDD (II) Pythonlines = sc.textFile(/path/to/file.txt)

Scalaval lines = sc.textFile(/path/to/file.txt)

Java JavaRDD lines = sc.textFile(/path/to/file.txt)

spark provide two ways to create RDDs : loading an external dataset and parallelizing a collection


NOTESuser can specify rdd priority, specifying which rdd to be first splitted to disk if the memory start to fill.

Other data structures in Spark

Paired RDDDataFrameDataSet

paired rdd - an RDD made of tuple objectsDataFrame - an RDD made of Rows and has associated a schema (similar with tables in SQL)DataSet - combine the bests from RDDs and DataFrames

Paired RDD Paired RDD = an RDD of key/value pairsuser1user2user3user4user5id1/user1id2/user2id3/user3id4/user4id5/user5

Spark operations RDD 1RDD 2RDD 4RDD 6RDD 3RDD 5


- explain this DAG using a example, (filter log files)- explain lazy initialization- explain fault tolerance

Notes :rdd do not need to be materialized at all times, it has enough information (its lineage) to be computed from data in a stable storagecorse grained transformation restrict RDD to appliations which do bulk writes operations but give an easy fault-tolerant strategy

TransformationsRDD 1

RDD 2Transformations describe how to transform an RDD into another RDD.RDD 1RDD 2

Transformations RDD 1 RDD{1,2,3,4,5,6}MapRDD{2,3,4,5,6,7}FilterRDD{1,2,3,5,6}map x => x +1filter x => x != 4

Popular transformationsmapfiltersampleuniondistinctgroupByKeyreduceByKeysortByKeyjoin


Actions compute a result from an RDD !


Actions InputRDD{1,2,3,4,5,6}MapRDD{2,3,4,5,6,7}FilterRDD{1,2,3,5,6}map x => x +1filter x => x != 4


Popular actionscollectcountfirsttaketakeSamplecountByKeysaveAsTextFile

Transformations and Actions usersadministrators

filter take(3)

Transformations and Actions usersadministrators

filter() take(3)


Transformations and Actions usersadministrators

filter() take(3)


Lazy initialization usersadministrators

filter take(3)

How Spark Executes Your Program

Hands-on session

MovieLens MovieLens data sets were collected by the GroupLens Research Projectat the University of Minnesota. This data set consists of:* 100,000 ratings (1-5) from 943 users on 1682 movies. * Each user has rated at least 20 movies. * Simple demographic info for the users (age, gender, occupation, zip)

Download link : http://grouplens.org/datasets/movielens/

for this workshop we choose MovieLens dataset to play with. Using this dataset we will learn how to read files with spark, create RDDs and apply common transformations and actions on them.the dataset contains three files : user, rating and movies.

MovieLens dataset useruser_idagegenderoccupationzipcodeuser_ratinguser_idmovie_idratingtimestampmoviemovie_idtitlerelease_datevideo_releaseimdb_urlgenres...

Exercises already solved !

Return only the users with occupation administratorIncrease the age of each user by oneJoin user and rating datasets by user id

Exercises to solveHow many men/women register to MovieLensDistribution of age for male/female registered to MovieLensWhich are the movies names with rating x?Average rating by moviesSort users by their occupation

Congrats if you reached this slide !