DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001:...

20
DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ

Transcript of DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001:...

Page 1: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ

1ter2018

DS 3001: Foundations of Data Science (Spark)

Apdapted from Cloudera and Stanford Univ

Page 2: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ

2

ThePlan

• GettingstartedwithSpark• RDDs• Commonlyusefuloperations• Usingpython• UsingJava• UsingScala• Helpsession

Page 3: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ

3

GodownloadSparknow.(Version2.2.1forHadoop2.7)https://spark.apache.org/downloads.html

Page 4: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ

4

ThreeDeploymentOptions

DriverTask

Task

Task

Local

Driver

Stand-aloneCluster

ClusterManager

Executor

Task

Task

Executor

Task

Driver

ManagedCluster

ClusterManager

Executor

Task

Task

Executor

Task

YARN

Page 5: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ

5

ThreeDeploymentOptions

DriverTask

Task

Task

Local

Driver

Stand-aloneCluster

ClusterManager

Executor

TaskTask

Executor

Task

Driver

ManagedCluster

ClusterManager

Executor

TaskTask

Executor

Task

YARN

Page 6: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ

6

ResilientDistributedDatasets

• SparkisRDD-centric• RDDsareimmutable• RDDsarecomputedlazily• RDDscanbecached• RDDsknowwhotheirparentsare• RDDsthatcontainonlytuplesoftwoelementsare“pairRDDs”

Page 7: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ

7

UsefulRDDActions

• take(n)– returnthefirstnelementsintheRDDasanarray.• collect()– returnallelementsoftheRDDasanarray.Usewithcaution.• count()– returnthenumberofelementsintheRDDasanint.• saveAsTextFile(‘path/to/dir’)– savetheRDDtofilesinadirectory.Willcreatethedirectoryifitdoesn’texistandwillfailifitdoes.• foreach(func)– executethefunctionagainsteveryelementintheRDD,butdon’tkeepanyresults.

Page 8: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ

8

UsefulRDDOperations

Page 9: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ

9

map()

ApplyanoperationtoeveryelementofanRDDandreturnanewRDDthatcontainstheresults

>>> data = sc.textFile(’path/to/file’)>>> data.take(3)[u’Apple,Amy’, u’Butter,Bob’, u’Cheese,Chucky’]>>> data.map(lambda line: line.split(‘,’)).take(3)[[u’Apple’, u‘Amy’], [u’Butter’, u’Bob’], [u’Cheese’, u‘Chucky’]]

Page 10: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ

10

flatMap()

ApplyanoperationtoeveryelementofanRDDandreturnanewRDDthatcontainstheresultsafterdroppingtheoutermostcontainer

>>> data = sc.textFile(’path/to/file’)>>> data.take(3)[u’Apple,Amy’, u’Butter,Bob’, u’Cheese,Chucky’]>>> data.flatMap(lambda line: line.split(‘,’)).take(6)[u’Apple’, u‘Amy’, u’Butter’, u’Bob’, u’Cheese’, u‘Chucky’]

Page 11: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ

11

mapValues()

ApplyanoperationtothevalueofeveryelementofanRDDandreturnanewRDDthatcontainstheresults.OnlyworkswithpairRDDs

>>> data = sc.textFile(’path/to/file’)>>> data = data.map(lambda line: line.split(‘,’))>>> data = data.map(lambda pair: (pair[0], pair[1]))>>> data.take(3)[(u’Apple’, u‘Amy’), (u’Butter’, u’Bob’), (u’Cheese’, u‘Chucky’)]>>> data.mapValues(lambda name: name.lower()).take(3)[(u’Apple’, u‘amy’), (u’Butter’, u’bob’), (u’Cheese’, u‘chucky’)]

Page 12: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ

12

flatMapValues()

ApplyanoperationtothevalueofeveryelementofanRDDandreturnanewRDDthatcontainstheresultsafterremovingtheoutermostcontainer.OnlyworkswithpairRDDs>>> data = sc.textFile(’path/to/file’)>>> data = data.map(lambda line: line.split(‘,’))>>> data = data.map(lambda pair: (pair[0], pair[1])).take(3)>>> data.take(3)[(u’Apple’, u‘Amy’), (u’Butter’, u’Bob’), (u’Cheese’, u‘Chucky’)]>>> data.flatMapValues(lambda name: name.lower()).take(3)[(u’Apple’, u‘a’), (u’Apple’, u’m’), (u’Apple’, u‘y’)]

Page 13: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ

13

filter()

ReturnanewRDDthatcontainsonlytheelementsthatpassafilteroperation

>>> import re>>> data = sc.textFile(’path/to/file’)>>> data.take(3)[u’Apple,Amy’, u’Butter,Bob’, u’Cheese,Chucky’]>>> data.filter(lambda line: re.match(r’^[AEIOU]’, line)).take(3)[u’Apple,Amy’, u’Egg,Edward’, u’Oxtail,Oscar’]

Page 14: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ

14

groupByKey()

ApplyanoperationtothevalueofeveryelementofanRDDandreturnanewRDDthatcontainstheresultsafterremovingtheoutermostcontainer.OnlyworkswithpairRDDs>>> data = sc.textFile(’path/to/file’)>>> data = data.map(lambda line: line.split(‘,’))>>> data = data.map(lambda pair: (pair[0], pair[1]))>>> data.take(3)[(u’Apple’, u‘Amy’), (u’Butter’, u’Bob’), (u’Cheese’, u‘Chucky’)]>>> data.groupByKey().take(1)[(u’Apple’, <pyspark.resultiterable.ResultIterable object at 0x102ed1290>)]>>> for pair in data.groupByKey().take(1):... print “%s:%s” % (pair[0], “,”.join([n for n in pair[1]))Apple:Amy,Adam,Alex

Page 15: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ

15

reduceByKey()

CombineelementsofanRDDbykeyandthenapplyareduceoperationtopairsofkeysuntilonlyasinglekeyremains.ReturntheresultinanewRDD.

>>> data = sc.textFile(’path/to/file’)>>> data = data.map(lambda line: line.split(‘,’))>>> data = data.map(lambda pair: (pair[0], pair[1]))>>> data.take(3)[(u’Apple’, u‘Amy’), (u’Butter’, u’Bob’), (u’Cheese’, u‘Chucky’)]>>> data.reduceByKey(lambda v1, v2: v1 + “:” + v2).take(1)[(u’Apple’, u’Amy:Alex:Adam’)]

Page 16: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ

16

sortBy()

SortanRDDaccordingtoasortingfunctionandreturntheresultsinanewRDD.

>>> data = sc.textFile(’path/to/file’)>>> data = data.map(lambda line: line.split(‘,’))>>> data = data.map(lambda pair: (pair[0], pair[1]))>>> data.take(3)[(u’Apple’, u‘Amy’), (u’Butter’, u’Bob’), (u’Cheese’, u‘Chucky’)]>>> data.sortBy(lambda pair: pair[1]).take(3)[(u’Avocado’, u’Adam’), (u‘Anchovie’, u’Alex’), (u’Apple’, u’Amy’)]

Page 17: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ

17

sortByKey()

SortanRDDaccordingtothenaturalorderingofthekeysandreturntheresultsinanewRDD.

>>> data = sc.textFile(’path/to/file’)>>> data = data.map(lambda line: line.split(‘,’))>>> data = data.map(lambda pair: (pair[0], pair[1]))>>> data.take(3)[(u’Apple’, u‘Amy’), (u’Butter’, u’Bob’), (u’Cheese’, u‘Chucky’)]>>> data.sortByKey().take(3)[(u’Apple’, u’Amy’), (u‘Anchovie’, u’Alex’), (u’Avocado’, u’Adam’)]

Page 18: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ

18

subtract()

ReturnanewRDDthatcontainsalltheelementsfromtheoriginalRDDthatdonotappearinatargetRDD.

>>> data1 = sc.textFile(’path/to/file1’)>>> data1.take(3)[u’Apple,Amy’, u’Butter,Bob’, u’Cheese,Chucky’]>>> data2 = sc.textFile(’path/to/file2’)>>> data2.take(3)[u’Wendy’, u’McDonald,Ronald’, u’Cheese,Chucky’]>>> data1.subtract(data2).take(3)[u’Apple,Amy’, u’Butter,Bob’, u’Dinkel,Dieter’]

Page 19: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ

19

join()

ReturnanewRDDthatcontainsalltheelementsfromtheoriginalRDDjoined(innerjoin)withelementsfromthetargetRDD.

>>> data1 = sc.textFile(’path/to/file1’).map(lambda line: line.split(‘,’)).map(lambda pair: (pair[0], pair[1]))>>> data1.take(3)[(u’Apple’, u’Amy’), (u’Butter’, u’Bob’), (u’Cheese’, u’Chucky’)]>>> data2 = sc.textFile(’path/to/file2’).map(lambda line: line.split(‘,’)).map(lambda pair: (pair[0], pair[1]))>>> data2.take(3)[(u’Doughboy’, u’Pilsbury’), (u’McDonald’, u’Ronald’), (u’Cheese’, u’Chucky’)]>>> data1.join(data2).collect()[(u’Cheese’, (u’Chucky’, u’Chucky’))]>>> data1.fullOuterJoin(data2).take(2)[(u’Apple’,(u’Amy’, None)), (u’Cheese’, (u’Chucky’, u’Chucky’))]

Page 20: DS 3001: Foundations of Data Science (Spark)kmlee/ds3001/spark_intro.pdf1 ter 2018 DS 3001: Foundations of Data Science (Spark) Apdapted from Cloudera and Stanford Univ

20

Thankyou