CS60021: Scalable Data Mining...

68
CS60021: Scalable Data Mining Spark Sourangshu Bha<acharya

Transcript of CS60021: Scalable Data Mining...

Page 1: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

CS60021:ScalableDataMining

SparkSourangshuBha<acharya

Page 2: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

SCALA

Page 3: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Scala•  Scalaisbothfunc?onalandobject-oriented

–  everyvalueisanobject–  everyfunc?onisavalue--includingmethods

•  Scalaisinteroperablewithjava.•  Scalaissta?callytyped

–  includesalocaltypeinferencesystem:–  inJava1.5:

Pair<Integer, String> p = new Pair<Integer, String>(1, "Scala");

–  inScala:val p = new MyPair(1, "scala");

Page 4: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

VarandValq Usevar todeclarevariables:

q  var x = 3; q  x += 4;

q Useval todeclarevalues(finalvars)q  val y = 3; q  y += 4; // error

q No?cenotypes,butitissta?callytypedq  var x = 3; q  x = “hello world”; // error

q Typeannota?ons:q  var x : Int = 3;

Page 5: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Classdefini?onclass Point(val xc: Int, val yc: Int) { var x: Int = xc var y: Int = yc def move(dx: Int, dy: Int) { x = x + dx y = y + dy println ("Point x location : " + x); println ("Point y location : " + y); } }

Page 6: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Scala

q Classinstancesq val c = new IntCounter[String];

q Accessingmembersq println(c.size); // same as c.size()

q Definingfunc?ons:q  def foo(x : Int) { println(x == 42); } q  def bar(y : Int): Int = y + 42; // no braces

// needed!

q  def return42 = 42; // No parameters either!

Page 7: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Scala

q Defininglambdas–namelessfunc?ons(typessome?mesneeded)q val f = x :Int => x + 42;

q Closures(contextsensi?vefunc?ons)q  var y = 3;q  val g = {x : Int => y += 1; x+y; }

q Maps(andacoolwaytodosomefunc?ons)q List(1,2,3).map(_+10).foreach(println)

q Filtering(andranges!)q  1 to 100 filter (_ % 7 == 3) foreach (println)

Page 8: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

“Statements” •  Scala’s “statements” should really be called “expressions,” because every

statement has a value •  The value of many statements, for example the while loop,is ()

•  () is a value of type Unit•  () is the only value of type Unit•  ()basically means “Nothing to see here. Move along.”

•  The value of a if or matchstatement is the last value computed •  The value of a block, {…}, is the last value computed in the block

•  A statement is ended by the end of the line (not with a semicolon) unless it is

obviously incomplete, or if the next line cannot begin a valid statement –  For example, x=3*(2*y+ is obviously incomplete –  Because Scala lets you leave out a lot of unnecessary punctuation, sometimes a line

that you think is complete really isn’t complete (or vice versa) •  You can end statements with semicolons, but that’s not good Scala practice

8

Page 9: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Familiar statement types •  These are the same as in Java, but have a value of ():

–  variable=expression//also+=,*=,etc.–  while(condition){statements} –  do{statements}while(condition)

•  These are the same as in Java, but may have a useful value: –  {statements}

•  The value of the block is the last value computed in it –  if(condition){statements}else{statements}

•  The value is the value of whichever block is chosen •  If the value is to be used, both blocks should have the same type, otherwise

the type of the result is the “least upper bound” of the two types –  if(condition){statements}

•  The value is the value of the last statement executed, but its type is Any – if you want a value, you really should use an else

•  As in Java, braces around a single statement may be omitted 9

Page 10: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Arrays •  Arrays in Scala are parameterized types

–  Array[String] is an Array of Strings, where String is a type parameter –  In Java we would call this a “generic type”

•  When no initial values are given, new is required, along with an explicit type: –  valary=newArray[Int](5)

•  When initial values are given, new is not allowed: –  valary2=Array(3,1,4,1,6)

•  Array syntax in Scala is just object syntax

•  Scala’s Lists are more useful, and used more often, than Arrays –  vallist1=List(3,1,4,1,6)–  vallist2=List[Int]()//Anemptylistmusthavean

explicittype

Page 11: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Simple List operations •  By default, Lists, like Strings, are immutable

–  Operations on an immutable List return a new List •  Basic operations:

–  list.head (or listhead) returns the first element in the list –  list.tail (or listtail) returns a list with the first element removed –  list(i) returns the ith element (starting from 0) of the list –  list(i)=value is illegal (immutable, remember?) –  value::list returns a list with value appended to the front –  list1:::list2 appends (“concatenates”) the two lists –  list.contains(value) (or listcontainsvalue) tests whether value is in list

•  Many operations on Lists also work on other kinds of Sequences •  An operation on a Sequence may return a Sequence of a different type

–  scala>"abc"::List(1,2,3)res22:List[Any]=List(abc,1,2,3)

–  This happens because Any is the only type that can hold both integers and strings •  There are over 150 built-in operations on Lists—use the API!

Page 12: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Tuples •  Scala has tuples, up to size 22 (why 22? I have no idea.)

–  scala>valt=Tuple3(3,"abc",5.5)t:(Int,java.lang.String,Double)=(3,abc,5.5)

–  scala>valtt=(3,"abc",5.5)tt:(Int,java.lang.String,Double)=(3,abc,5.5)

•  Tuples are referenced starting from 1, using _1, _2, ... –  scala>valt=('a',"one",1)

t:(Char,String,Int)=(a,one,1)–  scala>t._2

res3:String=one

•  Tuples, like Lists, are immutable •  Tuples are a great way to return more than one value from a

method

Page 13: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Simple method definitions •  defisEven(n:Int)={

valm=n%2m==0} –  The result is the last value (in this case, a Boolean) –  This is really kind of poor style; the extra variable isn’t needed

•  defisEven(n:Int)=n%2==0

–  This is much better style –  The result is just a single expression, so no braces are needed

•  defcountTo(n:Int){

for(i<-1to10){println(i)}}–  It’s good style to omit the =when the result is () –  If you omit the =, the result will be ()–  You need either braces or an =; you can’t leave out both

Page 14: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Methods and functions •  A method is a function that belongs to an object

–  Methods usually use, and manipulate, the fields of an object

•  A function does not belong to any object –  The inputs to a function are, or should be, just its parameters –  The result of calling a function is, or should be, just its return value

•  Scala can sometimes accept a method where a function is expected –  A method can sometimes be “converted” to a function by following

with an underscore (for example, isEven_)

Page 15: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Functions are first-class objects •  Functions are values (like integers, etc.) and can be assigned to

variables, passed to and returned from functions, and so on •  Wherever you see the => symbol, it’s a literal function •  Example (assigning a literal function to the variable foo):

–  scala>valfoo=(x:Int)=>if(x%2==0)x/2else3*x+1foo:(Int)=>Int=<function1>scala>foo(7)res28:Int=22

•  The basic syntax of a function literal is parameter_list => function_body

•  In this example, foreach is a function that takes a function as a parameter: –  myList.foreach(i=>println(2*i))

Page 16: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Functions as parameters •  To define a function, you must specify the types of each of its parameters •  Therefore, to have a function parameter, you must know how to write its type:

–  (type1,type2,...,typeN)=>return_type–  type=>return_type//ifonlyoneparameter–  ()=>return_type//ifnoparameters

•  Example: –  scala>defdoTwice(f:Int=>Int,n:Int)=f(f(n))

doTwice:(f:(Int)=>Int,n:Int)Intscala>defcollatz(n:Int)=if(n%2==0)n/2else3*n+1collatz:(n:Int)Intscala>doTwice(collatz,7)res2:Int=11scala>doTwice(a=>101*a,3)res4:Int=30603

Page 17: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Higher-order methods on Lists •  map applies a one-parameter function to every element of a List, returning

a new List –  scala>defdouble(n:Int)=2*n

double:(n:Int)Int–  scala>valll=List(2,3,5,7,11)

ll:List[Int]=List(2,3,5,7,11)–  scala>llmapdouble

res5:List[Int]=List(4,6,10,14,22)–  scala>llmap(n=>3*n)

res6:List[Int]=List(6,9,15,21,33)–  scala>llmap(n=>n>5)

res8:List[Boolean]=List(false,false,false,true,true)

•  filter applies a one-parameter test to every element of a List, returning a List of those elements that pass the test –  scala>llfilter(n=>n<5)

res10:List[Int]=List(2,3)–  scala>llfilter(_<5) // abbreviated function where parameter is

used once res11:List[Int]=List(2,3)

Page 18: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

More higher-order methods •  deffilterNot(p:(A)=>Boolean):List[A]

–  Selects all elements of this list which do not satisfy a predicate

•  defcount(p:(A)=>Boolean):Int–  Counts the number of elements in the list which satisfy a predicate

•  defforall(p:(A)=>Boolean):Boolean

–  Tests whether a predicate holds for every element of this list

•  defexists(p:(A)=>Boolean):Boolean–  Tests whether a predicate holds for at least one of the elements of this list

•  deffind(p:(A)=>Boolean):Option[A]

–  Finds the first element of the list satisfying a predicate, if any

•  defsortWith(lt:(A,A)=>Boolean):List[A]–  Sorts this list according to a comparison function

Page 19: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

SPARK

Page 20: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Spark

SparkisanIn-MemoryClusterCompuAngplaCormforIteraAveandInteracAve

ApplicaAons.

h<p://spark.apache.org

Page 21: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Spark

q StartedinAMPLabatUCBerkeley.q ResilientDistributedDatasets.q Dataand/orComputa?onIntensive.q Scalable–faulttolerant.q IntegratedwithSCALA.q Stragglerhandling.q Datalocality.q Easytouse.

Page 22: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Background

•  Commodityclustershavebecomeanimportantcompu?ngplaZormforavarietyofapplica?ons–  Inindustry:search,machinetransla?on,adtarge?ng,…–  Inresearch:bioinforma?cs,NLP,climatesimula?on,…

•  High-levelclusterprogrammingmodelslikeMapReducepowermanyoftheseapps

•  Themeofthiswork:providesimilarlypowerfulabstrac8onsforabroaderclassofapplica8ons

Page 23: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Mo?va?onCurrentpopularprogrammingmodelsforclusterstransformdataflowingfromstablestoragetostablestorage

E.g.,MapReduce:

Map

Map

Map

Reduce

Reduce

Input Output

Page 24: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Mo?va?on

Map

Map

Map

Reduce

Reduce

Input Output

Benefitsofdataflow:run?mecandecidewheretoruntasksandcanautoma?cally

recoverfromfailures

•  Currentpopularprogrammingmodelsforclusterstransformdataflowingfromstablestoragetostablestorage

•  E.g.,MapReduce:

Page 25: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Mo?va?on•  Acyclicdataflowisapowerfulabstrac?on,butisnotefficient

forapplica?onsthatrepeatedlyreuseaworkingsetofdata:–  IteraAvealgorithms(manyinmachinelearning)–  InteracAvedataminingtools(R,Excel,Python)

•  Sparkmakesworkingsetsafirst-classconcepttoefficientlysupporttheseapps

Page 26: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

SparkGoal•  Providedistributedmemoryabstrac?onsforclustersto

supportappswithworkingsets•  Retainthea<rac?veproper?esofMapReduce:

–  Faulttolerance(forcrashes&stragglers)–  Datalocality–  Scalability

Solution:augmentdataflowmodelwith“resilientdistributeddatasets”(RDDs)

Page 27: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

ResilientDistributedDatasetsq  ImmutabledistributedSCALAcollec?ons.

q Array,List,Map,Set,etc.q Transforma?onsonRDDscreatenewRDDs.

q Map,ReducebyKey,Filter,Join,etc.q Ac?onsonRDDreturnvalues.

q Reduce,collect,count,take,etc.q SeamlesslyintegratedintoaSCALAprogram.q RDDsarematerializedwhenneeded.q RDDsarecachedtodisk–gracefuldegrada?on.q Sparkframeworkre-computeslostsplitsofRDDs.

27

Page 28: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

RDDsinMoreDetailq AnRDDisanimmutable,par??oned,logicalcollec?onofrecordsq Neednotbematerialized,butrathercontainsinforma?ontorebuildadatasetfromstablestorage

q Par??oningcanbebasedonakeyineachrecord(usinghashorrangepar??oning)

q Builtusingbulktransforma?onsonotherRDDsq Canbecachedforfuturereuse

Page 29: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

RDDOpera?onsTransformations(defineanewRDD)

mapfiltersampleuniongroupByKeyreduceByKeyjoincache…

Actions(returnaresulttodriver)

reducecollectcountsavelookupKey…

Page 30: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

RDDFaultTolerance•  RDDsmaintainlineageinforma?onthatcanbeusedto

reconstructlostpar??ons

•  Ex:cachedMsgs = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) .cache()

HdfsRDDpath:hdfs://…

FilteredRDDfunc:contains(...)

MappedRDDfunc:split(…) CachedRDD

Page 31: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

BenefitsofRDDModelq Consistencyiseasyduetoimmutability

q  Inexpensivefaulttolerance(loglineageratherthan

replica?ng/checkpoin?ngdata)

q Locality-awareschedulingoftasksonpar??ons

q Despitebeingrestricted,modelseemsapplicabletoabroadvarietyofapplica?ons

Page 32: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

SparkArchitecture

Page 33: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Example:MapReduce•  MapReducedataflowcanbeexpressedusingRDDtransforma?ons

res = data.flatMap(rec => myMapFunc(rec)) .groupByKey() .map((key, vals) => myReduceFunc(key, vals))

Orwithcombiners:

res = data.flatMap(rec => myMapFunc(rec)) .reduceByKey(myCombiner) .map((key, val) => myReduceFunc(key, val))

Page 34: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

WordCountinSpark

val lines = spark.textFile(“hdfs://...”) val counts = lines.flatMap(_.split(“\\s”)) .reduceByKey(_ + _) counts.save(“hdfs://...”)

Page 35: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Example:MatrixMul?plica?on

Page 36: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

MatrixMul?plica?onu Representa?onofMatrix:

u List<Rowindex,Colindex,Value>u Sizeofmatrices:Firstmatrix(A):m*k,Secondmatrix(B):k*n

u Scheme:u Foreachinputrecord:Ifinputrecord

u Mapperkey:<row_index_matrix_1,Column_index_matrix_2>u Mappervalue:<column_index_1/row_index_2,value>u GroupByKey:List(MapperValues)u Collectall(two)recordswiththesamefirstfieldmul?ply

themandaddtothesum.

Page 37: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Example:Logis?cRegression

Page 38: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Logis?cRegression

•  BinaryClassifica?on.yε{+1,-1}•  Probabilityofclassesgivenbylinearmodel:

•  RegularizedMLes?mateofwgivendataset(xi,yi)isobtainedbyminimizing:

38

p(y | x,w) = 11+ e(−yw

T x )

l(w) = log(1+ exp(−yiwT xi ))+

λ2wTw

i∑

Page 39: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Logis?cRegression•  Gradientoftheobjec?veisgivenby:

•  GradientDescentupdatesare:

∇l(w) = (1−σ (yiwT xi ))yixi −λw

i∑

wt+1 = wt − s∇l(wt )

Page 40: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

SparkImplementa?on

val x = loadData(file) //creates RDD

var w = 0

do { //creates RDD

val g = x.map(a => grad(w,a)).reduce(_+_)

s = linesearch(x,w,g)

w = w – s * g

}while(norm(g) > e)

Page 41: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

ScaleupwithCores

0

500

1000

1500

2000

2500

3000

3500

0 1 2 3 4 5 6 7

Tim

e in

sec

onds

Number of Cores

Epsilon (Pascal Challenge)

Spark-GD

Spark-CG Liblinear-C++

Page 42: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

ScaleupwithNodes

42

0

50

100

150

200

250

300

0 5 10 15 20 25 30 35

Tim

e in

sec

onds

Number of nodes

Epsilon (Pascal Challenge)

Spark-GD

Spark-CG

996.24 s Liblinear-C++

Page 43: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Example:PageRank

Page 44: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

BasicIdea•  Givepagesranks(scores)basedonlinkstothem

–  Linksfrommanypagesèhighrank–  Linkfromahigh-rankpageèhighrank

Image: en.wikipedia.org/wiki/File:PageRank-hi-res-2.png

Page 45: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Algorithm

1.0 1.0

1.0

1.0

1.  Starteachpageatarankof12.  Oneachitera?on,havepagepcontribute

rankp/|neighborsp|toitsneighbors3.  Seteachpage’srankto0.15+0.85×contribs

Page 46: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Algorithm

1.  Starteachpageatarankof12.  Oneachitera?on,havepagepcontribute

rankp/|neighborsp|toitsneighbors3.  Seteachpage’srankto0.15+0.85×contribs

1.0 1.0

1.0

1.0

1

0.5

0.5

0.5

1

0.5

Page 47: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Algorithm1.  Starteachpageatarankof12.  Oneachitera?on,havepagepcontribute

rankp/|neighborsp|toitsneighbors3.  Seteachpage’srankto0.15+0.85×contribs

0.58 1.0

1.85

0.58

Page 48: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Algorithm1.  Starteachpageatarankof12.  Oneachitera?on,havepagepcontribute

rankp/|neighborsp|toitsneighbors3.  Seteachpage’srankto0.15+0.85×contribs

0.58

0.29

0.29

0.5

1.85 0.58 1.0

1.85

0.58

0.5

Page 49: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Algorithm

0.39 1.72

1.31

0.58

...

1.  Starteachpageatarankof12.  Oneachitera?on,havepagepcontribute

rankp/|neighborsp|toitsneighbors3.  Seteachpage’srankto0.15+0.85×contribs

Page 50: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Algorithm

0.46 1.37

1.44

0.73

Final state:

1.  Starteachpageatarankof12.  Oneachitera?on,havepagepcontribute

rankp/|neighborsp|toitsneighbors3.  Seteachpage’srankto0.15+0.85×contribs

Page 51: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

SparkImplementa?onvallinks=//RDDof(url,neighbors)pairsvarranks=//RDDof(url,rank)pairsfor(i<-1toITERATIONS){valcontribs=links.join(ranks).flatMap{(url,(nhb,rank))=>nhb(dest=>(dest,rank/nhb.size))}ranks=contribs.reduceByKey(_+_).mapValues(0.15+0.85*_)}ranks.saveAsTextFile(...)

Page 52: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Example:Alterna?ngLeastsquares

Page 53: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Collabora?vefiltering

Page 54: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

MatrixFactoriza?on

Page 55: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Alterna?ngLeastSquares

Page 56: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

NaïveSparkALS

Page 57: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

EfficientSparkALS

Page 58: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Example:LogMiningLoaderrormessagesfromalogintomemory,theninterac?velysearchforvariouspa<erns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

. . .

tasks

results

Cache1

Cache2

Cache3

BaseRDDTransformedRDD

Ac?on

Result:full-textsearchofWikipediain<1sec(vs20secforon-diskdata)

Result:scaledto1TBdatain5-7sec(vs170secforon-diskdata)

Page 59: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

SparkScheduler

Dryad-likeDAGsPipelinesfunc?onswithinastageCache-awareworkreuse&localityPar??oning-awaretoavoidshuffles join

union

groupBy

map

Stage3

Stage1

Stage2

A: B:

C: D:

E:

F:

G:

=cacheddatapartition

Page 60: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

UserLogMiningvaluserData=sc.sequenceFile[UserID,UserInfo]("hdfs://...").persist()defprocessNewLogs(logFileName:String){valevents=sc.sequenceFile[UserID,LinkInfo](logFileName)valjoined=userData.join(events)//RDDof(UserID,(UserInfo,LinkInfo))pairsvaloffTopicVisits=joined.filter{case(userId,(userInfo,linkInfo))=>//Expandthetupleintoitscomponents

userInfo.topics.contains(linkInfo.topic)}.count()println("Numberofvisitstonon-subscribedtopics:"+offTopicVisits)}

Page 61: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

UserLogMining

Page 62: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

UserLogMiningvaluserData=sc.sequenceFile[UserID,UserInfo]("hdfs://...").par??onBy(newHashParAAoner(100))//Create100par88ons.persist()defprocessNewLogs(logFileName:String){valevents=sc.sequenceFile[UserID,LinkInfo](logFileName)valjoined=userData.join(events)//RDDof(UserID,(UserInfo,LinkInfo))pairsvaloffTopicVisits=joined.filter{case(userId,(userInfo,linkInfo))=>//ExpandthetupleintoitscomponentsuserInfo.topics.contains(linkInfo.topic)}.count()println("Numberofvisitstonon-subscribedtopics:"+offTopicVisits)}

Page 63: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

UserLogMining

Page 64: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Par??oningq  Opera?onsbenefiQngfrompar??oning:

cogroup(),groupWith(),join(),le{OuterJoin(),rightOuterJoin(),groupByKey(),reduceByKey(),combineByKey(),andlookup().

q  Opera?onsaffecAngpar??oning:cogroup(),groupWith(),join(),le{OuterJoin(),rightOuterJoin(),groupByKey(),reduceByKey(),combineByKey(),par??onBy(),sort()mapValues()(iftheparentRDDhasapar??oner),flatMapValues()(ifparenthasapar??oner)filter()(ifparenthasapar??oner).

Page 65: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

PageRank(Revisited)vallinks=sc.objectFile[(String,Seq[String])]("links").par??onBy(newHashParAAoner(100)).persist()varranks=links.mapValues(v=>1.0)for(i<-0un?l10){valcontribu?ons=links.join(ranks).flatMap{case(pageId,(nbh,rank))=>nbh.map(dest=>(dest,rank/nbh.size))}ranks=contribu?ons.reduceByKey((x,y)=>x+y).mapValues(v=>0.15+0.85*v)}ranks.saveAsTextFile("ranks")

Page 66: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Accumulatorsvalsc=newSparkContext(...)valfile=sc.textFile("file.txt")valblankLines=sc.accumulator(0)//CreateanAccumulator[Int]ini8alizedto0valcallSigns=file.flatMap(line=>{if(line==""){blankLines+=1//Addtotheaccumulator}line.split("")})callSigns.saveAsTextFile("output.txt”)println("Blanklines:"+blankLines.value)

Page 67: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

PhysicalExecu?onPlanq  UsercodedefinesaDAG(directedacyclicgraph)ofRDDs

q Opera?onsonRDDscreatenewRDDsthatreferbacktotheirparents,therebycrea?ngagraph.

q  Ac?onsforcetransla?onoftheDAGtoanexecu?onplanq Whenyoucallanac?ononanRDDitmustbecomputed.Thisrequires

compu?ngitsparentRDDsaswell.Spark’sschedulersubmitsajobtocomputeallneededRDDs.Thatjobwillhaveoneormorestages,whichareparallelwavesofcomputa?oncomposedoftasks.EachstagewillcorrespondtooneormoreRDDsintheDAG.Asinglestagecancorrespondtomul?pleRDDsduetopipelining.

q  Tasksarescheduledandexecutedonaclusterq Stagesareprocessedinorder,withindividualtaskslaunchingto

computesegmentsoftheRDD.Oncethefinalstageisfinishedinajob,theac?oniscomplete.

Page 68: CS60021: Scalable Data Mining Sparkcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/mapreduce2.pdf · Scala • Scala is both func?onal and object-oriented – every value is an object

Tasks•  Eachtaskinternallyperformsthefollowingsteps:

q Fetchingitsinput,eitherfromdatastorage(iftheRDDisaninput

RDD),anexis?ngRDD(ifthestageisbasedonalreadycacheddata),orshuffleoutputs.

q Performingtheopera?onnecessarytocomputeRDD(s)thatitrepresents.Forinstance,execu?ngfilter()ormap()func?onsontheinputdata,orperform-inggroupingorreduc?on.

q Wri?ngoutputtoashuffle,toexternalstorage,orbacktothedriver(ifitisthefinalRDDofanac?onsuchascount()).