Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

39
Custom applications with Spark’s RDD Tejas Patil Facebook

Transcript of Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Page 1: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

CustomapplicationswithSpark’sRDD

TejasPatilFacebook

Page 2: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Agenda

• Usecase• Realworldapplications• Previoussolution• Sparkversion• Dataskew• Performanceevaluation

Page 3: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

N-gramlanguagemodeltraining

Page 4: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Canyoupleasecomehere ?

History

5-gram

Wordbeingpredicted

Page 5: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Realworldapplications

Page 6: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Auto-subtitlingforPagevideos

Page 7: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Detectinglowqualityplaces

• Non-publicplaces• Myhome• Homesweethome

• Non-realplaces• Apt#00,Fakelane,FooCity,CA• Mordor,Westeros!!

• Non-suitableforwatch• Anythingcontainingnudity,intensesexuality,profanityordisturbingcontent

Page 8: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Previoussolution

Page 9: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Sub-model1trainingjob

Sub-model2trainingjob

Sub-model`n`trainingjob

Interpolationalgorithm

Languagemodel

LM1

LM2

LM`n`

…....................

Intermediatesubmodels

Hivequery

Hivetable

Page 10: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Sub-model2trainingjob

Sub-model`n`trainingjob

Interpolationalgorithm

Languagemodel

LM1

LM2

LM`n`

…....................

Intermediatesubmodels

Hivequery

Hivetable

Sub-model1trainingjob

INSERTOVERWRITETABLEsub_model_1SELECT....FROM(REDUCEm.ngram,m.group_key,m.countUSING"./train_model --config=myconfig.json ....”AS`ngram`,`count`,...FROM(SELECT...FROMdata_sourceWHERE...DISTRIBUTEBYgroup_key))GROUPBY`ngram`

Page 11: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Lessonslearned

• SQLnotgoodchoiceforbuildingsuchapplications• Duplication• Poorreadability• Brittle,notesting• Alternatives

• Map-reduce• Querytemplating

• Latencywhiletrainingwithlargedata

Page 12: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Sparksolution

Page 13: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Sparksolution

• Samehighlevelarchitecture• Hivetablesasfinalinputsandoutputs• SamebinariesusedinHiveTRANSFORM

• RDDnotDatasets• `pipe()`operator• Modular,readable,maintainable

Page 14: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Configuration

PipelineConfiguration- whereistheinputdata?- wheretostorefinaloutput?- sparkspecificconfigs:"spark.dynamicAllocation.maxExecutors”"spark.executor.memory”"spark.memory.storageFraction”…………

- listofComponentConfiguration……

Page 15: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Scalabilitychallenges

• Executorslostasunabletoheartbeat• ShuffleserviceOOM• FrequentexecutorGC• ExecutorOOM• 2GBlimitinSparkforblocks• Exceptionswhilereadingoutputstreamofpipeprocess

Page 16: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Scalabilitychallenges

• Executorslostasunabletoheartbeat• ShuffleserviceOOM• FrequentexecutorGC• ExecutorOOM• 2GBlimitinSparkforblocks• Exceptionswhilereadingoutputstreamofpipeprocess

Page 17: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Dataskew

Page 18: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Sub-model2trainingjob

Sub-model`n`trainingjob

Interpolationalgorithm

Languagemodel

LM1

LM2

LM`n`

…....................

Intermediatesubmodels

Hivequery

Hivetable

Sub-model1trainingjob

Page 19: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Sub-model2trainingjob

Sub-model`n`trainingjob

Interpolationalgorithm

Languagemodel

LM1

LM2

LM`n`

…....................

Intermediatesubmodels

Hivequery

Hivetable

Sub-model1trainingjob

ngramextraction

andcounting

Estimationandpruning normalize

ngramcounts

Page 20: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

HowareyouHowaretheyItsrainingHowarewegoingWhenarewegoingYouareawesomeTheyareworking…..…..

Trainingdataset

<Howarewegoing>:1….<Howareyou>:1<Howarethey>:1….<Howare>:4<Youare>:1<Itsraining>:1….<are>:6<you>:1<How>:4…..

Wordcount

Page 21: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

<Howarewegoing>:1<arewegoing>:2<wegoing>:2<going>:1<Whenarewegoing>:1<Itsraining>:1<Youareawesome>:1…..…..

Wordcount

Partitionbasedon2-wordsuffix

Page 22: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

<Howarewegoing>:1<arewegoing>:2<wegoing>:2<Whenarewegoing>:1…..

<Itsraining>:1<Youareawesome>:1…..

Wordcount

<Howarewegoing>:1<arewegoing>:2<wegoing>:2<going>:1<Whenarewegoing>:1<Itsraining>:1<Youareawesome>:1…..….. …..

…..

Page 23: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

<are>:6<How>:4<you>:1<doing>:1<going>:1<awesome>:1<working>:1…..…..

Frequencyofeveryword:0’thshard

Page 24: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

shards1to(n-1)

0-shard(hasfrequencyofeveryword)andisshippedtoallthenodes

N-gramswithsame2-wordsuffixwillfallinthesameshard

Distributionofshards(1-wordsharding)

Page 25: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Skewedshardsduetodatafromfrequentphraseseg.“howto..”,“doyou..”

shards1to(n-1)

Distributionofshards(1-wordsharding)

Page 26: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

shards1to(n-1)

0-shardhassinglewordfrequenciesand2-wordfrequenciesaswell

Distributionofshards(2-wordsharding)

Page 27: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Solution:Progressivesharding

Page 28: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Firstiteration

Ignoreskewedshards

Page 29: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

def findLargeShardIds(sc:SparkContext,threshold:Long,…..):Set[Int]={val shardSizesRDD = sc.textFile(shardCountsFile).map {caseline=>

val Array(indexStr,countStr)=line.split('\t')(indexStr.toInt,countStr.toLong)

}val largeShardIds =shardSizesRDD.filter {

case(index,count)=> count>threshold}.map(_._1).collect().toSet

returnlargeShardIds}

Page 30: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Firstiteration

Processallthenon-skewedshards

Page 31: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Seconditeration

Effective0-shardissmall

Re-shardleftoverwith2-wordshistory

Page 32: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Seconditeration

Discardbiggershards

Page 33: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Seconditeration

Processallthenon-skewedshards

Page 34: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Continuewithfurtheriterations….

Page 35: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

var iterationId =0do{val currentCounts:RDD[(String,Long)]=allCounts(iterationId - 1)val partitioner =newPartitionerForNgram(numShards,iterationId)

val shardCountsFile =s"${shard_sizes}_$iterationId"currentCounts.map(ngram =>(partitioner.getPartition(ngram._1),1L)).reduceByKey(_+_).saveAsTextFile(shardCountsFile)

largeShardIds =findLargeShardIds(sc,config.largeShardThreshold,shardCountsFile)trainer.trainedModel (currentCounts,component,largeShardIds)

.saveAsObjectFile(s"${component.order}_$iterationId")iterationId +1}while(largeShards.nonEmpty)

Page 36: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Performanceevaluation

0

200

400

600

800

1000

1200

1400

1600

1800

2000

Hive Spark

ReservedCPUtime(days)

0

1

2

3

4

5

6

7

8

9

Hive Spark

Latency(hours)

Page 37: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

0

200

400

600

800

1000

1200

1400

1600

1800

2000

Hive Spark

ReservedCPUtime(days)

15xefficient

Performanceevaluation

0

1

2

3

4

5

6

7

8

9

Hive Spark

Latency(hours)

2.6xfaster

Page 38: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Upstreamcontributionstopipe()

• [SPARK-13793]PipedRDD doesn'tpropagateexceptionswhilereadingparentRDD• [SPARK-15826]PipedRDD toallowconfigurablecharencoding• [SPARK-14542]PipeRDD shouldallowconfigurablebuffersizeforthestdin writer• [SPARK-14110]PipedRDD toprintthecommandranonnonzeroexit

Page 39: Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Questions?