Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil
-
Upload
spark-summit -
Category
Data & Analytics
-
view
80 -
download
0
Transcript of Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil
CustomapplicationswithSpark’sRDD
TejasPatilFacebook
Agenda
• Usecase• Realworldapplications• Previoussolution• Sparkversion• Dataskew• Performanceevaluation
N-gramlanguagemodeltraining
Canyoupleasecomehere ?
History
5-gram
Wordbeingpredicted
Realworldapplications
Auto-subtitlingforPagevideos
Detectinglowqualityplaces
• Non-publicplaces• Myhome• Homesweethome
• Non-realplaces• Apt#00,Fakelane,FooCity,CA• Mordor,Westeros!!
• Non-suitableforwatch• Anythingcontainingnudity,intensesexuality,profanityordisturbingcontent
Previoussolution
Sub-model1trainingjob
Sub-model2trainingjob
Sub-model`n`trainingjob
Interpolationalgorithm
Languagemodel
LM1
LM2
LM`n`
…....................
Intermediatesubmodels
Hivequery
Hivetable
Sub-model2trainingjob
Sub-model`n`trainingjob
Interpolationalgorithm
Languagemodel
LM1
LM2
LM`n`
…....................
Intermediatesubmodels
Hivequery
Hivetable
Sub-model1trainingjob
INSERTOVERWRITETABLEsub_model_1SELECT....FROM(REDUCEm.ngram,m.group_key,m.countUSING"./train_model --config=myconfig.json ....”AS`ngram`,`count`,...FROM(SELECT...FROMdata_sourceWHERE...DISTRIBUTEBYgroup_key))GROUPBY`ngram`
Lessonslearned
• SQLnotgoodchoiceforbuildingsuchapplications• Duplication• Poorreadability• Brittle,notesting• Alternatives
• Map-reduce• Querytemplating
• Latencywhiletrainingwithlargedata
Sparksolution
Sparksolution
• Samehighlevelarchitecture• Hivetablesasfinalinputsandoutputs• SamebinariesusedinHiveTRANSFORM
• RDDnotDatasets• `pipe()`operator• Modular,readable,maintainable
Configuration
PipelineConfiguration- whereistheinputdata?- wheretostorefinaloutput?- sparkspecificconfigs:"spark.dynamicAllocation.maxExecutors”"spark.executor.memory”"spark.memory.storageFraction”…………
- listofComponentConfiguration……
Scalabilitychallenges
• Executorslostasunabletoheartbeat• ShuffleserviceOOM• FrequentexecutorGC• ExecutorOOM• 2GBlimitinSparkforblocks• Exceptionswhilereadingoutputstreamofpipeprocess
Scalabilitychallenges
• Executorslostasunabletoheartbeat• ShuffleserviceOOM• FrequentexecutorGC• ExecutorOOM• 2GBlimitinSparkforblocks• Exceptionswhilereadingoutputstreamofpipeprocess
Dataskew
Sub-model2trainingjob
Sub-model`n`trainingjob
Interpolationalgorithm
Languagemodel
LM1
LM2
LM`n`
…....................
Intermediatesubmodels
Hivequery
Hivetable
Sub-model1trainingjob
Sub-model2trainingjob
Sub-model`n`trainingjob
Interpolationalgorithm
Languagemodel
LM1
LM2
LM`n`
…....................
Intermediatesubmodels
Hivequery
Hivetable
Sub-model1trainingjob
ngramextraction
andcounting
Estimationandpruning normalize
ngramcounts
HowareyouHowaretheyItsrainingHowarewegoingWhenarewegoingYouareawesomeTheyareworking…..…..
Trainingdataset
<Howarewegoing>:1….<Howareyou>:1<Howarethey>:1….<Howare>:4<Youare>:1<Itsraining>:1….<are>:6<you>:1<How>:4…..
Wordcount
<Howarewegoing>:1<arewegoing>:2<wegoing>:2<going>:1<Whenarewegoing>:1<Itsraining>:1<Youareawesome>:1…..…..
Wordcount
Partitionbasedon2-wordsuffix
<Howarewegoing>:1<arewegoing>:2<wegoing>:2<Whenarewegoing>:1…..
<Itsraining>:1<Youareawesome>:1…..
Wordcount
<Howarewegoing>:1<arewegoing>:2<wegoing>:2<going>:1<Whenarewegoing>:1<Itsraining>:1<Youareawesome>:1…..….. …..
…..
<are>:6<How>:4<you>:1<doing>:1<going>:1<awesome>:1<working>:1…..…..
Frequencyofeveryword:0’thshard
shards1to(n-1)
0-shard(hasfrequencyofeveryword)andisshippedtoallthenodes
N-gramswithsame2-wordsuffixwillfallinthesameshard
Distributionofshards(1-wordsharding)
Skewedshardsduetodatafromfrequentphraseseg.“howto..”,“doyou..”
shards1to(n-1)
Distributionofshards(1-wordsharding)
shards1to(n-1)
0-shardhassinglewordfrequenciesand2-wordfrequenciesaswell
Distributionofshards(2-wordsharding)
Solution:Progressivesharding
Firstiteration
Ignoreskewedshards
def findLargeShardIds(sc:SparkContext,threshold:Long,…..):Set[Int]={val shardSizesRDD = sc.textFile(shardCountsFile).map {caseline=>
val Array(indexStr,countStr)=line.split('\t')(indexStr.toInt,countStr.toLong)
}val largeShardIds =shardSizesRDD.filter {
case(index,count)=> count>threshold}.map(_._1).collect().toSet
returnlargeShardIds}
Firstiteration
Processallthenon-skewedshards
Seconditeration
Effective0-shardissmall
Re-shardleftoverwith2-wordshistory
Seconditeration
Discardbiggershards
Seconditeration
Processallthenon-skewedshards
Continuewithfurtheriterations….
var iterationId =0do{val currentCounts:RDD[(String,Long)]=allCounts(iterationId - 1)val partitioner =newPartitionerForNgram(numShards,iterationId)
val shardCountsFile =s"${shard_sizes}_$iterationId"currentCounts.map(ngram =>(partitioner.getPartition(ngram._1),1L)).reduceByKey(_+_).saveAsTextFile(shardCountsFile)
largeShardIds =findLargeShardIds(sc,config.largeShardThreshold,shardCountsFile)trainer.trainedModel (currentCounts,component,largeShardIds)
.saveAsObjectFile(s"${component.order}_$iterationId")iterationId +1}while(largeShards.nonEmpty)
Performanceevaluation
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Hive Spark
ReservedCPUtime(days)
0
1
2
3
4
5
6
7
8
9
Hive Spark
Latency(hours)
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Hive Spark
ReservedCPUtime(days)
15xefficient
Performanceevaluation
0
1
2
3
4
5
6
7
8
9
Hive Spark
Latency(hours)
2.6xfaster
Upstreamcontributionstopipe()
• [SPARK-13793]PipedRDD doesn'tpropagateexceptionswhilereadingparentRDD• [SPARK-15826]PipedRDD toallowconfigurablecharencoding• [SPARK-14542]PipeRDD shouldallowconfigurablebuffersizeforthestdin writer• [SPARK-14110]PipedRDD toprintthecommandranonnonzeroexit
Questions?