Handlingdataskewadaptivelyin Spark usingDynamicRepartitioning
Zoltá[email protected]
Introduction
• HungarianAcademyofSciences, InstituteforComputerScienceandControl (MTASZTAKI)• Researchinstitutewithstrongindustryties• BigDataprojectsusingSpark,Flink,Cassandra,Hadoopetc.• Multipletelcousecaseslately,withchallengingdatavolumeanddistribution
Agenda
• Ourdata-skewstory• Problemdefinitions&aims• DynamicRepartitioning
• Architecture• Component breakdown• Repartitioning mechanism• Benchmarkresults
• Visualization• Conclusion
Motivation
• Wehavedevelopedanapplicationaggregatingtelcodatathattestedwellontoydata• Whendeployingitagainsttherealdatasettheapplicationseemedhealthy• Howeveritcouldbecomesurprisinglyslow orevencrash• Whatdidgowrong?
Ourdata-skewstory
• Wehaveuse-caseswhenmap-sidecombineisnotanoption:groupBy, join• 80%ofthetrafficgenerated by 20%ofthecommunication towers• Mostofourdatais80–20rule
Theproblem
• Usingdefaulthashingisnotgoingtodistributethedatauniformly• Someunknownpartition(s)tocontainalotofrecordsonthereducerside• Slowtaskswillappear• Datadistributionisnotknowninadvance• „Conceptdrifts”arecommon
stageboundary&shuffle
evenpartitioning skeweddata
slow task
slow task
Aim
Generally,makeSparkadata-awaredistributeddata-processingframework
• Collectinformationaboutthedata-characteristicson-the-fly• Partitionthedataasuniformlyaspossibleon-the-fly• Handlearbitrarydatadistributions• Mitigatetheproblemofslowtasks• Belightweightandefficient• Shouldnotrequireuser-guidance
spark.repartitioning = true
Architecture
Slave Slave Slave
Master
approximatelocal keydistribution &statistics
1collect to master 2
global keydistribution &statistics
3
4redistribute newhash
Driverperspective
• RepartitioningTrackerMaster partofSparkEnv• Listenstojob&stagesubmissions• Holdsavarietyofrepartitioningstrategiesforeachjob&stage• Decides when &how to (re)partition
Executorperspective
• RepartitioningTrackerWorker partofSparkEnv• Duties:
• StoresScannerStrategies (Scanner included) receivedfromtheRTM• Instantiates andbindsScanners toTaskMetrics (wheredata-characteristics iscollected)
• DefinesaninterfaceforScanners tosendDataCharacteristics backtotheRTM
Scalable sampling
• Key-distributions are approximatedwith astrategy,that is• not sensitive to early or late concept drifts,• lightweight andefficient,• scalable by using abackoff strategy
keys
frequency
counters
keys
frequency
sampling rate of𝑠" sampling rate of𝑠" /𝑏
keysfre
quency
sampling rate of𝑠"%&/𝑏
truncateincrease by 𝑠"
𝑇𝐵
Complexity-sensitive sampling
• Reducerrun-timecanhighlycorrelatewiththecomputationalcomplexity ofthevaluesforagivenkey• CalculatingobjectsizeiscostlyinJVMandinsomecasesshowslittlecorrelationtocomputationalcomplexity(functiononthenextstage)• Solution:
• Iftheuserobject isWeightable, usecomplexity-sensitive sampling• Whenincreasingthecounterforaspecificvalue,consider itscomplexity
Scalable sampling in numbers
• InSpark,ithasbeenimplementedwithAccumulators• Notsoefficient,butwewantedtoimplementitwithminimalimpactontheexistingcode
• Optimizedwithmicro-benchmarking• Mainfactors:
• Samplingstrategy- aggressiveness (initialsamplingratio,back-offfactor,etc…)• Complexityofthecurrentstage(mapper)
• Currentstage’sruntime:• Whenusedthroughout theexecutionofthewholestageitadds5-15%toruntime• Afterrepartitioning,wecutoutthesampler’scode-path; inpractice,itadds0.2-1.5% toruntime
Mainnew task-level metrics
• TaskMetrics• RepartitioningInfo – newhashfunction,repartitionstrategy• ShuffleReadMetrics
• (Weightable)DataCharacteristics – usedfortestingthecorrectnessoftherepartitioning&forexecutionvisualizationtools
• ShuffleWriteMetrics• (Weightable)DataCharacteristics – scannedperiodicallybyScannersbasedonScannerStrategy
• insertionTime• repartitioningTime
• inMemory• ofSpills
Scanner
• Instantiatedforeachtaskbeforetheexecutorstartsthem• Differentimplementations:Throughput, Timed, Histogram• ScannerStrategy defines:
• whentosendtotheRTM,• histogram-compaction level.
Decisiontorepartition
• RepartitioningTrackerMaster canusedifferentdecisionstrategies:• numberoflocalhistogramsneeded,• globalhistogram’sdistance fromuniformdistribution,• preferences intheconstruction ofthenewhashfunction.
Construction ofthe new hash function
𝑐𝑢𝑡, - single-key cut
hash akey herewith the probabilityproportional to the remaining space to reach 𝑙
𝑐𝑢𝑡. - uniform distribution
topprominent keys hashed
𝑙
𝑐𝑢𝑡/ - probabilistic cut
Newhash function in numbers
• MorecomplexthanahashCode• Weneedtoevaluateitforeveryrecord• Micro-benchmark(forexampleString):
• Numberofpartitions: 512• HashPartitioner: AVGtime to hash arecord is90.033 ns• KeyIsolatorPartitioner: AVGtime to hash arecord is121.933ns
• Inpracticeitaddsnegligibleoverhead,under1%
Repartitioning
• Reorganizepreviousnaivehashing• Usuallyhappensin-memory• Inpractice,addsadditional1-8%overhead(usuallythelowerend),basedon:• complexity ofthemapper,• lengthofthescanningprocess.
Morenumbers (groupBy)
MusicTimeseries – groupBy on tags from alistenings-stream
1710 200 400
83M
134M
92M
sizeofth
ebiggestpartition
number ofpartitions
over-partitioning overhead&blind scheduling64%reduction
observed 3%map-side overhead
naive
Dynamic Repartitioning
38%reduction
Morenumbers (join)
• Joiningtrackswithtags• Tracksdatasetisskewed• Numberofpartitionsissetto33• Naive:
• sizeofthebiggest partition=4.14M• reducer’sstageruntime=251seconds
• DynamicRepartitioning• sizeofthebiggest partition=2.01M• reducer’sstageruntime=124 seconds• heavymap,only0.9%overhead
Spark RESTAPI
• NewmetricsareavailablethroughtheRESTAPI• AddednewqueriestotheRESTAPI,forexample:„whathappenedinthelast3second?”• BlockFetches arecollectedtoShuffleReadMetrics
Execution visualization ofSpark jobs
Repartitioning in the visualization
heavy keys createheavy partitions (slow tasks)
heavy isalone,size ofthe biggest partition
isminimized
Conclusion
• OurDynamicRepartitioning canhandledataskewdynamically,on-the-flyonanyworkloadandarbitrarykey-distributions• Withverylittleoverhead,dataskewcanbehandledinanatural &general way• Visualizationscanaiddeveloperstobetterunderstandissuesandbottlenecksofcertainworkloads• MakingSparkdata-aware paysoff
Thankyouforyourattention
Zoltá[email protected]
Top Related