Scaling spark
-
Upload
alex-rovner -
Category
Data & Analytics
-
view
816 -
download
0
Transcript of Scaling spark
SCALING SPARK ON AWS
THE JOURNEY
ABOUT US
Alex Rovner, Director of Data EngineeringMedia Platform
Processing Terabytes Daily
PRIOR STATETWO CLUSTERS
CORE & ANALYTICSBOTH IN COLO
CHALLENGES
CHALLENGESSCALABILITYELASTICITY
AGILITY
SPARK
SPARKSCALABLE
FRIENDLY APIPYTHON SUPPORT
AWS
ON-DEMAND COMPUTEFLEXIBLE TERMS
AWS
INSTANCES
D2.8XLARGE48TB OF EPHEMERAL STORAGE
244 GB RAM38 V-CPU
INSTANCES
INSTANCES
WHY EPHEMERAL?
INSTANCES
RESERVED VS ON-DEMAND?
INSTANCES
SPOT?
SPOT
INSTANCESHDFS
D2
D2
D2
D2
INSTANCESWAIT, WHAT ABOUT DATA
LOCALITY?
HADOOP
WHAT ABOUT EMR?
HADOOP
HADOOPCDH 5.3
SPARK 1.2 ON YARN
HADOOPCDH 5.4
SPARK 1.3 ON YARN
HADOOPCDH 5.4
SPARK 1.5 ON YARN
AUTO SCALE
CALCULATE CLUSTER UTILIZATION
QUERY CM APIV-CORES AVAILABLE, USED &
PENDING
AUTO SCALE
CALCULATE TARGET CAPACITYTARGET 80% UTILIZATION
LIMIT DOWNSIZING
AUTO SCALE
ADJUST CAPACITY
AUTO SCALE
SPEED BUMPS
SPEED BUMPSAPPLICATION MASTER ON SPOT
YARN LABELS
SPEED BUMPSUSERS ARE IMPATIENTITS NEVER ENOUGH
SPEED BUMPSI AM LOST!YARN LOGS
SET YARN OVERHEADCHECK GC TIME
INCREASE EXECUTOR MEMORYTRY AGAIN
SPEED BUMPS
BROADCASTING IS EVIL
SPEED BUMPSBROADCASTING “LARGE”
DATASETS IS EVIL
CURRENT STATE
THREE CLUSTERSANALYTICS & STREAMING (AWS)CORE (COLO - MOVING SOON!)
BIG SUCCESS!
QUESTIONS?