Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman
-
Upload
spark-summit -
Category
Data & Analytics
-
view
875 -
download
0
Transcript of Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman
![Page 1: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/1.jpg)
Morri Feldman
The Road Less Traveled
Highlights and Challenges from Running Spark on Mesos in Production
![Page 2: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/2.jpg)
The Plan
Attribution & Overall Architecture
Retention Data Infrastructure - Spark on Mesos
1
2
3
![Page 3: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/3.jpg)
-OR-
User Device
Store Redirected
Enables • Cost Per Install (CPI) • Cost Per In-app Action (CPA) • Revenue Share • Network Optimization • Retargeting
Media sources
The Flow
AppsFlyer Servers
![Page 4: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/4.jpg)
![Page 5: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/5.jpg)
RetentionInstall day 1 2 3 4 5 6 7 8 9 10 11 12
![Page 6: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/6.jpg)
Retention Scale> 30 Million Installs / Day> 5 Billion Sessions / Day
RetentionInstall day 1 2 3 4 5 6 7 8 9 10 11 12
![Page 7: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/7.jpg)
Retention Dimensions
![Page 8: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/8.jpg)
Two Dimensions (App-Id and Media-Source)
Cascalog
DataLog / Logic programming over Cascading / Hadoop
Retention V1 (MVP)
![Page 9: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/9.jpg)
Two Dimensions (App-Id and Media-Source)
Cascalog
DataLog / Logic programming over Cascading / Hadoop
Retention V1 (MVP)
![Page 10: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/10.jpg)
Two Dimensions (App-Id and Media-Source)
Cascalog
DataLog / Logic programming over Cascading / Hadoop
Retention V1 (MVP)
![Page 11: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/11.jpg)
S3 Data v1 – Hadoop Sequence files:
Key, Value <Kafka Offset, Json Message> Gzip Compressed ~ 1.8 TB / Day
S3 Data v2 – Parquet Files (Schema on Write)
Retain fields required for retention, apply some business logic while converting.
Generates “tables” for installs and sessions. Retention v2 – “SELECT … JOIN ON ...”
18 Dimensions vs 2 in original report
Retention – Spark SQL / Parquet
![Page 12: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/12.jpg)
Retention Calculation Phases
1. Daily aggregationCohort_day, Activity_day, <Dimensions>, Retained Count
2. PivotCohort_day, <Dimensions>, Day0, Day1, Day2 …
After Aggregation and Pivot ~ 1 billion rows
![Page 13: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/13.jpg)
Data Warehouse v3
Parquet Files – Schema on ReadRetain almost all fields from original jsonDo not apply any business logicBusiness logic applied when reading throughuse of a shared library
![Page 14: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/14.jpg)
Spark and Spark Streaming: ETL for Druid
SQL
![Page 15: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/15.jpg)
Why?
All Data on S3 – No need for HDFS Spark & Mesos have a long history Some interest in moving our attribution services to Mesos Began using spark with EC2 “standalone” cluster scripts (No VPC) Easy to setup Culture of trying out promising technologies
![Page 16: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/16.jpg)
Mesos Creature Comforts
Nice UI – Job outputs / sandbox easy to find Driver and Slave logs are accessible
![Page 17: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/17.jpg)
Mesos Creature Comforts
Fault tolerant – Masters store data in zookeeper and canfail over smoothly Nodes join and leave the cluster automatically at bootup / shutdown
![Page 18: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/18.jpg)
Job Scheduling – Chronos
?https://aphyr.com/posts/326-jepsen-chronos
![Page 19: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/19.jpg)
Specific Lessons / Challenges using Spark, Mesos & S3
-or- What Went Wrong with
Spark / Mesos & S3 and How We Fixed It.
Spark / Mesos in production for nearly 1 year
![Page 20: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/20.jpg)
S3 is not HDFS
S3n gives tons of timeouts and DNS Errors @ 5pm Daily
Can compensate for timeouts with spark.task.maxFailures set to 20
Use S3a from Hadoop 2.7 (S3a in 2.6 generates millions of partitions – HADOOP-11584)
https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/
![Page 21: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/21.jpg)
S3 is not HDFS part 2 Use a Direct Output Commiter
https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/
Spark writes files to staging area and renames them at end of job
Rename on S3 is an expensive operation (~10s of minutes for thousands of files)
Direct Output Commiters write to final output location (Safe because S3 is atomic, so writes always succeed)
Disadvantages –Incompatible with speculative execution
Poor recovery from failures during write operations
![Page 22: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/22.jpg)
Avoid .0 releases if possible
https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/
Worst example
Spark 1.4.0 randomly loses data especially on jobs with many output partitions
Fixed by SPARK-8406
![Page 23: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/23.jpg)
Coarse-Grained or Fine-Grained?
TL; DR – Use coarse-grained Not Perfect, but Stable
![Page 24: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/24.jpg)
Coarse-Grained – Disadvantages
spark.cores.max (not dynamic)
![Page 25: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/25.jpg)
Coarse-Grained with Dynamic Allocation
![Page 26: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/26.jpg)
Tuning Jobs in Coarse-Grained
![Page 27: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/27.jpg)
Tuning Jobs in Coarse-Grained
Set executor memory to ~ entire memory of a machine (200GB for r3.8xlarge) spark.task.cpus is then actually spark memory per task
OOM!!
200 GB 32 cpus
![Page 28: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/28.jpg)
Tuning Jobs in Coarse-Grained
More Shuffle Partitions
OOM!!
![Page 29: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/29.jpg)
Spark on Mesos Future Improvements
Increased stability – Dynamic allocation Tungsten
Mesos Maintenance Primitives, experimental in 0.25.0
Gracefully reduce size of cluster by marking nodes that will soon be killed
Inverse Offers – preemption, more dynamic scheduling
![Page 30: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/30.jpg)
How We Generated Duplicate Data
OR
S3 is Still Not HDFS
![Page 31: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/31.jpg)
S3 is Still Not HDFS
S3 is Eventually Consistent
![Page 32: Highlights and Challenges from Running Spark on Mesos in Production by Morri Feldman](https://reader031.fdocuments.net/reader031/viewer/2022030317/587155641a28ab8e5b8b5093/html5/thumbnails/32.jpg)
We are Hiring! https://www.appsflyer.com/jobs/