Post on 16-Apr-2017
Apache Spark - -
/ @laclefyoshi / ysaeki@r.recruit.co.jp
• • Apache Spark
• • •
•
2
• 2011/04
• 2015/09
• • Druid (KDP, 2015)
• RDB NoSQL ( , 2016; : HBase )
• ESP8266 Wi-Fi IoT (KDP, 2016)
• • (WebDB Forum 2014)
• Spark Streaming (Spark Meetup December 2015)
• Kafka AWS Kinesis (Apache Kafka Meetup Japan #1; 2016)
• (FutureOfData; 2016)
• Queryable State for Kafka Streams (Apache Kafka Meetup Japan #2; 2016)
3
Why Spark?
In-memory Computing
Disk-based Computing In-memory Computing
http://www.jcmit.com/memoryprice.htm6
In-memory Computing
Memcached Hazelcast HANA Exadata
Apache IgniteApache Spark
2003 ~ 2008 ~ 2009 ~ 2011 ~2010 ~
Apache Spark
Lost executor X on xxxx: remote Akka client disassociated
Container marked as failed: container_xxxx on host: xxxx. Exit status: 1
Container killed by YARN for exceeding memory limits
shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[Remote]
How come?
Apache Spark
Executor Executor
Executor
Driver
Apache Spark
Executor Executor
Executor
Driver
Apache Spark
Disk Memory
$ spark-submit \ --MEMORY_OPTIONS1 \ --MEMORY_OPTIONS2 \ --MEMORY_OPTIONS3 \ --conf ADDITIONAL_OPTIONS1 \ --conf ADDITIONAL_OPTIONS2 \ --class jp.co.recruit.app.Main \ spark-project-1.0-SNAPSHOT.jar
Apache Spark
Apache Spark : Heap
On-heap
--executor-memory XXG or --conf spark.executor.memory=XXG
--conf spark.memory.offHeap.size=XXX
Disk Off-heap
Apache Spark : Executor
Disk
On-heap Off-heap
On-heap Off-heap
Executor
Executor
OS Other Apps
Apache Spark : Container
Disk
On-heap Off-heap
On-heap Off-heap
Executor
Executor
OS Other Apps
Mesos / YARN Container
Overhead
Apache Spark : Overhead
On-heap
--executor-memory XXG or --conf spark.executor.memory=XXG
Disk Off-heap Overhead
--conf spark.mesos.executor.memoryOverhead --conf spark.yarn.executor.memoryOverhead =max(XXG/10 or 384MB)
Apache Spark : Overhead
On-heapDisk Off-heap Overhead
• • Java VM
Apache Spark : Overhead
Disk Off-heapOn-heap
Apache Spark : Garbage Collection
Disk Off-heapOn-heap
Apache Spark : Tachyon
Tachyon
Block Store
Disk Off-heapOn-heap
Apache Spark : Tachyon
Tachyon
Block Store
Disk Off-heapOn-heap
Apache Spark : Project Tungsten
Project Tungsten
Disk Off-heapOn-heap
Apache Spark :
Off-heap300MBDisk On-heap
Don’t touch!
Apache Spark : User Memory
Off-heap300MBDisk
--conf spark.memory.fraction=0.6
Memory Fraction User Memory
• • • Memory Fraction
Apache Spark : Execution Storage
Off-heap300MBDisk User Memory
--conf spark.memory.storageFraction=0.5
Storage Fraction
Execution Fraction
Apache Spark : Execution Storage
Off-heap300MBDisk User Memory
Storage Fraction
Execution Fraction
• • Broadcast Accumulator
• Shuffle Join Sort Aggregate
•
Apache Spark : Unified Memory
Off-heap300MBDisk User Memory
Storage Fraction
Execution Fraction
Examples
User Memory
Off-heap300MBDisk User Memory
Storage Fraction
Execution Fraction
or
User Memory
Off-heap300MBDisk User Memory
Storage Fraction
Execution Fraction
Storage Fraction
Off-heap300MBDisk User Memory
Storage Fraction
Execution Fraction
or
Storage Fraction
Off-heap300MBDisk User Memory
Storage Fraction
Execution Fraction
Off-heap300MBDisk User Memory
Storage Fraction
Execution Fraction
or
Off-heap300MBDisk User Memory
Storage Fraction
Execution Fraction
OutOfMemoryError
How Spark can help us not to stop our applications
Apache Spark
Disk User Memory
Storage Fraction
Execution Fraction
SpillProject Tungsten
Project Tungsten
Off-heap300MB
Spill
Apache Spark : Garbage Collection
Disk Off-heapOn-heap
JVM : Garbage Collection
-XX:+UseConcMarkSweepGC // GC
-XX:+UseParNewGC // GC
-XX:+CMSParallelRemarkEnabled // GC Remark
-XX:+DisableExplicitGC // GC(System.gc())
JVM : Garbage Collection
-XX:+HeapDumpOnOutOfMemoryError // OoME
-XX:+PrintGCDetails // GC
-XX:+PrintGCDateStamps //
-XX:+UseGCLogFileRotation // GC
JVM
$ spark-submit \ --executor-memory 8GB \ --num-executors 20 \ --executor-cores 2 \ --conf \ "spark.executor.extraJavaOptions=..." \ --spark.memory.offHeap.enabled=true \ --spark.memory.offHeap.size=1073741824 \ --class jp.co.recruit.app.Main \ spark-project-1.0-SNAPSHOT.jar
!
How we can help ourselves not to stop our applications
RDD
Off-heap300MBDisk User Memory
Storage Fraction
Execution Fraction
rdd.cache() rdd.persist() rdd.persist(StorageLevel.MEMORY_ONLY)
RDD
Off-heap300MBDisk User Memory
Storage Fraction
Execution Fraction
MEMORY_ONLY MEMORY_ONLY_2 MEMORY_ONLY_SER
MEMORY_AND_DISK MEMORY_AND_DISK_2 MEMORY_AND_DISK_SER
DISK_ONLY
OFF_HEAP
RDD
Off-heap300MBDisk User Memory
Storage Fraction
Execution Fraction
+
RDD 1
• SizeEstimator
$ spark-shell > import org.apache.spark.util.SizeEstimator
> SizeEstimator.estimate("1234") res0: Long = 48
> val rdd = sc.makeRDD( (1 to 100000).map(e => e.toString).toSeq)
> SizeEstimator.estimate(rdd) res2: Long = 7246792
RDD 2
• Web UI Storage panel
> SizeEstimator.estimate(rdd) res2: Long = 7246792
> rdd.persist(StorageLevel.MEMORY_ONLY)
RDD
> orders = sc.textFile("lineorder.csv") orders: org.apache.spark.rdd.RDD[String] = ... > result = orders.map(...) result: org.apache.spark.rdd.RDD[String] = ...
> orders.persist(StorageLevel.MEMORY_ONLY) > result.persist(StorageLevel.MEMORY_AND_DISK)
RDD
> result.persist(StorageLevel.MEMORY_AND_DISK)
RDD
> orders.persist(StorageLevel.MEMORY_ONLY)
16/12/09 14:34:06 WARN MemoryStore: Not enough space to cache rdd_1_39 in memory! (computed 44.4 MB so far)
16/12/09 14:34:06 WARN BlockManager: Block rdd_1_39 could not be removed as it was not found on disk or in memory
16/12/09 14:34:06 WARN BlockManager: Putting block rdd_1_39 failed
•
Off-heap300MBDisk User Memory
Storage Fraction
Execution Fraction
•
Off-heap300MBDisk User Memory
Storage Fraction
Execution Fraction
• RDD
> orders.partitions.size res3: Int = 40 > orders.repartition(80)
> orders.persist(StorageLevel.MEMORY_ONLY)
Off-heap300MBDisk User Memory
Storage Fraction
Execution Fraction
OutOfMemoryError
RDD
Off-heap300MBDisk User Memory
Storage Fraction
Execution Fraction
> rdd.unpersist(true) //
> rdd.unpersist(false) //
Execution Fraction
•
• • • Garbage Collection
• GC
• Shuffle
Apache Spark
Off-heap300MBUser Memory
--conf spark.memory.storageFraction
Storage Fraction
Execution Fraction
--conf spark.memory.fraction --conf spark.memory.offHeap.size
--executor-memory --conf spark.executor.memory
Overhead
--conf spark.mesos. executor.memoryOverhead --conf spark.yarn. executor.memoryOverhead
: Executor
• [A] Storage Fraction = RDD
• [B] Execution Fraction = A
• [C] On-heap = (A + B) / 0.6 + 300MB // 0.6 User Memory
• [D] Off-heap = RDD
• [E] Overhead = max(C * 0.1, 384MB) //
• [F] 1 Container (Executor)
• [G] OS
• [H]
(C + D + E) * F + G < H
: Driver ?
Driver Memory Overhead
--conf spark.mesos. driver.memoryOverhead --conf spark.yarn. driver.memoryOverhead
--driver-memory --conf spark.driver.memory
--conf spark.driver.maxResultSize=1G
Action (collect, reduce, take ) !
Driver
Yes, It’s all about Spark Memory.
Enjoy In-memory Computing!