Apache Spark Internals - Part 2

24
Lightning-fast cluster computing

Transcript of Apache Spark Internals - Part 2

Page 1: Apache Spark Internals - Part 2

Lightning-fast cluster computing

Page 2: Apache Spark Internals - Part 2

Resilience

Worker

Executor

Task Task

Worker

Executor

Task Task

Worker

Executor

Task Task

DriverMaster (Active)

Job Job

Page 3: Apache Spark Internals - Part 2

ResilienceDriver

Worker

Executor

Task Task

Worker

Executor

Task Task

Worker

Executor

Task Task

DriverMaster (Active)

Job Job

./spark-submit --deploy-mode "cluster" --supervise

Page 4: Apache Spark Internals - Part 2

ResilienceDriver

Worker

Executor

Task Task

Worker

Executor

Task Task

Worker

Executor

Task Task

DriverMaster (Active)

Job Job

Driver runs in the worker

Page 5: Apache Spark Internals - Part 2

ResilienceDriver

Worker

Executor

Task Task

Worker

Executor

Task Task

Worker

Executor

Task Task

DriverMaster (Active)

Job Job

Driver is started in a new worker

Page 6: Apache Spark Internals - Part 2

ResilienceMaster

Master (Active)

Job Job

Zookeeper

Master (Standby)

Job Job

Worker

Executor

Task Task

Worker

Executor

Task Task

Worker

Executor

Task Task

Driver

Page 7: Apache Spark Internals - Part 2

Master (Active)

ResilienceMaster

Zookeeper

Master (Standby)

Job Job Job JobDriver

Worker

Executor

Task Task

Worker

Executor

Task Task

Worker

Executor

Task Task

Page 8: Apache Spark Internals - Part 2

Master (Active)

ResilienceWorker

Zookeeper

Master (Standby)

Job Job Job JobDriver

Driver and Executor are

also killed

Worker

Executor

Task Task

Worker

Executor

Task Task

Worker

Executor

Task Task

Page 9: Apache Spark Internals - Part 2

Master (Active)

ResilienceWorker

Zookeeper

Master (Standby)

Job Job Job JobDriver

Worker is relaunched

Driver and executor are

also relaunched

Worker

Executor

Task Task

Worker

Executor

Task Task

Worker

Executor

Task Task

Page 10: Apache Spark Internals - Part 2

ResilienceRDD

● An RDD is an immutable, deterministically re-computable, distributed dataset.

● Each RDD remembers the lineage of deterministic operations that were used on a fault-tolerant input dataset to create it.

● If any partition of an RDD is lost due to a worker node failure, then that partition can be re-computed from the original fault-tolerant dataset using the lineage of operations.

● Assuming that all of the RDD transformations are deterministic, the data in the final transformed RDD will always be the same irrespective of failures in the Spark cluster.

Page 11: Apache Spark Internals - Part 2

cache

logLinesRDD

cleanedRDD

collect()

errosRDD

Error, ts, msg1, ts, msg3, ts

Error, ts, msg4, ts, msg1

Error, ts, msg1, ts Error, ts, ts, msg1

filter(fx)

errorMsg1RDD

count()

saveToCassandra()

ResilienceRDD

filter(fx)

coalesce(2)

If partition is damaged, it can recompute from his parent, if parents aren't in memory anymore, it'll reprocess from disk

Page 12: Apache Spark Internals - Part 2

RDD

Shard allocationRDD - Resilient Distributed Dataset

Error, ts, msg1, warn, ts,

msg2, Error

info, ts, msg8, info, ts, msg3,

info

Error, ts, msg5, ts, info

Error, ts, info, msg9, ts, info,

Error

File (hdfs, s3, etc)

partitions

Default Algorithm: Hash partition

RDD = Data abstractionIt hides data partitioning and distribution complexity

Page 13: Apache Spark Internals - Part 2

Worker

Executor

Task

Worker

Executor

Task

Worker

Executor

TaskTask

RDD

Shard allocationRDD - Resilient Distributed Dataset

Error, ts, msg1, warn, ts,

msg2, Error

info, ts, msg8, info, ts, msg3,

info

Error, ts, msg5, ts, info

Error, ts, info, msg9, ts, info,

Error

File (hdfs, s3, etc)

Default Algorithm: Hash partition

partitions

Page 14: Apache Spark Internals - Part 2

Shard allocationPartition configuration - numbers of partition

Specifying number of partitionBy default it create one partition for

each processor core

Page 15: Apache Spark Internals - Part 2

Default settings:● mapreduce.input.fileinputformat.split.minsize = 1 byte (minSize)● dfs.block.size = 128 MB (cluster) / fs.local.block.size = 32 MB (local) (blockSize)

Calculating goal size:e.g.:

● Total size of input files = T = 599 MB● Desired number of partitions = P = 30 (parametrized)● Partition Goal size = PGS = T / P = 599 / 30 = 19 MB

Result: Math.max(1, Math.min(19, 32)) == 19 MB

Shard allocationPartition configuration - defining partition size

Page 16: Apache Spark Internals - Part 2

Fewer partitions

● more data in each partition

● less network and disk i/o

● fast access to data

● increase memory pressure

● don't make use of

parallelism

More partitions

● increase parallelism processing

● less data in each partition

● more network and disk i/o

Shard allocationTrade offs

Page 17: Apache Spark Internals - Part 2

Shard allocationExample - Cases - auxiliary function

Page 18: Apache Spark Internals - Part 2

Shard allocationExample - Case 1

Correctly distributed between 8 partitions

Page 19: Apache Spark Internals - Part 2

Shard allocationExample - Case 2

Inefficient use of resources - 8 cores, 4 idles

Page 20: Apache Spark Internals - Part 2

Shard allocationExample - Case 1 - explanation

val = 2.000.000 / 8 = 250.000

Range partition:

[0] -> 2 - 250.000[1] -> 250.001 - 500.000[2] -> 500.001 - 750.000[3] -> 750.001 - 1.000.000[4] -> 1.000.001 - 1.025.000[5] -> 1.025.001 - 1.050,000[6] -> 1.050.001 - 1.075.000[7] -> 1.075.001 - 2.000.000

Page 21: Apache Spark Internals - Part 2

Shard allocationExample - Case 2 - explanation

val = 2.000.000

map() turned into (key,value), where:

Each value was a list of all integers we needed to multiply the key by to find the multiples up to 2 million. For half of them (all keys greater than 1 million) this meant that the value was an empty list

E.g.: (2, Range(2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159,......(200013,Range(2, 3, 4, 5, 6, 7, 8, 9))

Page 22: Apache Spark Internals - Part 2

Shard allocationExample - Case 3 - fixing it using repartition

Correctly distributed between 8 partitions

Shuffle partitions

Page 23: Apache Spark Internals - Part 2

References

http://spark.apache.org

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-partitions.html

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd.html

http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/

http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html

Page 24: Apache Spark Internals - Part 2

Thanks!Questions?

[email protected]

@jefersonm

jefersonm

jefersonm

jefmachado