Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit....
Transcript of Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit....
![Page 1: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/1.jpg)
From MapReduce to Spark with Apache
CrunchMicah Whitacre
@mkwhit
![Page 2: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/2.jpg)
![Page 3: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/3.jpg)
Invested in learning
![Page 4: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/4.jpg)
Invested in learning
Setup production clusters
![Page 5: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/5.jpg)
Invested in learning
Setup production clusters
Tuned everything
![Page 6: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/6.jpg)
Current Strategy
1. Build MR Jobs as needed2. ????3. Profit
![Page 7: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/7.jpg)
We should switch to
![Page 8: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/8.jpg)
Umm what would it take to switch?
![Page 9: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/9.jpg)
Learn Spark’s API and processing patterns, ...
![Page 10: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/10.jpg)
Refactor all our code, ...
![Page 11: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/11.jpg)
Experiment with how to tune it all again, ...
![Page 12: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/12.jpg)
![Page 13: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/13.jpg)
We don’t use plain MR it won’t be that bad...
![Page 14: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/14.jpg)
● Crunch (http://crunch.apache.org/user-guide.html#sparkpipeline)
● Cascading/Scalding (https://github.com/tresata/spark-scalding)
● Summingbird (will https://github.com/twitter/summingbird/issues/387)
![Page 15: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/15.jpg)
![Page 16: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/16.jpg)
How Spark is Known..
![Page 17: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/17.jpg)
In Memory
How Spark is Known..
![Page 18: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/18.jpg)
In Memory
100x Faster than MapReduce
SQL, streaming, and complex analytics
How Spark is Known..
![Page 19: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/19.jpg)
A fast and general engine for large-scale
data processing.
![Page 20: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/20.jpg)
Spark has an advanced Directed Acyclic Graph execution engine that
supports cyclic data flow and in-memory computing.
![Page 21: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/21.jpg)
Spark has an advanced Directed Acyclic Graph execution engine that
supports cyclic data flow and in-memory computing.
![Page 22: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/22.jpg)
![Page 23: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/23.jpg)
RDD
![Page 24: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/24.jpg)
RDD
Resilient Distributed Dataset
![Page 25: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/25.jpg)
Locality Aware Scheduling
![Page 26: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/26.jpg)
Locality Aware Scheduling
Scalability
![Page 27: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/27.jpg)
Fault Tolerant
Locality Aware Scheduling
Scalability
![Page 28: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/28.jpg)
Fault Tolerant
Locality Aware Scheduling
Scalability
Applications with working sets(Parallel ops on intermediate results)
![Page 29: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/29.jpg)
Fault Tolerant
Locality Aware Scheduling
Scalability
Applications with working sets(Parallel ops on intermediate results)
![Page 30: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/30.jpg)
Log Updates
Options?
Distributed Shared Memory + Checkpointing
![Page 31: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/31.jpg)
Log Updates
Options?
Distributed Shared Memory + Checkpointing
![Page 32: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/32.jpg)
Log (coarse-grained)
Updates
![Page 33: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/33.jpg)
Immutable/Read Only
Partitioned
Bad for async updates to shared state
![Page 34: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/34.jpg)
RDDs lifecycle in memory tied to Spark Application
![Page 35: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/35.jpg)
Transformations
Actions
![Page 36: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/36.jpg)
Transformations
Actions
map, filter, flatmap, union, groupByKey, sample
reduce, collect, count, take
![Page 37: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/37.jpg)
Transformations
Actions
lazily executed
return values to driver
![Page 38: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/38.jpg)
val sc = new SparkContext(new SparkConf())val charCounts = sc.textFile(args(0)) .flatMap(_.split(" ")) .flatMap(_.toCharArray).map((_, 1))charCounts.collect()// (‘a’, 1)(‘a’, 1)(‘b’, 1)(‘c’, 1)(‘e’, 1)
![Page 39: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/39.jpg)
Apache Crunch Review
![Page 40: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/40.jpg)
Process Reference
Data
Process Raw
Person Data
Process Raw Data
using Reference
Filter Out Invalid Data
Group Data By Person
Create Person Record
Avro
CSV
CSV
![Page 41: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/41.jpg)
Process Reference
Data
Process Raw
Person Data
Process Raw Data
using Reference
Filter Out Invalid Data
Group Data By Person
Create Person Record
Avro
CSV
CSV
Pipeline
![Page 42: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/42.jpg)
Pipeline p = ...
![Page 43: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/43.jpg)
Process Reference
Data
Process Raw
Person Data
Process Raw Data
using Reference
Filter Out Invalid Data
Group Data By Person
Create Person Record
Pipeline
Targets
Sources
![Page 44: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/44.jpg)
PCollection<String> values = p.read(source);...values.write(target);
![Page 45: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/45.jpg)
Process Reference
Data
Process Raw
Person Data
Process Raw Data
using Reference
Filter Out Invalid Data
Group Data By Person
Create Person Record
Avro
CSV
CSV
PipelinePCollection
![Page 46: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/46.jpg)
PCollection<String> values = …PTable<String, Integer> counts = values.parallelDo(fn,ptype);
![Page 47: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/47.jpg)
DoFn
DoFn
Join FilterFn Group By Key MapFn
Pipeline
![Page 48: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/48.jpg)
DoFn
DoFn
Join FilterFn Group By Key MapFn
MRPipeline
![Page 49: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/49.jpg)
Pipeline p = new MRPipeline( Driver.class, hadoopConfig);
![Page 50: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/50.jpg)
PCollection<String> values = p.read(...);<do processing>p.write(...);p.done();
![Page 51: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/51.jpg)
DoFn
DoFn
Join FilterFn Group By Key MapFn
MRPipeline
Map Reduce Reduce
![Page 52: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/52.jpg)
Here’s what we need to do to switch...
![Page 53: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/53.jpg)
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>${sparkVersion}</version> <scope>provided</scope></dependency><dependency> <groupId>org.apache.crunch</groupId> <artifactId>crunch-spark</artifactId> <version>${crunchVersion}</version> <scope>compile</scope></dependency>
![Page 54: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/54.jpg)
Pipeline p = new MRPipeline( Driver.class, hadoopConfig);
![Page 55: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/55.jpg)
Pipeline p = new SparkPipeline( “spark://localhost:7077”, “Spark App Name”);
![Page 56: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/56.jpg)
hadoop jar myjar.jar com.example.Driver ...
![Page 57: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/57.jpg)
spark-submit
--class com.example.Driver
--master spark://localhost:7077
...
![Page 58: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/58.jpg)
DoFn
DoFn
Join FilterFn Group By Key MapFn
SparkPipeline
Job 1Stage 1
Stage 2 Stage 3
![Page 59: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/59.jpg)
That’s not too bad...
![Page 60: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/60.jpg)
Well there are some differences to account for...
![Page 61: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/61.jpg)
![Page 62: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/62.jpg)
Crunch with MRPipeline minimizes I/O
Crunch with SparkPipeline defers planning to Spark
![Page 63: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/63.jpg)
MRPipeline SparkPipeline
![Page 64: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/64.jpg)
MRPipeline SparkPipeline
Supports multiple writes
![Page 65: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/65.jpg)
MRPipeline SparkPipeline
Supports multiple writes
Performs multiple writes in same task/stage
![Page 66: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/66.jpg)
MRPipeline SparkPipeline
Supports multiple writes
Performs multiple writes in same task/stage
Serial writes in separate
task/stages
![Page 67: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/67.jpg)
DoFn
DoFn
FilterFn Group By Key MapFn
Map Reduce
![Page 68: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/68.jpg)
DoFn
DoFn
FilterFn Group By Key MapFn
Job 1
Job 2
SparkPipeline
![Page 69: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/69.jpg)
DoFn
DoFn
FilterFn Group By Key MapFn
Map Reduce
![Page 70: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/70.jpg)
DoFn
DoFn
FilterFn Group By Key MapFn
Stage 1
Job 2
Job 3
SparkPipeline
![Page 71: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/71.jpg)
DoFnCompute
Expensive DoFn
DoFn
FilterFn
![Page 72: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/72.jpg)
Spark is lazy
Action needed for something to happen
![Page 73: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/73.jpg)
DoFnCompute
Expensive DoFn
DoFn
FilterFn
Job 1
![Page 74: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/74.jpg)
DoFnCompute
Expensive DoFn
DoFn
FilterFn
Job 2
![Page 75: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/75.jpg)
DoFnCompute
Expensive DoFn
DoFn
FilterFn
![Page 76: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/76.jpg)
Limit expensive computations
Keep RDDs around for reuse
![Page 77: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/77.jpg)
Spark supports persisting RDDs in memory
rdd.persist()
![Page 78: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/78.jpg)
DISK_ONLY, DISK_ONLY_2, MEMORY_AND_DISK, MEMORY_AND_DISK_2,
MEMORY_AND_DISK_SER, MEMORY_AND_DISK_SER_2, MEMORY_ONLY,
MEMORY_ONLY_2, MEMORY_ONLY_SER, MEMORY_ONLY_SER_2, NONE, OFF_HEAP
rdd.persist( StorageLevel.MEMORY_ONLY)
![Page 79: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/79.jpg)
DoFnCompute
Expensive DoFn
DoFn
FilterFn
Job 2
Job 1
![Page 80: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/80.jpg)
PCollection<String> values = //expensive computationvalues.cache();
![Page 81: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/81.jpg)
PCollection<String> values = //expensive computationCacheOptions opts = new CacheOptions.Builder() .useDisk(true).useMemory(true) .build();values.cache(opts);
![Page 82: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/82.jpg)
Spark needs to be able to serialize data
Send data between workers
Persist data in memory or disk
![Page 83: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/83.jpg)
Spark supported serialization
Java Serializable (and Externalizable)
Kyro Serialization
![Page 84: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/84.jpg)
Spark recommends Kryo
Extra config on the SparkConfig
Custom serializer
registration
![Page 85: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/85.jpg)
Spark on Crunch
Hides serialization behind PTypes
Handles complex records like Avro
![Page 86: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/86.jpg)
Spark on Crunch
Hides serialization behind PTypes
Handles complex records like Avro
![Page 87: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/87.jpg)
Additional Topics to Explore
Aggregation sort behaviors
Reusing Crunch Functions in Spark
![Page 88: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/88.jpg)
With Crunch, we’ll be able to ...
![Page 89: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/89.jpg)
minimize significant code refactoring,
![Page 90: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/90.jpg)
shorten learning curve by reusing concepts and API already used to...
![Page 91: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/91.jpg)
incrementally switch from Spark,
![Page 92: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/92.jpg)
overall experiment to find where Spark fits best.
![Page 93: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/93.jpg)
Links:
● http://crunch.apache.org/● http://spark.apache.org/docs/latest/● Examples: https://github.
com/mkwhitacre/simplesparkapp
![Page 94: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning](https://reader033.fdocuments.net/reader033/viewer/2022042223/5ec97c190b155a264d70eeba/html5/thumbnails/94.jpg)
Special Thanks...
● Josh Wills - helped come up with content● Sean Owen & Sandy Ryza whose repo I
forked to build examples and experiment