into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform...
Transcript of into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform...
![Page 1: into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)](https://reader035.fdocuments.net/reader035/viewer/2022081400/60ac18189850266968779fea/html5/thumbnails/1.jpg)
1 “Transform Real Time Data into Real Time Decisions”
“Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)
![Page 2: into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)](https://reader035.fdocuments.net/reader035/viewer/2022081400/60ac18189850266968779fea/html5/thumbnails/2.jpg)
2 “Transform Real Time Data into Real Time Decisions”
CUSTOMERS PARTNERSHIPS
OPEN SOURCE
![Page 3: into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)](https://reader035.fdocuments.net/reader035/viewer/2022081400/60ac18189850266968779fea/html5/thumbnails/3.jpg)
3 “Transform Real Time Data into Real Time Decisions”
RDBMS
RDBMS
• Only structured data • $50K – 100K per TB • Limited Analy?cs
ü Both structured and unstructured data ü 50x-‐100x cost savings: $1K per TB ü Expanded analy?cs with MapReduce/NoSQL etc.
FROM
TO
EDW
EDW Hadoop/SPARK ETL + Long Term Storage Query + Present
ETL
Sensor Data
Web Logs
![Page 4: into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)](https://reader035.fdocuments.net/reader035/viewer/2022081400/60ac18189850266968779fea/html5/thumbnails/4.jpg)
4 “Transform Real Time Data into Real Time Decisions”
ETL Goals
• Make data processing more powerful • Make data processing more simple • Make data processing 100x faster than before • What are the op?ons ?
![Page 5: into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)](https://reader035.fdocuments.net/reader035/viewer/2022081400/60ac18189850266968779fea/html5/thumbnails/5.jpg)
5 “Transform Real Time Data into Real Time Decisions”
What steered us into Spark
• Powerful in-‐memory Processing • Simple operator on Data • Debuggable API • Efficient Execu?on • Universally distributed
![Page 6: into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)](https://reader035.fdocuments.net/reader035/viewer/2022081400/60ac18189850266968779fea/html5/thumbnails/6.jpg)
6 “Transform Real Time Data into Real Time Decisions”
What steered us into Pig
• DSL for ETL • Rich Operator Library • Extendable • Pluggable • Powerful ETL
![Page 7: into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)](https://reader035.fdocuments.net/reader035/viewer/2022081400/60ac18189850266968779fea/html5/thumbnails/7.jpg)
7 “Transform Real Time Data into Real Time Decisions”
Operator Mapping
Pig Spark
Load HadoopRDD
Store saveasObjectFile
Filter MappedRDD + filter func
GroupBY (Local rearrange, global rearrange & package) Sort + Group by
…. …
![Page 8: into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)](https://reader035.fdocuments.net/reader035/viewer/2022081400/60ac18189850266968779fea/html5/thumbnails/8.jpg)
8 “Transform Real Time Data into Real Time Decisions”
Current Flow
![Page 9: into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)](https://reader035.fdocuments.net/reader035/viewer/2022081400/60ac18189850266968779fea/html5/thumbnails/9.jpg)
9 “Transform Real Time Data into Real Time Decisions”
Issues
• Scaling • Performance • Spark Specific Operators (Cache) • Pig on Spark Unit test • Some specific joins & rank opera?on
![Page 10: into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)](https://reader035.fdocuments.net/reader035/viewer/2022081400/60ac18189850266968779fea/html5/thumbnails/10.jpg)
10 “Transform Real Time Data into Real Time Decisions”
Filter Code implementa?on
• hcps://bitbucket.org/SigmoidDev/spork/src/80a3e4626e4504c1829568942e0690abc79d239a/src/org/apache/pig/backend/hadoop/execu?onengine/spark/converter/FilterConverter.java?at=spork-‐1.0
![Page 11: into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)](https://reader035.fdocuments.net/reader035/viewer/2022081400/60ac18189850266968779fea/html5/thumbnails/11.jpg)
11 “Transform Real Time Data into Real Time Decisions”
Contribute
• Pig on Spark Umbrella Jira • hcps://issues.apache.org/jira/browse/PIG-‐4059
• hcps://github.com/sigmoidanaly?cs/spork • Issues
![Page 12: into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)](https://reader035.fdocuments.net/reader035/viewer/2022081400/60ac18189850266968779fea/html5/thumbnails/12.jpg)
12 “Transform Real Time Data into Real Time Decisions”
Benchmark
Dis?nct opera?on on the data is a wikistats dump for 25 days with size 270G took 4.25mins on Pig on Spark, as compared to 30mins in MapReduce .
![Page 13: into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)](https://reader035.fdocuments.net/reader035/viewer/2022081400/60ac18189850266968779fea/html5/thumbnails/13.jpg)
13 “Transform Real Time Data into Real Time Decisions”
Mixing Streaming & Batch Processing
• Current State – Different code for batch and stream • Lambda Architecture • One unified language to perform both
![Page 14: into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)](https://reader035.fdocuments.net/reader035/viewer/2022081400/60ac18189850266968779fea/html5/thumbnails/14.jpg)
14 “Transform Real Time Data into Real Time Decisions”
What else is cool
CloudFlux SigmaStream Cloud Deployment PIG/SQL Like DSL Fault Tolerance Rich Stream operators AutoScaling Mul?ple Data source/Sink Programma?c interface Add custom Operators Cloud Agnos?c Apache Spark Based Apache License Apache License
![Page 15: into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)](https://reader035.fdocuments.net/reader035/viewer/2022081400/60ac18189850266968779fea/html5/thumbnails/15.jpg)
15 “Transform Real Time Data into Real Time Decisions”
Thank You
Gulmohar Enclave Road,
Silver Spring Layout, Munnekollal
Bengaluru, Karnataka 560037
+1 (760) 203 3257
US Office
1343 Kingfisher Way
Sunnyvale, CA, 94087 India Office