Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka
-
Upload
frank-munz -
Category
Internet
-
view
566 -
download
3
Transcript of Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka
![Page 1: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/1.jpg)
Open Source Big Data in OPC
Edelweiss KammermannFrank MunzJava One 2017
![Page 2: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/2.jpg)
munz & more #2
![Page 3: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/3.jpg)
© IT Convergence 2016. All rights reserved.
![Page 4: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/4.jpg)
© IT Convergence 2016. All rights reserved.
About Meà Computer Engineer, BI and Data Integration Specialist
à Over 20 years of Consulting and Project Management experience in Oracle technology.
à Co-founder and Vice President of Uruguayan Oracle User Group (UYOUG)
à Director of Community of LAOUC
à Head of BI Team CMS at ITConvergence
à Writer and frequent speaker at international conferences:
à Collaborate, OTN Tour LA, UKOUG Tech & Apps, OOW, Rittman Mead BI Forum
à Oracle ACE Director
![Page 5: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/5.jpg)
© IT Convergence 2016. All rights reserved.
Uruguay
![Page 6: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/6.jpg)
6
Dr. Frank Munz
•Founded munz & more in 2007
•17 years Oracle Middleware,Cloud, and Distributed Computing
•Consulting and High-End Training
•Wrote two Oracle WLS andone Cloud book
![Page 7: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/7.jpg)
![Page 8: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/8.jpg)
#1
Hadoop
![Page 9: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/9.jpg)
© IT Convergence 2016. All rights reserved.
What is Big Data?à Volume: The high amount of dataà Variety: The wide range of different data formats and schemas.
Unstructured and semi-structured data
à Velocity: The speed which data is created or consumedà Oracle added another V in this definition
à Value: Data has intrinsic value—but it must be discovered.
![Page 10: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/10.jpg)
© IT Convergence 2016. All rights reserved.
What is Oracle Big Data Cloud Compute Edition?à Big Data Platform that integrates Oracle Big Data solution with
Open Source tools à Fully Elastic
à Integrated with Other Paas Services as Database Cloud Service, MySQL Cloud Service, Event Hub Cloud Service
à Access, Data and Network Security
à REST access to all the funcitonality
![Page 11: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/11.jpg)
© IT Convergence 2016. All rights reserved.
Big Data Cloud Service – Compute Edition (BDCS-CE)
![Page 12: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/12.jpg)
© IT Convergence 2016. All rights reserved.
BDCS-CE Notebook: Interactive Analysisà Apache Zeppelin Notebook (version0.7) to interactively work with data
![Page 13: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/13.jpg)
© IT Convergence 2016. All rights reserved.
What is Hadoop?à An open source software platform for distributed storage and
processing à Manage huge volumes of unstructured data
à Parallel processing of large data set
à Highly scalable
à Fault-tolerant
à Two main components:à HDFS: Hadoop Distributed File System for storing information
à MapReduce: programming framework that process information
![Page 14: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/14.jpg)
© IT Convergence 2016. All rights reserved.
Hadoop Components: HFDSà Stores the data on the cluster
à Namenode: block registry
à DataNode: block containers themselves (Datanode)
à HDFS cartoon by Mvarshney
![Page 15: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/15.jpg)
© IT Convergence 2016. All rights reserved.
Hadoop Components: MapReduceà Retrieves data from HDFS à A MapReduce program is composed by
à Map() method: performs filtering and sorting of the <key, value> inputs
à Reduce() method: summarize the <key,value> pairs provided by the Mappers
à Code can be written in many languages (Perl, Python, Java etc)
![Page 16: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/16.jpg)
© IT Convergence 2016. All rights reserved.
MapReduce Example
![Page 17: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/17.jpg)
© IT Convergence 2016. All rights reserved.
Code Example
![Page 18: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/18.jpg)
© IT Convergence 2016. All rights reserved.
Code Example
![Page 19: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/19.jpg)
© IT Convergence 2016. All rights reserved.
#2Hive
![Page 20: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/20.jpg)
© IT Convergence 2016. All rights reserved.
What is Hive?à An open source data warehouse software on top of Apache Hadoop
à Analyze and query data stored in HDFS
à Structure the data into tables
à Tools for simple ETL
à SQL- like queries (HiveQL)
à Procedural language with HPL-SQL
à Metadata storage in a RDBMS
![Page 21: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/21.jpg)
© IT Convergence 2016. All rights reserved.
Hadoop & Hive Demo
![Page 22: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/22.jpg)
#3
Spark
![Page 23: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/23.jpg)
Revisited: Map Reduce I/O
munz & more #23Source:HadoopApplicationArchitectureBook
![Page 24: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/24.jpg)
Spark
• Orders of magnitude(s) faster than M/R
• Higher level Scala, Java or Python API
• Standalone, in Hadoop, or Mesos
• Principle: Run an operation on all data
-> ”Spark is the new MapReduce”• See also: Apache Storm, etc
• Uses RDDs, or Dataframes, or Datasets
munz & more #24https://stackoverflow.com/questions/31508083/difference-between-dataframe-in-spark-2-0-i-e-datasetrow-and-rdd-in-spark
https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
![Page 25: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/25.jpg)
RDDs
Resilient Distributed Datasets
Where do they come from?
Collection of data grouped into named columns.Supports text, JSON, Apache Parquet, sequence.
ReadinHDFS,LocalFS,S3,Hbase
ParallelizeexistingCollection
TransformotherRDD->RDDsareimmutable
![Page 26: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/26.jpg)
Lazy Evaluation
munz & more #26
Nothingisexecuted Execution
Transformations:map(), flatMap(),reduceByKey(), groupByKey()
Actions:collect(), count(), first(), takeOrdered(), saveAsTextFile(), …
http://spark.apache.org/docs/2.1.1/programming-guide.html
![Page 27: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/27.jpg)
map(func) Returnanewdistributeddatasetformedbypassingeachelementofthesourcethroughafunction func.
flatMap(func) Similartomap,buteachinputitemcanbemappedto0ormoreoutputitems(so funcshouldreturnaSeq ratherthanasingleitem).
reduceByKey(func,[numTasks]) Whencalledonadatasetof(K,V)pairs,returnsadatasetof(K,V)pairswherethevaluesforeachkeyareaggregatedusingthegivenreducefunction func,whichmustbeoftype(V,V)=>V.
groupByKey([numTasks]) Whencalledonadatasetof(K,V)pairs,returnsadatasetof(K,Iterable<V>)pairs.
Transformations
![Page 28: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/28.jpg)
![Page 29: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/29.jpg)
![Page 30: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/30.jpg)
Spark Demo
munz & more #30
![Page 31: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/31.jpg)
Apache Zeppelin Notebook
munz & more #31
![Page 32: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/32.jpg)
Word Count and Histogram
munz & more #32
res = t.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
res.takeOrdered(5, key = lambda x: -x[1])
![Page 33: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/33.jpg)
Zeppelin Notebooks
munz & more #33
![Page 34: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/34.jpg)
Big Data Compute Service CE
munz & more #34
![Page 35: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/35.jpg)
#4
Kafka
![Page 36: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/36.jpg)
Kafka
Partitioned, replicated commit log
munz & more #36
0 1 2 3 4 … n
Immutablelog:Messageswithoffset
Producer
ConsumerA
ConsumerBhttps://www.quora.com/Kafka-writes-every-message-to-broker-disk-Still-performance-wise-it-is-better-than-some-of-the-in-memory-message-storing-message-queues-Why-is-that
![Page 37: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/37.jpg)
Broker1
Broker2
Broker3
TopicA(1)
TopicA(2)
TopicA(3)
Partition/Leader
Repl A(1)
Repl A(2)
Repl A(3)
Producer
Replication/Follower
Zoo-keeper
Zoo-keeper
Zoo-keeper
State/HA
![Page 38: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/38.jpg)
https://www.confluent.io/blog/publishing-apache-kafka-new-york-times/
- 1 topic- 1 partition- Contains every article published
since 1851- Multiple producers / consumers
ExampleforStream/TableDuality
![Page 39: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/39.jpg)
Kafka Clients
SDKs Connect Streams
- OOTB:Java,Scala- Confluent:Python,C,C++
Confluent:- HDFSsink,- JDBCsource,- S3sink- Elasticsearchsink
- Plugin.jarfile- JDBC:Changedata
capture(CDC)
- Real-timedataingestion- Microservices- KSQL:SQLstreaming
engineforstreamingETL,anomalydetection,monitoring
- .jarfilerunsanywhere
High/lowlevelKafkaAPI ConfigurationonlyIntegrateexternalSystems
DatainMotionStream/Tableduality
REST
- Languageagnostic
- Easyformobileapps
- EasytotunnelthroughFWetc.
Lightweight
![Page 40: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/40.jpg)
Oracle Event Hub Cloud Service
• PaaS: Managed Kafka 0.10.2
• Two deployment modes
– Basic (Broker and ZK on 1 node)
– Recommended (distributed)
• REST Proxy
– Separate sever(s) running REST Proxy
munz & more #40
![Page 41: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/41.jpg)
Event Hub
munz & more #41
![Page 42: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/42.jpg)
Event Hub Service
munz & more #42
![Page 43: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/43.jpg)
Ports
You must open ports to allow access for external clients
• Kafka Broker (from OPC connect string)
• Zookeeper with port 2181
munz & more #43
![Page 44: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/44.jpg)
Scaling
munz & more #44
horizontal (up)vertical
![Page 45: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/45.jpg)
Event Hub REST Interface
munz & more #45
https://129.151.91.31:1080/restproxy/topics/a12345orderTopic
Service = Topic
![Page 46: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/46.jpg)
Interesting to Know
• Event Hub topics are prefixed with ID domain
• With Kafka CLI topics with ID Domain can be created
• Topics without ID domain are not shown in OPC console
46
![Page 47: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/47.jpg)
#5
Conclusion
![Page 48: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/48.jpg)
TL;DR #bigData #openSource #OPCOpenSource: entry point to Oracle Big Data world / Low(er) setup times / Check for resource usage & limits in Big Data OPC / BDCS-CE: managed Hadoop, Hive, Spark + Event hub:Kafka / Attend a hands-on workshop! / Next level: Oracle Big Data tools
@EdelweissK@FrankMunz
![Page 49: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/49.jpg)
www.linkedin.com/in/frankmunz/ www.munzandmore.com/blog
facebook.com/cloudcomputingbookfacebook.com/weblogicbook
@frankmunz
youtube.com/weblogicbook
-> more than 50 web casts
Don’t be
shy J
![Page 51: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/51.jpg)
3MembershipTiers• OracleACEDirector• OracleACE• OracleACEAssociate
bit.ly/OracleACEProgram
500+TechnicalExpertsHelpingPeersGlobally
Connect:
Nominateyourselforsomeoneyouknow:acenomination.oracle.com
@oracleace
Facebook.com/oracleaces
![Page 52: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka](https://reader031.fdocuments.net/reader031/viewer/2022022415/5a6773897f8b9a656a8b5325/html5/thumbnails/52.jpg)
Sign up for Free Trial
http://cloud.oracle.com