Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

Open Source Big Data in OPC

Edelweiss KammermannFrank MunzJava One 2017

munz & more #2


About Meà Computer Engineer, BI and Data Integration Specialist

à Over 20 years of Consulting and Project Management experience in Oracle technology.

à Co-founder and Vice President of Uruguayan Oracle User Group (UYOUG)

à Director of Community of LAOUC

à Head of BI Team CMS at ITConvergence

à Writer and frequent speaker at international conferences:

à Collaborate, OTN Tour LA, UKOUG Tech & Apps, OOW, Rittman Mead BI Forum

à Oracle ACE Director


Uruguay

6

Dr. Frank Munz

•Founded munz & more in 2007

•17 years Oracle Middleware,Cloud, and Distributed Computing

•Consulting and High-End Training

•Wrote two Oracle WLS andone Cloud book

#1

Hadoop


What is Big Data?à Volume: The high amount of dataà Variety: The wide range of different data formats and schemas.

Unstructured and semi-structured data

à Velocity: The speed which data is created or consumedà Oracle added another V in this definition

à Value: Data has intrinsic value—but it must be discovered.


What is Oracle Big Data Cloud Compute Edition?à Big Data Platform that integrates Oracle Big Data solution with

Open Source tools à Fully Elastic

à Integrated with Other Paas Services as Database Cloud Service, MySQL Cloud Service, Event Hub Cloud Service

à Access, Data and Network Security

à REST access to all the funcitonality


Big Data Cloud Service – Compute Edition (BDCS-CE)


BDCS-CE Notebook: Interactive Analysisà Apache Zeppelin Notebook (version0.7) to interactively work with data


What is Hadoop?à An open source software platform for distributed storage and

processing à Manage huge volumes of unstructured data

à Parallel processing of large data set

à Highly scalable

à Fault-tolerant

à Two main components:à HDFS: Hadoop Distributed File System for storing information

à MapReduce: programming framework that process information


Hadoop Components: HFDSà Stores the data on the cluster

à Namenode: block registry

à DataNode: block containers themselves (Datanode)

à HDFS cartoon by Mvarshney


Hadoop Components: MapReduceà Retrieves data from HDFS à A MapReduce program is composed by

à Map() method: performs filtering and sorting of the <key, value> inputs

à Reduce() method: summarize the <key,value> pairs provided by the Mappers

à Code can be written in many languages (Perl, Python, Java etc)


MapReduce Example


Code Example


#2Hive


What is Hive?à An open source data warehouse software on top of Apache Hadoop

à Analyze and query data stored in HDFS

à Structure the data into tables

à Tools for simple ETL

à SQL- like queries (HiveQL)

à Procedural language with HPL-SQL

à Metadata storage in a RDBMS


Hadoop & Hive Demo

#3

Spark

Revisited: Map Reduce I/O

munz & more #23Source:HadoopApplicationArchitectureBook

Spark

• Orders of magnitude(s) faster than M/R

• Higher level Scala, Java or Python API

• Standalone, in Hadoop, or Mesos

• Principle: Run an operation on all data

-> ”Spark is the new MapReduce”• See also: Apache Storm, etc

• Uses RDDs, or Dataframes, or Datasets

munz & more #24https://stackoverflow.com/questions/31508083/difference-between-dataframe-in-spark-2-0-i-e-datasetrow-and-rdd-in-spark

https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf

RDDs

Resilient Distributed Datasets

Where do they come from?

Collection of data grouped into named columns.Supports text, JSON, Apache Parquet, sequence.

ReadinHDFS,LocalFS,S3,Hbase

ParallelizeexistingCollection

TransformotherRDD->RDDsareimmutable

Lazy Evaluation

munz & more #26

Nothingisexecuted Execution

Transformations:map(), flatMap(),reduceByKey(), groupByKey()

Actions:collect(), count(), first(), takeOrdered(), saveAsTextFile(), …

http://spark.apache.org/docs/2.1.1/programming-guide.html

map(func) Returnanewdistributeddatasetformedbypassingeachelementofthesourcethroughafunction func.

flatMap(func) Similartomap,buteachinputitemcanbemappedto0ormoreoutputitems(so funcshouldreturnaSeq ratherthanasingleitem).

reduceByKey(func,[numTasks]) Whencalledonadatasetof(K,V)pairs,returnsadatasetof(K,V)pairswherethevaluesforeachkeyareaggregatedusingthegivenreducefunction func,whichmustbeoftype(V,V)=>V.

groupByKey([numTasks]) Whencalledonadatasetof(K,V)pairs,returnsadatasetof(K,Iterable<V>)pairs.

Transformations

Spark Demo

munz & more #30

Apache Zeppelin Notebook

munz & more #31

Word Count and Histogram

munz & more #32

res = t.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)

res.takeOrdered(5, key = lambda x: -x[1])

Zeppelin Notebooks

munz & more #33

Big Data Compute Service CE

munz & more #34

#4

Kafka

Kafka

Partitioned, replicated commit log

munz & more #36

0 1 2 3 4 … n

Immutablelog:Messageswithoffset

Producer

ConsumerA

ConsumerBhttps://www.quora.com/Kafka-writes-every-message-to-broker-disk-Still-performance-wise-it-is-better-than-some-of-the-in-memory-message-storing-message-queues-Why-is-that

Broker1

Broker2

Broker3

TopicA(1)

TopicA(2)

TopicA(3)

Partition/Leader

Repl A(1)

Repl A(2)

Repl A(3)

Producer

Replication/Follower

Zoo-keeper

Zoo-keeper

Zoo-keeper

State/HA

https://www.confluent.io/blog/publishing-apache-kafka-new-york-times/

- 1 topic- 1 partition- Contains every article published

since 1851- Multiple producers / consumers

ExampleforStream/TableDuality

Kafka Clients

SDKs Connect Streams

- OOTB:Java,Scala- Confluent:Python,C,C++

Confluent:- HDFSsink,- JDBCsource,- S3sink- Elasticsearchsink

- Plugin.jarfile- JDBC:Changedata

capture(CDC)

- Real-timedataingestion- Microservices- KSQL:SQLstreaming

engineforstreamingETL,anomalydetection,monitoring

- .jarfilerunsanywhere

High/lowlevelKafkaAPI ConfigurationonlyIntegrateexternalSystems

DatainMotionStream/Tableduality

REST

- Languageagnostic

- Easyformobileapps

- EasytotunnelthroughFWetc.

Lightweight

Oracle Event Hub Cloud Service

• PaaS: Managed Kafka 0.10.2

• Two deployment modes

– Basic (Broker and ZK on 1 node)

– Recommended (distributed)

• REST Proxy

– Separate sever(s) running REST Proxy

munz & more #40

Event Hub

munz & more #41

Event Hub Service

munz & more #42

Ports

You must open ports to allow access for external clients

• Kafka Broker (from OPC connect string)

• Zookeeper with port 2181

munz & more #43

Scaling

munz & more #44

horizontal (up)vertical

Event Hub REST Interface

munz & more #45

https://129.151.91.31:1080/restproxy/topics/a12345orderTopic

Service = Topic

Interesting to Know

• Event Hub topics are prefixed with ID domain

• With Kafka CLI topics with ID Domain can be created

• Topics without ID domain are not shown in OPC console

46

#5

Conclusion

TL;DR #bigData #openSource #OPCOpenSource: entry point to Oracle Big Data world / Low(er) setup times / Check for resource usage & limits in Big Data OPC / BDCS-CE: managed Hadoop, Hive, Spark + Event hub:Kafka / Attend a hands-on workshop! / Next level: Oracle Big Data tools

@EdelweissK@FrankMunz

www.linkedin.com/in/frankmunz/ www.munzandmore.com/blog

facebook.com/cloudcomputingbookfacebook.com/weblogicbook

@frankmunz

youtube.com/weblogicbook

-> more than 50 web casts

Don’t be

shy J

email:[email protected]

Twitter:@EdelweissK

3MembershipTiers• OracleACEDirector• OracleACE• OracleACEAssociate

bit.ly/OracleACEProgram

500+TechnicalExpertsHelpingPeersGlobally

Connect:

Nominateyourselforsomeoneyouknow:acenomination.oracle.com

@oracleace

Facebook.com/oracleaces

[email protected]

Sign up for Free Trial

http://cloud.oracle.com

Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

Internet

Transcript of Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka