Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark...
Transcript of Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark...
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Turning Relational Database Tables into Spark Data Sources
Kuassi Mensah Jean de Lavarene Director Product Mgmt Director Development Server Technologies October 04, 2017
3
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Safe Harbor Statement
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.
4
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Speaker Bio - Kuassi Mensah
• Director of Product Management at Oracle
(i) Java integration with the Oracle database (JDBC, UCP, Java in the database)
(ii) Oracle Datasource for Hadoop (OD4H), upcoming OD for Spark, OD for Flink and so on
(iii) JavaScript/Nashorn integration with the Oracle database (DB access, JS stored proc, fluent JS )
• MS CS from the Programming Institute of University of Paris
• Frequent speaker JavaOne, Oracle Open World, Data Summit, Node Summit, Oracle User groups (UKOUG, DOAG,OUGN, BGOUG, OUGF, GUOB, ArOUG, ORAMEX, Sangam,OTNYathra, China, Thailand, etc),
• Author: Oracle Database Programming using Java and Web Services
• @kmensah, http://db360.blogspot.com/, https://www.linkedin.com/in/kmensah
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Speaker Bio – Jean de Lavarene
• Director of Product Development at Oracle
(i) Java integration with the Oracle database (JDBC, UCP)
(ii) Oracle Datasource for Hadoop (OD4H), upcoming OD for Spark, OD for Flink and so on
• MS CS from Ecole des Mines, Paris
• https://www.linkedin.com/in/jean-de-lavarene-707b8
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Program Agenda
Requirements
Apache Spark
RDBMS Table as Spark Datasource
Performance, Scalability and Security Optimizations
Demo and Wrap up
1
2
3
4
5
7
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Program Agenda
Requirements
Apache Spark
RDBMS Table as Spark Datasource
Performance , Scalability and Security Optimizations
Demo and Wrap up
1
2
3
4
5
8
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Requirements
9
Copyright © 2017, Oracle and/or its affiliates. All rights reserved.
Big Data Analytics and Requirements
10
• Goal: furnish actionable information to help business decisions making.
• Example
“Which of our products got a rating of four stars or higher, on social media in the last quarter?
Master Data
(RDBMS)
Big Data HDFS, NoSQL
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Program Agenda
Requirements and Motivations
Apache Spark
RDBMS Table as Spark Datasource
Performance, Scalability and Security Optimizations
Demo and Wrap up
1
2
3
4
5
11
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Apache Spark
12
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Apache Spark - Core Architecture
13
Spark Core
Cluster Manager/Scheduler: Mesos or YARN or Standalone
Spark SQL Spark Streaming
MLib GraphX
DataFrame API
Data Source API
RDBMS
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Apache Spark Concepts
14
• Processes data in memory
• RDD: fault tolerance abstraction for in-memory data sharing - Immutable and partitioned datasets; user controlled partitioning & persistance - Coarse-grained transformations of one RDD into another - Store the graph of transformations “lineage”, can be re-constructed in case of failure - Ensure exactly-once processing
• Dataframe conceptually equivalent to a table in a relational database; allows running Spark-SQL queries over its data.
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Apache Spark Summary
15
• Can process data in HDFS, HBase, Cassandra, Hive, Hadoop InputFormat.
• DataSource API: built-in support for Hive, Avro, JSON, Parquet and JDBC.
• Spark SQL: operates on a variety of data sources through the dataframe interface.
• Can run in Hadoop clusters through YARN or Spark's standalone mode
• Spark Streaming: for real-time streaming data processing; based on micro batching.
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Apache Spark Data Points
16
• Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster on disk.
• Sort 100 TB of data 3X faster than Hadoop MapReduce on 1/10th of the machines
• The largest known Spark cluster has 8000 nodes.
• More than 1000 organizations are using Spark in production
Copyright © 2017, Oracle and/or its affiliates. All rights reserved.
How Apache Spark Works
17
• Create a dataset from external data, then apply parallel operations to it; all work expressed as - transformations : creating new RDDs, or transforming existing RDDs - actions: calling operations on RDDs
• Execution plan as a Directed Acyclic Graph (DAG) of operations
• Every Spark program and shell session work as follows 0. Spark Context: main entry point in Spark APIs 1. Create some input RDDs from external data. 2. Transform them to define new RDDs using transformations like filter(). 3. Ask Spark to persist any intermediate RDDs that will need to be reused. 4. Launch actions such as count() and first() to kick off a parallel computation, which are optimized and executed by Spark executor.
Copyright © 2017, Oracle and/or its affiliates. All rights reserved.
Basic Spark Example (Python)
18
Lines =spark.textFiles(“hdfs://…”)
Errors = Lines.filter(_.startswith(“ERROR”)
messages = Errors.map(_.split(‘\t’) (2))
messages.persist()
HadoopRDD
FilteredRDD
MappedRDD
HadoopRDD FilteredRDD MappedRDD
HDFS
Copyright © 2017, Oracle and/or its affiliates. All rights reserved.
Spark Workflow
19
Worker Node
Executor
Cache
Task Task
Worker Node
Executor
Cache
Task Task
Cluster Manager
(Task Scheduler)
Mesos
or YARN
or Standalone
Data Node
Data Node
Spark Driver
Executes User
Application
Spark Context
RDD Objects
DAG operator
DAG Scheduler (what to run ,
split graph into tasks )
TaskSet
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Spark Streaming
20
Copyright © 2017, Oracle and/or its affiliates. All rights reserved.
Streaming Data Processing: What, Where When, How
21
• What results are calculated? -> transformations within the pipeline: sums, histograms, ML models
• Where in the event time are results calculated? -> event-time windowing within the pipeline:
• When in processing time are results materialized? -> the use of watermarks and triggers
• How do refinements of results relate? -> type of accumulation used: discarding, accumulating & retracting
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Streaming Data Processing Concepts
22
Stream processing: analyze a fragment/window of data stream •Low Latency: Sub-second • Windowing: Fixed/tumbling, Sliding, Session • Timestamp: Event-time, Ingestion time, Processing time -- cf Star Wars • Watermark: When a window is considered done (input completeness with respect to event times) • In-order processing , Out-of-order processing • Punctuation: segment a stream into tuples, control signal for operators • Triggers: When to materialize the output the computation = watermark | event time| processing time| punctuations • Accumulation: disjointed or overlapping results observed for the same window • Delivery guarantee: at least once, exactly once, end to end exactly once (guarantee) • Event prioritization • Backpressure support
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Program Agenda
Requirements and Motivations
Apache Spark
RDBMS Table as Spark Datasource
Performance , Scalability and Security Optimizations
Demo and Wrap up
1
2
3
4
5
23
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
RDBMS Table as Spark Datasource
24
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Spark SQL
25
A library including the following APIs and services
• The Data Source API a universal API for loading and saving structured data
• The DataFrame API produces a distributed collection of data organized into named columns
• The SQL Interpreter and Optimizer
• The SQL Service a Hive Thrift server
Dataframe DSL
Dataframe API
Datasource API
Spark SQL HQL
RDBMS
JDBC
Copyright © 2017, Oracle and/or its affiliates. All rights reserved.
Plain JDBC or RDBMS Connector
26
DataFrame df = sqlContext.read().format("jdbc")
. options(options).load();
Scala
val jdbcDF = sqlContext.read.format("jdbc")
.options(Map("url" -> "jdbc:<rdbms>",
"dbtable" -> "schema.tablename")).load()
Java
Map<String, String> options = new HashMap<String, String>();
options.put("url", "jdbc:<rdbms>");
options.put("dbtable", "schema.tablename");
• Plain JDBC supports Predicate pushdown and basic partitioner • DBMS/RDBMS Connectors furnish more optimizations
JDBCRDD + schema = Dataframe
Table
JDBC
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Program Agenda
Requirements and Motivations
Apache Spark
RDBMS Table as Spark Datasource
Performance and Scalability Optimizations
Demo and Wrap up
1
2
3
4
5
27
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Performance, Scalability and Security Optimizations
28
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Optimizations in DBMSses Connectors
29
Many DBMS/RDBMS vendors furnish their own connectors where they implement optimizations not available in the Spark plain JDBC datasource
Potential optimizations in the Oracle implementation
• Custom Partitioners or Splitters
• Partition pruning
• Fast JDBC types conversion
• Connection properties e.g., fetch size
• Connection caching
• Strong authentication, encryption and integrity
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Partitioners
30
• Controls the number of parallel tasks to run against the RDBMS table
• Efficient logical table partitioning
• Generate RDBMS SQL queries for each partition
Table
JDBC
Oracle impl
Worker Node Executor
Cache
Task Task
Worker Node Executor
Cache
Task Task
…
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Probable Partitioners for the Oracle Database
31
See the similar split definitions in http://www.oracle.com/technetwork/database/database-technologies/bdc/oracle-datasource-4-hadoop-3158076.pdf
•SINGLE_SPLITTER: no parallelism, the whole table as a single unit
• ROW_SPLITTER: create several splits based on row count
• BLOCK_SPLITTER: create several splits based on block count
• PARTITION_SPLITTER: align splits on the table partitions i.e., 1 split per table partition.
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Dataframe Creation https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrame.html
32
SQLContext: entry point into all functionality in Spark SQL $ spark-shell --jars
file:///home/spark/od4s/jlib/ojdbc7.jar,
file:///home/spark/od4s/jlib/ucp.jar,
file:///home/spark/od4s/jlib/ ...
scala> val df = sqlContext.read.format(“the oracle spark
datasource")
.option("url","jdbc:oracle:thin:@localhost:1521/pdb1.localdomain
")
.option("driver", "oracle.jdbc.OracleDriver")
.option("dbtable", "EmployeeData")
.option("user", "hr").option("password", "hr")
.option("oracle.jdbc.spark.partitionerType","BLOCK_SPLITTER")
.option("oracle.jdbc.spark.maxPartitions","4").load()
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Dataframe Operations
33
scala> df.load() -> only creates the dataframe scala> df.show -> rows are fetched scala> df.count()
scala> df.printSchema
scala> df.filter("EMP_ID = 79272").show
scala> df.first()
scala> df.select("EMP_ID", "JOB_TITLE").show
scala> df.filter("SALARY < 56000").show
scala> df.select("EMP_ID", "JOB_TITLE").show
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Other Optimizations
34
• Fast JDBC types conversion
• Connection properties e.g., fetch size
• Connection caching
• Strong authentication, encryption and integrity
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Program Agenda
Requirements and Motivations
Apache Spark
RDBMS Table as Spark Datasource
Performance and Scalability Optimizations
Demo and Wrap up
1
2
3
4
5
35
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Demo and Wrap-up
36
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 37