Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark...

Post on 20-May-2020

15 views 0 download

Transcript of Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark...

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Turning Relational Database Tables into Spark Data Sources

Kuassi Mensah Jean de Lavarene Director Product Mgmt Director Development Server Technologies October 04, 2017

3

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Safe Harbor Statement

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

4

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Speaker Bio - Kuassi Mensah

• Director of Product Management at Oracle

(i) Java integration with the Oracle database (JDBC, UCP, Java in the database)

(ii) Oracle Datasource for Hadoop (OD4H), upcoming OD for Spark, OD for Flink and so on

(iii) JavaScript/Nashorn integration with the Oracle database (DB access, JS stored proc, fluent JS )

• MS CS from the Programming Institute of University of Paris

• Frequent speaker JavaOne, Oracle Open World, Data Summit, Node Summit, Oracle User groups (UKOUG, DOAG,OUGN, BGOUG, OUGF, GUOB, ArOUG, ORAMEX, Sangam,OTNYathra, China, Thailand, etc),

• Author: Oracle Database Programming using Java and Web Services

• @kmensah, http://db360.blogspot.com/, https://www.linkedin.com/in/kmensah

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Speaker Bio – Jean de Lavarene

• Director of Product Development at Oracle

(i) Java integration with the Oracle database (JDBC, UCP)

(ii) Oracle Datasource for Hadoop (OD4H), upcoming OD for Spark, OD for Flink and so on

• MS CS from Ecole des Mines, Paris

• https://www.linkedin.com/in/jean-de-lavarene-707b8

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Program Agenda

Requirements

Apache Spark

RDBMS Table as Spark Datasource

Performance, Scalability and Security Optimizations

Demo and Wrap up

1

2

3

4

5

7

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Program Agenda

Requirements

Apache Spark

RDBMS Table as Spark Datasource

Performance , Scalability and Security Optimizations

Demo and Wrap up

1

2

3

4

5

8

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Requirements

9

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

Big Data Analytics and Requirements

10

• Goal: furnish actionable information to help business decisions making.

• Example

“Which of our products got a rating of four stars or higher, on social media in the last quarter?

Master Data

(RDBMS)

Big Data HDFS, NoSQL

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Program Agenda

Requirements and Motivations

Apache Spark

RDBMS Table as Spark Datasource

Performance, Scalability and Security Optimizations

Demo and Wrap up

1

2

3

4

5

11

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Apache Spark

12

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Apache Spark - Core Architecture

13

Spark Core

Cluster Manager/Scheduler: Mesos or YARN or Standalone

Spark SQL Spark Streaming

MLib GraphX

DataFrame API

Data Source API

RDBMS

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Apache Spark Concepts

14

• Processes data in memory

• RDD: fault tolerance abstraction for in-memory data sharing - Immutable and partitioned datasets; user controlled partitioning & persistance - Coarse-grained transformations of one RDD into another - Store the graph of transformations “lineage”, can be re-constructed in case of failure - Ensure exactly-once processing

• Dataframe conceptually equivalent to a table in a relational database; allows running Spark-SQL queries over its data.

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Apache Spark Summary

15

• Can process data in HDFS, HBase, Cassandra, Hive, Hadoop InputFormat.

• DataSource API: built-in support for Hive, Avro, JSON, Parquet and JDBC.

• Spark SQL: operates on a variety of data sources through the dataframe interface.

• Can run in Hadoop clusters through YARN or Spark's standalone mode

• Spark Streaming: for real-time streaming data processing; based on micro batching.

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Apache Spark Data Points

16

• Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster on disk.

• Sort 100 TB of data 3X faster than Hadoop MapReduce on 1/10th of the machines

• The largest known Spark cluster has 8000 nodes.

• More than 1000 organizations are using Spark in production

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

How Apache Spark Works

17

• Create a dataset from external data, then apply parallel operations to it; all work expressed as - transformations : creating new RDDs, or transforming existing RDDs - actions: calling operations on RDDs

• Execution plan as a Directed Acyclic Graph (DAG) of operations

• Every Spark program and shell session work as follows 0. Spark Context: main entry point in Spark APIs 1. Create some input RDDs from external data. 2. Transform them to define new RDDs using transformations like filter(). 3. Ask Spark to persist any intermediate RDDs that will need to be reused. 4. Launch actions such as count() and first() to kick off a parallel computation, which are optimized and executed by Spark executor.

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

Basic Spark Example (Python)

18

Lines =spark.textFiles(“hdfs://…”)

Errors = Lines.filter(_.startswith(“ERROR”)

messages = Errors.map(_.split(‘\t’) (2))

messages.persist()

HadoopRDD

FilteredRDD

MappedRDD

HadoopRDD FilteredRDD MappedRDD

HDFS

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

Spark Workflow

19

Worker Node

Executor

Cache

Task Task

Worker Node

Executor

Cache

Task Task

Cluster Manager

(Task Scheduler)

Mesos

or YARN

or Standalone

Data Node

Data Node

Spark Driver

Executes User

Application

Spark Context

RDD Objects

DAG operator

DAG Scheduler (what to run ,

split graph into tasks )

TaskSet

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Spark Streaming

20

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

Streaming Data Processing: What, Where When, How

21

• What results are calculated? -> transformations within the pipeline: sums, histograms, ML models

• Where in the event time are results calculated? -> event-time windowing within the pipeline:

• When in processing time are results materialized? -> the use of watermarks and triggers

• How do refinements of results relate? -> type of accumulation used: discarding, accumulating & retracting

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Streaming Data Processing Concepts

22

Stream processing: analyze a fragment/window of data stream •Low Latency: Sub-second • Windowing: Fixed/tumbling, Sliding, Session • Timestamp: Event-time, Ingestion time, Processing time -- cf Star Wars • Watermark: When a window is considered done (input completeness with respect to event times) • In-order processing , Out-of-order processing • Punctuation: segment a stream into tuples, control signal for operators • Triggers: When to materialize the output the computation = watermark | event time| processing time| punctuations • Accumulation: disjointed or overlapping results observed for the same window • Delivery guarantee: at least once, exactly once, end to end exactly once (guarantee) • Event prioritization • Backpressure support

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Program Agenda

Requirements and Motivations

Apache Spark

RDBMS Table as Spark Datasource

Performance , Scalability and Security Optimizations

Demo and Wrap up

1

2

3

4

5

23

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

RDBMS Table as Spark Datasource

24

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Spark SQL

25

A library including the following APIs and services

• The Data Source API a universal API for loading and saving structured data

• The DataFrame API produces a distributed collection of data organized into named columns

• The SQL Interpreter and Optimizer

• The SQL Service a Hive Thrift server

Dataframe DSL

Dataframe API

Datasource API

Spark SQL HQL

RDBMS

JDBC

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

Plain JDBC or RDBMS Connector

26

DataFrame df = sqlContext.read().format("jdbc")

. options(options).load();

Scala

val jdbcDF = sqlContext.read.format("jdbc")

.options(Map("url" -> "jdbc:<rdbms>",

"dbtable" -> "schema.tablename")).load()

Java

Map<String, String> options = new HashMap<String, String>();

options.put("url", "jdbc:<rdbms>");

options.put("dbtable", "schema.tablename");

• Plain JDBC supports Predicate pushdown and basic partitioner • DBMS/RDBMS Connectors furnish more optimizations

JDBCRDD + schema = Dataframe

Table

JDBC

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Program Agenda

Requirements and Motivations

Apache Spark

RDBMS Table as Spark Datasource

Performance and Scalability Optimizations

Demo and Wrap up

1

2

3

4

5

27

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Performance, Scalability and Security Optimizations

28

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Optimizations in DBMSses Connectors

29

Many DBMS/RDBMS vendors furnish their own connectors where they implement optimizations not available in the Spark plain JDBC datasource

Potential optimizations in the Oracle implementation

• Custom Partitioners or Splitters

• Partition pruning

• Fast JDBC types conversion

• Connection properties e.g., fetch size

• Connection caching

• Strong authentication, encryption and integrity

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Partitioners

30

• Controls the number of parallel tasks to run against the RDBMS table

• Efficient logical table partitioning

• Generate RDBMS SQL queries for each partition

Table

JDBC

Oracle impl

Worker Node Executor

Cache

Task Task

Worker Node Executor

Cache

Task Task

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Probable Partitioners for the Oracle Database

31

See the similar split definitions in http://www.oracle.com/technetwork/database/database-technologies/bdc/oracle-datasource-4-hadoop-3158076.pdf

•SINGLE_SPLITTER: no parallelism, the whole table as a single unit

• ROW_SPLITTER: create several splits based on row count

• BLOCK_SPLITTER: create several splits based on block count

• PARTITION_SPLITTER: align splits on the table partitions i.e., 1 split per table partition.

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Dataframe Creation https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrame.html

32

SQLContext: entry point into all functionality in Spark SQL $ spark-shell --jars

file:///home/spark/od4s/jlib/ojdbc7.jar,

file:///home/spark/od4s/jlib/ucp.jar,

file:///home/spark/od4s/jlib/ ...

scala> val df = sqlContext.read.format(“the oracle spark

datasource")

.option("url","jdbc:oracle:thin:@localhost:1521/pdb1.localdomain

")

.option("driver", "oracle.jdbc.OracleDriver")

.option("dbtable", "EmployeeData")

.option("user", "hr").option("password", "hr")

.option("oracle.jdbc.spark.partitionerType","BLOCK_SPLITTER")

.option("oracle.jdbc.spark.maxPartitions","4").load()

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Dataframe Operations

33

scala> df.load() -> only creates the dataframe scala> df.show -> rows are fetched scala> df.count()

scala> df.printSchema

scala> df.filter("EMP_ID = 79272").show

scala> df.first()

scala> df.select("EMP_ID", "JOB_TITLE").show

scala> df.filter("SALARY < 56000").show

scala> df.select("EMP_ID", "JOB_TITLE").show

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Other Optimizations

34

• Fast JDBC types conversion

• Connection properties e.g., fetch size

• Connection caching

• Strong authentication, encryption and integrity

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Program Agenda

Requirements and Motivations

Apache Spark

RDBMS Table as Spark Datasource

Performance and Scalability Optimizations

Demo and Wrap up

1

2

3

4

5

35

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Demo and Wrap-up

36

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 37