Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark...

39

Transcript of Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark...

Page 1: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster
Page 2: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster
Page 3: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Turning Relational Database Tables into Spark Data Sources

Kuassi Mensah Jean de Lavarene Director Product Mgmt Director Development Server Technologies October 04, 2017

3

Page 4: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Safe Harbor Statement

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

4

Page 5: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Speaker Bio - Kuassi Mensah

• Director of Product Management at Oracle

(i) Java integration with the Oracle database (JDBC, UCP, Java in the database)

(ii) Oracle Datasource for Hadoop (OD4H), upcoming OD for Spark, OD for Flink and so on

(iii) JavaScript/Nashorn integration with the Oracle database (DB access, JS stored proc, fluent JS )

• MS CS from the Programming Institute of University of Paris

• Frequent speaker JavaOne, Oracle Open World, Data Summit, Node Summit, Oracle User groups (UKOUG, DOAG,OUGN, BGOUG, OUGF, GUOB, ArOUG, ORAMEX, Sangam,OTNYathra, China, Thailand, etc),

• Author: Oracle Database Programming using Java and Web Services

• @kmensah, http://db360.blogspot.com/, https://www.linkedin.com/in/kmensah

Page 6: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Speaker Bio – Jean de Lavarene

• Director of Product Development at Oracle

(i) Java integration with the Oracle database (JDBC, UCP)

(ii) Oracle Datasource for Hadoop (OD4H), upcoming OD for Spark, OD for Flink and so on

• MS CS from Ecole des Mines, Paris

• https://www.linkedin.com/in/jean-de-lavarene-707b8

Page 7: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Program Agenda

Requirements

Apache Spark

RDBMS Table as Spark Datasource

Performance, Scalability and Security Optimizations

Demo and Wrap up

1

2

3

4

5

7

Page 8: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Program Agenda

Requirements

Apache Spark

RDBMS Table as Spark Datasource

Performance , Scalability and Security Optimizations

Demo and Wrap up

1

2

3

4

5

8

Page 9: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Requirements

9

Page 10: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

Big Data Analytics and Requirements

10

• Goal: furnish actionable information to help business decisions making.

• Example

“Which of our products got a rating of four stars or higher, on social media in the last quarter?

Master Data

(RDBMS)

Big Data HDFS, NoSQL

Page 11: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Program Agenda

Requirements and Motivations

Apache Spark

RDBMS Table as Spark Datasource

Performance, Scalability and Security Optimizations

Demo and Wrap up

1

2

3

4

5

11

Page 12: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Apache Spark

12

Page 13: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Apache Spark - Core Architecture

13

Spark Core

Cluster Manager/Scheduler: Mesos or YARN or Standalone

Spark SQL Spark Streaming

MLib GraphX

DataFrame API

Data Source API

RDBMS

Page 14: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Apache Spark Concepts

14

• Processes data in memory

• RDD: fault tolerance abstraction for in-memory data sharing - Immutable and partitioned datasets; user controlled partitioning & persistance - Coarse-grained transformations of one RDD into another - Store the graph of transformations “lineage”, can be re-constructed in case of failure - Ensure exactly-once processing

• Dataframe conceptually equivalent to a table in a relational database; allows running Spark-SQL queries over its data.

Page 15: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Apache Spark Summary

15

• Can process data in HDFS, HBase, Cassandra, Hive, Hadoop InputFormat.

• DataSource API: built-in support for Hive, Avro, JSON, Parquet and JDBC.

• Spark SQL: operates on a variety of data sources through the dataframe interface.

• Can run in Hadoop clusters through YARN or Spark's standalone mode

• Spark Streaming: for real-time streaming data processing; based on micro batching.

Page 16: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Apache Spark Data Points

16

• Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster on disk.

• Sort 100 TB of data 3X faster than Hadoop MapReduce on 1/10th of the machines

• The largest known Spark cluster has 8000 nodes.

• More than 1000 organizations are using Spark in production

Page 17: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

How Apache Spark Works

17

• Create a dataset from external data, then apply parallel operations to it; all work expressed as - transformations : creating new RDDs, or transforming existing RDDs - actions: calling operations on RDDs

• Execution plan as a Directed Acyclic Graph (DAG) of operations

• Every Spark program and shell session work as follows 0. Spark Context: main entry point in Spark APIs 1. Create some input RDDs from external data. 2. Transform them to define new RDDs using transformations like filter(). 3. Ask Spark to persist any intermediate RDDs that will need to be reused. 4. Launch actions such as count() and first() to kick off a parallel computation, which are optimized and executed by Spark executor.

Page 18: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

Basic Spark Example (Python)

18

Lines =spark.textFiles(“hdfs://…”)

Errors = Lines.filter(_.startswith(“ERROR”)

messages = Errors.map(_.split(‘\t’) (2))

messages.persist()

HadoopRDD

FilteredRDD

MappedRDD

HadoopRDD FilteredRDD MappedRDD

HDFS

Page 19: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

Spark Workflow

19

Worker Node

Executor

Cache

Task Task

Worker Node

Executor

Cache

Task Task

Cluster Manager

(Task Scheduler)

Mesos

or YARN

or Standalone

Data Node

Data Node

Spark Driver

Executes User

Application

Spark Context

RDD Objects

DAG operator

DAG Scheduler (what to run ,

split graph into tasks )

TaskSet

Page 20: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Spark Streaming

20

Page 21: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

Streaming Data Processing: What, Where When, How

21

• What results are calculated? -> transformations within the pipeline: sums, histograms, ML models

• Where in the event time are results calculated? -> event-time windowing within the pipeline:

• When in processing time are results materialized? -> the use of watermarks and triggers

• How do refinements of results relate? -> type of accumulation used: discarding, accumulating & retracting

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

Page 22: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Streaming Data Processing Concepts

22

Stream processing: analyze a fragment/window of data stream •Low Latency: Sub-second • Windowing: Fixed/tumbling, Sliding, Session • Timestamp: Event-time, Ingestion time, Processing time -- cf Star Wars • Watermark: When a window is considered done (input completeness with respect to event times) • In-order processing , Out-of-order processing • Punctuation: segment a stream into tuples, control signal for operators • Triggers: When to materialize the output the computation = watermark | event time| processing time| punctuations • Accumulation: disjointed or overlapping results observed for the same window • Delivery guarantee: at least once, exactly once, end to end exactly once (guarantee) • Event prioritization • Backpressure support

Page 23: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Program Agenda

Requirements and Motivations

Apache Spark

RDBMS Table as Spark Datasource

Performance , Scalability and Security Optimizations

Demo and Wrap up

1

2

3

4

5

23

Page 24: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

RDBMS Table as Spark Datasource

24

Page 25: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Spark SQL

25

A library including the following APIs and services

• The Data Source API a universal API for loading and saving structured data

• The DataFrame API produces a distributed collection of data organized into named columns

• The SQL Interpreter and Optimizer

• The SQL Service a Hive Thrift server

Dataframe DSL

Dataframe API

Datasource API

Spark SQL HQL

RDBMS

JDBC

Page 26: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

Plain JDBC or RDBMS Connector

26

DataFrame df = sqlContext.read().format("jdbc")

. options(options).load();

Scala

val jdbcDF = sqlContext.read.format("jdbc")

.options(Map("url" -> "jdbc:<rdbms>",

"dbtable" -> "schema.tablename")).load()

Java

Map<String, String> options = new HashMap<String, String>();

options.put("url", "jdbc:<rdbms>");

options.put("dbtable", "schema.tablename");

• Plain JDBC supports Predicate pushdown and basic partitioner • DBMS/RDBMS Connectors furnish more optimizations

JDBCRDD + schema = Dataframe

Table

JDBC

Page 27: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Program Agenda

Requirements and Motivations

Apache Spark

RDBMS Table as Spark Datasource

Performance and Scalability Optimizations

Demo and Wrap up

1

2

3

4

5

27

Page 28: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Performance, Scalability and Security Optimizations

28

Page 29: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Optimizations in DBMSses Connectors

29

Many DBMS/RDBMS vendors furnish their own connectors where they implement optimizations not available in the Spark plain JDBC datasource

Potential optimizations in the Oracle implementation

• Custom Partitioners or Splitters

• Partition pruning

• Fast JDBC types conversion

• Connection properties e.g., fetch size

• Connection caching

• Strong authentication, encryption and integrity

Page 30: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Partitioners

30

• Controls the number of parallel tasks to run against the RDBMS table

• Efficient logical table partitioning

• Generate RDBMS SQL queries for each partition

Table

JDBC

Oracle impl

Worker Node Executor

Cache

Task Task

Worker Node Executor

Cache

Task Task

Page 31: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Probable Partitioners for the Oracle Database

31

See the similar split definitions in http://www.oracle.com/technetwork/database/database-technologies/bdc/oracle-datasource-4-hadoop-3158076.pdf

•SINGLE_SPLITTER: no parallelism, the whole table as a single unit

• ROW_SPLITTER: create several splits based on row count

• BLOCK_SPLITTER: create several splits based on block count

• PARTITION_SPLITTER: align splits on the table partitions i.e., 1 split per table partition.

Page 32: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Dataframe Creation https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrame.html

32

SQLContext: entry point into all functionality in Spark SQL $ spark-shell --jars

file:///home/spark/od4s/jlib/ojdbc7.jar,

file:///home/spark/od4s/jlib/ucp.jar,

file:///home/spark/od4s/jlib/ ...

scala> val df = sqlContext.read.format(“the oracle spark

datasource")

.option("url","jdbc:oracle:thin:@localhost:1521/pdb1.localdomain

")

.option("driver", "oracle.jdbc.OracleDriver")

.option("dbtable", "EmployeeData")

.option("user", "hr").option("password", "hr")

.option("oracle.jdbc.spark.partitionerType","BLOCK_SPLITTER")

.option("oracle.jdbc.spark.maxPartitions","4").load()

Page 33: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Dataframe Operations

33

scala> df.load() -> only creates the dataframe scala> df.show -> rows are fetched scala> df.count()

scala> df.printSchema

scala> df.filter("EMP_ID = 79272").show

scala> df.first()

scala> df.select("EMP_ID", "JOB_TITLE").show

scala> df.filter("SALARY < 56000").show

scala> df.select("EMP_ID", "JOB_TITLE").show

Page 34: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Other Optimizations

34

• Fast JDBC types conversion

• Connection properties e.g., fetch size

• Connection caching

• Strong authentication, encryption and integrity

Page 35: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Program Agenda

Requirements and Motivations

Apache Spark

RDBMS Table as Spark Datasource

Performance and Scalability Optimizations

Demo and Wrap up

1

2

3

4

5

35

Page 36: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Demo and Wrap-up

36

Page 37: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 37

Page 38: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster
Page 39: Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster