Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark...

Turning Relational Database Tables into Spark Data Sources

Kuassi Mensah Jean de Lavarene Director Product Mgmt Director Development Server Technologies October 04, 2017

Safe Harbor Statement

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

Speaker Bio - Kuassi Mensah

• Director of Product Management at Oracle

(i) Java integration with the Oracle database (JDBC, UCP, Java in the database)

(ii) Oracle Datasource for Hadoop (OD4H), upcoming OD for Spark, OD for Flink and so on

(iii) JavaScript/Nashorn integration with the Oracle database (DB access, JS stored proc, fluent JS )

• MS CS from the Programming Institute of University of Paris

• Frequent speaker JavaOne, Oracle Open World, Data Summit, Node Summit, Oracle User groups (UKOUG, DOAG,OUGN, BGOUG, OUGF, GUOB, ArOUG, ORAMEX, Sangam,OTNYathra, China, Thailand, etc),

• Author: Oracle Database Programming using Java and Web Services

• @kmensah, http://db360.blogspot.com/, https://www.linkedin.com/in/kmensah

Speaker Bio – Jean de Lavarene

• Director of Product Development at Oracle

(i) Java integration with the Oracle database (JDBC, UCP)

(ii) Oracle Datasource for Hadoop (OD4H), upcoming OD for Spark, OD for Flink and so on

• MS CS from Ecole des Mines, Paris

• https://www.linkedin.com/in/jean-de-lavarene-707b8

Program Agenda

Requirements

Apache Spark

RDBMS Table as Spark Datasource

Performance, Scalability and Security Optimizations

Demo and Wrap up

Program Agenda

Requirements

Apache Spark

Performance , Scalability and Security Optimizations

Demo and Wrap up

Requirements

Big Data Analytics and Requirements

• Goal: furnish actionable information to help business decisions making.

• Example

“Which of our products got a rating of four stars or higher, on social media in the last quarter?

Master Data

(RDBMS)

Big Data HDFS, NoSQL

Program Agenda

Requirements and Motivations

Apache Spark

Demo and Wrap up

Apache Spark

Apache Spark - Core Architecture

Spark Core

Cluster Manager/Scheduler: Mesos or YARN or Standalone

Spark SQL Spark Streaming

MLib GraphX

DataFrame API

Data Source API

Apache Spark Concepts

• Processes data in memory

• RDD: fault tolerance abstraction for in-memory data sharing - Immutable and partitioned datasets; user controlled partitioning & persistance - Coarse-grained transformations of one RDD into another - Store the graph of transformations “lineage”, can be re-constructed in case of failure - Ensure exactly-once processing

• Dataframe conceptually equivalent to a table in a relational database; allows running Spark-SQL queries over its data.

Apache Spark Summary

• Can process data in HDFS, HBase, Cassandra, Hive, Hadoop InputFormat.

• DataSource API: built-in support for Hive, Avro, JSON, Parquet and JDBC.

• Spark SQL: operates on a variety of data sources through the dataframe interface.

• Can run in Hadoop clusters through YARN or Spark's standalone mode

• Spark Streaming: for real-time streaming data processing; based on micro batching.

Apache Spark Data Points

• Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster on disk.

• Sort 100 TB of data 3X faster than Hadoop MapReduce on 1/10th of the machines

• The largest known Spark cluster has 8000 nodes.

• More than 1000 organizations are using Spark in production

How Apache Spark Works

• Create a dataset from external data, then apply parallel operations to it; all work expressed as - transformations : creating new RDDs, or transforming existing RDDs - actions: calling operations on RDDs

• Execution plan as a Directed Acyclic Graph (DAG) of operations

• Every Spark program and shell session work as follows 0. Spark Context: main entry point in Spark APIs 1. Create some input RDDs from external data. 2. Transform them to define new RDDs using transformations like filter(). 3. Ask Spark to persist any intermediate RDDs that will need to be reused. 4. Launch actions such as count() and first() to kick off a parallel computation, which are optimized and executed by Spark executor.

Basic Spark Example (Python)

Lines =spark.textFiles(“hdfs://…”)

Errors = Lines.filter(_.startswith(“ERROR”)

messages = Errors.map(_.split(‘\t’) (2))

messages.persist()

HadoopRDD

FilteredRDD

MappedRDD

HadoopRDD FilteredRDD MappedRDD

Spark Workflow

Worker Node

Executor

Task Task

Worker Node

Executor

Task Task

Cluster Manager

(Task Scheduler)

or YARN

or Standalone

Data Node

Spark Driver

Executes User

Application

Spark Context

RDD Objects

DAG operator

DAG Scheduler (what to run ,

split graph into tasks )

TaskSet

Spark Streaming

Streaming Data Processing: What, Where When, How

• What results are calculated? -> transformations within the pipeline: sums, histograms, ML models

• Where in the event time are results calculated? -> event-time windowing within the pipeline:

• When in processing time are results materialized? -> the use of watermarks and triggers

• How do refinements of results relate? -> type of accumulation used: discarding, accumulating & retracting

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

Streaming Data Processing Concepts

Stream processing: analyze a fragment/window of data stream •Low Latency: Sub-second • Windowing: Fixed/tumbling, Sliding, Session • Timestamp: Event-time, Ingestion time, Processing time -- cf Star Wars • Watermark: When a window is considered done (input completeness with respect to event times) • In-order processing , Out-of-order processing • Punctuation: segment a stream into tuples, control signal for operators • Triggers: When to materialize the output the computation = watermark | event time| processing time| punctuations • Accumulation: disjointed or overlapping results observed for the same window • Delivery guarantee: at least once, exactly once, end to end exactly once (guarantee) • Event prioritization • Backpressure support

Program Agenda

Apache Spark

Performance , Scalability and Security Optimizations

Demo and Wrap up

Spark SQL

A library including the following APIs and services

• The Data Source API a universal API for loading and saving structured data

• The DataFrame API produces a distributed collection of data organized into named columns

• The SQL Interpreter and Optimizer

• The SQL Service a Hive Thrift server

Dataframe DSL

Dataframe API

Datasource API

Spark SQL HQL

Plain JDBC or RDBMS Connector

DataFrame df = sqlContext.read().format("jdbc")

. options(options).load();

val jdbcDF = sqlContext.read.format("jdbc")

.options(Map("url" -> "jdbc:<rdbms>",

"dbtable" -> "schema.tablename")).load()

Map<String, String> options = new HashMap<String, String>();

options.put("url", "jdbc:<rdbms>");

options.put("dbtable", "schema.tablename");

• Plain JDBC supports Predicate pushdown and basic partitioner • DBMS/RDBMS Connectors furnish more optimizations

JDBCRDD + schema = Dataframe

Program Agenda

Apache Spark

Performance and Scalability Optimizations

Demo and Wrap up

Optimizations in DBMSses Connectors

Many DBMS/RDBMS vendors furnish their own connectors where they implement optimizations not available in the Spark plain JDBC datasource

Potential optimizations in the Oracle implementation

• Custom Partitioners or Splitters

• Partition pruning

• Fast JDBC types conversion

• Connection properties e.g., fetch size

• Connection caching

• Strong authentication, encryption and integrity

Partitioners

• Controls the number of parallel tasks to run against the RDBMS table

• Efficient logical table partitioning

• Generate RDBMS SQL queries for each partition

Oracle impl

Worker Node Executor

Task Task

Worker Node Executor

Task Task

Probable Partitioners for the Oracle Database

See the similar split definitions in http://www.oracle.com/technetwork/database/database-technologies/bdc/oracle-datasource-4-hadoop-3158076.pdf

•SINGLE_SPLITTER: no parallelism, the whole table as a single unit

• ROW_SPLITTER: create several splits based on row count

• BLOCK_SPLITTER: create several splits based on block count

• PARTITION_SPLITTER: align splits on the table partitions i.e., 1 split per table partition.

Dataframe Creation https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrame.html

SQLContext: entry point into all functionality in Spark SQL $ spark-shell --jars

file:///home/spark/od4s/jlib/ojdbc7.jar,

file:///home/spark/od4s/jlib/ucp.jar,

file:///home/spark/od4s/jlib/ ...

scala> val df = sqlContext.read.format(“the oracle spark

datasource")

.option("url","jdbc:oracle:thin:@localhost:1521/pdb1.localdomain

.option("driver", "oracle.jdbc.OracleDriver")

.option("dbtable", "EmployeeData")

.option("user", "hr").option("password", "hr")

.option("oracle.jdbc.spark.partitionerType","BLOCK_SPLITTER")

.option("oracle.jdbc.spark.maxPartitions","4").load()

Dataframe Operations

scala> df.load() -> only creates the dataframe scala> df.show -> rows are fetched scala> df.count()

scala> df.printSchema

scala> df.filter("EMP_ID = 79272").show

scala> df.first()

scala> df.select("EMP_ID", "JOB_TITLE").show

scala> df.filter("SALARY < 56000").show

scala> df.select("EMP_ID", "JOB_TITLE").show

Other Optimizations

• Fast JDBC types conversion

• Connection properties e.g., fetch size

• Connection caching

• Strong authentication, encryption and integrity

Program Agenda

Apache Spark

Performance and Scalability Optimizations

Demo and Wrap up

Demo and Wrap-up

Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark...

Documents

Transcript of Turning Relational Database Tables into Spark Data Sources · Apache Spark Data Points 16 • Spark...

Faster ETL Workflows using Apache Pig & Spark€¦ · Faster ETL Workflows using Apache Pig & Spark - Praveen Rachabattuni, Sigmoid Analytics @praveenr019. About me Apache Pig committer

Debugging Apache Spark - Scala & Python super happy fun times 2017

january 2013 spark-gap times (need adobe acrobat reader)

Faster Than A Speeding Bullet, Three Times Higher Than The ...

Five cool ways the JVM can run Apache Spark faster

How to spark post-acute growth in challenging times

SPARK GAP TIMES - OOTC

VMware vSphere vMotion: 5.4 times faster than Hyper-V Live Migration

july 2010 spark-gap times

Getting Started with Apache Spark - MapRinfo.mapr.com/rs/mapr/images/Getting_Started_With_Apache_Spark.pdfGetting Started with Apache Spark ... 100 times faster than Hadoop MapReduce,

The Life and Times of Muriel Spark · The Life and Times of Muriel Spark 1940 Muriel’s marriage to Sydney ends Winston Churchill becomes the Prime Minister of the United Kingdom

Tools for Chemistry - INTERCHIM: Home · to be heated, making the heat-up times significantly faster. This weight-saving design also offers faster post synthesis cool down times.

GraphConnect Europe 2016 - Faster Lap Times with Neo4j - Srinivas Suravarapu

State of Spark, and where it is going - GitHub Pagesrxin.github.io/talks/2015-12-03_strata_singapore_keynote.pdf · No! Development is faster than ever. Expect Spark 2.0 in 2016.

Data Analysis Using Spark and Sparkling Water · Data Analysis Using Spark and Sparkling Water ... Spark is able to process data in-memory up to 100 times faster ... contains common

5 Times Faster Magento Theme Designs with ThemerMG

2008 REGION CHAMPIONSHIP QUALIFYING TIMES Qualifying ... · Faster than : Slower than . Faster than : Slower than . Events : Faster than . Slower than : Faster than . Slower than

General Session #5 Application Processing Update · 2019-12-05 · half times faster than Usain Bolt’s record, and. 20 times. faster than Michael Phelps. • Tornadoes can develop

Safe Harbor Statement - Bilişim Zirvesi · Exalytics Database Appliance Zero Data Loss ... Oracle Database Up to 7x faster Apache Spark Up to 6x faster Java Streams 3x to 21x faster

SageWinds - National Weather Service enables NOAA to gather data with three times more channels, four times better resolution, five times faster than before. Faster, more accurate