Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

30
Jim Hatcher Using Spark to Load Oracle Data into Cassandra

Transcript of Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

Page 1: Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

Jim Hatcher

Using Spark to Load Oracle Data into Cassandra

Page 2: Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

1 Introduction

2 Problem Description

3 Methods of loading external data into Cassandra

4 What is Spark?

5 Lessons Learned

6 Resources

2© DataStax, All Rights Reserved.

Page 3: Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

Introduction

Page 4: Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

© DataStax, All Rights Reserved. 4

At IHS Markit, we take raw data and turn it into information and insights for our customers.

Automotive Systems (CarFax)Defense Systems (Jane’s)Oil & Gas Systems (Petra, Kingdom)Maritime SystemsTechnology Systems (Electronic Parts Database, Root Metrics)ChemicalsFinancial Systems (Wall Street on Demand)Lots of others

Page 5: Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

Problem Description

Page 6: Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

Cluster

Factory

Oracle

Back-end Applications Customer-facing Systems

Load Files

Customer-facing

Applications

Oracle

Cassandra+

SolrFactory

Applications

Data Updates

Cassandra+

Spark

Page 7: Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

Methods of loading external data into Cassandra

Page 8: Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

Methods of Loading External Data into C*

1. CQL Copy command2. Sqoop3. Write a custom program that uses the CQL driver4. Write a Spark program

© DataStax, All Rights Reserved. 8

Page 9: Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

What is Spark?

Page 10: Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

© DataStax, All Rights Reserved. 10

What is Spark?Spark is a processing framework designed to work with distributed data.

“up to 100X faster than MapReduce” according to spark.apache.org

Used in any ecosystem where you want to work with distributed data (Hadoop, Cassandra, etc.)

Includes other specialized libraries:• SparkSQL• Spark Streaming• MLLib• GraphX

Spark Facts

Conceptually Similar To MapReduce

Written In Scala

Supported By DataBricks

Supported Languages Scala, Java, Python, R

Page 11: Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

© DataStax, All Rights Reserved. 11

Spark Client

DriverSpark

Context

Spark Master

Spark Worker

Spark Worker

Spark Worker

Executor

Executor

Executor

1. Request Resources2. Allocate Resources

3. S

tart

Exec

utor

s

4. P

erfo

rm

Com

puta

tion

Credit: https://academy.datastax.com/courses/ds320-analytics-apache-spark/introduction-spark-architecture

Spark Architecture

Page 12: Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

© DataStax, All Rights Reserved. 12

Spark with CassandraCredit: https://academy.datastax.com/courses/ds320-analytics-apache-spark/introduction-spark-architecture

Cassandra Cluster

A

CB

Spark Worker

Spark WorkerSpark Worker

Spark Master

Spark Client

Spark Cassandra Connector – open source, supported by DataStaxhttps://github.com/datastax/spark-cassandra-connector

Page 13: Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

© DataStax, All Rights Reserved. 13

ETL (Extract, Transform, Load)

Text File

JDBC Data Source

Cassandra

Hadoop

Extract Data

Spark: Create RDD or Data

Frame

Data Source(s)Spark Code

Transform Data

Spark: Map function

Spark Code

Cassandra

Data Source(s)

Load Data

Spark: Save

Spark Code

Page 14: Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

© DataStax, All Rights Reserved.

Typical Code - Example// Extractval extracted = sqlContext .read  .format("jdbc")  .options(    Map[String, String](      "url" -> "jdbc:oracle:thin:username/password@//hostname:port/oracle_svc",      "dbtable" -> "table_name"    )  )  .load()

// Transformval transformed = extracted.map { dbRow => (dbRow.getAs[String](“field_one"), dbRow.getAs[Integer](“field_two"))}

// Loadtransformed.saveToCassandra(“keyspace_name", “table_name", SomeColumns(“field_one“, “field_two"))

Page 15: Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

Lessons Learned

Page 16: Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

Lesson #1 - Spark SQL handles Oracle NUMBER fields with no precision incorrectlyhttps://issues.apache.org/jira/browse/SPARK-10909

All of our Oracle tables have ID fields defined as NUMBER(15,0).

When you use Spark SQL to access an Oracle table, there is a piece of code in the JDBC driver that reads the metadata and creates a dataframe with the proper schema. If your schema has a NUMBER(*, 0) field defined in it, you get a “Overflowed precision” error.

This is fixed in Spark 1.5, but we don’t have the option of adopting a new version of Spark since we’re using Spark bundled with DSE 4.8.6 (which uses spark 1.4.2). We were able to fix this by stealing the fix from the Spark 1.5 code and applying it to our code (yay, open source!).

At some point, we’ll update to DSE 5.* which uses Spark 1.6, and we can remove this code.

© DataStax, All Rights Reserved. 16

Page 17: Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

© DataStax, All Rights Reserved. 17

import java.sql.Typesimport org.apache.spark.sql.jdbc.{JdbcDialect, JdbcType}import org.apache.spark.sql.types._

private case object OracleDialect extends JdbcDialect {

override def canHandle(url: String): Boolean = url.startsWith("jdbc:oracle")

override def getCatalystType(sqlType: Int, typeName: String, size: Int, md: MetadataBuilder): Option[DataType] = { // Handle NUMBER fields that have no precision/scale in special way // because JDBC ResultSetMetaData converts this to 0 precision and -127 scale // For more details, please see // https://github.com/apache/spark/pull/8780#issuecomment-145598968 // and // https://github.com/apache/spark/pull/8780#issuecomment-144541760 if (sqlType == Types.NUMERIC && size == 0) { // This is sub-optimal as we have to pick a precision/scale in advance whereas the data // in Oracle is allowed to have different precision/scale for each value. Option(DecimalType(38, 10)) } else { None } }

override def getJDBCType(dt: DataType): Option[JdbcType] = dt match { case StringType => Some(JdbcType("VARCHAR2(255)", java.sql.Types.VARCHAR)) case _ => None }

}org.apache.spark.sql.jdbc.JdbcDialects.registerDialect(OracleDialect)

Lesson #1 - Spark SQL handles Oracle NUMBER fields with no precision incorrectly

Page 18: Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

Lesson #2 - Spark SQL doesn’t handle timeuuid fields correctly https://issues.apache.org/jira/browse/SPARK-10501

Spark SQL doesn’t know what to do with a timeuuid field when reading a table from Cassandra. This is an issue since we commonly use timeuuid columns in our Cassandra key structures.

We got this error: scala.MatchError: UUIDType (of class org.apache.spark.sql.cassandra.types.UUIDType$)

We are able to work around this issue by casting the timeuuid values to strings, like this:

© DataStax, All Rights Reserved. 18

val dataFrameRaw = sqlContext  .read  .format("org.apache.spark.sql.cassandra")  .options(Map("table" -> "table_name", "keyspace" -> "keyspace_name"))  .load()

val dataFrameFixed = dataFrameRaw  .withColumn(“timeuuid_column", dataFrameRaw("timeuuid_column").cast(StringType))

Page 19: Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

Lesson #3 – Careful when generating ID fields

We created an RDD:

val baseRdd = rddInsertsAndUpdates.map { dbRow =>

  val keyColumn = {    if (!dbRow.isNullAt(dbRow.fieldIndex(“timeuuid_key_column"))) {      dbRow.getAs[String]("timeuuid_key_column")    } else {      UUIDs.timeBased().toString    }  }

  //do some further processing

(keyColumn, …other values)}

Then, we took that RDD and transformed it into another RDD:

val invertedIndexTable = baseRdd.map { entry =>  (entry.getString(“timeuuid_key_column"), entry.getString(“fld_1"))}

Then we wrote them both to C*, like this:

baseRdd.saveToCassandra(“keyspace_name", “table_name", SomeColumns(“key_column“, “fld_1“, “fld_2"))

invertedIndexTable.saveToCassandra(“keyspace_name", “inverted_index_table_name"  SomeColumns(“key_column“, “fld_1“)

© DataStax, All Rights Reserved. 19

Page 20: Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

Lesson #3 – Careful when generating ID fields

We kept finding that the ID values in the inverted index table had slightly different ID values than the values in the base table.

We fixed this by adding a cache() to our first RDD.

© DataStax, All Rights Reserved. 20

val baseRdd = rddInsertsAndUpdates.map { dbRow =>

  val keyColumn = {    if (!dbRow.isNullAt(dbRow.fieldIndex(“timeuuid_key_column"))) {      dbRow.getAs[String]("timeuuid_key_column")    } else {      UUIDs.timeBased().toString    }  }

  //do some further processing

(keyColumn, …other values)}.cache()

Page 21: Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

Lesson #4 – You can only return an RDD of a tuple if you have 22 items or less.

© DataStax, All Rights Reserved. 21

It’s pretty common in Spark to return an RDD of tuplesval myNewRdd = myOldRdd.map { dbRow =>

val firstName = dbRow.getAs[String](“FirstName") val lastName = dbRow.getAs[String](“LastName") val calcField1 = dbRow.getAs[Intger](“SomeColumn") * 3.14

(firstName, lastName, calcField1)}

This works great until you get to 22 fields in your tuple, and then Scala throws an error. (Later versions of Scala lift this restriction, but it’s a problem for our version of Scala.)

Page 22: Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

Lesson #4 – You can only return an RDD of a tuple if you have 22 items or less.

© DataStax, All Rights Reserved. 22

You can fix this by returning an RDD of CassandraRows instead. (especially if your goal is to save them to C*)val myNewRdd = myOldRdd.map { dbRow =>

val firstName = dbRow.getAs[String](“FirstName") val lastName = dbRow.getAs[String](“LastName") val calcField1 = dbRow.getAs[Integer](“SomeColumn") * 3.14 val allValues = IndexedSeq[AnyRef](firstName, lastName, calcField1) val allColumnNames = Array[String]( “first_name", “last_name", “calc_field_1“) new CassandraRow(allColumnNames, allValues)}

Page 23: Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

Lesson #5 – Getting a JDBC dataframe based on a SQL statement is not very intuitive.To get a dataframe from a JDBC source, you do this:

val exampleDataFrame = sqlContext .read  .format("jdbc")  .options(    Map[String, String](      "url" -> "jdbc:oracle:thin:username/password@//hostname:port/oracle_svc",      "dbtable" -> "table_name"    )  )  .load()

You would think there would be a version of this call that lets you pass in a SQL statement but there is not.

However, when JDBC creates your query from the above syntax, all it does is prepend your dbtable value with “SELECT * FROM”.

© DataStax, All Rights Reserved. 23

Page 24: Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

Lesson #5 – Getting a JDBC dataframe based on a SQL statement is not very intuitive.So, the workaround is to do this:

val sql =  "( " +    "  SELECT S.* " +    "  FROM Sample S " +    "  WHERE ID = 11111 " +    "  ORDER BY S.SomeField " +    ")" val exampleDataFrame = sqlContext .read  .format("jdbc")  .options(    Map[String, String](      "url" -> "jdbc:oracle:thin:username/password@//hostname:port/oracle_svc",      "dbtable" -> sql    )  )  .load()

You’re effectively doing this in Oracle:SELECT * FROM (      SELECT S.*      FROM Sample S      WHERE ID = 11111      ORDER BY S.SomeField)© DataStax, All Rights Reserved. 24

Page 25: Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

Lesson #6 – Creating a partitioned JDBC dataframe is not very intuitive.The code to get a JDBC dataframe looks like this:val basePartitionedOracleData = sqlContext .read  .format("jdbc")  .options(    Map[String, String](      "url" -> "jdbc:oracle:thin:username/password@//hostname:port/oracle_svc",      "dbtable" -> "ExampleTable",      "lowerBound" -> "1",      "upperBound" -> "10000",      "numPartitions" -> "10",      "partitionColumn" -> “KeyColumn"    )  )  .load()

The last four arguments in that map are there for the purpose of getting a partitioned dataset.  If you pass any of them, you have to pass all of them.

© DataStax, All Rights Reserved. 25

Page 26: Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

Lesson #6 – Creating a partitioned JDBC dataframe is not very intuitive.When you pass these additional arguments in, here’s what it does:

It builds a SQL statement template in the format “SELECT * FROM {tableName} WHERE {partitionColumn} >= ? AND {partitionColumn} < ?”

It sends {numPartitions} statements to the DB engine.  If you suppled these values: {dbTable=ExampleTable, lowerBound=1, upperBound=10,000, numPartitions=10, partitionColumn=KeyColumn}, it would create these ten statements:

SELECT * FROM ExampleTable WHERE KeyColumn >= 1 AND KeyColumn < 1001

SELECT * FROM ExampleTable WHERE KeyColumn >= 1001 AND KeyColumn < 2000

SELECT * FROM ExampleTable WHERE KeyColumn >= 2001 AND KeyColumn < 3000

SELECT * FROM ExampleTable WHERE KeyColumn >= 3001 AND KeyColumn < 4000

SELECT * FROM ExampleTable WHERE KeyColumn >= 4001 AND KeyColumn < 5000

SELECT * FROM ExampleTable WHERE KeyColumn >= 5001 AND KeyColumn < 6000

SELECT * FROM ExampleTable WHERE KeyColumn >= 6001 AND KeyColumn < 7000

SELECT * FROM ExampleTable WHERE KeyColumn >= 7001 AND KeyColumn < 8000

SELECT * FROM ExampleTable WHERE KeyColumn >= 8001 AND KeyColumn < 9000

SELECT * FROM ExampleTable WHERE KeyColumn >= 9001 AND KeyColumn < 10000

And then it would put the results of each of those queries in its own partition in Spark.© DataStax, All Rights Reserved. 26

Page 27: Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

Lesson #7 – JDBC *really* wants you to get your partitioned dataframe using a sequential ID column.In our Oracle database, we don’t have sequential integer ID columns.

We tried to get around that by doing a query like this and passing “ROW_NUMBER” as the partitioning column:SELECT ST.*, ROW_NUMBER() OVER (ORDER BY ID_FIELD ASC) AS ROW_NUMBER FROM SourceTable STWHERE …my criteriaORDER BY ID_FIELDBut, this didn’t perform well.

We ended up creating a processing table:

CREATE TABLE SPARK_ETL_BATCH_SEQUENCE (SEQ_ID NUMBER(15,0) NOT NULL, //this has a sequence that gets auto-incrementedBATCH_ID NUMBER(15,0) NOT NULL,ID_FIELD NUMBER(15,0) NOT NULL

)© DataStax, All Rights Reserved. 27

Page 28: Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

Lesson #7 – JDBC *really* wants you to get your partitioned dataframe using a sequential ID column.We insert into this table first:INSERT INTO SPARK_ETL_BATCH_SEQUENCE ( BATCH_ID, ID_FIELD ) //SEQ_ID gets auto-populatedSELECT {NextBatchID}, ID_FIELDFROM SourceTable STWHERE …my criteriaORDER BY ID_FIELD

Then, we join to it in the query where we get our data which provides us with a sequential ID:SELECT ST.*, SEQ.SEQ_IDFROM SourceTable STINNER JOIN SPARK_ETL_BATCH_SEQUENCE SEQ ON ST.ID_FIELD = SEQ.ID_FIELDWHERE …my criteriaORDER BY ID_FIELDAnd, we use SEQ_ID as our Partitioning Column.Despite its need to talk to Oracle twice, this approach has proven to perform much faster than having uneven partitions.

© DataStax, All Rights Reserved. 28

Page 29: Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

Resources

Page 30: Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

Resources

© DataStax, All Rights Reserved. 30

Spark• Books

• Learning Sparkhttp://shop.oreilly.com/product/0636920028512.do

Scala (Knowing Scala with really help you progress in Spark)• Functional Programming Principles in Scala (videos)

https://www.youtube.com/user/afigfigueira/playlists?shelf_id=9&view=50&sort=dd• Books

http://www.scala-lang.org/documentation/books.html

Spark and Cassandra• DataStax Academy

http://academy.datastax.com/• Self-paced course: DS320: DataStax Enterprise Analytics with Apache Spark – Really Good!• Tutorials

• Spark Cassandra Connector website – lots of good exampleshttps://github.com/datastax/spark-cassandra-connector