High concurrency,Low latency analyticsusing Spark/Kudu

93
High concurrency, Low latency analytics using Spark/Kudu Chris George

Transcript of High concurrency,Low latency analyticsusing Spark/Kudu

Page 1: High concurrency,Low latency analyticsusing Spark/Kudu

High concurrency,Low latency analytics

using Spark/KuduChris George

Page 2: High concurrency,Low latency analyticsusing Spark/Kudu

Who is this guy?

Page 3: High concurrency,Low latency analyticsusing Spark/Kudu

Tech we will talk about:KuduSpark

Spark Job ServerSpark Thrift Server

Page 4: High concurrency,Low latency analyticsusing Spark/Kudu

What was the problem?

Page 5: High concurrency,Low latency analyticsusing Spark/Kudu

Apache Kudu

Page 6: High concurrency,Low latency analyticsusing Spark/Kudu

History of Kudu

Page 7: High concurrency,Low latency analyticsusing Spark/Kudu

Columnar vs other types of storage

Page 8: High concurrency,Low latency analyticsusing Spark/Kudu

What if you could update parquet/ORC

easily?

Page 9: High concurrency,Low latency analyticsusing Spark/Kudu

HDFS vs Kudu vs HBase/Cassandra/x

yz

Page 10: High concurrency,Low latency analyticsusing Spark/Kudu

Kudu is purely a storage engine

accessible through api

Page 11: High concurrency,Low latency analyticsusing Spark/Kudu

To add sql queries/more

advanced sql like operations

Page 12: High concurrency,Low latency analyticsusing Spark/Kudu

Impala vs Spark

Page 13: High concurrency,Low latency analyticsusing Spark/Kudu

Kudu Slack Channel

Page 14: High concurrency,Low latency analyticsusing Spark/Kudu

Master and Tablets in Kudu

Page 15: High concurrency,Low latency analyticsusing Spark/Kudu

Range and Hash Partitioning

Page 16: High concurrency,Low latency analyticsusing Spark/Kudu

Number of cores = number of partitions

Page 17: High concurrency,Low latency analyticsusing Spark/Kudu

Partitioning can be on 1+ columns

Page 18: High concurrency,Low latency analyticsusing Spark/Kudu

Composite primary keys important to filter on the key in order

A, B, Ci.e. don’t scan for just B if possible

it will be expensive

Page 19: High concurrency,Low latency analyticsusing Spark/Kudu

Scans on a tablet is single threaded but you can do 200+ scans on a tablet

concurrently

Page 20: High concurrency,Low latency analyticsusing Spark/Kudu

To find your scale... load up a single tablet

Insertion, Update, Deletes.. concurrently

Until it doesn't meet your performance

Page 21: High concurrency,Low latency analyticsusing Spark/Kudu

Partitioning is extremely important

Page 22: High concurrency,Low latency analyticsusing Spark/Kudu

Kudu client is javaPython connectors

comingC++ client

Page 23: High concurrency,Low latency analyticsusing Spark/Kudu

Java Client loops through tablets, but

not concurrently

Page 24: High concurrency,Low latency analyticsusing Spark/Kudu

But you can code the multithread or

contribute

Page 25: High concurrency,Low latency analyticsusing Spark/Kudu

Predicates on any column

Page 26: High concurrency,Low latency analyticsusing Spark/Kudu

Summary of why Kudu?

Page 27: High concurrency,Low latency analyticsusing Spark/Kudu

Predicates/Projections on any column very

quickly at scale

Page 28: High concurrency,Low latency analyticsusing Spark/Kudu

Spark

Page 29: High concurrency,Low latency analyticsusing Spark/Kudu

Spark Datasource api:

Page 30: High concurrency,Low latency analyticsusing Spark/Kudu

Reads CSVval df = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") // Use first line of all files as header .option("inferSchema", "true") // Automatically infer data types .load("cars.csv")

Writes CSVval selectedData = df.select("year", "model")selectedData.write .format("com.databricks.spark.csv") .option("header", "true") .save("newcars.csv")

Page 31: High concurrency,Low latency analyticsusing Spark/Kudu

But these are often simlified as:val parquetDataframe = sqlContext.read.parquet(“people.parquet")

parquetDataframe.write.parquet("people.parquet")

Page 32: High concurrency,Low latency analyticsusing Spark/Kudu

I wrote the current version of the Kudu Datasource/Spark

Integration

Page 33: High concurrency,Low latency analyticsusing Spark/Kudu

There are limitations with the datasource

api

Page 34: High concurrency,Low latency analyticsusing Spark/Kudu

Save Modes for datasource api:

append, overwrite, ignore, error

Page 35: High concurrency,Low latency analyticsusing Spark/Kudu

append = insert

Page 36: High concurrency,Low latency analyticsusing Spark/Kudu

overwrite = truncate + insert

Page 37: High concurrency,Low latency analyticsusing Spark/Kudu

ignore = create if not exists

Page 38: High concurrency,Low latency analyticsusing Spark/Kudu

error = throw exception if the data

exists

Page 39: High concurrency,Low latency analyticsusing Spark/Kudu

What if I want to update? Nope

Page 40: High concurrency,Low latency analyticsusing Spark/Kudu

What about deletes? Not individually

Page 41: High concurrency,Low latency analyticsusing Spark/Kudu

So how do you support

updates/deletes?

Page 42: High concurrency,Low latency analyticsusing Spark/Kudu

By not using the datasource api.. but I'll talk more about that in

a minute

Page 43: High concurrency,Low latency analyticsusing Spark/Kudu

Immutability of dataframes

Page 44: High concurrency,Low latency analyticsusing Spark/Kudu

So why use datasource api?

Page 45: High concurrency,Low latency analyticsusing Spark/Kudu

Because it's smarter than it appears for

reads

Page 46: High concurrency,Low latency analyticsusing Spark/Kudu

Pushdown predicates and

projections

Page 47: High concurrency,Low latency analyticsusing Spark/Kudu

Pushdown predicates:val df = sqlContext.read.options(Map("kudu.master" -> "kudu.master:7051","kudu.table" -> “kudu_table")).kudu

df.filter("id" >= 5).show()

Page 48: High concurrency,Low latency analyticsusing Spark/Kudu

Datasource has knowledge of what can be pushed down to

underlying storeand what can not.

Page 49: High concurrency,Low latency analyticsusing Spark/Kudu

Why am I telling you this?

Page 50: High concurrency,Low latency analyticsusing Spark/Kudu

Cause if you want things to be fast you need to know what is

not pushed down!

Page 52: High concurrency,Low latency analyticsusing Spark/Kudu

Did you notice whats missing?EqualTo

GreaterThanGreaterThanOrEqual

LessThanLessThanOrEqual

And

Page 53: High concurrency,Low latency analyticsusing Spark/Kudu

"OR"So spark will use it's optimizer to run two separate kudu scans for the OR"IN" is coming very soon, nuanced

performance details

Page 54: High concurrency,Low latency analyticsusing Spark/Kudu

btw if you register the dataframe as a temp table in

spark"select * from someDF where

id>=5" will also do pushdowns

Page 55: High concurrency,Low latency analyticsusing Spark/Kudu

"select * from someDF where id>=5" will also

do pushdowns

Page 56: High concurrency,Low latency analyticsusing Spark/Kudu

things like select * from someDF where

lower(name)="joe"will pull the entire table into

memoryprobably a bad thing

Page 57: High concurrency,Low latency analyticsusing Spark/Kudu

Projections will also be pushed down to kudu so your not retrieving

the entire rowdf.select("id", "name")

select id, name from someDf

Page 58: High concurrency,Low latency analyticsusing Spark/Kudu

Looked at lots of existing datasource’s

to design kudu’s

Page 59: High concurrency,Low latency analyticsusing Spark/Kudu

How does kudu do updates/deletes in

spark?

Page 60: High concurrency,Low latency analyticsusing Spark/Kudu

// Use KuduContext to create, delete, or write to Kudu tablesval kuduContext = new KuduContext("kudu.master:7051")

Page 61: High concurrency,Low latency analyticsusing Spark/Kudu

// Insert datakuduContext.insertRows(df, "test_table")// Delete datakuduContext.deleteRows(filteredDF, "test_table")// Upsert datakuduContext.upsertRows(df, "test_table")// Update dataval alteredDF = df.select("id", $"count" + 1)kuduContext.updateRows(alteredDF, “test_table")http://kudu.apache.org/docs/developing.html

Page 62: High concurrency,Low latency analyticsusing Spark/Kudu

Upserts are handled server side for performance

Upserts can also be handled through datasource api:

df.write.options(Map("kudu.master"-> "kudu.master:7051", "kudu.table"->"test_table")).mode("append").kudu

Page 63: High concurrency,Low latency analyticsusing Spark/Kudu

You can also create, check existence and delete tables through

api

Page 64: High concurrency,Low latency analyticsusing Spark/Kudu

Additional notes:Kudu datasource currently works with spark 1.xNext release it will support both 1.x and 2.xIt's being improved on regular basis

Page 65: High concurrency,Low latency analyticsusing Spark/Kudu

Number of partitions on the dataframe is related to how many

tablets/partitions are related to the filter.

Partition scans are parallel and have locality awareness in spark

Page 66: High concurrency,Low latency analyticsusing Spark/Kudu

Be sure to set spark locality wait to something other small for low

latency (3 seconds is the spark default)

Page 67: High concurrency,Low latency analyticsusing Spark/Kudu

Spark Job Server (SJS)

Page 68: High concurrency,Low latency analyticsusing Spark/Kudu

Created for low latency jobs on

spark

Page 69: High concurrency,Low latency analyticsusing Spark/Kudu

Persistent contextsReduces runtime of hello world

type of job from 1 second to 10 ms

Page 70: High concurrency,Low latency analyticsusing Spark/Kudu

Rest based api to:Run Jobs

Create contextsCheck status of job both

async/sync

Page 71: High concurrency,Low latency analyticsusing Spark/Kudu

Creating a context calls spark submit (in separate jvm mode)

Uses akka to communicate between rest and spark driver

Page 72: High concurrency,Low latency analyticsusing Spark/Kudu

To create a persistent context you need:

cpu cores + memory footprintname to reference it by

factory to use for the context.ie HiveContextFactory vs

SqlContextFactory

Page 73: High concurrency,Low latency analyticsusing Spark/Kudu

Our average job time is 30ms when coming

through api for simpler retrievals

Page 74: High concurrency,Low latency analyticsusing Spark/Kudu

Jobs need to implement an interface

context will be passed inDON’T CREATE YOUR OWN

SQLCONTEXT!!

Page 75: High concurrency,Low latency analyticsusing Spark/Kudu

Currently only supports spark 1.x

2.x is coming soonish

Page 76: High concurrency,Low latency analyticsusing Spark/Kudu

Keeps track of job runtimes in nice ui

along with additional metrics

Page 77: High concurrency,Low latency analyticsusing Spark/Kudu

You can cache data and it will be available

to later jobs

Page 78: High concurrency,Low latency analyticsusing Spark/Kudu

You can also load objects and they are available to

later jobsvia NamedObject interface

Page 79: High concurrency,Low latency analyticsusing Spark/Kudu

Persistent context can be run in seperate JVM

or within SJS

Page 80: High concurrency,Low latency analyticsusing Spark/Kudu

It does have some sharp edges

though...

Page 81: High concurrency,Low latency analyticsusing Spark/Kudu

Due to jvm classloader contexts need to be

restarted on deploy to pick up new code

Page 82: High concurrency,Low latency analyticsusing Spark/Kudu

Some settings:spark.files.overwrite = true context-per-jvm = true

spray-can: parsing.max-content-length = 256m

spray-can: idle-timeout = 600 sspray-can: request-timeout = 540 s

spark.serializer = "org.apache.spark.serializer.KryoSerializer"

filedao vs sqldao backendhave to build from source/no binary for SJS

Page 83: High concurrency,Low latency analyticsusing Spark/Kudu

hive-site.xml<configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:derby:memory:myDB;create=true</value> <description>JDBC connect string for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>org.apache.derby.jdbc.EmbeddedDriver</value> <description>Driver class name for a JDBC metastore</description> </property>

Page 84: High concurrency,Low latency analyticsusing Spark/Kudu

SparkThrift ServerExtended/Reused hive

thrift server

Page 85: High concurrency,Low latency analyticsusing Spark/Kudu

I run the following on a persistent context:sc.getConf.set("spark.sql.hive.thriftServer.singleSession", "true")sqlContext.setConf("hive.server2.thrift.port", port) // port to run thrift server onHiveThriftServer2.startWithContext(sqlContext)

Page 86: High concurrency,Low latency analyticsusing Spark/Kudu

Now I can connect using

hive-jdbcodbc (microsoft or

simba)

Page 87: High concurrency,Low latency analyticsusing Spark/Kudu

Run a job with joins/ or even just a basic dataframe through

datasource api and registerTempTable

Page 88: High concurrency,Low latency analyticsusing Spark/Kudu

val df = sqlContext.read.options(Map("kudu.master" -> "kudu.master:7051","kudu.table" -> “kudu_table")).kududf.registerTempTable

Page 89: High concurrency,Low latency analyticsusing Spark/Kudu

You could also potentially cache/persist via spark and register that way assuming

joins are expensive

Page 90: High concurrency,Low latency analyticsusing Spark/Kudu

Now you can run queries as if it was a traditional database

Page 91: High concurrency,Low latency analyticsusing Spark/Kudu

Hey thats great, but how fast?500 ms average response time

200 concurrent complex queries1+ Billion rows with 200+ columns

sql queries with 5 predicates, min,max,count some values and group by on 5 columns

No spark caching

Page 92: High concurrency,Low latency analyticsusing Spark/Kudu

We take this a step farther and do complex dataframes and it

is made available as a registered temp table

Page 93: High concurrency,Low latency analyticsusing Spark/Kudu

Questions… If we run out of time send me questions on slack