High concurrency,Low latency analyticsusing Spark/Kudu
-
Upload
chris-george -
Category
Data & Analytics
-
view
122 -
download
3
Transcript of High concurrency,Low latency analyticsusing Spark/Kudu
High concurrency,Low latency analytics
using Spark/KuduChris George
Who is this guy?
Tech we will talk about:KuduSpark
Spark Job ServerSpark Thrift Server
What was the problem?
Apache Kudu
History of Kudu
Columnar vs other types of storage
What if you could update parquet/ORC
easily?
HDFS vs Kudu vs HBase/Cassandra/x
yz
Kudu is purely a storage engine
accessible through api
To add sql queries/more
advanced sql like operations
Impala vs Spark
Kudu Slack Channel
Master and Tablets in Kudu
Range and Hash Partitioning
Number of cores = number of partitions
Partitioning can be on 1+ columns
Composite primary keys important to filter on the key in order
A, B, Ci.e. don’t scan for just B if possible
it will be expensive
Scans on a tablet is single threaded but you can do 200+ scans on a tablet
concurrently
To find your scale... load up a single tablet
Insertion, Update, Deletes.. concurrently
Until it doesn't meet your performance
Partitioning is extremely important
Kudu client is javaPython connectors
comingC++ client
Java Client loops through tablets, but
not concurrently
But you can code the multithread or
contribute
Predicates on any column
Summary of why Kudu?
Predicates/Projections on any column very
quickly at scale
Spark
Spark Datasource api:
Reads CSVval df = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") // Use first line of all files as header .option("inferSchema", "true") // Automatically infer data types .load("cars.csv")
Writes CSVval selectedData = df.select("year", "model")selectedData.write .format("com.databricks.spark.csv") .option("header", "true") .save("newcars.csv")
But these are often simlified as:val parquetDataframe = sqlContext.read.parquet(“people.parquet")
parquetDataframe.write.parquet("people.parquet")
I wrote the current version of the Kudu Datasource/Spark
Integration
There are limitations with the datasource
api
Save Modes for datasource api:
append, overwrite, ignore, error
append = insert
overwrite = truncate + insert
ignore = create if not exists
error = throw exception if the data
exists
What if I want to update? Nope
What about deletes? Not individually
So how do you support
updates/deletes?
By not using the datasource api.. but I'll talk more about that in
a minute
Immutability of dataframes
So why use datasource api?
Because it's smarter than it appears for
reads
Pushdown predicates and
projections
Pushdown predicates:val df = sqlContext.read.options(Map("kudu.master" -> "kudu.master:7051","kudu.table" -> “kudu_table")).kudu
df.filter("id" >= 5).show()
Datasource has knowledge of what can be pushed down to
underlying storeand what can not.
Why am I telling you this?
Cause if you want things to be fast you need to know what is
not pushed down!
EqualToGreaterThan
GreaterThanOrEqualLessThan
LessThanOrEqualAnd
https://github.com/cloudera/kudu/blob/master/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala#L159
Did you notice whats missing?EqualTo
GreaterThanGreaterThanOrEqual
LessThanLessThanOrEqual
And
"OR"So spark will use it's optimizer to run two separate kudu scans for the OR"IN" is coming very soon, nuanced
performance details
btw if you register the dataframe as a temp table in
spark"select * from someDF where
id>=5" will also do pushdowns
"select * from someDF where id>=5" will also
do pushdowns
things like select * from someDF where
lower(name)="joe"will pull the entire table into
memoryprobably a bad thing
Projections will also be pushed down to kudu so your not retrieving
the entire rowdf.select("id", "name")
select id, name from someDf
Looked at lots of existing datasource’s
to design kudu’s
How does kudu do updates/deletes in
spark?
// Use KuduContext to create, delete, or write to Kudu tablesval kuduContext = new KuduContext("kudu.master:7051")
// Insert datakuduContext.insertRows(df, "test_table")// Delete datakuduContext.deleteRows(filteredDF, "test_table")// Upsert datakuduContext.upsertRows(df, "test_table")// Update dataval alteredDF = df.select("id", $"count" + 1)kuduContext.updateRows(alteredDF, “test_table")http://kudu.apache.org/docs/developing.html
Upserts are handled server side for performance
Upserts can also be handled through datasource api:
df.write.options(Map("kudu.master"-> "kudu.master:7051", "kudu.table"->"test_table")).mode("append").kudu
You can also create, check existence and delete tables through
api
Additional notes:Kudu datasource currently works with spark 1.xNext release it will support both 1.x and 2.xIt's being improved on regular basis
Number of partitions on the dataframe is related to how many
tablets/partitions are related to the filter.
Partition scans are parallel and have locality awareness in spark
Be sure to set spark locality wait to something other small for low
latency (3 seconds is the spark default)
Spark Job Server (SJS)
Created for low latency jobs on
spark
Persistent contextsReduces runtime of hello world
type of job from 1 second to 10 ms
Rest based api to:Run Jobs
Create contextsCheck status of job both
async/sync
Creating a context calls spark submit (in separate jvm mode)
Uses akka to communicate between rest and spark driver
To create a persistent context you need:
cpu cores + memory footprintname to reference it by
factory to use for the context.ie HiveContextFactory vs
SqlContextFactory
Our average job time is 30ms when coming
through api for simpler retrievals
Jobs need to implement an interface
context will be passed inDON’T CREATE YOUR OWN
SQLCONTEXT!!
Currently only supports spark 1.x
2.x is coming soonish
Keeps track of job runtimes in nice ui
along with additional metrics
You can cache data and it will be available
to later jobs
You can also load objects and they are available to
later jobsvia NamedObject interface
Persistent context can be run in seperate JVM
or within SJS
It does have some sharp edges
though...
Due to jvm classloader contexts need to be
restarted on deploy to pick up new code
Some settings:spark.files.overwrite = true context-per-jvm = true
spray-can: parsing.max-content-length = 256m
spray-can: idle-timeout = 600 sspray-can: request-timeout = 540 s
spark.serializer = "org.apache.spark.serializer.KryoSerializer"
filedao vs sqldao backendhave to build from source/no binary for SJS
hive-site.xml<configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:derby:memory:myDB;create=true</value> <description>JDBC connect string for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>org.apache.derby.jdbc.EmbeddedDriver</value> <description>Driver class name for a JDBC metastore</description> </property>
SparkThrift ServerExtended/Reused hive
thrift server
I run the following on a persistent context:sc.getConf.set("spark.sql.hive.thriftServer.singleSession", "true")sqlContext.setConf("hive.server2.thrift.port", port) // port to run thrift server onHiveThriftServer2.startWithContext(sqlContext)
Now I can connect using
hive-jdbcodbc (microsoft or
simba)
Run a job with joins/ or even just a basic dataframe through
datasource api and registerTempTable
val df = sqlContext.read.options(Map("kudu.master" -> "kudu.master:7051","kudu.table" -> “kudu_table")).kududf.registerTempTable
You could also potentially cache/persist via spark and register that way assuming
joins are expensive
Now you can run queries as if it was a traditional database
Hey thats great, but how fast?500 ms average response time
200 concurrent complex queries1+ Billion rows with 200+ columns
sql queries with 5 predicates, min,max,count some values and group by on 5 columns
No spark caching
We take this a step farther and do complex dataframes and it
is made available as a registered temp table
Questions… If we run out of time send me questions on slack