Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

1© Cloudera, Inc. All rights reserved.

13 April 2016Ted Malaska| Principle Solutions Architect @ Cloudera, Jonathan Hsieh| HBase Tech Lead @ Cloudera, Apache HBase PMC

Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications


About Ted and Jon

Ted Malaska• Principal Solutions Architect @ Cloudera• Apache HBase SparkOnHBase

Contributor•Contact• [email protected]

Jon Hsieh• Tech Lead/Eng Manager HBase Team @ Cloudera• Apache HBase PMC• Apache Flume founder

•Contact• [email protected]• @jmhsieh

Hsieh and Malaska, Hadoop Summit EU Dublin 2016

3© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

Outline

• Introduction

•Architecture and integration patterns

• Typing and API usage examples

• Future work and Conclusion


• Apache HBase is a distributed non-relational datastore that specializes in strongly consistent, low-latency, random access reads, writes, and short scans. As a storage system, it is an obvious source for reading RDDs and a destination for writing RDDs

• Apache Spark is a distributed in-memory processing system that can be used for batch and continuous, near-real time streaming jobs. Spark’s programming model is built upon the RDD (resilient distributed dataset) abstraction

Apache HBase + Apache Spark


Example Use cases

• Streaming Analytics into HBase to replace Lambda Architectures (with Kafka)•Weblogs

• ETL in Spark to bulkload into HBase• 25-50B records per weekly batch

• Using SQL for extraction layer to query HBase entity-centric timeseries data


Architecture and Integration Patterns


How does data get in and out of HBase?

HBase Client

Put, Incr, Append

HBase Client

Get, Scan

Bulk Import

HBase Client

HBase ReplicationHBase Replication

low latency

high throughput

GetsShort scan

Full Scan, Snapshot, MapReduce

HBase Scanner


HBase + MapReduce: Batch processing patterns

• Read dataset from HBase Table• Use HBase’s MR InputFormats• TableInputFormat• MultiTableInputFormat• TableSnapshotInputFormat

• Write dataset to HBase Table• Use HBase’s MR OutputFormat• TableOutputFormat• MultiTableOutputFormat• HFileOutputFormat

Read from HBase Table

Write to HBase Table


HBase + Spark: Batch processing patterns

• Read dataset(RDD) from HBase Table• Use HBase’s MR InputFormats• TableInputFormat• MultiTableInputFormat• TableSnapshotInputFormat

• Write dataset(RDD) to HBase Table• Use HBase’s MR OutputFormat• TableOutputFormat• MultiTableOutputFormat• HFileOutputFormat

Read HBase Table as RDD

Write RDD as HBase Table


Spark Streaming

• Take an Data source• Partition in to mini batches RDDs• Compute using Spark engine• Output mini batch RDDsMini batch input RDD

Data source

Mini batch output RDD


HBase + Spark Streaming – Enriching With HBase Data

• “Join” a dataset with HBase data• Enrich Streaming data source with

HBase data• Extract information from minibatch• Read/write/update HBase data in

processing • Output HBase-data enriched stream

of output RDDs

Mini batch input RDD

Data source

HBase-enriched mini batch output RDD


How does Spark get data in and out of HBase?

HBase Client

Put, Incr, Append

HBase Client

Get, Scan

Bulk Import

HBase Client


low latency

high throughput

GetsShort scan


HBase Scanner



HBase Client

Put, Incr, Append

HBase Client

Get, Scan

Bulk Import

HBase Client


low latency

high throughput

GetsShort scan


HBase ScannerBatch RDD via HBase’s MR Input/ Output Formats

Streaming using Hbase to Enrich stream data

Streaming using HBase to Enrich stream data


Typing and API Usage


Under the covers


Driver

Walker Node

Configs

Executor

Static Space

Configs

HConnection

Tasks Tasks

Walker Node

Executor

Static Space

Configs

HConnection

Tasks Tasks


Key Addition: HBaseContext

• Create an HBaseContext// an Hadoop/HBase Configuration objectval conf = HBaseConfiguration.create() conf.addResource(new Path("/etc/hbase/conf/core-site.xml"))conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"))// sc is the Spark Context; hbase context corresponds to an HBase Connectionval hbaseContext = new HBaseContext(sc, conf)

// A sample RDDval rdd = sc.parallelize(Array( (Bytes.toBytes("1")), (Bytes.toBytes("2")), (Bytes.toBytes("3")), (Bytes.toBytes("4")), (Bytes.toBytes("5")), (Bytes.toBytes("6")), (Bytes.toBytes("7"))))


• Foreach • Map• BulkLoad• BulkLoadThinRows• BulkGet (aka Multiget)• BulkDelete

Operations on the HBaseContext



Foreach

• Read HBase data in parallel for each partition and compute

rdd.hbaseForeachPartition(hbaseContext, (it, conn) => { // do something val bufferedMutator = conn.getBufferedMutator(

TableName.valueOf("t1")) it.foreach(r => { ... // HBase API put/incr/append/cas calls } bufferedMutator.flush() bufferedMutator.close() })


Map

• Take an HBase dataset and map it in parallel for each partition to produce a new RDD

val getRdd = rdd.hbaseMapPartitions(hbaseContext, (it, conn) => { val table = conn.getTable(TableName.valueOf("t1")) var res = mutable.MutableList[String]() it.map( r => { ... // HBase API Scan Results } })


BulkLoad

• Bulk load a data set into Hbase (for all cases, generally wide tables)

rdd.hbaseBulkLoad (tableName, t => { Seq((new KeyFamilyQualifier(t.rowKey, t.family, t.qualifier), t.value)).iterator }, stagingFolder)


BulkLoadThinRows

• Bulk load a data set into HBase (for skinny tables, <10k cols)

hbaseContext.bulkLoadThinRows[(String, Iterable[(Array[Byte], Array[Byte], Array[Byte])])] (rdd, TableName.valueOf(tableName), t => { val rowKey = Bytes.toBytes(t._1) val familyQualifiersValues = new FamiliesQualifiersValues t._2.foreach(f => { val family:Array[Byte] = f._1 val qualifier = f._2 val value:Array[Byte] = f._3 familyQualifiersValues +=(family, qualifier, value) }) (new ByteArrayWrapper(rowKey), familyQualifiersValues) }, stagingFolder.getPath)


Scan vs Bulk Get (Parallel HBase Multigets)Scan HBase Table Bulk Get HBase Table


BulkPut

• Parallelized HBase Multiput

hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte], Array[Byte])])](rdd, tableName, (putRecord) => { val put = new Put(putRecord._1) putRecord._2.foreach((putValue) =>

put.add(putValue._1, putValue._2, putValue._3)) put }

}


BulkDelete

• Parallelized HBase Multi-deletes

hbaseContext.bulkDelete[Array[Byte]](rdd, tableName, putRecord => new Delete(putRecord), 4) // batch size

rdd.hbaseBulkDelete(hbaseContext, tableName, putRecord => new Delete(putRecord), 4) // batch size


SparkSQL

• Using SparkSQL to query HBase Data

// Setup Schema Mappingval dataframe = sqlContext.load("org.apache.hadoop.hbase.spark", Map("hbase.columns.mapping" -> "KEY_FIELD STRING :key, A_FIELD STRING c:a, B_FIELD STRING c:b,", "hbase.table" -> "t1"))dataframe.registerTempTable("hbaseTmp")

// Query sqlContext.sql("SELECT KEY_FIELD FROM hbaseTmp " + "WHERE " + "(KEY_FIELD = 'get1' and B_FIELD < '3') or " + "(KEY_FIELD <= 'get3' and B_FIELD = '8')")

.foreach(r => println(" - "+r))


SparkSQL + MLLib

• Process data extracted from SparkSQL

val resultDf = sqlContext.sql("SELECT gamer_id, oks, games_won, games_played FROM gamer")// Parse data to apply typing informationval parsedData = resultDf.map(r => { val array = Array(r.getInt(1).toDouble, r.getInt(2).toDouble, r.getInt(3).toDouble) Vectors.dense(array) })val dataCount = parsedData.count()if (dataCount > 0) { val clusters = KMeans.train(parsedData, 3, 5) clusters.clusterCenters.foreach(v => println(" Vector Center:" + v))}


Future work and Conclusion


Development and Distribution Status

• Today • Batch Analysis patterns with existing MR Input/Output Formats• Streaming Analysis Patterns

• Committed to HBase trunk branch (2.0) as part of HBase project• Available in CDH5.7.0 with commercial support• Used in production and pre-production today at ~10 Cloudera customers

• Recent Additions• Kerberos and Secure HBase access

• To come: Kerberos ticket renewals for Spark Streaming• New JSON based HBase table schema specification



HBase Client

Put, Incr, Append

HBase Client

Get, Scan

Bulk Import

HBase Client


low latency

high throughput

GetsShort scan

Full Scan, MapReduce

HBase ScannerBatch RDD via HBase’s MR Input/ Output Formats



HBase Data as Spark Streaming data source


Future: HBase Data as a Source

• HBase edits as a Spark streaming data source (with Kafka?)• Gather other data• Do some computation• Write the data out

HBaseReplication

Mini batch input RDD

Data source

Hbase-enriched mini batch output RDD


Thank you!


Use Case – Streaming Counting


• Puts vs Increments• Bulk Puts/Gets is good• You can get perfect counting

4/13/2016


DStream

DStream

DStream

Spark Streaming

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count HBase Increments

Source Receiver RDD

RDD

RDD

Single Pass

Filter Count HBase Increments

First Batch

Second Batch


DStream

DStream

DStream

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count

HBase Puts

Source Receiver RDDpartitions

RDDParition

RDD

Single Pass

Filter Count

Pre-first Batch

First Batch

Second Batch

Stateful RDD 1

HBase Puts

Stateful RDD 2

Stateful RDD 1

Spark Streaming

Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

Technology

Transcript of Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications