Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

34
1 © Cloudera, Inc. All rights reserved. 13 April 2016 Ted Malaska| Principle Solutions Architect @ Cloudera, Jonathan Hsieh| HBase Tech Lead @ Cloudera, Apache HBase PMC Apache HBase + Spark: Leveraging your Non- Relational Datastore in Batch and Streaming applications

Transcript of Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

Page 1: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

1© Cloudera, Inc. All rights reserved.

13 April 2016Ted Malaska| Principle Solutions Architect @ Cloudera, Jonathan Hsieh| HBase Tech Lead @ Cloudera, Apache HBase PMC

Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

Page 2: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

2© Cloudera, Inc. All rights reserved.

About Ted and Jon

Ted Malaska• Principal Solutions Architect @ Cloudera• Apache HBase SparkOnHBase

Contributor•Contact• [email protected]

Jon Hsieh• Tech Lead/Eng Manager HBase Team @ Cloudera• Apache HBase PMC• Apache Flume founder

•Contact• [email protected]• @jmhsieh

Hsieh and Malaska, Hadoop Summit EU Dublin 2016

Page 3: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

3© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

Outline

• Introduction

•Architecture and integration patterns

• Typing and API usage examples

• Future work and Conclusion

Page 4: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

4© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

• Apache HBase is a distributed non-relational datastore that specializes in strongly consistent, low-latency, random access reads, writes, and short scans. As a storage system, it is an obvious source for reading RDDs and a destination for writing RDDs

• Apache Spark is a distributed in-memory processing system that can be used for batch and continuous, near-real time streaming jobs. Spark’s programming model is built upon the RDD (resilient distributed dataset) abstraction

Apache HBase + Apache Spark

Page 5: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

5© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

Example Use cases

• Streaming Analytics into HBase to replace Lambda Architectures (with Kafka)•Weblogs

• ETL in Spark to bulkload into HBase• 25-50B records per weekly batch

• Using SQL for extraction layer to query HBase entity-centric timeseries data

Page 6: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

6© Cloudera, Inc. All rights reserved.

Architecture and Integration Patterns

Page 7: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

7© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

How does data get in and out of HBase?

HBase Client

Put, Incr, Append

HBase Client

Get, Scan

Bulk Import

HBase Client

HBase ReplicationHBase Replication

low latency

high throughput

GetsShort scan

Full Scan, Snapshot, MapReduce

HBase Scanner

Page 8: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

8© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

HBase + MapReduce: Batch processing patterns

• Read dataset from HBase Table• Use HBase’s MR InputFormats• TableInputFormat• MultiTableInputFormat• TableSnapshotInputFormat

• Write dataset to HBase Table• Use HBase’s MR OutputFormat• TableOutputFormat• MultiTableOutputFormat• HFileOutputFormat

Read from HBase Table

Write to HBase Table

Page 9: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

9© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

HBase + Spark: Batch processing patterns

• Read dataset(RDD) from HBase Table• Use HBase’s MR InputFormats• TableInputFormat• MultiTableInputFormat• TableSnapshotInputFormat

• Write dataset(RDD) to HBase Table• Use HBase’s MR OutputFormat• TableOutputFormat• MultiTableOutputFormat• HFileOutputFormat

Read HBase Table as RDD

Write RDD as HBase Table

Page 10: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

10© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

Spark Streaming

• Take an Data source• Partition in to mini batches RDDs• Compute using Spark engine• Output mini batch RDDsMini batch input RDD

Data source

Mini batch output RDD

Page 11: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

11© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

HBase + Spark Streaming – Enriching With HBase Data

• “Join” a dataset with HBase data• Enrich Streaming data source with

HBase data• Extract information from minibatch• Read/write/update HBase data in

processing • Output HBase-data enriched stream

of output RDDs

Mini batch input RDD

Data source

HBase-enriched mini batch output RDD

Page 12: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

12© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

How does Spark get data in and out of HBase?

HBase Client

Put, Incr, Append

HBase Client

Get, Scan

Bulk Import

HBase Client

HBase ReplicationHBase Replication

low latency

high throughput

GetsShort scan

Full Scan, Snapshot, MapReduce

HBase Scanner

Page 13: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

13© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

How does Spark get data in and out of HBase?

HBase Client

Put, Incr, Append

HBase Client

Get, Scan

Bulk Import

HBase Client

HBase ReplicationHBase Replication

low latency

high throughput

GetsShort scan

Full Scan, Snapshot, MapReduce

HBase ScannerBatch RDD via HBase’s MR Input/ Output Formats

Streaming using Hbase to Enrich stream data

Streaming using HBase to Enrich stream data

Page 14: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

14© Cloudera, Inc. All rights reserved.

Typing and API Usage

Page 15: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

15© Cloudera, Inc. All rights reserved.

Under the covers

Hsieh and Malaska, Hadoop Summit EU Dublin 2016

Driver

Walker Node

Configs

Executor

Static Space

Configs

HConnection

Tasks Tasks

Walker Node

Executor

Static Space

Configs

HConnection

Tasks Tasks

Page 16: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

16© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

Key Addition: HBaseContext

• Create an HBaseContext// an Hadoop/HBase Configuration objectval conf = HBaseConfiguration.create() conf.addResource(new Path("/etc/hbase/conf/core-site.xml"))conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"))// sc is the Spark Context; hbase context corresponds to an HBase Connectionval hbaseContext = new HBaseContext(sc, conf)

// A sample RDDval rdd = sc.parallelize(Array( (Bytes.toBytes("1")), (Bytes.toBytes("2")), (Bytes.toBytes("3")), (Bytes.toBytes("4")), (Bytes.toBytes("5")), (Bytes.toBytes("6")), (Bytes.toBytes("7"))))

Page 17: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

17© Cloudera, Inc. All rights reserved.

• Foreach • Map• BulkLoad• BulkLoadThinRows• BulkGet (aka Multiget)• BulkDelete

Operations on the HBaseContext

Hsieh and Malaska, Hadoop Summit EU Dublin 2016

Page 18: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

18© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

Foreach

• Read HBase data in parallel for each partition and compute

rdd.hbaseForeachPartition(hbaseContext, (it, conn) => { // do something val bufferedMutator = conn.getBufferedMutator(

TableName.valueOf("t1")) it.foreach(r => { ... // HBase API put/incr/append/cas calls } bufferedMutator.flush() bufferedMutator.close() })

Page 19: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

19© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

Map

• Take an HBase dataset and map it in parallel for each partition to produce a new RDD

val getRdd = rdd.hbaseMapPartitions(hbaseContext, (it, conn) => { val table = conn.getTable(TableName.valueOf("t1")) var res = mutable.MutableList[String]() it.map( r => { ... // HBase API Scan Results } })

Page 20: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

20© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

BulkLoad

• Bulk load a data set into Hbase (for all cases, generally wide tables)

rdd.hbaseBulkLoad (tableName, t => { Seq((new KeyFamilyQualifier(t.rowKey, t.family, t.qualifier), t.value)).iterator }, stagingFolder)

Page 21: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

21© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

BulkLoadThinRows

• Bulk load a data set into HBase (for skinny tables, <10k cols)

hbaseContext.bulkLoadThinRows[(String, Iterable[(Array[Byte], Array[Byte], Array[Byte])])] (rdd, TableName.valueOf(tableName), t => { val rowKey = Bytes.toBytes(t._1) val familyQualifiersValues = new FamiliesQualifiersValues t._2.foreach(f => { val family:Array[Byte] = f._1 val qualifier = f._2 val value:Array[Byte] = f._3 familyQualifiersValues +=(family, qualifier, value) }) (new ByteArrayWrapper(rowKey), familyQualifiersValues) }, stagingFolder.getPath)

Page 22: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

22© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

Scan vs Bulk Get (Parallel HBase Multigets)Scan HBase Table Bulk Get HBase Table

Page 23: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

23© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

BulkPut

• Parallelized HBase Multiput

hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte], Array[Byte])])](rdd, tableName, (putRecord) => { val put = new Put(putRecord._1) putRecord._2.foreach((putValue) =>

put.add(putValue._1, putValue._2, putValue._3)) put }

}

Page 24: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

24© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

BulkDelete

• Parallelized HBase Multi-deletes

hbaseContext.bulkDelete[Array[Byte]](rdd, tableName, putRecord => new Delete(putRecord), 4) // batch size

rdd.hbaseBulkDelete(hbaseContext, tableName, putRecord => new Delete(putRecord), 4) // batch size

Page 25: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

25© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

SparkSQL

• Using SparkSQL to query HBase Data

// Setup Schema Mappingval dataframe = sqlContext.load("org.apache.hadoop.hbase.spark", Map("hbase.columns.mapping" -> "KEY_FIELD STRING :key, A_FIELD STRING c:a, B_FIELD STRING c:b,", "hbase.table" -> "t1"))dataframe.registerTempTable("hbaseTmp")

// Query sqlContext.sql("SELECT KEY_FIELD FROM hbaseTmp " + "WHERE " + "(KEY_FIELD = 'get1' and B_FIELD < '3') or " + "(KEY_FIELD <= 'get3' and B_FIELD = '8')")

.foreach(r => println(" - "+r))

Page 26: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

26© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

SparkSQL + MLLib

• Process data extracted from SparkSQL

val resultDf = sqlContext.sql("SELECT gamer_id, oks, games_won, games_played FROM gamer")// Parse data to apply typing informationval parsedData = resultDf.map(r => { val array = Array(r.getInt(1).toDouble, r.getInt(2).toDouble, r.getInt(3).toDouble) Vectors.dense(array) })val dataCount = parsedData.count()if (dataCount > 0) { val clusters = KMeans.train(parsedData, 3, 5) clusters.clusterCenters.foreach(v => println(" Vector Center:" + v))}

Page 27: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

27© Cloudera, Inc. All rights reserved.

Future work and Conclusion

Page 28: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

28© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

Development and Distribution Status

• Today • Batch Analysis patterns with existing MR Input/Output Formats• Streaming Analysis Patterns

• Committed to HBase trunk branch (2.0) as part of HBase project• Available in CDH5.7.0 with commercial support• Used in production and pre-production today at ~10 Cloudera customers

• Recent Additions• Kerberos and Secure HBase access

• To come: Kerberos ticket renewals for Spark Streaming• New JSON based HBase table schema specification

Page 29: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

29© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

How does Spark get data in and out of HBase?

HBase Client

Put, Incr, Append

HBase Client

Get, Scan

Bulk Import

HBase Client

HBase ReplicationHBase Replication

low latency

high throughput

GetsShort scan

Full Scan, MapReduce

HBase ScannerBatch RDD via HBase’s MR Input/ Output Formats

Streaming using Hbase to Enrich stream data

Streaming using Hbase to Enrich stream data

HBase Data as Spark Streaming data source

Page 30: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

30© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

Future: HBase Data as a Source

• HBase edits as a Spark streaming data source (with Kafka?)• Gather other data• Do some computation• Write the data out

HBaseReplication

Mini batch input RDD

Data source

Hbase-enriched mini batch output RDD

Page 31: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

31© Cloudera, Inc. All rights reserved.

Thank you!

Page 32: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

32© Cloudera, Inc. All rights reserved.

Use Case – Streaming Counting

Hsieh and Malaska, Hadoop Summit EU Dublin 2016

• Puts vs Increments• Bulk Puts/Gets is good• You can get perfect counting

4/13/2016

Page 33: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

33© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

DStream

DStream

DStream

Spark Streaming

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count HBase Increments

Source Receiver RDD

RDD

RDD

Single Pass

Filter Count HBase Increments

First Batch

Second Batch

Page 34: Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

34© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

DStream

DStream

DStream

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count

HBase Puts

Source Receiver RDDpartitions

RDDParition

RDD

Single Pass

Filter Count

Pre-first Batch

First Batch

Second Batch

Stateful RDD 1

HBase Puts

Stateful RDD 2

Stateful RDD 1

Spark Streaming