Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

Post on 15-Apr-2017

2.050 views 15 download

Transcript of Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

1© Cloudera, Inc. All rights reserved.

13 April 2016Ted Malaska| Principle Solutions Architect @ Cloudera, Jonathan Hsieh| HBase Tech Lead @ Cloudera, Apache HBase PMC

Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

2© Cloudera, Inc. All rights reserved.

About Ted and Jon

Ted Malaska• Principal Solutions Architect @ Cloudera• Apache HBase SparkOnHBase

Contributor•Contact• ted.malaska@cloudera.com

Jon Hsieh• Tech Lead/Eng Manager HBase Team @ Cloudera• Apache HBase PMC• Apache Flume founder

•Contact• jon@cloudera.com• @jmhsieh

Hsieh and Malaska, Hadoop Summit EU Dublin 2016

3© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

Outline

• Introduction

•Architecture and integration patterns

• Typing and API usage examples

• Future work and Conclusion

4© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

• Apache HBase is a distributed non-relational datastore that specializes in strongly consistent, low-latency, random access reads, writes, and short scans. As a storage system, it is an obvious source for reading RDDs and a destination for writing RDDs

• Apache Spark is a distributed in-memory processing system that can be used for batch and continuous, near-real time streaming jobs. Spark’s programming model is built upon the RDD (resilient distributed dataset) abstraction

Apache HBase + Apache Spark

5© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

Example Use cases

• Streaming Analytics into HBase to replace Lambda Architectures (with Kafka)•Weblogs

• ETL in Spark to bulkload into HBase• 25-50B records per weekly batch

• Using SQL for extraction layer to query HBase entity-centric timeseries data

6© Cloudera, Inc. All rights reserved.

Architecture and Integration Patterns

7© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

How does data get in and out of HBase?

HBase Client

Put, Incr, Append

HBase Client

Get, Scan

Bulk Import

HBase Client

HBase ReplicationHBase Replication

low latency

high throughput

GetsShort scan

Full Scan, Snapshot, MapReduce

HBase Scanner

8© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

HBase + MapReduce: Batch processing patterns

• Read dataset from HBase Table• Use HBase’s MR InputFormats• TableInputFormat• MultiTableInputFormat• TableSnapshotInputFormat

• Write dataset to HBase Table• Use HBase’s MR OutputFormat• TableOutputFormat• MultiTableOutputFormat• HFileOutputFormat

Read from HBase Table

Write to HBase Table

9© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

HBase + Spark: Batch processing patterns

• Read dataset(RDD) from HBase Table• Use HBase’s MR InputFormats• TableInputFormat• MultiTableInputFormat• TableSnapshotInputFormat

• Write dataset(RDD) to HBase Table• Use HBase’s MR OutputFormat• TableOutputFormat• MultiTableOutputFormat• HFileOutputFormat

Read HBase Table as RDD

Write RDD as HBase Table

10© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

Spark Streaming

• Take an Data source• Partition in to mini batches RDDs• Compute using Spark engine• Output mini batch RDDsMini batch input RDD

Data source

Mini batch output RDD

11© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

HBase + Spark Streaming – Enriching With HBase Data

• “Join” a dataset with HBase data• Enrich Streaming data source with

HBase data• Extract information from minibatch• Read/write/update HBase data in

processing • Output HBase-data enriched stream

of output RDDs

Mini batch input RDD

Data source

HBase-enriched mini batch output RDD

12© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

How does Spark get data in and out of HBase?

HBase Client

Put, Incr, Append

HBase Client

Get, Scan

Bulk Import

HBase Client

HBase ReplicationHBase Replication

low latency

high throughput

GetsShort scan

Full Scan, Snapshot, MapReduce

HBase Scanner

13© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

How does Spark get data in and out of HBase?

HBase Client

Put, Incr, Append

HBase Client

Get, Scan

Bulk Import

HBase Client

HBase ReplicationHBase Replication

low latency

high throughput

GetsShort scan

Full Scan, Snapshot, MapReduce

HBase ScannerBatch RDD via HBase’s MR Input/ Output Formats

Streaming using Hbase to Enrich stream data

Streaming using HBase to Enrich stream data

14© Cloudera, Inc. All rights reserved.

Typing and API Usage

15© Cloudera, Inc. All rights reserved.

Under the covers

Hsieh and Malaska, Hadoop Summit EU Dublin 2016

Driver

Walker Node

Configs

Executor

Static Space

Configs

HConnection

Tasks Tasks

Walker Node

Executor

Static Space

Configs

HConnection

Tasks Tasks

16© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

Key Addition: HBaseContext

• Create an HBaseContext// an Hadoop/HBase Configuration objectval conf = HBaseConfiguration.create() conf.addResource(new Path("/etc/hbase/conf/core-site.xml"))conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"))// sc is the Spark Context; hbase context corresponds to an HBase Connectionval hbaseContext = new HBaseContext(sc, conf)

// A sample RDDval rdd = sc.parallelize(Array( (Bytes.toBytes("1")), (Bytes.toBytes("2")), (Bytes.toBytes("3")), (Bytes.toBytes("4")), (Bytes.toBytes("5")), (Bytes.toBytes("6")), (Bytes.toBytes("7"))))

17© Cloudera, Inc. All rights reserved.

• Foreach • Map• BulkLoad• BulkLoadThinRows• BulkGet (aka Multiget)• BulkDelete

Operations on the HBaseContext

Hsieh and Malaska, Hadoop Summit EU Dublin 2016

18© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

Foreach

• Read HBase data in parallel for each partition and compute

rdd.hbaseForeachPartition(hbaseContext, (it, conn) => { // do something val bufferedMutator = conn.getBufferedMutator(

TableName.valueOf("t1")) it.foreach(r => { ... // HBase API put/incr/append/cas calls } bufferedMutator.flush() bufferedMutator.close() })

19© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

Map

• Take an HBase dataset and map it in parallel for each partition to produce a new RDD

val getRdd = rdd.hbaseMapPartitions(hbaseContext, (it, conn) => { val table = conn.getTable(TableName.valueOf("t1")) var res = mutable.MutableList[String]() it.map( r => { ... // HBase API Scan Results } })

20© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

BulkLoad

• Bulk load a data set into Hbase (for all cases, generally wide tables)

rdd.hbaseBulkLoad (tableName, t => { Seq((new KeyFamilyQualifier(t.rowKey, t.family, t.qualifier), t.value)).iterator }, stagingFolder)

21© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

BulkLoadThinRows

• Bulk load a data set into HBase (for skinny tables, <10k cols)

hbaseContext.bulkLoadThinRows[(String, Iterable[(Array[Byte], Array[Byte], Array[Byte])])] (rdd, TableName.valueOf(tableName), t => { val rowKey = Bytes.toBytes(t._1) val familyQualifiersValues = new FamiliesQualifiersValues t._2.foreach(f => { val family:Array[Byte] = f._1 val qualifier = f._2 val value:Array[Byte] = f._3 familyQualifiersValues +=(family, qualifier, value) }) (new ByteArrayWrapper(rowKey), familyQualifiersValues) }, stagingFolder.getPath)

22© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

Scan vs Bulk Get (Parallel HBase Multigets)Scan HBase Table Bulk Get HBase Table

23© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

BulkPut

• Parallelized HBase Multiput

hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte], Array[Byte])])](rdd, tableName, (putRecord) => { val put = new Put(putRecord._1) putRecord._2.foreach((putValue) =>

put.add(putValue._1, putValue._2, putValue._3)) put }

}

24© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

BulkDelete

• Parallelized HBase Multi-deletes

hbaseContext.bulkDelete[Array[Byte]](rdd, tableName, putRecord => new Delete(putRecord), 4) // batch size

rdd.hbaseBulkDelete(hbaseContext, tableName, putRecord => new Delete(putRecord), 4) // batch size

25© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

SparkSQL

• Using SparkSQL to query HBase Data

// Setup Schema Mappingval dataframe = sqlContext.load("org.apache.hadoop.hbase.spark", Map("hbase.columns.mapping" -> "KEY_FIELD STRING :key, A_FIELD STRING c:a, B_FIELD STRING c:b,", "hbase.table" -> "t1"))dataframe.registerTempTable("hbaseTmp")

// Query sqlContext.sql("SELECT KEY_FIELD FROM hbaseTmp " + "WHERE " + "(KEY_FIELD = 'get1' and B_FIELD < '3') or " + "(KEY_FIELD <= 'get3' and B_FIELD = '8')")

.foreach(r => println(" - "+r))

26© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

SparkSQL + MLLib

• Process data extracted from SparkSQL

val resultDf = sqlContext.sql("SELECT gamer_id, oks, games_won, games_played FROM gamer")// Parse data to apply typing informationval parsedData = resultDf.map(r => { val array = Array(r.getInt(1).toDouble, r.getInt(2).toDouble, r.getInt(3).toDouble) Vectors.dense(array) })val dataCount = parsedData.count()if (dataCount > 0) { val clusters = KMeans.train(parsedData, 3, 5) clusters.clusterCenters.foreach(v => println(" Vector Center:" + v))}

27© Cloudera, Inc. All rights reserved.

Future work and Conclusion

28© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

Development and Distribution Status

• Today • Batch Analysis patterns with existing MR Input/Output Formats• Streaming Analysis Patterns

• Committed to HBase trunk branch (2.0) as part of HBase project• Available in CDH5.7.0 with commercial support• Used in production and pre-production today at ~10 Cloudera customers

• Recent Additions• Kerberos and Secure HBase access

• To come: Kerberos ticket renewals for Spark Streaming• New JSON based HBase table schema specification

29© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

How does Spark get data in and out of HBase?

HBase Client

Put, Incr, Append

HBase Client

Get, Scan

Bulk Import

HBase Client

HBase ReplicationHBase Replication

low latency

high throughput

GetsShort scan

Full Scan, MapReduce

HBase ScannerBatch RDD via HBase’s MR Input/ Output Formats

Streaming using Hbase to Enrich stream data

Streaming using Hbase to Enrich stream data

HBase Data as Spark Streaming data source

30© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

Future: HBase Data as a Source

• HBase edits as a Spark streaming data source (with Kafka?)• Gather other data• Do some computation• Write the data out

HBaseReplication

Mini batch input RDD

Data source

Hbase-enriched mini batch output RDD

31© Cloudera, Inc. All rights reserved.

Thank you!

32© Cloudera, Inc. All rights reserved.

Use Case – Streaming Counting

Hsieh and Malaska, Hadoop Summit EU Dublin 2016

• Puts vs Increments• Bulk Puts/Gets is good• You can get perfect counting

4/13/2016

33© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

DStream

DStream

DStream

Spark Streaming

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count HBase Increments

Source Receiver RDD

RDD

RDD

Single Pass

Filter Count HBase Increments

First Batch

Second Batch

34© Cloudera, Inc. All rights reserved.Hsieh and Malaska, Hadoop Summit EU Dublin 2016

DStream

DStream

DStream

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count

HBase Puts

Source Receiver RDDpartitions

RDDParition

RDD

Single Pass

Filter Count

Pre-first Batch

First Batch

Second Batch

Stateful RDD 1

HBase Puts

Stateful RDD 2

Stateful RDD 1

Spark Streaming