Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

13 April 2016Ted Malaska| Principle Solutions Architect @ Cloudera, Jonathan Hsieh| HBase Tech Lead @ Cloudera, Apache HBase PMC

About Ted and Jon

Ted Malaska• Principal Solutions Architect @ Cloudera• Apache HBase SparkOnHBase

Contributor•Contact• ted.malaska@cloudera.com

Jon Hsieh• Tech Lead/Eng Manager HBase Team @ Cloudera• Apache HBase PMC• Apache Flume founder

•Contact• jon@cloudera.com• @jmhsieh

Hsieh and Malaska, Hadoop Summit EU Dublin 2016

Outline

• Introduction

•Architecture and integration patterns

• Typing and API usage examples

• Future work and Conclusion

• Apache HBase is a distributed non-relational datastore that specializes in strongly consistent, low-latency, random access reads, writes, and short scans. As a storage system, it is an obvious source for reading RDDs and a destination for writing RDDs

• Apache Spark is a distributed in-memory processing system that can be used for batch and continuous, near-real time streaming jobs. Spark’s programming model is built upon the RDD (resilient distributed dataset) abstraction

Apache HBase + Apache Spark

Example Use cases

• Streaming Analytics into HBase to replace Lambda Architectures (with Kafka)•Weblogs

• ETL in Spark to bulkload into HBase• 25-50B records per weekly batch

• Using SQL for extraction layer to query HBase entity-centric timeseries data

Architecture and Integration Patterns

How does data get in and out of HBase?

HBase Client

Put, Incr, Append

HBase Client

Get, Scan

Bulk Import

HBase Client

HBase ReplicationHBase Replication

low latency

high throughput

GetsShort scan

Full Scan, Snapshot, MapReduce

HBase Scanner

HBase + MapReduce: Batch processing patterns

• Read dataset from HBase Table• Use HBase’s MR InputFormats• TableInputFormat• MultiTableInputFormat• TableSnapshotInputFormat

• Write dataset to HBase Table• Use HBase’s MR OutputFormat• TableOutputFormat• MultiTableOutputFormat• HFileOutputFormat

Read from HBase Table

Write to HBase Table

HBase + Spark: Batch processing patterns

• Read dataset(RDD) from HBase Table• Use HBase’s MR InputFormats• TableInputFormat• MultiTableInputFormat• TableSnapshotInputFormat

• Write dataset(RDD) to HBase Table• Use HBase’s MR OutputFormat• TableOutputFormat• MultiTableOutputFormat• HFileOutputFormat

Read HBase Table as RDD

Write RDD as HBase Table

Spark Streaming

• Take an Data source• Partition in to mini batches RDDs• Compute using Spark engine• Output mini batch RDDsMini batch input RDD

Data source

Mini batch output RDD

HBase + Spark Streaming – Enriching With HBase Data

• “Join” a dataset with HBase data• Enrich Streaming data source with

HBase data• Extract information from minibatch• Read/write/update HBase data in

processing • Output HBase-data enriched stream

of output RDDs

Mini batch input RDD

Data source

HBase-enriched mini batch output RDD

How does Spark get data in and out of HBase?

HBase Client

Put, Incr, Append

HBase Client

Get, Scan

Bulk Import

HBase Client

low latency

high throughput

GetsShort scan

HBase Scanner

HBase Client

Put, Incr, Append

HBase Client

Get, Scan

Bulk Import

HBase Client

low latency

high throughput

GetsShort scan

HBase ScannerBatch RDD via HBase’s MR Input/ Output Formats

Streaming using Hbase to Enrich stream data

Streaming using HBase to Enrich stream data

Typing and API Usage

Under the covers

Driver

Walker Node

Configs

Executor

Static Space

Configs

HConnection

Tasks Tasks

Walker Node

Executor

Static Space

Configs

HConnection

Tasks Tasks

Key Addition: HBaseContext

• Create an HBaseContext// an Hadoop/HBase Configuration objectval conf = HBaseConfiguration.create() conf.addResource(new Path("/etc/hbase/conf/core-site.xml"))conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"))// sc is the Spark Context; hbase context corresponds to an HBase Connectionval hbaseContext = new HBaseContext(sc, conf)

// A sample RDDval rdd = sc.parallelize(Array( (Bytes.toBytes("1")), (Bytes.toBytes("2")), (Bytes.toBytes("3")), (Bytes.toBytes("4")), (Bytes.toBytes("5")), (Bytes.toBytes("6")), (Bytes.toBytes("7"))))

• Foreach • Map• BulkLoad• BulkLoadThinRows• BulkGet (aka Multiget)• BulkDelete

Operations on the HBaseContext

Foreach

• Read HBase data in parallel for each partition and compute

rdd.hbaseForeachPartition(hbaseContext, (it, conn) => { // do something val bufferedMutator = conn.getBufferedMutator(

TableName.valueOf("t1")) it.foreach(r => { ... // HBase API put/incr/append/cas calls } bufferedMutator.flush() bufferedMutator.close() })

• Take an HBase dataset and map it in parallel for each partition to produce a new RDD

val getRdd = rdd.hbaseMapPartitions(hbaseContext, (it, conn) => { val table = conn.getTable(TableName.valueOf("t1")) var res = mutable.MutableList[String]() it.map( r => { ... // HBase API Scan Results } })

BulkLoad

• Bulk load a data set into Hbase (for all cases, generally wide tables)

rdd.hbaseBulkLoad (tableName, t => { Seq((new KeyFamilyQualifier(t.rowKey, t.family, t.qualifier), t.value)).iterator }, stagingFolder)

BulkLoadThinRows

• Bulk load a data set into HBase (for skinny tables, <10k cols)

hbaseContext.bulkLoadThinRows[(String, Iterable[(Array[Byte], Array[Byte], Array[Byte])])] (rdd, TableName.valueOf(tableName), t => { val rowKey = Bytes.toBytes(t._1) val familyQualifiersValues = new FamiliesQualifiersValues t._2.foreach(f => { val family:Array[Byte] = f._1 val qualifier = f._2 val value:Array[Byte] = f._3 familyQualifiersValues +=(family, qualifier, value) }) (new ByteArrayWrapper(rowKey), familyQualifiersValues) }, stagingFolder.getPath)

Scan vs Bulk Get (Parallel HBase Multigets)Scan HBase Table Bulk Get HBase Table

BulkPut

• Parallelized HBase Multiput

hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte], Array[Byte])])](rdd, tableName, (putRecord) => { val put = new Put(putRecord._1) putRecord._2.foreach((putValue) =>

put.add(putValue._1, putValue._2, putValue._3)) put }

BulkDelete

• Parallelized HBase Multi-deletes

hbaseContext.bulkDelete[Array[Byte]](rdd, tableName, putRecord => new Delete(putRecord), 4) // batch size

rdd.hbaseBulkDelete(hbaseContext, tableName, putRecord => new Delete(putRecord), 4) // batch size

SparkSQL

• Using SparkSQL to query HBase Data

// Setup Schema Mappingval dataframe = sqlContext.load("org.apache.hadoop.hbase.spark", Map("hbase.columns.mapping" -> "KEY_FIELD STRING :key, A_FIELD STRING c:a, B_FIELD STRING c:b,", "hbase.table" -> "t1"))dataframe.registerTempTable("hbaseTmp")

// Query sqlContext.sql("SELECT KEY_FIELD FROM hbaseTmp " + "WHERE " + "(KEY_FIELD = 'get1' and B_FIELD < '3') or " + "(KEY_FIELD <= 'get3' and B_FIELD = '8')")

.foreach(r => println(" - "+r))

SparkSQL + MLLib

• Process data extracted from SparkSQL

val resultDf = sqlContext.sql("SELECT gamer_id, oks, games_won, games_played FROM gamer")// Parse data to apply typing informationval parsedData = resultDf.map(r => { val array = Array(r.getInt(1).toDouble, r.getInt(2).toDouble, r.getInt(3).toDouble) Vectors.dense(array) })val dataCount = parsedData.count()if (dataCount > 0) { val clusters = KMeans.train(parsedData, 3, 5) clusters.clusterCenters.foreach(v => println(" Vector Center:" + v))}

Future work and Conclusion

Development and Distribution Status

• Today • Batch Analysis patterns with existing MR Input/Output Formats• Streaming Analysis Patterns

• Committed to HBase trunk branch (2.0) as part of HBase project• Available in CDH5.7.0 with commercial support• Used in production and pre-production today at ~10 Cloudera customers

• Recent Additions• Kerberos and Secure HBase access

• To come: Kerberos ticket renewals for Spark Streaming• New JSON based HBase table schema specification

HBase Client

Put, Incr, Append

HBase Client

Get, Scan

Bulk Import

HBase Client

low latency

high throughput

GetsShort scan

Full Scan, MapReduce

HBase ScannerBatch RDD via HBase’s MR Input/ Output Formats

Streaming using Hbase to Enrich stream data

HBase Data as Spark Streaming data source

Future: HBase Data as a Source

• HBase edits as a Spark streaming data source (with Kafka?)• Gather other data• Do some computation• Write the data out

HBaseReplication

Mini batch input RDD

Data source

Hbase-enriched mini batch output RDD

Thank you!

Use Case – Streaming Counting

• Puts vs Increments• Bulk Puts/Gets is good• You can get perfect counting

4/13/2016

DStream

Spark Streaming

Single Pass

Source Receiver RDD

Filter Count HBase Increments

Source Receiver RDD

Single Pass

Filter Count HBase Increments

First Batch

Second Batch

DStream

Single Pass

Source Receiver RDD

Filter Count

HBase Puts

Source Receiver RDDpartitions

RDDParition

Single Pass

Filter Count

Pre-first Batch

First Batch

Second Batch

Stateful RDD 1

HBase Puts

Stateful RDD 2

Stateful RDD 1

Spark Streaming

Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

Technology

Transcript of Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

hbase - arif.works

HBase Intro

HBase internals

Building a LINQ Provider for HBase MapReduce · 2019-04-30 · HBase/ Hadoop Building a LINQ Provider for HBase MapReduce Building a LINQ Provider for HBase MapReduce Summary HBase

HBase ArcheTypes

Breaking with relational DBMS and dating with Hbase [5th IndicThreads.com Conference On Java, Pune, India]

Hbase Tutorial

DBA to Data Scientist with Oracle Big Data Appliance...© Copyright 2013. Apps Associates LLC. 35 HBase • HBase is an open source, non-relational, distributed database modeled after

hadoop developer - SevenMentor · 2021. 2. 17. · D. HBASE: Introduction to HBASE Basic Configurations of HBASE Fundamentals of HBase What is NoSQL? HBase Data Model Table and Row

HBase @ Meetup

Flume HBase

HBase train Stark - community.qingcloud.com · HBase 介绍及特点 HBase 系统架构 HBase 集群搭建 HBase 存储结构 HBase 关键流程 HBase 使用及开发 HBase 大纲

HBase Basic

Scaling up HBase - Visualization · • Hbase when you don’t know the structure/schema • HBase supports sparse data • many columns, values can be absent • Relational databases

JPA Mapping Guide (v5.2) · Applicable to RDBMS, MongoDB, HBase, Cassandra, Neo4j With standard JPA when you delete an object from persistence it is deleted from the datastore. DataNucleus

HBase Schema Design - HBase-Con 2012

Hbase in action - Chapter 09: Deploying HBase

BIG DATA HADOOP FULLlBulk Loading in HBase lCreate, Insert, Read Tables in HBase lHBase Admin APIs l HBase Security lHBase vs Hive lBackup & Restore in HBase lApache HBase External

Hoya : HBase on YARN (2013-08-20 HBase Hug)

The Evolution of a Relational Database Layer over HBase @ApachePhoenix James Taylor (@JamesPlusPlus) Maryann Xue (@MaryannXue)