Adding Value to HBase with IBM InfoSphere BigInsights and...

50
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL Session Number 1687 Piotr Pruski @ppruski

Transcript of Adding Value to HBase with IBM InfoSphere BigInsights and...

Page 1: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

Adding Value to HBase with IBM InfoSphere BigInsightsand BigSQLSession Number 1687

Piotr Pruski

@ppruski

Page 2: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

2

Please note

IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion.

Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.

The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.

Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.

Page 3: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

3

Agenda

� Introduction to HBase

� Big SQL HBase Storage Handler– Column mapping

– Data encoding– Data load

� Secondary Indexes

� Querying

� Recommendations and limitations

� Logs and Troubleshooting

� Highlights and HBase use cases

Page 4: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

4

HBase Basics

� Client/server database– Master and a set of region servers

� Key-value store – Key and value are byte arrays– Efficient access using row key

� Different from relational databases– No types: all data is stored as bytes

– No schema: Rows can have different set of columns

Page 5: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

5

HBase Data Model

� Table– Contains column-families

� Column family– Logical and physical grouping of

columns

� Column– Exists only when inserted

– Can have multiple versions

– Each row can have different set of

columns

– Each column identified by it’s key

� Row key– Implicit primary key

– Used for storing ordered rows

– Efficient queries using row key

HBTABLE

cf_data: {‘cq_name’: ‘name2’, ‘cq_val’: 2013 @ ts = 2013,‘cq_val’: 2012 @ ts = 2012

}

22222

cf_data:

{‘cq_name’: ‘name1’, ‘cq_val’: 1111}cf_info: {‘cq_desc’: ‘desc11111’}

11111

ValueRow key

HFileHFileHFile

11111 cf_data cq_name name1 @ ts1

11111 cf_data cq_val 1111 @ ts1

22222 cf_data cq_name name2 @ ts1

22222 cf_data cq_val 2013 @ ts1

22222 cf_data cq_val 2012 @ ts 2

HFile

11111 cf_info cq_desc desc11111 @ ts1

Page 6: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

6

More on the HBase Data Model

� There is no Schema for an HBase Table in the RDBMS sense– Except that one has to declare the Column Families

• Since it determines the physical on-disk organization

– Thus every row can have a different set of Columns

� HBase is described as a key-value store

� Each key-value pair is versioned– Can be a timestamp or an integer

– Update a column is just to add a new version

� All data are byte arrays, including table name, Column Family names, and Column names (also called Column Qualifiers)

Key/Value Row Column Family Column Qualifier Timestamp Value

Key

Page 7: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

7

HBase Cluster Architecture

HDFS / GPFS

Region Server …

Master

ClientZooKeeper Peer

ZooKeeper Quorum

ZooKeeper Peer

… Hbase master assigns regions and load balancing

Client finds

region server addresses in ZooKeeper

Client reads and writes row by

accessing the region server

ZooKeeper is used for coordination / monitoring

Region

Region Server

Region … Region Region …

HFile

HFile

HFile

HFile

HFile

HFile HFile HFile

HFile HFile

Coprocessor Coprocessor …Coprocessor Coprocessor

Region Server

Coprocessor CoprocessorCoprocessor …CoprocessorCoprocessor …CoprocessorCoprocessor

Page 8: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

8 8

BigInsights - Big SQL

� Big SQL brings robust SQL support to the Hadoop ecosystem

� Driving design goals

– Existing queries should run with no or few modifications– Existing JDBC and ODBC compliant tools should continue to function

• Data warehouse augmentation is a very common use case for Hadoop

� While highly scalable, MapReduce is notoriously difficult to use

� SQL support opens the data to a much wider audience

� Making data in BigInsights accessible to SQL capable tools– Cognos BI

– Microstrategy

– Tableau– …

Page 9: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

9

Big Data for a Query-able Archive

BigInsights

(Hadoop)

InfoSphereWarehouse/ Netezza **

• Cognos BI can issue SQL Queries against data managed by Apache Hive in BigInsights

• The IBM BigData platform supports bi-directional queries between BigInsights and the EDW

•Key Benefits:

• Existing SQL based applications can leverage the BigData platform

• EDW optimized from size and performance perspective

• Provides cost effective and flexible big data storage and analysis

CognosInsight

Cognos BI Server

Explore & Analyze

Report & Act

Bi-Directional Query Support

SQLSQLInfoSphere OptimInfoSphere Optim

Page 10: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

10

Big SQL HBase Storage Handler

� Mapping of SQL to HBase data: Column Mapping� Handles serialization/deserialization of data (SerDe)� Efficiently handles SQL queries by pushing down predicates

InputData

Big SQL

QueryResults

HBase Storage Handler

SerDe

Delimited

files

Warehouse

JDBCapplication

DFS

HBase

SQLQuery

Query Optimizer(Compile time)

- Process hints

Query Analyzer(Runtime)

- HBase scan limits

- Filters

- Index usage

Page 11: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

11

Column Mapping

� Mapping HBase row key/columns to SQL columns– Supports one to one and one to many mappings

� One to one mapping– Single HBase entity mapped to a single SQL column

11111 name1 1111name1 1111name1 1111name1 1111name1 1111name1 1111name111111 name111111 1111name111111 name1 1111name1 1111name1 1111name1 1111name1 1111name1

id name value

11111

id

11111

id

11111

id

name111111

id

1111name111111

id name

1111name111111

id valuename

1111name111111

id valuename

1111name111111

id valuename

1111name111111

id descvaluename

1111name1

id SQL

HBase

valuename

1111name1

id

name1 1111name1 1111name1 1111

nameid valuenameid descvaluenameid

Column Family: cf_info

desc11111

key cq_name

Column Family: cf_data

cq_val cq_desc

Page 12: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

12

Create Table: One to One Mapping

CREATE HBASE TABLE HBTABLE

( id INT,

name VARCHAR(10),

value INT,

desc VARCHAR(20)

)

COLUMN MAPPING

(

key mapped by (id),

cf_data:cq_name mapped by (name),

cf_data:cq_val mapped by (value),

cf_info:cq_desc mapped by (desc)

);

SQLHBase

Required

HBase column

identified by

family:qualifier

Page 13: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

13

One to Many Column Mapping

� Single HBase entity mapped to multiple SQL columns

� Composite key– HBase row key mapped to multiple SQL columns

� Dense column– One HBase column mapped to multiple SQL columns

11111_ac11 fname1_lname1 11111#11#0.25

Column Family: cf_data

balancefirst_nameacc_nouserid last_name interestmin_bal SQL

HBasecq_names cq_acctkey

Page 14: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

14

Create Table: One to Many Mapping

CREATE HBASE TABLE DENSE_TABLE

( userid INT,

acc_no VARCHAR(10),

first_name VARCHAR(10),

last_name VARCHAR(10),

balance double,

min_bal double,

interest double

)

COLUMN MAPPING

(

key mapped by (userid, acc_no),

cf_data:cq_names mapped by (first_name, last_name),

cf_data:cq_acct mapped by (balance, min_bal, interest)

);

Composite Key

Dense

Columns

List of SQL columns

Page 15: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

15

Why use One to Many mapping ?

� HBase is very verbose– Stores a lot of information for each value– Primarily intended for sparse data

<row> <columnfamily> <columnqualifier> <timestamp> <value>

� Save storage space– Sample table with 9 columns. 1.5 million rows

– One to one mapping: 522 MB– One to many mapping: 276 MB

� Improve query response time– Query results also return the entire key for each value

– select * query on sample table• One to one mapping: 1m 31 s

• One to many mapping: 1m 2s

Page 16: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

16

Data encoding

� HBase stores all data as an array of bytes– Application decides how to encode/decode the bytes

� Big SQL uses Hive SerDe interface for serialization/deserialization

� Supports two types of data encodings: String, Binary

� Encoding can be specified at HBase row key/column level

11111_ac11 fname1_lname1 0x000001 …

Column Family: cf_data

balancefirst_nameacc_nouserid last_name interestmin_bal SQL

HBasecq_names cq_acctkey

HBaseHBase

String BinaryString

Page 17: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

17

String encoding

� Default encoding

� Value is converted to string and stored as UTF-8 bytes

� Separator to identify parts in one to many mapping– Default separator: \u0000

CREATE HBASE TABLE DENSE_TABLE_STR

( userid INT,

acc_no VARCHAR(10),

first_name VARCHAR(10),

last_name VARCHAR(10),

balance double,

min_bal double,

interest double

)

COLUMN MAPPING

(

key mapped by (userid, acc_no) separator '_',

cf_data:cq_names mapped by (first_name, last_name) separator '_',

cf_data:cq_acct mapped by (balance, min_bal, interest) separator '#'

);

Can specify different separator

for each column and row key.

Default separator is null byte

(\u0000) for string encoding.

Page 18: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

18

String Encoding: Pros and Cons

� Readable format and easier to port across applications

� Useful to map existing data

� Numeric data not collated correctly– HBase stores data as bytes

– Lexicographic ordering

� Slow– Parsing strings is expensive

11111_ac11 fname1_lname1 10000#10#0.25

Column Family: cf_data

cq_names cq_acctkey

last_name interestbalance min_balfirst_nameacc_nouserid

Existing

HBase table

External

Big SQL table

11029

2 > 109 > 10

Page 19: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

19

External Tables

� Useful to map tables that already exist in HBase– Data in external tables is not pre-validated

� Can create multiple views of same table

create external hbase table externalhbase_table (user INT, acc string,

balance double, min_bal double, interest double)

column mapping(key mapped by (user,acc), cf_data:cq_acct mapped by(balance,

min_bal, interest) separator '#')

hbase table name 'dense_table';

� HBase tables created using Hive HBase storage handler cannot be read by Big SQL– Need to create external tables for this

� Things to note:– Dropping external table only drops the metadata

– Cannot create secondary index on external tables

Use subset of data from

dense_table

Page 20: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

20

Binary Encoding

� Data encoded using sortable binary representation

� Separators handled internally– Escaped to avoid issue of separator existing within data

CREATE HBASE TABLE MIXED_ENCODING

(

C1 INT, C2 INT, C3 INT,

C4 VARCHAR(10), C5 DECIMAL(5,2),

C6 SMALLINT

)

COLUMN MAPPING

(

KEY MAPPED BY (C1, C2, C3) ENCODING BINARY,

CF1:COL1 MAPPED BY (C4, C5) SEPARATOR '|',

CF2:COL1 MAPPED BY (C6) ENCODING BINARY

);

0x000000000000000100000000000000020000000000000003

key

foo|97.31

col1cf1

0x0000DEAF

col1

0x000000000000000100000000000000020000000000000003

cf2

key col1

0x000000000000000100000000000000020000000000000003

key

foo|97.31

col1

0x000000000000000100000000000000020000000000000003

key

foo|97.31

col1

0x000000000000000100000000000000020000000000000003

keycf1

foo|97.31

col1

0x000000000000000100000000000000020000000000000003

keycf1

foo|97.31

col1

0x000000000000000100000000000000020000000000000003

keycf1

foo|97.31

col1

0x000000000000000100000000000000020000000000000003

key col1cf1

foo|97.31

col1

0x000000000000000100000000000000020000000000000003

key col1cf1

foo|97.31

col1

0x000000000000000100000000000000020000000000000003

key

0x0000DEAF

col1cf1

foo|97.31

col1

0x000000000000000100000000000000020000000000000003

key

0x0000DEAF

col1cf1

foo|97.31

col1

0x000000000000000100000000000000020000000000000003

key

0x0000DEAF

cf1

foo|97.31

col1

0x000000000000000100000000000000020000000000000003

key

If encoding not

specified, string is

used as default

Page 21: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

21

Binary Encoding: Pros and Cons

� Faster

� Numeric types collated correctly including negative numbers

CREATE HBASE TABLE WEATHER (temp INT, date TIMESTAMP, humidity DOUBLE)

COLUMN MAPPING (key mapped by (temp, date), cf:cq mapped by (humidity))

default encoding binary;

� Limited portability

100,2012-06-10 17:00:00:000,40.25-17,2012-12-12 17:00:00:000,30.2595,2012-06-05 17:00:00:000,50.25

\x01\x7F\xFF\xFF\xEF\x012012-12-12 17:00:00:000\x00 \x01\x80\x00\x00_\x012012-06-05 17:00:00:000\x00 \x01\x80\x00\x00d\x012012-06-10 17:00:00:000\x00

\x01\xC0>@\x00\x00\x00\x00\x00\x01\xC0I \x00\x00\x00\x00\x00\x01\xC0D \x00\x00\x00\x00\x00

cq

-1795

100

cf

Page 22: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

22

Load Data

� Load HBase – Loads data from delimited files– Column list can be specified

load hbase data inpath 'file:///input.dat'

delimited fields terminated by '|'

into table hbtable

(name, value, desc, id);

� Load FROM– Loads data from a (JDBC) source outside of a BigInsights cluster

� Insert command available

insert into hbtable

(name, value, desc, id)

values(‘name5’, 5555, ‘desc55555’, 55555);

File can be on DFS or local to Big SQL server

Column list optional. If not specified, uses column ordering in

table definition

Page 23: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

23

Load Data: Upsert

� HBase ensures uniqueness of row key

� Upsert can be confusing. No errors but fewer rows !

� Combine multiple columns to make row key uniquekey mapped by (id, name)

select count(*) from hbtable : 7 rowsDelimited file : 10 rows Load : 10 rows affected

11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

11111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name9, 9999, desc99999 @ts122222 , name2, 2222, desc22222 @ts1

Load11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

Load11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

Load11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

11111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name9, 9999, desc99999 @ts122222 , name2, 2222, desc22222 @ts1

Load11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

Load11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

11111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name9, 9999, desc99999 @ts122222 , name2, 2222, desc22222 @ts1

Load11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

Load11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

11111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name9, 9999, desc99999 @ts122222 , name2, 2222, desc22222 @ts1

11111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name9, 9999, desc99999 @ts122222 , name2, 2222, desc22222 @ts1

11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

Load11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

Load11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

11111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name9, 9999, desc99999 @ts122222 , name2, 2222, desc22222 @ts1

Load

11111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name9, 9999, desc99999 @ts122222 , name2, 2222, desc22222 @ts1

Load

11111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name9, 9999, desc99999 @ts122222 , name2, 2222, desc22222 @ts1

11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

Load11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

Load

11111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name9, 9999, desc99999 @ts122222 , name2, 2222, desc22222 @ts1

Load

11111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name9, 9999, desc99999 @ts122222 , name2, 2222, desc22222 @ts1

Load11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

11111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name9, 9999, desc99999 @ts122222 , name2, 2222, desc22222 @ts1

Load

11111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name9, 9999, desc99999 @ts122222 , name2, 2222, desc22222 @ts1

Load

keykey

11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

11111 , name1, 1111, desc11111 @ts011111 , name9, 9999, desc99999 @ts122222 , name2, 2222, desc22222 @ts1

Load11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

Load11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

11111/x00name1, 1111, desc11111 @ts011111/x00name9, 9999, desc99999 @ts122222/x00name2, 2222, desc22222 @ts1

Load11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

key

Page 24: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

24

Force Key Unique

� Use force key unique option when creating a table

CREATE HBASE TABLE HBTABLE_FORCE_KEY_UNIQUE

( id INT, name VARCHAR(10), value INT, desc VARCHAR(20) )

COLUMN MAPPING

(

key mapped by (id) force key unique,

cf_data:cq_name mapped by (name),

cf_data:cq_val mapped by (value),

cf_info:cq_desc mapped by (desc)

);

� Load adds UUID to the row key

� Prevents data loss

� Inefficient� Stores more data

� Slower queries

11111\x00b71c95d8-ffdd-4d49-9015-2fdd6f7dcdf4, name1, 1111, desc1111111111\x00ea780078-9893-4bf7-95d8-cb9ca4b2427f, name9, 9999, desc9999922222\x00a90885b0-418b-49ac-a6f6-aa73273b57ca, name2, 2222, desc22222

11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

Page 25: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

25

Load Data: Error Handling

� Option to continue and log error rows– LOG ERROR ROWS IN FILE 'filename'

� Common Errors– Separator exists within data for string encoding– Invalid numeric types

� Always count number of rows after loading– Load always reports total number of rows that it handled

11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

… Error file (2 rows)22222 , name-2, 2222, desc22222

3333a , name3, 3333, desc33333

11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

Load: 4 rows affected

11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

HBase Table (2 rows)11111-name1, 1111, desc1111111111-name9, 9999, desc99999

11111, name1, 1111, desc1111111111, name9, 9999, desc9999922222, name-2, 2222, desc222223333a, name3, 3333, desc33333

keykey

key mapped by (id, name) separator ‘-’id defined as integer

Page 26: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

26

Options to Speed up Load

� Disable WAL– Data loss can happen if region server crashes

LOAD HBASE DATA INPATH 'tpch/ORDERS.TXT' DELIMITED FIELDS

TERMINATED BY '|' INTO TABLE ORDERS DISABLE WAL;

� Increase write buffer– set hbase.client.write.buffer=8388608;

Page 27: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

27

Secondary Index Support

� Self-maintaining secondary indexes– Stored in an HBase table– Populated using a Map Reduce index builder

– Kept up to date using a synchronous coprocessor

Data Table

Index Table

Index Regions

ClientClient

Big SQL

HBase Storage Handler

Query Optimizer(Compile time)

- Process hints

Query Analyzer(Runtime)

- Use index ?

Data RegionsMRIndexBuildercreate

index

SerDe

IndexCoprocessor

inputdata

queryresults

query

Index building Index maintenance Batched Get Requests

Page 28: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

28

Index Creation and Usage

create hbase table dt(id int,c1 string,c2 string,c3 string,c4 string,c5 string)column mapping (key mapped by (id), f:a mapped by (c1,c2,c3), f:b mapped by

(c4,c5));

create index ixc3 on table dt (c3) as 'hbase';

� Automatic index usage– Range scan on index table to get matching row key(s) in base table– Batched get requests to base table with the matched row key(s)

bt1 , c11_c21_c31, c41_c51

bt2 , c12_c22_c32, c42_c52bt3 , c13_c23_c33, c43_c53

key c1 c2 c3 c4 c5

c31_bt1c32_bt2c33_bt3

key

Data table (dt) Index table (dt_ixc3)

Use Index ?

Query

c3=c32

create index ixc3 (c3)

YesNo

Full table scanIndex table range scan

start row = c32stop row = c32++

Data table getrow = bt2

Page 29: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

29

Index Pros and Cons

� Fast key based lookups for queries that return limited data

� Not beneficial if there are too many matches� No statistics to make the decision in compiler

� useindex hint to make explicit choices

� Index adds latency to data load– When loading a big data set, drop index and recreate

�LOAD from option bypasses index maintenance� Uses HBase bulk load which writes to HFiles directly

Page 30: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

30

Column Family Options

� Compression– compression(gz)

� Bloom filters– NONE, ROW, ROWCOL

� In memory columns– in memory, no in memory

create hbase table colopt_table (key string, c1 string)

column mapping(key mapped by (key), cf1:c1 mapped by(c1))

column family options(cf1 compression(gz) bloom filter(row) in memory);

Page 31: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

31

Query Handling

� Projection pushdown

� Predicate pushdown– Point scan

– Range scan– Automatic index usage

– Filters

� Query Hints

Page 32: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

32

Sample Data

� TPCH orders table with 1.5 million rows

drop table if exists orders;

CREATE HBASE TABLE ORDERS ( O_ORDERKEY BIGINT, O_CUSTKEY INTEGER,

O_ORDERSTATUS VARCHAR(1), O_TOTALPRICE FLOAT, O_ORDERDATE

TIMESTAMP, O_ORDERPRIORITY VARCHAR(15), O_CLERK VARCHAR(15),

O_SHIPPRIORITY INTEGER, O_COMMENT VARCHAR(79) )

column mapping (

key mapped by (O_ORDERKEY,O_CUSTKEY),

cf:d mapped by

(O_ORDERSTATUS,O_TOTALPRICE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORIT

Y,O_COMMENT),

cf:od mapped by (O_ORDERDATE)

)

default encoding binary;

LOAD HBASE DATA INPATH 'tpch/ORDERS.TXT' DELIMITED FIELDS

TERMINATED BY '|' INTO TABLE ORDERS;

Page 33: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

33

Projection Pushdown

� Get only columns required by the query

� Limit data retrieved to the client

select * from orders

go -m discard

1500000 rows in results(first row: 0.21s; total: 1m1.77s)

HBase scan details:{ … , families={cf=[d, od]}, …}

select o_totalprice from orders

go -m discard

1500000 rows in results(first row: 0.19s; total: 21.27s)

HBase scan details:{ … , families={cf=[d]}, …}

select o_orderdate from orders

go -m discard

1500000 rows in results(first row: 0.36s; total: 36.24s)

HBase scan details:{ … , families={cf=[od]}, …}

� Projection happens at HBase column level– For composite key and dense columns, the entire value is retrieved to the client– Efficient to pack columns that are queried together

Log

Log

Log

The response time is higher for this query even when it retrieves lesser

data than query for o_totalprice. This is

because timestamp type is more expensive

Page 34: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

34

Predicate Pushdown: Point Scan

� With full row key

� Big SQL can combine predicates on row key parts

set force local on;

select o_orderkey,o_totalprice from orders where o_custkey=1 and o_orderkey=454791;

+--------------+

| o_totalprice |

+--------------+

| 208660.75000 |

+--------------+

1 row in results(first row: 0.14s; total: 0.14s)

Found a row scan by combining all composite key parts.Log

1#4547911# 579908 1# 3868359 1# 4273923 1# 4808192 1# 5133509

key

o_custkey o_orderkey columns

Queryo_custkey=1

ando_orderkey=454791

start row=1#454791stop row=1#454791

Page 35: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

35

Predicate Pushdown: Partial row Scan

select o_orderkey,o_totalprice from orders where o_custkey=1;

+------------+--------------+

| o_orderkey | o_totalprice |

+------------+--------------+

| 454791 | 74602.81250 |

| 579908 | 54048.26172 |

| 3868359 | 123076.84375 |

| 4273923 | 95911.00781 |

| 4808192 | 65478.05078 |

| 5133509 | 174645.93750 |

+------------+--------------+

6 rows in results(first row: 0.13s; total: 0.13s)

Found a row scan that uses the first 1 part(s) of composite key.

1#4547911# 579908 1# 3868359 1# 4273923 1# 4808192 1# 5133509

2#430243

key

o_custkey o_orderkey columns

Queryo_custkey=1

start row=1stop row=1++

Log

Predicate(s) on leading part(s) of row key

Page 36: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

36

Predicate Pushdown: Range Scan

� With range predicates

select o_orderkey,o_totalprice from orders where o_custkey < 3;

Found a row scan that uses the first 1 part(s) of composite key.

HBase scan details:{ .. , stopRow=\x01\x80\x00\x00\x03, startRow=, … }

Log

Log

1#454791…

1# 51335092#430243

…4#164711

key

o_custkey o_orderkey columns

Queryo_custkey<3

start row=stop row=3#

Page 37: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

37

Predicate Pushdown: Full table Scan

� This is an example of a case where predicates are not pushed down.

� If there are predicates on non-leading parts of row key

set force local on;

select o_orderkey,o_totalprice from orders where o_orderkey=454791;

+------------+--------------+

| o_orderkey | o_totalprice |

+------------+--------------+

| 454791 | 74602.81250 |

+------------+--------------+

1 row in results(first row: 32.13s; total: 32.13s)

HBase scan details:{ .. , stopRow=, startRow=, … }Log

Page 38: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

38

Automatic Index Usage

select * from orders where o_clerk='Clerk#000000999'

go -m discard

1472 rows in results(first row: 1.63s; total: 30.32s)

create index ix_clerk on table orders (o_clerk) as 'hbase';0 rows affected (total: 3m57.82s)

select * from orders where o_clerk='Clerk#000000999'

go -m discard

1472 rows in results(first row: 3.60s; total: 3.65s)

Index query successful

� Index used automatically

� For composite index, rules similar to composite row key apply– Parts will be combined where possible– With partial value for composite index, range scan done on index table

� Multiple indexes on a table– Index to be used is randomly chosen– Specify useIndex hint to make use of specific index

Log

Page 39: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

39

Pushing down Filters into HBase

� Filters do not avoid full table scan– Some filters can skip certain sections e.g, PrefixFilter

� Limits rows returned to the client

� Limits data returned to client– Key only filters

select o_orderkey from orders where o_custkey>100000 and o_orderstatus='P'

go -m discard

12819 rows in results(first row: 1.12s; total: 6.80s)

Found a row scan that uses the first 1 part(s) of composite key.

HBase filter list created using AND.

HBase scan details:{… , filter=FilterList AND (1/1):

[SingleColumnValueFilter (cf, d, EQUAL, \x01P\x00)], stopRow=,

startRow=\x01\x80\x01\x86\xA1, …}

Log

Row scan

Column filter as

there is a

predicate on

leading part of

dense column

Page 40: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

40

Key Only Tables

� Big SQL allows creation of tables without specifying any HBase column

create hbase table KEY_ONLY_TABLE (k1 string, k2 string, k3 string)

column mapping (key mapped by (k1, k2, k3));

select * from KEY_ONLY_TABLE;

Only row key or parts of row key requested. Applying filters.

HBase scan details:{… families={}, filter=FilterList AND (2/2):

[FirstKeyOnlyFilter, KeyOnlyFilter], …}

Log

Page 41: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

41

Predicate Precedence

� When a query contains multiple predicates, the following precedence applies:

– Row Scan

– Index

– Filters• Row filters• Column filters

� Filters will be applied along with row scans

� Filters cannot be combined with index lookups

� Multiple predicates: Use of row and column filter

select o_orderkey, o_custkey, o_orderdate from orders where

o_orderdate=cast('1996-12-09' as timestamp) or o_custkey=2;

HBase filter list created using OR.

HBase scan details:{… , filter=FilterList OR (2/2):

[SingleColumnValueFilter (cf, od, EQUAL, \x011996-12-09 00:00:00.000\x00),

PrefixFilter \x01\x80\x00\x00\x02], cacheBlocks=false, stopRow=,

startRow=, … }

Log

The OR condition prevents usage of row scan. Row filter (PrefixFilter) is used along with a column filter

Page 42: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

42

Accessmode Hint

� Will run the query locally in Big SQL server – Useful to avoid map reduce overhead

� Very important for HBase point queries– This is not detected currently by compiler– Specify accessmode=‘local’ hint when getting a limited set of data from HBase

� Specify at query level

select o_orderkey from orders /*+ accessmode='local' +*/ where o_custkey=1

and o_orderkey=454791;

� Specify at session level– set force local on

– set force commands override query level hints

Page 43: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

43

HBase Hints

� rowcachesize (default=2000)– Used as scan cache setting– Also used to determine number of get requests to batch in index lookups

� colbatchsize (default=100)

� useindex (‘false’ to avoid index usage)

select o_orderkey from orders /*+ rowcachesize=10000 +*/ where o_custkey>5000

go -m discard

1450136 rows in results(first row: 22.67s; total: 27.46s)

HBase scan details:{... , caching=10000, ...}

� rowcachesize can also be set using the set command:– set hbase.client.scanner.caching=10000;

Log

Page 44: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

44

Recommendations

� Row key design is the most important factor– Try to combine predicates that are most commonly used into row key columns

– Do not make the row key too long

� Use short names for HBase column families and column qualifiers– f:q instead of mycolumnfamily:mycolumnqualifier

� Check if key only tables can be used

� Pack columns that are queried together into dense columns– Use the column that is used as query predicate as prefix

– Create indexes for columns that do not have repeating values and are queried often

� Separate columns that are rarely or never queried into a different column family

� Set hbase.client.scanner.caching to an optimum value

� Ensure even data distribution

Page 45: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

45

Limitations

� No diagnostic info about HBase pushdown– How HBase storage handler pushes down a query is decided only at runtime– Predicate handling details are logged at INFO level

– Many examples of log messages covered in previous slides

� No auto detection of local vs MR mode– Currently depends on user specified hints

� Statistics not available– Big SQL does not have a framework to collect statistics

– Query optimizations can be improved with availability of useful statistics

� Map type not supported– Big SQL does not support map data type

– Hive HBase handler supports map data type and many to one mapping• Mapping an entire HBase column family to a map data type

Page 46: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

46

Logs and Troubleshooting

� Big SQL logs– Look for rewritten query– More information in Big SQL logs if query is run in local mode

� Map Reduce logs– Predicate handling information in map task log when run in MR mode

� HBase web GUI– http://<<hostname>>:60010/master-status

Page 47: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

47

Big SQL HBase Handler Highlights

� Support for composite key/dense columns

� Pushdown for efficient execution of queries

� Support for secondary indexes

� Binary encoding (collated correctly)

� Key only tables

� Support for hints to make query optimization decisions

Page 48: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

48

Scenarios that can leverage HBase features

� Point queries– Queries that return a single row of result– Row can be determined using row key or secondary index

• All queries using secondary index are not point queries

� Queries with projections– If a query requires only a few columns

– Projection happens at HBase column level

� Data maintenance using upserts– Loading different value for columns using same row key

Page 49: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

49

Acknowledgements and Disclaimers

Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates.

The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software.

All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.

© Copyright IBM Corporation 2013. All rights reserved.

•U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

IBM, the IBM logo, ibm.com, and InfoSphere BigInsights are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml

Other company, product, or service names may be trademarks or service marks of others.

Page 50: Adding Value to HBase with IBM InfoSphere BigInsights and ...public.dhe.ibm.com/.../bd-bigsqlhbase1/IBD-1687A.pdf · HBase tables created using Hive HBase storage handler cannot be

Piotr Pruski

@ppruski

Thank YouAdding Value to HBase with IBM InfoSphere BigInsights and BigSQL

� Full credit to Deepa Remesh

Acknowledgements