hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

45
Hive Hbase Metastore - Improving Hive with a Big Data Metadata Storage Daniel Dai, Vaibhav Gumashta Hortonworks Hadoop Summit San Jose June, 2016

Transcript of hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

Page 1: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

Hive Hbase Metastore - Improving Hive with a Big Data Metadata StorageDaniel Dai Vaibhav GumashtaHortonworksHadoop Summit San JoseJune 2016

2 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

3 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

What is Hive MetaStore

Store Metadata about the datandash Databasendash Tablendash Partitionndash Privilegendash Rolendash Permanent UDFndash Statisticsndash Locksndash Transactionndash etc

Two modesndash Thrift Serverndash Embedded

Backendndash RDBMS Derby MSSQL MySQL Oracle PostGres

4 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Low latency in Hive

Hadoop is only for large jobndash Most jobs are small jobsndash User want to run both small and large

jobs in one system

Whatrsquos trending in Hive ndash Low latencyndash Stinger (Tez + ORC + Vectorization)

bull Bring query to 5-10sndash LLAP

bull Sub-second query TPC-DS query 27

5 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

New BottleNet - Metastore

Planning time is non-negligible Among planning significant amount of time spent on metadata fetching

6 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Besides Latency

Significantly more scalendash More metadata ndash millions of partitionsndash New large scale metadata ndash Split information ORC row group statisticsndash More calls ndash Handle orders of magnitude higher no of calls ndash From tasks

Reduce Complexityndash Object Relational Modeling is an impedance mismatchndash DataNucleusndash DBCP BoneCP or Hikaricp

7 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

ER Diagram for ObjectStore Database

8 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

How About Improving ObjectStore

Already happeningndash Using direct SQL instead of O-R

But

ndash Maintenance nightmarendash Handle syntax difference for databases

Re-engineering effort may not pay off Ultimate barrier Scalability

String queryText = select PARTITIONSPART_ID SDSSD_ID SDSCD_ID + SERDESSERDE_ID PARTITIONSCREATE_TIME + PARTITIONSLAST_ACCESS_TIME SDSINPUT_FORMAT SDSIS_COMPRESSED + SDSIS_STOREDASSUBDIRECTORIES SDSLOCATION SDSNUM_BUCKETS + SDSOUTPUT_FORMAT SERDESNAME SERDESSLIB + from PARTITIONS + left outer join SDS on PARTITIONSSD_ID = SDSSD_ID + left outer join SERDES on SDSSERDE_ID = SERDESSERDE_ID + where PART_ID in ( + partIds + ) order by PART_NAME asc

9 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

10 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

System Architecture

HiveMetaStore Thrift Server

ObjectStoreHBaseStore

RDBMSHBase

Omid

bull Two implementation of the RawStore interfacebull HBaseStorebull ObjectStore

bull Both backend will live together for a while

bull HBaseStorebull Most traffic will go through transaction

layer (Omid)bull Some traffic will bypass transaction layer

bull Volatile databull High possibility of conflict

HiveMetaStore Thrift Client

11 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

RDBMS schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model

objects bull ORM opens transaction on RDBMS and writesreads values tofrom various tables in

RDBMS using appropriate foreign key references bull RDBMS fastpath enabled by not using ORM and writing direct SQL However

complicates testing matrix as there may be slight variations in SQL semantics for different RDBMS databases

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

12 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

RDBMS schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model

objects bull ORM opens transaction on RDBMS and writes reads values to from various tables in

RDBMS using appropriate foreign key references

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

TBLS

TBL_PRIVS

TBL_COL_PRIVS

PART_PRIVS

SDS

CDS

SORT_ORDER

SERDES

TYPE_FIELDS

PARTITIONS

PARTITION_KEY_VALS

PARTITION_PARAMS

BUCKETING_COLS

SORT_COLS

SD_PARAMS

SKEWED_COL_NAMES

SKEWED_VALUES

TABLE_PARAMS

13 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog ldquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

HBase schema

14 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Table Name Key Column Families and Columns

Description

HBMS_GLOBAL_PRIVS bytes(ldquogprdquo) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized PrincipalPrivilegeSet proto

HBMS_ROLES bytes(roleName) cf_catalog ldquorolesrdquo ldquorolesrdquo storeretrieve serialized Role proto

HBMS_USER_TO_ROLE bytes(userName) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized RoleList proto

HBMS_SECURITY bytes(delTokenId) cf_catalog ldquodtrdquo ldquomkrdquo ldquodtrdquo storeretrieve delegation token ldquomkrdquo master keys

HBMS_SEQUENCES bytes(sequence) cf_catalog ldquocrdquo ldquocrdquo storeretrieve sequences

HBase schema

15 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

De-normalization

bull Goalbull Optimized for queryingbull May slower in DDL bull Example drop_role(String roleName)

Key Value

bytes(ldquoUser 1rdquo) Proto(Role 1 Role 2 Role 3 Role 5)

bytes(ldquoUser 2rdquo) Proto(Role 1 Role 2)

bytes(ldquoUser 3rdquo) Proto(Role 4 Role 5)

bytes(ldquoUser 4rdquo) Proto (Role 2 Role 3)

HBMS_USER_TO_ROLE

bull Need to scan amp de-serialize everything in order to drop a role

16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Partition Keys

Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo

Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value

date state

201601 CA

201601 WA

201602 CA

201603 CA

201605 CA

17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Typed Partition Keys

Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)

ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08

Using BinarySortableSerDendash Support all Hive data typesndash Handles null

(String Integer) Bytes

lsquoA10rsquo 3 41 31 30 00 00 00 00 03

lsquoA10rsquo 10 41 31 30 00 00 00 00 0A

lsquoA5rsquo 4 41 35 00 00 00 00 04

lsquoA5rsquo 15 41 35 00 00 00 00 0D

18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

struct StorageDescriptor

1 listltFieldSchemagt cols

2 string location

3 string inputFormat

4 string outputFormat

5 bool compressed

6 i32 numBuckets

7 SerDeInfo serdeInfo

8 listltstringgt bucketCols

9 listltOrdergt sortCols

10 mapltstring stringgt parameters

11 optional SkewedInfo skewedInfo

12 optional bool storedAsSubDirectories

20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

message StorageDescriptor

message Order hellip

message SerDeInfo hellip

message SkewedInfo hellip

repeated FieldSchema cols = 1

optional string input_format = 2

optional string output_format = 3

optional bool is_compressed = 4

optional sint32 num_buckets = 5

optional SerDeInfo serde_info = 6

repeated string bucket_cols = 7

repeated Order sort_cols = 8

optional SkewedInfo skewed_info = 9

optional bool stored_as_sub_directories = 10

21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

HBMS_

PARTITIONS

HBMS_

SDS

23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching

Aggregate Statsbull Location - on HBasebull Compile time

File Footers bull Location - on HBasebull Runtime - accessed from tasks

Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time

25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching Aggregate Stats

ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo

bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS

bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto

bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client

side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats

bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to

removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp

removes them from the cache

26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching File Footers

bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner

thread

bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic

27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBaseMetaStore Needs Transaction

Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege

HBase donrsquot support transactionndash Donrsquot support cross-row transactions

HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction

29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid

Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project

ndash First release this Monday

Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts

Low overhead

30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Components

TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction

TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO

Compactorndash Run as HBase Coprocessorndash Remove stale cell versions

HBaseCompactor

Client

TSO

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 2: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

2 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

3 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

What is Hive MetaStore

Store Metadata about the datandash Databasendash Tablendash Partitionndash Privilegendash Rolendash Permanent UDFndash Statisticsndash Locksndash Transactionndash etc

Two modesndash Thrift Serverndash Embedded

Backendndash RDBMS Derby MSSQL MySQL Oracle PostGres

4 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Low latency in Hive

Hadoop is only for large jobndash Most jobs are small jobsndash User want to run both small and large

jobs in one system

Whatrsquos trending in Hive ndash Low latencyndash Stinger (Tez + ORC + Vectorization)

bull Bring query to 5-10sndash LLAP

bull Sub-second query TPC-DS query 27

5 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

New BottleNet - Metastore

Planning time is non-negligible Among planning significant amount of time spent on metadata fetching

6 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Besides Latency

Significantly more scalendash More metadata ndash millions of partitionsndash New large scale metadata ndash Split information ORC row group statisticsndash More calls ndash Handle orders of magnitude higher no of calls ndash From tasks

Reduce Complexityndash Object Relational Modeling is an impedance mismatchndash DataNucleusndash DBCP BoneCP or Hikaricp

7 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

ER Diagram for ObjectStore Database

8 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

How About Improving ObjectStore

Already happeningndash Using direct SQL instead of O-R

But

ndash Maintenance nightmarendash Handle syntax difference for databases

Re-engineering effort may not pay off Ultimate barrier Scalability

String queryText = select PARTITIONSPART_ID SDSSD_ID SDSCD_ID + SERDESSERDE_ID PARTITIONSCREATE_TIME + PARTITIONSLAST_ACCESS_TIME SDSINPUT_FORMAT SDSIS_COMPRESSED + SDSIS_STOREDASSUBDIRECTORIES SDSLOCATION SDSNUM_BUCKETS + SDSOUTPUT_FORMAT SERDESNAME SERDESSLIB + from PARTITIONS + left outer join SDS on PARTITIONSSD_ID = SDSSD_ID + left outer join SERDES on SDSSERDE_ID = SERDESSERDE_ID + where PART_ID in ( + partIds + ) order by PART_NAME asc

9 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

10 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

System Architecture

HiveMetaStore Thrift Server

ObjectStoreHBaseStore

RDBMSHBase

Omid

bull Two implementation of the RawStore interfacebull HBaseStorebull ObjectStore

bull Both backend will live together for a while

bull HBaseStorebull Most traffic will go through transaction

layer (Omid)bull Some traffic will bypass transaction layer

bull Volatile databull High possibility of conflict

HiveMetaStore Thrift Client

11 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

RDBMS schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model

objects bull ORM opens transaction on RDBMS and writesreads values tofrom various tables in

RDBMS using appropriate foreign key references bull RDBMS fastpath enabled by not using ORM and writing direct SQL However

complicates testing matrix as there may be slight variations in SQL semantics for different RDBMS databases

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

12 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

RDBMS schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model

objects bull ORM opens transaction on RDBMS and writes reads values to from various tables in

RDBMS using appropriate foreign key references

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

TBLS

TBL_PRIVS

TBL_COL_PRIVS

PART_PRIVS

SDS

CDS

SORT_ORDER

SERDES

TYPE_FIELDS

PARTITIONS

PARTITION_KEY_VALS

PARTITION_PARAMS

BUCKETING_COLS

SORT_COLS

SD_PARAMS

SKEWED_COL_NAMES

SKEWED_VALUES

TABLE_PARAMS

13 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog ldquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

HBase schema

14 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Table Name Key Column Families and Columns

Description

HBMS_GLOBAL_PRIVS bytes(ldquogprdquo) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized PrincipalPrivilegeSet proto

HBMS_ROLES bytes(roleName) cf_catalog ldquorolesrdquo ldquorolesrdquo storeretrieve serialized Role proto

HBMS_USER_TO_ROLE bytes(userName) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized RoleList proto

HBMS_SECURITY bytes(delTokenId) cf_catalog ldquodtrdquo ldquomkrdquo ldquodtrdquo storeretrieve delegation token ldquomkrdquo master keys

HBMS_SEQUENCES bytes(sequence) cf_catalog ldquocrdquo ldquocrdquo storeretrieve sequences

HBase schema

15 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

De-normalization

bull Goalbull Optimized for queryingbull May slower in DDL bull Example drop_role(String roleName)

Key Value

bytes(ldquoUser 1rdquo) Proto(Role 1 Role 2 Role 3 Role 5)

bytes(ldquoUser 2rdquo) Proto(Role 1 Role 2)

bytes(ldquoUser 3rdquo) Proto(Role 4 Role 5)

bytes(ldquoUser 4rdquo) Proto (Role 2 Role 3)

HBMS_USER_TO_ROLE

bull Need to scan amp de-serialize everything in order to drop a role

16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Partition Keys

Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo

Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value

date state

201601 CA

201601 WA

201602 CA

201603 CA

201605 CA

17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Typed Partition Keys

Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)

ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08

Using BinarySortableSerDendash Support all Hive data typesndash Handles null

(String Integer) Bytes

lsquoA10rsquo 3 41 31 30 00 00 00 00 03

lsquoA10rsquo 10 41 31 30 00 00 00 00 0A

lsquoA5rsquo 4 41 35 00 00 00 00 04

lsquoA5rsquo 15 41 35 00 00 00 00 0D

18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

struct StorageDescriptor

1 listltFieldSchemagt cols

2 string location

3 string inputFormat

4 string outputFormat

5 bool compressed

6 i32 numBuckets

7 SerDeInfo serdeInfo

8 listltstringgt bucketCols

9 listltOrdergt sortCols

10 mapltstring stringgt parameters

11 optional SkewedInfo skewedInfo

12 optional bool storedAsSubDirectories

20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

message StorageDescriptor

message Order hellip

message SerDeInfo hellip

message SkewedInfo hellip

repeated FieldSchema cols = 1

optional string input_format = 2

optional string output_format = 3

optional bool is_compressed = 4

optional sint32 num_buckets = 5

optional SerDeInfo serde_info = 6

repeated string bucket_cols = 7

repeated Order sort_cols = 8

optional SkewedInfo skewed_info = 9

optional bool stored_as_sub_directories = 10

21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

HBMS_

PARTITIONS

HBMS_

SDS

23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching

Aggregate Statsbull Location - on HBasebull Compile time

File Footers bull Location - on HBasebull Runtime - accessed from tasks

Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time

25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching Aggregate Stats

ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo

bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS

bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto

bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client

side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats

bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to

removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp

removes them from the cache

26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching File Footers

bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner

thread

bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic

27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBaseMetaStore Needs Transaction

Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege

HBase donrsquot support transactionndash Donrsquot support cross-row transactions

HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction

29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid

Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project

ndash First release this Monday

Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts

Low overhead

30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Components

TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction

TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO

Compactorndash Run as HBase Coprocessorndash Remove stale cell versions

HBaseCompactor

Client

TSO

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 3: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

3 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

What is Hive MetaStore

Store Metadata about the datandash Databasendash Tablendash Partitionndash Privilegendash Rolendash Permanent UDFndash Statisticsndash Locksndash Transactionndash etc

Two modesndash Thrift Serverndash Embedded

Backendndash RDBMS Derby MSSQL MySQL Oracle PostGres

4 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Low latency in Hive

Hadoop is only for large jobndash Most jobs are small jobsndash User want to run both small and large

jobs in one system

Whatrsquos trending in Hive ndash Low latencyndash Stinger (Tez + ORC + Vectorization)

bull Bring query to 5-10sndash LLAP

bull Sub-second query TPC-DS query 27

5 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

New BottleNet - Metastore

Planning time is non-negligible Among planning significant amount of time spent on metadata fetching

6 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Besides Latency

Significantly more scalendash More metadata ndash millions of partitionsndash New large scale metadata ndash Split information ORC row group statisticsndash More calls ndash Handle orders of magnitude higher no of calls ndash From tasks

Reduce Complexityndash Object Relational Modeling is an impedance mismatchndash DataNucleusndash DBCP BoneCP or Hikaricp

7 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

ER Diagram for ObjectStore Database

8 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

How About Improving ObjectStore

Already happeningndash Using direct SQL instead of O-R

But

ndash Maintenance nightmarendash Handle syntax difference for databases

Re-engineering effort may not pay off Ultimate barrier Scalability

String queryText = select PARTITIONSPART_ID SDSSD_ID SDSCD_ID + SERDESSERDE_ID PARTITIONSCREATE_TIME + PARTITIONSLAST_ACCESS_TIME SDSINPUT_FORMAT SDSIS_COMPRESSED + SDSIS_STOREDASSUBDIRECTORIES SDSLOCATION SDSNUM_BUCKETS + SDSOUTPUT_FORMAT SERDESNAME SERDESSLIB + from PARTITIONS + left outer join SDS on PARTITIONSSD_ID = SDSSD_ID + left outer join SERDES on SDSSERDE_ID = SERDESSERDE_ID + where PART_ID in ( + partIds + ) order by PART_NAME asc

9 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

10 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

System Architecture

HiveMetaStore Thrift Server

ObjectStoreHBaseStore

RDBMSHBase

Omid

bull Two implementation of the RawStore interfacebull HBaseStorebull ObjectStore

bull Both backend will live together for a while

bull HBaseStorebull Most traffic will go through transaction

layer (Omid)bull Some traffic will bypass transaction layer

bull Volatile databull High possibility of conflict

HiveMetaStore Thrift Client

11 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

RDBMS schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model

objects bull ORM opens transaction on RDBMS and writesreads values tofrom various tables in

RDBMS using appropriate foreign key references bull RDBMS fastpath enabled by not using ORM and writing direct SQL However

complicates testing matrix as there may be slight variations in SQL semantics for different RDBMS databases

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

12 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

RDBMS schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model

objects bull ORM opens transaction on RDBMS and writes reads values to from various tables in

RDBMS using appropriate foreign key references

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

TBLS

TBL_PRIVS

TBL_COL_PRIVS

PART_PRIVS

SDS

CDS

SORT_ORDER

SERDES

TYPE_FIELDS

PARTITIONS

PARTITION_KEY_VALS

PARTITION_PARAMS

BUCKETING_COLS

SORT_COLS

SD_PARAMS

SKEWED_COL_NAMES

SKEWED_VALUES

TABLE_PARAMS

13 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog ldquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

HBase schema

14 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Table Name Key Column Families and Columns

Description

HBMS_GLOBAL_PRIVS bytes(ldquogprdquo) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized PrincipalPrivilegeSet proto

HBMS_ROLES bytes(roleName) cf_catalog ldquorolesrdquo ldquorolesrdquo storeretrieve serialized Role proto

HBMS_USER_TO_ROLE bytes(userName) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized RoleList proto

HBMS_SECURITY bytes(delTokenId) cf_catalog ldquodtrdquo ldquomkrdquo ldquodtrdquo storeretrieve delegation token ldquomkrdquo master keys

HBMS_SEQUENCES bytes(sequence) cf_catalog ldquocrdquo ldquocrdquo storeretrieve sequences

HBase schema

15 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

De-normalization

bull Goalbull Optimized for queryingbull May slower in DDL bull Example drop_role(String roleName)

Key Value

bytes(ldquoUser 1rdquo) Proto(Role 1 Role 2 Role 3 Role 5)

bytes(ldquoUser 2rdquo) Proto(Role 1 Role 2)

bytes(ldquoUser 3rdquo) Proto(Role 4 Role 5)

bytes(ldquoUser 4rdquo) Proto (Role 2 Role 3)

HBMS_USER_TO_ROLE

bull Need to scan amp de-serialize everything in order to drop a role

16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Partition Keys

Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo

Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value

date state

201601 CA

201601 WA

201602 CA

201603 CA

201605 CA

17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Typed Partition Keys

Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)

ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08

Using BinarySortableSerDendash Support all Hive data typesndash Handles null

(String Integer) Bytes

lsquoA10rsquo 3 41 31 30 00 00 00 00 03

lsquoA10rsquo 10 41 31 30 00 00 00 00 0A

lsquoA5rsquo 4 41 35 00 00 00 00 04

lsquoA5rsquo 15 41 35 00 00 00 00 0D

18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

struct StorageDescriptor

1 listltFieldSchemagt cols

2 string location

3 string inputFormat

4 string outputFormat

5 bool compressed

6 i32 numBuckets

7 SerDeInfo serdeInfo

8 listltstringgt bucketCols

9 listltOrdergt sortCols

10 mapltstring stringgt parameters

11 optional SkewedInfo skewedInfo

12 optional bool storedAsSubDirectories

20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

message StorageDescriptor

message Order hellip

message SerDeInfo hellip

message SkewedInfo hellip

repeated FieldSchema cols = 1

optional string input_format = 2

optional string output_format = 3

optional bool is_compressed = 4

optional sint32 num_buckets = 5

optional SerDeInfo serde_info = 6

repeated string bucket_cols = 7

repeated Order sort_cols = 8

optional SkewedInfo skewed_info = 9

optional bool stored_as_sub_directories = 10

21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

HBMS_

PARTITIONS

HBMS_

SDS

23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching

Aggregate Statsbull Location - on HBasebull Compile time

File Footers bull Location - on HBasebull Runtime - accessed from tasks

Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time

25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching Aggregate Stats

ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo

bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS

bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto

bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client

side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats

bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to

removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp

removes them from the cache

26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching File Footers

bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner

thread

bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic

27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBaseMetaStore Needs Transaction

Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege

HBase donrsquot support transactionndash Donrsquot support cross-row transactions

HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction

29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid

Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project

ndash First release this Monday

Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts

Low overhead

30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Components

TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction

TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO

Compactorndash Run as HBase Coprocessorndash Remove stale cell versions

HBaseCompactor

Client

TSO

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 4: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

4 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Low latency in Hive

Hadoop is only for large jobndash Most jobs are small jobsndash User want to run both small and large

jobs in one system

Whatrsquos trending in Hive ndash Low latencyndash Stinger (Tez + ORC + Vectorization)

bull Bring query to 5-10sndash LLAP

bull Sub-second query TPC-DS query 27

5 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

New BottleNet - Metastore

Planning time is non-negligible Among planning significant amount of time spent on metadata fetching

6 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Besides Latency

Significantly more scalendash More metadata ndash millions of partitionsndash New large scale metadata ndash Split information ORC row group statisticsndash More calls ndash Handle orders of magnitude higher no of calls ndash From tasks

Reduce Complexityndash Object Relational Modeling is an impedance mismatchndash DataNucleusndash DBCP BoneCP or Hikaricp

7 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

ER Diagram for ObjectStore Database

8 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

How About Improving ObjectStore

Already happeningndash Using direct SQL instead of O-R

But

ndash Maintenance nightmarendash Handle syntax difference for databases

Re-engineering effort may not pay off Ultimate barrier Scalability

String queryText = select PARTITIONSPART_ID SDSSD_ID SDSCD_ID + SERDESSERDE_ID PARTITIONSCREATE_TIME + PARTITIONSLAST_ACCESS_TIME SDSINPUT_FORMAT SDSIS_COMPRESSED + SDSIS_STOREDASSUBDIRECTORIES SDSLOCATION SDSNUM_BUCKETS + SDSOUTPUT_FORMAT SERDESNAME SERDESSLIB + from PARTITIONS + left outer join SDS on PARTITIONSSD_ID = SDSSD_ID + left outer join SERDES on SDSSERDE_ID = SERDESSERDE_ID + where PART_ID in ( + partIds + ) order by PART_NAME asc

9 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

10 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

System Architecture

HiveMetaStore Thrift Server

ObjectStoreHBaseStore

RDBMSHBase

Omid

bull Two implementation of the RawStore interfacebull HBaseStorebull ObjectStore

bull Both backend will live together for a while

bull HBaseStorebull Most traffic will go through transaction

layer (Omid)bull Some traffic will bypass transaction layer

bull Volatile databull High possibility of conflict

HiveMetaStore Thrift Client

11 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

RDBMS schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model

objects bull ORM opens transaction on RDBMS and writesreads values tofrom various tables in

RDBMS using appropriate foreign key references bull RDBMS fastpath enabled by not using ORM and writing direct SQL However

complicates testing matrix as there may be slight variations in SQL semantics for different RDBMS databases

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

12 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

RDBMS schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model

objects bull ORM opens transaction on RDBMS and writes reads values to from various tables in

RDBMS using appropriate foreign key references

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

TBLS

TBL_PRIVS

TBL_COL_PRIVS

PART_PRIVS

SDS

CDS

SORT_ORDER

SERDES

TYPE_FIELDS

PARTITIONS

PARTITION_KEY_VALS

PARTITION_PARAMS

BUCKETING_COLS

SORT_COLS

SD_PARAMS

SKEWED_COL_NAMES

SKEWED_VALUES

TABLE_PARAMS

13 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog ldquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

HBase schema

14 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Table Name Key Column Families and Columns

Description

HBMS_GLOBAL_PRIVS bytes(ldquogprdquo) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized PrincipalPrivilegeSet proto

HBMS_ROLES bytes(roleName) cf_catalog ldquorolesrdquo ldquorolesrdquo storeretrieve serialized Role proto

HBMS_USER_TO_ROLE bytes(userName) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized RoleList proto

HBMS_SECURITY bytes(delTokenId) cf_catalog ldquodtrdquo ldquomkrdquo ldquodtrdquo storeretrieve delegation token ldquomkrdquo master keys

HBMS_SEQUENCES bytes(sequence) cf_catalog ldquocrdquo ldquocrdquo storeretrieve sequences

HBase schema

15 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

De-normalization

bull Goalbull Optimized for queryingbull May slower in DDL bull Example drop_role(String roleName)

Key Value

bytes(ldquoUser 1rdquo) Proto(Role 1 Role 2 Role 3 Role 5)

bytes(ldquoUser 2rdquo) Proto(Role 1 Role 2)

bytes(ldquoUser 3rdquo) Proto(Role 4 Role 5)

bytes(ldquoUser 4rdquo) Proto (Role 2 Role 3)

HBMS_USER_TO_ROLE

bull Need to scan amp de-serialize everything in order to drop a role

16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Partition Keys

Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo

Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value

date state

201601 CA

201601 WA

201602 CA

201603 CA

201605 CA

17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Typed Partition Keys

Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)

ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08

Using BinarySortableSerDendash Support all Hive data typesndash Handles null

(String Integer) Bytes

lsquoA10rsquo 3 41 31 30 00 00 00 00 03

lsquoA10rsquo 10 41 31 30 00 00 00 00 0A

lsquoA5rsquo 4 41 35 00 00 00 00 04

lsquoA5rsquo 15 41 35 00 00 00 00 0D

18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

struct StorageDescriptor

1 listltFieldSchemagt cols

2 string location

3 string inputFormat

4 string outputFormat

5 bool compressed

6 i32 numBuckets

7 SerDeInfo serdeInfo

8 listltstringgt bucketCols

9 listltOrdergt sortCols

10 mapltstring stringgt parameters

11 optional SkewedInfo skewedInfo

12 optional bool storedAsSubDirectories

20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

message StorageDescriptor

message Order hellip

message SerDeInfo hellip

message SkewedInfo hellip

repeated FieldSchema cols = 1

optional string input_format = 2

optional string output_format = 3

optional bool is_compressed = 4

optional sint32 num_buckets = 5

optional SerDeInfo serde_info = 6

repeated string bucket_cols = 7

repeated Order sort_cols = 8

optional SkewedInfo skewed_info = 9

optional bool stored_as_sub_directories = 10

21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

HBMS_

PARTITIONS

HBMS_

SDS

23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching

Aggregate Statsbull Location - on HBasebull Compile time

File Footers bull Location - on HBasebull Runtime - accessed from tasks

Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time

25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching Aggregate Stats

ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo

bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS

bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto

bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client

side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats

bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to

removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp

removes them from the cache

26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching File Footers

bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner

thread

bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic

27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBaseMetaStore Needs Transaction

Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege

HBase donrsquot support transactionndash Donrsquot support cross-row transactions

HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction

29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid

Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project

ndash First release this Monday

Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts

Low overhead

30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Components

TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction

TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO

Compactorndash Run as HBase Coprocessorndash Remove stale cell versions

HBaseCompactor

Client

TSO

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 5: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

5 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

New BottleNet - Metastore

Planning time is non-negligible Among planning significant amount of time spent on metadata fetching

6 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Besides Latency

Significantly more scalendash More metadata ndash millions of partitionsndash New large scale metadata ndash Split information ORC row group statisticsndash More calls ndash Handle orders of magnitude higher no of calls ndash From tasks

Reduce Complexityndash Object Relational Modeling is an impedance mismatchndash DataNucleusndash DBCP BoneCP or Hikaricp

7 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

ER Diagram for ObjectStore Database

8 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

How About Improving ObjectStore

Already happeningndash Using direct SQL instead of O-R

But

ndash Maintenance nightmarendash Handle syntax difference for databases

Re-engineering effort may not pay off Ultimate barrier Scalability

String queryText = select PARTITIONSPART_ID SDSSD_ID SDSCD_ID + SERDESSERDE_ID PARTITIONSCREATE_TIME + PARTITIONSLAST_ACCESS_TIME SDSINPUT_FORMAT SDSIS_COMPRESSED + SDSIS_STOREDASSUBDIRECTORIES SDSLOCATION SDSNUM_BUCKETS + SDSOUTPUT_FORMAT SERDESNAME SERDESSLIB + from PARTITIONS + left outer join SDS on PARTITIONSSD_ID = SDSSD_ID + left outer join SERDES on SDSSERDE_ID = SERDESSERDE_ID + where PART_ID in ( + partIds + ) order by PART_NAME asc

9 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

10 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

System Architecture

HiveMetaStore Thrift Server

ObjectStoreHBaseStore

RDBMSHBase

Omid

bull Two implementation of the RawStore interfacebull HBaseStorebull ObjectStore

bull Both backend will live together for a while

bull HBaseStorebull Most traffic will go through transaction

layer (Omid)bull Some traffic will bypass transaction layer

bull Volatile databull High possibility of conflict

HiveMetaStore Thrift Client

11 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

RDBMS schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model

objects bull ORM opens transaction on RDBMS and writesreads values tofrom various tables in

RDBMS using appropriate foreign key references bull RDBMS fastpath enabled by not using ORM and writing direct SQL However

complicates testing matrix as there may be slight variations in SQL semantics for different RDBMS databases

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

12 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

RDBMS schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model

objects bull ORM opens transaction on RDBMS and writes reads values to from various tables in

RDBMS using appropriate foreign key references

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

TBLS

TBL_PRIVS

TBL_COL_PRIVS

PART_PRIVS

SDS

CDS

SORT_ORDER

SERDES

TYPE_FIELDS

PARTITIONS

PARTITION_KEY_VALS

PARTITION_PARAMS

BUCKETING_COLS

SORT_COLS

SD_PARAMS

SKEWED_COL_NAMES

SKEWED_VALUES

TABLE_PARAMS

13 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog ldquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

HBase schema

14 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Table Name Key Column Families and Columns

Description

HBMS_GLOBAL_PRIVS bytes(ldquogprdquo) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized PrincipalPrivilegeSet proto

HBMS_ROLES bytes(roleName) cf_catalog ldquorolesrdquo ldquorolesrdquo storeretrieve serialized Role proto

HBMS_USER_TO_ROLE bytes(userName) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized RoleList proto

HBMS_SECURITY bytes(delTokenId) cf_catalog ldquodtrdquo ldquomkrdquo ldquodtrdquo storeretrieve delegation token ldquomkrdquo master keys

HBMS_SEQUENCES bytes(sequence) cf_catalog ldquocrdquo ldquocrdquo storeretrieve sequences

HBase schema

15 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

De-normalization

bull Goalbull Optimized for queryingbull May slower in DDL bull Example drop_role(String roleName)

Key Value

bytes(ldquoUser 1rdquo) Proto(Role 1 Role 2 Role 3 Role 5)

bytes(ldquoUser 2rdquo) Proto(Role 1 Role 2)

bytes(ldquoUser 3rdquo) Proto(Role 4 Role 5)

bytes(ldquoUser 4rdquo) Proto (Role 2 Role 3)

HBMS_USER_TO_ROLE

bull Need to scan amp de-serialize everything in order to drop a role

16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Partition Keys

Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo

Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value

date state

201601 CA

201601 WA

201602 CA

201603 CA

201605 CA

17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Typed Partition Keys

Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)

ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08

Using BinarySortableSerDendash Support all Hive data typesndash Handles null

(String Integer) Bytes

lsquoA10rsquo 3 41 31 30 00 00 00 00 03

lsquoA10rsquo 10 41 31 30 00 00 00 00 0A

lsquoA5rsquo 4 41 35 00 00 00 00 04

lsquoA5rsquo 15 41 35 00 00 00 00 0D

18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

struct StorageDescriptor

1 listltFieldSchemagt cols

2 string location

3 string inputFormat

4 string outputFormat

5 bool compressed

6 i32 numBuckets

7 SerDeInfo serdeInfo

8 listltstringgt bucketCols

9 listltOrdergt sortCols

10 mapltstring stringgt parameters

11 optional SkewedInfo skewedInfo

12 optional bool storedAsSubDirectories

20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

message StorageDescriptor

message Order hellip

message SerDeInfo hellip

message SkewedInfo hellip

repeated FieldSchema cols = 1

optional string input_format = 2

optional string output_format = 3

optional bool is_compressed = 4

optional sint32 num_buckets = 5

optional SerDeInfo serde_info = 6

repeated string bucket_cols = 7

repeated Order sort_cols = 8

optional SkewedInfo skewed_info = 9

optional bool stored_as_sub_directories = 10

21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

HBMS_

PARTITIONS

HBMS_

SDS

23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching

Aggregate Statsbull Location - on HBasebull Compile time

File Footers bull Location - on HBasebull Runtime - accessed from tasks

Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time

25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching Aggregate Stats

ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo

bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS

bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto

bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client

side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats

bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to

removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp

removes them from the cache

26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching File Footers

bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner

thread

bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic

27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBaseMetaStore Needs Transaction

Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege

HBase donrsquot support transactionndash Donrsquot support cross-row transactions

HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction

29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid

Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project

ndash First release this Monday

Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts

Low overhead

30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Components

TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction

TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO

Compactorndash Run as HBase Coprocessorndash Remove stale cell versions

HBaseCompactor

Client

TSO

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 6: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

6 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Besides Latency

Significantly more scalendash More metadata ndash millions of partitionsndash New large scale metadata ndash Split information ORC row group statisticsndash More calls ndash Handle orders of magnitude higher no of calls ndash From tasks

Reduce Complexityndash Object Relational Modeling is an impedance mismatchndash DataNucleusndash DBCP BoneCP or Hikaricp

7 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

ER Diagram for ObjectStore Database

8 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

How About Improving ObjectStore

Already happeningndash Using direct SQL instead of O-R

But

ndash Maintenance nightmarendash Handle syntax difference for databases

Re-engineering effort may not pay off Ultimate barrier Scalability

String queryText = select PARTITIONSPART_ID SDSSD_ID SDSCD_ID + SERDESSERDE_ID PARTITIONSCREATE_TIME + PARTITIONSLAST_ACCESS_TIME SDSINPUT_FORMAT SDSIS_COMPRESSED + SDSIS_STOREDASSUBDIRECTORIES SDSLOCATION SDSNUM_BUCKETS + SDSOUTPUT_FORMAT SERDESNAME SERDESSLIB + from PARTITIONS + left outer join SDS on PARTITIONSSD_ID = SDSSD_ID + left outer join SERDES on SDSSERDE_ID = SERDESSERDE_ID + where PART_ID in ( + partIds + ) order by PART_NAME asc

9 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

10 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

System Architecture

HiveMetaStore Thrift Server

ObjectStoreHBaseStore

RDBMSHBase

Omid

bull Two implementation of the RawStore interfacebull HBaseStorebull ObjectStore

bull Both backend will live together for a while

bull HBaseStorebull Most traffic will go through transaction

layer (Omid)bull Some traffic will bypass transaction layer

bull Volatile databull High possibility of conflict

HiveMetaStore Thrift Client

11 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

RDBMS schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model

objects bull ORM opens transaction on RDBMS and writesreads values tofrom various tables in

RDBMS using appropriate foreign key references bull RDBMS fastpath enabled by not using ORM and writing direct SQL However

complicates testing matrix as there may be slight variations in SQL semantics for different RDBMS databases

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

12 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

RDBMS schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model

objects bull ORM opens transaction on RDBMS and writes reads values to from various tables in

RDBMS using appropriate foreign key references

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

TBLS

TBL_PRIVS

TBL_COL_PRIVS

PART_PRIVS

SDS

CDS

SORT_ORDER

SERDES

TYPE_FIELDS

PARTITIONS

PARTITION_KEY_VALS

PARTITION_PARAMS

BUCKETING_COLS

SORT_COLS

SD_PARAMS

SKEWED_COL_NAMES

SKEWED_VALUES

TABLE_PARAMS

13 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog ldquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

HBase schema

14 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Table Name Key Column Families and Columns

Description

HBMS_GLOBAL_PRIVS bytes(ldquogprdquo) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized PrincipalPrivilegeSet proto

HBMS_ROLES bytes(roleName) cf_catalog ldquorolesrdquo ldquorolesrdquo storeretrieve serialized Role proto

HBMS_USER_TO_ROLE bytes(userName) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized RoleList proto

HBMS_SECURITY bytes(delTokenId) cf_catalog ldquodtrdquo ldquomkrdquo ldquodtrdquo storeretrieve delegation token ldquomkrdquo master keys

HBMS_SEQUENCES bytes(sequence) cf_catalog ldquocrdquo ldquocrdquo storeretrieve sequences

HBase schema

15 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

De-normalization

bull Goalbull Optimized for queryingbull May slower in DDL bull Example drop_role(String roleName)

Key Value

bytes(ldquoUser 1rdquo) Proto(Role 1 Role 2 Role 3 Role 5)

bytes(ldquoUser 2rdquo) Proto(Role 1 Role 2)

bytes(ldquoUser 3rdquo) Proto(Role 4 Role 5)

bytes(ldquoUser 4rdquo) Proto (Role 2 Role 3)

HBMS_USER_TO_ROLE

bull Need to scan amp de-serialize everything in order to drop a role

16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Partition Keys

Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo

Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value

date state

201601 CA

201601 WA

201602 CA

201603 CA

201605 CA

17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Typed Partition Keys

Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)

ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08

Using BinarySortableSerDendash Support all Hive data typesndash Handles null

(String Integer) Bytes

lsquoA10rsquo 3 41 31 30 00 00 00 00 03

lsquoA10rsquo 10 41 31 30 00 00 00 00 0A

lsquoA5rsquo 4 41 35 00 00 00 00 04

lsquoA5rsquo 15 41 35 00 00 00 00 0D

18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

struct StorageDescriptor

1 listltFieldSchemagt cols

2 string location

3 string inputFormat

4 string outputFormat

5 bool compressed

6 i32 numBuckets

7 SerDeInfo serdeInfo

8 listltstringgt bucketCols

9 listltOrdergt sortCols

10 mapltstring stringgt parameters

11 optional SkewedInfo skewedInfo

12 optional bool storedAsSubDirectories

20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

message StorageDescriptor

message Order hellip

message SerDeInfo hellip

message SkewedInfo hellip

repeated FieldSchema cols = 1

optional string input_format = 2

optional string output_format = 3

optional bool is_compressed = 4

optional sint32 num_buckets = 5

optional SerDeInfo serde_info = 6

repeated string bucket_cols = 7

repeated Order sort_cols = 8

optional SkewedInfo skewed_info = 9

optional bool stored_as_sub_directories = 10

21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

HBMS_

PARTITIONS

HBMS_

SDS

23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching

Aggregate Statsbull Location - on HBasebull Compile time

File Footers bull Location - on HBasebull Runtime - accessed from tasks

Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time

25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching Aggregate Stats

ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo

bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS

bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto

bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client

side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats

bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to

removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp

removes them from the cache

26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching File Footers

bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner

thread

bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic

27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBaseMetaStore Needs Transaction

Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege

HBase donrsquot support transactionndash Donrsquot support cross-row transactions

HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction

29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid

Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project

ndash First release this Monday

Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts

Low overhead

30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Components

TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction

TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO

Compactorndash Run as HBase Coprocessorndash Remove stale cell versions

HBaseCompactor

Client

TSO

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 7: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

7 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

ER Diagram for ObjectStore Database

8 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

How About Improving ObjectStore

Already happeningndash Using direct SQL instead of O-R

But

ndash Maintenance nightmarendash Handle syntax difference for databases

Re-engineering effort may not pay off Ultimate barrier Scalability

String queryText = select PARTITIONSPART_ID SDSSD_ID SDSCD_ID + SERDESSERDE_ID PARTITIONSCREATE_TIME + PARTITIONSLAST_ACCESS_TIME SDSINPUT_FORMAT SDSIS_COMPRESSED + SDSIS_STOREDASSUBDIRECTORIES SDSLOCATION SDSNUM_BUCKETS + SDSOUTPUT_FORMAT SERDESNAME SERDESSLIB + from PARTITIONS + left outer join SDS on PARTITIONSSD_ID = SDSSD_ID + left outer join SERDES on SDSSERDE_ID = SERDESSERDE_ID + where PART_ID in ( + partIds + ) order by PART_NAME asc

9 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

10 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

System Architecture

HiveMetaStore Thrift Server

ObjectStoreHBaseStore

RDBMSHBase

Omid

bull Two implementation of the RawStore interfacebull HBaseStorebull ObjectStore

bull Both backend will live together for a while

bull HBaseStorebull Most traffic will go through transaction

layer (Omid)bull Some traffic will bypass transaction layer

bull Volatile databull High possibility of conflict

HiveMetaStore Thrift Client

11 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

RDBMS schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model

objects bull ORM opens transaction on RDBMS and writesreads values tofrom various tables in

RDBMS using appropriate foreign key references bull RDBMS fastpath enabled by not using ORM and writing direct SQL However

complicates testing matrix as there may be slight variations in SQL semantics for different RDBMS databases

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

12 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

RDBMS schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model

objects bull ORM opens transaction on RDBMS and writes reads values to from various tables in

RDBMS using appropriate foreign key references

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

TBLS

TBL_PRIVS

TBL_COL_PRIVS

PART_PRIVS

SDS

CDS

SORT_ORDER

SERDES

TYPE_FIELDS

PARTITIONS

PARTITION_KEY_VALS

PARTITION_PARAMS

BUCKETING_COLS

SORT_COLS

SD_PARAMS

SKEWED_COL_NAMES

SKEWED_VALUES

TABLE_PARAMS

13 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog ldquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

HBase schema

14 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Table Name Key Column Families and Columns

Description

HBMS_GLOBAL_PRIVS bytes(ldquogprdquo) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized PrincipalPrivilegeSet proto

HBMS_ROLES bytes(roleName) cf_catalog ldquorolesrdquo ldquorolesrdquo storeretrieve serialized Role proto

HBMS_USER_TO_ROLE bytes(userName) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized RoleList proto

HBMS_SECURITY bytes(delTokenId) cf_catalog ldquodtrdquo ldquomkrdquo ldquodtrdquo storeretrieve delegation token ldquomkrdquo master keys

HBMS_SEQUENCES bytes(sequence) cf_catalog ldquocrdquo ldquocrdquo storeretrieve sequences

HBase schema

15 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

De-normalization

bull Goalbull Optimized for queryingbull May slower in DDL bull Example drop_role(String roleName)

Key Value

bytes(ldquoUser 1rdquo) Proto(Role 1 Role 2 Role 3 Role 5)

bytes(ldquoUser 2rdquo) Proto(Role 1 Role 2)

bytes(ldquoUser 3rdquo) Proto(Role 4 Role 5)

bytes(ldquoUser 4rdquo) Proto (Role 2 Role 3)

HBMS_USER_TO_ROLE

bull Need to scan amp de-serialize everything in order to drop a role

16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Partition Keys

Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo

Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value

date state

201601 CA

201601 WA

201602 CA

201603 CA

201605 CA

17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Typed Partition Keys

Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)

ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08

Using BinarySortableSerDendash Support all Hive data typesndash Handles null

(String Integer) Bytes

lsquoA10rsquo 3 41 31 30 00 00 00 00 03

lsquoA10rsquo 10 41 31 30 00 00 00 00 0A

lsquoA5rsquo 4 41 35 00 00 00 00 04

lsquoA5rsquo 15 41 35 00 00 00 00 0D

18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

struct StorageDescriptor

1 listltFieldSchemagt cols

2 string location

3 string inputFormat

4 string outputFormat

5 bool compressed

6 i32 numBuckets

7 SerDeInfo serdeInfo

8 listltstringgt bucketCols

9 listltOrdergt sortCols

10 mapltstring stringgt parameters

11 optional SkewedInfo skewedInfo

12 optional bool storedAsSubDirectories

20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

message StorageDescriptor

message Order hellip

message SerDeInfo hellip

message SkewedInfo hellip

repeated FieldSchema cols = 1

optional string input_format = 2

optional string output_format = 3

optional bool is_compressed = 4

optional sint32 num_buckets = 5

optional SerDeInfo serde_info = 6

repeated string bucket_cols = 7

repeated Order sort_cols = 8

optional SkewedInfo skewed_info = 9

optional bool stored_as_sub_directories = 10

21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

HBMS_

PARTITIONS

HBMS_

SDS

23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching

Aggregate Statsbull Location - on HBasebull Compile time

File Footers bull Location - on HBasebull Runtime - accessed from tasks

Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time

25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching Aggregate Stats

ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo

bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS

bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto

bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client

side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats

bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to

removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp

removes them from the cache

26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching File Footers

bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner

thread

bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic

27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBaseMetaStore Needs Transaction

Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege

HBase donrsquot support transactionndash Donrsquot support cross-row transactions

HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction

29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid

Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project

ndash First release this Monday

Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts

Low overhead

30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Components

TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction

TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO

Compactorndash Run as HBase Coprocessorndash Remove stale cell versions

HBaseCompactor

Client

TSO

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 8: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

8 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

How About Improving ObjectStore

Already happeningndash Using direct SQL instead of O-R

But

ndash Maintenance nightmarendash Handle syntax difference for databases

Re-engineering effort may not pay off Ultimate barrier Scalability

String queryText = select PARTITIONSPART_ID SDSSD_ID SDSCD_ID + SERDESSERDE_ID PARTITIONSCREATE_TIME + PARTITIONSLAST_ACCESS_TIME SDSINPUT_FORMAT SDSIS_COMPRESSED + SDSIS_STOREDASSUBDIRECTORIES SDSLOCATION SDSNUM_BUCKETS + SDSOUTPUT_FORMAT SERDESNAME SERDESSLIB + from PARTITIONS + left outer join SDS on PARTITIONSSD_ID = SDSSD_ID + left outer join SERDES on SDSSERDE_ID = SERDESSERDE_ID + where PART_ID in ( + partIds + ) order by PART_NAME asc

9 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

10 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

System Architecture

HiveMetaStore Thrift Server

ObjectStoreHBaseStore

RDBMSHBase

Omid

bull Two implementation of the RawStore interfacebull HBaseStorebull ObjectStore

bull Both backend will live together for a while

bull HBaseStorebull Most traffic will go through transaction

layer (Omid)bull Some traffic will bypass transaction layer

bull Volatile databull High possibility of conflict

HiveMetaStore Thrift Client

11 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

RDBMS schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model

objects bull ORM opens transaction on RDBMS and writesreads values tofrom various tables in

RDBMS using appropriate foreign key references bull RDBMS fastpath enabled by not using ORM and writing direct SQL However

complicates testing matrix as there may be slight variations in SQL semantics for different RDBMS databases

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

12 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

RDBMS schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model

objects bull ORM opens transaction on RDBMS and writes reads values to from various tables in

RDBMS using appropriate foreign key references

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

TBLS

TBL_PRIVS

TBL_COL_PRIVS

PART_PRIVS

SDS

CDS

SORT_ORDER

SERDES

TYPE_FIELDS

PARTITIONS

PARTITION_KEY_VALS

PARTITION_PARAMS

BUCKETING_COLS

SORT_COLS

SD_PARAMS

SKEWED_COL_NAMES

SKEWED_VALUES

TABLE_PARAMS

13 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog ldquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

HBase schema

14 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Table Name Key Column Families and Columns

Description

HBMS_GLOBAL_PRIVS bytes(ldquogprdquo) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized PrincipalPrivilegeSet proto

HBMS_ROLES bytes(roleName) cf_catalog ldquorolesrdquo ldquorolesrdquo storeretrieve serialized Role proto

HBMS_USER_TO_ROLE bytes(userName) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized RoleList proto

HBMS_SECURITY bytes(delTokenId) cf_catalog ldquodtrdquo ldquomkrdquo ldquodtrdquo storeretrieve delegation token ldquomkrdquo master keys

HBMS_SEQUENCES bytes(sequence) cf_catalog ldquocrdquo ldquocrdquo storeretrieve sequences

HBase schema

15 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

De-normalization

bull Goalbull Optimized for queryingbull May slower in DDL bull Example drop_role(String roleName)

Key Value

bytes(ldquoUser 1rdquo) Proto(Role 1 Role 2 Role 3 Role 5)

bytes(ldquoUser 2rdquo) Proto(Role 1 Role 2)

bytes(ldquoUser 3rdquo) Proto(Role 4 Role 5)

bytes(ldquoUser 4rdquo) Proto (Role 2 Role 3)

HBMS_USER_TO_ROLE

bull Need to scan amp de-serialize everything in order to drop a role

16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Partition Keys

Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo

Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value

date state

201601 CA

201601 WA

201602 CA

201603 CA

201605 CA

17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Typed Partition Keys

Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)

ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08

Using BinarySortableSerDendash Support all Hive data typesndash Handles null

(String Integer) Bytes

lsquoA10rsquo 3 41 31 30 00 00 00 00 03

lsquoA10rsquo 10 41 31 30 00 00 00 00 0A

lsquoA5rsquo 4 41 35 00 00 00 00 04

lsquoA5rsquo 15 41 35 00 00 00 00 0D

18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

struct StorageDescriptor

1 listltFieldSchemagt cols

2 string location

3 string inputFormat

4 string outputFormat

5 bool compressed

6 i32 numBuckets

7 SerDeInfo serdeInfo

8 listltstringgt bucketCols

9 listltOrdergt sortCols

10 mapltstring stringgt parameters

11 optional SkewedInfo skewedInfo

12 optional bool storedAsSubDirectories

20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

message StorageDescriptor

message Order hellip

message SerDeInfo hellip

message SkewedInfo hellip

repeated FieldSchema cols = 1

optional string input_format = 2

optional string output_format = 3

optional bool is_compressed = 4

optional sint32 num_buckets = 5

optional SerDeInfo serde_info = 6

repeated string bucket_cols = 7

repeated Order sort_cols = 8

optional SkewedInfo skewed_info = 9

optional bool stored_as_sub_directories = 10

21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

HBMS_

PARTITIONS

HBMS_

SDS

23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching

Aggregate Statsbull Location - on HBasebull Compile time

File Footers bull Location - on HBasebull Runtime - accessed from tasks

Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time

25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching Aggregate Stats

ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo

bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS

bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto

bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client

side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats

bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to

removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp

removes them from the cache

26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching File Footers

bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner

thread

bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic

27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBaseMetaStore Needs Transaction

Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege

HBase donrsquot support transactionndash Donrsquot support cross-row transactions

HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction

29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid

Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project

ndash First release this Monday

Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts

Low overhead

30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Components

TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction

TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO

Compactorndash Run as HBase Coprocessorndash Remove stale cell versions

HBaseCompactor

Client

TSO

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 9: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

9 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

10 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

System Architecture

HiveMetaStore Thrift Server

ObjectStoreHBaseStore

RDBMSHBase

Omid

bull Two implementation of the RawStore interfacebull HBaseStorebull ObjectStore

bull Both backend will live together for a while

bull HBaseStorebull Most traffic will go through transaction

layer (Omid)bull Some traffic will bypass transaction layer

bull Volatile databull High possibility of conflict

HiveMetaStore Thrift Client

11 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

RDBMS schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model

objects bull ORM opens transaction on RDBMS and writesreads values tofrom various tables in

RDBMS using appropriate foreign key references bull RDBMS fastpath enabled by not using ORM and writing direct SQL However

complicates testing matrix as there may be slight variations in SQL semantics for different RDBMS databases

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

12 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

RDBMS schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model

objects bull ORM opens transaction on RDBMS and writes reads values to from various tables in

RDBMS using appropriate foreign key references

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

TBLS

TBL_PRIVS

TBL_COL_PRIVS

PART_PRIVS

SDS

CDS

SORT_ORDER

SERDES

TYPE_FIELDS

PARTITIONS

PARTITION_KEY_VALS

PARTITION_PARAMS

BUCKETING_COLS

SORT_COLS

SD_PARAMS

SKEWED_COL_NAMES

SKEWED_VALUES

TABLE_PARAMS

13 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog ldquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

HBase schema

14 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Table Name Key Column Families and Columns

Description

HBMS_GLOBAL_PRIVS bytes(ldquogprdquo) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized PrincipalPrivilegeSet proto

HBMS_ROLES bytes(roleName) cf_catalog ldquorolesrdquo ldquorolesrdquo storeretrieve serialized Role proto

HBMS_USER_TO_ROLE bytes(userName) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized RoleList proto

HBMS_SECURITY bytes(delTokenId) cf_catalog ldquodtrdquo ldquomkrdquo ldquodtrdquo storeretrieve delegation token ldquomkrdquo master keys

HBMS_SEQUENCES bytes(sequence) cf_catalog ldquocrdquo ldquocrdquo storeretrieve sequences

HBase schema

15 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

De-normalization

bull Goalbull Optimized for queryingbull May slower in DDL bull Example drop_role(String roleName)

Key Value

bytes(ldquoUser 1rdquo) Proto(Role 1 Role 2 Role 3 Role 5)

bytes(ldquoUser 2rdquo) Proto(Role 1 Role 2)

bytes(ldquoUser 3rdquo) Proto(Role 4 Role 5)

bytes(ldquoUser 4rdquo) Proto (Role 2 Role 3)

HBMS_USER_TO_ROLE

bull Need to scan amp de-serialize everything in order to drop a role

16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Partition Keys

Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo

Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value

date state

201601 CA

201601 WA

201602 CA

201603 CA

201605 CA

17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Typed Partition Keys

Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)

ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08

Using BinarySortableSerDendash Support all Hive data typesndash Handles null

(String Integer) Bytes

lsquoA10rsquo 3 41 31 30 00 00 00 00 03

lsquoA10rsquo 10 41 31 30 00 00 00 00 0A

lsquoA5rsquo 4 41 35 00 00 00 00 04

lsquoA5rsquo 15 41 35 00 00 00 00 0D

18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

struct StorageDescriptor

1 listltFieldSchemagt cols

2 string location

3 string inputFormat

4 string outputFormat

5 bool compressed

6 i32 numBuckets

7 SerDeInfo serdeInfo

8 listltstringgt bucketCols

9 listltOrdergt sortCols

10 mapltstring stringgt parameters

11 optional SkewedInfo skewedInfo

12 optional bool storedAsSubDirectories

20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

message StorageDescriptor

message Order hellip

message SerDeInfo hellip

message SkewedInfo hellip

repeated FieldSchema cols = 1

optional string input_format = 2

optional string output_format = 3

optional bool is_compressed = 4

optional sint32 num_buckets = 5

optional SerDeInfo serde_info = 6

repeated string bucket_cols = 7

repeated Order sort_cols = 8

optional SkewedInfo skewed_info = 9

optional bool stored_as_sub_directories = 10

21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

HBMS_

PARTITIONS

HBMS_

SDS

23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching

Aggregate Statsbull Location - on HBasebull Compile time

File Footers bull Location - on HBasebull Runtime - accessed from tasks

Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time

25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching Aggregate Stats

ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo

bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS

bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto

bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client

side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats

bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to

removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp

removes them from the cache

26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching File Footers

bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner

thread

bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic

27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBaseMetaStore Needs Transaction

Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege

HBase donrsquot support transactionndash Donrsquot support cross-row transactions

HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction

29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid

Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project

ndash First release this Monday

Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts

Low overhead

30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Components

TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction

TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO

Compactorndash Run as HBase Coprocessorndash Remove stale cell versions

HBaseCompactor

Client

TSO

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 10: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

10 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

System Architecture

HiveMetaStore Thrift Server

ObjectStoreHBaseStore

RDBMSHBase

Omid

bull Two implementation of the RawStore interfacebull HBaseStorebull ObjectStore

bull Both backend will live together for a while

bull HBaseStorebull Most traffic will go through transaction

layer (Omid)bull Some traffic will bypass transaction layer

bull Volatile databull High possibility of conflict

HiveMetaStore Thrift Client

11 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

RDBMS schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model

objects bull ORM opens transaction on RDBMS and writesreads values tofrom various tables in

RDBMS using appropriate foreign key references bull RDBMS fastpath enabled by not using ORM and writing direct SQL However

complicates testing matrix as there may be slight variations in SQL semantics for different RDBMS databases

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

12 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

RDBMS schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model

objects bull ORM opens transaction on RDBMS and writes reads values to from various tables in

RDBMS using appropriate foreign key references

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

TBLS

TBL_PRIVS

TBL_COL_PRIVS

PART_PRIVS

SDS

CDS

SORT_ORDER

SERDES

TYPE_FIELDS

PARTITIONS

PARTITION_KEY_VALS

PARTITION_PARAMS

BUCKETING_COLS

SORT_COLS

SD_PARAMS

SKEWED_COL_NAMES

SKEWED_VALUES

TABLE_PARAMS

13 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog ldquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

HBase schema

14 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Table Name Key Column Families and Columns

Description

HBMS_GLOBAL_PRIVS bytes(ldquogprdquo) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized PrincipalPrivilegeSet proto

HBMS_ROLES bytes(roleName) cf_catalog ldquorolesrdquo ldquorolesrdquo storeretrieve serialized Role proto

HBMS_USER_TO_ROLE bytes(userName) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized RoleList proto

HBMS_SECURITY bytes(delTokenId) cf_catalog ldquodtrdquo ldquomkrdquo ldquodtrdquo storeretrieve delegation token ldquomkrdquo master keys

HBMS_SEQUENCES bytes(sequence) cf_catalog ldquocrdquo ldquocrdquo storeretrieve sequences

HBase schema

15 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

De-normalization

bull Goalbull Optimized for queryingbull May slower in DDL bull Example drop_role(String roleName)

Key Value

bytes(ldquoUser 1rdquo) Proto(Role 1 Role 2 Role 3 Role 5)

bytes(ldquoUser 2rdquo) Proto(Role 1 Role 2)

bytes(ldquoUser 3rdquo) Proto(Role 4 Role 5)

bytes(ldquoUser 4rdquo) Proto (Role 2 Role 3)

HBMS_USER_TO_ROLE

bull Need to scan amp de-serialize everything in order to drop a role

16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Partition Keys

Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo

Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value

date state

201601 CA

201601 WA

201602 CA

201603 CA

201605 CA

17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Typed Partition Keys

Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)

ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08

Using BinarySortableSerDendash Support all Hive data typesndash Handles null

(String Integer) Bytes

lsquoA10rsquo 3 41 31 30 00 00 00 00 03

lsquoA10rsquo 10 41 31 30 00 00 00 00 0A

lsquoA5rsquo 4 41 35 00 00 00 00 04

lsquoA5rsquo 15 41 35 00 00 00 00 0D

18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

struct StorageDescriptor

1 listltFieldSchemagt cols

2 string location

3 string inputFormat

4 string outputFormat

5 bool compressed

6 i32 numBuckets

7 SerDeInfo serdeInfo

8 listltstringgt bucketCols

9 listltOrdergt sortCols

10 mapltstring stringgt parameters

11 optional SkewedInfo skewedInfo

12 optional bool storedAsSubDirectories

20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

message StorageDescriptor

message Order hellip

message SerDeInfo hellip

message SkewedInfo hellip

repeated FieldSchema cols = 1

optional string input_format = 2

optional string output_format = 3

optional bool is_compressed = 4

optional sint32 num_buckets = 5

optional SerDeInfo serde_info = 6

repeated string bucket_cols = 7

repeated Order sort_cols = 8

optional SkewedInfo skewed_info = 9

optional bool stored_as_sub_directories = 10

21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

HBMS_

PARTITIONS

HBMS_

SDS

23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching

Aggregate Statsbull Location - on HBasebull Compile time

File Footers bull Location - on HBasebull Runtime - accessed from tasks

Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time

25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching Aggregate Stats

ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo

bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS

bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto

bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client

side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats

bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to

removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp

removes them from the cache

26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching File Footers

bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner

thread

bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic

27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBaseMetaStore Needs Transaction

Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege

HBase donrsquot support transactionndash Donrsquot support cross-row transactions

HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction

29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid

Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project

ndash First release this Monday

Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts

Low overhead

30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Components

TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction

TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO

Compactorndash Run as HBase Coprocessorndash Remove stale cell versions

HBaseCompactor

Client

TSO

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 11: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

11 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

RDBMS schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model

objects bull ORM opens transaction on RDBMS and writesreads values tofrom various tables in

RDBMS using appropriate foreign key references bull RDBMS fastpath enabled by not using ORM and writing direct SQL However

complicates testing matrix as there may be slight variations in SQL semantics for different RDBMS databases

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

12 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

RDBMS schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model

objects bull ORM opens transaction on RDBMS and writes reads values to from various tables in

RDBMS using appropriate foreign key references

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

TBLS

TBL_PRIVS

TBL_COL_PRIVS

PART_PRIVS

SDS

CDS

SORT_ORDER

SERDES

TYPE_FIELDS

PARTITIONS

PARTITION_KEY_VALS

PARTITION_PARAMS

BUCKETING_COLS

SORT_COLS

SD_PARAMS

SKEWED_COL_NAMES

SKEWED_VALUES

TABLE_PARAMS

13 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog ldquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

HBase schema

14 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Table Name Key Column Families and Columns

Description

HBMS_GLOBAL_PRIVS bytes(ldquogprdquo) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized PrincipalPrivilegeSet proto

HBMS_ROLES bytes(roleName) cf_catalog ldquorolesrdquo ldquorolesrdquo storeretrieve serialized Role proto

HBMS_USER_TO_ROLE bytes(userName) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized RoleList proto

HBMS_SECURITY bytes(delTokenId) cf_catalog ldquodtrdquo ldquomkrdquo ldquodtrdquo storeretrieve delegation token ldquomkrdquo master keys

HBMS_SEQUENCES bytes(sequence) cf_catalog ldquocrdquo ldquocrdquo storeretrieve sequences

HBase schema

15 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

De-normalization

bull Goalbull Optimized for queryingbull May slower in DDL bull Example drop_role(String roleName)

Key Value

bytes(ldquoUser 1rdquo) Proto(Role 1 Role 2 Role 3 Role 5)

bytes(ldquoUser 2rdquo) Proto(Role 1 Role 2)

bytes(ldquoUser 3rdquo) Proto(Role 4 Role 5)

bytes(ldquoUser 4rdquo) Proto (Role 2 Role 3)

HBMS_USER_TO_ROLE

bull Need to scan amp de-serialize everything in order to drop a role

16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Partition Keys

Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo

Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value

date state

201601 CA

201601 WA

201602 CA

201603 CA

201605 CA

17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Typed Partition Keys

Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)

ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08

Using BinarySortableSerDendash Support all Hive data typesndash Handles null

(String Integer) Bytes

lsquoA10rsquo 3 41 31 30 00 00 00 00 03

lsquoA10rsquo 10 41 31 30 00 00 00 00 0A

lsquoA5rsquo 4 41 35 00 00 00 00 04

lsquoA5rsquo 15 41 35 00 00 00 00 0D

18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

struct StorageDescriptor

1 listltFieldSchemagt cols

2 string location

3 string inputFormat

4 string outputFormat

5 bool compressed

6 i32 numBuckets

7 SerDeInfo serdeInfo

8 listltstringgt bucketCols

9 listltOrdergt sortCols

10 mapltstring stringgt parameters

11 optional SkewedInfo skewedInfo

12 optional bool storedAsSubDirectories

20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

message StorageDescriptor

message Order hellip

message SerDeInfo hellip

message SkewedInfo hellip

repeated FieldSchema cols = 1

optional string input_format = 2

optional string output_format = 3

optional bool is_compressed = 4

optional sint32 num_buckets = 5

optional SerDeInfo serde_info = 6

repeated string bucket_cols = 7

repeated Order sort_cols = 8

optional SkewedInfo skewed_info = 9

optional bool stored_as_sub_directories = 10

21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

HBMS_

PARTITIONS

HBMS_

SDS

23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching

Aggregate Statsbull Location - on HBasebull Compile time

File Footers bull Location - on HBasebull Runtime - accessed from tasks

Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time

25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching Aggregate Stats

ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo

bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS

bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto

bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client

side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats

bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to

removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp

removes them from the cache

26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching File Footers

bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner

thread

bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic

27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBaseMetaStore Needs Transaction

Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege

HBase donrsquot support transactionndash Donrsquot support cross-row transactions

HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction

29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid

Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project

ndash First release this Monday

Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts

Low overhead

30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Components

TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction

TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO

Compactorndash Run as HBase Coprocessorndash Remove stale cell versions

HBaseCompactor

Client

TSO

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 12: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

12 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

RDBMS schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model

objects bull ORM opens transaction on RDBMS and writes reads values to from various tables in

RDBMS using appropriate foreign key references

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

TBLS

TBL_PRIVS

TBL_COL_PRIVS

PART_PRIVS

SDS

CDS

SORT_ORDER

SERDES

TYPE_FIELDS

PARTITIONS

PARTITION_KEY_VALS

PARTITION_PARAMS

BUCKETING_COLS

SORT_COLS

SD_PARAMS

SKEWED_COL_NAMES

SKEWED_VALUES

TABLE_PARAMS

13 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog ldquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

HBase schema

14 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Table Name Key Column Families and Columns

Description

HBMS_GLOBAL_PRIVS bytes(ldquogprdquo) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized PrincipalPrivilegeSet proto

HBMS_ROLES bytes(roleName) cf_catalog ldquorolesrdquo ldquorolesrdquo storeretrieve serialized Role proto

HBMS_USER_TO_ROLE bytes(userName) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized RoleList proto

HBMS_SECURITY bytes(delTokenId) cf_catalog ldquodtrdquo ldquomkrdquo ldquodtrdquo storeretrieve delegation token ldquomkrdquo master keys

HBMS_SEQUENCES bytes(sequence) cf_catalog ldquocrdquo ldquocrdquo storeretrieve sequences

HBase schema

15 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

De-normalization

bull Goalbull Optimized for queryingbull May slower in DDL bull Example drop_role(String roleName)

Key Value

bytes(ldquoUser 1rdquo) Proto(Role 1 Role 2 Role 3 Role 5)

bytes(ldquoUser 2rdquo) Proto(Role 1 Role 2)

bytes(ldquoUser 3rdquo) Proto(Role 4 Role 5)

bytes(ldquoUser 4rdquo) Proto (Role 2 Role 3)

HBMS_USER_TO_ROLE

bull Need to scan amp de-serialize everything in order to drop a role

16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Partition Keys

Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo

Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value

date state

201601 CA

201601 WA

201602 CA

201603 CA

201605 CA

17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Typed Partition Keys

Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)

ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08

Using BinarySortableSerDendash Support all Hive data typesndash Handles null

(String Integer) Bytes

lsquoA10rsquo 3 41 31 30 00 00 00 00 03

lsquoA10rsquo 10 41 31 30 00 00 00 00 0A

lsquoA5rsquo 4 41 35 00 00 00 00 04

lsquoA5rsquo 15 41 35 00 00 00 00 0D

18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

struct StorageDescriptor

1 listltFieldSchemagt cols

2 string location

3 string inputFormat

4 string outputFormat

5 bool compressed

6 i32 numBuckets

7 SerDeInfo serdeInfo

8 listltstringgt bucketCols

9 listltOrdergt sortCols

10 mapltstring stringgt parameters

11 optional SkewedInfo skewedInfo

12 optional bool storedAsSubDirectories

20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

message StorageDescriptor

message Order hellip

message SerDeInfo hellip

message SkewedInfo hellip

repeated FieldSchema cols = 1

optional string input_format = 2

optional string output_format = 3

optional bool is_compressed = 4

optional sint32 num_buckets = 5

optional SerDeInfo serde_info = 6

repeated string bucket_cols = 7

repeated Order sort_cols = 8

optional SkewedInfo skewed_info = 9

optional bool stored_as_sub_directories = 10

21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

HBMS_

PARTITIONS

HBMS_

SDS

23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching

Aggregate Statsbull Location - on HBasebull Compile time

File Footers bull Location - on HBasebull Runtime - accessed from tasks

Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time

25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching Aggregate Stats

ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo

bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS

bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto

bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client

side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats

bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to

removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp

removes them from the cache

26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching File Footers

bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner

thread

bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic

27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBaseMetaStore Needs Transaction

Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege

HBase donrsquot support transactionndash Donrsquot support cross-row transactions

HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction

29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid

Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project

ndash First release this Monday

Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts

Low overhead

30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Components

TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction

TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO

Compactorndash Run as HBase Coprocessorndash Remove stale cell versions

HBaseCompactor

Client

TSO

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 13: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

13 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog ldquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

HBase schema

14 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Table Name Key Column Families and Columns

Description

HBMS_GLOBAL_PRIVS bytes(ldquogprdquo) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized PrincipalPrivilegeSet proto

HBMS_ROLES bytes(roleName) cf_catalog ldquorolesrdquo ldquorolesrdquo storeretrieve serialized Role proto

HBMS_USER_TO_ROLE bytes(userName) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized RoleList proto

HBMS_SECURITY bytes(delTokenId) cf_catalog ldquodtrdquo ldquomkrdquo ldquodtrdquo storeretrieve delegation token ldquomkrdquo master keys

HBMS_SEQUENCES bytes(sequence) cf_catalog ldquocrdquo ldquocrdquo storeretrieve sequences

HBase schema

15 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

De-normalization

bull Goalbull Optimized for queryingbull May slower in DDL bull Example drop_role(String roleName)

Key Value

bytes(ldquoUser 1rdquo) Proto(Role 1 Role 2 Role 3 Role 5)

bytes(ldquoUser 2rdquo) Proto(Role 1 Role 2)

bytes(ldquoUser 3rdquo) Proto(Role 4 Role 5)

bytes(ldquoUser 4rdquo) Proto (Role 2 Role 3)

HBMS_USER_TO_ROLE

bull Need to scan amp de-serialize everything in order to drop a role

16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Partition Keys

Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo

Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value

date state

201601 CA

201601 WA

201602 CA

201603 CA

201605 CA

17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Typed Partition Keys

Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)

ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08

Using BinarySortableSerDendash Support all Hive data typesndash Handles null

(String Integer) Bytes

lsquoA10rsquo 3 41 31 30 00 00 00 00 03

lsquoA10rsquo 10 41 31 30 00 00 00 00 0A

lsquoA5rsquo 4 41 35 00 00 00 00 04

lsquoA5rsquo 15 41 35 00 00 00 00 0D

18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

struct StorageDescriptor

1 listltFieldSchemagt cols

2 string location

3 string inputFormat

4 string outputFormat

5 bool compressed

6 i32 numBuckets

7 SerDeInfo serdeInfo

8 listltstringgt bucketCols

9 listltOrdergt sortCols

10 mapltstring stringgt parameters

11 optional SkewedInfo skewedInfo

12 optional bool storedAsSubDirectories

20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

message StorageDescriptor

message Order hellip

message SerDeInfo hellip

message SkewedInfo hellip

repeated FieldSchema cols = 1

optional string input_format = 2

optional string output_format = 3

optional bool is_compressed = 4

optional sint32 num_buckets = 5

optional SerDeInfo serde_info = 6

repeated string bucket_cols = 7

repeated Order sort_cols = 8

optional SkewedInfo skewed_info = 9

optional bool stored_as_sub_directories = 10

21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

HBMS_

PARTITIONS

HBMS_

SDS

23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching

Aggregate Statsbull Location - on HBasebull Compile time

File Footers bull Location - on HBasebull Runtime - accessed from tasks

Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time

25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching Aggregate Stats

ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo

bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS

bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto

bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client

side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats

bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to

removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp

removes them from the cache

26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching File Footers

bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner

thread

bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic

27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBaseMetaStore Needs Transaction

Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege

HBase donrsquot support transactionndash Donrsquot support cross-row transactions

HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction

29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid

Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project

ndash First release this Monday

Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts

Low overhead

30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Components

TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction

TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO

Compactorndash Run as HBase Coprocessorndash Remove stale cell versions

HBaseCompactor

Client

TSO

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 14: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

14 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Table Name Key Column Families and Columns

Description

HBMS_GLOBAL_PRIVS bytes(ldquogprdquo) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized PrincipalPrivilegeSet proto

HBMS_ROLES bytes(roleName) cf_catalog ldquorolesrdquo ldquorolesrdquo storeretrieve serialized Role proto

HBMS_USER_TO_ROLE bytes(userName) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized RoleList proto

HBMS_SECURITY bytes(delTokenId) cf_catalog ldquodtrdquo ldquomkrdquo ldquodtrdquo storeretrieve delegation token ldquomkrdquo master keys

HBMS_SEQUENCES bytes(sequence) cf_catalog ldquocrdquo ldquocrdquo storeretrieve sequences

HBase schema

15 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

De-normalization

bull Goalbull Optimized for queryingbull May slower in DDL bull Example drop_role(String roleName)

Key Value

bytes(ldquoUser 1rdquo) Proto(Role 1 Role 2 Role 3 Role 5)

bytes(ldquoUser 2rdquo) Proto(Role 1 Role 2)

bytes(ldquoUser 3rdquo) Proto(Role 4 Role 5)

bytes(ldquoUser 4rdquo) Proto (Role 2 Role 3)

HBMS_USER_TO_ROLE

bull Need to scan amp de-serialize everything in order to drop a role

16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Partition Keys

Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo

Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value

date state

201601 CA

201601 WA

201602 CA

201603 CA

201605 CA

17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Typed Partition Keys

Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)

ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08

Using BinarySortableSerDendash Support all Hive data typesndash Handles null

(String Integer) Bytes

lsquoA10rsquo 3 41 31 30 00 00 00 00 03

lsquoA10rsquo 10 41 31 30 00 00 00 00 0A

lsquoA5rsquo 4 41 35 00 00 00 00 04

lsquoA5rsquo 15 41 35 00 00 00 00 0D

18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

struct StorageDescriptor

1 listltFieldSchemagt cols

2 string location

3 string inputFormat

4 string outputFormat

5 bool compressed

6 i32 numBuckets

7 SerDeInfo serdeInfo

8 listltstringgt bucketCols

9 listltOrdergt sortCols

10 mapltstring stringgt parameters

11 optional SkewedInfo skewedInfo

12 optional bool storedAsSubDirectories

20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

message StorageDescriptor

message Order hellip

message SerDeInfo hellip

message SkewedInfo hellip

repeated FieldSchema cols = 1

optional string input_format = 2

optional string output_format = 3

optional bool is_compressed = 4

optional sint32 num_buckets = 5

optional SerDeInfo serde_info = 6

repeated string bucket_cols = 7

repeated Order sort_cols = 8

optional SkewedInfo skewed_info = 9

optional bool stored_as_sub_directories = 10

21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

HBMS_

PARTITIONS

HBMS_

SDS

23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching

Aggregate Statsbull Location - on HBasebull Compile time

File Footers bull Location - on HBasebull Runtime - accessed from tasks

Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time

25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching Aggregate Stats

ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo

bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS

bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto

bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client

side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats

bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to

removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp

removes them from the cache

26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching File Footers

bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner

thread

bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic

27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBaseMetaStore Needs Transaction

Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege

HBase donrsquot support transactionndash Donrsquot support cross-row transactions

HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction

29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid

Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project

ndash First release this Monday

Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts

Low overhead

30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Components

TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction

TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO

Compactorndash Run as HBase Coprocessorndash Remove stale cell versions

HBaseCompactor

Client

TSO

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 15: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

15 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

De-normalization

bull Goalbull Optimized for queryingbull May slower in DDL bull Example drop_role(String roleName)

Key Value

bytes(ldquoUser 1rdquo) Proto(Role 1 Role 2 Role 3 Role 5)

bytes(ldquoUser 2rdquo) Proto(Role 1 Role 2)

bytes(ldquoUser 3rdquo) Proto(Role 4 Role 5)

bytes(ldquoUser 4rdquo) Proto (Role 2 Role 3)

HBMS_USER_TO_ROLE

bull Need to scan amp de-serialize everything in order to drop a role

16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Partition Keys

Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo

Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value

date state

201601 CA

201601 WA

201602 CA

201603 CA

201605 CA

17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Typed Partition Keys

Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)

ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08

Using BinarySortableSerDendash Support all Hive data typesndash Handles null

(String Integer) Bytes

lsquoA10rsquo 3 41 31 30 00 00 00 00 03

lsquoA10rsquo 10 41 31 30 00 00 00 00 0A

lsquoA5rsquo 4 41 35 00 00 00 00 04

lsquoA5rsquo 15 41 35 00 00 00 00 0D

18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

struct StorageDescriptor

1 listltFieldSchemagt cols

2 string location

3 string inputFormat

4 string outputFormat

5 bool compressed

6 i32 numBuckets

7 SerDeInfo serdeInfo

8 listltstringgt bucketCols

9 listltOrdergt sortCols

10 mapltstring stringgt parameters

11 optional SkewedInfo skewedInfo

12 optional bool storedAsSubDirectories

20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

message StorageDescriptor

message Order hellip

message SerDeInfo hellip

message SkewedInfo hellip

repeated FieldSchema cols = 1

optional string input_format = 2

optional string output_format = 3

optional bool is_compressed = 4

optional sint32 num_buckets = 5

optional SerDeInfo serde_info = 6

repeated string bucket_cols = 7

repeated Order sort_cols = 8

optional SkewedInfo skewed_info = 9

optional bool stored_as_sub_directories = 10

21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

HBMS_

PARTITIONS

HBMS_

SDS

23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching

Aggregate Statsbull Location - on HBasebull Compile time

File Footers bull Location - on HBasebull Runtime - accessed from tasks

Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time

25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching Aggregate Stats

ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo

bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS

bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto

bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client

side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats

bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to

removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp

removes them from the cache

26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching File Footers

bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner

thread

bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic

27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBaseMetaStore Needs Transaction

Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege

HBase donrsquot support transactionndash Donrsquot support cross-row transactions

HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction

29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid

Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project

ndash First release this Monday

Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts

Low overhead

30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Components

TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction

TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO

Compactorndash Run as HBase Coprocessorndash Remove stale cell versions

HBaseCompactor

Client

TSO

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 16: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Partition Keys

Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo

Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value

date state

201601 CA

201601 WA

201602 CA

201603 CA

201605 CA

17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Typed Partition Keys

Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)

ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08

Using BinarySortableSerDendash Support all Hive data typesndash Handles null

(String Integer) Bytes

lsquoA10rsquo 3 41 31 30 00 00 00 00 03

lsquoA10rsquo 10 41 31 30 00 00 00 00 0A

lsquoA5rsquo 4 41 35 00 00 00 00 04

lsquoA5rsquo 15 41 35 00 00 00 00 0D

18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

struct StorageDescriptor

1 listltFieldSchemagt cols

2 string location

3 string inputFormat

4 string outputFormat

5 bool compressed

6 i32 numBuckets

7 SerDeInfo serdeInfo

8 listltstringgt bucketCols

9 listltOrdergt sortCols

10 mapltstring stringgt parameters

11 optional SkewedInfo skewedInfo

12 optional bool storedAsSubDirectories

20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

message StorageDescriptor

message Order hellip

message SerDeInfo hellip

message SkewedInfo hellip

repeated FieldSchema cols = 1

optional string input_format = 2

optional string output_format = 3

optional bool is_compressed = 4

optional sint32 num_buckets = 5

optional SerDeInfo serde_info = 6

repeated string bucket_cols = 7

repeated Order sort_cols = 8

optional SkewedInfo skewed_info = 9

optional bool stored_as_sub_directories = 10

21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

HBMS_

PARTITIONS

HBMS_

SDS

23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching

Aggregate Statsbull Location - on HBasebull Compile time

File Footers bull Location - on HBasebull Runtime - accessed from tasks

Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time

25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching Aggregate Stats

ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo

bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS

bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto

bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client

side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats

bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to

removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp

removes them from the cache

26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching File Footers

bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner

thread

bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic

27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBaseMetaStore Needs Transaction

Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege

HBase donrsquot support transactionndash Donrsquot support cross-row transactions

HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction

29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid

Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project

ndash First release this Monday

Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts

Low overhead

30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Components

TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction

TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO

Compactorndash Run as HBase Coprocessorndash Remove stale cell versions

HBaseCompactor

Client

TSO

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 17: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Typed Partition Keys

Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)

ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08

Using BinarySortableSerDendash Support all Hive data typesndash Handles null

(String Integer) Bytes

lsquoA10rsquo 3 41 31 30 00 00 00 00 03

lsquoA10rsquo 10 41 31 30 00 00 00 00 0A

lsquoA5rsquo 4 41 35 00 00 00 00 04

lsquoA5rsquo 15 41 35 00 00 00 00 0D

18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

struct StorageDescriptor

1 listltFieldSchemagt cols

2 string location

3 string inputFormat

4 string outputFormat

5 bool compressed

6 i32 numBuckets

7 SerDeInfo serdeInfo

8 listltstringgt bucketCols

9 listltOrdergt sortCols

10 mapltstring stringgt parameters

11 optional SkewedInfo skewedInfo

12 optional bool storedAsSubDirectories

20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

message StorageDescriptor

message Order hellip

message SerDeInfo hellip

message SkewedInfo hellip

repeated FieldSchema cols = 1

optional string input_format = 2

optional string output_format = 3

optional bool is_compressed = 4

optional sint32 num_buckets = 5

optional SerDeInfo serde_info = 6

repeated string bucket_cols = 7

repeated Order sort_cols = 8

optional SkewedInfo skewed_info = 9

optional bool stored_as_sub_directories = 10

21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

HBMS_

PARTITIONS

HBMS_

SDS

23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching

Aggregate Statsbull Location - on HBasebull Compile time

File Footers bull Location - on HBasebull Runtime - accessed from tasks

Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time

25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching Aggregate Stats

ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo

bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS

bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto

bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client

side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats

bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to

removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp

removes them from the cache

26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching File Footers

bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner

thread

bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic

27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBaseMetaStore Needs Transaction

Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege

HBase donrsquot support transactionndash Donrsquot support cross-row transactions

HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction

29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid

Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project

ndash First release this Monday

Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts

Low overhead

30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Components

TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction

TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO

Compactorndash Run as HBase Coprocessorndash Remove stale cell versions

HBaseCompactor

Client

TSO

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 18: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

struct StorageDescriptor

1 listltFieldSchemagt cols

2 string location

3 string inputFormat

4 string outputFormat

5 bool compressed

6 i32 numBuckets

7 SerDeInfo serdeInfo

8 listltstringgt bucketCols

9 listltOrdergt sortCols

10 mapltstring stringgt parameters

11 optional SkewedInfo skewedInfo

12 optional bool storedAsSubDirectories

20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

message StorageDescriptor

message Order hellip

message SerDeInfo hellip

message SkewedInfo hellip

repeated FieldSchema cols = 1

optional string input_format = 2

optional string output_format = 3

optional bool is_compressed = 4

optional sint32 num_buckets = 5

optional SerDeInfo serde_info = 6

repeated string bucket_cols = 7

repeated Order sort_cols = 8

optional SkewedInfo skewed_info = 9

optional bool stored_as_sub_directories = 10

21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

HBMS_

PARTITIONS

HBMS_

SDS

23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching

Aggregate Statsbull Location - on HBasebull Compile time

File Footers bull Location - on HBasebull Runtime - accessed from tasks

Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time

25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching Aggregate Stats

ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo

bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS

bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto

bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client

side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats

bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to

removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp

removes them from the cache

26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching File Footers

bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner

thread

bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic

27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBaseMetaStore Needs Transaction

Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege

HBase donrsquot support transactionndash Donrsquot support cross-row transactions

HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction

29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid

Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project

ndash First release this Monday

Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts

Low overhead

30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Components

TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction

TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO

Compactorndash Run as HBase Coprocessorndash Remove stale cell versions

HBaseCompactor

Client

TSO

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 19: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

struct StorageDescriptor

1 listltFieldSchemagt cols

2 string location

3 string inputFormat

4 string outputFormat

5 bool compressed

6 i32 numBuckets

7 SerDeInfo serdeInfo

8 listltstringgt bucketCols

9 listltOrdergt sortCols

10 mapltstring stringgt parameters

11 optional SkewedInfo skewedInfo

12 optional bool storedAsSubDirectories

20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

message StorageDescriptor

message Order hellip

message SerDeInfo hellip

message SkewedInfo hellip

repeated FieldSchema cols = 1

optional string input_format = 2

optional string output_format = 3

optional bool is_compressed = 4

optional sint32 num_buckets = 5

optional SerDeInfo serde_info = 6

repeated string bucket_cols = 7

repeated Order sort_cols = 8

optional SkewedInfo skewed_info = 9

optional bool stored_as_sub_directories = 10

21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

HBMS_

PARTITIONS

HBMS_

SDS

23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching

Aggregate Statsbull Location - on HBasebull Compile time

File Footers bull Location - on HBasebull Runtime - accessed from tasks

Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time

25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching Aggregate Stats

ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo

bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS

bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto

bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client

side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats

bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to

removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp

removes them from the cache

26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching File Footers

bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner

thread

bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic

27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBaseMetaStore Needs Transaction

Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege

HBase donrsquot support transactionndash Donrsquot support cross-row transactions

HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction

29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid

Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project

ndash First release this Monday

Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts

Low overhead

30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Components

TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction

TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO

Compactorndash Run as HBase Coprocessorndash Remove stale cell versions

HBaseCompactor

Client

TSO

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 20: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Storage Descriptor de-duplication

Table Name Key Column Families and Columns

Description

HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto

HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count

HBMS_TBLS bytes(dbName tblName)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Table protoldquosrdquo Stats per column in the Table

HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)

cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn

ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition

HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )

cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto

HBMS_FUNCS bytes(dbName funcName)

cf_catalog lsquocrdquo ldquocrdquo Function proto

HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo

ldquocrdquo Metadata footer protoldquosrdquo PPD Stats

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

message StorageDescriptor

message Order hellip

message SerDeInfo hellip

message SkewedInfo hellip

repeated FieldSchema cols = 1

optional string input_format = 2

optional string output_format = 3

optional bool is_compressed = 4

optional sint32 num_buckets = 5

optional SerDeInfo serde_info = 6

repeated string bucket_cols = 7

repeated Order sort_cols = 8

optional SkewedInfo skewed_info = 9

optional bool stored_as_sub_directories = 10

21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

HBMS_

PARTITIONS

HBMS_

SDS

23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching

Aggregate Statsbull Location - on HBasebull Compile time

File Footers bull Location - on HBasebull Runtime - accessed from tasks

Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time

25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching Aggregate Stats

ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo

bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS

bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto

bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client

side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats

bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to

removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp

removes them from the cache

26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching File Footers

bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner

thread

bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic

27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBaseMetaStore Needs Transaction

Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege

HBase donrsquot support transactionndash Donrsquot support cross-row transactions

HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction

29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid

Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project

ndash First release this Monday

Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts

Low overhead

30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Components

TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction

TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO

Compactorndash Run as HBase Coprocessorndash Remove stale cell versions

HBaseCompactor

Client

TSO

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 21: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

HBMS_

PARTITIONS

HBMS_

SDS

23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching

Aggregate Statsbull Location - on HBasebull Compile time

File Footers bull Location - on HBasebull Runtime - accessed from tasks

Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time

25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching Aggregate Stats

ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo

bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS

bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto

bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client

side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats

bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to

removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp

removes them from the cache

26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching File Footers

bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner

thread

bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic

27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBaseMetaStore Needs Transaction

Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege

HBase donrsquot support transactionndash Donrsquot support cross-row transactions

HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction

29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid

Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project

ndash First release this Monday

Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts

Low overhead

30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Components

TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction

TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO

Compactorndash Run as HBase Coprocessorndash Remove stale cell versions

HBaseCompactor

Client

TSO

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 22: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBase schema

ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in

metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding

protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)

bull Writesreads the protobuf payloads tofrom HBase tables

Example adding a new partition ldquoadd_partition(Partition new_part)rdquo

struct Partition

1 listltstringgt values

2 string dbName

3 string tableName

4 i32 createTime

5 i32 lastAccessTime

6 StorageDescriptor sd

7 mapltstring stringgt parameters

8 optional PrincipalPrivilegeSet privileges

message Partition

optional int64 create_time = 1

optional int64 last_access_time = 2

optional string location = 3

optional Parameters sd_parameters = 4

required bytes sd_hash = 5

optional Parameters parameters = 6

HBMS_

PARTITIONS

HBMS_

SDS

23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching

Aggregate Statsbull Location - on HBasebull Compile time

File Footers bull Location - on HBasebull Runtime - accessed from tasks

Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time

25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching Aggregate Stats

ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo

bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS

bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto

bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client

side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats

bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to

removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp

removes them from the cache

26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching File Footers

bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner

thread

bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic

27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBaseMetaStore Needs Transaction

Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege

HBase donrsquot support transactionndash Donrsquot support cross-row transactions

HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction

29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid

Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project

ndash First release this Monday

Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts

Low overhead

30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Components

TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction

TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO

Compactorndash Run as HBase Coprocessorndash Remove stale cell versions

HBaseCompactor

Client

TSO

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 23: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching

Aggregate Statsbull Location - on HBasebull Compile time

File Footers bull Location - on HBasebull Runtime - accessed from tasks

Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time

25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching Aggregate Stats

ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo

bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS

bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto

bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client

side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats

bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to

removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp

removes them from the cache

26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching File Footers

bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner

thread

bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic

27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBaseMetaStore Needs Transaction

Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege

HBase donrsquot support transactionndash Donrsquot support cross-row transactions

HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction

29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid

Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project

ndash First release this Monday

Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts

Low overhead

30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Components

TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction

TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO

Compactorndash Run as HBase Coprocessorndash Remove stale cell versions

HBaseCompactor

Client

TSO

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 24: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching

Aggregate Statsbull Location - on HBasebull Compile time

File Footers bull Location - on HBasebull Runtime - accessed from tasks

Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time

25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching Aggregate Stats

ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo

bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS

bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto

bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client

side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats

bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to

removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp

removes them from the cache

26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching File Footers

bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner

thread

bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic

27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBaseMetaStore Needs Transaction

Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege

HBase donrsquot support transactionndash Donrsquot support cross-row transactions

HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction

29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid

Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project

ndash First release this Monday

Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts

Low overhead

30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Components

TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction

TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO

Compactorndash Run as HBase Coprocessorndash Remove stale cell versions

HBaseCompactor

Client

TSO

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 25: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching Aggregate Stats

ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo

bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS

bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto

bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client

side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats

bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to

removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp

removes them from the cache

26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching File Footers

bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner

thread

bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic

27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBaseMetaStore Needs Transaction

Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege

HBase donrsquot support transactionndash Donrsquot support cross-row transactions

HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction

29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid

Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project

ndash First release this Monday

Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts

Low overhead

30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Components

TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction

TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO

Compactorndash Run as HBase Coprocessorndash Remove stale cell versions

HBaseCompactor

Client

TSO

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 26: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Caching File Footers

bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner

thread

bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic

27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBaseMetaStore Needs Transaction

Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege

HBase donrsquot support transactionndash Donrsquot support cross-row transactions

HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction

29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid

Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project

ndash First release this Monday

Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts

Low overhead

30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Components

TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction

TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO

Compactorndash Run as HBase Coprocessorndash Remove stale cell versions

HBaseCompactor

Client

TSO

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 27: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBaseMetaStore Needs Transaction

Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege

HBase donrsquot support transactionndash Donrsquot support cross-row transactions

HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction

29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid

Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project

ndash First release this Monday

Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts

Low overhead

30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Components

TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction

TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO

Compactorndash Run as HBase Coprocessorndash Remove stale cell versions

HBaseCompactor

Client

TSO

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 28: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

HBaseMetaStore Needs Transaction

Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege

HBase donrsquot support transactionndash Donrsquot support cross-row transactions

HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction

29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid

Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project

ndash First release this Monday

Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts

Low overhead

30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Components

TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction

TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO

Compactorndash Run as HBase Coprocessorndash Remove stale cell versions

HBaseCompactor

Client

TSO

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 29: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid

Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project

ndash First release this Monday

Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts

Low overhead

30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Components

TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction

TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO

Compactorndash Run as HBase Coprocessorndash Remove stale cell versions

HBaseCompactor

Client

TSO

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 30: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Components

TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction

TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO

Compactorndash Run as HBase Coprocessorndash Remove stale cell versions

HBaseCompactor

Client

TSO

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 31: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Operations

Open transactionndash Get transid from TSO

Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start

Write a cellndash Write value versioned with transid to HBase

Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 32: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Omid Data Structure

Memory management in TSOndash Never run OOM abort old transactions

TSO

row1 T20

row2 T25

row5 T22

lastCommit committedT10 T20

T4 T25

T11 T30

T2 hellip hellip

aborted

bull Detect transaction conflict at commit time

bull Largest trunk of memory

bull Construct snapshot at read time

bull Partially replicated to client

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 33: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Transaction Conflict

Two concurrent DDL write to the same datandash Proper retry logic

Task node writes - ORC footer cache

ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer

public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 34: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 35: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deployment

Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar

New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore

Server Side Filter

Omid Compactor

HBase

TSO

Hive MetaStore

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 36: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Deploy Omid

Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table

Start Omid TSOndash omidsh tso

Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid

HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 37: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Instantiate HBase Metastore

Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install

Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 38: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 39: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

TPCDS queries

Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760

1000

2000

3000

4000

5000

6000

Query Plan Time for TPCDS queries

HBaseStore HBaseStore+Omid ObjectStore

1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries

ndash 219 (without Omid)ndash 212 (With Omid)

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 40: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

AgendaMotivation

System Design

Caching Strategy

Transaction Management

Deployment

Experimental Results

Future Work

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 41: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Current Status

hbase-metastore branch merged to master last September Turn off by default Feature parity Almost

ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported

Run most e2e queries Fixing unit tests

ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 42: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work - ACID

Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions

Data structure is harder to de-normalize New work transaction server

ndash Keep lock and transaction tree in memory

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 43: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash HA via HBase Coprocessor

Two new server componentsndash Omid TSO Serverndash Transaction Server

All servers need HAndash Management headache

Automatic HA through HBase Coprocessor

TSO Server via CoProcessor

TSO Server via CoProcessor

Region Server Region Server

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 44: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Future Work ndash Other

Stats Aggregationndash Coprocessor

Improving ObjectCachendash Rudimentary implementation currentlyndash LRU

Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You
Page 45: hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved

Thank You

  • Hive Hbase Metastore - Improving Hive with a Big Data Metadata
  • Agenda
  • What is Hive MetaStore
  • Low latency in Hive
  • New BottleNet - Metastore
  • Besides Latency
  • ER Diagram for ObjectStore Database
  • How About Improving ObjectStore
  • Agenda (2)
  • System Architecture
  • RDBMS schema
  • RDBMS schema (2)
  • HBase schema
  • HBase schema (2)
  • De-normalization
  • Partition Keys
  • Typed Partition Keys
  • HBase schema (3)
  • HBase schema (4)
  • HBase schema (5)
  • HBase schema (6)
  • HBase schema (7)
  • Agenda (3)
  • Caching
  • Caching Aggregate Stats
  • Caching File Footers
  • Agenda (4)
  • HBaseMetaStore Needs Transaction
  • Omid
  • Omid Components
  • Omid Operations
  • Omid Data Structure
  • Transaction Conflict
  • Agenda (5)
  • Deployment
  • Deploy Omid
  • Instantiate HBase Metastore
  • Agenda (6)
  • TPCDS queries
  • Agenda (7)
  • Current Status
  • Future Work - ACID
  • Future Work ndash HA via HBase Coprocessor
  • Future Work ndash Other
  • Thank You