HBase Read High Availability Using Timeline Consistent Region Replicas

38
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HBase Read High Availability Using Timeline-Consistent Region Replicas Enis Soztutar ([email protected]) Devaraj Das ([email protected])

description

HBaseCon 2014 presentation.

Transcript of HBase Read High Availability Using Timeline Consistent Region Replicas

Page 1: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

HBase Read High Availability Using Timeline-Consistent Region Replicas

Enis Soztutar ([email protected])

Devaraj Das ([email protected])

Page 2: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

About Us

Enis Soztutar

Committer and PMC member in Apache HBase and Hadoop since 2007

HBase team @Hortonworks

Twitter @enissoz

Devaraj Das

Committer and PMC member in Hadoop since 2006

Committer at HBase

Co-founder @Hortonworks

Page 3: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Outline of the talk

PART I: Use case and semantics CAP recap

Use case and motivation

Region replicas

Timeline consistency

Semantics

PART II : Implementation and next steps Server side

Client side

Data replication

Next steps & Summary

Page 4: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Part IUse case and semantics

Page 5: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

CAP reCAP

Partition tolerance

Consistency Availability

Pick Two

HBase is CP

Page 6: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Availability

CAP reCAP

• In a distributed system you cannot NOT have P

• C vs A is about what happens if there is a network partition!

• A an C are NEVER binary values, always a range

• Different operations in the system can have different A / C choices

• HBase cannot be simplified as CP

Partition tolerance

Consistency

Pick Two

HBase is CP

Page 7: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

HBase consistency model

For a single row, HBase is strongly consistent within a data center

Across rows HBase is not strongly consistent (but available!).

When a RS goes down, only the regions on that server become unavailable. Other regions are unaffected.

HBase multi-DC replication is “eventual consistent”

HBase applications should carefully design the schema for correct semantics / performance tradeoff

Page 8: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Use cases and motivation

More and more applications are looking for a “0 down time” platform 30 seconds downtime (aggressive MTTR time) is too much

Certain classes of apps are willing to tolerate decreased consistency guarantees in favor of availability Especially for READs

Some build wrappers around the native API to be able to handle failures of destination servers Multi-DC: when one server is down in one DC, the client switches to a different one

Can we do something in HBase natively? Within the same cluster?

Page 9: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Use cases and motivation

Designing the application requires careful tradeoff consideration In schema design since single-row is strong consistent, but no multi-row trx

Multi-datacenter replication (active-passive, active-active, backups etc)

It is good to be able to give the application flexibility to pick-and-choose Higher availability vs stronger consistency

Read vs Write Different consistency models for read vs write

Read-repair, latest ts-wins vs linearizable updates

Page 10: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Initial goals

Support applications talking to a single cluster really well No perceived downtime

Only for READs

If apps wants to tolerate cluster failures Use HBase replication

Combine that with wrappers in the application

Page 11: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Introducing….

Region Replicas in HBase

Timeline Consistency in HBase

Page 12: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Region replicas

For every region of the table, there can be more than one replica Every region replica has an associated “replica_id”, starting from 0

Each region replica is hosted by a different region server

Tables can be configured with a REGION_REPLICATION parameter Default is 1

No change in the current behavior

One replica per region is the “default” or “primary” Only this can accepts WRITEs

All reads from this region replica return the most recent data

Other replicas, also called “secondaries” follow the primary They see only committed updates

Page 13: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Region replicas

Secondary region replicas are read-only No writes are routed to secondary replicas

Data is replicated to secondary regions (more on this later)

Serve data from the same data files are primary

May not have received the recent data

Reads and Scans can be performed, returning possibly stale data

Region replica placement is done to maximize availability of any particular region Region replicas are not co-located on same region servers

And same racks (if possible)

Page 14: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

rowkey column:value column:value …

RegionServer

Region

memstore

DataNode

b2

b9 b1

DataNode

b2

b1

DataNode

b1

ClientRead and write

RegionServer

Page 15: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 15 © Hortonworks Inc. 2011 – 2014. All Rights ReservedPage 15

rowkey column:value column:value …

RegionServer

Region

DataNode

b2

b9 b1

DataNode

b2

b1

DataNode

b1

ClientRead and write

memstore

RegionServer

rowkey column:value column:value …

memstore

Region replica

Read only

Page 16: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

TIMELINE Consistency

Introduced a Consistency enum STRONG

TIMELINE

Consistency.STRONG is default

Consistency can be set per read operation (per-get or per-scan)

Timeline-consistent read RPCs sent to more than one replica

Semantics is a bit different than Eventual Consistency model

Page 17: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

TIMELINE Consistency

public enum Consistency {

STRONG,

TIMELINE

}

Get get = new Get(row);

get.setConsistency(Consistency.TIMELINE);

...

Result result = table.get(get);

if (result.isStale()) {

...

}

Page 18: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

TIMELINE Consistency Semantics

Can be though of as in-cluster active-passive replication

Single homed and ordered updates All writes are handled and ordered by the primary region

All writes are STRONG consistency

Secondaries apply the mutations in order

Only get/scan requests to secondaries

Get/Scan Result can be inspected to see whether the result was from possibly stale data

The client CAN observe edits out-of-order Each RPC can be handled by a different replica

No stickiness to region replicas

Page 19: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

TIMELINE Consistency Example

Client1

X=1

Client2

WAL

Data:

Replica_id=0 (primary)

Replica_id=1

Replica_id=2

replication

replicationX=3

WAL

Data:

WAL

Data:

X=1X=1Write

Page 20: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

TIMELINE Consistency Example

Client1

X=1

Client2

WAL

Data:

Replica_id=0 (primary)

Replica_id=1

Replica_id=2

replication

replicationX=3

WAL

Data:

WAL

Data:

X=1

X=1

X=1

X=1

X=1

X=1Read

X=1Read

X=1Read

Page 21: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

TIMELINE Consistency Example

Client1

X=1

Client2

WAL

Data:

Replica_id=0 (primary)

Replica_id=1

Replica_id=2

replication

replication

WAL

Data:

WAL

Data:

Write

X=1

X=1

X=2 X=2

X=2

Page 22: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

TIMELINE Consistency Example

Client1

X=1

Client2

WAL

Data:

Replica_id=0 (primary)

Replica_id=1

Replica_id=2

replication

replication

WAL

Data:

WAL

Data:

X=2

X=1

X=2

X=2

X=2

X=2Read

X=2Read

X=1Read

Page 23: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

TIMELINE Consistency Example

Client1

X=1

Client2

WAL

Data:

Replica_id=0 (primary)

Replica_id=1

Replica_id=2

replication

replication

WAL

Data:

WAL

Data:

X=2

X=1

X=3

X=2

Write X=3

X=3

Page 24: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

TIMELINE Consistency Example

Client1

X=1

Client2

WAL

Data:

Replica_id=0 (primary)

Replica_id=1

Replica_id=2

replication

replication

WAL

Data:

WAL

Data:

X=2

X=1

X=3

X=2 X=3

X=3Read

X=2Read

X=1Read

Page 25: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

PART IIImplementation and next steps

Page 26: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Region replicas – recap

Every region replica has an associated “replica_id”, starting from 0

Each region replica is hosted by a different region server All replicas can serve READs

One replica per region is the “default” or “primary” Only this can accepts WRITEs

All reads from this region replica return the most recent data

Page 27: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Updates in the Master

Replica creation Created during table creation

No distinction between primary & secondary replicas

Meta table contain all information in one row

Load balancer improvements LB made aware of replicas

Does best effort to place replicas in machines/racks to maximize availability

Alter table support For adjusting number of replicas

Page 28: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Updates in the RegionServer

Treats non-default replicas as read-only

Storefile management Keeps itself up-to-date with the changes to do with store file creation/deletions

Page 29: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

IPC layer high level flow

Client

YES

Response within timeout (10 millis)?

NO Send READ to all secondaries

Send READ to primary

Poll for responseWait for response

Take the first successful response;

cancel others

Similar flow for GET/Batch-GET/Scan, except that Scan is sticky to the server it sees success from.

Page 30: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Performance and Testing

No significant performance issues discovered Added interrupt handling in the RPCs to cancel unneeded replica RPCs

Deeper level of performance testing work is still in progress

Tested via IT tests fails if response is not received within a certain time

Page 31: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Next steps

What has been described so far is in “Phase-1” of the project

Phase-2 WAL replication

Handling of Merges and Splits

Latency guarantees

– Cancellation of RPCs server side

– Promotion of one Secondary to Primary, and recruiting a new Secondary

Use the infrastructure to implement consensus protocols for read/write within a single datacenter

Page 32: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Data Replication

Data should be replicated from primary regions to secondary regions

A regions data = Data files on hdfs + in-memory data in Memstores

Data files MUST be shared. We do not want to store multiple copies

Do not cause more writes than necessary

Two solutions: Region snapshots : Share only data files

Async WAL Replication : Share data files, every region replica has its own in-memory data

Page 33: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Data Replication – Region Snapshots

Primary region works as usual Buffer up mutations in memstore

Flush to disk when full

Compact files when needed

Deleted files are kept in archive directory for some time

Secondary regions periodically look for new files in primary region When a new flushed file is seen, just open it and start serving data from there

When a compaction is seen, open new file, close the files that are gone

Good for read-only, bulk load data or less frequently updated data

Implemented in phase 1

Page 34: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Data Replication - Async WAL Replication

Being implemented in Phase 2

Uses replication source to tail the WAL files from RS Plugs in a custom replication sink to replay the edits on the secondaries

Flush and Compaction events are written to WAL. Secondaries pick new files when they see the entry

A secondary region open will: Open region files of the primary region

Setup a replication queue based on last seen seqId

Accumulate edits in memstore (memory management issues in the next slide)

Mimic flushes and compactions from primary region

Page 35: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Memory management & flushes

Memory Snapshots-based approach The secondaries looks for WAL-edit entries Start-Flush, Commit-Flush

They mimic what the primary does in terms of taking snapshots

– When a flush is successful, the snapshot is let go

If the RegionServer hosting secondary is under memory pressure

– Make some other primary region flush

Flush-based approach Treat the secondary regions as regular regions

Allow them to flush as usual

Flush to the local disk, and clean them up periodically or on certain events

– Treat them as a normal store file for serving reads

Page 36: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Summary

Pros High-availability for read-only tables

High-availability for stale reads

Very low-latency for the above

Cons Increased memory from memstores of the secondaries

Increased blockcache usage

Extra network traffic for the replica calls

Increased number of regions to manage in the cluster

Page 37: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

References

Apache branch hbase-10070 (https://github.com/apache/hbase/tree/hbase-10070)

HDP-2.1 comes with experimental support for Phase-1

More on the use cases for this work is in Sudarshan’s (Bloomberg) talk “Case Studies” track titled “HBase at Bloomberg: High Availability Needs for the Financial

Industry”

Page 38: HBase  Read High Availability Using Timeline Consistent Region Replicas

Page 38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

ThanksQ & A