HBase Read High Availability Using Timeline Consistent Region Replicas
-
Upload
enissoz -
Category
Technology
-
view
1.707 -
download
1
description
Transcript of HBase Read High Availability Using Timeline Consistent Region Replicas
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HBase Read High Availability Using Timeline-Consistent Region Replicas
Enis Soztutar ([email protected])
Devaraj Das ([email protected])
Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
About Us
Enis Soztutar
Committer and PMC member in Apache HBase and Hadoop since 2007
HBase team @Hortonworks
Twitter @enissoz
Devaraj Das
Committer and PMC member in Hadoop since 2006
Committer at HBase
Co-founder @Hortonworks
Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Outline of the talk
PART I: Use case and semantics CAP recap
Use case and motivation
Region replicas
Timeline consistency
Semantics
PART II : Implementation and next steps Server side
Client side
Data replication
Next steps & Summary
Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Part IUse case and semantics
Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
CAP reCAP
Partition tolerance
Consistency Availability
Pick Two
HBase is CP
Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Availability
CAP reCAP
• In a distributed system you cannot NOT have P
• C vs A is about what happens if there is a network partition!
• A an C are NEVER binary values, always a range
• Different operations in the system can have different A / C choices
• HBase cannot be simplified as CP
Partition tolerance
Consistency
Pick Two
HBase is CP
Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HBase consistency model
For a single row, HBase is strongly consistent within a data center
Across rows HBase is not strongly consistent (but available!).
When a RS goes down, only the regions on that server become unavailable. Other regions are unaffected.
HBase multi-DC replication is “eventual consistent”
HBase applications should carefully design the schema for correct semantics / performance tradeoff
Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Use cases and motivation
More and more applications are looking for a “0 down time” platform 30 seconds downtime (aggressive MTTR time) is too much
Certain classes of apps are willing to tolerate decreased consistency guarantees in favor of availability Especially for READs
Some build wrappers around the native API to be able to handle failures of destination servers Multi-DC: when one server is down in one DC, the client switches to a different one
Can we do something in HBase natively? Within the same cluster?
Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Use cases and motivation
Designing the application requires careful tradeoff consideration In schema design since single-row is strong consistent, but no multi-row trx
Multi-datacenter replication (active-passive, active-active, backups etc)
It is good to be able to give the application flexibility to pick-and-choose Higher availability vs stronger consistency
Read vs Write Different consistency models for read vs write
Read-repair, latest ts-wins vs linearizable updates
Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Initial goals
Support applications talking to a single cluster really well No perceived downtime
Only for READs
If apps wants to tolerate cluster failures Use HBase replication
Combine that with wrappers in the application
Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Introducing….
Region Replicas in HBase
Timeline Consistency in HBase
Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Region replicas
For every region of the table, there can be more than one replica Every region replica has an associated “replica_id”, starting from 0
Each region replica is hosted by a different region server
Tables can be configured with a REGION_REPLICATION parameter Default is 1
No change in the current behavior
One replica per region is the “default” or “primary” Only this can accepts WRITEs
All reads from this region replica return the most recent data
Other replicas, also called “secondaries” follow the primary They see only committed updates
Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Region replicas
Secondary region replicas are read-only No writes are routed to secondary replicas
Data is replicated to secondary regions (more on this later)
Serve data from the same data files are primary
May not have received the recent data
Reads and Scans can be performed, returning possibly stale data
Region replica placement is done to maximize availability of any particular region Region replicas are not co-located on same region servers
And same racks (if possible)
Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
rowkey column:value column:value …
RegionServer
Region
memstore
DataNode
b2
b9 b1
DataNode
b2
b1
DataNode
b1
ClientRead and write
RegionServer
Page 15 © Hortonworks Inc. 2011 – 2014. All Rights ReservedPage 15
rowkey column:value column:value …
RegionServer
Region
DataNode
b2
b9 b1
DataNode
b2
b1
DataNode
b1
ClientRead and write
memstore
RegionServer
rowkey column:value column:value …
memstore
Region replica
Read only
Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
TIMELINE Consistency
Introduced a Consistency enum STRONG
TIMELINE
Consistency.STRONG is default
Consistency can be set per read operation (per-get or per-scan)
Timeline-consistent read RPCs sent to more than one replica
Semantics is a bit different than Eventual Consistency model
Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
TIMELINE Consistency
public enum Consistency {
STRONG,
TIMELINE
}
Get get = new Get(row);
get.setConsistency(Consistency.TIMELINE);
...
Result result = table.get(get);
…
if (result.isStale()) {
...
}
Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
TIMELINE Consistency Semantics
Can be though of as in-cluster active-passive replication
Single homed and ordered updates All writes are handled and ordered by the primary region
All writes are STRONG consistency
Secondaries apply the mutations in order
Only get/scan requests to secondaries
Get/Scan Result can be inspected to see whether the result was from possibly stale data
The client CAN observe edits out-of-order Each RPC can be handled by a different replica
No stickiness to region replicas
Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
TIMELINE Consistency Example
Client1
X=1
Client2
WAL
Data:
Replica_id=0 (primary)
Replica_id=1
Replica_id=2
replication
replicationX=3
WAL
Data:
WAL
Data:
X=1X=1Write
Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
TIMELINE Consistency Example
Client1
X=1
Client2
WAL
Data:
Replica_id=0 (primary)
Replica_id=1
Replica_id=2
replication
replicationX=3
WAL
Data:
WAL
Data:
X=1
X=1
X=1
X=1
X=1
X=1Read
X=1Read
X=1Read
Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
TIMELINE Consistency Example
Client1
X=1
Client2
WAL
Data:
Replica_id=0 (primary)
Replica_id=1
Replica_id=2
replication
replication
WAL
Data:
WAL
Data:
Write
X=1
X=1
X=2 X=2
X=2
Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
TIMELINE Consistency Example
Client1
X=1
Client2
WAL
Data:
Replica_id=0 (primary)
Replica_id=1
Replica_id=2
replication
replication
WAL
Data:
WAL
Data:
X=2
X=1
X=2
X=2
X=2
X=2Read
X=2Read
X=1Read
Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
TIMELINE Consistency Example
Client1
X=1
Client2
WAL
Data:
Replica_id=0 (primary)
Replica_id=1
Replica_id=2
replication
replication
WAL
Data:
WAL
Data:
X=2
X=1
X=3
X=2
Write X=3
X=3
Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
TIMELINE Consistency Example
Client1
X=1
Client2
WAL
Data:
Replica_id=0 (primary)
Replica_id=1
Replica_id=2
replication
replication
WAL
Data:
WAL
Data:
X=2
X=1
X=3
X=2 X=3
X=3Read
X=2Read
X=1Read
Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
PART IIImplementation and next steps
Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Region replicas – recap
Every region replica has an associated “replica_id”, starting from 0
Each region replica is hosted by a different region server All replicas can serve READs
One replica per region is the “default” or “primary” Only this can accepts WRITEs
All reads from this region replica return the most recent data
Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Updates in the Master
Replica creation Created during table creation
No distinction between primary & secondary replicas
Meta table contain all information in one row
Load balancer improvements LB made aware of replicas
Does best effort to place replicas in machines/racks to maximize availability
Alter table support For adjusting number of replicas
Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Updates in the RegionServer
Treats non-default replicas as read-only
Storefile management Keeps itself up-to-date with the changes to do with store file creation/deletions
Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
IPC layer high level flow
Client
YES
Response within timeout (10 millis)?
NO Send READ to all secondaries
Send READ to primary
Poll for responseWait for response
Take the first successful response;
cancel others
Similar flow for GET/Batch-GET/Scan, except that Scan is sticky to the server it sees success from.
Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Performance and Testing
No significant performance issues discovered Added interrupt handling in the RPCs to cancel unneeded replica RPCs
Deeper level of performance testing work is still in progress
Tested via IT tests fails if response is not received within a certain time
Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Next steps
What has been described so far is in “Phase-1” of the project
Phase-2 WAL replication
Handling of Merges and Splits
Latency guarantees
– Cancellation of RPCs server side
– Promotion of one Secondary to Primary, and recruiting a new Secondary
Use the infrastructure to implement consensus protocols for read/write within a single datacenter
Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Replication
Data should be replicated from primary regions to secondary regions
A regions data = Data files on hdfs + in-memory data in Memstores
Data files MUST be shared. We do not want to store multiple copies
Do not cause more writes than necessary
Two solutions: Region snapshots : Share only data files
Async WAL Replication : Share data files, every region replica has its own in-memory data
Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Replication – Region Snapshots
Primary region works as usual Buffer up mutations in memstore
Flush to disk when full
Compact files when needed
Deleted files are kept in archive directory for some time
Secondary regions periodically look for new files in primary region When a new flushed file is seen, just open it and start serving data from there
When a compaction is seen, open new file, close the files that are gone
Good for read-only, bulk load data or less frequently updated data
Implemented in phase 1
Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Replication - Async WAL Replication
Being implemented in Phase 2
Uses replication source to tail the WAL files from RS Plugs in a custom replication sink to replay the edits on the secondaries
Flush and Compaction events are written to WAL. Secondaries pick new files when they see the entry
A secondary region open will: Open region files of the primary region
Setup a replication queue based on last seen seqId
Accumulate edits in memstore (memory management issues in the next slide)
Mimic flushes and compactions from primary region
Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Memory management & flushes
Memory Snapshots-based approach The secondaries looks for WAL-edit entries Start-Flush, Commit-Flush
They mimic what the primary does in terms of taking snapshots
– When a flush is successful, the snapshot is let go
If the RegionServer hosting secondary is under memory pressure
– Make some other primary region flush
Flush-based approach Treat the secondary regions as regular regions
Allow them to flush as usual
Flush to the local disk, and clean them up periodically or on certain events
– Treat them as a normal store file for serving reads
Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Summary
Pros High-availability for read-only tables
High-availability for stale reads
Very low-latency for the above
Cons Increased memory from memstores of the secondaries
Increased blockcache usage
Extra network traffic for the replica calls
Increased number of regions to manage in the cluster
Page 37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
References
Apache branch hbase-10070 (https://github.com/apache/hbase/tree/hbase-10070)
HDP-2.1 comes with experimental support for Phase-1
More on the use cases for this work is in Sudarshan’s (Bloomberg) talk “Case Studies” track titled “HBase at Bloomberg: High Availability Needs for the Financial
Industry”
Page 38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
ThanksQ & A