GC settings for HBase Obscure HBase features HBase ...files.meetup.com/1350427/HUG presentation -...
Transcript of GC settings for HBase Obscure HBase features HBase ...files.meetup.com/1350427/HUG presentation -...
● HBase performance improvements● GC settings for HBase● Obscure HBase features
About Me● Lars Hofhansl● Architect at Salesforce● HBase Committer, HBase PMC member● HBase blog: http://hadoop-hbase.blogspot.com● Generalist.
○ have automated oil refineries using Fortran○ real time communications C/C++○ hacked databases and appservers○ implemented Javascript UI frameworks○ mostly a backend plumber, recently mostly Java○ Aikido black belt
Part I - Performance
(some of the following slides are lifted/adapted from a talk my coworker Jamie Martin gave some time back)
Modern HW architecture
Jeff Dean's "Numbers Everyone Should Know"+ some of my own
● CPU ~3GHZ (.33ns cycle time)● L1 cache reference .5ns● L2 cache reference 7ns● Main memory reference 100ns● SSD seek: 100,000ns● Read 1MB sequentially from memory 250,000ns● Round trip within same datacenter 500,000ns● Disk seek: 10,000,000ns● Read 1 MB sequentially from network 10,000,000ns● Read 1MB sequentially from disk 20,000,000ns● Read 1TB sequentially from disk: hours● Read 1TB randomly from disk: 10s of days
CPU is idle for ~300 cycles during a memory fetch!
CPU is idle for ~300 cycles during a memory fetch!
In 1 disk seek, 1mb can be read from disk over the net.
In 1 disk seek, 1mb can be read from disk over the net.
RAM is the new disk, disk is the new tape
RAM is the new disk, disk is the new tape(tape is the new horse-carriage delivery)
SSDs change the equation
SSDs change the equation● IO bounds tasks are now CPU bound● Random IO almost as good as sequential IO● Wear-Out a real concern
Trends:
● CPU speed improves 60% per year, DRAM speed about 10%
● GAP grows 50% year over year● Too much heat beyond 4 GHZ -> multiples
cores, h/w threads
Performance Gap: Memory and disks are getting further and further away from the CPU...
Each CPU core has its own L1/L2 cache.●cores can see changes out of order● reordering limited via memory barriers
Jim Gray: 100x faster cache at 90% hit rate makes effective perf improvement of 10x
Must design with cache coherency/locality in mind●Avoid memory barriers●Avoid unnecessary copies in RAM
(remember RAM is a slow, shared resource now, that does not scale with the cores)
A memory barrier (or fence) ensures that all prior reads or writes are seen in the same order by all cores. (details are tricky, and I am not a hardware guy)
Memory barriers in Java:● synchronized - sets read and write barriers as needed
(details depend on JVM, version, and settings)● volatile - sets a read barrier before a read to a volatile,
and write barrier after a write● final - sets a write barrier after the assignment● AtomicInteger, AtomicLong, etc - uses volatiles and
hardware CAS instructions● Explicit Locks - same as synchronized in terms of barriers
Memory barriers in Java:● synchronized - sets read and write barriers as needed
(details depend on JVM, version, and settings)● volatile - sets a read barrier before a read to a volatile,
and write barrier after a write● final - sets a write barrier after the assignment● AtomicInteger, AtomicLong, etc - uses volatiles and
hardware CAS instructions● Explicit Locks - same as synchronized in terms of barriers
TL;DR: Avoid these in hot code paths
But what about HBase?
HBASE-6603 - RegionMetricsStorage.incrNumericMetric is called too often
HBASE-6621 - Reduce calls to Bytes.toInt
HBASE-6711 - Avoid local results copy in StoreScanner
HBASE-7180 - RegionScannerImpl.next() is inefficient
HBASE-7279 - Avoid copying the rowkey in RegionScanner, StoreScanner, and
ScanQueryMatcher
HBASE-7336 - HFileBlock.readAtOffset does not work well with multiple threads
HBASE-9915 - Performance: isSeeked() in EncodedScannerV2 always returns false
HBASE-9807 - block encoder unnecessarily copies the key for each reseek
HBASE-10015 - Replace intrinsic locking with explicit locks in StoreScanner
HBASE-10117 - Avoid synchronization in HRegionScannerImpl.isFilterDone
HBASE-6852 - SchemaMetrics.updateOnCacheHit costs too much while full scanning a table
with all of its fields (Cheng Hao)
2x improvement for cached scans+
2-3x improvement for block encoders
Garbage Collection
“Weak generational hypothesis”
“Weak generational hypothesis”
HotSpot manages four generations:
1. Eden for all new objects2. Survivor I and II where surviving objects are promoted when eden
is collected3. Tenured space. Objects surviving a few rounds (16 by default) of
eden/survivor collection are promoted into the tenured space4. Perm gen for classes, interned strings, and other more or less
permanent objects.
The principle costs to the garbage collector
1. tracing all objects from the "root" objects and2. collecting all unreachable objects.
#1 is expensive when many objects need to be traced, and #2 is expensive when objects have to be moved (for example to reduce memory fragmentation)
Block cache and Memstore live longAllocated in chunks
64k blocks and 2mb slabs
Other objects don’t outlive a single RPC
-Xmn256m - small eden (maybe -Xmn1024m, but not more than that)-XX:+UseParNewGC - collect eden in parallel-XX:+UseConcMarkSweepGC - use the non-moving CMS collector-XX:CMSInitiatingOccupancyFraction=70 - start collecting when 70% of the tenured gen are full to avoid collection under pressure-XX:+UseCMSInitiatingOccupancyOnly - do not try to adjust CMS setting
TL;DR:
Observations with typical workloads:
Average GC times ~50ms99'ile GC times ~150msworst case <1500ms (very rare)
all on contemporary hardware with heaps of 32GB
What about G1?
Dunno
Nasty single threaded full GC fallback in 1.7Might be OK in 1.8
Haven’t tried
Part II - Lesser known HBase features
(Or is it just a crayon up the nose?)
1. KEEP_DELETED_CELLS2. MIN_VERSIONS3. Atomic Row Mutations4. MultiRowMutationEndpoint5. BulkDeleteEndpoint6. Filters and essential column families
KEEP_DELETED_CELLS
● HBase allows flashback queries○ Ability to get the state of your data as of a time○ Scan.setTimeRange, Scan.setTimeStamp
● This does not work with deletes○ Put(X) at TS○ Put(X') at TS+1○ Scan at TS: X○ Delete(X) at TS+2○ Scan at TS or TS+1: nil
● KEEP_DELETED_CELLS keeps deleted cells until they naturally expire (TTL or number of versions)○ Scan at TS: X○ Scan at TS+1: X'
● Delete marker collected after two major compactions
MIN_VERSIONS
● HBase supports TTL (compliments of Andy Purtell)○ data in column family is assigned a maximum life time, after
that it is unconditionally deleted.
● What if you wanted to expire your data after a while, but keep at least the last (N) version(s)?○ For example you want to keep 24h of data, but keep at least
one version if no change was made in 24h.
● Enter MIN_VERSIONS.○ Works together with TTL.○ Defines the number of versions to retain even when their TTL
has expired
Atomic Row Mutations
● Atomically apply multiple operation to a row:HTable.mutateRow(RowMutation)
● RowMutations implements the Row interface:List<Row> rows = ...;rows.add(new RowMutation(row1));rows.add(new RowMutation(row2));t.batch(rows);
Multi Row Transactions
● HBase supports multi row transaction if rows are colocated in the same regions
● Colocation is controlled by a SplitPolicy● Transactions are implemented in
MultiRowMutationEndpoint
Filter with essential families
● Filters in HBase are powerful○ Skip scans via hinting
● But need to load entire row to evaluate filter● New feature introduced in 0.94.5: FilterBase.isFamilyEssential(byte[] cfname)○ Filter will only load essential column families.○ In a 2nd pass all column families are loaded as needed
(i.e. if the included the rest of the row)● Currently implemented by SingleColumnValueFilter
Bulk Deletes
● Problem: Delete all column families, or columns, or column versions that are matched by a Scan.
Bulk Deletes
● Problem: Delete all column families, or columns, or column versions that are matched by a Scan.
● BulkDelete coprocessor endpoint shipped with HBase addresses that.BulkDeleteResponse delete(Scan scan, byte deleteType, Long timestamp, int rowBatchSize)