HBase User Group #9: HBase and HDFS

21
HBase and HDFS Todd Lipcon [email protected] Twitter: @tlipcon #hbase IRC: tlipcon March 10, 2010

Transcript of HBase User Group #9: HBase and HDFS

Page 1: HBase User Group #9: HBase and HDFS

HBase and HDFS

Todd [email protected]: @tlipcon

#hbase IRC: tlipcon

March 10, 2010

Page 2: HBase User Group #9: HBase and HDFS

Outline

HDFS Overview

HDFS meets HBase

Solving the HDFS-HBase problemsSmall Random ReadsSingle-Client Fault ToleranceDurable Record Appends

Summary

Page 3: HBase User Group #9: HBase and HDFS

HDFS OverviewWhat is HDFS?

I Hadoop’s Distributed File System

I Modeled after Google’s GFS

I Scalable, reliable data storage

I All persistent HBase storage is on HDFS

I HDFS reliability and performance are key toHBase reliability and performance

Page 4: HBase User Group #9: HBase and HDFS

HDFS Architecture

Page 5: HBase User Group #9: HBase and HDFS

HDFS Design GoalsI Store large amounts of data

I Data should be reliable

I Storage and performance should scale withnumber of nodes.

I Primary use: bulk processing with MapReduce

Page 6: HBase User Group #9: HBase and HDFS

Requirements for MapReduceI MR Task Outputs

I Large streaming writes of entire files

I MR Task InputsI Medium-size partial reads

I Each task usually has 1 reader, 1 writer; 8-16tasks per node.

I DataNodes usually servicing few concurrent clients

I MapReduce can restart tasks with ease (theyare idempotent)

Page 7: HBase User Group #9: HBase and HDFS

Requirements for HBaseAll of the requirements of MapReduce, plus:

I Constantly append small records to an edit log(WAL)

I Small-size random reads

I Many concurrent readers

I Clients cannot restart → single-client faulttolerance is necessary.

Page 8: HBase User Group #9: HBase and HDFS

HDFS Requirements Matrix

Requirement MR HBaseScalable storage X X

System fault tolerance X XLarge streaming writes X XLarge streaming reads X X

Small random reads - XSingle client fault tolerance - X

Durable record appends - X

Page 9: HBase User Group #9: HBase and HDFS

HDFS Requirements Matrix

Requirement MR HBaseScalable storage X X©

System fault tolerance X X©Large streaming writes X X©Large streaming reads X X©

Small random reads - X§Single client fault tolerance - X§

Durable record appends - X§

Page 10: HBase User Group #9: HBase and HDFS

Solutions...turn that frown upside-down

hard↔

easy I Configuration Tuning

I HBase-side workarounds

I HDFS Development/Patching

Page 11: HBase User Group #9: HBase and HDFS

Small Random ReadsConfiguration Tuning

I HBase often has more concurrent clients thanMapReduce.

I Typical problems:

xceiverCount 257 exceeds the limit of

concurrent xcievers 256

I Increase dfs.datanode.max.xcievers → 1024(or greater)

Too many open files

I Edit /etc/security/limits.conf to increasenofile → 32768

Page 12: HBase User Group #9: HBase and HDFS

Small Random ReadsHBase Features

I HBase block cacheI Avoids the need to hit HDFS for many reads

I Finer grained synchronization in HFile reads(HBASE-2180)

I Allow parallel clients to read data in parallel forhigher throughput

I Seek-and-read vs pread API (HBASE-1505)I In current HDFS, these have different performance

characteristics

Page 13: HBase User Group #9: HBase and HDFS

Small Random ReadsHDFS Development in Progress

I Client↔DN connection reuse (HDFS-941,HDFS-380)

I Eliminates TCP handshake latencyI Avoids restarting TCP Slow-Start algorithm for

each read

I Multiplexed BlockSender (HDFS-918)I Reduces number of threads and open files in DN

I Netty DataNode (hack in progress)I Non-blocking IO may be more efficient for high

concurrency

Page 14: HBase User Group #9: HBase and HDFS

Single-Client Fault ToleranceWhat exactly do I mean?

I If a MapReduce task fails to write, the MRframework will restart the task.

I MR relies on idempotence → task failures are nota big deal.

I Thus, fault tolerance of a single client is not asimportant to MR

I If an HBase region fails to write, it cannotrecreate the data easily

I HBase may access a single file for a day at atime → must ride over transient errors

Page 15: HBase User Group #9: HBase and HDFS

Single-Client Fault ToleranceHDFS Patches

I HDFS-127 / HDFS-927I Clients used to give up after N read failures on a

file, with no regard for time. This patch resets thefailure count after successful reads.

I HDFS-630I Fixes block allocation to exclude nodes client

knows to be badI Important for small clusters!I Backported to 0.20 in CDH2

I Various other write pipeline recovery fixes in0.20.2 (HDFS-101, HDFS-793)

Page 16: HBase User Group #9: HBase and HDFS

Durable Record AppendsWhat exactly is the infamous sync()/append()?

I Well, it’s really hflush()

I HBase accepts writes into memory (theMemStore)

I It also logs them to disk (the HLog / WAL)

I Each write needs to be on disk before claimingdurability.

I hflush() provides this guarantee (almost)

I Unfortunately, it doesn’t work in ApacheHadoop 0.20.x

Page 17: HBase User Group #9: HBase and HDFS

Durable Record AppendsHBase Workarounds

I HDFS files are durable once closed

I Currently, HBase rolls the edit log periodically

I After a roll, previous edits are safe

I Not much of a workaround §I A crash will lose any edits since last roll.I Rolling constantly results in small files

I Bad for NN metadata efficiency.I Triggers frequent flushes → bad for region server

efficiency

Page 18: HBase User Group #9: HBase and HDFS

Durable Record AppendsHBase Workarounds

I HDFS files are durable once closed

I Currently, HBase rolls the edit log periodically

I After a roll, previous edits are safe

I Not much of a workaround §I A crash will lose any edits since last roll.I Rolling constantly results in small files

I Bad for NN metadata efficiency.I Triggers frequent flushes → bad for region server

efficiency

Page 19: HBase User Group #9: HBase and HDFS

Durable Record AppendsHDFS Development

I On Apache trunk: HDFS-265I New append re-implementation for 0.21/0.22I Will work great, but essentially a very large set of

patchesI Not released yet - running unreleased Hadoop is

“daring”

I In 0.20.x distributions: HDFS-200 patchI Fixes bugs in old hflush() implementationI Not quite as efficient as HDFS-265, but good

enough and simplerI Dhruba Borthakur from Facebook testing and

improvingI Cloudera will test and merge this into CDH3

Page 20: HBase User Group #9: HBase and HDFS

SummaryI HDFS’s original target workload was

MapReduce, and HBase has different (harder)requirements.

I Engineers from the HBase team plus Facebook,Cloudera, and Yahoo are working together toimprove things.

I Cloudera will integrate all necessary HDFSpatches in CDH3, available for testing soon.

I Contact me if you’d like to help test in April.

Page 21: HBase User Group #9: HBase and HDFS

[email protected]: @tlipcon

#hbase IRC: tlipcon

P.S. we’re hiring!