Post on 10-May-2015
description
Large Scale Web Apps @Pinterest (Powered by Apache HBase)
May 5, 2014
Pinterest is a visual discovery tool for collecting the things you love, and discovering related content along the way.
What is Pinterest ?
ScaleChallenges @scale • 100s of millions of pins/repins per month • Billions of requests per week • Millions of daily active users • Billions of pins • One of the largest discovery tools on the internet
Storage stack @Pinterest!
• MySQL • Redis (persistence and for cache) • MemCache (Consistent Hashing)
App Tier
Manual Sharding
Sharding Logic
Why HBase ?!
• High Write throughput - Unlike MySQL/B-Tree, writes don’t ever seek on Disk
• Seamless integration with Hadoop • Distributed operation
- Fault tolerance - Load Balancing - Easily add/remove nodes !
Non-Technical Reasons • Large active community • Large scale online use cases
Outline!
• Features powered by HBase • SaaS (Storage as a Service)
- MetaStore - HFile Service (Terrapin)
• Our HBase setup - optimizing for High availability & Low latency
Applications/Features!
• Offline - Analytics - Search Indexing - ETL/Hadoop worklows
• Online - Personalized Feeds - Rich Pins - Recommendations
!
Why HBase ?
Personalized Feeds
WHY HBASE ? Write Heavy load due to Pin fanout.
Recommended Pins
Users I follow
Rich Pins
WHY HBASE ? Negative Hits with Bloom Filters
Recommendations
HADOOP 1.0
HBASE + HADOOP 2.0
HADOOP 2.0
WHY HBASE ? Seamless Data Transfer from Hadoop
Generate Recommendations
DistCP Jobs
Serving Cluster
SaaS
• Large number of feature requests • 1 Cluster per feature • Scaling with organizational growth • Need for “defensive” multi tenant storage • Previous solutions reaching their limits
MetaStore I• Key Value store on top of HBase • 1 HBase Table per Feature with salted keys • Pre split tables • Table level rate limiting (online/offline reads/writes) • No Scan support • Simple client API !
!
string getValue(string feature, string key, boolean online); void setValue(string feature, string key, string value,
boolean online);
MetaStore II
MetaStore Thrift Server
Primary HBase Secondary HBase
Clients
Master/Master Replication
Thrift
Salting + Rate Limiting ZooKeeper
Issue Gets/Sets
Notifications
Metastore Config - Rate Limits - Primary Cluster
HFile Service (Terrapin)
• Solve the Bulk Upload problem • HBase backed solution
- Bulk upload + major compact - Major compact to delete old data
• Design solution from scratch using mashup of: - HFile - HBase BlockCache - Avoid compactions - Low latency key value lookups
!
!
!
High Level Architecture I
!
Client Library /Service
ETL/Batch Jobs Load/Reload
HFile Servers
!
HFiles on Amazon S3
Key/Value Lookups
Multiple HFiles/Server
High Level Architecture II• Each HFile server runs 2 processes
- Copier: pulls HFiles from S3 to local disk - Supershard: serves multiple HFile shards to client
• ZooKeeper - Detecting alive servers - Coordinating loading/swapping of new data - Enabling clients to detect availability of new data
• Loader Module (replaces distcp) - Trigger new data copy - Trigger swap through zookeeper - Update ZooKeeper and notify client
• Client library understands sharding • Old data deleted by background process !
!
Salient Features
• Multi tenancy through namespacing • Pluggable sharding functions - modulus, range & more • HBase Block Cache • Multiple clusters for redundancy • Speculative execution across clusters for low latency !
!
!
Setting up for Success• Many online usecases/applications • Optimize for:
- Low MTTR - high availability - Low latency (performance)
!
!
MTTR - I
DEADLIVE STALE20sec 9min 40sec
!
• Stale nodes avoided - As candidates for Reads - As candidate replicas for writes - During Lease Recovery
• Copying of underreplicated blocks starts when a Node is marked as “Dead”
DataNode States
MTTR - II
Failure Detection
Lease Recovery
Log Split
Recover Regions
30 sec ZooKeeper session timeout
HDFS 4721
HDFS 3703 + HDFS 3912
< 2 min
!
• Avoid stale nodes at each point of the recovery process • Multi minute timeouts ==> Multi second timeouts
Simulate, Simulate, Simulate
Simulate “Pull the plug failures” and “tail -f the logs” • kill -9 both datanode and region server - causes connection refused errors • kill -STOP both datanode and region server - causes socket timeouts • Blackhole hosts using iptables - connect timeouts + “No Route to host” - Most representative of AWS failures
PerformanceConfiguration tweaks • Small Block Size, 4K-16K • Prefix compression to cache more - when data is in the key, close to 4X reduction for some data sets • Separation of RPC handler threads for reads vs writes • Short circuit local reads • HBase level checksums (HBASE 5074)
Hardware • SATA (m1.xl/c1.xl) and SSD (hi1.4xl) • Choose based on limiting factor
- Disk space - pick SATA for max GB/$$ - IOPs - pick SSD for max IOPs/$$, clusters with heavy reads or heavy compaction activity
Performance (SSDs)
HFile Read Performance • Turn off block cache for Data Blocks, reduce GC + heap fragmentation • Keep block cache on for Index Blocks • Increase “dfs.client.read.shortcircuit.streams.cache.size” from 100 to 10,000 (with short circuit reads) • Approx. 3X improvement in read throughput !
Write Performance • WAL contention when client sets AutoFlush=true • HBase 8755
In the Pipeline...!
• Building a graph database on HBase • Disaster recovery - snapshot + incremental backup + restore • Off Heap cache - reduce GC overhead and better use of hardware • Read path optimizations
And we are Hiring !!