HBase Design Patterns @ Yahoo!
description
Transcript of HBase Design Patterns @ Yahoo!
HBase Design Patterns @ Y!
PRESENTED BY Francis Liu | [email protected] May 5, 2014⎪
Y! Grid
▪ Off-Stage Processing▪ Hosted Service▪ Multi-tenant
Batch Processing (with HDFS)
▪ Append-only ▪ Efficient full table scans▪ Process entire data set (or partitions)
HBase
▪ Mutable▪ Point Access ▪ Range scans▪ Record-level processing▪ 7 clusters, 1500 nodes, 6PB
Entity Store: Motivation
▪ Integrate data from multiple data sources▪ Store historical data▪ Share data
› Analytics› Machine Learning› Consume a data source
Entity Store
▪ Records as Entities› Web pages› Celebrities› etc.
▪ Denormalized as a single table
Entity Store: Content Store
Entity Store: Considerations
▪ Row vs multiple rows as an entity?› Row in most cases
▪ Blob vs Primitives as cell values?› Blobs are more compact› Primitives work better for granular updates› Out of the box filters work better with primitives› Use a compact binary format
▪ Prepare for Schema Changes› Provide a DAO library
▪ Incremental Scan› Batch id (via version)› Size cache for batch
Event Processing: Motivation
▪ Process a stream of events› Ad Targeting› Personalization› etc.
▪ Low average age of a record/model/etc
Event Processing
▪ Entity Store▪ Incremental computation
› Persist incremental state▪ Stream processing framework
› ie Storm▪ Fit working set in Block Cache
Event Processing: Ad Targeting
Ad Targeting
Event Processing - Considerations
▪ Limit large compactions▪ Deferred log flush▪ Avoid compaction storms▪ Async Access
› HBase work queue› AsyncHBase
▪ Blobs when possible▪ Cache optimizations
Phased Event Processing: Motivation
▪ Large/Complex event pipeline▪ Modularization▪ Dependency between pipelines
Phased Event Processing
▪ Notifications › Separate Table› Separate Column Family
Phased Event Processing: Personalization
Phased Event Processing: Considerations
▪ Notifications› Ordered› At least once
▪ Write to multiple regions▪ Transactions
Time Series DB: Motivation
▪ Track/Monitor changes over time› Application Metrics› User Analytics› System Metrics› etc.
▪ Alerts/Alarms› Thresholds› Changes over time
Time Series DB: Personalization Data Quality
Time-Series: Considerations
▪ Hot metrics› Namespace› Indexed tags
▪ Pre-compute aggregates if it is accessed often▪ Consider using a block encoding scheme (PREFIX, FAST_DIFF, etc)▪ Consider pre-computed aggregates in a separate table▪ Consider OpenTSDB
HBaseCon 2014
Thank You!(We’re hiring)