L/O/G/O
云端的小飞象系列报告之二云端的小飞象系列报告之二
Cloud 组
L/O/G/O
Hadoop in SIGMOD 2011Hadoop in SIGMOD 2011
www.themegallery.com
OutlineOutline
Introduction
Nova: Continuous Pig/Hadoop Workflows
Apache Hadoop Goes Realtime at Facebook
Emerging Trends in the Enterprise Data Analytics
A Hadoop Based Distributed Loading Approach to Parallel Data Warehouses
Industrial Session in Sigmod 2011Industrial Session in Sigmod 2011
Data Management for Feeds and Streams(2)
Dynamic Optimization and Unstructured Content (4)
BusinessAnalytics(2)
Support for Business Analytics and Warehousing (4)
Applying Hadoop
(4)
Industrial
session
Nova: Continuous Pig/Hadoop Workflows
By Yahoo !
Nova OverviewNova Overview
Scenarios Ingesting and analyzing user behavior logs Building and updating a search index from a stream of crawled web
pages Processing semi-structured data
Two-layer programming model (Nova over Pig) Continuous processing Independent scheduling Cross-module optimization Manageability features
Workflow ModelWorkflow Model
Workflow Two kinds of vertices: tasks (processing
steps) and channels (data containers) Edges connect tasks to channels and channels
to tasks
Four common patterns of processing Non-incremental (template detection) Stateless incremental (shingling) Stateless incremental with lookup table
(template tagging) Stateful incremental (de-duping)
Workflow Model (Cont.)Workflow Model (Cont.)
Data and Update Model Blocks: A channel’s data is divided into blocks
Contains a complete snapshot of data on a channel as of some point in time
Base blocks are assigned increasing sequence numbers(B0,B1,B2……Bn)
Base block
Used in conjunction with incremental processing
Contains instructions for transforming a base block into a new base block( )
Delta block
( )i j i jB B i j
Workflow Model (Cont.)Workflow Model (Cont.)
Task/Data Interface Consumption mode: all or new Production mode: B or Δ
Workflow Model (Cont.)Workflow Model (Cont.)
Workflow Programming and Scheduling Data-based trigger. Time-based trigger Cascade trigger.
Data Compaction and Garbage Collection If a channel has blocks B0 , , , , the
compaction operation computes and adds B3 to the channel
After compaction is used to add B3 to the channel , and current cursor is at sequence number 2 , then B0 , ,
can be garbage-collected.
0 1 1 2 2 3
0 11 2
Nova System ArchitectureNova System Architecture
Apache Hadoop Goes Realtime at Facebook
By Facebook
Workload TypesWorkload Types
Facebook MessagingHigh Write ThroughputLarge TablesData Migration
Facebook InsightsRealtime AnalyticsHigh Throughput Increments
Facebook Metrics System (ODS)Automatic ShardingFast Reads of Recent Data and Table Scans
Why Hadoop & HBaseWhy Hadoop & HBase
ElasticityHigh write throughputEfficient and low-latency strong consistency semantics within
a data centerEfficient random reads from diskHigh Availability and Disaster RecoveryFault IsolationAtomic read-modify-write primitivesRange ScansTolerance of network partitions within a single data centerZero Downtime in case of individual data center failureActive-active serving capability across different data centers
Realtime HDFSRealtime HDFS
High Availability - AvatarNode
Realtime HDFS (Cont.)Realtime HDFS (Cont.)
Hadoop RPC compatibility
Block Availability: Placement Policy a pluggable block placement policy
Realtime HDFS (Cont.)Realtime HDFS (Cont.)
Performance Improvements for a Realtime Workload RPC Timeout Reads from Local Replicas
New Features HDFS sync Concurrent Readers
Production HBaseProduction HBase
ACID Compliance (RWCC: Read Write Consistency Control) Atomicity (WALEdit) Consistency
Availability Improvements HBase Master Rewrite , Region assignment in memory -> ZooKeeper
Online Upgrades Distributed Log Splitting
Performance Improvements Compaction ( minor and major ) Read Optimizations
Emerging Trends in the Enterprise Data Analytics: Connecting Hadoop and DB2 Warehouse
By IBM
MotivationMotivation
1.Increasing volumes of data
2. Hadoop-based solutions in conjunction with data warehouses
A Hadoop Based Distributed Loading Approach to Parallel Data Warehouses
By Teradata
MotivationMotivation
ETL(Extraction Transformation Loading) is a critical part of data warehouse
While data are partitioned and replicated across all nodes in a parallel data warehouse, load utilities reside on a single node(bottleneck)
Why Hadoop for Teradata EDW ( Enterprise Data Warehouse ) ?Why Hadoop for Teradata EDW ( Enterprise Data Warehouse ) ?
More disk space can be easily added Use as a intermediate storage MapReduce for transformation Load data in parallel
Block Assignment ProblemBlock Assignment ProblemBlock Assignment ProblemBlock Assignment Problem
– HDFS file F on a cluster of P nodes (each node is uniquely identified with an integer i where 1 ≤ i ≤ P)
– The problem is defined by: assignment(X, Y, n,m, k, r) X is the set of n blocks (X = {1, . . . , n}) of FY is the set of m nodes running PDBMS (called PDBMS nodes)
(Y⊆ {1, . . . , P })k copies, m nodesr is the mapping recording the replicated block locations of
each block. r(i) returns the set of nodes which has a copy of the block i.
Block Assignment ProblemBlock Assignment Problem (( Cont.Cont. ))Block Assignment ProblemBlock Assignment Problem (( Cont.Cont. ))
• An assignment g from the blocks in X to the nodes in Y is denoted by a mapping from X = {1, . . . , n} to Y where g(i) = j (i ∈ X, j ∈ Y ) means that the block i is assigned to the node j.
• An even assignment g is an assignment such that ∀ i ∈ Y ∀ j ∈ Y | |{ x | ∀ 1 ≤ x ≤ n&&g(x) = i}| - |{y | ∀ 1 ≤ y ≤ n&&g(y) = j}| | ≤ 1.
• The cost of an assignment g is defined to be cost(g) = |{i | g(i) r(i) ∀ 1 ≤ i ≤ n}|, which is the number of blocks assigned to remote nodes.
L/O/G/O
Thank You! Thank You!
Top Related