HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments
-
Upload
daniel-abadi -
Category
Technology
-
view
3.362 -
download
1
description
Transcript of From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB
paper into Real World Deployments
Daniel Abadi
Yale University
August 28th, 2013
Twitter: @daniel_abadi
Overview of Talk
Motivation for HadoopDBOverview of HadoopDBOverview of the commercialization processTechnical features missing from HadoopDB that Hadapt needed to implementWhat does this mean for tenure?
Situation in 2008
Hadoop starting to take off as a “Big Data” processing platform
Parallel database startups such as Netezza, Vertica, and Greenplum gaining traction for “Big Data” analysis
2 Schools of Thought– School 1: They are on a collision course– School 2: They are complementary
technologies
From 10,000 feet Hadoop and Parallel Database Systems are Quite Similar
Both are suitable for large-scale data processing– I.e. analytical processing workloads– Bulk loads– Not optimized for transactional workloads– Queries over large amounts of data– Both can handle both relational and nonrelational
queries (DBMS via UDFs)
SIGMOD 2009 Paper
Benchmarked Hadoop vs. 2 parallel database systems– Mostly focused on performance differences– Measured differences in load and query time
for some common data processing tasks– Used Web analytics benchmark whose goal
was to be representative of tasks that:Both should excel at
Hadoop should excel at
Databases should excel at
Hardware Setup
100 node cluster
Each node– 2.4 GHz Code 2 Duo Processors– 4 GB RAM– 2 250 GB SATA HDs (74 MB/Sec sequential I/O)
Dual GigE switches, each with 50 nodes– 128 Gbit/sec fabric
Connected by a 64 Gbit/sec ring
Join Task
0
200
400
600
800
1000
1200
1400
1600
10 nodes 25 nodes 50 nodes 100 nodes
Tim
e (
se
co
nd
s)
Vertica
DBMS-X
Hadoop
UDF Task
0
200
400
600
800
1000
1200
10 nodes 25 nodes 50 nodes 100nodes
Tim
e (
se
co
nd
s)
DBMS
Hadoop
DBMS clearly doesn’t scaleCalculate PageRank over a set of HTML documents
Performed via a UDF
Scalability
Except for UDFs all systems scale near linearly
BUT: only ran on 100 nodes
As nodes approach 1000, other effects come into play– Faults go from being rare, to not so rare– It is nearly impossible to maintain
homogeneity at scale
Fault Tolerance and Cluster Heterogeneity Results
0
20
4060
80
100
120
140160
180
200
Fault tolerance Slowdown tolerance
Pe
rce
nta
ge
Slo
wd
ow
n
DBMS
Hadoop
Database systems restart entire query upon a single node failure, and do not adapt if a node is running slowly
Benchmark Conclusions
Hadoop had scalability advantages– Checkpointing allows for better fault tolerance– Runtime scheduling allows for better tolerance of
unexpectedly slow nodes– Better parallelization of UDFs
Hadoop was consistently less efficient for structured, relational data– Reasons mostly non-fundamental– Needed better support for compression and direct
operation on compressed data– Needed better support for indexing– Needed better support for co-partitioning of datasets
Best of Both Worlds Possible?
Connector
Problems With the Connector Approach
Network delays and bandwidth limitations
Data silos
Multiple vendors
Fundamentally wasteful– Very similar architectures
Both partition data across a cluster
Both parallelize processing across the cluster
Both optimize for local data processing (to minimize network costs)
Unified System
Two options:– Bring Hadoop technology to a parallel
database systemProblem: Hadoop is more than just technology
– Bring parallel database system technology to Hadoop
Far more likely to have impact
Adding DBMS Technology to Hadoop
Option 1: Keep Hadoop’s storage and build parallel executor on top of it
Cloudera Impala (which is sort of a combination of Hadoop++ and NoDB research projects)
Need better Storage Formats (Trevni and Parquet are promising)
Updates and Deletes are hard (Impala doesn’t support them)
Option 2: Use relational storage on each nodeAccelerates “time to complete system”
We chose this option for HadoopDB
HadoopDB Architecture
SMS Planner
TPC-H Benchmark Results
UDF Task
0
100
200
300
400
500
600
700
800
10 nodes 25 nodes 50 nodes
Tim
e (
se
co
nd
s)
DBMS
Hadoop
HadoopDB
Fault Tolerance and Cluster Heterogeneity Results
0
20
4060
80
100
120
140160
180
200
Fault tolerance Slowdown tolerance
Pe
rce
nta
ge
Slo
wd
ow
n
DBMS
Hadoop
HadoopDB
HadoopDB Commercialization
Wanted to build a real system
Released initial prototype open source
Blog post about HadoopDB got slashdotted, led to VC interest– Initially reluctant to take VC money
Posted a job for an engineer to help build out open source codebase– Low quality of applicants– Not enough government funding for more than 1
engineer
HadoopDB Commercialization
VC money only route to building a complete system– Launched with $1.5 million in seed money in
2010– Raised an additional $8 million in 2011– Raised an additional $6.75 million in 2012
Commercializing HadoopDB: Where does development time go?
Work we expected to transition from research prototype to commercial product– SQL coverage– Failover for high availability– Authorization / authentication– Error codes / messages for every situation– Installer– Documentation
But what about unexpected work?
Infrastructure Tools
Distributed systems are unwieldy– For a cluster of size n, many things need to be done n times
Automated tools are critical
Just to try some new code, the following needs to happen:– Build product– Provision a cluster– Deploy build to cluster– Install dependencies (Hadoop distro, libraries, etc)– Install Hadapt with correct configuration parameters for that
cluster– Generate data or copy data files to cluster for load
Upgrader
Start-ups need to move fast
Hadapt delivers a new release every couple of months
Upgrade process must be easy
Downgrade (!) process must be easy
Changes in storage layout or APIs add complexity to the process
UDF Support
HadoopDB supported both MapReduce and SQL as interfaces
MapReduce was not a sufficient replacement for database UDFs
Hadapt provides an “HDK” that enables analysts to create functions that are invokable from SQL– Integrates with 3rd party tools
Search
Hadoop is increasingly used as a data landfill– Granular data– Messy data– Unprocessed data
Database for Hadoop cannot assume all data fits in rows and columns
Search support was the first thing we built after our A round of financing
Is doing a start-up pre-tenure a good idea?
Spinning off a company takes a ton of time– At first, you are the ONLY person who can give a
complete description of the technical vision, soYou’re talking to all the VCs to fundraise
You’re talking to all the prospective customers
You’re talking to all the prospective employees
– Lots of travel– Eventually, others can help with the above, but a good
CEO will not let you escape
Ups and downs can be mentally draining
If you do a start-up you will:
Publish less
Advise fewer students
Pursue fewer grants
Avoid university committees as much as possible
Skip faculty meetings (usually because of travel)
Attend fewer academic conferences
At the end of the day
Unless there are changes (see SIGMOD panel from June):– Publishing a lot is the best way to get tenure– Spinning off a company necessarily detracts from
university measurable objectives
Doing a start-up is putting all your eggs in one basket– If successful, you have a lot of impact you can point
to– If not successful, you have nothing– A lot of market forces that you have no control over
determine success