From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB

paper into Real World Deployments

Daniel Abadi

Yale University

August 28th, 2013

Twitter: @daniel_abadi

Overview of Talk

Motivation for HadoopDBOverview of HadoopDBOverview of the commercialization processTechnical features missing from HadoopDB that Hadapt needed to implementWhat does this mean for tenure?

Situation in 2008

Hadoop starting to take off as a “Big Data” processing platform

Parallel database startups such as Netezza, Vertica, and Greenplum gaining traction for “Big Data” analysis

2 Schools of Thought– School 1: They are on a collision course– School 2: They are complementary

technologies

From 10,000 feet Hadoop and Parallel Database Systems are Quite Similar

Both are suitable for large-scale data processing– I.e. analytical processing workloads– Bulk loads– Not optimized for transactional workloads– Queries over large amounts of data– Both can handle both relational and nonrelational

queries (DBMS via UDFs)

SIGMOD 2009 Paper

Benchmarked Hadoop vs. 2 parallel database systems– Mostly focused on performance differences– Measured differences in load and query time

for some common data processing tasks– Used Web analytics benchmark whose goal

was to be representative of tasks that:Both should excel at

Hadoop should excel at

Databases should excel at

Hardware Setup

100 node cluster

Each node– 2.4 GHz Code 2 Duo Processors– 4 GB RAM– 2 250 GB SATA HDs (74 MB/Sec sequential I/O)

Dual GigE switches, each with 50 nodes– 128 Gbit/sec fabric

Connected by a 64 Gbit/sec ring

Join Task

0

200

400

600

800

1000

1200

1400

1600

10 nodes 25 nodes 50 nodes 100 nodes

Tim

e (

se

co

nd

s)

Vertica

DBMS-X

Hadoop

UDF Task

0

200

400

600

800

1000

1200

10 nodes 25 nodes 50 nodes 100nodes

Tim

e (

se

co

nd

s)

DBMS

Hadoop

DBMS clearly doesn’t scaleCalculate PageRank over a set of HTML documents

Performed via a UDF

Scalability

Except for UDFs all systems scale near linearly

BUT: only ran on 100 nodes

As nodes approach 1000, other effects come into play– Faults go from being rare, to not so rare– It is nearly impossible to maintain

homogeneity at scale

Fault Tolerance and Cluster Heterogeneity Results

0

20

4060

80

100

120

140160

180

200

Fault tolerance Slowdown tolerance

Pe

rce

nta

ge

Slo

wd

ow

n

DBMS

Hadoop

Database systems restart entire query upon a single node failure, and do not adapt if a node is running slowly

Benchmark Conclusions

Hadoop had scalability advantages– Checkpointing allows for better fault tolerance– Runtime scheduling allows for better tolerance of

unexpectedly slow nodes– Better parallelization of UDFs

Hadoop was consistently less efficient for structured, relational data– Reasons mostly non-fundamental– Needed better support for compression and direct

operation on compressed data– Needed better support for indexing– Needed better support for co-partitioning of datasets

Best of Both Worlds Possible?

Connector

Problems With the Connector Approach

Network delays and bandwidth limitations

Data silos

Multiple vendors

Fundamentally wasteful– Very similar architectures

Both partition data across a cluster

Both parallelize processing across the cluster

Both optimize for local data processing (to minimize network costs)

Unified System

Two options:– Bring Hadoop technology to a parallel

database systemProblem: Hadoop is more than just technology

– Bring parallel database system technology to Hadoop

Far more likely to have impact

Adding DBMS Technology to Hadoop

Option 1: Keep Hadoop’s storage and build parallel executor on top of it

Cloudera Impala (which is sort of a combination of Hadoop++ and NoDB research projects)

Need better Storage Formats (Trevni and Parquet are promising)

Updates and Deletes are hard (Impala doesn’t support them)

Option 2: Use relational storage on each nodeAccelerates “time to complete system”

We chose this option for HadoopDB

HadoopDB Architecture

SMS Planner

TPC-H Benchmark Results

UDF Task

0

100

200

300

400

500

600

700

800

10 nodes 25 nodes 50 nodes

Tim

e (

se

co

nd

s)

DBMS

Hadoop

HadoopDB

Fault Tolerance and Cluster Heterogeneity Results

0

20

4060

80

100

120

140160

180

200

Fault tolerance Slowdown tolerance

Pe

rce

nta

ge

Slo

wd

ow

n

DBMS

Hadoop

HadoopDB

HadoopDB Commercialization

Wanted to build a real system

Released initial prototype open source

Blog post about HadoopDB got slashdotted, led to VC interest– Initially reluctant to take VC money

Posted a job for an engineer to help build out open source codebase– Low quality of applicants– Not enough government funding for more than 1

engineer

HadoopDB Commercialization

VC money only route to building a complete system– Launched with $1.5 million in seed money in

2010– Raised an additional $8 million in 2011– Raised an additional $6.75 million in 2012

Commercializing HadoopDB: Where does development time go?

Work we expected to transition from research prototype to commercial product– SQL coverage– Failover for high availability– Authorization / authentication– Error codes / messages for every situation– Installer– Documentation

But what about unexpected work?

Infrastructure Tools

Distributed systems are unwieldy– For a cluster of size n, many things need to be done n times

Automated tools are critical

Just to try some new code, the following needs to happen:– Build product– Provision a cluster– Deploy build to cluster– Install dependencies (Hadoop distro, libraries, etc)– Install Hadapt with correct configuration parameters for that

cluster– Generate data or copy data files to cluster for load

Upgrader

Start-ups need to move fast

Hadapt delivers a new release every couple of months

Upgrade process must be easy

Downgrade (!) process must be easy

Changes in storage layout or APIs add complexity to the process

UDF Support

HadoopDB supported both MapReduce and SQL as interfaces

MapReduce was not a sufficient replacement for database UDFs

Hadapt provides an “HDK” that enables analysts to create functions that are invokable from SQL– Integrates with 3rd party tools

Search

Hadoop is increasingly used as a data landfill– Granular data– Messy data– Unprocessed data

Database for Hadoop cannot assume all data fits in rows and columns

Search support was the first thing we built after our A round of financing

Is doing a start-up pre-tenure a good idea?

Spinning off a company takes a ton of time– At first, you are the ONLY person who can give a

complete description of the technical vision, soYou’re talking to all the VCs to fundraise

You’re talking to all the prospective customers

You’re talking to all the prospective employees

– Lots of travel– Eventually, others can help with the above, but a good

CEO will not let you escape

Ups and downs can be mentally draining

If you do a start-up you will:

Publish less

Advise fewer students

Pursue fewer grants

Avoid university committees as much as possible

Skip faculty meetings (usually because of travel)

Attend fewer academic conferences

At the end of the day

Unless there are changes (see SIGMOD panel from June):– Publishing a lot is the best way to get tenure– Spinning off a company necessarily detracts from

university measurable objectives

Doing a start-up is putting all your eggs in one basket– If successful, you have a lot of impact you can point

to– If not successful, you have nothing– A lot of market forces that you have no control over

determine success

From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

Technology

Transcript of From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments