From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

30
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments Daniel Abadi Yale University August 28 th , 2013 Twitter: @daniel_abadi

description

VLDB 2013 Early Career Research Contribution Award Presentation Abstract: Four years ago at VLDB 2009, a paper was published about a research prototype, called HadoopDB, that attempted to transform Hadoop --- a batch-oriented scalable system designed for processing unstructured data --- into a full-fledged parallel database system that can achieve real-time (interactive) query responses across both structured and unstructured data. In 2010 it was commercialized by Hadapt, a start-up that was formed to accelerate the engineering of the HadoopDB ideas, and to harden the codebase for deployment in real-world, mission-critical applications. In this talk I will give an overview of HadoopDB, and how it combines ideas from the Hadoop and database system communities. I will then describe how the project transitioned from a research prototype written by PhD students at Yale University into enterprise-ready software written by a team of experienced engineers. We will examine particular technical features that are required in enterprise Hadoop deployments, and technical challenges that we ran into while making HadoopDB robust enough to be deployed in the real world. The talk will conclude with an analysis of how starting a company impacts the tenure process, and some thoughts for graduate students and junior faculty considering a similar path.

Transcript of From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

Page 1: From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB

paper into Real World Deployments

Daniel Abadi

Yale University

August 28th, 2013

Twitter: @daniel_abadi

Page 2: From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

Overview of Talk

Motivation for HadoopDBOverview of HadoopDBOverview of the commercialization processTechnical features missing from HadoopDB that Hadapt needed to implementWhat does this mean for tenure?

Page 3: From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

Situation in 2008

Hadoop starting to take off as a “Big Data” processing platform

Parallel database startups such as Netezza, Vertica, and Greenplum gaining traction for “Big Data” analysis

2 Schools of Thought– School 1: They are on a collision course– School 2: They are complementary

technologies

Page 4: From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

From 10,000 feet Hadoop and Parallel Database Systems are Quite Similar

Both are suitable for large-scale data processing– I.e. analytical processing workloads– Bulk loads– Not optimized for transactional workloads– Queries over large amounts of data– Both can handle both relational and nonrelational

queries (DBMS via UDFs)

Page 5: From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

SIGMOD 2009 Paper

Benchmarked Hadoop vs. 2 parallel database systems– Mostly focused on performance differences– Measured differences in load and query time

for some common data processing tasks– Used Web analytics benchmark whose goal

was to be representative of tasks that:Both should excel at

Hadoop should excel at

Databases should excel at

Page 6: From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

Hardware Setup

100 node cluster

Each node– 2.4 GHz Code 2 Duo Processors– 4 GB RAM– 2 250 GB SATA HDs (74 MB/Sec sequential I/O)

Dual GigE switches, each with 50 nodes– 128 Gbit/sec fabric

Connected by a 64 Gbit/sec ring

Page 7: From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

Join Task

0

200

400

600

800

1000

1200

1400

1600

10 nodes 25 nodes 50 nodes 100 nodes

Tim

e (

se

co

nd

s)

Vertica

DBMS-X

Hadoop

Page 8: From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

UDF Task

0

200

400

600

800

1000

1200

10 nodes 25 nodes 50 nodes 100nodes

Tim

e (

se

co

nd

s)

DBMS

Hadoop

DBMS clearly doesn’t scaleCalculate PageRank over a set of HTML documents

Performed via a UDF

Page 9: From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

Scalability

Except for UDFs all systems scale near linearly

BUT: only ran on 100 nodes

As nodes approach 1000, other effects come into play– Faults go from being rare, to not so rare– It is nearly impossible to maintain

homogeneity at scale

Page 10: From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

Fault Tolerance and Cluster Heterogeneity Results

0

20

4060

80

100

120

140160

180

200

Fault tolerance Slowdown tolerance

Pe

rce

nta

ge

Slo

wd

ow

n

DBMS

Hadoop

Database systems restart entire query upon a single node failure, and do not adapt if a node is running slowly

Page 11: From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

Benchmark Conclusions

Hadoop had scalability advantages– Checkpointing allows for better fault tolerance– Runtime scheduling allows for better tolerance of

unexpectedly slow nodes– Better parallelization of UDFs

Hadoop was consistently less efficient for structured, relational data– Reasons mostly non-fundamental– Needed better support for compression and direct

operation on compressed data– Needed better support for indexing– Needed better support for co-partitioning of datasets

Page 12: From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

Best of Both Worlds Possible?

Connector

Page 13: From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

Problems With the Connector Approach

Network delays and bandwidth limitations

Data silos

Multiple vendors

Fundamentally wasteful– Very similar architectures

Both partition data across a cluster

Both parallelize processing across the cluster

Both optimize for local data processing (to minimize network costs)

Page 14: From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

Unified System

Two options:– Bring Hadoop technology to a parallel

database systemProblem: Hadoop is more than just technology

– Bring parallel database system technology to Hadoop

Far more likely to have impact

Page 15: From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

Adding DBMS Technology to Hadoop

Option 1: Keep Hadoop’s storage and build parallel executor on top of it

Cloudera Impala (which is sort of a combination of Hadoop++ and NoDB research projects)

Need better Storage Formats (Trevni and Parquet are promising)

Updates and Deletes are hard (Impala doesn’t support them)

Option 2: Use relational storage on each nodeAccelerates “time to complete system”

We chose this option for HadoopDB

Page 16: From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

HadoopDB Architecture

Page 17: From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

SMS Planner

Page 18: From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

TPC-H Benchmark Results

Page 19: From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

UDF Task

0

100

200

300

400

500

600

700

800

10 nodes 25 nodes 50 nodes

Tim

e (

se

co

nd

s)

DBMS

Hadoop

HadoopDB

Page 20: From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

Fault Tolerance and Cluster Heterogeneity Results

0

20

4060

80

100

120

140160

180

200

Fault tolerance Slowdown tolerance

Pe

rce

nta

ge

Slo

wd

ow

n

DBMS

Hadoop

HadoopDB

Page 21: From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

HadoopDB Commercialization

Wanted to build a real system

Released initial prototype open source

Blog post about HadoopDB got slashdotted, led to VC interest– Initially reluctant to take VC money

Posted a job for an engineer to help build out open source codebase– Low quality of applicants– Not enough government funding for more than 1

engineer

Page 22: From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

HadoopDB Commercialization

VC money only route to building a complete system– Launched with $1.5 million in seed money in

2010– Raised an additional $8 million in 2011– Raised an additional $6.75 million in 2012

Page 23: From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

Commercializing HadoopDB: Where does development time go?

Work we expected to transition from research prototype to commercial product– SQL coverage– Failover for high availability– Authorization / authentication– Error codes / messages for every situation– Installer– Documentation

But what about unexpected work?

Page 24: From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

Infrastructure Tools

Distributed systems are unwieldy– For a cluster of size n, many things need to be done n times

Automated tools are critical

Just to try some new code, the following needs to happen:– Build product– Provision a cluster– Deploy build to cluster– Install dependencies (Hadoop distro, libraries, etc)– Install Hadapt with correct configuration parameters for that

cluster– Generate data or copy data files to cluster for load

Page 25: From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

Upgrader

Start-ups need to move fast

Hadapt delivers a new release every couple of months

Upgrade process must be easy

Downgrade (!) process must be easy

Changes in storage layout or APIs add complexity to the process

Page 26: From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

UDF Support

HadoopDB supported both MapReduce and SQL as interfaces

MapReduce was not a sufficient replacement for database UDFs

Hadapt provides an “HDK” that enables analysts to create functions that are invokable from SQL– Integrates with 3rd party tools

Page 27: From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

Search

Hadoop is increasingly used as a data landfill– Granular data– Messy data– Unprocessed data

Database for Hadoop cannot assume all data fits in rows and columns

Search support was the first thing we built after our A round of financing

Page 28: From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

Is doing a start-up pre-tenure a good idea?

Spinning off a company takes a ton of time– At first, you are the ONLY person who can give a

complete description of the technical vision, soYou’re talking to all the VCs to fundraise

You’re talking to all the prospective customers

You’re talking to all the prospective employees

– Lots of travel– Eventually, others can help with the above, but a good

CEO will not let you escape

Ups and downs can be mentally draining

Page 29: From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

If you do a start-up you will:

Publish less

Advise fewer students

Pursue fewer grants

Avoid university committees as much as possible

Skip faculty meetings (usually because of travel)

Attend fewer academic conferences

Page 30: From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

At the end of the day

Unless there are changes (see SIGMOD panel from June):– Publishing a lot is the best way to get tenure– Spinning off a company necessarily detracts from

university measurable objectives

Doing a start-up is putting all your eggs in one basket– If successful, you have a lot of impact you can point

to– If not successful, you have nothing– A lot of market forces that you have no control over

determine success