Maintainable cloud architecture_of_hadoop

Post on 15-Jan-2017

2.926 views 0 download

Transcript of Maintainable cloud architecture_of_hadoop

Maintainable Cloud Architecture of Hadoop

Kai Sasaki Treasure Data Inc.

Who am I?

• Kai Sasaki (佐々木 海)

• @Lewuathe at Twitter, GitHub

• Treasure Data Inc. Software Engineer

• Contributing Hadoop, Spark.

Hadoop in Treasure Data

Cloud-based Data warehousing service

Hadoop is the core of Treasure Data

Hadoop on Cloud

1. Features provided by AWS, IDCF, Heroku etc

2. Fast growing reliability and integrity

Hadoop on Cloud

1. Features provided by AWS, IDCF, Heroku etc

2. Fast growing reliability and integrity

Maintainability of Middleware

Agenda• Maintainability of Distributed System

• Our Challenges • Stateless Hive Metastore • Cloud Storage for Hadoop • Multiple Hadoop Version Management • Regression Test for Hive Queries • REST API for Hadoop • Workflow Integration

• What we should keep in mind

MaintainabilityWe think high maintainability is achieved by…

• Stateless

MaintainabilityWe think high maintainability is achieved by…

• Stateless

• Mobility

MaintainabilityWe think high maintainability is achieved by…

• Stateless

• Mobility

• Queueing

Stateless

• Stateless Hive metastore

• Cloud Storage for Hadoop

Stateless Hive MS

Stateful Hive MS

MySQL

Stateful Hive MS

Driver Metastore MySQL

Stateful Hive MS

Driver Metastore MySQL

Require Maintaining RDBMS for only Meta Store

Stateless Hive MS

Driver Metastore

Stateless Hive MS

Driver Metastore Derby

Stateless Hive MS

Driver Metastore Derby

Worker

Submit DDL

request

Stateless Hive MS

Driver Metastore Derby

Worker

Submit DDL

request

Aggregate Stateful points

Treasure Data API

Cloud Storage for Hadoop

PlazmaDB

Data Connector

S3, Redshift, MySQL, PostgreSQL, Salesforce and more

SDK iOS, Android, JavaScriptUnity

Bulk Import td client

...

PlazmaDB

Data Connector

S3, Redshift, MySQL, PostgreSQL, Salesforce and more

SDK iOS, Android, JavaScriptUnity

Bulk Import td client

...

msgpack

PlazmaDB

Data Connector

S3, Redshift, MySQL, PostgreSQL, Salesforce and more

SDK iOS, Android, JavaScriptUnity

Bulk Import td client

...

msgpack

Hadoop

PlazmaDB

Data Connector

S3, Redshift, MySQL, PostgreSQL, Salesforce and more

SDK iOS, Android, JavaScriptUnity

Bulk Import td client

...

msgpack

Hadoop

Stateful

PlazmaDB

PostgreSQL

S3 or

Riak

S3 or

Riak

S3 or

Riak

S3 or

Riak

msgpack

Amazon RDS

PlazmaDB

PostgreSQL

S3 or

Riak

S3 or

Riak

S3 or

Riak

S3 or

Riak

msgpack

Amazon RDS

Transaction Immutable

Mobility

• Multiple Hadoop Version Management

• Regression Test for Hive Queries

Multiple Hadoop Version Management

Multiple Version Management

CDH HDP Apache

Multiple Version Management

CDH HDP Apache

client client client

Multiple Version Management

CDH HDP Apache

client client client

Tough Operation

Multiple Version Management

CDH HDP Apache

Worker

Multiple Version Management

CDH HDP Apache

Worker

switching

Multiple Version Management

CDH HDP Apache

Worker

switching

Multiple Version Management

CDH HDP Apache

Worker

CDH package

HDP package

Apache package

switching

Multiple Version Management

CDH HDP Apache

Worker

CDH package

HDP package

Apache package

S3

switching

Multiple Version Management

S3

/test

/stable

...

Multiple Version Management

CDH package

HDP package

Apache package

S3

/test

/stable

...

Multiple Version Management

CDH package

HDP package

Apache package

S3

/test

/stable

...

CDH

HDP

Apache

Worker

download

Regression Test for Hive

• Introducing new features, version up, migration must be done without regression

• Running integration system test and regression test for Hive queries

CDH

HDP

Apache

Worker

http://blog.circleci.com/meet-our-new-logo/

System Test Repository

CDH

HDP

Apache

Worker

http://blog.circleci.com/meet-our-new-logo/

System Test Repository

CDH

HDP

Apache

Worker

http://blog.circleci.com/meet-our-new-logo/

System Test Repository

S3Hadoop

Repository

CDH

HDP

Apache

Worker

http://blog.circleci.com/meet-our-new-logo/

System Test Repository

S3

Apache package

Hadoop Repository

CDH

HDP

Apache

Worker

http://blog.circleci.com/meet-our-new-logo/

System Test Repository

S3

Apache package

Hadoop Repository

Queueing

• REST API for Hadoop

• RDS based Queue management system

REST API for Hadoop

REST API for HadoopCDH HDP Apache

Worker

REST API for HadoopCDH HDP Apache

Worker PerfectQueue

Hadoop Job Server

REST API

REST API for HadoopCDH HDP Apache

Worker PerfectQueue

Hadoop Job Server

REST API

PrestoAPI

RDBMS-based Queue Management System

RDBMS based queue management

CDH HDP Apache

Worker

Client Client Client

PerfectQueue

Hadoop Job Server

PerfectQueue

• Highly available distributed queue build on RDBMS

• Amazon SQS like API

• Resource scheduling for multi tenancy

• Graceful and Live Restarting

https://github.com/treasure-data/perfectqueue

What we should keep in mind

• Stateless Delegate responsibility to Cloud systems

• Mobility Looking ahead for version up, migration

• QueueingMake each request persistent

Recap• Maintainability of Distributed System

• Our Challenges • Stateless Hive Metastore • Cloud Storage for Hadoop • Multiple Hadoop version management • Regression Test for Hive queries • REST API for Hadoop • Workflow Integration

• What we should keep in mind

https://www.treasuredata.com/