Maintainable cloud architecture_of_hadoop

60
Maintainable Cloud Architecture of Hadoop Kai Sasaki Treasure Data Inc.

Transcript of Maintainable cloud architecture_of_hadoop

Page 1: Maintainable cloud architecture_of_hadoop

Maintainable Cloud Architecture of Hadoop

Kai Sasaki Treasure Data Inc.

Page 2: Maintainable cloud architecture_of_hadoop

Who am I?

• Kai Sasaki (佐々木 海)

• @Lewuathe at Twitter, GitHub

• Treasure Data Inc. Software Engineer

• Contributing Hadoop, Spark.

Page 3: Maintainable cloud architecture_of_hadoop

Hadoop in Treasure Data

Page 4: Maintainable cloud architecture_of_hadoop

Cloud-based Data warehousing service

Page 5: Maintainable cloud architecture_of_hadoop
Page 6: Maintainable cloud architecture_of_hadoop
Page 7: Maintainable cloud architecture_of_hadoop
Page 8: Maintainable cloud architecture_of_hadoop

Hadoop is the core of Treasure Data

Page 9: Maintainable cloud architecture_of_hadoop

Hadoop on Cloud

1. Features provided by AWS, IDCF, Heroku etc

2. Fast growing reliability and integrity

Page 10: Maintainable cloud architecture_of_hadoop

Hadoop on Cloud

1. Features provided by AWS, IDCF, Heroku etc

2. Fast growing reliability and integrity

Maintainability of Middleware

Page 11: Maintainable cloud architecture_of_hadoop

Agenda• Maintainability of Distributed System

• Our Challenges • Stateless Hive Metastore • Cloud Storage for Hadoop • Multiple Hadoop Version Management • Regression Test for Hive Queries • REST API for Hadoop • Workflow Integration

• What we should keep in mind

Page 12: Maintainable cloud architecture_of_hadoop

MaintainabilityWe think high maintainability is achieved by…

• Stateless

Page 13: Maintainable cloud architecture_of_hadoop

MaintainabilityWe think high maintainability is achieved by…

• Stateless

• Mobility

Page 14: Maintainable cloud architecture_of_hadoop

MaintainabilityWe think high maintainability is achieved by…

• Stateless

• Mobility

• Queueing

Page 15: Maintainable cloud architecture_of_hadoop

Stateless

• Stateless Hive metastore

• Cloud Storage for Hadoop

Page 16: Maintainable cloud architecture_of_hadoop

Stateless Hive MS

Page 17: Maintainable cloud architecture_of_hadoop

Stateful Hive MS

MySQL

Page 18: Maintainable cloud architecture_of_hadoop

Stateful Hive MS

Driver Metastore MySQL

Page 19: Maintainable cloud architecture_of_hadoop

Stateful Hive MS

Driver Metastore MySQL

Require Maintaining RDBMS for only Meta Store

Page 20: Maintainable cloud architecture_of_hadoop

Stateless Hive MS

Driver Metastore

Page 21: Maintainable cloud architecture_of_hadoop

Stateless Hive MS

Driver Metastore Derby

Page 22: Maintainable cloud architecture_of_hadoop

Stateless Hive MS

Driver Metastore Derby

Worker

Submit DDL

request

Page 23: Maintainable cloud architecture_of_hadoop

Stateless Hive MS

Driver Metastore Derby

Worker

Submit DDL

request

Aggregate Stateful points

Treasure Data API

Page 24: Maintainable cloud architecture_of_hadoop

Cloud Storage for Hadoop

Page 25: Maintainable cloud architecture_of_hadoop

PlazmaDB

Data Connector

S3, Redshift, MySQL, PostgreSQL, Salesforce and more

SDK iOS, Android, JavaScriptUnity

Bulk Import td client

...

Page 26: Maintainable cloud architecture_of_hadoop

PlazmaDB

Data Connector

S3, Redshift, MySQL, PostgreSQL, Salesforce and more

SDK iOS, Android, JavaScriptUnity

Bulk Import td client

...

msgpack

Page 27: Maintainable cloud architecture_of_hadoop

PlazmaDB

Data Connector

S3, Redshift, MySQL, PostgreSQL, Salesforce and more

SDK iOS, Android, JavaScriptUnity

Bulk Import td client

...

msgpack

Hadoop

Page 28: Maintainable cloud architecture_of_hadoop

PlazmaDB

Data Connector

S3, Redshift, MySQL, PostgreSQL, Salesforce and more

SDK iOS, Android, JavaScriptUnity

Bulk Import td client

...

msgpack

Hadoop

Stateful

Page 29: Maintainable cloud architecture_of_hadoop

PlazmaDB

PostgreSQL

S3 or

Riak

S3 or

Riak

S3 or

Riak

S3 or

Riak

msgpack

Amazon RDS

Page 30: Maintainable cloud architecture_of_hadoop

PlazmaDB

PostgreSQL

S3 or

Riak

S3 or

Riak

S3 or

Riak

S3 or

Riak

msgpack

Amazon RDS

Transaction Immutable

Page 31: Maintainable cloud architecture_of_hadoop

Mobility

• Multiple Hadoop Version Management

• Regression Test for Hive Queries

Page 32: Maintainable cloud architecture_of_hadoop

Multiple Hadoop Version Management

Page 33: Maintainable cloud architecture_of_hadoop

Multiple Version Management

CDH HDP Apache

Page 34: Maintainable cloud architecture_of_hadoop

Multiple Version Management

CDH HDP Apache

client client client

Page 35: Maintainable cloud architecture_of_hadoop

Multiple Version Management

CDH HDP Apache

client client client

Tough Operation

Page 36: Maintainable cloud architecture_of_hadoop

Multiple Version Management

CDH HDP Apache

Worker

Page 37: Maintainable cloud architecture_of_hadoop

Multiple Version Management

CDH HDP Apache

Worker

switching

Page 38: Maintainable cloud architecture_of_hadoop

Multiple Version Management

CDH HDP Apache

Worker

switching

Page 39: Maintainable cloud architecture_of_hadoop

Multiple Version Management

CDH HDP Apache

Worker

CDH package

HDP package

Apache package

switching

Page 40: Maintainable cloud architecture_of_hadoop

Multiple Version Management

CDH HDP Apache

Worker

CDH package

HDP package

Apache package

S3

switching

Page 41: Maintainable cloud architecture_of_hadoop

Multiple Version Management

S3

/test

/stable

...

Page 42: Maintainable cloud architecture_of_hadoop

Multiple Version Management

CDH package

HDP package

Apache package

S3

/test

/stable

...

Page 43: Maintainable cloud architecture_of_hadoop

Multiple Version Management

CDH package

HDP package

Apache package

S3

/test

/stable

...

CDH

HDP

Apache

Worker

download

Page 44: Maintainable cloud architecture_of_hadoop

Regression Test for Hive

• Introducing new features, version up, migration must be done without regression

• Running integration system test and regression test for Hive queries

Page 45: Maintainable cloud architecture_of_hadoop

CDH

HDP

Apache

Worker

http://blog.circleci.com/meet-our-new-logo/

System Test Repository

Page 46: Maintainable cloud architecture_of_hadoop

CDH

HDP

Apache

Worker

http://blog.circleci.com/meet-our-new-logo/

System Test Repository

Page 47: Maintainable cloud architecture_of_hadoop

CDH

HDP

Apache

Worker

http://blog.circleci.com/meet-our-new-logo/

System Test Repository

S3Hadoop

Repository

Page 48: Maintainable cloud architecture_of_hadoop

CDH

HDP

Apache

Worker

http://blog.circleci.com/meet-our-new-logo/

System Test Repository

S3

Apache package

Hadoop Repository

Page 49: Maintainable cloud architecture_of_hadoop

CDH

HDP

Apache

Worker

http://blog.circleci.com/meet-our-new-logo/

System Test Repository

S3

Apache package

Hadoop Repository

Page 50: Maintainable cloud architecture_of_hadoop

Queueing

• REST API for Hadoop

• RDS based Queue management system

Page 51: Maintainable cloud architecture_of_hadoop

REST API for Hadoop

Page 52: Maintainable cloud architecture_of_hadoop

REST API for HadoopCDH HDP Apache

Worker

Page 53: Maintainable cloud architecture_of_hadoop

REST API for HadoopCDH HDP Apache

Worker PerfectQueue

Hadoop Job Server

REST API

Page 54: Maintainable cloud architecture_of_hadoop

REST API for HadoopCDH HDP Apache

Worker PerfectQueue

Hadoop Job Server

REST API

PrestoAPI

Page 55: Maintainable cloud architecture_of_hadoop

RDBMS-based Queue Management System

Page 56: Maintainable cloud architecture_of_hadoop

RDBMS based queue management

CDH HDP Apache

Worker

Client Client Client

PerfectQueue

Hadoop Job Server

Page 57: Maintainable cloud architecture_of_hadoop

PerfectQueue

• Highly available distributed queue build on RDBMS

• Amazon SQS like API

• Resource scheduling for multi tenancy

• Graceful and Live Restarting

https://github.com/treasure-data/perfectqueue

Page 58: Maintainable cloud architecture_of_hadoop

What we should keep in mind

• Stateless Delegate responsibility to Cloud systems

• Mobility Looking ahead for version up, migration

• QueueingMake each request persistent

Page 59: Maintainable cloud architecture_of_hadoop

Recap• Maintainability of Distributed System

• Our Challenges • Stateless Hive Metastore • Cloud Storage for Hadoop • Multiple Hadoop version management • Regression Test for Hive queries • REST API for Hadoop • Workflow Integration

• What we should keep in mind

Page 60: Maintainable cloud architecture_of_hadoop

https://www.treasuredata.com/