Maintainable cloud architecture_of_hadoop
-
Upload
kai-sasaki -
Category
Software
-
view
2.926 -
download
0
Transcript of Maintainable cloud architecture_of_hadoop
Maintainable Cloud Architecture of Hadoop
Kai Sasaki Treasure Data Inc.
Who am I?
• Kai Sasaki (佐々木 海)
• @Lewuathe at Twitter, GitHub
• Treasure Data Inc. Software Engineer
• Contributing Hadoop, Spark.
Hadoop in Treasure Data
Cloud-based Data warehousing service
Hadoop is the core of Treasure Data
Hadoop on Cloud
1. Features provided by AWS, IDCF, Heroku etc
2. Fast growing reliability and integrity
Hadoop on Cloud
1. Features provided by AWS, IDCF, Heroku etc
2. Fast growing reliability and integrity
Maintainability of Middleware
Agenda• Maintainability of Distributed System
• Our Challenges • Stateless Hive Metastore • Cloud Storage for Hadoop • Multiple Hadoop Version Management • Regression Test for Hive Queries • REST API for Hadoop • Workflow Integration
• What we should keep in mind
MaintainabilityWe think high maintainability is achieved by…
• Stateless
MaintainabilityWe think high maintainability is achieved by…
• Stateless
• Mobility
MaintainabilityWe think high maintainability is achieved by…
• Stateless
• Mobility
• Queueing
Stateless
• Stateless Hive metastore
• Cloud Storage for Hadoop
Stateless Hive MS
Stateful Hive MS
MySQL
Stateful Hive MS
Driver Metastore MySQL
Stateful Hive MS
Driver Metastore MySQL
Require Maintaining RDBMS for only Meta Store
Stateless Hive MS
Driver Metastore
Stateless Hive MS
Driver Metastore Derby
Stateless Hive MS
Driver Metastore Derby
Worker
Submit DDL
request
Stateless Hive MS
Driver Metastore Derby
Worker
Submit DDL
request
Aggregate Stateful points
Treasure Data API
Cloud Storage for Hadoop
PlazmaDB
Data Connector
S3, Redshift, MySQL, PostgreSQL, Salesforce and more
SDK iOS, Android, JavaScriptUnity
Bulk Import td client
...
PlazmaDB
Data Connector
S3, Redshift, MySQL, PostgreSQL, Salesforce and more
SDK iOS, Android, JavaScriptUnity
Bulk Import td client
...
msgpack
PlazmaDB
Data Connector
S3, Redshift, MySQL, PostgreSQL, Salesforce and more
SDK iOS, Android, JavaScriptUnity
Bulk Import td client
...
msgpack
Hadoop
PlazmaDB
Data Connector
S3, Redshift, MySQL, PostgreSQL, Salesforce and more
SDK iOS, Android, JavaScriptUnity
Bulk Import td client
...
msgpack
Hadoop
Stateful
PlazmaDB
PostgreSQL
S3 or
Riak
S3 or
Riak
S3 or
Riak
S3 or
Riak
msgpack
Amazon RDS
PlazmaDB
PostgreSQL
S3 or
Riak
S3 or
Riak
S3 or
Riak
S3 or
Riak
msgpack
Amazon RDS
Transaction Immutable
Mobility
• Multiple Hadoop Version Management
• Regression Test for Hive Queries
Multiple Hadoop Version Management
Multiple Version Management
CDH HDP Apache
Multiple Version Management
CDH HDP Apache
client client client
Multiple Version Management
CDH HDP Apache
client client client
Tough Operation
Multiple Version Management
CDH HDP Apache
Worker
Multiple Version Management
CDH HDP Apache
Worker
switching
Multiple Version Management
CDH HDP Apache
Worker
switching
Multiple Version Management
CDH HDP Apache
Worker
CDH package
HDP package
Apache package
switching
Multiple Version Management
CDH HDP Apache
Worker
CDH package
HDP package
Apache package
S3
switching
Multiple Version Management
S3
/test
/stable
...
Multiple Version Management
CDH package
HDP package
Apache package
S3
/test
/stable
...
Multiple Version Management
CDH package
HDP package
Apache package
S3
/test
/stable
...
CDH
HDP
Apache
Worker
download
Regression Test for Hive
• Introducing new features, version up, migration must be done without regression
• Running integration system test and regression test for Hive queries
CDH
HDP
Apache
Worker
http://blog.circleci.com/meet-our-new-logo/
System Test Repository
CDH
HDP
Apache
Worker
http://blog.circleci.com/meet-our-new-logo/
System Test Repository
CDH
HDP
Apache
Worker
http://blog.circleci.com/meet-our-new-logo/
System Test Repository
S3Hadoop
Repository
CDH
HDP
Apache
Worker
http://blog.circleci.com/meet-our-new-logo/
System Test Repository
S3
Apache package
Hadoop Repository
CDH
HDP
Apache
Worker
http://blog.circleci.com/meet-our-new-logo/
System Test Repository
S3
Apache package
Hadoop Repository
Queueing
• REST API for Hadoop
• RDS based Queue management system
REST API for Hadoop
REST API for HadoopCDH HDP Apache
Worker
REST API for HadoopCDH HDP Apache
Worker PerfectQueue
Hadoop Job Server
REST API
REST API for HadoopCDH HDP Apache
Worker PerfectQueue
Hadoop Job Server
REST API
PrestoAPI
RDBMS-based Queue Management System
RDBMS based queue management
CDH HDP Apache
Worker
Client Client Client
PerfectQueue
Hadoop Job Server
PerfectQueue
• Highly available distributed queue build on RDBMS
• Amazon SQS like API
• Resource scheduling for multi tenancy
• Graceful and Live Restarting
https://github.com/treasure-data/perfectqueue
What we should keep in mind
• Stateless Delegate responsibility to Cloud systems
• Mobility Looking ahead for version up, migration
• QueueingMake each request persistent
Recap• Maintainability of Distributed System
• Our Challenges • Stateless Hive Metastore • Cloud Storage for Hadoop • Multiple Hadoop version management • Regression Test for Hive queries • REST API for Hadoop • Workflow Integration
• What we should keep in mind
https://www.treasuredata.com/