Post on 27-Aug-2014
description
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
Treasure Data on The YARN
Ryu Kobayashi !
Hadoop Conference Japan 2014 8 July 2014
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
Who am I?• Ryu Kobayashi • @ryu_kobayashi • https://github.com/ryukobayashi
• Treasure Data, Inc. • Software Engineer
• Background • Hadoop, Cassandra, Machine Learning, ... • I developed Huahin(Hadoop) Framework.
http://huahinframework.org/
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
What is Treasure Data?
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
Our Service
!!!!Columnar Storage!
+!Hadoop!
MapReduce!
Data Collection Data Warehouse Data Analysis
!!!Open-Source!Log Collector!
Bulk Loader!!CSV / TSV!
MySQL, Postgres!
Oracle, etc.
Web Log
App Log
Sensor
RDBMS
CRM
ERP
Streaming Upload
BI Tools!Tableau, QlickView,!Pentaho, Excel, etc.!!
TD command / Web Console
REST API JDBC / ODBC
SQL (HiveQL)
or Pig
Bulk Upload Parallel Upload
External Service/Storage!
Custom App,!RDBMS, FTP, etc.
Result push
schema-less!
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
Our Service
!!!!Columnar Storage!
+!Hadoop!
MapReduce!
Data Collection Data Warehouse Data Analysis
!!!Open-Source!Log Collector!
Bulk Loader!!CSV / TSV!
MySQL, Postgres!
Oracle, etc.
Web Log
App Log
Sensor
RDBMS
CRM
ERP
Streaming Upload
BI Tools!Tableau, QlickView,!Pentaho, Excel, etc.!!
TD command / Web Console
REST API JDBC / ODBC
SQL (HiveQL)
or Pig
Bulk Upload Parallel Upload
External Service/Storage!
Custom App,!RDBMS, FTP, etc.
Result push
schema-less!
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
Our Query Language
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
Our Service
!!!!Columnar Storage!
+!Hadoop!
MapReduce!
Data Collection Data Warehouse Data Analysis
!!!Open-Source!Log Collector!
Bulk Loader!!CSV / TSV!
MySQL, Postgres!
Oracle, etc.
Web Log
App Log
Sensor
RDBMS
CRM
ERP
Streaming Upload
BI Tools!Tableau, QlickView,!Pentaho, Excel, etc.!!
TD command / Web Console
REST API JDBC / ODBC
SQL (HiveQL)
or Pig
Bulk Upload Parallel Upload
External Service/Storage!
Custom App,!RDBMS, FTP, etc.
Result push
schema-less!
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
Hadoop&Cluster PlazmaDB
Our System
HDFS is not used
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
Hadoop&Cluster PlazmaDB
Our System
HDFS is not used
• Customize Hadoop • Customize Hive • Customize Pig
• Customize Impala • Customize Presto
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
We have 4 production’s Hadoop Cluster
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
We have 4 production’s Hadoop Cluster
user1,&user4,&user5,&…
user2,&user9,&user34,&…
user10,&user40,&user102,&…
user50,&user88,&user1023,&…
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
Our Scheduler and Queue
QueueScheduler
Hadoop&Cluster Hadoop&Cluster
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
We have 4 production’s Hadoop Cluster and Hadoop Cluster(YARN)
YARN&Cluster
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
MRv1 and YARN Queue
Queue
Hadoop&Cluster Hadoop&Cluster
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
Our Service
• About 4700 users • About 6 trillion records • About 12 million Jobs • About 40,000 Job by day
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
What is YARN?
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
YARN(Yet Another Resource Negotiator) Architecture
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
• MRv1
• JobTracker
• TaskTracker
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
• YARN
• ResourceManager
• NodeManager
• ApplicationMaster
• Job History Server
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
• MRv1
• JobTracker
• TaskTracker
• YARN
• ResourceManager
• NodeManager
• ApplicationMaster
• Job History Server * ******(We*can*not*see*the*log*history*If*it*do*not*install)
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
Note!!!
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
Use the Hadoop 2.4.0 and later!!!
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
• The versions which must not be used
• Apache Hadoop 2.2.0
• Apache Hadoop 2.3.0
• HDP 2.0(2.2.0 based)
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
• Currently
• Apache Hadoop 2.4.1
• CDH 5.0.2(2.3.0 based and patch)
• HDP 2.1(2.4.0 based)
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
• Why should not use?
• Capacity Scheduler
• There is a bug
• Fair Scheduler
• There is a bug
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
• Any bugs?
• Each Scheduler will cause a deadlock
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
Distribution • CDH 5.0.2
• Red Hat/CentOS/Oracle 5 • Red Hat/CentOS/Oracle 6 • Ubuntu/Debian
• HDP 2.1 • Red Hat/CentOS/SLES (64-bit)
• (There is already Ubuntu12 to the repository)
• Windows Server 2008 & 2012
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
Configuration file has been changed several(YARN from MRv1)
!
reference: http://goo.gl/vBIYQP
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
Deprecated Properties
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
Other notes for configuration file
• hadoop-conf-pseudo does not work
• some mistakes ex : yarn.nodemanager.aux-services
mapreduce.shuffle -> mapreduce_shuffle
• 2.2.0 and 2.4.0
• There are some differences
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
What should we do?
• Copy of CDH VM and HDP VM configuration files
• Use the Ambari or Cloudera Manager
• I work hard on their own!
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
Slot has been changed(YARN from MRv1)
• MRv1
• map slot, reduce slot
• YARN(MRv2)
• resource(container)
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
mapred-site.xml
• mapred.tasktracker.map.tasks.maximum
• mapred.tasktracker.reduce.tasks.maximum
scheduler.xml
• maxMaps, minMaps
• maxReduces, minReduces
MRv1
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
yarn-site.xml • yarn.nodemanager.resource.memory-mb • (yarn.nodenamager.vmem-pmem-ratio) • (yarn.scheduler.minimum-allocation-mb)
mapred-site.xml • yarn.app.mapreduce.am.resource.mb • mapreduce.map.memory.mb • mapreduce.reduce.memory.mb
fair-scheduler.xml • maxResources, minResources
YARN(MRv2)
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
yarn.nodemanager.resource.memory-mb => Memory that NodeManager uses
!yarn.app.mapreduce.am.resource.mb =>
Memory that ApplicationMaster uses !
mapreduce.map.memory.mb => Memory that Map uses
!mapreduce.reduce.memory.mb =>
Memory that Reduce uses
YANR Resource Management
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
yarn.nodemanager.resource.memory-mb = 4096 yarn.app.mapreduce.am.resource.mb = 1024 mapreduce.map.memory.mb = 1024 mapreduce.reduce.memory.mb = 2048 !MRv2 Application ApplicationMaster => 1 Mapper => 3 Reducer => 1
YANR Resource Example
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
In addition to this(ex: Fair Scheduler): minResources maxResources maxRunningApps schedulingPolicy
YANR Resource Example
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
In addition to this(ex: Fair Scheduler): pool -> queue user. maxRunningJobs -> user. maxRunningApps userMaxJobsDefault -> userMaxAppsDefault etc…
Changes Fair scheduler
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
yarn.nodemanager.resource.memoryDmb
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
YANR Scheduler Management
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
e.g. Use hdp-configuration-utils.py script http://goo.gl/L2hxyq ! Use Ambari http://ambari.apache.org/ (not supported Ubuntu12. Ubuntu 12 support is coming soon)
YANR Resource Management
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
DefaultContainerExecuter • Container launch process based • Same as the conventional(MRv1)
!LinuxContainerExecuter
• Only Linux • Some restrictions
• cgroup, etc…
YANR Container Executer
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
MRv1 • The need to set the initial
!YARN
• The need to set the initial • There is a change from MRv1 (ex: /tmp/hadoop-yarn/)
YANR Directory Structure
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
What should we do?
• Reference the CDH VM and HDP VM HDFS directory
• Use the Ambari or Cloudera Manager
• I work hard on their own!
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
Enjoy the YARN!!!
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
We are hiring!!!
Copyright*©2014*Treasure*Data.**All*Rights*Reserved.
Thanks!!!