Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

39
How to monitor the How to monitor the $H!T out of Hadoop $H!T out of Hadoop Developing a Developing a comprehensive open comprehensive open approach to monitoring approach to monitoring hadoop clusters hadoop clusters

Transcript of Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

Page 1: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

How to monitor the How to monitor the $H!T out of Hadoop$H!T out of Hadoop

Developing a comprehensive Developing a comprehensive open approach to monitoring open approach to monitoring

hadoop clustershadoop clusters

Page 2: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

Relevant Hadoop InformationRelevant Hadoop Information

From 3 – 3000 NodesFrom 3 – 3000 Nodes

Hardware/Software failures “common”Hardware/Software failures “common”

Redundant Components DataNode, Redundant Components DataNode, TaskTrackerTaskTracker

Non-redundant Components NameNode, Non-redundant Components NameNode, JobTracker, SecondaryNameNodeJobTracker, SecondaryNameNode

Fast Evolving Technology (Best Fast Evolving Technology (Best Practices?)Practices?)

Page 3: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

Monitoring SoftwareMonitoring Software

Nagios – Nagios – – Red Yellow Green Alerts, EscalationsRed Yellow Green Alerts, Escalations– Defacto Standard – Widely deployedDefacto Standard – Widely deployed– Text base configurationText base configuration– Web InterfaceWeb Interface– Pluggable with shell scripts/external appsPluggable with shell scripts/external apps

Return 0 - OKReturn 0 - OK

Page 4: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

CactiCacti

Performance Graphing SystemPerformance Graphing System

RRD/RRA Front EndRRD/RRA Front End

Slick Web InterfaceSlick Web Interface

Template System for Graph TypesTemplate System for Graph Types

PluggablePluggable– SNMP inputSNMP input– Shell script /external programShell script /external program

Page 5: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com
Page 6: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

hadoop-cacti-jtghadoop-cacti-jtg

JMX Fetching Code w/ (kick off) scriptsJMX Fetching Code w/ (kick off) scripts

Cacti templates For HadoopCacti templates For Hadoop

Premade Nagios Check ScriptsPremade Nagios Check Scripts

Helper/Batch/automation scriptsHelper/Batch/automation scripts

Apache License Apache License

Page 7: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

Hadoop JMX Hadoop JMX

Page 8: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

Sample Cluster P1Sample Cluster P1

NameNode & SecNameNodeNameNode & SecNameNode– Hardware RAIDHardware RAID– 8 GB RAM8 GB RAM– 1x QUAD CORE1x QUAD CORE– DerbyDB (hive) on SecNameNodeDerbyDB (hive) on SecNameNode

JobTrackerJobTracker– 8GB RAM8GB RAM– 1x QUAD CORE1x QUAD CORE

Page 9: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

A Sample Cluster p2A Sample Cluster p2

Slave (hadoopdata1-XXXX)Slave (hadoopdata1-XXXX)– JBOD 8x 1TB SATA DiskJBOD 8x 1TB SATA Disk– RAM 16GBRAM 16GB– 2x Quad Core2x Quad Core

Page 10: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

PrerequisitesPrerequisites

Nagios (install) DAG RPMsNagios (install) DAG RPMs

Cacti (install) Several RPMSCacti (install) Several RPMS

Liberal network access to the clusterLiberal network access to the cluster

Page 11: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

Alerts & EscalationsAlerts & Escalations

X nodes * Y Services = < SleepX nodes * Y Services = < Sleep

Define a policy Define a policy – Wake Me Up’s (SMS)Wake Me Up’s (SMS)– Don’t Wake Me Up’s (EMAIL)Don’t Wake Me Up’s (EMAIL)– Review (Daily, Weekly, Monthly) Review (Daily, Weekly, Monthly)

Page 12: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

Wake Me Up’sWake Me Up’s

NameNodeNameNode– Disk Full (Big Big Headache)Disk Full (Big Big Headache)– RAID Array Issues (failed disk)RAID Array Issues (failed disk)

JobTrackerJobTracker

SecNameNodeSecNameNode– Do not realize it is not working too lateDo not realize it is not working too late

Page 13: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

Don’t Wake Me Up’sDon’t Wake Me Up’s

Or ‘Wake someone else up’Or ‘Wake someone else up’

DataNodeDataNode– Warning Currently Failed Disk will down the Warning Currently Failed Disk will down the

Data Node (see Jira)Data Node (see Jira)

TaskTrackerTaskTracker

HardwareHardware– Bad Disk (Start RMA)Bad Disk (Start RMA)

Slaves are expendable (up to a point)Slaves are expendable (up to a point)

Page 14: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

Monitoring Battle PlanMonitoring Battle Plan

Start With the BasicsStart With the Basics– Ping, DiskPing, Disk

Add Hadoop Specific Alarms Add Hadoop Specific Alarms – check_data_nodecheck_data_node

Add JMX GraphingAdd JMX Graphing– NameNodeOperationsNameNodeOperations

Add JMX Based alarmsAdd JMX Based alarms– FilesTotal > 1,000,000 or LiveNodes < 50%FilesTotal > 1,000,000 or LiveNodes < 50%

Page 15: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

The Basics NagiosThe Basics Nagios

Nagios (All Nodes)Nagios (All Nodes)– Host up (Ping check)Host up (Ping check)– Disk % FullDisk % Full– SWAP > 85 %SWAP > 85 %

* Load based alarms are somewhat useless * Load based alarms are somewhat useless 389% CPU load is not necessarily a bad 389% CPU load is not necessarily a bad thing in Hadoopvillething in Hadoopville

Page 16: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

The Basics CactiThe Basics Cacti

Cacti (All Nodes)Cacti (All Nodes)– CPU (full CPU)CPU (full CPU)– RAM/SWAP RAM/SWAP – NetworkNetwork– Disk UsageDisk Usage

Page 17: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

Disk UtilizationDisk Utilization

Page 18: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

RAID ToolsRAID Tools

Hpacucli – not a Street Fighter moveHpacucli – not a Street Fighter move– Alerts on RAID events (NameNode) Alerts on RAID events (NameNode)

Disk failed Disk failed

RebuildingRebuilding

– JBOD (DataNode)JBOD (DataNode)Failed DriveFailed Drive

Drive ErrorsDrive Errors

Dell, SUN, Vendor Specific ToolsDell, SUN, Vendor Specific Tools

Page 19: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

Before you jump inBefore you jump in

X Nodes * Y Checks * = Lots of workX Nodes * Y Checks * = Lots of work

About 3 Nodes into the process …About 3 Nodes into the process …– Wait!!! I need some interns!!!Wait!!! I need some interns!!!

Solution S.I.C.C.T. Semi-Intelligent-Solution S.I.C.C.T. Semi-Intelligent-Configuration-cloning-toolsConfiguration-cloning-tools– (I made that up) (I made that up) – (for this presentation)(for this presentation)

Page 20: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

NagiosNagios

Answers “IS IT RUNNING?”Answers “IS IT RUNNING?”

Text based ConfigurationText based Configuration

Page 21: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

CactiCacti

Answers “HOW WELL IS IT RUNNING?”Answers “HOW WELL IS IT RUNNING?”

Web Based configuration Web Based configuration – php-cli tools php-cli tools

Page 22: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

Monitoring Battle PlanMonitoring Battle PlanThus FarThus Far

Start With the BasicsStart With the Basics– Ping, Disk !!!!!!Done!!!!!!Ping, Disk !!!!!!Done!!!!!!

Add Hadoop Specific Alarms Add Hadoop Specific Alarms – check_data_nodecheck_data_node

Add JMX GraphingAdd JMX Graphing– NameNodeOperationsNameNodeOperations

Add JMX Based alarmsAdd JMX Based alarms– FilesTotal > 1,000,000 or LiveNodes < 50%FilesTotal > 1,000,000 or LiveNodes < 50%

Page 23: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

Add Hadoop Specific AlarmsAdd Hadoop Specific Alarms

Hadoop Components with a Web InterfaceHadoop Components with a Web Interface– NameNode 50070NameNode 50070– JobTracker 50030JobTracker 50030– TaskTracker 50060TaskTracker 50060– DataNode 50075DataNode 50075

check_http + regex = simple + effectivecheck_http + regex = simple + effective

Page 24: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

nagios_check_commands.cfgnagios_check_commands.cfg

Component FailureComponent Failure

(Future) Newer Hadoop will have XML status (Future) Newer Hadoop will have XML status

define command { command_name check_remote_namenode command_line $USER1$/check_http -H $HOSTADDRESS$ -u http://$HOSTADDRESS$:$ARG1$/dfshealth.jsp -p $ARG1$ -r NameNode}define service {               service_description            check_remote_namenode               use                             generic-service               host_name                       hadoopname1               check_command               check_remote_namenode!50070}

Page 25: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

Monitoring Battle PlanMonitoring Battle Plan

Start With the BasicsStart With the Basics– Ping, Disk (Done)Ping, Disk (Done)

Add Hadoop Specific Alarms Add Hadoop Specific Alarms – check_data_node (Done)check_data_node (Done)

Add JMX GraphingAdd JMX Graphing– NameNodeOperationsNameNodeOperations

Add JMX Based alarmsAdd JMX Based alarms– FilesTotal > 1,000,000 or LiveNodes < 50%FilesTotal > 1,000,000 or LiveNodes < 50%

Page 26: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

JMX GraphingJMX Graphing

Enable JMXEnable JMX

Import TemplatesImport Templates

Page 27: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

JMX GraphingJMX Graphing

Page 28: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

JMX GraphingJMX Graphing

Page 29: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

JMX GraphingJMX Graphing

Page 30: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com
Page 31: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

Standard Java JMXStandard Java JMX

Page 32: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

Monitoring Battle PlanMonitoring Battle PlanThus FarThus Far

Start With the Basics !!!!!!Done!!!!!Start With the Basics !!!!!!Done!!!!!– Ping, DiskPing, Disk

Add Hadoop Specific Alarms !Done!Add Hadoop Specific Alarms !Done!– check_data_nodecheck_data_node

Add JMX Graphing !Done!Add JMX Graphing !Done!– NameNodeOperationsNameNodeOperations

Add JMX Based alarmsAdd JMX Based alarms– FilesTotal > 1,000,000 or LiveNodes < 50%FilesTotal > 1,000,000 or LiveNodes < 50%

Page 33: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

Add JMX based AlarmsAdd JMX based Alarms

hadoop-cacti-jtg is flexiblehadoop-cacti-jtg is flexible– extend fetch classesextend fetch classes– Don’t call output()Don’t call output()– Write your own check logicWrite your own check logic

Page 34: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

Quick JMX Base Walkthrough Quick JMX Base Walkthrough

url, user, pass, object specified from CLIurl, user, pass, object specified from CLI

wantedVariables, wantedOperations by wantedVariables, wantedOperations by inheritanceinheritance

fetch() output() providedfetch() output() provided

Page 35: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

Extend for NameNodeExtend for NameNode

Page 36: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

Extend for NagiosExtend for Nagios

Page 37: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

Monitoring Battle PlanMonitoring Battle Plan

Start With the Basics !DONE!Start With the Basics !DONE!– Ping, DiskPing, Disk

Add Hadoop Specific Alarms !DONE!Add Hadoop Specific Alarms !DONE!– check_data_nodecheck_data_node

Add JMX Graphing !DONE!Add JMX Graphing !DONE!– NameNodeOperationsNameNodeOperations

Add JMX Based alarms !DONE!Add JMX Based alarms !DONE!– FilesTotal > 1,000,000 or LiveNodes < 50%FilesTotal > 1,000,000 or LiveNodes < 50%

Page 38: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

ReviewReview

File System GrowthFile System Growth– SizeSize– Number of FilesNumber of Files– Number of BlocksNumber of Blocks– Ratio’sRatio’s

UtilizationUtilization– CPU/MemoryCPU/Memory– DiskDisk

Email (nightly)Email (nightly)– FSCK FSCK – DSFADMINDSFADMIN

Page 39: Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

The FutureThe Future

JMX Coming to JobTracker and JMX Coming to JobTracker and TaskTracker (0.21)TaskTracker (0.21)– Collect and Graph Jobs RunningCollect and Graph Jobs Running– Collect and Graph Map / Reduce per nodeCollect and Graph Map / Reduce per node– Profile Specific Jobs in Cacti?Profile Specific Jobs in Cacti?