Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

Click here to load reader

  • date post

    16-Nov-2014
  • Category

    Documents

  • view

    109
  • download

    4

Embed Size (px)

Transcript of Hadoop World: Monitoring Best Practices, Ed Capriolo, About.com

How to monitor the $H!T out of HadoopDeveloping a comprehensive open approach to monitoring hadoop clusters

Relevant Hadoop InformationFrom 3 3000 Nodes Hardware/Software failures common Redundant Components DataNode, TaskTracker Non-redundant Components NameNode, JobTracker, SecondaryNameNode Fast Evolving Technology (Best Practices?)

Monitoring SoftwareNagios Red Yellow Green Alerts, Escalations Defacto Standard Widely deployed Text base configuration Web Interface Pluggable with shell scripts/external appsReturn 0 - OK

CactiPerformance Graphing System RRD/RRA Front End Slick Web Interface Template System for Graph Types Pluggable SNMP input Shell script /external program

hadoop-cacti-jtgJMX Fetching Code w/ (kick off) scripts Cacti templates For Hadoop Premade Nagios Check Scripts Helper/Batch/automation scripts Apache License

Hadoop JMX

Sample Cluster P1NameNode & SecNameNode Hardware RAID 8 GB RAM 1x QUAD CORE DerbyDB (hive) on SecNameNode

JobTracker 8GB RAM 1x QUAD CORE

A Sample Cluster p2Slave (hadoopdata1-XXXX) JBOD 8x 1TB SATA Disk RAM 16GB 2x Quad Core

PrerequisitesNagios (install) DAG RPMs Cacti (install) Several RPMS Liberal network access to the cluster

Alerts & EscalationsX nodes * Y Services = < Sleep Define a policy Wake Me Ups (SMS) Dont Wake Me Ups (EMAIL) Review (Daily, Weekly, Monthly)

Wake Me UpsNameNode Disk Full (Big Big Headache) RAID Array Issues (failed disk)

JobTracker SecNameNode Do not realize it is not working too late

Dont Wake Me UpsOr Wake someone else up DataNode Warning Currently Failed Disk will down the Data Node (see Jira)

TaskTracker Hardware Bad Disk (Start RMA)

Slaves are expendable (up to a point)

Monitoring Battle PlanStart With the Basics Ping, Disk

Add Hadoop Specific Alarms check_data_node

Add JMX Graphing NameNodeOperations

Add JMX Based alarms FilesTotal > 1,000,000 or LiveNodes < 50%

The Basics NagiosNagios (All Nodes) Host up (Ping check) Disk % Full SWAP > 85 %

* Load based alarms are somewhat useless 389% CPU load is not necessarily a bad thing in Hadoopville

The Basics CactiCacti (All Nodes) CPU (full CPU) RAM/SWAP Network Disk Usage

Disk Utilization

RAID ToolsHpacucli not a Street Fighter move Alerts on RAID events (NameNode)Disk failed Rebuilding

JBOD (DataNode)Failed Drive Drive Errors

Dell, SUN, Vendor Specific Tools

Before you jump inX Nodes * Y Checks * = Lots of work About 3 Nodes into the process Wait!!! I need some interns!!!

Solution S.I.C.C.T. Semi-IntelligentConfiguration-cloning-tools (I made that up) (for this presentation)

NagiosAnswers IS IT RUNNING? Text based Configuration

CactiAnswers HOW WELL IS IT RUNNING? Web Based configuration php-cli tools

Monitoring Battle Plan Thus FarStart With the Basics Ping, Disk !!!!!!Done!!!!!!

Add Hadoop Specific Alarms check_data_node

Add JMX Graphing NameNodeOperations

Add JMX Based alarms FilesTotal > 1,000,000 or LiveNodes < 50%

Add Hadoop Specific AlarmsHadoop Components with a Web Interface NameNode 50070 JobTracker 50030 TaskTracker 50060 DataNode 50075

check_http + regex = simple + effective

nagios_check_commands.cfgdefine command { command_name check_remote_namenode command_line $USER1$/check_http -H $HOSTADDRESS$ -u http://$HOSTADDRESS$:$ARG1$/dfshealth.jsp -p $ARG1$ -r NameNode } define service { service_description check_remote_namenode use generic-service host_name hadoopname1 check_command check_remote_namenode!50070 }

Component Failure (Future) Newer Hadoop will have XML status

Monitoring Battle PlanStart With the Basics Ping, Disk (Done)

Add Hadoop Specific Alarms check_data_node (Done)

Add JMX Graphing NameNodeOperations

Add JMX Based alarms FilesTotal > 1,000,000 or LiveNodes < 50%

JMX GraphingEnable JMX Import Templates

JMX Graphing

JMX Graphing

JMX Graphing

Standard Java JMX

Monitoring Battle Plan Thus FarStart With the Basics !!!!!!Done!!!!! Ping, Disk

Add Hadoop Specific Alarms !Done! check_data_node

Add JMX Graphing !Done! NameNodeOperations

Add JMX Based alarms FilesTotal > 1,000,000 or LiveNodes < 50%

Add JMX based Alarmshadoop-cacti-jtg is flexible extend fetch classes Dont call output() Write your own check logic

Quick JMX Base Walkthrough

url, user, pass, object specified from CLI wantedVariables, wantedOperations by inheritance fetch() output() provided

Extend for NameNode

Extend for Nagios

Monitoring Battle PlanStart With the Basics !DONE! Ping, Disk

Add Hadoop Specific Alarms !DONE! check_data_node

Add JMX Graphing !DONE! NameNodeOperations

Add JMX Based alarms !DONE! FilesTotal > 1,000,000 or LiveNodes < 50%

ReviewFile System Growth Size Number of Files Number of Blocks Ratios

Utilization

CPU/Memory Disk FSCK DSFADMIN

Email (nightly)

The FutureJMX Coming to JobTracker and TaskTracker (0.21) Collect and Graph Jobs Running Collect and Graph Map / Reduce per node Profile Specific Jobs in Cacti?