Sensu @ Yelp!: A Guided Tour
-
Upload
kyle-anderson -
Category
Technology
-
view
3.453 -
download
0
description
Transcript of Sensu @ Yelp!: A Guided Tour
Sensu @ Yelp - A Guided Tour
Kyle Andersonhttps://github.com/solarkennedy
DisclaimerI’m just a dude.
I know that when I watch a presentation by a company that I recognize, I think to myself, “Hmm, $company, I’ve heard of them. They probably have their stuff together. Lets see what they do…”
I’m here to describe, not persuade. I may not have everything together. Just because I have things with “Unit Tests”, doesn’t mean I’m “Right”.
Especially with a “framework” like Sensu, there can be more than one way to do things. The trick is figuring out what works for you. I hope by giving a real concrete example, you might be inspired to step up your monitoring game?
Outline
1. Overall Architecture2. Sensu Server Setup
a. Custom Base Handler3. Client Configuration
a. Sensu Check Puppet Wrapper4. Yelp SOA Checks5. AWS/Cloudwatch Checks6. Dealing with Ephemeral EC Servers7. Cron Job Monitoring8. Future Work
Overall Architecture● profile::sensu_client
○ Sensu clients connect to RabbitMQ on one of the servers (DNS Round Robin)
● profile::sensu_server○ Base HAProxy install○ RabbitMQ in Mirror Mode, load balanced via
HAProxy○ Redis in Master/slave mode, load balanced via
HAProxy. (only master passes healthcheck)○ Sensu Server installed, subscribes on RabbitMQ○ API Load balanced via HAProxy○ Dashboard Load balanced by HAProxy
Logical Diagram
Puppet Modules in Use
puppetlabs/rabbitmq
puppetlabs/haproxy
kyleanderson/redis_sentinel
arioch/redis
sensu/sensu
Addressing Complexity
“Sensu has so many moving parts that I wouldn’t be able to sleep at night unless I set up a Nagios instance to make sure they were all running.”
Laurie Dennesshttps://laur.ie/blog/2014/02/why-ill-be-letting-nagios-live-on-a-bit-longer-thank-you-very-much/
Addressing Complexity
“I will be honest; I haven’t used Sensu, because I’m in a happy place right now, but just the architectural diagram of how it works scares the shit out of me.
When you need 7 arrow colours to describe where data is going in a monitoring system, I’m starting to fear it slightly. But hey, if it works, good on you guys. It just looks a lot like this. Nothing wrong with that, if you can make it stable and reliable.”
Laurie Dennesshttps://laur.ie/blog/2014/02/why-ill-be-letting-nagios-live-on-a-bit-longer-thank-you-very-much/
First Principle: Single Point of Truth
Pop Quiz: Determine what Servers are Puppetmasters?
• A: Puppet manifests (include puppetmaster)• B: DNS (puppet.local A 10.5.x.x)• C: update-live script (for Server in ….)• D: The servers that have had the puppetmaster bootstrap script run on them• E: What MCollective says (mco find -C puppetmaster)
Answer: All / None of the above!
Sensu Server Detection
# Use DNS to detect if this server is a sensu server $local_sensu_server_array = gethostbyname2array("sensu.local-${::habitat}.yelpcorp.com")
$ip_address_array = split($::all_ipaddresses, ',')
validate_array($local_sensu_server_array)
validate_array($ip_address_array)
$array_intersection = intersection($ip_address_array, $local_sensu_server_array)
# If our ipaddresses are in the dns entries, we must be a sensu server!
if size($array_intersection) > 0 {
$is_sensu_server = true
} else {
$is_sensu_server = false
}
HAProxy
• Every server in the sensu cluster runs its own HAProxy• HAProxy listens on the “standard” ports, individual
instances listen on standard + 1• Having an array of sensu servers from DNS allows us to
grow the backends• If HAProxy dies, clients will re-resolve, and reconnect.
RabbitMQ
• Every server in the sensu cluster runs a rabbitmq server in mirror mode (with autoheal for AP)
• Lots of individual clusters, not doing shoveling.• Client authentication via SSL client certs (controlled by
puppet)• Load balanced by haproxy• Sensu-clients automatically reconnect on failure
Redis
• Redis is the persistent store used by Sensu to keep track of heartbeats, what alerts are silenced, how many times a check has failed, etc
• Redis is setup in a cluster mode, with redis-sentinel doing automatic master/slave promotion. (Kinda CP)
• We use the redis-role haproxy master pattern suggestion from http://failshell.io/sensu/high-availability-sensu/
Sensu API + Dashboard
• sensu-api provides a rest api with json output for integration.
• sensu-cli is provided for easy command line interactive use
• Both the API and Dashboard use basic auth internally (shared secret), and then LDAP+SSL auth externally.
• sensu-dashboard uses this api, and is behind our external facing apache for authentication.
Sensu Servers:
• Automatically does master election, good. Build for 3.• Connects to RabbitMQ, pulls events off and acts on
them• Runs “handlers” on the event data• Thats kinda it• Which leads to handlers….
Sensu Timing Tunables Before/AfterCustom check definition key-values
Custom key-values can be added to a check definition, which
will be included in event data, enabling handler creativity.
Common custom check definitions:
• interval: How frequently (in seconds) the check will be
executed
• occurrences: Number of event occurrences before the
handler should take action
• refresh: Number of seconds handlers should wait before
taking second action. Relies on sensu-plugin.
Yelp Monitoring Check Definition Key Values
The custom base handler interprets these values:
• check_every = '5m',• alert_after = '0s',• realert_every = '1',
Custom Base Handler
def filter_repeated
interval = @event['check']['interval'] || 0
alert_after = @event['check']['alert_after'] || 0
realert_every = @event['check']['realert_every'] || 1
failing_for = @event['occurrences'].to_i * @event['check']['interval'].to_i
if failing_for < alert_after
bail "Only failing for #{failing_for}, less than #{alert_after}. Not performing any action yet."
elsif interval > 0 and @event['action'] == 'create'
initial_failing_occurrences = alert_after.fdiv(interval).to_i
number_of_failed_attempts = @event['occurrences'] - initial_failing_occurrences
unless number_of_failed_attempts == 0 || number_of_failed_attempts % realert_every == 0
bail 'only handling every ' + number.to_s + ' occurrences'
end
end
end
Other Handlers In Use
● IRC (Triaged by who is “on-point”)● Email (not a thing)● Pagerduty (Handled by “on-call”)● OpsGenie (trialing)● aws_prune (only on ec2 nodes)● motd (sensu-report, not really a handler. Used for situation
awareness)Future Handlers● JIRA (auto create/close a ticket after a while?)● Flapjack?
Sensu Clients
• Almost every server @yelp runs the sensu client (thank you omnibus packages!)
• They connect to the Round-Robin dns entry local to their zone.
• All checks are standalone, configured by puppet
Monitoring Check Puppet Wrapper
define monitoring_check (
$command,
$runbook,
$check_every = '5m',
$alert_after = '0s',
$realert_every = '1',
$irc_channels = undef,
$tip = false,
$page = false,
$wake = true,
$needs_sudo = false,
$sudo_user = 'root',
$team = 'operations',
$ensure = 'present',
$dependencies = [],
$sensu_custom = {},
) {
……
Lots of validation. Lots of tests.
mandatory runbook!
Human readable time units!
Easy to add sudo rules!
TIP: The one line runbook for lazy humans!
Team defaults to ops for convenience.
Usually set to $::profile::server::team
Monitoring Check Puppet Wrapper Example
# Make sure apt-mirroring is working by checking the age of the NEW file left over.
monitoring_check { 'apt-mirror':
check_every => '4h',
team => 'operations',
page => false,
runbook => 'y/rb-package-mirroring',
tip => 'Talk to kwa. Check /var/spool/apt-mirror/var/cron.log, then /nail/apt-mirror/var/apt-mirror.lock.',
command => '/usr/lib/nagios/plugins/check_file_age /nail/apt-mirror/var/NEW -w 86400 -c 172800',
}
Why Not Use The Native Puppet Type?
● The wrapper reduces the boilerplate and gives good defaults
● Enforces site-specific policies and validation (team names, mandatory runbooks)
● Allows us to modify all puppet-controlled sensu checks in the future from a single spot.
● Custom tests● Allows us to be backend agnostic (maybe)
Yelp SOA Checks
• How do we (Yelp) empower our developers to monitor their services?
• How can we safely and conveniently allow devs to define checks within our SOA framework?
• How can Devs not be blocked by Ops for service deployment?
Define the Meta Check
# Defined on all hosts that run yelp SOA infrastructure
monitoring_check { 'check-yelp_soa':
check_every => '1m',
alert_after => '10m',
page => true,
runbook => 'http://y/rb-check-yelpsoa',
tip => 'Run /etc/sensu/plugins/check-yelp_soa.rb --debug to see what is wrong?',
command => '/etc/sensu/plugins/check-yelp_soa.rb',
require => Class['::yelp_soa']
}
check-yelp_soa.rb reduxdef run
# TODO: Parallelize?
configs.each do | service, config |
next unless services_that_run_here.include?(service)
$log.debug "Processing #{service} as apparently it runs here"
srv_configs = read_srv_configs(service)
next unless srv_configs.include?('monitoring_check')
monitoring_check = srv_configs['monitoring_check']
if numeric?(config['port'])
...
if command == 'check_http'
url = monitoring_check['check_url'] || '/status'
$log.debug "Making a http check for #{service}, team: #{team}, warn_timeout: #{warn_timeout}, crit_timeout: #{crit_timeout}"
output, status = check_http(port,url,http_expect,warn_timeout,crit_timeout)
elsif monitoring_check['command'] == 'check_tcp'
$log.debug "Making a tcp check for #{service}, team: #{team}, warn_timeout: #{warn_timeout}, crit_timeout: #{crit_timeout}"
output, status = check_tcp(port,warn_timeout,crit_timeout)
else
$log.debug "Not spawning a check for #{service} because I don't know how to run #{command}"
next
end
send_result_to_sensu(service, status, output, team, runbook, tip, page, alert_after, realert_every, irc_channels)
services_checked << service
end # End port check
end # End for loop
ok "Finished run. Ran checks on #{services_checked}"
end
What was that?
Iterate through the SOA services that are configured to run on a server.Determine if that service has monitoring metadata defined by the authorsOperate on that metadata to check it (usually check_http)Send the results of the check to the localhost:3030 socket as a *Different* check (“soa_$servicename”)
See https://gist.github.com/joemiller/5806570 for another example
An example service (request_blocking)
# from request_blocking.yaml
monitoring_check:
team: 'infra'
alert_after: 2m
realert_every: 2
irc_channels: 'infra'
url: '/status'
tip: "no tips yet"
warn_timout: 2.0
crit_timeout: 5.0
AWS/Cloudwatch Checks
• Pretty much the same thing, except:• Checks are executed on special monitoring hosts in
the AZ (not on the ephemeral node)• Runs graphite/check_data.rb against the provided
metric name• Written in python this time! (https://pypi.python.
org/pypi/sensu)
Dealing with Ephemeral EC2 Nodes
• Yelps lives in a hybrid world, we have lots of “ephemeral” EC2 nodes that are baked and do NOT run puppet. Can Sensu still work on them?
• How do we prevent ourselves from being spammed when hosts go away “normally”?
• How do we know what a host is without logging into it? (EC2 metadata)
• Baking………..
EC2 Considerations
• We use puppet to bake AMIs for ELBs, so we can control (via puppet) how Sensu is configured at bake time.
• We can query the AWS API to know if a host has gone away, and prune it from the Queue to squelch alerts.
• Using custom client metadata, we can add things like puppet cert name, AMI_ID, etc at runtime with a special init script.
For Non-Ephemeral Instances
if str2bool($::is_ec2) == true {
$client_custom = {
'instance_id' => $::ec2_instanceid,
'keepalive' => {
'handlers' => [ 'aws_prune', 'default' ],
'team' => $team,
'page' => true
}
}
} else {
$client_custom = {
'team' => $team,
'page' => true
}
}
Only EC2 Servers need the special aws_prune handler
A Fact! Embed it for easy troubleshooting
For Ephemeral (baked) Instances
description "Fix Sensu clientinfo on startup for baked ec2 instances"
author "Kyle Anderson <[email protected]>"
start on starting sensu-client
task
script
ADDRESS=$(curl -s http://169.254.169.254/latest/meta-data/local-ipv4)
AMI_ID=$(curl -s http://169.254.169.254/latest/meta-data/ami-id)
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
/usr/bin/jq ".client.name = \"$(/usr/local/sbin/puppet-certname)\" | .client.address = \"$ADDRESS\" | .client.instance_id =
\"$INSTANCE_ID\" | .client.ami_id = \"$AMI_ID\" " /etc/sensu/conf.d/client.json > /etc/sensu/conf.d/newclient.json
mv /etc/sensu/conf.d/client.json /etc/sensu/conf.d/client.json.old
mv /etc/sensu/conf.d/newclient.json /etc/sensu/conf.d/client.json
end script
Only run once, right before sensu-client
Real data. Can’t lie.
Overwrite what we were baked with. It is wrong.
jq FTW
Pruning Terminated EC2 Nodes
● Modification of https://github.com/sensu/sensu-community-plugins/blob/master/handlers/other/ec2_node.rb
● Instead we use a cron job to cache the results of the api call into json so we can be nice to AWS
● Then we can have *every* check use this handler, as it is easy to just to check on disk if the instance_id is active.
● Use the instance_id from the client data to figure out who you are. (which should be correct from the above)
What Does It Look Like? file { '/etc/sensu/plugins/cache_instance_list.rb':
owner => 'root',
group => 'root',
mode => '0500',
source => 'puppet:///modules/profile/sensu/handlers/cache_instance_list.rb',
} ->
cron::d { 'cache_instance_list':
minute => '*',
user => 'root',
command => "/etc/sensu/plugins/cache_instance_list.rb -a ${access_key} -r ${region} -k ${secret_key}",
} ->
monitoring_check { 'cache_instance_list-staleness':
check_every => '10m',
alert_after => '1h',
team => 'test',
runbook => 'y/rb-aws-prune',
command => "/usr/lib/nagios/plugins/check_file_age /var/cache/instance_list.json -w 1800 -c 3600",
page => false,
}
The Handler (puppet)
$access_key = hiera('sensu::aws_key')
$secret_key = hiera('sensu::aws_secret')
$aws_config_hash = {
access_key => $access_key,
secret_key => $secret_key,
region => $region,
blacklist_name_array => [ 'bake_soa_ami', 'Packer Builder' ]
}
sensu::handler { 'aws_prune':
type => 'pipe',
source => 'puppet:///modules/profile/sensu/handlers/aws_prune.rb',
config => $aws_config_hash,
require => [ Package['rubygem-fog'], Package['rubygem-sensu-plugin'], Package['rubygem-unf'] ],
}
}
The Handler (Ruby)
def ec2_node_exists?
running_instances = load_instances_cache
instance_ids = running_instances.collect { |s| Hash[ 'id', s['id'], 'tags', s['tags'] ]}
my_instance_id = @event['client']['instance_id']
instance_ids.each do |instance|
# YELP SPECIFIC CODE
instance_name = instance['tags']['Name'].to_s
# Yelp specific: pretend that the node does not exist if we are in our blacklist
return false if blacklist_name_array.include?(instance_name)
return true if my_instance_id == instance['id']
end
return false # no match found, node doesn't exist
end
Cron Job Monitoring
• I believe cron sending emails is an anti-pattern and not *web-scale*
• Lets use Sensu to monitor our cron jobs!• Use a combination of a cron puppet type wrapper and
my Sensu-Shell-Helper• Modified sensu-shell-helper includes fields for team
and page for yelp-specific things: https://github.com/solarkennedy/sensu-shell-helper
What does it look like?
$command = 'chgrp -R admin /nail/packages/'
cron::d { 'fix-packages-permissions':
mailto => '',
minute => '10',
user => 'root',
comment => 'Make permissions group writable for collaboration purposes',
command => “sensu-shell-helper -n fix-packages-permissions -p false -t operations ${command}”,
ensure => 'present'
}
See https://github.com/torrancew/puppet-cron#cronjob for related work.
Future Work
● battle-test more of the pagerduty stuff (blocked on bogus aws nodes still)● sort out AWS pruning, harder (#61626)● make tools that work on nagios *and* sensu?● really monitor the sensu instances in nagios with alerts (#60164)● enable self-serve sensu alerts for services (#62201)● make a library for sending passive checks (#62440)● set up infrastructure for “aggregate” checks (cluster checks)● better test the alerting tunables we have (#61628)● enable sensu alerts for Asgardy services (#57450)● set up easy to use metric based alerting (like horsefly, blocked on #67000)● write my sensu-downtime tool● write an super-dashboard (hackathon)● write the sensu archive service (sensu-db?)
Thanks!