Cookie Notice

As far as I know, and as far as I remember, nothing in this page does anything with Cookies.

2016/07/28

Log your machines and Check your logs

"Logs" by Aapo Haapanen is licensed under CC BY 2.0

Our VMs were having problems last fall. Their connections to the file system would falter, causing a large number of processes sitting around waiting to write. The symptom we found was that the load averages would then rise incredibly high.

Like four-digits high.

So, I wrote something that would log load average once an hour. It was a convergence of lab need and an excuse to learn Log::Log4Perl. I also used Pushover to tell me when load average was greater than 20, as if I could do anything about it.

Below, Mail is a wrapper around Email::Sender::Simple and Pushover around LWP::UserAgent that handle the formatting and authentication. Neither are necessary for the logging.

#!/usr/bin/env perl

# checks for high load average on a host using uptime and
# reports using Pushover
# logs uptime, high or low, via Log4perl

# Also sends result of ps to email to start to indicate what's
# actually doing something

# https://metacpan.org/pod/Log::Log4perl

use feature qw{ say } ;
use strict ;
use warnings ;
use utf8 ;

use Data::Dumper ;
use DateTime ;
use IO::Interactive qw{ interactive } ;
use Log::Log4perl ;

use lib '/home/djacoby/lib' ;
use Mail ;
use Pushover ;

# my $host = $ENV{HOSTNAME} ;
my $host = `/bin/hostname -s ` ;
chomp $host ;

Log::Log4perl::init( '/home/djacoby/.log4perl.conf') ;
my $logger = Log::Log4perl::get_logger( 'varlogrant.uptime' );
my @uptime = check_uptime() ;
$logger->trace( qq{$host : $uptime[0] $uptime[1] $uptime[2]});

if ( $uptime[0] > 20 ) {
    my $ps = process_table() ;
    my $message ;
    $message->{ message } = "High Load Average on $host: " . join ' ' , @uptime ;
    my $out = pushover( $message ) ;
    #send_table( join "\n\n" , ( join ' ' , @uptime ) , $ps ) ;
    }

exit ;

sub check_uptime {
    my $program = '/usr/bin/uptime' ;
    my $uptime = qx{$program} ;
    my @num = map {s/,//;$_ } ( split /\s+/ , $uptime )[-3,-2,-1] ;
    return @num ;
    }

sub process_table {
    my $out = qx{/bin/ps -U gcore -u gcore u } ;
    return $out ;
    }

sub send_table {
    my $body = shift ;
    my $date = DateTime->now()->iso8601() ;
    my $msg;
    $msg->{ identity } = 'example' ;
    $msg->{ subject } = qq{High Load on $host: $date} ;
    $msg->{ to } = 'varlogrant@example.com' ;
    $msg->{ body } = $body ;
    $msg->{ attachments } = [] ;
    send_mail($msg) ;
    }

Eventually, those issues worked out. The evidence of file system hinkiness is that, on occasion, we try to save or open a file, it takes a few minutes — I have learned from experience that mkdir does not display atomicity — but we never see the high load averages and catastrophically long file access times of a few months ago.

But the logging never left my crontab.

I started looking at and playing with new things, and I wrote an API that allowed me to curl from several machines once an hour, and I would get Pushover notifications when machines were down.

(You can really thank Phil Sturgeon and his excellent book, Build APIs You Won't Hate, for that. I'm not quite there with my API, though. It'd probably make an interesting blog post, but it's built on pre-MVC technology.)

(And yes, I really like Pushover. In general, I turn off notifications for most apps and only pay attention to what I have Pushover tell me.)

Anyway, I'd get notifications telling me my web server is down, then pull out my phone and find the web server up and responsive. I'm putting that into MySQL, so a query told me that, on some hours, I'd get five heartbeats, some four, and some 3, so I was sure that the issue wasn't with the API.

I log and get Pushover notifications set at the @reboot section of my crontab, and that hadn't warned me recently, so I knew the machines were up, but not responding.

Then I remembered that I never stopped monitoring load averages, and started looking at those logs.


#!/usr/bin/env perl

# reads and parses the uptime log

use feature qw{ say } ;
use strict ; use warnings ; use utf8 ; use DateTime ; my $file = q{/home/jacoby/mnt/rcac/.uptime.log} ; if ( -f $file && open my $fh, '<', $file ) { my $data ; my $last ; my $obj ; while (<$fh>) { chomp ; my ( $date, $time, $server ) = split m{\s+} ; next unless $server ; # say $server ; my ( $year, $month, $day ) = split m{/}, $date ; my ( $hour, $minute, $second ) = split m{:}, $time ; my $latest = DateTime->new( year => $year, month => $month, day => $day, hour => $hour, minute => $minute, second => 0, time_zone => 'UTC', ) ; my $diff = 0 ; # next if $year != 2016 ; # next if $month < 7 ; my $ymd = $latest->ymd ; my $hms = $latest->hms ; next if $ymd !~ /^2016-07/ ; push @{ $obj->{$ymd}{$hms} }, $server ; } my @hosts = sort qw{ genomics genomics-test genomics-apps genomics-db } ; for my $y ( sort keys %$obj ) { my $day = $obj->{$y} ; for my $h ( sort keys %$day ) { my @list = @{ $obj->{$y}{$h} } ; my %list = map { $_ => 1 } @list ; my @down = grep { !$list{$_} } @hosts ; next if !scalar @down ; say join ' ', $y, $h, @down ; } } } __DATA__ two days results: 2016-07-23 01:00:00 genomics-test 2016-07-23 02:00:00 genomics-test 2016-07-23 08:00:00 genomics genomics-db 2016-07-23 13:00:00 genomics-apps 2016-07-23 16:00:00 genomics-apps 2016-07-23 18:00:00 genomics-db 2016-07-23 19:00:00 genomics-test 2016-07-23 21:00:00 genomics-apps 2016-07-24 05:00:00 genomics genomics-apps 2016-07-24 07:00:00 genomics 2016-07-24 10:00:00 genomics-apps 2016-07-24 10:01:00 genomics genomics-db genomics-test 2016-07-24 11:00:00 genomics-db 2016-07-24 13:00:00 genomics-apps 2016-07-24 18:00:00 genomics genomics-apps 2016-07-24 23:00:00 genomics genomics-apps genomics-db

We see above that of the four VMs I monitor, all four fail to log multiple times, and many times, three of four VMs fail to run their crontabs. Since I had something more solid than "Hey, that's funny", I went to my admins about this. Looks like VMs are failing to authenticate with the LDAP server. My admins are taking it up the chain.

So, beyond how I make and parse logs, which might not be the best examples you can find, the message here is that it's hard to identify a problem unless you're tracking it, and even tracking something else might help you identify a problem.