Advanced Server Monitoring with Riemann and Graphite

My current server monitoring setup is documented in my CentOS 5 server tutorials. It consists of Nagios for service monitoring and Cacti for graphing of metrics including system load, network and disk space.

Both tools are very commonly used and lots of resources are available on their setup & configuration, but I never kicked the feeling that they were plain clunky. Over the past several months, I have performed several research and evaluated a variety of tools and thankfully came across the monitoring sucks effort which aims to document a bunch of blog posts on monitoring tools and their different merits and weaknesses. The collection of all documentation the is now kept in the monitoring sucks GitHub repo.

Long story short, each tool seems to only do part of the job. I hate redundancy, and I believe that a good monitoring system would:

  1. provide an overview of the current service status;
  2. notify you appropriately and timely when things go wrong; and
  3. provide a historical overview of data to establish some sort of baseline / normal level for collected metrics (i.e graphs and 99-percentiles)
  4. ideally, be able to react proactively when things go wrong

You'll find that most tools will do two of four above well, which is just enough to be annoyingly useful. You'll need to implement 2-3 overlapping tools that do one thing well and the other just okay. Well, I don't like to live with workarounds.

Choosing the right tool for the job

I did a bit of research and solicited some advice on r/sysadmin, but sadly it did not get enough upvotes to be very noticed. Collectd looked like a wonderful utility. It is simple, high-performance and focused on doing one thing well. It was trivial to get it writing tons of system metrics to RRD files, at which point Visage provided a smooth user interface. Although it was a step in the right direction as far as what I was looking for, it still only did two of the four items above.

Introducing Riemann

Then, I stumbled across Riemann through his Monitorama 2013 presentation. Although not the easiest to configure and its notification support is a bit lacking, it has several features that immediately piqued my interest:

  • Its architecture forgoes the traditional polling and instead processes arbitrary event streams.
    • Events can contain data (the metric) as well as other information (hostname, service, state, timestamp, tags, ttl)
    • Events can be filtered by their attributes and transformed (percentiles, rolling averages, etc)
    • Monitoring up new machines is as easy as pushing to your Riemann server from the new host
    • Embed a Riemann client into your application or web service and easily add application level metrics
    • Let collectd do what it does best and have it shove the machine's health metrics to Riemann as an event stream
  • It is built for scale, and can handle thousands of events per second
  • Bindings (clients) are available in multitudes of languages
  • Has (somewhat primitive) support for notifications and reacting to service failures, but Riemann is extensible so you can add what you need
  • An awesome, configurable dashboard

All of this is described more adequately and in greater detail on its homepage. So how do you get it?

Installing Riemann

This assumes you are running CentOS 6 or more better (e.g. recent version of Fedora). In the case of CentOS, it also assumes that you have installed the EPEL repository.

yum install ruby rubygems jre-1.6.0
gem install riemann-tools daemonize
rpm -Uhv http://aphyr.com/riemann/riemann-0.2.4-1.noarch.rpm
chkconfig riemann on
service riemann start

Be sure to open ports 5555 (both TCP and UDP), 5556 (TCP) and in your firewall. Riemann will uses 5555 for event submission, 5556 for a WebSockets connection to the server.

Riemann is now ready to go and accept events. You can modify your configuration at /etc/riemann/riemann.config as required - here is a sample from my test installation:

; -*- mode: clojure; -*-
; vim: filetype=clojure

(logging/init :file "/var/log/riemann/riemann.log")

; Listen on the local interface over TCP (5555), UDP (5555), and websockets (5556)
(let [host "my.hostname.tld"]
  (tcp-server :host host)
  (udp-server :host host)
  (ws-server  :host host))

; Expire old events from the index.
(periodically-expire 5)

; Custom stuffs

; Graphite server - connection pool
(def graph (graphite {:host "localhost"}))
; Email handler
(def email (mailer {:from "riemann@my.hostname.tld"}))

; Keep events in the index for 5 minutes by default.
(let [index (default :ttl 300 (update-index (index)))]

  ; Inbound events will be passed to these streams:
  (streams

    (where (tagged "rollingavg")
      (rate 5
        (percentiles 15 [0.5 0.95 0.99] index)
        index graph
      )
      (else
        index graph
      )
    )

    ; Calculate an overall rate of events.
    (with {:metric 1 :host nil :state "ok" :service "events/sec" :ttl 5}
      (rate 5 index))

    ; Log expired events.
    (expired
      (fn [event] (info "expired" event)))
))

The default configuration was modified here to do a few things differently:
  • Expire old events after only 5 seconds
  • Automatically calculate percentiles for events tagged with rollingavg
  • Send all event data to Graphite for graphing and archival
  • Set an email handler that, with some minor changes, could be used to send service state change notifications

Installing Graphite

Graphite can take data processed by Riemann and store it long-term, while also giving you tons of neat graphs.

yum --enablerepo=epel-testing install python-carbon python-whisper graphite-web httpd

We now need to edit /etc/carbon/storage-schemas.conf to tweak the time density of retained metrics. Since Riemann supports processing events quickly, I like to retain events at a higher precision than the default settings:
# Schema definitions for Whisper files. Entries are scanned in order,
# and first match wins. This file is scanned for changes every 60 seconds.
#
#  [name]
#  pattern = regex
#  retentions = timePerPoint:timeToStore, timePerPoint:timeToStore, ...

# Carbon's internal metrics. This entry should match what is specified in
# CARBON_METRIC_PREFIX and CARBON_METRIC_INTERVAL settings
[carbon]
pattern = ^carbon\.
retentions = 60:90d

#[default_1min_for_1day]
#pattern = .*
#retentions = 60s:1d

[primary]
pattern = .*
retentions = 10s:1h, 1m:7d, 15m:30d, 1h:2y

After making your changes, start the carbon-cache service:
service carbon-cache start
chkconfig carbon-cache on
touch /etc/carbon/storage-aggregation.conf

Now that Graphite's storage backend, Carbon, is running, we need to start Graphite:

python /usr/lib/python2.6/site-packages/graphite/manage.py syncdb
chown apache:apache /var/lib/graphite-web/graphite.db
service httpd graceful

Graphite should now be available on http://localhost - if this is undesirable, edit /etc/httpd/conf.d/graphite-web.conf and map it to a different hostname / URL according to your needs.

Note: as of writing, there's a bug in the version of python-carbon shipped with EL6 that complains incessantly to your logs if the storage-aggregation.conf configuration file doesn't exist. Let's create it to avoid a hundred-megabyte log file:

touch /etc/carbon/storage-aggregation.conf

But what about EL5

I am not going to detail how to install the full Riemann server on EL5, as the dependencies are far behind and it would require quite a bit of work. However, it is possible to install riemann-tools on RHEL/CentOS 5 for monitoring the machine with minimal work.

The rieman-health initscript requires the 'daemonize' command, install it via yum (EL6) or obtain it for EL5 here: http://pkgs.repoforge.org/daemonize/

The riemann-tools ruby gem and its dependencies will require a few development packages in order to build, as well as Karan's repo providing an updated ruby-1.8.7:

cat << EOF >> /etc/yum.repos.d/karan-ruby.repo
[kbs-el5-rb187]
name=kbs-el5-rb187
enabled=1
baseurl=http://centos.karan.org/el\$releasever/ruby187/\$basearch/
gpgcheck=1
gpgkey=http://centos.karan.org/RPM-GPG-KEY-karan.org.txt
EOF
yum update ruby\*
yum install ruby-devel libxml2-devel libxslt-devel libgcrypt-devel libgpg-error-devel
gem install riemann-tools --no-ri --no-rdoc

Rating: 

Building a home media server with ZFS and a gaming virtual machine

Work has kept me busy lately so it's been a while since my last post... I have been doing lots of research and collecting lots of information over the holiday break and I'm happy to say that in the coming days I will be posting a new server setup guide, this time for a server that is capable of running redundant storage (ZFS RAIDZ2), sharing home media (Plex Media Server, SMB, AFP) as well as a full Windows 7 gaming rig simultaneously!

Windows runs in a virtual machine and is assigned it's own real graphics card from the host's hardware using the using the brand-new VFIO PCI passthrough technique with the VGA quirks enabled. This does require a motherboard and CPU with support for IOMMU, more commonly known as VT-d or AMD-Vi.

Rating: 

Setting the OS X kernel asynchronous I/O limits to avoid VirtualBox crashing OS X

I had the need to setup a new VM for software testing today, and I kept running into intermittent problems where VirtualBox would freeze and then an OS X kernel panic, freezing/crashing the entire machine.

Luckily, I had made a snapshot in the OS moments earlier to the crash so I had a safe place to revert to, but the crashes kept happing at seemingly random times.

I setup a looped execution of 'dmesg' to see what was going on just before the crash and saw this at the next freeze:

VBoxDrv: host_vmxon  -> vmx_use_count=1 rc=0
VBoxDrv: host_vmxoff -> vmx_use_count=0
VBoxDrv: host_vmxon  -> vmx_use_count=1 rc=0
aio_queue_async_request(): too many in flight for proc: 16.
aio_queue_async_request(): too many in flight for proc: 16.
aio_queue_async_request(): too many in flight for proc: 16.
aio_queue_async_request(): too many in flight for proc: 16.
aio_queue_async_request(): too many in flight for proc: 16.
aio_queue_async_request(): too many in flight for proc: 16.

The first VBoxDrv messages didn't pull anything interesting in Google, but the other messages did: Virtual Box ticket #11219 and this blog post.

It would appear that the default limits for the OS X kernel's asynchronous I/O are very, very low. VirtualBox likely exceeds them when your VM(s) are performing heavy disk I/O, hence the 'too many in flight' message in the logs.

Luckily for us, there's a quick and easy solution:

sudo sysctl -w  kern.aiomax=512 kern.aioprocmax=128 kern.aiothreads=16

then restart VirtualBox. These settings will apply until you reboot. To make the changes permanent, add/update the following lines in /etc/sysctl.conf:
kern.aiomax=512
kern.aioprocmax=128
kern.aiothreads=16

Note: you can probably set those limits even higher, as documentation for Sybase (by SAP) recommends values 2048 / 1024 / 16 when using its software.

Rating: