Advanced Server Monitoring with Riemann and Graphite

My current server monitoring setup is documented in my CentOS 5 server tutorials. It consists of Nagios for service monitoring and Cacti for graphing of metrics including system load, network and disk space.

Both tools are very commonly used and lots of resources are available on their setup & configuration, but I never kicked the feeling that they were plain clunky. Over the past several months, I have performed several research and evaluated a variety of tools and thankfully came across the monitoring sucks effort which aims to document a bunch of blog posts on monitoring tools and their different merits and weaknesses. The collection of all documentation the is now kept in the monitoring sucks GitHub repo.

Long story short, each tool seems to only do part of the job. I hate redundancy, and I believe that a good monitoring system would:

  1. provide an overview of the current service status;
  2. notify you appropriately and timely when things go wrong; and
  3. provide a historical overview of data to establish some sort of baseline / normal level for collected metrics (i.e graphs and 99-percentiles)
  4. ideally, be able to react proactively when things go wrong

You'll find that most tools will do two of four above well, which is just enough to be annoyingly useful. You'll need to implement 2-3 overlapping tools that do one thing well and the other just okay. Well, I don't like to live with workarounds.

Choosing the right tool for the job

I did a bit of research and solicited some advice on r/sysadmin, but sadly it did not get enough upvotes to be very noticed. Collectd looked like a wonderful utility. It is simple, high-performance and focused on doing one thing well. It was trivial to get it writing tons of system metrics to RRD files, at which point Visage provided a smooth user interface. Although it was a step in the right direction as far as what I was looking for, it still only did two of the four items above.

Introducing Riemann

Then, I stumbled across Riemann through his Monitorama 2013 presentation. Although not the easiest to configure and its notification support is a bit lacking, it has several features that immediately piqued my interest:

  • Its architecture forgoes the traditional polling and instead processes arbitrary event streams.
    • Events can contain data (the metric) as well as other information (hostname, service, state, timestamp, tags, ttl)
    • Events can be filtered by their attributes and transformed (percentiles, rolling averages, etc)
    • Monitoring up new machines is as easy as pushing to your Riemann server from the new host
    • Embed a Riemann client into your application or web service and easily add application level metrics
    • Let collectd do what it does best and have it shove the machine's health metrics to Riemann as an event stream
  • It is built for scale, and can handle thousands of events per second
  • Bindings (clients) are available in multitudes of languages
  • Has (somewhat primitive) support for notifications and reacting to service failures, but Riemann is extensible so you can add what you need
  • An awesome, configurable dashboard

All of this is described more adequately and in greater detail on its homepage. So how do you get it?

Installing Riemann

This assumes you are running CentOS 6 or more better (e.g. recent version of Fedora). In the case of CentOS, it also assumes that you have installed the EPEL repository.

yum install ruby rubygems jre-1.6.0
gem install riemann-tools daemonize
rpm -Uhv
chkconfig riemann on
service riemann start

Be sure to open ports 5555 (both TCP and UDP), 5556 (TCP) and in your firewall. Riemann will uses 5555 for event submission, 5556 for a WebSockets connection to the server.

Riemann is now ready to go and accept events. You can modify your configuration at /etc/riemann/riemann.config as required - here is a sample from my test installation:

; -*- mode: clojure; -*-
; vim: filetype=clojure

(logging/init :file "/var/log/riemann/riemann.log")

; Listen on the local interface over TCP (5555), UDP (5555), and websockets (5556)
(let [host "my.hostname.tld"]
  (tcp-server :host host)
  (udp-server :host host)
  (ws-server  :host host))

; Expire old events from the index.
(periodically-expire 5)

; Custom stuffs

; Graphite server - connection pool
(def graph (graphite {:host "localhost"}))
; Email handler
(def email (mailer {:from "riemann@my.hostname.tld"}))

; Keep events in the index for 5 minutes by default.
(let [index (default :ttl 300 (update-index (index)))]

  ; Inbound events will be passed to these streams:

    (where (tagged "rollingavg")
      (rate 5
        (percentiles 15 [0.5 0.95 0.99] index)
        index graph
        index graph

    ; Calculate an overall rate of events.
    (with {:metric 1 :host nil :state "ok" :service "events/sec" :ttl 5}
      (rate 5 index))

    ; Log expired events.
      (fn [event] (info "expired" event)))

The default configuration was modified here to do a few things differently:
  • Expire old events after only 5 seconds
  • Automatically calculate percentiles for events tagged with rollingavg
  • Send all event data to Graphite for graphing and archival
  • Set an email handler that, with some minor changes, could be used to send service state change notifications

Installing Graphite

Graphite can take data processed by Riemann and store it long-term, while also giving you tons of neat graphs.

yum --enablerepo=epel-testing install python-carbon python-whisper graphite-web httpd

We now need to edit /etc/carbon/storage-schemas.conf to tweak the time density of retained metrics. Since Riemann supports processing events quickly, I like to retain events at a higher precision than the default settings:
# Schema definitions for Whisper files. Entries are scanned in order,
# and first match wins. This file is scanned for changes every 60 seconds.
#  [name]
#  pattern = regex
#  retentions = timePerPoint:timeToStore, timePerPoint:timeToStore, ...

# Carbon's internal metrics. This entry should match what is specified in
pattern = ^carbon\.
retentions = 60:90d

#pattern = .*
#retentions = 60s:1d

pattern = .*
retentions = 10s:1h, 1m:7d, 15m:30d, 1h:2y

After making your changes, start the carbon-cache service:
service carbon-cache start
chkconfig carbon-cache on
touch /etc/carbon/storage-aggregation.conf

Now that Graphite's storage backend, Carbon, is running, we need to start Graphite:

python /usr/lib/python2.6/site-packages/graphite/ syncdb
chown apache:apache /var/lib/graphite-web/graphite.db
service httpd graceful

Graphite should now be available on http://localhost - if this is undesirable, edit /etc/httpd/conf.d/graphite-web.conf and map it to a different hostname / URL according to your needs.

Note: as of writing, there's a bug in the version of python-carbon shipped with EL6 that complains incessantly to your logs if the storage-aggregation.conf configuration file doesn't exist. Let's create it to avoid a hundred-megabyte log file:

touch /etc/carbon/storage-aggregation.conf

But what about EL5

I am not going to detail how to install the full Riemann server on EL5, as the dependencies are far behind and it would require quite a bit of work. However, it is possible to install riemann-tools on RHEL/CentOS 5 for monitoring the machine with minimal work.

The rieman-health initscript requires the 'daemonize' command, install it via yum (EL6) or obtain it for EL5 here:

The riemann-tools ruby gem and its dependencies will require a few development packages in order to build, as well as Karan's repo providing an updated ruby-1.8.7:

cat << EOF >> /etc/yum.repos.d/karan-ruby.repo
yum update ruby\*
yum install ruby-devel libxml2-devel libxslt-devel libgcrypt-devel libgpg-error-devel
gem install riemann-tools --no-ri --no-rdoc



Do you use riemann-tools to send metrics to riemann? Or do you use collectd or something else to send general server metrics?

At the moment, riemann-health (part of riemann-tools) is serving my needs well, but collectd can easily give you tons of additional health stats if you need them. While I have not put them into production, I have also experimented with the bindings to have applications send their own metrics to Riemann (& graphite) for analysis later.

Thanks for the response. If you're using this on EL6 boxes, are you using RVM or something to install ruby 1.9.x? It looks like the latest version of the riemann-tools gem requires ruby 1.9.2 :(

Building native extensions.  This could take a while...
ERROR:  Error installing riemann-tools:
        mime-types requires Ruby version >= 1.9.2.

Ah, yeah I ran into that issue recently too. I remember it being a pain to troubleshoot... Unfortunately, I can't seem to find the notes I wrote on the subject. This is what was in the tail of my command history though:

gem install sinatra -v 1.3.2
gem install riemann-tools -v 0.1.1

If it helps, these are also the exact version of dependencies I have installed from gem list:

*** LOCAL GEMS ***

beefcake (0.4.0)
builder (3.2.2)
daemons (1.1.9)
erubis (2.7.0)
eventmachine (1.0.3)
excon (0.28.0)
formatador (0.2.4)
mtrc (0.0.4)
multi_json (1.3.6)
munin-ruby (0.2.5)
rack (1.5.2)
rack-protection (1.5.1)
redis (3.0.6)
riemann-client (0.2.2)
riemann-dash (0.2.6)
riemann-tools (0.1.1)
sass (3.2.12)
sinatra (1.3.2)
thin (1.6.1)
tilt (1.4.1)
trollop (2.0)
yajl-ruby (1.1.0)

This is using ruby-

You've already answered many of my questions. Now I almost know what to do and how my first steps should look like.
Thank you for assistance !

this is very interesting post for me and i hope that such information will be interesting for other people.

While I have not place them into manufacture, I have also best seo service experiment with the binding to have application send their possess metrics to Riemann (& graphite) for psychoanalysis afterward.

It is good that these tools are available for the installation. I think that most users will be satisfied with this arrangement and the provision of resources.


What is the best tool to monitoring a server activity? custom essay writer

Frank Muller acumen of replica watches architecture is the aboriginal what attracts anybody attention. Since the addition of his acclaimed colossal case, every Replica Frank Muller watch is instantly apparent on a wrist of its wearer whether he is a appointment agent or a hospital nurse, browse you will acquisition the a lot of different replica Frank Muller watch . In 1832, Auguste Agassiz entered the apple of replica watches uk by establishing a affiliation with a watchmaking comptoir (the Swiss name for a watch assembly branch and dealership) in Saint-Imier. Shortly thereafter, he took ascendancy of the action and renamed it "Agassiz & compagnie". Originally Longines outsourced watch assembly to the abounding craftsmen that existed in the region. In 1854 Longines start-up their own watch authoritative branch to ascendancy superior of their rolex replica watches . Superior did advance and Longines was anon accomplishment watches with constant accurateness that captivated them in top regard. So some humans are consistently aggravating to accomplish best replica Longines watches.The aggregation began bearing alarm watches in 1879 and registered its "Winged Hourglass" cast in May of 1890. In 1912, it produced the aboriginal automated timekeeping device.Longines went on to advance aviator watches and cockpit replica watch . Its able affiliation with aboriginal aerodynamics provided capital flight instruments for airplanes from the 1920s onward. It was Longines who timed the 1927 trans-Atlantic bridge of Charles Lindbergh in the 'Spirit of St Louis'. Longines accessories was aswell acclimated on the even "The City of New York" which was piloted by Collyer and Mears if they bankrupt an about the apple almanac in 1928.Leading the rolex replica as timekeepers to the apple of sports, Longines developed attention stop watches for athletes. The cast became allotment of the Olympic Games and ensured that athletes' performances were accurately recorded. From that day Longines has continued been associated with time-keeping at above antic events.

No instrumentation required for this on the servers, essentially screen administrations from another machine, make http demands. Check reaction time and get assignments done online precision of result.

Very informative article about server monitoring setup, you have written about it in detail which i like reading. I have visited many websites with same topic but i found your website the best. Keep writing more similar articles.
gym mats

Indeed graphite and Riemann can be a useful tool in monitoring servers. Students will definitely flock your post because of their usefulness, unlike the ones we usually read from which are not understandable and helpful at all.

I guide the students that how they can pass the exams and how can they get the certification for the latest knowledge this certification exam students click at exam or visit its 70-412 Braindumps its better for your bright future and will helpful to attain the IT certification 

Vidmate is an app that lets you download videos and songs from the Youtube, Metacafe, Vimeo, Soundcloud and from other popular multimedia websites. This app lets you download any number of videos or songs of various qualities.vidmate for laprop

INDIA's MOST POPULAR Shopping App. Trusted by over 7 Crore Indians. Shop the Online Megastore with the free Flipkart Android app
Flipkart app for android