EOS is a Linux distribution (based on Fedora), which means, among other things, that it can be monitored like any Linux server running Fedora. In this post we show how to package a popular open-source monitoring framework, tcollector, as an EOS extension.
A bit of history
OpenTSDB is a distributed time series database used for infrastructure monitoring in many medium to large scale environments. It uses a push model, meaning that OpenTSDB is not responsible for pulling monitoring from a set list of targets to monitor, rather the targets themselves are responsible for pushing their monitoring data to OpenTSDB, be they bare machines, VMs, containers, cron jobs, or anything else. This is one of the key design aspects that make OpenTSDB easy to scale and operate, as adding monitoring capacity is as simple as spinning up more instances of OpenTSDB, and in case of failure, the targets are responsible for finding an OpenTSDB instance they can connect to.
One of the most common ways of pushing monitoring data to OpenTSDB is to use tcollector, an utility written in Python that usually runs on all servers and VMs. tcollector comes with dozens of collectors built-in, for things ranging from collecting hundreds of metrics from Linux to MySQL or Postgres, elastic search, Hadoop and HBase, HAProxy or Varnish, ZooKeeper, etc. Creating new collectors is easy too, since collectors are usually simple shell or Python scripts, and can be written in any programming language.
In March 2013, during one of the biannual hackathon events hosted at Arista, called “hack-a-switch”, I decided to integrate tcollector on EOS. This involved writing a custom CLI plugin and a tiny bit of C++ code. This extension has then found an avid customer base amongst Arista’s POC team, which started to regularly use it to track real-time CPU and memory usage during POCs, especially POCs with demanding cloud customers.
Fast forward to the end of that year, in November 2013 we started building the EOS SDK, in partnership with one of our biggest cloud customers. By the following month, we had Python bindings available thanks to swig. In September 2014, we rewrote the extension to use the EOS SDK, and the rewrite was subsequently open-sourced on GitHub. So here are the instructions to build and deploy your own tcollector extension, so you can monitor EOS just like any server, while also getting visibility from the data plane.