Career advice and server monitoring

I came across a tweet the other day that had some career advice that struck a chord with me. Unfortunately, I can't remember who wrote it so I can't find it. It listed a few bullet points to help in advancing your career. One of the bullet points was something along the lines of:

Look for places to work that give you enough agency to actually make changes.

This alone has been what has given me the most satisfaction at work. It allows for me to solve new problems, make a tangible impact, and often explore new technologies. Not only is this immensely fulfilling to me but it's also what I think has made me a valuable part of my team.

I've always considered myself to be entrpreneurial and although I'm still waiting on my big break with my side hustles, this has often helped scratch that itch. Lots of businesses are born out of solving pain points for people and this is very much like that.

In a series of posts, I'd like to share some of what I've done that has solved problems where I work. This will be an outlet to list some accomplishments I can look back on but also hopefully as inspiration and source of ideas for you to impact your work environment.

Case #1: Server Monitoring

This is the story of how and why I wrote a little Go application.

At work, we've been slowly transitioning more of our application servers from Windows to Linux boxes. As part of this migration, we've needed to find new tools to help manage and monitor these servers.

I helped setup and now, maintain, some of our system monitoring tools for this initiative. We use Grafana in conjunction with Prometheus and Prometheus's node exporter to monitor server metrics

  • CPU usage
  • RAM usage
  • Disk space
  • etc

We use Alertmanager to send us Slack messages whenever problems arise.

Wait, what happened?

This has worked really well...while we're at our desks. If an alert occurred, I could ssh into the offending server and run a series of commands based on the alert type to figure out what was going on. However, sometimes we'd get alerts in the middle of the night but would resolve themselves before we were able to manually check it out. This was both worrisome and frustrating. We couldn't catch the problem.

After encountering various alerts over time, I came to find a bit of a routine when discovering what was going on. If the server was encountering a CPU or RAM spike, I would log in and use ps to view the currently running processes to find the culprit(s). If it was a disk space alert, I knew which directories on certain servers were likely to be the cause so I would cd to those and use the du command to see the sizes of those. This was all a manual process that I had to perform as soon as we got the alert. I wanted to automate this.

Choosing the tech

Prometheus's node exporter works by being an app that runs on every server you're monitoring. It exposes metrics, via an http endpoint, for your Prometheus instance to scrape. This seemed like a good model to use for my purposes. I wanted to be able to hit a port over http and have that respond with the output of those commands I run based on the alert type so now I just needed to settle on a language. We have several linux servers that take on a wide variety of roles: nginx proxy, api server, static asset server, docker host, etc. I couldn't count on these various servers having any one language installed which meant I either had to:

  • pick a language and install that on all of our servers
  • pick a language that could be compiled to a single executable

Because I mostly do frontend React work these days I was excited to finally find a usecase for Deno, a javascript runtime that compiles to a single executable file. I ran into two issues with Deno though:

Deno wasn't going to work for me so I settled on Go, which I had messed around with a couple years back and seemed perfect for my usecase. It compiles to a single executable, works on Centos, and the executables are relatively small. What I ended developing compiled to a 7 MB executable.

Architecture

I wanted to write an http application which accepts an alert name via query parameter, executes a predetermined set of commands based on that alert name, and outputs the result of those commands.

http://myserver.acme.com:5050/report?alertName=RAM Used

I wanted the commands to be configurable per server. Based on the server, I would want to look at different places for a Disk Space alert. For a database server, I would want to check where that data is being stored. For a box serving as an nginx proxy, I would want to check the logs directory. These directories aren't all applicable for every server. For this, I made the Go application accept a cli argument that would be the path to a json configuration file. It would contain the list of commands to run for each alert name. I made sure to make the code read this file on every http request rather than read it once at startup and keep that in memory so I could update the config file without having to bring the app down or redeploy it. This would be deployed and running on all the same servers that prometheus is monitoring.

Once, I had this little "resource reporter" app with a single endpoint working I moved onto the next step which was the trigger. Luckily, Alertmanager can be configured to use a generic webhook when alerts occur. This would obvioulsy be my trigger. Rather than develop a whole separate Go application for this, I decided to just add it to my existing reporter app.

I wrote a new endpoint in the same app that would

  • accept the incoming alerts from Alertmanager
  • determine which server triggered the alert(based on the Alertmanager payload)
  • make an http request to the resource reporter app running on the alerting server
    • which would report back the outputs of the predetermined commands
  • send those command outputs to our companies devops Slack channel

So now my little resource app has two endpoints:

  • /report?alertName=<alertName> - run commands and respond with the outputs of each
  • /webhook - Alertmanager webhook, which calls the /report endpoint for other servers

I compiled this and deployed it as a linux service on all of our servers and configured Alertmanager to use it as a webhook and it's been working beautifully.

Remember when I said things would spike the CPU in the middle of the night? Turns out it was our anti-virus conducting a scan. Luckily we can configure when those scans occur and can ignore certain alerts during specific times in Alertmanager.

Take-aways

Sometimes I'll look at a problem in regards to a new piece of tech and think I'm bound by what other people have already built for some reason. "I can't find anything someone else has built to do what I want so I guess it can't be done". This attitude is absurd. We are the builders.

I also find that I try to be too much of a purist and get hung up on "good" architecture. "This thing that I'm building needs to be infinitely flexible to accomodate every scenario"...even though I know I won't run into that scenario. I considered introducing an event dispatcher in my little application so I could add other ways to be notified: email, text, discord, etc. That was way too complicated for what I needed now. It reminds of some advice I've heard regarding purchasing power tools:

Buy the cheap version first. If you find yourself using it a ton, then upgrade.

The advice I have to repeat to myself:

Build the version I need now first becuase more often than not, it will suffice and I can move onto another problem to solve.

My advice for developer looking to advance their career: Look for pain points within your company to solve. Take the initiative to attempt to solve those...and then blog about it :)

Bonus section

Docker has been a game changer for me. Not only does it allow me to package up applications but it's also a stress-free way to explore new technologies. There are a ton of applications that now have Getting Started instructions that use Docker. This negates the need to install and configure every little thing to get something up and running. That work has already been baked into a Docker image for you. I don't need to muddy up my system on something I might not end up using. When using Docker, if I find that a piece of tech isn't quite right for whatever reason, I simply stop and delete the container and image and I'm back to where I was with no remnants left on my machine.

Before installing Granana and Prometheus on a company server, I spun up local docker containers running these apps to get familiar with them.

If you're at all interested in dev-ops I would strongly encourage you to get familiar with Docker. It has made exploring cheap and easy which has empowered me to discover new ways to improve our systems.

Categories: Career