This post is about how we monitor Nomad List, Remote OK and others. Over the years, Pieter and I have built up a set of tools to run and inform us when the sites may be experiencing an issue.
We use cron jobs for almost every task, from updating data sources to processing images. Right now we have 81 jobs, with some running multiple times per hour. The cron dashboard I made is powered by a simple bash wrapper script that records metrics and sends the data to a PHP script for processing. This has saved our backs so many times. It also allows us to see which scripts are taking a long time, so I can then go and optimize them.
.@daniellockyer made a dashboard for my robots and if they complete ✅OK or 🔴ERROR, so I can monitor and rerun from my iPhone wherever I am 🙃 pic.twitter.com/XZok44fnU5— Pieter Levels @ 🇯🇵 (@levelsio) May 10, 2017
ngxtop allows us to monitor Nginx logs and perform queries against the data. Using this tool, we've easily been able to find people scraping our site, or links that are regularly hit but tend to 404 or 500.
PHP Access Logs
One of my favourite tricks is to use the PHP access logs to monitor the script time of server-side page loads. The following bash command looks at the last 50 entries and sorts them by runtime cost.
sudo tail -n 50 /var/log/php7.1-fpm.access.log | sort -n | cut -c1-$(tput cols)
A typical entry may be something like:
50.886 2048 78.61% remoteok.io /srv/http/remoteok.io/app/feed.php?type=rss
The first column is script time in milliseconds. The second is memory usage. The third is CPU usage.
This is useful because it allows you to see which requests you may want to turn into a static HTML page and avoid PHP for each load. During the launch of Hoodmaps we saw all the page embeds turning up in the output so we switched to static pages and reduced our server load.
PHP-FPM actually comes with a status page built in. Once enabled, it allows you to see basic information about the process. If you add
?full to the URL, it lists all the child processes and statistics about them. If you append
&json to the URL, it outputs it in JSON format. I'm currently working on a dashboard to graph this data.
UptimeRobot is a service we use to check for availability and keywords on site pages. It can then inform us via Twitter DM, email or SMS if there is an issue.
✨I've built 37 test cases that check if my sites are working properly w/ @uptimerobot and SMS me if they're not, here's how it works (pt 1) pic.twitter.com/nB90v7ws5c— Pieter Levels 🏝 (@levelsio) May 23, 2017
Even things like htop and tcpdump can be useful in monitoring and diagnosing issues.
In the end, it's important to use what works for you. We like to try and stick to tools and services we control so we don't rely on third-party services.
If you're looking to set up monitoring for your server, email me at firstname.lastname@example.org!