Quickly debugging server issues

Part of my job as a server consultant is maintaining servers and quickly responding to issues as they arise.

Yesterday, I unlocked my phone to find a message from a client. He said Apache was getting railed and the server load was over 80. I quickly opened my laptop and SSHd into the server. Sure enough, Apache was maxing out the CPU. My client was reporting that Google Analytics was showing low amounts of traffic so it couldn't have been a sudden surge from publicity.

My first thought was to see where these requests were heading to. I tail the access logs and find requests to /rss/order/new/. (The site is powered by Magento.) Turns out to be a common brute force attack path. That post even contains a fix to block access to this API. I add the rules to Nginx (which was being used as a reverse proxy) to cut it off and the CPU load returns to a nominal value.

During my time working on Nomad List and Remote OK, we've had to deal with tons of problems as they arise. In a perfect world, you can set and forget servers but in reality, they always find a way to go wrong.

I have very mixed feelings about Docker. It sounds like a great idea but everytime I go and use it, it breaks. Sometimes it just freezes the entire server and you have to force reboot. This is actually quite a hard issue to spot unless you spot the tiny Z (indicating a Zombie process) in htop.

($*@#$*(&@#$ I hate @Docker so much, it literally just killed @nomadlist, @#)(@*#$#, getting sites back up with @DanielLockyer now— levelsio (@levelsio) September 3, 2017

A couple of months ago, a client came to me to see if I could fix an issue they were experiencing on their site. Some requests would take a long time to respond and some didn't respond at all.

I SSH into the server and open htop to see what's going on with the CPU -- it looked pretty ordinary. I notice that php-fpm had a very short TIME and reset to 0:00:00 often. Huh, you'd expect it to be longer -- php-fpm was restarting every couple of seconds. I checked out dmesg to check the system messages. Up come a ton of segfault error messages. Oh, that's why. Turns out php-fpm was segfaulting and restarting every couple of seconds which means Nginx couldn't process a large proportion of the requests.

But why? Something I always take a look at when I come across a PHP application is what version it is running. This server was running 7.0, but a few versions out of date. It's looking promising for me. I know quite a lot of those older versions have bugs resulting in segfaults. Just doing a Ctrl+F on the changelog shows 139 instances.

So, I just update PHP. FPM restarts and the issue is resolved. No segfaulting and the website is nice and fast. God knows how long they were dealing with that issue but it took me just 3 minutes to resolve. (!)

Debugging servers and their issues quickly is a skill you learn over time. When you come across these kind of issues, you fine-tune your ability to do it quicker.

If you're looking for someone to maintain your servers, email me at hi@daniellockyer.com!