Posts Tagged ‘downtime’

Apologies for the downtime, or, how our VPS provider ruined my weekend.

Monday, December 10th, 2007

To our valued users of our websites, especially those at PapayaPolls.com and Print-Bingo.com, we’re sorry.

Due to circumstances that were generally outside of our control, Perceptus’ virtual private server was down for 3-4 days (there were a couple multi-hour stints of uptime here and there).

The long story, short:

Our VPS provider (a VPS is a super-fancy variant of web hosting), with whom we’ve hosted for the last three years or so, had a border router go down badly on Friday.

Now, when we were choosing a VPS provider, we specifically looked for one that had fully redundant power, networks connections, reasonably intelligent sounding support, etc. This one did and still does advertise as such; however, as we’ve now discovered, this supposedly fully-redundant network of our VPS provider turned out to be mostly redundant with at least one single point of failure exception. When this border-router went down, there was no backup link, nor was there a convenient replacement unit for a quick swap. So their network went dead to the world for hours.

Eventually, our VPS provider fixed their network. However, the Perceptus’ VPS remained down. Several email tickets and live chats with support later, they figured out what was wrong, and the Perceptus VPS is finally up again on Monday.  It’s been several hours now, so our fingers are crossed that this might actually be over.

I won’t name our provider, but if you poke around enough, you’ll figure out who our VPS provider is. Suffice it to say that for the time being, don’t ask me for a VPS recommendation… I don’t have anyone to recommend. I don’t really blame them for the first few hours of downtime, overall I’ve been quite happy with them. But when a few hours stretched to a few days, they lost a lot of goodwill with me.

Lessons learned:

  1. When things go down and you’ve got no control over the fix, start implementing a fall-back plan right away. Even if someone who claims to be in a position of knowledge says it will be fixed in a few hours, start the work on the fallback plan anyways. Nine times out of ten, things will get fixed before you have to go live with Plan B, but your time isn’t wasted. Consider it a test-run of your backup plans for the one time that you will be very happy that you did start on Plan B ASAP.
  2. When possible, avoid single-points-of-failure, this includes your web host. Ironically, our VPS provider did a survey about a month ago asking about our interest in “high availability VPS'”… guess what would have happened to one of these last week? Yep, it would have been down anyway because the problem was at a choke point higher up the chain than the server.
  3. When choosing a web host, ask if they actually have staff in the same city as their data centre. If they are just relaying tickets to the data centre staff, they don’t really have control over when anything is done either. I don’t know if our VPS provider had such a setup when we first signed up, but I have every indication that they weren’t physically in the data centre this weekend.

The Future for Perceptus’ Web Server:

We’ll be looking at setting up a fail-over server on a totally separate network, completely unrelated to our current setup. The only question is how will we do it relatively cheaply and with relatively low maintenance. We’ll post something when we figure it out.