24/7: Keeping Your Store Open for Business

When you do business on the web, downtime means lost revenue. Hanging a sign on your shop door that says “Gone to lunch, back in one hour” may work on Main Street, but shoppers expect an always-on 24×7 experience on the web.

For this reason, uptime is one of the primary goals of web site engineers. How is this achieved? What makes it so difficult?

A web site is a system consisting of many pieces. The base layer consists of servers and network hardware, with software applications forming the next layer.

If everything always worked as expected, then there would be no reason to worry about outages. But both hardware and software do fail. Designing a system for uptime means handling failure without the user even noticing. This involves experience in anticipating what could go wrong and developing strategies for recovery.

Systems must handle common failure modes with automated responses so that a small failure does not trigger a large failure. “Redundancy” is one of many techniques employed to enable systems to continue operating normally during component failures. For example, the workload of a system is shared among multiple identical servers running at less than 100% capacity, so that survivors can take over for one of their fallen comrades. Another strategy is to have a backup server standing by ready to take over, like a bench player on a basketball team. These strategies enable a system to heal itself most of the time.

However, there are conditions that a system cannot respond to in a sensible way, such as a failure that was not anticipated or planned, a complex failure that the application cannot figure out on its own, the failure of a critical system for which there is no replacement, or even an intense burst in unexpected traffic (due to a promotion or a publicity appearance, such as “Shark Tank”). Those conditions require people, commonly referred to as “Ops”, to analyze the situation and take corrective action.

The Ops Team is the last line of defense against downtime.

Their duty is to apply emergency first aid when needed, by bringing new servers on line to take over for failed ones, increasing bandwidth, or injecting additional resources to keep the system operational. Finally, Ops diagnoses the cause of failure so that the problem can be solved and recurrences prevented.

Through a combination of cautious and conservative designs, redundancy, and as a last resort, the human operator, the inevitable failure of pieces of a system can be contained, compensated for, and eventually, fixed to provide the online shopper with the always-on experience that they expect.