Failure is the Only Option

I ran a computer department for a small private school back in 1991. I remember rolling out our first Macintosh computers, the ones with one megabyte of memory and no hard drive (the hard drives were too expensive for us). I managed to get them up and running with no disk in the floppy drive, so that students could load and save their work to their own floppy.

This diskless approach was not particularly stable, however, and the computers would crash on a regular basis. As a result, my mantra was SAVE YOUR WORK! Expect your computer to crash, and plan accordingly!

Schoolkids being schoolkids, however, they frequently ignored my admonition. Sure enough, periodically one would come up to me with a tear in her eye. “Mr. Bloom! I just spent all period writing my English paper and the computer crashed! Please help!” But of course, there was nothing I could do at that point.

Two decades later, the computers are bigger, faster, and cheaper, we have the Internet and all it has done for us, and today we even have the Cloud. But in some ways nothing has changed. Failure is still unpredictable, yet it is around every corner. The core best practice, expect and plan for failure, is still as important as it was in the floppy days.

The Cloud: Built to Fail

It was actually my research into the causes of last month’s spectacular Amazon Cloud flameout that reminded me of my time in the school computer lab. One of the core architectural principles of Amazon’s Cloud—or any Cloud for that matter, public or private—is our old friend, expect and plan for failure. After all, each individual node in the Cloud, whether it be a server, hard drive, or piece of network equipment, consists of cheap commodity hardware. Each Cloud provider architects their Cloud environment so that any such piece of equipment—or even several pieces at once—can fail, and the environment should recover automatically.

In fact, fault tolerance and elasticity go hand in hand. Elasticity, which is the dynamic, automatic provisioning and deprovisioning of Cloud resources on demand, requires the same bootstrapping that fault tolerance calls for. Sometimes, the reason to bootstrap a box is to meet additional demand (elasticity) or to replace some other box that is having issues (fault tolerance). Essentially, when you need a new box, it boots and asks what it’s supposed to do, and then it finds and installs the appropriate resources to become the piece of equipment it needs to be.

Architecting for Failure

However, just because your Cloud provider architected their internal infrastructure to be elastic and fault tolerant doesn’t mean that your app will automatically inherit these traits once you move it to the Cloud. When an organization wants to run an application in the Cloud, it is important to architect the application to take advantage of the elasticity and fault tolerance the Cloud provides. Moving a big chunk of legacy spaghetti code into the Cloud is asking for trouble, as is trying to migrate a tightly coupled set of objects. Where you might have been able to get away with such design antipatterns for an in-house app, the Cloud forces you to clean up your act and deploy modular, loosely coupled apps that can take advantage of the inherently elastic, fault tolerant nature of the Cloud.

There is an important story here: the internal architecture of the Cloud is forcing organizations to rearchitect their apps, enabling them to take advantage of the Cloud, but as a welcome side effect, gives them better architected apps. But that’s not the only story.

The broader story is especially ironic. The irony, of course, is that a core Cloud best practice is to expect for and plan for failure—not only within the Cloud, but of the Cloud itself. The Cloud brokering we discussed in the last ZapFlash is no longer a “nice to have,” but rather a core architectural best practice. If you tell yourself that Cloud architectures are inherently fault tolerant, and therefore it’s sufficient to count on a single Cloud provider, you’re fooling yourself.

Architecting for the Cloud doesn’t mean sticking your app into the Cloud as though it were a black box. There’s far more to the story. On the one hand, it means rearchitecting your app to take advantage of the Cloud, and on the other hand, it means considering each Cloud provider instance as one element in your broader enterprise architecture. And if you want that enterprise architecture to be fault tolerant, avoid any single point of failure—even if that point if failure is the Cloud itself. Bottom line: SAVE YOUR WORK. I don’t want you coming up to me at the end of class because you lost your data. I won’t be able to do anything about your data, but I will be able to tell you I told you so!

Image source.