Rethinking Cloud Service Level Agreements

It’s so easy these days to purchase Cloud-based services. Go online, click a few times, enter your credit card info, and presto! You’re in the Clouds. There’s no question the novelty really hasn’t worn off yet. Have you ever wondered, however, what you’re really paying for? Sure, you have some expectation that the service provider will, well, provide you with some services. But what are they promising to give you, specifically?

Enter the Service Level Agreement (SLA). The SLA is part of the contract between you and your service provider. It spells out the specifics of what they’re providing you as well as penalties the provider must pay in the event that they don’t live up to the SLA.

Or, maybe not.

The reality is, what’s actually in a Cloud SLA, or what should be in such an agreement, is all over the map. Ask the public IaaS providers, and they’ll give you one answer. Ask SaaS or PaaS providers, and they’ll tell you something different. And what about private Clouds? SLAs take on an entirely new meaning there as well. Let’s see if we can make sense of it all.

The Three Contexts for a Cloud SLA

This confusion over what belongs in Cloud SLAs centers on the fact that there are very different contexts for SLAs depending on the heritage of the organization writing them:

  • The managed hosting provider context: for service providers who had traditionally been in the hosted data center business, SLAs center on availability, measured in the familiar multiple-nines of uptime. Want 99.9% uptime? Pay one price. Want four nines, or five nines? Pay increasingly higher prices. If the provider drops the ball, they pay you a penalty, usually in the form of service credits. In other words, the crappier the service is, the more of it you get.
  • The software vendor context: when you buy a piece of software, you don’t get an SLA at all. Instead, you get an end-user license agreement (EULA). Instead of spelling out what the vendor will do for you, EULAs tell you what you are allowed to do with the software, and more importantly (to the vendor, anyway), what you’re not allowed to do with it. And then there’s all the boilerplate about no warranties or fitness for a particular purpose. When the vendor moves their software to a Cloud delivery model, thus becoming a SaaS or PaaS vendor, they typically retain the EULA context for their offering. From their perspective, SaaS is more about software than about services.
  • The enterprise operations context: it is the responsibility of the operations (ops) team to provide and support the IT capabilities the enterprise requires and pays for. If a business unit requires, say, a Web site with a three second or less response time, then infrastructure and solution architects specify the necessary hardware, software, and network capabilities to meet that requirement, the business cuts the check, and the ops team keeps all that gear running as per the requirements. A different business unit may have different non-functional requirements, which might cost more or less, but in any case, would lead to a different SLA. In this case, if the ops team drops the ball and violates an SLA, predetermined mitigation activities that are part of the governance framework kick in, but service credits are unlikely to be on the list.

For consumers of Cloud services, therefore, simply having a conversation about SLAs with your Cloud provider can lead to confusion, especially when there is a collision among these contexts.

When Contexts Collide provides an eye-opening case study that shows how confusing the aftermath can be when these three contexts collide. On the one hand, Salesforce is a SaaS and PaaS provider, built from the ground up to deliver software capabilities via a Cloud provider model. On the other hand, a substantial part of their business is with large enterprises who have come to expect uptime-based SLAs from their service providers.

For many years, Salesforce refused to publish SLAs of any kind, instead favoring EULA-type agreements. That is, until some well-publicized downtime back in the 2006 timeframe. Large customers finally realized that their businesses depended on Salesforce, and sought to strong-arm the vendor into publishing—and sticking to—negotiated SLAs.

The word in the blogosphere was that Salesforce fought the publication of such SLAs tooth and nail, relenting only in the case of their largest customers—and then, required those SLAs to be confidential, presumably so that different customers might get different promises. And what about all those Salesforce customers who didn’t have the clout to wrest an SLA from the vendor? Salesforce rolled out, a PR effort meant to convince their customers that they could be trusted to provide good service. In other words, “trust us, we’re You don’t need an SLA.”

Salesforce’s apparent anti-customer stance might seem quixotic, but makes sense from the perspective of a software vendor. Why offer to provide free service credits or other bonuses when most customers will buy your stuff regardless? But from the customer perspective, people are left scratching their heads, wondering if some other customer has extracted a better SLA. If everybody is sharing the same underlying infrastructure, then why would Salesforce promise different service levels to different customers?

Private Cloud providers must also navigate their own context collisions. On the one hand, these organizations’ Cloud teams are simply a part of the ops team, responsible for keeping the lights on like they always have. But on the other hand, their internal customers are likely to be comparing private and public Cloud options, or at the least, comparing their internal private Cloud with virtual private Cloud offerings from public Cloud providers. Remember, from the Cloud consumer’s perspective, a Cloud is a Cloud. Why would you expect service credits from one provider and internal service level guarantees from another?

On Beyond Uptime

Of the three contexts discussed above, the managed hosting provider’s focus on uptime is perhaps the most familiar context for SLAs. If you’re contracting with a third party for IT capabilities, then making sure those capabilities are up and running is certainly the most important non-functional requirement, correct?

Not so fast. Clouds are fundamentally different from managed hosting providers in one significant respect: elasticity is even more important than reliability. Remember, when working with the Cloud you must plan for and expect failure; it is the Cloud’s ability to automatically recover from such failures that compensates for the Cloud’s underlying shortcomings. How fast your Cloud can scale up, its ability to do so regardless of the demand, its ability to deprovision instances even more rapidly, and in particular its ability to recover automatically from failure, are the characteristics you’re really paying for.

The surprising conclusion to this focus on elasticity over reliability is that none of the three SLA contexts above are actually well-suited for the Cloud. Instead, you want your SLA to focus more on how well the Cloud deals with unexpected events, including failures, spikes in demand, and other situations that fall outside the norm. After all, these are the characteristics of the Cloud that make it a Cloud. You could say that Cloud SLAs should measure just how Cloudy that Cloud is: in other words, how well it lives up to the core value propositions that differentiate the Cloud from traditional hosted computing environments.

The ZapThink Take

However you look at Cloud SLAs—measuring reliability, Cloudiness, or something else—never forget where the rubber hits the road: the business value the Cloud provides. Why not base Cloud SLAs on how well the Cloud meets business needs? Such a mission-focused SLA would have to focus on specific, measurable goals for the Cloud. For example, if you move your payroll app into the Cloud, your key metric might be whether you made your payroll on time.

Such mission-focused SLAs might be workable when dealing with a SaaS provider, but promise to be quite problematic with PaaS or IaaS offerings, since the mission success with those service models depends upon the software running on the respective platform or infrastructure. In these situations, if something goes wrong, is it the Cloud that’s violating its SLA, or is it something wrong with the software you put in the Cloud?

For system integrators and software developers who are building bespoke Cloud-based apps for their customers, this question is paramount. After all, the customer simply wants their requirements to be met. If something goes wrong, and the consultant points their finger at the Cloud provider and vice versa, the customer will only become more upset. The problem is, poorly architected apps aren’t able to take advantage of the elasticity benefit of the Cloud, through no fault of the PaaS or IaaS provider.

There is an important warning here. It seems that every enterprise and government agency is looking to move many of their apps to the Cloud, and they’re hiring consultants to do the heavy lifting. However, both customer and consultant are still thinking of the Cloud as a glorified managed hosting provider, responsible for maintaining uptime-based SLAs. The reality is quite different. As Cloud-based deployments mature, the line between development and operations blurs, as Cloud behavior merges with application behavior. It will take several years before anybody will have a clue how to write—let alone comply with—an SLA that addresses this new reality.

Image source: scriptingnews