Resilience: The Missing Word in the SOA Conversation

In our conversations about the value of Service-Oriented Architecture (SOA), we frequently discuss the need for agility. The constant problem plaguing IT is its inability to deal with continuous and often unpredictable change. Therefore, it makes sense that any Enterprise Architecture (EA) initiative should focus on resolving that problem by designing for change — agility. However, we also discussed in a prior ZapFlash that it’s difficult to design for agility by focusing on individual Services. Rather, agility is an emergent property of the complex system that is IT.

So, if developers, integration architects, and infrastructure implementers can’t guarantee agility at their individual, atomic level of operation, what can they guarantee? One of the concepts that contributes to the emergence of agility in complex systems, but is often missing from our SOA conversations is the notion of resilience.

What is resilience? Resilience is the property of an entity to absorb energy when it is impacted by some change, but then rebound from that change back to its original condition. The concept of resiliency is sort of “a self-righting tendency” that allows the system to retain its overall structure without lasting impact even when impacted by significant change. And if we primarily want to enable the sort of loosely coupled change that SOA purports, then certainly the Services we build, infrastructure we implement, processes we model, and systems we enable should have some measure of resilience.

How Does Resilience Relate to Agility?
In many ways, the concept of resilience is similar to that of agility. Both agility and resilience deal with change in its various forms, but there are distinct differences that inform the way in which we architect, engineer, and design our complex systems. One way to understand the difference is to compare the concept of resilience with that of flexibility. Flexibility is another word frequently used to describe one of the desired benefits of agility. If systems can stretch and bend to meet new needs, then we don’t need to continuously re-engineer them as things change.

However, resilience is not the same concept as flexibility. The best way to understand the difference is to look at the antonyms of each of the words. Rigidity, often couched in terms of “robustness”, is the antonym of flexibility, and it implies the inability or resistance of an object to change. However, fragility is the antonym of resilience, and it implies that the given entity will break when a sufficient force is applied. There’s clearly a relationship between flexibility and resilience because things that are flexible have a higher tolerance for force, but flexible systems can still be fragile. Things can be flexible and not resilient, in that many systems can be changed but never regain their original shape. However, if that happens often enough, you are left with a system contorted beyond its original intention. Indeed, you want resiliency and agility, not just flexibility and robustness. Even more so, it is much easier to build systems for robustness than it is to build them for flexibility. The general thinking goes that you should build systems big, strong, and thick, and you can “withstand” change. But who wants or even can withstand the inevitable force of change? Would you rather not have colossal failure when the inevitable force of change does happen to occur? Wouldn’t you rather capitalize on change?

One insight is that systems are fragile when you change them beyond their “elastic limit”. From this perspective, things that are rigid have a very low elastic limit, and are very fragile. Things that are flexible have a high elastic limit and are resilient up to a point. Elasticity is measured by variability, and we can plan ahead with regards to this visibility by thinking about how much we expect things change and how much force there is when they change. As you might guess, in a system that’s continuously undergoing rapid and often unpredictable change, resilience provided through robustness provides neither flexibility nor emergent properties of agility. The only form of resilience that works is that which is based on flexibility. In this way, we can think resilience that we plan into our systems as variability, and resilience that emerges unplanned in our systems as agility.

The idea of measuring flexibility by planning for variability should sound familiar. We discussed this idea when we introduced the concept of the Agility Model. The Agility Model provides architects with three key capabilities: a method for planning Services and processes with regards to their expected variability, a means for business users to express their desires with respect to variability, and a means to measure developed systems and Services for their actual variability. Having variability provides flexibility, which in turn provides a measure of resilience, and contributes to agility as an emergent property. Specifically, planning for variability requires you to think beyond how a particular aspect of that Service is designed for today. What could change in the future? What is the cost/benefit trade-off for designing that variability in now, rather than just acknowledging its inflexibility at that aspect?

But there’s more to the resilience picture. In reality, architects can provide for resilience in one of two ways: by either building the system rigid enough to resist the change or build them flexible enough to absorb change without permanently changing the system. We often handle these resilience issues through a few key mechanisms: redundancy, distribution, fail-over, load-balancing, clustering, and an enforced no single point of failure rule. With this in mind, it doesn’t matter how flexible a particular Service might be if it can unexpectedly become unavailable at a moment’s notice. And we shouldn’t come to depend on systems to provide this sort of resilience either. Systems management software, ESBs, and other infrastructure can introduce more brittleness through a single point-of-failure. What if the SOA management system stops functioning, even if the Services themselves are operating fine? No, we can’t depend on infrastructure to solve architectural resiliency issues. We have to design resilience into the architecture, regardless of the current technology in use.

The Role of Resilience in SOA
Just as we can plan for flexibility at a variety of levels using measures of variability in the Agility Model, so too can we plan for resilience at those levels. Services that are resilient can not only handle a wide range of request types, but also significant numbers of Service requests without tipping over into failure. While it is possible for Service infrastructure (including the now ubiquitous ESB products) to handle such Service availability resilience, the best practice is for architects to consider Service availability as part of resilient Service design. For example, architects should consider fail-over Services, clusters of Service implementations, or load-balancing by having multiple Service interfaces and Service end-points defined in Service contracts. In this way, the architect doesn’t have to depend on specific infrastructure to handle variable Service loads.

Yet, resilience at the Service level is not enough to guarantee overall resilience of the enterprise architecture. Just as we need fail-over, redundancy, load-balancing, and just-in-time provisioning for Services, so too we need them for the business processes implemented as compositions of those Services. Consider fail-over processes that provide an alternate execution path for business logic, redundant processes that channel interactions across alternate invocation mechanisms, and methods to create ad-hoc processes when other processes are on the verge of tipping over.

Perhaps the easiest form of resilience can be achieved at the infrastructure level. For sure, SOA infrastructure should be able to handle a wide range of usage loads and invocation methods, but to depend on a single vendor or single implementation to provide that guarantee is foolhardy. Rather, good enterprise architects count on resilience of infrastructure by having redundant, load-balanced, and alternate runtime engines, and by using distributed, heterogeneous network intermediaries instead of single-vendor, proprietary, single point of failure ESBs. Organizations should also implement distributed caching, offloaded XML parsing, federated registries with late binding, and network gateways that handle security and policy enforcement away from the Service end-points. Resilience at the infrastructure level is much more doable when you count on high levels of reliability and throughput without counting on one vendor’s implementation to pull all the weight.

But why stop there? Organizations seeking SOA resilience need to also make sure to have resilient Service policies. This requires not just redundant policy enforcement mechanisms, but also fail-over policy definition points and even redundant, fail-over, and load-balanced Service policies. When you’re using policies at runtime to determine binding to Services, having unexpected outages of Service policy definition availability can cause just as much havoc as if the Service itself was not available.

Similarly, companies need to have resilience at the Service contract and schema level. Having redundant Service implementations makes no sense if they are all sharing a single Service contract file that is in danger of disappearing, especially if it is sitting on an unprotected file server. Protect your metadata by locking it behind a policy-enforced registry, but also make sure to have redundancy, fail-over, and load-balancing to avoid shifting a single point of failure. This also applies to all Service metadata, process metadata, data schema, and semantic mappings that might be necessary to allow for proper functioning of the system.

The ZapThink Take
Yet, all this doesn’t matter if the most important part of enterprise architecture, namely the architect, is him/herself not resilient. Are you the only EA in your organization that gets SOA? Even worse, are you the only EA in your organization? What happens if your job changes, or you get laid off, or the organization otherwise changes its feelings on EA and/or SOA? Will that kill the whole SOA project? What about budgets and funding? Are you operating your SOA projects on the edge, just awaiting a single nudge to push it into project oblivion? If so, you need architectural and organizational resilience. Make sure you have a broad base of support (redundancy). Distribute the workload and responsibility for architectural activities and make sure that there is a team of architects, not a lone crusader (failover and clustering). Provide visibility to the rest of the organization to the benefits of your activities and make sure you provide closed-loop interaction on how specific EA tasks result in specific business benefits, preferably iteratively, on a short time schedule, and frequently.

Agility and flexibility are not enough to guarantee SOA success. In fact, the real thrust of what ZapThink has been discussing on SOA for the past eight plus years has been on agile, resilient enterprise architecture. If some of the so-called benefits of SOA were to disappear (namely, standards-based integration), but we remain with agile, resilient EA, we have achieved the main objective of SOA. Enabling the business to operate in a continuously changing, heterogeneous environment without breaking, necessitating significant cost, or high latency requires enterprise architects to think, act, and plan for resilience as well as agility.

Download the Full Resilience: The Missing Word in the SOA Conversation Report Here