Built to Fail: System Resiliency is not Relegated to the Cloud

Abstract

Advances in software architecture over the last 8-10 years have led to more complex and dynamic software architectures in the form of distributed systems. Despite this progress, today’s software applications, web-based or not, continue to still either provide a slower response or become completely non-responsive too often. Our article will speak to the necessity of building systems from the outset that have the ability to correctly detect and route requests around system component failures. Utilizing software like Chaos Monkey, which challenges a software application’s distributed systems by taking random instances off-line, allows developers and system architects to test their designs well before the unexpected, yet inevitable system failure does occur. In the article we will discuss how to design applications that contain an Agile-friendly framework to gracefully handle component failure in a distributed system to minimize downtime.

It is hard to imagine today that software development in the not too distant past was focused almost entirely on coding technique and systems architecture. The evolution of software development has been driven by Moore’s law, open source software and the wave of successful technology companies who created very successful software systems focused upon optimizing the user experience. Today we find ourselves with more complex and dynamic software architectures in the form of distributed systems that do have a higher quality standard and that are produced at a more rapid pace.

Agile development teams today utilize test-driven development, automated build scripts, iterations, user stories, retrospectives, velocity, daily stand-ups and burndown charts. These Agile methodologies provide a framework for feature prioritization, a way to focus on assumptions, help us develop constraints and key performance indicators to keep projects on schedule and this process can be used whether we are creating a Minimally Viable Prototype (MVP) or a multiple sprint product release.

However, software systems which are developed using Agile techniques and utilizing modern software architectures, could still present problems. One problem we wanted to address was why do distributed systems, web-based or not, suffer all too often from a sub-optimal response or worse yet become completely non-responsive as the number of users increase?

Chaos Monkey

In virtually every Agile team who has created a Principal of Experience Design (a list of design rules that will define the user interaction with the application), there will be a maxim that speaks to establishing a trust between the user and the system. If the user cannot depend on the system, they simply will not trust it and will find alternatives. It is in consideration of the system-user trust relationship that drove us to rethink our approach to architecting distributed systems with the inherent ability to correctly detect and gracefully route requests around system service failures.

To test the resiliency and recoverability of their Amazon Web Services (AWS) cloud-based infrastructure, Netflix engineers developed an open source software tool named Chaos Monkey. According to the developers, it was named for the way it “wreaks havoc like a wild and armed monkey set loose in a data center.” The tool allows developers and system architects to test the design and overall architecture of the system by simulating failures of instances of services running within Auto Scaling Groups (ASG), by shutting down one or more of the virtual machines randomly and observing the result and examining the log files. The idea is that if the ASG detects the instance termination, it would automatically bring up a new, identically configured instance. Although the user would still more than likely lose their session state if they were connected to an instance that was suddenly terminated, the application would remain available to the user.

Message Queue

The concept of having many of the same instances and terminating them to understand the effect upon the system architecture is ideal for large systems, but does not readily lend itself to Agile development techniques where microservices are provided. After working with Chaos Monkey, we wanted to see if we could achieve the same functionality only with smaller distributed systems that provided a multitude of microservices.

Message Queue Architecture
Figure 1 – Message Queue Architecture

The means to achieve this was to abstract out all interface and database calls and route them through multiple, persistent and redundant message queues through a Message Bus identified at run-time. Figure 1 shows that by augmenting the queues themselves with the ability to restart very granular preconfigured software consumer services, the message queue can effectively re-route “around” a non-responsive service by spawning another instance of the service and then delivering the message to the newly created service. In addition, a few other changes were necessary to make this work:

  • All Consumer services had to provide logging indicating whether a message was fully processed.
  • The Message Bus had to have the ability to re-launch message queue(s) if any of them unexpectedly terminated.

ZapThink Take

Message queues are a great benefit for developers working in an Agile team environment. It is very easy to have different teams working together even when they are using dissimilar development tools or technology stacks because message queues can enable lightweight communication between any number of services. The ability to separate an application into many small subsystems fits in well with the Agile Scrum software methodology as these subsystems can easily fit within development sprints. Better yet, using message queues can make a system easier to maintain, test, debug and can help an application easily scale by deploying additional message queues and service providers.

Given the positives above, message queues are not a cure-all and like all software have downsides. These include setup, complexity to maintain and can even degrade application performance if the setup is not correct. However, by explicitly building a mechanism to restart every component of a system and testing this regularly throughout development the end result will indeed be a system that is more resilient, when (not if) the unexpected, yet inevitable failure of a key subsystem occurs.