Resilience: Planning for When the Clouds Go Away

Resilience: Planning for When the Clouds Go Away On November 25, AWS suffered a major service disruption in its Northern Virginia (US-East-1) region. It left the region out of commission for over 17 hours, and took thousands of sites and services, including Adobe, Twilio, Autodesk, the New York City Metropolitan Transit Authority, The Washington Post offline. When Planned Upgrades Go Wrong In a post-mortem, Amazon described how a planned capacity increase to their front-end fleet of Kinesis servers led to the servers exceeding the maximum number of threads allowed by the current configuration. As the servers reached their thread limits, fleet information, including membership and shard-map ownership, became corrupted. The front-end servers in turn were generating useless information, which they were propagating to neighboring Kinesis servers, causing a cascade of failures. Additionally, restarting the fleet appears to be a slow and rather painful process, in part because AWS relies on this neighboring servers model to propagate bootstrap information (rather than using an authoritative metadata store). Among other things, this meant that servers had to be brought up in small groups over a period of hours. Kinesis was only fully operational at 10:23pm, some 17 hours after Amazon received the initial alerts. Finally, the failures with Kinesis also took out a number of periphery services, including CloudWatch, Cognito, and the Service Health and Personal Health Dashboards, used to communicate outages to clients. For reasons that aren’t totally clear, these dashboards are dependent on Cognito, and may not be sharded across regions. Essentially, the post-mortem seems to imply that if Cognito goes down for a region, affected customers in that region will have no way of knowing. Resiliency, or How to Survive the Next Outage While Amazon posted a number of lessons learned in their post-mortem, which are all worth reading, today we wanted to discuss what customers can do to limit their own risks in the Cloud:

  • Build Outages into your BCP and IRP: While providing continuous service and support is the main goal, we should all be mindful of worst-case-scenarios. That means identifying critical workloads and applications, considering and having a plan to execute fail-over options, and ensuring that customers can be alerted when a failure can’t be avoided. Build response and recovery plans around these considerations.
  • Housing Critical Workloads in Multiple Availability Zones (AZ): Since the AWS outage appears to have been isolated to a single region, organizations that relied on hosting their critical systems across AZs were less impacted when US-East-1 went down. Consider services like AWS’ Multi-Region Application Architecture to fall over to a backup region. These features are not available by default, however, and must be built into an organizations’ overall architecture plan.
  • Use Amazon Route 53 Region Routing: Another best practice for ensuring geographic distribution is to use Route 53 to route users to infrastructure outside of AWS, whether it’s another cloud provider, or a more minimalist on-prem backup service.
  • Test for Your Worst Case: Adopt Netflix’s ‘Chaos Engineering’ model to test what happens when networks, applications, or infrastructures go down, and develop a road-tested plan for how to work around those failures.