PODCAST: Amazon AWS SLA & Design Around Failures – The Serverless Smarts Podcast
This month, Hillel welcomed Erica Windisch, CTO and Co-Founder, IOpipe, and AWS Serverless Community Hero. Erica introduced herself. “I was one of the first employees at Docker. Before that, I was working on OpenStack, and before that I founded a startup in 2002 building its own cloud and services. We were running into all the challenges of how to evolve from a shared webhosting environment, and how to let people run their own architectures inside of our cloud. My early involvement in cloud automation working to solve these problems has given me a lot of perspective. I jumped on serverless really quickly. I viewed it as the next step, which brings us closer to the vision that I want. I tend to be ahead of the curve, and I think we’re perfectly early this time.
“At IOpipe, we’re providing observability, debugging, and tracing tools for developers of serverless applications so they can understand how the actions and functionality of the app in production corresponds to what users are actually doing. IOpipe provides insight into errors, with context such as the specific user, time, and database, as well as which other database interactions occurred simultaneously. This helps our users correlate and understand a holistic picture of any outage even as minimally as a single failed insert into a database.”
Kudos to AWS + IOpipe & Others – Enabling Serverless
Hillel replied, “IOpipe was really early to this space, and kudos to AWS for recognizing how much IOpipe and some of the other early adopters and tool providers have been instrumental to enabling a lot of this stuff and making this thing happen. People say that, ‘Without some of these pieces of the puzzle, we couldn’t build production applications in this environment. We don’t have enough visibility to what’s going on.’ That’s really great.”
Various Cloud Provider Serverless SLAs, It’s Complicated
Hillel raised the first topic. “An article in TechTarget compared various cloud provider serverless SLAs, and I think most interesting was the fact that the number of nines is not nearly as high as you might think. Most of the cloud providers are guaranteeing something like 99.5% uptime. When you translate it, that’s a bunch of hours a month when they could be down and you’re not running. How do people monitor all that?”
Erica replied, “It’s something you can definitely do with IOpipe: understand what an outage looks like and how that affects your app.”
Errors Leads to Increased Invocations
Erica continued, “It’s interesting because we’ve experienced some partial outages. Amazon will refer to this as, ‘increased error rates in region,’ and there have been a few instances where that’s occurred. What will happen can vary. Sometimes the irony is that you’ll see a lot more invocations happening rather than fewer, because instead of not invoking your functions, they may accidentally invoke them three or four times each.
“In the background, they’re recycling the containers that run your function because each function is deployed to a container, and then invokes a number of invocations of that function processing requests. When each of those invocations occur, it gets basically scheduled to a process, and this process persists for anywhere between five minutes to five hours.”
Container Creation Leak
“3 years ago, there was a bug we could see from the backend through our observability stack. Processes were lasting for eight to 12 hours. The problem wasn’t that Lambda was down. The problem was that if your application had a memory leak, the container running it wasn’t getting recycled. As a result, Amazon started running low on capacity in that region because they would create new containers, but they weren’t deleting any of the containers. At some point, they basically had their own container leak, like a resource allocation leak with the container creation cycle for Lambda.”
The Cost of Cloud Outages & The Need for Robust Architecture
“SLAs are rather complicated. Something to remember is that nines of service reflect a company’s understanding of the cost structure of that service, the cost of an outage to them and their willingness to reimburse on that service. Five nines for S3 doesn’t mean it’s going to actually have five nines of service. What it means is that Amazon is willing to take a risk of reimbursing you should they fail to deliver on that promise.”
Design Around Failures
Erica continued, “Lambda availability, I would say, is more than two nines or three nines, but on the other side, I think that for any of these services, even things like S3, time has shown us over and over again that the most important thing to do as a developer is to have robust application architectural design to design around these failures within reason, because we can’t design around all the problems.
“For Lambda, one of the things that we do quite a lot is we buffer things through Kinesis. Now, granted, Kinesis could go down and that would be a really big problem. But if we shovel things into Kinesis and if we trust that Kinesis is going to work or that Kinesis will work eventually, that there would be an outage. But that outage will be resolved within 72 hours or whatever my retention period is, that when I come out of that, the lambdas will continue processing. Nothing will be lost except for time.
“But every application is different, and some are time-sensitive. That’s something where you have to say, ‘What if Lambda did go out for eight hours and I need my things running within 30 seconds or I have a really big problem? I can’t put things through our queue.’ You have to make those decisions.”
The Key Metrics for Serverless Observability, Monitoring the Health of Your Apps
Hillel asked, “If I’m using IOpipe and I want to monitor the health of my application, what are the key metrics that you’d say people should be looking at in IOpipe or just in CloudWatch?”
Erica stated, “One of the most obvious ones is error rates, which are pretty critical. If you have an actual exception or an error, it’s going to bubble up and you’ll capture it. There’s an argument, which is maybe should you allow exceptions to be raised or not, and for some applications, like API Gateway, you should never ever raise an exception. You should always bubble things up through API responses, in which case you need to capture those responses and observe those responses. Then you maybe would then do a learning based on responses.”