PODCAST: The View from Different Angles – AWS Serverless Monitoring
For this episode, Hillel and Tal from Protego were joined by Alex Glikson, Cloud Guru at Carnegie Mellon University. Alex stated, “I have been working on virtualization and cloud infrastructure topics for the last 15 years or so, even before it was called ‘cloud.’ We had the prototype of a bare-metal cloud roughly 15 to 20 years ago on x86, so it is really exciting to see those topics evolve so rapidly in the industry.
“Most of that time I’ve spent at IBM research in Israel, and since last year I’m in the Carnegie Mellon University in Pittsburgh while technically on a leave of absence from IBM. Carnegie Mellon was looking for some cloud expertise and I’m now working on three fronts here.
- I work with students who want to learn about cloud computing and serverless
- I’m happy to participate in some of the excellent research organizations here at Carnegie Mellon, some of which deals with cloud computing topics.
- The third area, which is very different, is working with the faculty that is actually developing applications. Apparently, in academia, there are lots of applications being developed for teaching purposes, and I’m trying to help them to do it in a cloud-native and modern fashion.
“My passion in the last couple of years is serverless. Both on the public cloud side with things like Amazon Lambda and IBM Cloud Functions, and in the private cloud space, things like Apache OpenWhisk and more recently technologies like Kubernetes and Knative. I really like how this ecosystem is evolving. I like to test the limits of serverless and to explore what use cases might be a good or not so good fit for serverless, and essentially to help this space evolve and to include more variety of workloads.”
IBM Acquisition of Red Hat
Hillel asked Alex’s view on the impact of IBM’s new acquisition of Red Hat. Alex replied, “At least to me, this acquisition was not a complete surprise because IBM and Red Hat have been business and technology partners for many years. Also, in the serverless space, they have been working closely together in the last year or two around Apache OpenWhisk. As you might know, Red Hat decided to use OpenWhisk as the technology for their serverless offering, and a large part of the collaboration was around better integrating OpenWhisk with Kubernetes and with Red Hat’s offering around Kubernetes – Open Shift. I think now it will continue and maybe accelerate a bit.
“Probably around the serverless technologies this merger will not have too much impact, so that work will continue. On the business side, I think it will create a larger mass of activities on the hybrid cloud side, specifically with serverless running both in the private cloud and in the public clouds, and some use cases that may potentially involve coordination between those environments in a more streamlined and straightforward manner.
“Of course, there are lots of different other aspects to that acquisition related to open source in general and the Kubernetes ecosystem and clouds and so on, but I think specifically for the serverless space things are probably going to continue, business as usual with some improvements here and there.”
More Hybrid, On-Prem
Hillel commented on an article on Tech Target around the big public cloud providers pushing into more hybrid, and asked for Alex’s views on their chances. Alex replied, “I think the public cloud vendors are certainly trying to get into on-prem, and there are also success stories with, for example, Amazon having some private regions for the US government and so on. On the other side of the market, there are all those so-called legacy private cloud solutions based on OpenStack, and I guess primarily targeting all the work loads that are now virtualized and are being managed in sort of a cloud fashion.
“But I think the real shift will happen when there are much more cloud-native applications and there are some natural reasons for companies, especially for larger companies, to want those apps run off the public cloud. It could be either for cost reasons, for example. The companies that are large enough, it might be economically viable for them to bring their workloads back to their data center because they have enough economy of scale of their own. It could be for regulation reasons, like privacy, compliance, that for example they have to run certain workloads or keep certain data in a certain geography. Or it could be because of some location constraints, for connectivity, bandwidths, latency and so on. I think there are very good reasons for new applications to continue running, not in the public cloud but in some other data centers, potentially owned by other companies. But the economics there is very different, so I think that’s still an open question, whether the public cloud providers have a better chance to adopt, to build an offering that would work well for the private cloud.
“Maybe companies that have lots of experience building on-prem solutions will grow those solutions to be efficient enough and scalable enough and so on, so that they can compete with the public cloud. There are not so many vendors that have credentials in both markets. I think IBM is struggling a bit with the public cloud offering, and Google and Amazon don’t have a very strong, or at all, presence on-prem. I think Microsoft in that sense is in quite a unique position. They seem to be doing well in both, so we’ll see whether they are able to translate that into a successful hybrid cloud story.”
Availability of AWS Serverless Monitoring. Anything Else Missing for Serverless?
Hillel next discussed our friends at serverless application monitoring tool Epsagon, another Israeli startup in the serverless space. “They just announced exiting stealth mode and the availability of their platform. Epsagon has an observability and learning platform for serverless applications and it’s got a very deep analysis of your application structure, what’s going on, what are you paying for, what’s slowing you down and what’s happening. First of all, kudos to them. I think it’s a nice product we’ve seen and we really think it’s a great direction. I was curious, first, for your take on how important these kinds of tools and technologies are for the success of the space and is anything still missing?”
Alex replied, “I think this is a very nice tool, Epsagon. Actually, I tried it in one of my experiments with the deep learning solution on Lambda and it’s really helped me to understand what is going on. When you design your application on Lambda or other serverless technologies, it inherently becomes more distributed and more event-driven, so it is really difficult to understand what is going on. In my case, I implemented all of the individual pieces and then I did something that was supposed to trigger a workflow, and I didn’t have any place to go and see whether it is actually working.
“What Epsagon and similar tools provide is this observability, understanding what is going on, the ability to dig into particular aspects of the application. In particular, they can take into account the distributed and the event-driven nature of the environment. I think this is really critical, especially for larger apps and for apps that are running for a long time, that need to be monitored and managed. I think that’s definitely an area that is growing and will continue to grow.
“If we compare this to the Microservices platforms or in general to the container-based platforms like Kubernetes, I think one of the reasons that Kubernetes is doing pretty well is because pretty much from day one they paid attention to manageability and monitoring and tooling around the actual runtime capabilities.
Critical Needs in the Serverless Space
Alex continued, “It sounds like similar needs are critical also in the serverless space. Regarding areas that I think would continue evolving in the future, observability is certainly one of them. I think other areas include debugging. It’s really non-trivial to debug individual pieces of an event-driven architecture in general.
“I think another area that is likely to evolve going forward is — and I guess it’s not only about tooling but also about the runtime itself — additional runtime models, maybe somewhere in between long-running and event-driven. For example, things like Knative. There are some behaviors which are triggered by events, individual events, individual requests, but there are also some easy ways to maintain their state over several invocations of the same function. I think it’s a combination of better tooling to enable those more complex applications that involve event-driven and long-running things, and also some additional runtime capabilities.
How Does Epsagon Differ from Protego?
Tal explained, “I met Epsagon a few months ago at a conference when we both gave a talk. They have a nice tool. It looks very promising and it solves a big problem there. Many of the other talks that were given there talked about how to know what you’re doing in the environment. Naturally, when you go to serverless, you get everything given by the provider and you don’t really see or you can’t really know what’s going inside their environment. So, it basically comes to solve the same issue from a different angle or from a different point of view of what is going on in the environment.
The View of Serverless from Different Angles
“I think that the transparency or the ability to let the user get the observability of what’s in his account, what’s happening. Either that’s debugging or understanding how the flow is going, or in our point of view it’s the security. Basically, it’s the same thing that everyone tries to solve. It’s giving the visibility to the user, but everyone looks at it differently. We believe security is an important issue. Obviously also debugging and they also try to solve how do you monitor, it’s maybe not the right term but how do you see the cost of your account.
Hillel replied, “Yes. As a consumer of cloud services in serverless, I can say that billing optimization can be very important. There are some months we definitely could use that.”
Hillel then asked Alex for his take on another recent announcement, Numpywren, a system for linear algebra built on a serverless architecture.
Alex stated, “I think this is a very interesting experiment and also the baseline of it, the Pywren project itself, is very cool. It essentially allows you to transparently run distributed computations on Lambda without even being aware that this is running on Lambda. Now it is being expanded to additional algorithmic capabilities. We took a conceptually similar approach and we tried to run deep learning on Lambda, and I think it was an interesting experience as well. In model training, you’re essentially processing lots of incoming data and you’re calculating new model parameters and you’re updating the model based on those calculations.
“We built an architecture that parallelizes that process in a way that it can be efficiently deployed on a solution like Lambda. It is somewhat similar to the map-reduce approach that Pywren is implementing.
“But with Pywren, you don’t really have a good way to do the ‘reduce’ because in the general case you need to collect all the outputs from the map phase to aggregate them together, and this is not something that can be done on Lambda because of the resource constraints that Lambda has. While in the deep learning case we found a way to do this aggregation phase in a parallel manner as well, so it did map well to this lightweight elastic set of workers that we could get with Lambda.
“The part that didn’t work so well was related to state management. All those workers and aggregators have lots of states that they need to exchange. For example, in our case we worked with a model, which is roughly 200 MB and we had hundreds of workers and reducers that had to exchange some updates to that model. We ended up implementing that using S3, which I think is also the way Pywren is doing this. S3 is extremely scalable and efficient but I think it was kind of a waste because we ended up transferring terabytes of transient data to and from S3. Because of that also our solution was not as efficient as it could have been if we were running on a fixed cluster. Obviously, it also starts to be rather expensive when you go beyond a certain scale.
“I think for this sort of workload something like Lambda is very attractive in terms of fine-grained elasticity, and in some cases it is very important to be able to quickly increase your capacity or decrease it and just pay for the resources that you are using. But apps which are more stateful in nature or that are more clustered apps that need to communicate between them, I think there is lots of potential here of introducing additional middleware that can be used to coordinate between those Lambdas and to exchange state, exchange data and so on.
“Amazon Step Function is certainly interesting, also for other purposes. This is definitely a nice way to coordinate multiple Lambda functions that you have some application state that spans across multiple Lambda functions and certainly across multiple Lambda function invocations.
“I think that’s one of the areas that it would be interesting to see what offerings evolve over time to make those computations still very rapidly elastic but more stateful, more services that make it easy for those computations to maintain some sort of state.”
Lambda Increases Maximum Timeout… Enabling Bad Habits?
Lambda has increased their maximum timeout to 15 minutes. Hillel stated, “I’m wondering, is this potentially going to hurt software design for serverless in the sense that some of these constraints, like five minutes, forced us into architectures that were more serverless, that scaled better, that worked better. As we start unravelling some of these requirements and dialing them back, we’re going to slip back into some of our bad habits in terms of how we build software. Is that a fear or am I just overreacting?”
Alex replied, “I think the reality is that there are lots of different applications, different workloads, different use cases. Naturally they have different requirements, so I think it’s a natural process of trying to address more of that variety, by Lambda in this case.
“As I mentioned before, at least in my opinion, there will be additional developments in this space to try and address additional workloads, both in the event-driven world and the more long-running microservices kind of architectures and the combination of the two.
“For example, today on Lambda, the resourceallocation for Lambda functions is proportional, so you just pick a memory size and the CPU and network are allocated proportionally to that. In many cases, it doesn’t really make much sense. For example, in some of our experiments, in some cases, we needed lots of CPU cycles so we ended up essentially paying for lots of memory that we don’t use and vice versa. In some cases, the workload is very memory-intensive and the CPU is idle.
“I think the same process that happened with EC2 ten years ago, I think it is inevitable also in this space – there is going to be a need for more fine-grained resource allocation in serverless. Ideally the providers should be able to take care of that automatically, because it is not really serverless if you need to specify how much resources you need for each of the workers.
“But at least, at the moment, I think the incremental path would be to enable more customization for resource allocation. It could also include things like GPUs for example. If your function is doing some deep learning inference, which is a quite common use case these days. If the functions can run for a few minutes, why not allocate a GPU and accelerating the computation with the GPU, which is something that happens already in other runtime models, so why not here?
Why Tailor the Runtime?
“I think there are other examples of customization of the runtime that will be required to optimize workloads. Workloads that maybe they transitioned from non-serverless to serverless architecture now. They probably benefited a lot from that transition, but now that they are already serverless there are obvious ways to make them more efficient and more efficient for the particular use case.
“It could be, for example, some latency requirements. There are a bunch of open source serverless projects these days that are focusing on providing low latency, which is probably difficult with the general-purpose container technologies that are being used, like in the major serverless offerings today.
“It could be about some specialized hardware, it could be many different things. I think that as more and more workloads are able to benefit from the elasticity and the ease of use of serverless, there will also be demand for custom runtimes and more customization, probably both in the public cloud space but also in the private cloud space.
“Maybe certain solutions would not make enough sense as a general public cloud offering, so there will be some tailored solutions for specific industries or specific applications even that would expand the serverless philosophy, the user experience, and so on to additional workloads.”
Hillel quipped, “It makes a lot of sense, and I look forward to being able to properly mine Bitcoin on Lambda. Tal, what are the security implication of some of these changes?”
The Value of Fine-Grained Function Permissions
Tal replied, “Actually I don’t like it, but probably I’m having the security point of view on every aspect. I think that it obviously gives the developer more time to do their stuff, which is what they like, but it also gives the attackers possibly more time on the environment to run their bad behaviors on the function or on the vulnerable resource.
“I think that before that — and I’m afraid that we’re going to lose that eventually, but maybe not — with the five-minute functions we could really fine-grain every service, what it needs to do. This is part of what we do here at Protego. We could actually build a profile on that function of the specific behavior it does. If the function does the thing over and over again or connects to a specific host or runs a specific process, we could profile it into a behavior analysis of the function and protect the function from doing anything else that it needs. I’m afraid that if we’re going to go with the 15 minutes to, if it will increase that, we’re going to be back in that container state where a function does a lot of things and it’s hard to really understand what’s needed at the time.
“Of course, this makes it even more important to start and automate the process of understanding the function of what it needs to do and how long it needs to run, because sometimes functions run for, they are set for five or ten minutes and they only need a few seconds, so automating that could really help. I think that this is also part of what we do here.”
The Latest in Serverless from Protego
- Hilel commented on Protego’s recent research showing a serverless botnet attack.
- Tal gave an update on The OWASP Serverless Top Ten project.
- Hillel and others from Protego will be at re:Invent.
- Tal is speaking on serverless security at Infosecurity North America.
Coming up on the 4 year anniversary of the preview announcement of #AWSLambda here at @awscloud and put together this small snapshot of what the team has delivered over the past 4 years. tl;dr: its a lot. #serverless pic.twitter.com/bkfNUWSSDt
— chrismunns (@chrismunns) October 23, 2018
Hillel began, “It’s a nice release history of the last four years. It’s really impressive how far the platform has come, how much they do, and how frequently they release new features. Then, obviously, there are other vendors as well. I just think the ecosystem has done a really good job of taking this from a concept which I don’t think anyone understood four years ago to a really full-blown architecture, and this is a nice snapshot of that.”
EC2 is the new on-prem
— Ben Kehoe (@ben11kehoe) October 12, 2018
Alex said, “My favorite tweet is by Ben Kehoe, that I think he was your guest previously on this show, and he’s talking about EC2 is the new on-prem. It’s a bit funny. I think what this really means, at least to me, is that EC2 is kind of the new legacy that you move away from to new technologies like serverless. I think this is really what is happening these days in the cloud-native applications.
“Just a quick example, in a cloud computing course here we give students a task. They can use any service as long as they don’t own their own VMs. They need to choose which hosted services they want to use, and they can use anything they want but not just plain EC2 instances. I think this is a really interesting trend that will continue to evolve and it will be exciting to see.”
Hillel commented, “It’s a great tweet and it’s always interesting to see how quickly things that were brand new just recently can become the legacy that we have to get away from.”
I’ve been moving a lot of @haveibeenpwned stuff to @AzureFunctions lately which has massively cut my app services costs (hundreds per month). Just checked my Function usage for the last month: 77,500,000 executions and 1,580,000,000,000 execution units which costs… $33.59 ????
— Troy Hunt (@troyhunt) October 26, 2018
Tal said, “I chose a tweet by Troy Hunt. Usually I’m more security-focused or at least funny-focused, and not going to things that talk about the cost change when moving to serverless, but in this case he’s speaking about how he moved the Have I been Pawned? It’s a security site. You go there, and you see if your records show up on recently hacked events, and he said how he moved everything to Azure functions and his costs… they have billions of invocations of execution units and 77 million executions a month, and it costs him only $33.59.
Hillel replied, “That’s definitely one of those use cases that maps really well to both the cost savings and the flexibility of serverless.
“Alex, really great to have you on. I really appreciate your insights. We don’t have a cloud guru on every single day and having one from Carnegie Mellon is really exciting, so really, thank you so much for being on.”