When Lambda deployments aren’t zero-downtime (and how to fix it)

MOIA Engineering
5 min readSep 27, 2022

--

by: Johannes Gehrs

When using AWS Lambda you quickly get used to the luxury of seamless zero-downtime deployments. At MOIA, we usually accomplish this via CloudFormation deployments, often intermediated by CDK. Just update the stack and your new software is running in the cloud, no downtime whatsoever.

When the abstraction breaks

Under the hood, CloudFormation will make two different API calls to bring your Lambda into the desired state of affairs: UpdateFunctionCode, to make the Lambda run the new code, and UpdateFunctionConfiguration, to adapt the configuration of the Lambda as needed.

Those two calls in conjunction are not atomic and they will be made with some time in between.

The problem occurs when changes to function code and configuration depend on one another. In this case, for some period of time, the function will be called with an incompatible combination of code and configuration, and the Lambda will fail.

This may be no problem in many use cases, but for Lambdas with a lot of traffic and high availability requirements, it is.

In this blog post, I will outline two cases in which this problem has occurred at MOIA. Both of them are in regards to Lambdas written in Go. However, both the problem and the solution being outlined should be applicable, in principle, to all Lambdas — especially those running on the custom provided runtimes.

Switching from the Go 1.x runtime to the Custom AL2 runtime

This is a case where the communication by AWS is confusing. Therefore, some introductory comments are needed to make the problem comprehensible.

When AWS initially introduced Go based Lambdas, they called the runtime go1.x. The runtime, however, worked quite differently than preexisting Lambda runtimes. For example, the Python runtime starts a Python interpreter, loads a Python module of which you provide the name and then - on invocation - calls a function in the module of which you also provide the name.

The Go runtime works quite differently. You compile a binary which you upload, packaged in a ZIP file. In the main method of your program you call a function of the language-specific Lambda SDK, which will start a custom process that essentially long-polls AWS for new events and processes them. This method of doing things was a success and decoupled Lambda from language-specific implementations. AWS generalized this approach into what is now called a Custom Runtime.

However, because of this change, AWS never published a newer version of the existing Go runtime. The original Go runtime go1.x is based on the decade-old Amazon Linux 1 and does not support arm64 based Graviton CPUs.

When Go programmers change from go1.x to provided.al2 they find themselves faced with a slight change in behavior. They can deploy the same binaries as before (unless there are lib-c-version specific dependencies in the Go binary), but the naming schema for handlers has changed.

The go1.x runtime expects that the handler field is set to the binary name. The provided.al2 runtime, however, expects that the binary is always called bootstrap.

When the developer reacts by both renaming their binary and changing the Lambda configuration to provided.al2, the foundation for downtime is laid. For some amount of time either the go1.x runtime will look for a non-existing binary (your handler name), or the provided.al2 runtime will do the same (looking for bootstrap, when the ZIP file has not yet been replaced with the new artifact).

Switching from Intel to AWS Graviton

A very similar pattern emerges when switching from Intel CPUs (i.e. amd64) to Graviton CPUs (arm64). AWS’s custom CPUs offer benefits in cost, performance, and - importantly - energy efficiency over Intel CPUs. MOIA is on a mission to reduce the carbon footprint of their operations which makes switching to Graviton based Lambdas important to us.

When the developer naively switches to Graviton they will find a similar situation as described above. For a period of time, either the Lambda framework running on Graviton will try to execute an incompatible amd64 binary, or the Lambda framework running in Intel will unsuccessfully try to execute the arm64 binary.

Why doesn’t this problem occur with language-specific runtimes?

It may be surprising or confusing that this problem is confined to the Custom runtimes, so let me give some additional context. The first case discussed above is not applicable by definition, because it’s about switching to a Custom provided runtime.

Regarding the latter case discussed above, in “conventional” Lambda runtimes such as the Java runtime, the Python runtime, or the JavaScript runtime, the deployable artifacts are independent of the underlying CPU architecture, so the problem cannot occur.

The challenge of configuration and code-changes being non-atomic occurs with all Lambda runtimes though. If you’re aware of a case where a similar problem occurs with language-specific runtimes, feel free to let us know.

The solution: Building forward-compatible Lambda artifacts

AWS recommend to work around this problem with Lambda Versions. But we find that the approach outlined here is simpler and causes much less interference with our normal deployment workflow.

Our way to solve the problem is to first provide a forward-compatible Lambda artifact and then change the Lambda configuration later.

#!/bin/sh# Note that the script must be placed on the root level of the archive and the file must be named 'bootstrap'set -eu

arch_suffix=""
machine_type=$(uname -m)

# Lambda runtime env returns "aarch64" here if the Graviton architecture is used
if test "${machine_type}" == "aarch64" ; then
arch_suffix="_arm64"
fi

cd "${LAMBDA_TASK_ROOT}"
./"${_HANDLER}${arch_suffix}"

The script posted above does two things. Firstly, it takes the value of the ${_HANDLER} variable and executes the binary, which has the same name as what you’ve set as the handler name. Just package this script alongside your binary in the zip archive and you’ve got the same behavior as the go1.x runtime. You can now safely switch to provided.al2 without renaming your binaries. Just be sure to make the script executable before deploying.

The other goal is that we’d like to be able to switch to the Graviton CPU architecture. To do this in a forward-compatible way, we need to package our binaries in such a way that they contain both the arm64 binary and the amd64 binary.

The script assumes that your amd64 binaries carry no suffix but the arm64 binaries do carry a _arm64 suffix.

So you would need to adapt your build-script in such a way that you build for both architectures. Details may vary by application, but here’s a simple example of how to achieve this.

GOARCH=amd64 GOOS=linux go build -o bin/your-handler ./your-handler
GOARCH=arm64 GOOS=linux go build -o bin/your-handler_arm64 ./your-handler

Package those into a zip file, e.g. by using CDK’s lambda.Code.fromAsset and you’re done. You can freely move back and forth between both Lambda runtimes and CPU architectures. Be sure to deploy the changes to your artifact first, before you make any changes to the runtime configuration, as not to sabotage the intended effect.

--

--