Graceful startup and shutdown for Phoenix applications

A quick guide for running a Graceful Shutdown process in your Phoenix application, ensuring clean process termination and minimizing downtime during deployments.

Introduction

If you’ve developed a Phoenix application, then you’ve heard how to keep a Phoenix application up and running—the BEAM is made for long-lived processes after all. But even long-lived processes need to end sometime. When a process ends it is important that you clean up what remains of that process or things will get messy.

Graceful Shutdown is the practice of cleaning up after all of your processes when the system shuts down. Luckily for us, many Phoenix applications do not need to worry about Graceful Shutdown because Phoenix already handles most of the hard bits for us. But if you are running additional processes then you may need to handle their shutdown concerns yourself. But not to worry! In this post I will help you land the plane safely. We’ll walk through Graceful Shutdown in general and one specific Graceful Shutdown behavior that we've implemented here at Felt.

What is Graceful Shutdown?

During a graceful shutdown the system should:

stop accepting new work
finish in-progress work
bring the application/system down

In most setups your load balancer should be configured to stop sending your Phoenix application new work, so today I will focus on the other two components of Graceful Shutdown.

Why Graceful Shutdown is important

This is important because at Felt we ship new code many times a day so we want the shutdown of the old server and startup of the new server to not be noticeable to users. We also want to ensure that we don't fill up our bug tracker with irrelevant errors that occur during application startup or shutdown, distracting us from errors that directly affect users.

Zero-downtime deploy overview

To simplify things let’s assume that you’re using a typical blue green deployment to facilitate your zero-downtime deploys. In a blue green deployment setup, a new instance of the code is started, then once that instance is fully up and ready to receive work, the load balancer will start sending traffic to the new instance and stop sending traffic to the old instance. The load balancer (or other part of the system) will then initiate a Graceful Shutdown of the old instance of the code. If everything is coded right then users won't notice the switchover at all.

Luckily Phoenix's default configuration will handle nearly all of this for us. The biggest part of Phoenix's default Graceful Shutdown handling is Plug.Cowboy.Drainer. When <p-inline>Plug.Cowboy.Drainer<p-inline> is triggered as part of the Graceful Shutdown process it stops listening for new HTTP connections and waits for existing connections to complete. For Phoenix channels, this is handled transparently because the client of a Phoenix channel will automatically reconnect if it is disconnected. When the channel reconnects it will connect to the new server because the load balancer is no longer sending traffic to the old server.

Designing your system for Graceful Shutdown

Figuring out the application order can be difficult

If you're not careful you will end up with cyclic dependencies between different processes in your supervision tree.

Generally your Phoenix endpoint should be one of the last processes in your supervision tree. This will ensure that when your Phoenix endpoint starts up, the rest of your application is ready to serve traffic. This is in line with the production practice of using a healthcheck endpoint (such as <p-inline>/healthz<p-inline>) to allow the deployment system to determine if this application instance is ready to start accepting traffic.

Detecting and Handling a Graceful Shutdown

Here's a basic template for a Graceful Shutdown handler that you can start as part of your supervision tree:

The majority of this code is a basic GenServer boilerplate, but with <p-inline>Process.flag(:trap_exit, true)<p-inline> added along with the <p-inline>terminate/2 callback<p-inline>. Now when this process is shutdown as part of the normal shutdown process the <p-inline>terminate/2<p-inline> callback will run and log <p-inline>Graceful Shutdown occurring<p-inline>. In actual code you'll want to put your shutdown code in <p-inline>terminate/2<p-inline>.

Triggering a Graceful Shutdown

If there's one thing I hope you take away from this article, it is that when you close the application with <p-inline>ctrl-c ctrl-c<p-inline> you kill the system, you are NOT triggering a Graceful Shutdown.

There's two main ways that I'd recommend to test graceful shutdown in development:

If you started your server with <p-inline>iex -S mix phx.server<p-inline> (which is what I use 95% of the time) then you can run <p-inline>System.stop()<p-inline>
Use <p-inline>ps aux | grep 'mix phx.server'<p-inline> to find the OS PID for the BEAM, then run <p-inline>kill -s TERM <os_pid><p-inline>

Once Graceful Shutdown starts you'll likely see a log notice: <p-inline>[notice] SIGTERM received - shutting down<p-inline>. This means that your code to handle a Graceful Shutdown ran! Now let’s make it do something interesting.

Handling a Graceful Shutdown: PresenceDrainer

If a node in a clustered Phoenix application dies unceremoniously—that is without executing a Graceful Shutdown—then the other nodes in the cluster will still show outdated presence for many seconds. This is something we can solve by improving with our Graceful Shutdown technique by calling <p-inline>Phoenix.Tracker.graceful_permdown(MyAppWeb.Presence)<p-inline>.

Here’s an example <p-inline>MyApp.PresenceDrainer<p-inline> module:

Why this is important:

Without this your Phoenix Presence entry hangs around for many seconds because Node B doesn't know that Node A is down
By calling <p-inline>Phoenix.Tracker.graceful_permdown/1<p-inline> when the server is going away our Phoenix Presence is more accurate

With this a final supervision for a typical Phoenix application will look like:

Now, when our application shuts down, <p-inline>MyApp.PresenceDrainer<p-inline> will notify other nodes in the cluster so no outdated presence information is shown. We can now land our application successfully, and our users can continue to make maps without interruption.

Bio