How to Ship Next.js Updates With Zero Downtime Using PM2
Build to a fresh release, health-check it, then reload PM2 worker by worker so users never see a 502 on deploy.
Watch a basic deploy closely and you will see the seam. The script pulls new code, runs the build, and restarts the app. For the few seconds between the old process dying and the new one accepting connections, every request that arrives gets a 502. If you deploy during the day, real users hit that gap. If you deploy often, your users hit it often. The error is brief, but "brief" is cold comfort to the person whose checkout failed at exactly the wrong moment, and every hour of downtime carries a real number that those few seconds quietly chip away at across enough deploys.
Zero-downtime deploys close that seam. The mechanics are not complicated, and PM2 gives you most of what you need out of the box, but there are two distinct problems to solve and people usually only solve one of them. The first is the build: never restart onto a half-built or broken release. The second is the handoff: never have a moment where no process is listening. Get both right and your users stop seeing deploys at all.
Problem one: never restart onto a broken build
Next.js builds the production bundle into a .next directory. The dangerous pattern, and a common one, is to delete .next, run the build in place, and restart, all against the directory the live app is serving from. If the build fails partway, you have just destroyed the working build and replaced it with nothing, and the restart brings the app up against a broken or missing bundle. Now you are not deploying, you are firefighting.
The discipline is to treat the build as something that must fully succeed before it is allowed to become live. Build first, confirm success, and only then restart. A clean way to make "did the build succeed" unambiguous is to have the build write a success marker as its final step and check for that marker before proceeding:
rm -rf .next .build_success
npm run build > /tmp/build.log 2>&1
test -f .build_success || { echo "build failed, not restarting"; exit 1; }
If .build_success is not there, the build did not finish, and you stop before touching the running process. The live app keeps serving the old, working bundle. Nobody saw anything. This is the difference between a failed deploy that is a non-event and a failed deploy that is an outage: in the first, the build failure is caught before the restart; in the second, you restarted onto the wreckage.
The same logic argues for building a fresh release rather than mutating the live one in place where you can. Pull, install dependencies, build, verify the build, and only then swap the running process onto the new code. Each step has to pass before the next one runs, so a problem surfaces while the old version is still happily serving traffic.
Problem two: the handoff with no gap
Now the build is good and you need to bring the new code live without that 502 window. This is where PM2's difference between restart and reload matters, and it is not a cosmetic distinction.
pm2 restart kills the process and starts it again. There is a real gap between dead and listening, and that gap is your downtime. pm2 reload in cluster mode does something else entirely: it replaces processes one at a time. While one worker is being swapped to the new code, the other workers keep handling requests. The new worker comes up, starts accepting connections, and only then does PM2 move on to the next one. At no point is there zero capacity listening, so there is no window for a 502.
To get this, run your app in cluster mode. PM2's cluster mode uses the Node.js cluster module to run several instances of your app, and the kernel load-balances incoming connections across them on the shared port. With multiple workers, reload has somewhere to send traffic while it cycles each one:
// ecosystem.config.js
module.exports = {
apps: [{
name: 'web',
script: 'node_modules/next/dist/bin/next',
args: 'start',
instances: 'max', // one per CPU core
exec_mode: 'cluster',
}]
};
Now pm2 reload web --update-env rolls the new code out worker by worker with no gap in coverage. The --update-env flag matters when your deploy changes environment variables, because without it the reloaded workers keep the old environment and you get a confusing half-deployed state where the code is new but the config is stale. One trap worth knowing before you scale to multiple apps on one box is the PM2 multi-daemon trap that breaks your next deploy, where a process started under the wrong user silently slips out of your deploy script's reach.
Graceful shutdown: the part that quietly breaks zero-downtime
Here is the subtlety that catches teams who set up cluster mode and assume they are done. Reload only gives you zero downtime if each worker shuts down cooperatively. If PM2 tells a worker to stop and that worker is in the middle of handling requests, and it just dies, those in-flight requests are dropped. You avoided the 502 for new requests but killed the ones already in progress. That is still downtime, just a sneakier kind.
A graceful shutdown means the worker stops accepting new connections, finishes the requests it already has, and then exits, all within a timeout. The worker listens for the shutdown signal PM2 sends and drains its work before going:
process.on('SIGINT', () => {
server.close(() => process.exit(0)); // stop accepting, finish in-flight, exit
});
PM2 has a safety net here: if a worker does not exit within the reload timeout, PM2 falls back to a hard restart of that worker, so a stuck process does not hang the whole deploy. But you want the graceful path to be the normal path, with the hard fallback as the rare exception, not the other way around. A worker that drains cleanly on every reload is what makes the zero-downtime claim actually true under real traffic.
Verify the new release before trusting it
A deploy that brings up new code without checking that the new code works is half a deploy. The last piece is a health check: after the reload, hit a health endpoint on the app and confirm it responds correctly before you call the deploy done.
The pattern is to curl a known endpoint, retry a few times to allow for startup, and treat a sustained failure as a signal to roll back. If the new release comes up unhealthy, you want to fall back to the previous working version automatically rather than leave a broken app live while you figure out what happened. That automatic-rollback machinery is worth building once and reusing, and a deploy script that rolls itself back when health checks fail walks through exactly that. The health check turns "the process started" into "the process started and is actually serving," which are not the same thing, and the automatic rollback turns a bad release into a brief blip instead of an outage that lasts until someone notices.
This full sequence, build to a verified release, reload worker by worker with graceful drain, health-check the result, and roll back on failure, is the spine of how we run deploys across the apps we operate. Wiring it into a single repeatable command rather than a pile of manual steps is part of what disciplined server administration buys you, and the principle that the script, not a human at a terminal, owns production state is the heart of making your scripts the source of truth. Once the command is reliable enough to trust, handing the deploy pipeline to an agent is the next step that lets you sleep through a release. A deploy that is one command nobody has to think about is a deploy that does not go wrong because someone skipped a step at 2am.
What a zero-downtime deploy feels like
When all of this is in place, deploying stops being an event you schedule for off-peak hours and hold your breath through. You ship in the middle of the day. The build runs against a fresh release and has to fully succeed before anything live is touched. PM2 rolls the new code out one worker at a time, each worker draining its in-flight requests before it exits, so there is never a moment with no capacity listening. The health check confirms the new version is actually serving before the deploy is declared done, and rolls back on its own if it is not.
From the outside, nothing happened. No 502s, no dropped requests, no maintenance window. The new version is just quietly live, and the next user gets it without ever knowing a deploy occurred. That invisibility is the goal. The best deploy is the one your users have no way to detect, and getting there is mostly about refusing to ever have a moment where the old version is gone and the new one is not yet ready.






