Build a Deploy That Rolls Back When Health Checks Fail
Capture the good commit, curl the health endpoint with retries, and restore the last working build automatically when a deploy goes bad.
A deploy that breaks production is bad. A deploy that breaks production and leaves you SSHing in at midnight trying to remember what the previous version was is worse. The fix is not to deploy more carefully by hand. It is to make the deploy script itself capable of noticing that the new build is unhealthy and putting the old one back without you in the loop.
This is a few dozen lines of shell, and it changes the emotional weight of every deploy. When the script captures the current good commit before it touches anything, checks the app's health after the restart, and rolls back automatically if that check fails, a bad deploy becomes a non-event. The site dipped for a few seconds, the script noticed, the previous build came back, and you got a clear message instead of a panic. That is the difference between a deploy you trust and one you brace for.
The single most important line
Before any rollback logic, the foundation is a script that fails loudly instead of plowing ahead after an error. In bash that is one line at the top:
set -euo pipefail
-e exits on any command that fails, -u errors on undefined variables, and pipefail makes a pipeline fail if any stage fails rather than only the last. Without this, a failed npm install or a git pull that hit a conflict keeps going, restarts the app on a half-updated tree, and you have shipped a broken build that "succeeded." A deploy script that does not fail fast cannot roll back reliably, because it does not know it failed.
Step one: capture the good state before you change anything
The rollback only works if you recorded what to roll back to before the deploy modified the working tree. The first real action of the script, before the pull, before the build, is to record the current commit.
PREVIOUS_COMMIT=$(git rev-parse HEAD)
echo "Current good commit: $PREVIOUS_COMMIT"
This is the single fact the entire safety net hangs on. If the new build fails its health check, you git reset --hard "$PREVIOUS_COMMIT", rebuild, and restart. You cannot do that if you did not capture the commit while it was still the live one. Capture it first, every time, no exceptions.
The same idea applies if you deploy from build artifacts rather than source: keep the previous release directory intact and switch a symlink, so rolling back is pointing the symlink at the old directory rather than rebuilding. Either way, the principle is that the last known-good version stays recoverable until the new one has proven itself.
Step two: build, then restart, then prove it is alive
The middle of the script is the ordinary deploy: stash local changes, pull, install dependencies, run any migrations, build, and restart the process. The detail that matters for safety is that the build must produce a clear success signal you can check. A build that fails should stop the deploy before the restart, not after.
A reliable pattern is to remove the build output and a success marker, run the build, and confirm the marker exists before proceeding. If the build did not complete cleanly, you never restart the app, so production keeps running the old version untouched. The worst outcome of a failed build then is "we did not deploy," which is exactly the right outcome.
Step three: the health check with retries
After the restart comes the part that makes the script self-healing. You curl the application's health endpoint and confirm it answers correctly. The retry loop matters because a freshly restarted app needs a moment to come up. A single immediate curl will fail against an app that is two seconds from being ready, so you retry with a short wait between attempts and only declare failure after several misses.
HEALTH_URL="https://yourapp.com/api/health"
check_health() {
for i in $(seq 1 5); do
if curl -fsS --max-time 5 "$HEALTH_URL" > /dev/null; then
echo "Health check passed on attempt $i"
return 0
fi
echo "Health check attempt $i failed, retrying..."
sleep 3
done
return 1
}
The -f flag makes curl return a non-zero exit on HTTP errors, so a 500 from a half-broken app counts as a failure, not a success. The retries absorb normal startup latency. After five failed attempts spaced a few seconds apart, the app is genuinely not healthy and the script acts.
A health endpoint that returns 200 only when the app can actually serve requests, not just when the process is running, is what makes this honest. A process that booted but cannot reach its database should fail the health check. If your /health returns 200 the instant the server binds a port, it will happily green-light a deploy that cannot serve a single real request, so make the endpoint check the things that have to work. The same waiting-for-readiness discipline shows up at boot time with docker compose up --wait and healthchecks killing startup races.
Step four: roll back automatically when the check fails
When check_health returns failure, the script restores the previous commit, rebuilds, restarts, and ideally checks health again on the restored version. The rollback path is the deploy path run in reverse against the known-good commit you captured at the start.
if check_health; then
echo "Deploy succeeded."
else
echo "Health check failed. Rolling back to $PREVIOUS_COMMIT"
git reset --hard "$PREVIOUS_COMMIT"
npm install --prefer-offline
npm run build
pm2 restart "$APP_NAME" --update-env
echo "Rolled back. Investigate the failed build offline."
fi
The user-visible outcome is a brief blip while the bad build was up and the restore happened, instead of an outage that lasts until a human wakes up and intervenes. The script turned a failed deploy into a self-correcting event and left a clear log of exactly what happened. That brief blip is not free either, which is why what every hour of downtime actually costs your business is the number that justifies building this in the first place.
The operational details that separate a toy from a harness
The skeleton above is the idea. A few additions make it something you can actually run against production:
- Use a graceful restart where your process manager supports it. PM2's reload swaps workers without dropping connections, which keeps the brief window during a restart from being a hard outage. The same process manager has its own deploy trap worth knowing, the PM2 multi-daemon trap that breaks your next deploy, and pairs with shipping Next.js updates with zero downtime using PM2.
- Run migrations carefully. A migration that is not backward-compatible can make rollback impossible, because the old code cannot read the new schema. Prefer expand-then-contract migrations so the previous version still works against the migrated database during the rollback window. Risky changes are easier to retract when they sit behind feature flags you can kill in one click.
- Log every step and notify on rollback. A rollback that happens silently is a rollback you find out about a week later. Pipe the deploy log somewhere and send a message when a rollback fires, so you know to look at the failed build, the kind of signal covered in turning noisy server logs into alerts you actually trust.
- Keep the deploy logic in one place per app. A single harness that knows each app's path, user, and health endpoint is far less error-prone than a pile of ad-hoc commands, because the safety steps are baked in for every app instead of remembered for some. This is the heart of making your scripts the source of truth and never fixing production by hand.
This is precisely the model our deploy infrastructure runs on: capture the current commit, pull, install, build with a success marker, restart the process as the right user, then curl the health endpoint with retries and roll back to the captured commit automatically if it fails. The reason it is worth building once is that it makes every subsequent deploy boring, and boring is what you want from a deploy. This is core server administration work, and it is the kind of reliability foundation we put under everything we run, including LadenX, our AI site-reliability engineer, which leans on exactly this discipline of verify-then-act so an automated fix can never make an outage worse. The same verify-then-act boundary is what lets you safely hand your deploy pipeline to an agent and still sleep at night, with guardrails around autonomous fixes before they touch production.
A deploy script that can heal itself is not a luxury for big teams. It is the cheapest insurance you can buy against the specific 2am that every engineer who has shipped to production remembers. If your deploys are currently a sequence of remembered commands and a hope, we can help you turn them into a harness that catches its own mistakes.






