Never Fix Production by Hand: Scripts Are the Truth

Every 11pm manual fix is a time bomb of config drift. Encode fixes in deploy scripts so any server rebuilds from zero, identically.

Derrick S. K. SiaworOctober 4, 20257 min read

Row of clean dark server racks in a data center — Photo · Brett Sayles / Pexels

Something breaks on production at 11pm. You SSH in, find the problem, and make a quick manual fix. Edit a config, restart a service, add a line to a file. It works, the alert clears, you go to bed. The fix took four minutes and it saved the night. It also planted a time bomb.

Six months later, someone rebuilds that server, or a deploy overwrites your edit, or a teammate is debugging a problem and cannot understand why the live box behaves differently from the code. The manual fix you made at 11pm is invisible. It exists nowhere except in the running state of one machine, and nobody remembers it was made. This is configuration drift, and it is how a server slowly becomes a unique, fragile, undocumented artifact that nobody can reproduce and everybody is afraid to touch.

The discipline that prevents it is simple to state and hard to hold to: your scripts are the source of truth for server state, not the server. Every fix goes into the code that builds and deploys the box, never into the box by hand. Here is why that rule is worth the friction, and how to actually live by it.

What a manual fix actually costs

The four-minute manual fix feels free because the cost is deferred. The fix works immediately and the bill comes later, when you least expect it. The classic sequence: someone logs into a server and makes a temporary fix, six months later nobody remembers what changed, and the server now behaves in a way no document or script explains. The "temporary" fix has become permanent and undocumented, which is the worst combination.

Configuration drift is the accumulation of these. Each manual change moves the live server a little further from what your code says it should be. Eventually the running state and the defined state diverge so far that the defined state is fiction. You cannot rebuild the server from your scripts, because the scripts no longer describe the server. You cannot trust a staging environment to match production, because production has drifted somewhere staging never went. And when the box finally dies, or you need to scale to a second one, you discover that the only copy of "how this server actually works" was the running disk that just failed.

The rule: fix the script, not the server

Manual server fix causes config drift vs fixing the script then redeploying stays reproducible from zero

The fix for drift is to make a single, non-negotiable rule and hold it: never fix production by hand. When something is broken on a server, you do not patch the server. You patch the setup script or the deploy script that produces the server, and then you let that script apply the fix. The scripts become the single source of truth for what the server is, and the server is just the current output of running them.

This inverts the instinct. The instinct under pressure is to fix the symptom on the live box as fast as possible. The discipline is to spend a little more time encoding the fix into the script, so that the fix is permanent, documented, and reproducible, and so that the next server you build, the staging environment, and the disaster-recovery rebuild all get it automatically. A fix that lives only on one server is a fix you will lose. A fix that lives in the script is a fix that survives.

Concretely:

A service needs a config change? Change it in the script that writes the config, then redeploy. This is also how you ship Next.js updates with zero downtime using PM2, where the deploy script owns the restart.
A package is missing? Add it to the provisioning script, do not just apt install it on the box. The same applies to dependencies: making pnpm dev bring up your whole stack in one command is this rule applied to the local environment.
A permission is wrong? Fix the chown in the deploy flow, do not just chmod it live.
A cron job, a firewall rule, an nginx directive, all of it goes through the code that defines the server. Even process-manager state belongs here: a server registered under the wrong daemon by an ad-hoc command is exactly the PM2 multi-daemon trap that breaks your next deploy.

If a script fails, you fix the script, not the server, because the script failing is the script telling you the truth about a gap in your automation. Reaching past the failed script to fix the server by hand papers over the gap and guarantees you hit it again.

Reproducible from zero is the test

The cleanest test of whether you are living by the rule is this question: if this server died right now, could you rebuild an identical one by running your scripts against a blank machine? If the answer is yes, your scripts are the source of truth and drift is under control. If the answer is "well, mostly, but there's some stuff I'd have to remember to do by hand," every one of those by-hand things is drift, and every one is a thing you will eventually forget.

This is why the industry moved decisively toward defining infrastructure as code rather than configuring it interactively. The shift is dramatic: over 80 percent of enterprises now use infrastructure-as-code in some form, up from under 35 percent five years earlier, precisely because reproducibility-from-code eliminates the class of failure that manual configuration creates. When one central definition keeps every environment in sync, the drift that comes from manual changes during incidents simply cannot accumulate, because the manual changes are not allowed to happen.

You do not need a heavyweight tool to get the benefit. A well-written setup script and a deploy script that together encode the full state of the box deliver the same property: the server is reproducible from the code, and the code is the truth. The principle matters more than the tooling. The principle is that no important fact about the server lives only in the server.

The deploy gate that enforces it

The practical way to keep the rule honest is to make the deploy script itself the gate through which all change flows. When the only sanctioned path to changing a server is "edit the script, run the deploy," manual changes stop being the path of least resistance. The deploy pipeline pulls the code, builds, applies configuration, restarts services, and health-checks the result so it can roll itself back when checks fail, and it does all of that the same way every time. A change that is not in the code is a change that does not survive the next deploy, which trains everyone to put their changes in the code.

This is exactly how we run every box under server administration: the setup and deploy scripts are the single source of truth, fixes go into the scripts rather than onto the live machine, and any server can be rebuilt from zero by running them. It is the same instinct behind LadenX, our AI site-reliability engineer, which is built to fix the root cause through code rather than band-aid a symptom on a running box, and which classifies and refuses destructive manual commands precisely because reaching for an ad-hoc live fix is the dangerous path the discipline exists to avoid. Even when an agent runs the deploy pipeline, the scripts staying the source of truth is what lets you sleep at night.

The night the rule pays off

The payoff comes on a night you cannot predict. A server fails hard, or you need a second one fast, or a new hire needs to stand up an identical environment to debug a problem. If you have held the line, that night is a non-event: run the scripts, and an identical, correct server appears, with every fix that was ever made already baked in, because every fix went into the code. If you have not, that night is a frantic reconstruction from memory and grep, hoping you remember all the 11pm fixes that nobody wrote down.

Reproducibility like this is one of the quiet ways reliability becomes a competitive moat nobody can copy: a competitor with hand-patched servers cannot move as fast or as safely as one that rebuilds from code. The four minutes you save with a manual fix tonight are borrowed from that future night, with interest. Spend the extra few minutes now to encode the fix where it belongs, and the server stays reproducible, the team stays able to rebuild it, and the time bomb never gets planted. The scripts are the truth. The server is just what happens when you run them.

devops infrastructure automation reliability

All of the Journal