How To Set Reliability Targets Your Whole Company Can Agree On

Use error budgets to settle the eternal fight between shipping fast and staying up, with one shared, data-driven number.

Derrick S. K. SiaworMarch 7, 20256 min read

Laptop on a desk showing a clean line graph against a soft blue room — Photo · ThisIsEngineering / Pexels

There is a fight that happens inside every growing software company, and it never fully resolves. The product team wants to ship. The reliability-minded engineers want to slow down and make things stable. One side says "we are not moving fast enough," the other says "we are breaking things." The argument gets emotional, it gets political, and it gets decided by whoever has the most influence in the room that week rather than by anything resembling evidence. It is exhausting, and it is the wrong way to run the most important trade-off in your engineering org.

There is a better way, and it comes from the people who run some of the largest systems on earth. It is a single shared number that turns the speed-versus-stability fight from an opinion war into a data-driven decision. It is called an error budget, and once your company adopts one, the argument largely stops, because the number settles it.

The idea: you are allowed to be a little unreliable

Start with a counterintuitive premise: 100 percent reliability is the wrong target. Chasing perfect uptime means you never ship anything, because every change carries risk and the only way to never break production is to never change it. A product that never changes loses. So the real question is not "how do we never fail," it is "how much failure can we afford while still moving fast," and an error budget is how you answer it with a number instead of a vibe.

It works in three layers. First you pick a service-level objective, an SLO, which is your reliability target stated as a number, say 99.9 percent of requests succeed over a month. That target is a business decision about how reliable your product needs to be for your customers, not an engineering one, and it is best made with a clear sense of what an hour of downtime actually costs your business. Then the gap between perfect and your target becomes your error budget: if your SLO is 99.9 percent, your error budget is the 0.1 percent of unreliability you are explicitly allowed to spend. That budget is not a failure you tolerate grudgingly. It is a resource you get to use.

How the budget settles the fight

Here is where it becomes powerful. The error budget gives both sides a shared incentive and an objective rule, so the decision about whether to ship fast or stabilize is made by the data, not by whoever argues hardest.

The control loop is simple. As long as you are within your error budget, the system is reliable enough, so the product team ships freely. Green light, full speed, no permission needed, because the data says reliability is fine and innovation is what is needed. But if outages and bugs burn through the error budget faster than it refills, releases pause and the team's attention shifts to reliability work until the budget recovers, work that pays for itself the way reliability becomes a moat nobody can copy. That reliability work is concrete: self-healing deploys that roll themselves back on a failed health check spend the budget far more slowly than manual ones. Red light, stabilize, because the data now says reliability is the thing that matters more than the next feature.

Error budget control loop: ship freely while budget healthy, pause to stabilize when burned

Notice what this does. It is not punitive. Halting releases is not a punishment for the product team; it is what the shared agreement says to do when the data indicates reliability has become more important than shipping right now. And the reverse is just as binding: when the budget is healthy, the reliability-minded engineers cannot block a release on a hunch, because the number says shipping is fine. Both sides agreed to the rule in advance, so when it triggers, nobody is overruling anybody. The budget is.

That is the game-changer. Decisions about release velocity stop depending on gut feelings and inter-departmental friction and start depending on objective data. The eternal fight becomes a thermostat.

Setting the number without over-engineering it

You do not need to be Google to use this, and you should not copy a giant company's targets blindly. The right SLO is the one that matches what your customers actually need, and setting it too high is its own failure mode: aim for 99.999 percent on a product that does not require it and you will spend enormous effort and never ship features, which is a way to lose just as surely as being unreliable.

So start where your customers feel it. Pick the one or two service-level indicators that actually reflect their experience, usually availability (did requests succeed) and latency (were they fast enough), and measure them where customers feel them, in field data rather than lab scores, instrumented so you can find the root cause in minutes when an indicator slips. Then set a target that is honestly good enough for your product and your customers, not a vanity number. A 99.9 percent availability target gives you roughly 43 minutes of allowable downtime a month, which for most products is a sane, achievable starting point that leaves real room to ship. You can tighten it later as the product matures and customers demand more. The point is to pick a number everyone agrees represents "reliable enough," because that agreement is what makes the budget binding.

What it gives a founder

For a founder, the error budget is less an engineering tool than an organizational one. It does three things that matter at the company level.

It ends the recurring argument, which frees your best people from re-litigating speed versus stability every planning cycle and lets them spend that energy building. It makes the trade-off visible to you, because the budget is a single number you can look at: budget healthy means the team can push, budget burning means you have a reliability problem demanding investment before it becomes a customer problem, the kind of return that makes reliability pay back as an investment. And it aligns two teams that are usually in tension around one shared goal, because both product and reliability now win or lose by the same number instead of pulling against each other.

It also forces a healthy conversation you might otherwise avoid: how reliable does our product actually need to be? Answering that honestly, tied to what customers will tolerate and what your competitors offer, is a strategic decision worth making deliberately rather than discovering by accident after an outage.

One number, one truce

The speed-versus-stability fight does not have to be permanent, and it does not have to be decided by politics. An error budget replaces the argument with a number that everyone agreed to in advance: ship freely while the budget is healthy, stabilize when it runs low, and let the data make the call. It is the closest thing engineering has to a peace treaty between the people who want to move fast and the people who want to stay up, and it works because both sides signed it before they knew which way it would point.

Setting that target sensibly, instrumenting the indicators that actually reflect your customers' experience, and building the monitoring and automated recovery that keep you inside the budget, is the kind of reliability foundation we help teams put in place, and it is the discipline behind how we run infrastructure and the consultation we do with founders deciding how reliable their product needs to be. Pick the number, agree on the rule, and let it settle the fight you are tired of having.

reliability slo error-budget sre engineering-management

All of the Journal