Set DNS TTLs So Failover Is Instant Without Hammering Resolvers

Short TTLs reroute traffic in seconds but multiply query load. Set TTL per record by volatility, and lower it before any planned cutover.

Derrick S. K. SiaworSeptember 27, 20257 min read

Abstract glowing blue network of light lines connecting across a dark cityscape — Photo · Pixabay / Pexels

A server goes down at 2pm. You have a healthy standby ready to take over, and all you need to do is point your domain at it. You change the DNS record, and then you wait. And wait. Some users hit the new server within seconds. Others keep landing on the dead one for an hour, because their resolver cached the old answer and will not ask again until that cache expires. The failover you thought was instant is actually a slow, uneven rollout governed by a number you set days or years ago and probably never thought about: the TTL.

DNS time-to-live is the single setting that decides how fast you can move traffic during an incident. Set it right and a failover propagates in under a minute. Set it wrong and your standby sits idle while a third of your users keep hitting a server that is on fire. The catch is that the same short TTL that makes failover fast also multiplies your DNS query load, so the answer is not "set everything low," it is "set each record to match how often it actually changes."

What TTL actually controls

Every DNS record carries a TTL, a number of seconds that tells resolvers how long they may cache the answer before they have to ask again. When a user's resolver looks up your domain, it caches the result for the TTL duration. During that window, every user behind that resolver gets the cached answer without a fresh query reaching your authoritative servers.

That caching is what makes DNS fast and cheap at internet scale, and it is also what makes changes slow to take effect. When you update a record, resolvers that already cached the old value keep serving it until their cached copy expires. The TTL is the maximum time a change takes to fully propagate, because it is the longest any resolver will hold a stale answer. A record with a TTL of one hour can take up to an hour to fully roll over. A record with a TTL of 60 seconds rolls over in about a minute.

The failover math

For failover, lower is faster, and the relationship is direct. If your failover target has a TTL of 3600 seconds, a switch can take up to an hour to reach everyone. If it has a TTL of 60 seconds, it reaches everyone in about a minute. For a health-checked endpoint that needs to fail over the instant a server dies, 30 to 60 seconds is the range that gives you near-instant rerouting. A short TTL on the record is what lets a deploy script roll itself back when health checks fail actually reach users quickly rather than trickling out. The downtime a slow failover causes is not abstract: every hour of downtime carries a real, countable cost.

So why not set every record to 30 seconds and never worry about failover speed again? Because the cache that a short TTL defeats is the cache that protects your authoritative DNS from a flood of queries.

The cost of a short TTL

Every resolver that serves your domain must re-fetch each record once per TTL window. That is where the query load comes from, and the numbers are stark. A single resolver serving 10,000 users makes one authoritative query per TTL window per record. At a TTL of 3600 seconds, that is 24 queries a day for that record. At a TTL of 60 seconds, it is 1,440 queries a day, for the same record, from the same resolver.

Dropping the TTL from 3600 to 60 multiplies your authoritative query volume by 60 times. Across thousands of resolvers and millions of users, that is the difference between a quiet DNS zone and one fielding a constant storm of lookups. Short TTLs give you quick failover but increase query load; long TTLs reduce load but slow propagation. There is no free lunch, only a trade-off you choose deliberately per record.

Set TTL per record, by how often it changes

The resolution is to stop treating TTL as one global setting and start setting it per record according to that record's volatility. Records that almost never change get long TTLs; records that need to move fast get short ones.

A sensible default split:

Stable records get long TTLs (3600 to 86400 seconds). Your apex A record, MX records, and TXT records (SPF, DKIM, DMARC, verification strings) rarely change. Give them long TTLs so they cache hard and keep your query load low. There is no failover benefit to a short TTL on a record you change twice a year.
Volatile and failover records get short TTLs (60 to 300 seconds). Health-checked load balancer endpoints, autoscaled IPs, and any target you might need to reroute in an incident get short TTLs so you can move them fast. The extra query load is the price of agility, and for these specific records it is worth paying.

This way you pay the short-TTL query cost only on the handful of records where fast change actually matters, and you keep the long-TTL efficiency on everything else.

The pre-change TTL reduction trick

Pre-change DNS TTL reduction: lower TTL early, wait out old window, then change propagates in minutes

There is one more technique that experienced operators use, and it gets you the best of both worlds for planned changes. Before a migration or a cutover, lower the TTL on the records you are about to change, 48 to 72 hours in advance. Then make the change.

Here is why it works. TTL changes themselves are subject to caching, so when you drop a record's TTL from 3600 to 300, resolvers still holding the old value will keep the old, longer TTL until it expires. You have to lower the TTL first and wait out the old TTL window, so that by the time you make the real change, every resolver is already operating on the short TTL and will pick up your change in minutes.

Teams that build this into their routine never experience the "why is it still not propagated after 30 hours" incident, because their TTL is already 300 seconds by the time they make the cutover. Calendar the TTL reduction two to three days ahead of any planned migration, and the migration itself propagates fast. After the change has settled, you can raise the TTL back up to restore the efficiency.

For unplanned incidents, the lesson is to have already set short TTLs on your failover-critical records so you do not need the 48-hour lead time when a server is actively down. The pre-reduction trick is for planned work; the standing short TTL on failover records is for emergencies.

Get this right before you need it

The cruel thing about TTL is that you discover you set it wrong at the worst possible moment, mid-incident, watching a third of your traffic stick to a dead server because the record's TTL was an hour and nobody had thought about it. By then it is too late to fix for this incident, because lowering the TTL now is itself subject to the old TTL.

That is why TTL strategy is something you settle in advance, as part of how you run your infrastructure, not something you reach for during an outage. It pairs naturally with anycast DNS to cut resolver latency and latency-based routing that sends every user to their nearest region. We set this up as standard in our networking and server administration work: stable records on long TTLs for efficiency, failover-critical records on short TTLs so a switch propagates in under a minute, and the pre-change reduction baked into any planned migration. The result is a domain you can reroute in seconds when it matters, without paying for that speed on every record that never needed it.

A domain you can reroute in seconds is one small piece of how reliability becomes a competitive moat nobody can copy. The number is small and the setting is easy to ignore, right up until the day a server dies and the TTL is the only thing standing between you and an instant recovery. Set it per record, set it before you need it, and the failover you planned for actually arrives in seconds instead of trickling out over an hour.

dns networking failover infrastructure

All of the Journal