Skip to content
DERKONLINE

Rotate Production Secrets Without Taking the App Down

A dual-key rollover keyed on kid lets you swap JWT, CSRF, and webhook secrets while every active session keeps working.

Derrick S. K. Siawor8 min read

There is a category of operational task that everyone agrees is important and almost nobody does, because the obvious way to do it logs every user out. Rotating your secrets, the JWT signing key, the CSRF secret, the webhook secret, is exactly that task. The naive version is "change the environment variable and restart," and the instant you do that, every session token signed with the old key fails verification, every logged-in user gets bounced to the login screen, and your support inbox fills up.

To rotate secrets without logging everyone out, use a dual-key pattern: the signer always uses the single current key, but the verifier accepts both the new key and the previous one during a grace period. Add the new key, switch signing to it, let old tokens expire naturally over their lifetime, then retire the old key once nothing in flight still depends on it.

So secrets do not get rotated. They sit in production for years, the same key that was in a leaked CI log or an old laptop or a former employee's access, because rotating them is perceived as an outage. The day one does leak, an answer to "what did the holder touch" comes from audit logs that actually help after a breach. The fix is a dual-key pattern that lets the new key and the old key both be valid for a grace period, so you can swap signers without invalidating anything that is still in flight. It is not complicated once you see it, and it is the difference between secrets you can actually rotate and secrets you are stuck with.

Why a naive rotation breaks everything

A JWT session token is signed with your secret. When a request comes in, your server verifies the token's signature against that secret. If the signature is valid, the user is authenticated. The whole thing hinges on the verifier using the same key the signer used, and on the token being built so it cannot be forged in the first place, which is the subject of issuing JWTs attackers cannot forge or replay.

When you change the secret and restart, the signer immediately starts using the new key. But every token already in a user's browser was signed with the old key. The verifier now only knows the new key, so it rejects every existing token, and every active user is suddenly unauthenticated. They did nothing wrong; you changed the lock while they were holding the old key.

The insight that fixes this: the signer and the verifier do not have to know the same single key. The signer uses exactly one key, the current one. The verifier can be willing to accept several. That asymmetry is the entire trick.

The dual-key pattern

Zero-downtime dual-key secret rotation: add new key, switch signer, wait grace period, retire old key

Graceful rotation runs the new and old key side by side during a transition window:

  1. Add the new key to the verifier's accepted set, without changing the signer. Now the verifier accepts tokens signed by either the old key or the new key. Nothing has actually changed for users yet, because the signer is still issuing old-key tokens, but the system is now ready.
  2. Switch the signer to the new key. From this point, every newly issued token is signed with the new key. Existing tokens are still old-key, and they still verify, because the verifier accepts both. No user is logged out.
  3. Wait out the grace period. Old-key tokens naturally expire over time as users get reissued new-key tokens on their next login or token refresh. The grace period needs to be at least as long as your longest token lifetime, so that every old-key token in existence has had time to expire.
  4. Retire the old key. Once the grace period has passed and no valid old-key tokens can remain, remove the old key from the verifier's accepted set. The rotation is complete, and not a single user noticed.

At every moment in that sequence, every token a user is holding verifies against some key the verifier accepts. That is what makes it zero-downtime. You never have a moment where a valid token cannot be checked.

Key IDs make this clean

Once a verifier accepts multiple keys, it needs to know which key to try for a given token. You can brute-force it, trying each key until one verifies, but that is wasteful and gets slow as the key set grows. The clean way is a key ID, the kid, an identifier in the token header that names which key signed it.

When the signer issues a token, it stamps the header with the kid of the key it used. When the verifier receives a token, it reads the kid, looks up exactly that key, and verifies against it directly, no guessing. During rotation, the verifier holds both keys by their IDs, and each token tells the verifier which one it needs. This also makes retirement clean: when you remove a key, any token still carrying its kid is now explicitly invalid, which is correct, because by then the grace period has ensured those tokens are expired anyway.

// verifier holds a map of kid -> secret
const keys = { "k-2024-01": OLD_SECRET, "k-2024-07": NEW_SECRET };

function verifyToken(token) {
  const { kid } = decodeHeader(token);
  const key = keys[kid];
  if (!key) throw new Error("Unknown key id");
  return verify(token, key); // throws if signature invalid
}

The same pattern fits CSRF and webhook secrets

This is not JWT-specific. Any secret that signs something with a lifetime can be rotated the same way.

CSRF tokens are signed with a CSRF secret and have a short expiry, often an hour. To rotate, accept both the old and new CSRF secret on verification, sign new tokens with the new secret, and retire the old secret after the token lifetime has passed. Because CSRF tokens are short-lived, the grace period is short, often just an hour, which makes CSRF the easiest secret to rotate. The verification logic those tokens lean on is covered in building CSRF protection that survives OAuth callbacks.

Webhook secrets are slightly different because the signer is the external provider, not you. Stripe and other providers support this directly: when you rotate a webhook secret, they let you keep the old one valid for a window while both are active, and you verify incoming webhooks against both during the overlap. You add the new secret to your verifier, switch the provider to sign with the new one, wait out the overlap, then drop the old one. Same shape, same zero-downtime result.

Get the timing right

The one number that determines whether this rotation goes unnoticed is the grace period, and it has to be at least your longest token lifetime. If your session tokens last 8 hours, an old-key token issued one second before you switched the signer is valid for nearly 8 more hours, so the old key must stay accepted for at least that long. Retire it sooner and you log out exactly the users whose tokens were freshest at the moment of the switch.

Two practical habits make rotations boring:

  • Monitor authentication errors during the cutover, not after. A spike in verification failures the moment you switch signers means a key was not actually in the verifier's accepted set, and you want to catch that immediately, not from a support ticket.
  • Track your longest token TTL and any caching layers explicitly, because a cached key set that has not refreshed is a verifier that does not yet know the new key. The slowest path to propagate the new key sets your real minimum grace period.

Why this is worth building before you need it

The reason to wire up kid-based dual-key verification now, even though you are not rotating today, is that the day you need to rotate is usually the day a secret leaked, and that is the worst possible time to discover that rotating it is an outage. This is exactly the kind of thing a small team needs in place before its first security incident. A team that built for rotation can swap a compromised key in an afternoon with nobody logged out. A team that did not is choosing between leaving a leaked key live and logging out every user, neither of which is acceptable under pressure.

This sits at the core of building auth that holds up over a product's real lifetime, and it is part of what we look at in a security audit: not just whether the secrets are strong, but whether they can be rotated at all without an outage. Separating your secrets per concern (one key for JWT, a different one for CSRF, a different one for webhooks, so one leak does not compromise the others) and making each one rotatable is the kind of foundation that turns a leaked-secret incident from a crisis into a routine. It also depends on those secrets never escaping in the first place, which is why stopping API keys from shipping in your frontend bundle is the other half of the discipline. If your app's secrets are effectively un-rotatable today, that is a fixable architecture problem, and it is much better to fix it on a calm afternoon than during an incident.