Scale Realtime WebSockets Without Drowning in Backpressure

Connection limits, batched fan-out, backpressure, and jittered reconnection keep live updates smooth from ten to ten thousand users.

Derrick S. K. SiaworFebruary 6, 20257 min read

Laptop showing code on a warm wooden desk beside a mug and notebook — Photo · Pexels / Pexels

Realtime features are easy to build and hard to scale. A live dashboard, a collaborative cursor, a chat thread, a price ticker, you wire up a WebSocket, push updates to connected clients, and it works beautifully in the demo with ten people watching. Then a launch brings ten thousand, and the same code that felt magical starts dropping connections, ballooning memory, and falling over. The gap between "works for the team" and "works for everyone" is where most realtime projects get into trouble, and it is almost entirely about a few patterns you either built in early or now have to retrofit under load.

The encouraging part is that the patterns are well understood. Connection limits, fan-out batching, backpressure handling, and reconnection strategy are the four things that keep a realtime system smooth from ten users to ten thousand, and each one addresses a specific way that naive code breaks at scale.

Know where a single server stops

The first reality to internalize is that one server has a ceiling, and you should know roughly where it is before you hit it. A single well-tuned server can hold a large number of persistent WebSocket connections, but past roughly 100,000 active connections or 50,000 messages per second on one box, horizontal scaling stops being optional. At that point you need a pool of servers behind a load balancer, and the moment you have more than one server, you have a new problem: a message that needs to reach a user connected to server B cannot be sent from server A directly, because server A does not hold that connection.

The answer is a pub/sub backplane. The servers do not talk to clients on other servers; they publish messages to a shared channel using something like Redis, NATS, or Kafka, and every server subscribes and delivers to its own connected clients. Plan for this from the start even if you launch on one server, because adding a backplane later means rewriting your message-delivery path, and the shape of that path is hard to change once features depend on it.

Pub sub backplane fanning WebSocket messages across servers so any server reaches any client

Manage connection state so any server can take any client

Closely related is where you keep per-connection state. The simple path is sticky sessions: the load balancer pins each client to the same server, so that server holds the session in memory and the client keeps reconnecting to it. That works at moderate scale and keeps things simple. The more durable path is to store connection and session state in a shared store, so any server can serve any reconnecting client and restore its state. The shared-state approach costs a little more up front and pays off the moment a server restarts or you need to rebalance, because no client is stranded by losing the one box that knew about it.

Backpressure: the slow client that takes down the server

This is the failure mode that surprises people, because it is not about how many clients you have, it is about your slowest one. Slow clients are among the biggest threats to WebSocket stability. When you send a message faster than a client can receive it, the unsent data has to go somewhere, and that somewhere is a server-side buffer. If you keep pushing without checking whether the client is keeping up, that buffer grows without bound. One slow consumer, on a bad mobile connection or a throttled tab, can consume server memory until latency spikes for everyone or the process crashes, the server-side cousin of the memory leaks that make a long-running client slower by the hour.

The fix is to respect backpressure: check whether the socket's send buffer is draining before you queue more, and have a policy for when it is not. Set a buffer threshold, and when a client exceeds it, you make a deliberate choice rather than buffering forever, drop non-essential updates for that client, send only the latest state instead of every intermediate message, or in the extreme, disconnect a client that has fallen hopelessly behind so it reconnects fresh. The principle is that no single slow consumer is allowed to threaten the server. This is the same flow-control discipline that keeps a streamed LLM response stable or any high-throughput pipe stable: stop producing when the consumer cannot keep up, rather than letting an unbounded buffer absorb the difference.

Fan-out batching: do not wake everyone at once

The second scale problem is the thundering herd you create yourself. When an event needs to go to every connected client, the naive approach loops over all of them and sends immediately. At ten thousand connections that is ten thousand sends in a tight loop, and worse, ten thousand clients potentially reacting at the same instant, each firing a follow-up request that lands on your backend simultaneously. You have manufactured a traffic spike out of a single event.

The mitigation is to fan out gradually. Batch updates rather than sending each one individually, so you ship a coalesced message every short interval instead of a flood of tiny ones. Spread the delivery so some clients receive an update slightly later than others, trading strict simultaneity for stability. Most realtime features do not actually need every client updated in the same millisecond; a price ticker or a presence indicator is perfectly good with updates that arrive within a small window. Coalescing and staggering turn a spike into a smooth flow, which is what keeps both your servers and any downstream systems from buckling.

Reconnection: survive the moment everyone comes back

The last pattern protects you from your own recovery. When a server restarts or a network blip drops a batch of connections, every affected client tries to reconnect. If they all retry immediately, they hit your servers at the same instant, a self-inflicted denial of service right when you are already in a fragile state. The harder you got knocked down, the harder the reconnection storm hits.

The defense has two halves. On the client, use exponential backoff with random jitter, so reconnection attempts spread out over a window of tens of seconds instead of all landing at once. This is the same retry discipline that keeps agent tool calls idempotent so a retry never double-charges a customer: a retry has to be safe and spread out, not blind and synchronized. The jitter is the important part; without it, synchronized clients still retry in lockstep even with backoff. On the server, rate-limit incoming connections so a flood gets metered rather than accepted all at once. Together, jittered client retries and server-side connection rate limiting turn a reconnection stampede into an orderly trickle that your system can absorb while it recovers.

// client: jittered exponential backoff
let attempt = 0;
function reconnect() {
  const base = Math.min(1000 * 2 ** attempt, 30000);
  const jitter = Math.random() * base;
  setTimeout(connect, base / 2 + jitter / 2);
  attempt++;
}

Build for the launch, not the demo

The thread running through all four patterns is the same: design for the worst case, not the happy one. The slowest client, the busiest event, the moment a server restarts and everyone reconnects at once. A realtime feature that only accounts for the demo will work right up until it matters, then fail in front of the audience you built it for. One that accounts for the slow client, the fan-out spike, and the reconnection storm holds steady as it grows.

This is the scalability lens we apply to everything we build, the quiet question of "and at a thousand times this load?" asked of every list, every query and the N plus one patterns that quietly slow an API, and every connection. Realtime systems make that question urgent because the cost of getting it wrong is a public outage during your best moment. Getting it right, the backplane, the bounded buffers, the batched fan-out, the jittered reconnect, is part of building web applications that survive contact with real traffic, and it is the same reliability thinking we carry into how we run the infrastructure underneath them. Build the four patterns in early, and the feature that delighted ten people delights ten thousand without a rewrite.

websockets realtime scaling backpressure architecture

All of the Journal