Stream LLM Responses That Never Cut Off Mid-Sentence

Token-budget headroom, thinking-config, finish-reason checks, and backpressure keep streamed AI answers complete and fast.

Derrick S. K. SiaworMay 6, 20257 min read

Warm amber and gold light streaks flowing on a black background like a smooth data stream — Photo · Unsplash contributor / Unsplash

A user asks your assistant a question, the answer starts streaming in word by word, and then it stops. Mid-sentence. The reply just ends, sometimes on a dangling "the" or a half-finished bullet point. To the person reading it, the product looks broken, because a confident answer that cuts off reads worse than no answer at all. It signals the machine ran out of room while it was still talking.

A streamed LLM answer cuts off mid-sentence almost always because the model's internal thinking tokens are spent from the same maxOutputTokens budget as the visible reply, so reasoning eats the room and generation stops at the limit. Fix it by giving maxOutputTokens generous headroom and, for simple tasks on models like Gemini 2.5 Flash, setting thinkingConfig to a zero thinking budget so the whole allowance goes to the answer.

Almost always this is not a model quality problem. It is a token budget problem, and on the current generation of reasoning models it has a specific, fixable cause that catches a lot of teams off guard. Once you understand how the budget is actually spent, keeping streamed answers complete and fast comes down to a handful of configuration choices.

Token budget flow deciding thinking budget and checking finish reason before rendering markdown

The trap: thinking tokens eat your output budget

The newer Flash and reasoning models, including Gemini 2.5 Flash, generate internal "thinking" tokens before they produce the answer the user sees. The catch that surprises people is that those thinking tokens count against the same maxOutputTokens budget as the visible response. They are not free, and they are not separate.

This produces a particularly nasty failure mode. The model reasons through a problem, spends a chunk of its budget on thinking, and then has little or nothing left for the actual answer. If thinking tokens plus output tokens exceed the limit, the generation stops with a MAX_TOKENS finish reason, and the visible response can come back truncated or even completely empty. That empty case is the worst, because there is no error to catch and no text to show, just a blank where the answer should be.

It is also a behavioral change worth knowing about. These models previously worked fine with low maxOutputTokens values because thinking was not enabled. The same setting that was safe a generation ago now silently starves the response.

For simple tasks, turn thinking off

If you are building an FAQ assistant, a support bot, a classification step, or anything where you do not need the model to reason through a multi-step problem, the cleanest fix is to spend zero of the budget on thinking. For a narrow classification step specifically, you might not need a frontier model at all, since a small local classifier trained on your own data can outperform one. Set the thinking budget to zero and the entire maxOutputTokens allocation goes to the answer the user actually reads.

const stream = await ai.models.generateContentStream({
  model: "gemini-2.5-flash",
  contents: userPrompt,
  config: {
    maxOutputTokens: 2048,
    thinkingConfig: { thinkingBudget: 0 },
  },
});

There is a bonus here beyond completeness. Skipping the thinking pass also makes the response start faster, because the model goes straight to generating the answer instead of reasoning first. For a chat surface where time-to-first-token is what the user feels, that is a real latency win. The thinking budget can range up to 24,576 tokens on this model, so the difference between zero and a generous budget is not small, and it is the same lever you pull when you are cutting an LLM bill without touching answer quality.

For real reasoning tasks, give headroom to both

Some tasks genuinely need the model to think. A complex extraction, a multi-step plan, a tricky judgment call. For those, do not disable thinking; instead give the budget enough room for both the reasoning and a full answer. The mistake is setting maxOutputTokens to a tight value like 512 and assuming the visible answer gets all of it. With thinking on, that 512 has to cover the reasoning too, and the answer gets whatever is left.

Set the ceiling high enough that even a long reasoning pass leaves room for a complete reply. Headroom is cheap; truncated answers are expensive. The token ceiling only guarantees the answer can finish. Keeping it concise is a job for the system prompt, not the budget, so instruct the model to be brief in the prompt and give it enough budget that brevity is a choice rather than a forced cutoff.

Never show a half-finished reply

Configuration fixes the common case, but a resilient product also handles the moment a response does get cut off, because it eventually will. Streaming makes this trickier, because by the time you discover the MAX_TOKENS finish reason, you may have already shown the user most of a sentence that has no ending.

Two defenses matter. First, check the finish reason on the final chunk. If it is MAX_TOKENS, you know the answer is incomplete and can append a gentle continuation affordance or quietly request a continuation, rather than leaving the user staring at a dangling clause, the kind of recovery moment that, like error messages that recover trust instead of losing customers, turns a rough edge into something graceful. Second, sanitize the rendered text. Model output is markdown, and a reply that got cut off mid-format leaves orphan markers, a lone ** with no closing pair, an unterminated list item. The same care applies to anything the model emits that your code consumes downstream, which is why you force LLM output into schemas your code can trust before acting on it. Strip those before rendering so no raw, broken markdown ever reaches the screen. A reply that ends a sentence early is unfortunate; one that also shows literal asterisks looks unfinished in a second, more visible way.

Render the markdown, always

This is adjacent but it is the other half of "the answer looks broken." Model output is markdown: bold with **, bullets, numbered lists, links, inline code. If you render it raw, the user sees literal **word** and * item and the product reads as half-built even when the answer itself is perfect. Every surface where model text lands needs to render that markdown: chat UIs, transcripts, emails, notifications, PDFs.

On a React surface, that means a markdown renderer with components styled to your design system. On a server-rendered or email surface where there is no React to do it, convert the markdown to inline-styled HTML with a small safe renderer that escapes the text first and then applies formatting, so the source can never inject markup. The principle is the same everywhere: model output is markdown, so treat it as markdown on the way to the screen.

Backpressure: keep the stream smooth under load

The last piece is what happens when many users stream at once. A naive implementation forwards every token to the client the instant it arrives, which is fine for one user and fragile for a thousand. If a client reads slowly, an unbounded server-side buffer grows until memory pressure builds.

The fix is to respect backpressure. Use the streaming primitives that pause production when the consumer is not keeping up, rather than buffering everything in memory and hoping the client drains it. When you pipe the model stream to the response, let the transport's flow control do its job: stop pulling from the model when the downstream write buffer is full, resume when it drains. This keeps memory bounded as concurrent streams grow, and it is the same discipline that keeps any high-fan-out realtime system stable under backpressure as it scales from ten connections to ten thousand.

Complete, fast, and never broken

Put together, the recipe is short. Decide whether the task needs thinking; turn it off for simple work, give it room for hard work. Set maxOutputTokens with enough headroom that the visible answer always finishes. Check the finish reason and sanitize truncated markdown so a cutoff never reaches the user as broken text. Render markdown on every surface. Respect backpressure so the stream stays smooth under load.

None of these are exotic. They are the difference between an assistant that feels finished and one that quietly embarrasses you mid-sentence in front of a customer. The same observability that helps you see exactly what your agent did when it goes off the rails is what surfaces a creeping MAX_TOKENS rate before users complain. Getting the AI surface to feel trustworthy is a meaningful part of the AI products we build and the broader AI and automation work we ship, where a streamed answer that stops halfway is not a rough edge, it is the whole experience falling apart. Build the budget right and the answer simply finishes, every time.

llm streaming gemini tokens backpressure

All of the Journal