When Fine-Tuning Beats Prompting and When It Burns Money

Prompt first, retrieve for knowledge, fine-tune for behavior: a practical framework that avoids burning money on the wrong tool.

Derrick S. K. SiaworApril 24, 20257 min read

Macro of a circuit board with intricate gold traces on dark green — Photo · Tima Miroshnichenko / Pexels

A team decides their AI feature is not good enough, and the conversation immediately jumps to fine-tuning. They start budgeting GPU time, planning a labeling effort, and talking about training runs, all before anyone has checked whether a better prompt would have solved the problem in an afternoon. This is one of the most common and expensive mistakes in applied AI: reaching for the heaviest tool first. Fine-tuning is powerful, but it is the last option you should consider, not the first, and most of the time the cheaper options win.

There are three ways to improve an AI system's output: prompting, retrieval, and fine-tuning. Each solves a genuinely different problem, and choosing the wrong one wastes money and time while not actually fixing what is broken. The good news is that the decision is not subtle once you frame it correctly. The framework that the strongest teams use comes down to one sentence: prompt first, retrieve for knowledge, fine-tune for behavior.

Prompting: start here, always

Most "the model is not good enough" problems are prompting problems wearing a costume. The model is capable; it just was not told clearly enough what you want. Before anything else, you write a better prompt: clearer instructions, a few examples of the input and the ideal output, an explicit format, and the constraints that matter. This is free, it takes minutes, and it solves a surprising share of problems outright.

Modern long-context models make prompting stronger than people assume. For a knowledge base under roughly 200,000 tokens, you can often put the whole thing in the context window, combine it with prompt caching, and get something faster and cheaper than building a retrieval pipeline. If your "we need the model to know our docs" problem fits in the context window, you may not need any infrastructure at all, just a well-constructed prompt and the documents pasted in.

The rule: never move past prompting until you have genuinely exhausted it. A reasoned prompt with good examples is the baseline every other option has to beat.

Retrieval: when the model does not know your stuff

The clearest signal that you need retrieval rather than fine-tuning is this: the model does not know X. Your product catalog, your internal documentation, your customer's account history, anything specific to your domain or that changes over time. The model was not trained on it and cannot be, because it is your data and it updates.

The instinct to fine-tune here is wrong, and it is worth being precise about why. Fine-tuning changes how a model behaves, not what facts it has at its fingertips, and trying to cram knowledge into a model by training is unreliable and goes stale the moment the data changes. Retrieval, often called RAG, solves the actual problem: you fetch the relevant facts at query time and put them in the prompt, so the model answers from current, specific information, which is exactly why the quality of your retrieval layer quietly decides your answer quality. And once the facts are in the prompt, you still want to force the model's output into a schema your code can trust before acting on it. When the data changes, retrieval just fetches the new version; nothing needs retraining. This is why retrieval dominates in practice. Enterprise surveys consistently show retrieval used in production far more than fine-tuning, because "the model does not know X" is the most common real problem and retrieval is its right answer.

So if your problem is a knowledge gap, the decision is made: retrieve, do not fine-tune.

Fine-tuning: when you need to change behavior

There is a real place for fine-tuning, and it is narrower than the hype suggests. Fine-tuning is the right tool when the model's intrinsic behavior has to change, reliably and persistently, in a way that prompting cannot produce. The clearest case is a classification or structured task that consistently fails with prompt engineering no matter how you phrase it. If you have tried strong prompts with examples and the model still cannot hit the accuracy or consistency your task demands, that is the signal that the behavior itself needs to be trained in, not instructed.

Other genuine fine-tuning cases: a very specific output format or style the model will not hold reliably under prompting, a domain-specific reasoning pattern, or a high-volume narrow task where you want a small, fast, cheap model to match a large one. That last case is real and underused: a small fine-tuned model can match or beat a frontier model on a narrow task while costing a fraction to run, which is its own article. But notice that all of these are behavior problems, not knowledge problems. That is the line.

The cost is lower than people fear, but the discipline is higher

Part of why this decision gets distorted is a wrong assumption about cost. People imagine fine-tuning as a massive, expensive endeavor. In practice, parameter-efficient fine-tuning with techniques like LoRA or QLoRA on a 7B to 13B parameter open model typically costs somewhere in the low hundreds of dollars in compute for a training run, finishing in a handful of hours on a single GPU. The compute is not the expensive part. Where it does pay off, a tuned small model also tends to be cheaper to run, one of the surest ways to cut your LLM bill without touching answer quality.

The expensive part is the data and the evaluation. Fine-tuning is only as good as the training set, and curating a clean, production-representative dataset is real work that takes real time. And here is the discipline that separates teams who succeed from teams who waste a quarter: build your evaluation harness before the first training run, not after. You need a way to measure whether fine-tuning actually improved the task before you start training, or you have no way to know if it worked, and you will burn cycles tuning blind. The sequence the best teams follow is: confirm fine-tuning is genuinely the right tool, build the eval, then curate the data, then train. Skipping straight to training is how the budget disappears.

The decision in one pass

Run any AI quality problem through this filter, in order:

Decision tree: prompt first, retrieve for knowledge gaps, fine-tune for behavior changes

Is the output wrong because the instructions were unclear or under-specified? Fix the prompt. Stop here if it works.
Does the model lack specific or current information it needs to answer? Use retrieval. The data lives outside the model and gets fetched at query time.
Does the model's behavior itself need to change, a classification it keeps getting wrong, a format it will not hold, a style it cannot maintain, even with strong prompting? Now consider fine-tuning, and build your evaluation before you train.

Most problems resolve at the first or second step. Fine-tuning is reserved for the cases where you genuinely need to alter what the model does, and you have confirmed the cheaper tools cannot get there.

Choosing right is the whole game

The reason this framework matters is that the wrong choice is invisibly expensive. A team that fine-tunes to fix a knowledge gap ends up with a model that is wrong in a new way and stale within a week, after spending weeks they could have spent shipping retrieval. A team that keeps prompting at a problem that genuinely needs trained behavior plateaus and blames the model. Matching the tool to the actual problem is most of what separates AI features that work from AI features that drain money.

This is also the heart of the build-versus-buy call on AI: knowing which lever actually moves your specific task is what we bring to the AI products we build, where the difference between a thoughtful prompt and an unnecessary training pipeline is the difference between shipping in a week and burning a quarter. Prompt first, retrieve for knowledge, fine-tune for behavior, and build the eval before you train. Get the order right and the expensive option is the one you rarely need.

ai fine-tuning rag prompting llm

All of the Journal