How to Hunt Down N Plus One Queries Before They Melt Production

Find and fix the silent query fan-out that runs fine at ten rows and melts at five hundred, with eager loading and keyset pagination.

Derrick S. K. SiaworNovember 19, 20249 min read

MacBook Pro showing a code editor lit by ambient blue light — Photo · Nemuel Sereti / Pexels

The query that takes your app down was passing every test. It ran fine in development, fine in code review, fine in the demo. Then it shipped, real data filled the tables, and a page that loaded in 80 milliseconds with ten rows started taking eight seconds with five hundred. Nobody changed the code. The data changed, and the code was hiding a fan-out that only hurts at scale.

An N plus one query is when you fetch a list with one query, then fire one more query per row to load a related record, turning a single page load into hundreds of round trips that only hurt once real data fills the tables. You kill it by eager-loading the relation in a single query (a join or batched lookup) instead of lazy-loading it inside the loop, and by catching it early with query logging or tracing.

This is the N plus one query problem, and it is the most common reason a feature that looked fast in development melts under production load. It is silent because it does not throw an error, it does not fail a test on a small dataset, and it does not look wrong in the code. It just quietly fires one query, then one more query per row of the result, so what reads like a single operation is actually hundreds of round trips to the database. Here is how to find it before it finds you, and how to fix it so it stays fixed.

What the fan-out actually is

Comparison of N plus one query fan-out versus eager loading the relation in one query

The pattern hides behind an ORM's convenience. You fetch a list of records, then loop over them and access a related record on each one. The ORM, being helpful, lazy-loads that relation, which means it runs a separate query the moment you touch it. One query to get the list (that is the "one"), then one query per item in the list to load its relation (that is the "N"). Fetch 500 orders and access each order's customer, and you have fired 501 queries to render one page.

const orders = await db.orders.findMany();        // 1 query
for (const order of orders) {
  const customer = await db.customer(order.id);   // N queries, one per order
}

On ten orders, eleven queries, fast enough that nobody notices. On five hundred orders, five hundred and one queries, and the page falls over. The code is identical. Only the row count changed, which is exactly why this passes every check that runs against a small dataset and only surfaces when real data arrives. It does its worst damage hidden inside a server component fetch where it quietly undoes the parallelism you fought for. The cost is linear in the number of rows, so it degrades smoothly until it does not.

Detect it before production, then keep watching in production

You cannot fix what you cannot see, and the reason N plus one survives so long is that nobody is counting queries. The fix starts with making the query count visible.

In development

Turn on query logging and look at how many queries a page actually fires. Most ORMs and frameworks have a tool for this: a query log, a request profiler, a development panel that shows queries per request. The instant you can see "this page ran 501 queries," the problem is obvious, and the moment you see "this page runs one query per row," you know exactly where the fan-out is. The detection is trivial once you are looking. The trap is that nobody looks until something is slow.

Better still, make it impossible to miss. Add a query-counter assertion to your tests that fails when a request fires more queries than it should. A test that says "rendering the orders page must run at most three queries" will fail the moment someone introduces an N plus one, in code review, on a small dataset, long before production. This turns a problem that only shows up at scale into one your test suite catches at any scale, which is the only reliable way to keep it from coming back.

In production

Some N plus one problems only emerge under real traffic and real data distributions that your tests do not replicate. Application performance monitoring with query tracing is your safety net here. APM tools automatically flag N plus one patterns in production, showing you the request that fired hundreds of similar queries and the line of code that caused it. This is the same instrumentation-first instinct behind instrumenting your app so you find the root cause in minutes not hours: you cannot reason about a performance problem you cannot observe, which is why query tracing is part of how we build and run web applications rather than something added after the first slow incident.

The combination is what keeps you safe: query logs and a counter assertion to catch it before it ships, APM to catch the cases that only appear under production load.

The fix is eager loading

Once you have found the fan-out, the fix is usually a one-line change to how you fetch. Instead of loading each relation on demand inside the loop, tell the ORM to fetch all the related data up front, in one query or a small fixed number of queries, before you iterate.

const orders = await db.orders.findMany({ include: { customer: true } }); // 1 or 2 queries total
for (const order of orders) {
  const customer = order.customer; // already loaded, no query
}

This is eager loading, and every modern ORM has it: include in some, includes in Rails, Include in Entity Framework, joins or batch loads under the hood. It collapses the 501 queries back into one or two, regardless of how many rows you fetch, because the related data comes back with the original query rather than in a separate trip per row. Of course, fewer queries only help if the database can actually serve them, so keep the connection pool healthy too, which in a hot-reloading dev runtime means stopping Next.js reloads from exhausting your MySQL connections. The page that took eight seconds at five hundred rows now takes the same time it took at ten, because the query count no longer grows with the data.

The discipline is to default to eager loading any relation you know you are going to access in a loop. Lazy loading is fine when you touch a relation occasionally on a single record. It is a trap the moment you touch it inside an iteration over a list, because that is precisely when it fans out.

Bound the result set too, or you trade one problem for another

Eager loading fixes the query count, but it does not fix an unbounded result. If you eager-load the customers for every order in a table that grows without limit, you have one efficient query that still tries to pull a million rows into memory. The fan-out is gone, the scale problem remains.

This is where pagination matters, and specifically the kind you choose. The familiar approach, OFFSET and LIMIT, works fine for the first few pages and gets progressively slower as the offset grows, because the database has to scan and discard every row before the offset to find your page. Page five hundred has to skip half a million rows just to return twenty. On a large table that scan becomes its own performance problem.

Keyset pagination, also called cursor pagination, avoids this. Instead of "skip 10,000 rows and give me the next 20," you say "give me 20 rows after this specific value," using an indexed column as the cursor. The database jumps straight to the cursor position via the index and reads forward, so page five hundred costs the same as page one. For large datasets and infinite-scroll feeds, keyset pagination is the approach that holds up, because its cost does not grow with how deep into the data you go. The same query you are eager-loading should also be bounded the way you would hunt down N plus one queries quietly slowing your API at the latency level.

Combine the two and you have a query path that scales: eager load so the query count is constant regardless of rows, and keyset paginate so the result size and scan cost are bounded regardless of table size. Each fix addresses a different axis of the same scaling problem, and you need both.

Test against realistic data, because that is where it lives

The throughline of this whole problem is that it is invisible on small data. So the most important habit is to stop testing exclusively on small data. Seed your development and staging environments with realistic volumes, hundreds or thousands of rows where production will have them, and run your pages against that. The N plus one that hides at ten rows is obvious at five hundred, and a query-count assertion makes it obvious at any size.

This is the same principle that runs through serious verification work: the happy path on a tiny dataset proves almost nothing, and the bugs that matter live in the conditions that resemble production. We design and review database access paths with the scale question first, asking what happens at 10 times, 100 times, 1,000 times the rows, because a query that is fine at ten and falls over at five hundred is a bug, not a future optimization. This scale-first discipline is part of the security audits and build reviews we run for clients before a feature ships.

The short version

The N plus one query is the silent killer because it passes every test on small data and only melts production when real data arrives. Find it by counting queries: turn on query logging in development, add a query-count assertion to your tests so it fails the moment someone introduces it, and run APM with query tracing in production as the safety net. Fix it with eager loading, which collapses one-query-per-row back into a single fetch. Bound the result with keyset pagination so deep pages cost the same as shallow ones. And test against realistic data volumes, because the whole reason this bug survives is that nobody fed it enough rows to bite.

database performance orm scalability

All of the Journal