How Do You Test Code That's Different Every Time? I Added RAG to My Game and Had to Find Out

My game's in-chat help used to be a vending machine. The assistant had a gameHelp tool with seven hand-curated topics — gaze, breeding, gadgets, a few others — and if a player asked about something I hadn't thought to write up, it had nothing. Every new mechanic meant another hand-authored answer, and the help was only ever as complete as my patience.

I wanted to replace that fixed list with retrieval over the project's actual design docs — about two dozen markdown files covering brain mechanics, biochemistry, breeding, droughts, dialects, the lot. Ask anything the docs cover, get an answer grounded in them, and have it stay current as the docs change. That's retrieval-augmented generation, and the building of it was the easy part. The part that made me stop and think was a question I hadn't had to ask in seventeen years of writing software: how do you test something whose output is different every time and still correct?

The thing nobody warns you about

I come from deterministic web development. You write a function, you assert the output equals the expected value, the test passes or it doesn't. Correctness is equality. That model is so ingrained it's invisible — until you put semantic search and a language model in the middle of your system, and it quietly stops working.

When a player asks "why won't my creatures have babies?", the retrieval step doesn't return the answer. It returns a ranked set of document chunks — the docs sliced into sections of a few paragraphs each — that are probably relevant, scored by how close their meaning sits to the question. The model then phrases a reply from them, differently each time. There's no single correct output to assert against. Run the same query twice and you might get different wording, a different chunk order, a different but equally valid answer. assert result === expected has nothing to bite on.

So what does "this works" even mean? And the more practical version, the one that actually keeps you up: how do I know that a change I make next week — a different chunk size, a new embedding model, an extra doc — hasn't quietly made retrieval worse? With deterministic code, a regression fails a test. Here, a regression just means the answers get a bit less relevant, silently, and you'd never notice until a player did.

This is the problem evals solve, and understanding why they exist mattered more to me than any line of the implementation.

Evals: scoring instead of asserting

The shift is small to state and large in consequence. You stop asking "does the output equal the expected value?" and start asking "how good is the output, measured against something?" You score it, and the score gates everything: whether you can ship, whether you've regressed, whether a change helped or hurt.

Concretely, for retrieval, I built a labelled set of queries — phrased the way a player might ask rather than the way the docs are written — and for each one, recorded which document chunks should be among the results. Then the metric: Recall@3, the fraction of queries where at least one genuinely-correct chunk lands in the top three results. Run the set, get a score between 0 and 1, compare it to a threshold. Below the threshold, you don't ship. The number going down between two runs means you've regressed, and now you can see it.

That's the whole idea, and it's worth sitting with because it generalises far beyond RAG. Any system with a language model in it — anything non-deterministic, anything where "correct" is a judgment rather than an equality — needs this. The unit test gives way to the eval. The assertion gives way to the score and the threshold. For a company shipping AI features, the eval harness is the thing standing between "we improved the prompt" and "we improved the prompt and can prove we didn't break anything." It's not optional infrastructure; it's how you have any idea whether the thing works at all.

What the eval actually caught (mostly me)

The first run scored 0.50, and my instinct was that retrieval was broken. Reading the failures one by one told a different story: most weren't retrieval errors, they were my errors. My answer key — the list of which document sections each question should surface — was wrong in places, and in others too strict: I'd marked a single section as the only correct answer when a broader one answered the question just as well. Fixing the answer key took the score to 0.95. That's the part worth knowing about evals: they don't just measure the system, they force you to define what "correct" even means, and defining it is where most of your sloppiness surfaces. The number was never the point; the discipline of pinning it down was.

I kept one genuine miss in the set on purpose, scored as a failure: "what's inside a creature's genome?" retrieves a breeding chunk rather than the genetics chunk I'd want. It's a real ranking weakness, and leaving it in as a deliberate red mark means if a future change accidentally fixes it — or makes it worse — I'll see the score move. An eval with no failures left in it has stopped being able to teach you anything.

The query set: a tradeoff, not a virtue

If there's a decision in this I'd flag, it's how the query set was built — because it's a tradeoff, not a clean win. I didn't hand-write the questions. I judged that queries I invented on the spot would be narrower and less varied than the range of things a real player might actually ask — so I had an AI generate a baseline set instead, aiming for breadth I wouldn't reach by hand.

That comes with an obvious risk, and it's worth being honest about it: AI-generated eval queries can end up sounding like the documentation, because the model writing them has the same source material the retrieval is searching. If your test questions echo your docs, you're measuring vocabulary overlap, not whether a confused human can find an answer — and your score is a comfortable lie. The mitigation is real usage: the set grows over time, and the additions come from actual questions surfaced during play that retrieved badly. Those are the queries no model would think to generate, because they come from someone who doesn't know the docs exist. The synthetic set gets it off the ground; the real questions are what keep it honest. It's a small set — a couple of dozen queries — which is enough to catch gross regressions and obvious failures, not enough to claim statistical confidence; growing it with real usage is how it earns that over time.

The stack, and the decisions that mattered

A few infrastructure choices were more interesting than the RAG itself, mostly because the system runs on Cloudflare Workers and the edge has sharp edges.

The most fiddly one: where to put the vectors. The obvious answer for a Workers app is Cloudflare's own Vectorize — same vendor, one binding, done. I went with Neon Postgres and the pgvector extension instead, for two reasons. Portability: pgvector is the most widely deployed vector store there is, and the schema moves to RDS, Aurora, or a plain Postgres box without a rewrite, whereas Vectorize's API ties you to Cloudflare and is eventually-consistent on upserts in a way I didn't want to reason about. And it's still ordinary PostgreSQL — the vectors live next to everything else SQL can do, queried with an HNSW index for fast approximate nearest-neighbour search. No lock-in, standard tooling, portable knowledge.

The non-obvious edge: Workers don't hold traditional database connections well. They're short-lived, massively distributed, and don't pool long-lived TCP connections the way a normal server does — so pointing a standard Postgres driver at, say, a self-hosted instance means bolting on a connection pooler or a proxy shim to stop exhausting connections. Neon sidesteps this entirely: its serverless driver speaks HTTP (and WebSockets), which is exactly what a Worker is built to do. A query becomes an HTTP request, no socket to hold, no pool to manage. It's the same Postgres underneath — only the transport changes — but that transport choice is the difference between "works cleanly on the edge" and "fights the runtime." Knowing where the runtime's limits are saved a category of problem that has nothing to do with RAG.

The retrieval itself is two-stage. First, approximate nearest-neighbour against pgvector pulls the top ten candidate chunks by vector similarity — fast, but blunt. Then a cross-encoder reranker re-scores those ten against the query and keeps the best three. The reranker is the single biggest quality lever in the whole pipeline: nearest-neighbour search judges query and document separately, but a cross-encoder looks at them together, which catches relevance the first pass misses. It's the cheapest upgrade from "toy" to "actually good." Embeddings come from a small open model rather than a hosted API — same reasoning as the database, no lock-in and good enough at this scale, swappable later if it isn't.

What I didn't build

The corpus here is static and trusted: design docs, committed to git, the same for every player, only changing when a pull request lands. That made the problem tractable — no untrusted input, no per-player partitioning, a fixed answer key I could label by hand.

The interesting harder version is next, and I scoped it out deliberately rather than build it now. I want the assistant to answer questions about a player's own world — "what happened to my first creature?", "tell me Ada's story" — by retrieving from a continuously growing memory the simulation writes as creatures live and die. That's a different and harder problem: dynamic content instead of a static corpus, partially-untrusted input (player-named creatures are an injection surface), no committable answer key to evaluate against, and a need for hybrid keyword-plus-vector search to handle exact-name lookups that pure semantic search fumbles. It also only earns its keep once the simulation runs persistently on a server instead of stopping when the browser tab closes — a larger migration in its own right.

Building the static version first means the hard version inherits all the plumbing: the schema, the embedder, the security boundary, the eval scaffolding. Same reads, different writes. But the eval problem alone — how do you label "correct" for questions about a world that's different in every save? — is enough for its own post, and probably its own month.

For now, the help is no longer a vending machine. You can ask it almost anything about how the world works, and it answers from the actual docs — and I can prove, with a number, that it mostly gets it right.