What is offline evaluation?

Offline evaluation is the step in the loop between running an experiment and shipping a change. You have a dataset, you have run your application against it, and now you need to judge whether the outputs are good.

TODO: loop diagram

Three kinds of evaluation

There are three main ways to evaluate: manually, with code, or with an LLM. Each is suited to different kinds of quality checks.

Manual evaluation

Manual evaluation is the process of manually looking at outputs and scoring it/writing down your thoughts on its quality.

This is an important process, reading outputs builds an understanding of what your application actually does, where it struggles, and what "good" looks like for your specific use case. That understanding is what tells you which automated evaluators to build and how to define their criteria later on. Teams that skip this step and jump straight to automated evaluation often end up measuring things that don't matter.

Manual evaluation also produces human labels that serve as ground truth for validating automated evaluators later.

Code-based evaluation

Code-based evaluators check properties that can be verified with deterministic logic. They are fast, cheap, and produce the same result every time.

Some example checks where code-based evaluators are a natural fit:

The output is valid JSON or follows a required schema
The output contains (or does not contain) specific keywords or patterns
The output stays within a length limit
The generated SQL executes without errors

Their limitation is that they cannot assess meaning. A code-based evaluator can check that an output contains the word "refund," but it cannot check whether the output correctly explains the refund policy.

LLM-as-a-judge

An LLM-as-a-judge evaluator uses a language model to score outputs.

This is the right method for qualities that require understanding language: whether a response is relevant to the question, whether the tone matches the intended audience, whether a summary captures the key points of the source material, etc.

LLM judges are imperfect. This means:

They sometimes pass outputs that a human would fail, and vice versa
They need calibration against human labels to verify they are measuring what you think they are measuring
They can share blind spots with your application's LLM, especially when the same model family is used for both

These limitations aren't reasons to avoid LLM judges. An LLM judge that has been calibrated against human labels and is backed by code-based checks is a reliable evaluator.

Reference-based vs reference-free evaluators

Both code-based and LLM-as-a-judge evaluators can be either reference-based or reference-free.

TODO: link to section in datasets page about this

	Code-based	LLM-as-a-judge / manual
Reference-based	Exact string match check	"Does the response correctly explain this specific policy?"
Reference-free	Validate JSON structure	Tone-of-voice evaluation

What should you evaluate?

The best source for knowing what you should evaluate comes from identifying failure modes. You'll discover these from a combination of manually going through your traces, and looking at negative user feedback.

TODO: add very specific examples per use case

Generic qualities like "helpfulness" or "quality" are tempting starting points, but they rarely produce useful signal. An evaluator that checks a vague criterion will give vague results. The more precisely you can define what "good" or "bad" looks like for your application, the more useful your evaluators will be.

Combining evaluation methods

Each quality you care about gets its own evaluator. For each evaluator, pick the method that fits:

When	Method
The check can be expressed as a deterministic rule	Code-based evaluator
The check requires understanding language	LLM-as-a-judge
You can't clearly define what "good" looks like yet	Manual evaluation

Most mature evaluation setups use all three.

One practical recommendation: prefer binary scores (pass/fail) over graded scales (1-5) when designing evaluators. Binary scores force a clear definition of what separates acceptable from unacceptable. Graded scales introduce ambiguity about what a 3 means versus a 4, which makes scores harder to interpret and less consistent across evaluators and over time.

Where to start

If you are just setting things up, start by evaluating manually. It will give you a good idea of what it is you actually want, and what criteria you most care about structurally.

TODO: Later, start defining evaluation criteria and set up corresponding automated evaluators. Be careful about doing this too early: link to specification vs generalization problem

Was this page helpful?

On this page