What is offline evaluation?
Offline evaluation is the step in the loop between running an experiment and shipping a change. You have a dataset, you have run your application against it, and now you need to judge whether the outputs are good.
TODO: loop diagram
Three kinds of evaluation
There are three main ways to evaluate: manually, with code, or with an LLM. Each is suited to different kinds of quality checks.
Manual evaluation
Manual evaluation is the process of manually looking at outputs and scoring it/writing down your thoughts on its quality.
This is an important process, reading outputs builds an understanding of what your application actually does, where it struggles, and what "good" looks like for your specific use case. That understanding is what tells you which automated evaluators to build and how to define their criteria later on. Teams that skip this step and jump straight to automated evaluation often end up measuring things that don't matter.
Manual evaluation also produces human labels that serve as ground truth for validating automated evaluators later.
Code-based evaluation
Code-based evaluators check properties that can be verified with deterministic logic. They are fast, cheap, and produce the same result every time.
Some example checks where code-based evaluators are a natural fit:
- The output is valid JSON or follows a required schema
- The output contains (or does not contain) specific keywords or patterns
- The output stays within a length limit
- The generated SQL executes without errors
Their limitation is that they cannot assess meaning. A code-based evaluator can check that an output contains the word "refund," but it cannot check whether the output correctly explains the refund policy.
LLM-as-a-judge
An LLM-as-a-judge evaluator uses a language model to score outputs.
This is the right method for qualities that require understanding language: whether a response is relevant to the question, whether the tone matches the intended audience, whether a summary captures the key points of the source material, etc.
LLM judges are imperfect. This means:
- They sometimes pass outputs that a human would fail, and vice versa
- They need calibration against human labels to verify they are measuring what you think they are measuring
- They can share blind spots with your application's LLM, especially when the same model family is used for both
These limitations aren't reasons to avoid LLM judges. An LLM judge that has been calibrated against human labels and is backed by code-based checks is a reliable evaluator.
Reference-based vs reference-free evaluators
Both code-based and LLM-as-a-judge evaluators can be either reference-based or reference-free.
TODO: link to section in datasets page about this
| Code-based | LLM-as-a-judge / manual | |
|---|---|---|
| Reference-based | Exact string match check | "Does the response correctly explain this specific policy?" |
| Reference-free | Validate JSON structure | Tone-of-voice evaluation |
What should you evaluate?
The best source for knowing what you should evaluate comes from identifying failure modes. You'll discover these from a combination of manually going through your traces, and looking at negative user feedback.
TODO: add very specific examples per use case
Generic qualities like "helpfulness" or "quality" are tempting starting points, but they rarely produce useful signal. An evaluator that checks a vague criterion will give vague results. The more precisely you can define what "good" or "bad" looks like for your application, the more useful your evaluators will be.
Combining evaluation methods
Each quality you care about gets its own evaluator. For each evaluator, pick the method that fits:
| When | Method |
|---|---|
| The check can be expressed as a deterministic rule | Code-based evaluator |
| The check requires understanding language | LLM-as-a-judge |
| You can't clearly define what "good" looks like yet | Manual evaluation |
Most mature evaluation setups use all three.
One practical recommendation: prefer binary scores (pass/fail) over graded scales (1-5) when designing evaluators. Binary scores force a clear definition of what separates acceptable from unacceptable. Graded scales introduce ambiguity about what a 3 means versus a 4, which makes scores harder to interpret and less consistent across evaluators and over time.
Where to start
If you are just setting things up, start by evaluating manually. It will give you a good idea of what it is you actually want, and what criteria you most care about structurally.
TODO: Later, start defining evaluation criteria and set up corresponding automated evaluators. Be careful about doing this too early: link to specification vs generalization problem