AI Output Evaluation (AOE) — Free Sample

1. An LLM produces a fluent, well-organized 200-word answer to a customer question about your product's refund policy. The answer cites "the refund window outlined in your terms of service." Your terms of service do not contain a specific refund window. What's the most diagnostic call?

Hallucination — the model fabricated a citation to support a confidently-stated claim. Reject the output and surface the failure mode for retrieval-grounding work. Minor accuracy issue — the underlying claim about refund policy is probably right, just needs the citation removed before shipping. Acceptable — the customer probably won't notice the missing detail, and the rest of the answer is high-quality. Test the prompt with three more refund questions — if those work fine, this is just a one-off failure.

2. You're evaluating two LLM outputs against a rubric that includes "factually accurate" and "appropriately hedged." Output A is fluent but states one minor fact incorrectly. Output B is correct but hedges everything excessively ("it might be possible that..."). The rubric weights accuracy at 0.6 and hedging at 0.4. Which is the better output?

Output B — the rubric weights accuracy higher, and the inaccuracy in A is a hard rubric failure regardless of fluency. Hedging is annoying but recoverable; inaccuracy is not. Output A — fluency and confidence matter for user trust even if one fact is wrong; hedging-everything makes the model unusable. It depends on the use case; without more context, you can't pick. Both outputs fail the rubric; reject both and re-prompt.

3. A customer reports the model produced a confidently wrong answer. You investigate and find the model output is grammatically perfect, internally consistent, and superficially sounds like the kind of answer a domain expert would give. The substance is wrong. What's the highest-value pattern to apply going forward?

Add this failure to the eval set as an adversarial case, and design future evals to specifically probe "fluent and wrong" failure modes — fluency is correlated with confidence, not correctness. Add a rule to the prompt explicitly forbidding the wrong answer's pattern. Switch to a more capable model — fluent-but-wrong usually indicates the model lacks the underlying capability. Report the issue to the customer and offer a refund; the model class probably can't do this reliably.

4. You're evaluating a code-generation model's output. The code compiles, passes the unit tests you wrote, and produces correct results on the inputs you tested. A senior engineer on your team reviews it and points out the code has a subtle race condition under concurrent load. Which of these is the most-aligned-with-good-AOE-practice take on this evaluation?

The eval missed the failure mode because it didn't include concurrency-stress test cases. Update the eval to include them; this generalizes to a recurring pattern of evals under-testing for non-functional properties. The senior engineer's review caught what the eval missed — that's why human review exists alongside automated evals; the eval is doing fine. The race condition is a model-capability gap. Switch to a more capable model. Ship it — the unit tests pass and most production traffic isn't concurrent enough to trigger the issue.

5. Your team has authored a 50-item eval that the latest model passes with 92% pass rate. The team is debating whether to ship. The 8% failures are concentrated in one specific input pattern. What's the most-defensible call?

Don't ship yet — concentration in one input pattern means production traffic matching that pattern will fail systematically, not randomly. Investigate the cluster, decide whether the failures are blockers, and update the eval to surface them as a separate threshold rather than averaged into the overall pass rate. Ship — 92% is above the team's threshold and the eval passes overall; 8% is acceptable error rate. Add the 8% failure cases to the prompt as explicit examples of what to do correctly. Replace the eval — the 92% number is misleading because of the cluster; build a new eval from scratch.

Your directional indicator

This 5-question sample gives a rough indication only. For a valid, verified Skills Passport credential, take the full 40-question assessment — it takes about 30 minutes.

Coming soon

Full assessment launching soon