AI Output Evaluation (AOE) — Free Sample

A 5-scenario sample of the AI Output Evaluation family — applying rubrics to specific model outputs, distinguishing surface-correct from substance-correct, identifying hallucination, recognizing grounded vs fabricated content, and judging when output quality is shippable. Items are originally authored by AIEH editorial; they're not drawn from a copyrighted bank. For a verified Skills Passport credential, take the full AOE assessment.

1. An LLM produces a fluent, well-organized 200-word answer to a customer question about your product's refund policy. The answer cites "the refund window outlined in your terms of service." Your terms of service do not contain a specific refund window. What's the most diagnostic call?
2. You're evaluating two LLM outputs against a rubric that includes "factually accurate" and "appropriately hedged." Output A is fluent but states one minor fact incorrectly. Output B is correct but hedges everything excessively ("it might be possible that..."). The rubric weights accuracy at 0.6 and hedging at 0.4. Which is the better output?
3. A customer reports the model produced a confidently wrong answer. You investigate and find the model output is grammatically perfect, internally consistent, and superficially sounds like the kind of answer a domain expert would give. The substance is wrong. What's the highest-value pattern to apply going forward?
4. You're evaluating a code-generation model's output. The code compiles, passes the unit tests you wrote, and produces correct results on the inputs you tested. A senior engineer on your team reviews it and points out the code has a subtle race condition under concurrent load. Which of these is the most-aligned-with-good-AOE-practice take on this evaluation?
5. Your team has authored a 50-item eval that the latest model passes with 92% pass rate. The team is debating whether to ship. The 8% failures are concentrated in one specific input pattern. What's the most-defensible call?