From the ai output evaluation sample test
What does the "hallucinated citation" AOE scenario measure?
Note on framing: This is the first item-level explainer for the AOE (AI Output Evaluation) sample-test family. Construct-level coverage is in the aoe-evaluating-llm-output explainer; this article covers the specific aoe_sample_1 item with the scenario-ladder pattern documented in the acl-eval-design-from-fuzzy-goal explainer.
What this scenario measures
This scenario — an LLM produces a fluent answer that cites “the refund window outlined in your terms of service” when your terms of service contain no specific refund window — measures hallucination detection in fluent AI output. Specifically, the item probes whether the respondent recognizes that:
- The model fabricated a citation to support a confidently- stated claim. Hallucination is the canonical AI failure mode; recognizing it reflexively distinguishes mature from immature AI evaluators.
- Fluency is not evidence of correctness. The output sounds reasonable; the claim it makes about the document is wrong.
- The right response is to reject the output and surface the failure mode for retrieval-grounding work — not to accept it because surrounding content is high-quality.
Hallucination is one of the most-documented production AI failure modes; the Liang et al (2022) HELM evaluation framework documents hallucination prevalence across multiple LLM families.
Why this scenario captures AOE skill well
Three properties make the dual-constraint structure diagnostic:
- The hallucination is realistic. Real LLM outputs routinely fabricate citations, statistics, and other confident-sounding details that don’t exist. The scenario’s pattern matches actual production failure modes.
- The graded option ladder catches direction-of-failure. A respondent who picks the “minor accuracy issue” option (value 3) demonstrates partial competence — recognizes something is wrong but under-weights the hallucination-as-systemic-failure pattern. A respondent who picks the “acceptable” option (value 2) signals a ship-anyway pattern that produces customer-trust loss over time.
- The best response models a teachable pattern. Reject + surface for retrieval-grounding work is the generalizable response to hallucination patterns; the pattern applies to most production AI evaluation contexts.
What the best response shows (and doesn’t)
Three misconceptions worth flagging:
- Picking the right option ≠ being a strong AI evaluator generally. A respondent can pattern-match to one well-known template (hallucination-rejection) without internalizing the underlying eval-discipline principle. Stronger predictors come from the full 40-scenario assessment.
- Picking a lower-tier option ≠ being weak. Some contexts have specific failure-mode tolerance; the scenario’s value-5 framing assumes a typical customer-facing context where trust is core.
- The best response isn’t context-universal. Internal tools with explicit human-review checkpoints have different failure-tolerance than customer-facing outputs.
How the sample test scores you
In the AIEH 5-scenario AOE sample, this scenario contributes one of five datapoints aggregated into your single aoe_quality score via the W3.2 normalize-by-count threshold.
Data Notice: Sample-test results are directional indicators only. For a verified Skills Passport credential, take the full 40-scenario assessment.
Why hallucination is the canonical AI failure mode
Hallucination — confidently generating false content that sounds correct — is structurally different from other AI failure modes because it’s hardest to detect downstream. When a model generates obviously-wrong content, downstream review catches it; when a model generates fluent-and- confident content that’s subtly wrong, downstream review often misses it. The asymmetry produces compounding trust loss: customers who encounter even one hallucination start distrusting otherwise-correct outputs from the same system.
Strong AOE practitioners treat hallucination as a primary concern, design evaluation specifically to surface it, and design products to fail visibly rather than silently when hallucination occurs. The fail-visibly framing includes confidence scoring, citation grounding, and human-review checkpoints for high-stakes outputs.
Related concepts
- Hallucination as model-architecture artifact. Auto-regressive language models generate plausible- sounding tokens based on training-data statistical patterns, not on grounded fact-checking. Without retrieval-grounding or explicit fact-verification, all LLMs hallucinate at non-zero rates.
- Retrieval-augmented generation (RAG). The dominant production architecture for reducing hallucination — the model retrieves relevant documents and grounds outputs in retrieved content rather than parametric knowledge alone.
- Fluent vs grounded outputs. Fluency is a stylistic property; grounding is an epistemic property. Strong AI evaluators distinguish them reflexively.
- Citation fabrication as adversarial test case. Strong eval sets include citation-fabrication test cases specifically; weak eval sets miss this systematic failure pattern.
For broader treatment of how AOE fits into role-readiness scoring, see the AI fluency in hiring overview and the scoring methodology.
Sources
- Bai, Y., Kadavath, S., Kundu, S., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073.
- Liang, P., Bommasani, R., Lee, T., et al. (2022). Holistic Evaluation of Language Models (HELM). arXiv:2211.09110.
- Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology. Psychological Bulletin, 124(2), 262–274.