What does the "hallucinated citation" AOE scenario measure?

Note on framing: This is the first item-level explainer for the AOE (AI Output Evaluation) sample-test family. Construct-level coverage is in the aoe-evaluating-llm-output explainer; this article covers the specific aoe_sample_1 item with the scenario-ladder pattern documented in the acl-eval-design-from-fuzzy-goal explainer.

What this scenario measures

This scenario — an LLM produces a fluent answer that cites “the refund window outlined in your terms of service” when your terms of service contain no specific refund window — measures hallucination detection in fluent AI output. Specifically, the item probes whether the respondent recognizes that:

The model fabricated a citation to support a confidently- stated claim. Hallucination is the canonical AI failure mode; recognizing it reflexively distinguishes mature from immature AI evaluators.
Fluency is not evidence of correctness. The output sounds reasonable; the claim it makes about the document is wrong.
The right response is to reject the output and surface the failure mode for retrieval-grounding work — not to accept it because surrounding content is high-quality.

Hallucination is one of the most-documented production AI failure modes; the Liang et al (2022) HELM evaluation framework documents hallucination prevalence across multiple LLM families.

Why this scenario captures AOE skill well

Three properties make the dual-constraint structure diagnostic:

The hallucination is realistic. Real LLM outputs routinely fabricate citations, statistics, and other confident-sounding details that don’t exist. The scenario’s pattern matches actual production failure modes.
The graded option ladder catches direction-of-failure. A respondent who picks the “minor accuracy issue” option (value 3) demonstrates partial competence — recognizes something is wrong but under-weights the hallucination-as-systemic-failure pattern. A respondent who picks the “acceptable” option (value 2) signals a ship-anyway pattern that produces customer-trust loss over time.
The best response models a teachable pattern. Reject + surface for retrieval-grounding work is the generalizable response to hallucination patterns; the pattern applies to most production AI evaluation contexts.

What the best response shows (and doesn’t)

Three misconceptions worth flagging:

Picking the right option ≠ being a strong AI evaluator generally. A respondent can pattern-match to one well-known template (hallucination-rejection) without internalizing the underlying eval-discipline principle. Stronger predictors come from the full 40-scenario assessment.
Picking a lower-tier option ≠ being weak. Some contexts have specific failure-mode tolerance; the scenario’s value-5 framing assumes a typical customer-facing context where trust is core.
The best response isn’t context-universal. Internal tools with explicit human-review checkpoints have different failure-tolerance than customer-facing outputs.

How the sample test scores you

In the AIEH 5-scenario AOE sample, this scenario contributes one of five datapoints aggregated into your single aoe_quality score via the W3.2 normalize-by-count threshold.

Data Notice: Sample-test results are directional indicators only. For a verified Skills Passport credential, take the full 40-scenario assessment.

Why hallucination is the canonical AI failure mode

Hallucination — confidently generating false content that sounds correct — is structurally different from other AI failure modes because it’s hardest to detect downstream. When a model generates obviously-wrong content, downstream review catches it; when a model generates fluent-and- confident content that’s subtly wrong, downstream review often misses it. The asymmetry produces compounding trust loss: customers who encounter even one hallucination start distrusting otherwise-correct outputs from the same system.

Strong AOE practitioners treat hallucination as a primary concern, design evaluation specifically to surface it, and design products to fail visibly rather than silently when hallucination occurs. The fail-visibly framing includes confidence scoring, citation grounding, and human-review checkpoints for high-stakes outputs.

Hallucination as model-architecture artifact. Auto-regressive language models generate plausible- sounding tokens based on training-data statistical patterns, not on grounded fact-checking. Without retrieval-grounding or explicit fact-verification, all LLMs hallucinate at non-zero rates.
Retrieval-augmented generation (RAG). The dominant production architecture for reducing hallucination — the model retrieves relevant documents and grounds outputs in retrieved content rather than parametric knowledge alone.
Fluent vs grounded outputs. Fluency is a stylistic property; grounding is an epistemic property. Strong AI evaluators distinguish them reflexively.
Citation fabrication as adversarial test case. Strong eval sets include citation-fabrication test cases specifically; weak eval sets miss this systematic failure pattern.

For broader treatment of how AOE fits into role-readiness scoring, see the AI fluency in hiring overview and the scoring methodology.

Sources

Bai, Y., Kadavath, S., Kundu, S., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073.
Liang, P., Bommasani, R., Lee, T., et al. (2022). Holistic Evaluation of Language Models (HELM). arXiv:2211.09110.
Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology. Psychological Bulletin, 124(2), 262–274.