What does the "hallucinated citation" AOE scenario measure?

Note on framing: This is the first item-level explainer for the AOE (AI Output Evaluation) sample-test family. Construct-level coverage is in the aoe-evaluating-llm-output explainer; this article covers the specific aoe_sample_1 item with the scenario-ladder pattern documented in the acl-eval-design-from-fuzzy-goal explainer.

What this scenario measures

This scenario — an LLM produces a fluent answer that cites “the refund window outlined in your terms of service” when your terms of service contain no specific refund window — measures hallucination detection in fluent AI output. Specifically, the item probes whether the respondent recognizes that:

  1. The model fabricated a citation to support a confidently- stated claim. Hallucination is the canonical AI failure mode; recognizing it reflexively distinguishes mature from immature AI evaluators.
  2. Fluency is not evidence of correctness. The output sounds reasonable; the claim it makes about the document is wrong.
  3. The right response is to reject the output and surface the failure mode for retrieval-grounding work — not to accept it because surrounding content is high-quality.

Hallucination is one of the most-documented production AI failure modes; the Liang et al (2022) HELM evaluation framework documents hallucination prevalence across multiple LLM families.

Why this scenario captures AOE skill well

Three properties make the dual-constraint structure diagnostic:

  • The hallucination is realistic. Real LLM outputs routinely fabricate citations, statistics, and other confident-sounding details that don’t exist. The scenario’s pattern matches actual production failure modes.
  • The graded option ladder catches direction-of-failure. A respondent who picks the “minor accuracy issue” option (value 3) demonstrates partial competence — recognizes something is wrong but under-weights the hallucination-as-systemic-failure pattern. A respondent who picks the “acceptable” option (value 2) signals a ship-anyway pattern that produces customer-trust loss over time.
  • The best response models a teachable pattern. Reject + surface for retrieval-grounding work is the generalizable response to hallucination patterns; the pattern applies to most production AI evaluation contexts.

What the best response shows (and doesn’t)

Three misconceptions worth flagging:

  • Picking the right option ≠ being a strong AI evaluator generally. A respondent can pattern-match to one well-known template (hallucination-rejection) without internalizing the underlying eval-discipline principle. Stronger predictors come from the full 40-scenario assessment.
  • Picking a lower-tier option ≠ being weak. Some contexts have specific failure-mode tolerance; the scenario’s value-5 framing assumes a typical customer-facing context where trust is core.
  • The best response isn’t context-universal. Internal tools with explicit human-review checkpoints have different failure-tolerance than customer-facing outputs.

How the sample test scores you

In the AIEH 5-scenario AOE sample, this scenario contributes one of five datapoints aggregated into your single aoe_quality score via the W3.2 normalize-by-count threshold.

Data Notice: Sample-test results are directional indicators only. For a verified Skills Passport credential, take the full 40-scenario assessment.

Why hallucination is the canonical AI failure mode

Hallucination — confidently generating false content that sounds correct — is structurally different from other AI failure modes because it’s hardest to detect downstream. When a model generates obviously-wrong content, downstream review catches it; when a model generates fluent-and- confident content that’s subtly wrong, downstream review often misses it. The asymmetry produces compounding trust loss: customers who encounter even one hallucination start distrusting otherwise-correct outputs from the same system.

Strong AOE practitioners treat hallucination as a primary concern, design evaluation specifically to surface it, and design products to fail visibly rather than silently when hallucination occurs. The fail-visibly framing includes confidence scoring, citation grounding, and human-review checkpoints for high-stakes outputs.

  • Hallucination as model-architecture artifact. Auto-regressive language models generate plausible- sounding tokens based on training-data statistical patterns, not on grounded fact-checking. Without retrieval-grounding or explicit fact-verification, all LLMs hallucinate at non-zero rates.
  • Retrieval-augmented generation (RAG). The dominant production architecture for reducing hallucination — the model retrieves relevant documents and grounds outputs in retrieved content rather than parametric knowledge alone.
  • Fluent vs grounded outputs. Fluency is a stylistic property; grounding is an epistemic property. Strong AI evaluators distinguish them reflexively.
  • Citation fabrication as adversarial test case. Strong eval sets include citation-fabrication test cases specifically; weak eval sets miss this systematic failure pattern.

For broader treatment of how AOE fits into role-readiness scoring, see the AI fluency in hiring overview and the scoring methodology.


Sources

  • Bai, Y., Kadavath, S., Kundu, S., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073.
  • Liang, P., Bommasani, R., Lee, T., et al. (2022). Holistic Evaluation of Language Models (HELM). arXiv:2211.09110.
  • Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology. Psychological Bulletin, 124(2), 262–274.

Try the question yourself

This explainer covers what the item measures. To see how you score on the full ai output evaluation family, take the free 5-question sample.

Take the ai output evaluation sample