What does the "overconfident numeric claim" AOE scenario measure?

Note on framing: This is the aoe_sample_3 item-level explainer for the AOE (AI Output Evaluation) sample-test family. Construct-level coverage is in the aoe-evaluating-llm-output explainer; the canonical hallucination-detection item is documented in the aoe-hallucinated-citation explainer.

This scenario presents an LLM output that asserts a specific numeric claim — for example, “studies show 73% of remote workers report higher productivity” — without an attached source, footnote, or hedge. The output is fluent, the figure is plausible, and the surrounding text reads as authoritative. The candidate is asked to grade the response on the AOE graded-rubric ladder. The scenario probes a failure mode that is distinct from full hallucination: the overconfident numeric claim with no source attribution pattern, which is both more common and more insidious than fabricated citations because the absence of a citation is easier to miss when the number itself sounds reasonable.

What this question tests

The item targets a specific applied-AI evaluation skill: the ability to flag confidently-stated numerical claims that lack verifiable provenance, even when the number is plausible and the surrounding text is well-written. Unlike a fabricated citation — where the model invents a source — an overconfident numeric claim presents a figure as if it were common knowledge, giving the reader no opportunity to verify it. Strong AOE evaluators recognize this pattern reflexively because it is the most-shipped form of LLM error in production: the model internalizes a numeric “fact” from training-data statistical patterns, restates it confidently, and the evaluator must catch the missing-provenance signal that a factually-grounded claim would not have.

The skill is closer to investigative-journalism source-vetting than to traditional fact-checking. The evaluator does not need to know the right number; they need to recognize that the output has produced a number without giving any path to verification, and that this pattern — independent of whether the number is right — is itself the failure.

Why this is the right answer (concrete worked example)

The correct grade on the AOE rubric is value 4 (significant miscalibration; would not ship). The reasoning has three strands:

First, the numeric claim is presented at high confidence (“studies show 73%”) with no attached citation, footnote, or qualifier. A respondent who accepts this output is implicitly treating LLM-generated numbers the same way they would treat numbers from a vetted research report — but the LLM has no research-report-grade provenance chain.

Second, the specific number is suspicious in a way that generalizes. Round-near percentages (70%, 75%, 80%) and suspiciously-precise percentages (73%, 67%, 84%) are both common LLM hallucination patterns. The model’s training data contains many statistics phrased as “X% of [population] do [behavior]”; the autoregressive generator stitches plausible numbers into plausible sentence frames without grounding the specific value to any specific study. Bender et al.’s 2021 “stochastic parrots” framing applies directly: the model restates statistical patterns from training rather than producing newly-grounded factual claims.

Third, the right downstream action is to reject the output and surface the pattern for retrieval-grounding or explicit-citation work. A response that adds a hedge (“studies suggest” instead of “studies show”) is insufficient because the underlying provenance gap remains; the output still gives the reader no path to verify. The fix is structural: ground the claim to a specific source the system can cite, or remove the number entirely.

A worked illustration: an output reads, “Companies that adopt remote work see a 23% reduction in turnover within the first year.” The number is plausible, the framing is clean, and a cursory reviewer would let it through. A strong AOE evaluator flags it because (a) no source is attached, (b) the number is in the suspiciously-precise range that LLMs hallucinate, and (c) “first year” is the kind of specific qualifier that real research would tie to a specific study design — its presence without a citation is itself a tell.

What the wrong answers reveal

The other options on the graded rubric each map to a common evaluator failure mode:

  • Value 1 (major failure; reject immediately). Picking this option treats the overconfident numeric claim the same as a fabricated citation, which is too punitive. The scenario’s failure is real but recoverable through retrieval grounding; treating it as a major failure inflates the false-positive rate and wastes downstream prompt-engineering bandwidth on outputs that are fixable rather than fundamentally broken.
  • Value 3 (minor accuracy concern). This option signals partial competence: the respondent recognizes that something is off but under-weights the systemic nature of the no-provenance pattern. Numeric claims without provenance are not minor — they propagate through downstream pipelines and decision-making, and customers rely on them as if they were grounded.
  • Value 2 (acceptable; ship with light editing). Picking this signals a ship-anyway evaluator pattern. In a customer-facing context, shipping fluent numeric claims without provenance is the highest-risk form of trust loss: the reader treats the output as authoritative, encounters the eventual contradiction, and generalizes distrust across the entire system.

How the sample test scores you

In the AIEH 5-scenario AOE sample test, this item contributes one of five datapoints aggregated into the single aoe_quality score via the W3.2 normalize-by-count threshold. Graded scoring per item: value 4 is the correct grade for this scenario, with partial credit for adjacent grades reflecting the diagnostic pattern.

Data Notice: Sample-test results are directional indicators only. A 5-scenario sample cannot reliably distinguish “internalized AOE rubric” from “pattern-matched on this specific scenario”; for a verified Skills Passport credential, take the full 40-scenario assessment when it ships.

The full assessment probes the four major output-failure categories — factual, calibration, fitness, stylistic — across 40 scenarios with combination cases that test whether the evaluator can disentangle multiple simultaneous failure modes. See the scoring methodology for how AOE scores map onto the AIEH 300–850 Skills Passport scale, and the tests catalog for current AOE sample availability.

  • Stochastic parrots framing. Bender, Gebru, McMillan-Major & Shmitchell’s 2021 paper articulates why language models produce confident-sounding statistical claims without underlying grounding — the autoregressive generator optimizes for plausibility, not for verifiable provenance. The framing is foundational for understanding why overconfident numeric claims are a structural rather than incidental LLM failure.
  • Provenance vs accuracy. Provenance is the property of having a verifiable source chain; accuracy is the property of being true. Strong AOE evaluators flag provenance gaps independent of accuracy because a no-provenance claim that happens to be true today is structurally unsafe — there is no mechanism to update it when the underlying fact changes.
  • Retrieval-augmented generation (RAG) for numeric grounding. The dominant production architecture for reducing no-provenance numeric claims: the model retrieves relevant documents and grounds numeric outputs in retrieved content rather than parametric knowledge. RAG converts overconfident numeric claims into citation-bearing claims that downstream reviewers can verify.
  • Calibration penalty in scoring. Some AOE rubrics weight miscalibrated-confidence errors more heavily than factual-accuracy errors because miscalibration compounds trust loss. An LLM that says “studies show 73%” with no citation but the number happens to be right has produced the same epistemic disservice as one whose number is wrong: the reader has no way to know which condition they are in.

For broader treatment of how AOE fits into role-readiness scoring, see the AI fluency in hiring overview, the hire workflow page for evaluators, and the learn library for AOE training material.


Sources

  • Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT ‘21), 610–623.
  • Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., et al. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12), 1–38.
  • Liang, P., Bommasani, R., Lee, T., et al. (2022). Holistic Evaluation of Language Models (HELM). arXiv:2211.09110.
  • Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology. Psychological Bulletin, 124(2), 262–274.

Try the question yourself

This explainer covers what the item measures. To see how you score on the full ai output evaluation family, take the free 5-question sample.

Take the ai output evaluation sample