From the ai output evaluation sample test
What does the "mixed-source conflation" AOE scenario measure?
Note on framing: This is the aoe_sample_5 item-level explainer for the AOE (AI Output Evaluation) sample-test family. Construct-level coverage is in the aoe-evaluating-llm-output explainer; the canonical hallucination-detection item is documented in the aoe-hallucinated-citation explainer.
This scenario presents an LLM output that synthesizes claims across two or more retrieved source documents and produces a unified, plausible-sounding answer that is not actually supported by any single source. The output reads coherently — the claims look like they came from somewhere — but on close inspection the synthesis has stitched together a fact from Source A, a qualifier from Source B, and an inference that neither source actually made. The candidate is asked to grade this output on the AOE rubric. The scenario probes mixed-source conflation, a failure mode that has become more prominent as retrieval-augmented generation (RAG) has become the dominant production AI architecture.
What this question tests
The item targets a specific second-order AOE skill: the ability to detect when a multi-document synthesis has produced a plausible-sounding-but-false claim by combining elements from sources that, individually, do not support the combined claim. This is structurally distinct from hallucination (where the model fabricates content with no source) and from faithfulness errors in single-source summarization (where the model distorts a single source). Mixed-source conflation is a synthesis-time failure that is specific to multi-document workflows and is harder to catch because each individual claim in the output traces to a real source.
The skill is sometimes described as provenance-chain auditing in the RAG-evaluation literature: the evaluator must ask, for each claim in the output, “which specific source supports this specific claim, and does the source actually support it in the way the output asserts?” Strong AOE evaluators audit claim-by-claim rather than synthesis-by-synthesis, which is the only way to catch conflation reliably. Ji et al.’s 2023 hallucination survey documents conflation as one of the more difficult failure modes to detect because its surface signal is identical to correct synthesis: confident, well-structured output with implicit citation to retrieved documents.
Why this is the right answer (concrete worked example)
The correct grade is value 4 (significant miscalibration; would not ship until claim-by-claim provenance is verified). Mixed-source conflation produces outputs that are individually ungrounded even when the upstream retrieval was correct, and the right downstream action is to audit each claim against its purported source rather than accepting the synthesis on its surface coherence.
A worked illustration: suppose the user asks a RAG system, “What is the recommended dosage of ibuprofen for adults with chronic kidney disease?” The retrieval surfaces two documents. Source A is a general adult ibuprofen dosing guideline that recommends 200–400 mg every 4–6 hours. Source B is a kidney disease pharmacology review that notes ibuprofen should be avoided or used cautiously in chronic kidney disease patients because of nephrotoxicity risk, without giving a specific modified dose.
A conflated output reads: “Adults with chronic kidney disease can take 200–400 mg of ibuprofen every 4–6 hours, with caution due to kidney effects.” This output is fluent, both sources are real and were retrieved, and on quick reading it sounds authoritative. It is also wrong: Source A’s dose is for general adults, not chronic kidney disease patients; Source B did not endorse that dose for the chronic kidney disease population; and the synthesis has manufactured a clinical recommendation that no source actually made. The conflation is the failure.
A strong AOE evaluator catches this through claim-by-claim provenance auditing: the dosage claim cannot trace cleanly to either source, because Source A’s population was different and Source B did not give a dosage. The synthesis has combined elements that appear to fit but produce a clinical claim no source supports. The right grade is value 4 — the output’s individual building blocks were grounded, but the combined claim is not.
The fix is structural rather than stylistic. Mixed-source conflation is reduced by retrieval architectures that preserve claim-level attribution (so the output is forced to cite which source supports which sentence), by prompt design that constrains the model to refuse cross-source synthesis without explicit attribution, and by evaluation rubrics that audit claim-by-claim rather than output-by-output.
What the wrong answers reveal
The graded rubric has characteristic failure modes for this scenario:
- Value 5 (acceptable; ship with editing). Picking this signals that the respondent has accepted the surface coherence of the synthesis without auditing the claim-by-source mapping. This is the dominant evaluator failure for conflation scenarios because the output looks identical to correct multi-document synthesis.
- Value 3 (minor accuracy concern). This grade under-weights the structural nature of conflation. In a clinical-recommendation context, a conflated synthesis is not a minor error — it is a fabricated recommendation presented as evidence-backed. The error is structural to the synthesis process, not a surface mistake.
- Value 1 (major failure; reject and retrain). Picking this option treats conflation the same as hallucination, which is too punitive. Conflation is recoverable through retrieval-architecture changes and claim-level attribution prompts; rejection-and-retrain is the response to systemic parametric-knowledge failures, not synthesis failures.
The correct grade (value 4) preserves the diagnostic information that the failure is structural and recoverable, distinct from both hallucination and minor accuracy error.
How the sample test scores you
In the AIEH 5-scenario AOE sample test, this item contributes one of five datapoints aggregated into the single aoe_quality score via the W3.2 normalize-by-count threshold. Graded scoring per item with partial credit for adjacent grades, reflecting that conflation detection is one of the harder AOE skills to develop.
Data Notice: Sample-test results are directional indicators only. Mixed-source conflation is one of the AOE failure modes most-correlated with on-the-job RAG evaluation work; respondents who grade this scenario correctly tend to perform well on full-assessment RAG-specific scenarios. For a verified Skills Passport credential, take the full 40-scenario assessment.
See the scoring methodology for how AOE scores map onto the AIEH 300–850 Skills Passport scale.
Related concepts
- Provenance-chain auditing. The discipline of tracing each claim in an output to a specific source statement that supports it. Strong AOE evaluators audit at claim-granularity rather than output-granularity.
- RAG attribution architectures. Retrieval architectures that force the model to cite which source supports which sentence — sometimes called “grounded generation” or “attributed QA” in the literature. Attribution architectures reduce conflation by making provenance a first-class property of the output.
- Cross-document inference. The legitimate cousin of conflation: outputs that combine claims from multiple sources to support a synthesis the sources collectively imply but no single source states. Strong AOE evaluators distinguish legitimate cross-document inference (where each premise is faithfully drawn from its source) from conflation (where premises are stitched together in ways the sources do not support).
- Faithfulness in multi-document summarization. The multi-document analog of single-source faithfulness; metrics like FactCC and FRANK extend to multi-document cases but are noisier because the legitimate synthesis space is larger.
For broader treatment of how AOE fits into role-readiness scoring, see the AI fluency in hiring overview, the hire workflow page for AI-evaluator hiring, and the interview question design guide for evaluator-role interviewing.
Sources
- Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Proceedings of FAccT ‘21, 610–623.
- Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., et al. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12), 1–38.
- Liang, P., Bommasani, R., Lee, T., et al. (2022). Holistic Evaluation of Language Models (HELM). arXiv:2211.09110.
- Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology. Psychological Bulletin, 124(2), 262–274.