What does the "fuzzy product ask" ACL scenario measure?

Note on framing: This is the first item-level explainer for the ACL (AI Collaboration Literacy) family — distinct from the construct-level AOE explainer which covers the AOE construct without referencing a specific item. Item-level explainers for the AI-native sample-test families (ACL and eventually AOE) follow the scenario-ladder pattern documented in the Communication scenario explainer, adapted to AI-collaboration scenarios. Two structural variants are stacked here: first item-level explainer for the family, AND scenario-ladder pattern in the AI-native domain. This explainer documents the framing for future ACL-family item explainers to inherit.

What this scenario measures

This scenario — a fuzzy product ask (“the assistant should refuse to give specific medical advice but should be helpful for general health questions”) with a multiple-choice deliverable response — measures eval design as the highest-leverage first deliverable for ambiguous AI product specs. Specifically, the item probes whether the respondent recognizes that:

The product ask is too fuzzy to ship against directly without intermediate work.
The intermediate work that has the most downstream leverage is authoring an evaluation rubric (a graded example set with pass/fail criteria), not iterating on the prompt or escalating to legal review.
The eval-first pattern is what distinguishes production AI work from prompt-tinkering work.

Unlike Big Five personality items, which measure trait-level dispositions across contexts, ACL scenario items measure situation-specific judgment — the recognition that a particular response to a particular AI-collaboration situation captures the underlying skill more reliably than alternative responses. Strong AI collaborators get the eval-first recognition reflexively; weaker ones tend to fail in two predictable directions: prompt-iteration-first (under- specifying the success criteria before iterating) or spec-document-first (over-specifying without verifiable pass/fail criteria the team can run).

Why this scenario captures ACL skill well

The scenario is doing real work as an item because it forces a choice between four genuinely-on-the-table responses, only one of which captures the production-AI eval-first pattern. Three specific properties make the dual-constraint structure diagnostic:

The fuzziness is realistic. Real product asks for AI features routinely arrive with the kind of “should refuse X but be helpful for Y” framing the scenario uses. The candidate’s first-deliverable choice reflects how they actually approach this on the job.
The graded option ladder catches direction-of-failure. The scoring uses calibrated quality values (5/3/2/1) rather than binary right/wrong. A respondent who picks the prompt- iteration option (value 3) demonstrates partial competence — they recognize that prompts matter, but under-weight the eval-as-spec discipline. A respondent who picks the legal-escalation option (value 2) defers the engineering judgment that’s actually their work. The ladder distinguishes these failures in a way binary scoring cannot.
The best response models a teachable pattern. “100-row graded eval set with edge cases + pass-rate threshold” is not just the right answer for this scenario — it’s a generalizable template (graded examples + threshold) that applies to most production-AI scoping work. Strong respondents recognize the pattern; weaker respondents pattern-match to surface features (writing-prompts-first, escalating-to-legal) without internalizing the eval-as-deliverable principle.

The published HELM evaluation framework (Liang et al., 2022) and Anthropic’s Constitutional AI work (Bai et al., 2022) document the eval-driven-iteration pattern across multiple model families. Scenarios like this one reward the candidate’s ability to translate that pattern into product-specific deliverables.

What the best response shows (and doesn’t)

Picking the value-5 option demonstrates situation-specific eval-first judgment — but it does not demonstrate broader ACL skill in the trait sense. Three specific misconceptions worth flagging:

Picking the right option ≠ being a strong AI collaborator generally. A respondent can pattern-match to one well-known template (eval-first scoping) without internalizing the underlying generative principle. Stronger predictors of general ACL skill come from the full 40-scenario assessment, which probes the same eval-design judgment across diverse contexts (multi-step agent workflows, retrieval-augmented systems, fine-tuning trade-offs, etc.).
Picking a lower-tier option ≠ being a weak AI collaborator. Real production-AI work includes dimensions the scenario doesn’t measure (prompt-engineering craft, model-behavior intuition, output evaluation discipline). A respondent strong on those dimensions but weaker on the eval-first pattern can still be a competent AI collaborator in roles where the eval-design work is delegated or upstream.
The best response isn’t context-universal. In some contexts (regulated industries with strict pre-launch AI-feature compliance review), legal-escalation as a parallel-track first deliverable IS appropriate; the scenario’s value-5 framing assumes a typical SaaS-product context where engineering owns the eval-design work primarily. Other contexts have different correct patterns.

How the sample test scores you

In the AIEH 5-scenario ACL sample, this scenario contributes one of the five datapoints that aggregate into your single acl_quality score. The W3.2 scoring fix normalizes by item count, so your score is the average of your five scenario values mapped onto a 1–5 scale, then bucketed into low (≤2), mid (≤4), or high (>4) for the directional result.

Data Notice: Sample-test results are directional indicators only. Five-scenario ACL samples are too few to be psychometrically valid; for a verified Skills Passport credential, take the full 40-scenario assessment.

The full 40-scenario assessment expands coverage across more diverse AI-collaboration contexts and produces a calibrated score on the AIEH 300–850 scale via the scoring methodology. For the broader AI-fluency construct treatment of how ACL fits with related skills, see the AI fluency in hiring overview.

Eval-driven development. The broader software-engineering practice of authoring tests before or alongside implementation; in AI work the “tests” are graded eval rubrics. The HELM and Constitutional AI papers operationalize this for language-model evaluation specifically.
Spec-as-eval pattern. The recognition that a spec for an AI feature is most useful when it includes the eval that verifies whether the model meets it. Distinguishing spec from eval is one of the highest-leverage senior-AI-PM and prompt-engineer skills.
Eval design vs eval application. ACL targets the upstream skill (designing the rubric); AOE targets the downstream skill (applying the rubric to specific outputs). Both matter; the role-bundle weights for Prompt Engineer and AI PM weight ACL slightly higher than AOE because eval-rubric authoring is the higher-leverage skill.
Adversarial eval coverage. The discipline of including adversarial test cases (cases designed to surface failure modes) alongside happy-path cases. Strong eval rubrics intentionally include adversarial coverage; weak ones default to happy-path-only and miss subtle failure modes.

For role-specific bundles where ACL is highly weighted, see the Prompt Engineer role page and the AI Product Manager role page.

Sources

Bai, Y., Kadavath, S., Kundu, S., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2212.08073.
Liang, P., Bommasani, R., Lee, T., et al. (2022). Holistic Evaluation of Language Models (HELM). arXiv preprint arXiv:2211.09110.
Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology. Psychological Bulletin, 124(2), 262–274.

What this scenario measures

Why this scenario captures ACL skill well

What the best response shows (and doesn’t)

How the sample test scores you

Related concepts

Sources

Try the question yourself