From the acl prompt to spec sample test
What does the "fuzzy product ask" ACL scenario measure?
Note on framing: This is the first item-level explainer for the ACL (AI Collaboration Literacy) family — distinct from the construct-level AOE explainer which covers the AOE construct without referencing a specific item. Item-level explainers for the AI-native sample-test families (ACL and eventually AOE) follow the scenario-ladder pattern documented in the Communication scenario explainer, adapted to AI-collaboration scenarios. Two structural variants are stacked here: first item-level explainer for the family, AND scenario-ladder pattern in the AI-native domain. This explainer documents the framing for future ACL-family item explainers to inherit.
What this scenario measures
This scenario — a fuzzy product ask (“the assistant should refuse to give specific medical advice but should be helpful for general health questions”) with a multiple-choice deliverable response — measures eval design as the highest-leverage first deliverable for ambiguous AI product specs. Specifically, the item probes whether the respondent recognizes that:
- The product ask is too fuzzy to ship against directly without intermediate work.
- The intermediate work that has the most downstream leverage is authoring an evaluation rubric (a graded example set with pass/fail criteria), not iterating on the prompt or escalating to legal review.
- The eval-first pattern is what distinguishes production AI work from prompt-tinkering work.
Unlike Big Five personality items, which measure trait-level dispositions across contexts, ACL scenario items measure situation-specific judgment — the recognition that a particular response to a particular AI-collaboration situation captures the underlying skill more reliably than alternative responses. Strong AI collaborators get the eval-first recognition reflexively; weaker ones tend to fail in two predictable directions: prompt-iteration-first (under- specifying the success criteria before iterating) or spec-document-first (over-specifying without verifiable pass/fail criteria the team can run).
Why this scenario captures ACL skill well
The scenario is doing real work as an item because it forces a choice between four genuinely-on-the-table responses, only one of which captures the production-AI eval-first pattern. Three specific properties make the dual-constraint structure diagnostic:
- The fuzziness is realistic. Real product asks for AI features routinely arrive with the kind of “should refuse X but be helpful for Y” framing the scenario uses. The candidate’s first-deliverable choice reflects how they actually approach this on the job.
- The graded option ladder catches direction-of-failure. The scoring uses calibrated quality values (5/3/2/1) rather than binary right/wrong. A respondent who picks the prompt- iteration option (value 3) demonstrates partial competence — they recognize that prompts matter, but under-weight the eval-as-spec discipline. A respondent who picks the legal-escalation option (value 2) defers the engineering judgment that’s actually their work. The ladder distinguishes these failures in a way binary scoring cannot.
- The best response models a teachable pattern. “100-row graded eval set with edge cases + pass-rate threshold” is not just the right answer for this scenario — it’s a generalizable template (graded examples + threshold) that applies to most production-AI scoping work. Strong respondents recognize the pattern; weaker respondents pattern-match to surface features (writing-prompts-first, escalating-to-legal) without internalizing the eval-as-deliverable principle.
The published HELM evaluation framework (Liang et al., 2022) and Anthropic’s Constitutional AI work (Bai et al., 2022) document the eval-driven-iteration pattern across multiple model families. Scenarios like this one reward the candidate’s ability to translate that pattern into product-specific deliverables.
What the best response shows (and doesn’t)
Picking the value-5 option demonstrates situation-specific eval-first judgment — but it does not demonstrate broader ACL skill in the trait sense. Three specific misconceptions worth flagging:
- Picking the right option ≠ being a strong AI collaborator generally. A respondent can pattern-match to one well-known template (eval-first scoping) without internalizing the underlying generative principle. Stronger predictors of general ACL skill come from the full 40-scenario assessment, which probes the same eval-design judgment across diverse contexts (multi-step agent workflows, retrieval-augmented systems, fine-tuning trade-offs, etc.).
- Picking a lower-tier option ≠ being a weak AI collaborator. Real production-AI work includes dimensions the scenario doesn’t measure (prompt-engineering craft, model-behavior intuition, output evaluation discipline). A respondent strong on those dimensions but weaker on the eval-first pattern can still be a competent AI collaborator in roles where the eval-design work is delegated or upstream.
- The best response isn’t context-universal. In some contexts (regulated industries with strict pre-launch AI-feature compliance review), legal-escalation as a parallel-track first deliverable IS appropriate; the scenario’s value-5 framing assumes a typical SaaS-product context where engineering owns the eval-design work primarily. Other contexts have different correct patterns.
How the sample test scores you
In the AIEH 5-scenario ACL sample, this scenario contributes one of the five datapoints that aggregate into your single acl_quality score. The W3.2 scoring fix normalizes by item count, so your score is the average of your five scenario values mapped onto a 1–5 scale, then bucketed into low (≤2), mid (≤4), or high (>4) for the directional result.
Data Notice: Sample-test results are directional indicators only. Five-scenario ACL samples are too few to be psychometrically valid; for a verified Skills Passport credential, take the full 40-scenario assessment.
The full 40-scenario assessment expands coverage across more diverse AI-collaboration contexts and produces a calibrated score on the AIEH 300–850 scale via the scoring methodology. For the broader AI-fluency construct treatment of how ACL fits with related skills, see the AI fluency in hiring overview.
Related concepts
- Eval-driven development. The broader software-engineering practice of authoring tests before or alongside implementation; in AI work the “tests” are graded eval rubrics. The HELM and Constitutional AI papers operationalize this for language-model evaluation specifically.
- Spec-as-eval pattern. The recognition that a spec for an AI feature is most useful when it includes the eval that verifies whether the model meets it. Distinguishing spec from eval is one of the highest-leverage senior-AI-PM and prompt-engineer skills.
- Eval design vs eval application. ACL targets the upstream skill (designing the rubric); AOE targets the downstream skill (applying the rubric to specific outputs). Both matter; the role-bundle weights for Prompt Engineer and AI PM weight ACL slightly higher than AOE because eval-rubric authoring is the higher-leverage skill.
- Adversarial eval coverage. The discipline of including adversarial test cases (cases designed to surface failure modes) alongside happy-path cases. Strong eval rubrics intentionally include adversarial coverage; weak ones default to happy-path-only and miss subtle failure modes.
For role-specific bundles where ACL is highly weighted, see the Prompt Engineer role page and the AI Product Manager role page.
Sources
- Bai, Y., Kadavath, S., Kundu, S., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2212.08073.
- Liang, P., Bommasani, R., Lee, T., et al. (2022). Holistic Evaluation of Language Models (HELM). arXiv preprint arXiv:2211.09110.
- Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology. Psychological Bulletin, 124(2), 262–274.