From the acl prompt to spec sample test
What does the "vague feature ask to runnable spec" ACL scenario measure?
This is the second item-level explainer for the ACL (AI Collaboration Literacy) prompt-to-spec family. Where the first family item probed eval-design-as-first-deliverable, this one probes the upstream skill that has to land before any eval can be written: translating a vague product ask — the canonical “make checkout faster” framing — into a runnable spec the implementation team can verify against. The scenario measures whether the respondent can convert ambiguity into operational constraints, acceptance criteria, and unambiguous outputs without losing what the original ask was actually pointing at.
What this question tests
This scenario — a product manager arrives with the request “make checkout faster” and asks the AI collaborator to draft the implementation spec — measures the prompt-to-spec discipline of converting a vague feature ask into runnable constraints with verifiable acceptance criteria. Specifically, the item probes whether the respondent recognizes that:
- The ask as stated is not a spec; “faster” without a metric, a baseline, a target threshold, and a population is unimplementable.
- The right first move is structured elicitation — converting “faster” into a measurable target (e.g., median checkout completion time, p95 latency, time-to-first-payable-action) anchored to a baseline and a delta.
- The output of the prompt-to-spec step is a document the implementation team can run a test against: defined constraints, defined acceptance criteria, defined non-goals, and unambiguous output schemas where the feature interacts with downstream systems.
Unlike Big Five personality items, which measure trait-level dispositions across contexts, ACL scenario items measure situation-specific judgment — the recognition that a particular response to a particular AI-collaboration situation captures the underlying skill more reliably than alternative responses. Strong AI collaborators reflexively widen the ask into measurable structure before drafting; weaker ones tend to fail in three predictable directions: jumping straight to implementation guesses (“I’ll add a database index on orders”), over-elaborating with unverifiable language (“checkout will feel responsive and modern”), or punting the elicitation back to the PM without doing the structuring work.
Why this is the right answer
The value-5 response converts “make checkout faster” into a runnable spec by doing five things in sequence in a single artifact. A worked example clarifies what this looks like in practice.
A senior engineer responding to “make checkout faster” drafts a spec with: (1) a measured baseline — “current p50 checkout completion time is 4.8s, p95 is 11.2s, measured on desktop+mobile across the last 30 days from order-button-click to payment-confirmation”; (2) a target delta — “reduce p50 to ≤2.5s and p95 to ≤6.0s, measured the same way, sustained over a 14-day post-launch window”; (3) a constraint envelope — “no regression in checkout conversion rate (currently 71.4%); no regression in payment fraud rate; no new third-party network dependencies in the critical path”; (4) acceptance criteria as a runnable test — “ship behind a 10% holdout; measure the four metrics above; promote to 100% only if p50/p95 targets hit AND conversion + fraud guardrails hold”; (5) explicit non-goals — “this scope does not cover cart abandonment recovery, address-validation UX, or payment-method expansion; those are tracked separately.”
That spec is now verifiable. The implementation team can write code against it. The QA team can write tests against it. A future product review can ask “did we hit the targets?” and get a binary answer. Most importantly, an AI collaborator asked to assist with the implementation has unambiguous inputs to work from — the spec tells the model what success looks like, what the guardrails are, and what’s out of scope, all of which are necessary inputs to high-quality AI-assisted code or analysis.
The published software-engineering literature on requirements quality (Wiegers & Beatty, 2013) and the human-AI collaboration research on intent-specification clarity (Vaithilingam et al., 2022) both document that ambiguity in the upstream spec compounds downstream — a vague spec produces vague AI output produces vague review feedback produces a vague shipped feature. The prompt-to-spec discipline is upstream of all the AI collaboration work that follows; getting it wrong early invalidates the rest.
What the wrong answers reveal
The graded option ladder catches three distinct direction-of-failure patterns the scenario is designed to discriminate between:
- Implementation-jump (value 3). “I’ll profile the checkout flow and add database indexes where the p95 query latency is highest.” This response demonstrates partial competence — the respondent recognizes that performance work needs measurement, but skips the spec layer entirely and starts implementing against an inferred goal. The failure mode is real: the team ships index changes, p50 improves, p95 doesn’t, and the PM says “but checkout still feels slow on mobile” because the actual user-perceived bottleneck was time-to-interactive on the address form, not query latency. The prompt-to-spec layer would have caught this.
- Vague-elaboration (value 2). “Checkout should feel responsive, modern, and frictionless. Users should experience a seamless flow with minimal cognitive load.” This response signals an aversion to specificity that produces unverifiable specs. None of these terms are measurable; none of them gate a launch decision. A spec written in this register is rhetorical scaffolding around the original ambiguity, not a resolution of it.
- Punt-to-PM (value 1). “Can you tell me what ‘faster’ means? What metric, what target, on what population?” This response defers the elicitation work that’s actually the AI collaborator’s job. PMs are usually capable of answering these questions when prompted, but the prompt-to-spec skill being tested is the ability to propose a structured answer the PM can confirm or adjust — not to bounce the work back. Strong collaborators draft the spec and ask “does this match what you meant?”; weak collaborators ask “what did you mean?” and wait.
How the sample test scores you
In the AIEH 5-scenario ACL prompt-to-spec sample, this scenario contributes one of the five datapoints that aggregate into your single acl_quality score. The W3.2 scoring fix normalizes by item count, so your score is the average of your five scenario values mapped onto a 1–5 scale, then bucketed into low (≤2), mid (≤4), or high (>4) for the directional result. Each item in the sample uses 1–5 binary scoring per dimension before being aggregated; the normalize-by-count step ensures a respondent who skips items isn’t artificially advantaged.
Data Notice: Sample-test results are directional indicators only. Five-scenario ACL samples are too few to be psychometrically valid; for a verified Skills Passport credential, take the full 40-scenario assessment.
The full 40-scenario assessment expands coverage across more diverse AI-collaboration contexts (multi-agent workflows, RAG-system specs, fine-tuning trade-off documents, production-eval rubrics) and produces a calibrated score on the AIEH 300–850 scale via the scoring methodology. For the broader construct treatment, see the AI fluency in hiring overview.
Related concepts
- Spec-as-test pattern. A spec that can be turned into a passing/failing test is a strong spec; a spec that can’t is rhetoric. The acceptance-criteria-as-runnable-test framing is the discipline that produces the former.
- Operational definition. Borrowed from measurement theory — the practice of defining a fuzzy concept (“faster”) in terms of a specific procedure for measuring it. Strong prompt-to-spec work produces operational definitions for every fuzzy term the original ask contained.
- Constraint envelope. The set of guardrails (what cannot regress) that bound the optimization. Performance asks routinely have implicit guardrails (don’t break conversion, don’t break fraud detection) that have to become explicit in the spec or they get sacrificed under pressure.
- Non-goals as scope discipline. Naming what the spec does not cover is as load-bearing as naming what it does. Most scope creep originates in the absence of explicit non-goals.
For role-specific bundles where ACL is highly weighted, see the structured interview design overview and the broader interview question design page, both of which treat the prompt-to-spec skill as a foundational component of AI-augmented role evaluation.
Sources
- Vaithilingam, P., Zhang, T., & Glassman, E. L. (2022). Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems, Article 332.
- Wiegers, K., & Beatty, J. (2013). Software Requirements (3rd ed.). Microsoft Press.
- Liang, P., Bommasani, R., Lee, T., et al. (2022). Holistic Evaluation of Language Models (HELM). arXiv preprint arXiv:2211.09110.
- Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology. Psychological Bulletin, 124(2), 262–274.