Work-Sample Tests: Validity Evidence and Practical Use in Selection
Work-sample tests ask candidates to perform a representative slice of the actual job — write a short function, draft a customer email, debug a real codebase, or design a small component — under controlled conditions. Among the dozens of selection methods studied in the personnel-psychology literature, work-sample tests consistently sit near the top of the validity rankings, with Schmidt and Hunter’s 1998 meta-analytic synthesis reporting a corrected operational validity coefficient of approximately ~0.54 against supervisor performance ratings.
This article walks through what a work-sample test is, why the validity evidence is so strong, how to design one well, the common implementation pitfalls, and how AIEH’s portable-credential model treats work-sample evidence as a high-weight source in the domain pillar of the Skills Passport.
Data Notice: Validity coefficients cited here reflect peer-reviewed meta-analytic evidence at time of writing. Specific weights AIEH applies to work-sample evidence in the domain pillar are documented in the scoring methodology and may evolve as calibration data accrues.
What counts as a work-sample test
A work-sample test is a standardized, scoreable performance of a job-relevant task. The defining property is that the candidate is asked to do — not describe, not list, not discuss — a representative piece of the job. Examples that meet the definition:
- A backend engineer is given a small repository and asked to fix a documented bug within a two-hour window.
- A customer-success representative is given an email from a frustrated customer and asked to draft a response.
- A copy editor is given a 600-word draft and asked to return it edited within thirty minutes.
- A data analyst is given a dataset and a question and asked to produce a chart and short interpretation.
The contrast is with methods that probe knowledge or disposition rather than performance. A multiple-choice test of Python syntax is a job-knowledge test, not a work-sample test. An interview question that asks “tell me about a time you debugged a hard problem” is a behavioral interview, not a work sample. The distinction matters because the validity evidence is method-specific.
Why work samples predict so well
Schmidt and Hunter’s 1998 meta-analytic synthesis reported corrected operational validity of approximately ~0.54 for work-sample tests against supervisor performance ratings — among the highest single-method coefficients in the literature, comparable to general mental ability. The 2016 update by Schmidt, Oh, and Shaffer recalibrated some coefficients but left work-sample tests near the top of the rankings. Roth, Bobko, and McFarland’s 2005 meta-analysis on work-sample validity converged on similar estimates with somewhat narrower confidence intervals.
The mechanism behind the validity is intuitive. The predictor and the criterion are samples drawn from the same domain — the candidate is performing a slice of the job, and the job-performance criterion is performing the job. Sackett and Lievens (2008) frame this as point-to-point correspondence: the closer the predictor content matches the criterion content, the higher the expected validity ceiling. Work-sample tests sit at the upper end of point-to-point correspondence among selection methods.
A second mechanism is reduced contamination from non-job-relevant sources. A behavioral interview is filtered through the candidate’s narrative skill, the interviewer’s note-taking, and recall biases. A work-sample test is filtered through far less — the candidate produces an artifact, the artifact is scored against a rubric, and the rubric ties to job criteria.
Designing a defensible work sample
A work sample produces high validity only when the design is disciplined. The common failure modes:
- The task isn’t representative. A coding test consisting of inverting a binary tree under time pressure isn’t representative of senior backend work. A representative task involves the kinds of decisions the role actually makes daily — schema design, error handling, integrating with a library, debugging an unfamiliar codebase.
- The rubric isn’t anchored. “Did the candidate write good code?” produces wide rater variance. A defensible rubric specifies the dimensions (correctness, readability, error handling, performance) and provides anchored examples for each score level.
- The time window isn’t realistic. A two-hour task scoped for forty minutes punishes thoroughness. A task scoped for two hours but extending to four punishes candidates with caregiving constraints. Calibrate the window to the median completion time of strong candidates in pilot testing.
- The instructions aren’t standardized. If different candidates receive different briefings, the validity evidence collapses. Use a written brief, identical across candidates, with a fixed clock.
For the deeper treatment of how work-sample design ties to the broader hiring-loop architecture, see the hiring loop design article.
Practical workflow
A defensible work-sample workflow has six stages:
- Job analysis. Identify the two-to-four most diagnostic tasks the role actually does. The work sample should be a slice of one of those, not a trivia exercise loosely related to the job.
- Task drafting. Write a brief that takes a strong incumbent forty-five to ninety minutes to complete. Pilot it with current employees in the role to calibrate.
- Rubric authoring. Build the rubric before reviewing any candidate submissions. Anchor each dimension with examples at the high, mid, and low bands.
- Standardized administration. Same brief, same time window, same instructions. Asynchronous is acceptable; the standardization is what matters.
- Blind scoring. Two raters score independently against the rubric, then reconcile. Calculate inter-rater reliability across a pilot batch and refine the rubric until reliability is acceptable.
- Feedback to candidate. Even rejected candidates benefit from rubric-based feedback. This supports the candidate-experience side of the loop and reduces reputational risk.
Common pitfalls
Beyond the design failures above, two implementation pitfalls recur in industry:
- Treating the work sample as the only signal. A single 90-minute work sample is one data point. The Skills Passport composite weights work-sample evidence as one input alongside cognitive, AI fluency, personality, and communication evidence. Selection decisions on a single signal — work sample or otherwise — produce more variance than composite-based decisions.
- Letting the work sample creep into unpaid labor. A “work sample” that produces output the employer uses commercially is no longer a selection exercise. The brief should be hypothetical or a deliberate exercise, not a real ticket.
For practical guidance on integrating work-sample evidence with structured-interview signal in a defensible loop, see the structured interview design article.
AIEH integration
The Skills Passport composite weights work-sample evidence as a high-relevance source in the domain pillar (~0.35 default weight in the modal AIEH role bundle, see scoring methodology). When a candidate completes a work sample through an AIEH partner platform — HackerRank, CodeSignal, iMocha, or an AIEH-native scenario — the rubric-anchored score flows into the composite with appropriate recency decay (~12-18 month half-life for domain-specific evidence; cognitive ability decays slowly but specific technical skills shift with framework and tooling ecosystems).
The candidate-owned framing matters here. A
work-sample score in a vendor’s database can’t follow
the candidate to a different employer. A work-sample
score aggregated into the Skills Passport
follows the candidate by default — the URL
aieh.com/passport/{handle} carries the evidence
forward. This is the core
skills-based hiring evidence
pattern AIEH was built around.
For recruiters evaluating candidates, the hire workspace shows the work-sample evidence broken out from the composite — recruiters see both the calibrated 300-850 number and the underlying work-sample provenance, including which platform administered the test and when it was completed.
Takeaway
Work-sample tests sit near the top of the selection-validity rankings because the predictor and the criterion are drawn from the same content domain. Schmidt and Hunter’s 1998 meta-analytic estimate of ~0.54 corrected validity has held up across thirty years of replication. Strong work-sample design requires representative tasks, anchored rubrics, realistic time windows, standardized administration, and blind scoring with reconciled inter-rater reliability. AIEH treats work-sample evidence as a high-weight input to the domain pillar of the Skills Passport composite, with recency decay tuned to domain-skill timelines.
For related coverage, see cognitive ability in hiring for the highest-validity general-ability predictor, pre-employment screening evidence for the broader screening-method landscape, and the score page for the calibration math behind how work-sample evidence composes into the Skills Passport.
Sources
- Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124(2), 262-274.
- Sackett, P. R., & Lievens, F. (2008). Personnel selection. Annual Review of Psychology, 59, 419-450.
- Roth, P. L., Bobko, P., & McFarland, L. A. (2005). A meta-analysis of work sample test validity: Updating and integrating some classic literature. Personnel Psychology, 58(4), 1009-1037.
- Schmidt, F. L., Oh, I.-S., & Shaffer, J. A. (2016). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 100 years of research findings. Working paper.
- Callinan, M., & Robertson, I. T. (2000). Work sample testing. International Journal of Selection and Assessment, 8(4), 248-260.
About This Article
Researched and written by the AIEH editorial team using official sources. This article is for informational purposes only and does not constitute professional advice.
Last reviewed: · Editorial policy · Report an error