Inter-rater reliability is the quiet engine that determines whether a selection process actually measures anything generalizable. A structured interview rubric with behaviorally anchored rating scales can carry strong validity on paper, but if two interviewers watching the same candidate respond to the same questions produce ratings that disagree substantively, the validity claim collapses. The rated score becomes a function of which interviewer happened to be in the room rather than a function of the candidate’s underlying performance.

This article walks through what inter-rater reliability (IRR) is, how it’s measured, what the meta-analytic evidence shows about typical IRR levels in interviews and assessments, and what hiring teams can do to improve it. The selection-research literature has accumulated a clear set of design and training moves that reliably push IRR upward, and ignoring them is one of the most common sources of measurement noise in real-world hiring loops.

Data Notice: Reliability coefficients cited here reflect peer-reviewed meta-analytic findings at time of writing. Specific IRR estimates are ~projections from published meta-analyses and may shift with study aggregation. See the scoring methodology for how AIEH treats IRR inside the Skills Passport composite.

What inter-rater reliability is

Inter-rater reliability is the degree to which two or more independent raters agree when scoring the same target. Several technical measures operationalize the concept:

Cohen’s kappa for two raters scoring categorical outcomes, correcting for chance agreement.
Fleiss’s kappa for three or more raters with categorical outcomes.
Intraclass correlation coefficient (ICC) for continuous or ordinal ratings, with several variants (ICC(2,1), ICC(2,k), etc.) chosen based on the rating design and the inferential question.
Percent agreement as a simpler descriptive statistic, though without chance correction.

In interview and assessment contexts, ICC is typically the preferred measure because rating dimensions are usually ordinal or continuous (a 1-to-5 anchored rating, a work-sample score). The choice of ICC variant matters: a rating design where every interviewer sees every candidate calls for a different ICC variant than one where each candidate is rated by a different randomly assigned pair of interviewers.

A rough interpretive convention treats ICC values in the ~0.40 to ~0.60 range as moderate reliability, ~0.60 to ~0.75 as good, and above ~0.75 as excellent. The convention is context-sensitive — selection contexts should target the upper end of these bands.

What the meta-analytic evidence shows

Conway, Jako, and Goodman (1995) published one of the most-cited meta-analyses of interview reliability, working across structured and unstructured interview formats:

Unstructured interviews show low IRR. Average ICC values fall in the modest-to-poor range, with substantial variability across studies. Two interviewers conducting independent unstructured conversations with the same candidate often produce ratings that disagree substantively.
Structured interviews show meaningfully higher IRR. When questions are standardized, anchored, and asked of every candidate in a consistent order, average ICC values rise into the good range. The structure does the heavy lifting; the rubric makes the rating defensible.
Behaviorally anchored rating scales improve IRR further. Adding behavioral anchors that describe what a 3 vs a 4 vs a 5 looks like reduces the rater-interpretation variance that floats around unanchored numeric scales.
Calibration training has modest-to-moderate effects. Salgado and Moscoso (1995) and subsequent training studies show that explicit rater training on the rubric, with calibration on shared exemplar candidates, raises IRR meaningfully — though the effect size varies by training intensity and ongoing reinforcement.

The Schmidt and Hunter (1998) hierarchy and the Sackett and Lievens (2008) selection-research review both treat structured-interview IRR as a load-bearing assumption behind the validity claims for the method. When IRR is low, the validity collapses regardless of how good the questions look on paper.

Where IRR breaks down in practice

Common failure patterns in real-world hiring loops:

Ad-hoc question selection. When interviewers improvise questions per candidate, the rating dimension shifts from question to question, and the reliability question becomes unanswerable in principle. The standardization gate must precede the reliability measurement.
Numeric-only scales without anchors. A 1-to-5 scale without behavioral descriptors invites rater-interpretation variance: one interviewer’s “4” reflects different underlying behavior than another’s “4” because the scale has no shared reference points.
Halo effects across dimensions. When the rubric has multiple dimensions but the rater forms a global impression early and then rates every dimension to match, the cross-dimension correlations inflate and the per-dimension reliability claims weaken. Independent per-dimension rating reduces but does not eliminate the effect.
Rater drift over time. Even calibrated raters drift from the calibrated standard over months without reinforcement. Without ongoing measurement and recalibration, reliability decays.
Selection cohort effects. When raters see only strong candidates for a stretch, then suddenly interview a weaker candidate, the implicit reference distribution shifts and ratings become non-comparable across cohorts. Anchored rubrics with explicit reference exemplars mitigate this.

How to measure IRR in your own loop

A practical IRR measurement protocol for an in-flight hiring process:

Capture independent ratings before discussion. When two interviewers see the same candidate, both submit ratings independently before any post-interview conversation. Discussion before rating contaminates the measurement.
Sample across candidates and raters. A reliable IRR estimate requires multiple candidates per rater-pair, not a single shared candidate. Aim for enough overlap that the ICC has a reasonable confidence interval.
Choose the appropriate ICC variant. Match the variant to the rating design (every-rater-sees-every-candidate vs randomly-assigned-pair) and to the inferential question (single-rater reliability vs averaged-rating reliability).
Report by dimension and overall. Per-dimension IRR reveals which rubric dimensions are well-anchored and which are not. Overall IRR alone hides the diagnostic detail.
Re-measure quarterly. Drift is real. Quarterly re-measurement lets the loop catch reliability decay before it contaminates a hiring cohort.

How to improve IRR

The selection-research literature converges on a small set of moves that reliably raise IRR:

Use behaviorally anchored rating scales. BARS rubrics describe what each scale point looks like in observable candidate behavior. The anchors give raters a shared reference distribution, reducing the rater-interpretation variance that drives low IRR. See interview rubric design for construction principles.
Standardize the question set. Every candidate sees the same questions in the same order. The interview question design coverage walks through the question-construction side.
Train raters on the rubric with shared exemplars. Calibration training presents raters with reference candidates whose “correct” rating is established in advance, then practices rating until the cohort converges. See interviewer calibration process for the operational design.
Force independent rating before discussion. Ratings submitted independently before any post-interview conversation produce a defensible IRR measurement. Discussion-first rating is non-measurable.
Audit and recalibrate quarterly. Ongoing measurement and refresher calibration prevent drift. See hiring manager training evidence for the broader training program.

Pitfalls to avoid

The most common mistakes in operationalizing IRR:

Reporting percent agreement without chance correction. Two raters agreeing on “hire” 80% of the time when 90% of candidates are hired anyway is not a strong reliability signal. Use chance-corrected statistics.
Conflating reliability with validity. A loop can have high IRR and still measure the wrong construct. High IRR is necessary but not sufficient for validity.
One-shot training without reinforcement. Calibration training that happens once at onboarding decays. Ongoing measurement and refresher calibration is what keeps IRR durable.
Ignoring per-dimension breakdown. Overall IRR can mask one badly-anchored dimension. Reporting by dimension reveals where the rubric needs work.

IRR inside the AIEH Skills Passport

AIEH’s Skills Passport composite treats inter-rater reliability as a load-bearing assumption behind the structured-interview and behavioral-rating components of the composite. When a structured-interview score contributes to a candidate’s pillar weight, the assumption is that the loop produces calibrated ratings; the scoring methodology documents how the multi-vendor aggregation logic handles assessment-source provenance.

The candidate-owned credential pattern works only when the underlying ratings are reliable enough to compare across loops. The Skills Passport preserves provenance so recruiters can see whether a structured- interview score came from a calibrated rubric or from an ad-hoc rating, and weight accordingly.

Takeaway

Inter-rater reliability is the load-bearing assumption behind structured-interview and behavioral-rating validity. Unstructured interviews show low IRR; structured interviews with behaviorally anchored rating scales and trained, calibrated raters show meaningfully higher IRR. The selection-research literature has converged on a small set of moves — standardize the questions, anchor the scale, train and recalibrate the raters, force independent rating, audit quarterly — that reliably raise IRR. Hiring teams that skip these moves are running a measurement process whose validity claim cannot be defended.

For deeper coverage of related topics, see the structured interview design treatment, the interviewer calibration process, and the hire workspace for the recruiter-side workflow.

Sources

Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124(2), 262–274.
Sackett, P. R., & Lievens, F. (2008). Personnel selection. Annual Review of Psychology, 59, 419–450.
Conway, J. M., Jako, R. A., & Goodman, D. F. (1995). A meta-analysis of interrater and internal consistency reliability of selection interviews. Journal of Applied Psychology, 80(5), 565–579.
Salgado, J. F., & Moscoso, S. (1995). Validity of the structured behavioral interview. Revista de Psicología del Trabajo y de las Organizaciones, 11(31), 9–24.
Huffcutt, A. I., & Arthur, W. (1994). Hunter and Hunter (1984) revisited: Interview validity for entry-level jobs. Journal of Applied Psychology, 79(2), 184–190.