Designing Structured Interview Rubrics: Calibration and Behaviorally Anchored Rating Scales

A structured-interview rubric is the document that turns a hiring conversation into a measurable rating. Done well, it gives every interviewer a shared reference distribution for what each rating point means, makes inter-rater reliability defensible, and produces ratings that can be aggregated across raters and across candidates with calibrated meaning. Done poorly — or not done at all — it leaves the rating space at the mercy of each interviewer’s personal scale, which is the dominant pattern in unstructured interviewing and the dominant cause of the low inter-rater reliability documented in the selection-research literature.

This article walks through what a defensible rubric contains, how behaviorally anchored rating scales (BARS) work, the construction process from job analysis to anchored descriptors, and the calibration steps that keep ratings comparable over time. The goal is operational: a hiring team that follows the workflow described here can produce a rubric that meaningfully improves measurement quality without requiring an industrial-organizational psychologist to draft every dimension by hand.

Data Notice: Reliability and validity claims for structured rubric design reflect peer-reviewed meta-analytic findings at time of writing. Specific reliability lift estimates are ~projections from published studies and may shift with study aggregation. See the scoring methodology for how AIEH treats rubric-driven ratings inside the Skills Passport composite.

What a defensible rubric contains

A structured-interview rubric that meets selection-research quality standards typically includes several elements:

Job-relevant rating dimensions. Each dimension maps to a documented job requirement, ideally derived from a formal job analysis or a structured competency model. Two to five dimensions per interview is a typical range; beyond that, raters struggle to maintain independent attention to each dimension.
An ordinal rating scale, typically 5 points. The 5-point range is wide enough to discriminate meaningful differences without inviting false-precision distinctions.
Behavioral anchors for each scale point. Each point on the scale is described in observable candidate behavior, so raters can map what they actually observed to a scale position with shared reference points.
Probes and follow-up guidance. The rubric specifies how to probe when an initial answer is incomplete, which reduces variance from interviewers who probe well versus those who don’t.
Independent-rating instructions. The rubric specifies that ratings are submitted independently before any post-interview discussion, preserving the inter-rater-reliability measurement.
Decision-aggregation rules. When multiple raters score the same candidate, the rubric documents how individual ratings combine into a final composite (averaging, consensus, weighted by interview round).

The combination of these elements is what distinguishes a selection-research-grade rubric from a “scoring sheet” that teams sometimes produce ad-hoc.

What BARS actually means

Behaviorally anchored rating scales — BARS — are the canonical structured-rating method that Smith and Kendall (1963) introduced and that Borman (1986) and subsequent selection-research work elaborated. The defining feature is that each scale point is anchored in concrete, observable behavior rather than in adjectival descriptors:

Adjectival scale (weaker): “1 = Poor, 2 = Below Average, 3 = Average, 4 = Above Average, 5 = Excellent.” Every rater fills in the meaning of “average” from personal experience, producing rater-interpretation variance.
Behaviorally anchored (stronger): Each point describes what a candidate at that level actually does — for a problem-decomposition dimension, for instance, a “3” might describe “candidate identifies the main problem components but does not specify trade-offs between approaches,” while a “5” describes “candidate identifies components, names trade-offs, and proposes a defensible prioritization.”

The behavioral anchors give raters a shared distribution to map their observation to. The reliability gains are well- documented across the meta-analytic literature, with Conway, Jako, and Goodman (1995) and Sackett and Lievens (2008) treating behaviorally anchored rating as a default expectation for structured-interview validity.

Construction workflow

The practical workflow for building a BARS rubric:

Run a job analysis. Identify the role’s critical competencies — the behavioral patterns that distinguish strong from weak performers in the actual job. Sources include incumbent interviews, manager interviews, and review of prior performance data. The job analysis is what keeps the rubric grounded in the role rather than in interviewer preference.
Select 2-5 dimensions per interview. Map the identified competencies to interview rounds, with each round assigned 2-5 dimensions to assess. The cap exists because raters cannot maintain independent attention to more than a handful of dimensions in a single interview without halo bleed-through.
Draft behavioral anchors at the extremes first. For each dimension, write a strong (5) and weak (1) anchor first. The extremes are easier to articulate than the midpoints because the contrast is sharper.
Fill in the midpoints by gradient. Once 1 and 5 are anchored, write 2, 3, and 4 by interpolating the gradient. The 3-anchor is typically the most useful one for raters because it’s where most candidates land.
Validate against past candidates. Pull a sample of prior interview notes for the role and rate them with the draft rubric. If the rubric distinguishes strong-performers from weak-performers in retrospect, it likely will going forward. If it doesn’t, the anchors need refinement.
Pilot with a small rater cohort. Run the rubric through 5-10 candidates with multiple raters per candidate, measure inter-rater reliability per dimension, and identify which dimensions need anchor refinement.
Document the final rubric. Store the rubric with anchors, probes, and aggregation rules in a place every interviewer can find. Version-control the rubric so changes are auditable.

The interview question design coverage walks through the question-construction side that pairs with the rubric. Together they form the structured- interview backbone.

Calibration: making ratings comparable

A rubric on paper is necessary but not sufficient. Calibration training is the process that turns rubric-on-paper into rubric-in-practice:

Initial calibration on shared exemplars. Present the rater cohort with reference candidates whose “correct” rating per dimension is established in advance, then practice rating until the cohort converges. The calibration session reveals where anchors are misread and lets the rubric be refined before live use.
Buddy-rating early. Pair new interviewers with experienced calibrated interviewers for their first several interviews, with both rating independently and then comparing post-rating to identify divergence patterns.
Quarterly recalibration. Drift is real. Quarterly recalibration sessions on a small cohort of recent candidates reset the calibration and catch dimensions where the rater cohort has drifted apart.
Audit ongoing IRR. Measure inter-rater reliability across an ongoing sample of candidates rated by multiple raters. The inter-rater reliability evidence coverage walks through the measurement protocol.

The interviewer calibration process coverage describes the full operational design.

Pitfalls to avoid

The most common rubric-design failures:

Too many dimensions. Rubrics with 8-12 dimensions per interview produce halo-laden ratings because raters cannot maintain independent attention. Cap at 5 dimensions per interview round.
Adjectival anchors that pretend to be behavioral. “Excellent communication” is not a behavioral anchor. “Names the audience and adapts vocabulary accordingly” is. The behavioral specificity is what makes the anchor work.
Anchors that describe candidate traits rather than observable behavior. “Confident” is a trait inference, not an observable behavior. “Maintains eye contact and speaks audibly” is an observable behavior. Anchors should describe what the candidate does, not what the candidate is.
Skipping the pilot. Going live with an unpiloted rubric is the most common path to discovering anchors are ambiguous. The pilot catches the issues before they contaminate hiring data.
No version control. Rubrics evolve. Without version control, ratings from different periods become non-comparable. Every rubric change should be dated and stored alongside the candidate ratings collected under each version.
Using the rubric without calibration. A rubric without rater calibration produces low inter-rater reliability regardless of how well-anchored the scale is. The calibration is part of the design, not an optional add-on.

Rubric design inside the AIEH Skills Passport

AIEH’s Skills Passport composite treats structured-interview ratings as one source of evidence inside the four-pillar architecture, with rating provenance preserved so recruiters can see whether the score reflects a calibrated rubric or an ad-hoc rating. The scoring methodology documents the aggregation math, and the skills-based hiring evidence coverage situates structured-interview rating inside the broader portable-credential argument.

The candidate-owned credential pattern is what lets a calibrated structured-interview rating travel with the candidate to subsequent applications, rather than locking inside the original employer’s recruiter platform. The Skills Passport preserves the rating with provenance and recency, so recruiters in later applications can weight it appropriately.

Takeaway

A defensible structured-interview rubric pairs job-relevant rating dimensions with behaviorally anchored scale points, standardized probes, independent-rating instructions, and documented aggregation rules. Behaviorally anchored rating scales — the canonical Smith-and-Kendall and Borman construction — give raters a shared reference distribution that lifts inter-rater reliability into the defensible range. The rubric needs job analysis to ground it, behavioral specificity to anchor it, piloting to validate it, calibration to operationalize it, and versioning to keep it comparable across time. Skipping any of these steps produces ratings that look like measurement but do not behave like it.

For deeper coverage of related topics, see the structured interview design treatment, the interviewer calibration process, the interview question design coverage, and the hire workspace for the recruiter-side workflow.

Sources

Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124(2), 262–274.
Sackett, P. R., & Lievens, F. (2008). Personnel selection. Annual Review of Psychology, 59, 419–450.
Smith, P. C., & Kendall, L. M. (1963). Retranslation of expectations: An approach to the construction of unambiguous anchors for rating scales. Journal of Applied Psychology, 47(2), 149–155.
Borman, W. C. (1986). Behavior-based rating scales. In R. A. Berk (Ed.), Performance Assessment: Methods and Applications (pp. 100–120). Johns Hopkins University Press.
Campion, M. A., Palmer, D. K., & Campion, J. E. (1997). A review of structure in the selection interview. Personnel Psychology, 50(3), 655–702.