Interview Question Design: What Predicts Performance and What Just Sounds Smart

The interview question is the elemental unit of structured- interview design. Different question types have different validity profiles, different difficulty-of-fairness profiles, and different operational implementation costs. Loops that author interview questions without internalizing the question-type distinction tend to produce mixes that degrade toward the unstructured-interview validity floor (~0.20) regardless of how much rubric infrastructure surrounds them.

This article walks through the question types established in the selection-research literature, the validity evidence on each, the implementation patterns that distinguish good rubrics from nominal ones, the common bad-question patterns to avoid, and how interview-question design integrates with the broader multi-method hiring loop.

Data Notice: Validity coefficients and structural-feature findings cited here reflect peer-reviewed meta-analytic evidence at time of writing. Effect sizes vary across job families, interview types, and industries; consult primary sources before deploying in high-stakes selection contexts.

The four question types established in the literature

The selection-research literature distinguishes four substantially-different interview-question types, each with its own validity profile and implementation considerations:

Behavioral questions (“Tell me about a time you had to ship a feature with incomplete requirements”). The candidate describes specific past behavior in a specific context. The underlying validity logic: past behavior predicts future behavior in similar contexts. Behavioral questions surface the most diagnostic information when paired with rubrics that probe specifics (the STAR pattern: situation, task, action, result) rather than general impressions.
Situational questions (“If you were given a fuzzy product spec, what would you do first?”). The candidate describes hypothetical future behavior in a specific context. The validity logic: stated intentions about future behavior under specific conditions correlate meaningfully with actual behavior in those conditions. Situational questions probe judgment more directly than behavioral questions but rely more heavily on candidate-reported intent rather than candidate-reported history.
Knowledge questions (“What’s the time complexity of a binary search?”). The candidate produces a specific technical or domain-knowledge answer. The validity logic: domain-knowledge mastery is necessary (though not sufficient) for performance in roles where the knowledge is load-bearing. Knowledge questions are higher-validity for roles with discrete-knowledge cores and lower for judgment-heavy roles.
Trait-attribution questions (“Are you a self-starter?”). The candidate self-reports a trait or characteristic. The validity logic for these is the weakest of the four types. Self-presentation skill confounds the response substantially; candidates with strong impression-management skill produce high-trait self-attributions regardless of underlying trait level. Trait-attribution questions are widely used despite the weak validity evidence, primarily because they’re easy to author without rubric infrastructure.

Validity evidence on each type

The Schmidt & Hunter (1998) canonical meta-analysis grouped interview questions into structured (anchored rubrics + standardized questions, regardless of behavioral/situational/ knowledge type) and unstructured. Subsequent research has documented type-specific validity differences:

Behavioral questions with STAR-pattern rubrics achieve validity in the 0.45–0.55 range across studies (Janz et al., 1986; Campion et al., 1997). The wide range reflects the importance of rubric specificity — well-rubric’d behavioral questions reach the upper range; vague rubrics drop into the unstructured-interview range.
Situational questions with anchored rubrics achieve validity in the 0.40–0.50 range, slightly below behavioral questions in most meta-analyses but with smaller variance across studies. Situational questions show somewhat smaller adverse-impact exposure in some studies, making them attractive in contexts where bias-mitigation is a primary concern (McDaniel et al., 1994).
Knowledge questions validity varies dramatically by role type — high for roles with discrete-knowledge cores (medical, legal, regulated technical), low for judgment- heavy roles. Treating knowledge questions as substitutable with behavioral or situational questions is one of the most common interview-design mistakes.
Trait-attribution questions show validity in the 0.10–0.20 range across most studies — comparable to unstructured interviews and below years of education in the Schmidt & Hunter framework. The continued use of trait-attribution questions is one of the more evidence-resistant practices in interview design.

The implication: behavioral and situational questions, with anchored rubrics, are the highest-validity interview-question types for most knowledge work. Knowledge questions are appropriate for roles with knowledge-heavy cores. Trait- attribution questions should be removed from rubric-driven loops or repurposed as conversation-openers without scoring weight.

Implementation patterns that distinguish good rubrics from nominal ones

Three operational patterns distinguish rubrics that drive the documented validity gain from rubrics that exist only on paper:

Anchor points are evaluator-facing, not candidate-facing. The rubric describes what each rating level looks like in evaluator terms (“rating 5: candidate identifies at least three downstream consequences without prompting; rating 3: candidate identifies one consequence with prompting”). These anchors should NOT be read aloud to the candidate — that signals the expected response and degrades the diagnostic information.
Probe questions are pre-authored. Behavioral questions often need follow-up probing to elicit specific evidence (“can you describe what you actually did, not what the team did?”, “what was the specific outcome you measured?”). Strong rubrics include the probe library so interviewers don’t have to author probes in real time, which is when consistency drifts.
Scoring happens before discussion, not after. The multi-rater independence benefit (see structured interview design) requires that interviewers commit to scores before discussing the candidate. Loops that allow real-time discussion before scoring lose most of the multi-rater validity gain.

A fourth detail that compounds value over time: calibration sessions across cohorts. Quarterly group review of recorded candidate responses, comparison of scores across raters, and recalibration of anchor points where evaluator drift has occurred — sustained calibration is what keeps rubric-driven interviews from drifting back toward the unstructured-interview floor.

Common bad-question patterns to avoid

Five patterns that produce low-validity interview questions even when surrounded by structured-interview infrastructure:

Leading questions. “Tell me about a time you successfully led a team through a difficult migration.” The “successfully” signals the expected response and biases the candidate’s recall toward confirmatory examples. The diagnostic version: “Tell me about a time you led a team through a difficult migration. What was the outcome?”
Compound questions. “Tell me about a time you faced a technical disagreement with your manager AND how you handled it AND what the outcome was.” Compound questions produce rambling answers that are hard to score against rubric anchors. Split into separate questions or focus on one dimension.
Trait-attribution questions disguised as behavioral questions. “Tell me about a time you showed leadership.” This is a trait-attribution prompt with behavioral framing; the candidate selects which past event to describe based on what they think shows leadership, which produces self-presentation rather than behavioral evidence. The better version probes behavior in a specific context: “Tell me about a time the team was stuck on a decision and you influenced the outcome.”
Brain-teaser questions. “How many golf balls fit in a 747?” The validity literature on brain-teasers is weak; they correlate poorly with job performance for most knowledge work and produce disparate-impact exposure without predictive value. They persist culturally at some employers but the evidence supports retiring them.
Hypothetical ethical-dilemma questions. “What would you do if your manager asked you to falsify data?” The candidate produces the socially-desirable answer regardless of how they’d actually behave; integrity is hard to assess via hypothetical-question formats. Behavioral integrity questions (“Tell me about a time you faced an ethical pressure at work”) produce more diagnostic information than hypothetical variants.

The patterns persist because they sound interesting, demonstrate interviewer cleverness, or signal cultural fit within the interviewer cohort. None of those reasons are validity reasons; loops should evaluate questions against the validity literature rather than evaluator preferences.

Where interview-question design fits in multi-method selection

Interview-question design is one component of a defensible multi-method hiring loop. Other components typically include:

A cognitive-ability or work-sample component (high-validity per Schmidt & Hunter, 1998 — see the cognitive-ability in hiring treatment). The AIEH Skills Passport’s cognitive and work-sample-style assessments cover this slot.
A personality component, primarily conscientiousness — see Big Five in hiring for the validity evidence and the AIEH Big Five family for the corresponding assessment.
A reference-check or culture-conversation final-stage. Lower validity than the upstream methods but useful for surfacing red flags that wouldn’t otherwise surface.

Interview-question design’s role is to extract synchronous- judgment signal that other methods don’t capture — particularly for senior roles where behavioral patterns under realistic ambiguity are central to the hiring decision. For the broader treatment of multi-method loop design, see the hiring-loop design overview and the structured interview design implementation treatment.

Common pitfalls in interview-question authoring

Three patterns that recurring employers fall into beyond the specific bad-question patterns above:

Authoring per-loop instead of maintaining a library. Most organizations rotate roles slowly; the same role is re-interviewed many times per year. Re-authoring questions per loop loses the calibration-and-rubric infrastructure that makes structured interviews work. Strong loops maintain question libraries with rubrics that are reviewed quarterly, not authored per loop.
Treating rubric maintenance as one-time work. Rubric anchors drift over time as norms shift, technology evolves, and team composition changes. Loops that author once and never update tend to drift toward measuring outdated competencies. Quarterly calibration sessions are the maintenance discipline that keeps rubrics current.
Using AI-generated questions without validity calibration. AI-generated interview questions are increasingly common; the questions often look reasonable but haven’t been calibrated against actual hire-quality outcomes. Treat AI-generated questions as drafts that need rubric-anchor authoring and validity validation before deployment, not finished products.

Takeaway

Interview question design is a discipline with substantial empirical support: behavioral and situational questions with anchored rubrics achieve validity comparable to cognitive testing and work samples, while trait-attribution and brain- teaser questions add little diagnostic information. The implementation patterns that distinguish good rubrics from nominal ones (evaluator-facing anchors, pre-authored probes, score-before-discussion, sustained calibration) are well- documented but require ongoing operational discipline.

Most published “structured interview validity” claims apply to interviews using behavioral or situational questions with proper rubric infrastructure. Loops using mixes of trait- attribution and brain-teaser questions get the unstructured- interview validity number (~0.20), not the structured number (~0.51), regardless of how the questions are labeled. The discipline of question-type choice and rubric authoring matters more than the structural-interview label.

For broader treatments, see structured interview design, hiring-loop design, skills-based hiring evidence, and the scoring methodology for how interview signal integrates with the AIEH Skills Passport.

Sources

Campion, M. A., Palmer, D. K., & Campion, J. E. (1997). A review of structure in the selection interview. Personnel Psychology, 50(3), 655–702.
Janz, T., Hellervik, L., & Gilmore, D. C. (1986). Behavior Description Interviewing: New, Accurate, Cost-Effective. Allyn & Bacon.
McDaniel, M. A., Whetzel, D. L., Schmidt, F. L., & Maurer, S. D. (1994). The validity of employment interviews: A comprehensive review and meta-analysis. Journal of Applied Psychology, 79(4), 599–616.
Sackett, P. R., & Lievens, F. (2008). Personnel selection. Annual Review of Psychology, 59, 419–450.
Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124(2), 262–274.
Truxillo, D. M., & Bauer, T. N. (2011). Applicant reactions to organizations and selection systems. In S. Zedeck (Ed.), APA Handbook of Industrial and Organizational Psychology, Vol. 2: Selecting and Developing Members for the Organization (pp. 379–397). American Psychological Association.