Validity is whether an assessment measures what it claims to measure; reliability is whether it measures consistently. An assessment can be reliable but not valid - a bathroom scale that is five kilos heavy gives you the same wrong number every time, consistent but measuring the wrong thing. It can be valid in principle but unreliable - a good question marked by ten assessors who all disagree. You need both: a result is only trustworthy if the assessment measures the right thing and would give the same judgement again. These are the two foundations every other quality term in assessment sits on top of.
Validity: are you measuring the right thing?
Validity asks whether the assessment actually captures the competence or knowledge you intend. A coding test that mostly rewards fast typing is not a valid measure of programming skill, however precisely it scores. Validity has several practical facets worth knowing:
- Content validity - does the assessment cover the actual scope of what was taught or required, not a narrow slice of it?
- Construct validity - is it measuring the underlying ability you care about, rather than a proxy like reading speed, test-taking skill, or familiarity with the format?
- Face validity - does it look credible and relevant to the people taking and using it? Low face validity erodes trust even when the assessment is sound.
The most common validity failure is construct-irrelevant variance: the score moves for reasons that have nothing to do with the skill. A maths problem buried in dense English partly tests reading, not maths. A judgement task with a vague rubric partly tests how well the candidate guessed what the marker wanted.
Reliability: are you measuring consistently?
Reliability asks whether you would get the same result again - across markers, across occasions, across equivalent versions. The kinds that matter most in practice:
- Inter-rater reliability - would two qualified assessors reach the same judgement on the same submission? This is where open-ended marking most often falls down.
- Consistency over time - does the same assessor mark submission one and submission four hundred to the same standard, or do they drift as they tire?
- Internal consistency - do the items that are meant to measure the same thing actually agree with each other?
Low reliability is corrosive because it makes results unfair in a way that is hard to see: two learners of identical ability get different outcomes based on who marked them or when. And reliability caps validity - if a measure is noisy, it cannot be accurately measuring anything, so unreliability quietly destroys validity too.
Why you need both, and how they trade off
Both are necessary and neither is sufficient. A highly reliable assessment of the wrong construct gives you precise, consistent, useless results. A valid-in-principle task marked inconsistently gives you the right idea measured so noisily you cannot act on it. The goal is an assessment that targets the real competence (valid) and judges it the same way every time (reliable).
There is a known tension. The most valid assessments of complex skills tend to be open-ended and authentic - real tasks, real judgement - and those are the hardest to mark reliably, because human judgement varies. Closed, tick-box formats are easy to mark reliably but often sacrifice validity for higher-order skills. Historically you were forced to trade one against the other.
How to improve each - and where AI marking helps
To improve validity: align tasks tightly to the actual outcomes, use authentic tasks that resemble real performance, strip out construct-irrelevant difficulty, and sample the full scope rather than a corner of it. To improve reliability: use a clear, observable rubric so judgements anchor to defined levels rather than gut feel, calibrate markers against each other, and require evidence for each judgement.
AI rubric marking targets the reliability side of the historic trade-off directly. The model applies the same rubric to every response, so it does not drift between the first submission and the last, and it does not disagree with itself the way two human markers can - which lifts consistency on exactly the open-ended, valid-but-hard-to-mark tasks that used to force a compromise. Because it cites the evidence for each judgement, the marking is also checkable, and a qualified reviewer signs off, which keeps a human accountable for the final call. You get to keep the valid, authentic task and recover the reliability it used to cost you. The rubric does the heavy lifting on both fronts - start with how to write an assessment rubric and see how AI rubric marking works. Try it on your own task.