To grade open-ended responses with AI against a rubric, you do four things: write a rubric the model can actually use, pass it the response and the criteria together, require cited evidence for every judgement, and have a qualified person review the result before it counts. That sequence is what turns a vague "this looks good" into a defensible grade.
Here is each step in practice.
1. Write a rubric the model can use
Garbage rubric, garbage grade. The model can only mark against what you give it. Use specific criteria and observable performance levels, not fuzzy adjectives. "Demonstrates good understanding" tells the model nothing. "Identifies at least two risks and explains the mitigation for each" tells it exactly what to look for. Define the levels so the gap between a pass and a fail is concrete.
2. Give the model the response and the rubric together
The model marks one response at a time, against the criteria and levels you defined. It reads the actual answer - written text, a transcript, or an uploaded document - rather than scoring against a key. Because the same rubric is applied to the first response and the four hundredth, the standard does not drift the way a tired human marker drifts late in a session.
3. Require cited evidence for every judgement
A score with no reasoning is unauditable. Every judgement should point back to the part of the response that justified it, ideally quoting the candidate's own words. This does two things: it lets a reviewer check the call quickly, and it stops the model rewarding answers that merely sound confident. If the model cannot cite evidence for a strength, the strength is not there. We cover this failure mode in grounded AI feedback.
4. Keep a human in the loop
For anything consequential - a qualification, a job, a certification - a solely-automated decision is the wrong design and, under frameworks like the EU AI Act, often a compliance problem. The right pattern is the model does the heavy reading and proposes a grade with cited evidence, then a qualified person reviews, adjusts, and signs off. It is faster than pure manual marking and more reliable than pure automation.
Common failure modes to design around
- Confident wrong grades. A model can produce a clean, plausible score that is simply wrong on an edge case. The human review pass catches these.
- Keyword gaming. Candidates stuff answers with rubric terms. Evidence-based marking helps, but a reviewer spots the genuinely empty answer faster.
- Missing context. The model only sees what is in front of it. If the answer references a class discussion or a prior submission, the human supplies that context.
What this looks like in Scorafy
Scorafy is built around this exact method. You define the rubric, it reads each open-ended response, marks against your criteria with cited evidence, and routes the result to a reviewer who signs off. The point is not to remove the assessor. It is to give them a strong first pass and a clear audit trail. Try it on a real submission or read more on what AI grading can and cannot do.