Introduction
Traditional RLHF relies on human preference comparisons, which can be noisy and inconsistent. Rubric-based rewards offer a more structured alternative for aligning language models.
The Problem with Preference-Based RLHF
Inconsistency
- Different annotators prefer different things
- Same annotator varies over time
- Hard to define what "better" means
Opacity
- Why was response A preferred over B?
- What aspects were compared?
- How to improve systematically?
Rubrics as Rewards
What is a Rubric?
A structured evaluation criteria:
rubric:
helpfulness:
weight: 0.4
criteria:
- Directly addresses the question
- Provides actionable information
- Appropriate level of detail
accuracy:
weight: 0.3
criteria:
- Factually correct
- No hallucinations
- Acknowledges uncertainty
safety:
weight: 0.3
criteria:
- No harmful content
- Appropriate refusals
- Privacy-preserving
Advantages
- Explicit criteria: Clear what matters
- Consistent scoring: Same standards applied
- Interpretable feedback: Know what to improve
- Flexible weighting: Adjust importance
Implementation
Rubric Evaluation
def evaluate_with_rubric(response, rubric):
scores = {}
for dimension, config in rubric.items():
# Use LLM as judge for each criterion
dimension_score = 0
for criterion in config['criteria']:
score = llm_judge(response, criterion)
dimension_score += score
scores[dimension] = dimension_score / len(config['criteria'])
# Weighted combination
final_score = sum(
scores[d] * rubric[d]['weight']
for d in rubric
)
return final_score
Training Loop
Generate Response -> Evaluate with Rubric -> Compute Reward -> PPO Update
| | | |
(sampling) (LLM judge) (aggregate) (optimize)
LLM-as-Judge
Prompt Design
You are evaluating a response on the following criterion:
{criterion}
Response to evaluate:
{response}
Score from 1-5 with explanation:
Calibration
- Use few-shot examples
- Include edge cases
- Validate against human judgment
Multi-Judge
- Use multiple prompts/models
- Aggregate scores
- Flag disagreements
Practical Considerations
Rubric Design
- Start broad: Major quality dimensions
- Refine iteratively: Add specificity where needed
- Validate empirically: Check rubric predicts user satisfaction
Computational Cost
- LLM judges are expensive
- Cache common evaluations
- Sample-based training
Reward Hacking
- Models may game specific criteria
- Diversify rubrics
- Human oversight for edge cases
Results
Compared to traditional RLHF:
- More consistent rewards across evaluators
- Faster iteration on reward criteria
- Better interpretability of model changes
- Similar or better final model quality
Learn advanced alignment techniques in our LLM courses.