RLHF with Rubrics as Rewards: A Practical Approach

Introduction

Traditional RLHF relies on human preference comparisons, which can be noisy and inconsistent. Rubric-based rewards offer a more structured alternative for aligning language models.

The Problem with Preference-Based RLHF

Inconsistency

Different annotators prefer different things
Same annotator varies over time
Hard to define what "better" means

Opacity

Why was response A preferred over B?
What aspects were compared?
How to improve systematically?

Rubrics as Rewards

What is a Rubric?

A structured evaluation criteria:

rubric:
  helpfulness:
    weight: 0.4
    criteria:
      - Directly addresses the question
      - Provides actionable information
      - Appropriate level of detail
  accuracy:
    weight: 0.3
    criteria:
      - Factually correct
      - No hallucinations
      - Acknowledges uncertainty
  safety:
    weight: 0.3
    criteria:
      - No harmful content
      - Appropriate refusals
      - Privacy-preserving

Advantages

Explicit criteria: Clear what matters
Consistent scoring: Same standards applied
Interpretable feedback: Know what to improve
Flexible weighting: Adjust importance

Implementation

Rubric Evaluation

def evaluate_with_rubric(response, rubric):
    scores = {}
    for dimension, config in rubric.items():
        # Use LLM as judge for each criterion
        dimension_score = 0
        for criterion in config['criteria']:
            score = llm_judge(response, criterion)
            dimension_score += score
        scores[dimension] = dimension_score / len(config['criteria'])

    # Weighted combination
    final_score = sum(
        scores[d] * rubric[d]['weight']
        for d in rubric
    )
    return final_score

Training Loop

Generate Response -> Evaluate with Rubric -> Compute Reward -> PPO Update
       |                     |                    |               |
   (sampling)           (LLM judge)          (aggregate)     (optimize)

LLM-as-Judge

Prompt Design

You are evaluating a response on the following criterion:
{criterion}

Response to evaluate:
{response}

Score from 1-5 with explanation:

Calibration

Use few-shot examples
Include edge cases
Validate against human judgment

Multi-Judge

Use multiple prompts/models
Aggregate scores
Flag disagreements

Practical Considerations

Rubric Design

Start broad: Major quality dimensions
Refine iteratively: Add specificity where needed
Validate empirically: Check rubric predicts user satisfaction

Computational Cost

LLM judges are expensive
Cache common evaluations
Sample-based training

Reward Hacking

Models may game specific criteria
Diversify rubrics
Human oversight for edge cases

Results

Compared to traditional RLHF:

More consistent rewards across evaluators
Faster iteration on reward criteria
Better interpretability of model changes
Similar or better final model quality

Learn advanced alignment techniques in our LLM courses.