tutorial 2025-03-29 15 min read

Fine-Tuning LLMs: A Practical Guide for ML Engineers

Learn how to fine-tune language models with Hugging Face Transformers and LoRA. Covers full fine-tuning, parameter-efficient fine-tuning, and production deployment considerations.

fine-tuning LLM LoRA Hugging Face PEFT transformers

When to Fine-Tune vs. Prompt Engineer

Before writing any training code, answer this question: can you solve the problem with prompting alone?

Fine-tuning is appropriate when:

  • The task requires a specific output format that's hard to specify in a prompt
  • You need the model to have domain knowledge not in its pretraining data
  • Latency/cost constraints rule out large models or long prompts
  • You have >1000 high-quality labeled examples

Stick with prompting when:

  • You have fewer than a few hundred labeled examples
  • The task is general enough that a base model handles it well
  • You need rapid iteration

Full Fine-Tuning with Hugging Face

Full fine-tuning updates all model parameters. This gives the best results but requires the most memory.

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForSeq2Seq
)
from datasets import Dataset
import torch

# Prepare data
def format_instruction(example):
    return {
        "text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
    }

raw_data = [
    {"instruction": "Summarize this code", "output": "This function..."},
    # ... thousands of examples
]

dataset = Dataset.from_list(raw_data)
dataset = dataset.map(format_instruction)

# Load model and tokenizer
model_name = "meta-llama/Llama-3.2-3B"  # 3B — feasible to fine-tune on 1 GPU
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,  # bf16 saves memory, minimal quality loss
    device_map="auto",
)

# Tokenize
def tokenize(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=512,
        padding="max_length",
    )

tokenized = dataset.map(tokenize, batched=True, remove_columns=["text", "instruction", "output"])

# Training config
training_args = TrainingArguments(
    output_dir="./llama3-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,   # Effective batch size = 4 * 4 = 16
    learning_rate=2e-5,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    bf16=True,
    logging_steps=10,
    save_strategy="epoch",
    eval_strategy="epoch",
    report_to="wandb",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
)

trainer.train()

LoRA: Fine-Tune 1% of Parameters

Full fine-tuning a 7B model requires 4× 80GB A100s. LoRA (Low-Rank Adaptation) achieves comparable results by training only small adapter matrices:

Original weight matrix W (frozen):  [d × d]
LoRA: W' = W + BA where B: [d × r], A: [r × d], r << d

For a 4096×4096 attention matrix with r=16:
Original: 16.7M parameters
LoRA: 4096*16 + 16*4096 = 131K parameters  (~0.8% of original)
from peft import LoraConfig, get_peft_model, TaskType

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-7B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Apply LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,               # Rank — higher = more capacity, more memory
    lora_alpha=32,      # Scaling factor (usually 2*r)
    lora_dropout=0.1,
    target_modules=[    # Which layers to add LoRA to
        "q_proj", "v_proj",  # Attention projections
        "k_proj", "o_proj",  # All attention projections for better results
    ],
    bias="none",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622

Now train exactly like full fine-tuning — the API is identical, but you're only updating ~4M parameters instead of 7B.

QLoRA: 4-bit Quantization + LoRA

QLoRA (Quantized LoRA) lets you fine-tune a 7B model on a single 16GB GPU:

from transformers import BitsAndBytesConfig
from peft import prepare_model_for_kbit_training

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",        # NormalFloat4 quantization
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,   # Quantize the quantization constants
)

# Load in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=bnb_config,
    device_map="auto",
)

# Prepare for LoRA training
model = prepare_model_for_kbit_training(model)

# Apply LoRA (same as before)
model = get_peft_model(model, lora_config)

# GPU memory: ~8GB for 7B model with QLoRA

Supervised Fine-Tuning for Instruction Following

For instruction-following tasks, structure your data carefully:

# Good instruction format (alpaca-style)
def format_sample(instruction: str, input: str, output: str) -> str:
    if input:
        return f"""Below is an instruction that describes a task, paired with an input. Write a response.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}"""
    else:
        return f"""Below is an instruction. Write a response.

### Instruction:
{instruction}

### Response:
{output}"""

# Critical: only compute loss on the response, not the instruction
def tokenize_with_labels(example, tokenizer):
    full_text = format_sample(example["instruction"], example["input"], example["output"])
    tokens = tokenizer(full_text, return_tensors="pt")

    # Find where response starts
    instruction_text = format_sample(example["instruction"], example["input"], "")
    instruction_len = len(tokenizer(instruction_text)["input_ids"][0])

    # Mask instruction tokens in labels (-100 = ignore in loss)
    labels = tokens["input_ids"].clone()
    labels[0, :instruction_len] = -100

    return {"input_ids": tokens["input_ids"][0], "labels": labels[0]}

Not masking the instruction means your model wastes capacity learning to reproduce the prompt. Always mask.

Merging LoRA Adapters

After training, merge the LoRA weights back into the base model for clean serving:

from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-7B")

# Load and merge LoRA weights
model = PeftModel.from_pretrained(base_model, "./lora-checkpoint")
merged_model = model.merge_and_unload()  # Fuses LoRA into base weights

# Save merged model
merged_model.save_pretrained("./llama3-finetuned-merged")
tokenizer.save_pretrained("./llama3-finetuned-merged")

# Now deploy as a regular model — no PEFT dependency

Evaluating Fine-Tuned Models

from transformers import pipeline

pipe = pipeline("text-generation", model="./llama3-finetuned-merged", device=0)

# Automated evaluation
test_cases = [
    {"instruction": "Classify this email as spam or not spam.", "input": "Win a free iPhone...", "expected": "spam"},
    # ...
]

correct = 0
for case in test_cases:
    prompt = format_sample(case["instruction"], case["input"], "")
    output = pipe(prompt, max_new_tokens=50)[0]["generated_text"]
    response = output[len(prompt):].strip().lower()

    if case["expected"] in response:
        correct += 1

print(f"Accuracy: {correct/len(test_cases):.2%}")

For open-ended generation tasks, use LLM-as-judge evaluation or human evaluation.

Hardware Reference

GPU VRAM Max model (full FT) Max model (QLoRA)
RTX 4090 24GB 7B 70B
A100 40GB 40GB 13B 70B
A100 80GB 80GB 30B 70B+
2× A100 80GB 160GB 70B 70B+

For production fine-tuning, use cloud GPUs (Lambda Labs, RunPod, AWS p4d instances). For experimentation, QLoRA on an RTX 4090 covers most use cases.


Next: learn how to serve your fine-tuned model efficiently with vLLM and LLM inference optimization.

Want to Go Deeper?

This article is part of our comprehensive curriculum on building ML systems at scale. Explore our full courses for hands-on learning.