tutorial 2025-04-08 16 min read

LLM Fine-Tuning in Production: LoRA, QLoRA, and Full Fine-Tuning Compared

Learn when and how to fine-tune large language models for production use cases. Compare LoRA, QLoRA, and full fine-tuning with practical code examples and cost analysis.

fine-tuning LoRA QLoRA LLM PEFT Hugging Face training

The Fine-Tuning Decision

The first question isn't how to fine-tuneβ€”it's whether to fine-tune.

Prompt engineering and RAG are cheaper, faster to iterate, and easier to maintain. Fine-tuning wins when:

  1. Format / style matters and is hard to prompt for (e.g., specific JSON schema, brand voice)
  2. Latency is constrained and you can't afford long few-shot prompts
  3. Cost at scale: at 10M+ calls/day, a smaller fine-tuned model can be 10–50x cheaper than GPT-4o
  4. Data privacy: the training data can't leave your infrastructure

If none of these apply, don't fine-tune yet.


The Three Approaches

Full Fine-Tuning

Train all model weights. Maximum expressiveness, maximum cost.

When it makes sense:

  • Dramatic domain shift (e.g., fine-tuning on a specialized scientific corpus)
  • You have > 100K high-quality examples
  • You have the GPU budget (typically 8Γ— A100 80GB for a 7B model)

Cost estimate (7B model, 100K examples, 3 epochs):

  • ~48 GPU-hours on A100 80GB
  • ~$150–300 on cloud (Lambda Labs, RunPod)

LoRA (Low-Rank Adaptation)

LoRA freezes the base model and injects trainable low-rank matrices into attention layers. Instead of training 7 billion parameters, you train ~4–40 million.

The math: For a weight matrix W ∈ ℝ^(dΓ—k), LoRA adds Ξ”W = BA where B ∈ ℝ^(dΓ—r) and A ∈ ℝ^(rΓ—k), with rank r << min(d,k).

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "meta-llama/Llama-3.1-8B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                          # rank β€” higher = more capacity, more params
    lora_alpha=32,                 # scaling factor (often 2Γ—r)
    target_modules=[               # which weight matrices to adapt
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.52%

Training with SFTTrainer:

from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

dataset = load_dataset("json", data_files="train.jsonl", split="train")

training_args = SFTConfig(
    output_dir="./checkpoints",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,    # effective batch size = 16
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    bf16=True,
    logging_steps=10,
    save_strategy="epoch",
    max_seq_length=2048,
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)
trainer.train()

LoRA hyperparameter guide:

Hyperparameter Conservative Recommended Aggressive
Rank (r) 4 16 64
Alpha 8 32 128
Target modules q_proj, v_proj all attention all linear
LR 1e-4 2e-4 3e-4

QLoRA (Quantized LoRA)

QLoRA quantizes the frozen base model to 4-bit NF4 (NormalFloat4), then trains LoRA adapters in 16-bit. This cuts memory by ~4Γ— compared to LoRA with minimal quality loss.

Memory comparison for Llama 3.1 8B:

Method GPU Memory Required
Full fine-tuning (bf16) 160 GB (8Γ— A100)
LoRA (bf16 base) 40 GB (1Γ— A100 80GB)
QLoRA (4-bit base) 12 GB (1Γ— RTX 3090)
from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,      # 2nd quantization for extra savings
    bnb_4bit_quant_type="nf4",           # NormalFloat4 β€” better than int4
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
)

# Then apply LoRA config exactly as above
model = get_peft_model(model, lora_config)

QLoRA gotcha: gradient checkpointing is required with 4-bit quantization:

model.enable_input_require_grads()
model.gradient_checkpointing_enable()

Dataset Preparation

Quality beats quantity. 1,000 high-quality examples will outperform 100,000 noisy ones.

Chat format (instruction tuning)

# train.jsonl format
{
  "messages": [
    {"role": "system", "content": "You are a helpful SQL assistant."},
    {"role": "user", "content": "Write a query to find the top 10 customers by revenue."},
    {"role": "assistant", "content": "SELECT customer_id, SUM(amount) as revenue FROM orders GROUP BY customer_id ORDER BY revenue DESC LIMIT 10;"}
  ]
}

Data quality checklist

  • Remove duplicates (exact and near-duplicate via embedding similarity)
  • Filter outputs shorter than 20 tokens (likely truncated or garbage)
  • Validate JSON structure if training for structured output
  • Hold out 5–10% as eval set before any cleaning decisions

Merging LoRA Adapters

After training, merge adapters into the base model for faster inference:

from peft import PeftModel

# Load base model at full precision for merging
base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="cpu",
)

model = PeftModel.from_pretrained(base_model, "./checkpoints/checkpoint-final")
merged_model = model.merge_and_unload()

merged_model.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")

Then quantize to GGUF or AWQ for serving:

# Convert to GGUF for llama.cpp / Ollama
python convert_hf_to_gguf.py ./merged-model --outfile model-q4_k_m.gguf --outtype q4_k_m

# Or use vLLM with AWQ quantization
python -m awq.entry --model_path ./merged-model --quant_path ./awq-model --w_bit 4 --q_group_size 128

Evaluation

Never rely on training loss alone. Build a task-specific eval set:

from datasets import load_dataset
from transformers import pipeline

pipe = pipeline("text-generation", model="./merged-model", torch_dtype=torch.bfloat16)

eval_dataset = load_dataset("json", data_files="eval.jsonl", split="train")

correct = 0
for example in eval_dataset:
    prompt = format_prompt(example["input"])
    output = pipe(prompt, max_new_tokens=256)[0]["generated_text"]
    predicted = extract_answer(output)
    if predicted == example["expected"]:
        correct += 1

print(f"Accuracy: {correct / len(eval_dataset):.2%}")

Decision Framework

Do you have < 1000 examples?
  β†’ Start with prompt engineering or few-shot. Come back when you have more data.

Is latency the primary constraint?
  β†’ QLoRA on a 7B model, serve with vLLM

Is cost at scale the primary constraint?
  β†’ QLoRA on 7B or 13B, self-host on A10G or H100

Do you need maximum quality and have budget?
  β†’ Full fine-tuning on 70B+ model

Is operational simplicity paramount?
  β†’ OpenAI fine-tuning API (gpt-4o-mini is cost-effective)

Deploying your fine-tuned model at scale? See our guide on LLM Inference at Scale.

Want to Go Deeper?

This article is part of our comprehensive curriculum on building ML systems at scale. Explore our full courses for hands-on learning.