LLM Fine-Tuning in Production: LoRA, QLoRA, and Full Fine-Tuning Compared

The Fine-Tuning Decision

The first question isn't how to fine-tune—it's whether to fine-tune.

Prompt engineering and RAG are cheaper, faster to iterate, and easier to maintain. Fine-tuning wins when:

Format / style matters and is hard to prompt for (e.g., specific JSON schema, brand voice)
Latency is constrained and you can't afford long few-shot prompts
Cost at scale: at 10M+ calls/day, a smaller fine-tuned model can be 10–50x cheaper than GPT-4o
Data privacy: the training data can't leave your infrastructure

If none of these apply, don't fine-tune yet.

The Three Approaches

Full Fine-Tuning

Train all model weights. Maximum expressiveness, maximum cost.

When it makes sense:

Dramatic domain shift (e.g., fine-tuning on a specialized scientific corpus)
You have > 100K high-quality examples
You have the GPU budget (typically 8× A100 80GB for a 7B model)

Cost estimate (7B model, 100K examples, 3 epochs):

~48 GPU-hours on A100 80GB
~$150–300 on cloud (Lambda Labs, RunPod)

LoRA (Low-Rank Adaptation)

LoRA freezes the base model and injects trainable low-rank matrices into attention layers. Instead of training 7 billion parameters, you train ~4–40 million.

The math: For a weight matrix W ∈ ℝ^(d×k), LoRA adds ΔW = BA where B ∈ ℝ^(d×r) and A ∈ ℝ^(r×k), with rank r << min(d,k).

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "meta-llama/Llama-3.1-8B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                          # rank — higher = more capacity, more params
    lora_alpha=32,                 # scaling factor (often 2×r)
    target_modules=[               # which weight matrices to adapt
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.52%

Training with SFTTrainer:

from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

dataset = load_dataset("json", data_files="train.jsonl", split="train")

training_args = SFTConfig(
    output_dir="./checkpoints",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,    # effective batch size = 16
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    bf16=True,
    logging_steps=10,
    save_strategy="epoch",
    max_seq_length=2048,
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)
trainer.train()

LoRA hyperparameter guide:

Hyperparameter	Conservative	Recommended	Aggressive
Rank (r)	4	16	64
Alpha	8	32	128
Target modules	q_proj, v_proj	all attention	all linear
LR	1e-4	2e-4	3e-4

QLoRA (Quantized LoRA)

QLoRA quantizes the frozen base model to 4-bit NF4 (NormalFloat4), then trains LoRA adapters in 16-bit. This cuts memory by ~4× compared to LoRA with minimal quality loss.

Memory comparison for Llama 3.1 8B:

Method	GPU Memory Required
Full fine-tuning (bf16)	160 GB (8× A100)
LoRA (bf16 base)	40 GB (1× A100 80GB)
QLoRA (4-bit base)	12 GB (1× RTX 3090)

from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,      # 2nd quantization for extra savings
    bnb_4bit_quant_type="nf4",           # NormalFloat4 — better than int4
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
)

# Then apply LoRA config exactly as above
model = get_peft_model(model, lora_config)

QLoRA gotcha: gradient checkpointing is required with 4-bit quantization:

model.enable_input_require_grads()
model.gradient_checkpointing_enable()

Dataset Preparation

Quality beats quantity. 1,000 high-quality examples will outperform 100,000 noisy ones.

Chat format (instruction tuning)

# train.jsonl format
{
  "messages": [
    {"role": "system", "content": "You are a helpful SQL assistant."},
    {"role": "user", "content": "Write a query to find the top 10 customers by revenue."},
    {"role": "assistant", "content": "SELECT customer_id, SUM(amount) as revenue FROM orders GROUP BY customer_id ORDER BY revenue DESC LIMIT 10;"}
  ]
}

Data quality checklist

Remove duplicates (exact and near-duplicate via embedding similarity)
Filter outputs shorter than 20 tokens (likely truncated or garbage)
Validate JSON structure if training for structured output
Hold out 5–10% as eval set before any cleaning decisions

Merging LoRA Adapters

After training, merge adapters into the base model for faster inference:

from peft import PeftModel

# Load base model at full precision for merging
base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="cpu",
)

model = PeftModel.from_pretrained(base_model, "./checkpoints/checkpoint-final")
merged_model = model.merge_and_unload()

merged_model.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")

Then quantize to GGUF or AWQ for serving:

# Convert to GGUF for llama.cpp / Ollama
python convert_hf_to_gguf.py ./merged-model --outfile model-q4_k_m.gguf --outtype q4_k_m

# Or use vLLM with AWQ quantization
python -m awq.entry --model_path ./merged-model --quant_path ./awq-model --w_bit 4 --q_group_size 128

Evaluation

Never rely on training loss alone. Build a task-specific eval set:

from datasets import load_dataset
from transformers import pipeline

pipe = pipeline("text-generation", model="./merged-model", torch_dtype=torch.bfloat16)

eval_dataset = load_dataset("json", data_files="eval.jsonl", split="train")

correct = 0
for example in eval_dataset:
    prompt = format_prompt(example["input"])
    output = pipe(prompt, max_new_tokens=256)[0]["generated_text"]
    predicted = extract_answer(output)
    if predicted == example["expected"]:
        correct += 1

print(f"Accuracy: {correct / len(eval_dataset):.2%}")

Decision Framework

Do you have < 1000 examples?
  → Start with prompt engineering or few-shot. Come back when you have more data.

Is latency the primary constraint?
  → QLoRA on a 7B model, serve with vLLM

Is cost at scale the primary constraint?
  → QLoRA on 7B or 13B, self-host on A10G or H100

Do you need maximum quality and have budget?
  → Full fine-tuning on 70B+ model

Is operational simplicity paramount?
  → OpenAI fine-tuning API (gpt-4o-mini is cost-effective)

Deploying your fine-tuned model at scale? See our guide on LLM Inference at Scale.