The Fine-Tuning Decision
The first question isn't how to fine-tuneβit's whether to fine-tune.
Prompt engineering and RAG are cheaper, faster to iterate, and easier to maintain. Fine-tuning wins when:
- Format / style matters and is hard to prompt for (e.g., specific JSON schema, brand voice)
- Latency is constrained and you can't afford long few-shot prompts
- Cost at scale: at 10M+ calls/day, a smaller fine-tuned model can be 10β50x cheaper than GPT-4o
- Data privacy: the training data can't leave your infrastructure
If none of these apply, don't fine-tune yet.
The Three Approaches
Full Fine-Tuning
Train all model weights. Maximum expressiveness, maximum cost.
When it makes sense:
- Dramatic domain shift (e.g., fine-tuning on a specialized scientific corpus)
- You have > 100K high-quality examples
- You have the GPU budget (typically 8Γ A100 80GB for a 7B model)
Cost estimate (7B model, 100K examples, 3 epochs):
- ~48 GPU-hours on A100 80GB
- ~$150β300 on cloud (Lambda Labs, RunPod)
LoRA (Low-Rank Adaptation)
LoRA freezes the base model and injects trainable low-rank matrices into attention layers. Instead of training 7 billion parameters, you train ~4β40 million.
The math: For a weight matrix W β β^(dΓk), LoRA adds ΞW = BA where B β β^(dΓr) and A β β^(rΓk), with rank r << min(d,k).
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # rank β higher = more capacity, more params
lora_alpha=32, # scaling factor (often 2Γr)
target_modules=[ # which weight matrices to adapt
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_dropout=0.05,
bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.52%
Training with SFTTrainer:
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
dataset = load_dataset("json", data_files="train.jsonl", split="train")
training_args = SFTConfig(
output_dir="./checkpoints",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # effective batch size = 16
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
bf16=True,
logging_steps=10,
save_strategy="epoch",
max_seq_length=2048,
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
)
trainer.train()
LoRA hyperparameter guide:
| Hyperparameter | Conservative | Recommended | Aggressive |
|---|---|---|---|
| Rank (r) | 4 | 16 | 64 |
| Alpha | 8 | 32 | 128 |
| Target modules | q_proj, v_proj | all attention | all linear |
| LR | 1e-4 | 2e-4 | 3e-4 |
QLoRA (Quantized LoRA)
QLoRA quantizes the frozen base model to 4-bit NF4 (NormalFloat4), then trains LoRA adapters in 16-bit. This cuts memory by ~4Γ compared to LoRA with minimal quality loss.
Memory comparison for Llama 3.1 8B:
| Method | GPU Memory Required |
|---|---|
| Full fine-tuning (bf16) | 160 GB (8Γ A100) |
| LoRA (bf16 base) | 40 GB (1Γ A100 80GB) |
| QLoRA (4-bit base) | 12 GB (1Γ RTX 3090) |
from transformers import BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True, # 2nd quantization for extra savings
bnb_4bit_quant_type="nf4", # NormalFloat4 β better than int4
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
)
# Then apply LoRA config exactly as above
model = get_peft_model(model, lora_config)
QLoRA gotcha: gradient checkpointing is required with 4-bit quantization:
model.enable_input_require_grads()
model.gradient_checkpointing_enable()
Dataset Preparation
Quality beats quantity. 1,000 high-quality examples will outperform 100,000 noisy ones.
Chat format (instruction tuning)
# train.jsonl format
{
"messages": [
{"role": "system", "content": "You are a helpful SQL assistant."},
{"role": "user", "content": "Write a query to find the top 10 customers by revenue."},
{"role": "assistant", "content": "SELECT customer_id, SUM(amount) as revenue FROM orders GROUP BY customer_id ORDER BY revenue DESC LIMIT 10;"}
]
}
Data quality checklist
- Remove duplicates (exact and near-duplicate via embedding similarity)
- Filter outputs shorter than 20 tokens (likely truncated or garbage)
- Validate JSON structure if training for structured output
- Hold out 5β10% as eval set before any cleaning decisions
Merging LoRA Adapters
After training, merge adapters into the base model for faster inference:
from peft import PeftModel
# Load base model at full precision for merging
base_model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="cpu",
)
model = PeftModel.from_pretrained(base_model, "./checkpoints/checkpoint-final")
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")
Then quantize to GGUF or AWQ for serving:
# Convert to GGUF for llama.cpp / Ollama
python convert_hf_to_gguf.py ./merged-model --outfile model-q4_k_m.gguf --outtype q4_k_m
# Or use vLLM with AWQ quantization
python -m awq.entry --model_path ./merged-model --quant_path ./awq-model --w_bit 4 --q_group_size 128
Evaluation
Never rely on training loss alone. Build a task-specific eval set:
from datasets import load_dataset
from transformers import pipeline
pipe = pipeline("text-generation", model="./merged-model", torch_dtype=torch.bfloat16)
eval_dataset = load_dataset("json", data_files="eval.jsonl", split="train")
correct = 0
for example in eval_dataset:
prompt = format_prompt(example["input"])
output = pipe(prompt, max_new_tokens=256)[0]["generated_text"]
predicted = extract_answer(output)
if predicted == example["expected"]:
correct += 1
print(f"Accuracy: {correct / len(eval_dataset):.2%}")
Decision Framework
Do you have < 1000 examples?
β Start with prompt engineering or few-shot. Come back when you have more data.
Is latency the primary constraint?
β QLoRA on a 7B model, serve with vLLM
Is cost at scale the primary constraint?
β QLoRA on 7B or 13B, self-host on A10G or H100
Do you need maximum quality and have budget?
β Full fine-tuning on 70B+ model
Is operational simplicity paramount?
β OpenAI fine-tuning API (gpt-4o-mini is cost-effective)
Deploying your fine-tuned model at scale? See our guide on LLM Inference at Scale.