Introduction
What if you could combine multiple fine-tuned models — one specialized for coding, one for reasoning, one for instruction following — into a single model that's better at all three, without any training? This is the premise of model merging, and it works surprisingly well.
Model merging has emerged as one of the most cost-effective techniques in the LLM practitioner's toolkit. It's free (no GPU training required), reversible (if the merge fails, you still have the originals), and can unlock capabilities that fine-tuning on individual tasks can't achieve.
Why Model Merging Works: The Loss Landscape Intuition
Neural network training converges to a basin in loss space. Multiple fine-tunes of the same base model converge to different points within (or near) the same broad basin, because they start from the same initialization and share the base model's learned structure.
Weight-space interpolation between these solutions often stays within the low-loss region:
Base model weights: θ_base
Fine-tune A weights: θ_A (θ_base + δ_A)
Fine-tune B weights: θ_B (θ_base + δ_B)
Naive average: θ_merged = (θ_A + θ_B) / 2
= θ_base + (δ_A + δ_B) / 2
If δ_A and δ_B occupy a shared subspace, the average is valid.
If they conflict, quality degrades.
The key insight: fine-tuned models that share a base differ primarily in the task-specific delta added to base model weights. Merging is about intelligently combining these deltas.
Method 1: Model Soup (Weight Averaging)
The simplest form, introduced for vision models in "Model Soups" (Wortsman et al., 2022). Average the weights of multiple fine-tuned models:
def model_soup(models):
merged_params = {}
for name, param in models[0].named_parameters():
stacked = torch.stack([m.state_dict()[name].float() for m in models])
merged_params[name] = stacked.mean(dim=0)
return merged_params
This works well when models are fine-tuned from the same base on similar tasks. It effectively averages out per-model noise and preserves shared capabilities.
Limitation: Averaging doesn't work well when models have conflicting weights (e.g., a model fine-tuned for polite responses + a model fine-tuned for direct responses may average to incoherent behavior).
Method 2: Task Arithmetic
Task Arithmetic (Ilharco et al., 2023) introduces the concept of task vectors — the difference between fine-tuned and base model weights:
# Compute task vectors
task_vector_A = fine_tuned_A - base_model # learned delta for task A
task_vector_B = fine_tuned_B - base_model # learned delta for task B
# Merge by adding task vectors to the base model
merged = base_model + scaling_coef * task_vector_A + scaling_coef * task_vector_B
Task vectors can be added, subtracted, and scaled. This enables:
Add a capability: merged = base + 0.7 * coding_vector
Remove a behavior: safe_model = base - 0.5 * toxic_vector
Combine capabilities: merged = base + 0.7 * coding_vector + 0.5 * math_vector
Key hyperparameter: The scaling coefficient controls how strongly each task's changes are applied. Too high → one task dominates; too low → capabilities barely transfer.
Method 3: TIES (Trim, Elect, Merge)
Task arithmetic suffers when task vectors have many near-zero parameters that add noise without contributing capability. TIES (Yadav et al., 2023) adds a principled approach to handling interference:
TIES algorithm for merging N task vectors δ_1, ..., δ_N:
Step 1 — TRIM: For each task vector, zero out parameters below a threshold τ
δ̂_i = δ_i ⊙ 𝟙[|δ_i| > τ] ← keep only the top-k% by magnitude
Step 2 — ELECT: For each parameter position, decide the merged sign
γ = sign(Σ δ̂_i) ← majority vote on sign across tasks
Step 3 — MERGE: Average only parameters that agree with the elected sign
δ_merged[p] = mean(δ̂_i[p] for i where sign(δ̂_i[p]) == γ[p])
Final model: θ_merged = θ_base + λ * δ_merged
TIES is particularly effective when merging many models (3+) with potentially conflicting specializations. By electing a sign and filtering disagreements, it avoids the cancellation that plagues naive averaging.
Method 4: DARE (Drop and Rescale)
DARE (Yu et al., 2023) takes a different approach: randomly drop most delta parameters before merging, then rescale the survivors:
def dare_merge(base_model, fine_tuned_models, drop_rate=0.9, lambda_coef=1.0):
merged_delta = {}
for name, param in base_model.named_parameters():
deltas = []
for ft_model in fine_tuned_models:
delta = ft_model.state_dict()[name] - param
# Drop random parameters
mask = (torch.rand_like(delta) > drop_rate).float()
# Rescale to preserve expected magnitude
sparse_delta = delta * mask / (1 - drop_rate)
deltas.append(sparse_delta)
merged_delta[name] = torch.stack(deltas).mean(dim=0)
return {name: base + lambda_coef * delta
for (name, base), delta in zip(base_model.named_parameters(), merged_delta.items())}
The intuition: most fine-tuned parameters are redundant (they encode similar knowledge). Randomly dropping 90% and rescaling reduces interference between task vectors while preserving the essential capabilities.
DARE often works better than TIES for LoRA-based fine-tunes (where task vectors are already sparse).
TIES + DARE Combined
The current best practice for merging multiple LoRA-tuned models is to combine both techniques:
1. DARE: randomly drop 90% of each task vector's parameters, rescale
2. TIES: trim remaining near-zeros, elect sign, merge sign-consistent parameters
3. Add scaled merged delta to base model
This combination achieves the highest merge quality in most empirical comparisons.
Practical Application: Merging LoRA Adapters
In practice, model merging is most commonly applied to LoRA-fine-tuned models, since full fine-tuned models are expensive to train and store:
from mergekit import MergeConfig, merge
config = MergeConfig(
merge_method="ties", # or "dare_ties", "linear", "slerp"
base_model="meta-llama/Llama-3.1-8B-Instruct",
models=[
{"model": "my-coding-lora", "weight": 0.5},
{"model": "my-math-lora", "weight": 0.5},
],
parameters={"density": 0.5, "weight": 1.0}
)
merged_model = merge(config)
mergekit is the standard open-source library for model merging and implements all major algorithms.
When Merging Works and When It Fails
Merging works well when:
- Models share the same base model and tokenizer
- Fine-tunes target complementary capabilities (code + math + instruction-following)
- Models are fine-tuned with LoRA (sparse task vectors merge cleanly)
- Individual models are already high quality
Merging fails when:
- Models have conflicting behavioral fine-tunes (helpful vs. unhelpful at the same prompts)
- Models use different base models or architectures
- Task vectors are highly correlated (merging two copies of essentially the same fine-tune)
- One model is significantly higher quality than others (averaging dilutes it)
Evolutionary Model Merging
A recent extension: evolutionary model merging treats the merge coefficients as a search problem. Use evolutionary algorithms or Bayesian optimization to find the weights λ_1, ..., λ_n that maximize performance on a target benchmark:
Search over λ values:
Candidate: merged = base + λ_1 * δ_1 + λ_2 * δ_2 + ... + λ_n * δ_n
Fitness: score on validation benchmark
Algorithm: CMA-ES or similar evolutionary strategy
This can find merges significantly better than uniform-weight averaging, but requires a validation benchmark and search compute (though much cheaper than fine-tuning).
Conclusion
Model merging is one of the most underutilized techniques in practical LLM development. It's free, reversible, and frequently produces models that outperform any single fine-tune on multi-task evaluations. The conceptual shift it requires is thinking of fine-tuning not as producing a new model, but as producing a task vector that can be composed with others. DARE and TIES are the current best-practice algorithms; mergekit makes them accessible in a few lines of code.
Interested in other efficient model adaptation techniques? Read our guide on LoRA and QLoRA fine-tuning.