LoRA: Low-Rank Adaptation — Hannes Gustafsson

The core idea behind LoRA is deceptively simple. Instead of updating all the parameters in a weight matrix during fine-tuning, we freeze the original weights and inject two small matrices — A and B — whose product approximates the full update.

For a weight matrix W of size d x d, the update becomes:

$W' = W + BA$

where B has shape d x r and A has shape r x d, with rank r much smaller than d.

This means instead of training $d^2$ parameters, we only train $2 \cdot d \cdot r$ . For a typical transformer layer where $d = 768$ and $r = 4$ , that’s 6,144 parameters instead of 589,824 — roughly 1% of the original.

Interactive

d = 768

Rankr = 4

164

Full fine-tune589,824

LoRA (A + B)6,144

1.04%

of original parameters

The practical implications are massive for edge deployment. A LoRA adapter for a 7B model might be just 4-16 MB, compared to the 14 GB full model. You can swap task-specific adapters at runtime without reloading the base model.

This is exactly what makes fine-tuning feasible on resource-constrained hardware — you’re not shipping a new model for each task, you’re shipping a tiny delta.