Hannes Gustafsson Hannes Gustafsson

LoRA: Low-Rank Adaptation

arxiv.org/abs/2106.09685

The core idea behind LoRA is deceptively simple. Instead of updating all the parameters in a weight matrix during fine-tuning, we freeze the original weights and inject two small matrices — A and B — whose product approximates the full update.

For a weight matrix W of size d x d, the update becomes:

W=W+BAW' = W + BA

where B has shape d x r and A has shape r x d, with rank r much smaller than d.

This means instead of training d2d^2 parameters, we only train 2dr2 \cdot d \cdot r. For a typical transformer layer where d=768d = 768 and r=4r = 4, that’s 6,144 parameters instead of 589,824 — roughly 1% of the original.

Interactive

d = 768

r = 4
164
Full fine-tune589,824
LoRA (A + B)6,144

1.04%

of original parameters

The practical implications are massive for edge deployment. A LoRA adapter for a 7B model might be just 4-16 MB, compared to the 14 GB full model. You can swap task-specific adapters at runtime without reloading the base model.

This is exactly what makes fine-tuning feasible on resource-constrained hardware — you’re not shipping a new model for each task, you’re shipping a tiny delta.