The core idea behind LoRA is deceptively simple. Instead of updating all the parameters in a weight matrix during fine-tuning, we freeze the original weights and inject two small matrices — A and B — whose product approximates the full update.
For a weight matrix W of size d x d, the update becomes:
where B has shape d x r and A has shape r x d, with rank r much smaller than d.
This means instead of training parameters, we only train . For a typical transformer layer where and , that’s 6,144 parameters instead of 589,824 — roughly 1% of the original.
Interactive
d = 768
1.04%
of original parameters
The practical implications are massive for edge deployment. A LoRA adapter for a 7B model might be just 4-16 MB, compared to the 14 GB full model. You can swap task-specific adapters at runtime without reloading the base model.
This is exactly what makes fine-tuning feasible on resource-constrained hardware — you’re not shipping a new model for each task, you’re shipping a tiny delta.