Large language models (like ChatGPT) have billions of trainable parameters. LoRA fine-tunes large language models by freezing the original weights and introducing a small set of trainable parameters, enabling efficient adaptation to specific tasks.1
Reducing transfer learning complexity lowers computation and memory costs for fine-tuning models. Other model adaptation techniques lower computation and memory costs at the expense of adding inference latency, making model response slower. LoRA applies low-rank updates to weights at select layers without modifying the entire weight matrix.
Multiple LoRA layers can be trained for specific purposes and swapped in models to change specializations. Beyond natural language processing, LoRA's approach to efficient fine-tuning holds potential for applications in fields like computer vision and graph neural networks. Because LoRA is low rank compared to full-fine tuning, the technique may have difficulty capturing very specific tasks and patterns.
The major downside of fine-tuning is that the new model contains as many parameters as the original model.
- LoRa is a parameter efficient fine tuning (PEFT) method.
- LoRa Hypothesis - the change in weights during model adaptation also has a low "intrinsic risk."
- GPT-3 started with 175B params and LoRA cut the trainable params 10,000 times, GPU memory 3 times.
- QLoRA - Quantized Low-rank Adaptation of Large Language Models reduces the accuracy of the weights to constrain the memory required to fine-tune the model.
Related Methods
Pre-LoRA (2021): "Adapters", prompt fine-tuning (like prefix tuning), which is hard.
Post-LoRA -> IA3 (2022), OFT/BOFT (2023)
1 Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-rank adaptation of large language models. arXiv.
https://arxiv.org/abs/2106.09685