[105] LoRA: Low-Rank Adaptation of Large Language Models

TL;DR

I read this because.. : #113 Continuing the efficient finetuning series water
task : LLM finetuning
problem : finetuning is inefficient. adaptor affects latency because the layer is added in the middle anyway.
idea : approximate the updated minute of weight with low-rank and add it to the original parameter!
architecture : RoBERTa, DeBERTa, GPT-2, GPT-3
objective : ce loss
baseline : finetuning / adaptors / pre-layer
data : GLUE, WikiSQL, MultiNLI
result : better performance with a much smaller trainable parameter
contribution : Efficient finetuning without adding latency
limitation / things I cannot understand :

Details

preliminaries : Parameter-Efficient Transfer Learning for NLP Adaptor suggestions. Finetuning is inefficient because all parameters need to be learned and stored. feature-extraction has performance limitations. Suggest adaptors that learn downstream tasks with fewer parameters. In this paper, we put two adaptor layers in the transformer layer.

architecture

The basic idea is that dense layers can be decomposed into lower ranks. For any weight W, approximate $\Delta W$, the update fraction, by $BA$ $B\in\mathbb{R}^{d \times r}$, $A\in\mathbb{R}^{r \times k}$, to create a forward as follows

where A is random gaussian and B is initialized to zero, i.e., the initial BA is zero. Delta W$ is updated with $\alpha / \gamma$, where $\alpha$ is used as a hyperparameter, like some sort of learning rate. Apply LoRA only to the attention weights $W_q$, $W_k$, $W_v$, $W_o$, and not to MLP.