
TL;DR
- I read this because.. : #113 Continuing the efficient finetuning series water
- task : LLM finetuning
- problem : finetuning is inefficient. adaptor affects latency because the layer is added in the middle anyway.
- idea : approximate the updated minute of weight with low-rank and add it to the original parameter!
- architecture : RoBERTa, DeBERTa, GPT-2, GPT-3
- objective : ce loss
- baseline : finetuning / adaptors / pre-layer
- data : GLUE, WikiSQL, MultiNLI
- result : better performance with a much smaller trainable parameter
- contribution : Efficient finetuning without adding latency
- limitation / things I cannot understand :
Details
- preliminaries : Parameter-Efficient Transfer Learning for NLP
Adaptor suggestions. Finetuning is inefficient because all parameters need to be learned and stored. feature-extraction has performance limitations.
Suggest adaptors that learn downstream tasks with fewer parameters.
In this paper, we put two adaptor layers in the transformer layer.


- architecture

The basic idea is that dense layers can be decomposed into lower ranks. For any weight W, approximate $\Delta W$, the update fraction, by $BA$ $B\in\mathbb{R}^{d \times r}$, $A\in\mathbb{R}^{r \times k}$, to create a forward as follows

where A is random gaussian and B is initialized to zero, i.e., the initial BA is zero. Delta W$ is updated with $\alpha / \gamma$, where $\alpha$ is used as a hyperparameter, like some sort of learning rate. Apply LoRA only to the attention weights $W_q$, $W_k$, $W_v$, $W_o$, and not to MLP.

Within the limited parameter constraints, it was better to apply both $W_q$ than just $W_q$, even for rank 4, and best to apply all three.

It worked well at very low ranks, which means that the update matrix $\Delta W$ has a very low intrinsic matrix.
inference latency

results

