image

paper

TL;DR

  • I read this because.. : #113 Continuing the efficient finetuning series water
  • task : LLM finetuning
  • problem : finetuning is inefficient. adaptor affects latency because the layer is added in the middle anyway.
  • idea : approximate the updated minute of weight with low-rank and add it to the original parameter!
  • architecture : RoBERTa, DeBERTa, GPT-2, GPT-3
  • objective : ce loss
  • baseline : finetuning / adaptors / pre-layer
  • data : GLUE, WikiSQL, MultiNLI
  • result : better performance with a much smaller trainable parameter
  • contribution : Efficient finetuning without adding latency
  • limitation / things I cannot understand :

Details

  • preliminaries : Parameter-Efficient Transfer Learning for NLP Adaptor suggestions. Finetuning is inefficient because all parameters need to be learned and stored. feature-extraction has performance limitations. Suggest adaptors that learn downstream tasks with fewer parameters. In this paper, we put two adaptor layers in the transformer layer. image
image
  • architecture image

The basic idea is that dense layers can be decomposed into lower ranks. For any weight W, approximate $\Delta W$, the update fraction, by $BA$ $B\in\mathbb{R}^{d \times r}$, $A\in\mathbb{R}^{r \times k}$, to create a forward as follows

image

where A is random gaussian and B is initialized to zero, i.e., the initial BA is zero. Delta W$ is updated with $\alpha / \gamma$, where $\alpha$ is used as a hyperparameter, like some sort of learning rate. Apply LoRA only to the attention weights $W_q$, $W_k$, $W_v$, $W_o$, and not to MLP.

image

Within the limited parameter constraints, it was better to apply both $W_q$ than just $W_q$, even for rank 4, and best to apply all three. image

It worked well at very low ranks, which means that the update matrix $\Delta W$ has a very low intrinsic matrix.

  • inference latency image

  • results image

image image