[104] GPT Understands, too

TL;DR

I read this because.. : I read this while looking at the article in the parameter efficient finetuning repo on huggingface. I’ve heard of p-tuning a lot, but never read it.
task : language model finetuning(Knowledge probing, …)
Problem :** When finetuning LLM, the parameter is too large, and the few-shot setting, many-shot setting, or trasnfer ability is poor. I can use GPT-3 with a good prompt, but finding a good prompt is too laborious, and the performance is jagged depending on the prompt.
idea : don’t look for prompt in discrete, but in continuous space
architecture : Put template {pseudo-prompt $P_{0:i}$, $\mathbf{x}$, $P_{i+1:m}$, $\mathbf{e(y)}$ } in LLM such as BERT / GPT and learn the embedding of each psudo-prompt. We want the prompt embeddings to be learned interdependently, so we add a bi-LSTM layer to strengthen the embeddings.
objective : MLM loss
baseline : manual prompt, fiene-tuning, discrete prompt searching, manual prompt + finetuning
data : LAMA, SuperGLUE
evaluation : accuracy, F1, …
result : better performance on most tasks in GLUE on gpt/bert based model! (finetune also wins)
contribution : manual prompt search to continuous area
limitation / things I cannot understand : prompt CIL This reminds me a bit of this, and I’d like to try p-tuning in an MTL environment.

Details

$\mathcal{M}$ : pretrained LM

There are two problems with training this way: 1) the embedding space $\mathbf{e}$ of the pretrained LM $\mathcal{M}$ is discrete, so if $h$ is randomly initialized, only the parameters of small neighborhoods are modified and it is easy to fall into local minima, and 2) we want the prompt tokens to be dependent on each other. To solve this, we add one lite network.

Although an LSTM is added, it has very few parameters compared to LM, and in the inference phase, we can simply discard the LSTM and use the learned embedding h.

Result

p-tuning uses the parameters of the language model as freeze I’m surprised you beat finetuning.

Follow-up

P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks Putting a prompt token in each layer is shown to be good at hard sequence labeling tasks that p-tuning was not good at in the past / Shows that it works even on small models https://arxiv.org/pdf/2110.07602.pdf

TL;DR#

Details#

Result#

Follow-up#

TL;DR

Details

Result

Follow-up