TL;DR
- I read this because.. : CS330์์ ๋์ด. LLM’s in-context learning is sometimes interpreted as meta learning, but I asked him what he thinks about it, and he said he would have listened to it.
- problem : When does in-context learning work? When does the “emergent” ability of the LLM show up?
- Idea :** Natural data, unlike supervised data, is not the same as
- input/output : {image, label} sequence + query image -> novel label
- architecture : encoder(ResNet) + causal Transformer
- objective : ce loss
- baseline : RNN, LSTM
- data : Omniglot
- evaluation : Given 8 contexts and 1 test query, classify them well. To evaluate the “holdout image” (an image that has never been seen before), we randomly assign the class of 2 images in a 4-shot 2-way evaluation (e.g., the alphabet “a” was originally labeled 0 in the original evaluation, but 1 in the test).
- result : 1) RNN should not be used, but Transformer model 2) when there is busrtiness in the data 3) when there is a large set of rare classes
- contribution :
- etc. :
Details
in-context learning vs in-weight learning
- in-context learning does well when given only a few samples of a new concept without weight updates
- in-weight learning is a gradient update to do a few shots well with supervised learning In terms of meta-learning, MANN or MAML can be seen as in-context learning. However, in recent LLMs, this in-context learning is not directly taught, but “emergent”, so why is this?
Experimental design
Like the black box meta learning methodology, in-context learning is given an image, label sequence as context and sees how well it does when given a query image.
- BURSTY means that a certain class comes in bunches (AA A comes in bunches in a short period of time)
In this paper, we look at 1) burstiness 2) a large number of rarely occurring classes 3) multiplicity of labels 4) within-class variation
Burstiness
As shown in the example above, we evaluated with data that intentionally increased the busrtiness, and found that for in-context learning, increasing the burstiness increases the
In contrast, in-weight learning performs poorly as busrtiness increases
a large number of rarely occuring classes
I experimented with increasing the num of classes from 100 to 12800 (original class 1600) while giving omniglot a roatation (each class becomes less frequent and therefore long-tailed).
Once again, the number of classes was reversed: more in context learning was better, but more in weight learning was worse.
Multiplicity of labels
When I tried it with multiple labels for a single class, the performance improved again
within-class variation
I tried a lot of variation within classes, and again, for in-context learning, the higher the variation, the better the performance
Architecture
I ran rnn / lstm with all the right number of parameters / depth, etc. but never got in-context learning ability… For some reason, even the authors don’t know why! we were completely unable to elicit in-context learning in recurrent models, even with the training procedure, number of parameters, and model architecture otherwise matched to the transformer experiments. Emphasized that using a transformer alone does not result in in-context learning, and that the data distribution must have the three characteristics above.