[13] GPT-3 : Language Models are Few-Shot Learners

problem : Let’s do a few shots with LM. solution : Make a really big LM model result : Few-shot performance SOTA on various NLP tasks. details :

Comparing the performance of zero-, one-, and few-shot models by model size. Larger models are more effective for in-context learning
Glossary of terms in GPT3
The model architecture is very similar to GPT2, but we’ve switched to locally banded sparse attachments like Sparse Transformer.
The model is about this size. The model commonly referred to as “GPT-3” has 175 billion parameters. The data is 300 billion tokens.
The data was obtained from Common Crawl , preprocessed to improve the quality of the data, and mixed with known high quality corpora.
For large models, it is recommended to have the largest batch size possible and a small learning rate.
We found the gradient noise scale and used it to determine the batch size.(ref )
Downstream Tasks :
Penn Tree Bank : corpus for parsing, but also for LM performance evaluation.
LAMBADA : context-given, blank inference corpus. long-range dependencies need to be well addressed
SuperGLUE : A collection of difficult NLP tasks
Arithmetic: 2-5 digit addition/subtraction, 2-digit multiplication, and 1-digit operations (like 6+(4*8))
- word scrambling and manipulation task
news article generation: annotation progress to distinguish human-written news from model-generated news. t-test against a deliberately bad model.
learning and using novel words: looking at a word that has only been used once and asking them to create a sentence with it.
correcting english grammar : "Poor English Input: <sentence>\n Good English Output: <sentence> Gives input like this.
Limitations of the GPT3 model
Poor production. Spits out words repeatedly.
Lack of common sense about physics. For example, not being able to answer questions like “Will cheese melt if I put it in the refrigerator?”.
LM obejctive, not bi-LM, lacking information about which words are important and which are not.
Lack of information about the real world because they never learned about other domains, such as video or photography
Seems to have seen all the words a human will ever see in a lifetime, but learns less quickly than humans