image problem : Let’s do a few shots with LM. solution : Make a really big LM model result : Few-shot performance SOTA on various NLP tasks. details :

  • Comparing the performance of zero-, one-, and few-shot models by model size. Larger models are more effective for in-context learning image

  • Glossary of terms in GPT3 image

  • The model architecture is very similar to GPT2, but we’ve switched to locally banded sparse attachments like Sparse Transformer.

  • The model is about this size. The model commonly referred to as “GPT-3” has 175 billion parameters. The data is 300 billion tokens. image

  • The data was obtained from Common Crawl , preprocessed to improve the quality of the data, and mixed with known high quality corpora.

  • For large models, it is recommended to have the largest batch size possible and a small learning rate.

  • We found the gradient noise scale and used it to determine the batch size.(ref )

  • Downstream Tasks :

  • Penn Tree Bank : corpus for parsing, but also for LM performance evaluation.

  • LAMBADA : context-given, blank inference corpus. long-range dependencies need to be well addressed

  • SuperGLUE : A collection of difficult NLP tasks image

  • Arithmetic: 2-5 digit addition/subtraction, 2-digit multiplication, and 1-digit operations (like 6+(4*8))

    • word scrambling and manipulation task image
  • news article generation: annotation progress to distinguish human-written news from model-generated news. t-test against a deliberately bad model.

  • learning and using novel words: looking at a word that has only been used once and asking them to create a sentence with it. image

  • correcting english grammar : "Poor English Input: <sentence>\n Good English Output: <sentence> Gives input like this. image

  • Limitations of the GPT3 model

  • Poor production. Spits out words repeatedly.

  • Lack of common sense about physics. For example, not being able to answer questions like “Will cheese melt if I put it in the refrigerator?”.

  • LM obejctive, not bi-LM, lacking information about which words are important and which are not.

  • Lack of information about the real world because they never learned about other domains, such as video or photography

  • Seems to have seen all the words a human will ever see in a lifetime, but learns less quickly than humans