[194] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model ParametersDeepMind 2024Q3 reasoning
[116] Data Distributional Properties Drive Emergent In-Context Learning in TransformersDeepMind NeurIPS 2022Q2
[111] Perceiver IO: A General Architecture for Structured Inputs & Outputsmultimodal 2021Q2 ICLR DeepMind MTL