[58] MetaFormer Is Actually What You Need for Vision

August 31, 2022 · 1 min · long8v · 

[6] Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling

January 18, 2022 · 1 min · long8v · 

[5] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

January 13, 2022 · 1 min · long8v · 

[3] Twins: Revisiting the Design of Spatial Attention in Vision Transformers

January 10, 2022 · 1 min · long8v · 

[1] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

January 5, 2022 · 1 min · long8v ·