[6] Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modelingmultimodal 2021Q4 backbone multitask
[5] An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleViT backbone 2021Q1 re-read