[145] CLIPScore: A Reference-free Evaluation Metric for Image Captioning2021Q2 CLIP emnlp evaluation AI2
[111] Perceiver IO: A General Architecture for Structured Inputs & Outputsmultimodal 2021Q2 ICLR DeepMind MTL
[85] Dynamic Head: Unifying Object Detection Heads with Attentions2021Q2 CVPR microsoft object detection
[38] Visual Relationship Detection Using Part-and-Sum Transformers with Composite QueriesICCV 2021Q2 SGG one-stage
[32] ViLT: Vision-and-Language Transformer Without Convolution or Region Supervisionmultimodal 2021Q2 naver
[8] SimVLM: Simple Visual Language Model Pretraining with Weak Supervisionmultimodal SSL 2021Q2 zero-shot