[136] Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Modelsmultimodal naver 2021Q3 document emnlp
[131] Re-labeling ImageNet: from Single to Multi-Labels, from Global to Localized Labels2021Q1 CVPR naver
[32] ViLT: Vision-and-Language Transformer Without Convolution or Region Supervisionmultimodal 2021Q2 naver