[136] Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Modelsmultimodal naver 2021Q3 document emnlp
[128] Pix2Struct: Screenshot Parsing as Pretraining for Visual Language UnderstandingICML google 2022Q3 document