[136] Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

TL;DR

I read this because.. : aka cream. A colleague’s paper
task : DocVQA
problem : VQA without OCR has performance limitations, and using OCR as input eats up too many tokens.
Idea :** Use OVD and OCR, feature extraction with an auxiliary encoder, and then use it as a CA.
input/output : image, ocr result (box and text), ovd result (box and class text), question -> answer
architecture : Vision Encoder(CLIP ViT-L /LAION-2B), Auxiliary encoder(mBART), decoder(mBART, standalone mode), LLM(Vicuna).
objective : text read, masked text prediction, captioning, qa, qg / CL loss + LM loss -> qa / LM loss
baseline : Pushing the results of OCR into LLM, BLIP, UDOP, Pix2Struct, MatCha, Donut, T5
data : (text read adn masked text prediction) IIT-CDIP, Webvicob, (captioning) CC3M, (QA + QG) WKVVQA, SquadVQA, TydiVQA (proposed in this paper)
evaluation : (ChartQA) Accuracy, ANLS, nED, BERTScore, PPL
result : It is much better than putting ocr in a simple LLM, and it is the best among multi-task models except InfoVQA for document-specific models. performance wise, sota is better than UDOP.
contribution : Suggestions for how to better utilize ocr tokens in the document domain. Suggestions for CL methods to ensure performance doesn’t falter when ocr is unstable.
etc. : appendix is really full

Details

Architecture

The overall structure is similar to BLIP-2

In addition to the vision encoder output, we also use an auxiliary encoder! The vision encoder output and the aux encoder output are concatenated and enter the decoder with cross-attention.

The motivation for using CA was that text-rich images have too many OCR results, which eats up too many tokens!

The picture is a bit confusing (as if it’s cropped), but the postivie pair that is the target of contrastive seems to be contrasting the aux output from the above picture with the output of the corresponding (coordinate overlapping) patch. The explanation for why we did this is that it is advantageous when the OCR output is noisy or the results are limited.

Like saying that a patch to the vision encoder makes it closer to the ocr token encoder output, so it performs well even if it misses some ocr results? On the other hand, OVD uses Owl-ViT (with coco 80 classes), and DocVQA says that the performance is almost the same without OVD (81.2 -> 80.9, A.2.) I wonder if it’s because of DocVQA.

Dataset

Training

details

LM : CL = 1: 0.5
The number of learnble queries is 224
When putting an image into the vision encoder, use the variable resolution in the pix2struct (https://github.com/long8v/PTIR/issues/140 )

Result

Arithmetic improvements

LLM makes you better at math, but also produces bad text

TL;DR#

Details#

Architecture#

Dataset#

Training#

Result#