Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling arxiv image

image problem : pix2seq is a framework for object detection only, let’s extend it to a unified framework for various vision-language problems. solution : Very similar to pix2seq, cut the object box into bins and use the generation method to generate a task. The image is put into an image encoder (CNN), task prefix, task input is put into a text encoder, concatenated, and put into a transformer encoder, and then one transformer decoder generates the output. Define output sequence for tasks other than Object Detection result : We were able to solve various tasks such as grounded captioning, VQA, image captioning, and object detection with a single unified model, and several of these tasks benefited from multi-task learning.