[119] Visual Instruction Tuning

paper

TL;DR

I read this because.. : to read llava 1.5
task : chatting VLM
problem : Let’s make instruction-following work for multi-modal like chatGPT
idea : put bbox and caption in language only GPT and have it create QA
input/output : image + Q -> A
architecture : LLaMA 13B + CLIP + projection
objective : ce loss
baseline : GPT-4, BLIP-2, OpenFlamingo
data : (feature alignment) filtered during CC3M (e2e learning) instruction data or SicenceQA created with GPT4 or chatGPT with captions and bboxes for COCO images
evaluation : Create a question sampling from coco and ask GPT-4 to re-evaluate the answer that GPT-4 came up with when asked the question with bbox and caption.
result : Good performance on scientific QA, good at higher-level reasoning (such as humor interpretation) that BLIP-2 / OpenFlamingo / GPT-4 are not good at
contribution : Probably the first work to create instruct data to assist. open sourced well and widely used
etc. :

Details

Instruction following data

Create a caption and bbox for a COCO image
converstaion(58K) / detailed description(23K) / complex reasoning(77K)

Ablation for this. Adding detailed captions improves performance on the chatbot side. It seems to help with reasoning.

Training

input sequence

The first question can be an image first or a question first, and the order is randomized by using the

pre-training feature alignment Filtering only 595K image text on CC3M + learning only linear projection Use the caption as is, but format it as a simple instruction following (single turn, ask to briefly describe the image) In this case, the filtering method is as follows (uniformly fit by noun frequency)

finetuning end-to-end Freeze only the vision encoder and learn the rest of projection + LM

Ability

I’m wondering how to make it for complex reasoning, but it says to put system prompt like this

You are an AI visual assistant that can analyze a single image. You receive five sentences, each describing the same image you are observing. In addition, specific object locations within the image are given, along with detailed coordinates. These coordinates are in the form of bounding boxes, represented as (x1, y1, x2, y2) with floating numbers ranging from 0 to 1. These values correspond to the top left x, top left y, bottom right x, and bottom right y.

The task is to use the provided caption and bounding box information, create a plausible question about the image, and provide the answer in detail.

Create complex questions beyond describing the scene.
To answer such questions, one should require first understanding the visual content, then based on the background knowledge or reasoning, either explain why the things are happening that way, or provide guides and help to user's request.  Make the question challenging by not including the visual content details in the question so that the user needs to reason about that first.

Instead of directly mentioning the bounding box coordinates, utilize this data to explain the scene using natural language. Include details like object counts, position of the objects, relative position between the objects.  

When using the information from the caption and coordinates, directly explain the scene, and do not mention that the information source is the caption or the bounding box.  Always answer as if you are directly looking at the image.

Ablations

ViT last layer vs previous layer -> previous layer is better
Applying CoT i.e. answer then reasoning / Reasoning then answer -> convergence was faster for reasoning - answer, but not final performance
Skip the alignment learning step and go straight to learning -> worse performance
LLM 13B to 7B -> Worse Performance