image

paper

TL;DR

  • I read this because.. : to read llava 1.5
  • task : chatting VLM
  • problem : Let’s make instruction-following work for multi-modal like chatGPT
  • idea : put bbox and caption in language only GPT and have it create QA
  • input/output : image + Q -> A
  • architecture : LLaMA 13B + CLIP + projection
  • objective : ce loss
  • baseline : GPT-4, BLIP-2, OpenFlamingo
  • data : (feature alignment) filtered during CC3M (e2e learning) instruction data or SicenceQA created with GPT4 or chatGPT with captions and bboxes for COCO images
  • evaluation : Create a question sampling from coco and ask GPT-4 to re-evaluate the answer that GPT-4 came up with when asked the question with bbox and caption.
  • result : Good performance on scientific QA, good at higher-level reasoning (such as humor interpretation) that BLIP-2 / OpenFlamingo / GPT-4 are not good at
  • contribution : Probably the first work to create instruct data to assist. open sourced well and widely used
  • etc. :

Details

Instruction following data

  • Create a caption and bbox for a COCO image
  • converstaion(58K) / detailed description(23K) / complex reasoning(77K) image
image

Ablation for this. Adding detailed captions improves performance on the chatbot side. It seems to help with reasoning.

Training

input sequence

image image

The first question can be an image first or a question first, and the order is randomized by using the

image image
  • pre-training feature alignment Filtering only 595K image text on CC3M + learning only linear projection Use the caption as is, but format it as a simple instruction following (single turn, ask to briefly describe the image) In this case, the filtering method is as follows (uniformly fit by noun frequency)
image image
  • finetuning end-to-end Freeze only the vision encoder and learn the rest of projection + LM

Ability

image

I’m wondering how to make it for complex reasoning, but it says to put system prompt like this

You are an AI visual assistant that can analyze a single image. You receive five sentences, each describing the same image you are observing. In addition, specific object locations within the image are given, along with detailed coordinates. These coordinates are in the form of bounding boxes, represented as (x1, y1, x2, y2) with floating numbers ranging from 0 to 1. These values correspond to the top left x, top left y, bottom right x, and bottom right y.

The task is to use the provided caption and bounding box information, create a plausible question about the image, and provide the answer in detail.

Create complex questions beyond describing the scene.
To answer such questions, one should require first understanding the visual content, then based on the background knowledge or reasoning, either explain why the things are happening that way, or provide guides and help to user's request.  Make the question challenging by not including the visual content details in the question so that the user needs to reason about that first.

Instead of directly mentioning the bounding box coordinates, utilize this data to explain the scene using natural language. Include details like object counts, position of the objects, relative position between the objects.  

When using the information from the caption and coordinates, directly explain the scene, and do not mention that the information source is the caption or the bounding box.  Always answer as if you are directly looking at the image.

Ablations

image
  • ViT last layer vs previous layer -> previous layer is better
  • Applying CoT i.e. answer then reasoning / Reasoning then answer -> convergence was faster for reasoning - answer, but not final performance
  • Skip the alignment learning step and go straight to learning -> worse performance
  • LLM 13B to 7B -> Worse Performance

Play with demo

https://llava.hliu.cc/ Playing with the demo image

Good at general explanations

I tried to generate a scene graph. image

image image

Predicate is no longer a verb… image

Start lying…

If you give a bad example, just include it in your answer. You’re still making a triplet that makes sense. image

image

Here we have hallucination… visual genome was probably in the training data, so let’s get some other data. This photo I took in Taiwan…

image image

I don’t know where child is, but I’m guessing it’s

image image

It’s a pretty clear sample

image

It’s starting to get good lol

image

I changed the prompt and suddenly it started saying the right thing again…

It works well because it gives a good example, but where’s the baby? image

I post a picture of my kid from Busan, which is very relatable lol and wouldn’t be weird to have in a benchmark. image

image

Perfect. image

It’s a bit of a homophone, but you’re not wrong.

image

On an emotional note…

image image

Section 3 Section 4 Good at taking a knock…