TL;DR
- I read this because.. : to read llava 1.5
- task : chatting VLM
- problem : Let’s make instruction-following work for multi-modal like chatGPT
- idea : put bbox and caption in language only GPT and have it create QA
- input/output : image + Q -> A
- architecture : LLaMA 13B + CLIP + projection
- objective : ce loss
- baseline : GPT-4, BLIP-2, OpenFlamingo
- data : (feature alignment) filtered during CC3M (e2e learning) instruction data or SicenceQA created with GPT4 or chatGPT with captions and bboxes for COCO images
- evaluation : Create a question sampling from coco and ask GPT-4 to re-evaluate the answer that GPT-4 came up with when asked the question with bbox and caption.
- result : Good performance on scientific QA, good at higher-level reasoning (such as humor interpretation) that BLIP-2 / OpenFlamingo / GPT-4 are not good at
- contribution : Probably the first work to create instruct data to assist. open sourced well and widely used
- etc. :
Details
Instruction following data
- Create a caption and bbox for a COCO image
- converstaion(58K) / detailed description(23K) / complex reasoning(77K)
Ablation for this. Adding detailed captions improves performance on the chatbot side. It seems to help with reasoning.
Training
input sequence
The first question can be an image first or a question first, and the order is randomized by using the
- pre-training feature alignment Filtering only 595K image text on CC3M + learning only linear projection Use the caption as is, but format it as a simple instruction following (single turn, ask to briefly describe the image) In this case, the filtering method is as follows (uniformly fit by noun frequency)
- finetuning end-to-end Freeze only the vision encoder and learn the rest of projection + LM
Ability
I’m wondering how to make it for complex reasoning, but it says to put system prompt like this
You are an AI visual assistant that can analyze a single image. You receive five sentences, each describing the same image you are observing. In addition, specific object locations within the image are given, along with detailed coordinates. These coordinates are in the form of bounding boxes, represented as (x1, y1, x2, y2) with floating numbers ranging from 0 to 1. These values correspond to the top left x, top left y, bottom right x, and bottom right y.
The task is to use the provided caption and bounding box information, create a plausible question about the image, and provide the answer in detail.
Create complex questions beyond describing the scene.
To answer such questions, one should require first understanding the visual content, then based on the background knowledge or reasoning, either explain why the things are happening that way, or provide guides and help to user's request. Make the question challenging by not including the visual content details in the question so that the user needs to reason about that first.
Instead of directly mentioning the bounding box coordinates, utilize this data to explain the scene using natural language. Include details like object counts, position of the objects, relative position between the objects.
When using the information from the caption and coordinates, directly explain the scene, and do not mention that the information source is the caption or the bounding box. Always answer as if you are directly looking at the image.
Ablations
- ViT last layer vs previous layer -> previous layer is better
- Applying CoT i.e. answer then reasoning / Reasoning then answer -> convergence was faster for reasoning - answer, but not final performance
- Skip the alignment learning step and go straight to learning -> worse performance
- LLM 13B to 7B -> Worse Performance
Play with demo
https://llava.hliu.cc/
Playing with the demo
Good at general explanations
I tried to generate a scene graph.
Predicate is no longer a verb…
Start lying…
If you give a bad example, just include it in your answer. You’re still making a triplet that makes sense.
Here we have hallucination… visual genome was probably in the training data, so let’s get some other data. This photo I took in Taiwan…
I don’t know where child is, but I’m guessing it’s
It’s a pretty clear sample
It’s starting to get good lol
I changed the prompt and suddenly it started saying the right thing again…
It works well because it gives a good example, but where’s the baby?
I post a picture of my kid from Busan, which is very relatable lol and wouldn’t be weird to have in a benchmark.
Perfect.
It’s a bit of a homophone, but you’re not wrong.
On an emotional note…
Section 3 Section 4 Good at taking a knock…