image

paper

TL;DR

  • I read this because.. : flamingo gave me a demonstration and I thought, what trick can I do in the Infer step in multi-modal?
  • task : chain-of-thought
  • problem : Only the kids with 100B+ parameters were able to CoT. Why did 1B fail? When LLM’s chain-of-thought is used for Multi-modal, hallucination appears and the performance gets worse.
  • Idea :** Learn two models: one that sees a QCM (question / context / multiple choice) and generates a rationale, and one that receives the rationale and QCM and generates an answer. They have the same architecture but are trained separately. In this case, both models receive the vision feature additionally.
  • input/output : image, context, question, options -> gold rationale / image, context, question, option, rationale -> answer
  • architecture : DETR encoder + T5(initialized by unified QA)
  • objective : cross entropy loss
  • baseline : No-CoT(one-stage model), CoT w/o visual feature, CoT with caption
  • data : ScienceQA
  • evaluation : RougeL, accuracy
  • result : language only CoT superior to GPT3.5.
  • contribution : Looks like you suggested a CoT that sees visual information for the first time.
  • etc. :

Details

Problem

image

When I trained SicenceQA with a text-only model in a one-stage setting (reasoning and answer at once?), the performance was worse than without CoT. To see why this was the case, I split the performance by QCM -> R / QCMR -> A and got this result

image image

In other words, we were generating the wrong rationale. In the example below

image

Of course it’s obvious… If I don’t look at the image and ask to make rataionale, hallucination -> performance drops as if I saw something contrary to the image.

Framework

image

(i) rationale generation and (ii) answer inference, which have the same architecture but are trained separately (is there a reason for this in the paper?). In step (i), the input is $X={X^1_{language}, X_{vision}}$ and the output is $R$, which is the rationale. Step (ii) concatenates the generated $R$ to produce the input X’={concat(X_^1_{language}, R), X_{vision}} and generate the answer $A$.

Architecture

  • Encoding image

VisionExtractor is DETR. $H_{vision}\in\mathbb{R}^{m\times d}$ and dimensioned with (m: # of patches, d: hidden dim) lanugage output.

  • Interaction image

Q : $H_{language}$ K=V: $H_{vision}$

I did a gated fusion and let it learn how much vision information to look at.

image

Result

image

Superior to GPT-3.5 performance

image

For the two-stage baseline (training with two stages without looking at images), performance was good initially, but did not improve as the epoch progressed. Why is the one-stage baseline getting better? Hmm…

  • vision features architecture
image
  • language model architecture image

  • Multimodal CoT error case study image