[117] Multimodal Chain-of-Thought Reasoning in Language Models

TL;DR

I read this because.. : flamingo gave me a demonstration and I thought, what trick can I do in the Infer step in multi-modal?
task : chain-of-thought
problem : Only the kids with 100B+ parameters were able to CoT. Why did 1B fail? When LLM’s chain-of-thought is used for Multi-modal, hallucination appears and the performance gets worse.
Idea :** Learn two models: one that sees a QCM (question / context / multiple choice) and generates a rationale, and one that receives the rationale and QCM and generates an answer. They have the same architecture but are trained separately. In this case, both models receive the vision feature additionally.
input/output : image, context, question, options -> gold rationale / image, context, question, option, rationale -> answer
architecture : DETR encoder + T5(initialized by unified QA)
objective : cross entropy loss
baseline : No-CoT(one-stage model), CoT w/o visual feature, CoT with caption
data : ScienceQA
evaluation : RougeL, accuracy
result : language only CoT superior to GPT3.5.
contribution : Looks like you suggested a CoT that sees visual information for the first time.
etc. :

Details

Problem

When I trained SicenceQA with a text-only model in a one-stage setting (reasoning and answer at once?), the performance was worse than without CoT. To see why this was the case, I split the performance by QCM -> R / QCMR -> A and got this result

In other words, we were generating the wrong rationale. In the example below

Of course it’s obvious… If I don’t look at the image and ask to make rataionale, hallucination -> performance drops as if I saw something contrary to the image.

Framework

(i) rationale generation and (ii) answer inference, which have the same architecture but are trained separately (is there a reason for this in the paper?). In step (i), the input is $X={X^1_{language}, X_{vision}}$ and the output is $R$, which is the rationale. Step (ii) concatenates the generated $R$ to produce the input X’={concat(X_^1_{language}, R), X_{vision}} and generate the answer $A$.

Architecture

Encoding

VisionExtractor is DETR. $H_{vision}\in\mathbb{R}^{m\times d}$ and dimensioned with (m: # of patches, d: hidden dim) lanugage output.

Interaction

Q : $H_{language}$ K=V: $H_{vision}$

I did a gated fusion and let it learn how much vision information to look at.

Result

Superior to GPT-3.5 performance

For the two-stage baseline (training with two stages without looking at images), performance was good initially, but did not improve as the epoch progressed. Why is the one-stage baseline getting better? Hmm…

vision features architecture

language model architecture
Multimodal CoT error case study

TL;DR#

Details#

Problem#

Framework#

Architecture#

Result#