TL;DR
- I read this because.. : flamingo gave me a demonstration and I thought, what trick can I do in the Infer step in multi-modal?
- task : chain-of-thought
- problem : Only the kids with 100B+ parameters were able to CoT. Why did 1B fail? When LLM’s chain-of-thought is used for Multi-modal, hallucination appears and the performance gets worse.
- Idea :** Learn two models: one that sees a QCM (question / context / multiple choice) and generates a rationale, and one that receives the rationale and QCM and generates an answer. They have the same architecture but are trained separately. In this case, both models receive the vision feature additionally.
- input/output : image, context, question, options -> gold rationale / image, context, question, option, rationale -> answer
- architecture : DETR encoder + T5(initialized by unified QA)
- objective : cross entropy loss
- baseline : No-CoT(one-stage model), CoT w/o visual feature, CoT with caption
- data : ScienceQA
- evaluation : RougeL, accuracy
- result : language only CoT superior to GPT3.5.
- contribution : Looks like you suggested a CoT that sees visual information for the first time.
- etc. :
Details
Problem
When I trained SicenceQA with a text-only model in a one-stage setting (reasoning and answer at once?), the performance was worse than without CoT. To see why this was the case, I split the performance by QCM -> R / QCMR -> A and got this result
In other words, we were generating the wrong rationale. In the example below
Of course it’s obvious… If I don’t look at the image and ask to make rataionale, hallucination -> performance drops as if I saw something contrary to the image.
Framework
(i) rationale generation and (ii) answer inference, which have the same architecture but are trained separately (is there a reason for this in the paper?). In step (i), the input is $X={X^1_{language}, X_{vision}}$ and the output is $R$, which is the rationale. Step (ii) concatenates the generated $R$ to produce the input X’={concat(X_^1_{language}, R), X_{vision}} and generate the answer $A$.
Architecture
- Encoding
VisionExtractor is DETR. $H_{vision}\in\mathbb{R}^{m\times d}$ and dimensioned with (m: # of patches, d: hidden dim) lanugage output.
- Interaction
Q : $H_{language}$ K=V: $H_{vision}$
I did a gated fusion and let it learn how much vision information to look at.
Result
Superior to GPT-3.5 performance
For the two-stage baseline (training with two stages without looking at images), performance was good initially, but did not improve as the epoch progressed. Why is the one-stage baseline getting better? Hmm…
- vision features architecture
language model architecture
Multimodal CoT error case study