[46] ReFormer: The Relational Transformer for Image Captioning

TL;DR

task : image captioning and SGG
problem : It is helpful to use scene graph in the image for captioning, so I use external SGG + GCN as input. However, (1) using image captioning loss (=MLE) instead of relation-related loss does not train the encoder well enough, and (2) the encoder is not in a form where the relation can be extracted separately, so it is less universal and less explainable.
idea : image captioning and scene graph generator in one transformer model!
architecture : FasterRCNN with bbox and Transformer encoder with self-attention coarse output to m(m-1) for relation prediction and hidden vector weighted sum of L layer of transformer encoder to decoder for token prediction.
objective : cross-entropy loss for SGG / MLE for captioning.
baseline : IMP, MOTIFS, VCTree(for SGG)
data : COCO(image captioning), Visual Genome -> There are some images that overlap with COCO and Visual Genome, but they are very few.
result : Both image captioning and SGG are SOTA
contribution : SGG + captioning in one model!
limitation : SGG is also a sota.

Details

A Relational Encoding Learning Idea

A typical captioning objective might look like this

y is the token $x$ is the visual feature of the image. In order to add scene graph information when captioning, we first put the image feature x into some pretrained SGG to get the graph, then put the graph into GCN to embed it well, and then concatenate the embedding and the image feature and put it into the captioning input. In this case, the objective is captioning, not SGG, and there are studies that show that the encoder performs somewhat well if the decoder is strong even if the encoder does not extract much information, so it is questionable whether the encoder is trained to extract relation embeddings well.

Architecture

Encoder Architecture

For the Encoder, we took the GloVe vector of the box label from the bbox information and CNN, fired the transformer encoder, concatenated the m(m-1) pairs with the relation vector, and softmaxed the relation.

Weighted Decoder for Image Captioning

When decoding, we replaced the output vector of all layers of the transformer with the token prediction given the weighted sum.

Sequential Training with Inferred Labels

(i) Faster RCNN training on Visual Genome (ii) Train Encoder with Faster RCNN trained on Visual Genome (iii) encoder is trained and then trained as encoder - caption decoder for COCO dataset

I tried caption loss and SGG loss weighted sum, and ablation performed worse than caption loss alone.

Results

SGG

c.f. two-stage SGG comparison Predicate classification(PredCLS) : given GT bbox and cls, predict predicates Scene graph classification(SGCLS) : given GT bbox, predict predicates and object class As if SGDet = SGGen

SGDet	R@20	R@50	R@100
Reformer(here)	25.4	33.0	37.2
Seq2Seq https://github.com/long8v/PTIR/issues/50	22.1	30.9	34.4
BGT-Net(GRU) https://github.com/long8v/PTIR/issues/51	25.5	32.8	37.3
RTN https://github.com/long8v/PTIR/issues/49	22.5	29.0	33.1

SGCls	R@20	R@50	R@100
Reformer(here)	36.6	40.1	41.1
Seq2Seq https://github.com/long8v/PTIR/issues/50	34.5	38.3	39.0
BGT-Net(GRU) https://github.com/long8v/PTIR/issues/51	41.7	45.9	47.1
RTN https://github.com/long8v/PTIR/issues/49	43.8	44.0	44.0

TL;DR#

Details#

A Relational Encoding Learning Idea#

Architecture#

Encoder Architecture#

Weighted Decoder for Image Captioning#

Sequential Training with Inferred Labels#

Results#

SGG#

Captioning#

Ablation for captioning#