[160] ALOHa: A New Measure for Hallucination in Captioning Models

paper , code

TL;DR

I read this because.. : Personal research Very relevant research
task : object hallucination evaluation
problem : The existing CHAIR for measuring hallucination relies on string matching and is limited to COCO objects.
idea : Parsing with LLM, object extraction with DETR, bipartite matching with semantic similarity with S-BERT
input/output : {image, text} -> score (higher is better)
baseline : CHAIR, CLIPScore, RefCLIPScore
data : FOIL, noCaps-FOIL(proposed), HAT(proposed)
evaluation : AP for task 1, LA for task 2(accuracy)
result : Average Precision performs similar to RefCLIPScore, Localization Accuracy performs similar to CHAIRs but superior to noCaps-FOIL.
contribution : Proposed a pipeline for a captioning model for object hallucinations.
etc. : It’s good to see that the limitations are not hidden, and it’s good to see that you have created data to show the advantages of the proposed method.

Details

motivation

overall pipeline

(1) Extracting objects from candidates, references, and images

GT Candidates
DETRs trained with COCO -> object candidates
Pulled object parsing from referecne caption using ChatGPT
We also want to pull attributes at the same time
Singularization (minus s)
predicted
Parsing as LLM as in candidata catpion

(2) Object Filtering

Sometimes the caption model is uncertain and the caption is written like fork or knife.
In this case, subtract from the candidate caption’s class set (but not from referecne)
Use spaCy to leave only the noun in a referecne noun phrase.

(3) Object matching Using SBERT for bipartite matching

The final metric is “least matching similarity”, as shown below.

Result

HAT

HAT is created directly for COCO images. (TEST 400) Where CHAIRs is accuracy (can we put AP and accuracy in the same table?)

FOIL

Perform well on no-caps The baseline here is 50, so I’m not sure if I’m measuring the two comparatively like I would in CLIPScore. If so, I’m not sure if it would be called AP instead of accuracy.

[160] ALOHa: A New Measure for Hallucination in Captioning Models

TL;DR

Details

motivation

overall pipeline

Result

HAT

FOIL

Qualitative

Ablation

TL;DR#

Details#

motivation#

overall pipeline#

Result#

HAT#

FOIL#

Qualitative#

Ablation#

TL;DR

Details

motivation

overall pipeline

Result

HAT

FOIL

Qualitative

Ablation