image

paper , code

TL;DR

  • I read this because.. : Personal research Very relevant research
  • task : object hallucination evaluation
  • problem : The existing CHAIR for measuring hallucination relies on string matching and is limited to COCO objects.
  • idea : Parsing with LLM, object extraction with DETR, bipartite matching with semantic similarity with S-BERT
  • input/output : {image, text} -> score (higher is better)
  • baseline : CHAIR, CLIPScore, RefCLIPScore
  • data : FOIL, noCaps-FOIL(proposed), HAT(proposed)
  • evaluation : AP for task 1, LA for task 2(accuracy)
  • result : Average Precision performs similar to RefCLIPScore, Localization Accuracy performs similar to CHAIRs but superior to noCaps-FOIL.
  • contribution : Proposed a pipeline for a captioning model for object hallucinations.
  • etc. : It’s good to see that the limitations are not hidden, and it’s good to see that you have created data to show the advantages of the proposed method.

Details

motivation

image

overall pipeline

image

(1) Extracting objects from candidates, references, and images

  • GT Candidates
  • DETRs trained with COCO -> object candidates
  • Pulled object parsing from referecne caption using ChatGPT
  • We also want to pull attributes at the same time
  • Singularization (minus s)
  • predicted
  • Parsing as LLM as in candidata catpion

(2) Object Filtering

  • Sometimes the caption model is uncertain and the caption is written like fork or knife.
  • In this case, subtract from the candidate caption’s class set (but not from referecne)
  • Use spaCy to leave only the noun in a referecne noun phrase.

(3) Object matching Using SBERT for bipartite matching

The final metric is “least matching similarity”, as shown below. image

Result

HAT

image

HAT is created directly for COCO images. (TEST 400) Where CHAIRs is accuracy (can we put AP and accuracy in the same table?)

FOIL

image

Perform well on no-caps The baseline here is 50, so I’m not sure if I’m measuring the two comparatively like I would in CLIPScore. If so, I’m not sure if it would be called AP instead of accuracy.

Qualitative

image

Ablation

image