TL;DR
- I read this because.. : Personal research Very relevant research
- task : object hallucination evaluation
- problem : The existing CHAIR for measuring hallucination relies on string matching and is limited to COCO objects.
- idea : Parsing with LLM, object extraction with DETR, bipartite matching with semantic similarity with S-BERT
- input/output : {image, text} -> score (higher is better)
- baseline : CHAIR, CLIPScore, RefCLIPScore
- data : FOIL, noCaps-FOIL(proposed), HAT(proposed)
- evaluation : AP for task 1, LA for task 2(accuracy)
- result : Average Precision performs similar to RefCLIPScore, Localization Accuracy performs similar to CHAIRs but superior to noCaps-FOIL.
- contribution : Proposed a pipeline for a captioning model for object hallucinations.
- etc. : It’s good to see that the limitations are not hidden, and it’s good to see that you have created data to show the advantages of the proposed method.
Details
motivation
overall pipeline
(1) Extracting objects from candidates, references, and images
- GT Candidates
- DETRs trained with COCO -> object candidates
- Pulled object parsing from referecne caption using ChatGPT
- We also want to pull attributes at the same time
- Singularization (minus s)
- predicted
- Parsing as LLM as in candidata catpion
(2) Object Filtering
- Sometimes the caption model is uncertain and the caption is written like
fork or knife. - In this case, subtract from the candidate caption’s class set (but not from referecne)
- Use spaCy to leave only the noun in a referecne noun phrase.
(3) Object matching Using SBERT for bipartite matching
The final metric is “least matching similarity”, as shown below.
Result
HAT
HAT is created directly for COCO images. (TEST 400) Where CHAIRs is accuracy (can we put AP and accuracy in the same table?)
FOIL
Perform well on no-caps The baseline here is 50, so I’m not sure if I’m measuring the two comparatively like I would in CLIPScore. If so, I’m not sure if it would be called AP instead of accuracy.