CHAIR (== Object HalBench)
[18’EMNLP] Object Hallucination in Image Captioning https://arxiv.org/abs/1809.02156
COCO caption & semantic segmentation label – Measure hallucination in captioning model using synonyms
The denominator of CHAIR_i is the count of all objects mentioned // CHAIR_s is the count of sentences
COCO karpathy / robust test set
What we wanted to say in this paper is that even if captioning performance is high, such as CIDEr, the actual hallucination performance is not proportional to it.
LVLM gives 8 prompts to make the descriptive statement created by RLHF-V, obtains the gt segment and CHAIR, and reports this to Object Halbench
POPE
[24’EMNLP] Evaluating Object Hallucination in Large Vision-Language Models https://arxiv.org/pdf/2305.10355
A paper measuring object hallucinations like the CHAIR above by bringing them into LVLM
But the performance is choppy depending on what you do with the prompt. And it requires a complex human parsing rule to pull the object and match it with the GT object
So the suggestion was to use POPE
Instead of creating a caption and looking for hallucinated objects, create a question that can be answered with yes or no and use Measure
gt labels enrich object pools by pulling in semantic labels like SEEM
Create three negative sets here
- random : random object class
popular : object class that appeared a lot in the training data
adversarial: an object class that has emerged a lot, such as the current object
The set you used created 500 subsets with at least 3 objects in COCO.
The paper found that 1) object hallucinations, which were highly prevalent in COCO, and 2) object hallucinations, which were highly frequent in COCO, were severe.
HallusionBench
[CVPR'24] HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models https://arxiv.org/abs/2310.14566
AMBER
[arxiv'24] AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation https://arxiv.org/abs/2311.07397
There are two 1) generative 2) discriminative generative is designed for object existence, while discriminative can find objects, relations, and attributes Annotate the image and all the object, attribute, and relation labels that appear on it beforehand, and then just set the discriminative to yes and no. generative parse nouns for generated captions and then just pretend it’s CHAIR…. Hmmm.