image

CHAIR (== Object HalBench)

[18’EMNLP] Object Hallucination in Image Captioning https://arxiv.org/abs/1809.02156

  • COCO caption & semantic segmentation label – Measure hallucination in captioning model using synonyms

  • The denominator of CHAIR_i is the count of all objects mentioned // CHAIR_s is the count of sentences

  • COCO karpathy / robust test set
    image

  • What we wanted to say in this paper is that even if captioning performance is high, such as CIDEr, the actual hallucination performance is not proportional to it.

  • LVLM gives 8 prompts to make the descriptive statement created by RLHF-V, obtains the gt segment and CHAIR, and reports this to Object Halbench

POPE

[24’EMNLP] Evaluating Object Hallucination in Large Vision-Language Models https://arxiv.org/pdf/2305.10355

  • A paper measuring object hallucinations like the CHAIR above by bringing them into LVLM image

  • But the performance is choppy depending on what you do with the prompt. And it requires a complex human parsing rule to pull the object and match it with the GT object

  • So the suggestion was to use POPE image

  • Instead of creating a caption and looking for hallucinated objects, create a question that can be answered with yes or no and use Measure

  • gt labels enrich object pools by pulling in semantic labels like SEEM

  • Create three negative sets here

    • random : random object class
  • popular : object class that appeared a lot in the training data

  • adversarial: an object class that has emerged a lot, such as the current object

  • The set you used created 500 subsets with at least 3 objects in COCO.

  • The paper found that 1) object hallucinations, which were highly prevalent in COCO, and 2) object hallucinations, which were highly frequent in COCO, were severe.

image

HallusionBench

[CVPR'24] HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models https://arxiv.org/abs/2310.14566

AMBER

[arxiv'24] AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation https://arxiv.org/abs/2311.07397

image

image

image

There are two 1) generative 2) discriminative generative is designed for object existence, while discriminative can find objects, relations, and attributes Annotate the image and all the object, attribute, and relation labels that appear on it beforehand, and then just set the discriminative to yes and no. generative parse nouns for generated captions and then just pretend it’s CHAIR…. Hmmm.