HallE-Switch

Abstract

Current large vision-language models (LVLMs) achieve remarkable progress, yet there remains significant uncertainty regarding their ability to accurately apprehend visual details, that is, in performing detailed captioning.

GPT-4 Assisted Evaluation. we introduce CCEval, a GPT-4 assisted evaluation method tailored for detailed captioning. Interestingly, while LVLMs demonstrate minimal object existence hallucination in existing VQA benchmarks, our proposed evaluation reveals continued susceptibility to such hallucinations.
Detailed Hallucination Analysis. In this paper, we make the first attempt to investigate and attribute such hallucinations, including image resolution, the language decoder size, and instruction data amount, quality, granularity. Our findings underscore the unwarranted inference when the language description includes details at a finer object granularity than what the vision module can ground or verify, thus inducing hallucination.
Control Hallucination. To control such hallucinations, we further attribute the reliability of captioning to contextual knowledge (involving only contextually grounded objects) and parametric knowledge (containing inferred objects by the model). Thus, we introduce HallE-Switch, a controllable LVLM in terms of Hallucination in object Existence. HallE-Switch can condition the captioning to shift between (i) exclusively depicting contextual knowledge for grounded objects and (ii) blending it with parametric knowledge to imagine inferred objects.
Open-source. We make GPT-4 assisted evaluation set, our model and code base publicly available.

CCEval: CHAIR + Coverage

Existing benchmarks for VQA hallucination fall short in precisely evaluating hallucinations within detailed captions. To address this, we employ concept matching and coverage as tools for assessing hallucinations in detailed captions. Currently, CCEval is primary geared towards evaluating object existence hallucination.

Please check out our [evaluation code].

HallE-Switch: Control Imagination in LVLMs

Hallucination = Imagination !

HallE-Switch controls hallucination/imagination by one continuous parameter.

During training: Our approach involves training a projector where the parameter, choosing from 1 or -1, is multiplied. This parameter selection is contingent on whether the training caption contains parametric knowledge.
During inference: Compare to original LLaVA, our model accepts one additional parameter, ranging from 1 and -1, to adjust the imagination level of captions.
Please check out our [model zoo].

BibTeX


        @misc{zhai2023halleswitch,
          title={HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption}, 
          author={Bohan Zhai and Shijia Yang and Xiangchen Zhao and Chenfeng Xu and Sheng Shen and Dongdi Zhao and Kurt Keutzer and Manling Li and Tan Yan and Xiangjun Fan},
          year={2023},
          eprint={2310.01779},
          archivePrefix={arXiv},
          primaryClass={cs.CV}
        }

Acknowledgement

This website is adapted from LLaVA, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Halle-Switch: Hallucination Existence Switch

Controlling Object Hallucination in Large Vision Language Models