ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning
English summary
ClinHallu is a benchmark designed for stage-wise diagnosis of hallucinations in medical multimodal large language model (MLLM) reasoning. It contains 7,031 validated instances, each augmented with a structured reasoning trace that decomposes the process into visual recognition, knowledge recall, and reasoning integration. Stage-replacement interventions are used to measure how correcting a specific reasoning stage affects the final answer. The paper also shows that trace-supervised fine-tuning can reduce stage-wise hallucinations. The benchmark is publicly available on GitHub.
Chinese summary
ClinHallu 是一个用于分阶段诊断医学多模态大语言模型推理中幻觉的基准。它包含 7,031 个经过验证的样本,每个样本都配有结构化的推理轨迹,将推理过程分解为视觉识别、知识回忆和推理整合三个阶段。通过阶段替换干预来测量纠正特定阶段对最终答案的影响。论文还表明,基于推理轨迹的监督微调可以减少各阶段的幻觉。该基准已在 GitHub 上公开。
Key points
Benchmark includes 7,031 validated medical instances with structured reasoning traces.
基准包含 7,031 个经过验证的医学样本,每个样本都带有结构化的推理轨迹。
Reasoning is decomposed into three stages: Visual Recognition, Knowledge Recall, and Reasoning Integration.
推理被分解为三个阶段:视觉识别、知识回忆和推理整合。
Stage-replacement interventions quantify the impact of correcting hallucinations at a specific stage on the final answer.
通过阶段替换干预量化纠正特定阶段幻觉对最终答案的影响。
Fine-tuning on reasoning traces reduces stage-wise hallucinations, demonstrating a mitigation approach.
基于推理轨迹的微调可以减少各阶段幻觉,提供了一种缓解方法。