Does Language Shift Break Medical Vision-Language Models? Indonesian Radiology Visual Question Answering Case Study
Title: Does Language Shift Break Medical Vision-Language Models? Indonesian Radiology Visual Question Answering Case Study
Abstract: The robustness of Medical Vision-Language Models (VLMs) in non-English clinical environments remains largely unexamined, as these systems are predominantly assessed using English-based radiology visual question answering (VQA) benchmarks. To address this gap, we present IndoRad-VQA, an Indonesian variant of the VQA-RAD dataset, designed to test whether medical VLMs maintain their radiological reasoning capabilities when questions are posed in Bahasa Indonesia. To ensure the preservation of clinical meaning, terminology consistency, and answer equivalence, radiology question-answer pairs were translated into Indonesian, with self-evaluation employed as a quality control mechanism. Our study evaluates a range of models—including general-purpose, Southeast Asian multilingual, and medical-specific VLMs—under both English and Indonesian prompting conditions. In addition to measuring accuracy, we quantify the "language robustness gap" between the two languages and perform an error analysis to pinpoint specific failure modes, such as yes/no flips, laterality errors, and mismatches in output language. Our results indicate that high performance on English medical VQA benchmarks does not guarantee reliable behavior in Indonesian clinical settings. Depending on the evaluation metric, we observed a performance disparity ranging from 8 to 25 percent between English and Indonesian inputs. These findings underscore the necessity for more inclusive, multilingual evaluations of medical multimodal foundation models. The dataset is accessible at https://huggingface.co/datasets/Lab-IS/IndoRad-VQA.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





