https://github.com/WissingChen/VLCI.
Automatic radiology report generation is essential for computer-aided diagnosis and medication guidance. Importantly, automatic radiology report generation (RRG) can relieve the heavy burden of radiologists by generating medical reports automatically from visual-linguistic data relations. However, due to the spurious correlations within image-text data induced by visual and linguistic biases, it is challenging to generate accurate reports that reliably describe abnormalities. Besides, the cross-modal confounder is usually unobservable and difficult to be eliminated explicitly. In this paper, we mitigate the cross-modal data bias for RRG from a new perspective, i.e., visual-linguistic causal intervention, and propose a novel Visual-Linguistic Causal Intervention (VLCI) framework for RRG, which consists of a visual deconfounding module (VDM) and a linguistic deconfounding module (LDM), to implicitly deconfound the visual-linguistic confounder by causal front-door intervention. Specifically, the VDM explores and disentangles the visual confounder from the patch-based local and global features without object detection due to the absence of universal clinic semantic extraction. Simultaneously, the LDM eliminates the linguistic confounder caused by salient visual features and high-frequency context without constructing specific dictionaries. Extensive experiments on IU-Xray and MIMIC-CXR datasets show that our VLCI outperforms the state-of-the-art RRG methods significantly. Source code and models are available at