Radio Frequency (RF) signal-based multimodal image inpainting has recently emerged as a promising paradigm to enhance the capability of distortion-free image restoration by integrating wireless and visual information from the identical physical environment and has potential applications in fields like security and surveillance systems. In this paper, we aim to implement an RF-based image inpainting system that enables image restoration in a complex environment while maintaining high robustness and accuracy. This requires accurately converting RF signals into meaningful visual information and overcoming the challenges of RF signals in complex environments, such as multipath interference, signal attenuation, and noise. To tackle this problem, we propose Trans-Inpainter, a novel image inpainting method that utilizes the Channel State Information (CSI) of WiFi signals in combination with transformer networks to generate high-quality reconstructed images. This approach is the first to use CSI for image inpainting, which allows for extracting visual information from WiFi signals to fill in missing regions in images. To further improve Trans-Inpainter's performance, we investigate the impact of variations in CSI data on RF-based imaging ability, i.e., analyzing how the location of the CSI sensors, the combination of CSI from different sensors, and changes in temporal or frequency dimensions of CSI matrix affect the imaging quality. We compare the performance of Trans-Inpainter with RF-Inpainter, the state-of-the-art technology for RF-based multimodal image inpainting, under more realistic experimental scenarios, and with single-modality image inpainting models when only RF or image data is available, respectively. The results show that Trans-Inpainter outperforms other baseline methods in all cases.