Recompositing channel state information (CSI) from the beamforming feedback matrix (BFM), which is a compressed version of CSI and can be captured because of its lack of encryption, is an alternative way of implementing firmware-agnostic WiFi sensing. In this study, we propose the use of camera images toward the accuracy enhancement of CSI recomposition from BFM. The key motivation for this vision-aided CSI recomposition is to draw a first-hand insight that the BFM does not fully involve spatial information to recomposite CSI and that this could be compensated by camera images. To leverage the camera images, we use multimodal deep learning, where the two modalities, i.e., images and BFMs, are integrated to recomposite the CSI. We conducted experiments using IEEE 802.11ac devices. The experimental results confirmed that the recomposition accuracy of the proposed multimodal framework is improved compared to the single-modal framework only using images or BFMs.