Postoperative infection diagnosis is a common and serious complication that generally poses a high diagnostic challenge. This study focuses on PJI, a type of postoperative infection. X-ray examination is an imaging examination for suspected PJI patients that can evaluate joint prostheses and adjacent tissues, and detect the cause of pain. Laboratory examination data has high sensitivity and specificity and has significant potential in PJI diagnosis. In this study, we proposed a self-supervised masked autoencoder pre-training strategy and a multimodal fusion diagnostic network MED-NVC, which effectively implements the interaction between two modal features through the feature fusion network of CrossAttention. We tested our proposed method on our collected PJI dataset and evaluated its performance and feasibility through comparison and ablation experiments. The results showed that our method achieved an ACC of 94.71% and an AUC of 98.22%, which is better than the latest method and also reduces the number of parameters. Our proposed method has the potential to provide clinicians with a powerful tool for enhancing accuracy and efficiency.