Full-field ultra-high-speed (UHS) x-ray imaging experiments have been well established to characterize various processes and phenomena. However, the potential of UHS experiments through the joint acquisition of x-ray videos with distinct configurations has not been fully exploited. In this paper, we investigate the use of a deep learning-based spatio-temporal fusion (STF) framework to fuse two complementary sequences of x-ray images and reconstruct the target image sequence with high spatial resolution, high frame rate, and high fidelity. We applied a transfer learning strategy to train the model and compared the peak signal-to-noise ratio (PSNR), average absolute difference (AAD), and structural similarity (SSIM) of the proposed framework on two independent x-ray datasets with those obtained from a baseline deep learning model, a Bayesian fusion framework, and the bicubic interpolation method. The proposed framework outperformed the other methods with various configurations of the input frame separations and image noise levels. With 3 subsequent images from the low resolution (LR) sequence of a 4-time lower spatial resolution and another 2 images from the high resolution (HR) sequence of a 20-time lower frame rate, the proposed approach achieved an average PSNR of 37.57 dB and 35.15 dB, respectively. When coupled with the appropriate combination of high-speed cameras, the proposed approach will enhance the performance and therefore scientific value of the UHS x-ray imaging experiments.