Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiaohao Chen. Zhichao Wei

PEMF-VVTO: Point-Enhanced Video Virtual Try-on via Mask-free Paradigm

Dec 05, 2024

Tianyu Chang, Xiaohao Chen. Zhichao Wei, Xuanpu Zhang, Qing-Guo Chen, Weihua Luo, Xun Yang

Figure 1 for PEMF-VVTO: Point-Enhanced Video Virtual Try-on via Mask-free Paradigm

Figure 2 for PEMF-VVTO: Point-Enhanced Video Virtual Try-on via Mask-free Paradigm

Figure 3 for PEMF-VVTO: Point-Enhanced Video Virtual Try-on via Mask-free Paradigm

Figure 4 for PEMF-VVTO: Point-Enhanced Video Virtual Try-on via Mask-free Paradigm

Abstract:Video Virtual Try-on aims to fluently transfer the garment image to a semantically aligned try-on area in the source person video. Previous methods leveraged the inpainting mask to remove the original garment in the source video, thus achieving accurate garment transfer on simple model videos. However, when these methods are applied to realistic video data with more complex scene changes and posture movements, the overly large and incoherent agnostic masks will destroy the essential spatial-temporal information of the original video, thereby inhibiting the fidelity and coherence of the try-on video. To alleviate this problem, we propose a novel point-enhanced mask-free video virtual try-on framework (PEMF-VVTO). Specifically, we first leverage the pre-trained mask-based try-on model to construct large-scale paired training data (pseudo-person samples). Training on these mask-free data enables our model to perceive the original spatial-temporal information while realizing accurate garment transfer. Then, based on the pre-acquired sparse frame-cloth and frame-frame point alignments, we design the point-enhanced spatial attention (PSA) and point-enhanced temporal attention (PTA) to further improve the try-on accuracy and video coherence of the mask-free model. Concretely, PSA explicitly guides the garment transfer to desirable locations through the sparse semantic alignments of video frames and cloth. PTA exploits the temporal attention on sparse point correspondences to enhance the smoothness of generated videos. Extensive qualitative and quantitative experiments clearly illustrate that our PEMF-VVTO can generate more natural and coherent try-on videos than existing state-of-the-art methods.

Via

Access Paper or Ask Questions