Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Linguistics-Vision Monotonic Consistent Network for Sign Language Production

Dec 22, 2024

Xu Wang, Shengeng Tang, Peipei Song, Shuo Wang, Dan Guo, Richang Hong

Figure 1 for Linguistics-Vision Monotonic Consistent Network for Sign Language Production

Figure 2 for Linguistics-Vision Monotonic Consistent Network for Sign Language Production

Figure 3 for Linguistics-Vision Monotonic Consistent Network for Sign Language Production

Figure 4 for Linguistics-Vision Monotonic Consistent Network for Sign Language Production

Share this with someone who'll enjoy it:

Abstract:Sign Language Production (SLP) aims to generate sign videos corresponding to spoken language sentences, where the conversion of sign Glosses to Poses (G2P) is the key step. Due to the cross-modal semantic gap and the lack of word-action correspondence labels for strong supervision alignment, the SLP suffers huge challenges in linguistics-vision consistency. In this work, we propose a Transformer-based Linguistics-Vision Monotonic Consistent Network (LVMCN) for SLP, which constrains fine-grained cross-modal monotonic alignment and coarse-grained multimodal semantic consistency in language-visual cues through Cross-modal Semantic Aligner (CSA) and Multimodal Semantic Comparator (MSC). In the CSA, we constrain the implicit alignment between corresponding gloss and pose sequences by computing the cosine similarity association matrix between cross-modal feature sequences (i.e., the order consistency of fine-grained sign glosses and actions). As for MSC, we construct multimodal triplets based on paired and unpaired samples in batch data. By pulling closer the corresponding text-visual pairs and pushing apart the non-corresponding text-visual pairs, we constrain the semantic co-occurrence degree between corresponding gloss and pose sequences (i.e., the semantic consistency of coarse-grained textual sentences and sign videos). Extensive experiments on the popular PHOENIX14T benchmark show that the LVMCN outperforms the state-of-the-art.

* Accepted by ICASSP 2025

View paper on

Share this with someone who'll enjoy it:

Title:Linguistics-Vision Monotonic Consistent Network for Sign Language Production

Paper and Code