Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Feilong Bao

Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis

Jan 11, 2025

Rui Liu, Zhenqi Jia, Feilong Bao, Haizhou Li

Figure 1 for Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis

Figure 2 for Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis

Figure 3 for Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis

Figure 4 for Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis

Abstract:Conversational speech synthesis (CSS) aims to take the current dialogue (CD) history as a reference to synthesize expressive speech that aligns with the conversational style. Unlike CD, stored dialogue (SD) contains preserved dialogue fragments from earlier stages of user-agent interaction, which include style expression knowledge relevant to scenarios similar to those in CD. Note that this knowledge plays a significant role in enabling the agent to synthesize expressive conversational speech that generates empathetic feedback. However, prior research has overlooked this aspect. To address this issue, we propose a novel Retrieval-Augmented Dialogue Knowledge Aggregation scheme for expressive CSS, termed RADKA-CSS, which includes three main components: 1) To effectively retrieve dialogues from SD that are similar to CD in terms of both semantic and style. First, we build a stored dialogue semantic-style database (SDSSD) which includes the text and audio samples. Then, we design a multi-attribute retrieval scheme to match the dialogue semantic and style vectors of the CD with the stored dialogue semantic and style vectors in the SDSSD, retrieving the most similar dialogues. 2) To effectively utilize the style knowledge from CD and SD, we propose adopting the multi-granularity graph structure to encode the dialogue and introducing a multi-source style knowledge aggregation mechanism. 3) Finally, the aggregated style knowledge are fed into the speech synthesizer to help the agent synthesize expressive speech that aligns with the conversational style. We conducted a comprehensive and in-depth experiment based on the DailyTalk dataset, which is a benchmarking dataset for the CSS task. Both objective and subjective evaluations demonstrate that RADKA-CSS outperforms baseline models in expressiveness rendering. Code and audio samples can be found at: https://github.com/Coder-jzq/RADKA-CSS.

* Accepted by Information Fusion 2025

Via

Access Paper or Ask Questions

Unifying Dual-Space Embedding for Entity Alignment via Contrastive Learning

Dec 06, 2024

Cunda Wang, Weihua Wang, Qiuyu Liang, Feilong Bao, Guanglai Gao

Figure 1 for Unifying Dual-Space Embedding for Entity Alignment via Contrastive Learning

Figure 2 for Unifying Dual-Space Embedding for Entity Alignment via Contrastive Learning

Figure 3 for Unifying Dual-Space Embedding for Entity Alignment via Contrastive Learning

Figure 4 for Unifying Dual-Space Embedding for Entity Alignment via Contrastive Learning

Abstract:Entity alignment aims to match identical entities across different knowledge graphs (KGs). Graph neural network-based entity alignment methods have achieved promising results in Euclidean space. However, KGs often contain complex structures, including both local and hierarchical ones, which make it challenging to efficiently represent them within a single space. In this paper, we proposed a novel method UniEA, which unifies dual-space embedding to preserve the intrinsic structure of KGs. Specifically, we learn graph structure embedding in both Euclidean and hyperbolic spaces simultaneously to maximize the consistency between the embedding in both spaces. Moreover, we employ contrastive learning to mitigate the misalignment issues caused by similar entities, where embedding of similar neighboring entities within the KG become too close in distance. Extensive experiments on benchmark datasets demonstrate that our method achieves state-of-the-art performance in structure-based EA. Our code is available at https://github.com/wonderCS1213/UniEA.

* Accepted by COLING2025

Via

Access Paper or Ask Questions

Distance-Adaptive Quaternion Knowledge Graph Embedding with Bidirectional Rotation

Dec 05, 2024

Weihua Wang, Qiuyu Liang, Feilong Bao, Guanglai Gao

Figure 1 for Distance-Adaptive Quaternion Knowledge Graph Embedding with Bidirectional Rotation

Figure 2 for Distance-Adaptive Quaternion Knowledge Graph Embedding with Bidirectional Rotation

Figure 3 for Distance-Adaptive Quaternion Knowledge Graph Embedding with Bidirectional Rotation

Figure 4 for Distance-Adaptive Quaternion Knowledge Graph Embedding with Bidirectional Rotation

Abstract:Quaternion contains one real part and three imaginary parts, which provided a more expressive hypercomplex space for learning knowledge graph. Existing quaternion embedding models measure the plausibility of a triplet either through semantic matching or geometric distance scoring functions. However, it appears that semantic matching diminishes the separability of entities, while the distance scoring function weakens the semantics of entities. To address this issue, we propose a novel quaternion knowledge graph embedding model. Our model combines semantic matching with entity's geometric distance to better measure the plausibility of triplets. Specifically, in the quaternion space, we perform a right rotation on head entity and a reverse rotation on tail entity to learn rich semantic features. Then, we utilize distance adaptive translations to learn geometric distance between entities. Furthermore, we provide mathematical proofs to demonstrate our model can handle complex logical relationships. Extensive experimental results and analyses show our model significantly outperforms previous models on well-known knowledge graph completion benchmark datasets. Our code is available at https://github.com/llqy123/DaBR.

* Accepted by COLING 2025

Via

Access Paper or Ask Questions

Fully Hyperbolic Rotation for Knowledge Graph Embedding

Nov 07, 2024

Qiuyu Liang, Weihua Wang, Feilong Bao, Guanglai Gao

Figure 1 for Fully Hyperbolic Rotation for Knowledge Graph Embedding

Figure 2 for Fully Hyperbolic Rotation for Knowledge Graph Embedding

Figure 3 for Fully Hyperbolic Rotation for Knowledge Graph Embedding

Figure 4 for Fully Hyperbolic Rotation for Knowledge Graph Embedding

Abstract:Hyperbolic rotation is commonly used to effectively model knowledge graphs and their inherent hierarchies. However, existing hyperbolic rotation models rely on logarithmic and exponential mappings for feature transformation. These models only project data features into hyperbolic space for rotation, limiting their ability to fully exploit the hyperbolic space. To address this problem, we propose a novel fully hyperbolic model designed for knowledge graph embedding. Instead of feature mappings, we define the model directly in hyperbolic space with the Lorentz model. Our model considers each relation in knowledge graphs as a Lorentz rotation from the head entity to the tail entity. We adopt the Lorentzian version distance as the scoring function for measuring the plausibility of triplets. Extensive results on standard knowledge graph completion benchmarks demonstrated that our model achieves competitive results with fewer parameters. In addition, our model get the state-of-the-art performance on datasets of CoDEx-s and CoDEx-m, which are more diverse and challenging than before. Our code is available at https://github.com/llqy123/FHRE.

* Accepted by ECAI 2024

Via

Access Paper or Ask Questions

MCDubber: Multimodal Context-Aware Expressive Video Dubbing

Aug 21, 2024

Yuan Zhao, Zhenqi Jia, Rui Liu, De Hu, Feilong Bao, Guanglai Gao

Figure 1 for MCDubber: Multimodal Context-Aware Expressive Video Dubbing

Figure 2 for MCDubber: Multimodal Context-Aware Expressive Video Dubbing

Figure 3 for MCDubber: Multimodal Context-Aware Expressive Video Dubbing

Figure 4 for MCDubber: Multimodal Context-Aware Expressive Video Dubbing

Abstract:Automatic Video Dubbing (AVD) aims to take the given script and generate speech that aligns with lip motion and prosody expressiveness. Current AVD models mainly utilize visual information of the current sentence to enhance the prosody of synthesized speech. However, it is crucial to consider whether the prosody of the generated dubbing aligns with the multimodal context, as the dubbing will be combined with the original context in the final video. This aspect has been overlooked in previous studies. To address this issue, we propose a Multimodal Context-aware video Dubbing model, termed \textbf{MCDubber}, to convert the modeling object from a single sentence to a longer sequence with context information to ensure the consistency of the global context prosody. MCDubber comprises three main components: (1) A context duration aligner aims to learn the context-aware alignment between the text and lip frames; (2) A context prosody predictor seeks to read the global context visual sequence and predict the context-aware global energy and pitch; (3) A context acoustic decoder ultimately predicts the global context mel-spectrogram with the assistance of adjacent ground-truth mel-spectrograms of the target sentence. Through this process, MCDubber fully considers the influence of multimodal context on the prosody expressiveness of the current sentence when dubbing. The extracted mel-spectrogram belonging to the target sentence from the output context mel-spectrograms is the final required dubbing audio. Extensive experiments on the Chem benchmark dataset demonstrate that our MCDubber significantly improves dubbing expressiveness compared to all advanced baselines. The code and demos are available at https://github.com/XiaoYuanJun-zy/MCDubber.

Via

Access Paper or Ask Questions

L$^2$GC: Lorentzian Linear Graph Convolutional Networks For Node Classification

Mar 10, 2024

Qiuyu Liang, Weihua Wang, Feilong Bao, Guanglai Gao

Figure 1 for L$^2$GC: Lorentzian Linear Graph Convolutional Networks For Node Classification

Figure 2 for L$^2$GC: Lorentzian Linear Graph Convolutional Networks For Node Classification

Figure 3 for L$^2$GC: Lorentzian Linear Graph Convolutional Networks For Node Classification

Figure 4 for L$^2$GC: Lorentzian Linear Graph Convolutional Networks For Node Classification

Abstract:Linear Graph Convolutional Networks (GCNs) are used to classify the node in the graph data. However, we note that most existing linear GCN models perform neural network operations in Euclidean space, which do not explicitly capture the tree-like hierarchical structure exhibited in real-world datasets that modeled as graphs. In this paper, we attempt to introduce hyperbolic space into linear GCN and propose a novel framework for Lorentzian linear GCN. Specifically, we map the learned features of graph nodes into hyperbolic space, and then perform a Lorentzian linear feature transformation to capture the underlying tree-like structure of data. Experimental results on standard citation networks datasets with semi-supervised learning show that our approach yields new state-of-the-art results of accuracy 74.7$\%$ on Citeseer and 81.3$\%$ on PubMed datasets. Furthermore, we observe that our approach can be trained up to two orders of magnitude faster than other nonlinear GCN models on PubMed dataset. Our code is publicly available at https://github.com/llqy123/LLGC-master.

* Accepted by LREC-COLING 2024

Via

Access Paper or Ask Questions

MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and Accompanied Baseline

Sep 22, 2022

Yifan Hu, Pengkai Yin, Rui Liu, Feilong Bao, Guanglai Gao

Figure 1 for MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and Accompanied Baseline

Figure 2 for MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and Accompanied Baseline

Figure 3 for MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and Accompanied Baseline

Figure 4 for MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and Accompanied Baseline

Abstract:This paper introduces a high-quality open-source text-to-speech (TTS) synthesis dataset for Mongolian, a low-resource language spoken by over 10 million people worldwide. The dataset, named MnTTS, consists of about 8 hours of transcribed audio recordings spoken by a 22-year-old professional female Mongolian announcer. It is the first publicly available dataset developed to promote Mongolian TTS applications in both academia and industry. In this paper, we share our experience by describing the dataset development procedures and faced challenges. To demonstrate the reliability of our dataset, we built a powerful non-autoregressive baseline system based on FastSpeech2 model and HiFi-GAN vocoder, and evaluated it using the subjective mean opinion score (MOS) and real time factor (RTF) metrics. Evaluation results show that the powerful baseline system trained on our dataset achieves MOS above 4 and RTF about $3.30\times10^{-1}$, which makes it applicable for practical use. The dataset, training recipe, and pretrained TTS models are freely available \footnote{\label{github}\url{https://github.com/walker-hyf/MnTTS}}.

* Accepted at the 2022 International Conference on Asian Language Processing (IALP2022)

Via

Access Paper or Ask Questions

Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS

Aug 11, 2020

Rui Liu, Berrak Sisman, Feilong Bao, Guanglai Gao, Haizhou Li

Figure 1 for Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS

Figure 2 for Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS

Figure 3 for Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS

Figure 4 for Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS

Abstract:Tacotron-based end-to-end speech synthesis has shown remarkable voice quality. However, the rendering of prosody in the synthesized speech remains to be improved, especially for long sentences, where prosodic phrasing errors can occur frequently. In this paper, we extend the Tacotron-based speech synthesis framework to explicitly model the prosodic phrase breaks. We propose a multi-task learning scheme for Tacotron training, that optimizes the system to predict both Mel spectrum and phrase breaks. To our best knowledge, this is the first implementation of multi-task learning for Tacotron based TTS with a prosodic phrasing model. Experiments show that our proposed training scheme consistently improves the voice quality for both Chinese and Mongolian systems.

* To appear in IEEE Signal Processing Letters (SPL)

Via

Access Paper or Ask Questions

WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss

Feb 02, 2020

Rui Liu, Berrak Sisman, Feilong Bao, Guanglai Gao, Haizhou Li

Figure 1 for WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss

Figure 2 for WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss

Figure 3 for WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss

Figure 4 for WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss

Abstract:Tacotron-based text-to-speech (TTS) systems directly synthesize speech from text input. Such frameworks typically consist of a feature prediction network that maps character sequences to frequency-domain acoustic features, followed by a waveform reconstruction algorithm or a neural vocoder that generates the time-domain waveform from acoustic features. As the loss function is usually calculated only for frequency-domain acoustic features, that doesn't directly control the quality of the generated time-domain waveform. To address this problem, we propose a new training scheme for Tacotron-based TTS, referred to as WaveTTS, that has 2 loss functions: 1) time-domain loss, denoted as the waveform loss, that measures the distortion between the natural and generated waveform; and 2) frequency-domain loss, that measures the Mel-scale acoustic feature loss between the natural and generated acoustic features. WaveTTS ensures both the quality of the acoustic features and the resulting speech waveform. To our best knowledge, this is the first implementation of Tacotron with joint time-frequency domain loss. Experimental results show that the proposed framework outperforms the baselines and achieves high-quality synthesized speech.

* submitted to Odyssey 2020, Tokyo, Japan

Via

Access Paper or Ask Questions

Teacher-Student Training for Robust Tacotron-based TTS

Nov 07, 2019

Rui Liu, Berrak Sisman, Jingdong Li, Feilong Bao, Guanglai Gao, Haizhou Li

Figure 1 for Teacher-Student Training for Robust Tacotron-based TTS

Figure 2 for Teacher-Student Training for Robust Tacotron-based TTS

Figure 3 for Teacher-Student Training for Robust Tacotron-based TTS

Figure 4 for Teacher-Student Training for Robust Tacotron-based TTS

Abstract:While neural end-to-end text-to-speech (TTS) is superior to conventional statistical methods in many ways, the exposure bias problem in the autoregressive models remains an issue to be resolved. The exposure bias problem arises from the mismatch between the training and inference process, that results in unpredictable performance for out-of-domain test data at run-time. To overcome this, we propose a teacher-student training scheme for Tacotron-based TTS by introducing a distillation loss function in addition to the feature loss function. We first train a Tacotron2-based TTS model by always providing natural speech frames to the decoder, that serves as a teacher model. We then train another Tacotron2-based model as a student model, of which the decoder takes the predicted speech frames as input, similar to how the decoder works during run-time inference. With the distillation loss, the student model learns the output probabilities from the teacher model, that is called knowledge distillation. Experiments show that our proposed training scheme consistently improves the voice quality for out-of-domain test data both in Chinese and English systems.

* Submitted to ICASSP2020, Barcelona, Spain

Via

Access Paper or Ask Questions