Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kaidi Wang

DS-Codec: Dual-Stage Training with Mirror-to-NonMirror Architecture Switching for Speech Codec

May 30, 2025

Peijie Chen, Wenhao Guan, Kaidi Wang, Weijie Wu, Hukai Huang, Qingyang Hong, Lin Li

Abstract:Neural speech codecs are essential for advancing text-to-speech (TTS) systems. With the recent success of large language models in text generation, developing high-quality speech tokenizers has become increasingly important. This paper introduces DS-Codec, a novel neural speech codec featuring a dual-stage training framework with mirror and non-mirror architectures switching, designed to achieve superior speech reconstruction. We conduct extensive experiments and ablation studies to evaluate the effectiveness of our training strategy and compare the performance of the two architectures. Our results show that the mirrored structure significantly enhances the robustness of the learned codebooks, and the training strategy balances the advantages between mirrored and non-mirrored structures, leading to improved high-fidelity speech reconstruction.

* Accepted to Interspeech 2025

Via

Access Paper or Ask Questions

Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion

May 30, 2025

Kaidi Wang, Wenhao Guan, Ziyue Jiang, Hukai Huang, Peijie Chen, Weijie Wu, Qingyang Hong, Lin Li

Abstract:Currently, zero-shot voice conversion systems are capable of synthesizing the voice of unseen speakers. However, most existing approaches struggle to accurately replicate the speaking style of the source speaker or mimic the distinctive speaking style of the target speaker, thereby limiting the controllability of voice conversion. In this work, we propose Discl-VC, a novel voice conversion framework that disentangles content and prosody information from self-supervised speech representations and synthesizes the target speaker's voice through in-context learning with a flow matching transformer. To enable precise control over the prosody of generated speech, we introduce a mask generative transformer that predicts discrete prosody tokens in a non-autoregressive manner based on prompts. Experimental results demonstrate the superior performance of Discl-VC in zero-shot voice conversion and its remarkable accuracy in prosody control for synthesized speech.

Via

Access Paper or Ask Questions

Antenna Activation and Resource Allocation in Multi-Waveguide Pinching-Antenna Systems

May 03, 2025

Kaidi Wang, Zhiguo Ding, George K. Karagiannidis

Abstract:Pinching antennas, as a novel flexible-antenna technology capable of establishing line of sight (LoS) connections and effectively mitigating large-scale path loss, have recently attracted considerable research interests. However, the implementation of ideal pinching-antenna systems involves determining and adjusting pinching antennas to an arbitrary position on waveguides, which presents challenges to both practical deployment and related optimization. This paper investigates a practical pinching-antennas system in multi-waveguide scenarios, where pinching antennas are installed at pre-configured discrete positions to serve downlink users with non-orthogonal multiple access (NOMA). To improve system throughput, a sophisticated optimization problem is formulated by jointly considering waveguide assignment, antenna activation, successive interference cancellation (SIC) decoding order design, and power allocation. By treating waveguide assignment and antenna activation as two coalition-formation games, a novel game-theoretic algorithm is developed, in which the optimal decoding order is derived and incorporated. For power allocation, monotonic optimization and successive convex approximation (SCA) are employed to construct global optimal and low-complexity solutions, respectively. Simulation results demonstrate that the NOMA-based pinching-antenna system exhibits superior performance compared to the considered benchmark systems, and the proposed solutions provide significant improvement in terms of sum rate and outage probability.

Via

Access Paper or Ask Questions

SlimSpeech: Lightweight and Efficient Text-to-Speech with Slim Rectified Flow

Apr 10, 2025

Kaidi Wang, Wenhao Guan, Shenghui Lu, Jianglong Yao, Lin Li, Qingyang Hong

Abstract:Recently, flow matching based speech synthesis has significantly enhanced the quality of synthesized speech while reducing the number of inference steps. In this paper, we introduce SlimSpeech, a lightweight and efficient speech synthesis system based on rectified flow. We have built upon the existing speech synthesis method utilizing the rectified flow model, modifying its structure to reduce parameters and serve as a teacher model. By refining the reflow operation, we directly derive a smaller model with a more straight sampling trajectory from the larger model, while utilizing distillation techniques to further enhance the model performance. Experimental results demonstrate that our proposed method, with significantly reduced model parameters, achieves comparable performance to larger models through one-step sampling.

Via

Access Paper or Ask Questions

Empowering Large Language Models in Wireless Communication: A Novel Dataset and Fine-Tuning Framework

Jan 16, 2025

Yushen Lin, Ruichen Zhang, Wenqi Huang, Kaidi Wang, Zhiguo Ding, Daniel K. C. So, Dusit Niyato

Abstract:In this work, we develop a specialized dataset aimed at enhancing the evaluation and fine-tuning of large language models (LLMs) specifically for wireless communication applications. The dataset includes a diverse set of multi-hop questions, including true/false and multiple-choice types, spanning varying difficulty levels from easy to hard. By utilizing advanced language models for entity extraction and question generation, rigorous data curation processes are employed to maintain high quality and relevance. Additionally, we introduce a Pointwise V-Information (PVI) based fine-tuning method, providing a detailed theoretical analysis and justification for its use in quantifying the information content of training data with 2.24\% and 1.31\% performance boost for different models compared to baselines, respectively. To demonstrate the effectiveness of the fine-tuned models with the proposed methodologies on practical tasks, we also consider different tasks, including summarizing optimization problems from technical papers and solving the mathematical problems related to non-orthogonal multiple access (NOMA), which are generated by using the proposed multi-agent framework. Simulation results show significant performance gain in summarization tasks with 20.9\% in the ROUGE-L metrics. We also study the scaling laws of fine-tuning LLMs and the challenges LLMs face in the field of wireless communications, offering insights into their adaptation to wireless communication tasks. This dataset and fine-tuning methodology aim to enhance the training and evaluation of LLMs, contributing to advancements in LLMs for wireless communication research and applications.

* 13 pages, 13 figure, journal

Via

Access Paper or Ask Questions

Antenna Activation for NOMA Assisted Pinching-Antenna Systems

Dec 18, 2024

Kaidi Wang, Zhiguo Ding, Robert Schober

Figure 1 for Antenna Activation for NOMA Assisted Pinching-Antenna Systems

Figure 2 for Antenna Activation for NOMA Assisted Pinching-Antenna Systems

Figure 3 for Antenna Activation for NOMA Assisted Pinching-Antenna Systems

Figure 4 for Antenna Activation for NOMA Assisted Pinching-Antenna Systems

Abstract:In this letter, a non-orthogonal multiple access (NOMA) assisted downlink pinching-antenna system is investigated, where multiple pinching antennas can be activated at pre-configured positions along a dielectric waveguide to serve users via NOMA. In particular, the objective of this letter is to study at what locations and how many pinching antennas should be activated in order to maximize the system throughput. To this end, a sum rate maximization problem with antenna activation is formulated. With the help of matching theory, the formulated problem can be recast as a one-sided one-to-one matching, for which a low-complexity algorithm is developed. Simulation results indicate that the considered NOMA assisted pinching-antenna system can outperform conventional fixed-antenna systems in terms of sum rate, and the proposed matching based antenna activation algorithm yields a significant performance gain over the considered benchmarks.

Via

Access Paper or Ask Questions

SentiXRL: An advanced large language Model Framework for Multilingual Fine-Grained Emotion Classification in Complex Text Environment

Nov 27, 2024

Jie Wang, Yichen Wang, Zhilin Zhang, Jianhao Zeng, Kaidi Wang, Zhiyang Chen

Figure 1 for SentiXRL: An advanced large language Model Framework for Multilingual Fine-Grained Emotion Classification in Complex Text Environment

Figure 2 for SentiXRL: An advanced large language Model Framework for Multilingual Fine-Grained Emotion Classification in Complex Text Environment

Figure 3 for SentiXRL: An advanced large language Model Framework for Multilingual Fine-Grained Emotion Classification in Complex Text Environment

Figure 4 for SentiXRL: An advanced large language Model Framework for Multilingual Fine-Grained Emotion Classification in Complex Text Environment

Abstract:With strong expressive capabilities in Large Language Models(LLMs), generative models effectively capture sentiment structures and deep semantics, however, challenges remain in fine-grained sentiment classification across multi-lingual and complex contexts. To address this, we propose the Sentiment Cross-Lingual Recognition and Logic Framework (SentiXRL), which incorporates two modules,an emotion retrieval enhancement module to improve sentiment classification accuracy in complex contexts through historical dialogue and logical reasoning,and a self-circulating analysis negotiation mechanism (SANM)to facilitates autonomous decision-making within a single model for classification tasks.We have validated SentiXRL's superiority on multiple standard datasets, outperforming existing models on CPED and CH-SIMS,and achieving overall better performance on MELD,Emorynlp and IEMOCAP. Notably, we unified labels across several fine-grained sentiment annotation datasets and conducted category confusion experiments, revealing challenges and impacts of class imbalance in standard datasets.

Via

Access Paper or Ask Questions

LAFMA: A Latent Flow Matching Model for Text-to-Audio Generation

Jun 12, 2024

Wenhao Guan, Kaidi Wang, Wangjin Zhou, Yang Wang, Feng Deng, Hui Wang, Lin Li, Qingyang Hong, Yong Qin

Abstract:Recently, the application of diffusion models has facilitated the significant development of speech and audio generation. Nevertheless, the quality of samples generated by diffusion models still needs improvement. And the effectiveness of the method is accompanied by the extensive number of sampling steps, leading to an extended synthesis time necessary for generating high-quality audio. Previous Text-to-Audio (TTA) methods mostly used diffusion models in the latent space for audio generation. In this paper, we explore the integration of the Flow Matching (FM) model into the audio latent space for audio generation. The FM is an alternative simulation-free method that trains continuous normalization flows (CNF) based on regressing vector fields. We demonstrate that our model significantly enhances the quality of generated audio samples, achieving better performance than prior models. Moreover, it reduces the number of inference steps to ten steps almost without sacrificing performance.

* Accepted at Interspeech2024

Via

Access Paper or Ask Questions

Exploring Age-of-Information Weighting in Federated Learning under Data Heterogeneity

May 24, 2024

Kaidi Wang, Zhiguo Ding, Daniel K. C. So, Zhi Ding

Figure 1 for Exploring Age-of-Information Weighting in Federated Learning under Data Heterogeneity

Figure 2 for Exploring Age-of-Information Weighting in Federated Learning under Data Heterogeneity

Figure 3 for Exploring Age-of-Information Weighting in Federated Learning under Data Heterogeneity

Figure 4 for Exploring Age-of-Information Weighting in Federated Learning under Data Heterogeneity

Abstract:This paper investigates federated learning in a wireless communication system, where random device selection is employed with non-independent and identically distributed (non-IID) data. The analysis indicates that while training deep learning networks using federated stochastic gradient descent (FedSGD) on non-IID datasets, device selection can generate gradient errors that accumulate, leading to potential weight divergence. To mitigate training divergence, we design an age-weighted FedSGD to scale local gradients according to the previous state of devices. To further improve learning performance by increasing device participation under the maximum time consumption constraint, we formulate an energy consumption minimization problem by including resource allocation and sub-channel assignment. By transforming the resource allocation problem into convex and utilizing KKT conditions, we derived the optimal resource allocation solution. Moreover, this paper develops a matching based algorithm to generate the enhanced sub-channel assignment. Simulation results indicate that i) age-weighted FedSGD is able to outperform conventional FedSGD in terms of convergence rate and achievable accuracy, and ii) the proposed resource allocation and sub-channel assignment strategies can significantly reduce energy consumption and improve learning performance by increasing the number of selected devices.

Via

Access Paper or Ask Questions

Rethinking Clustered Federated Learning in NOMA Enhanced Wireless Networks

Mar 05, 2024

Yushen Lin, Kaidi Wang, Zhiguo Ding

Abstract:This study explores the benefits of integrating the novel clustered federated learning (CFL) approach with non-orthogonal multiple access (NOMA) under non-independent and identically distributed (non-IID) datasets, where multiple devices participate in the aggregation with time limitations and a finite number of sub-channels. A detailed theoretical analysis of the generalization gap that measures the degree of non-IID in the data distribution is presented. Following that, solutions to address the challenges posed by non-IID conditions are proposed with the analysis of the properties. Specifically, users' data distributions are parameterized as concentration parameters and grouped using spectral clustering, with Dirichlet distribution serving as the prior. The investigation into the generalization gap and convergence rate guides the design of sub-channel assignments through the matching-based algorithm, and the power allocation is achieved by Karush-Kuhn-Tucker (KKT) conditions with the derived closed-form solution. The extensive simulation results show that the proposed cluster-based FL framework can outperform FL baselines in terms of both test accuracy and convergence rate. Moreover, jointly optimizing sub-channel and power allocation in NOMA-enhanced networks can lead to a significant improvement.

Via

Access Paper or Ask Questions