Southern University of Science and Technology, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
Abstract:Zero-shot Text-To-Speech (TTS) synthesis shows great promise for personalized voice customization through voice cloning. However, current methods for achieving zero-shot TTS heavily rely on large model scales and extensive training datasets to ensure satisfactory performance and generalizability across various speakers. This raises concerns regarding both deployment costs and data security. In this paper, we present a lightweight and stable zero-shot TTS system. We introduce a novel TTS architecture designed to effectively model linguistic content and various speaker attributes from source speech and prompt speech, respectively. Furthermore, we present a two-stage self-distillation framework that constructs parallel data pairs for effectively disentangling linguistic content and speakers from the perspective of training data. Extensive experiments show that our system exhibits excellent performance and superior stability on the zero-shot TTS tasks. Moreover, it shows markedly superior computational efficiency, with RTFs of 0.13 and 0.012 on the CPU and GPU, respectively.
Abstract:Automatic modulation classification (AMC) is an effective way to deal with physical layer threats of the internet of things (IoT). However, there is often label mislabeling in practice, which significantly impacts the performance and robustness of deep neural networks (DNNs). In this paper, we propose a meta-learning guided label noise distillation method for robust AMC. Specifically, a teacher-student heterogeneous network (TSHN) framework is proposed to distill and reuse label noise. Based on the idea that labels are representations, the teacher network with trusted meta-learning divides and conquers untrusted label samples and then guides the student network to learn better by reassessing and correcting labels. Furthermore, we propose a multi-view signal (MVS) method to further improve the performance of hard-to-classify categories with few-shot trusted label samples. Extensive experimental results show that our methods can significantly improve the performance and robustness of signal AMC in various and complex label noise scenarios, which is crucial for securing IoT applications.
Abstract:Existing methods for 3D human mesh recovery always directly estimate SMPL parameters, which involve both joint rotations and shape parameters. However, these methods present rotation semantic ambiguity, rotation error accumulation, and shape estimation overfitting, which also leads to errors in the estimated pose. Additionally, these methods have not efficiently leveraged the advancements in another hot topic, human pose estimation. To address these issues, we propose a novel approach, Decomposition of 3D Rotation and Lift from 2D Joint to 3D mesh (D3L). We disentangle 3D joint rotation into bone direction and bone twist direction so that the human mesh recovery task is broken down into estimation of pose, twist, and shape, which can be handled independently. Then we design a 2D-to-3D lifting network for estimating twist direction and 3D joint position from 2D joint position sequences and introduce a nonlinear optimization method for fitting shape parameters and bone directions. Our approach can leverage human pose estimation methods, and avoid pose errors introduced by shape estimation overfitting. We conduct experiments on the Human3.6M dataset and demonstrate improved performance compared to existing methods by a large margin.