Ph.D. Program in Computer Science, The Graduate Center, The City University of New York, New York, New York, USA, Ph.D. Program in Biology and Biochemistry, The Graduate Center, The City University of New York, New York, New York, USA, Department of Computer Science, Hunter College, The City University of New York, New York, New York, USA, Helen and Robert Appel Alzheimers Disease Research Institute, Feil Family Brain and Mind Research Institute, Weill Cornell Medicine, Cornell University, New York, New York, USA
Abstract:The rapid development of large language models has brought many new smart applications, especially the excellent multimodal human-computer interaction in GPT-4o has brought impressive experience to users. In this background, researchers have proposed many multimodal LLMs that can achieve speech-to-speech dialogue recently. In this paper, we propose a speech-text multimodal LLM architecture called Freeze-Omni. Our main contribution is the speech input and output modalities can connected to the LLM while keeping the LLM frozen throughout the training process. We designed 3-stage training strategies both for the modeling of speech input and output, enabling Freeze-Omni to obtain speech-to-speech dialogue ability using text-speech paired data (such as ASR and TTS data) and only 60,000 multi-round text Q&A data on 8 GPUs. Moreover, we can effectively ensure that the intelligence of the Freeze-Omni in the speech modality is at the same level compared with that in the text modality of its backbone LLM, while the end-to-end latency of the spoken response achieves a low level. In addition, we also designed a method to achieve duplex dialogue ability through multi-task training, making Freeze-Omni have a more natural style of dialogue ability between the users. Freeze-Omni mainly provides a possibility for researchers to conduct multimodal LLM under the condition of a frozen LLM, avoiding various impacts caused by the catastrophic forgetting of LLM caused by fewer data and training resources.
Abstract:The ISCSLP 2024 Conversational Voice Clone (CoVoC) Challenge aims to benchmark and advance zero-shot spontaneous style voice cloning, particularly focusing on generating spontaneous behaviors in conversational speech. The challenge comprises two tracks: an unconstrained track without limitation on data and model usage, and a constrained track only allowing the use of constrained open-source datasets. A 100-hour high-quality conversational speech dataset is also made available with the challenge. This paper details the data, tracks, submitted systems, evaluation results, and findings.
Abstract:This paper presents the NPU-HWC system submitted to the ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge 2024 (ICAGC). Our system consists of two modules: a speech generator for Track 1 and a background audio generator for Track 2. In Track 1, we employ Single-Codec to tokenize the speech into discrete tokens and use a language-model-based approach to achieve zero-shot speaking style cloning. The Single-Codec effectively decouples timbre and speaking style at the token level, reducing the acoustic modeling burden on the autoregressive language model. Additionally, we use DSPGAN to upsample 16 kHz mel-spectrograms to high-fidelity 48 kHz waveforms. In Track 2, we propose a background audio generator based on large language models (LLMs). This system produces scene-appropriate accompaniment descriptions, synthesizes background audio with Tango 2, and integrates it with the speech generated by our Track 1 system. Our submission achieves the second place and the first place in Track 1 and Track 2 respectively.
Abstract:Single-cell RNA-seq (scRNA-seq) technology is a powerful tool for unraveling the complexity of biological systems. One of essential and fundamental tasks in scRNA-seq data analysis is Cell Type Annotation (CTA). In spite of tremendous efforts in developing machine learning methods for this problem, several challenges remains. They include identifying Out-of-Domain (OOD) cell types, quantifying the uncertainty of unseen cell type annotations, and determining interpretable cell type-specific gene drivers for an OOD case. OOD cell types are often associated with therapeutic responses and disease origins, making them critical for precision medicine and early disease diagnosis. Additionally, scRNA-seq data contains tens thousands of gene expressions. Pinpointing gene drivers underlying CTA can provide deep insight into gene regulatory mechanisms and serve as disease biomarkers. In this study, we develop a new method, eDOC, to address aforementioned challenges. eDOC leverages a transformer architecture with evidential learning to annotate In-Domain (IND) and OOD cell types as well as to highlight genes that contribute both IND cells and OOD cells in a single cell resolution. Rigorous experiments demonstrate that eDOC significantly improves the efficiency and effectiveness of OOD cell type and gene driver identification compared to other state-of-the-art methods. Our findings suggest that eDOC may provide new insights into single-cell biology.
Abstract:Generating overtaking trajectories in high-speed scenarios presents significant challenges and is typically addressed through hierarchical planning methods. However, this method has two primary drawbacks. First, heuristic algorithms can only provide a single initial solution, which may lead to local optima and consequently diminish the quality of the solution. Second, the time efficiency of trajectory refinement based on numerical optimization is insufficient. To overcome these limitations, this paper proposes an overtaking trajectory planning framework based on spatio-temporal topology and reachable set analysis (SROP), to improve trajectory quality and time efficiency. Specifically, this paper introduces topological classes to describe trajectories representing different overtaking behaviors, which support the spatio-temporal topological search method employed by the upper-layer planner to identify diverse initial paths. This approach helps prevent getting stuck in local optima, enhancing the overall solution quality by considering multiple initial solutions from distinct topologies. Moreover, the reachable set method is integrated into the lower-layer planner for parallel trajectory evaluation. This method enhances planning efficiency by decoupling vehicle model constraints from the optimization process, enabling parallel computation while ensuring control feasibility. Simulation results show that the proposed method improves the smoothness of generated trajectories by 66.8% compared to state-of-the-art methods, highlighting its effectiveness in enhancing trajectory quality. Additionally, this method reduces computation time by 62.9%, demonstrating its efficiency.
Abstract:The development of autonomous driving has boosted the research on autonomous racing. However, existing local trajectory planning methods have difficulty planning trajectories with optimal velocity profiles at racetracks with sharp corners, thus weakening the performance of autonomous racing. To address this problem, we propose a local trajectory planning method that integrates Velocity Prediction based on Model Predictive Contour Control (VPMPCC). The optimal parameters of VPMPCC are learned through Bayesian Optimization (BO) based on a proposed novel Objective Function adapted to Racing (OFR). Specifically, VPMPCC achieves velocity prediction by encoding the racetrack as a reference velocity profile and incorporating it into the optimization problem. This method optimizes the velocity profile of local trajectories, especially at corners with significant curvature. The proposed OFR balances racing performance with vehicle safety, ensuring safe and efficient BO training. In the simulation, the number of training iterations for OFR-based BO is reduced by 42.86% compared to the state-of-the-art method. The optimal simulation-trained parameters are then applied to a real-world F1TENTH vehicle without retraining. During prolonged racing on a custom-built racetrack featuring significant sharp corners, the mean velocity of VPMPCC reaches 93.18% of the vehicle's handling limits. The released code is available at https://github.com/zhouhengli/VPMPCC.
Abstract:Extreme cornering in racing often induces large side-slip angles, presenting a formidable challenge in vehicle control. To tackle this issue, this paper introduces an Active Exploration with Double GPR (AEDGPR) system. The system initiates by planning a minimum-time trajectory with a Gaussian Process Regression(GPR) compensated model. The planning results show that in the cornering section, the yaw angular velocity and side-slip angle are in opposite directions, indicating that the vehicle is drifting. In response, we develop a drift controller based on Model Predictive Control (MPC) and incorporate Gaussian Process Regression to correct discrepancies in the vehicle dynamics model. Moreover, the covariance from the GPR is employed to actively explore various cornering states, aiming to minimize trajectory tracking errors. The proposed algorithm is validated through simulations on the Simulink-Carsim platform and experiments using a 1/10 scale RC vehicle.
Abstract:Head-to-head racing against opponents is a challenging and emerging topic in the domain of autonomous racing. We propose Predictive Spliner, a data-driven overtaking planner that learns the behavior of opponents through Gaussian Process (GP) regression, which is then leveraged to compute viable overtaking maneuvers in future sections of the racing track. Experimentally validated on a 1:10 scale autonomous racing platform using Light Detection and Ranging (LiDAR) information to perceive the opponent, Predictive Spliner outperforms State-of-the-Art (SotA) algorithms by overtaking opponents at up to 83.1% of its own speed, being on average 8.4% faster than the previous best-performing method. Additionally, it achieves an average success rate of 84.5%, which is 47.6% higher than the previous best-performing method. The method maintains computational efficiency with a Central Processing Unit (CPU) load of 22.79% and a computation time of 8.4 ms, evaluated on a Commercial off-the-Shelf (CotS) Intel i7-1165G7, making it suitable for real-time robotic applications. These results highlight the potential of Predictive Spliner to enhance the performance and safety of autonomous racing vehicles. The code for Predictive Spliner is available at: https://github.com/ForzaETH/predictive-spliner.
Abstract:In this work, we describe our submissions for the Voice Privacy Challenge 2024. Rather than proposing a novel speech anonymization system, we enhance the provided baselines to meet all required conditions and improve evaluated metrics. Specifically, we implement emotion embedding and experiment with WavLM and ECAPA2 speaker embedders for the B3 baseline. Additionally, we compare different speaker and prosody anonymization techniques. Furthermore, we introduce Mean Reversion F0 for B5, which helps to enhance privacy without a loss in utility. Finally, we explore disentanglement models, namely $\beta$-VAE and NaturalSpeech3 FACodec.
Abstract:Zero-shot voice conversion (VC) aims to transform the source speaker timbre into an arbitrary unseen one without altering the original speech content.While recent advancements in zero-shot VC methods have shown remarkable progress, there still remains considerable potential for improvement in terms of improving speaker similarity and speech naturalness.In this paper, we propose Takin-VC, a novel zero-shot VC framework based on jointly hybrid content and memory-augmented context-aware timbre modeling to tackle this challenge. Specifically, an effective hybrid content encoder, guided by neural codec training, that leverages quantized features from pre-trained WavLM and HybridFormer is first presented to extract the linguistic content of the source speech. Subsequently, we introduce an advanced cross-attention-based context-aware timbre modeling approach that learns the fine-grained, semantically associated target timbre features. To further enhance both speaker similarity and real-time performance, we utilize a conditional flow matching model to reconstruct the Mel-spectrogram of the source speech. Additionally, we advocate an efficient memory-augmented module designed to generate high-quality conditional target inputs for the flow matching process, thereby improving the overall performance of the proposed system. Experimental results demonstrate that the proposed Takin-VC method surpasses state-of-the-art zero-shot VC systems, delivering superior performance in terms of both speech naturalness and speaker similarity.