Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wei Deng

Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought

May 21, 2025

Ao Liu, Botong Zhou, Can Xu, Chayse Zhou, ChenChen Zhang, Chengcheng Xu, Chenhao Wang, Decheng Wu, Dengpeng Wu, Dian Jiao(+239 more)

Abstract:As Large Language Models (LLMs) rapidly advance, we introduce Hunyuan-TurboS, a novel large hybrid Transformer-Mamba Mixture of Experts (MoE) model. It synergistically combines Mamba's long-sequence processing efficiency with Transformer's superior contextual understanding. Hunyuan-TurboS features an adaptive long-short chain-of-thought (CoT) mechanism, dynamically switching between rapid responses for simple queries and deep "thinking" modes for complex problems, optimizing computational resources. Architecturally, this 56B activated (560B total) parameter model employs 128 layers (Mamba2, Attention, FFN) with an innovative AMF/MF block pattern. Faster Mamba2 ensures linear complexity, Grouped-Query Attention minimizes KV cache, and FFNs use an MoE structure. Pre-trained on 16T high-quality tokens, it supports a 256K context length and is the first industry-deployed large-scale Mamba model. Our comprehensive post-training strategy enhances capabilities via Supervised Fine-Tuning (3M instructions), a novel Adaptive Long-short CoT Fusion method, Multi-round Deliberation Learning for iterative improvement, and a two-stage Large-scale Reinforcement Learning process targeting STEM and general instruction-following. Evaluations show strong performance: overall top 7 rank on LMSYS Chatbot Arena with a score of 1356, outperforming leading models like Gemini-2.0-Flash-001 (1352) and o4-mini-2025-04-16 (1345). TurboS also achieves an average of 77.9% across 23 automated benchmarks. Hunyuan-TurboS balances high performance and efficiency, offering substantial capabilities at lower inference costs than many reasoning models, establishing a new paradigm for efficient large-scale pre-trained models.

Via

Access Paper or Ask Questions

Optimal Stochastic Trace Estimation in Generative Modeling

Feb 26, 2025

Xinyang Liu, Hengrong Du, Wei Deng, Ruqi Zhang

Abstract:Hutchinson estimators are widely employed in training divergence-based likelihoods for diffusion models to ensure optimal transport (OT) properties. However, this estimator often suffers from high variance and scalability concerns. To address these challenges, we investigate Hutch++, an optimal stochastic trace estimator for generative models, designed to minimize training variance while maintaining transport optimality. Hutch++ is particularly effective for handling ill-conditioned matrices with large condition numbers, which commonly arise when high-dimensional data exhibits a low-dimensional structure. To mitigate the need for frequent and costly QR decompositions, we propose practical schemes that balance frequency and accuracy, backed by theoretical guarantees. Our analysis demonstrates that Hutch++ leads to generations of higher quality. Furthermore, this method exhibits effective variance reduction in various applications, including simulations, conditional time series forecasts, and image generation.

* Accepted by AISTATS 2025

Via

Access Paper or Ask Questions

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

Feb 08, 2025

Wei Deng, Siyi Zhou, Jingchen Shu, Jinchao Wang, Lu Wang

Figure 1 for IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

Figure 2 for IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

Figure 3 for IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

Figure 4 for IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

Abstract:Recently, large language model (LLM) based text-to-speech (TTS) systems have gradually become the mainstream in the industry due to their high naturalness and powerful zero-shot voice cloning capabilities.Here, we introduce the IndexTTS system, which is mainly based on the XTTS and Tortoise model. We add some novel improvements. Specifically, in Chinese scenarios, we adopt a hybrid modeling method that combines characters and pinyin, making the pronunciations of polyphonic characters and long-tail characters controllable. We also performed a comparative analysis of the Vector Quantization (VQ) with Finite-Scalar Quantization (FSQ) for codebook utilization of acoustic speech tokens. To further enhance the effect and stability of voice cloning, we introduce a conformer-based speech conditional encoder and replace the speechcode decoder with BigVGAN2. Compared with XTTS, it has achieved significant improvements in naturalness, content consistency, and zero-shot voice cloning. As for the popular TTS systems in the open-source, such as Fish-Speech, CosyVoice2, FireRedTTS and F5-TTS, IndexTTS has a relatively simple training process, more controllable usage, and faster inference speed. Moreover, its performance surpasses that of these systems. Our demos are available at https://index-tts.github.io.

Via

Access Paper or Ask Questions

Variational Schrödinger Momentum Diffusion

Jan 28, 2025

Kevin Rojas, Yixin Tan, Molei Tao, Yuriy Nevmyvaka, Wei Deng

Figure 1 for Variational Schrödinger Momentum Diffusion

Figure 2 for Variational Schrödinger Momentum Diffusion

Figure 3 for Variational Schrödinger Momentum Diffusion

Figure 4 for Variational Schrödinger Momentum Diffusion

Abstract:The momentum Schr\"odinger Bridge (mSB) has emerged as a leading method for accelerating generative diffusion processes and reducing transport costs. However, the lack of simulation-free properties inevitably results in high training costs and affects scalability. To obtain a trade-off between transport properties and scalability, we introduce variational Schr\"odinger momentum diffusion (VSMD), which employs linearized forward score functions (variational scores) to eliminate the dependence on simulated forward trajectories. Our approach leverages a multivariate diffusion process with adaptively transport-optimized variational scores. Additionally, we apply a critical-damping transform to stabilize training by removing the need for score estimations for both velocity and samples. Theoretically, we prove the convergence of samples generated with optimal variational scores and momentum diffusion. Empirical results demonstrate that VSMD efficiently generates anisotropic shapes while maintaining transport efficacy, outperforming overdamped alternatives, and avoiding complex denoising processes. Our approach also scales effectively to real-world data, achieving competitive results in time series and image generation.

* AISTATS 25

Via

Access Paper or Ask Questions

Denoising Fisher Training For Neural Implicit Samplers

Nov 03, 2024

Weijian Luo, Wei Deng

Figure 1 for Denoising Fisher Training For Neural Implicit Samplers

Figure 2 for Denoising Fisher Training For Neural Implicit Samplers

Figure 3 for Denoising Fisher Training For Neural Implicit Samplers

Figure 4 for Denoising Fisher Training For Neural Implicit Samplers

Abstract:Efficient sampling from un-normalized target distributions is pivotal in scientific computing and machine learning. While neural samplers have demonstrated potential with a special emphasis on sampling efficiency, existing neural implicit samplers still have issues such as poor mode covering behavior, unstable training dynamics, and sub-optimal performances. To tackle these issues, in this paper, we introduce Denoising Fisher Training (DFT), a novel training approach for neural implicit samplers with theoretical guarantees. We frame the training problem as an objective of minimizing the Fisher divergence by deriving a tractable yet equivalent loss function, which marks a unique theoretical contribution to assessing the intractable Fisher divergences. DFT is empirically validated across diverse sampling benchmarks, including two-dimensional synthetic distribution, Bayesian logistic regression, and high-dimensional energy-based models (EBMs). Notably, in experiments with high-dimensional EBMs, our best one-step DFT neural sampler achieves results on par with MCMC methods with up to 200 sampling steps, leading to a substantially greater efficiency over 100 times higher. This result not only demonstrates the superior performance of DFT in handling complex high-dimensional sampling but also sheds light on efficient sampling methodologies across broader applications.

Via

Access Paper or Ask Questions

Recurrent Interpolants for Probabilistic Time Series Prediction

Sep 18, 2024

Yu Chen, Marin Biloš, Sarthak Mittal, Wei Deng, Kashif Rasul, Anderson Schneider

Figure 1 for Recurrent Interpolants for Probabilistic Time Series Prediction

Figure 2 for Recurrent Interpolants for Probabilistic Time Series Prediction

Figure 3 for Recurrent Interpolants for Probabilistic Time Series Prediction

Figure 4 for Recurrent Interpolants for Probabilistic Time Series Prediction

Abstract:Sequential models such as recurrent neural networks or transformer-based models became \textit{de facto} tools for multivariate time series forecasting in a probabilistic fashion, with applications to a wide range of datasets, such as finance, biology, medicine, etc. Despite their adeptness in capturing dependencies, assessing prediction uncertainty, and efficiency in training, challenges emerge in modeling high-dimensional complex distributions and cross-feature dependencies. To tackle these issues, recent works delve into generative modeling by employing diffusion or flow-based models. Notably, the integration of stochastic differential equations or probability flow successfully extends these methods to probabilistic time series imputation and forecasting. However, scalability issues necessitate a computational-friendly framework for large-scale generative model-based predictions. This work proposes a novel approach by blending the computational efficiency of recurrent neural networks with the high-quality probabilistic modeling of the diffusion model, which addresses challenges and advances generative models' application in time series forecasting. Our method relies on the foundation of stochastic interpolants and the extension to a broader conditional generation framework with additional control features, offering insights for future developments in this dynamic field.

Via

Access Paper or Ask Questions

DIffSteISR: Harnessing Diffusion Prior for Superior Real-world Stereo Image Super-Resolution

Aug 15, 2024

Yuanbo Zhou, Xinlin Zhang, Wei Deng, Tao Wang, Tao Tan, Qinquan Gao, Tong Tong

Figure 1 for DIffSteISR: Harnessing Diffusion Prior for Superior Real-world Stereo Image Super-Resolution

Figure 2 for DIffSteISR: Harnessing Diffusion Prior for Superior Real-world Stereo Image Super-Resolution

Figure 3 for DIffSteISR: Harnessing Diffusion Prior for Superior Real-world Stereo Image Super-Resolution

Figure 4 for DIffSteISR: Harnessing Diffusion Prior for Superior Real-world Stereo Image Super-Resolution

Abstract:We introduce DiffSteISR, a pioneering framework for reconstructing real-world stereo images. DiffSteISR utilizes the powerful prior knowledge embedded in pre-trained text-to-image model to efficiently recover the lost texture details in low-resolution stereo images. Specifically, DiffSteISR implements a time-aware stereo cross attention with temperature adapter (TASCATA) to guide the diffusion process, ensuring that the generated left and right views exhibit high texture consistency thereby reducing disparity error between the super-resolved images and the ground truth (GT) images. Additionally, a stereo omni attention control network (SOA ControlNet) is proposed to enhance the consistency of super-resolved images with GT images in the pixel, perceptual, and distribution space. Finally, DiffSteISR incorporates a stereo semantic extractor (SSE) to capture unique viewpoint soft semantic information and shared hard tag semantic information, thereby effectively improving the semantic accuracy and consistency of the generated left and right images. Extensive experimental results demonstrate that DiffSteISR accurately reconstructs natural and precise textures from low-resolution stereo images while maintaining a high consistency of semantic and texture between the left and right views.

Via

Access Paper or Ask Questions

Bayesian Federated Learning with Hamiltonian Monte Carlo: Algorithm and Theory

Jul 09, 2024

Jiajun Liang, Qian Zhang, Wei Deng, Qifan Song, Guang Lin

Figure 1 for Bayesian Federated Learning with Hamiltonian Monte Carlo: Algorithm and Theory

Figure 2 for Bayesian Federated Learning with Hamiltonian Monte Carlo: Algorithm and Theory

Figure 3 for Bayesian Federated Learning with Hamiltonian Monte Carlo: Algorithm and Theory

Figure 4 for Bayesian Federated Learning with Hamiltonian Monte Carlo: Algorithm and Theory

Abstract:This work introduces a novel and efficient Bayesian federated learning algorithm, namely, the Federated Averaging stochastic Hamiltonian Monte Carlo (FA-HMC), for parameter estimation and uncertainty quantification. We establish rigorous convergence guarantees of FA-HMC on non-iid distributed data sets, under the strong convexity and Hessian smoothness assumptions. Our analysis investigates the effects of parameter space dimension, noise on gradients and momentum, and the frequency of communication (between the central node and local nodes) on the convergence and communication costs of FA-HMC. Beyond that, we establish the tightness of our analysis by showing that the convergence rate cannot be improved even for continuous FA-HMC process. Moreover, extensive empirical studies demonstrate that FA-HMC outperforms the existing Federated Averaging-Langevin Monte Carlo (FA-LD) algorithm.

Via

Access Paper or Ask Questions

ASteISR: Adapting Single Image Super-resolution Pre-trained Model for Efficient Stereo Image Super-resolution

Jul 04, 2024

Yuanbo Zhou, Yuyang Xue, Wei Deng, Xinlin Zhang, Qinquan Gao, Tong Tong

Abstract:Despite advances in the paradigm of pre-training then fine-tuning in low-level vision tasks, significant challenges persist particularly regarding the increased size of pre-trained models such as memory usage and training time. Another concern often encountered is the unsatisfying results yielded when directly applying pre-trained single-image models to multi-image domain. In this paper, we propose a efficient method for transferring a pre-trained single-image super-resolution (SISR) transformer network to the domain of stereo image super-resolution (SteISR) through a parameter-efficient fine-tuning (PEFT) method. Specifically, we introduce the concept of stereo adapters and spatial adapters which are incorporated into the pre-trained SISR transformer network. Subsequently, the pre-trained SISR model is frozen, enabling us to fine-tune the adapters using stereo datasets along. By adopting this training method, we enhance the ability of the SISR model to accurately infer stereo images by 0.79dB on the Flickr1024 dataset. This method allows us to train only 4.8% of the original model parameters, achieving state-of-the-art performance on four commonly used SteISR benchmarks. Compared to the more complicated full fine-tuning approach, our method reduces training time and memory consumption by 57% and 15%, respectively.

Via

Access Paper or Ask Questions

Constrained Exploration via Reflected Replica Exchange Stochastic Gradient Langevin Dynamics

May 13, 2024

Haoyang Zheng, Hengrong Du, Qi Feng, Wei Deng, Guang Lin

Abstract:Replica exchange stochastic gradient Langevin dynamics (reSGLD) is an effective sampler for non-convex learning in large-scale datasets. However, the simulation may encounter stagnation issues when the high-temperature chain delves too deeply into the distribution tails. To tackle this issue, we propose reflected reSGLD (r2SGLD): an algorithm tailored for constrained non-convex exploration by utilizing reflection steps within a bounded domain. Theoretically, we observe that reducing the diameter of the domain enhances mixing rates, exhibiting a \emph{quadratic} behavior. Empirically, we test its performance through extensive experiments, including identifying dynamical systems with physical constraints, simulations of constrained multi-modal distributions, and image classification tasks. The theoretical and empirical findings highlight the crucial role of constrained exploration in improving the simulation efficiency.

* 28 pages, 13 figures, to appear in ICML 2024

Via

Access Paper or Ask Questions