Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mahesh Kumar Nandwana

SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers

Nov 15, 2024

Joseph Liu, Joshua Geddes, Ziyu Guo, Haomiao Jiang, Mahesh Kumar Nandwana

Figure 1 for SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers

Figure 2 for SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers

Figure 3 for SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers

Figure 4 for SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers

Abstract:Diffusion Transformers (DiT) have emerged as powerful generative models for various tasks, including image, video, and speech synthesis. However, their inference process remains computationally expensive due to the repeated evaluation of resource-intensive attention and feed-forward modules. To address this, we introduce SmoothCache, a model-agnostic inference acceleration technique for DiT architectures. SmoothCache leverages the observed high similarity between layer outputs across adjacent diffusion timesteps. By analyzing layer-wise representation errors from a small calibration set, SmoothCache adaptively caches and reuses key features during inference. Our experiments demonstrate that SmoothCache achieves 8% to 71% speed up while maintaining or even improving generation quality across diverse modalities. We showcase its effectiveness on DiT-XL for image generation, Open-Sora for text-to-video, and Stable Audio Open for text-to-audio, highlighting its potential to enable real-time applications and broaden the accessibility of powerful DiT models.

* Code can be found at https://github.com/Roblox/SmoothCache

Via

Access Paper or Ask Questions

Enhancing Multilingual Voice Toxicity Detection with Speech-Text Alignment

Jun 14, 2024

Joseph Liu, Mahesh Kumar Nandwana, Janne Pylkkönen, Hannes Heikinheimo, Morgan McGuire

Figure 1 for Enhancing Multilingual Voice Toxicity Detection with Speech-Text Alignment

Figure 2 for Enhancing Multilingual Voice Toxicity Detection with Speech-Text Alignment

Figure 3 for Enhancing Multilingual Voice Toxicity Detection with Speech-Text Alignment

Figure 4 for Enhancing Multilingual Voice Toxicity Detection with Speech-Text Alignment

Abstract:Toxicity classification for voice heavily relies on the semantic content of speech. We propose a novel framework that utilizes cross-modal learning to integrate the semantic embedding of text into a multilabel speech toxicity classifier during training. This enables us to incorporate textual information during training while still requiring only audio during inference. We evaluate this classifier on large-scale datasets with real-world characteristics to validate the effectiveness of this framework. Through ablation studies, we demonstrate that general-purpose semantic text embeddings are rich and aligned with speech for toxicity classification purposes. Conducting experiments across multiple languages at scale, we show improvements in voice toxicity classification across five languages and different toxicity categories.

* Accepted to INTERSPEECH 2024

Via

Access Paper or Ask Questions

Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation

Jun 14, 2024

Nameer Hirschkind, Xiao Yu, Mahesh Kumar Nandwana, Joseph Liu, Eloi DuBois, Dao Le, Nicolas Thiebaut, Colin Sinclair, Kyle Spence, Charles Shang(+2 more)

Figure 1 for Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation

Figure 2 for Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation

Figure 3 for Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation

Figure 4 for Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation

Abstract:We introduce DiffuseST, a low-latency, direct speech-to-speech translation system capable of preserving the input speaker's voice zero-shot while translating from multiple source languages into English. We experiment with the synthesizer component of the architecture, comparing a Tacotron-based synthesizer to a novel diffusion-based synthesizer. We find the diffusion-based synthesizer to improve MOS and PESQ audio quality metrics by 23\% each and speaker similarity by 5\% while maintaining comparable BLEU scores. Despite having more than double the parameter count, the diffusion synthesizer has lower latency, allowing the entire model to run more than 5$\times$ faster than real-time.

* Published in Interspeech 2024

Via

Access Paper or Ask Questions