Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Avihu Dekel

Spoken question answering for visual queries

May 29, 2025

Nimrod Shabtay, Zvi Kons, Avihu Dekel, Hagai Aronowitz, Ron Hoory, Assaf Arbelle

Abstract:Question answering (QA) systems are designed to answer natural language questions. Visual QA (VQA) and Spoken QA (SQA) systems extend the textual QA system to accept visual and spoken input respectively. This work aims to create a system that enables user interaction through both speech and images. That is achieved through the fusion of text, speech, and image modalities to tackle the task of spoken VQA (SVQA). The resulting multi-modal model has textual, visual, and spoken inputs and can answer spoken questions on images. Training and evaluating SVQA models requires a dataset for all three modalities, but no such dataset currently exists. We address this problem by synthesizing VQA datasets using two zero-shot TTS models. Our initial findings indicate that a model trained only with synthesized speech nearly reaches the performance of the upper-bounding model trained on textual QAs. In addition, we show that the choice of the TTS model has a minor impact on accuracy.

* Accepted for Interspeech 2025 (with additional results)

Via

Access Paper or Ask Questions

Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities

May 14, 2025

George Saon, Avihu Dekel, Alexander Brooks, Tohru Nagano, Abraham Daniels, Aharon Satt, Ashish Mittal, Brian Kingsbury, David Haws, Edmilson Morais(+14 more)

Abstract:Granite-speech LLMs are compact and efficient speech language models specifically designed for English ASR and automatic speech translation (AST). The models were trained by modality aligning the 2B and 8B parameter variants of granite-3.3-instruct to speech on publicly available open-source corpora containing audio inputs and text targets consisting of either human transcripts for ASR or automatically generated translations for AST. Comprehensive benchmarking shows that on English ASR, which was our primary focus, they outperform several competitors' models that were trained on orders of magnitude more proprietary data, and they keep pace on English-to-X AST for major European languages, Japanese, and Chinese. The speech-specific components are: a conformer acoustic encoder using block attention and self-conditioning trained with connectionist temporal classification, a windowed query-transformer speech modality adapter used to do temporal downsampling of the acoustic embeddings and map them to the LLM text embedding space, and LoRA adapters to further fine-tune the text LLM. Granite-speech-3.3 operates in two modes: in speech mode, it performs ASR and AST by activating the encoder, projector, and LoRA adapters; in text mode, it calls the underlying granite-3.3-instruct model directly (without LoRA), essentially preserving all the text LLM capabilities and safety. Both models are freely available on HuggingFace (https://huggingface.co/ibm-granite/granite-speech-3.3-2b and https://huggingface.co/ibm-granite/granite-speech-3.3-8b) and can be used for both research and commercial purposes under a permissive Apache 2.0 license.

* 7 pages, 9 figures

Via

Access Paper or Ask Questions

Continuous Speech Synthesis using per-token Latent Diffusion

Oct 21, 2024

Arnon Turetzky, Nimrod Shabtay, Slava Shechtman, Hagai Aronowitz, David Haws, Ron Hoory, Avihu Dekel

Figure 1 for Continuous Speech Synthesis using per-token Latent Diffusion

Figure 2 for Continuous Speech Synthesis using per-token Latent Diffusion

Figure 3 for Continuous Speech Synthesis using per-token Latent Diffusion

Figure 4 for Continuous Speech Synthesis using per-token Latent Diffusion

Abstract:The success of autoregressive transformer models with discrete tokens has inspired quantization-based approaches for continuous modalities, though these often limit reconstruction quality. We therefore introduce SALAD, a per-token latent diffusion model for zero-shot text-to-speech, that operates on continuous representations. SALAD builds upon the recently proposed expressive diffusion head for image generation, and extends it to generate variable-length outputs. Our approach utilizes semantic tokens for providing contextual information and determining the stopping condition. We suggest three continuous variants for our method, extending popular discrete speech synthesis techniques. Additionally, we implement discrete baselines for each variant and conduct a comparative analysis of discrete versus continuous speech modeling techniques. Our results demonstrate that both continuous and discrete approaches are highly competent, and that SALAD achieves a superior intelligibility score while obtaining speech quality and speaker similarity on par with the ground-truth audio.

* Preprint, Under review

Via

Access Paper or Ask Questions

Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer

Oct 10, 2024

Slava Shechtman, Avihu Dekel

Figure 1 for Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer

Figure 2 for Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer

Figure 3 for Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer

Figure 4 for Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer

Abstract:Discrete Audio codecs (or audio tokenizers) have recently regained interest due to the ability of Large Language Models (LLMs) to learn their compressed acoustic representations. Various publicly available trainable discrete tokenizers recently demonstrated impressive results for audio tokenization, yet they mostly require high token rates to gain high-quality reconstruction. In this study, we fine-tuned an open-source general audio RVQGAN model using diverse open-source speech data, considering various recording conditions and quality levels. The resulting wideband (24kHz) speech-only model achieves speech reconstruction, which is nearly indistinguishable from PCM (pulse-code modulation) with a rate of 150-300 tokens per second (1500-3000 bps). The evaluation used comprehensive English speech data encompassing different recording conditions, including studio settings. Speech samples are made publicly available in http://ibm.biz/IS24SpeechRVQ . The model is officially released in https://huggingface.co/ibm/DAC.speech.v1.0

* Proc. Interspeech 2024, 4174-4178
* You can download the model from https://huggingface.co/ibm/DAC.speech.v1.0

Via

Access Paper or Ask Questions

Exploring the Benefits of Tokenization of Discrete Acoustic Units

Jun 08, 2024

Avihu Dekel, Raul Fernandez

Figure 1 for Exploring the Benefits of Tokenization of Discrete Acoustic Units

Figure 2 for Exploring the Benefits of Tokenization of Discrete Acoustic Units

Figure 3 for Exploring the Benefits of Tokenization of Discrete Acoustic Units

Figure 4 for Exploring the Benefits of Tokenization of Discrete Acoustic Units

Abstract:Tokenization algorithms that merge the units of a base vocabulary into larger, variable-rate units have become standard in natural language processing tasks. This idea, however, has been mostly overlooked when the vocabulary consists of phonemes or Discrete Acoustic Units (DAUs), an audio-based representation that is playing an increasingly important role due to the success of discrete language-modeling techniques. In this paper, we showcase the advantages of tokenization of phonetic units and of DAUs on three prediction tasks: grapheme-to-phoneme, grapheme-to-DAUs, and unsupervised speech generation using DAU language modeling. We demonstrate that tokenization yields significant improvements in terms of performance, as well as training and inference speed, across all three tasks. We also offer theoretical insights to provide some explanation for the superior performance observed.

* Interspeech 2024

Via

Access Paper or Ask Questions

Speak While You Think: Streaming Speech Synthesis During Text Generation

Sep 20, 2023

Avihu Dekel, Slava Shechtman, Raul Fernandez, David Haws, Zvi Kons, Ron Hoory

Figure 1 for Speak While You Think: Streaming Speech Synthesis During Text Generation

Figure 2 for Speak While You Think: Streaming Speech Synthesis During Text Generation

Figure 3 for Speak While You Think: Streaming Speech Synthesis During Text Generation

Figure 4 for Speak While You Think: Streaming Speech Synthesis During Text Generation

Abstract:Large Language Models (LLMs) demonstrate impressive capabilities, yet interaction with these models is mostly facilitated through text. Using Text-To-Speech to synthesize LLM outputs typically results in notable latency, which is impractical for fluent voice conversations. We propose LLM2Speech, an architecture to synthesize speech while text is being generated by an LLM which yields significant latency reduction. LLM2Speech mimics the predictions of a non-streaming teacher model while limiting the exposure to future context in order to enable streaming. It exploits the hidden embeddings of the LLM, a by-product of the text generation that contains informative semantic context. Experimental results show that LLM2Speech maintains the teacher's quality while reducing the latency to enable natural conversations.

* Under review for ICASSP 2024

Via

Access Paper or Ask Questions

Active Learning Through a Covering Lens

May 23, 2022

Ofer Yehuda, Avihu Dekel, Guy Hacohen, Daphna Weinshall

Figure 1 for Active Learning Through a Covering Lens

Figure 2 for Active Learning Through a Covering Lens

Figure 3 for Active Learning Through a Covering Lens

Figure 4 for Active Learning Through a Covering Lens

Abstract:Deep active learning aims to reduce the annotation cost for deep neural networks, which are notoriously data-hungry. Until recently, deep active learning methods struggled in the low-budget regime, where only a small amount of samples are annotated. The situation has been alleviated by recent advances in self-supervised representation learning methods, which impart the geometry of the data representation with rich information about the points. Taking advantage of this progress, we study the problem of subset selection for annotation through a "covering" lens, proposing ProbCover -- a new active learning algorithm for the low budget regime, which seeks to maximize Probability Coverage. We describe a dual way to view our formulation, from which one can derive strategies suitable for the high budget regime of active learning, related to existing methods like Coreset. We conclude with extensive experiments, evaluating ProbCover in the low budget regime. We show that our principled active learning strategy improves the state-of-the-art in the low-budget regime in several image recognition benchmarks. This method is especially beneficial in semi-supervised settings, allowing state-of-the-art semi-supervised methods to achieve high accuracy with only a few labels.

Via

Access Paper or Ask Questions

Active Learning on a Budget: Opposite Strategies Suit High and Low Budgets

Feb 08, 2022

Guy Hacohen, Avihu Dekel, Daphna Weinshall

Figure 1 for Active Learning on a Budget: Opposite Strategies Suit High and Low Budgets

Figure 2 for Active Learning on a Budget: Opposite Strategies Suit High and Low Budgets

Figure 3 for Active Learning on a Budget: Opposite Strategies Suit High and Low Budgets

Figure 4 for Active Learning on a Budget: Opposite Strategies Suit High and Low Budgets

Abstract:Investigating active learning, we focus on the relation between the number of labeled examples (budget size), and suitable corresponding querying strategies. Our theoretical analysis shows a behavior reminiscent of phase transition: typical points should best be queried in the low budget regime, while atypical (or uncertain) points are best queried when the budget is large. Combined evidence from our theoretical and empirical studies shows that a similar phenomenon occurs in simple classification models. Accordingly, we propose TypiClust -- a deep active learning strategy suited for low budgets. In a comparative empirical investigation using a variety of architectures and image datasets, we report that in the low budget regime, TypiClust outperforms all other active learning strategies. Using TypiClust in a semi-supervised framework, the performance of competitive semi-supervised methods gets a significant boost, surpassing the state of the art.

Via

Access Paper or Ask Questions