Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Thibault Gisselbrecht

An Empirical Comparison of Video Frame Sampling Methods for Multi-Modal RAG Retrieval

Jul 22, 2024

Mahesh Kandhare, Thibault Gisselbrecht

Abstract:Numerous video frame sampling methodologies detailed in the literature present a significant challenge in determining the optimal video frame method for Video RAG pattern without a comparative side-by-side analysis. In this work, we investigate the trade-offs in frame sampling methods for Video & Frame Retrieval using natural language questions. We explore the balance between the quantity of sampled frames and the retrieval recall score, aiming to identify efficient video frame sampling strategies that maintain high retrieval efficacy with reduced storage and processing demands. Our study focuses on the storage and retrieval of image data (video frames) within a vector database required by Video RAG pattern, comparing the effectiveness of various frame sampling techniques. Our investigation indicates that the recall@k metric for both text-to-video and text-to-frame retrieval tasks using various methods covered as part of this work is comparable to or exceeds that of storing each frame from the video. Our findings are intended to inform the selection of frame sampling methods for practical Video RAG implementations, serving as a springboard for innovative research in this domain.

* 19 pages, 24 figures (65 images)

Via

Access Paper or Ask Questions

Small-Footprint Open-Vocabulary Keyword Spotting with Quantized LSTM Networks

Feb 25, 2020

Théodore Bluche, Maël Primet, Thibault Gisselbrecht

Figure 1 for Small-Footprint Open-Vocabulary Keyword Spotting with Quantized LSTM Networks

Figure 2 for Small-Footprint Open-Vocabulary Keyword Spotting with Quantized LSTM Networks

Figure 3 for Small-Footprint Open-Vocabulary Keyword Spotting with Quantized LSTM Networks

Figure 4 for Small-Footprint Open-Vocabulary Keyword Spotting with Quantized LSTM Networks

Abstract:We explore a keyword-based spoken language understanding system, in which the intent of the user can directly be derived from the detection of a sequence of keywords in the query. In this paper, we focus on an open-vocabulary keyword spotting method, allowing the user to define their own keywords without having to retrain the whole model. We describe the different design choices leading to a fast and small-footprint system, able to run on tiny devices, for any arbitrary set of user-defined keywords, without training data specific to those keywords. The model, based on a quantized long short-term memory (LSTM) neural network, trained with connectionist temporal classification (CTC), weighs less than 500KB. Our approach takes advantage of some properties of the predictions of CTC-trained networks to calibrate the confidence scores and implement a fast detection algorithm. The proposed system outperforms a standard keyword-filler model approach.

Via

Access Paper or Ask Questions

Predicting detection filters for small footprint open-vocabulary keyword spotting

Dec 16, 2019

Theodore Bluche, Thibault Gisselbrecht

Figure 1 for Predicting detection filters for small footprint open-vocabulary keyword spotting

Figure 2 for Predicting detection filters for small footprint open-vocabulary keyword spotting

Figure 3 for Predicting detection filters for small footprint open-vocabulary keyword spotting

Figure 4 for Predicting detection filters for small footprint open-vocabulary keyword spotting

Abstract:In many scenarios, detecting keywords from natural language queries is sufficient to understand the intent of the user. In this paper, we propose a fully-neural approach to open-vocabulary keyword spotting, allowing a user to include a voice interface to its device without having to retrain a model on task-specific data. We present a keyword detection neural network weighing less than 550KB, in which the topmost layer performing keyword detection is predicted by an auxiliary network, that may be run offline to generate a detector for any keyword.

Via

Access Paper or Ask Questions

Efficient keyword spotting using dilated convolutions and gating

Nov 19, 2018

Alice Coucke, Mohammed Chlieh, Thibault Gisselbrecht, David Leroy, Mathieu Poumeyrol, Thibaut Lavril

Figure 1 for Efficient keyword spotting using dilated convolutions and gating

Figure 2 for Efficient keyword spotting using dilated convolutions and gating

Figure 3 for Efficient keyword spotting using dilated convolutions and gating

Figure 4 for Efficient keyword spotting using dilated convolutions and gating

Abstract:We explore the application of end-to-end stateless temporal modeling to small-footprint keyword spotting as opposed to recurrent networks that model long-term temporal dependencies using internal states. We propose a model inspired by the recent success of dilated convolutions in sequence modeling applications, allowing to train deeper architectures in resource-constrained configurations. Gated activations and residual connections are also added, following a similar configuration to WaveNet. In addition, we apply a custom target labeling that back-propagates loss from specific frames of interest, therefore yielding higher accuracy and only requiring to detect the end of the keyword. Our experimental results show that our model outperforms a max-pooling loss trained recurrent neural network using LSTM cells, with a significant decrease in false rejection rate. The underlying dataset - "Hey Snips" utterances recorded by over 2.2K different speakers - has been made publicly available to establish an open reference for wake-word detection.

Via

Access Paper or Ask Questions

Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces

Nov 05, 2018

Alice Coucke, Alaa Saade, Adrien Ball, Théodore Bluche, Alexandre Caulier, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone, Thibaut Lavril(+2 more)

Figure 1 for Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces

Figure 2 for Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces

Figure 3 for Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces

Figure 4 for Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces

Abstract:This paper presents the machine learning architecture of the Snips Voice Platform, a software solution to perform Spoken Language Understanding on microprocessors typical of IoT devices. The embedded inference is fast and accurate while enforcing privacy by design, as no personal user data is ever collected. Focusing on Automatic Speech Recognition and Natural Language Understanding, we detail our approach to training high-performance Machine Learning models that are small enough to run in real-time on small devices. Additionally, we describe a data generation procedure that provides sufficient, high-quality training data without compromising user privacy.

Via

Access Paper or Ask Questions

Federated Learning for Keyword Spotting

Oct 31, 2018

David Leroy, Alice Coucke, Thibaut Lavril, Thibault Gisselbrecht, Joseph Dureau

Figure 1 for Federated Learning for Keyword Spotting

Figure 2 for Federated Learning for Keyword Spotting

Figure 3 for Federated Learning for Keyword Spotting

Abstract:We propose a practical approach based on federated learning to solve out-of-domain issues with continuously running embedded speech-based models such as wake word detectors. We conduct an extensive empirical study of the federated averaging algorithm for the "Hey Snips" wake word based on a crowdsourced dataset that mimics a federation of wake word users. We empirically demonstrate that using an adaptive averaging strategy inspired from Adam in place of standard weighted model averaging highly reduces the number of communication rounds required to reach our target performance. The associated upstream communication costs per user are estimated at 8 MB, which is a reasonable in the context of smart home voice assistants. Additionally, the dataset used for these experiments is being open sourced with the aim of fostering further transparent research in the application of federated learning to speech data.

Via

Access Paper or Ask Questions

Spoken Language Understanding on the Edge

Oct 30, 2018

Alaa Saade, Alice Coucke, Alexandre Caulier, Joseph Dureau, Adrien Ball, Théodore Bluche, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone(+2 more)

Figure 1 for Spoken Language Understanding on the Edge

Figure 2 for Spoken Language Understanding on the Edge

Figure 3 for Spoken Language Understanding on the Edge

Figure 4 for Spoken Language Understanding on the Edge

Abstract:We consider the problem of performing Spoken Language Understanding (SLU) on small devices typical of IoT applications. Our contributions are twofold. First, we outline the design of an embedded, private-by-design SLU system and show that it has performance on par with cloud-based commercial solutions. Second, we release the datasets used in our experiments in the interest of reproducibility and in the hope that they can prove useful to the SLU community.

Via

Access Paper or Ask Questions