Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Brian Kulis

The NeurIPS 2023 Machine Learning for Audio Workshop: Affective Audio Benchmarks and Novel Data

Mar 21, 2024

Alice Baird, Rachel Manzelli, Panagiotis Tzirakis, Chris Gagne, Haoqi Li, Sadie Allen, Sander Dieleman, Brian Kulis, Shrikanth S. Narayanan, Alan Cowen

Figure 1 for The NeurIPS 2023 Machine Learning for Audio Workshop: Affective Audio Benchmarks and Novel Data

Figure 2 for The NeurIPS 2023 Machine Learning for Audio Workshop: Affective Audio Benchmarks and Novel Data

Figure 3 for The NeurIPS 2023 Machine Learning for Audio Workshop: Affective Audio Benchmarks and Novel Data

Figure 4 for The NeurIPS 2023 Machine Learning for Audio Workshop: Affective Audio Benchmarks and Novel Data

Abstract:The NeurIPS 2023 Machine Learning for Audio Workshop brings together machine learning (ML) experts from various audio domains. There are several valuable audio-driven ML tasks, from speech emotion recognition to audio event detection, but the community is sparse compared to other ML areas, e.g., computer vision or natural language processing. A major limitation with audio is the available data; with audio being a time-dependent modality, high-quality data collection is time-consuming and costly, making it challenging for academic groups to apply their often state-of-the-art strategies to a larger, more generalizable dataset. In this short white paper, to encourage researchers with limited access to large-datasets, the organizers first outline several open-source datasets that are available to the community, and for the duration of the workshop are making several propriety datasets available. Namely, three vocal datasets, Hume-Prosody, Hume-VocalBurst, an acted emotional speech dataset Modulate-Sonata, and an in-game streamer dataset Modulate-Stream. We outline the current baselines on these datasets but encourage researchers from across audio to utilize them outside of the initial baseline tasks.

Via

Access Paper or Ask Questions

A Data Centric Approach for Unsupervised Domain Generalization via Retrieval from Web Scale Multimodal Data

Feb 06, 2024

Christopher Liao, Theodoros Tsiligkaridis, Brian Kulis

Abstract:Domain generalization (DG) is an important problem that learns a model that can generalize to unseen test domains leveraging one or more source domains, under the assumption of shared label spaces. However, most DG methods assume access to abundant source data in the target label space, a requirement that proves overly stringent for numerous real-world applications, where acquiring the same label space as the target task is prohibitively expensive. For this setting, we tackle the multimodal version of the unsupervised domain generalization (UDG) problem, which uses a large task-agnostic unlabeled source dataset, such as LAION-2B during finetuning. Our framework does not explicitly assume any relationship between the source dataset and target task. Instead, it relies only on the premise that the source dataset can be efficiently searched in a joint vision-language space. For this multimodal UDG setting, we propose a novel method to build a small ($<$100K) subset of the source data in three simple steps: (1) diversified retrieval using label names as queries, (2) rank pseudo-labeling, and (3) clustering to find representative samples. To demonstrate the value of studying the multimodal UDG problem, we compare our results against state-of-the-art source-free DG and zero-shot (ZS) methods on their respective benchmarks and show up to 10% improvement in accuracy on 20 diverse target datasets. Additionally, our multi-stage dataset construction method achieves 3% improvement on average over nearest neighbors retrieval. Code is available: https://github.com/Chris210634/mudg

Via

Access Paper or Ask Questions

Image-Caption Encoding for Improving Zero-Shot Generalization

Feb 05, 2024

Eric Yang Yu, Christopher Liao, Sathvik Ravi, Theodoros Tsiligkaridis, Brian Kulis

Abstract:Recent advances in vision-language models have combined contrastive approaches with generative methods to achieve state-of-the-art (SOTA) on downstream inference tasks like zero-shot image classification. However, a persistent issue of these models for image classification is their out-of-distribution (OOD) generalization capabilities. We first show that when an OOD data point is misclassified, the correct class can be typically found in the Top-K predicted classes. In order to steer the model prediction toward the correct class within the top predicted classes, we propose the Image-Caption Encoding (ICE) method, a straightforward approach that directly enforces consistency between the image-conditioned and caption-conditioned predictions at evaluation time only. Intuitively, we take advantage of unique properties of the generated captions to guide our local search for the correct class label within the Top-K predicted classes. We show that our method can be easily combined with other SOTA methods to enhance Top-1 OOD accuracies by 0.5% on average and up to 3% on challenging datasets. Our code: https://github.com/Chris210634/ice

Via

Access Paper or Ask Questions

Maximum-Entropy Adversarial Audio Augmentation for Keyword Spotting

Jan 12, 2024

Zuzhao Ye, Gregory Ciccarelli, Brian Kulis

Abstract:Data augmentation is a key tool for improving the performance of deep networks, particularly when there is limited labeled data. In some fields, such as computer vision, augmentation methods have been extensively studied; however, for speech and audio data, there are relatively fewer methods developed. Using adversarial learning as a starting point, we develop a simple and effective augmentation strategy based on taking the gradient of the entropy of the outputs with respect to the inputs and then creating new data points by moving in the direction of the gradient to maximize the entropy. We validate its efficacy on several keyword spotting tasks as well as standard audio benchmarks. Our method is straightforward to implement, offering greater computational efficiency than more complex adversarial schemes like GANs. Despite its simplicity, it proves robust and effective, especially when combined with the established SpecAugment technique, leading to enhanced performance.

* 5 pages, 2 figures

Via

Access Paper or Ask Questions

Descriptor and Word Soups: Overcoming the Parameter Efficiency Accuracy Tradeoff for Out-of-Distribution Few-shot Learning

Nov 21, 2023

Christopher Liao, Theodoros Tsiligkaridis, Brian Kulis

Figure 1 for Descriptor and Word Soups: Overcoming the Parameter Efficiency Accuracy Tradeoff for Out-of-Distribution Few-shot Learning

Figure 2 for Descriptor and Word Soups: Overcoming the Parameter Efficiency Accuracy Tradeoff for Out-of-Distribution Few-shot Learning

Figure 3 for Descriptor and Word Soups: Overcoming the Parameter Efficiency Accuracy Tradeoff for Out-of-Distribution Few-shot Learning

Figure 4 for Descriptor and Word Soups: Overcoming the Parameter Efficiency Accuracy Tradeoff for Out-of-Distribution Few-shot Learning

Abstract:Over the past year, a large body of multimodal research has emerged around zero-shot evaluation using GPT descriptors. These studies boost the zero-shot accuracy of pretrained VL models with an ensemble of label-specific text generated by GPT. A recent study, WaffleCLIP, demonstrated that similar zero-shot accuracy can be achieved with an ensemble of random descriptors. However, both zero-shot methods are un-trainable and consequently sub-optimal when some few-shot out-of-distribution (OOD) training data is available. Inspired by these prior works, we present two more flexible methods called descriptor and word soups, which do not require an LLM at test time and can leverage training data to increase OOD target accuracy. Descriptor soup greedily selects a small set of textual descriptors using generic few-shot training data, then calculates robust class embeddings using the selected descriptors. Word soup greedily assembles a chain of words in a similar manner. Compared to existing few-shot soft prompt tuning methods, word soup requires fewer parameters by construction and less GPU memory, since it does not require backpropagation. Both soups outperform current published few-shot methods, even when combined with SoTA zero-shot methods, on cross-dataset and domain generalization benchmarks. Compared with SoTA prompt and descriptor ensembling methods, such as ProDA and WaffleCLIP, word soup achieves higher OOD accuracy with fewer ensemble members. Please checkout our code: github.com/Chris210634/word_soups

Via

Access Paper or Ask Questions

Supervised Metric Learning for Retrieval via Contextual Similarity Optimization

Oct 04, 2022

Christopher Liao, Theodoros Tsiligkaridis, Brian Kulis

Figure 1 for Supervised Metric Learning for Retrieval via Contextual Similarity Optimization

Figure 2 for Supervised Metric Learning for Retrieval via Contextual Similarity Optimization

Figure 3 for Supervised Metric Learning for Retrieval via Contextual Similarity Optimization

Figure 4 for Supervised Metric Learning for Retrieval via Contextual Similarity Optimization

Abstract:Existing deep metric learning approaches fall into three general categories: contrastive learning, average precision (AP) maximization, and classification. We propose a novel alternative approach, \emph{contextual similarity optimization}, inspired by work in unsupervised metric learning. Contextual similarity is a discrete similarity measure based on relationships between neighborhood sets, and is widely used in the unsupervised setting as pseudo-supervision. Inspired by this success, we propose a framework which optimizes \emph{a combination of contextual and cosine similarities}. Contextual similarity calculation involves several non-differentiable operations, including the heaviside function and intersection of sets. We show how to circumvent non-differentiability to explicitly optimize contextual similarity, and we further incorporate appropriate similarity regularization to yield our novel metric learning loss. The resulting loss function achieves state-of-the-art Recall @ 1 accuracy on standard supervised image retrieval benchmarks when combined with the standard contrastive loss. Code is released here: \url{https://github.com/Chris210634/metric-learning-using-contextual-similarity}

Via

Access Paper or Ask Questions

Latency Control for Keyword Spotting

Jun 15, 2022

Christin Jose, Joseph Wang, Grant P. Strimel, Mohammad Omar Khursheed, Yuriy Mishchenko, Brian Kulis

Figure 1 for Latency Control for Keyword Spotting

Figure 2 for Latency Control for Keyword Spotting

Figure 3 for Latency Control for Keyword Spotting

Figure 4 for Latency Control for Keyword Spotting

Abstract:Conversational agents commonly utilize keyword spotting (KWS) to initiate voice interaction with the user. For user experience and privacy considerations, existing approaches to KWS largely focus on accuracy, which can often come at the expense of introduced latency. To address this tradeoff, we propose a novel approach to control KWS model latency and which generalizes to any loss function without explicit knowledge of the keyword endpoint. Through a single, tunable hyperparameter, our approach enables one to balance detection latency and accuracy for the targeted application. Empirically, we show that our approach gives superior performance under latency constraints when compared to existing methods. Namely, we make a substantial 25\% relative false accepts improvement for a fixed latency target when compared to the baseline state-of-the-art. We also show that when our approach is used in conjunction with a max-pooling loss, we are able to improve relative false accepts by 25 % at a fixed latency when compared to cross entropy loss.

* Proceedings of INTERSPEECH

Via

Access Paper or Ask Questions

Pick up the PACE: Fast and Simple Domain Adaptation via Ensemble Pseudo-Labeling

May 26, 2022

Christopher Liao, Theodoros Tsiligkaridis, Brian Kulis

Figure 1 for Pick up the PACE: Fast and Simple Domain Adaptation via Ensemble Pseudo-Labeling

Figure 2 for Pick up the PACE: Fast and Simple Domain Adaptation via Ensemble Pseudo-Labeling

Figure 3 for Pick up the PACE: Fast and Simple Domain Adaptation via Ensemble Pseudo-Labeling

Figure 4 for Pick up the PACE: Fast and Simple Domain Adaptation via Ensemble Pseudo-Labeling

Abstract:Domain Adaptation (DA) has received widespread attention from deep learning researchers in recent years because of its potential to improve test accuracy with out-of-distribution labeled data. Most state-of-the-art DA algorithms require an extensive amount of hyperparameter tuning and are computationally intensive due to the large batch sizes required. In this work, we propose a fast and simple DA method consisting of three stages: (1) domain alignment by covariance matching, (2) pseudo-labeling, and (3) ensembling. We call this method $\textbf{PACE}$, for $\textbf{P}$seudo-labels, $\textbf{A}$lignment of $\textbf{C}$ovariances, and $\textbf{E}$nsembles. PACE is trained on top of fixed features extracted from an ensemble of modern pretrained backbones. PACE exceeds previous state-of-the-art by $\textbf{5 - 10 \%}$ on most benchmark adaptation tasks without training a neural network. PACE reduces training time and hyperparameter tuning time by $82\%$ and $97\%$, respectively, when compared to state-of-the-art DA methods. Code is released here: https://github.com/Chris210634/PACE-Domain-Adaptation

Via

Access Paper or Ask Questions

Faster Convex Lipschitz Regression via 2-block ADMM

Nov 29, 2021

Ali Siahkamari, Durmus Alp Emre Acar, Christopher Liao, Kelly Geyer, Venkatesh Saligrama, Brian Kulis

Figure 1 for Faster Convex Lipschitz Regression via 2-block ADMM

Figure 2 for Faster Convex Lipschitz Regression via 2-block ADMM

Figure 3 for Faster Convex Lipschitz Regression via 2-block ADMM

Figure 4 for Faster Convex Lipschitz Regression via 2-block ADMM

Abstract:The task of approximating an arbitrary convex function arises in several learning problems such as convex regression, learning with a difference of convex (DC) functions, and approximating Bregman divergences. In this paper, we show how a broad class of convex function learning problems can be solved via a 2-block ADMM approach, where updates for each block can be computed in closed form. For the task of convex Lipschitz regression, we establish that our proposed algorithm converges with iteration complexity of $ O(n\sqrt{d}/\epsilon)$ for a dataset $ X \in \mathbb R^{n\times d}$ and $\epsilon > 0$. Combined with per-iteration computation complexity, our method converges with the rate $O(n^3 d^{1.5}/\epsilon+n^2 d^{2.5}/\epsilon+n d^3/\epsilon)$. This new rate improves the state of the art rate of $O(n^5d^2/\epsilon)$ available by interior point methods if $d = o( n^4)$. Further we provide similar solvers for DC regression and Bregman divergence learning. Unlike previous approaches, our method is amenable to the use of GPUs. We demonstrate on regression and metric learning experiments that our approach is up to 30 times faster than the existing method, and produces results that are comparable to state-of-the-art.

* 21 pages, 3 figures. Paper under review

Via

Access Paper or Ask Questions

Tiny-CRNN: Streaming Wakeword Detection In A Low Footprint Setting

Sep 29, 2021

Mohammad Omar Khursheed, Christin Jose, Rajath Kumar, Gengshen Fu, Brian Kulis, Santosh Kumar Cheekatmalla

Figure 1 for Tiny-CRNN: Streaming Wakeword Detection In A Low Footprint Setting

Figure 2 for Tiny-CRNN: Streaming Wakeword Detection In A Low Footprint Setting

Figure 3 for Tiny-CRNN: Streaming Wakeword Detection In A Low Footprint Setting

Figure 4 for Tiny-CRNN: Streaming Wakeword Detection In A Low Footprint Setting

Abstract:In this work, we propose Tiny-CRNN (Tiny Convolutional Recurrent Neural Network) models applied to the problem of wakeword detection, and augment them with scaled dot product attention. We find that, compared to Convolutional Neural Network models, False Accepts in a 250k parameter budget can be reduced by 25% with a 10% reduction in parameter size by using models based on the Tiny-CRNN architecture, and we can get up to 32% reduction in False Accepts at a 50k parameter budget with 75% reduction in parameter size compared to word-level Dense Neural Network models. We discuss solutions to the challenging problem of performing inference on streaming audio with this architecture, as well as differences in start-end index errors and latency in comparison to CNN, DNN, and DNN-HMM models.

* arXiv admin note: substantial text overlap with arXiv:2011.12941

Via

Access Paper or Ask Questions