Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shawn Hershey

Dataset balancing can hurt model performance

Jun 30, 2023

R. Channing Moore, Daniel P. W. Ellis, Eduardo Fonseca, Shawn Hershey, Aren Jansen, Manoj Plakal

Figure 1 for Dataset balancing can hurt model performance

Figure 2 for Dataset balancing can hurt model performance

Figure 3 for Dataset balancing can hurt model performance

Figure 4 for Dataset balancing can hurt model performance

Abstract:Machine learning from training data with a skewed distribution of examples per class can lead to models that favor performance on common classes at the expense of performance on rare ones. AudioSet has a very wide range of priors over its 527 sound event classes. Classification performance on AudioSet is usually evaluated by a simple average over per-class metrics, meaning that performance on rare classes is equal in importance to the performance on common ones. Several recent papers have used dataset balancing techniques to improve performance on AudioSet. We find, however, that while balancing improves performance on the public AudioSet evaluation data it simultaneously hurts performance on an unpublished evaluation set collected under the same conditions. By varying the degree of balancing, we show that its benefits are fragile and depend on the evaluation set. We also do not find evidence indicating that balancing improves rare class performance relative to common classes. We therefore caution against blind application of balancing, as well as against paying too much attention to small improvements on a public evaluation set.

* ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023, pp. 1-5
* 5 pages, 3 figures, ICASSP 2023

Via

Access Paper or Ask Questions

The Benefit Of Temporally-Strong Labels In Audio Event Classification

May 14, 2021

Shawn Hershey, Daniel P W Ellis, Eduardo Fonseca, Aren Jansen, Caroline Liu, R Channing Moore, Manoj Plakal

Figure 1 for The Benefit Of Temporally-Strong Labels In Audio Event Classification

Figure 2 for The Benefit Of Temporally-Strong Labels In Audio Event Classification

Figure 3 for The Benefit Of Temporally-Strong Labels In Audio Event Classification

Abstract:To reveal the importance of temporal precision in ground truth audio event labels, we collected precise (~0.1 sec resolution) "strong" labels for a portion of the AudioSet dataset. We devised a temporally strong evaluation set (including explicit negatives of varying difficulty) and a small strong-labeled training subset of 67k clips (compared to the original dataset's 1.8M clips labeled at 10 sec resolution). We show that fine-tuning with a mix of weak and strongly labeled data can substantially improve classifier performance, even when evaluated using only the original weak labels. For a ResNet50 architecture, d' on the strong evaluation data including explicit negatives improves from 1.13 to 1.41. The new labels are available as an update to AudioSet.

* Accepted for publication at ICASSP 2021

Via

Access Paper or Ask Questions

Self-Supervised Learning from Automatically Separated Sound Scenes

May 05, 2021

Eduardo Fonseca, Aren Jansen, Daniel P. W. Ellis, Scott Wisdom, Marco Tagliasacchi, John R. Hershey, Manoj Plakal, Shawn Hershey, R. Channing Moore, Xavier Serra

Figure 1 for Self-Supervised Learning from Automatically Separated Sound Scenes

Figure 2 for Self-Supervised Learning from Automatically Separated Sound Scenes

Figure 3 for Self-Supervised Learning from Automatically Separated Sound Scenes

Figure 4 for Self-Supervised Learning from Automatically Separated Sound Scenes

Abstract:Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and each other is semantically constrained: the sound scene contains the union of source classes and not all classes naturally co-occur. With this motivation, this paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into multiple semantically-linked views for use in self-supervised contrastive learning. We find that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches that use the mixtures alone. Further, we discover that optimal source separation is not required for successful contrastive learning by demonstrating that a range of separation system convergence states all lead to useful and often complementary example transformations. Our best system incorporates these unsupervised separation models into a single augmentation front-end and jointly optimizes similarity maximization and coincidence prediction objectives across the views. The result is an unsupervised audio representation that rivals state-of-the-art alternatives on the established shallow AudioSet classification benchmark.

Via

Access Paper or Ask Questions

Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds

Nov 02, 2020

Efthymios Tzinis, Scott Wisdom, Aren Jansen, Shawn Hershey, Tal Remez, Daniel P. W. Ellis, John R. Hershey

Figure 1 for Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds

Figure 2 for Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds

Figure 3 for Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds

Figure 4 for Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds

Abstract:Recent progress in deep learning has enabled many advances in sound separation and visual scene understanding. However, extracting sound sources which are apparent in natural videos remains an open problem. In this work, we present AudioScope, a novel audio-visual sound separation framework that can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos. Prior audio-visual separation work assumed artificial limitations on the domain of sound classes (e.g., to speech or music), constrained the number of sources, and required strong sound separation or visual segmentation labels. AudioScope overcomes these limitations, operating on an open domain of sounds, with variable numbers of sources, and without labels or prior visual segmentation. The training procedure for AudioScope uses mixture invariant training (MixIT) to separate synthetic mixtures of mixtures (MoMs) into individual sources, where noisy labels for mixtures are provided by an unsupervised audio-visual coincidence model. Using the noisy labels, along with attention between video and audio features, AudioScope learns to identify audio-visual similarity and to suppress off-screen sounds. We demonstrate the effectiveness of our approach using a dataset of video clips extracted from open-domain YFCC100m video data. This dataset contains a wide diversity of sound classes recorded in unconstrained conditions, making the application of previous methods unsuitable. For evaluation and semi-supervised experiments, we collected human labels for presence of on-screen and off-screen sounds on a small subset of clips.

Via

Access Paper or Ask Questions

Addressing Missing Labels in Large-scale Sound Event Recognition using a Teacher-student Framework with Loss Masking

May 02, 2020

Eduardo Fonseca, Shawn Hershey, Manoj Plakal, Daniel P. W. Ellis, Aren Jansen, R. Channing Moore, Xavier Serra

Figure 1 for Addressing Missing Labels in Large-scale Sound Event Recognition using a Teacher-student Framework with Loss Masking

Figure 2 for Addressing Missing Labels in Large-scale Sound Event Recognition using a Teacher-student Framework with Loss Masking

Figure 3 for Addressing Missing Labels in Large-scale Sound Event Recognition using a Teacher-student Framework with Loss Masking

Figure 4 for Addressing Missing Labels in Large-scale Sound Event Recognition using a Teacher-student Framework with Loss Masking

Abstract:The study of label noise in sound event recognition has recently gained attention with the advent of larger and noisier datasets. This work addresses the problem of missing labels, one of the big weaknesses of large audio datasets, and one of the most conspicuous issues for AudioSet. We propose a simple and model-agnostic method based on a teacher-student framework with loss masking to first identify the most critical missing label candidates, and then ignore their contribution during the learning process. We find that a simple optimisation of the training label set improves recognition performance without additional compute. We discover that most of the improvement comes from ignoring a critical tiny portion of the missing labels. We also show that the damage done by missing labels is larger as the training set gets smaller, yet it can still be observed even when training with massive amounts of audio. We believe these insights can generalize to other large-scale datasets.

Via

Access Paper or Ask Questions

Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision

Nov 14, 2019

Aren Jansen, Daniel P. W. Ellis, Shawn Hershey, R. Channing Moore, Manoj Plakal, Ashok C. Popat, Rif A. Saurous

Figure 1 for Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision

Figure 2 for Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision

Figure 3 for Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision

Figure 4 for Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision

Abstract:Humans do not acquire perceptual abilities in the way we train machines. While machine learning algorithms typically operate on large collections of randomly-chosen, explicitly-labeled examples, human acquisition relies more heavily on multimodal unsupervised learning (as infants) and active learning (as children). With this motivation, we present a learning framework for sound representation and recognition that combines (i) a self-supervised objective based on a general notion of unimodal and cross-modal coincidence, (ii) a clustering objective that reflects our need to impose categorical structure on our experiences, and (iii) a cluster-based active learning procedure that solicits targeted weak supervision to consolidate categories into relevant semantic classes. By training a combined sound embedding/clustering/classification network according to these criteria, we achieve a new state-of-the-art unsupervised audio representation and demonstrate up to a 20-fold reduction in the number of labels required to reach a desired classification performance.

* This extended version of a ICASSP 2020 submission under same title has an added figure and additional discussion for easier consumption

Via

Access Paper or Ask Questions

Unsupervised Learning of Semantic Audio Representations

Nov 06, 2017

Aren Jansen, Manoj Plakal, Ratheet Pandya, Daniel P. W. Ellis, Shawn Hershey, Jiayang Liu, R. Channing Moore, Rif A. Saurous

Figure 1 for Unsupervised Learning of Semantic Audio Representations

Figure 2 for Unsupervised Learning of Semantic Audio Representations

Figure 3 for Unsupervised Learning of Semantic Audio Representations

Abstract:Even in the absence of any explicit semantic annotation, vast collections of audio recordings provide valuable information for learning the categorical structure of sounds. We consider several class-agnostic semantic constraints that apply to unlabeled nonspeech audio: (i) noise and translations in time do not change the underlying sound category, (ii) a mixture of two sound events inherits the categories of the constituents, and (iii) the categories of events in close temporal proximity are likely to be the same or related. Without labels to ground them, these constraints are incompatible with classification loss functions. However, they may still be leveraged to identify geometric inequalities needed for triplet loss-based training of convolutional neural networks. The result is low-dimensional embeddings of the input spectrograms that recover 41% and 84% of the performance of their fully-supervised counterparts when applied to downstream query-by-example sound retrieval and sound event classification tasks, respectively. Moreover, in limited-supervision settings, our unsupervised embeddings double the state-of-the-art classification performance.

* Submitted to ICASSP 2018

Via

Access Paper or Ask Questions

CNN Architectures for Large-Scale Audio Classification

Jan 10, 2017

Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold(+3 more)

Figure 1 for CNN Architectures for Large-Scale Audio Classification

Figure 2 for CNN Architectures for Large-Scale Audio Classification

Figure 3 for CNN Architectures for Large-Scale Audio Classification

Figure 4 for CNN Architectures for Large-Scale Audio Classification

Abstract:Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on our audio classification task, and larger training and label sets help up to a point. A model using embeddings from these classifiers does much better than raw features on the Audio Set [5] Acoustic Event Detection (AED) classification task.

* Accepted for publication at ICASSP 2017 Changes: Added definitions of mAP, AUC, and d-prime. Updated mAP/AUC/d-prime numbers for Audio Set based on changes of latest Audio Set revision. Changed wording to fit 4 page limit with new additions

Via

Access Paper or Ask Questions

Accelerating Inference: towards a full Language, Compiler and Hardware stack

Dec 12, 2012

Shawn Hershey, Jeff Bernstein, Bill Bradley, Andrew Schweitzer, Noah Stein, Theo Weber, Ben Vigoda

Abstract:We introduce Dimple, a fully open-source API for probabilistic modeling. Dimple allows the user to specify probabilistic models in the form of graphical models, Bayesian networks, or factor graphs, and performs inference (by automatically deriving an inference engine from a variety of algorithms) on the model. Dimple also serves as a compiler for GP5, a hardware accelerator for inference.

Via

Access Paper or Ask Questions