Qiao
Abstract:This study explores the critical but underexamined impact of label noise on Sound Event Detection (SED), which requires both sound identification and precise temporal localization. We categorize label noise into deletion, insertion, substitution, and subjective types and systematically evaluate their effects on SED using synthetic and real-life datasets. Our analysis shows that deletion noise significantly degrades performance, while insertion noise is relatively benign. Moreover, loss functions effective against classification noise do not perform well for SED due to intra-class imbalance between foreground sound events and background sounds. We demonstrate that loss functions designed to address data imbalance in SED can effectively reduce the impact of noisy labels on system performance. For instance, halving the weight of background sounds in a synthetic dataset improved macro-F1 and micro-F1 scores by approximately $9\%$ with minimal Error Rate increase, with consistent results in real-life datasets. This research highlights the nuanced effects of noisy labels on SED systems and provides practical strategies to enhance model robustness, which are pivotal for both constructing new SED datasets and improving model performance, including efficient utilization of soft and crowdsourced labels.
Abstract:The use of Artificial Intelligence (AI) in healthcare, including in caring for cancer survivors, has gained significant interest. However, gaps remain in our understanding of how such AI systems can provide care, especially for ethnic and racial minority groups who continue to face care disparities. Through interviews with six cancer survivors, we identify critical gaps in current healthcare systems such as a lack of personalized care and insufficient cultural and linguistic accommodation. AI, when applied to care, was seen as a way to address these issues by enabling real-time, culturally aligned, and linguistically appropriate interactions. We also uncovered concerns about the implications of AI-driven personalization, such as data privacy, loss of human touch in caregiving, and the risk of echo chambers that limit exposure to diverse information. We conclude by discussing the trade-offs between AI-enhanced personalization and the need for structural changes in healthcare that go beyond technological solutions, leading us to argue that we should begin by asking, ``Why personalization?''
Abstract:We introduce a multi-sensor navigation system for autonomous surface vessels (ASV) intended for water-quality monitoring in freshwater lakes. Our mission planner uses satellite imagery as a prior map, formulating offline a mission-level policy for global navigation of the ASV and enabling autonomous online execution via local perception and local planning modules. A significant challenge is posed by the inconsistencies in traversability estimation between satellite images and real lakes, due to environmental effects such as wind, aquatic vegetation, shallow waters, and fluctuating water levels. Hence, we specifically modelled these traversability uncertainties as stochastic edges in a graph and optimized for a mission-level policy that minimizes the expected total travel distance. To execute the policy, we propose a modern local planner architecture that processes sensor inputs and plans paths to execute the high-level policy under uncertain traversability conditions. Our system was tested on three km-scale missions on a Northern Ontario lake, demonstrating that our GPS-, vision-, and sonar-enabled ASV system can effectively execute the mission-level policy and disambiguate the traversability of stochastic edges. Finally, we provide insights gained from practical field experience and offer several future directions to enhance the overall reliability of ASV navigation systems.
Abstract:This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms. Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers. The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram. We find it beneficial to incorporate local window attention in the decoder, as audio spectrograms are highly correlated in local time and frequency bands. We then fine-tune the encoder with a lower masking ratio on target datasets. Empirically, Audio-MAE sets new state-of-the-art performance on six audio and speech classification tasks, outperforming other recent models that use external supervised pre-training. The code and models will be at https://github.com/facebookresearch/AudioMAE.
Abstract:As audio-visual systems are being deployed for safety-critical tasks such as surveillance and malicious content filtering, their robustness remains an under-studied area. Existing published work on robustness either does not scale to large-scale dataset, or does not deal with multiple modalities. This work aims to study several key questions related to multi-modal learning through the lens of robustness: 1) Are multi-modal models necessarily more robust than uni-modal models? 2) How to efficiently measure the robustness of multi-modal learning? 3) How to fuse different modalities to achieve a more robust multi-modal model? To understand the robustness of the multi-modal model in a large-scale setting, we propose a density-based metric, and a convexity metric to efficiently measure the distribution of each modality in high-dimensional latent space. Our work provides a theoretical intuition together with empirical evidence showing how multi-modal fusion affects adversarial robustness through these metrics. We further devise a mix-up strategy based on our metrics to improve the robustness of the trained model. Our experiments on AudioSet and Kinetics-Sounds verify our hypothesis that multi-modal models are not necessarily more robust than their uni-modal counterparts in the face of adversarial examples. We also observe our mix-up trained method could achieve as much protection as traditional adversarial training, offering a computationally cheap alternative. Implementation: https://github.com/lijuncheng16/AudioSetDoneRight
Abstract:Existing robotic lower-limb prostheses use autonomous control to address cyclic, locomotive tasks, but they are inadequate to operate the prosthesis for daily activities that are non-cyclic and unpredictable. To address this challenge, this study aims to design a novel electromyography (EMG)-driven musculoskeletal model for volitional control of a robotic ankle-foot prosthesis. This controller places the user in continuous control of the device, allowing them to freely manipulate the prosthesis behavior at will. The Hill-type muscle model was used to model a dorsiflexor and a plantarflexor, which functioned around a virtual ankle joint. The model parameters were determined by fitting the model prediction to the experimental data collected from an able-bodied subject. EMG signals recorded from ankle agonist and antagonist muscle pair were used to activate the virtual muscle models. This model was validated via offline simulations and real-time prosthesis control. Additionally, the feasibility of the proposed prosthesis control on assisting the user's functional tasks was demonstrated. The present control may further improve the function of robotic prosthesis for supporting versatile activities in individuals with lower-limb amputations.
Abstract:Research on both natural intelligence (NI) and artificial intelligence (AI) generally assumes that the future resembles the past: intelligent agents or systems (what we call 'intelligence') observe and act on the world, then use this experience to act on future experiences of the same kind. We call this 'retrospective learning'. For example, an intelligence may see a set of pictures of objects, along with their names, and learn to name them. A retrospective learning intelligence would merely be able to name more pictures of the same objects. We argue that this is not what true intelligence is about. In many real world problems, both NIs and AIs will have to learn for an uncertain future. Both must update their internal models to be useful for future tasks, such as naming fundamentally new objects and using these objects effectively in a new context or to achieve previously unencountered goals. This ability to learn for the future we call 'prospective learning'. We articulate four relevant factors that jointly define prospective learning. Continual learning enables intelligences to remember those aspects of the past which it believes will be most useful in the future. Prospective constraints (including biases and priors) facilitate the intelligence finding general solutions that will be applicable to future problems. Curiosity motivates taking actions that inform future decision making, including in previously unmet situations. Causal estimation enables learning the structure of relations that guide choosing actions for specific outcomes, even when the specific action-outcome contingencies have never been observed before. We argue that a paradigm shift from retrospective to prospective learning will enable the communities that study intelligence to unite and overcome existing bottlenecks to more effectively explain, augment, and engineer intelligences.
Abstract:Atomic clauses are fundamental text units for understanding complex sentences. Identifying the atomic sentences within complex sentences is important for applications such as summarization, argument mining, discourse analysis, discourse parsing, and question answering. Previous work mainly relies on rule-based methods dependent on parsing. We propose a new task to decompose each complex sentence into simple sentences derived from the tensed clauses in the source, and a novel problem formulation as a graph edit task. Our neural model learns to Accept, Break, Copy or Drop elements of a graph that combines word adjacency and grammatical dependencies. The full processing pipeline includes modules for graph construction, graph editing, and sentence generation from the output graph. We introduce DeSSE, a new dataset designed to train and evaluate complex sentence decomposition, and MinWiki, a subset of MinWikiSplit. ABCD achieves comparable performance as two parsing baselines on MinWiki. On DeSSE, which has a more even balance of complex sentence types, our model achieves higher accuracy on the number of atomic sentences than an encoder-decoder baseline. Results include a detailed error analysis.
Abstract:In this paper, we introduce a streaming keyphrase detection system that can be easily customized to accurately detect any phrase composed of words from a large vocabulary. The system is implemented with an end-to-end trained automatic speech recognition (ASR) model and a text-independent speaker verification model. To address the challenge of detecting these keyphrases under various noisy conditions, a speaker separation model is added to the feature frontend of the speaker verification model, and an adaptive noise cancellation (ANC) algorithm is included to exploit cross-microphone noise coherence. Our experiments show that the text-independent speaker verification model largely reduces the false triggering rate of the keyphrase detection, while the speaker separation model and adaptive noise cancellation largely reduce false rejections.
Abstract:Sparse Bayesian learning (SBL) can be implemented with low complexity based on the approximate message passing (AMP) algorithm. However, it is vulnerable to 'difficult' measurement matrices, which may cause AMP to diverge. Damped AMP has been used for SBL to alleviate the problem at the cost of reducing convergence speed. In this work, we propose a new SBL algorithm based on structured variational inference, leveraging AMP with a unitary transformation (UAMP). Both single measurement vector and multiple measurement vector problems are investigated. It is shown that, compared to state-of-the-art AMP-based SBL algorithms, the proposed UAMPSBL is more robust and efficient, leading to remarkably better performance.