Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mohan Li

Towards Robust and Secure Embodied AI: A Survey on Vulnerabilities and Attacks

Feb 18, 2025

Wenpeng Xing, Minghao Li, Mohan Li, Meng Han

Abstract:Embodied AI systems, including robots and autonomous vehicles, are increasingly integrated into real-world applications, where they encounter a range of vulnerabilities stemming from both environmental and system-level factors. These vulnerabilities manifest through sensor spoofing, adversarial attacks, and failures in task and motion planning, posing significant challenges to robustness and safety. Despite the growing body of research, existing reviews rarely focus specifically on the unique safety and security challenges of embodied AI systems. Most prior work either addresses general AI vulnerabilities or focuses on isolated aspects, lacking a dedicated and unified framework tailored to embodied AI. This survey fills this critical gap by: (1) categorizing vulnerabilities specific to embodied AI into exogenous (e.g., physical attacks, cybersecurity threats) and endogenous (e.g., sensor failures, software flaws) origins; (2) systematically analyzing adversarial attack paradigms unique to embodied AI, with a focus on their impact on perception, decision-making, and embodied interaction; (3) investigating attack vectors targeting large vision-language models (LVLMs) and large language models (LLMs) within embodied systems, such as jailbreak attacks and instruction misinterpretation; (4) evaluating robustness challenges in algorithms for embodied perception, decision-making, and task planning; and (5) proposing targeted strategies to enhance the safety and reliability of embodied AI systems. By integrating these dimensions, we provide a comprehensive framework for understanding the interplay between vulnerabilities and safety in embodied AI.

Via

Access Paper or Ask Questions

Neural Honeytrace: A Robust Plug-and-Play Watermarking Framework against Model Extraction Attacks

Jan 16, 2025

Yixiao Xu, Binxing Fang, Rui Wang, Yinghai Zhou, Shouling Ji, Yuan Liu, Mohan Li, Zhihong Tian

Abstract:Developing high-performance deep learning models is resource-intensive, leading model owners to utilize Machine Learning as a Service (MLaaS) platforms instead of publicly releasing their models. However, malicious users may exploit query interfaces to execute model extraction attacks, reconstructing the target model's functionality locally. While prior research has investigated triggerable watermarking techniques for asserting ownership, existing methods face significant challenges: (1) most approaches require additional training, resulting in high overhead and limited flexibility, and (2) they often fail to account for advanced attackers, leaving them vulnerable to adaptive attacks. In this paper, we propose Neural Honeytrace, a robust plug-and-play watermarking framework against model extraction attacks. We first formulate a watermark transmission model from an information-theoretic perspective, providing an interpretable account of the principles and limitations of existing triggerable watermarking. Guided by the model, we further introduce: (1) a similarity-based training-free watermarking method for plug-and-play and flexible watermarking, and (2) a distribution-based multi-step watermark information transmission strategy for robust watermarking. Comprehensive experiments on four datasets demonstrate that Neural Honeytrace outperforms previous methods in efficiency and resisting adaptive attacks. Neural Honeytrace reduces the average number of samples required for a worst-case t-Test-based copyright claim from $12,000$ to $200$ with zero training cost.

Via

Access Paper or Ask Questions

A Survey on Federated Learning in Human Sensing

Jan 07, 2025

Mohan Li, Martin Gjoreski, Pietro Barbiero, Gašper Slapničar, Mitja Luštrek, Nicholas D. Lane, Marc Langheinrich

Abstract:Human Sensing, a field that leverages technology to monitor human activities, psycho-physiological states, and interactions with the environment, enhances our understanding of human behavior and drives the development of advanced services that improve overall quality of life. However, its reliance on detailed and often privacy-sensitive data as the basis for its machine learning (ML) models raises significant legal and ethical concerns. The recently proposed ML approach of Federated Learning (FL) promises to alleviate many of these concerns, as it is able to create accurate ML models without sending raw user data to a central server. While FL has demonstrated its usefulness across a variety of areas, such as text prediction and cyber security, its benefits in Human Sensing are under-explored, given the particular challenges in this domain. This survey conducts a comprehensive analysis of the current state-of-the-art studies on FL in Human Sensing, and proposes a taxonomy and an eight-dimensional assessment for FL approaches. Through the eight-dimensional assessment, we then evaluate whether the surveyed studies consider a specific FL-in-Human-Sensing challenge or not. Finally, based on the overall analysis, we discuss open challenges and highlight five research aspects related to FL in Human Sensing that require urgent research attention. Our work provides a comprehensive corpus of FL studies and aims to assist FL practitioners in developing and evaluating solutions that effectively address the real-world complexities of Human Sensing.

Via

Access Paper or Ask Questions

WHISMA: A Speech-LLM to Perform Zero-shot Spoken Language Understanding

Aug 29, 2024

Mohan Li, Cong-Thanh Do, Simon Keizer, Youmna Farag, Svetlana Stoyanchev, Rama Doddipatla

Figure 1 for WHISMA: A Speech-LLM to Perform Zero-shot Spoken Language Understanding

Figure 2 for WHISMA: A Speech-LLM to Perform Zero-shot Spoken Language Understanding

Figure 3 for WHISMA: A Speech-LLM to Perform Zero-shot Spoken Language Understanding

Figure 4 for WHISMA: A Speech-LLM to Perform Zero-shot Spoken Language Understanding

Abstract:Speech large language models (speech-LLMs) integrate speech and text-based foundation models to provide a unified framework for handling a wide range of downstream tasks. In this paper, we introduce WHISMA, a speech-LLM tailored for spoken language understanding (SLU) that demonstrates robust performance in various zero-shot settings. WHISMA combines the speech encoder from Whisper with the Llama-3 LLM, and is fine-tuned in a parameter-efficient manner on a comprehensive collection of SLU-related datasets. Our experiments show that WHISMA significantly improves the zero-shot slot filling performance on the SLURP benchmark, achieving a relative gain of 26.6% compared to the current state-of-the-art model. Furthermore, to evaluate WHISMA's generalisation capabilities to unseen domains, we develop a new task-agnostic benchmark named SLU-GLUE. The evaluation results indicate that WHISMA outperforms an existing speech-LLM (Qwen-Audio) with a relative gain of 33.0%.

* accepted to SLT 2024

Via

Access Paper or Ask Questions

Prompting Whisper for QA-driven Zero-shot End-to-end Spoken Language Understanding

Jun 21, 2024

Mohan Li, Simon Keizer, Rama Doddipatla

Figure 1 for Prompting Whisper for QA-driven Zero-shot End-to-end Spoken Language Understanding

Figure 2 for Prompting Whisper for QA-driven Zero-shot End-to-end Spoken Language Understanding

Figure 3 for Prompting Whisper for QA-driven Zero-shot End-to-end Spoken Language Understanding

Figure 4 for Prompting Whisper for QA-driven Zero-shot End-to-end Spoken Language Understanding

Abstract:Zero-shot spoken language understanding (SLU) enables systems to comprehend user utterances in new domains without prior exposure to training data. Recent studies often rely on large language models (LLMs), leading to excessive footprints and complexity. This paper proposes the use of Whisper, a standalone speech processing model, for zero-shot end-to-end (E2E) SLU. To handle unseen semantic labels, SLU tasks are integrated into a question-answering (QA) framework, which prompts the Whisper decoder for semantics deduction. The system is efficiently trained with prefix-tuning, optimising a minimal set of parameters rather than the entire Whisper model. We show that the proposed system achieves a 40.7% absolute gain for slot filling (SLU-F1) on SLURP compared to a recently introduced zero-shot benchmark. Furthermore, it performs comparably to a Whisper-GPT-2 modular system under both in-corpus and cross-corpus evaluation settings, but with a relative 34.8% reduction in model parameters.

* Accepted to Interspeech 2024

Via

Access Paper or Ask Questions

DiaLoc: An Iterative Approach to Embodied Dialog Localization

Mar 11, 2024

Chao Zhang, Mohan Li, Ignas Budvytis, Stephan Liwicki

Abstract:Multimodal learning has advanced the performance for many vision-language tasks. However, most existing works in embodied dialog research focus on navigation and leave the localization task understudied. The few existing dialog-based localization approaches assume the availability of entire dialog prior to localizaiton, which is impractical for deployed dialog-based localization. In this paper, we propose DiaLoc, a new dialog-based localization framework which aligns with a real human operator behavior. Specifically, we produce an iterative refinement of location predictions which can visualize current pose believes after each dialog turn. DiaLoc effectively utilizes the multimodal data for multi-shot localization, where a fusion encoder fuses vision and dialog information iteratively. We achieve state-of-the-art results on embodied dialog-based localization task, in single-shot (+7.08% in Acc5@valUnseen) and multi-shot settings (+10.85% in Acc5@valUnseen). DiaLoc narrows the gap between simulation and real-world applications, opening doors for future research on collaborative localization and navigation.

* 12 pages, 10 figures, to appear in CVPR 2024

Via

Access Paper or Ask Questions

Self-regularised Minimum Latency Training for Streaming Transformer-based Speech Recognition

Apr 24, 2023

Mohan Li, Rama Doddipatla, Catalin Zorila

Figure 1 for Self-regularised Minimum Latency Training for Streaming Transformer-based Speech Recognition

Figure 2 for Self-regularised Minimum Latency Training for Streaming Transformer-based Speech Recognition

Figure 3 for Self-regularised Minimum Latency Training for Streaming Transformer-based Speech Recognition

Figure 4 for Self-regularised Minimum Latency Training for Streaming Transformer-based Speech Recognition

Abstract:This paper proposes a self-regularised minimum latency training (SR-MLT) method for streaming Transformer-based automatic speech recognition (ASR) systems. In previous works, latency was optimised by truncating the online attention weights based on the hard alignments obtained from conventional ASR models, without taking into account the potential loss of ASR accuracy. On the contrary, here we present a strategy to obtain the alignments as a part of the model training without external supervision. The alignments produced by the proposed method are dynamically regularised on the training data, such that the latency reduction does not result in the loss of ASR accuracy. SR-MLT is applied as a fine-tuning step on the pre-trained Transformer models that are based on either monotonic chunkwise attention (MoChA) or cumulative attention (CA) algorithms for online decoding. ASR experiments on the AIShell-1 and Librispeech datasets show that when applied on a decent pre-trained MoChA or CA baseline model, SR-MLT can effectively reduce the latency with the relative gains ranging from 11.8% to 39.5%. Furthermore, we also demonstrate that under certain accuracy levels, the models trained with SR-MLT can achieve lower latency when compared to those supervised using external hard alignments.

* 5 pages, 2 figures, accepted at Interspeech2022

Via

Access Paper or Ask Questions

Non-autoregressive End-to-end Approaches for Joint Automatic Speech Recognition and Spoken Language Understanding

Apr 21, 2023

Mohan Li, Rama Doddipatla

Figure 1 for Non-autoregressive End-to-end Approaches for Joint Automatic Speech Recognition and Spoken Language Understanding

Figure 2 for Non-autoregressive End-to-end Approaches for Joint Automatic Speech Recognition and Spoken Language Understanding

Figure 3 for Non-autoregressive End-to-end Approaches for Joint Automatic Speech Recognition and Spoken Language Understanding

Abstract:This paper presents the use of non-autoregressive (NAR) approaches for joint automatic speech recognition (ASR) and spoken language understanding (SLU) tasks. The proposed NAR systems employ a Conformer encoder that applies connectionist temporal classification (CTC) to transcribe the speech utterance into raw ASR hypotheses, which are further refined with a bidirectional encoder representations from Transformers (BERT)-like decoder. In the meantime, the intent and slot labels of the utterance are predicted simultaneously using the same decoder. Both Mask-CTC and self-conditioned CTC (SC-CTC) approaches are explored for this study. Experiments conducted on the SLURP dataset show that the proposed SC-Mask-CTC NAR system achieves 3.7% and 3.2% absolute gains in SLU metrics and a competitive level of ASR accuracy, when compared to a Conformer-Transformer based autoregressive (AR) model. Additionally, the NAR systems achieve 6x faster decoding speed than the AR baseline.

* 8 pages, 1 figure, accepted at IEEE SLT2023

Via

Access Paper or Ask Questions

Multiple-hypothesis RNN-T Loss for Unsupervised Fine-tuning and Self-training of Neural Transducer

Jul 29, 2022

Cong-Thanh Do, Mohan Li, Rama Doddipatla

Figure 1 for Multiple-hypothesis RNN-T Loss for Unsupervised Fine-tuning and Self-training of Neural Transducer

Figure 2 for Multiple-hypothesis RNN-T Loss for Unsupervised Fine-tuning and Self-training of Neural Transducer

Figure 3 for Multiple-hypothesis RNN-T Loss for Unsupervised Fine-tuning and Self-training of Neural Transducer

Figure 4 for Multiple-hypothesis RNN-T Loss for Unsupervised Fine-tuning and Self-training of Neural Transducer

Abstract:This paper proposes a new approach to perform unsupervised fine-tuning and self-training using unlabeled speech data for recurrent neural network (RNN)-Transducer (RNN-T) end-to-end (E2E) automatic speech recognition (ASR) systems. Conventional systems perform fine-tuning/self-training using ASR hypothesis as the targets when using unlabeled audio data and are susceptible to the ASR performance of the base model. Here in order to alleviate the influence of ASR errors while using unlabeled data, we propose a multiple-hypothesis RNN-T loss that incorporates multiple ASR 1-best hypotheses into the loss function. For the fine-tuning task, ASR experiments on Librispeech show that the multiple-hypothesis approach achieves a relative reduction of 14.2% word error rate (WER) when compared to the single-hypothesis approach, on the test_other set. For the self-training task, ASR models are trained using supervised data from Wall Street Journal (WSJ), Aurora-4 along with CHiME-4 real noisy data as unlabeled data. The multiple-hypothesis approach yields a relative reduction of 3.3% WER on the CHiME-4's single-channel real noisy evaluation set when compared with the single-hypothesis approach.

* Accepted to Interspeech 2022

Via

Access Paper or Ask Questions

Transformer-based Streaming ASR with Cumulative Attention

Mar 11, 2022

Mohan Li, Shucong Zhang, Catalin Zorila, Rama Doddipatla

Figure 1 for Transformer-based Streaming ASR with Cumulative Attention

Figure 2 for Transformer-based Streaming ASR with Cumulative Attention

Figure 3 for Transformer-based Streaming ASR with Cumulative Attention

Figure 4 for Transformer-based Streaming ASR with Cumulative Attention

Abstract:In this paper, we propose an online attention mechanism, known as cumulative attention (CA), for streaming Transformer-based automatic speech recognition (ASR). Inspired by monotonic chunkwise attention (MoChA) and head-synchronous decoder-end adaptive computation steps (HS-DACS) algorithms, CA triggers the ASR outputs based on the acoustic information accumulated at each encoding timestep, where the decisions are made using a trainable device, referred to as halting selector. In CA, all the attention heads of the same decoder layer are synchronised to have a unified halting position. This feature effectively alleviates the problem caused by the distinct behaviour of individual heads, which may otherwise give rise to severe latency issues as encountered by MoChA. The ASR experiments conducted on AIShell-1 and Librispeech datasets demonstrate that the proposed CA-based Transformer system can achieve on par or better performance with significant reduction in latency during inference, when compared to other streaming Transformer systems in literature.

* 5 pages, 1 figure, accepted at ICASSP 2022

Via

Access Paper or Ask Questions