Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Winston H. Hsu

Attention Tracker: Detecting Prompt Injection Attacks in LLMs

Nov 01, 2024

Kuo-Han Hung, Ching-Yun Ko, Ambrish Rawat, I-Hsin Chung, Winston H. Hsu, Pin-Yu Chen

Abstract:Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks, where malicious inputs manipulate the model into ignoring original instructions and executing designated action. In this paper, we investigate the underlying mechanisms of these attacks by analyzing the attention patterns within LLMs. We introduce the concept of the distraction effect, where specific attention heads, termed important heads, shift focus from the original instruction to the injected instruction. Building on this discovery, we propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks without the need for additional LLM inference. Our method generalizes effectively across diverse models, datasets, and attack types, showing an AUROC improvement of up to 10.0% over existing methods, and performs well even on small LLMs. We demonstrate the robustness of our approach through extensive evaluations and provide insights into safeguarding LLM-integrated systems from prompt injection vulnerabilities.

* Project page: https://huggingface.co/spaces/TrustSafeAI/Attention-Tracker

Via

Access Paper or Ask Questions

Unveiling Narrative Reasoning Limits of Large Language Models with Trope in Movie Synopses

Sep 22, 2024

Hung-Ting Su, Ya-Ching Hsu, Xudong Lin, Xiang-Qian Shi, Yulei Niu, Han-Yuan Hsu, Hung-yi Lee, Winston H. Hsu

Figure 1 for Unveiling Narrative Reasoning Limits of Large Language Models with Trope in Movie Synopses

Figure 2 for Unveiling Narrative Reasoning Limits of Large Language Models with Trope in Movie Synopses

Figure 3 for Unveiling Narrative Reasoning Limits of Large Language Models with Trope in Movie Synopses

Figure 4 for Unveiling Narrative Reasoning Limits of Large Language Models with Trope in Movie Synopses

Abstract:Large language models (LLMs) equipped with chain-of-thoughts (CoT) prompting have shown significant multi-step reasoning capabilities in factual content like mathematics, commonsense, and logic. However, their performance in narrative reasoning, which demands greater abstraction capabilities, remains unexplored. This study utilizes tropes in movie synopses to assess the abstract reasoning abilities of state-of-the-art LLMs and uncovers their low performance. We introduce a trope-wise querying approach to address these challenges and boost the F1 score by 11.8 points. Moreover, while prior studies suggest that CoT enhances multi-step reasoning, this study shows CoT can cause hallucinations in narrative content, reducing GPT-4's performance. We also introduce an Adversarial Injection method to embed trope-related text tokens into movie synopses without explicit tropes, revealing CoT's heightened sensitivity to such injections. Our comprehensive analysis provides insights for future research directions.

* EMNLP 2024 Findings. The first two authors contributed equally. Code: https://github.com/Shelley1214/Trope

Via

Access Paper or Ask Questions

Distribution Discrepancy and Feature Heterogeneity for Active 3D Object Detection

Sep 11, 2024

Huang-Yu Chen, Jia-Fong Yeh, Jia-Wei Liao, Pin-Hsuan Peng, Winston H. Hsu

Figure 1 for Distribution Discrepancy and Feature Heterogeneity for Active 3D Object Detection

Figure 2 for Distribution Discrepancy and Feature Heterogeneity for Active 3D Object Detection

Figure 3 for Distribution Discrepancy and Feature Heterogeneity for Active 3D Object Detection

Figure 4 for Distribution Discrepancy and Feature Heterogeneity for Active 3D Object Detection

Abstract:LiDAR-based 3D object detection is a critical technology for the development of autonomous driving and robotics. However, the high cost of data annotation limits its advancement. We propose a novel and effective active learning (AL) method called Distribution Discrepancy and Feature Heterogeneity (DDFH), which simultaneously considers geometric features and model embeddings, assessing information from both the instance-level and frame-level perspectives. Distribution Discrepancy evaluates the difference and novelty of instances within the unlabeled and labeled distributions, enabling the model to learn efficiently with limited data. Feature Heterogeneity ensures the heterogeneity of intra-frame instance features, maintaining feature diversity while avoiding redundant or similar instances, thus minimizing annotation costs. Finally, multiple indicators are efficiently aggregated using Quantile Transform, providing a unified measure of informativeness. Extensive experiments demonstrate that DDFH outperforms the current state-of-the-art (SOTA) methods on the KITTI and Waymo datasets, effectively reducing the bounding box annotation cost by 56.3% and showing robustness when working with both one-stage and two-stage models.

* Accepted to CoRL 2024

Via

Access Paper or Ask Questions

Context-Aware Replanning with Pre-explored Semantic Map for Object Navigation

Sep 07, 2024

Hung-Ting Su, Ching-Yuan Chen, Po-Chen Ko, Jia-Fong Yeh, Min Sun, Winston H. Hsu

Figure 1 for Context-Aware Replanning with Pre-explored Semantic Map for Object Navigation

Figure 2 for Context-Aware Replanning with Pre-explored Semantic Map for Object Navigation

Figure 3 for Context-Aware Replanning with Pre-explored Semantic Map for Object Navigation

Figure 4 for Context-Aware Replanning with Pre-explored Semantic Map for Object Navigation

Abstract:Pre-explored Semantic Maps, constructed through prior exploration using visual language models (VLMs), have proven effective as foundational elements for training-free robotic applications. However, existing approaches assume the map's accuracy and do not provide effective mechanisms for revising decisions based on incorrect maps. To address this, we introduce Context-Aware Replanning (CARe), which estimates map uncertainty through confidence scores and multi-view consistency, enabling the agent to revise erroneous decisions stemming from inaccurate maps without requiring additional labels. We demonstrate the effectiveness of our proposed method by integrating it with two modern mapping backbones, VLMaps and OpenMask3D, and observe significant performance improvements in object navigation tasks. More details can be found on the project page: https://carmaps.github.io/supplements/.

* CoRL 2024. The first three authors contributed equally, and their order of authorship is interchangeable. Project page: https://carmaps.github.io/supplements/

Via

Access Paper or Ask Questions

Bridging Episodes and Semantics: A Novel Framework for Long-Form Video Understanding

Aug 30, 2024

Gueter Josmy Faure, Jia-Fong Yeh, Min-Hung Chen, Hung-Ting Su, Winston H. Hsu, Shang-Hong Lai

Figure 1 for Bridging Episodes and Semantics: A Novel Framework for Long-Form Video Understanding

Figure 2 for Bridging Episodes and Semantics: A Novel Framework for Long-Form Video Understanding

Figure 3 for Bridging Episodes and Semantics: A Novel Framework for Long-Form Video Understanding

Figure 4 for Bridging Episodes and Semantics: A Novel Framework for Long-Form Video Understanding

Abstract:While existing research often treats long-form videos as extended short videos, we propose a novel approach that more accurately reflects human cognition. This paper introduces BREASE: BRidging Episodes And SEmantics for Long-Form Video Understanding, a model that simulates episodic memory accumulation to capture action sequences and reinforces them with semantic knowledge dispersed throughout the video. Our work makes two key contributions: First, we develop an Episodic COmpressor (ECO) that efficiently aggregates crucial representations from micro to semi-macro levels. Second, we propose a Semantics reTRiever (SeTR) that enhances these aggregated representations with semantic information by focusing on the broader context, dramatically reducing feature dimensionality while preserving relevant macro-level information. Extensive experiments demonstrate that BREASE achieves state-of-the-art performance across multiple long video understanding benchmarks in both zero-shot and fully-supervised settings. The project page and code are at: https://joslefaure.github.io/assets/html/hermes.html.

* Accepted to the EVAL-FoMo Workshop at ECCV'24. Project page: https://joslefaure.github.io/assets/html/hermes.html

Via

Access Paper or Ask Questions

Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies

Jun 16, 2024

Hung-Ting Su, Chun-Tong Chao, Ya-Ching Hsu, Xudong Lin, Yulei Niu, Hung-Yi Lee, Winston H. Hsu

Abstract:Large Language Models (LLMs) have demonstrated effectiveness not only in language tasks but also in video reasoning. This paper introduces a novel dataset, Tropes in Movies (TiM), designed as a testbed for exploring two critical yet previously overlooked video reasoning skills: (1) Abstract Perception: understanding and tokenizing abstract concepts in videos, and (2) Long-range Compositional Reasoning: planning and integrating intermediate reasoning steps for understanding long-range videos with numerous frames. Utilizing tropes from movie storytelling, TiM evaluates the reasoning capabilities of state-of-the-art LLM-based approaches. Our experiments show that current methods, including Captioner-Reasoner, Large Multimodal Model Instruction Fine-tuning, and Visual Programming, only marginally outperform a random baseline when tackling the challenges of Abstract Perception and Long-range Compositional Reasoning. To address these deficiencies, we propose Face-Enhanced Viper of Role Interactions (FEVoRI) and Context Query Reduction (ConQueR), which enhance Visual Programming by fostering role interaction awareness and progressively refining movie contexts and trope queries during reasoning processes, significantly improving performance by 15 F1 points. However, this performance still lags behind human levels (40 vs. 65 F1). Additionally, we introduce a new protocol to evaluate the necessity of Abstract Perception and Long-range Compositional Reasoning for task resolution. This is done by analyzing the code generated through Visual Programming using an Abstract Syntax Tree (AST), thereby confirming the increased complexity of TiM. The dataset and code are available at: https://ander1119.github.io/TiM

* Project page: https://ander1119.github.io/TiM

Via

Access Paper or Ask Questions

Shared-unique Features and Task-aware Prioritized Sampling on Multi-task Reinforcement Learning

Jun 02, 2024

Po-Shao Lin, Jia-Fong Yeh, Yi-Ting Chen, Winston H. Hsu

Figure 1 for Shared-unique Features and Task-aware Prioritized Sampling on Multi-task Reinforcement Learning

Figure 2 for Shared-unique Features and Task-aware Prioritized Sampling on Multi-task Reinforcement Learning

Figure 3 for Shared-unique Features and Task-aware Prioritized Sampling on Multi-task Reinforcement Learning

Figure 4 for Shared-unique Features and Task-aware Prioritized Sampling on Multi-task Reinforcement Learning

Abstract:We observe that current state-of-the-art (SOTA) methods suffer from the performance imbalance issue when performing multi-task reinforcement learning (MTRL) tasks. While these methods may achieve impressive performance on average, they perform extremely poorly on a few tasks. To address this, we propose a new and effective method called STARS, which consists of two novel strategies: a shared-unique feature extractor and task-aware prioritized sampling. First, the shared-unique feature extractor learns both shared and task-specific features to enable better synergy of knowledge between different tasks. Second, the task-aware sampling strategy is combined with the prioritized experience replay for efficient learning on tasks with poor performance. The effectiveness and stability of our STARS are verified through experiments on the mainstream Meta-World benchmark. From the results, our STARS statistically outperforms current SOTA methods and alleviates the performance imbalance issue. Besides, we visualize the learned features to support our claims and enhance the interpretability of STARS.

* The first two authors contribute equally

Via

Access Paper or Ask Questions

VICtoR: Learning Hierarchical Vision-Instruction Correlation Rewards for Long-horizon Manipulation

May 26, 2024

Kuo-Han Hung, Pang-Chi Lo, Jia-Fong Yeh, Han-Yuan Hsu, Yi-Ting Chen, Winston H. Hsu

Figure 1 for VICtoR: Learning Hierarchical Vision-Instruction Correlation Rewards for Long-horizon Manipulation

Figure 2 for VICtoR: Learning Hierarchical Vision-Instruction Correlation Rewards for Long-horizon Manipulation

Figure 3 for VICtoR: Learning Hierarchical Vision-Instruction Correlation Rewards for Long-horizon Manipulation

Figure 4 for VICtoR: Learning Hierarchical Vision-Instruction Correlation Rewards for Long-horizon Manipulation

Abstract:We study reward models for long-horizon manipulation tasks by learning from action-free videos and language instructions, which we term the visual-instruction correlation (VIC) problem. Recent advancements in cross-modality modeling have highlighted the potential of reward modeling through visual and language correlations. However, existing VIC methods face challenges in learning rewards for long-horizon tasks due to their lack of sub-stage awareness, difficulty in modeling task complexities, and inadequate object state estimation. To address these challenges, we introduce VICtoR, a novel hierarchical VIC reward model capable of providing effective reward signals for long-horizon manipulation tasks. VICtoR precisely assesses task progress at various levels through a novel stage detector and motion progress evaluator, offering insightful guidance for agents learning the task effectively. To validate the effectiveness of VICtoR, we conducted extensive experiments in both simulated and real-world environments. The results suggest that VICtoR outperformed the best existing VIC methods, achieving a 43% improvement in success rates for long-horizon tasks.

Via

Access Paper or Ask Questions

Enhancing Sustainable Urban Mobility Prediction with Telecom Data: A Spatio-Temporal Framework Approach

May 26, 2024

ChungYi Lin, Shen-Lung Tung, Hung-Ting Su, Winston H. Hsu

Figure 1 for Enhancing Sustainable Urban Mobility Prediction with Telecom Data: A Spatio-Temporal Framework Approach

Figure 2 for Enhancing Sustainable Urban Mobility Prediction with Telecom Data: A Spatio-Temporal Framework Approach

Figure 3 for Enhancing Sustainable Urban Mobility Prediction with Telecom Data: A Spatio-Temporal Framework Approach

Figure 4 for Enhancing Sustainable Urban Mobility Prediction with Telecom Data: A Spatio-Temporal Framework Approach

Abstract:Traditional traffic prediction, limited by the scope of sensor data, falls short in comprehensive traffic management. Mobile networks offer a promising alternative using network activity counts, but these lack crucial directionality. Thus, we present the TeltoMob dataset, featuring undirected telecom counts and corresponding directional flows, to predict directional mobility flows on roadways. To address this, we propose a two-stage spatio-temporal graph neural network (STGNN) framework. The first stage uses a pre-trained STGNN to process telecom data, while the second stage integrates directional and geographic insights for accurate prediction. Our experiments demonstrate the framework's compatibility with various STGNN models and confirm its effectiveness. We also show how to incorporate the framework into real-world transportation systems, enhancing sustainable urban mobility.

* 8 Figures, 5 Tables. Just accepted by IJCAI (to appear)

Via

Access Paper or Ask Questions

Unsupervised Image Prior via Prompt Learning and CLIP Semantic Guidance for Low-Light Image Enhancement

May 19, 2024

Igor Morawski, Kai He, Shusil Dangi, Winston H. Hsu

Figure 1 for Unsupervised Image Prior via Prompt Learning and CLIP Semantic Guidance for Low-Light Image Enhancement

Figure 2 for Unsupervised Image Prior via Prompt Learning and CLIP Semantic Guidance for Low-Light Image Enhancement

Figure 3 for Unsupervised Image Prior via Prompt Learning and CLIP Semantic Guidance for Low-Light Image Enhancement

Figure 4 for Unsupervised Image Prior via Prompt Learning and CLIP Semantic Guidance for Low-Light Image Enhancement

Abstract:Currently, low-light conditions present a significant challenge for machine cognition. In this paper, rather than optimizing models by assuming that human and machine cognition are correlated, we use zero-reference low-light enhancement to improve the performance of downstream task models. We propose to improve the zero-reference low-light enhancement method by leveraging the rich visual-linguistic CLIP prior without any need for paired or unpaired normal-light data, which is laborious and difficult to collect. We propose a simple but effective strategy to learn prompts that help guide the enhancement method and experimentally show that the prompts learned without any need for normal-light data improve image contrast, reduce over-enhancement, and reduce noise over-amplification. Next, we propose to reuse the CLIP model for semantic guidance via zero-shot open vocabulary classification to optimize low-light enhancement for task-based performance rather than human visual perception. We conduct extensive experimental results showing that the proposed method leads to consistent improvements across various datasets regarding task-based performance and compare our method against state-of-the-art methods, showing favorable results across various low-light datasets.

* Accepted to CVPR 2024 Workshop NTIRE: New Trends in Image Restoration and Enhancement workshop and Challenges

Via

Access Paper or Ask Questions