Abstract:Graphical User Interface (GUI) action grounding is a critical step in GUI automation that maps language instructions to actionable elements on GUI screens. Most recent works of GUI action grounding leverage large GUI datasets to fine-tune MLLMs. However, the fine-tuning data always covers limited GUI environments, and we find the performance of the resulting model deteriorates in novel environments. We argue that the GUI grounding models should be further aligned to the novel environments to reveal their full potential, when the inference is known to involve novel environments, i.e., environments not used during the previous fine-tuning. To realize this, we first propose GUI-Bee, an MLLM-based autonomous agent, to collect high-quality, environment-specific data through exploration and then continuously fine-tune GUI grounding models with the collected data. Our agent leverages a novel Q-value-Incentive In-Context Reinforcement Learning (Q-ICRL) method to optimize exploration efficiency and data quality. Additionally, we introduce NovelScreenSpot, a benchmark for testing how well the data can help align GUI action grounding models to novel environments and demonstrate the effectiveness of data collected by GUI-Bee in the experiments. Furthermore, we conduct an ablation study to validate the Q-ICRL method in enhancing the efficiency of GUI-Bee. Project page: https://gui-bee.github.io
Abstract:As real-world knowledge is constantly evolving, ensuring the timeliness and accuracy of a model's knowledge is crucial. This has made knowledge editing in large language models increasingly important. However, existing knowledge editing methods face several challenges, including parameter localization coupling, imprecise localization, and a lack of dynamic interaction across layers. In this paper, we propose a novel knowledge editing method called Knowledge Neuronal Ensemble (KNE). A knowledge neuronal ensemble represents a group of neurons encoding specific knowledge, thus mitigating the issue of frequent parameter modification caused by coupling in parameter localization. The KNE method enhances the precision and accuracy of parameter localization by computing gradient attribution scores for each parameter at each layer. During the editing process, only the gradients and losses associated with the knowledge neuronal ensemble are computed, with error backpropagation performed accordingly, ensuring dynamic interaction and collaborative updates among parameters. Experimental results on three widely used knowledge editing datasets show that the KNE method significantly improves the accuracy of knowledge editing and achieves, or even exceeds, the performance of the best baseline methods in portability and locality metrics.
Abstract:Graphical User Interface (GUI) agents, powered by Large Foundation Models, have emerged as a transformative approach to automating human-computer interaction. These agents autonomously interact with digital systems or software applications via GUIs, emulating human actions such as clicking, typing, and navigating visual elements across diverse platforms. Motivated by the growing interest and fundamental importance of GUI agents, we provide a comprehensive survey that categorizes their benchmarks, evaluation metrics, architectures, and training methods. We propose a unified framework that delineates their perception, reasoning, planning, and acting capabilities. Furthermore, we identify important open challenges and discuss key future directions. Finally, this work serves as a basis for practitioners and researchers to gain an intuitive understanding of current progress, techniques, benchmarks, and critical open problems that remain to be addressed.
Abstract:Learning a self-supervised Monocular Depth Estimation (MDE) model with great generalization remains significantly challenging. Despite the success of adversarial augmentation in the supervised learning generalization, naively incorporating it into self-supervised MDE models potentially causes over-regularization, suffering from severe performance degradation. In this paper, we conduct qualitative analysis and illuminate the main causes: (i) inherent sensitivity in the UNet-alike depth network and (ii) dual optimization conflict caused by over-regularization. To tackle these issues, we propose a general adversarial training framework, named Stabilized Conflict-optimization Adversarial Training (SCAT), integrating adversarial data augmentation into self-supervised MDE methods to achieve a balance between stability and generalization. Specifically, we devise an effective scaling depth network that tunes the coefficients of long skip connection and effectively stabilizes the training process. Then, we propose a conflict gradient surgery strategy, which progressively integrates the adversarial gradient and optimizes the model toward a conflict-free direction. Extensive experiments on five benchmarks demonstrate that SCAT can achieve state-of-the-art performance and significantly improve the generalization capability of existing self-supervised MDE methods.
Abstract:Image restoration (IR) refers to the process of improving visual quality of images while removing degradation, such as noise, blur, weather effects, and so on. Traditional IR methods typically target specific types of degradation, which limits their effectiveness in real-world scenarios with complex distortions. In response to this challenge, the all-in-one image restoration (AiOIR) paradigm has emerged, offering a unified framework that adeptly addresses multiple degradation types. These innovative models enhance both convenience and versatility by adaptively learning degradation-specific features while simultaneously leveraging shared knowledge across diverse corruptions. In this review, we delve into the AiOIR methodologies, emphasizing their architecture innovations and learning paradigm and offering a systematic review of prevalent approaches. We systematically categorize prevalent approaches and critically assess the challenges these models encounter, proposing future research directions to advance this dynamic field. Our paper begins with an introduction to the foundational concepts of AiOIR models, followed by a categorization of cutting-edge designs based on factors such as prior knowledge and generalization capability. Next, we highlight key advancements in AiOIR, aiming to inspire further inquiry and innovation within the community. To facilitate a robust evaluation of existing methods, we collate and summarize commonly used datasets, implementation details, and evaluation metrics. Additionally, we present an objective comparison of open-sourced methods, providing valuable insights for researchers and practitioners alike. This paper stands as the first comprehensive and insightful review of AiOIR. A related repository is available at https://github.com/Harbinzzy/All-in-One-Image-Restoration-Survey.
Abstract:Current state-of-the-art flow methods are mostly based on dense all-pairs cost volumes. However, as image resolution increases, the computational and spatial complexity of constructing these cost volumes grows at a quartic rate, making these methods impractical for high-resolution images. In this paper, we propose a novel Hybrid Cost Volume for memory-efficient optical flow, named HCV. To construct HCV, we first propose a Top-k strategy to separate the 4D cost volume into two global 3D cost volumes. These volumes significantly reduce memory usage while retaining a substantial amount of matching information. We further introduce a local 4D cost volume with a local search space to supplement the local information for HCV. Based on HCV, we design a memory-efficient optical flow network, named HCVFlow. Compared to the recurrent flow methods based the all-pairs cost volumes, our HCVFlow significantly reduces memory consumption while ensuring high accuracy. We validate the effectiveness and efficiency of our method on the Sintel and KITTI datasets and real-world 4K (2160*3840) resolution images. Extensive experiments show that our HCVFlow has very low memory usage and outperforms other memory-efficient methods in terms of accuracy. The code is publicly available at https://github.com/gangweiX/HCVFlow.
Abstract:In order to mitigate the distance reduction attack in Ultra-Wide Band (UWB) ranging, this paper proposes a secure ranging scheme based on a random time-hopping mechanism without redundant signaling overhead. Additionally, a secure ranging strategy is designed for backward compatibility with existing standards such as IEEE 802.15.4a/z, combined with an attack detection scheme. The effectiveness and feasibility of the proposed strategy are demonstrated through both simulation and experimental results in the case of the Ghost Peak attack, as demonstrated by Patrick Leu et al. The random time-hopping mechanism is verified to be capable of reducing the success rate of distance reduction attacks to less than 0.01%, thereby significantly enhancing the security of UWB ranging.
Abstract:A variety of ranging threats represented by Ghost Peak attack have raised concerns regarding the security performance of Ultra-Wide Band (UWB) systems with the finalization of the IEEE 802.15.4z standard. Based on channel reciprocity, this paper proposes a low complexity attack detection scheme that compares Channel Impulse Response (CIR) features of both ranging sides utilizing an autoencoder with the capability of data compression and feature extraction. Taking Ghost Peak attack as an example, this paper demonstrates the effectiveness, feasibility and generalizability of the proposed attack detection scheme through simulation and experimental validation. The proposed scheme achieves an attack detection success rate of over 99% and can be implemented in current systems at low cost.
Abstract:Large language models (LLMs) have recently achieved state-of-the-art performance across various tasks, yet due to their large computational requirements, they struggle with strict latency and power demands. Deep neural network (DNN) quantization has traditionally addressed these limitations by converting models to low-precision integer formats. Yet recently alternative formats, such as Normal Float (NF4), have been shown to consistently increase model accuracy, albeit at the cost of increased chip area. In this work, we first conduct a large-scale analysis of LLM weights and activations across 30 networks to conclude most distributions follow a Student's t-distribution. We then derive a new theoretically optimal format, Student Float (SF4), with respect to this distribution, that improves over NF4 across modern LLMs, for example increasing the average accuracy on LLaMA2-7B by 0.76% across tasks. Using this format as a high-accuracy reference, we then propose augmenting E2M1 with two variants of supernormal support for higher model accuracy. Finally, we explore the quality and performance frontier across 11 datatypes, including non-traditional formats like Additive-Powers-of-Two (APoT), by evaluating their model accuracy and hardware complexity. We discover a Pareto curve composed of INT4, E2M1, and E2M1 with supernormal support, which offers a continuous tradeoff between model accuracy and chip area. For example, E2M1 with supernormal support increases the accuracy of Phi-2 by up to 2.19% with 1.22% area overhead, enabling more LLM-based applications to be run at four bits.
Abstract:Semantic communication technology is regarded as a method surpassing the Shannon limit of bit transmission, capable of effectively enhancing transmission efficiency. However, current approaches that directly map content to transmission symbols are challenging to deploy in practice, imposing significant limitations on the development of semantic communication. To address this challenge, we propose a hybrid bit and semantic communication system, named HybridBSC, in which encoded semantic information is inserted into bit information for transmission via conventional digital communication systems utilizing same spectrum resources. The system can be easily deployed using existing communication architecture to achieve bit and semantic information transmission. Particularly, we design a semantic insertion and extraction scheme to implement this strategy. Furthermore, we conduct experimental validation based on the pluto-based software defined radio (SDR) platform in a real wireless channel, demonstrating that the proposed strategy can simultaneously transmit semantic and bit information.