Abstract:Panoptic narrative grounding (PNG), whose core target is fine-grained image-text alignment, requires a panoptic segmentation of referred objects given a narrative caption. Previous discriminative methods achieve only weak or coarse-grained alignment by panoptic segmentation pretraining or CLIP model adaptation. Given the recent progress of text-to-image Diffusion models, several works have shown their capability to achieve fine-grained image-text alignment through cross-attention maps and improved general segmentation performance. However, the direct use of phrase features as static prompts to apply frozen Diffusion models to the PNG task still suffers from a large task gap and insufficient vision-language interaction, yielding inferior performance. Therefore, we propose an Extractive-Injective Phrase Adapter (EIPA) bypass within the Diffusion UNet to dynamically update phrase prompts with image features and inject the multimodal cues back, which leverages the fine-grained image-text alignment capability of Diffusion models more sufficiently. In addition, we also design a Multi-Level Mutual Aggregation (MLMA) module to reciprocally fuse multi-level image and phrase features for segmentation refinement. Extensive experiments on the PNG benchmark show that our method achieves new state-of-the-art performance.
Abstract:In this paper, we propose an Audio-Language-Referenced SAM 2 (AL-Ref-SAM 2) pipeline to explore the training-free paradigm for audio and language-referenced video object segmentation, namely AVS and RVOS tasks. The intuitive solution leverages GroundingDINO to identify the target object from a single frame and SAM 2 to segment the identified object throughout the video, which is less robust to spatiotemporal variations due to a lack of video context exploration. Thus, in our AL-Ref-SAM 2 pipeline, we propose a novel GPT-assisted Pivot Selection (GPT-PS) module to instruct GPT-4 to perform two-step temporal-spatial reasoning for sequentially selecting pivot frames and pivot boxes, thereby providing SAM 2 with a high-quality initial object prompt. Within GPT-PS, two task-specific Chain-of-Thought prompts are designed to unleash GPT's temporal-spatial reasoning capacity by guiding GPT to make selections based on a comprehensive understanding of video and reference information. Furthermore, we propose a Language-Binded Reference Unification (LBRU) module to convert audio signals into language-formatted references, thereby unifying the formats of AVS and RVOS tasks in the same pipeline. Extensive experiments on both tasks show that our training-free AL-Ref-SAM 2 pipeline achieves performances comparable to or even better than fully-supervised fine-tuning methods. The code is available at: https://github.com/appletea233/AL-Ref-SAM2.
Abstract:Beyond diagonal reconfigurable intelligent surfaces (BD-RIS) generalizes and goes beyond conventional diagonal reconfigurable intelligent surfaces (D-RIS) by interconnecting elements to generate beyond diagonal scattering matrices, which significantly strengthen the wireless channels. In this work, we use BD-RIS for passive multiuser beamforming in multiuser multiple-input-single-output (MU-MISO) systems. Specifically, we design the scattering matrix of BD-RIS to either maximize the sum received signal power at the users following maximum ratio transmission (MRT), or to nullify the interference at the users following zero forcing (ZF). Furthermore, we investigate uniform/optimized power allocation and ZF precoding at the base station (BS). Numerical results show that BD-RIS improves the interference nulling capability and sum rate with fewer reflecting elements (REs) compared to D-RIS. In addition, at moderate to high signal to noise ratios (SNRs), passive interference nulling reduces the complexity at the BS by relaxing the need for precoding or water-filling power allocation design. Furthermore, the passive MRT with ZF precoding achieves a tight sum rate performance to the joint design considering MU-MISO scenarios with many REs while maintaining low computational complexity and simplifying the channel estimation.
Abstract:To achieve dexterity comparable to that of humans, robots must intelligently process tactile sensor data. Taxel-based tactile signals often have low spatial-resolution, with non-standardized representations. In this paper, we propose a novel framework, HyperTaxel, for learning a geometrically-informed representation of taxel-based tactile signals to address challenges associated with their spatial resolution. We use this representation and a contrastive learning objective to encode and map sparse low-resolution taxel signals to high-resolution contact surfaces. To address the uncertainty inherent in these signals, we leverage joint probability distributions across multiple simultaneous contacts to improve taxel hyper-resolution. We evaluate our representation by comparing it with two baselines and present results that suggest our representation outperforms the baselines. Furthermore, we present qualitative results that demonstrate the learned representation captures the geometric features of the contact surface, such as flatness, curvature, and edges, and generalizes across different objects and sensor configurations. Moreover, we present results that suggest our representation improves the performance of various downstream tasks, such as surface classification, 6D in-hand pose estimation, and sim-to-real transfer.
Abstract:This paper addresses the channel estimation problem for beyond diagonal reconfigurable intelligent surface (BD-RIS) from a tensor decomposition perspective. We first show that the received pilot signals can be arranged as a three-way tensor, allowing us to recast the cascaded channel estimation problem as a block Tucker decomposition problem that yields decoupled estimates for the involved channel matrices while offering a substantial performance gain over the conventional (matrix-based) least squares (LS) estimation method. More specifically, we develop two solutions to solve the problem. The first one is a closed-form solution that extracts the channel estimates via a block Tucker Kronecker factorization (BTKF), which boils down to solving a set of parallel rank-one matrix approximation problems. Exploiting such a low-rank property yields a noise rejection gain compared to the standard LS estimation scheme while allowing the two involved channels to be estimated separately. The second solution is based on a block Tucker alternating least squares (BTALS) algorithm that directly estimates the involved channel matrices using an iterative estimation procedure. We discuss the uniqueness and identifiability issues and their implications for training design. We also propose a tensor-based design of the BD-RIS training tensor for each algorithm that ensures unique decoupled channel estimates under trivial scaling ambiguities. Our numerical results shed light on the tradeoffs offered by BTKF and BTALS methods. Specifically, while the first enjoys fast and parallel extraction of the channel estimates in closed form, the second has a more flexible training design, allowing for a significantly reduced training overhead compared to the state-of-the-art LS method.
Abstract:This paper investigates the capability of a passive Reconfigurable Intelligent Surface (RIS) to redistribute the singular values of point-to-point Multiple-Input Multiple-Output (MIMO) channels for achieving power and rate gains. We depart from the conventional Diagonal (D)-RIS with diagonal phase shift matrix and adopt a Beyond Diagonal (BD) architecture that offers greater wave manipulation flexibility through element-wise connections. Specifically, we first provide shaping insights by characterizing the channel singular value regions attainable by D-RIS and BD-RIS via a novel geodesic optimization. Analytical singular value bounds are then derived to explore their shaping limits in typical deployment scenarios. As a side product, we tackle BD-RIS-aided MIMO rate maximization problem by a local-optimal Alternating Optimization (AO) and a shaping-inspired low-complexity approach. Results show that compared to D-RIS, BD-RIS significantly improves the dynamic range of all channel singular values, the trade-off in manipulating them, and thus the channel power and achievable rate. Those observations become more pronounced when the number of RIS elements and MIMO dimensions increase. Of particular interest, BD-RIS is shown to activate multi-stream transmission at lower transmit power than D-RIS, hence achieving the asymptotic Degrees of Freedom (DoF) at low Signal-to-Noise Ratio (SNR) thanks to its higher flexibility of shaping the distribution of channel singular values.
Abstract:This paper proposes a cooperative integrated sensing and communication network (Co-ISACNet) adopting hybrid beamforming (HBF) architecture, which improves both radar sensing and communication performance. The main contributions of this work are four-fold. First, we introduce a novel cooperative sensing method for the considered Co-ISACNet, followed by a comprehensive analysis of this method. This analysis mathematically verifies the benefits of Co-ISACNet and provides insightful design guidelines. Second, to show the benefits of Co-ISACNet, we propose to jointly design the HBF to maximize the network communication capacity while satisfying the constraint of beampattern similarity for radar sensing, which results in a highly dimensional and non-convex problem. Third, to facilitate the joint design, we propose a novel distributed optimization framework based on proximal gradient and alternating direction method of multipliers, namely PANDA. Fourth, we further adopt the proposed PANDA framework to solve the joint HBF design problem for the Co-ISACNet. By using the proposed PANDA framework, all access points (APs) optimize the HBF in parallel, where each AP only requires local channel state information and limited message exchange among the APs. Such framework reduces significantly the computational complexity and thus has pronounced benefits in practical scenarios. Simulation results verify the effectiveness of the proposed algorithm compared with the conventional centralized algorithm and show the remarkable performance improvement of radar sensing and communication by deploying Co-ISACNet.
Abstract:The multi-sector intelligent surface (IS), benefiting from a smarter wave manipulation capability, has been shown to enhance channel gain and offer full-space coverage in communications. However, the benefits of multi-sector IS in wireless sensing remain unexplored. This paper introduces the application of multi-sector IS for wireless sensing/localization. Specifically, we propose a new self-sensing system, where an active source controller uses the multi-sector IS geometry to reflect/scatter the emitted signals towards the entire space, thereby achieving full-space coverage for wireless sensing. Additionally, dedicated sensors are installed aligned with the IS elements at each sector, which collect echo signals from the target and cooperate to sense the target angle. In this context, we develop a maximum likelihood estimator of the target angle for the proposed multi-sector IS self-sensing system, along with the corresponding theoretical limits defined by the Cram\'er-Rao Bound. The analysis reveals that the advantages of the multi-sector IS self-sensing system stem from two aspects: enhancing the probing power on targets (thereby improving power efficiency) and increasing the rate of target angle (thereby enhancing the transceiver's sensitivity to target angles). Finally, our analysis and simulations confirm that the multi-sector IS self-sensing system, particularly the 4-sector architecture, achieves full-space sensing capability beyond the single-sector IS configuration. Furthermore, similarly to communications, employing directive antenna patterns on each sector's IS elements and sensors significantly enhances sensing capabilities. This enhancement originates from both aspects of improved power efficiency and target angle sensitivity, with the former also being observed in communications while the latter being unique in sensing.
Abstract:Catastrophic Forgetting (CF) means models forgetting previously acquired knowledge when learning new data. It compromises the effectiveness of large language models (LLMs) during fine-tuning, yet the underlying causes have not been thoroughly investigated. This paper takes the first step to reveal the direct link between the flatness of the model loss landscape and the extent of CF in the field of LLMs. Based on this, we introduce the sharpness-aware minimization to mitigate CF by flattening the loss landscape. Experiments on three widely-used fine-tuning datasets, spanning different model scales, demonstrate the effectiveness of our method in alleviating CF. Analyses show that we nicely complement the existing anti-forgetting strategies, further enhancing the resistance of LLMs to CF.
Abstract:This paper investigates a hardware-efficient massive multiple-input multiple-output integrated sensing and communication (MIMO-ISAC) system with 1-bit analog-to-digital converters (ADCs)/digital-to-analog converters (DACs). The proposed system, referred to as 1BitISAC, employs 1-bit DACs at the ISAC transmitter and 1-bit ADCs at the sensing receiver, achieving significant reductions in power consumption and hardware costs. For such kind of systems, two 1BitISAC joint transceiver designs, i.e., i) quality of service constrained 1BitISAC design and ii) quality of detection constrained design, are considered and the corresponding problems are formulated. In order to address these problems, we thoroughly analyze the radar detection performance after 1-bit ADCs quantization and the communication bit error rate. This analysis yields new design insights and leads to unique radar and communication metrics, which enables us to simplify the original problems and employ majorization-minimization and integer linear programming methods to solve the problems. Numerical results are provided to validate the performance analysis of the proposed 1BitISAC and to compare with other ISAC configurations. The superiority of the proposed 1BitISAC system in terms of balancing ISAC performance and energy efficiency is also demonstrated.