Abstract:Recent advancements in multimodal large language models (MLLMs) have shown promising results, yet existing approaches struggle to effectively handle both temporal and spatial localization simultaneously. This challenge stems from two key issues: first, incorporating spatial-temporal localization introduces a vast number of coordinate combinations, complicating the alignment of linguistic and visual coordinate representations; second, encoding fine-grained temporal and spatial information during video feature compression is inherently difficult. To address these issues, we propose LLaVA-ST, a MLLM for fine-grained spatial-temporal multimodal understanding. In LLaVA-ST, we propose Language-Aligned Positional Embedding, which embeds the textual coordinate special token into the visual space, simplifying the alignment of fine-grained spatial-temporal correspondences. Additionally, we design the Spatial-Temporal Packer, which decouples the feature compression of temporal and spatial resolutions into two distinct point-to-region attention processing streams. Furthermore, we propose ST-Align dataset with 4.3M training samples for fine-grained spatial-temporal multimodal understanding. With ST-align, we present a progressive training pipeline that aligns the visual and textual feature through sequential coarse-to-fine stages.Additionally, we introduce an ST-Align benchmark to evaluate spatial-temporal interleaved fine-grained understanding tasks, which include Spatial-Temporal Video Grounding (STVG) , Event Localization and Captioning (ELC) and Spatial Video Grounding (SVG). LLaVA-ST achieves outstanding performance on 11 benchmarks requiring fine-grained temporal, spatial, or spatial-temporal interleaving multimodal understanding. Our code, data and benchmark will be released at Our code, data and benchmark will be released at https://github.com/appletea233/LLaVA-ST .
Abstract:Beyond diagonal reconfigurable intelligent surface (BD-RIS) is a new architecture for RIS where elements are interconnected to provide more wave manipulation flexibility than traditional single connected RIS, enhancing data rate and coverage. However, channel estimation for BD-RIS is challenging due to the more complex multiple-connection structure involving their scattering elements. To address this issue, this paper proposes a decoupled channel estimation method for BD-RIS that yields separate estimates of the involved channels to enhance the accuracy of the overall combined channel by capitalizing on its Kronecker structure. Starting from a least squares estimate of the combined channel and by properly reshaping the resulting filtered signal, the proposed algorithm resorts to a Khatri-Rao Factorization (KRF) method that teases out the individual channels based on simple rank-one matrix approximation steps. Numerical results show that the proposed decoupled channel estimation yields more accurate channel estimates than the classical least squares scheme.
Abstract:Reconfigurable intelligent surface (RIS) has been envisioned as a key technology in future wireless communication networks to enable smart radio environment. To further enhance the passive beamforming capability of RIS, beyond diagonal (BD)-RIS has been proposed considering reconfigurable interconnections among different RIS elements. BD-RIS has a unique feature that cannot be enabled by conventional diagonal RIS; it can be realized by non-reciprocal circuits and thus enables an asymmetric scattering matrix. This feature provides the capability to break the wireless channel reciprocity, and has the potential to benefit full-duplex (FD) systems. In this paper, we model the BD RIS-assisted FD systems, where the impact of BD-RIS non-reciprocity and that of structural scattering, which refers to the specular reflection generated by RIS when the RIS is turned OFF, are explicitly captured. To assess the benefits of non-reciprocal BD-RIS, we optimise the scattering matrix, precoder and combiner to maximize the DL and UL sum-rates in the FD system. To tackle this optimization problem, we propose an iterative algorithm based on block coordination descent (BCD) and penalty dual decomposition (PDD). Numerical results demonstrate surprising benefits of non-reciprocal BD-RIS that it can achieve much higher DL and UL sum-rates in the FD scenario than reciprocal BD-RIS and conventional diagonal RIS.
Abstract:Reconfigurable intelligent surface (RIS) has been envisioned as a key technology in future wireless communication networks to enable smart radio environment. To further enhance the passive beamforming capability of RIS, beyond diagonal (BD)-RIS has been proposed considering interconnections among different RIS elements. BD-RIS has a unique feature that cannot be enabled by conventional diagonal RIS; it can be realized by non-reciprocal circuits and thus has asymmetric scattering matrix. This feature provides probability to break the wireless channel reciprocity, and thus has potential to benefit the full-duplex (FD) system. In this paper, we model the BD RIS-assisted FD systems, where the impact of BD-RIS non-reciprocity and that of structural scattering, which refers to the virtual direct channel constructed by RIS when the RIS is turned OFF, are explicitly captured. To visualize the analysis, we propose to design the scattering matrix, precoder and combiner to maximize the DL and UL sum-rates in the FD system. To tackle this optimization problem, we propose an iterative algorithm based on block coordination descent (BCD) and penalty dual decomposition (PDD). Numerical results demonstrate surprising benefits of non-reciprocal BD-RIS that it can achieve higher DL and UL sum-rates in the FD scenario than reciprocal BD-RIS and conventional diagonal RIS.
Abstract:Reconfigurable Intelligent Surface (RIS) is a breakthrough technology enabling the dynamic control of the propagation environment in wireless communications through programmable surfaces. To improve the flexibility of conventional diagonal RIS (D-RIS), beyond diagonal RIS (BD-RIS) has emerged as a family of more general RIS architectures. However, D-RIS and BD-RIS have been commonly explored neglecting mutual coupling effects, while the global optimization of RIS with mutual coupling, its performance limits, and scaling laws remain unexplored. This study addresses these gaps by deriving global optimal closed-form solutions for BD-RIS with mutual coupling to maximize the channel gain, specifically fully- and tree-connected RISs. Besides, we provide the expression of the maximum channel gain achievable in the presence of mutual coupling and its scaling law in closed form. By using the derived scaling laws, we analytically prove that mutual coupling increases the channel gain on average under Rayleigh fading channels. Our theoretical analysis, confirmed by numerical simulations, shows that both fully- and tree-connected RISs with mutual coupling achieve the same channel gain upper bound when optimized with the proposed global optimal solutions. Furthermore, we observe that a mutual coupling-unaware optimization of RIS can cause a channel gain degradation of up to 5 dB.
Abstract:Beyond diagonal reconfigurable intelligent surfaces (BD-RIS) is a new advance in RIS techniques that introduces reconfigurable inter-element connections to generate scattering matrices not limited to being diagonal. BD-RIS has been recently proposed and proven to have benefits in enhancing channel gain and enlarging coverage in wireless communications. Uniquely, BD-RIS enables reciprocal and non-reciprocal architectures characterized by symmetric and non-symmetric scattering matrices. However, the performance benefits and new use cases enabled by non-reciprocal BD-RIS for wireless systems remain unexplored. This work takes a first step toward closing this knowledge gap and studies the non-reciprocal BD-RIS in full-duplex systems and its performance benefits over reciprocal counterparts. We start by deriving a general RIS aided full-duplex system model using a multiport circuit theory, followed by a simplified channel model based on physically consistent assumptions. With the considered channel model, we investigate the effect of BD-RIS non-reciprocity and identify the theoretical conditions for reciprocal and non-reciprocal BD-RISs to simultaneously achieve the maximum received power of the signal of interest in the uplink and the downlink. Simulation results validate the theories and highlight the significant benefits offered by non-reciprocal BD-RIS in full-duplex systems. The significant gains are achieved because of the non-reciprocity principle which implies that if a wave hits the non-reciprocal BD-RIS from one direction, the surface behaves differently than if it hits from the opposite direction. This enables an uplink user and a downlink user at different locations to optimally communicate with the same full-duplex base station via a non-reciprocal BD-RIS, which would not be possible with reciprocal surfaces.
Abstract:Panoptic narrative grounding (PNG), whose core target is fine-grained image-text alignment, requires a panoptic segmentation of referred objects given a narrative caption. Previous discriminative methods achieve only weak or coarse-grained alignment by panoptic segmentation pretraining or CLIP model adaptation. Given the recent progress of text-to-image Diffusion models, several works have shown their capability to achieve fine-grained image-text alignment through cross-attention maps and improved general segmentation performance. However, the direct use of phrase features as static prompts to apply frozen Diffusion models to the PNG task still suffers from a large task gap and insufficient vision-language interaction, yielding inferior performance. Therefore, we propose an Extractive-Injective Phrase Adapter (EIPA) bypass within the Diffusion UNet to dynamically update phrase prompts with image features and inject the multimodal cues back, which leverages the fine-grained image-text alignment capability of Diffusion models more sufficiently. In addition, we also design a Multi-Level Mutual Aggregation (MLMA) module to reciprocally fuse multi-level image and phrase features for segmentation refinement. Extensive experiments on the PNG benchmark show that our method achieves new state-of-the-art performance.
Abstract:In this paper, we propose an Audio-Language-Referenced SAM 2 (AL-Ref-SAM 2) pipeline to explore the training-free paradigm for audio and language-referenced video object segmentation, namely AVS and RVOS tasks. The intuitive solution leverages GroundingDINO to identify the target object from a single frame and SAM 2 to segment the identified object throughout the video, which is less robust to spatiotemporal variations due to a lack of video context exploration. Thus, in our AL-Ref-SAM 2 pipeline, we propose a novel GPT-assisted Pivot Selection (GPT-PS) module to instruct GPT-4 to perform two-step temporal-spatial reasoning for sequentially selecting pivot frames and pivot boxes, thereby providing SAM 2 with a high-quality initial object prompt. Within GPT-PS, two task-specific Chain-of-Thought prompts are designed to unleash GPT's temporal-spatial reasoning capacity by guiding GPT to make selections based on a comprehensive understanding of video and reference information. Furthermore, we propose a Language-Binded Reference Unification (LBRU) module to convert audio signals into language-formatted references, thereby unifying the formats of AVS and RVOS tasks in the same pipeline. Extensive experiments on both tasks show that our training-free AL-Ref-SAM 2 pipeline achieves performances comparable to or even better than fully-supervised fine-tuning methods. The code is available at: https://github.com/appletea233/AL-Ref-SAM2.
Abstract:Beyond diagonal reconfigurable intelligent surfaces (BD-RIS) generalizes and goes beyond conventional diagonal reconfigurable intelligent surfaces (D-RIS) by interconnecting elements to generate beyond diagonal scattering matrices, which significantly strengthen the wireless channels. In this work, we use BD-RIS for passive multiuser beamforming in multiuser multiple-input-single-output (MU-MISO) systems. Specifically, we design the scattering matrix of BD-RIS to either maximize the sum received signal power at the users following maximum ratio transmission (MRT), or to nullify the interference at the users following zero forcing (ZF). Furthermore, we investigate uniform/optimized power allocation and ZF precoding at the base station (BS). Numerical results show that BD-RIS improves the interference nulling capability and sum rate with fewer reflecting elements (REs) compared to D-RIS. In addition, at moderate to high signal to noise ratios (SNRs), passive interference nulling reduces the complexity at the BS by relaxing the need for precoding or water-filling power allocation design. Furthermore, the passive MRT with ZF precoding achieves a tight sum rate performance to the joint design considering MU-MISO scenarios with many REs while maintaining low computational complexity and simplifying the channel estimation.
Abstract:To achieve dexterity comparable to that of humans, robots must intelligently process tactile sensor data. Taxel-based tactile signals often have low spatial-resolution, with non-standardized representations. In this paper, we propose a novel framework, HyperTaxel, for learning a geometrically-informed representation of taxel-based tactile signals to address challenges associated with their spatial resolution. We use this representation and a contrastive learning objective to encode and map sparse low-resolution taxel signals to high-resolution contact surfaces. To address the uncertainty inherent in these signals, we leverage joint probability distributions across multiple simultaneous contacts to improve taxel hyper-resolution. We evaluate our representation by comparing it with two baselines and present results that suggest our representation outperforms the baselines. Furthermore, we present qualitative results that demonstrate the learned representation captures the geometric features of the contact surface, such as flatness, curvature, and edges, and generalizes across different objects and sensor configurations. Moreover, we present results that suggest our representation improves the performance of various downstream tasks, such as surface classification, 6D in-hand pose estimation, and sim-to-real transfer.