Abstract:Most existing GUI agents typically depend on non-vision inputs like HTML source code or accessibility trees, limiting their flexibility across diverse software environments and platforms. Current multimodal large language models (MLLMs), which excel at using vision to ground real-world objects, offer a potential alternative. However, they often struggle with accurately localizing GUI elements -- a critical requirement for effective GUI automation -- due to the semantic gap between real-world objects and GUI elements. In this work, we introduce Ponder & Press, a divide-and-conquer framework for general computer control using only visual input. Our approach combines an general-purpose MLLM as an 'interpreter', responsible for translating high-level user instructions into detailed action descriptions, with a GUI-specific MLLM as a 'locator' that precisely locates GUI elements for action placement. By leveraging a purely visual input, our agent offers a versatile, human-like interaction paradigm applicable to a wide range of applications. Ponder & Press locator outperforms existing models by +22.5% on the ScreenSpot GUI grounding benchmark. Both offline and interactive agent benchmarks across various GUI environments -- including web pages, desktop software, and mobile UIs -- demonstrate that Ponder & Press framework achieves state-of-the-art performance, highlighting the potential of visual GUI agents. Refer to the project homepage https://invinciblewyq.github.io/ponder-press-page/
Abstract:Since decades ago, multi-antenna has become a key enabling technology in the evolution of wireless communication systems. In contrast to conventional multi-antenna systems that contain antennas at fixed positions, position-flexible antenna systems have been proposed to fully utilize the spatial variation of wireless channels. In this paper, movable antenna (MA) systems are analyzed from channel measurement, modeling, position optimization to performance evaluation. First, a broadband channel measurement system with physical MAs is developed, for which the extremely high movable resolution reaches 0.02 mm. A practical two-ray model is constructed based on the channel measurement for a two-dimensional movable antenna system across 32$\times$32 planar port positions at 300 GHz. In light of the measurement results, spatial-correlated channel models for the two-dimensional MA system are proposed, which are statistically parameterized by the covariance matrix of measured channels. Finally, the signal-to-interference-and-noise ratio (SINR)-maximized position selection algorithm is proposed, which achieves 99% of the optimal performance. The performance of different MA systems in terms of spectral efficiency are evaluated and compared for both planar and linear MA systems. Extensive results demonstrate the advantage of MAs over fixed-position antennas in coping with the multi-path fading and improving the spectral efficiency by 10% in a 300 GHz measured channel.
Abstract:Multimodal language models (MLLMs) are increasingly being implemented in real-world environments, necessitating their ability to interpret 3D spaces and comprehend temporal dynamics. Despite their potential, current top models within our community still fall short in adequately understanding spatial and temporal dimensions. We introduce Coarse Correspondence, a simple, training-free, effective, and general-purpose visual prompting method to elicit 3D and temporal understanding in multimodal LLMs. Our method uses a lightweight tracking model to find object correspondences between frames in a video or between sets of image viewpoints. It selects the most frequent object instances and visualizes them with markers with unique IDs in the image. With this simple approach, we achieve state-of-the-art results on 3D understanding benchmarks including ScanQA (+20.5\%) and a subset of OpenEQA (+9.7\%), and on long-form video benchmarks such as EgoSchema (+6.0\%). We also curate a small diagnostic dataset to evaluate whether MLLMs can reason about space from a described viewpoint other than the camera viewpoint. Again, Coarse Correspondence improves spatial perspective-taking abilities but we highlight that MLLMs struggle with this task. Together, we demonstrate that our simple prompting method can significantly aid downstream tasks that require 3D or temporal reasoning.
Abstract:This paper describes our champion solution to the LOVEU Challenge @ CVPR'24, Track 1 (Long Video VQA). Processing long sequences of visual tokens is computationally expensive and memory-intensive, making long video question-answering a challenging task. The key is to compress visual tokens effectively, reducing memory footprint and decoding latency, while preserving the essential information for accurate question-answering. We adopt a hierarchical memory mechanism named STAR Memory, proposed in Flash-VStream, that is capable of processing long videos with limited GPU memory (VRAM). We further utilize the video and audio data of MovieChat-1K training set to fine-tune the pretrained weight released by Flash-VStream, achieving 1st place in the challenge. Code is available at project homepage https://invinciblewyq.github.io/vstream-page
Abstract:Benefiting from the advancements in large language models and cross-modal alignment, existing multi-modal video understanding methods have achieved prominent performance in offline scenario. However, online video streams, as one of the most common media forms in the real world, have seldom received attention. Compared to offline videos, the 'dynamic' nature of online video streams poses challenges for the direct application of existing models and introduces new problems, such as the storage of extremely long-term information, interaction between continuous visual content and 'asynchronous' user questions. Therefore, in this paper we present Flash-VStream, a video-language model that simulates the memory mechanism of human. Our model is able to process extremely long video streams in real-time and respond to user queries simultaneously. Compared to existing models, Flash-VStream achieves significant reductions in inference latency and VRAM consumption, which is intimately related to performing understanding of online streaming video. In addition, given that existing video understanding benchmarks predominantly concentrate on offline scenario, we propose VStream-QA, a novel question answering benchmark specifically designed for online video streaming understanding. Comparisons with popular existing methods on the proposed benchmark demonstrate the superiority of our method for such challenging setting. To verify the generalizability of our approach, we further evaluate it on existing video understanding benchmarks and achieves state-of-the-art performance in offline scenarios as well. All code, models, and datasets are available at the https://invinciblewyq.github.io/vstream-page/
Abstract:Technologies like ultra-massive multiple-input-multiple-output (UM-MIMO) and reconfigurable intelligent surfaces (RISs) are of special interest to meet the key performance indicators of future wireless systems including ubiquitous connectivity and lightning-fast data rates. One of their common features, the extremely large-scale antenna array (ELAA) systems with hundreds or thousands of antennas, give rise to near-field (NF) propagation and bring new challenges to channel modeling and characterization. In this paper, a cross-field channel model for ELAA systems is proposed, which improves the statistical model in 3GPP TR 38.901 by refining the propagation path with its first and last bounces and differentiating the characterization of parameters like path loss, delay, and angles in near- and far-fields. A comprehensive analysis of cross-field boundaries and closed-form expressions of corresponding NF or FF parameters are provided. Furthermore, cross-field experiments carried out in a typical indoor scenario at 300 GHz verify the variation of MPC parameters across the antenna array, and demonstrate the distinction of channels between different antenna elements. Finally, detailed generation procedures of the cross-field channel model are provided, based on which simulations and analysis on NF probabilities and channel coefficients are conducted for $4\times4$, $8\times8$, $16\times16$, and $9\times21$ uniform planar arrays at different frequency bands.
Abstract:Extremely large-scale antenna array (ELAA) technologies consisting of ultra-massive multiple-input-multiple-output (UM-MIMO) or reconfigurable intelligent surfaces (RISs), are emerging to meet the demand of wireless systems in sixth-generation and beyond communications for enhanced coverage and extreme data rates up to Terabits per second. For ELAA operating at Terahertz (THz) frequencies, the Rayleigh distance expands, and users are likely to be located in both far-field (FF) and near-field (NF) regions. On one hand, new features like NF propagation and spatial non-stationarity need to be characterized. On the other hand, the transition of properties near the FF and NF boundary is worth exploring. In this paper, a complete experimental analysis of far- and near-field channel characteristics using a THz virtual antenna array is provided based on measurement of the multi-input-single-output channel with the virtual uniform planar array (UPA) structure of at most 4096 elements. In particular, non-linear phase change is observed in the NF, and the Rayleigh criterion regarding the maximum phase error is verified. Then, a new cross-field path loss model is proposed, which is compatible with both FF and NF cases based on the UPA structure. Besides, multi-path fading is discovered in both NF and FF regions.
Abstract:To extract channel characteristics and conduct channel modeling in millimeter-wave (mmWave) and Terahertz (THz) bands, accurate estimations of multi-path component (MPC) parameters in measured results are fundamental. However, due to high frequency and narrow antenna beams in mmWave and THz direction-scan measurements, existing channel parameter estimation algorithms are no longer effective. In this paper, a novel narrow-beam near-field space-alternating generalized expectation-maximization (N2-SAGE) algorithm is proposed, which is derived by carefully considering the features of mmWave and THz direction-scan measurement campaigns, such as near field propagation, narrow antenna beams as well as asynchronous measurements in different scanning directions. The delays of MPCs are calculated using spherical wave front (SWF), which depends on delay and angles of MPCs, resulting in a high-dimensional estimation problem. To overcome this, a novel two-phase estimation process is proposed, including a rough estimation phase and an accurate estimation phase. Moreover, considering the narrow antenna beams used for mmWave and THz direction-scan measurements, the usage of partial information alleviates influence of background noises. Additionally, the phases of MPCs in different scanning directions are treated as random variables, which are estimated and reused during the estimation process, making the algorithm immune to possible phase errors. Furthermore, performance of the proposed N2-SAGE algorithm is validated and compared with existing channel parameter estimation algorithms, based on simulations and measured data. Results show that the proposed N2-SAGE algorithm greatly outperforms existing channel parameter estimation algorithms in terms of estimation accuracy. By using the N2-SAGE algorithm, the channel is characterized more correctly and reasonably.
Abstract:The Terahertz (0.1-10 THz) band has been envisioned as one of the promising spectrum bands to support ultra-broadband sixth-generation (6G) and beyond communications. In this paper, a wideband channel measurement campaign in an indoor lobby at 306-321 GHz is presented. The measurement system consists of a vector network analyzer (VNA)-based channel sounder, and a directional antenna equipped at the receiver to resolve multi-path components (MPCs) in the angular domain. In particular, 21 positions and 3780 channel impulse responses (CIRs) are measured in the lobby, including the line-of-sight (LoS), non-line-of-sight (NLoS) and obstructed-line-of-sight (OLoS) cases. Multi-path propagation is elaborated in terms of clustering results, and the effect of typical scatterers in the indoor lobby scenario in the THz band is explored. Moreover, indoor THz channel characteristics are analyzed in depth. Specifically, best direction and omni-directional path losses are analyzed by invoking close-in and alpha-beta path loss models. The most clusters are observed in the OLoS case, followed by NLoS and then LoS cases. On average, the power dispersion of MPCs is smaller in the LoS case in both temporal and angular domains, compared with the NLoS and OLoS counterparts.
Abstract:Owning abundant bandwidth resource, the Terahertz (0.1-10 THz) band is a promising spectrum to support sixth-generation (6G) and beyond communications. As the foundation of channel study in the spectrum, channel measurement is ongoing in covering representative 6G communication scenarios and promising THz frequency bands. In this paper, a wideband channel measurement in an L-shaped university campus street is conducted at 306-321 GHz and 356-371 GHz. In particular, ten line-of-sight (LoS) and eight non-line-of-sight (NLoS) points are measured at the two frequency bands, respectively. In total, 6480 channel impulse responses (CIRs) are obtained from the measurement, based on which multi-path propagation in the L-shaped roadway in the THz band is elaborated to identify major scatterers of walls, vehicles, etc. in the environment and their impact on multi-path components (MPCs). Furthermore, outdoor THz channel characteristics in the two frequency bands are analyzed, including path losses, shadow fading, cluster parameters, delay spread and angular spread. In contrast with the counterparts in the similar outdoor scenario at lower frequencies, the results verify the sparsity of MPCs at THz frequencies and indicate smaller power spreads in both temporal and spatial domains in the THz band.