Abstract:Recently, visual grounding and multi-sensors setting have been incorporated into perception system for terrestrial autonomous driving systems and Unmanned Surface Vehicles (USVs), yet the high complexity of modern learning-based visual grounding model using multi-sensors prevents such model to be deployed on USVs in the real-life. To this end, we design a low-power multi-task model named NanoMVG for waterway embodied perception, guiding both camera and 4D millimeter-wave radar to locate specific object(s) through natural language. NanoMVG can perform both box-level and mask-level visual grounding tasks simultaneously. Compared to other visual grounding models, NanoMVG achieves highly competitive performance on the WaterVG dataset, particularly in harsh environments and boasts ultra-low power consumption for long endurance.
Abstract:The perception of waterways based on human intent is significant for autonomous navigation and operations of Unmanned Surface Vehicles (USVs) in water environments. Inspired by visual grounding, we introduce WaterVG, the first visual grounding dataset designed for USV-based waterway perception based on human prompts. WaterVG encompasses prompts describing multiple targets, with annotations at the instance level including bounding boxes and masks. Notably, WaterVG includes 11,568 samples with 34,987 referred targets, whose prompts integrates both visual and radar characteristics. The pattern of text-guided two sensors equips a finer granularity of text prompts with visual and radar features of referred targets. Moreover, we propose a low-power visual grounding model, Potamoi, which is a multi-task model with a well-designed Phased Heterogeneous Modality Fusion (PHMF) mode, including Adaptive Radar Weighting (ARW) and Multi-Head Slim Cross Attention (MHSCA). Exactly, ARW extracts required radar features to fuse with vision for prompt alignment. MHSCA is an efficient fusion module with a remarkably small parameter count and FLOPs, elegantly fusing scenario context captured by two sensors with linguistic features, which performs expressively on visual grounding tasks. Comprehensive experiments and evaluations have been conducted on WaterVG, where our Potamoi archives state-of-the-art performances compared with counterparts.
Abstract:It is crucial to choose the appropriate scale in order to build an effective and informational representation of a complex system. Scientists carefully choose the scales for their experiments to extract the variables that describe the causalities in the system. They found that the coarse scale(macro) is sometimes more causal and informative than the numerous-parameter observations(micro). The phenomenon that the causality emerges by coarse-graining is called Causal Emergence(CE). Based on information theory, a number of recent works quantitatively showed that CE indeed happens while coarse-graining a micro model to the macro. However, the existing works have not discussed the question of why and when the CE happens. We quantitatively analyze the redistribution of uncertainties for coarse-graining and suggest that the redistribution of uncertainties is the cause of causal emergence. We further analyze the thresholds that determine if CE happens or not. From the regularity of the transition probability matrix(TPM) of discrete systems, the mathematical expressions of the model properties are derived. The values of thresholds for different operations are computed. The results provide the critical and specific conditions of CE as helpful suggestions for choosing the proper coarse-graining operation. The results also provided a new way to better understand the nature of causality and causal emergence.