Chalmers University of Technology
Abstract:We present TUMTraffic-VideoQA, a novel dataset and benchmark designed for spatio-temporal video understanding in complex roadside traffic scenarios. The dataset comprises 1,000 videos, featuring 85,000 multiple-choice QA pairs, 2,300 object captioning, and 5,700 object grounding annotations, encompassing diverse real-world conditions such as adverse weather and traffic anomalies. By incorporating tuple-based spatio-temporal object expressions, TUMTraffic-VideoQA unifies three essential tasks-multiple-choice video question answering, referred object captioning, and spatio-temporal object grounding-within a cohesive evaluation framework. We further introduce the TUMTraffic-Qwen baseline model, enhanced with visual token sampling strategies, providing valuable insights into the challenges of fine-grained spatio-temporal reasoning. Extensive experiments demonstrate the dataset's complexity, highlight the limitations of existing models, and position TUMTraffic-VideoQA as a robust foundation for advancing research in intelligent transportation systems. The dataset and benchmark are publicly available to facilitate further exploration.
Abstract:Modeling spatial heterogeneity in the data generation process is essential for understanding and predicting geographical phenomena. Despite their prevalence in geospatial tasks, neural network models usually assume spatial stationarity, which could limit their performance in the presence of spatial process heterogeneity. By allowing model parameters to vary over space, several approaches have been proposed to incorporate spatial heterogeneity into neural networks. However, current geographically weighting approaches are ineffective on graph neural networks, yielding no significant improvement in prediction accuracy. We assume the crux lies in the over-fitting risk brought by a large number of local parameters. Accordingly, we propose to model spatial process heterogeneity at the regional level rather than at the individual level, which largely reduces the number of spatially varying parameters. We further develop a heuristic optimization procedure to learn the region partition adaptively in the process of model training. Our proposed spatial-heterogeneity-aware graph convolutional network, named RegionGCN, is applied to the spatial prediction of county-level vote share in the 2016 US presidential election based on socioeconomic attributes. Results show that RegionGCN achieves significant improvement over the basic and geographically weighted GCNs. We also offer an exploratory analysis tool for the spatial variation of non-linear relationships through ensemble learning of regional partitions from RegionGCN. Our work contributes to the practice of Geospatial Artificial Intelligence (GeoAI) in tackling spatial heterogeneity.
Abstract:Human mobility is a fundamental aspect of social behavior, with broad applications in transportation, urban planning, and epidemic modeling. However, for decades new mathematical formulas to model mobility phenomena have been scarce and usually discovered by analogy to physical processes, such as the gravity model and the radiation model. These sporadic discoveries are often thought to rely on intuition and luck in fitting empirical data. Here, we propose a systematic approach that leverages symbolic regression to automatically discover interpretable models from human mobility data. Our approach finds several well-known formulas, such as the distance decay effect and classical gravity models, as well as previously unknown ones, such as an exponential-power-law decay that can be explained by the maximum entropy principle. By relaxing the constraints on the complexity of model expressions, we further show how key variables of human mobility are progressively incorporated into the model, making this framework a powerful tool for revealing the underlying mathematical structures of complex social phenomena directly from observational data.
Abstract:Human trust plays a crucial role in the effectiveness of human-robot collaboration. Despite its significance, the development and maintenance of an optimal trust level are obstructed by the complex nature of influencing factors and their mechanisms. This study investigates the effects of cognitive load on human trust within the context of a hybrid human-robot collaboration task. An experiment is conducted where the humans and the robot, acting as team members, collaboratively construct pyramids with differentiated levels of task complexity. Our findings reveal that cognitive load exerts diverse impacts on human trust in the robot. Notably, there is an increase in human trust under conditions of high cognitive load. Furthermore, the rewards for performance are substantially higher in tasks with high cognitive load compared to those with low cognitive load, and a significant correlation exists between human trust and the failure risk of performance in tasks with low and medium cognitive load. By integrating interdependent task steps, this research emphasizes the unique dynamics of hybrid human-robot collaboration scenarios. The insights gained not only contribute to understanding how cognitive load influences trust but also assist developers in optimizing collaborative target selection and designing more effective human-robot interfaces in such environments.
Abstract:Social platforms, while facilitating access to information, have also become saturated with a plethora of fake news, resulting in negative consequences. Automatic multimodal fake news detection is a worthwhile pursuit. Existing multimodal fake news datasets only provide binary labels of real or fake. However, real news is alike, while each fake news is fake in its own way. These datasets fail to reflect the mixed nature of various types of multimodal fake news. To bridge the gap, we construct an attributing multi-granularity multimodal fake news detection dataset \amg, revealing the inherent fake pattern. Furthermore, we propose a multi-granularity clue alignment model \our to achieve multimodal fake news detection and attribution. Experimental results demonstrate that \amg is a challenging dataset, and its attribution setting opens up new avenues for future research.
Abstract:The upper mid-band (or FR3, spanning 6-24 GHz) is a crucial frequency range for next-generation mobile networks, offering a favorable balance between coverage and spectrum efficiency. From another perspective, the systems operating in the near-field in both indoor environment and outdoor environments can support line-of-sight multiple input multiple output (MIMO) communications and be beneficial from the FR3 bands. In this paper, a novel method is proposed to measure the near-field parameters leveraging a recently developed reflection model where the near-field paths can be described by their image points. We show that these image points can be accurately estimated via triangulation from multiple measurements with a small number of antennas in each measurement, thus affording a low-cost procedure for near-field multi-path parameter extraction. A preliminary experimental apparatus is presented comprising 2 transmit and 2 receive antennas mounted on a linear track to measure the 2x2 MIMO channel at various displacements. The system uses a recently-developed wideband radio frequency (RF) transceiver board with fast frequency switching, an FPGA for fast baseband processing, and a new parameter extraction method to recover paths and spherical characteristics from the multiple 2x2 measurements.
Abstract:The following paper provides a multi-band channel measurement analysis on the frequency range (FR)3. This study focuses on the FR3 low frequencies 6.5 GHz and 8.75 GHz with a setup tailored to the context of integrated sensing and communication (ISAC), where the data are collected with and without the presence of a target. A method based on multiple signal classification (MUSIC) is used to refine the delays of the channel impulse response estimates. The results reveal that the channel at the lower frequency 6.5 GHz has additional distinguishable multipath components in the presence of the target, while the one associated with the higher frequency 8.75 GHz has more blockage. The set of results reported in this paper serves as a benchmark for future multi-band studies in the FR3 spectrum.
Abstract:Embodied reference understanding is crucial for intelligent agents to predict referents based on human intention through gesture signals and language descriptions. This paper introduces the Attention-Dynamic DINO, a novel framework designed to mitigate misinterpretations of pointing gestures across various interaction contexts. Our approach integrates visual and textual features to simultaneously predict the target object's bounding box and the attention source in pointing gestures. Leveraging the distance-aware nature of nonverbal communication in visual perspective taking, we extend the virtual touch line mechanism and propose an attention-dynamic touch line to represent referring gesture based on interactive distances. The combination of this distance-aware approach and independent prediction of the attention source, enhances the alignment between objects and the gesture represented line. Extensive experiments on the YouRefIt dataset demonstrate the efficacy of our gesture information understanding method in significantly improving task performance. Our model achieves 76.4% accuracy at the 0.25 IoU threshold and, notably, surpasses human performance at the 0.75 IoU threshold, marking a first in this domain. Comparative experiments with distance-unaware understanding methods from previous research further validate the superiority of the Attention-Dynamic Touch Line across diverse contexts.
Abstract:Digital twinning is becoming increasingly vital in the design and real-time control of future wireless networks by providing precise cost-effective simulations, predictive insights, and real-time data integration. This paper explores the application of digital twinning in optimizing wireless communication systems within urban environments, where building arrangements can critically impact network performances. We develop a digital twin platform to simulate and analyze how factors such as building positioning, base station placement, and antenna design influence wireless propagation. The ray-tracing software package of Matlab is compared with Remcom Wireless InSite. Using a realistic radiation pattern of a base transceiver station (BTS) antenna, ray tracing simulations for signal propagation and interactions in urban landscapes are then extensively examined. By analyzing radio heat maps alongside antenna patterns, we gain valuable insights into optimizing wireless deployment strategies. This study highlights the potential of digital twinning as a critical tool for urban planners and network engineers.
Abstract:The integration of point and voxel representations is becoming more common in LiDAR-based 3D object detection. However, this combination often struggles with capturing semantic information effectively. Moreover, relying solely on point features within regions of interest can lead to information loss and limitations in local feature representation. To tackle these challenges, we propose a novel two-stage 3D object detector, called Point-Voxel Attention Fusion Network (PVAFN). PVAFN leverages an attention mechanism to improve multi-modal feature fusion during the feature extraction phase. In the refinement stage, it utilizes a multi-pooling strategy to integrate both multi-scale and region-specific information effectively. The point-voxel attention mechanism adaptively combines point cloud and voxel-based Bird's-Eye-View (BEV) features, resulting in richer object representations that help to reduce false detections. Additionally, a multi-pooling enhancement module is introduced to boost the model's perception capabilities. This module employs cluster pooling and pyramid pooling techniques to efficiently capture key geometric details and fine-grained shape structures, thereby enhancing the integration of local and global features. Extensive experiments on the KITTI and Waymo datasets demonstrate that the proposed PVAFN achieves competitive performance. The code and models will be available.