Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Siru Zhong

Time-VLM: Exploring Multimodal Vision-Language Models for Augmented Time Series Forecasting

Feb 06, 2025

Siru Zhong, Weilin Ruan, Ming Jin, Huan Li, Qingsong Wen, Yuxuan Liang

Abstract:Recent advancements in time series forecasting have explored augmenting models with text or vision modalities to improve accuracy. While text provides contextual understanding, it often lacks fine-grained temporal details. Conversely, vision captures intricate temporal patterns but lacks semantic context, limiting the complementary potential of these modalities. To address this, we propose Time-VLM, a novel multimodal framework that leverages pre-trained Vision-Language Models (VLMs) to bridge temporal, visual, and textual modalities for enhanced forecasting. Our framework comprises three key components: (1) a Retrieval-Augmented Learner, which extracts enriched temporal features through memory bank interactions; (2) a Vision-Augmented Learner, which encodes time series as informative images; and (3) a Text-Augmented Learner, which generates contextual textual descriptions. These components collaborate with frozen pre-trained VLMs to produce multimodal embeddings, which are then fused with temporal features for final prediction. Extensive experiments across diverse datasets demonstrate that Time-VLM achieves superior performance, particularly in few-shot and zero-shot scenarios, thereby establishing a new direction for multimodal time series forecasting.

* 19 pages

Via

Access Paper or Ask Questions

Cross Space and Time: A Spatio-Temporal Unitized Model for Traffic Flow Forecasting

Nov 14, 2024

Weilin Ruan, Wenzhuo Wang, Siru Zhong, Wei Chen, Li Liu, Yuxuan Liang

Figure 1 for Cross Space and Time: A Spatio-Temporal Unitized Model for Traffic Flow Forecasting

Figure 2 for Cross Space and Time: A Spatio-Temporal Unitized Model for Traffic Flow Forecasting

Figure 3 for Cross Space and Time: A Spatio-Temporal Unitized Model for Traffic Flow Forecasting

Figure 4 for Cross Space and Time: A Spatio-Temporal Unitized Model for Traffic Flow Forecasting

Abstract:Predicting spatio-temporal traffic flow presents significant challenges due to complex interactions between spatial and temporal factors. Existing approaches often address these dimensions in isolation, neglecting their critical interdependencies. In this paper, we introduce the Spatio-Temporal Unitized Model (STUM), a unified framework designed to capture both spatial and temporal dependencies while addressing spatio-temporal heterogeneity through techniques such as distribution alignment and feature fusion. It also ensures both predictive accuracy and computational efficiency. Central to STUM is the Adaptive Spatio-temporal Unitized Cell (ASTUC), which utilizes low-rank matrices to seamlessly store, update, and interact with space, time, as well as their correlations. Our framework is also modular, allowing it to integrate with various spatio-temporal graph neural networks through components such as backbone models, feature extractors, residual fusion blocks, and predictive modules to collectively enhance forecasting outcomes. Experimental results across multiple real-world datasets demonstrate that STUM consistently improves prediction performance with minimal computational cost. These findings are further supported by hyperparameter optimization, pre-training analysis, and result visualization. We provide our source code for reproducibility at https://anonymous.4open.science/r/STUM-E4F0.

Via

Access Paper or Ask Questions

Navigating Spatio-Temporal Heterogeneity: A Graph Transformer Approach for Traffic Forecasting

Aug 20, 2024

Jianxiang Zhou, Erdong Liu, Wei Chen, Siru Zhong, Yuxuan Liang

Figure 1 for Navigating Spatio-Temporal Heterogeneity: A Graph Transformer Approach for Traffic Forecasting

Figure 2 for Navigating Spatio-Temporal Heterogeneity: A Graph Transformer Approach for Traffic Forecasting

Figure 3 for Navigating Spatio-Temporal Heterogeneity: A Graph Transformer Approach for Traffic Forecasting

Figure 4 for Navigating Spatio-Temporal Heterogeneity: A Graph Transformer Approach for Traffic Forecasting

Abstract:Traffic forecasting has emerged as a crucial research area in the development of smart cities. Although various neural networks with intricate architectures have been developed to address this problem, they still face two key challenges: i) Recent advancements in network designs for modeling spatio-temporal correlations are starting to see diminishing returns in performance enhancements. ii) Additionally, most models do not account for the spatio-temporal heterogeneity inherent in traffic data, i.e., traffic distribution varies significantly across different regions and traffic flow patterns fluctuate across various time slots. To tackle these challenges, we introduce the Spatio-Temporal Graph Transformer (STGormer), which effectively integrates attribute and structure information inherent in traffic data for learning spatio-temporal correlations, and a mixture-of-experts module for capturing heterogeneity along spaital and temporal axes. Specifically, we design two straightforward yet effective spatial encoding methods based on the graph structure and integrate time position encoding into the vanilla transformer to capture spatio-temporal traffic patterns. Additionally, a mixture-of-experts enhanced feedforward neural network (FNN) module adaptively assigns suitable expert layers to distinct patterns via a spatio-temporal gating network, further improving overall prediction accuracy. Experiments on five real-world datasets demonstrate that STGormer achieves state-of-the-art performance.

Via

Access Paper or Ask Questions

Predicting Parking Availability in Singapore with Cross-Domain Data: A New Dataset and A Data-Driven Approach

May 29, 2024

Huaiwu Zhang, Yutong Xia, Siru Zhong, Kun Wang, Zekun Tong, Qingsong Wen, Roger Zimmermann, Yuxuan Liang

Figure 1 for Predicting Parking Availability in Singapore with Cross-Domain Data: A New Dataset and A Data-Driven Approach

Figure 2 for Predicting Parking Availability in Singapore with Cross-Domain Data: A New Dataset and A Data-Driven Approach

Figure 3 for Predicting Parking Availability in Singapore with Cross-Domain Data: A New Dataset and A Data-Driven Approach

Figure 4 for Predicting Parking Availability in Singapore with Cross-Domain Data: A New Dataset and A Data-Driven Approach

Abstract:The increasing number of vehicles highlights the need for efficient parking space management. Predicting real-time Parking Availability (PA) can help mitigate traffic congestion and the corresponding social problems, which is a pressing issue in densely populated cities like Singapore. In this study, we aim to collectively predict future PA across Singapore with complex factors from various domains. The contributions in this paper are listed as follows: (1) A New Dataset: We introduce the \texttt{SINPA} dataset, containing a year's worth of PA data from 1,687 parking lots in Singapore, enriched with various spatial and temporal factors. (2) A Data-Driven Approach: We present DeepPA, a novel deep-learning framework, to collectively and efficiently predict future PA across thousands of parking lots. (3) Extensive Experiments and Deployment: DeepPA demonstrates a 9.2% reduction in prediction error for up to 3-hour forecasts compared to existing advanced models. Furthermore, we implement DeepPA in a practical web-based platform to provide real-time PA predictions to aid drivers and inform urban planning for the governors in Singapore. We release the dataset and source code at https://github.com/yoshall/SINPA.

* Accepted by IJCAI 2024 (Multi-Year Track On AI And Social Good with ~20% acceptance rate)

Via

Access Paper or Ask Questions

UrbanCross: Enhancing Satellite Image-Text Retrieval with Cross-Domain Adaptation

Apr 22, 2024

Siru Zhong, Xixuan Hao, Yibo Yan, Ying Zhang, Yangqiu Song, Yuxuan Liang

Abstract:Urbanization challenges underscore the necessity for effective satellite image-text retrieval methods to swiftly access specific information enriched with geographic semantics for urban applications. However, existing methods often overlook significant domain gaps across diverse urban landscapes, primarily focusing on enhancing retrieval performance within single domains. To tackle this issue, we present UrbanCross, a new framework for cross-domain satellite image-text retrieval. UrbanCross leverages a high-quality, cross-domain dataset enriched with extensive geo-tags from three countries to highlight domain diversity. It employs the Large Multimodal Model (LMM) for textual refinement and the Segment Anything Model (SAM) for visual augmentation, achieving a fine-grained alignment of images, segments and texts, yielding a 10% improvement in retrieval performance. Additionally, UrbanCross incorporates an adaptive curriculum-based source sampler and a weighted adversarial cross-domain fine-tuning module, progressively enhancing adaptability across various domains. Extensive experiments confirm UrbanCross's superior efficiency in retrieval and adaptation to new urban environments, demonstrating an average performance increase of 15% over its version without domain adaptation mechanisms, effectively bridging the domain gap.

Via

Access Paper or Ask Questions

UrbanVLP: A Multi-Granularity Vision-Language Pre-Trained Foundation Model for Urban Indicator Prediction

Mar 25, 2024

Xixuan Hao, Wei Chen, Yibo Yan, Siru Zhong, Kun Wang, Qingsong Wen, Yuxuan Liang

Abstract:Urban indicator prediction aims to infer socio-economic metrics in diverse urban landscapes using data-driven methods. However, prevalent pre-trained models, particularly those reliant on satellite imagery, face dual challenges. Firstly, concentrating solely on macro-level patterns from satellite data may introduce bias, lacking nuanced details at micro levels, such as architectural details at a place. Secondly, the lack of interpretability in pre-trained models limits their utility in providing transparent evidence for urban planning. In response to these issues, we devise a novel Vision-Language Pre-Trained Model (UrbanVLP) in this paper. Our UrbanVLP seamlessly integrates multi-granularity information from both macro (satellite) and micro (street-view) levels, overcoming the limitations of prior pre-trained models. Moreover, it introduces automatic text generation and calibration, elevating interpretability in downstream applications by producing high-quality text descriptions of urban imagery. Rigorous experiments conducted across six socio-economic tasks underscore UrbanVLP's superior performance. We also deploy a web platform to verify its practicality.

Via

Access Paper or Ask Questions

Spatio-Temporal Field Neural Networks for Air Quality Inference

Mar 02, 2024

Yutong Feng, Qiongyan Wang, Yutong Xia, Junlin Huang, Siru Zhong, Kun Wang, Shifen Cheng, Yuxuan Liang

Figure 1 for Spatio-Temporal Field Neural Networks for Air Quality Inference

Figure 2 for Spatio-Temporal Field Neural Networks for Air Quality Inference

Figure 3 for Spatio-Temporal Field Neural Networks for Air Quality Inference

Figure 4 for Spatio-Temporal Field Neural Networks for Air Quality Inference

Abstract:The air quality inference problem aims to utilize historical data from a limited number of observation sites to infer the air quality index at an unknown location. Considering the sparsity of data due to the high maintenance cost of the stations, good inference algorithms can effectively save the cost and refine the data granularity. While spatio-temporal graph neural networks have made excellent progress on this problem, their non-Euclidean and discrete data structure modeling of reality limits its potential. In this work, we make the first attempt to combine two different spatio-temporal perspectives, fields and graphs, by proposing a new model, Spatio-Temporal Field Neural Network, and its corresponding new framework, Pyramidal Inference. Extensive experiments validate that our model achieves state-of-the-art performance in nationwide air quality inference in the Chinese Mainland, demonstrating the superiority of our proposed model and framework.

Via

Access Paper or Ask Questions

When Urban Region Profiling Meets Large Language Models

Oct 22, 2023

Yibo Yan, Haomin Wen, Siru Zhong, Wei Chen, Haodong Chen, Qingsong Wen, Roger Zimmermann, Yuxuan Liang

Figure 1 for When Urban Region Profiling Meets Large Language Models

Figure 2 for When Urban Region Profiling Meets Large Language Models

Figure 3 for When Urban Region Profiling Meets Large Language Models

Figure 4 for When Urban Region Profiling Meets Large Language Models

Abstract:Urban region profiling from web-sourced data is of utmost importance for urban planning and sustainable development. We are witnessing a rising trend of LLMs for various fields, especially dealing with multi-modal data research such as vision-language learning, where the text modality serves as a supplement information for the image. Since textual modality has never been introduced into modality combinations in urban region profiling, we aim to answer two fundamental questions in this paper: i) Can textual modality enhance urban region profiling? ii) and if so, in what ways and with regard to which aspects? To answer the questions, we leverage the power of Large Language Models (LLMs) and introduce the first-ever LLM-enhanced framework that integrates the knowledge of textual modality into urban imagery profiling, named LLM-enhanced Urban Region Profiling with Contrastive Language-Image Pretraining (UrbanCLIP). Specifically, it first generates a detailed textual description for each satellite image by an open-source Image-to-Text LLM. Then, the model is trained on the image-text pairs, seamlessly unifying natural language supervision for urban visual representation learning, jointly with contrastive loss and language modeling loss. Results on predicting three urban indicators in four major Chinese metropolises demonstrate its superior performance, with an average improvement of 6.1% on R^2 compared to the state-of-the-art methods. Our code and the image-language dataset will be released upon paper notification.

Via

Access Paper or Ask Questions