Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Li Mi

GeoExplorer: Active Geo-localization with Curiosity-Driven Exploration

Jul 31, 2025

Li Mi, Manon Bechaz, Zeming Chen, Antoine Bosselut, Devis Tuia

Abstract:Active Geo-localization (AGL) is the task of localizing a goal, represented in various modalities (e.g., aerial images, ground-level images, or text), within a predefined search area. Current methods approach AGL as a goal-reaching reinforcement learning (RL) problem with a distance-based reward. They localize the goal by implicitly learning to minimize the relative distance from it. However, when distance estimation becomes challenging or when encountering unseen targets and environments, the agent exhibits reduced robustness and generalization ability due to the less reliable exploration strategy learned during training. In this paper, we propose GeoExplorer, an AGL agent that incorporates curiosity-driven exploration through intrinsic rewards. Unlike distance-based rewards, our curiosity-driven reward is goal-agnostic, enabling robust, diverse, and contextually relevant exploration based on effective environment modeling. These capabilities have been proven through extensive experiments across four AGL benchmarks, demonstrating the effectiveness and generalization ability of GeoExplorer in diverse settings, particularly in localizing unfamiliar targets and environments.

* ICCV 2025. Project page at https://limirs.github.io/GeoExplorer/

Via

Access Paper or Ask Questions

VinaBench: Benchmark for Faithful and Consistent Visual Narratives

Mar 26, 2025

Silin Gao, Sheryl Mathew, Li Mi, Sepideh Mamooler, Mengjie Zhao, Hiromi Wakaki, Yuki Mitsufuji, Syrielle Montariol, Antoine Bosselut

Figure 1 for VinaBench: Benchmark for Faithful and Consistent Visual Narratives

Figure 2 for VinaBench: Benchmark for Faithful and Consistent Visual Narratives

Figure 3 for VinaBench: Benchmark for Faithful and Consistent Visual Narratives

Figure 4 for VinaBench: Benchmark for Faithful and Consistent Visual Narratives

Abstract:Visual narrative generation transforms textual narratives into sequences of images illustrating the content of the text. However, generating visual narratives that are faithful to the input text and self-consistent across generated images remains an open challenge, due to the lack of knowledge constraints used for planning the stories. In this work, we propose a new benchmark, VinaBench, to address this challenge. Our benchmark annotates the underlying commonsense and discourse constraints in visual narrative samples, offering systematic scaffolds for learning the implicit strategies of visual storytelling. Based on the incorporated narrative constraints, we further propose novel metrics to closely evaluate the consistency of generated narrative images and the alignment of generations with the input textual narrative. Our results across three generative vision models demonstrate that learning with VinaBench's knowledge constraints effectively improves the faithfulness and cohesion of generated visual narratives.

* Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025)

Via

Access Paper or Ask Questions

Unsupervised Multi-view UAV Image Geo-localization via Iterative Rendering

Nov 22, 2024

Haoyuan Li, Chang Xu, Wen Yang, Li Mi, Huai Yu, Haijian Zhang

Figure 1 for Unsupervised Multi-view UAV Image Geo-localization via Iterative Rendering

Figure 2 for Unsupervised Multi-view UAV Image Geo-localization via Iterative Rendering

Figure 3 for Unsupervised Multi-view UAV Image Geo-localization via Iterative Rendering

Figure 4 for Unsupervised Multi-view UAV Image Geo-localization via Iterative Rendering

Abstract:Unmanned Aerial Vehicle (UAV) Cross-View Geo-Localization (CVGL) presents significant challenges due to the view discrepancy between oblique UAV images and overhead satellite images. Existing methods heavily rely on the supervision of labeled datasets to extract viewpoint-invariant features for cross-view retrieval. However, these methods have expensive training costs and tend to overfit the region-specific cues, showing limited generalizability to new regions. To overcome this issue, we propose an unsupervised solution that lifts the scene representation to 3d space from UAV observations for satellite image generation, providing robust representation against view distortion. By generating orthogonal images that closely resemble satellite views, our method reduces view discrepancies in feature representation and mitigates shortcuts in region-specific image pairing. To further align the rendered image's perspective with the real one, we design an iterative camera pose updating mechanism that progressively modulates the rendered query image with potential satellite targets, eliminating spatial offsets relative to the reference images. Additionally, this iterative refinement strategy enhances cross-view feature invariance through view-consistent fusion across iterations. As such, our unsupervised paradigm naturally avoids the problem of region-specific overfitting, enabling generic CVGL for UAV images without feature fine-tuning or data-driven training. Experiments on the University-1652 and SUES-200 datasets demonstrate that our approach significantly improves geo-localization accuracy while maintaining robustness across diverse regions. Notably, without model fine-tuning or paired training, our method achieves competitive performance with recent supervised methods.

* 13 pages

Via

Access Paper or Ask Questions

Knowledge-aware Text-Image Retrieval for Remote Sensing Images

May 06, 2024

Li Mi, Xianjie Dai, Javiera Castillo-Navarro, Devis Tuia

Figure 1 for Knowledge-aware Text-Image Retrieval for Remote Sensing Images

Figure 2 for Knowledge-aware Text-Image Retrieval for Remote Sensing Images

Figure 3 for Knowledge-aware Text-Image Retrieval for Remote Sensing Images

Figure 4 for Knowledge-aware Text-Image Retrieval for Remote Sensing Images

Abstract:Image-based retrieval in large Earth observation archives is challenging because one needs to navigate across thousands of candidate matches only with the query image as a guide. By using text as information supporting the visual query, the retrieval system gains in usability, but at the same time faces difficulties due to the diversity of visual signals that cannot be summarized by a short caption only. For this reason, as a matching-based task, cross-modal text-image retrieval often suffers from information asymmetry between texts and images. To address this challenge, we propose a Knowledge-aware Text-Image Retrieval (KTIR) method for remote sensing images. By mining relevant information from an external knowledge graph, KTIR enriches the text scope available in the search query and alleviates the information gaps between texts and images for better matching. Moreover, by integrating domain-specific knowledge, KTIR also enhances the adaptation of pre-trained vision-language models to remote sensing applications. Experimental results on three commonly used remote sensing text-image retrieval benchmarks show that the proposed knowledge-aware method leads to varied and consistent retrievals, outperforming state-of-the-art retrieval methods.

* Under review

Via

Access Paper or Ask Questions

ConGeo: Robust Cross-view Geo-localization across Ground View Variations

Mar 20, 2024

Li Mi, Chang Xu, Javiera Castillo-Navarro, Syrielle Montariol, Wen Yang, Antoine Bosselut, Devis Tuia

Figure 1 for ConGeo: Robust Cross-view Geo-localization across Ground View Variations

Figure 2 for ConGeo: Robust Cross-view Geo-localization across Ground View Variations

Figure 3 for ConGeo: Robust Cross-view Geo-localization across Ground View Variations

Figure 4 for ConGeo: Robust Cross-view Geo-localization across Ground View Variations

Abstract:Cross-view geo-localization aims at localizing a ground-level query image by matching it to its corresponding geo-referenced aerial view. In real-world scenarios, the task requires accommodating diverse ground images captured by users with varying orientations and reduced field of views (FoVs). However, existing learning pipelines are orientation-specific or FoV-specific, demanding separate model training for different ground view variations. Such models heavily depend on the North-aligned spatial correspondence and predefined FoVs in the training data, compromising their robustness across different settings. To tackle this challenge, we propose ConGeo, a single- and cross-modal Contrastive method for Geo-localization: it enhances robustness and consistency in feature representations to improve a model's invariance to orientation and its resilience to FoV variations, by enforcing proximity between ground view variations of the same location. As a generic learning objective for cross-view geo-localization, when integrated into state-of-the-art pipelines, ConGeo significantly boosts the performance of three base models on four geo-localization benchmarks for diverse ground view variations and outperforms competing methods that train separate models for each ground view variation.

* Project page at https://chasel-tsui.github.io/ConGeo/

Via

Access Paper or Ask Questions

ConVQG: Contrastive Visual Question Generation with Multimodal Guidance

Feb 20, 2024

Li Mi, Syrielle Montariol, Javiera Castillo-Navarro, Xianjie Dai, Antoine Bosselut, Devis Tuia

Figure 1 for ConVQG: Contrastive Visual Question Generation with Multimodal Guidance

Figure 2 for ConVQG: Contrastive Visual Question Generation with Multimodal Guidance

Figure 3 for ConVQG: Contrastive Visual Question Generation with Multimodal Guidance

Figure 4 for ConVQG: Contrastive Visual Question Generation with Multimodal Guidance

Abstract:Asking questions about visual environments is a crucial way for intelligent agents to understand rich multi-faceted scenes, raising the importance of Visual Question Generation (VQG) systems. Apart from being grounded to the image, existing VQG systems can use textual constraints, such as expected answers or knowledge triplets, to generate focused questions. These constraints allow VQG systems to specify the question content or leverage external commonsense knowledge that can not be obtained from the image content only. However, generating focused questions using textual constraints while enforcing a high relevance to the image content remains a challenge, as VQG systems often ignore one or both forms of grounding. In this work, we propose Contrastive Visual Question Generation (ConVQG), a method using a dual contrastive objective to discriminate questions generated using both modalities from those based on a single one. Experiments on both knowledge-aware and standard VQG benchmarks demonstrate that ConVQG outperforms the state-of-the-art methods and generates image-grounded, text-guided, and knowledge-rich questions. Our human evaluation results also show preference for ConVQG questions compared to non-contrastive baselines.

* AAAI 2024. Project page at https://limirs.github.io/ConVQG

Via

Access Paper or Ask Questions

Predicate correlation learning for scene graph generation

Jul 06, 2021

Leitian Tao, Li Mi, Nannan Li, Xianhang Cheng, Yaosi Hu, Zhenzhong Chen

Figure 1 for Predicate correlation learning for scene graph generation

Figure 2 for Predicate correlation learning for scene graph generation

Figure 3 for Predicate correlation learning for scene graph generation

Figure 4 for Predicate correlation learning for scene graph generation

Abstract:For a typical Scene Graph Generation (SGG) method, there is often a large gap in the performance of the predicates' head classes and tail classes. This phenomenon is mainly caused by the semantic overlap between different predicates as well as the long-tailed data distribution. In this paper, a Predicate Correlation Learning (PCL) method for SGG is proposed to address the above two problems by taking the correlation between predicates into consideration. To describe the semantic overlap between strong-correlated predicate classes, a Predicate Correlation Matrix (PCM) is defined to quantify the relationship between predicate pairs, which is dynamically updated to remove the matrix's long-tailed bias. In addition, PCM is integrated into a Predicate Correlation Loss function ($L_{PC}$) to reduce discouraging gradients of unannotated classes. The proposed method is evaluated on Visual Genome benchmark, where the performance of the tail classes is significantly improved when built on the existing methods.

Via

Access Paper or Ask Questions

Visual Relationship Forecasting in Videos

Jul 02, 2021

Li Mi, Yangjun Ou, Zhenzhong Chen

Figure 1 for Visual Relationship Forecasting in Videos

Figure 2 for Visual Relationship Forecasting in Videos

Figure 3 for Visual Relationship Forecasting in Videos

Figure 4 for Visual Relationship Forecasting in Videos

Abstract:Real-world scenarios often require the anticipation of object interactions in unknown future, which would assist the decision-making process of both humans and agents. To meet this challenge, we present a new task named Visual Relationship Forecasting (VRF) in videos to explore the prediction of visual relationships in a reasoning manner. Specifically, given a subject-object pair with H existing frames, VRF aims to predict their future interactions for the next T frames without visual evidence. To evaluate the VRF task, we introduce two video datasets named VRF-AG and VRF-VidOR, with a series of spatio-temporally localized visual relation annotations in a video. These two datasets densely annotate 13 and 35 visual relationships in 1923 and 13447 video clips, respectively. In addition, we present a novel Graph Convolutional Transformer (GCT) framework, which captures both object-level and frame-level dependencies by spatio-temporal Graph Convolution Network and Transformer. Experimental results on both VRF-AG and VRF-VidOR datasets demonstrate that GCT outperforms the state-of-the-art sequence modelling methods on visual relationship forecasting.

Via

Access Paper or Ask Questions