Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jan Hosang

Scaling Image Geo-Localization to Continent Level

Oct 30, 2025

Philipp Lindenberger, Paul-Edouard Sarlin, Jan Hosang, Matteo Balice, Marc Pollefeys, Simon Lynen, Eduard Trulls

Abstract:Determining the precise geographic location of an image at a global scale remains an unsolved challenge. Standard image retrieval techniques are inefficient due to the sheer volume of images (>100M) and fail when coverage is insufficient. Scalable solutions, however, involve a trade-off: global classification typically yields coarse results (10+ kilometers), while cross-view retrieval between ground and aerial imagery suffers from a domain gap and has been primarily studied on smaller regions. This paper introduces a hybrid approach that achieves fine-grained geo-localization across a large geographic expanse the size of a continent. We leverage a proxy classification task during training to learn rich feature representations that implicitly encode precise location information. We combine these learned prototypes with embeddings of aerial imagery to increase robustness to the sparsity of ground-level data. This enables direct, fine-grained retrieval over areas spanning multiple countries. Our extensive evaluation demonstrates that our approach can localize within 200m more than 68\% of queries of a dataset covering a large part of Europe. The code is publicly available at https://scaling-geoloc.github.io.

* NeurIPS 2025

Via

Access Paper or Ask Questions

SNAP: Self-Supervised Neural Maps for Visual Positioning and Semantic Understanding

Jun 08, 2023

Paul-Edouard Sarlin, Eduard Trulls, Marc Pollefeys, Jan Hosang, Simon Lynen

Figure 1 for SNAP: Self-Supervised Neural Maps for Visual Positioning and Semantic Understanding

Figure 2 for SNAP: Self-Supervised Neural Maps for Visual Positioning and Semantic Understanding

Figure 3 for SNAP: Self-Supervised Neural Maps for Visual Positioning and Semantic Understanding

Figure 4 for SNAP: Self-Supervised Neural Maps for Visual Positioning and Semantic Understanding

Abstract:Semantic 2D maps are commonly used by humans and machines for navigation purposes, whether it's walking or driving. However, these maps have limitations: they lack detail, often contain inaccuracies, and are difficult to create and maintain, especially in an automated fashion. Can we use raw imagery to automatically create better maps that can be easily interpreted by both humans and machines? We introduce SNAP, a deep network that learns rich neural 2D maps from ground-level and overhead images. We train our model to align neural maps estimated from different inputs, supervised only with camera poses over tens of millions of StreetView images. SNAP can resolve the location of challenging image queries beyond the reach of traditional methods, outperforming the state of the art in localization by a large margin. Moreover, our neural maps encode not only geometry and appearance but also high-level semantics, discovered without explicit supervision. This enables effective pre-training for data-efficient semantic scene understanding, with the potential to unlock cost-efficient creation of more detailed maps.

Via

Access Paper or Ask Questions

TUSK: Task-Agnostic Unsupervised Keypoints

Jun 16, 2022

Yuhe Jin, Weiwei Sun, Jan Hosang, Eduard Trulls, Kwang Moo Yi

Figure 1 for TUSK: Task-Agnostic Unsupervised Keypoints

Figure 2 for TUSK: Task-Agnostic Unsupervised Keypoints

Figure 3 for TUSK: Task-Agnostic Unsupervised Keypoints

Figure 4 for TUSK: Task-Agnostic Unsupervised Keypoints

Abstract:Existing unsupervised methods for keypoint learning rely heavily on the assumption that a specific keypoint type (e.g. elbow, digit, abstract geometric shape) appears only once in an image. This greatly limits their applicability, as each instance must be isolated before applying the method-an issue that is never discussed or evaluated. We thus propose a novel method to learn Task-agnostic, UnSupervised Keypoints (TUSK) which can deal with multiple instances. To achieve this, instead of the commonly-used strategy of detecting multiple heatmaps, each dedicated to a specific keypoint type, we use a single heatmap for detection, and enable unsupervised learning of keypoint types through clustering. Specifically, we encode semantics into the keypoints by teaching them to reconstruct images from a sparse set of keypoints and their descriptors, where the descriptors are forced to form distinct clusters in feature space around learned prototypes. This makes our approach amenable to a wider range of tasks than any previous unsupervised keypoint method: we show experiments on multiple-instance detection and classification, object discovery, and landmark detection-all unsupervised-with performance on par with the state of the art, while also being able to deal with multiple instances.

Via

Access Paper or Ask Questions

Efficient Large Scale Inlier Voting for Geometric Vision Problems

Jul 27, 2021

Dror Aiger, Simon Lynen, Jan Hosang, Bernhard Zeisl

Figure 1 for Efficient Large Scale Inlier Voting for Geometric Vision Problems

Figure 2 for Efficient Large Scale Inlier Voting for Geometric Vision Problems

Figure 3 for Efficient Large Scale Inlier Voting for Geometric Vision Problems

Figure 4 for Efficient Large Scale Inlier Voting for Geometric Vision Problems

Abstract:Outlier rejection and equivalently inlier set optimization is a key ingredient in numerous applications in computer vision such as filtering point-matches in camera pose estimation or plane and normal estimation in point clouds. Several approaches exist, yet at large scale we face a combinatorial explosion of possible solutions and state-of-the-art methods like RANSAC, Hough transform or Branch&Bound require a minimum inlier ratio or prior knowledge to remain practical. In fact, for problems such as camera posing in very large scenes these approaches become useless as they have exponential runtime growth if these conditions aren't met. To approach the problem we present a efficient and general algorithm for outlier rejection based on "intersecting" $k$-dimensional surfaces in $R^d$. We provide a recipe for casting a variety of geometric problems as finding a point in $R^d$ which maximizes the number of nearby surfaces (and thus inliers). The resulting algorithm has linear worst-case complexity with a better runtime dependency in the approximation factor than competing algorithms while not requiring domain specific bounds. This is achieved by introducing a space decomposition scheme that bounds the number of computations by successively rounding and grouping samples. Our recipe (and open-source code) enables anybody to derive such fast approaches to new problems across a wide range of domains. We demonstrate the versatility of the approach on several camera posing problems with a high number of matches at low inlier ratio achieving state-of-the-art results at significantly lower processing times.

Via

Access Paper or Ask Questions

COTR: Correspondence Transformer for Matching Across Images

Mar 25, 2021

Wei Jiang, Eduard Trulls, Jan Hosang, Andrea Tagliasacchi, Kwang Moo Yi

Figure 1 for COTR: Correspondence Transformer for Matching Across Images

Figure 2 for COTR: Correspondence Transformer for Matching Across Images

Figure 3 for COTR: Correspondence Transformer for Matching Across Images

Figure 4 for COTR: Correspondence Transformer for Matching Across Images

Abstract:We propose a novel framework for finding correspondences in images based on a deep neural network that, given two images and a query point in one of them, finds its correspondence in the other. By doing so, one has the option to query only the points of interest and retrieve sparse correspondences, or to query all points in an image and obtain dense mappings. Importantly, in order to capture both local and global priors, and to let our model relate between image regions using the most relevant among said priors, we realize our network using a transformer. At inference time, we apply our correspondence network by recursively zooming in around the estimates, yielding a multiscale pipeline able to provide highly-accurate correspondences. Our method significantly outperforms the state of the art on both sparse and dense correspondence problems on multiple datasets and tasks, ranging from wide-baseline stereo to optical flow, without any retraining for a specific dataset. We commit to releasing data, code, and all the tools necessary to train from scratch and ensure reproducibility.

Via

Access Paper or Ask Questions

Learning non-maximum suppression

May 09, 2017

Jan Hosang, Rodrigo Benenson, Bernt Schiele

Figure 1 for Learning non-maximum suppression

Figure 2 for Learning non-maximum suppression

Figure 3 for Learning non-maximum suppression

Figure 4 for Learning non-maximum suppression

Abstract:Object detectors have hugely profited from moving towards an end-to-end learning paradigm: proposals, features, and the classifier becoming one neural network improved results two-fold on general object detection. One indispensable component is non-maximum suppression (NMS), a post-processing algorithm responsible for merging all detections that belong to the same object. The de facto standard NMS algorithm is still fully hand-crafted, suspiciously simple, and -- being based on greedy clustering with a fixed distance threshold -- forces a trade-off between recall and precision. We propose a new network architecture designed to perform NMS, using only boxes and their score. We report experiments for person detection on PETS and for general object categories on the COCO dataset. Our approach shows promise providing improved localization and occlusion handling.

* Added "Supplementary material" title

Via

Access Paper or Ask Questions

Simple Does It: Weakly Supervised Instance and Semantic Segmentation

Nov 23, 2016

Anna Khoreva, Rodrigo Benenson, Jan Hosang, Matthias Hein, Bernt Schiele

Figure 1 for Simple Does It: Weakly Supervised Instance and Semantic Segmentation

Figure 2 for Simple Does It: Weakly Supervised Instance and Semantic Segmentation

Figure 3 for Simple Does It: Weakly Supervised Instance and Semantic Segmentation

Figure 4 for Simple Does It: Weakly Supervised Instance and Semantic Segmentation

Abstract:Semantic labelling and instance segmentation are two tasks that require particularly costly annotations. Starting from weak supervision in the form of bounding box detection annotations, we propose a new approach that does not require modification of the segmentation training procedure. We show that when carefully designing the input labels from given bounding boxes, even a single round of training is enough to improve over previously reported weakly supervised results. Overall, our weak supervision approach reaches ~95% of the quality of the fully supervised model, both for semantic labelling and instance segmentation.

Via

Access Paper or Ask Questions

How Far are We from Solving Pedestrian Detection?

Jun 21, 2016

Shanshan Zhang, Rodrigo Benenson, Mohamed Omran, Jan Hosang, Bernt Schiele

Figure 1 for How Far are We from Solving Pedestrian Detection?

Figure 2 for How Far are We from Solving Pedestrian Detection?

Figure 3 for How Far are We from Solving Pedestrian Detection?

Figure 4 for How Far are We from Solving Pedestrian Detection?

Abstract:Encouraged by the recent progress in pedestrian detection, we investigate the gap between current state-of-the-art methods and the "perfect single frame detector". We enable our analysis by creating a human baseline for pedestrian detection (over the Caltech dataset), and by manually clustering the recurrent errors of a top detector. Our results characterize both localization and background-versus-foreground errors. To address localization errors we study the impact of training annotation noise on the detector performance, and show that we can improve even with a small portion of sanitized training data. To address background/foreground discrimination, we study convnets for pedestrian detection, and discuss which factors affect their performance. Other than our in-depth analysis, we report top performance on the Caltech dataset, and provide a new sanitized set of training and test annotations.

* CVPR16 camera ready

Via

Access Paper or Ask Questions

A convnet for non-maximum suppression

Jan 08, 2016

Jan Hosang, Rodrigo Benenson, Bernt Schiele

Figure 1 for A convnet for non-maximum suppression

Figure 2 for A convnet for non-maximum suppression

Figure 3 for A convnet for non-maximum suppression

Figure 4 for A convnet for non-maximum suppression

Abstract:Non-maximum suppression (NMS) is used in virtually all state-of-the-art object detection pipelines. While essential object detection ingredients such as features, classifiers, and proposal methods have been extensively researched surprisingly little work has aimed to systematically address NMS. The de-facto standard for NMS is based on greedy clustering with a fixed distance threshold, which forces to trade-off recall versus precision. We propose a convnet designed to perform NMS of a given set of detections. We report experiments on a synthetic setup, and results on crowded pedestrian detection scenes. Our approach overcomes the intrinsic limitations of greedy NMS, obtaining better recall and precision.

* Included comments from reviewers

Via

Access Paper or Ask Questions

What makes for effective detection proposals?

Aug 01, 2015

Jan Hosang, Rodrigo Benenson, Piotr Dollár, Bernt Schiele

Figure 1 for What makes for effective detection proposals?

Figure 2 for What makes for effective detection proposals?

Figure 3 for What makes for effective detection proposals?

Figure 4 for What makes for effective detection proposals?

Abstract:Current top performing object detectors employ detection proposals to guide the search for objects, thereby avoiding exhaustive sliding window search across images. Despite the popularity and widespread use of detection proposals, it is unclear which trade-offs are made when using them during object detection. We provide an in-depth analysis of twelve proposal methods along with four baselines regarding proposal repeatability, ground truth annotation recall on PASCAL, ImageNet, and MS COCO, and their impact on DPM, R-CNN, and Fast R-CNN detection performance. Our analysis shows that for object detection improving proposal localisation accuracy is as important as improving recall. We introduce a novel metric, the average recall (AR), which rewards both high recall and good localisation and correlates surprisingly well with detection performance. Our findings show common strengths and weaknesses of existing methods, and provide insights and metrics for selecting and tuning proposal methods.

* TPAMI final version, duplicate proposals removed in experiments

Via

Access Paper or Ask Questions