Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Silvano Galliani

R-SCoRe: Revisiting Scene Coordinate Regression for Robust Large-Scale Visual Localization

Jan 02, 2025

Xudong Jiang, Fangjinhua Wang, Silvano Galliani, Christoph Vogel, Marc Pollefeys

Figure 1 for R-SCoRe: Revisiting Scene Coordinate Regression for Robust Large-Scale Visual Localization

Figure 2 for R-SCoRe: Revisiting Scene Coordinate Regression for Robust Large-Scale Visual Localization

Figure 3 for R-SCoRe: Revisiting Scene Coordinate Regression for Robust Large-Scale Visual Localization

Figure 4 for R-SCoRe: Revisiting Scene Coordinate Regression for Robust Large-Scale Visual Localization

Abstract:Learning-based visual localization methods that use scene coordinate regression (SCR) offer the advantage of smaller map sizes. However, on datasets with complex illumination changes or image-level ambiguities, it remains a less robust alternative to feature matching methods. This work aims to close the gap. We introduce a covisibility graph-based global encoding learning and data augmentation strategy, along with a depth-adjusted reprojection loss to facilitate implicit triangulation. Additionally, we revisit the network architecture and local feature extraction module. Our method achieves state-of-the-art on challenging large-scale datasets without relying on network ensembles or 3D supervision. On Aachen Day-Night, we are 10$\times$ more accurate than previous SCR methods with similar map sizes and require at least 5$\times$ smaller map sizes than any other SCR method while still delivering superior accuracy. Code will be available at: https://github.com/cvg/scrstudio .

* Code: https://github.com/cvg/scrstudio

Via

Access Paper or Ask Questions

GLACE: Global Local Accelerated Coordinate Encoding

Jun 06, 2024

Fangjinhua Wang, Xudong Jiang, Silvano Galliani, Christoph Vogel, Marc Pollefeys

Abstract:Scene coordinate regression (SCR) methods are a family of visual localization methods that directly regress 2D-3D matches for camera pose estimation. They are effective in small-scale scenes but face significant challenges in large-scale scenes that are further amplified in the absence of ground truth 3D point clouds for supervision. Here, the model can only rely on reprojection constraints and needs to implicitly triangulate the points. The challenges stem from a fundamental dilemma: The network has to be invariant to observations of the same landmark at different viewpoints and lighting conditions, etc., but at the same time discriminate unrelated but similar observations. The latter becomes more relevant and severe in larger scenes. In this work, we tackle this problem by introducing the concept of co-visibility to the network. We propose GLACE, which integrates pre-trained global and local encodings and enables SCR to scale to large scenes with only a single small-sized network. Specifically, we propose a novel feature diffusion technique that implicitly groups the reprojection constraints with co-visibility and avoids overfitting to trivial solutions. Additionally, our position decoder parameterizes the output positions for large-scale scenes more effectively. Without using 3D models or depth maps for supervision, our method achieves state-of-the-art results on large-scale scenes with a low-map-size model. On Cambridge landmarks, with a single model, we achieve 17% lower median position error than Poker, the ensemble variant of the state-of-the-art SCR method ACE. Code is available at: https://github.com/cvg/glace.

* Large-scale visual localization with a single optimizable MLP. CVPR 2024. Code: https://github.com/cvg/glace. Project page: https://xjiangan.github.io/glace

Via

Access Paper or Ask Questions

IterMVS: Iterative Probability Estimation for Efficient Multi-View Stereo

Dec 09, 2021

Fangjinhua Wang, Silvano Galliani, Christoph Vogel, Marc Pollefeys

Figure 1 for IterMVS: Iterative Probability Estimation for Efficient Multi-View Stereo

Figure 2 for IterMVS: Iterative Probability Estimation for Efficient Multi-View Stereo

Figure 3 for IterMVS: Iterative Probability Estimation for Efficient Multi-View Stereo

Figure 4 for IterMVS: Iterative Probability Estimation for Efficient Multi-View Stereo

Abstract:We present IterMVS, a new data-driven method for high-resolution multi-view stereo. We propose a novel GRU-based estimator that encodes pixel-wise probability distributions of depth in its hidden state. Ingesting multi-scale matching information, our model refines these distributions over multiple iterations and infers depth and confidence. To extract the depth maps, we combine traditional classification and regression in a novel manner. We verify the efficiency and effectiveness of our method on DTU, Tanks&Temples and ETH3D. While being the most efficient method in both memory and run-time, our model achieves competitive performance on DTU and better generalization ability on Tanks&Temples as well as ETH3D than most state-of-the-art methods. Code is available at https://github.com/FangjinhuaWang/IterMVS.

Via

Access Paper or Ask Questions

DeepVideoMVS: Multi-View Stereo on Video with Recurrent Spatio-Temporal Fusion

Dec 03, 2020

Arda Düzçeker, Silvano Galliani, Christoph Vogel, Pablo Speciale, Mihai Dusmanu, Marc Pollefeys

Figure 1 for DeepVideoMVS: Multi-View Stereo on Video with Recurrent Spatio-Temporal Fusion

Figure 2 for DeepVideoMVS: Multi-View Stereo on Video with Recurrent Spatio-Temporal Fusion

Figure 3 for DeepVideoMVS: Multi-View Stereo on Video with Recurrent Spatio-Temporal Fusion

Figure 4 for DeepVideoMVS: Multi-View Stereo on Video with Recurrent Spatio-Temporal Fusion

Abstract:We propose an online multi-view depth prediction approach on posed video streams, where the scene geometry information computed in the previous time steps is propagated to the current time step in an efficient and geometrically plausible way. The backbone of our approach is a real-time capable, lightweight encoder-decoder that relies on cost volumes computed from pairs of images. We extend it by placing a ConvLSTM cell at the bottleneck layer, which compresses an arbitrary amount of past information in its states. The novelty lies in propagating the hidden state of the cell by accounting for the viewpoint changes between time steps. At a given time step, we warp the previous hidden state into the current camera plane using the previous depth prediction. Our extension brings only a small overhead of computation time and memory consumption, while improving the depth predictions significantly. As a result, we outperform the existing state-of-the-art multi-view stereo methods on most of the evaluated metrics in hundreds of indoor scenes while maintaining a real-time performance. Code available: https://github.com/ardaduz/deep-video-mvs

Via

Access Paper or Ask Questions

PatchmatchNet: Learned Multi-View Patchmatch Stereo

Dec 02, 2020

Fangjinhua Wang, Silvano Galliani, Christoph Vogel, Pablo Speciale, Marc Pollefeys

Figure 1 for PatchmatchNet: Learned Multi-View Patchmatch Stereo

Figure 2 for PatchmatchNet: Learned Multi-View Patchmatch Stereo

Figure 3 for PatchmatchNet: Learned Multi-View Patchmatch Stereo

Figure 4 for PatchmatchNet: Learned Multi-View Patchmatch Stereo

Abstract:We present PatchmatchNet, a novel and learnable cascade formulation of Patchmatch for high-resolution multi-view stereo. With high computation speed and low memory requirement, PatchmatchNet can process higher resolution imagery and is more suited to run on resource limited devices than competitors that employ 3D cost volume regularization. For the first time we introduce an iterative multi-scale Patchmatch in an end-to-end trainable architecture and improve the Patchmatch core algorithm with a novel and learned adaptive propagation and evaluation scheme for each iteration. Extensive experiments show a very competitive performance and generalization for our method on DTU, Tanks & Temples and ETH3D, but at a significantly higher efficiency than all existing top-performing models: at least two and a half times faster than state-of-the-art methods with twice less memory usage.

Via

Access Paper or Ask Questions

HoloLens 2 Research Mode as a Tool for Computer Vision Research

Aug 25, 2020

Dorin Ungureanu, Federica Bogo, Silvano Galliani, Pooja Sama, Xin Duan, Casey Meekhof, Jan Stühmer, Thomas J. Cashman, Bugra Tekin, Johannes L. Schönberger(+2 more)

Figure 1 for HoloLens 2 Research Mode as a Tool for Computer Vision Research

Figure 2 for HoloLens 2 Research Mode as a Tool for Computer Vision Research

Figure 3 for HoloLens 2 Research Mode as a Tool for Computer Vision Research

Figure 4 for HoloLens 2 Research Mode as a Tool for Computer Vision Research

Abstract:Mixed reality headsets, such as the Microsoft HoloLens 2, are powerful sensing devices with integrated compute capabilities, which makes it an ideal platform for computer vision research. In this technical report, we present HoloLens 2 Research Mode, an API and a set of tools enabling access to the raw sensor streams. We provide an overview of the API and explain how it can be used to build mixed reality applications based on processing sensor data. We also show how to combine the Research Mode sensor data with the built-in eye and hand tracking capabilities provided by HoloLens 2. By releasing the Research Mode API and a set of open-source tools, we aim to foster further research in the fields of computer vision as well as robotics and encourage contributions from the research community.

Via

Access Paper or Ask Questions

Super-resolution of Sentinel-2 images: Learning a globally applicable deep neural network

Oct 01, 2018

Charis Lanaras, José Bioucas-Dias, Silvano Galliani, Emmanuel Baltsavias, Konrad Schindler

Figure 1 for Super-resolution of Sentinel-2 images: Learning a globally applicable deep neural network

Figure 2 for Super-resolution of Sentinel-2 images: Learning a globally applicable deep neural network

Figure 3 for Super-resolution of Sentinel-2 images: Learning a globally applicable deep neural network

Figure 4 for Super-resolution of Sentinel-2 images: Learning a globally applicable deep neural network

Abstract:The Sentinel-2 satellite mission delivers multi-spectral imagery with 13 spectral bands, acquired at three different spatial resolutions. The aim of this research is to super-resolve the lower-resolution (20 m and 60 m Ground Sampling Distance - GSD) bands to 10 m GSD, so as to obtain a complete data cube at the maximal sensor resolution. We employ a state-of-the-art convolutional neural network (CNN) to perform end-to-end upsampling, which is trained with data at lower resolution, i.e., from 40->20 m, respectively 360->60 m GSD. In this way, one has access to a virtually infinite amount of training data, by downsampling real Sentinel-2 images. We use data sampled globally over a wide range of geographical locations, to obtain a network that generalises across different climate zones and land-cover types, and can super-resolve arbitrary Sentinel-2 images without the need of retraining. In quantitative evaluations (at lower scale, where ground truth is available), our network, which we call DSen2, outperforms the best competing approach by almost 50% in RMSE, while better preserving the spectral characteristics. It also delivers visually convincing results at the full 10 m GSD. The code is available at https://github.com/lanha/DSen2

* ISPRS Journal of Photogrammetry and Remote Sensing, 146 (2018), pp. 305-319
* 19 pages, 11 figures

Via

Access Paper or Ask Questions

Inference, Learning and Attention Mechanisms that Exploit and Preserve Sparsity in Convolutional Networks

Feb 09, 2018

Timo Hackel, Mikhail Usvyatsov, Silvano Galliani, Jan D. Wegner, Konrad Schindler

Figure 1 for Inference, Learning and Attention Mechanisms that Exploit and Preserve Sparsity in Convolutional Networks

Figure 2 for Inference, Learning and Attention Mechanisms that Exploit and Preserve Sparsity in Convolutional Networks

Figure 3 for Inference, Learning and Attention Mechanisms that Exploit and Preserve Sparsity in Convolutional Networks

Figure 4 for Inference, Learning and Attention Mechanisms that Exploit and Preserve Sparsity in Convolutional Networks

Abstract:While CNNs naturally lend themselves to densely sampled data, and sophisticated implementations are available, they lack the ability to efficiently process sparse data. In this work we introduce a suite of tools that exploit sparsity in both the feature maps and the filter weights, and thereby allow for significantly lower memory footprints and computation times than the conventional dense framework when processing data with a high degree of sparsity. Our scheme provides (i) an efficient GPU implementation of a convolution layer based on direct, sparse convolution; (ii) a filter step within the convolution layer, which we call attention, that prevents fill-in, i.e., the tendency of convolution to rapidly decrease sparsity, and guarantees an upper bound on the computational resources; and (iii) an adaptation of the back-propagation algorithm, which makes it possible to combine our approach with standard learning frameworks, while still exploiting sparsity in the data and the model.

Via

Access Paper or Ask Questions

Classification With an Edge: Improving Semantic Image Segmentation with Boundary Detection

Dec 21, 2017

Dimitrios Marmanis, Konrad Schindler, Jan Dirk Wegner, Silvano Galliani, Mihai Datcu, Uwe Stilla

Figure 1 for Classification With an Edge: Improving Semantic Image Segmentation with Boundary Detection

Figure 2 for Classification With an Edge: Improving Semantic Image Segmentation with Boundary Detection

Figure 3 for Classification With an Edge: Improving Semantic Image Segmentation with Boundary Detection

Figure 4 for Classification With an Edge: Improving Semantic Image Segmentation with Boundary Detection

Abstract:We present an end-to-end trainable deep convolutional neural network (DCNN) for semantic segmentation with built-in awareness of semantically meaningful boundaries. Semantic segmentation is a fundamental remote sensing task, and most state-of-the-art methods rely on DCNNs as their workhorse. A major reason for their success is that deep networks learn to accumulate contextual information over very large windows (receptive fields). However, this success comes at a cost, since the associated loss of effecive spatial resolution washes out high-frequency details and leads to blurry object boundaries. Here, we propose to counter this effect by combining semantic segmentation with semantically informed edge detection, thus making class-boundaries explicit in the model, First, we construct a comparatively simple, memory-efficient model by adding boundary detection to the Segnet encoder-decoder architecture. Second, we also include boundary detection in FCN-type models and set up a high-end classifier ensemble. We show that boundary detection significantly improves semantic segmentation with CNNs. Our high-end ensemble achieves > 90% overall accuracy on the ISPRS Vaihingen benchmark.

* ISPRS Journal of Photogrammetry and Remote Sensing, Volume 135, January 2018, Pages 158-172, ISSN 0924-2716, https://doi.org/10.1016/j.isprsjprs.2017.11.009

Via

Access Paper or Ask Questions

Learned Multi-Patch Similarity

Aug 21, 2017

Wilfried Hartmann, Silvano Galliani, Michal Havlena, Luc Van Gool, Konrad Schindler

Figure 1 for Learned Multi-Patch Similarity

Figure 2 for Learned Multi-Patch Similarity

Figure 3 for Learned Multi-Patch Similarity

Figure 4 for Learned Multi-Patch Similarity

Abstract:Estimating a depth map from multiple views of a scene is a fundamental task in computer vision. As soon as more than two viewpoints are available, one faces the very basic question how to measure similarity across >2 image patches. Surprisingly, no direct solution exists, instead it is common to fall back to more or less robust averaging of two-view similarities. Encouraged by the success of machine learning, and in particular convolutional neural networks, we propose to learn a matching function which directly maps multiple image patches to a scalar similarity score. Experiments on several multi-view datasets demonstrate that this approach has advantages over methods based on pairwise patch similarity.

* 10 pages, 7 figures, Accepted at ICCV 2017

Via

Access Paper or Ask Questions