Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Daniel DeTone

Sonata: Self-Supervised Learning of Reliable Point Representations

Mar 20, 2025

Xiaoyang Wu, Daniel DeTone, Duncan Frost, Tianwei Shen, Chris Xie, Nan Yang, Jakob Engel, Richard Newcombe, Hengshuang Zhao, Julian Straub

Abstract:In this paper, we question whether we have a reliable self-supervised point cloud model that can be used for diverse 3D tasks via simple linear probing, even with limited data and minimal computation. We find that existing 3D self-supervised learning approaches fall short when evaluated on representation quality through linear probing. We hypothesize that this is due to what we term the "geometric shortcut", which causes representations to collapse to low-level spatial features. This challenge is unique to 3D and arises from the sparse nature of point cloud data. We address it through two key strategies: obscuring spatial information and enhancing the reliance on input features, ultimately composing a Sonata of 140k point clouds through self-distillation. Sonata is simple and intuitive, yet its learned representations are strong and reliable: zero-shot visualizations demonstrate semantic grouping, alongside strong spatial reasoning through nearest-neighbor relationships. Sonata demonstrates exceptional parameter and data efficiency, tripling linear probing accuracy (from 21.8% to 72.5%) on ScanNet and nearly doubling performance with only 1% of the data compared to previous approaches. Full fine-tuning further advances SOTA across both 3D indoor and outdoor perception tasks.

* CVPR 2025, produced by Pointcept x Meta, project page: https://xywu.me/sonata/

Via

Access Paper or Ask Questions

EFM3D: A Benchmark for Measuring Progress Towards 3D Egocentric Foundation Models

Jun 14, 2024

Julian Straub, Daniel DeTone, Tianwei Shen, Nan Yang, Chris Sweeney, Richard Newcombe

Figure 1 for EFM3D: A Benchmark for Measuring Progress Towards 3D Egocentric Foundation Models

Figure 2 for EFM3D: A Benchmark for Measuring Progress Towards 3D Egocentric Foundation Models

Figure 3 for EFM3D: A Benchmark for Measuring Progress Towards 3D Egocentric Foundation Models

Figure 4 for EFM3D: A Benchmark for Measuring Progress Towards 3D Egocentric Foundation Models

Abstract:The advent of wearable computers enables a new source of context for AI that is embedded in egocentric sensor data. This new egocentric data comes equipped with fine-grained 3D location information and thus presents the opportunity for a novel class of spatial foundation models that are rooted in 3D space. To measure progress on what we term Egocentric Foundation Models (EFMs) we establish EFM3D, a benchmark with two core 3D egocentric perception tasks. EFM3D is the first benchmark for 3D object detection and surface regression on high quality annotated egocentric data of Project Aria. We propose Egocentric Voxel Lifting (EVL), a baseline for 3D EFMs. EVL leverages all available egocentric modalities and inherits foundational capabilities from 2D foundation models. This model, trained on a large simulated dataset, outperforms existing methods on the EFM3D benchmark.

Via

Access Paper or Ask Questions

OrienterNet: Visual Localization in 2D Public Maps with Neural Matching

Apr 04, 2023

Paul-Edouard Sarlin, Daniel DeTone, Tsun-Yi Yang, Armen Avetisyan, Julian Straub, Tomasz Malisiewicz, Samuel Rota Bulo, Richard Newcombe, Peter Kontschieder, Vasileios Balntas

Abstract:Humans can orient themselves in their 3D environments using simple 2D maps. Differently, algorithms for visual localization mostly rely on complex 3D point clouds that are expensive to build, store, and maintain over time. We bridge this gap by introducing OrienterNet, the first deep neural network that can localize an image with sub-meter accuracy using the same 2D semantic maps that humans use. OrienterNet estimates the location and orientation of a query image by matching a neural Bird's-Eye View with open and globally available maps from OpenStreetMap, enabling anyone to localize anywhere such maps are available. OrienterNet is supervised only by camera poses but learns to perform semantic matching with a wide range of map elements in an end-to-end manner. To enable this, we introduce a large crowd-sourced dataset of images captured across 12 cities from the diverse viewpoints of cars, bikes, and pedestrians. OrienterNet generalizes to new datasets and pushes the state of the art in both robotics and AR scenarios. The code and trained model will be released publicly.

* CVPR 2023

Via

Access Paper or Ask Questions

Theseus: A Library for Differentiable Nonlinear Optimization

Jul 19, 2022

Luis Pineda, Taosha Fan, Maurizio Monge, Shobha Venkataraman, Paloma Sodhi, Ricky Chen, Joseph Ortiz, Daniel DeTone, Austin Wang, Stuart Anderson(+3 more)

Figure 1 for Theseus: A Library for Differentiable Nonlinear Optimization

Figure 2 for Theseus: A Library for Differentiable Nonlinear Optimization

Figure 3 for Theseus: A Library for Differentiable Nonlinear Optimization

Figure 4 for Theseus: A Library for Differentiable Nonlinear Optimization

Abstract:We present Theseus, an efficient application-agnostic open source library for differentiable nonlinear least squares (DNLS) optimization built on PyTorch, providing a common framework for end-to-end structured learning in robotics and vision. Existing DNLS implementations are application specific and do not always incorporate many ingredients important for efficiency. Theseus is application-agnostic, as we illustrate with several example applications that are built using the same underlying differentiable components, such as second-order optimizers, standard costs functions, and Lie groups. For efficiency, Theseus incorporates support for sparse solvers, automatic vectorization, batching, GPU acceleration, and gradient computation with implicit differentiation and direct loss minimization. We do extensive performance evaluation in a set of applications, demonstrating significant efficiency gains and better scalability when these features are incorporated. Project page: https://sites.google.com/view/theseus-ai

Via

Access Paper or Ask Questions

Nerfels: Renderable Neural Codes for Improved Camera Pose Estimation

Jun 04, 2022

Gil Avraham, Julian Straub, Tianwei Shen, Tsun-Yi Yang, Hugo Germain, Chris Sweeney, Vasileios Balntas, David Novotny, Daniel DeTone, Richard Newcombe

Figure 1 for Nerfels: Renderable Neural Codes for Improved Camera Pose Estimation

Figure 2 for Nerfels: Renderable Neural Codes for Improved Camera Pose Estimation

Figure 3 for Nerfels: Renderable Neural Codes for Improved Camera Pose Estimation

Figure 4 for Nerfels: Renderable Neural Codes for Improved Camera Pose Estimation

Abstract:This paper presents a framework that combines traditional keypoint-based camera pose optimization with an invertible neural rendering mechanism. Our proposed 3D scene representation, Nerfels, is locally dense yet globally sparse. As opposed to existing invertible neural rendering systems which overfit a model to the entire scene, we adopt a feature-driven approach for representing scene-agnostic, local 3D patches with renderable codes. By modelling a scene only where local features are detected, our framework effectively generalizes to unseen local regions in the scene via an optimizable code conditioning mechanism in the neural renderer, all while maintaining the low memory footprint of a sparse 3D map representation. Our model can be incorporated to existing state-of-the-art hand-crafted and learned local feature pose estimators, yielding improved performance when evaluating on ScanNet for wide camera baseline scenarios.

* Published at CVPRW with supplementary material

Via

Access Paper or Ask Questions

ODAM: Object Detection, Association, and Mapping using Posed RGB Video

Aug 23, 2021

Kejie Li, Daniel DeTone, Steven Chen, Minh Vo, Ian Reid, Hamid Rezatofighi, Chris Sweeney, Julian Straub, Richard Newcombe

Figure 1 for ODAM: Object Detection, Association, and Mapping using Posed RGB Video

Figure 2 for ODAM: Object Detection, Association, and Mapping using Posed RGB Video

Figure 3 for ODAM: Object Detection, Association, and Mapping using Posed RGB Video

Figure 4 for ODAM: Object Detection, Association, and Mapping using Posed RGB Video

Abstract:Localizing objects and estimating their extent in 3D is an important step towards high-level 3D scene understanding, which has many applications in Augmented Reality and Robotics. We present ODAM, a system for 3D Object Detection, Association, and Mapping using posed RGB videos. The proposed system relies on a deep learning front-end to detect 3D objects from a given RGB frame and associate them to a global object-based map using a graph neural network (GNN). Based on these frame-to-model associations, our back-end optimizes object bounding volumes, represented as super-quadrics, under multi-view geometry constraints and the object scale prior. We validate the proposed system on ScanNet where we show a significant improvement over existing RGB-only methods.

* Accepted in ICCV 2021 as oral

Via

Access Paper or Ask Questions

SuperGlue: Learning Feature Matching with Graph Neural Networks

Nov 26, 2019

Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, Andrew Rabinovich

Figure 1 for SuperGlue: Learning Feature Matching with Graph Neural Networks

Figure 2 for SuperGlue: Learning Feature Matching with Graph Neural Networks

Figure 3 for SuperGlue: Learning Feature Matching with Graph Neural Networks

Figure 4 for SuperGlue: Learning Feature Matching with Graph Neural Networks

Abstract:This paper introduces SuperGlue, a neural network that matches two sets of local features by jointly finding correspondences and rejecting non-matchable points. Assignments are estimated by solving a differentiable optimal transport problem, whose costs are predicted by a graph neural network. We introduce a flexible context aggregation mechanism based on attention, enabling SuperGlue to reason about the underlying 3D scene and feature assignments jointly. Compared to traditional, hand-designed heuristics, our technique learns priors over geometric transformations and regularities of the 3D world through end-to-end training from image pairs. SuperGlue outperforms other learned approaches and achieves state-of-the-art results on the task of pose estimation in challenging real-world indoor and outdoor environments. The proposed method performs matching in real-time on a modern GPU and can be readily integrated into modern SfM or SLAM systems.

Via

Access Paper or Ask Questions

Deep ChArUco: Dark ChArUco Marker Pose Estimation

Dec 08, 2018

Danying Hu, Daniel DeTone, Vikram Chauhan, Igor Spivak, Tomasz Malisiewicz

Figure 1 for Deep ChArUco: Dark ChArUco Marker Pose Estimation

Figure 2 for Deep ChArUco: Dark ChArUco Marker Pose Estimation

Figure 3 for Deep ChArUco: Dark ChArUco Marker Pose Estimation

Figure 4 for Deep ChArUco: Dark ChArUco Marker Pose Estimation

Abstract:ChArUco boards are used for camera calibration, monocular pose estimation, and pose verification in both robotics and augmented reality. Such fiducials are detectable via traditional computer vision methods (as found in OpenCV) in well-lit environments, but classical methods fail when the lighting is poor or when the image undergoes extreme motion blur. We present Deep ChArUco, a real-time pose estimation system which combines two custom deep networks, ChArUcoNet and RefineNet, with the Perspective-n-Point (PnP) algorithm to estimate the marker's 6DoF pose. ChArUcoNet is a two-headed marker-specific convolutional neural network (CNN) which jointly outputs ID-specific classifiers and 2D point locations. The 2D point locations are further refined into subpixel coordinates using RefineNet. Our networks are trained using a combination of auto-labeled videos of the target marker, synthetic subpixel corner data, and extreme data augmentation. We evaluate Deep ChArUco in challenging low-light, high-motion, high-blur scenarios and demonstrate that our approach is superior to a traditional OpenCV-based method for ChArUco marker detection and pose estimation.

Via

Access Paper or Ask Questions

Self-Improving Visual Odometry

Dec 08, 2018

Daniel DeTone, Tomasz Malisiewicz, Andrew Rabinovich

Figure 1 for Self-Improving Visual Odometry

Figure 2 for Self-Improving Visual Odometry

Figure 3 for Self-Improving Visual Odometry

Figure 4 for Self-Improving Visual Odometry

Abstract:We propose a self-supervised learning framework that uses unlabeled monocular video sequences to generate large-scale supervision for training a Visual Odometry (VO) frontend, a network which computes pointwise data associations across images. Our self-improving method enables a VO frontend to learn over time, unlike other VO and SLAM systems which require time-consuming hand-tuning or expensive data collection to adapt to new environments. Our proposed frontend operates on monocular images and consists of a single multi-task convolutional neural network which outputs 2D keypoints locations, keypoint descriptors, and a novel point stability score. We use the output of VO to create a self-supervised dataset of point correspondences to retrain the frontend. When trained using VO at scale on 2.5 million monocular images from ScanNet, the stability classifier automatically discovers a ranking for keypoints that are not likely to help in VO, such as t-junctions across depth discontinuities, features on shadows and highlights, and dynamic objects like people. The resulting frontend outperforms both traditional methods (SIFT, ORB, AKAZE) and deep learning methods (SuperPoint and LF-Net) in a 3D-to-2D pose estimation task on ScanNet.

Via

Access Paper or Ask Questions

SuperPoint: Self-Supervised Interest Point Detection and Description

Apr 19, 2018

Daniel DeTone, Tomasz Malisiewicz, Andrew Rabinovich

Figure 1 for SuperPoint: Self-Supervised Interest Point Detection and Description

Figure 2 for SuperPoint: Self-Supervised Interest Point Detection and Description

Figure 3 for SuperPoint: Self-Supervised Interest Point Detection and Description

Figure 4 for SuperPoint: Self-Supervised Interest Point Detection and Description

Abstract:This paper presents a self-supervised framework for training interest point detectors and descriptors suitable for a large number of multiple-view geometry problems in computer vision. As opposed to patch-based neural networks, our fully-convolutional model operates on full-sized images and jointly computes pixel-level interest point locations and associated descriptors in one forward pass. We introduce Homographic Adaptation, a multi-scale, multi-homography approach for boosting interest point detection repeatability and performing cross-domain adaptation (e.g., synthetic-to-real). Our model, when trained on the MS-COCO generic image dataset using Homographic Adaptation, is able to repeatedly detect a much richer set of interest points than the initial pre-adapted deep model and any other traditional corner detector. The final system gives rise to state-of-the-art homography estimation results on HPatches when compared to LIFT, SIFT and ORB.

* Camera-ready version for CVPR 2018 Deep Learning for Visual SLAM Workshop (DL4VSLAM2018)

Via

Access Paper or Ask Questions