Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Henry Howard-Jenkins

Human-in-the-Loop Local Corrections of 3D Scene Layouts via Infilling

Mar 14, 2025

Christopher Xie, Armen Avetisyan, Henry Howard-Jenkins, Yawar Siddiqui, Julian Straub, Richard Newcombe, Vasileios Balntas, Jakob Engel

Abstract:We present a novel human-in-the-loop approach to estimate 3D scene layout that uses human feedback from an egocentric standpoint. We study this approach through introduction of a novel local correction task, where users identify local errors and prompt a model to automatically correct them. Building on SceneScript, a state-of-the-art framework for 3D scene layout estimation that leverages structured language, we propose a solution that structures this problem as "infilling", a task studied in natural language processing. We train a multi-task version of SceneScript that maintains performance on global predictions while significantly improving its local correction ability. We integrate this into a human-in-the-loop system, enabling a user to iteratively refine scene layout estimates via a low-friction "one-click fix'' workflow. Our system enables the final refined layout to diverge from the training distribution, allowing for more accurate modelling of complex layouts.

* Project page: https://www.projectaria.com/scenescript/

Via

Access Paper or Ask Questions

SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model

Mar 19, 2024

Armen Avetisyan, Christopher Xie, Henry Howard-Jenkins, Tsun-Yi Yang, Samir Aroudj, Suvam Patra, Fuyang Zhang, Duncan Frost, Luke Holland, Campbell Orme(+4 more)

Figure 1 for SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model

Figure 2 for SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model

Figure 3 for SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model

Figure 4 for SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model

Abstract:We introduce SceneScript, a method that directly produces full scene models as a sequence of structured language commands using an autoregressive, token-based approach. Our proposed scene representation is inspired by recent successes in transformers & LLMs, and departs from more traditional methods which commonly describe scenes as meshes, voxel grids, point clouds or radiance fields. Our method infers the set of structured language commands directly from encoded visual data using a scene language encoder-decoder architecture. To train SceneScript, we generate and release a large-scale synthetic dataset called Aria Synthetic Environments consisting of 100k high-quality in-door scenes, with photorealistic and ground-truth annotated renders of egocentric scene walkthroughs. Our method gives state-of-the art results in architectural layout estimation, and competitive results in 3D object detection. Lastly, we explore an advantage for SceneScript, which is the ability to readily adapt to new commands via simple additions to the structured language, which we illustrate for tasks such as coarse 3D object part reconstruction.

* see project page, https://projectaria.com/scenescript

Via

Access Paper or Ask Questions

Cos R-CNN for Online Few-shot Object Detection

Jul 25, 2023

Gratianus Wesley Putra Data, Henry Howard-Jenkins, David Murray, Victor Prisacariu

Abstract:We propose Cos R-CNN, a simple exemplar-based R-CNN formulation that is designed for online few-shot object detection. That is, it is able to localise and classify novel object categories in images with few examples without fine-tuning. Cos R-CNN frames detection as a learning-to-compare task: unseen classes are represented as exemplar images, and objects are detected based on their similarity to these exemplars. The cosine-based classification head allows for dynamic adaptation of classification parameters to the exemplar embedding, and encourages the clustering of similar classes in embedding space without the need for manual tuning of distance-metric hyperparameters. This simple formulation achieves best results on the recently proposed 5-way ImageNet few-shot detection benchmark, beating the online 1/5/10-shot scenarios by more than 8/3/1%, as well as performing up to 20% better in online 20-way few-shot VOC across all shots on novel classes.

* Unpublished tech report from 2020

Via

Access Paper or Ask Questions

LaLaLoc: Latent Layout Localisation in Dynamic, Unvisited Environments

Apr 19, 2021

Henry Howard-Jenkins, Jose-Raul Ruiz-Sarmiento, Victor Adrian Prisacariu

Figure 1 for LaLaLoc: Latent Layout Localisation in Dynamic, Unvisited Environments

Figure 2 for LaLaLoc: Latent Layout Localisation in Dynamic, Unvisited Environments

Figure 3 for LaLaLoc: Latent Layout Localisation in Dynamic, Unvisited Environments

Figure 4 for LaLaLoc: Latent Layout Localisation in Dynamic, Unvisited Environments

Abstract:We present LaLaLoc to localise in environments without the need for prior visitation, and in a manner that is robust to large changes in scene appearance, such as a full rearrangement of furniture. Specifically, LaLaLoc performs localisation through latent representations of room layout. LaLaLoc learns a rich embedding space shared between RGB panoramas and layouts inferred from a known floor plan that encodes the structural similarity between locations. Further, LaLaLoc introduces direct, cross-modal pose optimisation in its latent space. Thus, LaLaLoc enables fine-grained pose estimation in a scene without the need for prior visitation, as well as being robust to dynamics, such as a change in furniture configuration. We show that in a domestic environment LaLaLoc is able to accurately localise a single RGB panorama image to within 8.3cm, given only a floor plan as a prior.

Via

Access Paper or Ask Questions

Correspondence Networks with Adaptive Neighbourhood Consensus

Mar 26, 2020

Shuda Li, Kai Han, Theo W. Costain, Henry Howard-Jenkins, Victor Prisacariu

Figure 1 for Correspondence Networks with Adaptive Neighbourhood Consensus

Figure 2 for Correspondence Networks with Adaptive Neighbourhood Consensus

Figure 3 for Correspondence Networks with Adaptive Neighbourhood Consensus

Figure 4 for Correspondence Networks with Adaptive Neighbourhood Consensus

Abstract:In this paper, we tackle the task of establishing dense visual correspondences between images containing objects of the same category. This is a challenging task due to large intra-class variations and a lack of dense pixel level annotations. We propose a convolutional neural network architecture, called adaptive neighbourhood consensus network (ANC-Net), that can be trained end-to-end with sparse key-point annotations, to handle this challenge. At the core of ANC-Net is our proposed non-isotropic 4D convolution kernel, which forms the building block for the adaptive neighbourhood consensus module for robust matching. We also introduce a simple and efficient multi-scale self-similarity module in ANC-Net to make the learned feature robust to intra-class variations. Furthermore, we propose a novel orthogonal loss that can enforce the one-to-one matching constraint. We thoroughly evaluate the effectiveness of our method on various benchmarks, where it substantially outperforms state-of-the-art methods.

* CVPR 2020. Project page: https://ancnet.avlcode.org/

Via

Access Paper or Ask Questions

FlowNet3D++: Geometric Losses For Deep Scene Flow Estimation

Dec 10, 2019

Zirui Wang, Shuda Li, Henry Howard-Jenkins, Victor Adrian Prisacariu, Min Chen

Figure 1 for FlowNet3D++: Geometric Losses For Deep Scene Flow Estimation

Figure 2 for FlowNet3D++: Geometric Losses For Deep Scene Flow Estimation

Figure 3 for FlowNet3D++: Geometric Losses For Deep Scene Flow Estimation

Figure 4 for FlowNet3D++: Geometric Losses For Deep Scene Flow Estimation

Abstract:We present FlowNet3D++, a deep scene flow estimation network. Inspired by classical methods, FlowNet3D++ incorporates geometric constraints in the form of point-to-plane distance and angular alignment between individual vectors in the flow field, into FlowNet3D. We demonstrate that the addition of these geometric loss terms improves the previous state-of-art FlowNet3D accuracy from 57.85% to 63.43%. To further demonstrate the effectiveness of our geometric constraints, we propose a benchmark for flow estimation on the task of dynamic 3D reconstruction, thus providing a more holistic and practical measure of performance than the breakdown of individual metrics previously used to evaluate scene flow. This is made possible through the contribution of a novel pipeline to integrate point-based scene flow predictions into a global dense volume. FlowNet3D++ achieves up to a 15.0% reduction in reconstruction error over FlowNet3D, and up to a 35.2% improvement over KillingFusion alone. We will release our scene flow estimation code later.

* Accepted in WACV 2020

Via

Access Paper or Ask Questions

GroSS: Group-Size Series Decomposition for Whole Search-Space Training

Dec 02, 2019

Henry Howard-Jenkins, Yiwen Li, Victor A. Prisacariu

Figure 1 for GroSS: Group-Size Series Decomposition for Whole Search-Space Training

Figure 2 for GroSS: Group-Size Series Decomposition for Whole Search-Space Training

Figure 3 for GroSS: Group-Size Series Decomposition for Whole Search-Space Training

Figure 4 for GroSS: Group-Size Series Decomposition for Whole Search-Space Training

Abstract:We present Group-size Series (GroSS) decomposition, a mathematical formulation of tensor factorisation into a series of approximations of increasing rank terms. GroSS allows for dynamic and differentiable selection of factorisation rank, which is analogous to a grouped convolution. Therefore, to the best of our knowledge, GroSS is the first method to simultaneously train differing numbers of groups within a single layer, as well as all possible combinations between layers. In doing so, GroSS trains an entire grouped convolution architecture search-space concurrently. We demonstrate this through proof-of-concept architecture searches with performance objectives. GroSS represents a significant step towards liberating network architecture search from the burden of training and fine-tuning.

Via

Access Paper or Ask Questions

Thinking Outside the Box: Generation of Unconstrained 3D Room Layouts

May 08, 2019

Henry Howard-Jenkins, Shuda Li, Victor Prisacariu

Figure 1 for Thinking Outside the Box: Generation of Unconstrained 3D Room Layouts

Figure 2 for Thinking Outside the Box: Generation of Unconstrained 3D Room Layouts

Figure 3 for Thinking Outside the Box: Generation of Unconstrained 3D Room Layouts

Figure 4 for Thinking Outside the Box: Generation of Unconstrained 3D Room Layouts

Abstract:We propose a method for room layout estimation that does not rely on the typical box approximation or Manhattan world assumption. Instead, we reformulate the geometry inference problem as an instance detection task, which we solve by directly regressing 3D planes using an R-CNN. We then use a variant of probabilistic clustering to combine the 3D planes regressed at each frame in a video sequence, with their respective camera poses, into a single global 3D room layout estimate. Finally, we showcase results which make no assumptions about perpendicular alignment, so can deal effectively with walls in any alignment.

* Asian Conference on Computer Vision (ACCV), 2018

Via

Access Paper or Ask Questions