Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Anand Bhattad

Generative Blocks World: Moving Things Around in Pictures

Jun 25, 2025

Vaibhav Vavilala, Seemandhar Jain, Rahul Vasanth, D. A. Forsyth, Anand Bhattad

Abstract:We describe Generative Blocks World to interact with the scene of a generated image by manipulating simple geometric abstractions. Our method represents scenes as assemblies of convex 3D primitives, and the same scene can be represented by different numbers of primitives, allowing an editor to move either whole structures or small details. Once the scene geometry has been edited, the image is generated by a flow-based method which is conditioned on depth and a texture hint. Our texture hint takes into account the modified 3D primitives, exceeding texture-consistency provided by existing key-value caching techniques. These texture hints (a) allow accurate object and camera moves and (b) largely preserve the identity of objects depicted. Quantitative and qualitative experiments demonstrate that our approach outperforms prior works in visual fidelity, editability, and compositional generalization.

* 23 pages, 16 figures, 2 tables

Via

Access Paper or Ask Questions

Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting

Mar 27, 2025

Anand Bhattad, Konpat Preechakul, Alexei A. Efros

Abstract:This paper proposes a novel scene understanding task called Visual Jenga. Drawing inspiration from the game Jenga, the proposed task involves progressively removing objects from a single image until only the background remains. Just as Jenga players must understand structural dependencies to maintain tower stability, our task reveals the intrinsic relationships between scene elements by systematically exploring which objects can be removed while preserving scene coherence in both physical and geometric sense. As a starting point for tackling the Visual Jenga task, we propose a simple, data-driven, training-free approach that is surprisingly effective on a range of real-world images. The principle behind our approach is to utilize the asymmetry in the pairwise relationships between objects within a scene and employ a large inpainting model to generate a set of counterfactuals to quantify the asymmetry.

* project page: https://visualjenga.github.io/

Via

Access Paper or Ask Questions

From an Image to a Scene: Learning to Imagine the World from a Million 360 Videos

Dec 10, 2024

Matthew Wallingford, Anand Bhattad, Aditya Kusupati, Vivek Ramanujan, Matt Deitke, Sham Kakade, Aniruddha Kembhavi, Roozbeh Mottaghi, Wei-Chiu Ma, Ali Farhadi

Abstract:Three-dimensional (3D) understanding of objects and scenes play a key role in humans' ability to interact with the world and has been an active area of research in computer vision, graphics, and robotics. Large scale synthetic and object-centric 3D datasets have shown to be effective in training models that have 3D understanding of objects. However, applying a similar approach to real-world objects and scenes is difficult due to a lack of large-scale data. Videos are a potential source for real-world 3D data, but finding diverse yet corresponding views of the same content has shown to be difficult at scale. Furthermore, standard videos come with fixed viewpoints, determined at the time of capture. This restricts the ability to access scenes from a variety of more diverse and potentially useful perspectives. We argue that large scale 360 videos can address these limitations to provide: scalable corresponding frames from diverse views. In this paper, we introduce 360-1M, a 360 video dataset, and a process for efficiently finding corresponding frames from diverse viewpoints at scale. We train our diffusion-based model, Odin, on 360-1M. Empowered by the largest real-world, multi-view dataset to date, Odin is able to freely generate novel views of real-world scenes. Unlike previous methods, Odin can move the camera through the environment, enabling the model to infer the geometry and layout of the scene. Additionally, we show improved performance on standard novel view synthesis and 3D reconstruction benchmarks.

* NeurIPS 2024. For project page, see https://mattwallingford.github.io/ODIN

Via

Access Paper or Ask Questions

LumiNet: Latent Intrinsics Meets Diffusion Models for Indoor Scene Relighting

Dec 03, 2024

Xiaoyan Xing, Konrad Groh, Sezer Karaoglu, Theo Gevers, Anand Bhattad

Figure 1 for LumiNet: Latent Intrinsics Meets Diffusion Models for Indoor Scene Relighting

Figure 2 for LumiNet: Latent Intrinsics Meets Diffusion Models for Indoor Scene Relighting

Figure 3 for LumiNet: Latent Intrinsics Meets Diffusion Models for Indoor Scene Relighting

Figure 4 for LumiNet: Latent Intrinsics Meets Diffusion Models for Indoor Scene Relighting

Abstract:We introduce LumiNet, a novel architecture that leverages generative models and latent intrinsic representations for effective lighting transfer. Given a source image and a target lighting image, LumiNet synthesizes a relit version of the source scene that captures the target's lighting. Our approach makes two key contributions: a data curation strategy from the StyleGAN-based relighting model for our training, and a modified diffusion-based ControlNet that processes both latent intrinsic properties from the source image and latent extrinsic properties from the target image. We further improve lighting transfer through a learned adaptor (MLP) that injects the target's latent extrinsic properties via cross-attention and fine-tuning. Unlike traditional ControlNet, which generates images with conditional maps from a single scene, LumiNet processes latent representations from two different images - preserving geometry and albedo from the source while transferring lighting characteristics from the target. Experiments demonstrate that our method successfully transfers complex lighting phenomena including specular highlights and indirect illumination across scenes with varying spatial layouts and materials, outperforming existing approaches on challenging indoor scenes using only images as input.

* Project page: https://luminet-relight.github.io

Via

Access Paper or Ask Questions

SpotLight: Shadow-Guided Object Relighting via Diffusion

Nov 27, 2024

Frédéric Fortier-Chouinard, Zitian Zhang, Louis-Etienne Messier, Mathieu Garon, Anand Bhattad, Jean-François Lalonde

Abstract:Recent work has shown that diffusion models can be used as powerful neural rendering engines that can be leveraged for inserting virtual objects into images. Unlike typical physics-based renderers, however, neural rendering engines are limited by the lack of manual control over the lighting setup, which is often essential for improving or personalizing the desired image outcome. In this paper, we show that precise lighting control can be achieved for object relighting simply by specifying the desired shadows of the object. Rather surprisingly, we show that injecting only the shadow of the object into a pre-trained diffusion-based neural renderer enables it to accurately shade the object according to the desired light position, while properly harmonizing the object (and its shadow) within the target background image. Our method, SpotLight, leverages existing neural rendering approaches and achieves controllable relighting results with no additional training. Specifically, we demonstrate its use with two neural renderers from the recent literature. We show that SpotLight achieves superior object compositing results, both quantitatively and perceptually, as confirmed by a user study, outperforming existing diffusion-based models specifically designed for relighting.

* Project page: https://lvsn.github.io/spotlight

Via

Access Paper or Ask Questions

ScribbleLight: Single Image Indoor Relighting with Scribbles

Nov 26, 2024

Jun Myeong Choi, Annie Wang, Pieter Peers, Anand Bhattad, Roni Sengupta

Abstract:Image-based relighting of indoor rooms creates an immersive virtual understanding of the space, which is useful for interior design, virtual staging, and real estate. Relighting indoor rooms from a single image is especially challenging due to complex illumination interactions between multiple lights and cluttered objects featuring a large variety in geometrical and material complexity. Recently, generative models have been successfully applied to image-based relighting conditioned on a target image or a latent code, albeit without detailed local lighting control. In this paper, we introduce ScribbleLight, a generative model that supports local fine-grained control of lighting effects through scribbles that describe changes in lighting. Our key technical novelty is an Albedo-conditioned Stable Image Diffusion model that preserves the intrinsic color and texture of the original image after relighting and an encoder-decoder-based ControlNet architecture that enables geometry-preserving lighting effects with normal map and scribble annotations. We demonstrate ScribbleLight's ability to create different lighting effects (e.g., turning lights on/off, adding highlights, cast shadows, or indirect lighting from unseen lights) from sparse scribble annotations.

Via

Access Paper or Ask Questions

ZeroComp: Zero-shot Object Compositing from Image Intrinsics via Diffusion

Oct 10, 2024

Zitian Zhang, Frédéric Fortier-Chouinard, Mathieu Garon, Anand Bhattad, Jean-François Lalonde

Figure 1 for ZeroComp: Zero-shot Object Compositing from Image Intrinsics via Diffusion

Figure 2 for ZeroComp: Zero-shot Object Compositing from Image Intrinsics via Diffusion

Figure 3 for ZeroComp: Zero-shot Object Compositing from Image Intrinsics via Diffusion

Figure 4 for ZeroComp: Zero-shot Object Compositing from Image Intrinsics via Diffusion

Abstract:We present ZeroComp, an effective zero-shot 3D object compositing approach that does not require paired composite-scene images during training. Our method leverages ControlNet to condition from intrinsic images and combines it with a Stable Diffusion model to utilize its scene priors, together operating as an effective rendering engine. During training, ZeroComp uses intrinsic images based on geometry, albedo, and masked shading, all without the need for paired images of scenes with and without composite objects. Once trained, it seamlessly integrates virtual 3D objects into scenes, adjusting shading to create realistic composites. We developed a high-quality evaluation dataset and demonstrate that ZeroComp outperforms methods using explicit lighting estimations and generative techniques in quantitative and human perception benchmarks. Additionally, ZeroComp extends to real and outdoor image compositing, even when trained solely on synthetic indoor data, showcasing its effectiveness in image compositing.

Via

Access Paper or Ask Questions

Latent Intrinsics Emerge from Training to Relight

May 31, 2024

Xiao Zhang, William Gao, Seemandhar Jain, Michael Maire, David. A. Forsyth, Anand Bhattad

Figure 1 for Latent Intrinsics Emerge from Training to Relight

Figure 2 for Latent Intrinsics Emerge from Training to Relight

Figure 3 for Latent Intrinsics Emerge from Training to Relight

Figure 4 for Latent Intrinsics Emerge from Training to Relight

Abstract:Image relighting is the task of showing what a scene from a source image would look like if illuminated differently. Inverse graphics schemes recover an explicit representation of geometry and a set of chosen intrinsics, then relight with some form of renderer. However error control for inverse graphics is difficult, and inverse graphics methods can represent only the effects of the chosen intrinsics. This paper describes a relighting method that is entirely data-driven, where intrinsics and lighting are each represented as latent variables. Our approach produces SOTA relightings of real scenes, as measured by standard metrics. We show that albedo can be recovered from our latent intrinsics without using any example albedos, and that the albedos recovered are competitive with SOTA methods.

Via

Access Paper or Ask Questions

Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion

Mar 22, 2024

Xiang Fan, Anand Bhattad, Ranjay Krishna

Abstract:We introduce Videoshop, a training-free video editing algorithm for localized semantic edits. Videoshop allows users to use any editing software, including Photoshop and generative inpainting, to modify the first frame; it automatically propagates those changes, with semantic, spatial, and temporally consistent motion, to the remaining frames. Unlike existing methods that enable edits only through imprecise textual instructions, Videoshop allows users to add or remove objects, semantically change objects, insert stock photos into videos, etc. with fine-grained control over locations and appearance. We achieve this through image-based video editing by inverting latents with noise extrapolation, from which we generate videos conditioned on the edited image. Videoshop produces higher quality edits against 6 baselines on 2 editing benchmarks using 10 evaluation metrics.

* Project page at https://videoshop-editing.github.io/

Via

Access Paper or Ask Questions

Shadows Don't Lie and Lines Can't Bend! Generative Models don't know Projective Geometryfor now

Nov 28, 2023

Ayush Sarkar, Hanlin Mai, Amitabh Mahapatra, Svetlana Lazebnik, D. A. Forsyth, Anand Bhattad

Figure 1 for Shadows Don't Lie and Lines Can't Bend! Generative Models don't know Projective Geometryfor now

Figure 2 for Shadows Don't Lie and Lines Can't Bend! Generative Models don't know Projective Geometryfor now

Figure 3 for Shadows Don't Lie and Lines Can't Bend! Generative Models don't know Projective Geometryfor now

Figure 4 for Shadows Don't Lie and Lines Can't Bend! Generative Models don't know Projective Geometryfor now

Abstract:Generative models can produce impressively realistic images. This paper demonstrates that generated images have geometric features different from those of real images. We build a set of collections of generated images, prequalified to fool simple, signal-based classifiers into believing they are real. We then show that prequalified generated images can be identified reliably by classifiers that only look at geometric properties. We use three such classifiers. All three classifiers are denied access to image pixels, and look only at derived geometric features. The first classifier looks at the perspective field of the image, the second looks at lines detected in the image, and the third looks at relations between detected objects and shadows. Our procedure detects generated images more reliably than SOTA local signal based detectors, for images from a number of distinct generators. Saliency maps suggest that the classifiers can identify geometric problems reliably. We conclude that current generators cannot reliably reproduce geometric properties of real images.

* Project Page: https://projective-geometry.github.io | First three authors contributed equally

Via

Access Paper or Ask Questions