Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Etai Sella

InstanceGen: Image Generation with Instance-level Instructions

May 08, 2025

Etai Sella, Yanir Kleiman, Hadar Averbuch-Elor

Abstract:Despite rapid advancements in the capabilities of generative models, pretrained text-to-image models still struggle in capturing the semantics conveyed by complex prompts that compound multiple objects and instance-level attributes. Consequently, we are witnessing growing interests in integrating additional structural constraints, %leveraging additional structural inputs typically in the form of coarse bounding boxes, to better guide the generation process in such challenging cases. In this work, we take the idea of structural guidance a step further by making the observation that contemporary image generation models can directly provide a plausible \emph{fine-grained} structural initialization. We propose a technique that couples this image-based structural guidance with LLM-based instance-level instructions, yielding output images that adhere to all parts of the text prompt, including object counts, instance-level attributes, and spatial relations between instances.

* Project page: https://tau-vailab.github.io/InstanceGen/

Via

Access Paper or Ask Questions

SPiC-E : Structural Priors in 3D Diffusion Models using Cross-Entity Attention

Nov 30, 2023

Etai Sella, Gal Fiebelman, Noam Atia, Hadar Averbuch-Elor

Figure 1 for SPiC-E : Structural Priors in 3D Diffusion Models using Cross-Entity Attention

Figure 2 for SPiC-E : Structural Priors in 3D Diffusion Models using Cross-Entity Attention

Figure 3 for SPiC-E : Structural Priors in 3D Diffusion Models using Cross-Entity Attention

Figure 4 for SPiC-E : Structural Priors in 3D Diffusion Models using Cross-Entity Attention

Abstract:We are witnessing rapid progress in automatically generating and manipulating 3D assets due to the availability of pretrained text-image diffusion models. However, time-consuming optimization procedures are required for synthesizing each sample, hindering their potential for democratizing 3D content creation. Conversely, 3D diffusion models now train on million-scale 3D datasets, yielding high-quality text-conditional 3D samples within seconds. In this work, we present SPiC-E - a neural network that adds structural guidance to 3D diffusion models, extending their usage beyond text-conditional generation. At its core, our framework introduces a cross-entity attention mechanism that allows for multiple entities (in particular, paired input and guidance 3D shapes) to interact via their internal representations within the denoising network. We utilize this mechanism for learning task-specific structural priors in 3D diffusion models from auxiliary guidance shapes. We show that our approach supports a variety of applications, including 3D stylization, semantic shape editing and text-conditional abstraction-to-3D, which transforms primitive-based abstractions into highly-expressive shapes. Extensive experiments demonstrate that SPiC-E achieves SOTA performance over these tasks while often being considerably faster than alternative methods. Importantly, this is accomplished without tailoring our approach for any specific task.

* Project webpage: https://tau-vailab.github.io/spic-e

Via

Access Paper or Ask Questions

Vox-E: Text-guided Voxel Editing of 3D Objects

Mar 21, 2023

Etai Sella, Gal Fiebelman, Peter Hedman, Hadar Averbuch-Elor

Abstract:Large scale text-guided diffusion models have garnered significant attention due to their ability to synthesize diverse images that convey complex visual concepts. This generative power has more recently been leveraged to perform text-to-3D synthesis. In this work, we present a technique that harnesses the power of latent diffusion models for editing existing 3D objects. Our method takes oriented 2D images of a 3D object as input and learns a grid-based volumetric representation of it. To guide the volumetric representation to conform to a target text prompt, we follow unconditional text-to-3D methods and optimize a Score Distillation Sampling (SDS) loss. However, we observe that combining this diffusion-guided loss with an image-based regularization loss that encourages the representation not to deviate too strongly from the input object is challenging, as it requires achieving two conflicting goals while viewing only structure-and-appearance coupled 2D projections. Thus, we introduce a novel volumetric regularization loss that operates directly in 3D space, utilizing the explicit nature of our 3D representation to enforce correlation between the global structure of the original and edited object. Furthermore, we present a technique that optimizes cross-attention volumetric grids to refine the spatial extent of the edits. Extensive experiments and comparisons demonstrate the effectiveness of our approach in creating a myriad of edits which cannot be achieved by prior works.

* Project webpage: https://tau-vailab.github.io/Vox-E/

Via

Access Paper or Ask Questions