Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bedrich Benes

Tuning-Free Amodal Segmentation via the Occlusion-Free Bias of Inpainting Models

Mar 24, 2025

Jae Joong Lee, Bedrich Benes, Raymond A. Yeh

Abstract:Amodal segmentation aims to predict segmentation masks for both the visible and occluded regions of an object. Most existing works formulate this as a supervised learning problem, requiring manually annotated amodal masks or synthetic training data. Consequently, their performance depends on the quality of the datasets, which often lack diversity and scale. This work introduces a tuning-free approach that repurposes pretrained diffusion-based inpainting models for amodal segmentation. Our approach is motivated by the "occlusion-free bias" of inpainting models, i.e., the inpainted objects tend to be complete objects without occlusions. Specifically, we reconstruct the occluded regions of an object via inpainting and then apply segmentation, all without additional training or fine-tuning. Experiments on five datasets demonstrate the generalizability and robustness of our approach. On average, our approach achieves 5.3% more accurate masks over the state-of-the-art.

Via

Access Paper or Ask Questions

RGB2Point: 3D Point Cloud Generation from Single RGB Images

Jul 20, 2024

Jae Joong Lee, Bedrich Benes

Abstract:We introduce RGB2Point, an unposed single-view RGB image to a 3D point cloud generation based on Transformer. RGB2Point takes an input image of an object and generates a dense 3D point cloud. Contrary to prior works based on CNN layers and diffusion denoising approaches, we use pre-trained Transformer layers that are fast and generate high-quality point clouds with consistent quality over available categories. Our generated point clouds demonstrate high quality on a real-world dataset, as evidenced by improved Chamfer distance (51.15%) and Earth Mover's distance (45.96%) metrics compared to the current state-of-the-art. Additionally, our approach shows a better quality on a synthetic dataset, achieving better Chamfer distance (39.26%), Earth Mover's distance (26.95%), and F-score (47.16%). Moreover, our method produces 63.1% more consistent high-quality results across various object categories compared to prior works. Furthermore, RGB2Point is computationally efficient, requiring only 2.3GB of VRAM to reconstruct a 3D point cloud from a single RGB image, and our implementation generates the results 15,133x faster than a SOTA diffusion-based model.

Via

Access Paper or Ask Questions

Tree-D Fusion: Simulation-Ready Tree Dataset from Single Images with Diffusion Priors

Jul 14, 2024

Jae Joong Lee, Bosheng Li, Sara Beery, Jonathan Huang, Songlin Fei, Raymond A. Yeh, Bedrich Benes

Abstract:We introduce Tree D-fusion, featuring the first collection of 600,000 environmentally aware, 3D simulation-ready tree models generated through Diffusion priors. Each reconstructed 3D tree model corresponds to an image from Google's Auto Arborist Dataset, comprising street view images and associated genus labels of trees across North America. Our method distills the scores of two tree-adapted diffusion models by utilizing text prompts to specify a tree genus, thus facilitating shape reconstruction. This process involves reconstructing a 3D tree envelope filled with point markers, which are subsequently utilized to estimate the tree's branching structure using the space colonization algorithm conditioned on a specified genus.

* Accepted to ECCV24

Via

Access Paper or Ask Questions

LAESI: Leaf Area Estimation with Synthetic Imagery

Mar 31, 2024

Jacek Kałużny, Yannik Schreckenberg, Karol Cyganik, Peter Annighöfer, Sören Pirk, Dominik L. Michels, Mikolaj Cieslak, Farhah Assaad-Gerbert, Bedrich Benes, Wojciech Pałubicki

Abstract:We introduce LAESI, a Synthetic Leaf Dataset of 100,000 synthetic leaf images on millimeter paper, each with semantic masks and surface area labels. This dataset provides a resource for leaf morphology analysis primarily aimed at beech and oak leaves. We evaluate the applicability of the dataset by training machine learning models for leaf surface area prediction and semantic segmentation, using real images for validation. Our validation shows that these models can be trained to predict leaf surface area with a relative error not greater than an average human annotator. LAESI also provides an efficient framework based on 3D procedural models and generative AI for the large-scale, controllable generation of data with potential further applications in agriculture and biology. We evaluate the inclusion of generative AI in our procedural data generation pipeline and show how data filtering based on annotation consistency results in datasets which allow training the highest performing vision models.

* 10 pages, 12 figures, 1 table

Via

Access Paper or Ask Questions

Hands-Free VR

Feb 23, 2024

Jorge Askur Vazquez Fernandez, Jae Joong Lee, Santiago Andrés Serrano Vacca, Alejandra Magana, Bedrich Benes, Voicu Popescu

Abstract:The paper introduces Hands-Free VR, a voice-based natural-language interface for VR. The user gives a command using their voice, the speech audio data is converted to text using a speech-to-text deep learning model that is fine-tuned for robustness to word phonetic similarity and to spoken English accents, and the text is mapped to an executable VR command using a large language model that is robust to natural language diversity. Hands-Free VR was evaluated in a controlled within-subjects study (N = 22) that asked participants to find specific objects and to place them in various configurations. In the control condition participants used a conventional VR user interface to grab, carry, and position the objects using the handheld controllers. In the experimental condition participants used Hands-Free VR. The results confirm that: (1) Hands-Free VR is robust to spoken English accents, as for 20 of our participants English was not their first language, and to word phonetic similarity, correctly transcribing the voice command 96.71% of the time; (2) Hands-Free VR is robust to natural language diversity, correctly mapping the transcribed command to an executable command in 97.83% of the time; (3) Hands-Free VR had a significant efficiency advantage over the conventional VR interface in terms of task completion time, total viewpoint translation, total view direction rotation, and total left and right hand translations; (4) Hands-Free VR received high user preference ratings in terms of ease of use, intuitiveness, ergonomics, reliability, and desirability.

Via

Access Paper or Ask Questions

DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision

Dec 29, 2023

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu(+10 more)

Figure 1 for DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision

Figure 2 for DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision

Figure 3 for DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision

Figure 4 for DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision

Abstract:We have witnessed significant progress in deep learning-based 3D vision, ranging from neural radiance field (NeRF) based 3D representation learning to applications in novel view synthesis (NVS). However, existing scene-level datasets for deep learning-based 3D vision, limited to either synthetic environments or a narrow selection of real-world scenes, are quite insufficient. This insufficiency not only hinders a comprehensive benchmark of existing methods but also caps what could be explored in deep learning-based 3D analysis. To address this critical gap, we present DL3DV-10K, a large-scale scene dataset, featuring 51.2 million frames from 10,510 videos captured from 65 types of point-of-interest (POI) locations, covering both bounded and unbounded scenes, with different levels of reflection, transparency, and lighting. We conducted a comprehensive benchmark of recent NVS methods on DL3DV-10K, which revealed valuable insights for future research in NVS. In addition, we have obtained encouraging results in a pilot study to learn generalizable NeRF from DL3DV-10K, which manifests the necessity of a large-scale scene-level dataset to forge a path toward a foundation model for learning 3D representation. Our DL3DV-10K dataset, benchmark results, and models will be publicly accessible at https://dl3dv-10k.github.io/DL3DV-10K/.

Via

Access Paper or Ask Questions

DeepTree: Modeling Trees with Situated Latents

May 09, 2023

Xiaochen Zhou, Bosheng Li, Bedrich Benes, Songlin Fei, Sören Pirk

Abstract:In this paper, we propose DeepTree, a novel method for modeling trees based on learning developmental rules for branching structures instead of manually defining them. We call our deep neural model situated latent because its behavior is determined by the intrinsic state -- encoded as a latent space of a deep neural model -- and by the extrinsic (environmental) data that is situated as the location in the 3D space and on the tree structure. We use a neural network pipeline to train a situated latent space that allows us to locally predict branch growth only based on a single node in the branch graph of a tree model. We use this representation to progressively develop new branch nodes, thereby mimicking the growth process of trees. Starting from a root node, a tree is generated by iteratively querying the neural network on the newly added nodes resulting in the branching structure of the whole tree. Our method enables generating a wide variety of tree shapes without the need to define intricate parameters that control their growth and behavior. Furthermore, we show that the situated latents can also be used to encode the environmental response of tree models, e.g., when trees grow next to obstacles. We validate the effectiveness of our method by measuring the similarity of our tree models and by procedurally generated ones based on a number of established metrics for tree form.

Via

Access Paper or Ask Questions

StyleDEM: a Versatile Model for Authoring Terrains

Apr 19, 2023

Simon Perche, Adrien Peytavie, Bedrich Benes, Eric Galin, Eric Guérin

Figure 1 for StyleDEM: a Versatile Model for Authoring Terrains

Figure 2 for StyleDEM: a Versatile Model for Authoring Terrains

Figure 3 for StyleDEM: a Versatile Model for Authoring Terrains

Figure 4 for StyleDEM: a Versatile Model for Authoring Terrains

Abstract:Many terrain modelling methods have been proposed for the past decades, providing efficient and often interactive authoring tools. However, they generally do not include any notion of style, which is a critical aspect for designers in the entertainment industry. We introduce StyleDEM, a new generative adversarial network method for terrain synthesis and authoring, with a versatile toolbox of authoring methods with style. This method starts from an input sketch or an existing terrain. It outputs a terrain with features that can be authored using interactive brushes and enhanced with additional tools such as style manipulation or super-resolution. The strength of our approach resides in the versatility and interoperability of the toolbox.

Via

Access Paper or Ask Questions

SnakeVoxFormer: Transformer-based Single Image\\Voxel Reconstruction with Run Length Encoding

Mar 28, 2023

Jae Joong Lee, Bedrich Benes

$Figure 1 for SnakeVoxFormer: Transformer-based Single Image\\Voxel Reconstruction with Run Length Encoding$

$Figure 2 for SnakeVoxFormer: Transformer-based Single Image\\Voxel Reconstruction with Run Length Encoding$

$Figure 3 for SnakeVoxFormer: Transformer-based Single Image\\Voxel Reconstruction with Run Length Encoding$

$Figure 4 for SnakeVoxFormer: Transformer-based Single Image\\Voxel Reconstruction with Run Length Encoding$

Abstract:Deep learning-based 3D object reconstruction has achieved unprecedented results. Among those, the transformer deep neural model showed outstanding performance in many applications of computer vision. We introduce SnakeVoxFormer, a novel, 3D object reconstruction in voxel space from a single image using the transformer. The input to SnakeVoxFormer is a 2D image, and the result is a 3D voxel model. The key novelty of our approach is in using the run-length encoding that traverses (like a snake) the voxel space and encodes wide spatial differences into a 1D structure that is suitable for transformer encoding. We then use dictionary encoding to convert the discovered RLE blocks into tokens that are used for the transformer. The 1D representation is a lossless 3D shape data compression method that converts to 1D data that use only about 1% of the original data size. We show how different voxel traversing strategies affect the effect of encoding and reconstruction. We compare our method with the state-of-the-art for 3D voxel reconstruction from images and our method improves the state-of-the-art methods by at least 2.8% and up to 19.8%.

Via

Access Paper or Ask Questions

PixHt-Lab: Pixel Height Based Light Effect Generation for Image Compositing

Feb 28, 2023

Yichen Sheng, Jianming Zhang, Julien Philip, Yannick Hold-Geoffroy, Xin Sun, HE Zhang, Lu Ling, Bedrich Benes

Figure 1 for PixHt-Lab: Pixel Height Based Light Effect Generation for Image Compositing

Figure 2 for PixHt-Lab: Pixel Height Based Light Effect Generation for Image Compositing

Figure 3 for PixHt-Lab: Pixel Height Based Light Effect Generation for Image Compositing

Figure 4 for PixHt-Lab: Pixel Height Based Light Effect Generation for Image Compositing

Abstract:Lighting effects such as shadows or reflections are key in making synthetic images realistic and visually appealing. To generate such effects, traditional computer graphics uses a physically-based renderer along with 3D geometry. To compensate for the lack of geometry in 2D Image compositing, recent deep learning-based approaches introduced a pixel height representation to generate soft shadows and reflections. However, the lack of geometry limits the quality of the generated soft shadows and constrain reflections to pure specular ones. We introduce PixHt-Lab, a system leveraging an explicit mapping from pixel height representation to 3D space. Using this mapping, PixHt-Lab reconstructs both the cutout and background geometry and renders realistic, diverse, lighting effects for image compositing. Given a surface with physically-based materials, we can render reflections with varying glossiness. To generate more realistic soft shadows, we further propose to use 3D-aware buffer channels to guide a neural renderer. Both quantitative and qualitative evaluations demonstrate that PixHt-Lab significantly improves soft shadow generation.

* 11 pages, 10 figures

Via

Access Paper or Ask Questions