Topic:Scene Graph Generation
What is Scene Graph Generation? Scene graph generation is the process of creating structured representations of scenes that capture the relationships between objects.
Papers and Code
Oct 20, 2024
Abstract:Current approaches for open-vocabulary scene graph generation (OVSGG) use vision-language models such as CLIP and follow a standard zero-shot pipeline -- computing similarity between the query image and the text embeddings for each category (i.e., text classifiers). In this work, we argue that the text classifiers adopted by existing OVSGG methods, i.e., category-/part-level prompts, are scene-agnostic as they remain unchanged across contexts. Using such fixed text classifiers not only struggles to model visual relations with high variance, but also falls short in adapting to distinct contexts. To plug these intrinsic shortcomings, we devise SDSGG, a scene-specific description based OVSGG framework where the weights of text classifiers are adaptively adjusted according to the visual content. In particular, to generate comprehensive and diverse descriptions oriented to the scene, an LLM is asked to play different roles (e.g., biologist and engineer) to analyze and discuss the descriptive features of a given scene from different views. Unlike previous efforts simply treating the generated descriptions as mutually equivalent text classifiers, SDSGG is equipped with an advanced renormalization mechanism to adjust the influence of each text classifier based on its relevance to the presented scene (this is what the term "specific" means). Furthermore, to capture the complicated interplay between subjects and objects, we propose a new lightweight module called mutual visual adapter. It refines CLIP's ability to recognize relations by learning an interaction-aware semantic space. Extensive experiments on prevalent benchmarks show that SDSGG outperforms top-leading methods by a clear margin.
Via
Oct 21, 2024
Abstract:Traditional rule-based physical models are limited by their reliance on singular physical formulas and parameters, making it difficult to effectively tackle the intricate tasks associated with crowd simulation. Recent research has introduced deep learning methods to tackle these issues, but most current approaches focus primarily on generating pedestrian trajectories, often lacking interpretability and failing to provide real-time dynamic simulations.To address the aforementioned issues, we propose a novel data-driven crowd simulation framework that integrates Physics-informed Machine Learning (PIML) with navigation potential fields. Our approach leverages the strengths of both physical models and PIML. Specifically, we design an innovative Physics-informed Spatio-temporal Graph Convolutional Network (PI-STGCN) as a data-driven module to predict pedestrian movement trends based on crowd spatio-temporal data. Additionally, we construct a physical model of navigation potential fields based on flow field theory to guide pedestrian movements, thereby reinforcing physical constraints during the simulation. In our framework, navigation potential fields are dynamically computed and updated based on the movement trends predicted by the PI-STGCN, while the updated crowd dynamics, guided by these fields, subsequently feed back into the PI-STGCN. Comparative experiments on two publicly available large-scale real-world datasets across five scenes demonstrate that our proposed framework outperforms existing rule-based methods in accuracy and fidelity. The similarity between simulated and actual pedestrian trajectories increases by 10.8%, while the average error is reduced by 4%. Moreover, our framework exhibits greater adaptability and better interpretability compared to methods that rely solely on deep learning for trajectory generation.
Via
Oct 21, 2024
Abstract:This paper tackles the challenging task of 3D visual grounding-locating a specific object in a 3D point cloud scene based on text descriptions. Existing methods fall into two categories: top-down and bottom-up methods. Top-down methods rely on a pre-trained 3D detector to generate and select the best bounding box, resulting in time-consuming processes. Bottom-up methods directly regress object bounding boxes with coarse-grained features, producing worse results. To combine their strengths while addressing their limitations, we propose a joint top-down and bottom-up framework, aiming to enhance the performance while improving the efficiency. Specifically, in the first stage, we propose a bottom-up based proposal generation module, which utilizes lightweight neural layers to efficiently regress and cluster several coarse object proposals instead of using a complex 3D detector. Then, in the second stage, we introduce a top-down based proposal consolidation module, which utilizes graph design to effectively aggregate and propagate the query-related object contexts among the generated proposals for further refinement. By jointly training these two modules, we can avoid the inherent drawbacks of the complex proposals in the top-down framework and the coarse proposals in the bottom-up framework. Experimental results on the ScanRefer benchmark show that our framework is able to achieve the state-of-the-art performance.
* Accepted by ICPR2024
Via
Oct 20, 2024
Abstract:In the visual spatial understanding (VSU) area, spatial image-to-text (SI2T) and spatial text-to-image (ST2I) are two fundamental tasks that appear in dual form. Existing methods for standalone SI2T or ST2I perform imperfectly in spatial understanding, due to the difficulty of 3D-wise spatial feature modeling. In this work, we consider modeling the SI2T and ST2I together under a dual learning framework. During the dual framework, we then propose to represent the 3D spatial scene features with a novel 3D scene graph (3DSG) representation that can be shared and beneficial to both tasks. Further, inspired by the intuition that the easier 3D$\to$image and 3D$\to$text processes also exist symmetrically in the ST2I and SI2T, respectively, we propose the Spatial Dual Discrete Diffusion (SD$^3$) framework, which utilizes the intermediate features of the 3D$\to$X processes to guide the hard X$\to$3D processes, such that the overall ST2I and SI2T will benefit each other. On the visual spatial understanding dataset VSD, our system outperforms the mainstream T2I and I2T methods significantly. Further in-depth analysis reveals how our dual learning strategy advances.
Via
Oct 17, 2024
Abstract:Enabling mobile robots to perform long-term tasks in dynamic real-world environments is a formidable challenge, especially when the environment changes frequently due to human-robot interactions or the robot's own actions. Traditional methods typically assume static scenes, which limits their applicability in the continuously changing real world. To overcome these limitations, we present DovSG, a novel mobile manipulation framework that leverages dynamic open-vocabulary 3D scene graphs and a language-guided task planning module for long-term task execution. DovSG takes RGB-D sequences as input and utilizes vision-language models (VLMs) for object detection to obtain high-level object semantic features. Based on the segmented objects, a structured 3D scene graph is generated for low-level spatial relationships. Furthermore, an efficient mechanism for locally updating the scene graph, allows the robot to adjust parts of the graph dynamically during interactions without the need for full scene reconstruction. This mechanism is particularly valuable in dynamic environments, enabling the robot to continually adapt to scene changes and effectively support the execution of long-term tasks. We validated our system in real-world environments with varying degrees of manual modifications, demonstrating its effectiveness and superior performance in long-term tasks. Our project page is available at: https://BJHYZJ.github.io/DoviSG.
* 8 pages, 5 figures
Via
Oct 17, 2024
Abstract:This paper proposes a method for on-demand scenario generation in simulation, grounded on real-world data. Evaluating the behaviour of Autonomous Vehicles (AVs) in both safety-critical and regular scenarios is essential for assessing their robustness before real-world deployment. By integrating scenarios derived from real-world datasets into the simulation, we enhance the plausibility and validity of testing sets. This work introduces a novel approach that employs temporal scene graphs to capture evolving spatiotemporal relationships among scene entities from a real-world dataset, enabling the generation of dynamic scenarios in simulation through Graph Neural Networks (GNNs). User-defined action and criticality conditioning are used to ensure flexible, tailored scenario creation. Our model significantly outperforms the benchmarks in accurately predicting links corresponding to the requested scenarios. We further evaluate the validity and compatibility of our generated scenarios in an off-the-shelf simulator.
* 7 pages, 8 figures
Via
Oct 17, 2024
Abstract:Vision-Language Models (VLMs) often generate plausible but incorrect responses to visual queries. However, reliably quantifying the effect of such hallucinations in free-form responses to open-ended queries is challenging as it requires visually verifying each claim within the response. We propose Programmatic VLM Evaluation (PROVE), a new benchmarking paradigm for evaluating VLM responses to open-ended queries. To construct PROVE, we provide a large language model (LLM) with a high-fidelity scene-graph representation constructed from a hyper-detailed image caption, and prompt it to generate diverse question-answer (QA) pairs, as well as programs that can be executed over the scene graph object to verify each QA pair. We thus construct a benchmark of 10.5k challenging but visually grounded QA pairs. Next, to evaluate free-form model responses to queries in PROVE, we propose a programmatic evaluation strategy that measures both the helpfulness and truthfulness of a response within a unified scene graph-based framework. We benchmark the helpfulness-truthfulness trade-offs of a range of VLMs on PROVE, finding that very few are in-fact able to achieve a good balance between the two. Project page: \url{https://prove-explorer.netlify.app/}.
Via
Oct 15, 2024
Abstract:Scene graphs offer a structured, hierarchical representation of images, with nodes and edges symbolizing objects and the relationships among them. It can serve as a natural interface for image editing, dramatically improving precision and flexibility. Leveraging this benefit, we introduce a new framework that integrates large language model (LLM) with Text2Image generative model for scene graph-based image editing. This integration enables precise modifications at the object level and creative recomposition of scenes without compromising overall image integrity. Our approach involves two primary stages: 1) Utilizing a LLM-driven scene parser, we construct an image's scene graph, capturing key objects and their interrelationships, as well as parsing fine-grained attributes such as object masks and descriptions. These annotations facilitate concept learning with a fine-tuned diffusion model, representing each object with an optimized token and detailed description prompt. 2) During the image editing phase, a LLM editing controller guides the edits towards specific areas. These edits are then implemented by an attention-modulated diffusion editor, utilizing the fine-tuned model to perform object additions, deletions, replacements, and adjustments. Through extensive experiments, we demonstrate that our framework significantly outperforms existing image editing methods in terms of editing precision and scene aesthetics.
* Accepted by ACM Transactions on Graphics and SIGGRAPH Asia 2024.
Project page: https://bestzzhang.github.io/SGEdit
Via
Oct 16, 2024
Abstract:Recent advancements in Radiance Fields have significantly improved novel-view synthesis. However, in many real-world applications, the more advanced challenge lies in inverse rendering, which seeks to derive the physical properties of a scene, including light, geometry, textures, and materials. Meshes, as a traditional representation adopted by many simulation pipeline, however, still show limited influence in radiance field for inverse rendering. This paper introduces a novel framework called Triangle Patchlet (abbr. Triplet), a mesh-based representation, to comprehensively approximate these scene parameters. We begin by assembling Triplets with either randomly generated points or sparse points obtained from camera calibration where all faces are treated as an independent element. Next, we simulate the physical interaction of light and optimize the scene parameters using traditional graphics rendering techniques like rasterization and ray tracing, accompanying with density control and propagation. An iterative mesh extracting process is also suggested, where we continue to optimize on geometry and materials with graph-based operation. We also introduce several regulation terms to enable better generalization of materials property. Our framework could precisely estimate the light, materials and geometry with mesh without prior of light, materials and geometry in a unified framework. Experiments demonstrate that our approach can achieve state-of-the-art visual quality while reconstructing high-quality geometry and accurate material properties.
Via
Oct 13, 2024
Abstract:For biodiversity and forestry applications, end-users desire maps of forests that are fully detailed, from the forest floor to the canopy. Terrestrial laser scanning and aerial laser scanning are accurate and increasingly mature methods for scanning the forest. However, individually they are not able to estimate attributes such as tree height, trunk diameter and canopy density due to the inherent differences in their field-of-view and mapping processes. In this work, we present a pipeline that can automatically generate a single joint terrestrial and aerial forest reconstruction. The novelty of the approach is a marker-free registration pipeline, which estimates a set of relative transformation constraints between the aerial cloud and terrestrial sub-clouds without requiring any co-registration reflective markers to be physically placed in the scene. Our method then uses these constraints in a pose graph formulation, which enables us to finely align the respective clouds while respecting spatial constraints introduced by the terrestrial SLAM scanning process. We demonstrate that our approach can produce a fine-grained and complete reconstruction of large-scale natural environments, enabling multi-platform data capture for forestry applications without requiring external infrastructure.
Via