Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shanyan Guan

Neural Material Adaptor for Visual Grounding of Intrinsic Dynamics

Oct 10, 2024

Junyi Cao, Shanyan Guan, Yanhao Ge, Wei Li, Xiaokang Yang, Chao Ma

Abstract:While humans effortlessly discern intrinsic dynamics and adapt to new scenarios, modern AI systems often struggle. Current methods for visual grounding of dynamics either use pure neural-network-based simulators (black box), which may violate physical laws, or traditional physical simulators (white box), which rely on expert-defined equations that may not fully capture actual dynamics. We propose the Neural Material Adaptor (NeuMA), which integrates existing physical laws with learned corrections, facilitating accurate learning of actual dynamics while maintaining the generalizability and interpretability of physical priors. Additionally, we propose Particle-GS, a particle-driven 3D Gaussian Splatting variant that bridges simulation and observed images, allowing back-propagate image gradients to optimize the simulator. Comprehensive experiments on various dynamics in terms of grounded particle accuracy, dynamic rendering quality, and generalization ability demonstrate that NeuMA can accurately capture intrinsic dynamics.

* NeurIPS 2024, the project page: https://xjay18.github.io/projects/neuma.html

Via

Access Paper or Ask Questions

HybridBooth: Hybrid Prompt Inversion for Efficient Subject-Driven Generation

Oct 10, 2024

Shanyan Guan, Yanhao Ge, Ying Tai, Jian Yang, Wei Li, Mingyu You

Figure 1 for HybridBooth: Hybrid Prompt Inversion for Efficient Subject-Driven Generation

Figure 2 for HybridBooth: Hybrid Prompt Inversion for Efficient Subject-Driven Generation

Figure 3 for HybridBooth: Hybrid Prompt Inversion for Efficient Subject-Driven Generation

Figure 4 for HybridBooth: Hybrid Prompt Inversion for Efficient Subject-Driven Generation

Abstract:Recent advancements in text-to-image diffusion models have shown remarkable creative capabilities with textual prompts, but generating personalized instances based on specific subjects, known as subject-driven generation, remains challenging. To tackle this issue, we present a new hybrid framework called HybridBooth, which merges the benefits of optimization-based and direct-regression methods. HybridBooth operates in two stages: the Word Embedding Probe, which generates a robust initial word embedding using a fine-tuned encoder, and the Word Embedding Refinement, which further adapts the encoder to specific subject images by optimizing key parameters. This approach allows for effective and fast inversion of visual concepts into textual embedding, even from a single image, while maintaining the model's generalization capabilities.

* ECCV 2024, the project page: https://sites.google.com/view/hybridbooth

Via

Access Paper or Ask Questions

PostEdit: Posterior Sampling for Efficient Zero-Shot Image Editing

Oct 07, 2024

Feng Tian, Yixuan Li, Yichao Yan, Shanyan Guan, Yanhao Ge, Xiaokang Yang

Figure 1 for PostEdit: Posterior Sampling for Efficient Zero-Shot Image Editing

Figure 2 for PostEdit: Posterior Sampling for Efficient Zero-Shot Image Editing

Figure 3 for PostEdit: Posterior Sampling for Efficient Zero-Shot Image Editing

Figure 4 for PostEdit: Posterior Sampling for Efficient Zero-Shot Image Editing

Abstract:In the field of image editing, three core challenges persist: controllability, background preservation, and efficiency. Inversion-based methods rely on time-consuming optimization to preserve the features of the initial images, which results in low efficiency due to the requirement for extensive network inference. Conversely, inversion-free methods lack theoretical support for background similarity, as they circumvent the issue of maintaining initial features to achieve efficiency. As a consequence, none of these methods can achieve both high efficiency and background consistency. To tackle the challenges and the aforementioned disadvantages, we introduce PostEdit, a method that incorporates a posterior scheme to govern the diffusion sampling process. Specifically, a corresponding measurement term related to both the initial features and Langevin dynamics is introduced to optimize the estimated image generated by the given target prompt. Extensive experimental results indicate that the proposed PostEdit achieves state-of-the-art editing performance while accurately preserving unedited regions. Furthermore, the method is both inversion- and training-free, necessitating approximately 1.5 seconds and 18 GB of GPU memory to generate high-quality results.

Via

Access Paper or Ask Questions

PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards Video Object Detection

Sep 06, 2022

Han Wang, Jun Tang, Xiaodong Liu, Shanyan Guan, Rong Xie, Li Song

Figure 1 for PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards Video Object Detection

Figure 2 for PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards Video Object Detection

Figure 3 for PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards Video Object Detection

Figure 4 for PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards Video Object Detection

Abstract:Recent years have witnessed a trend of applying context frames to boost the performance of object detection as video object detection. Existing methods usually aggregate features at one stroke to enhance the feature. These methods, however, usually lack spatial information from neighboring frames and suffer from insufficient feature aggregation. To address the issues, we perform a progressive way to introduce both temporal information and spatial information for an integrated enhancement. The temporal information is introduced by the temporal feature aggregation model (TFAM), by conducting an attention mechanism between the context frames and the target frame (i.e., the frame to be detected). Meanwhile, we employ a Spatial Transition Awareness Model (STAM) to convey the location transition information between each context frame and target frame. Built upon a transformer-based detector DETR, our PTSEFormer also follows an end-to-end fashion to avoid heavy post-processing procedures while achieving 88.1% mAP on the ImageNet VID dataset. Codes are available at https://github.com/Hon-Wong/PTSEFormer.

Via

Access Paper or Ask Questions

NeuroFluid: Fluid Dynamics Grounding with Particle-Driven Neural Radiance Fields

Mar 03, 2022

Shanyan Guan, Huayu Deng, Yunbo Wang, Xiaokang Yang

Figure 1 for NeuroFluid: Fluid Dynamics Grounding with Particle-Driven Neural Radiance Fields

Figure 2 for NeuroFluid: Fluid Dynamics Grounding with Particle-Driven Neural Radiance Fields

Figure 3 for NeuroFluid: Fluid Dynamics Grounding with Particle-Driven Neural Radiance Fields

Figure 4 for NeuroFluid: Fluid Dynamics Grounding with Particle-Driven Neural Radiance Fields

Abstract:Deep learning has shown great potential for modeling the physical dynamics of complex particle systems such as fluids (in Lagrangian descriptions). Existing approaches, however, require the supervision of consecutive particle properties, including positions and velocities. In this paper, we consider a partially observable scenario known as fluid dynamics grounding, that is, inferring the state transitions and interactions within the fluid particle systems from sequential visual observations of the fluid surface. We propose a differentiable two-stage network named NeuroFluid. Our approach consists of (i) a particle-driven neural renderer, which involves fluid physical properties into the volume rendering function, and (ii) a particle transition model optimized to reduce the differences between the rendered and the observed images. NeuroFluid provides the first solution to unsupervised learning of particle-based fluid dynamics by training these two models jointly. It is shown to reasonably estimate the underlying physics of fluids with different initial shapes, viscosity, and densities. It is a potential alternative approach to understanding complex fluid mechanics, such as turbulence, that are difficult to model using traditional methods of mathematical physics.

Via

Access Paper or Ask Questions

Out-of-Domain Human Mesh Reconstruction via Dynamic Bilevel Online Adaptation

Nov 07, 2021

Shanyan Guan, Jingwei Xu, Michelle Z. He, Yunbo Wang, Bingbing Ni, Xiaokang Yang

Figure 1 for Out-of-Domain Human Mesh Reconstruction via Dynamic Bilevel Online Adaptation

Figure 2 for Out-of-Domain Human Mesh Reconstruction via Dynamic Bilevel Online Adaptation

Figure 3 for Out-of-Domain Human Mesh Reconstruction via Dynamic Bilevel Online Adaptation

Figure 4 for Out-of-Domain Human Mesh Reconstruction via Dynamic Bilevel Online Adaptation

Abstract:We consider a new problem of adapting a human mesh reconstruction model to out-of-domain streaming videos, where performance of existing SMPL-based models are significantly affected by the distribution shift represented by different camera parameters, bone lengths, backgrounds, and occlusions. We tackle this problem through online adaptation, gradually correcting the model bias during testing. There are two main challenges: First, the lack of 3D annotations increases the training difficulty and results in 3D ambiguities. Second, non-stationary data distribution makes it difficult to strike a balance between fitting regular frames and hard samples with severe occlusions or dramatic changes. To this end, we propose the Dynamic Bilevel Online Adaptation algorithm (DynaBOA). It first introduces the temporal constraints to compensate for the unavailable 3D annotations, and leverages a bilevel optimization procedure to address the conflicts between multi-objectives. DynaBOA provides additional 3D guidance by co-training with similar source examples retrieved efficiently despite the distribution shift. Furthermore, it can adaptively adjust the number of optimization steps on individual frames to fully fit hard samples and avoid overfitting regular frames. DynaBOA achieves state-of-the-art results on three out-of-domain human mesh reconstruction benchmarks.

* 14 pages, 13 figures; code repositoty: https://github.com/syguan96/DynaBOA

Via

Access Paper or Ask Questions

Bilevel Online Adaptation for Out-of-Domain Human Mesh Reconstruction

Mar 30, 2021

Shanyan Guan, Jingwei Xu, Yunbo Wang, Bingbing Ni, Xiaokang Yang

Figure 1 for Bilevel Online Adaptation for Out-of-Domain Human Mesh Reconstruction

Figure 2 for Bilevel Online Adaptation for Out-of-Domain Human Mesh Reconstruction

Figure 3 for Bilevel Online Adaptation for Out-of-Domain Human Mesh Reconstruction

Figure 4 for Bilevel Online Adaptation for Out-of-Domain Human Mesh Reconstruction

Abstract:This paper considers a new problem of adapting a pre-trained model of human mesh reconstruction to out-of-domain streaming videos. However, most previous methods based on the parametric SMPL model \cite{loper2015smpl} underperform in new domains with unexpected, domain-specific attributes, such as camera parameters, lengths of bones, backgrounds, and occlusions. Our general idea is to dynamically fine-tune the source model on test video streams with additional temporal constraints, such that it can mitigate the domain gaps without over-fitting the 2D information of individual test frames. A subsequent challenge is how to avoid conflicts between the 2D and temporal constraints. We propose to tackle this problem using a new training algorithm named Bilevel Online Adaptation (BOA), which divides the optimization process of overall multi-objective into two steps of weight probe and weight update in a training iteration. We demonstrate that BOA leads to state-of-the-art results on two human mesh reconstruction benchmarks.

* CVPR 2021, the project page: https://sites.google.com/view/humanmeshboa

Via

Access Paper or Ask Questions

Collaborative Learning for Faster StyleGAN Embedding

Jul 03, 2020

Shanyan Guan, Ying Tai, Bingbing Ni, Feida Zhu, Feiyue Huang, Xiaokang Yang

Figure 1 for Collaborative Learning for Faster StyleGAN Embedding

Figure 2 for Collaborative Learning for Faster StyleGAN Embedding

Figure 3 for Collaborative Learning for Faster StyleGAN Embedding

Figure 4 for Collaborative Learning for Faster StyleGAN Embedding

Abstract:The latent code of the recent popular model StyleGAN has learned disentangled representations thanks to the multi-layer style-based generator. Embedding a given image back to the latent space of StyleGAN enables wide interesting semantic image editing applications. Although previous works are able to yield impressive inversion results based on an optimization framework, which however suffers from the efficiency issue. In this work, we propose a novel collaborative learning framework that consists of an efficient embedding network and an optimization-based iterator. On one hand, with the progress of training, the embedding network gives a reasonable latent code initialization for the iterator. On the other hand, the updated latent code from the iterator in turn supervises the embedding network. In the end, high-quality latent code can be obtained efficiently with a single forward pass through our embedding network. Extensive experiments demonstrate the effectiveness and efficiency of our work.

* 10 pages, 11 figures

Via

Access Paper or Ask Questions