Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Miaomiao Liu

Dalian University of Technology

Joint Optimization of Neural Radiance Fields and Continuous Camera Motion from a Monocular Video

Apr 28, 2025

Hoang Chuong Nguyen, Wei Mao, Jose M. Alvarez, Miaomiao Liu

Abstract:Neural Radiance Fields (NeRF) has demonstrated its superior capability to represent 3D geometry but require accurately precomputed camera poses during training. To mitigate this requirement, existing methods jointly optimize camera poses and NeRF often relying on good pose initialisation or depth priors. However, these approaches struggle in challenging scenarios, such as large rotations, as they map each camera to a world coordinate system. We propose a novel method that eliminates prior dependencies by modeling continuous camera motions as time-dependent angular velocity and velocity. Relative motions between cameras are learned first via velocity integration, while camera poses can be obtained by aggregating such relative motions up to a world coordinate system defined at a single time step within the video. Specifically, accurate continuous camera movements are learned through a time-dependent NeRF, which captures local scene geometry and motion by training from neighboring frames for each time step. The learned motions enable fine-tuning the NeRF to represent the full scene geometry. Experiments on Co3D and Scannet show our approach achieves superior camera pose and depth estimation and comparable novel-view synthesis performance compared to state-of-the-art methods. Our code is available at https://github.com/HoangChuongNguyen/cope-nerf.

Via

Access Paper or Ask Questions

Magnetic Distortion Resistant Orientation Estimation

Oct 16, 2024

Sikai Yang, Miaomiao Liu, Wan Du

Figure 1 for Magnetic Distortion Resistant Orientation Estimation

Figure 2 for Magnetic Distortion Resistant Orientation Estimation

Figure 3 for Magnetic Distortion Resistant Orientation Estimation

Figure 4 for Magnetic Distortion Resistant Orientation Estimation

Abstract:Inertial Measurement Unit (IMU) sensors, including accelerometers, gyroscopes, and magnetometers, are used to estimate the orientation of mobile devices. However, indoor magnetic fields are often distorted, causing the magnetometer's readings to deviate from true north and resulting in inaccurate orientation estimates. Existing solutions either ignore magnetic distortion or avoid using the magnetometer when distortion is detected. In this paper, we develop MDR, a Magnetic Distortion Resistant orientation estimation system that fundamentally models and corrects magnetic distortion. MDR builds a database to record magnetic directions at different locations and uses it to correct orientation estimates affected by magnetic distortion. To avoid the overhead of database preparation, MDR adopts practical designs to automatically update the database in parallel with orientation estimation. Experiments on 27+ hours of arm motion data show that MDR outperforms the state-of-the-art method by 35.34%.

* 14pages

Via

Access Paper or Ask Questions

SOAF: Scene Occlusion-aware Neural Acoustic Field

Jul 02, 2024

Huiyu Gao, Jiahao Ma, David Ahmedt-Aristizabal, Chuong Nguyen, Miaomiao Liu

Figure 1 for SOAF: Scene Occlusion-aware Neural Acoustic Field

Figure 2 for SOAF: Scene Occlusion-aware Neural Acoustic Field

Figure 3 for SOAF: Scene Occlusion-aware Neural Acoustic Field

Figure 4 for SOAF: Scene Occlusion-aware Neural Acoustic Field

Abstract:This paper tackles the problem of novel view audio-visual synthesis along an arbitrary trajectory in an indoor scene, given the audio-video recordings from other known trajectories of the scene. Existing methods often overlook the effect of room geometry, particularly wall occlusion to sound propagation, making them less accurate in multi-room environments. In this work, we propose a new approach called Scene Occlusion-aware Acoustic Field (SOAF) for accurate sound generation. Our approach derives a prior for sound energy field using distance-aware parametric sound-propagation modelling and then transforms it based on scene transmittance learned from the input video. We extract features from the local acoustic field centred around the receiver using a Fibonacci Sphere to generate binaural audio for novel views with a direction-aware attention mechanism. Extensive experiments on the real dataset~\emph{RWAVS} and the synthetic dataset~\emph{SoundSpaces} demonstrate that our method outperforms previous state-of-the-art techniques in audio generation. Project page: https://github.com/huiyu-gao/SOAF/.

Via

Access Paper or Ask Questions

Mining Supervision for Dynamic Regions in Self-Supervised Monocular Depth Estimation

Apr 23, 2024

Hoang Chuong Nguyen, Tianyu Wang, Jose M. Alvarez, Miaomiao Liu

Figure 1 for Mining Supervision for Dynamic Regions in Self-Supervised Monocular Depth Estimation

Figure 2 for Mining Supervision for Dynamic Regions in Self-Supervised Monocular Depth Estimation

Figure 3 for Mining Supervision for Dynamic Regions in Self-Supervised Monocular Depth Estimation

Figure 4 for Mining Supervision for Dynamic Regions in Self-Supervised Monocular Depth Estimation

Abstract:This paper focuses on self-supervised monocular depth estimation in dynamic scenes trained on monocular videos. Existing methods jointly estimate pixel-wise depth and motion, relying mainly on an image reconstruction loss. Dynamic regions1 remain a critical challenge for these methods due to the inherent ambiguity in depth and motion estimation, resulting in inaccurate depth estimation. This paper proposes a self-supervised training framework exploiting pseudo depth labels for dynamic regions from training data. The key contribution of our framework is to decouple depth estimation for static and dynamic regions of images in the training data. We start with an unsupervised depth estimation approach, which provides reliable depth estimates for static regions and motion cues for dynamic regions and allows us to extract moving object information at the instance level. In the next stage, we use an object network to estimate the depth of those moving objects assuming rigid motions. Then, we propose a new scale alignment module to address the scale ambiguity between estimated depths for static and dynamic regions. We can then use the depth labels generated to train an end-to-end depth estimation network and improve its performance. Extensive experiments on the Cityscapes and KITTI datasets show that our self-training strategy consistently outperforms existing self/unsupervised depth estimation methods.

* Accepted to CVPR2024

Via

Access Paper or Ask Questions

HashPoint: Accelerated Point Searching and Sampling for Neural Rendering

Apr 22, 2024

Jiahao Ma, Miaomiao Liu, David Ahmedt-Aristizaba, Chuong Nguyen

Figure 1 for HashPoint: Accelerated Point Searching and Sampling for Neural Rendering

Figure 2 for HashPoint: Accelerated Point Searching and Sampling for Neural Rendering

Figure 3 for HashPoint: Accelerated Point Searching and Sampling for Neural Rendering

Figure 4 for HashPoint: Accelerated Point Searching and Sampling for Neural Rendering

Abstract:In this paper, we address the problem of efficient point searching and sampling for volume neural rendering. Within this realm, two typical approaches are employed: rasterization and ray tracing. The rasterization-based methods enable real-time rendering at the cost of increased memory and lower fidelity. In contrast, the ray-tracing-based methods yield superior quality but demand longer rendering time. We solve this problem by our HashPoint method combining these two strategies, leveraging rasterization for efficient point searching and sampling, and ray marching for rendering. Our method optimizes point searching by rasterizing points within the camera's view, organizing them in a hash table, and facilitating rapid searches. Notably, we accelerate the rendering process by adaptive sampling on the primary surface encountered by the ray. Our approach yields substantial speed-up for a range of state-of-the-art ray-tracing-based methods, maintaining equivalent or superior accuracy across synthetic and real test datasets. The code will be available at https://jiahao-ma.github.io/hashpoint/.

* The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024
* CVPR2024 Highlight

Via

Access Paper or Ask Questions

MIDGET: Music Conditioned 3D Dance Generation

Apr 18, 2024

Jinwu Wang, Wei Mao, Miaomiao Liu

Abstract:In this paper, we introduce a MusIc conditioned 3D Dance GEneraTion model, named MIDGET based on Dance motion Vector Quantised Variational AutoEncoder (VQ-VAE) model and Motion Generative Pre-Training (GPT) model to generate vibrant and highquality dances that match the music rhythm. To tackle challenges in the field, we introduce three new components: 1) a pre-trained memory codebook based on the Motion VQ-VAE model to store different human pose codes, 2) employing Motion GPT model to generate pose codes with music and motion Encoders, 3) a simple framework for music feature extraction. We compare with existing state-of-the-art models and perform ablation experiments on AIST++, the largest publicly available music-dance dataset. Experiments demonstrate that our proposed framework achieves state-of-the-art performance on motion quality and its alignment with the music.

* In Australasian Joint Conference on Artificial Intelligence (pp. 277-288). Singapore: Springer Nature Singapore 2023
* 12 pages, 6 figures Published in AI 2023: Advances in Artificial Intelligence

Via

Access Paper or Ask Questions

Scene-aware Human Motion Forecasting via Mutual Distance Prediction

Oct 01, 2023

Chaoyue Xing, Wei Mao, Miaomiao Liu

Figure 1 for Scene-aware Human Motion Forecasting via Mutual Distance Prediction

Figure 2 for Scene-aware Human Motion Forecasting via Mutual Distance Prediction

Figure 3 for Scene-aware Human Motion Forecasting via Mutual Distance Prediction

Figure 4 for Scene-aware Human Motion Forecasting via Mutual Distance Prediction

Abstract:In this paper, we tackle the problem of scene-aware 3D human motion forecasting. A key challenge of this task is to predict future human motions that are consistent with the scene, by modelling the human-scene interactions. While recent works have demonstrated that explicit constraints on human-scene interactions can prevent the occurrence of ghost motion, they only provide constraints on partial human motion e.g., the global motion of the human or a few joints contacting the scene, leaving the rest motion unconstrained. To address this limitation, we propose to model the human-scene interaction with the mutual distance between the human body and the scene. Such mutual distances constrain both the local and global human motion, resulting in a whole-body motion constrained prediction. In particular, mutual distance constraints consist of two components, the signed distance of each vertex on the human mesh to the scene surface, and the distance of basis scene points to the human mesh. We develop a pipeline with two prediction steps that first predicts the future mutual distances from the past human motion sequence and the scene, and then forecasts the future human motion conditioning on the predicted mutual distances. During training, we explicitly encourage consistency between the predicted poses and the mutual distances. Our approach outperforms the state-of-the-art methods on both synthetic and real datasets.

Via

Access Paper or Ask Questions

Variational Inference for Scalable 3D Object-centric Learning

Sep 25, 2023

Tianyu Wang, Kee Siong Ng, Miaomiao Liu

Figure 1 for Variational Inference for Scalable 3D Object-centric Learning

Figure 2 for Variational Inference for Scalable 3D Object-centric Learning

Figure 3 for Variational Inference for Scalable 3D Object-centric Learning

Figure 4 for Variational Inference for Scalable 3D Object-centric Learning

Abstract:We tackle the task of scalable unsupervised object-centric representation learning on 3D scenes. Existing approaches to object-centric representation learning show limitations in generalizing to larger scenes as their learning processes rely on a fixed global coordinate system. In contrast, we propose to learn view-invariant 3D object representations in localized object coordinate systems. To this end, we estimate the object pose and appearance representation separately and explicitly map object representations across views while maintaining object identities. We adopt an amortized variational inference pipeline that can process sequential input and scalably update object latent distributions online. To handle large-scale scenes with a varying number of objects, we further introduce a Cognitive Map that allows the registration and query of objects on a per-scene global map to achieve scalable representation learning. We explore the object-centric neural radiance field (NeRF) as our 3D scene representation, which is jointly modeled within our unsupervised object-centric learning framework. Experimental results on synthetic and real datasets show that our proposed method can infer and maintain object-centric representations of 3D scenes and outperforms previous models.

Via

Access Paper or Ask Questions

LDP: Language-driven Dual-Pixel Image Defocus Deblurring Network

Jul 21, 2023

Hao Yang, Liyuan Pan, Yan Yang, Miaomiao Liu

Figure 1 for LDP: Language-driven Dual-Pixel Image Defocus Deblurring Network

Figure 2 for LDP: Language-driven Dual-Pixel Image Defocus Deblurring Network

Figure 3 for LDP: Language-driven Dual-Pixel Image Defocus Deblurring Network

Figure 4 for LDP: Language-driven Dual-Pixel Image Defocus Deblurring Network

Abstract:Recovering sharp images from dual-pixel (DP) pairs with disparity-dependent blur is a challenging task.~Existing blur map-based deblurring methods have demonstrated promising results. In this paper, we propose, to the best of our knowledge, the first framework to introduce the contrastive language-image pre-training framework (CLIP) to achieve accurate blur map estimation from DP pairs unsupervisedly. To this end, we first carefully design text prompts to enable CLIP to understand blur-related geometric prior knowledge from the DP pair. Then, we propose a format to input stereo DP pair to the CLIP without any fine-tuning, where the CLIP is pre-trained on monocular images. Given the estimated blur map, we introduce a blur-prior attention block, a blur-weighting loss and a blur-aware loss to recover the all-in-focus image. Our method achieves state-of-the-art performance in extensive experiments.

Via

Access Paper or Ask Questions

Parametric Depth Based Feature Representation Learning for Object Detection and Segmentation in Bird's Eye View

Jul 11, 2023

Jiayu Yang, Enze Xie, Miaomiao Liu, Jose M. Alvarez

Figure 1 for Parametric Depth Based Feature Representation Learning for Object Detection and Segmentation in Bird's Eye View

Figure 2 for Parametric Depth Based Feature Representation Learning for Object Detection and Segmentation in Bird's Eye View

Figure 3 for Parametric Depth Based Feature Representation Learning for Object Detection and Segmentation in Bird's Eye View

Figure 4 for Parametric Depth Based Feature Representation Learning for Object Detection and Segmentation in Bird's Eye View

Abstract:Recent vision-only perception models for autonomous driving achieved promising results by encoding multi-view image features into Bird's-Eye-View (BEV) space. A critical step and the main bottleneck of these methods is transforming image features into the BEV coordinate frame. This paper focuses on leveraging geometry information, such as depth, to model such feature transformation. Existing works rely on non-parametric depth distribution modeling leading to significant memory consumption, or ignore the geometry information to address this problem. In contrast, we propose to use parametric depth distribution modeling for feature transformation. We first lift the 2D image features to the 3D space defined for the ego vehicle via a predicted parametric depth distribution for each pixel in each view. Then, we aggregate the 3D feature volume based on the 3D space occupancy derived from depth to the BEV frame. Finally, we use the transformed features for downstream tasks such as object detection and semantic segmentation. Existing semantic segmentation methods do also suffer from an hallucination problem as they do not take visibility information into account. This hallucination can be particularly problematic for subsequent modules such as control and planning. To mitigate the issue, our method provides depth uncertainty and reliable visibility-aware estimations. We further leverage our parametric depth modeling to present a novel visibility-aware evaluation metric that, when taken into account, can mitigate the hallucination problem. Extensive experiments on object detection and semantic segmentation on the nuScenes datasets demonstrate that our method outperforms existing methods on both tasks.

Via

Access Paper or Ask Questions