Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wonjoon Choi

Scale-invariant and View-relational Representation Learning for Full Surround Monocular Depth

Dec 09, 2025

Kyumin Hwang, Wonhyeok Choi, Kiljoon Han, Wonjoon Choi, Minwoo Choi, Yongcheon Na, Minwoo Park, Sunghoon Im

Figure 1 for Scale-invariant and View-relational Representation Learning for Full Surround Monocular Depth

Figure 2 for Scale-invariant and View-relational Representation Learning for Full Surround Monocular Depth

Figure 3 for Scale-invariant and View-relational Representation Learning for Full Surround Monocular Depth

Figure 4 for Scale-invariant and View-relational Representation Learning for Full Surround Monocular Depth

Abstract:Recent foundation models demonstrate strong generalization capabilities in monocular depth estimation. However, directly applying these models to Full Surround Monocular Depth Estimation (FSMDE) presents two major challenges: (1) high computational cost, which limits real-time performance, and (2) difficulty in estimating metric-scale depth, as these models are typically trained to predict only relative depth. To address these limitations, we propose a novel knowledge distillation strategy that transfers robust depth knowledge from a foundation model to a lightweight FSMDE network. Our approach leverages a hybrid regression framework combining the knowledge distillation scheme--traditionally used in classification--with a depth binning module to enhance scale consistency. Specifically, we introduce a cross-interaction knowledge distillation scheme that distills the scale-invariant depth bin probabilities of a foundation model into the student network while guiding it to infer metric-scale depth bin centers from ground-truth depth. Furthermore, we propose view-relational knowledge distillation, which encodes structural relationships among adjacent camera views and transfers them to enhance cross-view depth consistency. Experiments on DDAD and nuScenes demonstrate the effectiveness of our method compared to conventional supervised methods and existing knowledge distillation approaches. Moreover, our method achieves a favorable trade-off between performance and efficiency, meeting real-time requirements.

* Accepted at IEEE Robotics and Automation Letters (RA-L) 2026

Via

Access Paper or Ask Questions

Bridging Geometric and Semantic Foundation Models for Generalized Monocular Depth Estimation

May 29, 2025

Sanggyun Ma, Wonjoon Choi, Jihun Park, Jaeyeul Kim, Seunghun Lee, Jiwan Seo, Sunghoon Im

Abstract:We present Bridging Geometric and Semantic (BriGeS), an effective method that fuses geometric and semantic information within foundation models to enhance Monocular Depth Estimation (MDE). Central to BriGeS is the Bridging Gate, which integrates the complementary strengths of depth and segmentation foundation models. This integration is further refined by our Attention Temperature Scaling technique. It finely adjusts the focus of the attention mechanisms to prevent over-concentration on specific features, thus ensuring balanced performance across diverse inputs. BriGeS capitalizes on pre-trained foundation models and adopts a strategy that focuses on training only the Bridging Gate. This method significantly reduces resource demands and training time while maintaining the model's ability to generalize effectively. Extensive experiments across multiple challenging datasets demonstrate that BriGeS outperforms state-of-the-art methods in MDE for complex scenes, effectively handling intricate structures and overlapping objects.

Via

Access Paper or Ask Questions

Intrinsic Image Decomposition for Robust Self-supervised Monocular Depth Estimation on Reflective Surfaces

Mar 28, 2025

Wonhyeok Choi, Kyumin Hwang, Minwoo Choi, Kiljoon Han, Wonjoon Choi, Mingyu Shin, Sunghoon Im

Figure 1 for Intrinsic Image Decomposition for Robust Self-supervised Monocular Depth Estimation on Reflective Surfaces

Figure 2 for Intrinsic Image Decomposition for Robust Self-supervised Monocular Depth Estimation on Reflective Surfaces

Figure 3 for Intrinsic Image Decomposition for Robust Self-supervised Monocular Depth Estimation on Reflective Surfaces

Figure 4 for Intrinsic Image Decomposition for Robust Self-supervised Monocular Depth Estimation on Reflective Surfaces

Abstract:Self-supervised monocular depth estimation (SSMDE) has gained attention in the field of deep learning as it estimates depth without requiring ground truth depth maps. This approach typically uses a photometric consistency loss between a synthesized image, generated from the estimated depth, and the original image, thereby reducing the need for extensive dataset acquisition. However, the conventional photometric consistency loss relies on the Lambertian assumption, which often leads to significant errors when dealing with reflective surfaces that deviate from this model. To address this limitation, we propose a novel framework that incorporates intrinsic image decomposition into SSMDE. Our method synergistically trains for both monocular depth estimation and intrinsic image decomposition. The accurate depth estimation facilitates multi-image consistency for intrinsic image decomposition by aligning different view coordinate systems, while the decomposition process identifies reflective areas and excludes corrupted gradients from the depth training process. Furthermore, our framework introduces a pseudo-depth generation and knowledge distillation technique to further enhance the performance of the student model across both reflective and non-reflective surfaces. Comprehensive evaluations on multiple datasets show that our approach significantly outperforms existing SSMDE baselines in depth prediction, especially on reflective surfaces.

* Accepted at AAAI 2025

Via

Access Paper or Ask Questions