Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sanmin Kim

LabelDistill: Label-guided Cross-modal Knowledge Distillation for Camera-based 3D Object Detection

Jul 14, 2024

Sanmin Kim, Youngseok Kim, Sihwan Hwang, Hyeonjun Jeong, Dongsuk Kum

Figure 1 for LabelDistill: Label-guided Cross-modal Knowledge Distillation for Camera-based 3D Object Detection

Figure 2 for LabelDistill: Label-guided Cross-modal Knowledge Distillation for Camera-based 3D Object Detection

Figure 3 for LabelDistill: Label-guided Cross-modal Knowledge Distillation for Camera-based 3D Object Detection

Figure 4 for LabelDistill: Label-guided Cross-modal Knowledge Distillation for Camera-based 3D Object Detection

Abstract:Recent advancements in camera-based 3D object detection have introduced cross-modal knowledge distillation to bridge the performance gap with LiDAR 3D detectors, leveraging the precise geometric information in LiDAR point clouds. However, existing cross-modal knowledge distillation methods tend to overlook the inherent imperfections of LiDAR, such as the ambiguity of measurements on distant or occluded objects, which should not be transferred to the image detector. To mitigate these imperfections in LiDAR teacher, we propose a novel method that leverages aleatoric uncertainty-free features from ground truth labels. In contrast to conventional label guidance approaches, we approximate the inverse function of the teacher's head to effectively embed label inputs into feature space. This approach provides additional accurate guidance alongside LiDAR teacher, thereby boosting the performance of the image detector. Additionally, we introduce feature partitioning, which effectively transfers knowledge from the teacher modality while preserving the distinctive features of the student, thereby maximizing the potential of both modalities. Experimental results demonstrate that our approach improves mAP and NDS by 5.1 points and 4.9 points compared to the baseline model, proving the effectiveness of our approach. The code is available at https://github.com/sanmin0312/LabelDistill

* ECCV 2024

Via

Access Paper or Ask Questions

Predict to Detect: Prediction-guided 3D Object Detection using Sequential Images

Jun 14, 2023

Sanmin Kim, Youngseok Kim, In-Jae Lee, Dongsuk Kum

Abstract:Recent camera-based 3D object detection methods have introduced sequential frames to improve the detection performance hoping that multiple frames would mitigate the large depth estimation error. Despite improved detection performance, prior works rely on naive fusion methods (e.g., concatenation) or are limited to static scenes (e.g., temporal stereo), neglecting the importance of the motion cue of objects. These approaches do not fully exploit the potential of sequential images and show limited performance improvements. To address this limitation, we propose a novel 3D object detection model, P2D (Predict to Detect), that integrates a prediction scheme into a detection framework to explicitly extract and leverage motion features. P2D predicts object information in the current frame using solely past frames to learn temporal motion features. We then introduce a novel temporal feature aggregation method that attentively exploits Bird's-Eye-View (BEV) features based on predicted object information, resulting in accurate 3D object detection. Experimental results demonstrate that P2D improves mAP and NDS by 3.0% and 3.7% compared to the sequential image-based baseline, illustrating that incorporating a prediction scheme can significantly improve detection accuracy.

* 10 pages, 5 figures and 9 tables

Via

Access Paper or Ask Questions

CRN: Camera Radar Net for Accurate, Robust, Efficient 3D Perception

Apr 03, 2023

Youngseok Kim, Sanmin Kim, Juyeb Shin, Jun Won Choi, Dongsuk Kum

Abstract:Autonomous driving requires an accurate and fast 3D perception system that includes 3D object detection, tracking, and segmentation. Although recent low-cost camera-based approaches have shown promising results, they are susceptible to poor illumination or bad weather conditions and have a large localization error. Hence, fusing camera with low-cost radar, which provides precise long-range measurement and operates reliably in all environments, is promising but has not yet been thoroughly investigated. In this paper, we propose Camera Radar Net (CRN), a novel camera-radar fusion framework that generates a semantically rich and spatially accurate bird's-eye-view (BEV) feature map for various tasks. To overcome the lack of spatial information in an image, we transform perspective view image features to BEV with the help of sparse but accurate radar points. We further aggregate image and radar feature maps in BEV using multi-modal deformable attention designed to tackle the spatial misalignment between inputs. CRN with real-time setting operates at 20 FPS while achieving comparable performance to LiDAR detectors on nuScenes, and even outperforms at a far distance on 100m setting. Moreover, CRN with offline setting yields 62.4% NDS, 57.5% mAP on nuScenes test set and ranks first among all camera and camera-radar 3D object detectors.

* International Conference on Learning Representations 2023 Workshop on Scene Representations for Autonomous Driving (ICLR'23 SR4AD)

Via

Access Paper or Ask Questions

Boosting Monocular 3D Object Detection with Object-Centric Auxiliary Depth Supervision

Oct 29, 2022

Youngseok Kim, Sanmin Kim, Sangmin Sim, Jun Won Choi, Dongsuk Kum

Abstract:Recent advances in monocular 3D detection leverage a depth estimation network explicitly as an intermediate stage of the 3D detection network. Depth map approaches yield more accurate depth to objects than other methods thanks to the depth estimation network trained on a large-scale dataset. However, depth map approaches can be limited by the accuracy of the depth map, and sequentially using two separated networks for depth estimation and 3D detection significantly increases computation cost and inference time. In this work, we propose a method to boost the RGB image-based 3D detector by jointly training the detection network with a depth prediction loss analogous to the depth estimation task. In this way, our 3D detection network can be supervised by more depth supervision from raw LiDAR points, which does not require any human annotation cost, to estimate accurate depth without explicitly predicting the depth map. Our novel object-centric depth prediction loss focuses on depth around foreground objects, which is important for 3D object detection, to leverage pixel-wise depth supervision in an object-centric manner. Our depth regression model is further trained to predict the uncertainty of depth to represent the 3D confidence of objects. To effectively train the 3D detector with raw LiDAR points and to enable end-to-end training, we revisit the regression target of 3D objects and design a network architecture. Extensive experiments on KITTI and nuScenes benchmarks show that our method can significantly boost the monocular image-based 3D detector to outperform depth map approaches while maintaining the real-time inference speed.

* Accepted by IEEE Transaction on Intelligent Transportation System (T-ITS)

Via

Access Paper or Ask Questions

CRAFT: Camera-Radar 3D Object Detection with Spatio-Contextual Fusion Transformer

Sep 14, 2022

Youngseok Kim, Sanmin Kim, Jun Won Choi, Dongsuk Kum

Figure 1 for CRAFT: Camera-Radar 3D Object Detection with Spatio-Contextual Fusion Transformer

Figure 2 for CRAFT: Camera-Radar 3D Object Detection with Spatio-Contextual Fusion Transformer

Figure 3 for CRAFT: Camera-Radar 3D Object Detection with Spatio-Contextual Fusion Transformer

Figure 4 for CRAFT: Camera-Radar 3D Object Detection with Spatio-Contextual Fusion Transformer

Abstract:Camera and radar sensors have significant advantages in cost, reliability, and maintenance compared to LiDAR. Existing fusion methods often fuse the outputs of single modalities at the result-level, called the late fusion strategy. This can benefit from using off-the-shelf single sensor detection algorithms, but late fusion cannot fully exploit the complementary properties of sensors, thus having limited performance despite the huge potential of camera-radar fusion. Here we propose a novel proposal-level early fusion approach that effectively exploits both spatial and contextual properties of camera and radar for 3D object detection. Our fusion framework first associates image proposal with radar points in the polar coordinate system to efficiently handle the discrepancy between the coordinate system and spatial properties. Using this as a first stage, following consecutive cross-attention based feature fusion layers adaptively exchange spatio-contextual information between camera and radar, leading to a robust and attentive fusion. Our camera-radar fusion approach achieves the state-of-the-art 41.1% mAP and 52.3% NDS on the nuScenes test set, which is 8.7 and 10.8 points higher than the camera-only baseline, as well as yielding competitive performance on the LiDAR method.

Via

Access Paper or Ask Questions

Improving Diversity of Multiple Trajectory Prediction based on Map-adaptive Lane Loss

Jun 17, 2022

Sanmin Kim, Hyeongseok Jeon, Junwon Choi, Dongsuk Kum

Figure 1 for Improving Diversity of Multiple Trajectory Prediction based on Map-adaptive Lane Loss

Figure 2 for Improving Diversity of Multiple Trajectory Prediction based on Map-adaptive Lane Loss

Figure 3 for Improving Diversity of Multiple Trajectory Prediction based on Map-adaptive Lane Loss

Figure 4 for Improving Diversity of Multiple Trajectory Prediction based on Map-adaptive Lane Loss

Abstract:Prior arts in the field of motion predictions for autonomous driving tend to focus on finding a trajectory that is close to the ground truth trajectory. Such problem formulations and approaches, however, frequently lead to loss of diversity and biased trajectory predictions. Therefore, they are unsuitable for real-world autonomous driving where diverse and road-dependent multimodal trajectory predictions are critical for safety. To this end, this study proposes a novel loss function, \textit{Lane Loss}, that ensures map-adaptive diversity and accommodates geometric constraints. A two-stage trajectory prediction architecture with a novel trajectory candidate proposal module, \textit{Trajectory Prediction Attention (TPA)}, is trained with Lane Loss encourages multiple trajectories to be diversely distributed, covering feasible maneuvers in a map-aware manner. Furthermore, considering that the existing trajectory performance metrics are focusing on evaluating the accuracy based on the ground truth future trajectory, a quantitative evaluation metric is also suggested to evaluate the diversity of predicted multiple trajectories. The experiments performed on the Argoverse dataset show that the proposed method significantly improves the diversity of the predicted trajectories without sacrificing the prediction accuracy.

* 11pages 5figures

Via

Access Paper or Ask Questions