Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Il Yong Chun

MAMS: Model-Agnostic Module Selection Framework for Video Captioning

Jan 30, 2025

Sangho Lee, Il Yong Chun, Hogun Park

Figure 1 for MAMS: Model-Agnostic Module Selection Framework for Video Captioning

Figure 2 for MAMS: Model-Agnostic Module Selection Framework for Video Captioning

Figure 3 for MAMS: Model-Agnostic Module Selection Framework for Video Captioning

Figure 4 for MAMS: Model-Agnostic Module Selection Framework for Video Captioning

Abstract:Multi-modal transformers are rapidly gaining attention in video captioning tasks. Existing multi-modal video captioning methods typically extract a fixed number of frames, which raises critical challenges. When a limited number of frames are extracted, important frames with essential information for caption generation may be missed. Conversely, extracting an excessive number of frames includes consecutive frames, potentially causing redundancy in visual tokens extracted from consecutive video frames. To extract an appropriate number of frames for each video, this paper proposes the first model-agnostic module selection framework in video captioning that has two main functions: (1) selecting a caption generation module with an appropriate size based on visual tokens extracted from video frames, and (2) constructing subsets of visual tokens for the selected caption generation module. Furthermore, we propose a new adaptive attention masking scheme that enhances attention on important visual tokens. Our experiments on three different benchmark datasets demonstrate that the proposed framework significantly improves the performance of three recent video captioning models.

* Accepted to the AAAI 2025 Main Technical Track. This is an extended version of the original submission

Via

Access Paper or Ask Questions

LaB-CL: Localized and Balanced Contrastive Learning for improving parking slot detection

Oct 10, 2024

U Jin Jeong, Sumin Roh, Il Yong Chun

Figure 1 for LaB-CL: Localized and Balanced Contrastive Learning for improving parking slot detection

Figure 2 for LaB-CL: Localized and Balanced Contrastive Learning for improving parking slot detection

Figure 3 for LaB-CL: Localized and Balanced Contrastive Learning for improving parking slot detection

Figure 4 for LaB-CL: Localized and Balanced Contrastive Learning for improving parking slot detection

Abstract:Parking slot detection is an essential technology in autonomous parking systems. In general, the classification problem of parking slot detection consists of two tasks, a task determining whether localized candidates are junctions of parking slots or not, and the other that identifies a shape of detected junctions. Both classification tasks can easily face biased learning toward the majority class, degrading classification performances. Yet, the data imbalance issue has been overlooked in parking slot detection. We propose the first supervised contrastive learning framework for parking slot detection, Localized and Balanced Contrastive Learning for improving parking slot detection (LaB-CL). The proposed LaB-CL framework uses two main approaches. First, we propose to include class prototypes to consider representations from all classes in every mini batch, from the local perspective. Second, we propose a new hard negative sampling scheme that selects local representations with high prediction error. Experiments with the benchmark dataset demonstrate that the proposed LaB-CL framework can outperform existing parking slot detection methods.

* 7 pages, 6 figures

Via

Access Paper or Ask Questions

DX2CT: Diffusion Model for 3D CT Reconstruction from Bi or Mono-planar 2D X-ray(s)

Sep 13, 2024

Yun Su Jeong, Hye Bin Yoo, Il Yong Chun

Abstract:Computational tomography (CT) provides high-resolution medical imaging, but it can expose patients to high radiation. X-ray scanners have low radiation exposure, but their resolutions are low. This paper proposes a new conditional diffusion model, DX2CT, that reconstructs three-dimensional (3D) CT volumes from bi or mono-planar X-ray image(s). Proposed DX2CT consists of two key components: 1) modulating feature maps extracted from two-dimensional (2D) X-ray(s) with 3D positions of CT volume using a new transformer and 2) effectively using the modulated 3D position-aware feature maps as conditions of DX2CT. In particular, the proposed transformer can provide conditions with rich information of a target CT slice to the conditional diffusion model, enabling high-quality CT reconstruction. Our experiments with the bi or mono-planar X-ray(s) benchmark datasets show that proposed DX2CT outperforms several state-of-the-art methods. Our codes and model will be available at: https://www.github.com/intyeger/DX2CT.

Via

Access Paper or Ask Questions

Improving Neural Radiance Field using Near-Surface Sampling with Point Cloud Generation

Oct 06, 2023

Hye Bin Yoo, Hyun Min Han, Sung Soo Hwang, Il Yong Chun

Abstract:Neural radiance field (NeRF) is an emerging view synthesis method that samples points in a three-dimensional (3D) space and estimates their existence and color probabilities. The disadvantage of NeRF is that it requires a long training time since it samples many 3D points. In addition, if one samples points from occluded regions or in the space where an object is unlikely to exist, the rendering quality of NeRF can be degraded. These issues can be solved by estimating the geometry of 3D scene. This paper proposes a near-surface sampling framework to improve the rendering quality of NeRF. To this end, the proposed method estimates the surface of a 3D object using depth images of the training set and sampling is performed around there only. To obtain depth information on a novel view, the paper proposes a 3D point cloud generation method and a simple refining method for projected depth from a point cloud. Experimental results show that the proposed near-surface sampling NeRF framework can significantly improve the rendering quality, compared to the original NeRF and a state-of-the-art depth-based NeRF method. In addition, one can significantly accelerate the training time of a NeRF model with the proposed near-surface sampling framework.

* 13 figures, 2 tables

Via

Access Paper or Ask Questions

End-to-End Driving via Self-Supervised Imitation Learning Using Camera and LiDAR Data

Aug 28, 2023

Jin Bok Park, Jinkyu Lee, Muhyun Back, Hyunmin Han, David T. Ma, Sang Min Won, Sung Soo Hwang, Il Yong Chun

Abstract:In autonomous driving, the end-to-end (E2E) driving approach that predicts vehicle control signals directly from sensor data is rapidly gaining attention. To learn a safe E2E driving system, one needs an extensive amount of driving data and human intervention. Vehicle control data is constructed by many hours of human driving, and it is challenging to construct large vehicle control datasets. Often, publicly available driving datasets are collected with limited driving scenes, and collecting vehicle control data is only available by vehicle manufacturers. To address these challenges, this paper proposes the first self-supervised learning framework, self-supervised imitation learning (SSIL), that can learn E2E driving networks without using driving command data. To construct pseudo steering angle data, proposed SSIL predicts a pseudo target from the vehicle's poses at the current and previous time points that are estimated with light detection and ranging sensors. Our numerical experiments demonstrate that the proposed SSIL framework achieves comparable E2E driving accuracy with the supervised learning counterpart. In addition, our qualitative analyses using a conventional visual explanation tool show that trained NNs by proposed SSIL and the supervision counterpart attend similar objects in making predictions.

* 20 pages, 8 figures

Via

Access Paper or Ask Questions

Self-supervised regression learning using domain knowledge: Applications to improving self-supervised denoising in imaging

May 10, 2022

Il Yong Chun, Dongwon Park, Xuehang Zheng, Se Young Chun, Yong Long

Figure 1 for Self-supervised regression learning using domain knowledge: Applications to improving self-supervised denoising in imaging

Figure 2 for Self-supervised regression learning using domain knowledge: Applications to improving self-supervised denoising in imaging

Figure 3 for Self-supervised regression learning using domain knowledge: Applications to improving self-supervised denoising in imaging

Figure 4 for Self-supervised regression learning using domain knowledge: Applications to improving self-supervised denoising in imaging

Abstract:Regression that predicts continuous quantity is a central part of applications using computational imaging and computer vision technologies. Yet, studying and understanding self-supervised learning for regression tasks - except for a particular regression task, image denoising - have lagged behind. This paper proposes a general self-supervised regression learning (SSRL) framework that enables learning regression neural networks with only input data (but without ground-truth target data), by using a designable pseudo-predictor that encapsulates domain knowledge of a specific application. The paper underlines the importance of using domain knowledge by showing that under different settings, the better pseudo-predictor can lead properties of SSRL closer to those of ordinary supervised learning. Numerical experiments for low-dose computational tomography denoising and camera image denoising demonstrate that proposed SSRL significantly improves the denoising quality over several existing self-supervised denoising methods.

* 17 pages, 16 figures, 2 tables, submitted to IEEE T-IP

Via

Access Paper or Ask Questions

Accelerated MRI With Deep Linear Convolutional Transform Learning

Apr 17, 2022

Hongyi Gu, Burhaneddin Yaman, Steen Moeller, Il Yong Chun, Mehmet Akçakaya

Figure 1 for Accelerated MRI With Deep Linear Convolutional Transform Learning

Figure 2 for Accelerated MRI With Deep Linear Convolutional Transform Learning

Figure 3 for Accelerated MRI With Deep Linear Convolutional Transform Learning

Figure 4 for Accelerated MRI With Deep Linear Convolutional Transform Learning

Abstract:Recent studies show that deep learning (DL) based MRI reconstruction outperforms conventional methods, such as parallel imaging and compressed sensing (CS), in multiple applications. Unlike CS that is typically implemented with pre-determined linear representations for regularization, DL inherently uses a non-linear representation learned from a large database. Another line of work uses transform learning (TL) to bridge the gap between these two approaches by learning linear representations from data. In this work, we combine ideas from CS, TL and DL reconstructions to learn deep linear convolutional transforms as part of an algorithm unrolling approach. Using end-to-end training, our results show that the proposed technique can reconstruct MR images to a level comparable to DL methods, while supporting uniform undersampling patterns unlike conventional CS methods. Our proposed method relies on convex sparse image reconstruction with linear representation at inference time, which may be beneficial for characterizing robustness, stability and generalizability.

* To be published in 2022 44th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)

Via

Access Paper or Ask Questions

Improved Real-Time Monocular SLAM Using Semantic Segmentation on Selective Frames

Apr 30, 2021

Jinkyu Lee, Muhyun Back, Sung Soo Hwang, Il Yong Chun

Figure 1 for Improved Real-Time Monocular SLAM Using Semantic Segmentation on Selective Frames

Figure 2 for Improved Real-Time Monocular SLAM Using Semantic Segmentation on Selective Frames

Figure 3 for Improved Real-Time Monocular SLAM Using Semantic Segmentation on Selective Frames

Figure 4 for Improved Real-Time Monocular SLAM Using Semantic Segmentation on Selective Frames

Abstract:Monocular simultaneous localization and mapping (SLAM) is emerging in advanced driver assistance systems and autonomous driving, because a single camera is cheap and easy to install. Conventional monocular SLAM has two major challenges leading inaccurate localization and mapping. First, it is challenging to estimate scales in localization and mapping. Second, conventional monocular SLAM uses inappropriate mapping factors such as dynamic objects and low-parallax ares in mapping. This paper proposes an improved real-time monocular SLAM that resolves the aforementioned challenges by efficiently using deep learning-based semantic segmentation. To achieve the real-time execution of the proposed method, we apply semantic segmentation only to downsampled keyframes in parallel with mapping processes. In addition, the proposed method corrects scales of camera poses and three-dimensional (3D) points, using estimated ground plane from road-labeled 3D points and the real camera height. The proposed method also removes inappropriate corner features labeled as moving objects and low parallax areas. Experiments with six video sequences demonstrate that the proposed monocular SLAM system achieves significantly more accurate trajectory tracking accuracy compared to state-of-the-art monocular SLAM and comparable trajectory tracking accuracy compared to state-of-the-art stereo SLAM.

Via

Access Paper or Ask Questions

Improved and efficient inter-vehicle distance estimation using road gradients of both ego and target vehicles

Apr 01, 2021

Muhyun Back, Jinkyu Lee, Kyuho Bae, Sung Soo Hwang, Il Yong Chun

Figure 1 for Improved and efficient inter-vehicle distance estimation using road gradients of both ego and target vehicles

Figure 2 for Improved and efficient inter-vehicle distance estimation using road gradients of both ego and target vehicles

Figure 3 for Improved and efficient inter-vehicle distance estimation using road gradients of both ego and target vehicles

Figure 4 for Improved and efficient inter-vehicle distance estimation using road gradients of both ego and target vehicles

Abstract:In advanced driver assistant systems and autonomous driving, it is crucial to estimate distances between an ego vehicle and target vehicles. Existing inter-vehicle distance estimation methods assume that the ego and target vehicles drive on a same ground plane. In practical driving environments, however, they may drive on different ground planes. This paper proposes an inter-vehicle distance estimation framework that can consider slope changes of a road forward, by estimating road gradients of \emph{both} ego vehicle and target vehicles and using a 2D object detection deep net. Numerical experiments demonstrate that the proposed method significantly improves the distance estimation accuracy and time complexity, compared to deep learning-based depth estimation methods.

* 5 pages, 3 figures, 2 tables, submitted to IEEE ICAS 2021

Via

Access Paper or Ask Questions

An Improved Iterative Neural Network for High-Quality Image-Domain Material Decomposition in Dual-Energy CT

Dec 02, 2020

Zhipeng Li, Yong Long, Il Yong Chun

Figure 1 for An Improved Iterative Neural Network for High-Quality Image-Domain Material Decomposition in Dual-Energy CT

Figure 2 for An Improved Iterative Neural Network for High-Quality Image-Domain Material Decomposition in Dual-Energy CT

Figure 3 for An Improved Iterative Neural Network for High-Quality Image-Domain Material Decomposition in Dual-Energy CT

Figure 4 for An Improved Iterative Neural Network for High-Quality Image-Domain Material Decomposition in Dual-Energy CT

Abstract:Dual-energy computed tomography (DECT) has been widely used in many applications that need material decomposition. Image-domain methods directly decompose material images from high- and low-energy attenuation images, and thus, are susceptible to noise and artifacts on attenuation images. To obtain high-quality material images, various data-driven methods have been proposed. Iterative neural network (INN) methods combine regression NNs and model-based image reconstruction algorithm. INNs reduced the generalization error of (noniterative) deep regression NNs, and achieved high-quality reconstruction in diverse medical imaging applications. BCD-Net is a recent INN architecture that incorporates imaging refining NNs into the block coordinate descent (BCD) model-based image reconstruction algorithm. We propose a new INN architecture, distinct cross-material BCD-Net, for DECT material decomposition. The proposed INN architecture uses distinct cross-material convolutional neural network (CNN) in image refining modules, and uses image decomposition physics in image reconstruction modules. The distinct cross-material CNN refiners incorporate distinct encoding-decoding filters and cross-material model that captures correlations between different materials. We interpret the distinct cross-material CNN refiner with patch perspective. Numerical experiments with extended cardiactorso (XCAT) phantom and clinical data show that proposed distinct cross-material BCD-Net significantly improves the image quality over several image-domain material decomposition methods, including a conventional model-based image decomposition (MBID) method using an edge-preserving regularizer, a state-of-the-art MBID method using pre-learned material-wise sparsifying transforms, and a noniterative deep CNN denoiser.

Via

Access Paper or Ask Questions