Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Changjae Oh

HanDrawer: Leveraging Spatial Information to Render Realistic Hands Using a Conditional Diffusion Model in Single Stage

Mar 03, 2025

Qifan Fu, Xu Chen, Muhammad Asad, Shanxin Yuan, Changjae Oh, Gregory Slabaugh

Abstract:Although diffusion methods excel in text-to-image generation, generating accurate hand gestures remains a major challenge, resulting in severe artifacts, such as incorrect number of fingers or unnatural gestures. To enable the diffusion model to learn spatial information to improve the quality of the hands generated, we propose HanDrawer, a module to condition the hand generation process. Specifically, we apply graph convolutional layers to extract the endogenous spatial structure and physical constraints implicit in MANO hand mesh vertices. We then align and fuse these spatial features with other modalities via cross-attention. The spatially fused features are used to guide a single stage diffusion model denoising process for high quality generation of the hand region. To improve the accuracy of spatial feature fusion, we propose a Position-Preserving Zero Padding (PPZP) fusion strategy, which ensures that the features extracted by HanDrawer are fused into the region of interest in the relevant layers of the diffusion model. HanDrawer learns the entire image features while paying special attention to the hand region thanks to an additional hand reconstruction loss combined with the denoising loss. To accurately train and evaluate our approach, we perform careful cleansing and relabeling of the widely used HaGRID hand gesture dataset and obtain high quality multimodal data. Quantitative and qualitative analyses demonstrate the state-of-the-art performance of our method on the HaGRID dataset through multiple evaluation metrics. Source code and our enhanced dataset will be released publicly if the paper is accepted.

* 9 pages

Via

Access Paper or Ask Questions

Stereo Hand-Object Reconstruction for Human-to-Robot Handover

Dec 10, 2024

Yik Lung Pang, Alessio Xompero, Changjae Oh, Andrea Cavallaro

Figure 1 for Stereo Hand-Object Reconstruction for Human-to-Robot Handover

Figure 2 for Stereo Hand-Object Reconstruction for Human-to-Robot Handover

Figure 3 for Stereo Hand-Object Reconstruction for Human-to-Robot Handover

Figure 4 for Stereo Hand-Object Reconstruction for Human-to-Robot Handover

Abstract:Jointly estimating hand and object shape ensures the success of the robot grasp in human-to-robot handovers. However, relying on hand-crafted prior knowledge about the geometric structure of the object fails when generalising to unseen objects, and depth sensors fail to detect transparent objects such as drinking glasses. In this work, we propose a stereo-based method for hand-object reconstruction that combines single-view reconstructions probabilistically to form a coherent stereo reconstruction. We learn 3D shape priors from a large synthetic hand-object dataset to ensure that our method is generalisable, and use RGB inputs instead of depth as RGB can better capture transparent objects. We show that our method achieves a lower object Chamfer distance compared to existing RGB based hand-object reconstruction methods on single view and stereo settings. We process the reconstructed hand-object shape with a projection-based outlier removal step and use the output to guide a human-to-robot handover pipeline with wide-baseline stereo RGB cameras. Our hand-object reconstruction enables a robot to successfully receive a diverse range of household objects from the human.

* 8 pages, 9 figures, 1 table

Via

Access Paper or Ask Questions

Adaptive Multi-Modal Control of Digital Human Hand Synthesis Using a Region-Aware Cycle Loss

Sep 13, 2024

Qifan Fu, Xiaohang Yang, Muhammad Asad, Changjae Oh, Shanxin Yuan, Gregory Slabaugh

Figure 1 for Adaptive Multi-Modal Control of Digital Human Hand Synthesis Using a Region-Aware Cycle Loss

Figure 2 for Adaptive Multi-Modal Control of Digital Human Hand Synthesis Using a Region-Aware Cycle Loss

Figure 3 for Adaptive Multi-Modal Control of Digital Human Hand Synthesis Using a Region-Aware Cycle Loss

Figure 4 for Adaptive Multi-Modal Control of Digital Human Hand Synthesis Using a Region-Aware Cycle Loss

Abstract:Diffusion models have shown their remarkable ability to synthesize images, including the generation of humans in specific poses. However, current models face challenges in adequately expressing conditional control for detailed hand pose generation, leading to significant distortion in the hand regions. To tackle this problem, we first curate the How2Sign dataset to provide richer and more accurate hand pose annotations. In addition, we introduce adaptive, multi-modal fusion to integrate characters' physical features expressed in different modalities such as skeleton, depth, and surface normal. Furthermore, we propose a novel Region-Aware Cycle Loss (RACL) that enables the diffusion model training to focus on improving the hand region, resulting in improved quality of generated hand gestures. More specifically, the proposed RACL computes a weighted keypoint distance between the full-body pose keypoints from the generated image and the ground truth, to generate higher-quality hand poses while balancing overall pose accuracy. Moreover, we use two hand region metrics, named hand-PSNR and hand-Distance for hand pose generation evaluations. Our experimental evaluations demonstrate the effectiveness of our proposed approach in improving the quality of digital human pose generation using diffusion models, especially the quality of the hand region. The source code is available at https://github.com/fuqifan/Region-Aware-Cycle-Loss.

* This paper has been accepted by the ECCV 2024 HANDS workshop

Via

Access Paper or Ask Questions

Improving Image De-raining Using Reference-Guided Transformers

Aug 01, 2024

Zihao Ye, Jaehoon Cho, Changjae Oh

Figure 1 for Improving Image De-raining Using Reference-Guided Transformers

Figure 2 for Improving Image De-raining Using Reference-Guided Transformers

Figure 3 for Improving Image De-raining Using Reference-Guided Transformers

Figure 4 for Improving Image De-raining Using Reference-Guided Transformers

Abstract:Image de-raining is a critical task in computer vision to improve visibility and enhance the robustness of outdoor vision systems. While recent advances in de-raining methods have achieved remarkable performance, the challenge remains to produce high-quality and visually pleasing de-rained results. In this paper, we present a reference-guided de-raining filter, a transformer network that enhances de-raining results using a reference clean image as guidance. We leverage the capabilities of the proposed module to further refine the images de-rained by existing methods. We validate our method on three datasets and show that our module can improve the performance of existing prior-based, CNN-based, and transformer-based approaches.

Via

Access Paper or Ask Questions

High-resolution open-vocabulary object 6D pose estimation

Jun 24, 2024

Jaime Corsetti, Davide Boscaini, Francesco Giuliari, Changjae Oh, Andrea Cavallaro, Fabio Poiesi

Figure 1 for High-resolution open-vocabulary object 6D pose estimation

Figure 2 for High-resolution open-vocabulary object 6D pose estimation

Figure 3 for High-resolution open-vocabulary object 6D pose estimation

Figure 4 for High-resolution open-vocabulary object 6D pose estimation

Abstract:The generalisation to unseen objects in the 6D pose estimation task is very challenging. While Vision-Language Models (VLMs) enable using natural language descriptions to support 6D pose estimation of unseen objects, these solutions underperform compared to model-based methods. In this work we present Horyon, an open-vocabulary VLM-based architecture that addresses relative pose estimation between two scenes of an unseen object, described by a textual prompt only. We use the textual prompt to identify the unseen object in the scenes and then obtain high-resolution multi-scale features. These features are used to extract cross-scene matches for registration. We evaluate our model on a benchmark with a large variety of unseen objects across four datasets, namely REAL275, Toyota-Light, Linemod, and YCB-Video. Our method achieves state-of-the-art performance on all datasets, outperforming by 12.6 in Average Recall the previous best-performing approach.

* Technical report. Extension of CVPR paper "Open-vocabulary object 6D pose estimation". Project page: https://jcorsetti.github.io/oryon

Via

Access Paper or Ask Questions

Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation

May 07, 2024

Jihyun Kim, Changjae Oh, Hoseok Do, Soohyun Kim, Kwanghoon Sohn

Figure 1 for Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation

Figure 2 for Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation

Figure 3 for Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation

Figure 4 for Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation

Abstract:We present a new multi-modal face image generation method that converts a text prompt and a visual input, such as a semantic mask or scribble map, into a photo-realistic face image. To do this, we combine the strengths of Generative Adversarial networks (GANs) and diffusion models (DMs) by employing the multi-modal features in the DM into the latent space of the pre-trained GANs. We present a simple mapping and a style modulation network to link two models and convert meaningful representations in feature maps and attention maps into latent codes. With GAN inversion, the estimated latent codes can be used to generate 2D or 3D-aware facial images. We further present a multi-step training strategy that reflects textual and structural representations into the generated image. Our proposed network produces realistic 2D, multi-view, and stylized face images, which align well with inputs. We validate our method by using pre-trained 2D and 3D GANs, and our results outperform existing methods. Our project page is available at https://github.com/1211sh/Diffusion-driven_GAN-Inversion/.

* Accepted by CVPR 2024

Via

Access Paper or Ask Questions

Sparse multi-view hand-object reconstruction for unseen environments

May 02, 2024

Yik Lung Pang, Changjae Oh, Andrea Cavallaro

Figure 1 for Sparse multi-view hand-object reconstruction for unseen environments

Figure 2 for Sparse multi-view hand-object reconstruction for unseen environments

Figure 3 for Sparse multi-view hand-object reconstruction for unseen environments

Figure 4 for Sparse multi-view hand-object reconstruction for unseen environments

Abstract:Recent works in hand-object reconstruction mainly focus on the single-view and dense multi-view settings. On the one hand, single-view methods can leverage learned shape priors to generalise to unseen objects but are prone to inaccuracies due to occlusions. On the other hand, dense multi-view methods are very accurate but cannot easily adapt to unseen objects without further data collection. In contrast, sparse multi-view methods can take advantage of the additional views to tackle occlusion, while keeping the computational cost low compared to dense multi-view methods. In this paper, we consider the problem of hand-object reconstruction with unseen objects in the sparse multi-view setting. Given multiple RGB images of the hand and object captured at the same time, our model SVHO combines the predictions from each view into a unified reconstruction without optimisation across views. We train our model on a synthetic hand-object dataset and evaluate directly on a real world recorded hand-object dataset with unseen objects. We show that while reconstruction of unseen hands and objects from RGB is challenging, additional views can help improve the reconstruction quality.

* Camera-ready version. Paper accepted to CVPRW 2024. 8 pages, 7 figures, 1 table

Via

Access Paper or Ask Questions

Open-vocabulary object 6D pose estimation

Dec 07, 2023

Jaime Corsetti, Davide Boscaini, Changjae Oh, Andrea Cavallaro, Fabio Poiesi

Abstract:We introduce the new setting of open-vocabulary object 6D pose estimation, in which a textual prompt is used to specify the object of interest. In contrast to existing approaches, in our setting (i) the object of interest is specified solely through the textual prompt, (ii) no object model (e.g. CAD or video sequence) is required at inference, (iii) the object is imaged from two different viewpoints of two different scenes, and (iv) the object was not observed during the training phase. To operate in this setting, we introduce a novel approach that leverages a Vision-Language Model to segment the object of interest from two distinct scenes and to estimate its relative 6D pose. The key of our approach is a carefully devised strategy to fuse object-level information provided by the prompt with local image features, resulting in a feature space that can generalize to novel concepts. We validate our approach on a new benchmark based on two popular datasets, REAL275 and Toyota-Light, which collectively encompass 39 object instances appearing in four thousand image pairs. The results demonstrate that our approach outperforms both a well-established hand-crafted method and a recent deep learning-based baseline in estimating the relative 6D pose of objects in different scenes. Project page: https://jcorsetti.github.io/oryon/.

* Technical report. 21 pages, 15 figures, 6 tables. Updated website link

Via

Access Paper or Ask Questions

Boosting Video Object Segmentation based on Scale Inconsistency

May 02, 2022

Hengyi Wang, Changjae Oh

Figure 1 for Boosting Video Object Segmentation based on Scale Inconsistency

Figure 2 for Boosting Video Object Segmentation based on Scale Inconsistency

Figure 3 for Boosting Video Object Segmentation based on Scale Inconsistency

Figure 4 for Boosting Video Object Segmentation based on Scale Inconsistency

Abstract:We present a refinement framework to boost the performance of pre-trained semi-supervised video object segmentation (VOS) models. Our work is based on scale inconsistency, which is motivated by the observation that existing VOS models generate inconsistent predictions from input frames with different sizes. We use the scale inconsistency as a clue to devise a pixel-level attention module that aggregates the advantages of the predictions from different-size inputs. The scale inconsistency is also used to regularize the training based on a pixel-level variance measured by an uncertainty estimation. We further present a self-supervised online adaptation, tailored for test-time optimization, that bootstraps the predictions without ground-truth masks based on the scale inconsistency. Experiments on DAVIS 16 and DAVIS 17 datasets show that our framework can be generically applied to various VOS models and improve their performance.

Via

Access Paper or Ask Questions

Learning by Erasing: Conditional Entropy based Transferable Out-Of-Distribution Detection

Apr 23, 2022

Meng Xing, Zhiyong Feng, Yong Su, Changjae Oh

Figure 1 for Learning by Erasing: Conditional Entropy based Transferable Out-Of-Distribution Detection

Figure 2 for Learning by Erasing: Conditional Entropy based Transferable Out-Of-Distribution Detection

Figure 3 for Learning by Erasing: Conditional Entropy based Transferable Out-Of-Distribution Detection

Figure 4 for Learning by Erasing: Conditional Entropy based Transferable Out-Of-Distribution Detection

Abstract:Out-of-distribution (OOD) detection is essential to handle the distribution shifts between training and test scenarios. For a new in-distribution (ID) dataset, existing methods require retraining to capture the dataset-specific feature representation or data distribution. In this paper, we propose a deep generative models (DGM) based transferable OOD detection method, which is unnecessary to retrain on a new ID dataset. We design an image erasing strategy to equip exclusive conditional entropy distribution for each ID dataset, which determines the discrepancy of DGM's posteriori ucertainty distribution on different ID datasets. Owing to the powerful representation capacity of convolutional neural networks, the proposed model trained on complex dataset can capture the above discrepancy between ID datasets without retraining and thus achieve transferable OOD detection. We validate the proposed method on five datasets and verity that ours achieves comparable performance to the state-of-the-art group based OOD detection methods that need to be retrained to deploy on new ID datasets. Our code is available at https://github.com/oOHCIOo/CETOOD.

* 9 pages, 9 figures

Via

Access Paper or Ask Questions