Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hang Dai

IGAF: Incremental Guided Attention Fusion for Depth Super-Resolution

Jan 03, 2025

Athanasios Tragakis, Chaitanya Kaul, Kevin J. Mitchell, Hang Dai, Roderick Murray-Smith, Daniele Faccio

Abstract:Accurate depth estimation is crucial for many fields, including robotics, navigation, and medical imaging. However, conventional depth sensors often produce low-resolution (LR) depth maps, making detailed scene perception challenging. To address this, enhancing LR depth maps to high-resolution (HR) ones has become essential, guided by HR-structured inputs like RGB or grayscale images. We propose a novel sensor fusion methodology for guided depth super-resolution (GDSR), a technique that combines LR depth maps with HR images to estimate detailed HR depth maps. Our key contribution is the Incremental guided attention fusion (IGAF) module, which effectively learns to fuse features from RGB images and LR depth maps, producing accurate HR depth maps. Using IGAF, we build a robust super-resolution model and evaluate it on multiple benchmark datasets. Our model achieves state-of-the-art results compared to all baseline models on the NYU v2 dataset for $\times 4$, $\times 8$, and $\times 16$ upsampling. It also outperforms all baselines in a zero-shot setting on the Middlebury, Lu, and RGB-D-D datasets. Code, environments, and models are available on GitHub.

* Sensors 2025, 25, 24

Via

Access Paper or Ask Questions

Learning Semi-Supervised Medical Image Segmentation from Spatial Registration

Sep 16, 2024

Qianying Liu, Paul Henderson, Xiao Gu, Hang Dai, Fani Deligianni

Figure 1 for Learning Semi-Supervised Medical Image Segmentation from Spatial Registration

Figure 2 for Learning Semi-Supervised Medical Image Segmentation from Spatial Registration

Figure 3 for Learning Semi-Supervised Medical Image Segmentation from Spatial Registration

Figure 4 for Learning Semi-Supervised Medical Image Segmentation from Spatial Registration

Abstract:Semi-supervised medical image segmentation has shown promise in training models with limited labeled data and abundant unlabeled data. However, state-of-the-art methods ignore a potentially valuable source of unsupervised semantic information -- spatial registration transforms between image volumes. To address this, we propose CCT-R, a contrastive cross-teaching framework incorporating registration information. To leverage the semantic information available in registrations between volume pairs, CCT-R incorporates two proposed modules: Registration Supervision Loss (RSL) and Registration-Enhanced Positive Sampling (REPS). The RSL leverages segmentation knowledge derived from transforms between labeled and unlabeled volume pairs, providing an additional source of pseudo-labels. REPS enhances contrastive learning by identifying anatomically-corresponding positives across volumes using registration transforms. Experimental results on two challenging medical segmentation benchmarks demonstrate the effectiveness and superiority of CCT-R across various semi-supervised settings, with as few as one labeled case. Our code is available at https://github.com/kathyliu579/ContrastiveCross-teachingWithRegistration.

Via

Access Paper or Ask Questions

GLFNET: Global-Local (frequency) Filter Networks for efficient medical image segmentation

Mar 01, 2024

Athanasios Tragakis, Qianying Liu, Chaitanya Kaul, Swalpa Kumar Roy, Hang Dai, Fani Deligianni, Roderick Murray-Smith, Daniele Faccio

Figure 1 for GLFNET: Global-Local (frequency) Filter Networks for efficient medical image segmentation

Figure 2 for GLFNET: Global-Local (frequency) Filter Networks for efficient medical image segmentation

Figure 3 for GLFNET: Global-Local (frequency) Filter Networks for efficient medical image segmentation

Figure 4 for GLFNET: Global-Local (frequency) Filter Networks for efficient medical image segmentation

Abstract:We propose a novel transformer-style architecture called Global-Local Filter Network (GLFNet) for medical image segmentation and demonstrate its state-of-the-art performance. We replace the self-attention mechanism with a combination of global-local filter blocks to optimize model efficiency. The global filters extract features from the whole feature map whereas the local filters are being adaptively created as 4x4 patches of the same feature map and add restricted scale information. In particular, the feature extraction takes place in the frequency domain rather than the commonly used spatial (image) domain to facilitate faster computations. The fusion of information from both spatial and frequency spaces creates an efficient model with regards to complexity, required data and performance. We test GLFNet on three benchmark datasets achieving state-of-the-art performance on all of them while being almost twice as efficient in terms of GFLOP operations.

Via

Access Paper or Ask Questions

UniVision: A Unified Framework for Vision-Centric 3D Perception

Jan 13, 2024

Yu Hong, Qian Liu, Huayuan Cheng, Danjiao Ma, Hang Dai, Yu Wang, Guangzhi Cao, Yong Ding

Abstract:The past few years have witnessed the rapid development of vision-centric 3D perception in autonomous driving. Although the 3D perception models share many structural and conceptual similarities, there still exist gaps in their feature representations, data formats, and objectives, posing challenges for unified and efficient 3D perception framework design. In this paper, we present UniVision, a simple and efficient framework that unifies two major tasks in vision-centric 3D perception, \ie, occupancy prediction and object detection. Specifically, we propose an explicit-implicit view transform module for complementary 2D-3D feature transformation. We propose a local-global feature extraction and fusion module for efficient and adaptive voxel and BEV feature extraction, enhancement, and interaction. Further, we propose a joint occupancy-detection data augmentation strategy and a progressive loss weight adjustment strategy which enables the efficiency and stability of the multi-task framework training. We conduct extensive experiments for different perception tasks on four public benchmarks, including nuScenes LiDAR segmentation, nuScenes detection, OpenOccupancy, and Occ3D. UniVision achieves state-of-the-art results with +1.5 mIoU, +1.8 NDS, +1.5 mIoU, and +1.8 mIoU gains on each benchmark, respectively. We believe that the UniVision framework can serve as a high-performance baseline for the unified vision-centric 3D perception task. The code will be available at \url{https://github.com/Cc-Hy/UniVision}.

Via

Access Paper or Ask Questions

Feature Shrinkage Pyramid for Camouflaged Object Detection with Transformers

Mar 26, 2023

Zhou Huang, Hang Dai, Tian-Zhu Xiang, Shuo Wang, Huai-Xin Chen, Jie Qin, Huan Xiong

Figure 1 for Feature Shrinkage Pyramid for Camouflaged Object Detection with Transformers

Figure 2 for Feature Shrinkage Pyramid for Camouflaged Object Detection with Transformers

Figure 3 for Feature Shrinkage Pyramid for Camouflaged Object Detection with Transformers

Figure 4 for Feature Shrinkage Pyramid for Camouflaged Object Detection with Transformers

Abstract:Vision transformers have recently shown strong global context modeling capabilities in camouflaged object detection. However, they suffer from two major limitations: less effective locality modeling and insufficient feature aggregation in decoders, which are not conducive to camouflaged object detection that explores subtle cues from indistinguishable backgrounds. To address these issues, in this paper, we propose a novel transformer-based Feature Shrinkage Pyramid Network (FSPNet), which aims to hierarchically decode locality-enhanced neighboring transformer features through progressive shrinking for camouflaged object detection. Specifically, we propose a nonlocal token enhancement module (NL-TEM) that employs the non-local mechanism to interact neighboring tokens and explore graph-based high-order relations within tokens to enhance local representations of transformers. Moreover, we design a feature shrinkage decoder (FSD) with adjacent interaction modules (AIM), which progressively aggregates adjacent transformer features through a layer-bylayer shrinkage pyramid to accumulate imperceptible but effective cues as much as possible for object information decoding. Extensive quantitative and qualitative experiments demonstrate that the proposed model significantly outperforms the existing 24 competitors on three challenging COD benchmark datasets under six widely-used evaluation metrics. Our code is publicly available at https://github.com/ZhouHuang23/FSPNet.

* CVPR 2023. Project webpage at: https://tzxiang.github.io/project/COD-FSPNet/index.html

Via

Access Paper or Ask Questions

MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving

Mar 15, 2023

Jiale Li, Hang Dai, Hao Han, Yong Ding

Figure 1 for MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving

Figure 2 for MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving

Figure 3 for MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving

Figure 4 for MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving

Abstract:LiDAR and camera are two modalities available for 3D semantic segmentation in autonomous driving. The popular LiDAR-only methods severely suffer from inferior segmentation on small and distant objects due to insufficient laser points, while the robust multi-modal solution is under-explored, where we investigate three crucial inherent difficulties: modality heterogeneity, limited sensor field of view intersection, and multi-modal data augmentation. We propose a multi-modal 3D semantic segmentation model (MSeg3D) with joint intra-modal feature extraction and inter-modal feature fusion to mitigate the modality heterogeneity. The multi-modal fusion in MSeg3D consists of geometry-based feature fusion GF-Phase, cross-modal feature completion, and semantic-based feature fusion SF-Phase on all visible points. The multi-modal data augmentation is reinvigorated by applying asymmetric transformations on LiDAR point cloud and multi-camera images individually, which benefits the model training with diversified augmentation transformations. MSeg3D achieves state-of-the-art results on nuScenes, Waymo, and SemanticKITTI datasets. Under the malfunctioning multi-camera input and the multi-frame point clouds input, MSeg3D still shows robustness and improves the LiDAR-only baseline. Our code is publicly available at \url{https://github.com/jialeli1/lidarseg3d}.

* Accepted to CVPR 2023 (preprint)

Via

Access Paper or Ask Questions

Laplacian ICP for Progressive Registration of 3D Human Head Meshes

Feb 04, 2023

Nick Pears, Hang Dai, Will Smith, Hao Sun

Abstract:We present a progressive 3D registration framework that is a highly-efficient variant of classical non-rigid Iterative Closest Points (N-ICP). Since it uses the Laplace-Beltrami operator for deformation regularisation, we view the overall process as Laplacian ICP (L-ICP). This exploits a `small deformation per iteration' assumption and is progressively coarse-to-fine, employing an increasingly flexible deformation model, an increasing number of correspondence sets, and increasingly sophisticated correspondence estimation. Correspondence matching is only permitted within predefined vertex subsets derived from domain-specific feature extractors. Additionally, we present a new benchmark and a pair of evaluation metrics for 3D non-rigid registration, based on annotation transfer. We use this to evaluate our framework on a publicly-available dataset of 3D human head scans (Headspace). The method is robust and only requires a small fraction of the computation time compared to the most popular classical approach, yet has comparable registration performance.

* 17th IEEE International Conference on Automatic Face and Gesture Recognition, Jan 5th-8th 2023
* 7 pages, 6 figures

Via

Access Paper or Ask Questions

Cross-Modality Knowledge Distillation Network for Monocular 3D Object Detection

Nov 14, 2022

Yu Hong, Hang Dai, Yong Ding

Abstract:Leveraging LiDAR-based detectors or real LiDAR point data to guide monocular 3D detection has brought significant improvement, e.g., Pseudo-LiDAR methods. However, the existing methods usually apply non-end-to-end training strategies and insufficiently leverage the LiDAR information, where the rich potential of the LiDAR data has not been well exploited. In this paper, we propose the Cross-Modality Knowledge Distillation (CMKD) network for monocular 3D detection to efficiently and directly transfer the knowledge from LiDAR modality to image modality on both features and responses. Moreover, we further extend CMKD as a semi-supervised training framework by distilling knowledge from large-scale unlabeled data and significantly boost the performance. Until submission, CMKD ranks $1^{st}$ among the monocular 3D detectors with publications on both KITTI $test$ set and Waymo $val$ set with significant performance gains compared to previous state-of-the-art methods.

* Accepted by ECCV 2022 as Oral Presentation

Via

Access Paper or Ask Questions

Semi-Supervised Cross-Modal Salient Object Detection with U-Structure Networks

Aug 08, 2022

Yunqing Bao, Hang Dai, Abdulmotaleb Elsaddik

Figure 1 for Semi-Supervised Cross-Modal Salient Object Detection with U-Structure Networks

Figure 2 for Semi-Supervised Cross-Modal Salient Object Detection with U-Structure Networks

Figure 3 for Semi-Supervised Cross-Modal Salient Object Detection with U-Structure Networks

Figure 4 for Semi-Supervised Cross-Modal Salient Object Detection with U-Structure Networks

Abstract:Salient Object Detection (SOD) is a popular and important topic aimed at precise detection and segmentation of the interesting regions in the images. We integrate the linguistic information into the vision-based U-Structure networks designed for salient object detection tasks. The experiments are based on the newly created DUTS Cross Modal (DUTS-CM) dataset, which contains both visual and linguistic labels. We propose a new module called efficient Cross-Modal Self-Attention (eCMSA) to combine visual and linguistic features and improve the performance of the original U-structure networks. Meanwhile, to reduce the heavy burden of labeling, we employ a semi-supervised learning method by training an image caption model based on the DUTS-CM dataset, which can automatically label other datasets like DUT-OMRON and HKU-IS. The comprehensive experiments show that the performance of SOD can be improved with the natural language input and is competitive compared with other SOD methods.

Via

Access Paper or Ask Questions

High-resolution Iterative Feedback Network for Camouflaged Object Detection

Mar 22, 2022

Xiaobin Hu, Deng-Ping Fan, Xuebin Qin, Hang Dai, Wenqi Ren, Ying Tai, Chengjie Wang, Ling Shao

Figure 1 for High-resolution Iterative Feedback Network for Camouflaged Object Detection

Figure 2 for High-resolution Iterative Feedback Network for Camouflaged Object Detection

Figure 3 for High-resolution Iterative Feedback Network for Camouflaged Object Detection

Figure 4 for High-resolution Iterative Feedback Network for Camouflaged Object Detection

Abstract:Spotting camouflaged objects that are visually assimilated into the background is tricky for both object detection algorithms and humans who are usually confused or cheated by the perfectly intrinsic similarities between the foreground objects and the background surroundings. To tackle this challenge, we aim to extract the high-resolution texture details to avoid the detail degradation that causes blurred vision in edges and boundaries. We introduce a novel HitNet to refine the low-resolution representations by high-resolution features in an iterative feedback manner, essentially a global loop-based connection among the multi-scale resolutions. In addition, an iterative feedback loss is proposed to impose more constraints on each feedback connection. Extensive experiments on four challenging datasets demonstrate that our \ourmodel~breaks the performance bottleneck and achieves significant improvements compared with 29 state-of-the-art methods. To address the data scarcity in camouflaged scenarios, we provide an application example by employing cross-domain learning to extract the features that can reflect the camouflaged object properties and embed the features into salient objects, thereby generating more camouflaged training samples from the diverse salient object datasets The code will be available at https://github.com/HUuxiaobin/HitNet.

Via

Access Paper or Ask Questions