Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bowen Liu

Understanding What Is Not Said:Referring Remote Sensing Image Segmentation with Scarce Expressions

Oct 26, 2025

Kai Ye, Bowen Liu, Jianghang Lin, Jiayi Ji, Pingyang Dai, Liujuan Cao

Abstract:Referring Remote Sensing Image Segmentation (RRSIS) aims to segment instances in remote sensing images according to referring expressions. Unlike Referring Image Segmentation on general images, acquiring high-quality referring expressions in the remote sensing domain is particularly challenging due to the prevalence of small, densely distributed objects and complex backgrounds. This paper introduces a new learning paradigm, Weakly Referring Expression Learning (WREL) for RRSIS, which leverages abundant class names as weakly referring expressions together with a small set of accurate ones to enable efficient training under limited annotation conditions. Furthermore, we provide a theoretical analysis showing that mixed-referring training yields a provable upper bound on the performance gap relative to training with fully annotated referring expressions, thereby establishing the validity of this new setting. We also propose LRB-WREL, which integrates a Learnable Reference Bank (LRB) to refine weakly referring expressions through sample-specific prompt embeddings that enrich coarse class-name inputs. Combined with a teacher-student optimization framework using dynamically scheduled EMA updates, LRB-WREL stabilizes training and enhances cross-modal generalization under noisy weakly referring supervision. Extensive experiments on our newly constructed benchmark with varying weakly referring data ratios validate both the theoretical insights and the practical effectiveness of WREL and LRB-WREL, demonstrating that they can approach or even surpass models trained with fully annotated referring expressions.

Via

Access Paper or Ask Questions

Phase Noise Tolerance for Low-Pilot-Overhead OFDM Terahertz Links Beyond 64-QAM

Aug 07, 2025

Bowen Liu, Takasumi Tanabe

Figure 1 for Phase Noise Tolerance for Low-Pilot-Overhead OFDM Terahertz Links Beyond 64-QAM

Figure 2 for Phase Noise Tolerance for Low-Pilot-Overhead OFDM Terahertz Links Beyond 64-QAM

Figure 3 for Phase Noise Tolerance for Low-Pilot-Overhead OFDM Terahertz Links Beyond 64-QAM

Figure 4 for Phase Noise Tolerance for Low-Pilot-Overhead OFDM Terahertz Links Beyond 64-QAM

Abstract:THz wireless communications have garnered significant attention due to their unprecedented data rates enabled by the abundant untapped spectrum. However, advanced modulation formats beyond 64-QAM remain largely unexplored, as phase errors introduced during up/down-conversion severely limit system performance. Particularly, OFDM transmission is highly susceptible to aggravated ICI induced by phase noise, undermining the orthogonality of subcarriers. While PLLs and pilot-assisted compensation can mitigate phase errors, excessive pilot overhead compromises spectral efficiency and energy consumption, and white phase noise remains unrecoverable. Therefore, quantifying phase noise tolerance is essential for practical physical layer protocols. Here, we reveal the impact of phase noise in a 64-QAM, 2048-subcarrier OFDM THz transmission system. 3{\sigma}-error estimation is proposed to quantify phase noise tolerance, indicating an intuitive EVM threshold of approximately 5%. This threshold further delineates the trade-offs among phase noise levels, SNR requirements, and pilot overhead. Moreover, by benchmarking representative oscillators with distinct phase noise spectra, microring resonators (MRRs) are identified as indispensable enablers for low-pilot-overhead OFDM THz links operating beyond 64-QAM.

Via

Access Paper or Ask Questions

APTOS-2024 challenge report: Generation of synthetic 3D OCT images from fundus photographs

Jun 09, 2025

Bowen Liu, Weiyi Zhang, Peranut Chotcomwongse, Xiaolan Chen, Ruoyu Chen, Pawin Pakaymaskul, Niracha Arjkongharn, Nattaporn Vongsa, Xuelian Cheng, Zongyuan Ge(+16 more)

Abstract:Optical Coherence Tomography (OCT) provides high-resolution, 3D, and non-invasive visualization of retinal layers in vivo, serving as a critical tool for lesion localization and disease diagnosis. However, its widespread adoption is limited by equipment costs and the need for specialized operators. In comparison, 2D color fundus photography offers faster acquisition and greater accessibility with less dependence on expensive devices. Although generative artificial intelligence has demonstrated promising results in medical image synthesis, translating 2D fundus images into 3D OCT images presents unique challenges due to inherent differences in data dimensionality and biological information between modalities. To advance generative models in the fundus-to-3D-OCT setting, the Asia Pacific Tele-Ophthalmology Society (APTOS-2024) organized a challenge titled Artificial Intelligence-based OCT Generation from Fundus Images. This paper details the challenge framework (referred to as APTOS-2024 Challenge), including: the benchmark dataset, evaluation methodology featuring two fidelity metrics-image-based distance (pixel-level OCT B-scan similarity) and video-based distance (semantic-level volumetric consistency), and analysis of top-performing solutions. The challenge attracted 342 participating teams, with 42 preliminary submissions and 9 finalists. Leading methodologies incorporated innovations in hybrid data preprocessing or augmentation (cross-modality collaborative paradigms), pre-training on external ophthalmic imaging datasets, integration of vision foundation models, and model architecture improvement. The APTOS-2024 Challenge is the first benchmark demonstrating the feasibility of fundus-to-3D-OCT synthesis as a potential solution for improving ophthalmic care accessibility in under-resourced healthcare settings, while helping to expedite medical research and clinical applications.

Via

Access Paper or Ask Questions

More Clear, More Flexible, More Precise: A Comprehensive Oriented Object Detection benchmark for UAV

Apr 28, 2025

Kai Ye, Haidi Tang, Bowen Liu, Pingyang Dai, Liujuan Cao, Rongrong Ji

Abstract:Applications of unmanned aerial vehicle (UAV) in logistics, agricultural automation, urban management, and emergency response are highly dependent on oriented object detection (OOD) to enhance visual perception. Although existing datasets for OOD in UAV provide valuable resources, they are often designed for specific downstream tasks.Consequently, they exhibit limited generalization performance in real flight scenarios and fail to thoroughly demonstrate algorithm effectiveness in practical environments. To bridge this critical gap, we introduce CODrone, a comprehensive oriented object detection dataset for UAVs that accurately reflects real-world conditions. It also serves as a new benchmark designed to align with downstream task requirements, ensuring greater applicability and robustness in UAV-based OOD.Based on application requirements, we identify four key limitations in current UAV OOD datasets-low image resolution, limited object categories, single-view imaging, and restricted flight altitudes-and propose corresponding improvements to enhance their applicability and robustness.Furthermore, CODrone contains a broad spectrum of annotated images collected from multiple cities under various lighting conditions, enhancing the realism of the benchmark. To rigorously evaluate CODrone as a new benchmark and gain deeper insights into the novel challenges it presents, we conduct a series of experiments based on 22 classical or SOTA methods.Our evaluation not only assesses the effectiveness of CODrone in real-world scenarios but also highlights key bottlenecks and opportunities to advance OOD in UAV applications.Overall, CODrone fills the data gap in OOD from UAV perspective and provides a benchmark with enhanced generalization capability, better aligning with practical applications and future algorithm development.

Via

Access Paper or Ask Questions

ViG3D-UNet: Volumetric Vascular Connectivity-Aware Segmentation via 3D Vision Graph Representation

Apr 18, 2025

Bowen Liu, Chunlei Meng, Wei Lin, Hongda Zhang, Ziqing Zhou, Zhongxue Gan, Chun Ouyang

Abstract:Accurate vascular segmentation is essential for coronary visualization and the diagnosis of coronary heart disease. This task involves the extraction of sparse tree-like vascular branches from the volumetric space. However, existing methods have faced significant challenges due to discontinuous vascular segmentation and missing endpoints. To address this issue, a 3D vision graph neural network framework, named ViG3D-UNet, was introduced. This method integrates 3D graph representation and aggregation within a U-shaped architecture to facilitate continuous vascular segmentation. The ViG3D module captures volumetric vascular connectivity and topology, while the convolutional module extracts fine vascular details. These two branches are combined through channel attention to form the encoder feature. Subsequently, a paperclip-shaped offset decoder minimizes redundant computations in the sparse feature space and restores the feature map size to match the original input dimensions. To evaluate the effectiveness of the proposed approach for continuous vascular segmentation, evaluations were performed on two public datasets, ASOCA and ImageCAS. The segmentation results show that the ViG3D-UNet surpassed competing methods in maintaining vascular segmentation connectivity while achieving high segmentation accuracy. Our code will be available soon.

Via

Access Paper or Ask Questions

Fundus2Globe: Generative AI-Driven 3D Digital Twins for Personalized Myopia Management

Feb 18, 2025

Danli Shi, Bowen Liu, Zhen Tian, Yue Wu, Jiancheng Yang, Ruoyu Chen, Bo Yang, Ou Xiao, Mingguang He

Figure 1 for Fundus2Globe: Generative AI-Driven 3D Digital Twins for Personalized Myopia Management

Figure 2 for Fundus2Globe: Generative AI-Driven 3D Digital Twins for Personalized Myopia Management

Figure 3 for Fundus2Globe: Generative AI-Driven 3D Digital Twins for Personalized Myopia Management

Figure 4 for Fundus2Globe: Generative AI-Driven 3D Digital Twins for Personalized Myopia Management

Abstract:Myopia, projected to affect 50% population globally by 2050, is a leading cause of vision loss. Eyes with pathological myopia exhibit distinctive shape distributions, which are closely linked to the progression of vision-threatening complications. Recent understanding of eye-shape-based biomarkers requires magnetic resonance imaging (MRI), however, it is costly and unrealistic in routine ophthalmology clinics. We present Fundus2Globe, the first AI framework that synthesizes patient-specific 3D eye globes from ubiquitous 2D color fundus photographs (CFPs) and routine metadata (axial length, spherical equivalent), bypassing MRI dependency. By integrating a 3D morphable eye model (encoding biomechanical shape priors) with a latent diffusion model, our approach achieves submillimeter accuracy in reconstructing posterior ocular anatomy efficiently. Fundus2Globe uniquely quantifies how vision-threatening lesions (e.g., staphylomas) in CFPs correlate with MRI-validated 3D shape abnormalities, enabling clinicians to simulate posterior segment changes in response to refractive shifts. External validation demonstrates its robust generation performance, ensuring fairness across underrepresented groups. By transforming 2D fundus imaging into 3D digital replicas of ocular structures, Fundus2Globe is a gateway for precision ophthalmology, laying the foundation for AI-driven, personalized myopia management.

* 24 pages, 6 figures

Via

Access Paper or Ask Questions

FFA Sora, video generation as fundus fluorescein angiography simulator

Dec 23, 2024

Xinyuan Wu, Lili Wang, Ruoyu Chen, Bowen Liu, Weiyi Zhang, Xi Yang, Yifan Feng, Mingguang He, Danli Shi

Figure 1 for FFA Sora, video generation as fundus fluorescein angiography simulator

Figure 2 for FFA Sora, video generation as fundus fluorescein angiography simulator

Figure 3 for FFA Sora, video generation as fundus fluorescein angiography simulator

Figure 4 for FFA Sora, video generation as fundus fluorescein angiography simulator

Abstract:Fundus fluorescein angiography (FFA) is critical for diagnosing retinal vascular diseases, but beginners often struggle with image interpretation. This study develops FFA Sora, a text-to-video model that converts FFA reports into dynamic videos via a Wavelet-Flow Variational Autoencoder (WF-VAE) and a diffusion transformer (DiT). Trained on an anonymized dataset, FFA Sora accurately simulates disease features from the input text, as confirmed by objective metrics: Frechet Video Distance (FVD) = 329.78, Learned Perceptual Image Patch Similarity (LPIPS) = 0.48, and Visual-question-answering Score (VQAScore) = 0.61. Specific evaluations showed acceptable alignment between the generated videos and textual prompts, with BERTScore of 0.35. Additionally, the model demonstrated strong privacy-preserving performance in retrieval evaluations, achieving an average Recall@K of 0.073. Human assessments indicated satisfactory visual quality, with an average score of 1.570(scale: 1 = best, 5 = worst). This model addresses privacy concerns associated with sharing large-scale FFA data and enhances medical education.

* 24 pages, 3 figures

Via

Access Paper or Ask Questions

EyeDiff: text-to-image diffusion model improves rare eye disease diagnosis

Nov 15, 2024

Ruoyu Chen, Weiyi Zhang, Bowen Liu, Xiaolan Chen, Pusheng Xu, Shunming Liu, Mingguang He, Danli Shi

Abstract:The rising prevalence of vision-threatening retinal diseases poses a significant burden on the global healthcare systems. Deep learning (DL) offers a promising solution for automatic disease screening but demands substantial data. Collecting and labeling large volumes of ophthalmic images across various modalities encounters several real-world challenges, especially for rare diseases. Here, we introduce EyeDiff, a text-to-image model designed to generate multimodal ophthalmic images from natural language prompts and evaluate its applicability in diagnosing common and rare diseases. EyeDiff is trained on eight large-scale datasets using the advanced latent diffusion model, covering 14 ophthalmic image modalities and over 80 ocular diseases, and is adapted to ten multi-country external datasets. The generated images accurately capture essential lesional characteristics, achieving high alignment with text prompts as evaluated by objective metrics and human experts. Furthermore, integrating generated images significantly enhances the accuracy of detecting minority classes and rare eye diseases, surpassing traditional oversampling methods in addressing data imbalance. EyeDiff effectively tackles the issue of data imbalance and insufficiency typically encountered in rare diseases and addresses the challenges of collecting large-scale annotated images, offering a transformative solution to enhance the development of expert-level diseases diagnosis models in ophthalmic field.

* 28 pages, 2 figures

Via

Access Paper or Ask Questions

Subgraph Aggregation for Out-of-Distribution Generalization on Graphs

Oct 29, 2024

Bowen Liu, Haoyang Li, Shuning Wang, Shuo Nie, Shanghang Zhang

Figure 1 for Subgraph Aggregation for Out-of-Distribution Generalization on Graphs

Figure 2 for Subgraph Aggregation for Out-of-Distribution Generalization on Graphs

Figure 3 for Subgraph Aggregation for Out-of-Distribution Generalization on Graphs

Figure 4 for Subgraph Aggregation for Out-of-Distribution Generalization on Graphs

Abstract:Out-of-distribution (OOD) generalization in Graph Neural Networks (GNNs) has gained significant attention due to its critical importance in graph-based predictions in real-world scenarios. Existing methods primarily focus on extracting a single causal subgraph from the input graph to achieve generalizable predictions. However, relying on a single subgraph can lead to susceptibility to spurious correlations and is insufficient for learning invariant patterns behind graph data. Moreover, in many real-world applications, such as molecular property prediction, multiple critical subgraphs may influence the target label property. To address these challenges, we propose a novel framework, SubGraph Aggregation (SuGAr), designed to learn a diverse set of subgraphs that are crucial for OOD generalization on graphs. Specifically, SuGAr employs a tailored subgraph sampler and diversity regularizer to extract a diverse set of invariant subgraphs. These invariant subgraphs are then aggregated by averaging their representations, which enriches the subgraph signals and enhances coverage of the underlying causal structures, thereby improving OOD generalization. Extensive experiments on both synthetic and real-world datasets demonstrate that \ours outperforms state-of-the-art methods, achieving up to a 24% improvement in OOD generalization on graphs. To the best of our knowledge, this is the first work to study graph OOD generalization by learning multiple invariant subgraphs.

Via

Access Paper or Ask Questions

CTA-Net: A CNN-Transformer Aggregation Network for Improving Multi-Scale Feature Extraction

Oct 15, 2024

Chunlei Meng, Jiacheng Yang, Wei Lin, Bowen Liu, Hongda Zhang, chun ouyang, Zhongxue Gan

Figure 1 for CTA-Net: A CNN-Transformer Aggregation Network for Improving Multi-Scale Feature Extraction

Figure 2 for CTA-Net: A CNN-Transformer Aggregation Network for Improving Multi-Scale Feature Extraction

Figure 3 for CTA-Net: A CNN-Transformer Aggregation Network for Improving Multi-Scale Feature Extraction

Figure 4 for CTA-Net: A CNN-Transformer Aggregation Network for Improving Multi-Scale Feature Extraction

Abstract:Convolutional neural networks (CNNs) and vision transformers (ViTs) have become essential in computer vision for local and global feature extraction. However, aggregating these architectures in existing methods often results in inefficiencies. To address this, the CNN-Transformer Aggregation Network (CTA-Net) was developed. CTA-Net combines CNNs and ViTs, with transformers capturing long-range dependencies and CNNs extracting localized features. This integration enables efficient processing of detailed local and broader contextual information. CTA-Net introduces the Light Weight Multi-Scale Feature Fusion Multi-Head Self-Attention (LMF-MHSA) module for effective multi-scale feature integration with reduced parameters. Additionally, the Reverse Reconstruction CNN-Variants (RRCV) module enhances the embedding of CNNs within the transformer architecture. Extensive experiments on small-scale datasets with fewer than 100,000 samples show that CTA-Net achieves superior performance (TOP-1 Acc 86.76\%), fewer parameters (20.32M), and greater efficiency (FLOPs 2.83B), making it a highly efficient and lightweight solution for visual tasks on small-scale datasets (fewer than 100,000).

* 9 pages, 3 figures

Via

Access Paper or Ask Questions