Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yuke Li

CityGo: Lightweight Urban Modeling and Rendering with Proxy Buildings and Residual Gaussians

May 28, 2025

Weihang Liu, Yuhui Zhong, Yuke Li, Xi Chen, Jiadi Cui, Honglong Zhang, Lan Xu, Xin Lou, Yujiao Shi, Jingyi Yu(+1 more)

Abstract:Accurate and efficient modeling of large-scale urban scenes is critical for applications such as AR navigation, UAV based inspection, and smart city digital twins. While aerial imagery offers broad coverage and complements limitations of ground-based data, reconstructing city-scale environments from such views remains challenging due to occlusions, incomplete geometry, and high memory demands. Recent advances like 3D Gaussian Splatting (3DGS) improve scalability and visual quality but remain limited by dense primitive usage, long training times, and poor suit ability for edge devices. We propose CityGo, a hybrid framework that combines textured proxy geometry with residual and surrounding 3D Gaussians for lightweight, photorealistic rendering of urban scenes from aerial perspectives. Our approach first extracts compact building proxy meshes from MVS point clouds, then uses zero order SH Gaussians to generate occlusion-free textures via image-based rendering and back-projection. To capture high-frequency details, we introduce residual Gaussians placed based on proxy-photo discrepancies and guided by depth priors. Broader urban context is represented by surrounding Gaussians, with importance-aware downsampling applied to non-critical regions to reduce redundancy. A tailored optimization strategy jointly refines proxy textures and Gaussian parameters, enabling real-time rendering of complex urban scenes on mobile GPUs with significantly reduced training and memory requirements. Extensive experiments on real-world aerial datasets demonstrate that our hybrid representation significantly reduces training time, achieving on average 1.4x speedup, while delivering comparable visual fidelity to pure 3D Gaussian Splatting approaches. Furthermore, CityGo enables real-time rendering of large-scale urban scenes on mobile consumer GPUs, with substantially reduced memory usage and energy consumption.

Via

Access Paper or Ask Questions

Riemannian Complex Hermit Positive Definite Convolution Network for Polarimetric SAR Image Classification

Feb 12, 2025

Junfei Shi, Mengmeng Nie, Yuke Li, Haiyan Jin, Weisi Lin

Abstract:Deep learning can learn high-level semantic features in Euclidean space effectively for PolSAR images, while they need to covert the complex covariance matrix into a feature vector or complex-valued vector as the network input. However, the complex covariance matrices are essentially a complex Hermit positive definite (HPD) matrix endowed in Riemannian manifold rather than Euclidean space. The matrix's real and imagery parts are with the same significance, as the imagery part represents the phase information. The matrix vectorization will destroy the geometric structure and manifold characteristics of complex covariance matrices. To learn complex HPD matrices directly, we propose a Riemannian complex HPD convolution network(HPD\_CNN) for PolSAR images. This method consists of a complex HPD unfolding network(HPDnet) and a CV-3DCNN enhanced network. The proposed complex HPDnet defines the HPD mapping, rectifying and the logEig layers to learn geometric features of complex matrices. In addition, a fast eigenvalue decomposition method is designed to reduce computation burden. Finally, a Riemannian-to-Euclidean enhanced network is defined to enhance contextual information for classification. Experimental results on two real PolSSAR datasets demonstrate the proposed method can achieve superior performance than the state-of-the-art methods especially in heterogeneous regions.

* 9 pages, 4 figures

Via

Access Paper or Ask Questions

FleSpeech: Flexibly Controllable Speech Generation with Various Prompts

Jan 08, 2025

Hanzhao Li, Yuke Li, Xinsheng Wang, Jingbin Hu, Qicong Xie, Shan Yang, Lei Xie

Figure 1 for FleSpeech: Flexibly Controllable Speech Generation with Various Prompts

Figure 2 for FleSpeech: Flexibly Controllable Speech Generation with Various Prompts

Figure 3 for FleSpeech: Flexibly Controllable Speech Generation with Various Prompts

Figure 4 for FleSpeech: Flexibly Controllable Speech Generation with Various Prompts

Abstract:Controllable speech generation methods typically rely on single or fixed prompts, hindering creativity and flexibility. These limitations make it difficult to meet specific user needs in certain scenarios, such as adjusting the style while preserving a selected speaker's timbre, or choosing a style and generating a voice that matches a character's visual appearance. To overcome these challenges, we propose \textit{FleSpeech}, a novel multi-stage speech generation framework that allows for more flexible manipulation of speech attributes by integrating various forms of control. FleSpeech employs a multimodal prompt encoder that processes and unifies different text, audio, and visual prompts into a cohesive representation. This approach enhances the adaptability of speech synthesis and supports creative and precise control over the generated speech. Additionally, we develop a data collection pipeline for multimodal datasets to facilitate further research and applications in this field. Comprehensive subjective and objective experiments demonstrate the effectiveness of FleSpeech. Audio samples are available at https://kkksuper.github.io/FleSpeech/

* 14 pages, 3 figures

Via

Access Paper or Ask Questions

CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion

Dec 03, 2024

Yuke Li, Xinfa Zhu, Hanzhao Li, JiXun Yao, WenJie Tian, XiPeng Yang, YunLin Chen, Zhifei Li, Lei Xie

Figure 1 for CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion

Figure 2 for CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion

Figure 3 for CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion

Figure 4 for CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion

Abstract:Zero-shot voice conversion (VC) aims to convert the original speaker's timbre to any target speaker while keeping the linguistic content. Current mainstream zero-shot voice conversion approaches depend on pre-trained recognition models to disentangle linguistic content and speaker representation. This results in a timbre residue within the decoupled linguistic content and inadequacies in speaker representation modeling. In this study, we propose CoDiff-VC, an end-to-end framework for zero-shot voice conversion that integrates a speech codec and a diffusion model to produce high-fidelity waveforms. Our approach involves employing a single-codebook codec to separate linguistic content from the source speech. To enhance content disentanglement, we introduce Mix-Style layer normalization (MSLN) to perturb the original timbre. Additionally, we incorporate a multi-scale speaker timbre modeling approach to ensure timbre consistency and improve voice detail similarity. To improve speech quality and speaker similarity, we introduce dual classifier-free guidance, providing both content and timbre guidance during the generation process. Objective and subjective experiments affirm that CoDiff-VC significantly improves speaker similarity, generating natural and higher-quality speech.

Via

Access Paper or Ask Questions

Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework

Oct 14, 2024

Zhengwei Yang, Yuke Li, Qiang Sun, Basura Fernando, Heng Huang, Zheng Wang

Figure 1 for Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework

Figure 2 for Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework

Figure 3 for Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework

Figure 4 for Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework

Abstract:Most existing studies on few-shot learning focus on unimodal settings, where models are trained to generalize on unseen data using only a small number of labeled examples from the same modality. However, real-world data are inherently multi-modal, and unimodal approaches limit the practical applications of few-shot learning. To address this gap, this paper introduces the Cross-modal Few-Shot Learning (CFSL) task, which aims to recognize instances from multiple modalities when only a few labeled examples are available. This task presents additional challenges compared to classical few-shot learning due to the distinct visual characteristics and structural properties unique to each modality. To tackle these challenges, we propose a Generative Transfer Learning (GTL) framework consisting of two stages: the first stage involves training on abundant unimodal data, and the second stage focuses on transfer learning to adapt to novel data. Our GTL framework jointly estimates the latent shared concept across modalities and in-modality disturbance in both stages, while freezing the generative module during the transfer phase to maintain the stability of the learned representations and prevent overfitting to the limited multi-modal samples. Our finds demonstrate that GTL has superior performance compared to state-of-the-art methods across four distinct multi-modal datasets: Sketchy, TU-Berlin, Mask1K, and SKSF-A. Additionally, the results suggest that the model can estimate latent concepts from vast unimodal data and generalize these concepts to unseen modalities using only a limited number of available samples, much like human cognitive processes.

* 19 pages, 7 figures

Via

Access Paper or Ask Questions

Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation

Mar 11, 2024

Xinyao Li, Yuke Li, Zhekai Du, Fengling Li, Ke Lu, Jingjing Li

Figure 1 for Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation

Figure 2 for Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation

Figure 3 for Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation

Figure 4 for Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation

Abstract:Large vision-language models (VLMs) like CLIP have demonstrated good zero-shot learning performance in the unsupervised domain adaptation task. Yet, most transfer approaches for VLMs focus on either the language or visual branches, overlooking the nuanced interplay between both modalities. In this work, we introduce a Unified Modality Separation (UniMoS) framework for unsupervised domain adaptation. Leveraging insights from modality gap studies, we craft a nimble modality separation network that distinctly disentangles CLIP's features into language-associated and vision-associated components. Our proposed Modality-Ensemble Training (MET) method fosters the exchange of modality-agnostic information while maintaining modality-specific nuances. We align features across domains using a modality discriminator. Comprehensive evaluations on three benchmarks reveal our approach sets a new state-of-the-art with minimal computational costs. Code: https://github.com/TL-UESTC/UniMoS

* CVPR 2024 camera ready

Via

Access Paper or Ask Questions

Learning Domain-Invariant Temporal Dynamics for Few-Shot Action Recognition

Feb 20, 2024

Yuke Li, Guangyi Chen, Ben Abramowitz, Stefano Anzellott, Donglai Wei

Figure 1 for Learning Domain-Invariant Temporal Dynamics for Few-Shot Action Recognition

Figure 2 for Learning Domain-Invariant Temporal Dynamics for Few-Shot Action Recognition

Figure 3 for Learning Domain-Invariant Temporal Dynamics for Few-Shot Action Recognition

Figure 4 for Learning Domain-Invariant Temporal Dynamics for Few-Shot Action Recognition

Abstract:Few-shot action recognition aims at quickly adapting a pre-trained model to the novel data with a distribution shift using only a limited number of samples. Key challenges include how to identify and leverage the transferable knowledge learned by the pre-trained model. Our central hypothesis is that temporal invariance in the dynamic system between latent variables lends itself to transferability (domain-invariance). We therefore propose DITeD, or Domain-Invariant Temporal Dynamics for knowledge transfer. To detect the temporal invariance part, we propose a generative framework with a two-stage training strategy during pre-training. Specifically, we explicitly model invariant dynamics including temporal dynamic generation and transitions, and the variant visual and domain encoders. Then we pre-train the model with the self-supervised signals to learn the representation. After that, we fix the whole representation model and tune the classifier. During adaptation, we fix the transferable temporal dynamics and update the image encoder. The efficacy of our approach is revealed by the superior accuracy of DITeD over leading alternatives across standard few-shot action recognition datasets. Moreover, we validate that the learned temporal dynamic transition and temporal dynamic generation modules possess transferable qualities.

Via

Access Paper or Ask Questions

HaltingVT: Adaptive Token Halting Transformer for Efficient Video Recognition

Jan 10, 2024

Qian Wu, Ruoxuan Cui, Yuke Li, Haoqi Zhu

Figure 1 for HaltingVT: Adaptive Token Halting Transformer for Efficient Video Recognition

Figure 2 for HaltingVT: Adaptive Token Halting Transformer for Efficient Video Recognition

Figure 3 for HaltingVT: Adaptive Token Halting Transformer for Efficient Video Recognition

Figure 4 for HaltingVT: Adaptive Token Halting Transformer for Efficient Video Recognition

Abstract:Action recognition in videos poses a challenge due to its high computational cost, especially for Joint Space-Time video transformers (Joint VT). Despite their effectiveness, the excessive number of tokens in such architectures significantly limits their efficiency. In this paper, we propose HaltingVT, an efficient video transformer adaptively removing redundant video patch tokens, which is primarily composed of a Joint VT and a Glimpser module. Specifically, HaltingVT applies data-adaptive token reduction at each layer, resulting in a significant reduction in the overall computational cost. Besides, the Glimpser module quickly removes redundant tokens in shallow transformer layers, which may even be misleading for video recognition tasks based on our observations. To further encourage HaltingVT to focus on the key motion-related information in videos, we design an effective Motion Loss during training. HaltingVT acquires video analysis capabilities and token halting compression strategies simultaneously in a unified training process, without requiring additional training procedures or sub-networks. On the Mini-Kinetics dataset, we achieved 75.0% top-1 ACC with 24.2 GFLOPs, as well as 67.2% top-1 ACC with an extremely low 9.9 GFLOPs. The code is available at https://github.com/dun-research/HaltingVT.

Via

Access Paper or Ask Questions

Learning Socio-Temporal Graphs for Multi-Agent Trajectory Prediction

Dec 22, 2023

Yuke Li, Lixiong Chen, Guangyi Chen, Ching-Yao Chan, Kun Zhang, Stefano Anzellotti, Donglai Wei

Figure 1 for Learning Socio-Temporal Graphs for Multi-Agent Trajectory Prediction

Figure 2 for Learning Socio-Temporal Graphs for Multi-Agent Trajectory Prediction

Figure 3 for Learning Socio-Temporal Graphs for Multi-Agent Trajectory Prediction

Figure 4 for Learning Socio-Temporal Graphs for Multi-Agent Trajectory Prediction

Abstract:In order to predict a pedestrian's trajectory in a crowd accurately, one has to take into account her/his underlying socio-temporal interactions with other pedestrians consistently. Unlike existing work that represents the relevant information separately, partially, or implicitly, we propose a complete representation for it to be fully and explicitly captured and analyzed. In particular, we introduce a Directed Acyclic Graph-based structure, which we term Socio-Temporal Graph (STG), to explicitly capture pair-wise socio-temporal interactions among a group of people across both space and time. Our model is built on a time-varying generative process, whose latent variables determine the structure of the STGs. We design an attention-based model named STGformer that affords an end-to-end pipeline to learn the structure of the STGs for trajectory prediction. Our solution achieves overall state-of-the-art prediction accuracy in two large-scale benchmark datasets. Our analysis shows that a person's past trajectory is critical for predicting another person's future path. Our model learns this relationship with a strong notion of socio-temporal localities. Statistics show that utilizing this information explicitly for prediction yields a noticeable performance gain with respect to the trajectory-only approaches.

Via

Access Paper or Ask Questions

Enhancing Traffic Object Detection in Variable Illumination with RGB-Event Fusion

Nov 01, 2023

Zhanwen Liu, Nan Yang, Yang Wang, Yuke Li, Xiangmo Zhao, Fei-Yue Wang

Abstract:Traffic object detection under variable illumination is challenging due to the information loss caused by the limited dynamic range of conventional frame-based cameras. To address this issue, we introduce bio-inspired event cameras and propose a novel Structure-aware Fusion Network (SFNet) that extracts sharp and complete object structures from the event stream to compensate for the lost information in images through cross-modality fusion, enabling the network to obtain illumination-robust representations for traffic object detection. Specifically, to mitigate the sparsity or blurriness issues arising from diverse motion states of traffic objects in fixed-interval event sampling methods, we propose the Reliable Structure Generation Network (RSGNet) to generate Speed Invariant Frames (SIF), ensuring the integrity and sharpness of object structures. Next, we design a novel Adaptive Feature Complement Module (AFCM) which guides the adaptive fusion of two modality features to compensate for the information loss in the images by perceiving the global lightness distribution of the images, thereby generating illumination-robust representations. Finally, considering the lack of large-scale and high-quality annotations in the existing event-based object detection datasets, we build a DSEC-Det dataset, which consists of 53 sequences with 63,931 images and more than 208,000 labels for 8 classes. Extensive experimental results demonstrate that our proposed SFNet can overcome the perceptual boundaries of conventional cameras and outperform the frame-based method by 8.0% in mAP50 and 5.9% in mAP50:95. Our code and dataset will be available at https://github.com/YN-Yang/SFNet.

* 13 pages, 10 figures

Via

Access Paper or Ask Questions