Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yanhao Wu

Refining CLIP's Spatial Awareness: A Visual-Centric Perspective

Apr 03, 2025

Congpei Qiu, Yanhao Wu, Wei Ke, Xiuxiu Bai, Tong Zhang

Figure 1 for Refining CLIP's Spatial Awareness: A Visual-Centric Perspective

Figure 2 for Refining CLIP's Spatial Awareness: A Visual-Centric Perspective

Figure 3 for Refining CLIP's Spatial Awareness: A Visual-Centric Perspective

Figure 4 for Refining CLIP's Spatial Awareness: A Visual-Centric Perspective

Abstract:Contrastive Language-Image Pre-training (CLIP) excels in global alignment with language but exhibits limited sensitivity to spatial information, leading to strong performance in zero-shot classification tasks but underperformance in tasks requiring precise spatial understanding. Recent approaches have introduced Region-Language Alignment (RLA) to enhance CLIP's performance in dense multimodal tasks by aligning regional visual representations with corresponding text inputs. However, we find that CLIP ViTs fine-tuned with RLA suffer from notable loss in spatial awareness, which is crucial for dense prediction tasks. To address this, we propose the Spatial Correlation Distillation (SCD) framework, which preserves CLIP's inherent spatial structure and mitigates the above degradation. To further enhance spatial correlations, we introduce a lightweight Refiner that extracts refined correlations directly from CLIP before feeding them into SCD, based on an intriguing finding that CLIP naturally captures high-quality dense features. Together, these components form a robust distillation framework that enables CLIP ViTs to integrate both visual-language and visual-centric improvements, achieving state-of-the-art results across various open-vocabulary dense prediction benchmarks.

* ICLR 2025

Via

Access Paper or Ask Questions

Generating Multimodal Driving Scenes via Next-Scene Prediction

Mar 19, 2025

Yanhao Wu, Haoyang Zhang, Tianwei Lin, Lichao Huang, Shujie Luo, Rui Wu, Congpei Qiu, Wei Ke, Tong Zhang

Abstract:Generative models in Autonomous Driving (AD) enable diverse scene creation, yet existing methods fall short by only capturing a limited range of modalities, restricting the capability of generating controllable scenes for comprehensive evaluation of AD systems. In this paper, we introduce a multimodal generation framework that incorporates four major data modalities, including a novel addition of map modality. With tokenized modalities, our scene sequence generation framework autoregressively predicts each scene while managing computational demands through a two-stage approach. The Temporal AutoRegressive (TAR) component captures inter-frame dynamics for each modality while the Ordered AutoRegressive (OAR) component aligns modalities within each scene by sequentially predicting tokens in a fixed order. To maintain coherence between map and ego-action modalities, we introduce the Action-aware Map Alignment (AMA) module, which applies a transformation based on the ego-action to maintain coherence between these modalities. Our framework effectively generates complex, realistic driving scenes over extended sequences, ensuring multimodal consistency and offering fine-grained control over scene elements.

Via

Access Paper or Ask Questions

Mitigating Object Dependencies: Improving Point Cloud Self-Supervised Learning through Object Exchange

Apr 11, 2024

Yanhao Wu, Tong Zhang, Wei Ke, Congpei Qiu, Sabine Susstrunk, Mathieu Salzmann

Figure 1 for Mitigating Object Dependencies: Improving Point Cloud Self-Supervised Learning through Object Exchange

Figure 2 for Mitigating Object Dependencies: Improving Point Cloud Self-Supervised Learning through Object Exchange

Figure 3 for Mitigating Object Dependencies: Improving Point Cloud Self-Supervised Learning through Object Exchange

Figure 4 for Mitigating Object Dependencies: Improving Point Cloud Self-Supervised Learning through Object Exchange

Abstract:In the realm of point cloud scene understanding, particularly in indoor scenes, objects are arranged following human habits, resulting in objects of certain semantics being closely positioned and displaying notable inter-object correlations. This can create a tendency for neural networks to exploit these strong dependencies, bypassing the individual object patterns. To address this challenge, we introduce a novel self-supervised learning (SSL) strategy. Our approach leverages both object patterns and contextual cues to produce robust features. It begins with the formulation of an object-exchanging strategy, where pairs of objects with comparable sizes are exchanged across different scenes, effectively disentangling the strong contextual dependencies. Subsequently, we introduce a context-aware feature learning strategy, which encodes object patterns without relying on their specific context by aggregating object features across various scenes. Our extensive experiments demonstrate the superiority of our method over existing SSL techniques, further showing its better robustness to environmental changes. Moreover, we showcase the applicability of our approach by transferring pre-trained models to diverse point cloud datasets.

Via

Access Paper or Ask Questions

Spatiotemporal Self-supervised Learning for Point Clouds in the Wild

Mar 28, 2023

Yanhao Wu, Tong Zhang, Wei Ke, Sabine Süsstrunk, Mathieu Salzmann

Figure 1 for Spatiotemporal Self-supervised Learning for Point Clouds in the Wild

Figure 2 for Spatiotemporal Self-supervised Learning for Point Clouds in the Wild

Figure 3 for Spatiotemporal Self-supervised Learning for Point Clouds in the Wild

Figure 4 for Spatiotemporal Self-supervised Learning for Point Clouds in the Wild

Abstract:Self-supervised learning (SSL) has the potential to benefit many applications, particularly those where manually annotating data is cumbersome. One such situation is the semantic segmentation of point clouds. In this context, existing methods employ contrastive learning strategies and define positive pairs by performing various augmentation of point clusters in a single frame. As such, these methods do not exploit the temporal nature of LiDAR data. In this paper, we introduce an SSL strategy that leverages positive pairs in both the spatial and temporal domain. To this end, we design (i) a point-to-cluster learning strategy that aggregates spatial information to distinguish objects; and (ii) a cluster-to-cluster learning strategy based on unsupervised object tracking that exploits temporal correspondences. We demonstrate the benefits of our approach via extensive experiments performed by self-supervised training on two large-scale LiDAR datasets and transferring the resulting models to other point cloud segmentation benchmarks. Our results evidence that our method outperforms the state-of-the-art point cloud SSL methods.

* CVPR accepted

Via

Access Paper or Ask Questions