Picture for Shiming Xiang

Shiming Xiang

UNIP: Rethinking Pre-trained Attention Patterns for Infrared Semantic Segmentation

Add code
Feb 04, 2025
Viaarxiv icon

Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization

Add code
Feb 03, 2025
Figure 1 for Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization
Figure 2 for Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization
Figure 3 for Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization
Figure 4 for Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization
Viaarxiv icon

Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation

Add code
Jan 29, 2025
Figure 1 for Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation
Figure 2 for Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation
Figure 3 for Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation
Figure 4 for Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation
Viaarxiv icon

Rethinking Comprehensive Benchmark for Chart Understanding: A Perspective from Scientific Literature

Add code
Dec 11, 2024
Figure 1 for Rethinking Comprehensive Benchmark for Chart Understanding: A Perspective from Scientific Literature
Figure 2 for Rethinking Comprehensive Benchmark for Chart Understanding: A Perspective from Scientific Literature
Figure 3 for Rethinking Comprehensive Benchmark for Chart Understanding: A Perspective from Scientific Literature
Figure 4 for Rethinking Comprehensive Benchmark for Chart Understanding: A Perspective from Scientific Literature
Viaarxiv icon

Continuous Speculative Decoding for Autoregressive Image Generation

Add code
Nov 18, 2024
Viaarxiv icon

A Survey of Low-shot Vision-Language Model Adaptation via Representer Theorem

Add code
Oct 15, 2024
Figure 1 for A Survey of Low-shot Vision-Language Model Adaptation via Representer Theorem
Figure 2 for A Survey of Low-shot Vision-Language Model Adaptation via Representer Theorem
Figure 3 for A Survey of Low-shot Vision-Language Model Adaptation via Representer Theorem
Figure 4 for A Survey of Low-shot Vision-Language Model Adaptation via Representer Theorem
Viaarxiv icon

Calibrated Cache Model for Few-Shot Vision-Language Model Adaptation

Add code
Oct 11, 2024
Figure 1 for Calibrated Cache Model for Few-Shot Vision-Language Model Adaptation
Figure 2 for Calibrated Cache Model for Few-Shot Vision-Language Model Adaptation
Figure 3 for Calibrated Cache Model for Few-Shot Vision-Language Model Adaptation
Figure 4 for Calibrated Cache Model for Few-Shot Vision-Language Model Adaptation
Viaarxiv icon

Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis

Add code
Sep 10, 2024
Figure 1 for Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis
Figure 2 for Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis
Figure 3 for Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis
Figure 4 for Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis
Viaarxiv icon

AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation

Add code
Aug 03, 2024
Figure 1 for AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation
Figure 2 for AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation
Figure 3 for AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation
Figure 4 for AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation
Viaarxiv icon

AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization

Add code
Jul 11, 2024
Figure 1 for AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization
Figure 2 for AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization
Figure 3 for AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization
Figure 4 for AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization
Viaarxiv icon