Picture for Yuexian Zou

Yuexian Zou

VASparse: Towards Efficient Visual Hallucination Mitigation for Large Vision-Language Model via Visual-Aware Sparsification

Add code
Jan 11, 2025
Viaarxiv icon

CAR: Controllable Autoregressive Modeling for Visual Generation

Add code
Oct 07, 2024
Viaarxiv icon

DiffATR: Diffusion-based Generative Modeling for Audio-Text Retrieval

Add code
Sep 16, 2024
Viaarxiv icon

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation

Add code
Sep 14, 2024
Viaarxiv icon

Image Conductor: Precision Control for Interactive Video Synthesis

Add code
Jun 21, 2024
Viaarxiv icon

On the Worst Prompt Performance of Large Language Models

Add code
Jun 08, 2024
Figure 1 for On the Worst Prompt Performance of Large Language Models
Figure 2 for On the Worst Prompt Performance of Large Language Models
Figure 3 for On the Worst Prompt Performance of Large Language Models
Figure 4 for On the Worst Prompt Performance of Large Language Models
Viaarxiv icon

Towards Spoken Language Understanding via Multi-level Multi-grained Contrastive Learning

Add code
May 31, 2024
Viaarxiv icon

VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding

Add code
Mar 22, 2024
Figure 1 for VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding
Figure 2 for VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding
Figure 3 for VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding
Figure 4 for VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding
Viaarxiv icon

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

Add code
Mar 14, 2024
Figure 1 for VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework
Figure 2 for VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework
Figure 3 for VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework
Figure 4 for VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework
Viaarxiv icon

WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image Inputs

Add code
Mar 10, 2024
Figure 1 for WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image Inputs
Figure 2 for WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image Inputs
Figure 3 for WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image Inputs
Figure 4 for WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image Inputs
Viaarxiv icon