Picture for Zhecan Wang

Zhecan Wang

HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning

Add code
Jul 22, 2024
Viaarxiv icon

Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions

Add code
May 23, 2024
Viaarxiv icon

Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and Beyond

Add code
Oct 31, 2023
Viaarxiv icon

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

Add code
Jul 03, 2023
Viaarxiv icon

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

Add code
May 24, 2023
Viaarxiv icon

CoBIT: A Contrastive Bi-directional Image-Text Generation Model

Add code
Mar 23, 2023
Figure 1 for CoBIT: A Contrastive Bi-directional Image-Text Generation Model
Figure 2 for CoBIT: A Contrastive Bi-directional Image-Text Generation Model
Figure 3 for CoBIT: A Contrastive Bi-directional Image-Text Generation Model
Figure 4 for CoBIT: A Contrastive Bi-directional Image-Text Generation Model
Viaarxiv icon

Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding

Add code
Dec 14, 2022
Figure 1 for Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding
Figure 2 for Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding
Figure 3 for Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding
Figure 4 for Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding
Viaarxiv icon

Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense

Add code
Nov 10, 2022
Viaarxiv icon

Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks

Add code
Apr 28, 2022
Figure 1 for Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks
Figure 2 for Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks
Figure 3 for Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks
Figure 4 for Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks
Viaarxiv icon

CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks

Add code
Jan 15, 2022
Figure 1 for CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks
Figure 2 for CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks
Figure 3 for CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks
Figure 4 for CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks
Viaarxiv icon