Picture for Xiangtai Li

Xiangtai Li

Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs

Add code
Jan 08, 2025
Viaarxiv icon

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Add code
Jan 07, 2025
Figure 1 for Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Figure 2 for Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Figure 3 for Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Figure 4 for Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Viaarxiv icon

DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation

Add code
Dec 10, 2024
Figure 1 for DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
Figure 2 for DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
Figure 3 for DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
Figure 4 for DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
Viaarxiv icon

EMOv2: Pushing 5M Vision Model Frontier

Add code
Dec 09, 2024
Figure 1 for EMOv2: Pushing 5M Vision Model Frontier
Figure 2 for EMOv2: Pushing 5M Vision Model Frontier
Figure 3 for EMOv2: Pushing 5M Vision Model Frontier
Figure 4 for EMOv2: Pushing 5M Vision Model Frontier
Viaarxiv icon

SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model

Add code
Dec 05, 2024
Figure 1 for SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model
Figure 2 for SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model
Figure 3 for SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model
Figure 4 for SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model
Viaarxiv icon

HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing

Add code
Dec 05, 2024
Viaarxiv icon

DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation

Add code
Dec 04, 2024
Figure 1 for DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation
Figure 2 for DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation
Figure 3 for DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation
Figure 4 for DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation
Viaarxiv icon

RelationBooth: Towards Relation-Aware Customized Object Generation

Add code
Oct 30, 2024
Viaarxiv icon

Synergistic Dual Spatial-aware Generation of Image-to-Text and Text-to-Image

Add code
Oct 20, 2024
Figure 1 for Synergistic Dual Spatial-aware Generation of Image-to-Text and Text-to-Image
Figure 2 for Synergistic Dual Spatial-aware Generation of Image-to-Text and Text-to-Image
Figure 3 for Synergistic Dual Spatial-aware Generation of Image-to-Text and Text-to-Image
Figure 4 for Synergistic Dual Spatial-aware Generation of Image-to-Text and Text-to-Image
Viaarxiv icon

Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

Add code
Oct 14, 2024
Figure 1 for Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation
Figure 2 for Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation
Figure 3 for Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation
Figure 4 for Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation
Viaarxiv icon