Picture for Wentao Mo

Wentao Mo

From Knowing to Doing Precisely: A General Self-Correction and Termination Framework for VLA models

Add code
Feb 02, 2026
Viaarxiv icon

Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs

Add code
Jan 02, 2025
Figure 1 for Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs
Figure 2 for Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs
Figure 3 for Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs
Figure 4 for Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs
Viaarxiv icon

3D Vision and Language Pretraining with Large-Scale Synthetic Data

Add code
Jul 08, 2024
Figure 1 for 3D Vision and Language Pretraining with Large-Scale Synthetic Data
Figure 2 for 3D Vision and Language Pretraining with Large-Scale Synthetic Data
Figure 3 for 3D Vision and Language Pretraining with Large-Scale Synthetic Data
Figure 4 for 3D Vision and Language Pretraining with Large-Scale Synthetic Data
Viaarxiv icon

Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA

Add code
Feb 24, 2024
Viaarxiv icon