Picture for Zixian Ma

Zixian Ma

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Add code
Jan 15, 2026
Viaarxiv icon

SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning

Add code
Dec 15, 2025
Viaarxiv icon

Completion $ eq$ Collaboration: Scaling Collaborative Effort with Agents

Add code
Oct 30, 2025
Viaarxiv icon

Explain Before You Answer: A Survey on Compositional Visual Reasoning

Add code
Aug 24, 2025
Figure 1 for Explain Before You Answer: A Survey on Compositional Visual Reasoning
Figure 2 for Explain Before You Answer: A Survey on Compositional Visual Reasoning
Figure 3 for Explain Before You Answer: A Survey on Compositional Visual Reasoning
Figure 4 for Explain Before You Answer: A Survey on Compositional Visual Reasoning
Viaarxiv icon

Synthetic Visual Genome

Add code
Jun 09, 2025
Viaarxiv icon

Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations

Add code
Jun 05, 2025
Viaarxiv icon

Biological Sequence with Language Model Prompting: A Survey

Add code
Mar 06, 2025
Viaarxiv icon

TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action

Add code
Dec 10, 2024
Figure 1 for TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action
Figure 2 for TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action
Figure 3 for TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action
Figure 4 for TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action
Viaarxiv icon

ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models

Add code
Dec 09, 2024
Figure 1 for ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models
Figure 2 for ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models
Figure 3 for ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models
Figure 4 for ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models
Viaarxiv icon

NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

Add code
Oct 18, 2024
Figure 1 for NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples
Figure 2 for NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples
Figure 3 for NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples
Figure 4 for NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples
Viaarxiv icon