Picture for Bin Wen

Bin Wen

TIME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLMs

Add code
Mar 13, 2025
Viaarxiv icon

Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding

Add code
Mar 12, 2025
Viaarxiv icon

RecipeGen: A Benchmark for Real-World Recipe Image Generation

Add code
Mar 07, 2025
Viaarxiv icon

What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Coverage of MLLMs

Add code
Feb 19, 2025
Viaarxiv icon

Kwai-STaR: Transform LLMs into State-Transition Reasoners

Add code
Nov 07, 2024
Viaarxiv icon

EVLM: An Efficient Vision-Language Model for Visual Understanding

Add code
Jul 19, 2024
Figure 1 for EVLM: An Efficient Vision-Language Model for Visual Understanding
Figure 2 for EVLM: An Efficient Vision-Language Model for Visual Understanding
Figure 3 for EVLM: An Efficient Vision-Language Model for Visual Understanding
Figure 4 for EVLM: An Efficient Vision-Language Model for Visual Understanding
Viaarxiv icon

CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation

Add code
Jun 15, 2024
Figure 1 for CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation
Figure 2 for CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation
Figure 3 for CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation
Figure 4 for CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation
Viaarxiv icon

Recognize Any Regions

Add code
Nov 02, 2023
Viaarxiv icon

Self-supervised Video Representation Learning with Motion-Aware Masked Autoencoders

Add code
Oct 09, 2022
Figure 1 for Self-supervised Video Representation Learning with Motion-Aware Masked Autoencoders
Figure 2 for Self-supervised Video Representation Learning with Motion-Aware Masked Autoencoders
Figure 3 for Self-supervised Video Representation Learning with Motion-Aware Masked Autoencoders
Figure 4 for Self-supervised Video Representation Learning with Motion-Aware Masked Autoencoders
Viaarxiv icon

MetaFormer: A Unified Meta Framework for Fine-Grained Recognition

Add code
Mar 05, 2022
Figure 1 for MetaFormer: A Unified Meta Framework for Fine-Grained Recognition
Figure 2 for MetaFormer: A Unified Meta Framework for Fine-Grained Recognition
Figure 3 for MetaFormer: A Unified Meta Framework for Fine-Grained Recognition
Figure 4 for MetaFormer: A Unified Meta Framework for Fine-Grained Recognition
Viaarxiv icon