Picture for Shengbang Tong

Shengbang Tong

Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning

Add code
Oct 25, 2024
Figure 1 for Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning
Figure 2 for Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning
Figure 3 for Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning
Figure 4 for Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning
Viaarxiv icon

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

Add code
Sep 04, 2024
Figure 1 for MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Figure 2 for MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Figure 3 for MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Figure 4 for MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Viaarxiv icon

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Add code
Jun 24, 2024
Figure 1 for Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Figure 2 for Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Figure 3 for Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Figure 4 for Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Viaarxiv icon

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

Add code
May 17, 2024
Figure 1 for Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
Figure 2 for Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
Figure 3 for Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
Figure 4 for Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
Viaarxiv icon

Ctrl123: Consistent Novel View Synthesis via Closed-Loop Transcription

Add code
Mar 16, 2024
Viaarxiv icon

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Add code
Jan 11, 2024
Viaarxiv icon

White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?

Add code
Nov 24, 2023
Viaarxiv icon

Investigating the Catastrophic Forgetting in Multimodal Large Language Models

Add code
Sep 26, 2023
Viaarxiv icon

Emergence of Segmentation with Minimalistic White-Box Transformers

Add code
Aug 30, 2023
Viaarxiv icon

Mass-Producing Failures of Multimodal Systems with Language Models

Add code
Jun 21, 2023
Figure 1 for Mass-Producing Failures of Multimodal Systems with Language Models
Figure 2 for Mass-Producing Failures of Multimodal Systems with Language Models
Figure 3 for Mass-Producing Failures of Multimodal Systems with Language Models
Figure 4 for Mass-Producing Failures of Multimodal Systems with Language Models
Viaarxiv icon