Picture for Saining Xie

Saining Xie

Altogether: Image Captioning via Re-aligning Alt-text

Add code
Oct 22, 2024
Figure 1 for Altogether: Image Captioning via Re-aligning Alt-text
Figure 2 for Altogether: Image Captioning via Re-aligning Alt-text
Figure 3 for Altogether: Image Captioning via Re-aligning Alt-text
Figure 4 for Altogether: Image Captioning via Re-aligning Alt-text
Viaarxiv icon

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Add code
Oct 09, 2024
Figure 1 for Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Figure 2 for Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Figure 3 for Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Figure 4 for Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Viaarxiv icon

DiffusionGuard: A Robust Defense Against Malicious Diffusion-based Image Editing

Add code
Oct 08, 2024
Viaarxiv icon

AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

Add code
Oct 04, 2024
Viaarxiv icon

Fast Encoding and Decoding for Implicit Video Representation

Add code
Sep 28, 2024
Viaarxiv icon

On Scaling Up 3D Gaussian Splatting Training

Add code
Jun 26, 2024
Viaarxiv icon

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Add code
Jun 24, 2024
Figure 1 for Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Figure 2 for Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Figure 3 for Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Figure 4 for Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Viaarxiv icon

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

Add code
May 17, 2024
Figure 1 for Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
Figure 2 for Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
Figure 3 for Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
Figure 4 for Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
Viaarxiv icon

MoDE: CLIP Data Experts via Clustering

Add code
Apr 24, 2024
Figure 1 for MoDE: CLIP Data Experts via Clustering
Figure 2 for MoDE: CLIP Data Experts via Clustering
Figure 3 for MoDE: CLIP Data Experts via Clustering
Figure 4 for MoDE: CLIP Data Experts via Clustering
Viaarxiv icon

V-IRL: Grounding Virtual Intelligence in Real Life

Add code
Feb 05, 2024
Viaarxiv icon