Picture for Yumao Lu

Yumao Lu

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Add code
Nov 10, 2023
Viaarxiv icon

MM-VID: Advancing Video Understanding with GPT-4V

Add code
Oct 30, 2023
Viaarxiv icon

SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning

Add code
Nov 25, 2021
Figure 1 for SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning
Figure 2 for SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning
Figure 3 for SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning
Figure 4 for SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning
Viaarxiv icon

Scaling Up Vision-Language Pre-training for Image Captioning

Add code
Nov 24, 2021
Figure 1 for Scaling Up Vision-Language Pre-training for Image Captioning
Figure 2 for Scaling Up Vision-Language Pre-training for Image Captioning
Figure 3 for Scaling Up Vision-Language Pre-training for Image Captioning
Figure 4 for Scaling Up Vision-Language Pre-training for Image Captioning
Viaarxiv icon

Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling

Add code
Nov 23, 2021
Figure 1 for Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling
Figure 2 for Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling
Figure 3 for Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling
Figure 4 for Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling
Viaarxiv icon

Florence: A New Foundation Model for Computer Vision

Add code
Nov 22, 2021
Figure 1 for Florence: A New Foundation Model for Computer Vision
Figure 2 for Florence: A New Foundation Model for Computer Vision
Figure 3 for Florence: A New Foundation Model for Computer Vision
Figure 4 for Florence: A New Foundation Model for Computer Vision
Viaarxiv icon

UFO: A UniFied TransfOrmer for Vision-Language Representation Learning

Add code
Nov 19, 2021
Figure 1 for UFO: A UniFied TransfOrmer for Vision-Language Representation Learning
Figure 2 for UFO: A UniFied TransfOrmer for Vision-Language Representation Learning
Figure 3 for UFO: A UniFied TransfOrmer for Vision-Language Representation Learning
Figure 4 for UFO: A UniFied TransfOrmer for Vision-Language Representation Learning
Viaarxiv icon

An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA

Add code
Sep 10, 2021
Figure 1 for An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
Figure 2 for An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
Figure 3 for An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
Figure 4 for An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
Viaarxiv icon