Picture for Dahun Kim

Dahun Kim

Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning

Add code
Nov 22, 2024
Figure 1 for Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning
Figure 2 for Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning
Figure 3 for Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning
Figure 4 for Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning
Viaarxiv icon

Learning Visual Grounding from Generative Vision and Language Model

Add code
Jul 18, 2024
Viaarxiv icon

OmniBind: Teach to Build Unequal-Scale Modality Interaction for Omni-Bind of All

Add code
May 25, 2024
Viaarxiv icon

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

Add code
Nov 13, 2023
Viaarxiv icon

Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection

Add code
Sep 29, 2023
Viaarxiv icon

Contrastive Feature Masking Open-Vocabulary Vision Transformer

Add code
Sep 02, 2023
Figure 1 for Contrastive Feature Masking Open-Vocabulary Vision Transformer
Figure 2 for Contrastive Feature Masking Open-Vocabulary Vision Transformer
Figure 3 for Contrastive Feature Masking Open-Vocabulary Vision Transformer
Figure 4 for Contrastive Feature Masking Open-Vocabulary Vision Transformer
Viaarxiv icon

Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation

Add code
Aug 03, 2023
Figure 1 for Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation
Figure 2 for Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation
Figure 3 for Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation
Figure 4 for Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation
Viaarxiv icon

Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

Add code
May 11, 2023
Figure 1 for Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers
Figure 2 for Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers
Figure 3 for Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers
Figure 4 for Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers
Viaarxiv icon

RECLIP: Resource-efficient CLIP by Training with Small Images

Add code
Apr 12, 2023
Figure 1 for RECLIP: Resource-efficient CLIP by Training with Small Images
Figure 2 for RECLIP: Resource-efficient CLIP by Training with Small Images
Figure 3 for RECLIP: Resource-efficient CLIP by Training with Small Images
Figure 4 for RECLIP: Resource-efficient CLIP by Training with Small Images
Viaarxiv icon

Video-kMaX: A Simple Unified Approach for Online and Near-Online Video Panoptic Segmentation

Add code
Apr 10, 2023
Viaarxiv icon