Picture for Arushi Goel

Arushi Goel

Music Flamingo: Scaling Music Understanding in Audio Language Models

Add code
Nov 13, 2025
Viaarxiv icon

NVIDIA Nemotron Nano V2 VL

Add code
Nov 07, 2025
Viaarxiv icon

Visually Interpretable Subtask Reasoning for Visual Question Answering

Add code
May 12, 2025
Figure 1 for Visually Interpretable Subtask Reasoning for Visual Question Answering
Figure 2 for Visually Interpretable Subtask Reasoning for Visual Question Answering
Figure 3 for Visually Interpretable Subtask Reasoning for Visual Question Answering
Figure 4 for Visually Interpretable Subtask Reasoning for Visual Question Answering
Viaarxiv icon

ETTA: Elucidating the Design Space of Text-to-Audio Models

Add code
Dec 26, 2024
Viaarxiv icon

OMCAT: Omni Context Aware Transformer

Add code
Oct 15, 2024
Figure 1 for OMCAT: Omni Context Aware Transformer
Figure 2 for OMCAT: Omni Context Aware Transformer
Figure 3 for OMCAT: Omni Context Aware Transformer
Figure 4 for OMCAT: Omni Context Aware Transformer
Viaarxiv icon

Audio Dialogues: Dialogues dataset for audio and music understanding

Add code
Apr 11, 2024
Figure 1 for Audio Dialogues: Dialogues dataset for audio and music understanding
Figure 2 for Audio Dialogues: Dialogues dataset for audio and music understanding
Figure 3 for Audio Dialogues: Dialogues dataset for audio and music understanding
Figure 4 for Audio Dialogues: Dialogues dataset for audio and music understanding
Viaarxiv icon

Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities

Add code
Feb 02, 2024
Figure 1 for Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
Figure 2 for Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
Figure 3 for Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
Figure 4 for Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
Viaarxiv icon

Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in Clutter

Add code
Nov 09, 2023
Figure 1 for Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in Clutter
Figure 2 for Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in Clutter
Figure 3 for Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in Clutter
Figure 4 for Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in Clutter
Viaarxiv icon

Semi-supervised multimodal coreference resolution in image narrations

Add code
Oct 20, 2023
Figure 1 for Semi-supervised multimodal coreference resolution in image narrations
Figure 2 for Semi-supervised multimodal coreference resolution in image narrations
Figure 3 for Semi-supervised multimodal coreference resolution in image narrations
Figure 4 for Semi-supervised multimodal coreference resolution in image narrations
Viaarxiv icon

Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories

Add code
Jun 15, 2023
Viaarxiv icon