Picture for Yong Man Ro

Yong Man Ro

Are Vision-Language Models Truly Understanding Multi-vision Sensor?

Add code
Dec 30, 2024
Viaarxiv icon

Long-Form Speech Generation with Spoken Language Models

Add code
Dec 24, 2024
Viaarxiv icon

Empathetic Response in Audio-Visual Conversations Using Emotion Preference Optimization and MambaCompressor

Add code
Dec 23, 2024
Viaarxiv icon

AV-EmoDialog: Chat with Audio-Visual Users Leveraging Emotional Cues

Add code
Dec 23, 2024
Viaarxiv icon

VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models

Add code
Dec 02, 2024
Viaarxiv icon

Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing

Add code
Nov 29, 2024
Viaarxiv icon

Revisiting Misalignment in Multispectral Pedestrian Detection: A Language-Driven Approach for Cross-modal Alignment Fusion

Add code
Nov 27, 2024
Figure 1 for Revisiting Misalignment in Multispectral Pedestrian Detection: A Language-Driven Approach for Cross-modal Alignment Fusion
Figure 2 for Revisiting Misalignment in Multispectral Pedestrian Detection: A Language-Driven Approach for Cross-modal Alignment Fusion
Figure 3 for Revisiting Misalignment in Multispectral Pedestrian Detection: A Language-Driven Approach for Cross-modal Alignment Fusion
Figure 4 for Revisiting Misalignment in Multispectral Pedestrian Detection: A Language-Driven Approach for Cross-modal Alignment Fusion
Viaarxiv icon

SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis

Add code
Nov 25, 2024
Figure 1 for SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis
Figure 2 for SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis
Figure 3 for SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis
Figure 4 for SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis
Viaarxiv icon

Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language

Add code
Sep 02, 2024
Viaarxiv icon

SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for Large-scale Vision-Language Models

Add code
Aug 23, 2024
Figure 1 for SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for Large-scale Vision-Language Models
Figure 2 for SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for Large-scale Vision-Language Models
Figure 3 for SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for Large-scale Vision-Language Models
Figure 4 for SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for Large-scale Vision-Language Models
Viaarxiv icon