Picture for Le Wang

Le Wang

Xi'an Jiaotong University

AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation

Add code
Aug 01, 2025
Viaarxiv icon

Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation

Add code
Jun 24, 2025
Viaarxiv icon

AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions

Add code
Jun 17, 2025
Viaarxiv icon

Time-Unified Diffusion Policy with Action Discrimination for Robotic Manipulation

Add code
Jun 11, 2025
Viaarxiv icon

FaithfulRAG: Fact-Level Conflict Modeling for Context-Faithful Retrieval-Augmented Generation

Add code
Jun 10, 2025
Viaarxiv icon

RSRNav: Reasoning Spatial Relationship for Image-Goal Navigation

Add code
Apr 25, 2025
Viaarxiv icon

From Mapping to Composing: A Two-Stage Framework for Zero-shot Composed Image Retrieval

Add code
Apr 25, 2025
Viaarxiv icon

Manipulating Multimodal Agents via Cross-Modal Prompt Injection

Add code
Apr 22, 2025
Viaarxiv icon

Moment Quantization for Video Temporal Grounding

Add code
Apr 03, 2025
Figure 1 for Moment Quantization for Video Temporal Grounding
Figure 2 for Moment Quantization for Video Temporal Grounding
Figure 3 for Moment Quantization for Video Temporal Grounding
Figure 4 for Moment Quantization for Video Temporal Grounding
Viaarxiv icon

CogMorph: Cognitive Morphing Attacks for Text-to-Image Models

Add code
Jan 21, 2025
Viaarxiv icon