Picture for Xiaoming Wei

Xiaoming Wei

Meituan

LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding

Add code
Jan 14, 2025
Viaarxiv icon

CharGen: High Accurate Character-Level Visual Text Generation Model with MultiModal Encoder

Add code
Dec 23, 2024
Viaarxiv icon

High-Resolution Image Synthesis via Next-Token Prediction

Add code
Nov 22, 2024
Viaarxiv icon

Faster Multi-GPU Training with PPLL: A Pipeline Parallelism Framework Leveraging Local Learning

Add code
Nov 19, 2024
Figure 1 for Faster Multi-GPU Training with PPLL: A Pipeline Parallelism Framework Leveraging Local Learning
Figure 2 for Faster Multi-GPU Training with PPLL: A Pipeline Parallelism Framework Leveraging Local Learning
Figure 3 for Faster Multi-GPU Training with PPLL: A Pipeline Parallelism Framework Leveraging Local Learning
Figure 4 for Faster Multi-GPU Training with PPLL: A Pipeline Parallelism Framework Leveraging Local Learning
Viaarxiv icon

Denoising with a Joint-Embedding Predictive Architecture

Add code
Oct 02, 2024
Figure 1 for Denoising with a Joint-Embedding Predictive Architecture
Figure 2 for Denoising with a Joint-Embedding Predictive Architecture
Figure 3 for Denoising with a Joint-Embedding Predictive Architecture
Figure 4 for Denoising with a Joint-Embedding Predictive Architecture
Viaarxiv icon

FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction

Add code
Sep 26, 2024
Viaarxiv icon

Dynamic Prompting of Frozen Text-to-Image Diffusion Models for Panoptic Narrative Grounding

Add code
Sep 12, 2024
Figure 1 for Dynamic Prompting of Frozen Text-to-Image Diffusion Models for Panoptic Narrative Grounding
Figure 2 for Dynamic Prompting of Frozen Text-to-Image Diffusion Models for Panoptic Narrative Grounding
Figure 3 for Dynamic Prompting of Frozen Text-to-Image Diffusion Models for Panoptic Narrative Grounding
Figure 4 for Dynamic Prompting of Frozen Text-to-Image Diffusion Models for Panoptic Narrative Grounding
Viaarxiv icon

Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation

Add code
Aug 28, 2024
Figure 1 for Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation
Figure 2 for Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation
Figure 3 for Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation
Figure 4 for Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation
Viaarxiv icon

Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input

Add code
Aug 28, 2024
Figure 1 for Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input
Figure 2 for Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input
Figure 3 for Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input
Figure 4 for Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input
Viaarxiv icon

Fine-gained Zero-shot Video Sampling

Add code
Jul 31, 2024
Viaarxiv icon