Picture for Xinyan Xiao

Xinyan Xiao

Query-Kontext: An Unified Multimodal Model for Image Generation and Editing

Add code
Sep 30, 2025
Figure 1 for Query-Kontext: An Unified Multimodal Model for Image Generation and Editing
Figure 2 for Query-Kontext: An Unified Multimodal Model for Image Generation and Editing
Figure 3 for Query-Kontext: An Unified Multimodal Model for Image Generation and Editing
Figure 4 for Query-Kontext: An Unified Multimodal Model for Image Generation and Editing
Viaarxiv icon

Can Understanding and Generation Truly Benefit Together -- or Just Coexist?

Add code
Sep 11, 2025
Viaarxiv icon

A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone

Add code
May 19, 2025
Viaarxiv icon

UGen: Unified Autoregressive Multimodal Model with Progressive Vocabulary Learning

Add code
Mar 27, 2025
Figure 1 for UGen: Unified Autoregressive Multimodal Model with Progressive Vocabulary Learning
Figure 2 for UGen: Unified Autoregressive Multimodal Model with Progressive Vocabulary Learning
Figure 3 for UGen: Unified Autoregressive Multimodal Model with Progressive Vocabulary Learning
Figure 4 for UGen: Unified Autoregressive Multimodal Model with Progressive Vocabulary Learning
Viaarxiv icon

BiDeV: Bilateral Defusing Verification for Complex Claim Fact-Checking

Add code
Feb 22, 2025
Figure 1 for BiDeV: Bilateral Defusing Verification for Complex Claim Fact-Checking
Figure 2 for BiDeV: Bilateral Defusing Verification for Complex Claim Fact-Checking
Figure 3 for BiDeV: Bilateral Defusing Verification for Complex Claim Fact-Checking
Figure 4 for BiDeV: Bilateral Defusing Verification for Complex Claim Fact-Checking
Viaarxiv icon

Investigating Inference-time Scaling for Chain of Multi-modal Thought: A Preliminary Study

Add code
Feb 17, 2025
Viaarxiv icon

Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training

Add code
Oct 06, 2024
Figure 1 for Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training
Figure 2 for Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training
Figure 3 for Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training
Figure 4 for Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training
Viaarxiv icon

MonoFormer: One Transformer for Both Diffusion and Autoregression

Add code
Sep 24, 2024
Figure 1 for MonoFormer: One Transformer for Both Diffusion and Autoregression
Figure 2 for MonoFormer: One Transformer for Both Diffusion and Autoregression
Figure 3 for MonoFormer: One Transformer for Both Diffusion and Autoregression
Figure 4 for MonoFormer: One Transformer for Both Diffusion and Autoregression
Viaarxiv icon

UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion

Add code
Jan 25, 2024
Viaarxiv icon

UniVG: Towards UNIfied-modal Video Generation

Add code
Jan 17, 2024
Viaarxiv icon