Picture for Hongtao Xie

Hongtao Xie

UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models

Add code
Oct 02, 2025
Figure 1 for UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models
Figure 2 for UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models
Figure 3 for UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models
Figure 4 for UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models
Viaarxiv icon

Test-Time Scaling with Reflective Generative Model

Add code
Jul 02, 2025
Viaarxiv icon

From Evaluation to Defense: Advancing Safety in Video Large Language Models

Add code
May 22, 2025
Viaarxiv icon

PosterMaker: Towards High-Quality Product Poster Generation with Accurate Text Rendering

Add code
Apr 09, 2025
Viaarxiv icon

Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation

Add code
Mar 25, 2025
Figure 1 for Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation
Figure 2 for Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation
Figure 3 for Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation
Figure 4 for Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation
Viaarxiv icon

Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models

Add code
Mar 20, 2025
Viaarxiv icon

SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability

Add code
Mar 18, 2025
Figure 1 for SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability
Figure 2 for SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability
Figure 3 for SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability
Figure 4 for SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability
Viaarxiv icon

What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Coverage of MLLMs

Add code
Feb 19, 2025
Figure 1 for What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Coverage of MLLMs
Figure 2 for What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Coverage of MLLMs
Figure 3 for What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Coverage of MLLMs
Figure 4 for What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Coverage of MLLMs
Viaarxiv icon

A Graph-Based Synthetic Data Pipeline for Scaling High-Quality Reasoning Instructions

Add code
Dec 12, 2024
Figure 1 for A Graph-Based Synthetic Data Pipeline for Scaling High-Quality Reasoning Instructions
Figure 2 for A Graph-Based Synthetic Data Pipeline for Scaling High-Quality Reasoning Instructions
Figure 3 for A Graph-Based Synthetic Data Pipeline for Scaling High-Quality Reasoning Instructions
Figure 4 for A Graph-Based Synthetic Data Pipeline for Scaling High-Quality Reasoning Instructions
Viaarxiv icon

SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition

Add code
Nov 24, 2024
Viaarxiv icon