Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xubin Li

Differentiable Solver Search for Fast Diffusion Sampling

May 27, 2025

Shuai Wang, Zexian Li, Qipeng zhang, Tianhui Song, Xubin Li, Tiezheng Ge, Bo Zheng, Limin Wang

Abstract:Diffusion models have demonstrated remarkable generation quality but at the cost of numerous function evaluations. Recently, advanced ODE-based solvers have been developed to mitigate the substantial computational demands of reverse-diffusion solving under limited sampling steps. However, these solvers, heavily inspired by Adams-like multistep methods, rely solely on t-related Lagrange interpolation. We show that t-related Lagrange interpolation is suboptimal for diffusion model and reveal a compact search space comprised of time steps and solver coefficients. Building on our analysis, we propose a novel differentiable solver search algorithm to identify more optimal solver. Equipped with the searched solver, rectified-flow models, e.g., SiT-XL/2 and FlowDCN-XL/2, achieve FID scores of 2.40 and 2.35, respectively, on ImageNet256 with only 10 steps. Meanwhile, DDPM model, DiT-XL/2, reaches a FID score of 2.33 with only 10 steps. Notably, our searched solver outperforms traditional solvers by a significant margin. Moreover, our searched solver demonstrates generality across various model architectures, resolutions, and model sizes.

* accpeted on ICML25

Via

Access Paper or Ask Questions

DMM: Building a Versatile Image Generation Model via Distillation-Based Model Merging

Apr 16, 2025

Tianhui Song, Weixin Feng, Shuai Wang, Xubin Li, Tiezheng Ge, Bo Zheng, Limin Wang

Abstract:The success of text-to-image (T2I) generation models has spurred a proliferation of numerous model checkpoints fine-tuned from the same base model on various specialized datasets. This overwhelming specialized model production introduces new challenges for high parameter redundancy and huge storage cost, thereby necessitating the development of effective methods to consolidate and unify the capabilities of diverse powerful models into a single one. A common practice in model merging adopts static linear interpolation in the parameter space to achieve the goal of style mixing. However, it neglects the features of T2I generation task that numerous distinct models cover sundry styles which may lead to incompatibility and confusion in the merged model. To address this issue, we introduce a style-promptable image generation pipeline which can accurately generate arbitrary-style images under the control of style vectors. Based on this design, we propose the score distillation based model merging paradigm (DMM), compressing multiple models into a single versatile T2I model. Moreover, we rethink and reformulate the model merging task in the context of T2I generation, by presenting new merging goals and evaluation protocols. Our experiments demonstrate that DMM can compactly reorganize the knowledge from multiple teacher models and achieve controllable arbitrary-style generation.

Via

Access Paper or Ask Questions

SceneBooth: Diffusion-based Framework for Subject-preserved Text-to-Image Generation

Jan 07, 2025

Shang Chai, Zihang Lin, Min Zhou, Xubin Li, Liansheng Zhuang, Houqiang Li

Figure 1 for SceneBooth: Diffusion-based Framework for Subject-preserved Text-to-Image Generation

Figure 2 for SceneBooth: Diffusion-based Framework for Subject-preserved Text-to-Image Generation

Figure 3 for SceneBooth: Diffusion-based Framework for Subject-preserved Text-to-Image Generation

Figure 4 for SceneBooth: Diffusion-based Framework for Subject-preserved Text-to-Image Generation

Abstract:Due to the demand for personalizing image generation, subject-driven text-to-image generation method, which creates novel renditions of an input subject based on text prompts, has received growing research interest. Existing methods often learn subject representation and incorporate it into the prompt embedding to guide image generation, but they struggle with preserving subject fidelity. To solve this issue, this paper approaches a novel framework named SceneBooth for subject-preserved text-to-image generation, which consumes inputs of a subject image, object phrases and text prompts. Instead of learning the subject representation and generating a subject, our SceneBooth fixes the given subject image and generates its background image guided by the text prompts. To this end, our SceneBooth introduces two key components, i.e., a multimodal layout generation module and a background painting module. The former determines the position and scale of the subject by generating appropriate scene layouts that align with text captions, object phrases, and subject visual information. The latter integrates two adapters (ControlNet and Gated Self-Attention) into the latent diffusion model to generate a background that harmonizes with the subject guided by scene layouts and text descriptions. In this manner, our SceneBooth ensures accurate preservation of the subject's appearance in the output. Quantitative and qualitative experimental results demonstrate that SceneBooth significantly outperforms baseline methods in terms of subject preservation, image harmonization and overall quality.

Via

Access Paper or Ask Questions

FlowDCN: Exploring DCN-like Architectures for Fast Image Generation with Arbitrary Resolution

Oct 30, 2024

Shuai Wang, Zexian Li, Tianhui Song, Xubin Li, Tiezheng Ge, Bo Zheng, Limin Wang

Figure 1 for FlowDCN: Exploring DCN-like Architectures for Fast Image Generation with Arbitrary Resolution

Figure 2 for FlowDCN: Exploring DCN-like Architectures for Fast Image Generation with Arbitrary Resolution

Figure 3 for FlowDCN: Exploring DCN-like Architectures for Fast Image Generation with Arbitrary Resolution

Figure 4 for FlowDCN: Exploring DCN-like Architectures for Fast Image Generation with Arbitrary Resolution

Abstract:Arbitrary-resolution image generation still remains a challenging task in AIGC, as it requires handling varying resolutions and aspect ratios while maintaining high visual quality. Existing transformer-based diffusion methods suffer from quadratic computation cost and limited resolution extrapolation capabilities, making them less effective for this task. In this paper, we propose FlowDCN, a purely convolution-based generative model with linear time and memory complexity, that can efficiently generate high-quality images at arbitrary resolutions. Equipped with a new design of learnable group-wise deformable convolution block, our FlowDCN yields higher flexibility and capability to handle different resolutions with a single model. FlowDCN achieves the state-of-the-art 4.30 sFID on $256\times256$ ImageNet Benchmark and comparable resolution extrapolation results, surpassing transformer-based counterparts in terms of convergence speed (only $\frac{1}{5}$ images), visual quality, parameters ($8\%$ reduction) and FLOPs ($20\%$ reduction). We believe FlowDCN offers a promising solution to scalable and flexible image synthesis.

* Accepted on NeurIPS24

Via

Access Paper or Ask Questions

Accelerating Image Generation with Sub-path Linear Approximation Model

Apr 23, 2024

Chen Xu, Tianhui Song, Weixin Feng, Xubin Li, Tiezheng Ge, Bo Zheng, Limin Wang

Figure 1 for Accelerating Image Generation with Sub-path Linear Approximation Model

Figure 2 for Accelerating Image Generation with Sub-path Linear Approximation Model

Figure 3 for Accelerating Image Generation with Sub-path Linear Approximation Model

Figure 4 for Accelerating Image Generation with Sub-path Linear Approximation Model

Abstract:Diffusion models have significantly advanced the state of the art in image, audio, and video generation tasks. However, their applications in practical scenarios are hindered by slow inference speed. Drawing inspiration from the approximation strategies utilized in consistency models, we propose the Sub-path Linear Approximation Model (SLAM), which accelerates diffusion models while maintaining high-quality image generation. SLAM treats the PF-ODE trajectory as a series of PF-ODE sub-paths divided by sampled points, and harnesses sub-path linear (SL) ODEs to form a progressive and continuous error estimation along each individual PF-ODE sub-path. The optimization on such SL-ODEs allows SLAM to construct denoising mappings with smaller cumulative approximated errors. An efficient distillation method is also developed to facilitate the incorporation of more advanced diffusion models, such as latent diffusion models. Our extensive experimental results demonstrate that SLAM achieves an efficient training regimen, requiring only 6 A100 GPU days to produce a high-quality generative model capable of 2 to 4-step generation with high performance. Comprehensive evaluations on LAION, MS COCO 2014, and MS COCO 2017 datasets also illustrate that SLAM surpasses existing acceleration methods in few-step generation tasks, achieving state-of-the-art performance both on FID and the quality of the generated images.

Via

Access Paper or Ask Questions

Enhancing Prompt Following with Visual Control Through Training-Free Mask-Guided Diffusion

Apr 23, 2024

Hongyu Chen, Yiqi Gao, Min Zhou, Peng Wang, Xubin Li, Tiezheng Ge, Bo Zheng

Figure 1 for Enhancing Prompt Following with Visual Control Through Training-Free Mask-Guided Diffusion

Figure 2 for Enhancing Prompt Following with Visual Control Through Training-Free Mask-Guided Diffusion

Figure 3 for Enhancing Prompt Following with Visual Control Through Training-Free Mask-Guided Diffusion

Figure 4 for Enhancing Prompt Following with Visual Control Through Training-Free Mask-Guided Diffusion

Abstract:Recently, integrating visual controls into text-to-image~(T2I) models, such as ControlNet method, has received significant attention for finer control capabilities. While various training-free methods make efforts to enhance prompt following in T2I models, the issue with visual control is still rarely studied, especially in the scenario that visual controls are misaligned with text prompts. In this paper, we address the challenge of ``Prompt Following With Visual Control" and propose a training-free approach named Mask-guided Prompt Following (MGPF). Object masks are introduced to distinct aligned and misaligned parts of visual controls and prompts. Meanwhile, a network, dubbed as Masked ControlNet, is designed to utilize these object masks for object generation in the misaligned visual control region. Further, to improve attribute matching, a simple yet efficient loss is designed to align the attention maps of attributes with object regions constrained by ControlNet and object masks. The efficacy and superiority of MGPF are validated through comprehensive quantitative and qualitative experiments.

Via

Access Paper or Ask Questions

RHanDS: Refining Malformed Hands for Generated Images with Decoupled Structure and Style Guidance

Apr 22, 2024

Chengrui Wang, Pengfei Liu, Min Zhou, Ming Zeng, Xubin Li, Tiezheng Ge, Bo zheng

Figure 1 for RHanDS: Refining Malformed Hands for Generated Images with Decoupled Structure and Style Guidance

Figure 2 for RHanDS: Refining Malformed Hands for Generated Images with Decoupled Structure and Style Guidance

Figure 3 for RHanDS: Refining Malformed Hands for Generated Images with Decoupled Structure and Style Guidance

Figure 4 for RHanDS: Refining Malformed Hands for Generated Images with Decoupled Structure and Style Guidance

Abstract:Although diffusion models can generate high-quality human images, their applications are limited by the instability in generating hands with correct structures. Some previous works mitigate the problem by considering hand structure yet struggle to maintain style consistency between refined malformed hands and other image regions. In this paper, we aim to solve the problem of inconsistency regarding hand structure and style. We propose a conditional diffusion-based framework RHanDS to refine the hand region with the help of decoupled structure and style guidance. Specifically, the structure guidance is the hand mesh reconstructed from the malformed hand, serving to correct the hand structure. The style guidance is a hand image, e.g., the malformed hand itself, and is employed to furnish the style reference for hand refining. In order to suppress the structure leakage when referencing hand style and effectively utilize hand data to improve the capability of the model, we build a multi-style hand dataset and introduce a twostage training strategy. In the first stage, we use paired hand images for training to generate hands with the same style as the reference. In the second stage, various hand images generated based on the human mesh are used for training to enable the model to gain control over the hand structure. We evaluate our method and counterparts on the test dataset of the proposed multi-style hand dataset. The experimental results show that RHanDS can effectively refine hands structure- and style- correctly compared with previous methods. The codes and datasets will be available soon.

Via

Access Paper or Ask Questions

Long Short-Term Planning for Conversational Recommendation Systems

Oct 23, 2023

Xian Li, Hongguang Shi, Yunfei Wang, Yeqin Zhang, Xubin Li, Cam-Tu Nguyen

Figure 1 for Long Short-Term Planning for Conversational Recommendation Systems

Figure 2 for Long Short-Term Planning for Conversational Recommendation Systems

Figure 3 for Long Short-Term Planning for Conversational Recommendation Systems

Figure 4 for Long Short-Term Planning for Conversational Recommendation Systems

Abstract:In Conversational Recommendation Systems (CRS), the central question is how the conversational agent can naturally ask for user preferences and provide suitable recommendations. Existing works mainly follow the hierarchical architecture, where a higher policy decides whether to invoke the conversation module (to ask questions) or the recommendation module (to make recommendations). This architecture prevents these two components from fully interacting with each other. In contrast, this paper proposes a novel architecture, the long short-term feedback architecture, to connect these two essential components in CRS. Specifically, the recommendation predicts the long-term recommendation target based on the conversational context and the user history. Driven by the targeted recommendation, the conversational model predicts the next topic or attribute to verify if the user preference matches the target. The balance feedback loop continues until the short-term planner output matches the long-term planner output, that is when the system should make the recommendation.

* 14 pages, 3 figures. Accepted by ICONIP 2023

Via

Access Paper or Ask Questions

Multi-Scenario Ranking with Adaptive Feature Learning

Jun 29, 2023

Yu Tian, Bofang Li, Si Chen, Xubin Li, Hongbo Deng, Jian Xu, Bo Zheng, Qian Wang, Chenliang Li

Figure 1 for Multi-Scenario Ranking with Adaptive Feature Learning

Figure 2 for Multi-Scenario Ranking with Adaptive Feature Learning

Figure 3 for Multi-Scenario Ranking with Adaptive Feature Learning

Figure 4 for Multi-Scenario Ranking with Adaptive Feature Learning

Abstract:Recently, Multi-Scenario Learning (MSL) is widely used in recommendation and retrieval systems in the industry because it facilitates transfer learning from different scenarios, mitigating data sparsity and reducing maintenance cost. These efforts produce different MSL paradigms by searching more optimal network structure, such as Auxiliary Network, Expert Network, and Multi-Tower Network. It is intuitive that different scenarios could hold their specific characteristics, activating the user's intents quite differently. In other words, different kinds of auxiliary features would bear varying importance under different scenarios. With more discriminative feature representations refined in a scenario-aware manner, better ranking performance could be easily obtained without expensive search for the optimal network structure. Unfortunately, this simple idea is mainly overlooked but much desired in real-world systems.Further analysis also validates the rationality of adaptive feature learning under a multi-scenario scheme. Moreover, our A/B test results on the Alibaba search advertising platform also demonstrate that Maria is superior in production environments.

* 10 pages,

Via

Access Paper or Ask Questions

Visual Encoding and Debiasing for CTR Prediction

May 09, 2022

Si Chen, Chen Lin, Wanxian Guan, Jiayi Wei, Xingyuan Bu, He Guo, Hui Li, Xubin Li, Jian Xu, Bo Zheng

Figure 1 for Visual Encoding and Debiasing for CTR Prediction

Figure 2 for Visual Encoding and Debiasing for CTR Prediction

Figure 3 for Visual Encoding and Debiasing for CTR Prediction

Figure 4 for Visual Encoding and Debiasing for CTR Prediction

Abstract:Extracting expressive visual features is crucial for accurate Click-Through-Rate (CTR) prediction in visual search advertising systems. Current commercial systems use off-the-shelf visual encoders to facilitate fast online service. However, the extracted visual features are coarse-grained and/or biased. In this paper, we present a visual encoding framework for CTR prediction to overcome these problems. The framework is based on contrastive learning which pulls positive pairs closer and pushes negative pairs apart in the visual feature space. To obtain fine-grained visual features,we present contrastive learning supervised by click through data to fine-tune the visual encoder. To reduce sample selection bias, firstly we train the visual encoder offline by leveraging both unbiased self-supervision and click supervision signals. Secondly, we incorporate a debiasing network in the online CTR predictor to adjust the visual features by contrasting high impression items with selected items with lower impressions.We deploy the framework in the visual sponsor search system at Alibaba. Offline experiments on billion-scale datasets and online experiments demonstrate that the proposed framework can make accurate and unbiased predictions.

Via

Access Paper or Ask Questions