Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bin Zou

Disentangling Instruction Influence in Diffusion Transformers for Parallel Multi-Instruction-Guided Image Editing

Apr 07, 2025

Hui Liu, Bin Zou, Suiyun Zhang, Kecheng Chen, Rui Liu, Haoliang Li

Abstract:Instruction-guided image editing enables users to specify modifications using natural language, offering more flexibility and control. Among existing frameworks, Diffusion Transformers (DiTs) outperform U-Net-based diffusion models in scalability and performance. However, while real-world scenarios often require concurrent execution of multiple instructions, step-by-step editing suffers from accumulated errors and degraded quality, and integrating multiple instructions with a single prompt usually results in incomplete edits due to instruction conflicts. We propose Instruction Influence Disentanglement (IID), a novel framework enabling parallel execution of multiple instructions in a single denoising process, designed for DiT-based models. By analyzing self-attention mechanisms in DiTs, we identify distinctive attention patterns in multi-instruction settings and derive instruction-specific attention masks to disentangle each instruction's influence. These masks guide the editing process to ensure localized modifications while preserving consistency in non-edited regions. Extensive experiments on open-source and custom datasets demonstrate that IID reduces diffusion steps while improving fidelity and instruction completion compared to existing baselines. The codes will be publicly released upon the acceptance of the paper.

* 14 pages, 8 figures

Via

Access Paper or Ask Questions

Walle: An End-to-End, General-Purpose, and Large-Scale Production System for Device-Cloud Collaborative Machine Learning

May 30, 2022

Chengfei Lv, Chaoyue Niu, Renjie Gu, Xiaotang Jiang, Zhaode Wang, Bin Liu, Ziqi Wu, Qiulin Yao, Congyu Huang, Panos Huang(+10 more)

Figure 1 for Walle: An End-to-End, General-Purpose, and Large-Scale Production System for Device-Cloud Collaborative Machine Learning

Figure 2 for Walle: An End-to-End, General-Purpose, and Large-Scale Production System for Device-Cloud Collaborative Machine Learning

Figure 3 for Walle: An End-to-End, General-Purpose, and Large-Scale Production System for Device-Cloud Collaborative Machine Learning

Figure 4 for Walle: An End-to-End, General-Purpose, and Large-Scale Production System for Device-Cloud Collaborative Machine Learning

Abstract:To break the bottlenecks of mainstream cloud-based machine learning (ML) paradigm, we adopt device-cloud collaborative ML and build the first end-to-end and general-purpose system, called Walle, as the foundation. Walle consists of a deployment platform, distributing ML tasks to billion-scale devices in time; a data pipeline, efficiently preparing task input; and a compute container, providing a cross-platform and high-performance execution environment, while facilitating daily task iteration. Specifically, the compute container is based on Mobile Neural Network (MNN), a tensor compute engine along with the data processing and model execution libraries, which are exposed through a refined Python thread-level virtual machine (VM) to support diverse ML tasks and concurrent task execution. The core of MNN is the novel mechanisms of operator decomposition and semi-auto search, sharply reducing the workload in manually optimizing hundreds of operators for tens of hardware backends and further quickly identifying the best backend with runtime optimization for a computation graph. The data pipeline introduces an on-device stream processing framework to enable processing user behavior data at source. The deployment platform releases ML tasks with an efficient push-then-pull method and supports multi-granularity deployment policies. We evaluate Walle in practical e-commerce application scenarios to demonstrate its effectiveness, efficiency, and scalability. Extensive micro-benchmarks also highlight the superior performance of MNN and the Python thread-level VM. Walle has been in large-scale production use in Alibaba, while MNN has been open source with a broad impact in the community.

* Accepted by OSDI 2022

Via

Access Paper or Ask Questions

PCLNet: A Practical Way for Unsupervised Deep PolSAR Representations and Few-Shot Classification

Jun 27, 2020

Lamei Zhang, Siyu Zhang, Bin Zou, Hongwei Dong

Figure 1 for PCLNet: A Practical Way for Unsupervised Deep PolSAR Representations and Few-Shot Classification

Figure 2 for PCLNet: A Practical Way for Unsupervised Deep PolSAR Representations and Few-Shot Classification

Figure 3 for PCLNet: A Practical Way for Unsupervised Deep PolSAR Representations and Few-Shot Classification

Figure 4 for PCLNet: A Practical Way for Unsupervised Deep PolSAR Representations and Few-Shot Classification

Abstract:Deep learning and convolutional neural networks (CNNs) have made progress in polarimetric synthetic aperture radar (PolSAR) image classification over the past few years. However, a crucial issue has not been addressed, i.e., the requirement of CNNs for abundant labeled samples versus the insufficient human annotations of PolSAR images. It is well-known that following the supervised learning paradigm may lead to the overfitting of training data, and the lack of supervision information of PolSAR images undoubtedly aggravates this problem, which greatly affects the generalization performance of CNN-based classifiers in large-scale applications. To handle this problem, in this paper, learning transferrable representations from unlabeled PolSAR data through convolutional architectures is explored for the first time. Specifically, a PolSAR-tailored contrastive learning network (PCLNet) is proposed for unsupervised deep PolSAR representation learning and few-shot classification. Different from the utilization of optical processing methods, a diversity stimulation mechanism is constructed to narrow the application gap between optics and PolSAR. Beyond the conventional supervised methods, PCLNet develops an auxiliary pre-training phase based on the proxy objective of contrastive instance discrimination to learn useful representations from unlabeled PolSAR data. The acquired representations are transferred to the downstream task, i.e., few-shot PolSAR classification. Experiments on two widely used PolSAR benchmark datasets confirm the validity of PCLNet. Besides, this work may enlighten how to efficiently utilize the massive unlabeled PolSAR data to alleviate the greedy demands of CNN-based methods for human annotations.

* 16 pages, 16 figures

Via

Access Paper or Ask Questions

MNN: A Universal and Efficient Inference Engine

Feb 27, 2020

Xiaotang Jiang, Huan Wang, Yiliu Chen, Ziqi Wu, Lichuan Wang, Bin Zou, Yafeng Yang, Zongyang Cui, Yu Cai, Tianhang Yu(+2 more)

Figure 1 for MNN: A Universal and Efficient Inference Engine

Figure 2 for MNN: A Universal and Efficient Inference Engine

Figure 3 for MNN: A Universal and Efficient Inference Engine

Figure 4 for MNN: A Universal and Efficient Inference Engine

Abstract:Deploying deep learning models on mobile devices draws more and more attention recently. However, designing an efficient inference engine on devices is under the great challenges of model compatibility, device diversity, and resource limitation. To deal with these challenges, we propose Mobile Neural Network (MNN), a universal and efficient inference engine tailored to mobile applications. In this paper, the contributions of MNN include: (1) presenting a mechanism called pre-inference that manages to conduct runtime optimization; (2)deliveringthorough kernel optimization on operators to achieve optimal computation performance; (3) introducing backend abstraction module which enables hybrid scheduling and keeps the engine lightweight. Extensive benchmark experiments demonstrate that MNN performs favorably against other popular lightweight deep learning frameworks. MNN is available to public at: https://github.com/alibaba/MNN.

* Accepted by MLSys 2020

Via

Access Paper or Ask Questions

Automatic Design of CNNs via Differentiable Neural Architecture Search for PolSAR Image Classification

Nov 19, 2019

Hongwei Dong, Siyu Zhang, Bin Zou, Lamei Zhang

Figure 1 for Automatic Design of CNNs via Differentiable Neural Architecture Search for PolSAR Image Classification

Figure 2 for Automatic Design of CNNs via Differentiable Neural Architecture Search for PolSAR Image Classification

Figure 3 for Automatic Design of CNNs via Differentiable Neural Architecture Search for PolSAR Image Classification

Figure 4 for Automatic Design of CNNs via Differentiable Neural Architecture Search for PolSAR Image Classification

Abstract:Convolutional neural networks (CNNs) have shown good performance in polarimetric synthetic aperture radar (PolSAR) image classification due to the automation of feature engineering. Excellent hand-crafted architectures of CNNs incorporated the wisdom of human experts, which is an important reason for CNN's success. However, the design of the architectures is a difficult problem, which needs a lot of professional knowledge as well as computational resources. Moreover, the architecture designed by hand might be suboptimal, because it is only one of thousands of unobserved but objective existed paths. Considering that the success of deep learning is largely due to its automation of the feature engineering process, how to design automatic architecture searching methods to replace the hand-crafted ones is an interesting topic. In this paper, we explore the application of neural architecture search (NAS) in PolSAR area for the first time. Different from the utilization of existing NAS methods, we propose a differentiable architecture search (DAS) method which is customized for PolSAR classification. The proposed DAS is equipped with a PolSAR tailored search space and an improved one-shot search strategy. By DAS, the weights parameters and architecture parameters (corresponds to the hyperparameters but not the topologies) can be optimized by stochastic gradient descent method during the training. The optimized architecture parameters should be transformed into corresponding CNN architecture and re-train to achieve high-precision PolSAR classification. In addition, complex-valued DAS is developed to take into account the characteristics of PolSAR images so as to further improve the performance. Experiments on three PolSAR benchmark datasets show that the CNNs obtained by searching have better classification performance than the hand-crafted ones.

Via

Access Paper or Ask Questions

Band Attention Convolutional Networks For Hyperspectral Image Classification

Jun 11, 2019

Hongwei Dong, Lamei Zhang, Bin Zou

Figure 1 for Band Attention Convolutional Networks For Hyperspectral Image Classification

Figure 2 for Band Attention Convolutional Networks For Hyperspectral Image Classification

Figure 3 for Band Attention Convolutional Networks For Hyperspectral Image Classification

Figure 4 for Band Attention Convolutional Networks For Hyperspectral Image Classification

Abstract:Redundancy and noise exist in the bands of hyperspectral images (HSIs). Thus, it is a good property to be able to select suitable parts from hundreds of input bands for HSIs classification methods. In this letter, a band attention module (BAM) is proposed to implement the deep learning based HSIs classification with the capacity of band selection or weighting. The proposed BAM can be seen as a plug-and-play complementary component of the existing classification networks which fully considers the adverse effects caused by the redundancy of the bands when using convolutional neural networks (CNNs) for HSIs classification. Unlike most of deep learning methods used in HSIs, the band attention module which is customized according to the characteristics of hyperspectral images is embedded in the ordinary CNNs for better performance. At the same time, unlike classical band selection or weighting methods, the proposed method achieves the end-to-end training instead of the separated stages. Experiments are carried out on two HSI benchmark datasets. Compared to some classical and advanced deep learning methods, numerical simulations under different evaluation criteria show that the proposed method have good performance. Last but not least, some advanced CNNs are combined with the proposed BAM for better performance.

Via

Access Paper or Ask Questions

Efficiently utilizing complex-valued PolSAR image data via a multi-task deep learning framework

Mar 24, 2019

Lamei Zhang, Hongwei Dong, Bin Zou

Figure 1 for Efficiently utilizing complex-valued PolSAR image data via a multi-task deep learning framework

Figure 2 for Efficiently utilizing complex-valued PolSAR image data via a multi-task deep learning framework

Figure 3 for Efficiently utilizing complex-valued PolSAR image data via a multi-task deep learning framework

Figure 4 for Efficiently utilizing complex-valued PolSAR image data via a multi-task deep learning framework

Abstract:Accompanied by the successful progress of deep representation learning, convolutional neural networks (CNNs) have been widely applied to improve the accuracy of polarimetric synthetic aperture radar (PolSAR) image classification. However, in most applications, the difference between PolSAR image and optical image is rarely considered. The design of most existing network structures is not tailored to the characteristics of PolSAR image data and complex-valued data of PolSAR image are simply equated to real-valued data to adapt to the existing mainstream network pipeline to avoid complex-valued operations. These make CNNs unable to perform their full capabilities in the PolSAR image classification tasks. In this paper, we focus on finding a better input form of PolSAR image data and designing special CNN structures that are more compatible with PolSAR image. Considering the relationship between complex number and its amplitude and phase, we extract the amplitude and phase of the complex-valued PolSAR image data as input to maintain the integrity of the original information while avoiding the current immature complex-valued operations, and a novel multi-task CNN framework is proposed to adapt to novel form of input data. Furthermore, in order to better explore the unique phase information in the PolSAR image data, depthwise separable convolutions are applied to the proposed multi-task CNN model. Experiments on three benchmark datasets not only prove that using amplitude and phase information as input does contribute to the improvement of classification accuracy, but also verify the effectiveness of the proposed methods for amplitude and phase input.

Via

Access Paper or Ask Questions