Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jianbin Zheng

Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

Dec 23, 2025

Team Seedance, Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng(+187 more)

Abstract:Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo.

* Seedance 1.5 pro Technical Report

Via

Access Paper or Ask Questions

Intelligent Anti-Money Laundering Solution Based upon Novel Community Detection in Massive Transaction Networks on Spark

Jan 08, 2025

Xurui Li, Xiang Cao, Xuetao Qiu, Jintao Zhao, Jianbin Zheng

Abstract:Criminals are using every means available to launder the profits from their illegal activities into ostensibly legitimate assets. Meanwhile, most commercial anti-money laundering systems are still rule-based, which cannot adapt to the ever-changing tricks. Although some machine learning methods have been proposed, they are mainly focused on the perspective of abnormal behavior for single accounts. Considering money laundering activities are often involved in gang criminals, these methods are still not intelligent enough to crack down on criminal gangs all-sidedly. In this paper, a systematic solution is presented to find suspicious money laundering gangs. A temporal-directed Louvain algorithm has been proposed to detect communities according to relevant anti-money laundering patterns. All processes are implemented and optimized on Spark platform. This solution can greatly improve the efficiency of anti-money laundering work for financial regulation agencies.

Via

Access Paper or Ask Questions

Trajectory Consistency Distillation

Feb 29, 2024

Jianbin Zheng, Minghui Hu, Zhongyi Fan, Chaoyue Wang, Changxing Ding, Dacheng Tao, Tat-Jen Cham

Figure 1 for Trajectory Consistency Distillation

Figure 2 for Trajectory Consistency Distillation

Figure 3 for Trajectory Consistency Distillation

Figure 4 for Trajectory Consistency Distillation

Abstract:Latent Consistency Model (LCM) extends the Consistency Model to the latent space and leverages the guided consistency distillation technique to achieve impressive performance in accelerating text-to-image synthesis. However, we observed that LCM struggles to generate images with both clarity and detailed intricacy. To address this limitation, we initially delve into and elucidate the underlying causes. Our investigation identifies that the primary issue stems from errors in three distinct areas. Consequently, we introduce Trajectory Consistency Distillation (TCD), which encompasses trajectory consistency function and strategic stochastic sampling. The trajectory consistency function diminishes the distillation errors by broadening the scope of the self-consistency boundary condition and endowing the TCD with the ability to accurately trace the entire trajectory of the Probability Flow ODE. Additionally, strategic stochastic sampling is specifically designed to circumvent the accumulated errors inherent in multi-step consistency sampling, which is meticulously tailored to complement the TCD model. Experiments demonstrate that TCD not only significantly enhances image quality at low NFEs but also yields more detailed results compared to the teacher model at high NFEs.

* Project Page: https://mhh0318.github.io/tcd

Via

Access Paper or Ask Questions

One More Step: A Versatile Plug-and-Play Module for Rectifying Diffusion Schedule Flaws and Enhancing Low-Frequency Controls

Nov 27, 2023

Minghui Hu, Jianbin Zheng, Chuanxia Zheng, Chaoyue Wang, Dacheng Tao, Tat-Jen Cham

Figure 1 for One More Step: A Versatile Plug-and-Play Module for Rectifying Diffusion Schedule Flaws and Enhancing Low-Frequency Controls

Figure 2 for One More Step: A Versatile Plug-and-Play Module for Rectifying Diffusion Schedule Flaws and Enhancing Low-Frequency Controls

Figure 3 for One More Step: A Versatile Plug-and-Play Module for Rectifying Diffusion Schedule Flaws and Enhancing Low-Frequency Controls

Figure 4 for One More Step: A Versatile Plug-and-Play Module for Rectifying Diffusion Schedule Flaws and Enhancing Low-Frequency Controls

Abstract:It is well known that many open-released foundational diffusion models have difficulty in generating images that substantially depart from average brightness, despite such images being present in the training data. This is due to an inconsistency: while denoising starts from pure Gaussian noise during inference, the training noise schedule retains residual data even in the final timestep distribution, due to difficulties in numerical conditioning in mainstream formulation, leading to unintended bias during inference. To mitigate this issue, certain $\epsilon$-prediction models are combined with an ad-hoc offset-noise methodology. In parallel, some contemporary models have adopted zero-terminal SNR noise schedules together with $\mathbf{v}$-prediction, which necessitate major alterations to pre-trained models. However, such changes risk destabilizing a large multitude of community-driven applications anchored on these pre-trained models. In light of this, our investigation revisits the fundamental causes, leading to our proposal of an innovative and principled remedy, called One More Step (OMS). By integrating a compact network and incorporating an additional simple yet effective step during inference, OMS elevates image fidelity and harmonizes the dichotomy between training and inference, while preserving original model parameters. Once trained, various pre-trained diffusion models with the same latent domain can share the same OMS module.

* Project Page: https://jabir-zheng.github.io/OneMoreStep/, Demo Page: https://huggingface.co/spaces/h1t/oms_sdxl_lcm

Via

Access Paper or Ask Questions

DRL4Route: A Deep Reinforcement Learning Framework for Pick-up and Delivery Route Prediction

Jul 30, 2023

Xiaowei Mao, Haomin Wen, Hengrui Zhang, Huaiyu Wan, Lixia Wu, Jianbin Zheng, Haoyuan Hu, Youfang Lin

Figure 1 for DRL4Route: A Deep Reinforcement Learning Framework for Pick-up and Delivery Route Prediction

Figure 2 for DRL4Route: A Deep Reinforcement Learning Framework for Pick-up and Delivery Route Prediction

Figure 3 for DRL4Route: A Deep Reinforcement Learning Framework for Pick-up and Delivery Route Prediction

Figure 4 for DRL4Route: A Deep Reinforcement Learning Framework for Pick-up and Delivery Route Prediction

Abstract:Pick-up and Delivery Route Prediction (PDRP), which aims to estimate the future service route of a worker given his current task pool, has received rising attention in recent years. Deep neural networks based on supervised learning have emerged as the dominant model for the task because of their powerful ability to capture workers' behavior patterns from massive historical data. Though promising, they fail to introduce the non-differentiable test criteria into the training process, leading to a mismatch in training and test criteria. Which considerably trims down their performance when applied in practical systems. To tackle the above issue, we present the first attempt to generalize Reinforcement Learning (RL) to the route prediction task, leading to a novel RL-based framework called DRL4Route. It combines the behavior-learning abilities of previous deep learning models with the non-differentiable objective optimization ability of reinforcement learning. DRL4Route can serve as a plug-and-play component to boost the existing deep learning models. Based on the framework, we further implement a model named DRL4Route-GAE for PDRP in logistic service. It follows the actor-critic architecture which is equipped with a Generalized Advantage Estimator that can balance the bias and variance of the policy gradient estimates, thus achieving a more optimal policy. Extensive offline experiments and the online deployment show that DRL4Route-GAE improves Location Square Deviation (LSD) by 0.9%-2.7%, and Accuracy@3 (ACC@3) by 2.4%-3.2% over existing methods on the real-world dataset.

* Accepted by KDD23

Via

Access Paper or Ask Questions

Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation

Jun 01, 2023

Minghui Hu, Jianbin Zheng, Daqing Liu, Chuanxia Zheng, Chaoyue Wang, Dacheng Tao, Tat-Jen Cham

Figure 1 for Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation

Figure 2 for Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation

Figure 3 for Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation

Figure 4 for Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation

Abstract:Text-conditional diffusion models are able to generate high-fidelity images with diverse contents. However, linguistic representations frequently exhibit ambiguous descriptions of the envisioned objective imagery, requiring the incorporation of additional control signals to bolster the efficacy of text-guided diffusion models. In this work, we propose Cocktail, a pipeline to mix various modalities into one embedding, amalgamated with a generalized ControlNet (gControlNet), a controllable normalisation (ControlNorm), and a spatial guidance sampling method, to actualize multi-modal and spatially-refined control for text-conditional diffusion models. Specifically, we introduce a hyper-network gControlNet, dedicated to the alignment and infusion of the control signals from disparate modalities into the pre-trained diffusion model. gControlNet is capable of accepting flexible modality signals, encompassing the simultaneous reception of any combination of modality signals, or the supplementary fusion of multiple modality signals. The control signals are then fused and injected into the backbone model according to our proposed ControlNorm. Furthermore, our advanced spatial guidance sampling methodology proficiently incorporates the control signal into the designated region, thereby circumventing the manifestation of undesired objects within the generated image. We demonstrate the results of our method in controlling various modalities, proving high-quality synthesis and fidelity to multiple external signals.

* Project Page: https://mhh0318.github.io/cocktail/

Via

Access Paper or Ask Questions

MMoT: Mixture-of-Modality-Tokens Transformer for Composed Multimodal Conditional Image Synthesis

May 10, 2023

Jianbin Zheng, Daqing Liu, Chaoyue Wang, Minghui Hu, Zuopeng Yang, Changxing Ding, Dacheng Tao

Figure 1 for MMoT: Mixture-of-Modality-Tokens Transformer for Composed Multimodal Conditional Image Synthesis

Figure 2 for MMoT: Mixture-of-Modality-Tokens Transformer for Composed Multimodal Conditional Image Synthesis

Figure 3 for MMoT: Mixture-of-Modality-Tokens Transformer for Composed Multimodal Conditional Image Synthesis

Figure 4 for MMoT: Mixture-of-Modality-Tokens Transformer for Composed Multimodal Conditional Image Synthesis

Abstract:Existing multimodal conditional image synthesis (MCIS) methods generate images conditioned on any combinations of various modalities that require all of them must be exactly conformed, hindering the synthesis controllability and leaving the potential of cross-modality under-exploited. To this end, we propose to generate images conditioned on the compositions of multimodal control signals, where modalities are imperfectly complementary, i.e., composed multimodal conditional image synthesis (CMCIS). Specifically, we observe two challenging issues of the proposed CMCIS task, i.e., the modality coordination problem and the modality imbalance problem. To tackle these issues, we introduce a Mixture-of-Modality-Tokens Transformer (MMoT) that adaptively fuses fine-grained multimodal control signals, a multimodal balanced training loss to stabilize the optimization of each modality, and a multimodal sampling guidance to balance the strength of each modality control signal. Comprehensive experimental results demonstrate that MMoT achieves superior performance on both unimodal conditional image synthesis (UCIS) and MCIS tasks with high-quality and faithful image synthesis on complex multimodal conditions. The project website is available at https://jabir-zheng.github.io/MMoT.

Via

Access Paper or Ask Questions

G2PTL: A Pre-trained Model for Delivery Address and its Applications in Logistics System

Apr 04, 2023

Lixia Wu, Jianlin Liu, Junhong Lou, Haoyuan Hu, Jianbin Zheng, Haomin Wen, Chao Song, Shu He

Figure 1 for G2PTL: A Pre-trained Model for Delivery Address and its Applications in Logistics System

Figure 2 for G2PTL: A Pre-trained Model for Delivery Address and its Applications in Logistics System

Figure 3 for G2PTL: A Pre-trained Model for Delivery Address and its Applications in Logistics System

Figure 4 for G2PTL: A Pre-trained Model for Delivery Address and its Applications in Logistics System

Abstract:Text-based delivery addresses, as the data foundation for logistics systems, contain abundant and crucial location information. How to effectively encode the delivery address is a core task to boost the performance of downstream tasks in the logistics system. Pre-trained Models (PTMs) designed for Natural Language Process (NLP) have emerged as the dominant tools for encoding semantic information in text. Though promising, those NLP-based PTMs fall short of encoding geographic knowledge in the delivery address, which considerably trims down the performance of delivery-related tasks in logistic systems such as Cainiao. To tackle the above problem, we propose a domain-specific pre-trained model, named G2PTL, a Geography-Graph Pre-trained model for delivery address in Logistics field. G2PTL combines the semantic learning capabilities of text pre-training with the geographical-relationship encoding abilities of graph modeling. Specifically, we first utilize real-world logistics delivery data to construct a large-scale heterogeneous graph of delivery addresses, which contains abundant geographic knowledge and delivery information. Then, G2PTL is pre-trained with subgraphs sampled from the heterogeneous graph. Comprehensive experiments are conducted to demonstrate the effectiveness of G2PTL through four downstream tasks in logistics systems on real-world datasets. G2PTL has been deployed in production in Cainiao's logistics system, which significantly improves the performance of delivery-related tasks.

Via

Access Paper or Ask Questions

Transaction Fraud Detection Using GRU-centered Sandwich-structured Model

Mar 19, 2018

Xurui Li, Wei Yu, Tianyu Luwang, Jianbin Zheng, Xuetao Qiu, Jintao Zhao, Lei Xia, Yujiao Li

Figure 1 for Transaction Fraud Detection Using GRU-centered Sandwich-structured Model

Figure 2 for Transaction Fraud Detection Using GRU-centered Sandwich-structured Model

Figure 3 for Transaction Fraud Detection Using GRU-centered Sandwich-structured Model

Figure 4 for Transaction Fraud Detection Using GRU-centered Sandwich-structured Model

Abstract:Rapid growth of modern technologies such as internet and mobile computing are bringing dramatically increased e-commerce payments, as well as the explosion in transaction fraud. Meanwhile, fraudsters are continually refining their tricks, making rule-based fraud detection systems difficult to handle the ever-changing fraud patterns. Many data mining and artificial intelligence methods have been proposed for identifying small anomalies in large transaction data sets, increasing detecting efficiency to some extent. Nevertheless, there is always a contradiction that most methods are irrelevant to transaction sequence, yet sequence-related methods usually cannot learn information at single-transaction level well. In this paper, a new "within->between->within" sandwich-structured sequence learning architecture has been proposed by stacking an ensemble method, a deep sequential learning method and another top-layer ensemble classifier in proper order. Moreover, attention mechanism has also been introduced in to further improve performance. Models in this structure have been manifested to be very efficient in scenarios like fraud detection, where the information sequence is made up of vectors with complex interconnected features.

* accepted by cscwd2018

Via

Access Paper or Ask Questions