Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Junjie Zhai

Efficient Cross-Architecture Knowledge Transfer for Large-Scale Online User Response Prediction

Feb 02, 2026

Yucheng Wu, Yuekui Yang, Hongzheng Li, Anan Liu, Jian Xiao, Junjie Zhai, Huan Yu, Shaoping Ma, Leye Wang

Abstract:Deploying new architectures in large-scale user response prediction systems incurs high model switching costs due to expensive retraining on massive historical data and performance degradation under data retention constraints. Existing knowledge distillation methods struggle with architectural heterogeneity and the prohibitive cost of transferring large embedding tables. We propose CrossAdapt, a two-stage framework for efficient cross-architecture knowledge transfer. The offline stage enables rapid embedding transfer via dimension-adaptive projections without iterative training, combined with progressive network distillation and strategic sampling to reduce computational cost. The online stage introduces asymmetric co-distillation, where students update frequently while teachers update infrequently, together with a distribution-aware adaptation mechanism that dynamically balances historical knowledge preservation and fast adaptation to evolving data. Experiments on three public datasets show that CrossAdapt achieves 0.27-0.43% AUC improvements while reducing training time by 43-71%. Large-scale deployment on Tencent WeChat Channels (~10M daily samples) further demonstrates its effectiveness, significantly mitigating AUC degradation, LogLoss increase, and prediction bias compared to standard distillation baselines.

* 15 pages

Via

Access Paper or Ask Questions

Distributed Equivalent Substitution Training for Large-Scale Recommender Systems

Sep 10, 2019

Haidong Rong, Yangzihao Wang, Feihu Zhou, Junjie Zhai, Haiyang Wu, Rui Lan, Fan Li, Han Zhang, Yuekui Yang, Zhenyu Guo(+1 more)

Figure 1 for Distributed Equivalent Substitution Training for Large-Scale Recommender Systems

Figure 2 for Distributed Equivalent Substitution Training for Large-Scale Recommender Systems

Figure 3 for Distributed Equivalent Substitution Training for Large-Scale Recommender Systems

Figure 4 for Distributed Equivalent Substitution Training for Large-Scale Recommender Systems

Abstract:We present Distributed Equivalent Substitution (DES) training, a novel distributed training framework for recommender systems with large-scale dynamic sparse features. Our framework achieves faster convergence with less communication overhead and better computing resource utilization. DES strategy splits a weights-rich operator into sub-operators with co-located weights and aggregates partial results with much smaller communication cost to form a computationally equivalent substitution to the original operator. We show that for different types of models that recommender systems use, we can always find computational equivalent substitutions and splitting strategies for their weights-rich operators with theoretical communication load reduced ranging from 72.26% to 99.77%. We also present an implementation of DES that outperforms state-of-the-art recommender systems. Experiments show that our framework achieves up to 83% communication savings compared to other recommender systems, and can bring up to 4.5x improvement on throughput for deep models.

* 10 pages

Via

Access Paper or Ask Questions

Neural Machine Translation with Key-Value Memory-Augmented Attention

Jun 29, 2018

Fandong Meng, Zhaopeng Tu, Yong Cheng, Haiyang Wu, Junjie Zhai, Yuekui Yang, Di Wang

Figure 1 for Neural Machine Translation with Key-Value Memory-Augmented Attention

Figure 2 for Neural Machine Translation with Key-Value Memory-Augmented Attention

Figure 3 for Neural Machine Translation with Key-Value Memory-Augmented Attention

Figure 4 for Neural Machine Translation with Key-Value Memory-Augmented Attention

Abstract:Although attention-based Neural Machine Translation (NMT) has achieved remarkable progress in recent years, it still suffers from issues of repeating and dropping translations. To alleviate these issues, we propose a novel key-value memory-augmented attention model for NMT, called KVMEMATT. Specifically, we maintain a timely updated keymemory to keep track of attention history and a fixed value-memory to store the representation of source sentence throughout the whole translation process. Via nontrivial transformations and iterative interactions between the two memories, the decoder focuses on more appropriate source word(s) for predicting the next target word at each decoding step, therefore can improve the adequacy of translations. Experimental results on Chinese=>English and WMT17 German<=>English translation tasks demonstrate the superiority of the proposed model.

* Accepted at IJCAI 2018

Via

Access Paper or Ask Questions

Towards Robust Neural Machine Translation

May 16, 2018

Yong Cheng, Zhaopeng Tu, Fandong Meng, Junjie Zhai, Yang Liu

Figure 1 for Towards Robust Neural Machine Translation

Figure 2 for Towards Robust Neural Machine Translation

Abstract:Small perturbations in the input can severely distort intermediate representations and thus impact translation quality of neural machine translation (NMT) models. In this paper, we propose to improve the robustness of NMT models with adversarial stability training. The basic idea is to make both the encoder and decoder in NMT models robust against input perturbations by enabling them to behave similarly for the original input and its perturbed counterpart. Experimental results on Chinese-English, English-German and English-French translation tasks show that our approaches can not only achieve significant improvements over strong NMT systems but also improve the robustness of NMT models.

* Accepted by ACL 2018

Via

Access Paper or Ask Questions