Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiang-Rong Sheng

Enhancing Taobao Display Advertising with Multimodal Representations: Challenges, Approaches and Insights

Jul 28, 2024

Xiang-Rong Sheng, Feifan Yang, Litong Gong, Biao Wang, Zhangming Chan, Yujing Zhang, Yueyao Cheng, Yong-Nan Zhu, Tiezheng Ge, Han Zhu(+3 more)

Figure 1 for Enhancing Taobao Display Advertising with Multimodal Representations: Challenges, Approaches and Insights

Figure 2 for Enhancing Taobao Display Advertising with Multimodal Representations: Challenges, Approaches and Insights

Figure 3 for Enhancing Taobao Display Advertising with Multimodal Representations: Challenges, Approaches and Insights

Figure 4 for Enhancing Taobao Display Advertising with Multimodal Representations: Challenges, Approaches and Insights

Abstract:Despite the recognized potential of multimodal data to improve model accuracy, many large-scale industrial recommendation systems, including Taobao display advertising system, predominantly depend on sparse ID features in their models. In this work, we explore approaches to leverage multimodal data to enhance the recommendation accuracy. We start from identifying the key challenges in adopting multimodal data in a manner that is both effective and cost-efficient for industrial systems. To address these challenges, we introduce a two-phase framework, including: 1) the pre-training of multimodal representations to capture semantic similarity, and 2) the integration of these representations with existing ID-based models. Furthermore, we detail the architecture of our production system, which is designed to facilitate the deployment of multimodal representations. Since the integration of multimodal representations in mid-2023, we have observed significant performance improvements in Taobao display advertising system. We believe that the insights we have gathered will serve as a valuable resource for practitioners seeking to leverage multimodal data in their systems.

* Accepted at CIKM 2024

Via

Access Paper or Ask Questions

Calibration-compatible Listwise Distillation of Privileged Features for CTR Prediction

Dec 14, 2023

Xiaoqiang Gui, Yueyao Cheng, Xiang-Rong Sheng, Yunfeng Zhao, Guoxian Yu, Shuguang Han, Yuning Jiang, Jian Xu, Bo Zheng

Figure 1 for Calibration-compatible Listwise Distillation of Privileged Features for CTR Prediction

Figure 2 for Calibration-compatible Listwise Distillation of Privileged Features for CTR Prediction

Figure 3 for Calibration-compatible Listwise Distillation of Privileged Features for CTR Prediction

Figure 4 for Calibration-compatible Listwise Distillation of Privileged Features for CTR Prediction

Abstract:In machine learning systems, privileged features refer to the features that are available during offline training but inaccessible for online serving. Previous studies have recognized the importance of privileged features and explored ways to tackle online-offline discrepancies. A typical practice is privileged features distillation (PFD): train a teacher model using all features (including privileged ones) and then distill the knowledge from the teacher model using a student model (excluding the privileged features), which is then employed for online serving. In practice, the pointwise cross-entropy loss is often adopted for PFD. However, this loss is insufficient to distill the ranking ability for CTR prediction. First, it does not consider the non-i.i.d. characteristic of the data distribution, i.e., other items on the same page significantly impact the click probability of the candidate item. Second, it fails to consider the relative item order ranked by the teacher model's predictions, which is essential to distill the ranking ability. To address these issues, we first extend the pointwise-based PFD to the listwise-based PFD. We then define the calibration-compatible property of distillation loss and show that commonly used listwise losses do not satisfy this property when employed as distillation loss, thus compromising the model's calibration ability, which is another important measure for CTR prediction. To tackle this dilemma, we propose Calibration-compatible LIstwise Distillation (CLID), which employs carefully-designed listwise distillation loss to achieve better ranking ability than the pointwise-based PFD while preserving the model's calibration ability. We theoretically prove it is calibration-compatible. Extensive experiments on public datasets and a production dataset collected from the display advertising system of Alibaba further demonstrate the effectiveness of CLID.

* This paper has been accepted by WSDM'24

Via

Access Paper or Ask Questions

Entire Space Cascade Delayed Feedback Modeling for Effective Conversion Rate Prediction

Aug 09, 2023

Yunfeng Zhao, Xu Yan, Xiaoqiang Gui, Shuguang Han, Xiang-Rong Sheng, Guoxian Yu, Jufeng Chen, Zhao Xu, Bo Zheng

Abstract:Conversion rate (CVR) prediction is an essential task for large-scale e-commerce platforms. However, refund behaviors frequently occur after conversion in online shopping systems, which drives us to pay attention to effective conversion for building healthier shopping services. This paper defines the probability of item purchasing without any subsequent refund as an effective conversion rate (ECVR). A simple paradigm for ECVR prediction is to decompose it into two sub-tasks: CVR prediction and post-conversion refund rate (RFR) prediction. However, RFR prediction suffers from data sparsity (DS) and sample selection bias (SSB) issues, as the refund behaviors are only available after user purchase. Furthermore, there is delayed feedback in both conversion and refund events and they are sequentially dependent, named cascade delayed feedback (CDF), which significantly harms data freshness for model training. Previous studies mainly focus on tackling DS and SSB or delayed feedback for a single event. To jointly tackle these issues in ECVR prediction, we propose an Entire space CAscade Delayed feedback modeling (ECAD) method. Specifically, ECAD deals with DS and SSB by constructing two tasks including CVR prediction and conversion \& refund rate (CVRFR) prediction using the entire space modeling framework. In addition, it carefully schedules auxiliary tasks to leverage both conversion and refund time within data to alleviate CDF. Experimental results on the offline industrial dataset and online A/B testing demonstrate the effectiveness of ECAD. In addition, ECAD has been deployed in one of the recommender systems in Alibaba, contributing to a significant improvement of ECVR.

* Accepted to CIKM'23

Via

Access Paper or Ask Questions

COPR: Consistency-Oriented Pre-Ranking for Online Advertising

Jun 06, 2023

Zhishan Zhao, Jingyue Gao, Yu Zhang, Shuguang Han, Siyuan Lou, Xiang-Rong Sheng, Zhe Wang, Han Zhu, Yuning Jiang, Jian Xu(+1 more)

Figure 1 for COPR: Consistency-Oriented Pre-Ranking for Online Advertising

Figure 2 for COPR: Consistency-Oriented Pre-Ranking for Online Advertising

Figure 3 for COPR: Consistency-Oriented Pre-Ranking for Online Advertising

Figure 4 for COPR: Consistency-Oriented Pre-Ranking for Online Advertising

Abstract:Cascading architecture has been widely adopted in large-scale advertising systems to balance efficiency and effectiveness. In this architecture, the pre-ranking model is expected to be a lightweight approximation of the ranking model, which handles more candidates with strict latency requirements. Due to the gap in model capacity, the pre-ranking and ranking models usually generate inconsistent ranked results, thus hurting the overall system effectiveness. The paradigm of score alignment is proposed to regularize their raw scores to be consistent. However, it suffers from inevitable alignment errors and error amplification by bids when applied in online advertising. To this end, we introduce a consistency-oriented pre-ranking framework for online advertising, which employs a chunk-based sampling module and a plug-and-play rank alignment module to explicitly optimize consistency of ECPM-ranked results. A $\Delta NDCG$-based weighting mechanism is adopted to better distinguish the importance of inter-chunk samples in optimization. Both online and offline experiments have validated the superiority of our framework. When deployed in Taobao display advertising system, it achieves an improvement of up to +12.3\% CTR and +5.6\% RPM.

Via

Access Paper or Ask Questions

Capturing Conversion Rate Fluctuation during Sales Promotions: A Novel Historical Data Reuse Approach

May 22, 2023

Zhangming Chan, Yu Zhang, Shuguang Han, Yong Bai, Xiang-Rong Sheng, Siyuan Lou, Jiacen Hu, Baolin Liu, Yuning Jiang, Jian Xu(+1 more)

Figure 1 for Capturing Conversion Rate Fluctuation during Sales Promotions: A Novel Historical Data Reuse Approach

Figure 2 for Capturing Conversion Rate Fluctuation during Sales Promotions: A Novel Historical Data Reuse Approach

Figure 3 for Capturing Conversion Rate Fluctuation during Sales Promotions: A Novel Historical Data Reuse Approach

Figure 4 for Capturing Conversion Rate Fluctuation during Sales Promotions: A Novel Historical Data Reuse Approach

Abstract:Conversion rate (CVR) prediction is one of the core components in online recommender systems, and various approaches have been proposed to obtain accurate and well-calibrated CVR estimation. However, we observe that a well-trained CVR prediction model often performs sub-optimally during sales promotions. This can be largely ascribed to the problem of the data distribution shift, in which the conventional methods no longer work. To this end, we seek to develop alternative modeling techniques for CVR prediction. Observing similar purchase patterns across different promotions, we propose reusing the historical promotion data to capture the promotional conversion patterns. Herein, we propose a novel \textbf{H}istorical \textbf{D}ata \textbf{R}euse (\textbf{HDR}) approach that first retrieves historically similar promotion data and then fine-tunes the CVR prediction model with the acquired data for better adaptation to the promotion mode. HDR consists of three components: an automated data retrieval module that seeks similar data from historical promotions, a distribution shift correction module that re-weights the retrieved data for better aligning with the target promotion, and a TransBlock module that quickly fine-tunes the original model for better adaptation to the promotion mode. Experiments conducted with real-world data demonstrate the effectiveness of HDR, as it improves both ranking and calibration metrics to a large extent. HDR has also been deployed on the display advertising system in Alibaba, bringing a lift of $9\%$ RPM and $16\%$ CVR during Double 11 Sales in 2022.

* Accepted at KDD 2023 (camera-ready version coming soon). This work has already been deployed on the display advertising system in Alibaba, bringing substantial economic gains

Via

Access Paper or Ask Questions

Towards Understanding the Overfitting Phenomenon of Deep Click-Through Rate Prediction Models

Sep 04, 2022

Zhao-Yu Zhang, Xiang-Rong Sheng, Yujing Zhang, Biye Jiang, Shuguang Han, Hongbo Deng, Bo Zheng

Figure 1 for Towards Understanding the Overfitting Phenomenon of Deep Click-Through Rate Prediction Models

Figure 2 for Towards Understanding the Overfitting Phenomenon of Deep Click-Through Rate Prediction Models

Figure 3 for Towards Understanding the Overfitting Phenomenon of Deep Click-Through Rate Prediction Models

Figure 4 for Towards Understanding the Overfitting Phenomenon of Deep Click-Through Rate Prediction Models

Abstract:Deep learning techniques have been applied widely in industrial recommendation systems. However, far less attention has been paid to the overfitting problem of models in recommendation systems, which, on the contrary, is recognized as a critical issue for deep neural networks. In the context of Click-Through Rate (CTR) prediction, we observe an interesting one-epoch overfitting problem: the model performance exhibits a dramatic degradation at the beginning of the second epoch. Such a phenomenon has been witnessed widely in real-world applications of CTR models. Thereby, the best performance is usually achieved by training with only one epoch. To understand the underlying factors behind the one-epoch phenomenon, we conduct extensive experiments on the production data set collected from the display advertising system of Alibaba. The results show that the model structure, the optimization algorithm with a fast convergence rate, and the feature sparsity are closely related to the one-epoch phenomenon. We also provide a likely hypothesis for explaining such a phenomenon and conduct a set of proof-of-concept experiments. We hope this work can shed light on future research on training more epochs for better performance.

* Accepted by CIKM2022

Via

Access Paper or Ask Questions

Joint Optimization of Ranking and Calibration with Contextualized Hybrid Model

Aug 12, 2022

Xiang-Rong Sheng, Jingyue Gao, Yueyao Cheng, Siran Yang, Shuguang Han, Hongbo Deng, Yuning Jiang, Jian Xu, Bo Zheng

Figure 1 for Joint Optimization of Ranking and Calibration with Contextualized Hybrid Model

Figure 2 for Joint Optimization of Ranking and Calibration with Contextualized Hybrid Model

Figure 3 for Joint Optimization of Ranking and Calibration with Contextualized Hybrid Model

Figure 4 for Joint Optimization of Ranking and Calibration with Contextualized Hybrid Model

Abstract:Despite the development of ranking optimization techniques, the pointwise model remains the dominating approach for click-through rate (CTR) prediction. It can be attributed to the calibration ability of the pointwise model since the prediction can be viewed as the click probability. In practice, a CTR prediction model is also commonly assessed with the ranking ability, for which prediction models based on ranking losses (e.g., pairwise or listwise loss) usually achieve better performances than the pointwise loss. Previous studies have experimented with a direct combination of the two losses to obtain the benefit from both losses and observed an improved performance. However, previous studies break the meaning of output logit as the click-through rate, which may lead to sub-optimal solutions. To address this issue, we propose an approach that can Jointly optimize the Ranking and Calibration abilities (JRC for short). JRC improves the ranking ability by contrasting the logit value for the sample with different labels and constrains the predicted probability to be a function of the logit subtraction. We further show that JRC consolidates the interpretation of logits, where the logits model the joint distribution. With such an interpretation, we prove that JRC approximately optimizes the contextualized hybrid discriminative-generative objective. Experiments on public and industrial datasets and online A/B testing show that our approach improves both ranking and calibration abilities. Since May 2022, JRC has been deployed on the display advertising platform of Alibaba and has obtained significant performance improvements.

Via

Access Paper or Ask Questions

Real Negatives Matter: Continuous Training with Real Negatives for Delayed Feedback Modeling

Apr 29, 2021

Siyu Gu, Xiang-Rong Sheng, Ying Fan, Guorui Zhou, Xiaoqiang Zhu

Figure 1 for Real Negatives Matter: Continuous Training with Real Negatives for Delayed Feedback Modeling

Figure 2 for Real Negatives Matter: Continuous Training with Real Negatives for Delayed Feedback Modeling

Figure 3 for Real Negatives Matter: Continuous Training with Real Negatives for Delayed Feedback Modeling

Figure 4 for Real Negatives Matter: Continuous Training with Real Negatives for Delayed Feedback Modeling

Abstract:One of the difficulties of conversion rate (CVR) prediction is that the conversions can delay and take place long after the clicks. The delayed feedback poses a challenge: fresh data are beneficial to continuous training but may not have complete label information at the time they are ingested into the training pipeline. To balance model freshness and label certainty, previous methods set a short waiting window or even do not wait for the conversion signal. If conversion happens outside the waiting window, this sample will be duplicated and ingested into the training pipeline with a positive label. However, these methods have some issues. First, they assume the observed feature distribution remains the same as the actual distribution. But this assumption does not hold due to the ingestion of duplicated samples. Second, the certainty of the conversion action only comes from the positives. But the positives are scarce as conversions are sparse in commercial systems. These issues induce bias during the modeling of delayed feedback. In this paper, we propose DElayed FEedback modeling with Real negatives (DEFER) method to address these issues. The proposed method ingests real negative samples into the training pipeline. The ingestion of real negatives ensures the observed feature distribution is equivalent to the actual distribution, thus reducing the bias. The ingestion of real negatives also brings more certainty information of the conversion. To correct the distribution shift, DEFER employs importance sampling to weigh the loss function. Experimental results on industrial datasets validate the superiority of DEFER. DEFER have been deployed in the display advertising system of Alibaba, obtaining over 6.0% improvement on CVR in several scenarios. The code and data in this paper are now open-sourced {https://github.com/gusuperstar/defer.git}.

Via

Access Paper or Ask Questions

One Model to Serve All: Star Topology Adaptive Recommender for Multi-Domain CTR Prediction

Jan 27, 2021

Xiang-Rong Sheng, Liqin Zhao, Guorui Zhou, Xinyao Ding, Binding Dai, Qiang Luo, Siran Yang, Jingshan Lv, Chi Zhang, Xiaoqiang Zhu

Figure 1 for One Model to Serve All: Star Topology Adaptive Recommender for Multi-Domain CTR Prediction

Figure 2 for One Model to Serve All: Star Topology Adaptive Recommender for Multi-Domain CTR Prediction

Figure 3 for One Model to Serve All: Star Topology Adaptive Recommender for Multi-Domain CTR Prediction

Figure 4 for One Model to Serve All: Star Topology Adaptive Recommender for Multi-Domain CTR Prediction

Abstract:Traditional industrial recommenders are usually trained on a single business domain and then serve for this domain. In large commercial platforms, however, it is often the case that the recommenders need to make click-through rate (CTR) predictions for multiple business domains. Different domains have overlapping user groups and items, thus exist commonalities. Since the specific user group may be different and the user behaviors may change within a specific domain, different domains also have distinctions. The distinctions result in different domain-specific data distributions, which makes it hard for a single shared model to work well on all domains. To address the problem, we present Star Topology Adaptive Recommender (STAR), where one model is learned to serve all domains effectively. Concretely, STAR has the star topology, which consists of the shared centered parameters and domain-specific parameters. The shared parameters are used to learn commonalities of all domains and the domain-specific parameters capture domain distinction for more refined prediction. Given requests from different domains, STAR can adapt its parameters conditioned on the domain. The experimental result from production data validates the superiority of the proposed STAR model. Up to now, STAR has been deployed in the display advertising system of Alibaba, obtaining averaging 8.0% improvement on CTR and 6.0% on RPM (Revenue Per Mille).

Via

Access Paper or Ask Questions

CAN: Revisiting Feature Co-Action for Click-Through Rate Prediction

Nov 11, 2020

Guorui Zhou, Weijie Bian, Kailun Wu, Lejian Ren, Qi Pi, Yujing Zhang, Can Xiao, Xiang-Rong Sheng, Na Mou, Xinchen Luo(+6 more)

Figure 1 for CAN: Revisiting Feature Co-Action for Click-Through Rate Prediction

Figure 2 for CAN: Revisiting Feature Co-Action for Click-Through Rate Prediction

Figure 3 for CAN: Revisiting Feature Co-Action for Click-Through Rate Prediction

Figure 4 for CAN: Revisiting Feature Co-Action for Click-Through Rate Prediction

Abstract:Inspired by the success of deep learning, recent industrial Click-Through Rate (CTR) prediction models have made the transition from traditional shallow approaches to deep approaches. Deep Neural Networks (DNNs) are known for its ability to learn non-linear interactions from raw feature automatically, however, the non-linear feature interaction is learned in an implicit manner. The non-linear interaction may be hard to capture and explicitly model the \textit{co-action} of raw feature is beneficial for CTR prediction. \textit{Co-action} refers to the collective effects of features toward final prediction. In this paper, we argue that current CTR models do not fully explore the potential of feature co-action. We conduct experiments and show that the effect of feature co-action is underestimated seriously. Motivated by our observation, we propose feature Co-Action Network (CAN) to explore the potential of feature co-action. The proposed model can efficiently and effectively capture the feature co-action, which improves the model performance while reduce the storage and computation consumption. Experiment results on public and industrial datasets show that CAN outperforms state-of-the-art CTR models by a large margin. Up to now, CAN has been deployed in the Alibaba display advertisement system, obtaining averaging 12\% improvement on CTR and 8\% on RPM.

Via

Access Paper or Ask Questions