Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhenzhe Zheng

Efficient Distributed Retrieval-Augmented Generation for Enhancing Language Model Performance

Apr 16, 2025

Shangyu Liu, Zhenzhe Zheng, Xiaoyao Huang, Fan Wu, Guihai Chen, Jie Wu

Abstract:Small language models (SLMs) support efficient deployments on resource-constrained edge devices, but their limited capacity compromises inference performance. Retrieval-augmented generation (RAG) is a promising solution to enhance model performance by integrating external databases, without requiring intensive on-device model retraining. However, large-scale public databases and user-specific private contextual documents are typically located on the cloud and the device separately, while existing RAG implementations are primarily centralized. To bridge this gap, we propose DRAGON, a distributed RAG framework to enhance on-device SLMs through both general and personal knowledge without the risk of leaking document privacy. Specifically, DRAGON decomposes multi-document RAG into multiple parallel token generation processes performed independently and locally on the cloud and the device, and employs a newly designed Speculative Aggregation, a dual-side speculative algorithm to avoid frequent output synchronization between the cloud and device. A new scheduling algorithm is further introduced to identify the optimal aggregation side based on real-time network conditions. Evaluations on real-world hardware testbed demonstrate a significant performance improvement of DRAGON-up to 1.9x greater gains over standalone SLM compared to the centralized RAG, substantial reduction in per-token latency, and negligible Time to First Token (TTFT) overhead.

Via

Access Paper or Ask Questions

AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference

Jan 04, 2025

Zhuomin He, Yizhen Yao, Pengfei Zuo, Bin Gao, Qinya Li, Zhenzhe Zheng, Fan Wu

Figure 1 for AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference

Figure 2 for AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference

Figure 3 for AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference

Figure 4 for AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference

Abstract:Long-context large language models (LLMs) inference is increasingly critical, motivating a number of studies devoted to alleviating the substantial storage and computational costs in such scenarios. Layer-wise skipping methods are promising optimizations but rarely explored in long-context inference. We observe that existing layer-wise skipping strategies have several limitations when applied in long-context inference, including the inability to adapt to model and context variability, disregard for sublayer significance, and inapplicability for the prefilling phase. This paper proposes \sysname, an adaptive sublayer skipping method specifically designed for long-context inference. \sysname adaptively identifies less important layers by leveraging on-the-fly similarity information, enables sublayer-wise skipping, and accelerates both the prefilling and decoding phases. The effectiveness of \sysname is demonstrated through extensive experiments on various long-context benchmarks and models, showcasing its superior inference performance over existing baselines.

* 9 pages,10 figures, AAAI

Via

Access Paper or Ask Questions

Delta: A Cloud-assisted Data Enrichment Framework for On-Device Continual Learning

Oct 24, 2024

Chen Gong, Zhenzhe Zheng, Fan Wu, Xiaofeng Jia, Guihai Chen

Abstract:In modern mobile applications, users frequently encounter various new contexts, necessitating on-device continual learning (CL) to ensure consistent model performance. While existing research predominantly focused on developing lightweight CL frameworks, we identify that data scarcity is a critical bottleneck for on-device CL. In this work, we explore the potential of leveraging abundant cloud-side data to enrich scarce on-device data, and propose a private, efficient and effective data enrichment framework Delta. Specifically, Delta first introduces a directory dataset to decompose the data enrichment problem into device-side and cloud-side sub-problems without sharing sensitive data. Next, Delta proposes a soft data matching strategy to effectively solve the device-side sub-problem with sparse user data, and an optimal data sampling scheme for cloud server to retrieve the most suitable dataset for enrichment with low computational complexity. Further, Delta refines the data sampling scheme by jointly considering the impact of enriched data on both new and past contexts, mitigating the catastrophic forgetting issue from a new aspect. Comprehensive experiments across four typical mobile computing tasks with varied data modalities demonstrate that Delta could enhance the overall model accuracy by an average of 15.1%, 12.4%, 1.1% and 5.6% for visual, IMU, audio and textual tasks compared with few-shot CL, and consistently reduce the communication costs by over 90% compared to federated CL.

Via

Access Paper or Ask Questions

MEBS: Multi-task End-to-end Bid Shading for Multi-slot Display Advertising

Mar 05, 2024

Zhen Gong, Lvyin Niu, Yang Zhao, Miao Xu, Zhenzhe Zheng, Haoqi Zhang, Zhilin Zhang, Fan Wu, Rongquan Bai, Chuan Yu(+2 more)

Figure 1 for MEBS: Multi-task End-to-end Bid Shading for Multi-slot Display Advertising

Figure 2 for MEBS: Multi-task End-to-end Bid Shading for Multi-slot Display Advertising

Figure 3 for MEBS: Multi-task End-to-end Bid Shading for Multi-slot Display Advertising

Figure 4 for MEBS: Multi-task End-to-end Bid Shading for Multi-slot Display Advertising

Abstract:Online bidding and auction are crucial aspects of the online advertising industry. Conventionally, there is only one slot for ad display and most current studies focus on it. Nowadays, multi-slot display advertising is gradually becoming popular where many ads could be displayed in a list and shown as a whole to users. However, multi-slot display advertising leads to different cost-effectiveness. Advertisers have the incentive to adjust bid prices so as to win the most economical ad positions. In this study, we introduce bid shading into multi-slot display advertising for bid price adjustment with a Multi-task End-to-end Bid Shading(MEBS) method. We prove the optimality of our method theoretically and examine its performance experimentally. Through extensive offline and online experiments, we demonstrate the effectiveness and efficiency of our method, and we obtain a 7.01% lift in Gross Merchandise Volume, a 7.42% lift in Return on Investment, and a 3.26% lift in ad buy count.

Via

Access Paper or Ask Questions

Trajectory-wise Iterative Reinforcement Learning Framework for Auto-bidding

Feb 23, 2024

Haoming Li, Yusen Huo, Shuai Dou, Zhenzhe Zheng, Zhilin Zhang, Chuan Yu, Jian Xu, Fan Wu

Abstract:In online advertising, advertisers participate in ad auctions to acquire ad opportunities, often by utilizing auto-bidding tools provided by demand-side platforms (DSPs). The current auto-bidding algorithms typically employ reinforcement learning (RL). However, due to safety concerns, most RL-based auto-bidding policies are trained in simulation, leading to a performance degradation when deployed in online environments. To narrow this gap, we can deploy multiple auto-bidding agents in parallel to collect a large interaction dataset. Offline RL algorithms can then be utilized to train a new policy. The trained policy can subsequently be deployed for further data collection, resulting in an iterative training framework, which we refer to as iterative offline RL. In this work, we identify the performance bottleneck of this iterative offline RL framework, which originates from the ineffective exploration and exploitation caused by the inherent conservatism of offline RL algorithms. To overcome this bottleneck, we propose Trajectory-wise Exploration and Exploitation (TEE), which introduces a novel data collecting and data utilization method for iterative offline RL from a trajectory perspective. Furthermore, to ensure the safety of online exploration while preserving the dataset quality for TEE, we propose Safe Exploration by Adaptive Action Selection (SEAS). Both offline experiments and real-world experiments on Alibaba display advertising platform demonstrate the effectiveness of our proposed method.

* Accepted by The Web Conference 2024

Via

Access Paper or Ask Questions

ECLM: Efficient Edge-Cloud Collaborative Learning with Continuous Environment Adaptation

Nov 18, 2023

Yan Zhuang, Zhenzhe Zheng, Yunfeng Shao, Bingshuai Li, Fan Wu, Guihai Chen

Abstract:Pervasive mobile AI applications primarily employ one of the two learning paradigms: cloud-based learning (with powerful large models) or on-device learning (with lightweight small models). Despite their own advantages, neither paradigm can effectively handle dynamic edge environments with frequent data distribution shifts and on-device resource fluctuations, inevitably suffering from performance degradation. In this paper, we propose ECLM, an edge-cloud collaborative learning framework for rapid model adaptation for dynamic edge environments. We first propose a novel block-level model decomposition design to decompose the original large cloud model into multiple combinable modules. By flexibly combining a subset of the modules, this design enables the derivation of compact, task-specific sub-models for heterogeneous edge devices from the large cloud model, and the seamless integration of new knowledge learned on these devices into the cloud model periodically. As such, ECLM ensures that the cloud model always provides up-to-date sub-models for edge devices. We further propose an end-to-end learning framework that incorporates the modular model design into an efficient model adaptation pipeline including an offline on-cloud model prototyping and training stage, and an online edge-cloud collaborative adaptation stage. Extensive experiments over various datasets demonstrate that ECLM significantly improves model performance (e.g., 18.89% accuracy increase) and resource efficiency (e.g., 7.12x communication cost reduction) in adapting models to dynamic edge environments by efficiently collaborating the edge and the cloud models.

Via

Access Paper or Ask Questions

Hierarchically Constrained Adaptive Ad Exposure in Feeds

May 31, 2022

Dagui Chen, Qi Yan, Chunjie Chen, Zhenzhe Zheng, Yangsu Liu, Zhenjia Ma, Chuan Yu, Jian Xu, Bo Zheng

Figure 1 for Hierarchically Constrained Adaptive Ad Exposure in Feeds

Figure 2 for Hierarchically Constrained Adaptive Ad Exposure in Feeds

Figure 3 for Hierarchically Constrained Adaptive Ad Exposure in Feeds

Figure 4 for Hierarchically Constrained Adaptive Ad Exposure in Feeds

Abstract:A contemporary feed application usually provides blended results of organic items and sponsored items~(ads) to users. Conventionally, ads are exposed at fixed positions. Such a static exposure strategy is inefficient due to ignoring users' personalized preferences towards ads. To this end, adaptive ad exposure has become an appealing strategy to boost the overall performance of the feed. However, existing approaches to implementing the adaptive ad exposure still suffer from several limitations: 1) they usually fall into sub-optimal solutions because of only focusing on request-level optimization without consideration of the long-term application-level performance and constraints, 2) they neglect the necessity of keeping the game-theoretical properties of ad auctions, which may lead to anarchy in bidding, and 3) they can hardly be deployed in large-scale applications due to high computational complexity. In this paper, we focus on long-term performance optimization under hierarchical constraints in feeds and formulate the adaptive ad exposure as a Dynamic Knapsack Problem. We propose an effective approach: Hierarchically Constrained Adaptive Ad Exposure~(HCA2E). We present that HCA2E possesses desired game-theoretical properties, computational efficiency, and performance robustness. Comprehensive offline and online experiments on a leading e-commerce application demonstrate the significant performance superiority of HCA2E over representative baselines. HCA2E has also been deployed on this application to serve millions of daily users.

Via

Access Paper or Ask Questions

Data-Free Evaluation of User Contributions in Federated Learning

Aug 24, 2021

Hongtao Lv, Zhenzhe Zheng, Tie Luo, Fan Wu, Shaojie Tang, Lifeng Hua, Rongfei Jia, Chengfei Lv

Figure 1 for Data-Free Evaluation of User Contributions in Federated Learning

Figure 2 for Data-Free Evaluation of User Contributions in Federated Learning

Figure 3 for Data-Free Evaluation of User Contributions in Federated Learning

Figure 4 for Data-Free Evaluation of User Contributions in Federated Learning

Abstract:Federated learning (FL) trains a machine learning model on mobile devices in a distributed manner using each device's private data and computing resources. A critical issues is to evaluate individual users' contributions so that (1) users' effort in model training can be compensated with proper incentives and (2) malicious and low-quality users can be detected and removed. The state-of-the-art solutions require a representative test dataset for the evaluation purpose, but such a dataset is often unavailable and hard to synthesize. In this paper, we propose a method called Pairwise Correlated Agreement (PCA) based on the idea of peer prediction to evaluate user contribution in FL without a test dataset. PCA achieves this using the statistical correlation of the model parameters uploaded by users. We then apply PCA to designing (1) a new federated learning algorithm called Fed-PCA, and (2) a new incentive mechanism that guarantees truthfulness. We evaluate the performance of PCA and Fed-PCA using the MNIST dataset and a large industrial product recommendation dataset. The results demonstrate that our Fed-PCA outperforms the canonical FedAvg algorithm and other baseline methods in accuracy, and at the same time, PCA effectively incentivizes users to behave truthfully.

* accepted by WiOpt 2021

Via

Access Paper or Ask Questions

A Cooperative-Competitive Multi-Agent Framework for Auto-bidding in Online Advertising

Jun 11, 2021

Chao Wen, Miao Xu, Zhilin Zhang, Zhenzhe Zheng, Yuhui Wang, Xiangyu Liu, Yu Rong, Dong Xie, Xiaoyang Tan, Chuan Yu(+4 more)

Figure 1 for A Cooperative-Competitive Multi-Agent Framework for Auto-bidding in Online Advertising

Figure 2 for A Cooperative-Competitive Multi-Agent Framework for Auto-bidding in Online Advertising

Figure 3 for A Cooperative-Competitive Multi-Agent Framework for Auto-bidding in Online Advertising

Figure 4 for A Cooperative-Competitive Multi-Agent Framework for Auto-bidding in Online Advertising

Abstract:In online advertising, auto-bidding has become an essential tool for advertisers to optimize their preferred ad performance metrics by simply expressing the high-level campaign objectives and constraints. Previous works consider the design of auto-bidding agents from the single-agent view without modeling the mutual influence between agents. In this paper, we instead consider this problem from the perspective of a distributed multi-agent system, and propose a general Multi-Agent reinforcement learning framework for Auto-Bidding, namely MAAB, to learn the auto-bidding strategies. First, we investigate the competition and cooperation relation among auto-bidding agents, and propose temperature-regularized credit assignment for establishing a mixed cooperative-competitive paradigm. By carefully making a competition and cooperation trade-off among the agents, we can reach an equilibrium state that guarantees not only individual advertiser's utility but also the system performance (social welfare). Second, due to the observed collusion behaviors of bidding low prices underlying the cooperation, we further propose bar agents to set a personalized bidding bar for each agent, and then to alleviate the degradation of revenue. Third, to deploy MAAB to the large-scale advertising system with millions of advertisers, we propose a mean-field approach. By grouping advertisers with the same objective as a mean auto-bidding agent, the interactions among advertisers are greatly simplified, making it practical to train MAAB efficiently. Extensive experiments on the offline industrial dataset and Alibaba advertising platform demonstrate that our approach outperforms several baseline methods in terms of social welfare and guarantees the ad platform's revenue.

* 10 pages

Via

Access Paper or Ask Questions

We Know What You Want: An Advertising Strategy Recommender System for Online Advertising

Jun 08, 2021

Liyi Guo, Junqi Jin, Haoqi Zhang, Zhenzhe Zheng, Zhiye Yang, Zhizhuang Xing, Fei Pan, Lvyin Niu, Fan Wu, Haiyang Xu(+3 more)

Figure 1 for We Know What You Want: An Advertising Strategy Recommender System for Online Advertising

Figure 2 for We Know What You Want: An Advertising Strategy Recommender System for Online Advertising

Figure 3 for We Know What You Want: An Advertising Strategy Recommender System for Online Advertising

Figure 4 for We Know What You Want: An Advertising Strategy Recommender System for Online Advertising

Abstract:Advertising expenditures have become the major source of revenue for e-commerce platforms. Providing good advertising experiences for advertisers through reducing their costs of trial and error for discovering the optimal advertising strategies is crucial for the long-term prosperity of online advertising. To achieve this goal, the advertising platform needs to identify the advertisers' marketing objectives, and then recommend the corresponding strategies to fulfill this objective. In this work, we first deploy a prototype of strategy recommender system on Taobao display advertising platform, recommending bid prices and targeted users to advertisers. We further augment this prototype system by directly revealing the advertising performance, and then infer the advertisers' marketing objectives through their adoptions of different recommending advertising performance. We use the techniques from context bandit to jointly learn the advertisers' marketing objectives and the recommending strategies. Online evaluations show that the designed advertising strategy recommender system can optimize the advertisers' advertising performance and increase the platform's revenue. Simulation experiments based on Taobao online bidding data show that the designed contextual bandit algorithm can effectively optimize the strategy adoption rate of advertisers.

* KDD 2021, Virtual Event, Singapore
* Accepted by KDD 2021

Via

Access Paper or Ask Questions