Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Junyu Cao

Max

Deconfounded Warm-Start Thompson Sampling with Applications to Precision Medicine

May 22, 2025

Prateek Jaiswal, Esmaeil Keyvanshokooh, Junyu Cao

Abstract:Randomized clinical trials often require large patient cohorts before drawing definitive conclusions, yet abundant observational data from parallel studies remains underutilized due to confounding and hidden biases. To bridge this gap, we propose Deconfounded Warm-Start Thompson Sampling (DWTS), a practical approach that leverages a Doubly Debiased LASSO (DDL) procedure to identify a sparse set of reliable measured covariates and combines them with key hidden covariates to form a reduced context. By initializing Thompson Sampling (LinTS) priors with DDL-estimated means and variances on these measured features -- while keeping uninformative priors on hidden features -- DWTS effectively harnesses confounded observational data to kick-start adaptive clinical trials. Evaluated on both a purely synthetic environment and a virtual environment created using real cardiovascular risk dataset, DWTS consistently achieves lower cumulative regret than standard LinTS, showing how offline causal insights from observational data can improve trial efficiency and support more personalized treatment decisions.

Via

Access Paper or Ask Questions

LiteWebAgent: The Open-Source Suite for VLM-Based Web-Agent Applications

Mar 04, 2025

Danqing Zhang, Balaji Rama, Jingyi Ni, Shiying He, Fu Zhao, Kunyu Chen, Arnold Chen, Junyu Cao

Abstract:We introduce LiteWebAgent, an open-source suite for VLM-based web agent applications. Our framework addresses a critical gap in the web agent ecosystem with a production-ready solution that combines minimal serverless backend configuration, intuitive user and browser interfaces, and extensible research capabilities in agent planning, memory, and tree search. For the core LiteWebAgent agent framework, we implemented a simple yet effective baseline using recursive function calling, providing with decoupled action generation and action grounding. In addition, we integrate advanced research components such as agent planning, agent workflow memory, and tree search in a modular and extensible manner. We then integrate the LiteWebAgent agent framework with frontend and backend as deployed systems in two formats: (1) a production Vercel-based web application, which provides users with an agent-controlled remote browser, (2) a Chrome extension leveraging LiteWebAgent's API to control an existing Chrome browser via CDP (Chrome DevTools Protocol). The LiteWebAgent framework is available at https://github.com/PathOnAI/LiteWebAgent, with deployed frontend at https://lite-web-agent.vercel.app/.

Via

Access Paper or Ask Questions

A Conformal Approach to Feature-based Newsvendor under Model Misspecification

Dec 17, 2024

Junyu Cao

Figure 1 for A Conformal Approach to Feature-based Newsvendor under Model Misspecification

Figure 2 for A Conformal Approach to Feature-based Newsvendor under Model Misspecification

Figure 3 for A Conformal Approach to Feature-based Newsvendor under Model Misspecification

Figure 4 for A Conformal Approach to Feature-based Newsvendor under Model Misspecification

Abstract:In many data-driven decision-making problems, performance guarantees often depend heavily on the correctness of model assumptions, which may frequently fail in practice. We address this issue in the context of a feature-based newsvendor problem, where demand is influenced by observed features such as demographics and seasonality. To mitigate the impact of model misspecification, we propose a model-free and distribution-free framework inspired by conformal prediction. Our approach consists of two phases: a training phase, which can utilize any type of prediction method, and a calibration phase that conformalizes the model bias. To enhance predictive performance, we explore the balance between data quality and quantity, recognizing the inherent trade-off: more selective training data improves quality but reduces quantity. Importantly, we provide statistical guarantees for the conformalized critical quantile, independent of the correctness of the underlying model. Moreover, we quantify the confidence interval of the critical quantile, with its width decreasing as data quality and quantity improve. We validate our framework using both simulated data and a real-world dataset from the Capital Bikeshare program in Washington, D.C. Across these experiments, our proposed method consistently outperforms benchmark algorithms, reducing newsvendor loss by up to 40% on the simulated data and 25% on the real-world dataset.

Via

Access Paper or Ask Questions

HR-Bandit: Human-AI Collaborated Linear Recourse Bandit

Oct 18, 2024

Junyu Cao, Ruijiang Gao, Esmaeil Keyvanshokooh

Figure 1 for HR-Bandit: Human-AI Collaborated Linear Recourse Bandit

Figure 2 for HR-Bandit: Human-AI Collaborated Linear Recourse Bandit

Figure 3 for HR-Bandit: Human-AI Collaborated Linear Recourse Bandit

Figure 4 for HR-Bandit: Human-AI Collaborated Linear Recourse Bandit

Abstract:Human doctors frequently recommend actionable recourses that allow patients to modify their conditions to access more effective treatments. Inspired by such healthcare scenarios, we propose the Recourse Linear UCB ($\textsf{RLinUCB}$) algorithm, which optimizes both action selection and feature modifications by balancing exploration and exploitation. We further extend this to the Human-AI Linear Recourse Bandit ($\textsf{HR-Bandit}$), which integrates human expertise to enhance performance. $\textsf{HR-Bandit}$ offers three key guarantees: (i) a warm-start guarantee for improved initial performance, (ii) a human-effort guarantee to minimize required human interactions, and (iii) a robustness guarantee that ensures sublinear regret even when human decisions are suboptimal. Empirical results, including a healthcare case study, validate its superior performance against existing benchmarks.

* 18 pages

Via

Access Paper or Ask Questions

A Probabilistic Approach for Alignment with Human Comparisons

Mar 16, 2024

Junyu Cao, Mohsen Bayati

Abstract:A growing trend involves integrating human knowledge into learning frameworks, leveraging subtle human feedback to refine AI models. Despite these advances, no comprehensive theoretical framework describing the specific conditions under which human comparisons improve the traditional supervised fine-tuning process has been developed. To bridge this gap, this paper studies the effective use of human comparisons to address limitations arising from noisy data and high-dimensional models. We propose a two-stage "Supervised Fine Tuning+Human Comparison" (SFT+HC) framework connecting machine learning with human feedback through a probabilistic bisection approach. The two-stage framework first learns low-dimensional representations from noisy-labeled data via an SFT procedure, and then uses human comparisons to improve the model alignment. To examine the efficacy of the alignment phase, we introduce a novel concept termed the "label-noise-to-comparison-accuracy" (LNCA) ratio. This paper theoretically identifies the conditions under which the "SFT+HC" framework outperforms pure SFT approach, leveraging this ratio to highlight the advantage of incorporating human evaluators in reducing sample complexity. We validate that the proposed conditions for the LNCA ratio are met in a case study conducted via an Amazon Mechanical Turk experiment.

Via

Access Paper or Ask Questions

Speed Up the Cold-Start Learning in Two-Sided Bandits with Many Arms

Oct 01, 2022

Mohsen Bayati, Junyu Cao, Wanning Chen

Figure 1 for Speed Up the Cold-Start Learning in Two-Sided Bandits with Many Arms

Figure 2 for Speed Up the Cold-Start Learning in Two-Sided Bandits with Many Arms

Figure 3 for Speed Up the Cold-Start Learning in Two-Sided Bandits with Many Arms

Figure 4 for Speed Up the Cold-Start Learning in Two-Sided Bandits with Many Arms

Abstract:Multi-armed bandit (MAB) algorithms are efficient approaches to reduce the opportunity cost of online experimentation and are used by companies to find the best product from periodically refreshed product catalogs. However, these algorithms face the so-called cold-start at the onset of the experiment due to a lack of knowledge of customer preferences for new products, requiring an initial data collection phase known as the burning period. During this period, MAB algorithms operate like randomized experiments, incurring large burning costs which scale with the large number of products. We attempt to reduce the burning by identifying that many products can be cast into two-sided products, and then naturally model the rewards of the products with a matrix, whose rows and columns represent the two sides respectively. Next, we design two-phase bandit algorithms that first use subsampling and low-rank matrix estimation to obtain a substantially smaller targeted set of products and then apply a UCB procedure on the target products to find the best one. We theoretically show that the proposed algorithms lower costs and expedite the experiment in cases when there is limited experimentation time along with a large product set. Our analysis also reveals three regimes of long, short, and ultra-short horizon experiments, depending on dimensions of the matrix. Empirical evidence from both synthetic data and a real-world dataset on music streaming services validates this superior performance.

Via

Access Paper or Ask Questions

Fatigue-aware Bandits for Dependent Click Models

Aug 22, 2020

Junyu Cao, Wei Sun, Zuo-Jun, Shen, Markus Ettl

Figure 1 for Fatigue-aware Bandits for Dependent Click Models

Figure 2 for Fatigue-aware Bandits for Dependent Click Models

Figure 3 for Fatigue-aware Bandits for Dependent Click Models

Abstract:As recommender systems send a massive amount of content to keep users engaged, users may experience fatigue which is contributed by 1) an overexposure to irrelevant content, 2) boredom from seeing too many similar recommendations. To address this problem, we consider an online learning setting where a platform learns a policy to recommend content that takes user fatigue into account. We propose an extension of the Dependent Click Model (DCM) to describe users' behavior. We stipulate that for each piece of content, its attractiveness to a user depends on its intrinsic relevance and a discount factor which measures how many similar contents have been shown. Users view the recommended content sequentially and click on the ones that they find attractive. Users may leave the platform at any time, and the probability of exiting is higher when they do not like the content. Based on user's feedback, the platform learns the relevance of the underlying content as well as the discounting effect due to content fatigue. We refer to this learning task as "fatigue-aware DCM Bandit" problem. We consider two learning scenarios depending on whether the discounting effect is known. For each scenario, we propose a learning algorithm which simultaneously explores and exploits, and characterize its regret bound.

* AAAI. 2020

Via

Access Paper or Ask Questions

Dynamic Learning with Frequent New Product Launches: A Sequential Multinomial Logit Bandit Problem

Apr 29, 2019

Junyu Cao, Wei Sun

Figure 1 for Dynamic Learning with Frequent New Product Launches: A Sequential Multinomial Logit Bandit Problem

Figure 2 for Dynamic Learning with Frequent New Product Launches: A Sequential Multinomial Logit Bandit Problem

Figure 3 for Dynamic Learning with Frequent New Product Launches: A Sequential Multinomial Logit Bandit Problem

Figure 4 for Dynamic Learning with Frequent New Product Launches: A Sequential Multinomial Logit Bandit Problem

Abstract:Motivated by the phenomenon that companies introduce new products to keep abreast with customers' rapidly changing tastes, we consider a novel online learning setting where a profit-maximizing seller needs to learn customers' preferences through offering recommendations, which may contain existing products and new products that are launched in the middle of a selling period. We propose a sequential multinomial logit (SMNL) model to characterize customers' behavior when product recommendations are presented in tiers. For the offline version with known customers' preferences, we propose a polynomial-time algorithm and characterize the properties of the optimal tiered product recommendation. For the online problem, we propose a learning algorithm and quantify its regret bound. Moreover, we extend the setting to incorporate a constraint which ensures every new product is learned to a given accuracy. Our results demonstrate the tier structure can be used to mitigate the risks associated with learning new products.

Via

Access Paper or Ask Questions

Dynamic Learning of Sequential Choice Bandit Problem under Marketing Fatigue

Mar 19, 2019

Junyu Cao, Wei Sun

Figure 1 for Dynamic Learning of Sequential Choice Bandit Problem under Marketing Fatigue

Figure 2 for Dynamic Learning of Sequential Choice Bandit Problem under Marketing Fatigue

Figure 3 for Dynamic Learning of Sequential Choice Bandit Problem under Marketing Fatigue

Abstract:Motivated by the observation that overexposure to unwanted marketing activities leads to customer dissatisfaction, we consider a setting where a platform offers a sequence of messages to its users and is penalized when users abandon the platform due to marketing fatigue. We propose a novel sequential choice model to capture multiple interactions taking place between the platform and its user: Upon receiving a message, a user decides on one of the three actions: accept the message, skip and receive the next message, or abandon the platform. Based on user feedback, the platform dynamically learns users' abandonment distribution and their valuations of messages to determine the length of the sequence and the order of the messages, while maximizing the cumulative payoff over a horizon of length T. We refer to this online learning task as the sequential choice bandit problem. For the offline combinatorial optimization problem, we show that an efficient polynomial-time algorithm exists. For the online problem, we propose an algorithm that balances exploration and exploitation, and characterize its regret bound. Lastly, we demonstrate how to extend the model with user contexts to incorporate personalization.

Via

Access Paper or Ask Questions