Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shengpu Tang

Division of Computer Science & Engineering, University of Michigan

CANDOR: Counterfactual ANnotated DOubly Robust Off-Policy Evaluation

Dec 11, 2024

Aishwarya Mandyam, Shengpu Tang, Jiayu Yao, Jenna Wiens, Barbara E. Engelhardt

Figure 1 for CANDOR: Counterfactual ANnotated DOubly Robust Off-Policy Evaluation

Figure 2 for CANDOR: Counterfactual ANnotated DOubly Robust Off-Policy Evaluation

Figure 3 for CANDOR: Counterfactual ANnotated DOubly Robust Off-Policy Evaluation

Figure 4 for CANDOR: Counterfactual ANnotated DOubly Robust Off-Policy Evaluation

Abstract:Off-policy evaluation (OPE) provides safety guarantees by estimating the performance of a policy before deployment. Recent work introduced IS+, an importance sampling (IS) estimator that uses expert-annotated counterfactual samples to improve behavior dataset coverage. However, IS estimators are known to have high variance; furthermore, the performance of IS+ deteriorates when annotations are imperfect. In this work, we propose a family of OPE estimators inspired by the doubly robust (DR) principle. A DR estimator combines IS with a reward model estimate, known as the direct method (DM), and offers favorable statistical guarantees. We propose three strategies for incorporating counterfactual annotations into a DR-inspired estimator and analyze their properties under various realistic settings. We prove that using imperfect annotations in the DM part of the estimator best leverages the annotations, as opposed to using them in the IS part. To support our theoretical findings, we evaluate the proposed estimators in three contextual bandit environments. Our empirical results show that when the reward model is misspecified and the annotations are imperfect, it is most beneficial to use the annotations only in the DM portion of a DR estimator. Based on these theoretical and empirical insights, we provide a practical guide for using counterfactual annotations in different realistic settings.

Via

Access Paper or Ask Questions

Machine Learning for Health symposium 2023 -- Findings track

Dec 01, 2023

Stefan Hegselmann, Antonio Parziale, Divya Shanmugam, Shengpu Tang, Mercy Nyamewaa Asiedu, Serina Chang, Thomas Hartvigsen, Harvineet Singh

Abstract:A collection of the accepted Findings papers that were presented at the 3rd Machine Learning for Health symposium (ML4H 2023), which was held on December 10, 2023, in New Orleans, Louisiana, USA. ML4H 2023 invited high-quality submissions on relevant problems in a variety of health-related disciplines including healthcare, biomedicine, and public health. Two submission tracks were offered: the archival Proceedings track, and the non-archival Findings track. Proceedings were targeted at mature work with strong technical sophistication and a high impact to health. The Findings track looked for new ideas that could spark insightful discussion, serve as valuable resources for the community, or could enable new collaborations. Submissions to the Proceedings track, if not accepted, were automatically considered for the Findings track. All the manuscripts submitted to ML4H Symposium underwent a double-blind peer-review process.

Via

Access Paper or Ask Questions

Counterfactual-Augmented Importance Sampling for Semi-Offline Policy Evaluation

Oct 26, 2023

Shengpu Tang, Jenna Wiens

Abstract:In applying reinforcement learning (RL) to high-stakes domains, quantitative and qualitative evaluation using observational data can help practitioners understand the generalization performance of new policies. However, this type of off-policy evaluation (OPE) is inherently limited since offline data may not reflect the distribution shifts resulting from the application of new policies. On the other hand, online evaluation by collecting rollouts according to the new policy is often infeasible, as deploying new policies in these domains can be unsafe. In this work, we propose a semi-offline evaluation framework as an intermediate step between offline and online evaluation, where human users provide annotations of unobserved counterfactual trajectories. While tempting to simply augment existing data with such annotations, we show that this naive approach can lead to biased results. Instead, we design a new family of OPE estimators based on importance sampling (IS) and a novel weighting scheme that incorporate counterfactual annotations without introducing additional bias. We analyze the theoretical properties of our approach, showing its potential to reduce both bias and variance compared to standard IS estimators. Our analyses reveal important practical considerations for handling biased, noisy, or missing annotations. In a series of proof-of-concept experiments involving bandits and a healthcare-inspired simulator, we demonstrate that our approach outperforms purely offline IS estimators and is robust to imperfect annotations. Our framework, combined with principled human-centered design of annotation solicitation, can enable the application of RL in high-stakes domains.

* 36 pages, 12 figures, 5 tables. NeurIPS 2023. Code available at https://github.com/MLD3/CounterfactualAnnot-SemiOPE

Via

Access Paper or Ask Questions

Leveraging Factored Action Spaces for Off-Policy Evaluation

Jul 13, 2023

Aaman Rebello, Shengpu Tang, Jenna Wiens, Sonali Parbhoo

Abstract:Off-policy evaluation (OPE) aims to estimate the benefit of following a counterfactual sequence of actions, given data collected from executed sequences. However, existing OPE estimators often exhibit high bias and high variance in problems involving large, combinatorial action spaces. We investigate how to mitigate this issue using factored action spaces i.e. expressing each action as a combination of independent sub-actions from smaller action spaces. This approach facilitates a finer-grained analysis of how actions differ in their effects. In this work, we propose a new family of "decomposed" importance sampling (IS) estimators based on factored action spaces. Given certain assumptions on the underlying problem structure, we prove that the decomposed IS estimators have less variance than their original non-decomposed versions, while preserving the property of zero bias. Through simulations, we empirically verify our theoretical results, probing the validity of various assumptions. Provided with a technique that can derive the action space factorisation for a given problem, our work shows that OPE can be improved "for free" by utilising this inherent problem structure.

* Main paper: 8 pages, 7 figures. Appendix: 30 pages, 17 figures. Accepted at ICML 2023 Workshop on Counterfactuals in Minds and Machines, Honolulu, Hawaii, USA. Camera ready version

Via

Access Paper or Ask Questions

Leveraging Factored Action Spaces for Efficient Offline Reinforcement Learning in Healthcare

May 02, 2023

Shengpu Tang, Maggie Makar, Michael W. Sjoding, Finale Doshi-Velez, Jenna Wiens

Abstract:Many reinforcement learning (RL) applications have combinatorial action spaces, where each action is a composition of sub-actions. A standard RL approach ignores this inherent factorization structure, resulting in a potential failure to make meaningful inferences about rarely observed sub-action combinations; this is particularly problematic for offline settings, where data may be limited. In this work, we propose a form of linear Q-function decomposition induced by factored action spaces. We study the theoretical properties of our approach, identifying scenarios where it is guaranteed to lead to zero bias when used to approximate the Q-function. Outside the regimes with theoretical guarantees, we show that our approach can still be useful because it leads to better sample efficiency without necessarily sacrificing policy optimality, allowing us to achieve a better bias-variance trade-off. Across several offline RL problems using simulators and real-world datasets motivated by healthcare, we demonstrate that incorporating factored action spaces into value-based RL can result in better-performing policies. Our approach can help an agent make more accurate inferences within underexplored regions of the state-action space when applying RL to observational datasets.

* 30 pages, 18 figures, 2 tables. NeurIPS 2022. Code available at https://github.com/MLD3/OfflineRL_FactoredActions

Via

Access Paper or Ask Questions

Machine Learning for Health symposium 2022 -- Extended Abstract track

Nov 28, 2022

Antonio Parziale, Monica Agrawal, Shalmali Joshi, Irene Y. Chen, Shengpu Tang, Luis Oala, Adarsh Subbaswamy

Abstract:A collection of the extended abstracts that were presented at the 2nd Machine Learning for Health symposium (ML4H 2022), which was held both virtually and in person on November 28, 2022, in New Orleans, Louisiana, USA. Machine Learning for Health (ML4H) is a longstanding venue for research into machine learning for health, including both theoretical works and applied works. ML4H 2022 featured two submission tracks: a proceedings track, which encompassed full-length submissions of technically mature and rigorous work, and an extended abstract track, which would accept less mature, but innovative research for discussion. All the manuscripts submitted to ML4H Symposium underwent a double-blind peer-review process. Extended abstracts included in this collection describe innovative machine learning research focused on relevant problems in health and biomedicine.

Via

Access Paper or Ask Questions

Towards Data-Driven Offline Simulations for Online Reinforcement Learning

Nov 14, 2022

Shengpu Tang, Felipe Vieira Frujeri, Dipendra Misra, Alex Lamb, John Langford, Paul Mineiro, Sebastian Kochman

Figure 1 for Towards Data-Driven Offline Simulations for Online Reinforcement Learning

Figure 2 for Towards Data-Driven Offline Simulations for Online Reinforcement Learning

Figure 3 for Towards Data-Driven Offline Simulations for Online Reinforcement Learning

Figure 4 for Towards Data-Driven Offline Simulations for Online Reinforcement Learning

Abstract:Modern decision-making systems, from robots to web recommendation engines, are expected to adapt: to user preferences, changing circumstances or even new tasks. Yet, it is still uncommon to deploy a dynamically learning agent (rather than a fixed policy) to a production system, as it's perceived as unsafe. Using historical data to reason about learning algorithms, similar to offline policy evaluation (OPE) applied to fixed policies, could help practitioners evaluate and ultimately deploy such adaptive agents to production. In this work, we formalize offline learner simulation (OLS) for reinforcement learning (RL) and propose a novel evaluation protocol that measures both fidelity and efficiency of the simulation. For environments with complex high-dimensional observations, we propose a semi-parametric approach that leverages recent advances in latent state discovery in order to achieve accurate and efficient offline simulations. In preliminary experiments, we show the advantage of our approach compared to fully non-parametric baselines. The code to reproduce these experiments will be made available at https://github.com/microsoft/rl-offline-simulation.

* Presented at the 3rd Offline Reinforcement Learning Workshop at NeurIPS 2022

Via

Access Paper or Ask Questions

Model Selection for Offline Reinforcement Learning: Practical Considerations for Healthcare Settings

Jul 23, 2021

Shengpu Tang, Jenna Wiens

Figure 1 for Model Selection for Offline Reinforcement Learning: Practical Considerations for Healthcare Settings

Figure 2 for Model Selection for Offline Reinforcement Learning: Practical Considerations for Healthcare Settings

Figure 3 for Model Selection for Offline Reinforcement Learning: Practical Considerations for Healthcare Settings

Figure 4 for Model Selection for Offline Reinforcement Learning: Practical Considerations for Healthcare Settings

Abstract:Reinforcement learning (RL) can be used to learn treatment policies and aid decision making in healthcare. However, given the need for generalization over complex state/action spaces, the incorporation of function approximators (e.g., deep neural networks) requires model selection to reduce overfitting and improve policy performance at deployment. Yet a standard validation pipeline for model selection requires running a learned policy in the actual environment, which is often infeasible in a healthcare setting. In this work, we investigate a model selection pipeline for offline RL that relies on off-policy evaluation (OPE) as a proxy for validation performance. We present an in-depth analysis of popular OPE methods, highlighting the additional hyperparameters and computational requirements (fitting/inference of auxiliary models) when used to rank a set of candidate policies. We compare the utility of different OPE methods as part of the model selection pipeline in the context of learning to treat patients with sepsis. Among all the OPE methods we considered, fitted Q evaluation (FQE) consistently leads to the best validation ranking, but at a high computational cost. To balance this trade-off between accuracy of ranking and computational efficiency, we propose a simple two-stage approach to accelerate model selection by avoiding potentially unnecessary computation. Our work serves as a practical guide for offline RL model selection and can help RL practitioners select policies using real-world datasets. To facilitate reproducibility and future extensions, the code accompanying this paper is available online at https://github.com/MLD3/OfflineRL_ModelSelection.

* 33 pages, 9 figures. Machine Learning for Healthcare Conference (MLHC 2021)

Via

Access Paper or Ask Questions

Clinician-in-the-Loop Decision Making: Reinforcement Learning with Near-Optimal Set-Valued Policies

Jul 24, 2020

Shengpu Tang, Aditya Modi, Michael W. Sjoding, Jenna Wiens

Figure 1 for Clinician-in-the-Loop Decision Making: Reinforcement Learning with Near-Optimal Set-Valued Policies

Figure 2 for Clinician-in-the-Loop Decision Making: Reinforcement Learning with Near-Optimal Set-Valued Policies

Figure 3 for Clinician-in-the-Loop Decision Making: Reinforcement Learning with Near-Optimal Set-Valued Policies

Figure 4 for Clinician-in-the-Loop Decision Making: Reinforcement Learning with Near-Optimal Set-Valued Policies

Abstract:Standard reinforcement learning (RL) aims to find an optimal policy that identifies the best action for each state. However, in healthcare settings, many actions may be near-equivalent with respect to the reward (e.g., survival). We consider an alternative objective -- learning set-valued policies to capture near-equivalent actions that lead to similar cumulative rewards. We propose a model-free algorithm based on temporal difference learning and a near-greedy heuristic for action selection. We analyze the theoretical properties of the proposed algorithm, providing optimality guarantees and demonstrate our approach on simulated environments and a real clinical task. Empirically, the proposed algorithm exhibits good convergence properties and discovers meaningful near-equivalent actions. Our work provides theoretical, as well as practical, foundations for clinician/human-in-the-loop decision making, in which humans (e.g., clinicians, patients) can incorporate additional knowledge (e.g., side effects, patient preference) when selecting among near-equivalent actions.

* ICML 2020. Code available at https://github.com/shengpu1126/RL-Set-Valued-Policy

Via

Access Paper or Ask Questions

Relaxed Weight Sharing: Effectively Modeling Time-Varying Relationships in Clinical Time-Series

Jun 07, 2019

Jeeheh Oh, Jiaxuan Wang, Shengpu Tang, Michael Sjoding, Jenna Wiens

Figure 1 for Relaxed Weight Sharing: Effectively Modeling Time-Varying Relationships in Clinical Time-Series

Figure 2 for Relaxed Weight Sharing: Effectively Modeling Time-Varying Relationships in Clinical Time-Series

Figure 3 for Relaxed Weight Sharing: Effectively Modeling Time-Varying Relationships in Clinical Time-Series

Figure 4 for Relaxed Weight Sharing: Effectively Modeling Time-Varying Relationships in Clinical Time-Series

Abstract:Recurrent neural networks (RNNs) are commonly applied to clinical time-series data with the goal of learning patient risk stratification models. Their effectiveness is due, in part, to their use of parameter sharing over time (i.e., cells are repeated hence the name recurrent). We hypothesize, however, that this trait also contributes to the increased difficulty such models have with learning relationships that change over time. Conditional shift, i.e., changes in the relationship between the input X and the output y, arises if the risk factors for the event of interest change over the course of a patient admission. While in theory, RNNs and gated RNNs (e.g., LSTMs) in particular should be capable of learning time-varying relationships, when training data are limited, such models often fail to accurately capture these dynamics. We illustrate the advantages and disadvantages of complete weight sharing (RNNs) by comparing an LSTM with shared parameters to a sequential architecture with time-varying parameters on three clinically-relevant prediction tasks: acute respiratory failure (ARF), shock, and in-hospital mortality. In experiments using synthetic data, we demonstrate how weight sharing in LSTMs leads to worse performance in the presence of conditional shift. To improve upon the dichotomy between complete weight sharing vs. no weight sharing, we propose a novel RNN formulation based on a mixture model in which we relax weight sharing over time. The proposed method outperforms standard LSTMs and other state-of-the-art baselines across all tasks. In settings with limited data, relaxed weight sharing can lead to improved patient risk stratification performance.

Via

Access Paper or Ask Questions