Stanford University
Abstract:Optimal control problems with state distribution constraints have attracted interest for their expressivity, but solutions rely on linear approximations. We approach the problem of driving the state of a dynamical system in distribution from a sequential decision-making perspective. We formulate the optimal control problem as an appropriate Markov decision process (MDP), where the actions correspond to the state-feedback control policies. We then solve the MDP using Monte Carlo tree search (MCTS). This renders our method suitable for any dynamics model. A key component of our approach is a novel, easy to compute, distance metric in the distribution space that allows our algorithm to guide the distribution of the state. We experimentally test our algorithm under both linear and nonlinear dynamics.
Abstract:Public AI benchmark results are widely broadcast by model developers as indicators of model quality within a growing and competitive market. However, these advertised scores do not necessarily reflect the traits of interest to those who will ultimately apply AI models. In this paper, we seek to understand if and how AI benchmarks are used to inform decision-making. Based on the analyses of interviews with 19 individuals who have used, or decided against using, benchmarks in their day-to-day work, we find that across these settings, participants use benchmarks as a signal of relative performance difference between models. However, whether this signal was considered a definitive sign of model superiority, sufficient for downstream decisions, varied. In academia, public benchmarks were generally viewed as suitable measures for capturing research progress. By contrast, in both product and policy, benchmarks -- even those developed internally for specific tasks -- were often found to be inadequate for informing substantive decisions. Of the benchmarks deemed unsatisfactory, respondents reported that their goals were neither well-defined nor reflective of real-world use. Based on the study results, we conclude that effective benchmarks should provide meaningful, real-world evaluations, incorporate domain expertise, and maintain transparency in scope and goals. They must capture diverse, task-relevant capabilities, be challenging enough to avoid quick saturation, and account for trade-offs in model performance rather than relying on a single score. Additionally, proprietary data collection and contamination prevention are critical for producing reliable and actionable results. By adhering to these criteria, benchmarks can move beyond mere marketing tricks into robust evaluative frameworks.
Abstract:Estimating the probability of failure is a critical step in developing safety-critical autonomous systems. Direct estimation methods such as Monte Carlo sampling are often impractical due to the rarity of failures in these systems. Existing importance sampling approaches do not scale to sequential decision-making systems with large state spaces and long horizons. We propose an adaptive importance sampling algorithm to address these limitations. Our method minimizes the forward Kullback-Leibler divergence between a state-dependent proposal distribution and a relaxed form of the optimal importance sampling distribution. Our method uses Markov score ascent methods to estimate this objective. We evaluate our approach on four sequential systems and show that it provides more accurate failure probability estimates than baseline Monte Carlo and importance sampling techniques. This work is open sourced.
Abstract:AI models are increasingly prevalent in high-stakes environments, necessitating thorough assessment of their capabilities and risks. Benchmarks are popular for measuring these attributes and for comparing model performance, tracking progress, and identifying weaknesses in foundation and non-foundation models. They can inform model selection for downstream tasks and influence policy initiatives. However, not all benchmarks are the same: their quality depends on their design and usability. In this paper, we develop an assessment framework considering 46 best practices across an AI benchmark's lifecycle and evaluate 24 AI benchmarks against it. We find that there exist large quality differences and that commonly used benchmarks suffer from significant issues. We further find that most benchmarks do not report statistical significance of their results nor allow for their results to be easily replicated. To support benchmark developers in aligning with best practices, we provide a checklist for minimum quality assurance based on our assessment. We also develop a living repository of benchmark assessments to support benchmark comparability, accessible at betterbench.stanford.edu.
Abstract:This paper addresses the problem of task planning for robots that must comply with operational manuals in real-world settings. Task planning under these constraints is essential for enabling autonomous robot operation in domains that require adherence to domain-specific knowledge. Current methods for generating robot goals and plans rely on common sense knowledge encoded in large language models. However, these models lack grounding of robot plans to domain-specific knowledge and are not easily transferable between multiple sites or customers with different compliance needs. In this work, we present SayComply, which enables grounding robotic task planning with operational compliance using retrieval-based language models. We design a hierarchical database of operational, environment, and robot embodiment manuals and procedures to enable efficient retrieval of the relevant context under the limited context length of the LLMs. We then design a task planner using a tree-based retrieval augmented generation (RAG) technique to generate robot tasks that follow user instructions while simultaneously complying with the domain knowledge in the database. We demonstrate the benefits of our approach through simulations and hardware experiments in real-world scenarios that require precise context retrieval across various types of context, outperforming the standard RAG method. Our approach bridges the gap in deploying robots that consistently adhere to operational protocols, offering a scalable and edge-deployable solution for ensuring compliance across varied and complex real-world environments. Project website: saycomply.github.io.
Abstract:While working within the spatial domain can pose problems associated with ill-conditioned scores caused by power-law decay, recent advances in diffusion-based generative models have shown that transitioning to the wavelet domain offers a promising alternative. However, within the wavelet domain, we encounter unique challenges, especially the sparse representation of high-frequency coefficients, which deviates significantly from the Gaussian assumptions in the diffusion process. To this end, we propose a multi-scale generative modeling in the wavelet domain that employs distinct strategies for handling low and high-frequency bands. In the wavelet domain, we apply score-based generative modeling with well-conditioned scores for low-frequency bands, while utilizing a multi-scale generative adversarial learning for high-frequency bands. As supported by the theoretical analysis and experimental results, our model significantly improve performance and reduce the number of trainable parameters, sampling steps, and time.
Abstract:Deciding on appropriate mechanical ventilator management strategies significantly impacts the health outcomes for patients with respiratory diseases. Acute Respiratory Distress Syndrome (ARDS) is one such disease that requires careful ventilator operation to be effectively treated. In this work, we frame the management of ventilators for patients with ARDS as a sequential decision making problem using the Markov decision process framework. We implement and compare controllers based on clinical guidelines contained in the ARDSnet protocol, optimal control theory, and learned latent dynamics represented as neural networks. The Pulse Physiology Engine's respiratory dynamics simulator is used to establish a repeatable benchmark, gather simulated data, and quantitatively compare these controllers. We score performance in terms of measured improvement in established ARDS health markers (pertaining to improved respiratory rate, oxygenation, and vital signs). Our results demonstrate that techniques leveraging neural networks and optimal control can automatically discover effective ventilation management strategies without access to explicit ventilator management procedures or guidelines (such as those defined in the ARDSnet protocol).
Abstract:To achieve autonomy in complex real-world exploration missions, we consider deployment strategies for a team of robots with heterogeneous autonomy capabilities. In this work, we formulate a multi-robot exploration mission and compute an operation policy to maintain robot team productivity and maximize mission rewards. The environment description, robot capability, and mission outcome are modeled as a Markov decision process (MDP). We also include constraints in real-world operation, such as sensor failures, limited communication coverage, and mobility-stressing elements. Then, we study the proposed operation model on a real-world scenario in the context of the DARPA Subterranean (SubT) Challenge. The computed deployment policy is also compared against the human-based operation strategy in the final competition of the SubT Challenge. Finally, using the proposed model, we discuss the design trade-off on building a multi-robot team with heterogeneous capabilities.
Abstract:The transfer of patients between two aircraft using an underway watercraft increases medical evacuation reach and flexibility in maritime environments. The selection of any one of multiple underway watercraft for patient exchange is complicated by participating aircraft utilization history and a participating watercraft position and velocity. The selection problem is modeled as a semi-Markov decision process with an action space including both fixed land and moving watercraft exchange points. Monte Carlo tree search with root parallelization is used to select optimal exchange points and determine aircraft dispatch times. Model parameters are varied in simulation to identify representative scenarios where watercraft exchange points reduce incident response times. We find that an optimal policy with watercraft exchange points outperforms an optimal policy without watercraft exchange points and a greedy policy by 35% and 40%, respectively. In partnership with the United States Army, we deploy for the first time the watercraft exchange point by executing a mock patient transfer with a manikin between two HH-60M medical evacuation helicopters and an underway Army Logistic Support Vessel south of the Hawaiian island of Oahu. Both helicopters were dispatched in accordance with our optimized decision strategy.
Abstract:Testing controllers in safety-critical systems is vital for ensuring their safety and preventing failures. In this paper, we address the falsification problem within learning-based closed-loop control systems through simulation. This problem involves the identification of counterexamples that violate system safety requirements and can be formulated as an optimization task based on these requirements. Using full-fidelity simulator data in this optimization problem can be computationally expensive. To improve efficiency, we propose a multi-fidelity Bayesian optimization falsification framework that harnesses simulators with varying levels of accuracy. Our proposed framework can transition between different simulators and establish meaningful relationships between them. Through multi-fidelity Bayesian optimization, we determine both the optimal system input likely to be a counterexample and the appropriate fidelity level for assessment. We evaluated our approach across various Gym environments, each featuring different levels of fidelity. Our experiments demonstrate that multi-fidelity Bayesian optimization is more computationally efficient than full-fidelity Bayesian optimization and other baseline methods in detecting counterexamples. A Python implementation of the algorithm is available at https://github.com/SAILRIT/MFBO_Falsification.