Abstract:Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs), such as hallucinations and outdated information. However, it remains unclear how LLMs handle knowledge conflicts arising from different augmented retrieved passages, especially when these passages originate from the same source and have equal trustworthiness. In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions that have varying answers based on contradictory passages from Wikipedia, a dataset widely regarded as a high-quality pre-training resource for most LLMs. Specifically, we introduce WikiContradict, a benchmark consisting of 253 high-quality, human-annotated instances designed to assess LLM performance when augmented with retrieved passages containing real-world knowledge conflicts. We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages. Through rigorous human evaluations on a subset of WikiContradict instances involving 5 LLMs and over 3,500 judgements, we shed light on the behaviour and limitations of these models. For instance, when provided with two passages containing contradictory facts, all models struggle to generate answers that accurately reflect the conflicting nature of the context, especially for implicit conflicts requiring reasoning. Since human evaluation is costly, we also introduce an automated model that estimates LLM performance using a strong open-source language model, achieving an F-score of 0.8. Using this automated metric, we evaluate more than 1,500 answers from seven LLMs across all WikiContradict instances. To facilitate future work, we release WikiContradict on: https://ibm.biz/wikicontradict.
Abstract:Spinal cord stimulation (SCS) is a therapeutic approach used for the management of chronic pain. It involves the delivery of electrical impulses to the spinal cord via an implanted device, which when given suitable stimulus parameters can mask or block pain signals. Selection of optimal stimulation parameters usually happens in the clinic under the care of a provider whereas at-home SCS optimization is managed by the patient. In this paper, we propose a recommender system for the management of pain in chronic pain patients undergoing SCS. In particular, we use a contextual multi-armed bandit (CMAB) approach to develop a system that recommends SCS settings to patients with the aim of improving their condition. These recommendations, sent directly to patients though a digital health ecosystem, combined with a patient monitoring system closes the therapeutic loop around a chronic pain patient over their entire patient journey. We evaluated the system in a cohort of SCS-implanted ENVISION study subjects (Clinicaltrials.gov ID: NCT03240588) using a combination of quality of life metrics and Patient States (PS), a novel measure of holistic outcomes. SCS recommendations provided statistically significant improvement in clinical outcomes (pain and/or QoL) in 85\% of all subjects (N=21). Among subjects in moderate PS (N=7) prior to receiving recommendations, 100\% showed statistically significant improvements and 5/7 had improved PS dwell time. This analysis suggests SCS patients may benefit from SCS recommendations, resulting in additional clinical improvement on top of benefits already received from SCS therapy.
Abstract:Kalman filter is a key tool for time-series forecasting and analysis. We show that the dependence of a prediction of Kalman filter on the past is decaying exponentially, whenever the process noise is non-degenerate. Therefore, Kalman filter may be approximated by regression on a few recent observations. Surprisingly, we also show that having some process noise is essential for the exponential decay. With no process noise, it may happen that the forecast depends on all of the past uniformly, which makes forecasting more difficult. Based on this insight, we devise an on-line algorithm for improper learning of a linear dynamical system (LDS), which considers only a few most recent observations. We use our decay results to provide the first regret bounds w.r.t. to Kalman filters within learning an LDS. That is, we compare the results of our algorithm to the best, in hindsight, Kalman filter for a given signal. Also, the algorithm is practical: its per-update run-time is linear in the regression depth.
Abstract:We consider a setting, where the output of a linear dynamical system (LDS) is, with an unknown but fixed probability, replaced by noise. There, we present a robust method for the prediction of the outputs of the LDS and identification of the samples of noise, and prove guarantees on its statistical performance. One application lies in anomaly detection: the samples of noise, unlikely to have been generated by the dynamics, can be flagged to operators of the system for further study.
Abstract:This paper describes a new algorithm for solar energy forecasting from a sequence of Cloud Optical Depth (COD) images. The algorithm is based on the following simple observation: the dynamics of clouds represented by COD images resembles the motion (transport) of a density in a fluid flow. This suggests that, to forecast the motion of COD images, it is sufficient to forecast the flow. The latter, in turn, can be accomplished by fitting a parametric model of the fluid flow to the COD images observed in the past. Namely, the learning phase of the algorithm is composed of the following steps: (i) given a sequence of COD images, the snapshots of the optical flow are estimated from two consecutive COD images; (ii) these snapshots are then assimilated into a Navier-Stokes Equation (NSE), i.e. an initial velocity field for NSE is selected so that the corresponding NSE' solution is as close as possible to the optical flow snapshots. The prediction phase consists of utilizing a linear transport equation, which describes the propagation of COD images in the fluid flow predicted by NSE, to estimate the future motion of the COD images. The algorithm has been tested on COD images provided by two geostationary operational environmental satellites from NOAA serving the west-hemisphere.