Abstract:Large Language Models (LLMs) have shown impressive capabilities in natural language processing, yet their use in sensitive domains like healthcare, particularly with Electronic Health Records (EHR), faces significant challenges due to privacy concerns and limited computational resources. This paper presents a compact LLM framework designed for local deployment in settings with strict privacy requirements and limited access to high-performance GPUs. We introduce a novel preprocessing technique that uses information extraction methods, e.g., regular expressions, to filter and emphasize critical information in clinical notes, enhancing the performance of smaller LLMs on EHR data. Our framework is evaluated using zero-shot and few-shot learning paradigms on both private and publicly available (MIMIC-IV) datasets, and we also compare its performance with fine-tuned LLMs on the MIMIC-IV dataset. The results demonstrate that our preprocessing approach significantly boosts the prediction accuracy of smaller LLMs, making them suitable for high-privacy, resource-constrained applications. This study offers valuable insights into optimizing LLM performance for sensitive, data-intensive tasks while addressing computational and privacy limitations.
Abstract:We conducted an experiment during the review process of the 2023 International Conference on Machine Learning (ICML) that requested authors with multiple submissions to rank their own papers based on perceived quality. We received 1,342 rankings, each from a distinct author, pertaining to 2,592 submissions. In this paper, we present an empirical analysis of how author-provided rankings could be leveraged to improve peer review processes at machine learning conferences. We focus on the Isotonic Mechanism, which calibrates raw review scores using author-provided rankings. Our analysis demonstrates that the ranking-calibrated scores outperform raw scores in estimating the ground truth ``expected review scores'' in both squared and absolute error metrics. Moreover, we propose several cautious, low-risk approaches to using the Isotonic Mechanism and author-provided rankings in peer review processes, including assisting senior area chairs' oversight of area chairs' recommendations, supporting the selection of paper awards, and guiding the recruitment of emergency reviewers. We conclude the paper by addressing the study's limitations and proposing future research directions.
Abstract:Recent advances in multi-modal algorithms have driven and been driven by the increasing availability of large image-text datasets, leading to significant strides in various fields, including computational pathology. However, in most existing medical image-text datasets, the text typically provides high-level summaries that may not sufficiently describe sub-tile regions within a large pathology image. For example, an image might cover an extensive tissue area containing cancerous and healthy regions, but the accompanying text might only specify that this image is a cancer slide, lacking the nuanced details needed for in-depth analysis. In this study, we introduce STimage-1K4M, a novel dataset designed to bridge this gap by providing genomic features for sub-tile images. STimage-1K4M contains 1,149 images derived from spatial transcriptomics data, which captures gene expression information at the level of individual spatial spots within a pathology image. Specifically, each image in the dataset is broken down into smaller sub-image tiles, with each tile paired with 15,000-30,000 dimensional gene expressions. With 4,293,195 pairs of sub-tile images and gene expressions, STimage-1K4M offers unprecedented granularity, paving the way for a wide range of advanced research in multi-modal data analysis an innovative applications in computational pathology, and beyond.
Abstract:In this paper, we critically examine the prevalent practice of using additive mixtures of Mat\'ern kernels in single-output Gaussian process (GP) models and explore the properties of multiplicative mixtures of Mat\'ern kernels for multi-output GP models. For the single-output case, we derive a series of theoretical results showing that the smoothness of a mixture of Mat\'ern kernels is determined by the least smooth component and that a GP with such a kernel is effectively equivalent to the least smooth kernel component. Furthermore, we demonstrate that none of the mixing weights or parameters within individual kernel components are identifiable. We then turn our attention to multi-output GP models and analyze the identifiability of the covariance matrix $A$ in the multiplicative kernel $K(x,y) = AK_0(x,y)$, where $K_0$ is a standard single output kernel such as Mat\'ern. We show that $A$ is identifiable up to a multiplicative constant, suggesting that multiplicative mixtures are well suited for multi-output tasks. Our findings are supported by extensive simulations and real applications for both single- and multi-output settings. This work provides insight into kernel selection and interpretation for GP models, emphasizing the importance of choosing appropriate kernel structures for different tasks.
Abstract:Supervised dimension reduction (SDR) has been a topic of growing interest in data science, as it enables the reduction of high-dimensional covariates while preserving the functional relation with certain response variables of interest. However, existing SDR methods are not suitable for analyzing datasets collected from case-control studies. In this setting, the goal is to learn and exploit the low-dimensional structure unique to or enriched by the case group, also known as the foreground group. While some unsupervised techniques such as the contrastive latent variable model and its variants have been developed for this purpose, they fail to preserve the functional relationship between the dimension-reduced covariates and the response variable. In this paper, we propose a supervised dimension reduction method called contrastive inverse regression (CIR) specifically designed for the contrastive setting. CIR introduces an optimization problem defined on the Stiefel manifold with a non-standard loss function. We prove the convergence of CIR to a local optimum using a gradient descent-based algorithm, and our numerical study empirically demonstrates the improved performance over competing methods for high-dimensional data.
Abstract:Inverse reinforcement learning~(IRL) is a powerful framework to infer an agent's reward function by observing its behavior, but IRL algorithms that learn point estimates of the reward function can be misleading because there may be several functions that describe an agent's behavior equally well. A Bayesian approach to IRL models a distribution over candidate reward functions, alleviating the shortcomings of learning a point estimate. However, several Bayesian IRL algorithms use a $Q$-value function in place of the likelihood function. The resulting posterior is computationally intensive to calculate, has few theoretical guarantees, and the $Q$-value function is often a poor approximation for the likelihood. We introduce kernel density Bayesian IRL (KD-BIRL), which uses conditional kernel density estimation to directly approximate the likelihood, providing an efficient framework that, with a modified reward function parameterization, is applicable to environments with complex and infinite state spaces. We demonstrate KD-BIRL's benefits through a series of experiments in Gridworld environments and a simulated sepsis treatment task.
Abstract:Modern datasets witness high-dimensionality and nontrivial geometries of spaces they live in. It would be helpful in data analysis to reduce the dimensionality while retaining the geometric structure of the dataset. Motivated by this observation, we propose a general dimension reduction method by incorporating geometric information. Our Spherical Rotation Component Analysis (SRCA) is a dimension reduction method that uses spheres or ellipsoids, to approximate a low-dimensional manifold. This method not only generalizes the Spherical Component Analysis (SPCA) method in terms of theories and algorithms and presents a comprehensive comparison of our method, as an optimization problem with theoretical guarantee and also as a structural preserving low-rank representation of data. Results relative to state-of-the-art competitors show considerable gains in ability to accurately approximate the subspace with fewer components and better structural preserving. In addition, we have pointed out that this method is a specific incarnation of a grander idea of using a geometrically induced loss function in dimension reduction tasks.
Abstract:Over a complete Riemannian manifold of finite dimension, Greene and Wu introduced a convolution, known as Greene-Wu (GW) convolution. In this paper, we introduce a reformulation of the GW convolution. Using our reformulation, many properties of the GW convolution can be easily derived, including a new formula for how the curvature of the space would affect the curvature of the function through the GW convolution. Also enabled by our new reformulation, an improved method for gradient estimation over Riemannian manifolds is introduced. Theoretically, our gradient estimation method improves the order of estimation error from $O \left( \left( n + 3 \right)^{3/2} \right)$ to $O \left( n^{3/2} \right)$, where $n$ is the dimension of the manifold. Empirically, our method outperforms the best existing method for gradient estimation over Riemannian manifolds, as evidenced by thorough experimental evaluations.
Abstract:Dimension reduction is useful for exploratory data analysis. In many applications, it is of interest to discover variation that is enriched in a "foreground" dataset relative to a "background" dataset. Recently, contrastive principal component analysis (CPCA) was proposed for this setting. However, the lack of a formal probabilistic model makes it difficult to reason about CPCA and to tune its hyperparameter. In this work, we propose probabilistic contrastive principal component analysis (PCPCA), a model-based alternative to CPCA. We discuss how to set the hyperparameter in theory and in practice, and we show several of PCPCA's advantages, including greater interpretability, uncertainty quantification, robustness to noise and missing data, and the ability to generate data from the model. We demonstrate PCPCA's performance through a series of simulations and experiments with datasets of gene expression, protein expression, and images.
Abstract:Even with the rise in popularity of over-parameterized models, simple dimensionality reduction and clustering methods, such as PCA and k-means, are still routinely used in an amazing variety of settings. A primary reason is the combination of simplicity, interpretability and computational efficiency. The focus of this article is on improving upon PCA and k-means, by allowing non-linear relations in the data and more flexible cluster shapes, without sacrificing the key advantages. The key contribution is a new framework for Principal Elliptical Analysis (PEA), defining a simple and computationally efficient alternative to PCA that fits the best elliptical approximation through the data. We provide theoretical guarantees on the proposed PEA algorithm using Vapnik-Chervonenkis (VC) theory to show strong consistency and uniform concentration bounds. Toy experiments illustrate the performance of PEA, and the ability to adapt to non-linear structure and complex cluster shapes. In a rich variety of real data clustering applications, PEA is shown to do as well as k-means for simple datasets, while dramatically improving performance in more complex settings.