Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Robert Tibshirani

BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature

Jan 14, 2025

Alejandro Lozano, Min Woo Sun, James Burgess, Liangyu Chen, Jeffrey J Nirschl, Jeffrey Gu, Ivan Lopez, Josiah Aklilu, Austin Wolfgang Katzer, Collin Chiu(+6 more)

Abstract:The development of vision-language models (VLMs) is driven by large-scale and diverse multimodal datasets. However, progress toward generalist biomedical VLMs is limited by the lack of annotated, publicly accessible datasets across biology and medicine. Existing efforts are restricted to narrow domains, missing the full diversity of biomedical knowledge encoded in scientific literature. To address this gap, we introduce BIOMEDICA, a scalable, open-source framework to extract, annotate, and serialize the entirety of the PubMed Central Open Access subset into an easy-to-use, publicly accessible dataset. Our framework produces a comprehensive archive with over 24 million unique image-text pairs from over 6 million articles. Metadata and expert-guided annotations are also provided. We demonstrate the utility and accessibility of our resource by releasing BMCA-CLIP, a suite of CLIP-style models continuously pre-trained on the BIOMEDICA dataset via streaming, eliminating the need to download 27 TB of data locally. On average, our models achieve state-of-the-art performance across 40 tasks - spanning pathology, radiology, ophthalmology, dermatology, surgery, molecular biology, parasitology, and cell biology - excelling in zero-shot classification with a 6.56% average improvement (as high as 29.8% and 17.5% in dermatology and ophthalmology, respectively), and stronger image-text retrieval, all while using 10x less compute. To foster reproducibility and collaboration, we release our codebase and dataset for the broader research community.

Via

Access Paper or Ask Questions

Semiparametric conformal prediction

Nov 04, 2024

Ji Won Park, Robert Tibshirani, Kyunghyun Cho

Figure 1 for Semiparametric conformal prediction

Figure 2 for Semiparametric conformal prediction

Figure 3 for Semiparametric conformal prediction

Figure 4 for Semiparametric conformal prediction

Abstract:Many risk-sensitive applications require well-calibrated prediction sets over multiple, potentially correlated target variables, for which the prediction algorithm may report correlated non-conformity scores. In this work, we treat the scores as random vectors and aim to construct the prediction set accounting for their joint correlation structure. Drawing from the rich literature on multivariate quantiles and semiparametric statistics, we propose an algorithm to estimate the $1-\alpha$ quantile of the scores, where $\alpha$ is the user-specified miscoverage rate. In particular, we flexibly estimate the joint cumulative distribution function (CDF) of the scores using nonparametric vine copulas and improve the asymptotic efficiency of the quantile estimate using its influence function. The vine decomposition allows our method to scale well to a large number of targets. We report desired coverage and competitive efficiency on a range of real-world regression problems, including those with missing-at-random labels in the calibration set.

* 9 pages (+12 appendix), 11 figures, submitted to AISTATS 2025

Via

Access Paper or Ask Questions

MMIL: A novel algorithm for disease associated cell type discovery

Jun 12, 2024

Erin Craig, Timothy Keyes, Jolanda Sarno, Maxim Zaslavsky, Garry Nolan, Kara Davis, Trevor Hastie, Robert Tibshirani

Figure 1 for MMIL: A novel algorithm for disease associated cell type discovery

Figure 2 for MMIL: A novel algorithm for disease associated cell type discovery

Figure 3 for MMIL: A novel algorithm for disease associated cell type discovery

Figure 4 for MMIL: A novel algorithm for disease associated cell type discovery

Abstract:Single-cell datasets often lack individual cell labels, making it challenging to identify cells associated with disease. To address this, we introduce Mixture Modeling for Multiple Instance Learning (MMIL), an expectation maximization method that enables the training and calibration of cell-level classifiers using patient-level labels. Our approach can be used to train e.g. lasso logistic regression models, gradient boosted trees, and neural networks. When applied to clinically-annotated, primary patient samples in Acute Myeloid Leukemia (AML) and Acute Lymphoblastic Leukemia (ALL), our method accurately identifies cancer cells, generalizes across tissues and treatment timepoints, and selects biologically relevant features. In addition, MMIL is capable of incorporating cell labels into model training when they are known, providing a powerful framework for leveraging both labeled and unlabeled data simultaneously. Mixture Modeling for MIL offers a novel approach for cell classification, with significant potential to advance disease understanding and management, especially in scenarios with unknown gold-standard labels and high dimensionality.

* Erin Craig and Timothy Keyes contributed equally to this work

Via

Access Paper or Ask Questions

Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank

Apr 26, 2024

Thomas Le Menestrel, Erin Craig, Robert Tibshirani, Trevor Hastie, Manuel Rivas

Abstract:Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals, underscoring a critical gap in genetic research. Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data. We evaluate the performance of Group-LASSO INTERaction-NET (glinternet) and pretrained lasso in disease prediction focusing on diverse ancestries in the UK Biobank. Models were trained on data from White British and other ancestries and validated across a cohort of over 96,000 individuals for 8 diseases. Out of 96 models trained, we report 16 with statistically significant incremental predictive performance in terms of ROC-AUC scores. These findings suggest that advanced statistical methods that borrow information across multiple ancestries may improve disease risk prediction, but with limited benefit.

Via

Access Paper or Ask Questions

FastCPH: Efficient Survival Analysis for Neural Networks

Aug 21, 2022

Xuelin Yang, Louis Abraham, Sejin Kim, Petr Smirnov, Feng Ruan, Benjamin Haibe-Kains, Robert Tibshirani

Figure 1 for FastCPH: Efficient Survival Analysis for Neural Networks

Figure 2 for FastCPH: Efficient Survival Analysis for Neural Networks

Figure 3 for FastCPH: Efficient Survival Analysis for Neural Networks

Figure 4 for FastCPH: Efficient Survival Analysis for Neural Networks

Abstract:The Cox proportional hazards model is a canonical method in survival analysis for prediction of the life expectancy of a patient given clinical or genetic covariates -- it is a linear model in its original form. In recent years, several methods have been proposed to generalize the Cox model to neural networks, but none of these are both numerically correct and computationally efficient. We propose FastCPH, a new method that runs in linear time and supports both the standard Breslow and Efron methods for tied events. We also demonstrate the performance of FastCPH combined with LassoNet, a neural network that provides interpretability through feature sparsity, on survival datasets. The final procedure is efficient, selects useful covariates and outperforms existing CoxPH approaches.

Via

Access Paper or Ask Questions

Confidence intervals for the Cox model test error from cross-validation

Jan 26, 2022

Min Woo Sun, Robert Tibshirani

Abstract:Cross-validation (CV) is one of the most widely used techniques in statistical learning for estimating the test error of a model, but its behavior is not yet fully understood. It has been shown that standard confidence intervals for test error using estimates from CV may have coverage below nominal levels. This phenomenon occurs because each sample is used in both the training and testing procedures during CV and as a result, the CV estimates of the errors become correlated. Without accounting for this correlation, the estimate of the variance is smaller than it should be. One way to mitigate this issue is by estimating the mean squared error of the prediction error instead using nested CV. This approach has been shown to achieve superior coverage compared to intervals derived from standard CV. In this work, we generalize the nested CV idea to the Cox proportional hazards model and explore various choices of test error for this setting.

Via

Access Paper or Ask Questions

Cooperative learning for multi-view analysis

Jan 06, 2022

Daisy Yi Ding, Balasubramanian Narasimhan, Robert Tibshirani

Figure 1 for Cooperative learning for multi-view analysis

Figure 2 for Cooperative learning for multi-view analysis

Figure 3 for Cooperative learning for multi-view analysis

Figure 4 for Cooperative learning for multi-view analysis

Abstract:We propose a new method for supervised learning with multiple sets of features ("views"). Cooperative learning combines the usual squared error loss of predictions with an "agreement" penalty to encourage the predictions from different data views to agree. By varying the weight of the agreement penalty, we get a continuum of solutions that include the well-known early and late fusion approaches. Cooperative learning chooses the degree of agreement (or fusion) in an adaptive manner, using a validation set or cross-validation to estimate test set prediction error. One version of our fitting procedure is modular, where one can choose different fitting mechanisms (e.g. lasso, random forests, boosting, neural networks) appropriate for different data views. In the setting of cooperative regularized linear regression, the method combines the lasso penalty with the agreement penalty. The method can be especially powerful when the different data views share some underlying relationship in their signals that we aim to strengthen, while each view has its idiosyncratic noise that we aim to reduce. We illustrate the effectiveness of our proposed method on simulated and real data examples.

Via

Access Paper or Ask Questions

Cross-validation: what does it estimate and how well does it do it?

Apr 14, 2021

Stephen Bates, Trevor Hastie, Robert Tibshirani

Figure 1 for Cross-validation: what does it estimate and how well does it do it?

Figure 2 for Cross-validation: what does it estimate and how well does it do it?

Figure 3 for Cross-validation: what does it estimate and how well does it do it?

Figure 4 for Cross-validation: what does it estimate and how well does it do it?

Abstract:Cross-validation is a widely-used technique to estimate prediction error, but its behavior is complex and not fully understood. Ideally, one would like to think that cross-validation estimates the prediction error for the model at hand, fit to the training data. We prove that this is not the case for the linear model fit by ordinary least squares; rather it estimates the average prediction error of models fit on other unseen training sets drawn from the same population. We further show that this phenomenon occurs for most popular estimates of prediction error, including data splitting, bootstrapping, and Mallow's Cp. Next, the standard confidence intervals for prediction error derived from cross-validation may have coverage far below the desired level. Because each data point is used for both training and testing, there are correlations among the measured accuracies for each fold, and so the usual estimate of variance is too small. We introduce a nested cross-validation scheme to estimate this variance more accurately, and show empirically that this modification leads to intervals with approximately correct coverage in many examples where traditional cross-validation intervals fail. Lastly, our analysis also shows that when producing confidence intervals for prediction accuracy with simple data splitting, one should not re-fit the model on the combined data, since this invalidates the confidence intervals.

Via

Access Paper or Ask Questions

Feature-weighted elastic net: using "features of features" for better prediction

Jun 02, 2020

J. Kenneth Tay, Nima Aghaeepour, Trevor Hastie, Robert Tibshirani

Figure 1 for Feature-weighted elastic net: using "features of features" for better prediction

Figure 2 for Feature-weighted elastic net: using "features of features" for better prediction

Figure 3 for Feature-weighted elastic net: using "features of features" for better prediction

Figure 4 for Feature-weighted elastic net: using "features of features" for better prediction

Abstract:In some supervised learning settings, the practitioner might have additional information on the features used for prediction. We propose a new method which leverages this additional information for better prediction. The method, which we call the feature-weighted elastic net ("fwelnet"), uses these "features of features" to adapt the relative penalties on the feature coefficients in the elastic net penalty. In our simulations, fwelnet outperforms the lasso in terms of test mean squared error and usually gives an improvement in true positive rate or false positive rate for feature selection. We also apply this method to early prediction of preeclampsia, where fwelnet outperforms the lasso in terms of 10-fold cross-validated area under the curve (0.86 vs. 0.80). We also provide a connection between fwelnet and the group lasso and suggest how fwelnet might be used for multi-task learning.

Via

Access Paper or Ask Questions

Reluctant generalized additive modeling

Jan 13, 2020

J. Kenneth Tay, Robert Tibshirani

Figure 1 for Reluctant generalized additive modeling

Figure 2 for Reluctant generalized additive modeling

Figure 3 for Reluctant generalized additive modeling

Figure 4 for Reluctant generalized additive modeling

Abstract:Sparse generalized additive models (GAMs) are an extension of sparse generalized linear models which allow a model's prediction to vary non-linearly with an input variable. This enables the data analyst build more accurate models, especially when the linearity assumption is known to be a poor approximation of reality. Motivated by reluctant interaction modeling (Yu et al. 2019), we propose a multi-stage algorithm, called $\textit{reluctant generalized additive modeling (RGAM)}$, that can fit sparse generalized additive models at scale. It is guided by the principle that, if all else is equal, one should prefer a linear feature over a non-linear feature. Unlike existing methods for sparse GAMs, RGAM can be extended easily to binary, count and survival data. We demonstrate the method's effectiveness on real and simulated examples.

* Change of method name, R package now available

Via

Access Paper or Ask Questions