Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Topic:Linear Regression Model

The Gauss-Markov Adjunction: Categorical Semantics of Residuals in Supervised Learning

Jul 03, 2025

Moto Kamiura

Abstract:Enhancing the intelligibility and interpretability of machine learning is a crucial task in responding to the demand for Explicability as an AI principle, and in promoting the better social implementation of AI. The aim of our research is to contribute to this improvement by reformulating machine learning models through the lens of category theory, thereby developing a semantic framework for structuring and understanding AI systems. Our categorical modeling in this paper clarifies and formalizes the structural interplay between residuals and parameters in supervised learning. The present paper focuses on the multiple linear regression model, which represents the most basic form of supervised learning. By defining two concrete categories corresponding to parameters and data, along with an adjoint pair of functors between them, we introduce our categorical formulation of supervised learning. We show that the essential structure of this framework is captured by what we call the Gauss-Markov Adjunction. Within this setting, the dual flow of information can be explicitly described as a correspondence between variations in parameters and residuals. The ordinary least squares estimator for the parameters and the minimum residual are related via the preservation of limits by the right adjoint functor. Furthermore, we position this formulation as an instance of extended denotational semantics for supervised learning, and propose applying a semantic perspective developed in theoretical computer science as a formal foundation for Explicability in AI.

Via

Access Paper or Ask Questions

Semi-supervised learning for linear extremile regression

Jul 02, 2025

Rong Jiang, Keming Yu, Jiangfeng Wang

Abstract:Extremile regression, as a least squares analog of quantile regression, is potentially useful tool for modeling and understanding the extreme tails of a distribution. However, existing extremile regression methods, as nonparametric approaches, may face challenges in high-dimensional settings due to data sparsity, computational inefficiency, and the risk of overfitting. While linear regression serves as the foundation for many other statistical and machine learning models due to its simplicity, interpretability, and relatively easy implementation, particularly in high-dimensional settings, this paper introduces a novel definition of linear extremile regression along with an accompanying estimation methodology. The regression coefficient estimators of this method achieve $\sqrt{n}$-consistency, which nonparametric extremile regression may not provide. In particular, while semi-supervised learning can leverage unlabeled data to make more accurate predictions and avoid overfitting to small labeled datasets in high-dimensional spaces, we propose a semi-supervised learning approach to enhance estimation efficiency, even when the specified linear extremile regression model may be misspecified. Both simulation studies and real data analyses demonstrate the finite-sample performance of our proposed methods.

* arXiv admin note: substantial text overlap with arXiv:2310.07107

Via

Access Paper or Ask Questions

Decomposing Prediction Mechanisms for In-Context Recall

Jul 02, 2025

Sultan Daniels, Dylan Davis, Dhruv Gautam, Wentinn Liao, Gireeja Ranade, Anant Sahai

Abstract:We introduce a new family of toy problems that combine features of linear-regression-style continuous in-context learning (ICL) with discrete associative recall. We pretrain transformer models on sample traces from this toy, specifically symbolically-labeled interleaved state observations from randomly drawn linear deterministic dynamical systems. We study if the transformer models can recall the state of a sequence previously seen in its context when prompted to do so with the corresponding in-context label. Taking a closer look at this task, it becomes clear that the model must perform two functions: (1) identify which system's state should be recalled and apply that system to its last seen state, and (2) continuing to apply the correct system to predict the subsequent states. Training dynamics reveal that the first capability emerges well into a model's training. Surprisingly, the second capability, of continuing the prediction of a resumed sequence, develops much earlier. Via out-of-distribution experiments, and a mechanistic analysis on model weights via edge pruning, we find that next-token prediction for this toy problem involves at least two separate mechanisms. One mechanism uses the discrete symbolic labels to do the associative recall required to predict the start of a resumption of a previously seen sequence. The second mechanism, which is largely agnostic to the discrete symbolic labels, performs a "Bayesian-style" prediction based on the previous token and the context. These two mechanisms have different learning dynamics. To confirm that this multi-mechanism (manifesting as separate phase transitions) phenomenon is not just an artifact of our toy setting, we used OLMo training checkpoints on an ICL translation task to see a similar phenomenon: a decisive gap in the emergence of first-task-token performance vs second-task-token performance.

* 44 pages, 47 figures, 2 tables

Via

Access Paper or Ask Questions

Symbolic identification of tensor equations in multidimensional physical fields

Jul 02, 2025

Tianyi Chen, Hao Yang, Wenjun Ma, Jun Zhang

Abstract:Recently, data-driven methods have shown great promise for discovering governing equations from simulation or experimental data. However, most existing approaches are limited to scalar equations, with few capable of identifying tensor relationships. In this work, we propose a general data-driven framework for identifying tensor equations, referred to as Symbolic Identification of Tensor Equations (SITE). The core idea of SITE--representing tensor equations using a host-plasmid structure--is inspired by the multidimensional gene expression programming (M-GEP) approach. To improve the robustness of the evolutionary process, SITE adopts a genetic information retention strategy. Moreover, SITE introduces two key innovations beyond conventional evolutionary algorithms. First, it incorporates a dimensional homogeneity check to restrict the search space and eliminate physically invalid expressions. Second, it replaces traditional linear scaling with a tensor linear regression technique, greatly enhancing the efficiency of numerical coefficient optimization. We validate SITE using two benchmark scenarios, where it accurately recovers target equations from synthetic data, showing robustness to noise and small sample sizes. Furthermore, SITE is applied to identify constitutive relations directly from molecular simulation data, which are generated without reliance on macroscopic constitutive models. It adapts to both compressible and incompressible flow conditions and successfully identifies the corresponding macroscopic forms, highlighting its potential for data-driven discovery of tensor equation.

Via

Access Paper or Ask Questions

Interpretable Representation Learning for Additive Rule Ensembles

Jun 26, 2025

Shahrzad Behzadimanesh, Pierre Le Bodic, Geoffrey I. Webb, Mario Boley

Abstract:Small additive ensembles of symbolic rules offer interpretable prediction models. Traditionally, these ensembles use rule conditions based on conjunctions of simple threshold propositions $x \geq t$ on a single input variable $x$ and threshold $t$, resulting geometrically in axis-parallel polytopes as decision regions. While this form ensures a high degree of interpretability for individual rules and can be learned efficiently using the gradient boosting approach, it relies on having access to a curated set of expressive and ideally independent input features so that a small ensemble of axis-parallel regions can describe the target variable well. Absent such features, reaching sufficient accuracy requires increasing the number and complexity of individual rules, which diminishes the interpretability of the model. Here, we extend classical rule ensembles by introducing logical propositions with learnable sparse linear transformations of input variables, i.e., propositions of the form $\mathbf{x}^\mathrm{T}\mathbf{w} \geq t$, where $\mathbf{w}$ is a learnable sparse weight vector, enabling decision regions as general polytopes with oblique faces. We propose a learning method using sequential greedy optimization based on an iteratively reweighted formulation of logistic regression. Experimental results demonstrate that the proposed method efficiently constructs rule ensembles with the same test risk as state-of-the-art methods while significantly reducing model complexity across ten benchmark datasets.

Via

Access Paper or Ask Questions

In-Context Occam's Razor: How Transformers Prefer Simpler Hypotheses on the Fly

Jun 24, 2025

Puneesh Deora, Bhavya Vasudeva, Tina Behnia, Christos Thrampoulidis

Abstract:In-context learning (ICL) enables transformers to adapt to new tasks through contextual examples without parameter updates. While existing research has typically studied ICL in fixed-complexity environments, practical language models encounter tasks spanning diverse complexity levels. This paper investigates how transformers navigate hierarchical task structures where higher-complexity categories can perfectly represent any pattern generated by simpler ones. We design well-controlled testbeds based on Markov chains and linear regression that reveal transformers not only identify the appropriate complexity level for each task but also accurately infer the corresponding parameters--even when the in-context examples are compatible with multiple complexity hypotheses. Notably, when presented with data generated by simpler processes, transformers consistently favor the least complex sufficient explanation. We theoretically explain this behavior through a Bayesian framework, demonstrating that transformers effectively implement an in-context Bayesian Occam's razor by balancing model fit against complexity penalties. We further ablate on the roles of model size, training mixture distribution, inference context length, and architecture. Finally, we validate this Occam's razor-like inductive bias on a pretrained GPT-4 model with Boolean-function tasks as case study, suggesting it may be inherent to transformers trained on diverse task distributions.

* 28 pages, 19 figures

Via

Access Paper or Ask Questions

A Simplified Analysis of SGD for Linear Regression with Weight Averaging

Jun 18, 2025

Alexandru Meterez, Depen Morwani, Costin-Andrei Oncescu, Jingfeng Wu, Cengiz Pehlevan, Sham Kakade

Abstract:Theoretically understanding stochastic gradient descent (SGD) in overparameterized models has led to the development of several optimization algorithms that are widely used in practice today. Recent work by~\citet{zou2021benign} provides sharp rates for SGD optimization in linear regression using constant learning rate, both with and without tail iterate averaging, based on a bias-variance decomposition of the risk. In our work, we provide a simplified analysis recovering the same bias and variance bounds provided in~\citep{zou2021benign} based on simple linear algebra tools, bypassing the requirement to manipulate operators on positive semi-definite (PSD) matrices. We believe our work makes the analysis of SGD on linear regression very accessible and will be helpful in further analyzing mini-batching and learning rate scheduling, leading to improvements in the training of realistic models.

Via

Access Paper or Ask Questions

Enhancing eLoran Timing Accuracy via Machine Learning with Meteorological and Terrain Data

Jun 18, 2025

Taewon Kang, Seunghyeon Park, Pyo-Woong Son, Jiwon Seo

Abstract:The vulnerabilities of global navigation satellite systems (GNSS) to signal interference have increased the demand for complementary positioning, navigation, and timing (PNT) systems. To address this, South Korea has decided to deploy an enhanced long-range navigation (eLoran) system as a complementary PNT solution. Similar to GNSS, eLoran provides highly accurate timing information, which is essential for applications such as telecommunications, financial systems, and power distribution. However, the primary sources of error for GNSS and eLoran differ. For eLoran, the main source of error is signal propagation delay over land, known as the additional secondary factor (ASF). This delay, influenced by ground conductivity and weather conditions along the signal path, is challenging to predict and mitigate. In this paper, we measure the time difference (TD) between GPS and eLoran using a time interval counter and analyze the correlations between eLoran/GPS TD and eleven meteorological factors. Accurate estimation of eLoran/GPS TD could enable eLoran to achieve timing accuracy comparable to that of GPS. We propose two estimation models for eLoran/GPS TD and compare their performance with existing TD estimation methods. The proposed WLR-AGRNN model captures the linear relationships between meteorological factors and eLoran/GPS TD using weighted linear regression (WLR) and models nonlinear relationships between outputs from expert networks through an anisotropic general regression neural network (AGRNN). The model incorporates terrain elevation to appropriately weight meteorological data, as elevation influences signal propagation delay. Experimental results based on four months of data demonstrate that the WLR-AGRNN model outperforms other models, highlighting its effectiveness in improving eLoran/GPS TD estimation accuracy.

* Submitted to IEEE Access

Via

Access Paper or Ask Questions

Revisiting Randomization in Greedy Model Search

Jun 18, 2025

Xin Chen, Jason M. Klusowski, Yan Shuo Tan, Chang Yu

Abstract:Combining randomized estimators in an ensemble, such as via random forests, has become a fundamental technique in modern data science, but can be computationally expensive. Furthermore, the mechanism by which this improves predictive performance is poorly understood. We address these issues in the context of sparse linear regression by proposing and analyzing an ensemble of greedy forward selection estimators that are randomized by feature subsampling -- at each iteration, the best feature is selected from within a random subset. We design a novel implementation based on dynamic programming that greatly improves its computational efficiency. Furthermore, we show via careful numerical experiments that our method can outperform popular methods such as lasso and elastic net across a wide range of settings. Next, contrary to prevailing belief that randomized ensembling is analogous to shrinkage, we show via numerical experiments that it can simultaneously reduce training error and degrees of freedom, thereby shifting the entire bias-variance trade-off curve of the base estimator. We prove this fact rigorously in the setting of orthogonal features, in which case, the ensemble estimator rescales the ordinary least squares coefficients with a two-parameter family of logistic weights, thereby enlarging the model search space. These results enhance our understanding of random forests and suggest that implicit regularization in general may have more complicated effects than explicit regularization.

Via

Access Paper or Ask Questions

Optimizing Blood Transfusions and Predicting Shortages in Resource-Constrained Areas

Jun 14, 2025

El Arbi Belfarsi, Sophie Brubaker, Maria Valero

Abstract:Our research addresses the critical challenge of managing blood transfusions and optimizing allocation in resource-constrained regions. We present heuristic matching algorithms for donor-patient and blood bank selection, alongside machine learning methods to analyze blood transfusion acceptance data and predict potential shortages. We developed simulations to optimize blood bank operations, progressing from random allocation to a system incorporating proximity-based selection, blood type compatibility, expiration prioritization, and rarity scores. Moving from blind matching to a heuristic-based approach yielded a 28.6% marginal improvement in blood request acceptance, while a multi-level heuristic matching resulted in a 47.6% improvement. For shortage prediction, we compared Long Short-Term Memory (LSTM) networks, Linear Regression, and AutoRegressive Integrated Moving Average (ARIMA) models, trained on 170 days of historical data. Linear Regression slightly outperformed others with a 1.40% average absolute percentage difference in predictions. Our solution leverages a Cassandra NoSQL database, integrating heuristic optimization and shortage prediction to proactively manage blood resources. This scalable approach, designed for resource-constrained environments, considers factors such as proximity, blood type compatibility, inventory expiration, and rarity. Future developments will incorporate real-world data and additional variables to improve prediction accuracy and optimization performance.

* Proceedings of the 18th International Joint Conference on Biomedical Engineering Systems and Technologies 2025
* 12 pages, 9 figures, International Conference on Health Informatics

Via

Access Paper or Ask Questions

Topic:Linear Regression Model

Papers and Code