Abstract:Current and future large scale structure surveys aim to constrain the neutrino mass and the equation of state of dark energy. We aim to construct accurate and interpretable symbolic approximations to the linear and nonlinear matter power spectra as a function of cosmological parameters in extended $\Lambda$CDM models which contain massive neutrinos and non-constant equations of state for dark energy. This constitutes an extension of the syren-halofit emulators to incorporate these two effects, which we call syren-new (SYmbolic-Regression-ENhanced power spectrum emulator with NEutrinos and $W_0-w_a$). We also obtain a simple approximation to the derived parameter $\sigma_8$ as a function of the cosmological parameters for these models. Our results for the linear power spectrum are designed to emulate CLASS, whereas for the nonlinear case we aim to match the results of EuclidEmulator2. We compare our results to existing emulators and $N$-body simulations. Our analytic emulators for $\sigma_8$, the linear and nonlinear power spectra achieve root mean squared errors of 0.1%, 0.3% and 1.3%, respectively, across a wide range of cosmological parameters, redshifts and wavenumbers. We verify that emulator-related discrepancies are subdominant compared to observational errors and other modelling uncertainties when computing shear power spectra for LSST-like surveys. Our expressions have similar accuracy to existing (numerical) emulators, but are at least an order of magnitude faster, both on a CPU and GPU. Our work greatly improves the accuracy, speed and range of applicability of current symbolic approximations to the linear and nonlinear matter power spectra. We provide publicly available code for all symbolic approximations found.
Abstract:Physics, as a fundamental science, aims to understand the laws of Nature and describe them in mathematical equations. While the physical reality manifests itself in a wide range of phenomena with varying levels of complexity, the equations that describe them display certain statistical regularities and patterns, which we begin to explore here. By drawing inspiration from linguistics, where Zipf's law states that the frequency of any word in a large corpus of text is roughly inversely proportional to its rank in the frequency table, we investigate whether similar patterns for the distribution of operators emerge in the equations of physics. We analyse three corpora of formulae and find, using sophisticated implicit-likelihood methods, that the frequency of operators as a function of their rank in the frequency table is best described by an exponential law with a stable exponent, in contrast with Zipf's inverse power-law. Understanding the underlying reasons behind this statistical pattern may shed light on Nature's modus operandi or reveal recurrent patterns in physicists' attempts to formalise the laws of Nature. It may also provide crucial input for symbolic regression, potentially augmenting language models to generate symbolic models for physical phenomena. By pioneering the study of statistical regularities in the equations of physics, our results open the door for a meta-law of Nature, a (probabilistic) law that all physical laws obey.
Abstract:We analyse the search behaviour of genetic programming for symbolic regression in practically relevant but limited settings, allowing exhaustive enumeration of all solutions. This enables us to quantify the success probability of finding the best possible expressions, and to compare the search efficiency of genetic programming to random search in the space of semantically unique expressions. This analysis is made possible by improved algorithms for equality saturation, which we use to improve the Exhaustive Symbolic Regression algorithm; this produces the set of semantically unique expression structures, orders of magnitude smaller than the full symbolic regression search space. We compare the efficiency of random search in the set of unique expressions and genetic programming. For our experiments we use two real-world datasets where symbolic regression has been used to produce well-fitting univariate expressions: the Nikuradse dataset of flow in rough pipes and the Radial Acceleration Relation of galaxy dynamics. The results show that genetic programming in such limited settings explores only a small fraction of all unique expressions, and evaluates expressions repeatedly that are congruent to already visited expressions.
Abstract:Rapid and accurate evaluation of the nonlinear matter power spectrum, $P(k)$, as a function of cosmological parameters and redshift is of fundamental importance in cosmology. Analytic approximations provide an interpretable solution, yet current approximations are neither fast nor accurate relative to black-box numerical emulators. We use symbolic regression to obtain simple analytic approximations to the nonlinear scale, $k_\sigma$, the effective spectral index, $n_{\rm eff}$, and the curvature, $C$, which are required for the halofit model. We then re-optimise the coefficients of halofit to fit a wide range of cosmologies and redshifts. We then again exploit symbolic regression to explore the space of analytic expressions to fit the residuals between $P(k)$ and the optimised predictions of halofit. All methods are validated against $N$-body simulations. Our symbolic expressions for $k_\sigma$, $n_{\rm eff}$ and $C$ have root mean squared fractional errors of 0.8%, 0.2% and 0.3%, respectively, for redshifts below 3 and a wide range of cosmologies. The re-optimised halofit parameters reduce the root mean squared fractional error from 3% to below 2% for wavenumbers $k=9\times10^{-3}-9 \, h{\rm Mpc^{-1}}$. We introduce syren-halofit (symbolic-regression-enhanced halofit), an extension to halofit containing a short symbolic correction which improves this error to 1%. Our method is 2350 and 3170 times faster than current halofit and hmcode implementations, respectively, and 2680 and 64 times faster than EuclidEmulator2 (which requires running class) and the BACCO emulator. We obtain comparable accuracy to EuclidEmulator2 and the BACCO emulator when tested on $N$-body simulations. Our work greatly increases the speed and accuracy of symbolic approximations to $P(k)$, making them significantly faster than their numerical counterparts without loss of accuracy.
Abstract:Computing the matter power spectrum, $P(k)$, as a function of cosmological parameters can be prohibitively slow in cosmological analyses, hence emulating this calculation is desirable. Previous analytic approximations are insufficiently accurate for modern applications, so black-box, uninterpretable emulators are often used. We utilise an efficient genetic programming based symbolic regression framework to explore the space of potential mathematical expressions which can approximate the power spectrum and $\sigma_8$. We learn the ratio between an existing low-accuracy fitting function for $P(k)$ and that obtained by solving the Boltzmann equations and thus still incorporate the physics which motivated this earlier approximation. We obtain an analytic approximation to the linear power spectrum with a root mean squared fractional error of 0.2% between $k = 9\times10^{-3} - 9 \, h{\rm \, Mpc^{-1}}$ and across a wide range of cosmological parameters, and we provide physical interpretations for various terms in the expression. We also provide a simple analytic approximation for $\sigma_8$ with a similar accuracy, with a root mean squared fractional error of just 0.4% when evaluated across the same range of cosmologies. This function is easily invertible to obtain $A_{\rm s}$ as a function of $\sigma_8$ and the other cosmological parameters, if preferred. It is possible to obtain symbolic approximations to a seemingly complex function at a precision required for current and future cosmological analyses without resorting to deep-learning techniques, thus avoiding their black-box nature and large number of parameters. Our emulator will be usable long after the codes on which numerical approximations are built become outdated.
Abstract:Inflation is a highly favoured theory for the early Universe. It is compatible with current observations of the cosmic microwave background and large scale structure and is a driver in the quest to detect primordial gravitational waves. It is also, given the current quality of the data, highly under-determined with a large number of candidate implementations. We use a new method in symbolic regression to generate all possible simple scalar field potentials for one of two possible basis sets of operators. Treating these as single-field, slow-roll inflationary models we then score them with an information-theoretic metric ("minimum description length") that quantifies their efficiency in compressing the information in the Planck data. We explore two possible priors on the parameter space of potentials, one related to the functions' structural complexity and one that uses a Katz back-off language model to prefer functions that may be theoretically motivated. This enables us to identify the inflaton potentials that optimally balance simplicity with accuracy at explaining the Planck data, which may subsequently find theoretical motivation. Our exploratory study opens the door to extraction of fundamental physics directly from data, and may be augmented with more refined theoretical priors in the quest for a complete understanding of the early Universe.
Abstract:When choosing between competing symbolic models for a data set, a human will naturally prefer the "simpler" expression or the one which more closely resembles equations previously seen in a similar context. This suggests a non-uniform prior on functions, which is, however, rarely considered within a symbolic regression (SR) framework. In this paper we develop methods to incorporate detailed prior information on both functions and their parameters into SR. Our prior on the structure of a function is based on a $n$-gram language model, which is sensitive to the arrangement of operators relative to one another in addition to the frequency of occurrence of each operator. We also develop a formalism based on the Fractional Bayes Factor to treat numerical parameter priors in such a way that models may be fairly compared though the Bayesian evidence, and explicitly compare Bayesian, Minimum Description Length and heuristic methods for model selection. We demonstrate the performance of our priors relative to literature standards on benchmarks and a real-world dataset from the field of cosmology.
Abstract:We apply a new method for learning equations from data -- Exhaustive Symbolic Regression (ESR) -- to late-type galaxy dynamics as encapsulated in the radial acceleration relation (RAR). Relating the centripetal acceleration due to baryons, $g_\text{bar}$, to the total dynamical acceleration, $g_\text{obs}$, the RAR has been claimed to manifest a new law of nature due to its regularity and tightness, in agreement with Modified Newtonian Dynamics (MOND). Fits to this relation have been restricted by prior expectations to particular functional forms, while ESR affords an exhaustive and nearly prior-free search through functional parameter space to identify the equations optimally trading accuracy with simplicity. Working with the SPARC data, we find the best functions typically satisfy $g_\text{obs} \propto g_\text{bar}$ at high $g_\text{bar}$, although the coefficient of proportionality is not clearly unity and the deep-MOND limit $g_\text{obs} \propto \sqrt{g_\text{bar}}$ as $g_\text{bar} \to 0$ is little evident at all. By generating mock data according to MOND with or without the external field effect, we find that symbolic regression would not be expected to identify the generating function or reconstruct successfully the asymptotic slopes. We conclude that the limited dynamical range and significant uncertainties of the SPARC RAR preclude a definitive statement of its functional form, and hence that this data alone can neither demonstrate nor rule out law-like gravitational behaviour.
Abstract:Symbolic Regression (SR) algorithms learn analytic expressions which both accurately fit data and, unlike traditional machine-learning approaches, are highly interpretable. Conventional SR suffers from two fundamental issues which we address in this work. First, since the number of possible equations grows exponentially with complexity, typical SR methods search the space stochastically and hence do not necessarily find the best function. In many cases, the target problems of SR are sufficiently simple that a brute-force approach is not only feasible, but desirable. Second, the criteria used to select the equation which optimally balances accuracy with simplicity have been variable and poorly motivated. To address these issues we introduce a new method for SR -- Exhaustive Symbolic Regression (ESR) -- which systematically and efficiently considers all possible equations and is therefore guaranteed to find not only the true optimum but also a complete function ranking. Utilising the minimum description length principle, we introduce a principled method for combining these preferences into a single objective statistic. To illustrate the power of ESR we apply it to a catalogue of cosmic chronometers and the Pantheon+ sample of supernovae to learn the Hubble rate as a function of redshift, finding $\sim$40 functions (out of 5.2 million considered) that fit the data more economically than the Friedmann equation. These low-redshift data therefore do not necessarily prefer a $\Lambda$CDM expansion history, and traditional SR algorithms that return only the Pareto-front, even if they found this successfully, would not locate $\Lambda$CDM. We make our code and full equation sets publicly available.