Abstract:The input of almost every machine learning algorithm targeting the properties of matter at the atomic scale involves a transformation of the list of Cartesian atomic coordinates into a more symmetric representation. Many of these most popular representations can be seen as an expansion of the symmetrized correlations of the atom density, and differ mainly by the choice of basis. Here we discuss how to build an adaptive, optimal numerical basis that is chosen to represent most efficiently the structural diversity of the dataset at hand. For each training dataset, this optimal basis is unique, and can be computed at no additional cost with respect to the primitive basis by approximating it with splines. We demonstrate that this construction yields representations that are accurate and computationally efficient, presenting examples that involve both molecular and condensed-phase machine-learning models.