Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Teemu Roos

VIBE: Vector Index Benchmark for Embeddings

May 23, 2025

Elias Jääsaari, Ville Hyvönen, Matteo Ceccarello, Teemu Roos, Martin Aumüller

Abstract:Approximate nearest neighbor (ANN) search is a performance-critical component of many machine learning pipelines. Rigorous benchmarking is essential for evaluating the performance of vector indexes for ANN search. However, the datasets of the existing benchmarks are no longer representative of the current applications of ANN search. Hence, there is an urgent need for an up-to-date set of benchmarks. To this end, we introduce Vector Index Benchmark for Embeddings (VIBE), an open source project for benchmarking ANN algorithms. VIBE contains a pipeline for creating benchmark datasets using dense embedding models characteristic of modern applications, such as retrieval-augmented generation (RAG). To replicate real-world workloads, we also include out-of-distribution (OOD) datasets where the queries and the corpus are drawn from different distributions. We use VIBE to conduct a comprehensive evaluation of SOTA vector indexes, benchmarking 21 implementations on 12 in-distribution and 6 out-of-distribution datasets.

* 25 pages

Via

Access Paper or Ask Questions

LoRANN: Low-Rank Matrix Factorization for Approximate Nearest Neighbor Search

Oct 24, 2024

Elias Jääsaari, Ville Hyvönen, Teemu Roos

Figure 1 for LoRANN: Low-Rank Matrix Factorization for Approximate Nearest Neighbor Search

Figure 2 for LoRANN: Low-Rank Matrix Factorization for Approximate Nearest Neighbor Search

Figure 3 for LoRANN: Low-Rank Matrix Factorization for Approximate Nearest Neighbor Search

Figure 4 for LoRANN: Low-Rank Matrix Factorization for Approximate Nearest Neighbor Search

Abstract:Approximate nearest neighbor (ANN) search is a key component in many modern machine learning pipelines; recent use cases include retrieval-augmented generation (RAG) and vector databases. Clustering-based ANN algorithms, that use score computation methods based on product quantization (PQ), are often used in industrial-scale applications due to their scalability and suitability for distributed and disk-based implementations. However, they have slower query times than the leading graph-based ANN algorithms. In this work, we propose a new supervised score computation method based on the observation that inner product approximation is a multivariate (multi-output) regression problem that can be solved efficiently by reduced-rank regression. Our experiments show that on modern high-dimensional data sets, the proposed reduced-rank regression (RRR) method is superior to PQ in both query latency and memory usage. We also introduce LoRANN, a clustering-based ANN library that leverages the proposed score computation method. LoRANN is competitive with the leading graph-based algorithms and outperforms the state-of-the-art GPU ANN methods on high-dimensional data sets.

* Accepted to NeurIPS 2024

Via

Access Paper or Ask Questions

Regional Ocean Forecasting with Hierarchical Graph Neural Networks

Oct 15, 2024

Daniel Holmberg, Emanuela Clementi, Teemu Roos

Figure 1 for Regional Ocean Forecasting with Hierarchical Graph Neural Networks

Figure 2 for Regional Ocean Forecasting with Hierarchical Graph Neural Networks

Figure 3 for Regional Ocean Forecasting with Hierarchical Graph Neural Networks

Figure 4 for Regional Ocean Forecasting with Hierarchical Graph Neural Networks

Abstract:Accurate ocean forecasting systems are vital for understanding marine dynamics, which play a crucial role in environmental management and climate adaptation strategies. Traditional numerical solvers, while effective, are computationally expensive and time-consuming. Recent advancements in machine learning have revolutionized weather forecasting, offering fast and energy-efficient alternatives. Building on these advancements, we introduce SeaCast, a neural network designed for high-resolution, medium-range ocean forecasting. SeaCast employs a graph-based framework to effectively handle the complex geometry of ocean grids and integrates external forcing data tailored to the regional ocean context. Our approach is validated through experiments at a high spatial resolution using the operational numerical model of the Mediterranean Sea provided by the Copernicus Marine Service, along with both numerical and data-driven atmospheric forcings.

* 28 pages, 35 figures. Accepted to the Tackling Climate Change with Machine Learning workshop at NeurIPS 2024

Via

Access Paper or Ask Questions

Quotient Normalized Maximum Likelihood Criterion for Learning Bayesian Network Structures

Aug 27, 2024

Tomi Silander, Janne Leppä-aho, Elias Jääsaari, Teemu Roos

Abstract:We introduce an information theoretic criterion for Bayesian network structure learning which we call quotient normalized maximum likelihood (qNML). In contrast to the closely related factorized normalized maximum likelihood criterion, qNML satisfies the property of score equivalence. It is also decomposable and completely free of adjustable hyperparameters. For practical computations, we identify a remarkably accurate approximation proposed earlier by Szpankowski and Weinberger. Experiments on both simulated and real data demonstrate that the new criterion leads to parsimonious models with good predictive accuracy.

* PMLR 84:948-957, 2018
* Accepted to AISTATS 2018

Via

Access Paper or Ask Questions

Modeling 3D Infant Kinetics Using Adaptive Graph Convolutional Networks

Feb 22, 2024

Daniel Holmberg, Manu Airaksinen, Viviana Marchi, Andrea Guzzetta, Anna Kivi, Leena Haataja, Sampsa Vanhatalo, Teemu Roos

Figure 1 for Modeling 3D Infant Kinetics Using Adaptive Graph Convolutional Networks

Figure 2 for Modeling 3D Infant Kinetics Using Adaptive Graph Convolutional Networks

Figure 3 for Modeling 3D Infant Kinetics Using Adaptive Graph Convolutional Networks

Figure 4 for Modeling 3D Infant Kinetics Using Adaptive Graph Convolutional Networks

Abstract:Reliable methods for the neurodevelopmental assessment of infants are essential for early detection of medical issues that may need prompt interventions. Spontaneous motor activity, or `kinetics', is shown to provide a powerful surrogate measure of upcoming neurodevelopment. However, its assessment is by and large qualitative and subjective, focusing on visually identified, age-specific gestures. Here, we follow an alternative approach, predicting infants' neurodevelopmental maturation based on data-driven evaluation of individual motor patterns. We utilize 3D video recordings of infants processed with pose-estimation to extract spatio-temporal series of anatomical landmarks, and apply adaptive graph convolutional networks to predict the actual age. We show that our data-driven approach achieves improvement over traditional machine learning baselines based on manually engineered features.

* 10 pages, 3 figures. Code repository available via https://github.com/deinal/infant-aagcn

Via

Access Paper or Ask Questions

Graph Representation of the Magnetic Field Topology in High-Fidelity Plasma Simulations for Machine Learning Applications

Jul 26, 2023

Ioanna Bouri, Fanni Franssila, Markku Alho, Giulia Cozzani, Ivan Zaitsev, Minna Palmroth, Teemu Roos

Abstract:Topological analysis of the magnetic field in simulated plasmas allows the study of various physical phenomena in a wide range of settings. One such application is magnetic reconnection, a phenomenon related to the dynamics of the magnetic field topology, which is difficult to detect and characterize in three dimensions. We propose a scalable pipeline for topological data analysis and spatiotemporal graph representation of three-dimensional magnetic vector fields. We demonstrate our methods on simulations of the Earth's magnetosphere produced by Vlasiator, a supercomputer-scale Vlasov theory-based simulation for near-Earth space. The purpose of this work is to challenge the machine learning community to explore graph-based machine learning approaches to address a largely open scientific problem with wide-ranging potential impact.

* 6 pages, 3 figures, Accepted at the ICML 2023 Workshop on Machine Learning for Astrophysics

Via

Access Paper or Ask Questions

Transfer Learning with Ensembles of Deep Neural Networks for Skin Cancer Classification in Imbalanced Data Sets

Mar 30, 2021

Aqsa Saeed Qureshi, Teemu Roos

Figure 1 for Transfer Learning with Ensembles of Deep Neural Networks for Skin Cancer Classification in Imbalanced Data Sets

Figure 2 for Transfer Learning with Ensembles of Deep Neural Networks for Skin Cancer Classification in Imbalanced Data Sets

Figure 3 for Transfer Learning with Ensembles of Deep Neural Networks for Skin Cancer Classification in Imbalanced Data Sets

Figure 4 for Transfer Learning with Ensembles of Deep Neural Networks for Skin Cancer Classification in Imbalanced Data Sets

Abstract:Early diagnosis plays a key role in prevention and treatment of skin cancer.Several machine learning techniques for accurate classification of skin cancer from medical images have been reported. Many of these techniques are based on pre-trained convolutional neural networks (CNNs), which enable training the models based on limited amounts of training data. However, the classification accuracy of these models still tends to be severely limited by the scarcity of representative images from malignant tumours. We propose a novel ensemble-based CNN architecture where multiple CNN models, some of which are pre-trained and some are trained only on the data at hand, along with patient information (meta-data) are combined using a meta-learner. The proposed approach improves the model's ability to handle scarce, imbalanced data. We demonstrate the benefits of the proposed technique using a dataset with 33126 dermoscopic images from 2000 patients.We evaluate the performance of the proposed technique in terms of the F1-measure, area under the ROC curve (AUC-ROC), and area under the PR curve (AUC-PR), and compare it with that of seven different benchmark methods, including two recent CNN-based techniques. The proposed technique achieves superior performance in terms of all the evaluation metrics (F1-measure $0.53$, AUC-PR $0.58$, AUC-ROC $0.97$).

* There was coding mistake and results need to be reevaluated

Via

Access Paper or Ask Questions

Gradient-Based Training and Pruning of Radial Basis Function Networks with an Application in Materials Physics

Apr 06, 2020

Jussi Määttä, Viacheslav Bazaliy, Jyri Kimari, Flyura Djurabekova, Kai Nordlund, Teemu Roos

Figure 1 for Gradient-Based Training and Pruning of Radial Basis Function Networks with an Application in Materials Physics

Figure 2 for Gradient-Based Training and Pruning of Radial Basis Function Networks with an Application in Materials Physics

Figure 3 for Gradient-Based Training and Pruning of Radial Basis Function Networks with an Application in Materials Physics

Figure 4 for Gradient-Based Training and Pruning of Radial Basis Function Networks with an Application in Materials Physics

Abstract:Many applications, especially in physics and other sciences, call for easily interpretable and robust machine learning techniques. We propose a fully gradient-based technique for training radial basis function networks with an efficient and scalable open-source implementation. We derive novel closed-form optimization criteria for pruning the models for continuous as well as binary data which arise in a challenging real-world material physics problem. The pruned models are optimized to provide compact and interpretable versions of larger models based on informed assumptions about the data distribution. Visualizations of the pruned models provide insight into the atomic configurations that determine atom-level migration processes in solid matter; these results may inform future research on designing more suitable descriptors for use with machine learning algorithms.

Via

Access Paper or Ask Questions

Supervised Learning Approach to Approximate Nearest Neighbor Search

Oct 18, 2019

Ville Hyvönen, Elias Jääsaari, Teemu Roos

Figure 1 for Supervised Learning Approach to Approximate Nearest Neighbor Search

Figure 2 for Supervised Learning Approach to Approximate Nearest Neighbor Search

Figure 3 for Supervised Learning Approach to Approximate Nearest Neighbor Search

Figure 4 for Supervised Learning Approach to Approximate Nearest Neighbor Search

Abstract:Approximate nearest neighbor search is a classic algorithmic problem where the goal is to design an efficient index structure for fast approximate nearest neighbor queries. We show that it can be framed as a classification problem and solved by training a suitable multi-label classifier and using it as an index. Compared to the existing algorithms, this supervised learning approach has several advantages: it enables adapting an index to the query distribution when the query distribution and the corpus distribution differ; it allows using training sets larger than the corpus; and in principle it enables using any multi-label classifier for approximate nearest neighbor search. We demonstrate these advantages on multiple synthetic and real-world data sets by using a random forest and an ensemble of random projection trees as the base classifiers.

Via

Access Paper or Ask Questions

Minimum Description Length Revisited

Aug 21, 2019

Peter Grünwald, Teemu Roos

Abstract:This is an up-to-date introduction to and overview of the Minimum Description Length (MDL) Principle, a theory of inductive inference that can be applied to general problems in statistics, machine learning and pattern recognition. While MDL was originally based on data compression ideas, this introduction can be read without any knowledge thereof. It takes into account all major developments since 2007, the last time an extensive overview was written. These include new methods for model selection and averaging and hypothesis testing, as well as the first completely general definition of {\em MDL estimators}. Incorporating these developments, MDL can be seen as a powerful extension of both penalized likelihood and Bayesian approaches, in which penalization functions and prior distributions are replaced by more general luckiness functions, average-case methodology is replaced by a more robust worst-case approach, and in which methods classically viewed as highly distinct, such as AIC vs BIC and cross-validation vs Bayes can, to a large extent, be viewed from a unified perspective.

* under submission

Via

Access Paper or Ask Questions