Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Andreas Kipf

Lightweight Correlation-Aware Table Compression

Oct 24, 2024

Mihail Stoian, Alexander van Renen, Jan Kobiolka, Ping-Lin Kuo, Josif Grabocka, Andreas Kipf

Figure 1 for Lightweight Correlation-Aware Table Compression

Figure 2 for Lightweight Correlation-Aware Table Compression

Figure 3 for Lightweight Correlation-Aware Table Compression

Abstract:The growing adoption of data lakes for managing relational data necessitates efficient, open storage formats that provide high scan performance and competitive compression ratios. While existing formats achieve fast scans through lightweight encoding techniques, they have reached a plateau in terms of minimizing storage footprint. Recently, correlation-aware compression schemes have been shown to reduce file sizes further. Yet, current approaches either incur significant scan overheads or require manual specification of correlations, limiting their practicability. We present $\texttt{Virtual}$, a framework that integrates seamlessly with existing open formats to automatically leverage data correlations, achieving substantial compression gains while having minimal scan performance overhead. Experiments on data-gov datasets show that $\texttt{Virtual}$ reduces file sizes by up to 40% compared to Apache Parquet.

* Third Table Representation Learning Workshop (TRL @ NeurIPS 2024)

Via

Access Paper or Ask Questions

LSI: A Learned Secondary Index Structure

May 11, 2022

Andreas Kipf, Dominik Horn, Pascal Pfeil, Ryan Marcus, Tim Kraska

Figure 1 for LSI: A Learned Secondary Index Structure

Figure 2 for LSI: A Learned Secondary Index Structure

Figure 3 for LSI: A Learned Secondary Index Structure

Figure 4 for LSI: A Learned Secondary Index Structure

Abstract:Learned index structures have been shown to achieve favorable lookup performance and space consumption compared to their traditional counterparts such as B-trees. However, most learned index studies have focused on the primary indexing setting, where the base data is sorted. In this work, we investigate whether learned indexes sustain their advantage in the secondary indexing setting. We introduce Learned Secondary Index (LSI), a first attempt to use learned indexes for indexing unsorted data. LSI works by building a learned index over a permutation vector, which allows binary search to performed on the unsorted base data using random access. We additionally augment LSI with a fingerprint vector to accelerate equality lookups. We show that LSI achieves comparable lookup performance to state-of-the-art secondary indexes while being up to 6x more space efficient.

* Fifth International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (aiDM 2022)

Via

Access Paper or Ask Questions

Bounding the Last Mile: Efficient Learned String Indexing

Nov 29, 2021

Benjamin Spector, Andreas Kipf, Kapil Vaidya, Chi Wang, Umar Farooq Minhas, Tim Kraska

Figure 1 for Bounding the Last Mile: Efficient Learned String Indexing

Figure 2 for Bounding the Last Mile: Efficient Learned String Indexing

Figure 3 for Bounding the Last Mile: Efficient Learned String Indexing

Abstract:We introduce the RadixStringSpline (RSS) learned index structure for efficiently indexing strings. RSS is a tree of radix splines each indexing a fixed number of bytes. RSS approaches or exceeds the performance of traditional string indexes while using 7-70$\times$ less memory. RSS achieves this by using the minimal string prefix to sufficiently distinguish the data unlike most learned approaches which index the entire string. Additionally, the bounded-error nature of RSS accelerates the last mile search and also enables a memory-efficient hash-table lookup accelerator. We benchmark RSS on several real-world string datasets against ART and HOT. Our experiments suggest this line of research may be promising for future memory-intensive database applications.

* 3rd International Workshop on Applied AI for Database Systems and Applications (AIDB'21), August 20, 2021, Copenhagen, Denmark

Via

Access Paper or Ask Questions

PLEX: Towards Practical Learned Indexing

Aug 11, 2021

Mihail Stoian, Andreas Kipf, Ryan Marcus, Tim Kraska

Figure 1 for PLEX: Towards Practical Learned Indexing

Figure 2 for PLEX: Towards Practical Learned Indexing

Figure 3 for PLEX: Towards Practical Learned Indexing

Abstract:Latest research proposes to replace existing index structures with learned models. However, current learned indexes tend to have many hyperparameters, often do not provide any error guarantees, and are expensive to build. We introduce Practical Learned Index (PLEX). PLEX only has a single hyperparameter $\epsilon$ (maximum prediction error) and offers a better trade-off between build and lookup time than state-of-the-art approaches. Similar to RadixSpline, PLEX consists of a spline and a (multi-level) radix layer. It first builds a spline satisfying the given $\epsilon$ and then performs an ad-hoc analysis of the distribution of spline points to quickly tune the radix layer.

* 3rd International Workshop on Applied AI for Database Systems and Applications (AIDB'21), August 20, 2021, Copenhagen, Denmark

Via

Access Paper or Ask Questions

The Case for Learned Spatial Indexes

Aug 24, 2020

Varun Pandey, Alexander van Renen, Andreas Kipf, Ibrahim Sabek, Jialin Ding, Alfons Kemper

Figure 1 for The Case for Learned Spatial Indexes

Figure 2 for The Case for Learned Spatial Indexes

Figure 3 for The Case for Learned Spatial Indexes

Figure 4 for The Case for Learned Spatial Indexes

Abstract:Spatial data is ubiquitous. Massive amounts of data are generated every day from billions of GPS-enabled devices such as cell phones, cars, sensors, and various consumer-based applications such as Uber, Tinder, location-tagged posts in Facebook, Twitter, Instagram, etc. This exponential growth in spatial data has led the research community to focus on building systems and applications that can process spatial data efficiently. In the meantime, recent research has introduced learned index structures. In this work, we use techniques proposed from a state-of-the art learned multi-dimensional index structure (namely, Flood) and apply them to five classical multi-dimensional indexes to be able to answer spatial range queries. By tuning each partitioning technique for optimal performance, we show that (i) machine learned search within a partition is faster by 11.79\% to 39.51\% than binary search when using filtering on one dimension, (ii) the bottleneck for tree structures is index lookup, which could potentially be improved by linearizing the indexed partitions (iii) filtering on one dimension and refining using machine learned indexes is 1.23x to 1.83x times faster than closest competitor which filters on two dimensions, and (iv) learned indexes can have a significant impact on the performance of low selectivity queries while being less effective under higher selectivities.

Via

Access Paper or Ask Questions

RadixSpline: A Single-Pass Learned Index

May 22, 2020

Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, Thomas Neumann

Figure 1 for RadixSpline: A Single-Pass Learned Index

Figure 2 for RadixSpline: A Single-Pass Learned Index

Figure 3 for RadixSpline: A Single-Pass Learned Index

Figure 4 for RadixSpline: A Single-Pass Learned Index

Abstract:Recent research has shown that learned models can outperform state-of-the-art index structures in size and lookup performance. While this is a very promising result, existing learned structures are often cumbersome to implement and are slow to build. In fact, most approaches that we are aware of require multiple training passes over the data. We introduce RadixSpline (RS), a learned index that can be built in a single pass over the data and is competitive with state-of-the-art learned index models, like RMI, in size and lookup performance. We evaluate RS using the SOSD benchmark and show that it achieves competitive results on all datasets, despite the fact that it only has two parameters.

* Third International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (aiDM 2020)

Via

Access Paper or Ask Questions

SOSD: A Benchmark for Learned Indexes

Nov 29, 2019

Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, Thomas Neumann

Figure 1 for SOSD: A Benchmark for Learned Indexes

Figure 2 for SOSD: A Benchmark for Learned Indexes

Figure 3 for SOSD: A Benchmark for Learned Indexes

Figure 4 for SOSD: A Benchmark for Learned Indexes

Abstract:A groundswell of recent work has focused on improving data management systems with learned components. Specifically, work on learned index structures has proposed replacing traditional index structures, such as B-trees, with learned models. Given the decades of research committed to improving index structures, there is significant skepticism about whether learned indexes actually outperform state-of-the-art implementations of traditional structures on real-world data. To answer this question, we propose a new benchmarking framework that comes with a variety of real-world datasets and baseline implementations to compare against. We also show preliminary results for selected index structures, and find that learned models indeed often outperform state-of-the-art implementations, and are therefore a promising direction for future research.

* NeurIPS 2019 Workshop on Machine Learning for Systems

Via

Access Paper or Ask Questions