Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kapil Vaidya

Bounding the Last Mile: Efficient Learned String Indexing

Nov 29, 2021

Benjamin Spector, Andreas Kipf, Kapil Vaidya, Chi Wang, Umar Farooq Minhas, Tim Kraska

Figure 1 for Bounding the Last Mile: Efficient Learned String Indexing

Figure 2 for Bounding the Last Mile: Efficient Learned String Indexing

Figure 3 for Bounding the Last Mile: Efficient Learned String Indexing

Abstract:We introduce the RadixStringSpline (RSS) learned index structure for efficiently indexing strings. RSS is a tree of radix splines each indexing a fixed number of bytes. RSS approaches or exceeds the performance of traditional string indexes while using 7-70$\times$ less memory. RSS achieves this by using the minimal string prefix to sufficiently distinguish the data unlike most learned approaches which index the entire string. Additionally, the bounded-error nature of RSS accelerates the last mile search and also enables a memory-efficient hash-table lookup accelerator. We benchmark RSS on several real-world string datasets against ART and HOT. Our experiments suggest this line of research may be promising for future memory-intensive database applications.

* 3rd International Workshop on Applied AI for Database Systems and Applications (AIDB'21), August 20, 2021, Copenhagen, Denmark

Via

Access Paper or Ask Questions

Partitioned Learned Bloom Filter

Jun 05, 2020

Kapil Vaidya, Eric Knorr, Tim Kraska, Michael Mitzenmacher

Figure 1 for Partitioned Learned Bloom Filter

Figure 2 for Partitioned Learned Bloom Filter

Figure 3 for Partitioned Learned Bloom Filter

Abstract:Learned Bloom filters enhance standard Bloom filters by using a learned model for the represented data set. However, a learned Bloom filter may under-utilize the model by not taking full advantage of the output. The learned Bloom filter uses the output score by simply applying a threshold, with elements above the threshold being interpreted as positives, and elements below the threshold subject to further analysis independent of the output score (using a smaller backup Bloom filter to prevent false negatives). While recent work has suggested additional heuristic approaches to take better advantage of the score, the results are only heuristic. Here, we instead frame the problem of optimal model utilization as an optimization problem. We show that the optimization problem can be effectively solved efficiently, yielding an improved {partitioned learned Bloom filter}, which partitions the score space and utilizes separate backup Bloom filters for each region. Experimental results from both simulated and real-world datasets show significant performance improvements from our optimization approach over both the original learned Bloom filter constructions and previously proposed heuristic improvements.

* 13 pages, 3 figures

Via

Access Paper or Ask Questions