Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Charles Xie

Starling: An I/O-Efficient Disk-Resident Graph Index Framework for High-Dimensional Vector Similarity Search on Data Segment

Jan 16, 2024

Mengzhao Wang, Weizhi Xu, Xiaomeng Yi, Songlin Wu, Zhangyang Peng, Xiangyu Ke, Yunjun Gao, Xiaoliang Xu, Rentong Guo, Charles Xie

Figure 1 for Starling: An I/O-Efficient Disk-Resident Graph Index Framework for High-Dimensional Vector Similarity Search on Data Segment

Figure 2 for Starling: An I/O-Efficient Disk-Resident Graph Index Framework for High-Dimensional Vector Similarity Search on Data Segment

Figure 3 for Starling: An I/O-Efficient Disk-Resident Graph Index Framework for High-Dimensional Vector Similarity Search on Data Segment

Figure 4 for Starling: An I/O-Efficient Disk-Resident Graph Index Framework for High-Dimensional Vector Similarity Search on Data Segment

Abstract:High-dimensional vector similarity search (HVSS) is gaining prominence as a powerful tool for various data science and AI applications. As vector data scales up, in-memory indexes pose a significant challenge due to the substantial increase in main memory requirements. A potential solution involves leveraging disk-based implementation, which stores and searches vector data on high-performance devices like NVMe SSDs. However, implementing HVSS for data segments proves to be intricate in vector databases where a single machine comprises multiple segments for system scalability. In this context, each segment operates with limited memory and disk space, necessitating a delicate balance between accuracy, efficiency, and space cost. Existing disk-based methods fall short as they do not holistically address all these requirements simultaneously. In this paper, we present Starling, an I/O-efficient disk-resident graph index framework that optimizes data layout and search strategy within the segment. It has two primary components: (1) a data layout incorporating an in-memory navigation graph and a reordered disk-based graph with enhanced locality, reducing the search path length and minimizing disk bandwidth wastage; and (2) a block search strategy designed to minimize costly disk I/O operations during vector query execution. Through extensive experiments, we validate the effectiveness, efficiency, and scalability of Starling. On a data segment with 2GB memory and 10GB disk capacity, Starling can accommodate up to 33 million vectors in 128 dimensions, offering HVSS with over 0.9 average precision and top-10 recall rate, and latency under 1 millisecond. The results showcase Starling's superior performance, exhibiting 43.9$\times$ higher throughput with 98% lower query latency compared to state-of-the-art methods while maintaining the same level of accuracy.

* This paper has been accepted by SIGMOD 2024

Via

Access Paper or Ask Questions

Using Machine Learning to Predict Engineering Technology Students' Success with Computer Aided Design

Aug 12, 2021

Jasmine Singh, Viranga Perera, Alejandra J. Magana, Brittany Newell, Jin Wei-Kocsis, Ying Ying Seah, Greg J. Strimel, Charles Xie

Figure 1 for Using Machine Learning to Predict Engineering Technology Students' Success with Computer Aided Design

Figure 2 for Using Machine Learning to Predict Engineering Technology Students' Success with Computer Aided Design

Figure 3 for Using Machine Learning to Predict Engineering Technology Students' Success with Computer Aided Design

Figure 4 for Using Machine Learning to Predict Engineering Technology Students' Success with Computer Aided Design

Abstract:Computer-aided design (CAD) programs are essential to engineering as they allow for better designs through low-cost iterations. While CAD programs are typically taught to undergraduate students as a job skill, such software can also help students learn engineering concepts. A current limitation of CAD programs (even those that are specifically designed for educational purposes) is that they are not capable of providing automated real-time help to students. To encourage CAD programs to build in assistance to students, we used data generated from students using a free, open source CAD software called Aladdin to demonstrate how student data combined with machine learning techniques can predict how well a particular student will perform in a design task. We challenged students to design a house that consumed zero net energy as part of an introductory engineering technology undergraduate course. Using data from 128 students, along with the scikit-learn Python machine learning library, we tested our models using both total counts of design actions and sequences of design actions as inputs. We found that our models using early design sequence actions are particularly valuable for prediction. Our logistic regression model achieved a >60% chance of predicting if a student would succeed in designing a zero net energy house. Our results suggest that it would be feasible for Aladdin to provide useful feedback to students when they are approximately halfway through their design. Further improvements to these models could lead to earlier predictions and thus provide students feedback sooner to enhance their learning.

Via

Access Paper or Ask Questions