Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Aline Bessa

Correlation Sketches for Approximate Join-Correlation Queries

Apr 07, 2021

Aécio Santos, Aline Bessa, Fernando Chirigati, Christopher Musco, Juliana Freire

Figure 1 for Correlation Sketches for Approximate Join-Correlation Queries

Figure 2 for Correlation Sketches for Approximate Join-Correlation Queries

Figure 3 for Correlation Sketches for Approximate Join-Correlation Queries

Figure 4 for Correlation Sketches for Approximate Join-Correlation Queries

Abstract:The increasing availability of structured datasets, from Web tables and open-data portals to enterprise data, opens up opportunities~to enrich analytics and improve machine learning models through relational data augmentation. In this paper, we introduce a new class of data augmentation queries: join-correlation queries. Given a column $Q$ and a join column $K_Q$ from a query table $\mathcal{T}_Q$, retrieve tables $\mathcal{T}_X$ in a dataset collection such that $\mathcal{T}_X$ is joinable with $\mathcal{T}_Q$ on $K_Q$ and there is a column $C \in \mathcal{T}_X$ such that $Q$ is correlated with $C$. A na\"ive approach to evaluate these queries, which first finds joinable tables and then explicitly joins and computes correlations between $Q$ and all columns of the discovered tables, is prohibitively expensive. To efficiently support correlated column discovery, we 1) propose a sketching method that enables the construction of an index for a large number of tables and that provides accurate estimates for join-correlation queries, and 2) explore different scoring strategies that effectively rank the query results based on how well the columns are correlated with the query. We carry out a detailed experimental evaluation, using both synthetic and real data, which shows that our sketches attain high accuracy and the scoring strategies lead to high-quality rankings.

* Proceedings of the 2021 International Conference on Management of Data (SIGMOD '21)

Via

Access Paper or Ask Questions

Auctus: A Dataset Search Engine for Data Augmentation

Feb 10, 2021

Fernando Chirigati, Rémi Rampin, Aécio Santos, Aline Bessa, Juliana Freire

Figure 1 for Auctus: A Dataset Search Engine for Data Augmentation

Figure 2 for Auctus: A Dataset Search Engine for Data Augmentation

Abstract:Machine Learning models are increasingly being adopted in many applications. The quality of these models critically depends on the input data on which they are trained, and by augmenting their input data with external data, we have the opportunity to create better models. However, the massive number of datasets available on the Web makes it challenging to find data suitable for augmentation. In this demo, we present our ongoing efforts to develop a dataset search engine tailored for data augmentation. Our prototype, named Auctus, automatically discovers datasets on the Web and, different from existing dataset search engines, infers consistent metadata for indexing and supports join and union search queries. Auctus is already being used in a real deployment environment to improve the performance of ML models. The demonstration will include various real-world data augmentation examples and visitors will be able to interact with the system.

Via

Access Paper or Ask Questions