Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Arash Termehchy

Towards Scalable Schema Mapping using Large Language Models

May 30, 2025

Christopher Buss, Mahdis Safari, Arash Termehchy, Stefan Lee, David Maier

Abstract:The growing need to integrate information from a large number of diverse sources poses significant scalability challenges for data integration systems. These systems often rely on manually written schema mappings, which are complex, source-specific, and costly to maintain as sources evolve. While recent advances suggest that large language models (LLMs) can assist in automating schema matching by leveraging both structural and natural language cues, key challenges remain. In this paper, we identify three core issues with using LLMs for schema mapping: (1) inconsistent outputs due to sensitivity to input phrasing and structure, which we propose methods to address through sampling and aggregation techniques; (2) the need for more expressive mappings (e.g., GLaV), which strain the limited context windows of LLMs; and (3) the computational cost of repeated LLM calls, which we propose to mitigate through strategies like data type prefiltering.

Via

Access Paper or Ask Questions

Learning Accurate Models on Incomplete Data with Minimal Imputation

Mar 18, 2025

Cheng Zhen, Nischal Aryal, Arash Termehchy, Prayoga, Garrett Biwer, Sankalp Patil

Abstract:Missing data often exists in real-world datasets, requiring significant time and effort for imputation to learn accurate machine learning (ML) models. In this paper, we demonstrate that imputing all missing values is not always necessary to achieve an accurate ML model. We introduce the concept of minimal data imputation, which ensures accurate ML models trained over the imputed dataset. Implementing minimal imputation guarantees both minimal imputation effort and optimal ML models. We propose algorithms to find exact and approximate minimal imputation for various ML models. Our extensive experiments indicate that our proposed algorithms significantly reduce the time and effort required for data imputation.

Via

Access Paper or Ask Questions

Certain and Approximately Certain Models for Statistical Learning

Mar 01, 2024

Cheng Zhen, Nischal Aryal, Arash Termehchy, Alireza Aghasi, Amandeep Singh Chabada

Figure 1 for Certain and Approximately Certain Models for Statistical Learning

Figure 2 for Certain and Approximately Certain Models for Statistical Learning

Figure 3 for Certain and Approximately Certain Models for Statistical Learning

Figure 4 for Certain and Approximately Certain Models for Statistical Learning

Abstract:Real-world data is often incomplete and contains missing values. To train accurate models over real-world datasets, users need to spend a substantial amount of time and resources imputing and finding proper values for missing data items. In this paper, we demonstrate that it is possible to learn accurate models directly from data with missing values for certain training data and target models. We propose a unified approach for checking the necessity of data imputation to learn accurate models across various widely-used machine learning paradigms. We build efficient algorithms with theoretical guarantees to check this necessity and return accurate models in cases where imputation is unnecessary. Our extensive experiments indicate that our proposed algorithms significantly reduce the amount of time and effort needed for data imputation without imposing considerable computational overhead.

* A technical report for a paper to appear at SIGMOD 2024

Via

Access Paper or Ask Questions

Towards Consistent Language Models Using Declarative Constraints

Dec 24, 2023

Jasmin Mousavi, Arash Termehchy

Abstract:Large language models have shown unprecedented abilities in generating linguistically coherent and syntactically correct natural language output. However, they often return incorrect and inconsistent answers to input questions. Due to the complexity and uninterpretability of the internally learned representations, it is challenging to modify language models such that they provide correct and consistent results. The data management community has developed various methods and tools for providing consistent answers over inconsistent datasets. In these methods, users specify the desired properties of data in a domain in the form of high-level declarative constraints. This approach has provided usable and scalable methods to delivering consistent information from inconsistent datasets. We aim to build upon this success and leverage these methods to modify language models such that they deliver consistent and accurate results. We investigate the challenges of using these ideas to obtain consistent and relevant answers from language models and report some preliminary empirical studies.

Via

Access Paper or Ask Questions

Learning Over Dirty Data Without Cleaning

Apr 05, 2020

Jose Picado, John Davis, Arash Termehchy, Ga Young Lee

Figure 1 for Learning Over Dirty Data Without Cleaning

Figure 2 for Learning Over Dirty Data Without Cleaning

Figure 3 for Learning Over Dirty Data Without Cleaning

Figure 4 for Learning Over Dirty Data Without Cleaning

Abstract:Real-world datasets are dirty and contain many errors. Examples of these issues are violations of integrity constraints, duplicates, and inconsistencies in representing data values and entities. Learning over dirty databases may result in inaccurate models. Users have to spend a great deal of time and effort to repair data errors and create a clean database for learning. Moreover, as the information required to repair these errors is not often available, there may be numerous possible clean versions for a dirty database. We propose DLearn, a novel relational learning system that learns directly over dirty databases effectively and efficiently without any preprocessing. DLearn leverages database constraints to learn accurate relational models over inconsistent and heterogeneous data. Its learned models represent patterns over all possible clean instances of the data in a usable form. Our empirical study indicates that DLearn learns accurate models over large real-world databases efficiently.

* To be published in Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD'20)

Via

Access Paper or Ask Questions

A Signaling Game Approach to Databases Querying and Interaction

May 04, 2018

Ben McCamish, Vahid Ghadakchi, Arash Termehchy, Behrouz Touri

Figure 1 for A Signaling Game Approach to Databases Querying and Interaction

Figure 2 for A Signaling Game Approach to Databases Querying and Interaction

Figure 3 for A Signaling Game Approach to Databases Querying and Interaction

Figure 4 for A Signaling Game Approach to Databases Querying and Interaction

Abstract:As most users do not precisely know the structure and/or the content of databases, their queries do not exactly reflect their information needs. The database management systems (DBMS) may interact with users and use their feedback on the returned results to learn the information needs behind their queries. Current query interfaces assume that users do not learn and modify the way way they express their information needs in form of queries during their interaction with the DBMS. Using a real-world interaction workload, we show that users learn and modify how to express their information needs during their interactions with the DBMS and their learning is accurately modeled by a well-known reinforcement learning mechanism. As current data interaction systems assume that users do not modify their strategies, they cannot discover the information needs behind users' queries effectively. We model the interaction between users and DBMS as a game with identical interest between two rational agents whose goal is to establish a common language for representing information needs in form of queries. We propose a reinforcement learning method that learns and answers the information needs behind queries and adapts to the changes in users' strategies and prove that it improves the effectiveness of answering queries stochastically speaking. We propose two efficient implementation of this method over large relational databases. Our extensive empirical studies over real-world query workloads indicate that our algorithms are efficient and effective.

* 21 pages

Via

Access Paper or Ask Questions

Schema Independent Relational Learning

Nov 06, 2017

Jose Picado, Arash Termehchy, Alan Fern, Parisa Ataei

Figure 1 for Schema Independent Relational Learning

Figure 2 for Schema Independent Relational Learning

Figure 3 for Schema Independent Relational Learning

Figure 4 for Schema Independent Relational Learning

Abstract:Learning novel concepts and relations from relational databases is an important problem with many applications in database systems and machine learning. Relational learning algorithms learn the definition of a new relation in terms of existing relations in the database. Nevertheless, the same data set may be represented under different schemas for various reasons, such as efficiency, data quality, and usability. Unfortunately, the output of current relational learning algorithms tends to vary quite substantially over the choice of schema, both in terms of learning accuracy and efficiency. This variation complicates their off-the-shelf application. In this paper, we introduce and formalize the property of schema independence of relational learning algorithms, and study both the theoretical and empirical dependence of existing algorithms on the common class of (de) composition schema transformations. We study both sample-based learning algorithms, which learn from sets of labeled examples, and query-based algorithms, which learn by asking queries to an oracle. We prove that current relational learning algorithms are generally not schema independent. For query-based learning algorithms we show that the (de) composition transformations influence their query complexity. We propose Castor, a sample-based relational learning algorithm that achieves schema independence by leveraging data dependencies. We support the theoretical results with an empirical study that demonstrates the schema dependence/independence of several algorithms on existing benchmark and real-world datasets under (de) compositions.

Via

Access Paper or Ask Questions

AutoMode: Relational Learning With Less Black Magic

Oct 03, 2017

Jose Picado, Sudhanshu Pathak, Arash Termehchy, Alan Fern

Figure 1 for AutoMode: Relational Learning With Less Black Magic

Figure 2 for AutoMode: Relational Learning With Less Black Magic

Figure 3 for AutoMode: Relational Learning With Less Black Magic

Figure 4 for AutoMode: Relational Learning With Less Black Magic

Abstract:Relational databases are valuable resources for learning novel and interesting relations and concepts. Relational learning algorithms learn the Datalog definition of new relations in terms of the existing relations in the database. In order to constraint the search through the large space of candidate definitions, users must tune the algorithm by specifying a language bias. Unfortunately, specifying the language bias is done via trial and error and is guided by the expert's intuitions. Hence, it normally takes a great deal of time and effort to effectively use these algorithms. In particular, it is hard to find a user that knows computer science concepts, such as database schema, and has a reasonable intuition about the target relation in special domains, such as biology. We propose AutoMode, a system that leverages information in the schema and content of the database to automatically induce the language bias used by popular relational learning systems. We show that AutoMode delivers the same accuracy as using manually-written language bias by imposing only a slight overhead on the running time of the learning algorithm.

Via

Access Paper or Ask Questions