Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Abhijit Manatkar

ILAEDA: An Imitation Learning Based Approach for Automatic Exploratory Data Analysis

Oct 15, 2024

Abhijit Manatkar, Devarsh Patel, Hima Patel, Naresh Manwani

Figure 1 for ILAEDA: An Imitation Learning Based Approach for Automatic Exploratory Data Analysis

Figure 2 for ILAEDA: An Imitation Learning Based Approach for Automatic Exploratory Data Analysis

Figure 3 for ILAEDA: An Imitation Learning Based Approach for Automatic Exploratory Data Analysis

Figure 4 for ILAEDA: An Imitation Learning Based Approach for Automatic Exploratory Data Analysis

Abstract:Automating end-to-end Exploratory Data Analysis (AutoEDA) is a challenging open problem, often tackled through Reinforcement Learning (RL) by learning to predict a sequence of analysis operations (FILTER, GROUP, etc). Defining rewards for each operation is a challenging task and existing methods rely on various \emph{interestingness measures} to craft reward functions to capture the importance of each operation. In this work, we argue that not all of the essential features of what makes an operation important can be accurately captured mathematically using rewards. We propose an AutoEDA model trained through imitation learning from expert EDA sessions, bypassing the need for manually defined interestingness measures. Our method, based on generative adversarial imitation learning (GAIL), generalizes well across datasets, even with limited expert data. We also introduce a novel approach for generating synthetic EDA demonstrations for training. Our method outperforms the existing state-of-the-art end-to-end EDA approach on benchmarks by upto 3x, showing strong performance and generalization, while naturally capturing diverse interestingness measures in generated EDA sessions.

* Accepted at AIMLSystems '24

Via

Access Paper or Ask Questions

QUIS: Question-guided Insights Generation for Automated Exploratory Data Analysis

Oct 14, 2024

Abhijit Manatkar, Ashlesha Akella, Parthivi Gupta, Krishnasuri Narayanam

Abstract:Discovering meaningful insights from a large dataset, known as Exploratory Data Analysis (EDA), is a challenging task that requires thorough exploration and analysis of the data. Automated Data Exploration (ADE) systems use goal-oriented methods with Large Language Models and Reinforcement Learning towards full automation. However, these methods require human involvement to anticipate goals that may limit insight extraction, while fully automated systems demand significant computational resources and retraining for new datasets. We introduce QUIS, a fully automated EDA system that operates in two stages: insight generation (ISGen) driven by question generation (QUGen). The QUGen module generates questions in iterations, refining them from previous iterations to enhance coverage without human intervention or manually curated examples. The ISGen module analyzes data to produce multiple relevant insights in response to each question, requiring no prior training and enabling QUIS to adapt to new datasets.

* 6 pages

Via

Access Paper or Ask Questions

An Automatic Prompt Generation System for Tabular Data Tasks

May 09, 2024

Ashlesha Akella, Abhijit Manatkar, Brij Chavda, Hima Patel

Figure 1 for An Automatic Prompt Generation System for Tabular Data Tasks

Figure 2 for An Automatic Prompt Generation System for Tabular Data Tasks

Figure 3 for An Automatic Prompt Generation System for Tabular Data Tasks

Figure 4 for An Automatic Prompt Generation System for Tabular Data Tasks

Abstract:Efficient processing of tabular data is important in various industries, especially when working with datasets containing a large number of columns. Large language models (LLMs) have demonstrated their ability on several tasks through carefully crafted prompts. However, creating effective prompts for tabular datasets is challenging due to the structured nature of the data and the need to manage numerous columns. This paper presents an innovative auto-prompt generation system suitable for multiple LLMs, with minimal training. It proposes two novel methods; 1) A Reinforcement Learning-based algorithm for identifying and sequencing task-relevant columns 2) Cell-level similarity-based approach for enhancing few-shot example selection. Our approach has been extensively tested across 66 datasets, demonstrating improved performance in three downstream tasks: data imputation, error detection, and entity matching using two distinct LLMs; Google flan-t5-xxl and Mixtral 8x7B.

* Accepted to NAACL 2024 Industry Track

Via

Access Paper or Ask Questions