Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark

Dec 09, 2024

Lan Li, Liri Fang, Vetle I. Torvik

Figure 1 for AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark

Figure 2 for AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark

Figure 3 for AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark

Figure 4 for AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark

Share this with someone who'll enjoy it:

Abstract:We investigate the reasoning capabilities of large language models (LLMs) for automatically generating data-cleaning workflows. To evaluate LLMs' ability to complete data-cleaning tasks, we implemented a pipeline for LLM-based Auto Data Cleaning Workflow (AutoDCWorkflow), prompting LLMs on data cleaning operations to repair three types of data quality issues: duplicates, missing values, and inconsistent data formats. Given a dirty table and a purpose (expressed as a query), this pipeline generates a minimal, clean table sufficient to address the purpose and the data cleaning workflow used to produce the table. The planning process involves three main LLM-driven components: (1) Select Target Columns: Identifies a set of target columns related to the purpose. (2) Inspect Column Quality: Assesses the data quality for each target column and generates a Data Quality Report as operation objectives. (3) Generate Operation & Arguments: Predicts the next operation and arguments based on the data quality report results. Additionally, we propose a data cleaning benchmark to evaluate the capability of LLM agents to automatically generate workflows that address data cleaning purposes of varying difficulty levels. The benchmark comprises the annotated datasets as a collection of purpose, raw table, clean table, data cleaning workflow, and answer set. In our experiments, we evaluated three LLMs that auto-generate purpose-driven data cleaning workflows. The results indicate that LLMs perform well in planning and generating data-cleaning workflows without the need for fine-tuning.

View paper on

Share this with someone who'll enjoy it:

Title:AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark

Paper and Code