Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing

Oct 16, 2024

Shreya Shankar, Aditya G. Parameswaran, Eugene Wu

Figure 1 for DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing

Figure 2 for DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing

Figure 3 for DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing

Figure 4 for DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing

Share this with someone who'll enjoy it:

Abstract:Analyzing unstructured data, such as complex documents, has been a persistent challenge in data processing. Large Language Models (LLMs) have shown promise in this regard, leading to recent proposals for declarative frameworks for LLM-powered unstructured data processing. However, these frameworks focus on reducing cost when executing user-specified operations using LLMs, rather than improving accuracy, executing most operations as-is. This is problematic for complex tasks and data, where LLM outputs for user-defined operations are often inaccurate, even with optimized prompts. We present DocETL, a system that optimizes complex document processing pipelines, while accounting for LLM shortcomings. DocETL offers a declarative interface for users to define such pipelines and uses an agent-based framework to automatically optimize them, leveraging novel agent-based rewrites (that we call {\em rewrite directives}) and an optimization and evaluation framework that we introduce. We introduce {\em (i)} logical rewriting of pipelines, tailored for LLM-based tasks, {\em (ii)} an agent-guided plan evaluation mechanism that synthesizes and orchestrates task-specific validation prompts, and {\em (iii)} an optimization algorithm that efficiently finds promising plans, considering the time constraints of LLM-based plan generation and evaluation. Our evaluation on three different unstructured document analysis tasks demonstrates that DocETL finds plans with outputs that are $1.34$ to $4.6\times$ higher quality (e.g., more accurate, comprehensive) than well-engineered baselines, addressing a critical gap in existing declarative frameworks for unstructured data analysis. DocETL is open-source at \ttt{docetl.org}, and as of October 2024, has amassed over 800 GitHub Stars, with users spanning a variety of domains.

* 21 pages, 7 figures, 3 tables

View paper on

Share this with someone who'll enjoy it:

Title:DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing

Paper and Code