Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Renan Souza

LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology

Sep 17, 2025

Renan Souza, Timothy Poteet, Brian Etz, Daniel Rosendo, Amal Gueroudji, Woong Shin, Prasanna Balaprakash, Rafael Ferreira da Silva

Abstract:Modern scientific discovery increasingly relies on workflows that process data across the Edge, Cloud, and High Performance Computing (HPC) continuum. Comprehensive and in-depth analyses of these data are critical for hypothesis validation, anomaly detection, reproducibility, and impactful findings. Although workflow provenance techniques support such analyses, at large scale, the provenance data become complex and difficult to analyze. Existing systems depend on custom scripts, structured queries, or static dashboards, limiting data interaction. In this work, we introduce an evaluation methodology, reference architecture, and open-source implementation that leverages interactive Large Language Model (LLM) agents for runtime data analysis. Our approach uses a lightweight, metadata-driven design that translates natural language into structured provenance queries. Evaluations across LLaMA, GPT, Gemini, and Claude, covering diverse query classes and a real-world chemistry workflow, show that modular design, prompt tuning, and Retrieval-Augmented Generation (RAG) enable accurate and insightful LLM agent responses beyond recorded provenance.

* Paper accepted in the proceedings of the ACM/IEEE Supercomputing Conference (SC). Cite it as Renan Souza, Timothy Poteet, Brian Etz, Daniel Rosendo, Amal Gueroudji, Woong Shin, Prasanna Balaprakash, and Rafael Ferreira da Silva. 2025. LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology. In SC Workshops (WORKS)

Via

Access Paper or Ask Questions

Towards Lightweight Data Integration using Multi-workflow Provenance and Data Observability

Aug 17, 2023

Renan Souza, Tyler J. Skluzacek, Sean R. Wilkinson, Maxim Ziatdinov, Rafael Ferreira da Silva

Figure 1 for Towards Lightweight Data Integration using Multi-workflow Provenance and Data Observability

Figure 2 for Towards Lightweight Data Integration using Multi-workflow Provenance and Data Observability

Figure 3 for Towards Lightweight Data Integration using Multi-workflow Provenance and Data Observability

Figure 4 for Towards Lightweight Data Integration using Multi-workflow Provenance and Data Observability

Abstract:Modern large-scale scientific discovery requires multidisciplinary collaboration across diverse computing facilities, including High Performance Computing (HPC) machines and the Edge-to-Cloud continuum. Integrated data analysis plays a crucial role in scientific discovery, especially in the current AI era, by enabling Responsible AI development, FAIR, Reproducibility, and User Steering. However, the heterogeneous nature of science poses challenges such as dealing with multiple supporting tools, cross-facility environments, and efficient HPC execution. Building on data observability, adapter system design, and provenance, we propose MIDA: an approach for lightweight runtime Multi-workflow Integrated Data Analysis. MIDA defines data observability strategies and adaptability methods for various parallel systems and machine learning tools. With observability, it intercepts the dataflows in the background without requiring instrumentation while integrating domain, provenance, and telemetry data at runtime into a unified database ready for user steering queries. We conduct experiments showing end-to-end multi-workflow analysis integrating data from Dask and MLFlow in a real distributed deep learning use case for materials science that runs on multiple environments with up to 276 GPUs in parallel. We show near-zero overhead running up to 100,000 tasks on 1,680 CPU cores on the Summit supercomputer.

* 19th IEEE International Conference on e-Science (eScience) 2023 - Limassol, Cyprus
* 10 pages, 5 figures, 2 Listings, 42 references, Paper accepted at IEEE eScience'23

Via

Access Paper or Ask Questions

Context-aware Execution Migration Tool for Data Science Jupyter Notebooks on Hybrid Clouds

Jul 01, 2021

Renato L. F. Cunha, Lucas V. Real, Renan Souza, Bruno Silva, Marco A. S. Netto

Figure 1 for Context-aware Execution Migration Tool for Data Science Jupyter Notebooks on Hybrid Clouds

Figure 2 for Context-aware Execution Migration Tool for Data Science Jupyter Notebooks on Hybrid Clouds

Figure 3 for Context-aware Execution Migration Tool for Data Science Jupyter Notebooks on Hybrid Clouds

Figure 4 for Context-aware Execution Migration Tool for Data Science Jupyter Notebooks on Hybrid Clouds

Abstract:Interactive computing notebooks, such as Jupyter notebooks, have become a popular tool for developing and improving data-driven models. Such notebooks tend to be executed either in the user's own machine or in a cloud environment, having drawbacks and benefits in both approaches. This paper presents a solution developed as a Jupyter extension that automatically selects which cells, as well as in which scenarios, such cells should be migrated to a more suitable platform for execution. We describe how we reduce the execution state of the notebook to decrease migration time and we explore the knowledge of user interactivity patterns with the notebook to determine which blocks of cells should be migrated. Using notebooks from Earth science (remote sensing), image recognition, and hand written digit identification (machine learning), our experiments show notebook state reductions of up to 55x and migration decisions leading to performance gains of up to 3.25x when the user interactivity with the notebook is taken into consideration.

* 10 pages

Via

Access Paper or Ask Questions

Workflow Provenance in the Lifecycle of Scientific Machine Learning

Sep 30, 2020

Renan Souza, Leonardo G. Azevedo, Vítor Lourenço, Elton Soares, Raphael Thiago, Rafael Brandão, Daniel Civitarese, Emilio Vital Brazil, Marcio Moreno, Patrick Valduriez(+3 more)

Figure 1 for Workflow Provenance in the Lifecycle of Scientific Machine Learning

Figure 2 for Workflow Provenance in the Lifecycle of Scientific Machine Learning

Figure 3 for Workflow Provenance in the Lifecycle of Scientific Machine Learning

Figure 4 for Workflow Provenance in the Lifecycle of Scientific Machine Learning

Abstract:Machine Learning (ML) has already fundamentally changed several businesses. More recently, it has also been profoundly impacting the computational science and engineering domains, like geoscience, climate science, and health science. In these domains, users need to perform comprehensive data analyses combining scientific data and ML models to provide for critical requirements, such as reproducibility, model explainability, and experiment data understanding. However, scientific ML is multidisciplinary, heterogeneous, and affected by the physical constraints of the domain, making such analyses even more challenging. In this work, we leverage workflow provenance techniques to build a holistic view to support the lifecycle of scientific ML. We contribute with (i) characterization of the lifecycle and taxonomy for data analyses; (ii) design principles to build this view, with a W3C PROV compliant data representation and a reference system architecture; and (iii) lessons learned after an evaluation in an Oil & Gas case using an HPC cluster with 393 nodes and 946 GPUs. The experiments show that the principles enable queries that integrate domain semantics with ML models while keeping low overhead (<1%), high scalability, and an order of magnitude of query acceleration under certain workloads against without our representation.

* 21 pages, 10 figures, Under review in a scientific journal since June 30th, 2020. arXiv admin note: text overlap with arXiv:1910.04223

Via

Access Paper or Ask Questions

Managing Data Lineage of O&G Machine Learning Models: The Sweet Spot for Shale Use Case

Mar 10, 2020

Raphael Thiago, Renan Souza, L. Azevedo, E. Soares, Rodrigo Santos, Wallas Santos, Max De Bayser, M. Cardoso, M. Moreno, Renato Cerqueira

Figure 1 for Managing Data Lineage of O&G Machine Learning Models: The Sweet Spot for Shale Use Case

Figure 2 for Managing Data Lineage of O&G Machine Learning Models: The Sweet Spot for Shale Use Case

Figure 3 for Managing Data Lineage of O&G Machine Learning Models: The Sweet Spot for Shale Use Case

Abstract:Machine Learning (ML) has increased its role, becoming essential in several industries. However, questions around training data lineage, such as "where has the dataset used to train this model come from?"; the introduction of several new data protection legislation; and, the need for data governance requirements, have hindered the adoption of ML models in the real world. In this paper, we discuss how data lineage can be leveraged to benefit the ML lifecycle to build ML models to discover sweet-spots for shale oil and gas production, a major application in the Oil and Gas O&G Industry.

* 2020 European Association of Geoscientists and Engineers (EAGE) Digitalization Conference and Exhibition
* Author preprint of paper accepted at the 2020 European Association of Geoscientists and Engineers (EAGE) Digitalization Conference and Exhibition

Via

Access Paper or Ask Questions

Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering

Oct 21, 2019

Renan Souza, Leonardo Azevedo, Vítor Lourenço, Elton Soares, Raphael Thiago, Rafael Brandão, Daniel Civitarese, Emilio Vital Brazil, Marcio Moreno, Patrick Valduriez(+3 more)

Figure 1 for Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering

Figure 2 for Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering

Figure 3 for Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering

Figure 4 for Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering

Abstract:Machine Learning (ML) has become essential in several industries. In Computational Science and Engineering (CSE), the complexity of the ML lifecycle comes from the large variety of data, scientists' expertise, tools, and workflows. If data are not tracked properly during the lifecycle, it becomes unfeasible to recreate a ML model from scratch or to explain to stakeholders how it was created. The main limitation of provenance tracking solutions is that they cannot cope with provenance capture and integration of domain and ML data processed in the multiple workflows in the lifecycle while keeping the provenance capture overhead low. To handle this problem, in this paper we contribute with a detailed characterization of provenance data in the ML lifecycle in CSE; a new provenance data representation, called PROV-ML, built on top of W3C PROV and ML Schema; and extensions to a system that tracks provenance from multiple workflows to address the characteristics of ML and CSE, and to allow for provenance queries with a standard vocabulary. We show a practical use in a real case in the Oil and Gas industry, along with its evaluation using 48 GPUs in parallel.

* 10 pages, 7 figures, Accepted at Workflows in Support of Large-scale Science (WORKS) co-located with the ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC) 2019, Denver, Colorado

Via

Access Paper or Ask Questions

A Hybrid Architecture for Multi-Party Conversational Systems

May 04, 2017

Maira Gatti de Bayser, Paulo Cavalin, Renan Souza, Alan Braz, Heloisa Candello, Claudio Pinhanez, Jean-Pierre Briot

Figure 1 for A Hybrid Architecture for Multi-Party Conversational Systems

Figure 2 for A Hybrid Architecture for Multi-Party Conversational Systems

Figure 3 for A Hybrid Architecture for Multi-Party Conversational Systems

Figure 4 for A Hybrid Architecture for Multi-Party Conversational Systems

Abstract:Multi-party Conversational Systems are systems with natural language interaction between one or more people or systems. From the moment that an utterance is sent to a group, to the moment that it is replied in the group by a member, several activities must be done by the system: utterance understanding, information search, reasoning, among others. In this paper we present the challenges of designing and building multi-party conversational systems, the state of the art, our proposed hybrid architecture using both rules and machine learning and some insights after implementing and evaluating one on the finance domain.

Via

Access Paper or Ask Questions