Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Konstantin Grotov

Themisto: Jupyter-Based Runtime Benchmark

Apr 16, 2025

Konstantin Grotov, Sergey Titov

Abstract:In this work, we present a benchmark that consists of Jupyter notebooks development trajectories and allows measuring how large language models (LLMs) can leverage runtime information for predicting code output and code generation. We demonstrate that the current generation of LLMs performs poorly on these tasks and argue that there exists a significantly understudied domain in the development of code-based models, which involves incorporating the runtime context.

* Accepted to the third Deep Learning for Code (DL4C) workshop @ ICLR 2025

Via

Access Paper or Ask Questions

Debug Smarter, Not Harder: AI Agents for Error Resolution in Computational Notebooks

Oct 18, 2024

Konstantin Grotov, Artem Borzilov, Maksim Krivobok, Timofey Bryksin, Yaroslav Zharov

Figure 1 for Debug Smarter, Not Harder: AI Agents for Error Resolution in Computational Notebooks

Figure 2 for Debug Smarter, Not Harder: AI Agents for Error Resolution in Computational Notebooks

Figure 3 for Debug Smarter, Not Harder: AI Agents for Error Resolution in Computational Notebooks

Figure 4 for Debug Smarter, Not Harder: AI Agents for Error Resolution in Computational Notebooks

Abstract:Computational notebooks became indispensable tools for research-related development, offering unprecedented interactivity and flexibility in the development process. However, these benefits come at the cost of reproducibility and an increased potential for bugs. With the rise of code-fluent Large Language Models empowered with agentic techniques, smart bug-fixing tools with a high level of autonomy have emerged. However, those tools are tuned for classical script programming and still struggle with non-linear computational notebooks. In this paper, we present an AI agent designed specifically for error resolution in a computational notebook. We have developed an agentic system capable of exploring a notebook environment by interacting with it -- similar to how a user would -- and integrated the system into the JetBrains service for collaborative data science called Datalore. We evaluate our approach against the pre-existing single-action solution by comparing costs and conducting a user study. Users rate the error resolution capabilities of the agentic system higher but experience difficulties with UI. We share the results of the study and consider them valuable for further improving user-agent collaboration.

* Accepted to EMNLP 2024 System Demonstrations

Via

Access Paper or Ask Questions