IBM Research Israel
Abstract:The rise of agentic AI systems, where agents collaborate to perform diverse tasks, poses new challenges with observing, analyzing and optimizing their behavior. Traditional evaluation and benchmarking approaches struggle to handle the non-deterministic, context-sensitive, and dynamic nature of these systems. This paper explores key challenges and opportunities in analyzing and optimizing agentic systems across development, testing, and maintenance. We explore critical issues such as natural language variability and unpredictable execution flows, which hinder predictability and control, demanding adaptive strategies to manage input variability and evolving behaviors. Through our user study, we supported these hypotheses. In particular, we showed a 79% agreement that non deterministic flow of agentic systems acts as a major challenge. Finally, we validated our statements empirically advocating the need for moving beyond classical benchmarking. To bridge these gaps, we introduce taxonomies to present expected analytics outcomes and the ways to collect them by extending standard observability frameworks. Building on these foundations, we introduce and demonstrate novel approach for benchmarking of agent evaluation systems. Unlike traditional "black box" performance evaluation approaches, our benchmark is built from agent runtime logs as input, and analytics outcome including discovered flows and issues. By addressing key limitations in existing methodologies, we aim to set the stage for more advanced and holistic evaluation strategies, which could foster the development of adaptive, interpretable, and robust agentic AI systems.
Abstract:As artificial intelligence (AI) continues to rapidly advance, there is a growing demand to integrate AI capabilities into existing business applications. However, a significant gap exists between the rapid progress in AI and how slowly AI is being embedded into business environments. Deploying well-performing lab models into production settings, especially in on-premise environments, often entails specialized expertise and imposes a heavy burden of model management, creating significant barriers to implementing AI models in real-world applications. KModels leverages proven libraries and platforms (Kubeflow Pipelines, KServe) to streamline AI adoption by supporting both AI developers and consumers. It allows model developers to focus solely on model development and share models as transportable units (Templates), abstracting away complex production deployment concerns. KModels enables AI consumers to eliminate the need for a dedicated data scientist, as the templates encapsulate most data science considerations while providing business-oriented control. This paper presents the architecture of KModels and the key decisions that shape it. We outline KModels' main components as well as its interfaces. Furthermore, we explain how KModels is highly suited for on-premise deployment but can also be used in cloud environments. The efficacy of KModels is demonstrated through the successful deployment of three AI models within an existing Work Order Management system. These models operate in a client's data center and are trained on local data, without data scientist intervention. One model improved the accuracy of Failure Code specification for work orders from 46% to 83%, showcasing the substantial benefit of accessible and localized AI solutions.
Abstract:General web-based agents are increasingly essential for interacting with complex web environments, yet their performance in real-world web applications remains poor, yielding extremely low accuracy even with state-of-the-art frontier models. We observe that these agents can be decomposed into two primary components: Planning and Grounding. Yet, most existing research treats these agents as black boxes, focusing on end-to-end evaluations which hinder meaningful improvements. We sharpen the distinction between the planning and grounding components and conduct a novel analysis by refining experiments on the Mind2Web dataset. Our work proposes a new benchmark for each of the components separately, identifying the bottlenecks and pain points that limit agent performance. Contrary to prevalent assumptions, our findings suggest that grounding is not a significant bottleneck and can be effectively addressed with current techniques. Instead, the primary challenge lies in the planning component, which is the main source of performance degradation. Through this analysis, we offer new insights and demonstrate practical suggestions for improving the capabilities of web agents, paving the way for more reliable agents.