Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Johannes Dorn

A Methodology for Evaluating RAG Systems: A Case Study On Configuration Dependency Validation

Oct 11, 2024

Sebastian Simon, Alina Mailach, Johannes Dorn, Norbert Siegmund

Figure 1 for A Methodology for Evaluating RAG Systems: A Case Study On Configuration Dependency Validation

Figure 2 for A Methodology for Evaluating RAG Systems: A Case Study On Configuration Dependency Validation

Figure 3 for A Methodology for Evaluating RAG Systems: A Case Study On Configuration Dependency Validation

Figure 4 for A Methodology for Evaluating RAG Systems: A Case Study On Configuration Dependency Validation

Abstract:Retrieval-augmented generation (RAG) is an umbrella of different components, design decisions, and domain-specific adaptations to enhance the capabilities of large language models and counter their limitations regarding hallucination and outdated and missing knowledge. Since it is unclear which design decisions lead to a satisfactory performance, developing RAG systems is often experimental and needs to follow a systematic and sound methodology to gain sound and reliable results. However, there is currently no generally accepted methodology for RAG evaluation despite a growing interest in this technology. In this paper, we propose a first blueprint of a methodology for a sound and reliable evaluation of RAG systems and demonstrate its applicability on a real-world software engineering research task: the validation of configuration dependencies across software technologies. In summary, we make two novel contributions: (i) A novel, reusable methodological design for evaluating RAG systems, including a demonstration that represents a guideline, and (ii) a RAG system, which has been developed following this methodology, that achieves the highest accuracy in the field of dependency validation. For the blueprint's demonstration, the key insights are the crucial role of choosing appropriate baselines and metrics, the necessity for systematic RAG refinements derived from qualitative failure analysis, as well as the reporting practices of key design decision to foster replication and evaluation.

Via

Access Paper or Ask Questions