Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Madhav Krishan Garg

ReviewEval: An Evaluation Framework for AI-Generated Reviews

Feb 17, 2025

Chavvi Kirtani, Madhav Krishan Garg, Tejash Prasad, Tanmay Singhal, Murari Mandal, Dhruv Kumar

Abstract:The escalating volume of academic research, coupled with a shortage of qualified reviewers, necessitates innovative approaches to peer review. While large language model (LLMs) offer potential for automating this process, their current limitations include superficial critiques, hallucinations, and a lack of actionable insights. This research addresses these challenges by introducing a comprehensive evaluation framework for AI-generated reviews, that measures alignment with human evaluations, verifies factual accuracy, assesses analytical depth, and identifies actionable insights. We also propose a novel alignment mechanism that tailors LLM-generated reviews to the unique evaluation priorities of individual conferences and journals. To enhance the quality of these reviews, we introduce a self-refinement loop that iteratively optimizes the LLM's review prompts. Our framework establishes standardized metrics for evaluating AI-based review systems, thereby bolstering the reliability of AI-generated reviews in academic research.

* Under review: 8 pages, 2 figures, 2 tables, 3 pages for appendix

Via

Access Paper or Ask Questions

"Which LLM should I use?": Evaluating LLMs for tasks performed by Undergraduate Computer Science Students in India

Jan 22, 2024

Vibhor Agarwal, Nakul Thureja, Madhav Krishan Garg, Sahiti Dharmavaram, Meghna, Dhruv Kumar

Abstract:This study evaluates the effectiveness of various large language models (LLMs) in performing tasks common among undergraduate computer science students. Although a number of research studies in the computing education community have explored the possibility of using LLMs for a variety of tasks, there is a lack of comprehensive research comparing different LLMs and evaluating which LLMs are most effective for different tasks. Our research systematically assesses some of the publicly available LLMs such as Google Bard, ChatGPT, GitHub Copilot Chat, and Microsoft Copilot across diverse tasks commonly encountered by undergraduate computer science students. These tasks include code generation, explanation, project ideation, content generation, class assignments, and email composition. Evaluation for these tasks was carried out by junior and senior students in computer science, and provides insights into the models' strengths and limitations. This study aims to guide students in selecting suitable LLMs for any specific task and offers valuable insights on how LLMs can be used constructively by students and instructors.

* Under review

Via

Access Paper or Ask Questions