Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mohammad Mahdi Mohajeri

DEEPQUESTION: Systematic Generation of Real-World Challenges for Evaluating LLMs Performance

May 30, 2025

Ali Khoramfar, Ali Ramezani, Mohammad Mahdi Mohajeri, Mohammad Javad Dousti, Majid Nili Ahmadabadi, Heshaam Faili

Abstract:LLMs often excel on standard benchmarks but falter on real-world tasks. We introduce DeepQuestion, a scalable automated framework that augments existing datasets based on Bloom's taxonomy and creates novel questions that trace original solution paths to probe evaluative and creative skills. Extensive experiments across ten open-source and proprietary models, covering both general-purpose and reasoning LLMs, reveal substantial performance drops (even up to 70% accuracy loss) on higher-order tasks, underscoring persistent gaps in deep reasoning. Our work highlights the need for cognitively diverse benchmarks to advance LLM progress. DeepQuestion and related datasets will be released upon acceptance of the paper.

Via

Access Paper or Ask Questions

Optimizing Alignment with Less: Leveraging Data Augmentation for Personalized Evaluation

Dec 10, 2024

Javad Seraj, Mohammad Mahdi Mohajeri, Mohammad Javad Dousti, Majid Nili Ahmadabadi

Figure 1 for Optimizing Alignment with Less: Leveraging Data Augmentation for Personalized Evaluation

Figure 2 for Optimizing Alignment with Less: Leveraging Data Augmentation for Personalized Evaluation

Figure 3 for Optimizing Alignment with Less: Leveraging Data Augmentation for Personalized Evaluation

Figure 4 for Optimizing Alignment with Less: Leveraging Data Augmentation for Personalized Evaluation

Abstract:Automatic evaluation by large language models (LLMs) is a prominent topic today; however, judgment and evaluation tasks are often subjective and influenced by various factors, making adaptation challenging. While many studies demonstrate the capabilities of state-of-the-art proprietary LLMs in comparison to human evaluators, they often struggle to adapt to reference evaluators over time, a requirement for achieving personalized judgment. Additionally, numerous works have attempted to apply open LLMs as judges or evaluators, but these efforts frequently overlook the limitations of working with scarce data. Personalized judgment is inherently associated with limited data scenarios, which are common in many real-world problems. Our work aims to present a data augmentation technique to select a more effective sample from limited data in order to align an open LLM with human preference. Our work achieves approximately 7% improvements in Pearson correlation with a reference judge over the baseline,and 30% improvement over the base model (Llama3.1-8B-Instruct) in the mathematical reasoning evaluation task. demonstrating that augmenting selecting more effective preference data enables our approach to surpass baseline methods.

Via

Access Paper or Ask Questions

CoCoP: Enhancing Text Classification with LLM through Code Completion Prompt

Nov 13, 2024

Mohammad Mahdi Mohajeri, Mohammad Javad Dousti, Majid Nili Ahmadabadi

Figure 1 for CoCoP: Enhancing Text Classification with LLM through Code Completion Prompt

Figure 2 for CoCoP: Enhancing Text Classification with LLM through Code Completion Prompt

Figure 3 for CoCoP: Enhancing Text Classification with LLM through Code Completion Prompt

Figure 4 for CoCoP: Enhancing Text Classification with LLM through Code Completion Prompt

Abstract:Text classification is a fundamental task in natural language processing (NLP), and large language models (LLMs) have demonstrated their capability to perform this task across various domains. However, the performance of LLMs heavily depends on the quality of their input prompts. Recent studies have also shown that LLMs exhibit remarkable results in code-related tasks. To leverage the capabilities of LLMs in text classification, we propose the Code Completion Prompt (CoCoP) method, which transforms the text classification problem into a code completion task. CoCoP significantly improves text classification performance across diverse datasets by utilizing LLMs' code-completion capability. For instance, CoCoP enhances the accuracy of the SST2 dataset by more than 20%. Moreover, when CoCoP integrated with LLMs specifically designed for code-related tasks (code models), such as CodeLLaMA, this method demonstrates better or comparable performance to few-shot learning techniques while using only one-tenth of the model size. The source code of our proposed method will be available to the public upon the acceptance of the paper.

Via

Access Paper or Ask Questions