Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Akira Kawabata

Rationale-Aware Answer Verification by Pairwise Self-Evaluation

Oct 07, 2024

Akira Kawabata, Saku Sugawara

Abstract:Answer verification identifies correct solutions among candidates generated by large language models (LLMs). Current approaches typically train verifier models by labeling solutions as correct or incorrect based solely on whether the final answer matches the gold answer. However, this approach neglects any flawed rationale in the solution yielding the correct answer, undermining the verifier's ability to distinguish between sound and flawed rationales. We empirically show that in StrategyQA, only 19% of LLM-generated solutions with correct answers have valid rationales, thus leading to an unreliable verifier. Furthermore, we demonstrate that training a verifier on valid rationales significantly improves its ability to distinguish valid and flawed rationale. To make a better verifier without extra human supervision, we introduce REPS (Rationale Enhancement through Pairwise Selection), a method for selecting valid rationales from candidates by iteratively applying pairwise self-evaluation using the same LLM that generates the solutions. Verifiers trained on solutions selected by REPS outperform those trained using conventional training methods on three reasoning benchmarks (ARC-Challenge, DROP, and StrategyQA). Our results suggest that training reliable verifiers requires ensuring the validity of rationales in addition to the correctness of the final answers, which would be critical for models assisting humans in solving complex reasoning tasks.

* EMNLP 2024

Via

Access Paper or Ask Questions

Evaluating the Rationale Understanding of Critical Reasoning in Logical Reading Comprehension

Nov 30, 2023

Akira Kawabata, Saku Sugawara

Figure 1 for Evaluating the Rationale Understanding of Critical Reasoning in Logical Reading Comprehension

Figure 2 for Evaluating the Rationale Understanding of Critical Reasoning in Logical Reading Comprehension

Figure 3 for Evaluating the Rationale Understanding of Critical Reasoning in Logical Reading Comprehension

Figure 4 for Evaluating the Rationale Understanding of Critical Reasoning in Logical Reading Comprehension

Abstract:To precisely evaluate a language model's capability for logical reading comprehension, we present a dataset for testing the understanding of the rationale behind critical reasoning. For questions taken from an existing multiplechoice logical reading comprehension dataset, we crowdsource rationale texts that explain why we should select or eliminate answer options, resulting in 3,003 multiple-choice subquestions that are associated with 943 main questions. Experiments on our dataset show that recent large language models (e.g., InstructGPT) struggle to answer the subquestions even if they are able to answer the main questions correctly. We find that the models perform particularly poorly in answering subquestions written for the incorrect options of the main questions, implying that the models have a limited capability for explaining why incorrect alternatives should be eliminated. These results suggest that our dataset encourages further investigation into the critical reasoning ability of language models while focusing on the elimination process of relevant alternatives.

* Accepted to EMNLP 2023

Via

Access Paper or Ask Questions