Abstract:Evaluating text comprehension in educational settings is critical for understanding student performance and improving curricular effectiveness. This study investigates the capability of state-of-the-art language models-RoBERTa Base, Bangla-BERT, and BERT Base-in automatically assessing Bangla passage-based question-answering from the National Curriculum and Textbook Board (NCTB) textbooks for classes 6-10. A dataset of approximately 3,000 Bangla passage-based question-answering instances was compiled, and the models were evaluated using F1 Score and Exact Match (EM) metrics across various hyperparameter configurations. Our findings revealed that Bangla-BERT consistently outperformed the other models, achieving the highest F1 (0.75) and EM (0.53) scores, particularly with smaller batch sizes, the inclusion of stop words, and a moderate learning rate. In contrast, RoBERTa Base demonstrated the weakest performance, with the lowest F1 (0.19) and EM (0.27) scores under certain configurations. The results underscore the importance of fine-tuning hyperparameters for optimizing model performance and highlight the potential of machine learning models in evaluating text comprehension in educational contexts. However, limitations such as dataset size, spelling inconsistencies, and computational constraints emphasize the need for further research to enhance the robustness and applicability of these models. This study lays the groundwork for the future development of automated evaluation systems in educational institutions, providing critical insights into model performance in the context of Bangla text comprehension.