Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Conrad Borchers

Can Large Language Models Match Tutoring System Adaptivity? A Benchmarking Study

Apr 07, 2025

Conrad Borchers, Tianze Shou

Abstract:Large Language Models (LLMs) hold promise as dynamic instructional aids. Yet, it remains unclear whether LLMs can replicate the adaptivity of intelligent tutoring systems (ITS)--where student knowledge and pedagogical strategies are explicitly modeled. We propose a prompt variation framework to assess LLM-generated instructional moves' adaptivity and pedagogical soundness across 75 real-world tutoring scenarios from an ITS. We systematically remove key context components (e.g., student errors and knowledge components) from prompts to create variations of each scenario. Three representative LLMs (Llama3-8B, Llama3-70B, and GPT-4o) generate 1,350 instructional moves. We use text embeddings and randomization tests to measure how the omission of each context feature impacts the LLMs' outputs (adaptivity) and a validated tutor-training classifier to evaluate response quality (pedagogical soundness). Surprisingly, even the best-performing model only marginally mimics the adaptivity of ITS. Specifically, Llama3-70B demonstrates statistically significant adaptivity to student errors. Although Llama3-8B's recommendations receive higher pedagogical soundness scores than the other models, it struggles with instruction-following behaviors, including output formatting. By contrast, GPT-4o reliably adheres to instructions but tends to provide overly direct feedback that diverges from effective tutoring, prompting learners with open-ended questions to gauge knowledge. Given these results, we discuss how current LLM-based tutoring is unlikely to produce learning benefits rivaling known-to-be-effective ITS tutoring. Through our open-source benchmarking code, we contribute a reproducible method for evaluating LLMs' instructional adaptivity and fidelity.

* Accepted as full paper to the 26th International Conference on Artificial Intelligence in Education (AIED 2025)

Via

Access Paper or Ask Questions

Augmenting Human-Annotated Training Data with Large Language Model Generation and Distillation in Open-Response Assessment

Jan 15, 2025

Conrad Borchers, Danielle R. Thomas, Jionghao Lin, Ralph Abboud, Kenneth R. Koedinger

Figure 1 for Augmenting Human-Annotated Training Data with Large Language Model Generation and Distillation in Open-Response Assessment

Figure 2 for Augmenting Human-Annotated Training Data with Large Language Model Generation and Distillation in Open-Response Assessment

Figure 3 for Augmenting Human-Annotated Training Data with Large Language Model Generation and Distillation in Open-Response Assessment

Abstract:Large Language Models (LLMs) like GPT-4o can help automate text classification tasks at low cost and scale. However, there are major concerns about the validity and reliability of LLM outputs. By contrast, human coding is generally more reliable but expensive to procure at scale. In this study, we propose a hybrid solution to leverage the strengths of both. We combine human-coded data and synthetic LLM-produced data to fine-tune a classical machine learning classifier, distilling both into a smaller BERT model. We evaluate our method on a human-coded test set as a validity measure for LLM output quality. In three experiments, we systematically vary LLM-generated samples' size, variety, and consistency, informed by best practices in LLM tuning. Our findings indicate that augmenting datasets with synthetic samples improves classifier performance, with optimal results achieved at an 80% synthetic to 20% human-coded data ratio. Lower temperature settings of 0.3, corresponding to less variability in LLM generations, produced more stable improvements but also limited model learning from augmented samples. In contrast, higher temperature settings (0.7 and above) introduced greater variability in performance estimates and, at times, lower performance. Hence, LLMs may produce more uniform output that classifiers overfit to earlier or produce more diverse output that runs the risk of deteriorating model performance through information irrelevant to the prediction task. Filtering out inconsistent synthetic samples did not enhance performance. We conclude that integrating human and LLM-generated data to improve text classification models in assessment offers a scalable solution that leverages both the accuracy of human coding and the variety of LLM outputs.

* Manuscript accepted to the Second Workshop on Generative AI for Learning Analytics (GenAI-LA) at LAK25

Via

Access Paper or Ask Questions

Toward Sufficient Statistical Power in Algorithmic Bias Assessment: A Test for ABROCA

Jan 08, 2025

Conrad Borchers

Figure 1 for Toward Sufficient Statistical Power in Algorithmic Bias Assessment: A Test for ABROCA

Figure 2 for Toward Sufficient Statistical Power in Algorithmic Bias Assessment: A Test for ABROCA

Figure 3 for Toward Sufficient Statistical Power in Algorithmic Bias Assessment: A Test for ABROCA

Figure 4 for Toward Sufficient Statistical Power in Algorithmic Bias Assessment: A Test for ABROCA

Abstract:Algorithmic bias is a pressing concern in educational data mining (EDM), as it risks amplifying inequities in learning outcomes. The Area Between ROC Curves (ABROCA) metric is frequently used to measure discrepancies in model performance across demographic groups to quantify overall model fairness. However, its skewed distribution--especially when class or group imbalances exist--makes significance testing challenging. This study investigates ABROCA's distributional properties and contributes robust methods for its significance testing. Specifically, we address (1) whether ABROCA follows any known distribution, (2) how to reliably test for algorithmic bias using ABROCA, and (3) the statistical power achievable with ABROCA-based bias assessments under typical EDM sample specifications. Simulation results confirm that ABROCA does not match standard distributions, including those suited to accommodate skewness. We propose nonparametric randomization tests for ABROCA and demonstrate that reliably detecting bias with ABROCA requires large sample sizes or substantial effect sizes, particularly in imbalanced settings. Findings suggest that ABROCA-based bias evaluation based on sample sizes common in EDM tends to be underpowered, undermining the reliability of conclusions about model fairness. By offering open-source code to simulate power and statistically test ABROCA, this paper aims to foster more reliable statistical testing in EDM research. It supports broader efforts toward replicability and equity in educational modeling.

Via

Access Paper or Ask Questions

Combining Large Language Models with Tutoring System Intelligence: A Case Study in Caregiver Homework Support

Dec 16, 2024

Devika Venugopalan, Ziwen Yan, Conrad Borchers, Jionghao Lin, Vincent Aleven

Figure 1 for Combining Large Language Models with Tutoring System Intelligence: A Case Study in Caregiver Homework Support

Figure 2 for Combining Large Language Models with Tutoring System Intelligence: A Case Study in Caregiver Homework Support

Figure 3 for Combining Large Language Models with Tutoring System Intelligence: A Case Study in Caregiver Homework Support

Figure 4 for Combining Large Language Models with Tutoring System Intelligence: A Case Study in Caregiver Homework Support

Abstract:Caregivers (i.e., parents and members of a child's caring community) are underappreciated stakeholders in learning analytics. Although caregiver involvement can enhance student academic outcomes, many obstacles hinder involvement, most notably knowledge gaps with respect to modern school curricula. An emerging topic of interest in learning analytics is hybrid tutoring, which includes instructional and motivational support. Caregivers assert similar roles in homework, yet it is unknown how learning analytics can support them. Our past work with caregivers suggested that conversational support is a promising method of providing caregivers with the guidance needed to effectively support student learning. We developed a system that provides instructional support to caregivers through conversational recommendations generated by a Large Language Model (LLM). Addressing known instructional limitations of LLMs, we use instructional intelligence from tutoring systems while conducting prompt engineering experiments with the open-source Llama 3 LLM. This LLM generated message recommendations for caregivers supporting their child's math practice via chat. Few-shot prompting and combining real-time problem-solving context from tutoring systems with examples of tutoring practices yielded desirable message recommendations. These recommendations were evaluated with ten middle school caregivers, who valued recommendations facilitating content-level support and student metacognition through self-explanation. We contribute insights into how tutoring systems can best be merged with LLMs to support hybrid tutoring settings through conversational assistance, facilitating effective caregiver involvement in tutoring systems.

* Full research paper accepted to Learning Analytics and Knowledge (LAK 2025)

Via

Access Paper or Ask Questions

Do Tutors Learn from Equity Training and Can Generative AI Assess It?

Dec 15, 2024

Danielle R. Thomas, Conrad Borchers, Sanjit Kakarla, Jionghao Lin, Shambhavi Bhushan, Boyuan Guo, Erin Gatz, Kenneth R. Koedinger

Figure 1 for Do Tutors Learn from Equity Training and Can Generative AI Assess It?

Figure 2 for Do Tutors Learn from Equity Training and Can Generative AI Assess It?

Figure 3 for Do Tutors Learn from Equity Training and Can Generative AI Assess It?

Figure 4 for Do Tutors Learn from Equity Training and Can Generative AI Assess It?

Abstract:Equity is a core concern of learning analytics. However, applications that teach and assess equity skills, particularly at scale are lacking, often due to barriers in evaluating language. Advances in generative AI via large language models (LLMs) are being used in a wide range of applications, with this present work assessing its use in the equity domain. We evaluate tutor performance within an online lesson on enhancing tutors' skills when responding to students in potentially inequitable situations. We apply a mixed-method approach to analyze the performance of 81 undergraduate remote tutors. We find marginally significant learning gains with increases in tutors' self-reported confidence in their knowledge in responding to middle school students experiencing possible inequities from pretest to posttest. Both GPT-4o and GPT-4-turbo demonstrate proficiency in assessing tutors ability to predict and explain the best approach. Balancing performance, efficiency, and cost, we determine that few-shot learning using GPT-4o is the preferred model. This work makes available a dataset of lesson log data, tutor responses, rubrics for human annotation, and generative AI prompts. Future work involves leveling the difficulty among scenarios and enhancing LLM prompts for large-scale grading and assessment.

* Full research paper accepted to Learning Analytics and Knowledge (LAK 2025)

Via

Access Paper or Ask Questions

Does Multiple Choice Have a Future in the Age of Generative AI? A Posttest-only RCT

Dec 13, 2024

Danielle R. Thomas, Conrad Borchers, Sanjit Kakarla, Jionghao Lin, Shambhavi Bhushan, Boyuan Guo, Erin Gatz, Kenneth R. Koedinger

Figure 1 for Does Multiple Choice Have a Future in the Age of Generative AI? A Posttest-only RCT

Figure 2 for Does Multiple Choice Have a Future in the Age of Generative AI? A Posttest-only RCT

Figure 3 for Does Multiple Choice Have a Future in the Age of Generative AI? A Posttest-only RCT

Figure 4 for Does Multiple Choice Have a Future in the Age of Generative AI? A Posttest-only RCT

Abstract:The role of multiple-choice questions (MCQs) as effective learning tools has been debated in past research. While MCQs are widely used due to their ease in grading, open response questions are increasingly used for instruction, given advances in large language models (LLMs) for automated grading. This study evaluates MCQs effectiveness relative to open-response questions, both individually and in combination, on learning. These activities are embedded within six tutor lessons on advocacy. Using a posttest-only randomized control design, we compare the performance of 234 tutors (790 lesson completions) across three conditions: MCQ only, open response only, and a combination of both. We find no significant learning differences across conditions at posttest, but tutors in the MCQ condition took significantly less time to complete instruction. These findings suggest that MCQs are as effective, and more efficient, than open response tasks for learning when practice time is limited. To further enhance efficiency, we autograded open responses using GPT-4o and GPT-4-turbo. GPT models demonstrate proficiency for purposes of low-stakes assessment, though further research is needed for broader use. This study contributes a dataset of lesson log data, human annotation rubrics, and LLM prompts to promote transparency and reproducibility.

* Full research paper accepted to Learning Analytics and Knowledge (LAK 2025)

Via

Access Paper or Ask Questions

Evaluating the Impact of Data Augmentation on Predictive Model Performance

Dec 03, 2024

Valdemar Švábenský, Conrad Borchers, Elizabeth B. Cloude, Atsushi Shimada

Abstract:In supervised machine learning (SML) research, large training datasets are essential for valid results. However, obtaining primary data in learning analytics (LA) is challenging. Data augmentation can address this by expanding and diversifying data, though its use in LA remains underexplored. This paper systematically compares data augmentation techniques and their impact on prediction performance in a typical LA task: prediction of academic outcomes. Augmentation is demonstrated on four SML models, which we successfully replicated from a previous LAK study based on AUC values. Among 21 augmentation techniques, SMOTE-ENN sampling performed the best, improving the average AUC by 0.01 and approximately halving the training time compared to the baseline models. In addition, we compared 99 combinations of chaining 21 techniques, and found minor, although statistically significant, improvements across models when adding noise to SMOTE-ENN (+0.014). Notably, some augmentation techniques significantly lowered predictive performance or increased performance fluctuation related to random chance. This paper's contribution is twofold. Primarily, our empirical findings show that sampling techniques provide the most statistically reliable performance improvements for LA applications of SML, and are computationally more efficient than deep generation methods with complex hyperparameter settings. Second, the LA community may benefit from validating a recent study through independent replication.

* Published in LAK 2025 conference proceedings in the ACM Digital Library, see https://doi.org/10.1145/3706468.3706485

Via

Access Paper or Ask Questions

ABROCA Distributions For Algorithmic Bias Assessment: Considerations Around Interpretation

Nov 28, 2024

Conrad Borchers, Ryan S. Baker

Abstract:Algorithmic bias continues to be a key concern of learning analytics. We study the statistical properties of the Absolute Between-ROC Area (ABROCA) metric. This fairness measure quantifies group-level differences in classifier performance through the absolute difference in ROC curves. ABROCA is particularly useful for detecting nuanced performance differences even when overall Area Under the ROC Curve (AUC) values are similar. We sample ABROCA under various conditions, including varying AUC differences and class distributions. We find that ABROCA distributions exhibit high skewness dependent on sample sizes, AUC differences, and class imbalance. When assessing whether a classifier is biased, this skewness inflates ABROCA values by chance, even when data is drawn (by simulation) from populations with equivalent ROC curves. These findings suggest that ABROCA requires careful interpretation given its distributional properties, especially when used to assess the degree of bias and when classes are imbalanced.

* Accepted to Learning Analytics and Knowledge (LAK 2025)

Via

Access Paper or Ask Questions

3DG: A Framework for Using Generative AI for Handling Sparse Learner Performance Data From Intelligent Tutoring Systems

Jan 29, 2024

Liang Zhang, Jionghao Lin, Conrad Borchers, Meng Cao, Xiangen Hu

Abstract:Learning performance data (e.g., quiz scores and attempts) is significant for understanding learner engagement and knowledge mastery level. However, the learning performance data collected from Intelligent Tutoring Systems (ITSs) often suffers from sparsity, impacting the accuracy of learner modeling and knowledge assessments. To address this, we introduce the 3DG framework (3-Dimensional tensor for Densification and Generation), a novel approach combining tensor factorization with advanced generative models, including Generative Adversarial Network (GAN) and Generative Pre-trained Transformer (GPT), for enhanced data imputation and augmentation. The framework operates by first representing the data as a three-dimensional tensor, capturing dimensions of learners, questions, and attempts. It then densifies the data through tensor factorization and augments it using Generative AI models, tailored to individual learning patterns identified via clustering. Applied to data from an AutoTutor lesson by the Center for the Study of Adult Literacy (CSAL), the 3DG framework effectively generated scalable, personalized simulations of learning performance. Comparative analysis revealed GAN's superior reliability over GPT-4 in this context, underscoring its potential in addressing data sparsity challenges in ITSs and contributing to the advancement of personalized educational technology.

* LAK 2024: International Workshop on Generative AI for Learning Analytics (GenAI-LA)

Via

Access Paper or Ask Questions

Revealing Networks: Understanding Effective Teacher Practices in AI-Supported Classrooms using Transmodal Ordered Network Analysis

Dec 17, 2023

Conrad Borchers, Yeyu Wang, Shamya Karumbaiah, Muhammad Ashiq, David Williamson Shaffer, Vincent Aleven

Figure 1 for Revealing Networks: Understanding Effective Teacher Practices in AI-Supported Classrooms using Transmodal Ordered Network Analysis

Figure 2 for Revealing Networks: Understanding Effective Teacher Practices in AI-Supported Classrooms using Transmodal Ordered Network Analysis

Figure 3 for Revealing Networks: Understanding Effective Teacher Practices in AI-Supported Classrooms using Transmodal Ordered Network Analysis

Abstract:Learning analytics research increasingly studies classroom learning with AI-based systems through rich contextual data from outside these systems, especially student-teacher interactions. One key challenge in leveraging such data is generating meaningful insights into effective teacher practices. Quantitative ethnography bears the potential to close this gap by combining multimodal data streams into networks of co-occurring behavior that drive insight into favorable learning conditions. The present study uses transmodal ordered network analysis to understand effective teacher practices in relationship to traditional metrics of in-system learning in a mathematics classroom working with AI tutors. Incorporating teacher practices captured by position tracking and human observation codes into modeling significantly improved the inference of how efficiently students improved in the AI tutor beyond a model with tutor log data features only. Comparing teacher practices by student learning rates, we find that students with low learning rates exhibited more hint use after monitoring. However, after an extended visit, students with low learning rates showed learning behavior similar to their high learning rate peers, achieving repeated correct attempts in the tutor. Observation notes suggest conceptual and procedural support differences can help explain visit effectiveness. Taken together, offering early conceptual support to students with low learning rates could make classroom practice with AI tutors more effective. This study advances the scientific understanding of effective teacher practice in classrooms learning with AI tutors and methodologies to make such practices visible.

* Full paper accepted to Learning Analytics and Knowledge (LAK 2024)

Via

Access Paper or Ask Questions