Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jionghao Lin

Automated Bias Assessment in AI-Generated Educational Content Using CEAT Framework

May 19, 2025

Jingyang Peng, Wenyuan Shen, Jiarui Rao, Jionghao Lin

Abstract:Recent advances in Generative Artificial Intelligence (GenAI) have transformed educational content creation, particularly in developing tutor training materials. However, biases embedded in AI-generated content--such as gender, racial, or national stereotypes--raise significant ethical and educational concerns. Despite the growing use of GenAI, systematic methods for detecting and evaluating such biases in educational materials remain limited. This study proposes an automated bias assessment approach that integrates the Contextualized Embedding Association Test with a prompt-engineered word extraction method within a Retrieval-Augmented Generation framework. We applied this method to AI-generated texts used in tutor training lessons. Results show a high alignment between the automated and manually curated word sets, with a Pearson correlation coefficient of r = 0.993, indicating reliable and consistent bias assessment. Our method reduces human subjectivity and enhances fairness, scalability, and reproducibility in auditing GenAI-produced educational content.

* Accepted by AIED 2025: Late-Breaking Results (LBR) Track

Via

Access Paper or Ask Questions

VTutor: An Open-Source SDK for Generative AI-Powered Animated Pedagogical Agents with Multi-Media Output

Feb 06, 2025

Eason Chen, Chengyu Lin, Xinyi Tang, Aprille Xi, Canwen Wang, Jionghao Lin, Kenneth R Koedinger

Figure 1 for VTutor: An Open-Source SDK for Generative AI-Powered Animated Pedagogical Agents with Multi-Media Output

Figure 2 for VTutor: An Open-Source SDK for Generative AI-Powered Animated Pedagogical Agents with Multi-Media Output

Abstract:The rapid evolution of large language models (LLMs) has transformed human-computer interaction (HCI), but the interaction with LLMs is currently mainly focused on text-based interactions, while other multi-model approaches remain under-explored. This paper introduces VTutor, an open-source Software Development Kit (SDK) that combines generative AI with advanced animation technologies to create engaging, adaptable, and realistic APAs for human-AI multi-media interactions. VTutor leverages LLMs for real-time personalized feedback, advanced lip synchronization for natural speech alignment, and WebGL rendering for seamless web integration. Supporting various 2D and 3D character models, VTutor enables researchers and developers to design emotionally resonant, contextually adaptive learning agents. This toolkit enhances learner engagement, feedback receptivity, and human-AI interaction while promoting trustworthy AI principles in education. VTutor sets a new standard for next-generation APAs, offering an accessible, scalable solution for fostering meaningful and immersive human-AI interaction experiences. The VTutor project is open-sourced and welcomes community-driven contributions and showcases.

Via

Access Paper or Ask Questions

Augmenting Human-Annotated Training Data with Large Language Model Generation and Distillation in Open-Response Assessment

Jan 15, 2025

Conrad Borchers, Danielle R. Thomas, Jionghao Lin, Ralph Abboud, Kenneth R. Koedinger

Figure 1 for Augmenting Human-Annotated Training Data with Large Language Model Generation and Distillation in Open-Response Assessment

Figure 2 for Augmenting Human-Annotated Training Data with Large Language Model Generation and Distillation in Open-Response Assessment

Figure 3 for Augmenting Human-Annotated Training Data with Large Language Model Generation and Distillation in Open-Response Assessment

Abstract:Large Language Models (LLMs) like GPT-4o can help automate text classification tasks at low cost and scale. However, there are major concerns about the validity and reliability of LLM outputs. By contrast, human coding is generally more reliable but expensive to procure at scale. In this study, we propose a hybrid solution to leverage the strengths of both. We combine human-coded data and synthetic LLM-produced data to fine-tune a classical machine learning classifier, distilling both into a smaller BERT model. We evaluate our method on a human-coded test set as a validity measure for LLM output quality. In three experiments, we systematically vary LLM-generated samples' size, variety, and consistency, informed by best practices in LLM tuning. Our findings indicate that augmenting datasets with synthetic samples improves classifier performance, with optimal results achieved at an 80% synthetic to 20% human-coded data ratio. Lower temperature settings of 0.3, corresponding to less variability in LLM generations, produced more stable improvements but also limited model learning from augmented samples. In contrast, higher temperature settings (0.7 and above) introduced greater variability in performance estimates and, at times, lower performance. Hence, LLMs may produce more uniform output that classifiers overfit to earlier or produce more diverse output that runs the risk of deteriorating model performance through information irrelevant to the prediction task. Filtering out inconsistent synthetic samples did not enhance performance. We conclude that integrating human and LLM-generated data to improve text classification models in assessment offers a scalable solution that leverages both the accuracy of human coding and the variety of LLM outputs.

* Manuscript accepted to the Second Workshop on Generative AI for Learning Analytics (GenAI-LA) at LAK25

Via

Access Paper or Ask Questions

Combining Large Language Models with Tutoring System Intelligence: A Case Study in Caregiver Homework Support

Dec 16, 2024

Devika Venugopalan, Ziwen Yan, Conrad Borchers, Jionghao Lin, Vincent Aleven

Figure 1 for Combining Large Language Models with Tutoring System Intelligence: A Case Study in Caregiver Homework Support

Figure 2 for Combining Large Language Models with Tutoring System Intelligence: A Case Study in Caregiver Homework Support

Figure 3 for Combining Large Language Models with Tutoring System Intelligence: A Case Study in Caregiver Homework Support

Figure 4 for Combining Large Language Models with Tutoring System Intelligence: A Case Study in Caregiver Homework Support

Abstract:Caregivers (i.e., parents and members of a child's caring community) are underappreciated stakeholders in learning analytics. Although caregiver involvement can enhance student academic outcomes, many obstacles hinder involvement, most notably knowledge gaps with respect to modern school curricula. An emerging topic of interest in learning analytics is hybrid tutoring, which includes instructional and motivational support. Caregivers assert similar roles in homework, yet it is unknown how learning analytics can support them. Our past work with caregivers suggested that conversational support is a promising method of providing caregivers with the guidance needed to effectively support student learning. We developed a system that provides instructional support to caregivers through conversational recommendations generated by a Large Language Model (LLM). Addressing known instructional limitations of LLMs, we use instructional intelligence from tutoring systems while conducting prompt engineering experiments with the open-source Llama 3 LLM. This LLM generated message recommendations for caregivers supporting their child's math practice via chat. Few-shot prompting and combining real-time problem-solving context from tutoring systems with examples of tutoring practices yielded desirable message recommendations. These recommendations were evaluated with ten middle school caregivers, who valued recommendations facilitating content-level support and student metacognition through self-explanation. We contribute insights into how tutoring systems can best be merged with LLMs to support hybrid tutoring settings through conversational assistance, facilitating effective caregiver involvement in tutoring systems.

* Full research paper accepted to Learning Analytics and Knowledge (LAK 2025)

Via

Access Paper or Ask Questions

Do Tutors Learn from Equity Training and Can Generative AI Assess It?

Dec 15, 2024

Danielle R. Thomas, Conrad Borchers, Sanjit Kakarla, Jionghao Lin, Shambhavi Bhushan, Boyuan Guo, Erin Gatz, Kenneth R. Koedinger

Figure 1 for Do Tutors Learn from Equity Training and Can Generative AI Assess It?

Figure 2 for Do Tutors Learn from Equity Training and Can Generative AI Assess It?

Figure 3 for Do Tutors Learn from Equity Training and Can Generative AI Assess It?

Figure 4 for Do Tutors Learn from Equity Training and Can Generative AI Assess It?

Abstract:Equity is a core concern of learning analytics. However, applications that teach and assess equity skills, particularly at scale are lacking, often due to barriers in evaluating language. Advances in generative AI via large language models (LLMs) are being used in a wide range of applications, with this present work assessing its use in the equity domain. We evaluate tutor performance within an online lesson on enhancing tutors' skills when responding to students in potentially inequitable situations. We apply a mixed-method approach to analyze the performance of 81 undergraduate remote tutors. We find marginally significant learning gains with increases in tutors' self-reported confidence in their knowledge in responding to middle school students experiencing possible inequities from pretest to posttest. Both GPT-4o and GPT-4-turbo demonstrate proficiency in assessing tutors ability to predict and explain the best approach. Balancing performance, efficiency, and cost, we determine that few-shot learning using GPT-4o is the preferred model. This work makes available a dataset of lesson log data, tutor responses, rubrics for human annotation, and generative AI prompts. Future work involves leveling the difficulty among scenarios and enhancing LLM prompts for large-scale grading and assessment.

* Full research paper accepted to Learning Analytics and Knowledge (LAK 2025)

Via

Access Paper or Ask Questions

Does Multiple Choice Have a Future in the Age of Generative AI? A Posttest-only RCT

Dec 13, 2024

Danielle R. Thomas, Conrad Borchers, Sanjit Kakarla, Jionghao Lin, Shambhavi Bhushan, Boyuan Guo, Erin Gatz, Kenneth R. Koedinger

Figure 1 for Does Multiple Choice Have a Future in the Age of Generative AI? A Posttest-only RCT

Figure 2 for Does Multiple Choice Have a Future in the Age of Generative AI? A Posttest-only RCT

Figure 3 for Does Multiple Choice Have a Future in the Age of Generative AI? A Posttest-only RCT

Figure 4 for Does Multiple Choice Have a Future in the Age of Generative AI? A Posttest-only RCT

Abstract:The role of multiple-choice questions (MCQs) as effective learning tools has been debated in past research. While MCQs are widely used due to their ease in grading, open response questions are increasingly used for instruction, given advances in large language models (LLMs) for automated grading. This study evaluates MCQs effectiveness relative to open-response questions, both individually and in combination, on learning. These activities are embedded within six tutor lessons on advocacy. Using a posttest-only randomized control design, we compare the performance of 234 tutors (790 lesson completions) across three conditions: MCQ only, open response only, and a combination of both. We find no significant learning differences across conditions at posttest, but tutors in the MCQ condition took significantly less time to complete instruction. These findings suggest that MCQs are as effective, and more efficient, than open response tasks for learning when practice time is limited. To further enhance efficiency, we autograded open responses using GPT-4o and GPT-4-turbo. GPT models demonstrate proficiency for purposes of low-stakes assessment, though further research is needed for broader use. This study contributes a dataset of lesson log data, human annotation rubrics, and LLM prompts to promote transparency and reproducibility.

* Full research paper accepted to Learning Analytics and Knowledge (LAK 2025)

Via

Access Paper or Ask Questions

A Systematic Review on Prompt Engineering in Large Language Models for K-12 STEM Education

Oct 14, 2024

Eason Chen, Danyang Wang, Luyi Xu, Chen Cao, Xiao Fang, Jionghao Lin

Figure 1 for A Systematic Review on Prompt Engineering in Large Language Models for K-12 STEM Education

Figure 2 for A Systematic Review on Prompt Engineering in Large Language Models for K-12 STEM Education

Figure 3 for A Systematic Review on Prompt Engineering in Large Language Models for K-12 STEM Education

Abstract:Large language models (LLMs) have the potential to enhance K-12 STEM education by improving both teaching and learning processes. While previous studies have shown promising results, there is still a lack of comprehensive understanding regarding how LLMs are effectively applied, specifically through prompt engineering-the process of designing prompts to generate desired outputs. To address this gap, our study investigates empirical research published between 2021 and 2024 that explores the use of LLMs combined with prompt engineering in K-12 STEM education. Following the PRISMA protocol, we screened 2,654 papers and selected 30 studies for analysis. Our review identifies the prompting strategies employed, the types of LLMs used, methods of evaluating effectiveness, and limitations in prior work. Results indicate that while simple and zero-shot prompting are commonly used, more advanced techniques like few-shot and chain-of-thought prompting have demonstrated positive outcomes for various educational tasks. GPT-series models are predominantly used, but smaller and fine-tuned models (e.g., Blender 7B) paired with effective prompt engineering outperform prompting larger models (e.g., GPT-3) in specific contexts. Evaluation methods vary significantly, with limited empirical validation in real-world settings.

Via

Access Paper or Ask Questions

Generative Adversarial Networks for Imputing Sparse Learning Performance

Jul 26, 2024

Liang Zhang, Mohammed Yeasin, Jionghao Lin, Felix Havugimana, Xiangen Hu

Figure 1 for Generative Adversarial Networks for Imputing Sparse Learning Performance

Figure 2 for Generative Adversarial Networks for Imputing Sparse Learning Performance

Figure 3 for Generative Adversarial Networks for Imputing Sparse Learning Performance

Figure 4 for Generative Adversarial Networks for Imputing Sparse Learning Performance

Abstract:Learning performance data, such as correct or incorrect responses to questions in Intelligent Tutoring Systems (ITSs) is crucial for tracking and assessing the learners' progress and mastery of knowledge. However, the issue of data sparsity, characterized by unexplored questions and missing attempts, hampers accurate assessment and the provision of tailored, personalized instruction within ITSs. This paper proposes using the Generative Adversarial Imputation Networks (GAIN) framework to impute sparse learning performance data, reconstructed into a three-dimensional (3D) tensor representation across the dimensions of learners, questions and attempts. Our customized GAIN-based method computational process imputes sparse data in a 3D tensor space, significantly enhanced by convolutional neural networks for its input and output layers. This adaptation also includes the use of a least squares loss function for optimization and aligns the shapes of the input and output with the dimensions of the questions-attempts matrices along the learners' dimension. Through extensive experiments on six datasets from various ITSs, including AutoTutor, ASSISTments and MATHia, we demonstrate that the GAIN approach generally outperforms existing methods such as tensor factorization and other generative adversarial network (GAN) based approaches in terms of imputation accuracy. This finding enhances comprehensive learning data modeling and analytics in AI-based education.

Via

Access Paper or Ask Questions

RAMO: Retrieval-Augmented Generation for Enhancing MOOCs Recommendations

Jul 06, 2024

Jiarui Rao, Jionghao Lin

Abstract:Massive Open Online Courses (MOOCs) have significantly enhanced educational accessibility by offering a wide variety of courses and breaking down traditional barriers related to geography, finance, and time. However, students often face difficulties navigating the vast selection of courses, especially when exploring new fields of study. Driven by this challenge, researchers have been exploring course recommender systems to offer tailored guidance that aligns with individual learning preferences and career aspirations. These systems face particular challenges in effectively addressing the ``cold start'' problem for new users. Recent advancements in recommender systems suggest integrating large language models (LLMs) into the recommendation process to enhance personalized recommendations and address the ``cold start'' problem. Motivated by these advancements, our study introduces RAMO (Retrieval-Augmented Generation for MOOCs), a system specifically designed to overcome the ``cold start'' challenges of traditional course recommender systems. The RAMO system leverages the capabilities of LLMs, along with Retrieval-Augmented Generation (RAG)-facilitated contextual understanding, to provide course recommendations through a conversational interface, aiming to enhance the e-learning experience.

* 7 pages, this paper underwent a rigorous review process and was officially accepted on May 31, 2024, for presentation at the Educational Data Mining 2024 Workshop: Leveraging Large Language Models for Next Generation Educational Technologies

Via

Access Paper or Ask Questions

SPL: A Socratic Playground for Learning Powered by Large Language Mode

Jun 20, 2024

Liang Zhang, Jionghao Lin, Ziyi Kuang, Sheng Xu, Mohammed Yeasin, Xiangen Hu

Figure 1 for SPL: A Socratic Playground for Learning Powered by Large Language Mode

Figure 2 for SPL: A Socratic Playground for Learning Powered by Large Language Mode

Figure 3 for SPL: A Socratic Playground for Learning Powered by Large Language Mode

Figure 4 for SPL: A Socratic Playground for Learning Powered by Large Language Mode

Abstract:Dialogue-based Intelligent Tutoring Systems (ITSs) have significantly advanced adaptive and personalized learning by automating sophisticated human tutoring strategies within interactive dialogues. However, replicating the nuanced patterns of expert human communication remains a challenge in Natural Language Processing (NLP). Recent advancements in NLP, particularly Large Language Models (LLMs) such as OpenAI's GPT-4, offer promising solutions by providing human-like and context-aware responses based on extensive pre-trained knowledge. Motivated by the effectiveness of LLMs in various educational tasks (e.g., content creation and summarization, problem-solving, and automated feedback provision), our study introduces the Socratic Playground for Learning (SPL), a dialogue-based ITS powered by the GPT-4 model, which employs the Socratic teaching method to foster critical thinking among learners. Through extensive prompt engineering, SPL can generate specific learning scenarios and facilitates efficient multi-turn tutoring dialogues. The SPL system aims to enhance personalized and adaptive learning experiences tailored to individual needs, specifically focusing on improving critical thinking skills. Our pilot experimental results from essay writing tasks demonstrate SPL has the potential to improve tutoring interactions and further enhance dialogue-based ITS functionalities. Our study, exemplified by SPL, demonstrates how LLMs enhance dialogue-based ITSs and expand the accessibility and efficacy of educational technologies.

Via

Access Paper or Ask Questions