Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications, BEA 2025, Vienna, Austria, July 31 - August 1, 2025

researchr

You are not signed in
Sign in
Sign up

Ekaterina Kochmar, Bashar Alhafni, Marie Bexte, Jill Burstein, Andrea Horbach, Ronja Laarmann-Quante, Anaïs Tack, Victoria Yaneva, Zheng Yuan 0003, editors, Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications, BEA 2025, Vienna, Austria, July 31 - August 1, 2025. Association for Computational Linguistics, 2025. [doi]

Conference: bea2025

Abstract is missing.

Large Language Models for Education: Understanding the Needs of Stakeholders, Current Capabilities and the Path ForwardSankalan Pal Chowdhury, Nico Daheim, Ekaterina Kochmar, Jakub Macina, Donya Rooein, Mrinmaya Sachan, Shashank Sonkar. 1-10 [doi]

Comparing human and LLM proofreading in L2 writing: Impact on lexical and syntactic featuresHakyung Sung, Karla Csürös, Min-Chang Sung. 11-23 [doi]

MateInfoUB: A Real-World Benchmark for Testing LLMs in Competitive, Multilingual, and Multimodal Educational TasksAdrian Marius Dumitran, Mihnea Buca, Theodor Moroianu. 24-37 [doi]

Unsupervised Automatic Short Answer Grading and Essay Scoring: A Weakly Supervised Explainable ApproachFelipe Urrutia, Cristian Buc, Roberto Araya, Valentin Barrière. 38-54 [doi]

A Survey on Automated Distractor Evaluation in Multiple-Choice TasksLuca Benedetto, Shiva Taslimipoor, Paula Buttery. 55-69 [doi]

Alignment Drift in CEFR-prompted LLMs for Interactive Spanish TutoringMina Almasi, Ross Deans Kristensen-McLachlan. 70-88 [doi]

Leveraging Generative AI for Enhancing Automated Assessment in Programming Education ContestsStefan Dascalescu, Adrian Marius Dumitran, Mihai Alexandru Vasiluta. 89-99 [doi]

Can LLMs Effectively Simulate Human Learners? Teachers' Insights from Tutoring LLM StudentsDaria Martynova, Jakub Macina, Nico Daheim, Nilay Yalcin, Xiaoyu Zhang 0014, Mrinmaya Sachan. 100-117 [doi]

Adapting LLMs for Minimal-edit Grammatical Error CorrectionRyszard Staruch, Filip Gralinski, Daniel Dzienisiewicz. 118-128 [doi]

COGENT: A Curriculum-oriented Framework for Generating Grade-appropriate Educational ContentZhengyuan Liu, Stella Xin Yin, Dion Hoe-Lian Goh, Nancy F. Chen. 129-143 [doi]

Is Lunch Free Yet? Overcoming the Cold-Start Problem in Supervised Content Scoring using Zero-Shot LLM-Generated Training DataMarie Bexte, Torsten Zesch. 144-159 [doi]

Transformer Architectures for Vocabulary Test Item Difficulty PredictionLucy Skidmore, Mariano Felice, Karen Dunn. 160-174 [doi]

Automatic concept extraction for learning domain modeling: A weakly supervised approach using contextualized word embeddingsKordula De Kuthy, Leander Girrbach, Detmar Meurers. 175-185 [doi]

Towards a Real-time Swedish Speech Analyzer for Language Learning Games: A Hybrid AI Approach to Language AssessmentTianyi Geng, David Alfter. 186-201 [doi]

Multilingual Grammatical Error Annotation: Combining Language-Agnostic Framework with Language-Specific FlexibilityMengyang Qiu, Tran Minh Nguyen, Zihao Huang, Zelong Li, Yang Gu 0004, Qingyu Gao, Siliang Liu, Jungyeul Park. 202-212 [doi]

LLM-based post-editing as reference-free GEC evaluationRobert Östling, Murathan Kurfali, Andrew Caines. 213-224 [doi]

Increasing the Generalizability of Similarity-Based Essay Scoring Through Cross-Prompt TrainingMarie Bexte, Yuning Ding, Andrea Horbach. 225-236 [doi]

Automated Scoring of a German Written Elicited Imitation TestMihail Chifligarov, Jammila Laâguidi, Max Schellenberg, Alexander Dill, Anna Timukova, Anastasia Drackert, Ronja Laarmann-Quante. 237-247 [doi]

LLMs Protégés: Tutoring LLMs with Knowledge Gaps Improves Student Learning OutcomeAndrei Kucharavy, Cyril Vallez, Dimitri Percia David. 248-257 [doi]

LEVOS: Leveraging Vocabulary Overlap with Sanskrit to Generate Technical Lexicons in Indian LanguagesKarthika NJ, Krishnakant Bhatt, Ganesh Ramakrishnan, Preethi Jyothi. 258-265 [doi]

Do LLMs Give Psychometrically Plausible Responses in Educational Assessments?Andreas Säuberli, Diego Frassinelli, Barbara Plank. 266-278 [doi]

Challenges for AI in Multimodal STEM Assessments: a Human-AI ComparisonAymeric de Chillaz, Anna Sotnikova, Patrick Jermann, Antoine Bosselut. 279-293 [doi]

LookAlike: Consistent Distractor Generation in Math MCQsNisarg Parikh, Alexander Scarlatos, Nigel Fernandez, Simon Woodhead 0002, Andrew Lan. 294-311 [doi]

You Shall Know a Word's Difficulty by the Family It Keeps: Word Family Features in Personalised Word Difficulty Classifiers for L2 SpanishJasper Degraeuwe. 312-325 [doi]

The Need for Truly Graded Lexical Complexity PredictionDavid Alfter. 326-333 [doi]

Towards Automatic Formal Feedback on Scientific DocumentsLouise Bloch, Johannes Rückert, Christoph M. Friedrich. 334-344 [doi]

Don't Score too Early! Evaluating Argument Mining Models on Incomplete EssaysNils-Jonathan Schaller, Yuning Ding, Thorben Jansen, Andrea Horbach. 345-355 [doi]

Educators' Perceptions of Large Language Models as Tutors: Comparing Human and AI Tutors in a Blind Text-only SettingSankalan Pal Chowdhury, Terry Jingchen Zhang, Donya Rooein, Dirk Hovy, Tanja Käser, Mrinmaya Sachan. 356-374 [doi]

Transformer-Based Real-Word Spelling Error Feedback with Configurable Confusion SetsTorsten Zesch, Dominic Gardner, Marie Bexte. 375-383 [doi]

Automated L2 Proficiency Scoring: Weak Supervision, Large Language Models, and Statistical GuaranteesAitor Arronte Alvarez, Naiyi Xie Fincham. 384-397 [doi]

Automatic Generation of Inference Making Questions for Reading Comprehension AssessmentsWanjing (Anya) Ma, Michael Flor, Zuowei Wang. 398-414 [doi]

Investigating Methods for Mapping Learning Objectives to Bloom's Revised Taxonomy in Course Descriptions for Higher EducationZahra Kolagar, Frank Zalkow, Alessandra Zarcone. 415-445 [doi]

LangEye: Toward 'Anytime' Learner-Driven Vocabulary Learning From Real-World ObjectsMariana Shimabukuro, Deval Panchal, Christopher Collins 0001. 446-459 [doi]

Costs and Benefits of AI-Enabled Topic Modeling in P-20 Research: The Case of School Improvement PlansSyeda Sabrina Akter, Seth Hunter, David Woo, Antonios Anastasopoulos. 460-476 [doi]

Advances in Auto-Grading with Large Language Models: A Cross-Disciplinary SurveyTania Amanda Nkoyo Frederick Eneye, Chukwuebuka Fortunate Ijezue, Ahmad Imam Amjad, Maaz Amjad, Sabur Butt, Gerardo Castañeda Garza. 477-498 [doi]

Unsupervised Sentence Readability Estimation Based on Parallel Corpora for Text SimplificationRina Miyata, Toru Urakawa, Hideaki Tamori, Tomoyuki Kajiwara. 499-504 [doi]

From End-Users to Co-Designers: Lessons from TeachersMartina Galletti, Valeria Cesaroni. 505-516 [doi]

LLMs in alliance with Edit-based models: advancing In-Context Learning for Grammatical Error Correction by Specific Example SelectionAlexey Sorokin, Regina Nasyrova. 517-534 [doi]

Explaining Holistic Essay Scores in Comparative Judgment Assessments by Predicting Scores on RubricsMichiel De Vrindt, Renske Bouwer, Wim van den Noortgate, Marije Lesterhuis, Anaïs Tack. 535-548 [doi]

Enhancing Arabic Automated Essay Scoring with Synthetic Data and Error InjectionChatrine Qwaider, Bashar Alhafni, Kirill Chirkunov, Nizar Habash, Ted Briscoe. 549-563 [doi]

Direct Repair Optimization: Training Small Language Models For Educational Program Repair Improves FeedbackCharles Koutcheme, Nicola Dainese, Arto Hellas. 564-581 [doi]

Analyzing Interview Questions via Bloom's Taxonomy to Enhance the Design Thinking ProcessFatemeh Kazemi Vanhari, Christopher Anand, Charles Welch. 582-593 [doi]

Estimation of Text Difficulty in the Context of Language LearningAnisia Katinskaia, Anh-Duc Vu, Jue Hou, Ulla Vanhatalo, Yiheng Wu, Roman Yangarber. 594-611 [doi]

Are Large Language Models for Education Reliable Across Languages?Vansh Gupta, Sankalan Pal Chowdhury, Vilém Zouhar, Donya Rooein, Mrinmaya Sachan. 612-631 [doi]

Exploiting the English Vocabulary Profile for L2 word-level vocabulary assessment with LLMsStefano Bannò, Kate M. Knill, Mark J. F. Gales. 632-646 [doi]

Advancing Question Generation with Joint Narrative and Difficulty ControlBernardo Leite 0002, Henrique Lopes Cardoso. 647-659 [doi]

Down the Cascades of Omethi: Hierarchical Automatic Scoring in Large-Scale AssessmentsFabian Zehner, Hyo-Jeong Shin, Emily Kerzabi, Andrea Horbach, Sebastian Gombert, Frank Goldhammer, Torsten Zesch, Nico Andersen. 660-671 [doi]

Lessons Learned in Assessing Student Reflections with LLMsMohamed Elaraby, Diane J. Litman. 672-686 [doi]

Using NLI to Identify Potential Collocation Transfer in L2 EnglishHaiyin Yang, Zoey Liu, Stefanie Wulff. 687-696 [doi]

Name of Thrones: How Do LLMs Rank Student Names in Status Hierarchies Based on Race and Gender?Annabella Sakunkoo, Jonathan Sakunkoo. 697-707 [doi]

Exploring LLM-Based Assessment of Italian Middle School Writing: A Pilot StudyAdriana Mirabella, Dominique Brunato. 708-715 [doi]

Exploring task formulation strategies to evaluate the coherence of classroom discussions with GPT-4oYuya Asano, Beata Beigman Klebanov, Jamie N. Mikeska. 716-736 [doi]

A Bayesian Approach to Inferring Prerequisite Structures and Topic Difficulty in Language LearningAnh-Duc Vu, Jue Hou, Anisia Katinskaia, Ching-Fan Sheu, Roman Yangarber. 737-751 [doi]

Improving In-context Learning Example Retrieval for Classroom Discussion Assessment with Re-ranking and Label Ratio RegulationNhat Tran, Diane J. Litman, Benjamin Pierce, Richard Correnti, Lindsay Clare Matsumura. 752-764 [doi]

Exploring LLMs for Predicting Tutor Strategy and Student Outcomes in DialoguesFareya Ikram, Alexander Scarlatos, Andrew Lan. 765-779 [doi]

Assessing Critical Thinking Components in Romanian Secondary School Textbooks: A Data Mining Approach to the ROTEX CorpusMadalina Chitez, Liviu P. Dinu, Marius Micluta-Câmpeanu, Ana-Maria Bucur, Roxana Rogobete. 780-793 [doi]

Improving AI assistants embedded in short e-learning courses with limited textual contentJacek Marciniak, Marek Kubis, Michal Gulczynski, Adam Szpilkowski, Adam Wieczarek, Marcin Szczepanski. 794-804 [doi]

Beyond Linear Digital Reading: An LLM-Powered Concept Mapping Approach for Reducing Cognitive LoadJunzhi Han, Jinho D. Choi. 805-817 [doi]

GermDetect: Verb Placement Error Detection Datasets for Learners of Germanic LanguagesNoah-Manuel Michael, Andrea Horbach. 818-829 [doi]

Enhancing Security and Strengthening Defenses in Automated Short-Answer Grading SystemsSahar Yarmohammadtoosky, Yiyun Zhou, Victoria Yaneva, Peter Baldwin, Saed Rezayi, Brian Clauser, Polina Harik. 830-840 [doi]

EyeLLM: Using Lookback Fixations to Enhance Human-LLM Alignment for Text CompletionAstha Singh, Mark Torrance, Evgeny Chukharev. 841-849 [doi]

Span Labeling with Large Language Models: Shell vs. MeatPhoebe Mulcaire, Nitin Madnani. 850-859 [doi]

Intent Matters: Enhancing AI Tutoring with Fine-Grained Pedagogical Intent AnnotationKseniia Petukhova, Ekaterina Kochmar. 860-872 [doi]

Comparing Behavioral Patterns of LLM and Human Tutors: A Population-level Analysis with the CIMA DatasetAayush Kucheria, Nitin Sawhney, Arto Hellas. 873-881 [doi]

Temporalizing Confidence: Evaluation of Chain-of-Thought Reasoning with Signal Temporal LogicZhenjiang Mao, Artem Bisliouk, Rohith Reddy Nama, Ivan Ruchkin. 882-890 [doi]

Automated Scoring of Communication Skills in Physician-Patient Interaction: Balancing Performance and ScalabilitySaed Rezayi, Le An Ha, Yiyun Zhou, Andrew Houriet, Angelo D'Addario, Peter Baldwin, Polina Harik, Ann King, Victoria Yaneva. 891-897 [doi]

Decoding Actionability: A Computational Analysis of Teacher Observation FeedbackMayank Sharma, Jason Zhang. 898-907 [doi]

EduCSW: Building a Mandarin-English Code-Switched Generation Pipeline for Computer Science LearningRuishi Chen, Yiling Zhao. 908-919 [doi]

STAIR-AIG: Optimizing the Automated Item Generation Process through Human-AI Collaboration for Critical Thinking AssessmentEuigyum Kim, Seewoo Li, Salah Khalil, Hyo-Jeong Shin. 920-930 [doi]

UPSC2M: Benchmarking Adaptive Learning from Two Million MCQ AttemptsKevin Shi, Karttikeya Mangalam. 931-936 [doi]

Can GPTZero's AI Vocabulary Distinguish Between LLM-Generated and Student-Written Essays?Veronica Schmalz, Anaïs Tack. 937-952 [doi]

Paragraph-level Error Correction and Explanation Generation: Case Study for EstonianMartin Vainikko, Taavi Kamarik, Karina Kert, Krista Liin, Silvia Maine, Kais Allkivi, Annekatrin Kaivapalu, Mark Fishel. 953-967 [doi]

End-to-End Automated Item Generation and Scoring for Adaptive English Writing Assessment with Large Language ModelsKamel Nebhi, Amrita Panesar, Hans Bantilan. 968-977 [doi]

A Framework for Proficiency-Aligned Grammar Practice in LLM-Based Dialogue SystemsLuisa Ribeiro-Flucht, Xiaobin Chen, Detmar Meurers. 978-987 [doi]

Can LLMs Reliably Simulate Real Students' Abilities in Mathematics and Reading Comprehension?Kv Aditya Srivatsa, Kaushal Maurya, Ekaterina Kochmar. 988-1001 [doi]

LLM-Assisted, Iterative Curriculum Writing: A Human-Centered AI Approach in Finnish Higher EducationLeo Huovinen, Mika Hämäläinen. 1002-1010 [doi]

Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered TutorsEkaterina Kochmar, Kaushal Maurya, Kseniia Petukhova, Kv Aditya Srivatsa, Anaïs Tack, Justin Vasselli. 1011-1033 [doi]

Jinan Smart Education at BEA 2025 Shared Task: Dual Encoder Architecture for Tutor Identification via Semantic Understanding of Pedagogical ConversationsLei Chen. 1034-1039 [doi]

Wonderland_EDU@HKU at BEA 2025 Shared Task: Fine-tuning Large Language Models to Evaluate the Pedagogical Ability of AI-powered TutorsDeliang Wang 0001, Chao Yang, Gaowei Chen. 1040-1048 [doi]

bea-jh at BEA 2025 Shared Task: Evaluating AI-powered Tutors through Pedagogically-Informed ReasoningJihyeon Roh, Jinhyun Bang. 1049-1059 [doi]

CU at BEA 2025 Shared Task: A BERT-Based Cross-Attention Approach for Evaluating Pedagogical Responses in DialogueZhihao Lyu. 1060-1072 [doi]

BJTU at BEA 2025 Shared Task: Task-Aware Prompt Tuning and Data Augmentation for Evaluating AI Math TutorsYuming Fan, Chuangchuang Tan, Wenyu Song. 1073-1077 [doi]

SYSUpporter Team at BEA 2025 Shared Task: Class Compensation and Assignment Optimization for LLM-generated Tutor IdentificationLongfeng Chen, Zeyu Huang, Zheng Xiao, Yawen Zeng, Jin Xu 0014. 1078-1083 [doi]

BLCU-ICALL at BEA 2025 Shared Task: Multi-Strategy Evaluation of AI TutorsJiyuan An, Xiang Fu, Bo Liu, Xuquan Zong, Cunliang Kong, Shuliang Liu, Shuo Wang 0013, Zhenghao Liu 0001, Liner Yang, Hanghang Fan, Erhong Yang. 1084-1097 [doi]

Phaedrus at BEA 2025 Shared Task: Assessment of Mathematical Tutoring Dialogues through Tutor Identity Classification and Actionability EvaluationRajneesh Tiwari, Pranshu Rastogi. 1098-1107 [doi]

Emergent Wisdom at BEA 2025 Shared Task: From Lexical Understanding to Reflective Reasoning for Pedagogical Ability AssessmentRaunak Jain, Srinivasan Rengarajan. 1108-1120 [doi]

Averroes at BEA 2025 Shared Task: Verifying Mistake Identification in Tutor, Student DialogueMazen Yasser, Mariam Saeed, Hossam Elkordi, Ayman Khalafallah. 1121-1126 [doi]

SmolLab_SEU at BEA 2025 Shared Task: A Transformer-Based Framework for Multi-Track Pedagogical Evaluation of AI-Powered TutorsMd. Abdur Rahman, Md Al-Amin, Sabik Aftahee, Muhammad Junayed, Md Ashiqur Rahman. 1127-1134 [doi]

RETUYT-INCO at BEA 2025 Shared Task: How Far Can Lightweight Models Go in AI-powered Tutor Evaluation?Santiago Góngora, Ignacio Sastre, Santiago Robaina, Ignacio Remersaro, Luis Chiruzzo, Aiala Rosá. 1135-1144 [doi]

K-NLPers at BEA 2025 Shared Task: Evaluating the Quality of AI Tutor Responses with GPT-4.1Geon Park, Jiwoo Song, Gihyeon Choi, Juoh Sun, Harksoo Kim. 1145-1163 [doi]

Henry at BEA 2025 Shared Task: Improving AI Tutor's Guidance Evaluation Through Context-Aware DistillationHenry Pit. 1164-1172 [doi]

TBA at BEA 2025 Shared Task: Transfer-Learning from DARE-TIES Merged Models for the Pedagogical Ability Assessment of LLM-Powered Math TutorsSebastian Gombert, Fabian Zehner, Hendrik Drachsler. 1173-1179 [doi]

LexiLogic at BEA 2025 Shared Task: Fine-tuning Transformer Language Models for the Pedagogical Skill Evaluation of LLM-based tutorsSouvik Bhattacharyya, Billodal Roy, Niranjan M, Pranav Gupta. 1180-1186 [doi]

IALab UC at BEA 2025 Shared Task: LLM-Powered Expert Pedagogical Feature ExtractionSofía Correa Busquets, Valentina Córdova Véliz, Jorge Baier. 1187-1193 [doi]

MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLMs as Math TutorsBaraa Hikal, Mohmaed Basem, Islam Oshallah, Ali Hamdi. 1194-1202 [doi]

TutorMind at BEA 2025 Shared Task: Leveraging Fine-Tuned LLMs and Data Augmentation for Mistake IdentificationFatima Dekmak, Christian Khairallah, Wissam Antoun. 1203-1211 [doi]

Two Outliers at BEA 2025 Shared Task: Tutor Identity Classification using DiReC, a Two-Stage Disentangled Contrastive RepresentationEduardus Tjitrahardja, Ikhlasul Akmal Hanif. 1212-1223 [doi]

Archaeology at BEA 2025 Shared Task: Are Simple Baselines Good Enough?Ana Rosu, Jany-Gabriel Ispas, Sergiu Nisioi. 1224-1241 [doi]

NLIP at BEA 2025 Shared Task: Evaluation of Pedagogical Ability of AI TutorsTrishita Saha, Shrenik Ganguli, Maunendra Sankar Desarkar. 1242-1253 [doi]

NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI TutorsNumaan Naeem, Sarfraz Ahmad, Momina Ahsan, Hasan Iqbal. 1254-1259 [doi]

DLSU at BEA 2025 Shared Task: Towards Establishing Baseline Models for Pedagogical Response Evaluation TasksMaria Monica Manlises, Mark Edward M. Gonzales, Lanz Lim. 1260-1265 [doi]

BD at BEA 2025 Shared Task: MPNet Ensembles for Pedagogical Mistake Identification and Localization in AI Tutor ResponsesShadman Rohan, Ishita Sur Apan, Muhtasim Ibteda Shochcho, Md Fahim, Mohammad Ashfaq Ur Rahman, A. K. M. Mahbubur Rahman, Amin Ali. 1266-1277 [doi]

Thapar Titan/s : Fine-Tuning Pretrained Language Models with Contextual Augmentation for Mistake Identification in Tutor-Student DialoguesHarsh Dadwal, Sparsh Rastogi, Jatin Bedi. 1278-1282 [doi]

External Links

Cite Key

Statistics

PDF

Researchr

Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications, BEA 2025, Vienna, Austria, July 31 - August 1, 2025

Abstract

Table of Contents