arXiv

Leveraging BART to Assess CS1 C++ Programming Assignments using Rubric-based Criteria

June 3, 2026 · Kelsey Rainey, Jesse Roberts · Original Source

Title: Using BART to Evaluate CS1 C++ Assignments via Rubric-Based Standards

Abstract: This study explores the application of rubric-aware, multitask fine-tuning for transformer models to automate the grading of introductory C++ programming coursework. The primary objective is to generate grade predictions that more closely mimic instructor evaluation patterns compared to general-purpose large language models. By leveraging multi-semester CS1 data, student submissions are linked with numerical scores, letter-grade categories, and specific assignment rubrics. These elements are then processed into unified sequences suitable for transformer input. The proposed method employs a BART encoder-decoder architecture adapted with LoRA, designed to simultaneously predict numerical grades and grade buckets. This approach is enhanced by a distribution-matching component intended to align predicted outcomes with empirical grade distributions—a factor frequently neglected in previous research.

The experiments evaluate various configurations, including single-task versus multitask training, hard one-hot labels versus fuzzy and boundary-based soft labels, and scenarios with and without rubric context. Additional variants involving T5 models and pairwise-pretrained architectures are also tested. The results indicate that multitask BART, when utilizing boundary-based soft labels and rubric context, yields lower mean absolute error and superior grade-distribution alignment compared to single-task models, hard-label approaches, or code-only baselines. Furthermore, fully fine-tuned T5 models enhance distributional fidelity, whereas pairwise pretraining decreases numeric error but compromises sensitivity to minority classes. Overall, these findings suggest that training methods focused on calibration and guided by rubrics produce grading behaviors more similar to human instructors than those optimized solely for accuracy.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC