Name
Tackling Item Performance Prediction: Applications of AI
Brad Bolender
Description

Accurately predicting item difficulty and other key item performance metrics such as discrimination and differential item functioning has long been a goal of testing organizations seeking to reduce the amount of resources devoted to pretesting newly-developed items. Traditionally, these predictions have relied on statistical methods, such as linear regression models based on handcrafted features like item length, word frequency, and syntactic complexity. More recently, natural language processing techniques have expanded these approaches, allowing for automated extraction of linguistic features and limited modeling of semantic content. Machine learning has further advanced the field by enabling nonlinear relationships and interaction effects among features, improving predictive accuracy over traditional methods. However, new innovations in AI are opening possibilities for even richer, more nuanced prediction frameworks that incorporate both surface and latent characteristics of test items.

This presentation will share results from a study applying advanced AI methods to predict item difficulty, discrimination, and differential item functioning using a rich dataset from a statewide assessment scaling study administered by Iowa Testing Programs from the University of Iowa. The dataset includes vocabulary test items administered to a cohort of approximately 2,000 students across 9th, 10th, and 11th grades, with overlapping items repeated across years to allow for robust longitudinal analysis. Multiple methods will be discussed, which are being used to explore the potential for AI to replace or support traditional and NLP/ML methods in prediction models targeting the performance of new, unadministered test items. The methods that will be described have been developed specifically to help foresee problematic item characteristics, to enhance item information, and to reduce the number of administrations required to gain sufficient understanding of difficulty in preparation for forms assembly or CAT selection.

The presentation will highlight comparative results across several modeling approaches, with a focus on three key evaluation criteria. First, models will be assessed based on their ability to recover underlying two-parameter logistic (2PL) item parameters, providing insight into how well each method captures fundamental psychometric properties. Second, predictive accuracy for relative item difficulty will be evaluated, reflecting how effectively models can rank new, unadministered items in terms of expected challenge for examinees. Third, models will be examined in terms of their ability to predict item performance across multiple overall achievement levels of the same sample of examinees. Together, these results provide a comprehensive view of the potential for advanced AI techniques to strengthen item development processes and assessment design.

The session will conclude with time for open discussion and questions, encouraging dialogue about the challenges and opportunities of applying advanced AI methods to item prediction in operational testing contexts. This research addresses a critical need for more efficient, accurate, and equitable approaches to item development, with potential to reduce costs, accelerate test construction timelines, and improve the quality and fairness of large-scale assessments. By showcasing both methodological innovations and practical applications, the presentation will offer valuable insights for researchers, psychometricians, and assessment program leaders working to modernize their practices in a rapidly evolving field.

Session Type
Presentation
Session Area
Education, Certification/Licensure, Workforce Skills Credentialing
Primary Topic
Innovation in Assessment