Name
Evolving High Stakes English Language Tests: Engaging Stakeholders in Research and Development
Sarah Hughes
Description

As the landscape of language testing purposes continues to evolve across Europe and beyond, and the use of automated and AI systems become even more commonplace, it is crucial to understand the perspectives of those who engage with these assessments. In particular, understanding test taker perspectives not only provides unique user experience insights, but are also essential in order to provide regulatory validity evidence and increase public trust and confidence in high stakes assessments.

This session will concentrate on a research study exploring test taker perspectives and experiences of potential revisions to a global high stakes English language test. This work contributes to a wider program of research designed to capture a broad range of evidence to validate and support these potential revisions but also to understand how these revisions would impact test takers and score users, and to develop resources to support them. In creating this program of research, we sought to establish an assembly of extensive argument (Kane, 2006) through various methods of validity research evidence to support any possible revisions and developments to the test.

The English language test at the centre of this research was launched in response to demand from higher education, governments and professional bodies for a test to measure the English communication skills of international students and economic migrants securely and accurately. The purpose of the test is to measure test takers’ English language competency in listening, reading, speaking and writing for academic and skilled migration purposes. The foundation of the test was and continues to be the Common European Framework of Reference for Languages (CEFR, Council of Europe, 2001). Key changes to the CEFR promote more integration of mediation and interaction competencies into language learning and assessment. It has therefore been an ongoing research focus to develop new item types and scoring rubrics that complement and expand the linguistic competencies of the test in line with CEFR enhancements.

Of all key stakeholders involved in the assessment process, test-takers are the only ones to experience the test first-hand (Nevo, 1995) and therefore can be considered the most important stakeholders (Rea-Dickins, 1997). In line with contemporary validity theory, we seek to evaluate technical components of the test (e.g. how suitable is the test for its intended purposes) as well as explore its social dimensions, evoking Messick’s validity argument of value implications and social consequences of testing outcomes (Messick 1989). The outcomes of this test have far-reaching implications for test takers, and it is important to understand how any potential revisions may impact the test-taking experience.

The primary aim of the study was to gather insights into the test takers’ backgrounds, motivations, and their experiences of two potential new speaking items requiring extended spoken responses, purposed to reflect real-world language use. By examining these perspectives, we aim to evaluate the face validity of potential new items and ensure that they meet the needs and expectations of their intended purposes. Participants were asked about the authenticity of the task in relation to real-life language use, item demand and user experience. We also sought to explore secondary themes to illicit a more detailed and rounded understanding of test taker perspectives on test-level feedback, test preparation practices and most notably, attitudes to the use of AI scoring in high-stakes language testing.

Utilizing a mixed-methods approach, this research combines quantitative data from surveys (n=633) with qualitative insights from semi-structured interviews (n=20). A survey instrument was built to capture high level feedback on test taker experiences of the test, test preparation activities and attitudes to the use of AI in high stakes testing. This survey was then followed up with in-depth test taker interviews to gather more detailed feedback, specifically on the potential new item types, as well as to provide further elaboration on the topics covered in the surveys. Efforts were made to ensure that participants invited to the interview represented a range of opinions (from their survey responses) regarding perceived relevancy of task types and skills. Thematic analysis was performed on the data to establish emergent themes from surveys and then the interviews.

The findings highlighted the diverse backgrounds and motivations of test takers, emphasising the importance of understanding their experiences to enhance the validity and reliability of the test. The feedback on the potential new speaking item types, indicated that these tasks were perceived as relevant and reflective of real-life scenarios, thereby supporting the test’s face validity. Moreover, the study reveals a general trust in AI scoring systems, with test takers appreciating the objectivity and fairness they offer compared to only human scoring. However, there is a consensus on the need for a balanced approach, combining AI and human oversight to ensure accuracy and reliability, particularly for speaking tasks. Overall, the research underscores the significance of incorporating test taker feedback in the test development process to maintain a robust and trustworthy assessment tool. This study was also supplemented by other methods of validity evidence, including outcomes from field testing and findings from score concordance and standards research.

While evolution is necessary in language testing, it is imperative for testing organisations to thoughtfully and responsibly support test takers and score users in times of transition. This session will demonstrate how testing organisations can incorporate stakeholder perspectives in research and development activities to ensure their views are represented and their needs are supported when considering potential changes to highly consequential English language tests.

Kane, M.T. (2006).Validation. In: R. L. Brennan (Ed.), Educational Measurement. (4th ed.). 17–64.Westport, CT: American Council on Education/Praeger.

Nevo, B. (1995). Examinee feedback questionnaire: Reliability and validity measures. Educational and Psychological Measurement, 55(3), 499

Rea-Dickins P., So, why do we need relationships with stakeholders in language testing? A view from the UK, Language Testing. (1997) 14, no. 3, 304–314

Messick, S. (1989).Validity. In: R. L. Linn (Ed.), Educational measurement. 3rd ed. 13–103. New York: American Council on Education/Macmillan.

Session Type
Presentation
Session Area
Education
Primary Topic
Candidate Experience