Building the validity foundation for interpreter certification performance testing
thesisposted on 29.03.2022, 03:27 authored by Chao Han
Interpreter certification performance testing (ICPT) has developed rapidly over the past decade. Yet there has been very limited discussion and systematic research conducted to enhance reliability and validity of high-stakes interpreter certification performance tests (ICPTs). This interdisciplinary mixed-methods research was therefore initiated to build theoretical and methodological foundations for rater-mediated ICPTs, with a special focus on test validation, construct definition, and rater/score reliability for English/Chinese ICPTs in China. Presented in a thesis-by-publication format, the research follows a multi-phase mixed-methods research (MMR) design, in which research results from a previous study inform and build to a subsequent study. To begin with, given the lack of guidance on rigorous validation of ICPTs, this thesis draws upon an argument-based approach to build a validity argument for ICPTs. The validity argument could serve as a roadmap to help testers collect validity evidence. Based on Interpreting Studies literature, two particular types of evidence are generally lacking: evidence supporting substantive score interpretations based on a strong construct theory, and evidence supporting test score generalizability, especially across raters. To help generate evidence that justifies the substantive score interpretations intended by certification authorities in China, an interactionalist approach to construct definition is therefore proposed and articulated for English/Chinese ICPTs. Essentially, the interactionalist construct model contends that performance consistency (i.e., interpreting performance) is as a function of context (i.e., characteristics of test tasks), trait (i.e., interpreting ability), and interactions between the two. The theoretical construct model gives rise to two research questions (RQ). RQ 1: What are the characteristics of interpreting tasks in the real-life practice domain in China? RQ 2: What is the possible interplay between characteristics of interpreting tasks, interpreting ability, and interpreting performance quality? To address RQ 1, an exploratory qualitative diary (n = 11) and a follow-up quantitative survey (n = 140) were conducted to generate empirical data that describe the characteristics of the interpreting practice in China. Main results include that the interpreters performed a greater variety of simultaneous interpreting (SI) tasks than previously thought, and encountered a number of prominent factors contributing to SI difficulty, such as fast speech rate (FSR) and strong accent (StrA). To investigate RQ 2, a factorial repeated-measures experiment was conducted. Specifically, informed by the diary and the survey findings, the experiment sought to address the interactions between SI tasks (characterized by FSR and StrA), strategy use (regarded as a crucial component of interpreting ability), and SI performance quality (measured by information completeness, fluency of delivery and target language quality). In the experiment, 32 interpreters were asked to perform English-to-Chinese SI in four manipulated tasks. A crossed measurement design was then implemented in which nine trained raters assessed each performance by each interpreter on each rating dimension. Results show that 1) the speed factor had a pattern of mixed impacts on information completeness, fluency of delivery and target language quality of SI performance, while the accent factor had a consistent pattern of detrimental impacts across the three dimensions; 2) the strategies of syntactic transformation and substitution were used most frequently. It also would appear that while the speed factor greatly influenced the use of the two strategies, the accent factor did not; and 3) there seemed to be a general trend that the more strategies were used, the better SI performance was. Finally, to help produce evidence supporting rater reliability and score generalizability (i.e., RQ 3), a methodological exploration was conducted to evaluate the utility of multifaceted Rasch measurement and generalizability theory in analyzing rater behavior, rater variability and its effects on score dependability. Data for the analyses were the rater-generated scores from the experiment. Results indicate that although the rating design produced reliable results, one of the raters was problematic, as s/he was not self-consistent, and provided significantly biased scores to a large proportion of the interpreters. The findings also show that increasing the number of raters and/or tasks would generally improve score reliability for each rating dimension, but the relative efficiency was different across the dimensions. Ultimately, the empirical and methodological findings would contribute evidence to the ICPT validity argument, and their implications on ICPT design and validation were also discussed.