Impact evaluations of schooling reforms in developing countries typically focus on tests of student achievement that are designed and implemented by researchers. Are these tests any good? What practical and principled guidance should researchers in the field follow? We aim to answer these questions. First, we review the test design in 163 studies and show that current designs vary widely in scope, content, administration, and analysis. Consequently, magnitudes of treatment effects are not currently comparable across studies; this problem is not fixed by expressing scores in standard deviations. Second, researchers rarely engage with the appropriateness of their test design to their estimands in reported analyses. Yet, the interpretation of any estimates is necessarily sensitive to the measurement of the core variables, even where treatments are randomly assigned. Third, we provide concrete examples from published studies to highlight the principles and practicalities of appropriate test design in developing countries and present a template of recommendations for researchers to consider.
with Kristen Mattern, Justin Radunzel, and Andrew D. Ho. 2018. In Educational Measurement: Issues and Practice, 37(3), pp.11-23.
Abstract. The percentage of students retaking college admissions tests is rising. Researchers and college admissions offices currently use a variety of methods for summarizing these multiple scores. Testing organizations such as ACT and the College Board, interested in validity evidence like correlations with first‐year grade point average (FYGPA), often use the most recent test score available. In contrast, institutions report using a variety of composite scoring methods for applicants with multiple test records, including averaging and taking the maximum subtest score across test occasions (“superscoring”). We compare four scoring methods on two criteria. First, we compare correlations between scores and FYGPA by scoring method. We find them similar (r=.40). Second, we compare the extent to which test scores differentially predict FYGPA by scoring method and number of retakes. We find that retakes account for additional variance beyond standardized achievement and positively predict FYGPA across all scoring methods. Superscoring minimizes this differential prediction—although it may seem that superscoring should inflate scores across retakes, this inflation is “true” in that it accounts for the positive effects of retaking for predicting FYGPA. Future research should identity factors related to retesting and consider how they should be used in college admissions.
with Jonathan P. Weeks. 2018. In Psychological assessment, 30(3), p.328.
Abstract. Achievement estimates are often based on either number correct scores or IRT-based ability parameters. Van der Linden (2007) and other researchers (e.g., Fox, Klein Entink, & van der Linden, 2007; Ranger, 2013) have developed psychometric models that allow for joint estimation of speed and item parameters using both response times and response data. This paper presents an application of this type of approach to a battery of 4 types of fluid reasoning measures, administered to a large sample of a highly educated examinees. We investigate the extent to which incorporation of response times in ability estimates can be used to inform the potential development of shorter test forms. In addition to exploratory analyses and response time data visualizations, we specifically consider the increase in precision of ability estimates given the addition of response time data relative to use of item responses alone. Our findings indicate that there may be instances where test forms can be substantially shortened without any reduction in score reliability, when response time information is incorporated into the item response model.
with Patrick Kyllonen et al. 2019. In Behavior research methods, 51(2), pp.507-522.
Abstract. The validity of studies investigating interventions to enhance fluid intelligence (Gf) depends on the adequacy of the Gf measures administered. Such studies have yielded mixed results, with a suggestion that Gf measurement issues may be partly responsible. The purpose of this study was to develop a Gf test battery comprising tests meeting the following criteria: (a) strong construct validity evidence, based on prior research; (b) reliable and sensitive to change; (c) varying in item types and content; (d) producing parallel tests, so that pretest–posttest comparisons could be made; (e) appropriate time limits; (f) unidimensional, to facilitate interpretation; and (g) appropriate in difficulty for a high-ability population, to detect change. A battery comprising letter, number, and figure series and figural matrix item types was developed and evaluated in three large-N studies (N = 3,067, 2,511, and 801, respectively). Items were generated algorithmically on the basis of proven item models from the literature, to achieve high reliability at the targeted difficulty levels. An item response theory approach was used to calibrate the items in the first two studies and to establish conditional reliability targets for the tests and the battery. On the basis of those calibrations, fixed parallel forms were assembled for the third study, using linear programming methods. Analyses showed that the tests and test battery achieved the proposed criteria. We suggest that the battery as constructed is a promising tool for measuring the effectiveness of cognitive enhancement interventions, and that its algorithmic item construction enables tailoring the battery to different difficulty targets, for even wider applications.
with G. Tanner Jackson, Andreas Oranje, and Elizabeth Owen. 2015. In International Conference on Artificial Intelligence in Education (pp. 545-549). Springer, Cham.
with Kristen E. DiCerbo, Shonté Stephenson, Yue Jia, Robert J. Mislevy, Malcolm Bauer, G. and Tanner Jackson. 2015 In Serious games analytics (pp. 319-342). Springer, Cham.
Copyright © 2022 Masha Bertling - All Rights Reserved.