In this paper, I apply a machine learning methodology for estimating variance in treatment effects beyond the average treatment effect (ATE). Focus on ATEs may lead to an imperfect or misleading impression of a program's impact, as causal effects may differ across individuals and subgroups (Athey & Imbens, 2016; Seibold et al., 2016). I turn to the investigation of heterogeneous treatment effects and take it a step further to estimate individual treatment effects (ITEs). To my knowledge, this is the first paper that attempts such an analysis in the development economics of education. When ITEs are, for example, explored against students' characteristics or ITE thresholds are used to illustrate the proportion of students who may benefit or lose from the intervention to a certain degree, researchers and policymakers may take a more informed and action-oriented approach to scalability and generalizability of the results. Especially when intervention has a negative effect on some students. Such unintended consequences should be considered in all interventions, and policymakers should be informed about such undesired possibilities.
with Ishita Ahmed et al. (EdWorkingPaper: 23-754). Annenberg Institute at Brown University
Researchers use test outcomes to evaluate the effectiveness of education interventions across numerous randomized controlled trials (RCTs). Aggregate test data—for example, simple measures like the sum of correct responses—are compared across treatment and control groups to determine whether an intervention has had a positive impact on student achievement. We show that item-level data and psychometric analyses can provide information about treatment heterogeneity and improve design of future experiments. We apply techniques typically used in the study of Differential Item Functioning (DIF) to examine variation in the degree to which items show treatment effects. That is, are observed treatment effects due to generalized gains on the aggregate achievement measures or are they due to targeted gains on specific items? Based on our analysis of 7,244,566 item responses (265,732 students responding to 2,119 items) taken from 15 RCTs in low-and-middle-income countries, we find clear evidence for variation in gains across items. DIF analyses identify items that are highly sensitive to the interventions—in one extreme case, a single item drives nearly 40% of the observed treatment effect—as well as items that are insensitive. We also show that the variation of item-level sensitivity can have implications for the precision of effect estimates. Of the RCTs that have significant effect estimates, 41% have patterns of item-level sensitivity to treatment that allow for the possibility of a null effect when this source of uncertainty is considered. Our findings demonstrate how researchers can gain more insight regarding the effects of interventions via additional analysis of item-level test data.
Impact evaluations of schooling reforms in developing countries typically focus on tests of student achievement that are designed and implemented by researchers. Are these tests any good? What practical and principled guidance should researchers in the field follow? We aim to answer these questions. First, we review the test design in 163 studies and show that current designs vary widely in scope, content, administration, and analysis. Consequently, magnitudes of treatment effects are not currently comparable across studies; this problem is not fixed by expressing scores in standard deviations. Second, researchers rarely engage with the appropriateness of their test design to their estimands in reported analyses. Yet, the interpretation of any estimates is necessarily sensitive to the measurement of the core variables, even where treatments are randomly assigned. Third, we provide concrete examples from published studies to highlight the principles and practicalities of appropriate test design in developing countries and present a template of recommendations for researchers to consider.
I conducted a formal meta-analysis and found that overall education interventions conducted in low- and middle-income countries between 2009 and 2020 have a positive and significant effect of 0.10 standard deviation units. However, I also found considerable heterogeneity across treatment effects depending on the focus and the intervention's level, e.g., teacher-level vs. school-level interventions. The analysis showed household-level interventions to be the most effective for improving literacy outcomes, whereas school-level interventions were the most effective for improving mathematics test scores. Finally, as a field, we have overwhelming evidence about the effectiveness of interventions, such as cash transfer or computer-assisted learning, whereas other types of interventions are not a focus of research or policy investigations. Specifically, I found no studies on "remedial education" or "community-based monitoring and accountability interventions." These appear to be potential areas for researchers to design future interventions.
with Kristen Mattern, Justin Radunzel, and Andrew D. Ho. 2018. In Educational Measurement: Issues and Practice, 37(3), pp.11-23.
Abstract. The percentage of students retaking college admissions tests is rising. Researchers and college admissions offices currently use a variety of methods for summarizing these multiple scores. Testing organizations such as ACT and the College Board, interested in validity evidence like correlations with first‐year grade point average (FYGPA), often use the most recent test score available. In contrast, institutions report using a variety of composite scoring methods for applicants with multiple test records, including averaging and taking the maximum subtest score across test occasions (“superscoring”). We compare four scoring methods on two criteria. First, we compare correlations between scores and FYGPA by scoring method. We find them similar (r=.40). Second, we compare the extent to which test scores differentially predict FYGPA by scoring method and number of retakes. We find that retakes account for additional variance beyond standardized achievement and positively predict FYGPA across all scoring methods. Superscoring minimizes this differential prediction—although it may seem that superscoring should inflate scores across retakes, this inflation is “true” in that it accounts for the positive effects of retaking for predicting FYGPA. Future research should identity factors related to retesting and consider how they should be used in college admissions.
with Jonathan P. Weeks. 2018. In Psychological assessment, 30(3), p.328.
Abstract. Achievement estimates are often based on either number correct scores or IRT-based ability parameters. Van der Linden (2007) and other researchers (e.g., Fox, Klein Entink, & van der Linden, 2007; Ranger, 2013) have developed psychometric models that allow for joint estimation of speed and item parameters using both response times and response data. This paper presents an application of this type of approach to a battery of 4 types of fluid reasoning measures, administered to a large sample of a highly educated examinees. We investigate the extent to which incorporation of response times in ability estimates can be used to inform the potential development of shorter test forms. In addition to exploratory analyses and response time data visualizations, we specifically consider the increase in precision of ability estimates given the addition of response time data relative to use of item responses alone. Our findings indicate that there may be instances where test forms can be substantially shortened without any reduction in score reliability, when response time information is incorporated into the item response model.
with Patrick Kyllonen et al. 2019. In Behavior research methods, 51(2), pp.507-522.
Abstract. The validity of studies investigating interventions to enhance fluid intelligence (Gf) depends on the adequacy of the Gf measures administered. Such studies have yielded mixed results, with a suggestion that Gf measurement issues may be partly responsible. The purpose of this study was to develop a Gf test battery comprising tests meeting the following criteria: (a) strong construct validity evidence, based on prior research; (b) reliable and sensitive to change; (c) varying in item types and content; (d) producing parallel tests, so that pretest–posttest comparisons could be made; (e) appropriate time limits; (f) unidimensional, to facilitate interpretation; and (g) appropriate in difficulty for a high-ability population, to detect change. A battery comprising letter, number, and figure series and figural matrix item types was developed and evaluated in three large-N studies (N = 3,067, 2,511, and 801, respectively). Items were generated algorithmically on the basis of proven item models from the literature, to achieve high reliability at the targeted difficulty levels. An item response theory approach was used to calibrate the items in the first two studies and to establish conditional reliability targets for the tests and the battery. On the basis of those calibrations, fixed parallel forms were assembled for the third study, using linear programming methods. Analyses showed that the tests and test battery achieved the proposed criteria. We suggest that the battery as constructed is a promising tool for measuring the effectiveness of cognitive enhancement interventions, and that its algorithmic item construction enables tailoring the battery to different difficulty targets, for even wider applications.
with G. Tanner Jackson, Andreas Oranje, and Elizabeth Owen. 2015. In International Conference on Artificial Intelligence in Education (pp. 545-549). Springer, Cham.
with Kristen E. DiCerbo, Shonté Stephenson, Yue Jia, Robert J. Mislevy, Malcolm Bauer, G. and Tanner Jackson. 2015 In Serious games analytics (pp. 319-342). Springer, Cham.
Copyright © 2023 Masha Bertling - All Rights Reserved.