粗心有差嗎？試題反應理論四參數模式於我國學生成就測驗之適用性

林奕宏

熱門：刘安 Zhiwen Cheng 王美玲朱丽彬曾瓊瑤黃光男

首頁

臺灣期刊 學校系所學協會民間出版

大陸/海外期刊 政府機關學校系所學協會民間出版

DOI查詢服務 DOI註冊服務

閱讀全文　購買本期
篇名	粗心有差嗎？試題反應理論四參數模式於我國學生成就測驗之適用性
並列篇名	Does Slipping Matter? Applicability of the Item Response Theory Four-Parameter Logistic Model in Taiwan Student Achievement Tests
作者	林奕宏
中文摘要	選擇題為目前主要的測驗題型，常用於分析選擇題的試題反應理論之三參數模式（three-parameter logistic model, 3PLM）能估計考生的猜測行為。以 3PLM 為基礎的四參數模式（four-parameter logistic model, 4PLM），再多估計考生的粗心行為，因而能同時考量猜測及粗心行為對考生作答的影響，是較符合實務的模式。惟國內少見應用 4PLM 的已發表研究，是否國內的學生成就測驗不適用 4PLM？故本研究以國內測驗資料為例考驗 4PLM 之適用性，並比較 4PLM 與 3PLM 估計結果之差異，以表徵 4PLM 的特點，作為研究者應用 4PLM 之參考。本研究分為實徵分析及模擬分析兩部分：(1)實徵分析結果：4PLM 模式適配度顯著優於 3PLM，較能描述實徵資料，適用於我國測驗情境；與 3PLM 相較，4PLM 的試題適配度指標較接近期望值1.0，鑑別度及猜測度的估計值及標準誤較大，難度估計值及標準誤較小，能力值的標準誤較大，但較不低估高能力考生能力值；4PLM 特有的粗心度則是估計值越大，標準誤越小。(2)模擬分析結果：4PLM 之鑑別度、難度、猜測度的平均 bias 及平均root mean square error（RMSE）皆較小，能力值以平均 RMSE 顯著較小，粗心度結果合理。以上結果顯示粗心度參數的作用及 4PLM 的實用性。研究建議如下：(1)選擇適當的資料分析模式以獲得較精確結果；(2)探討粗心度參數的作用；(3)比較4PLM 不同估計法的效能及特點。
英文摘要	Multiple-choice item format is widely used nowadays in student achievement tests at different educational levels. There are three factors that may determine whether an examinee answers a multiple-choice item correctly: True ability, guessing, and unexpected errors. Research on guessing behavior in multiple-choice items has long been an important issue in educational measurement, and “unexpected errors,” namely “carelessness,” has also received increasing attention. When examinees show carelessness, testing agencies may obtain biased estimates of examinee ability and item parameters, which may further influence test fairness and related educational decision making. In item response theory (IRT), the prevailing approach to carelessness is to add a slipping parameter to the three-parameter logistic model (3PLM), resulting in the four-parameter logistic model (4PLM), so as to estimate the probability that high-ability examinees unexpectedly answer easy items incorrectly. Because the 4PLM involves a complex parameter estimation procedure and appropriate estimation software has been limited, its application has been restricted. In recent years, due to advances in parameter estimation techniques and computer hardware and software, the 4PLM has gradually attracted scholarly attention. However, published studies applying the 4PLM in Taiwan remain uncommon. Whether student achievement tests in Taiwan can be fitted with the 4PLM? This study argues that carelessness is not rare in Taiwan student achievement tests and that applying the 4PLM is reasonable; thus, related research should not be absent. In order to provide suggestions for practitioners, this study fits empirical datasets from Taiwan with the 4PLM and compares the data-analyzing results of the 4PLM and 3PLM to demonstrate the properties of the 4PLM. This study includes empirical analysis and simulation evaluation. For empirical analysis, one high-stakes and one low-stakes dataset were analyzed to compare model fit between the 4PLM and 3PLM. The high-stakes dataset was a random sample of 5,000 examinees from the 2021 Comprehensive Assessment Program for junior high school students (CAP), including 41 multiple-choice items from the English reading test. The low-stakes dataset was from the 2014 Taiwan Assessment of Student Achievement (TASA), including 7,405 eleventh-grade examinees and 28 multiple-choice items from the Chinese test. Model fit indices included the likelihood ratio test (LRT), Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), and bias-corrected AIC (AICc). All parameters of the 4PLM and 3PLM were estimated using the IRTBEMM package in R. For simulation evaluation, this study generated multiple datasets using the 4PLM parameter estimates from empirical analysis as true values, and then analyzed the simulated data with both the 4PLM and 3PLM. Bias and root mean square error (RMSE) were used to evaluate the differences between average parameter estimates and the true values. The results include, compared to the 3PLM: (1) Reliability: The reliability coefficients of both empirical datasets were above .70; the CAP English test tended to have higher reliability than the TASA Chinese test, possibly due to the number of items; (2) Model fit: The LRT results indicated that the 4PLM, with more parameters, significantly reduced the chi-square values (p < .001 for both datasets); AIC, BIC, and AICc were also lower for the 4PLM, and the differences from the 3PLM were all larger than 10. Overall, the 4PLM fits the empirical datasets better than the 3PLM, indicating that the 4PLM more appropriately described the empirical data in this study; (3) Empirical parameter comparison: The item fit indices (infit and outfit mean-square) were similar between the 3PLM and 4PLM and all fell within the reasonable range of 0.5 to 1.5; however, the average item fit indices of the 4PLM tended to be closer to the expected value 1.0. Regarding discrimination, the average estimates and standard errors (SE) under the 4PLM were significantly larger than those under the 3PLM (p < .05). Regarding difficulty, the estimates and SE of difficulty parameters from the two models were significantly correlated (p < .05), but the 4PLM tended to yield smaller estimates and SE than the 3PLM. Regarding guessing, the guessing estimates and SE were often significantly correlated across models, but the 4PLM tended to yield larger estimates and SE than the 3PLM. Regarding slipping, the average slipping estimate in the CAP English test (0.96) was closer to the expectation that the slipping parameter “should be close to 1.0” and the suggestion that it “should not be smaller than 0.9,” whereas the average slipping estimate in the TASA Chinese test (0.86) was closer to previously reported averages in related studies. Regarding ability, the 4PLM ability distribution tended to be closer to 0 than the 3PLM, and the 3PLM tended to underestimate high-ability examinees compared to the 4PLM; (4) Simulation evaluation: For discrimination, the average bias and RMSE of the 4PLM were significantly smaller than those of the 3PLM (p < .05). For difficulty, when the test data contained non-negligible slipping effects, the average bias and RMSE of difficulty estimates under the 4PLM tended to be smaller than those under the 3PLM. For guessing, the absolute mean bias and mean RMSE under the 4PLM tended to be smaller than those under the 3PLM. For slipping, the average RMSE of the slipping parameter was close to that of the guessing parameter and was comparable to the magnitude reported in previous research. For ability, the mean RMSE under the 4PLM was significantly smaller than that under the 3PLM (p < .05). These results indicate that the slipping parameter is functional and the 4PLM is an applicable data-analyzing model for Taiwan student achievement tests. The conclusions of this study include: (1) The 4PLM is applicable to Taiwan test data: The 4PLM fit the empirical datasets significantly better than the 3PLM because the slipping parameter can explain examinees’ carelessness that the 3PLM cannot explain, thereby improving model fit; (2) The empirical results of the 4PLM differ from those of the 3PLM: Compared to the 3PLM, the 4PLM produced item fit indices closer to the expected value 1.0; larger discrimination and guessing estimates and SE; smaller difficulty estimates and SE; larger SE of ability estimates but less tendency to underestimate high-ability examinees; and for the slipping parameter, larger estimates were associated with smaller SE; (3) The simulation results of the 4PLM showed smaller errors: Compared to the 3PLM, the 4PLM yielded smaller mean bias and RMSE for discrimination, difficulty, and guessing; ability estimates showed significantly smaller mean RMSE; and slipping estimates were within a reasonable range and consistent with prior studies. The suggestions of this study include: (1) Choose an appropriate data-analyzing model to obtain more precise results: The empirical datasets analyzed in this study implied non-negligible carelessness effects beyond the commonly used 3PLM; thus, the 4PLM was a better-fitting model, and in practice, the selection of a data-analyzing model should be based on examinees’ response processes to obtain more precise results; (2) Investigate the influence and effect of the slipping parameter: The 4PLM tended to be less likely to underestimate high-ability examinees, and larger slipping estimates were associated with smaller SE; further investigating the instructional meaning of the slipping parameter and its relations with other parameters is a feasible direction for future research; (3) Compare the efficiency and properties of different 4PLM parameter estimation algorithms: Future studies may hold other conditions constant and compare different estimation methods for the 4PLM to provide evidence for researchers when selecting estimation tools.
起訖頁	001-040
關鍵詞	四參數模式、成就測驗、粗心、試題反應理論、4PLM、achievement test、item response theory、slipping
刊名	教育與心理研究
期數	202603 (49:1期)
出版單位	國立政治大學教育學院
DOI	10.53106/102498852026034901001 複製DOI
QR Code
該期刊下一篇	翻轉教學結合行動學習融入心理與教育測驗課程對學生學習投入與成效之影響

教師服務合作出版期刊徵稿聯絡高教高教FB	讀者服務圖書目錄教育期刊訂購服務活動訊息	數位服務高等教育知識庫國際資料庫收錄投審稿系統 DOI註冊	線上購買高點網路書店元照網路書店博客來網路書店	教育資源教育網站國際教育網站	關於高教高教簡介出版授權合作單位
知識達	知識達	知識達	知識達	知識達	知識達