從建置者角度應用開放軟體R建立大型教育評量學生表現可能值與次級資料分析,ERICDATA高等教育知識庫
高等教育出版
熱門: 朱丽彬  黃光男  王美玲  王善边  曾瓊瑤  崔雪娟  
高等教育出版
首頁 臺灣期刊   學校系所   學協會   民間出版   大陸/海外期刊   政府機關   學校系所   學協會   民間出版   DOI註冊服務
閱讀全文   購買本期
篇名
從建置者角度應用開放軟體R建立大型教育評量學生表現可能值與次級資料分析
並列篇名
Applying the R Packages to Integrate Estimation of Plausible Values and Implementation of Secondary Data Analysis with a National Large-Scale Assessment Data from the Perspective of Builders
作者 謝進昌
中文摘要
隨著國際評比興起,臺灣順勢啟動國內大型教育評量與調查建置。相關計畫目的之一為透過大規模資料蒐集與分析,以回應當下重要教育決策推動影響評估、或為學者進行次級資料分析議題探討。然而,推動過程面臨幾項問題,包含國內資料庫未釋出學生表現可能值、缺乏次級資料分析指引;此外,各個國際評比雖有釋出學生表現可能值,但面臨不同面向分析需求,直接使用官方釋出的估計值,並不能完全滿足深度分析的需求。基於此背景,本文目的以資料庫建置者角度,將「學生表現可能值估計及次級資料分析」視為一連貫性分析之兩階段程序,並透過開放軟體R相關套件,利用國內某一個大型教育評量資料,建立自學生表現可能值估計(包含分析資料的準備、試題反應模式選擇與試題評估、條件變項建立、潛在迴歸分析與抽取學生表現可能值)至大型評量次級資料分析(包含透過樣本推論母體描述統計、多元迴歸分析、結構方程式模式)等多個步驟。最後,文末提出幾項建議與發展討論,提供未來研究者參酌。
英文摘要

Research Motivation and Objective

International large-scale assessments (ILSA) have become core evidence infrastructures for policy and scholarship. Inspired by NAEP (USA), the National Education Association of Korea (NEAK), and the National Educational Policy Survey (Germany), Taiwan has launched TASA and TASAL (including its i-Generation extension) to measure proficiency and context at scale. Two domestic bottlenecks persist: (1) National databases rarely release plausible values (PVs), the lingua franca of ILSA-based inference; and (2) Guidance and tooling for secondary analyses under complex sampling and PV uncertainty are fragmented. Although international programmes disseminate PVs and utilities (e.g., IDB Analyzer), their latent-regression choices and missing-data strategies may not align with local analytic needs.

Positioning ourselves as database architects, we propose a coherent, end-to- end framework that integrates PV estimation with downstream secondary analysis. Using a Taiwanese large-scale English assessment, we demonstrate how open-source R packages implement the complete workflow—itemresponse modelling and latent regression (with principled handling of missing background data), PV extraction, and design-based routines for descriptive statistics, regression, and structural equation modelling (SEM). Our aims are: (1) To articulate a transparent, modular PV-and-analysis pipeline adaptable to local requirements; and (2) To supply concrete micro-procedures and code patterns that raise the quality and consistency of national secondary analyses.

Literature Review

PVs are generated by latent-regression item response models that couple an item response model with a population model for proficiency conditional on background covariates. Their rationale is strongest under matrix-sampled designs, where students see only a subset of items; multiple posterior draws propagate measurement and imputation uncertainty, avoiding biases of singlepoint scores. Design-based work on weights, replication (e.g., jackknife; modified BRR), and design effects establishes that valid inference in complex samples requires both weighting and proper variance estimation. A current tension concerns missing background data at the PV stage: International practice often uses missing-indicator codings, yet downstream analysts frequently impute. Divergent strategies can induce incoherence between PV construction and subsequent modelling, motivating a unified pipeline that treats missingness consistently across stages.

Research Methodology

Data and design. The demonstration dataset comprises 6,815 students selected via stratified two-stage cluster sampling: Systematic PPS selection of schools followed by one class per school. Released variables include student/school IDs, stratification and jackknife replication variables, total and normalized weights, background items, and English item scores. Each student took only two of five domains; English employed a partial balanced incomplete block design across listening and reading items.

Stage 1: PV construction. We compare five IRT specifications in TAM and mirt, ranging from unidimensional Rasch/PCM to multidimensional 2PL/GPCM and a bifactor model (general English plus listening/reading specifics). Parameters are estimated by marginal maximum likelihood with EM; higher-dimensional models use quasi–Monte Carlo integration. Item quality is reviewed with Infit/ Outfit (productive ≈ 0.5–1.5) and gender DIF via likelihood-ratio and Wald tests (with multiplicity control). Missing background data are multiply imputed using mice with predictive mean matching to align PV construction with likely downstream strategies. Background variables are contrast-coded; principal components explaining ~80%–90% of variance form the conditioning set for the latent regression. From the posterior, M = 10 PVs are drawn for English (EAP reliability ≈ .66) and transformed to a 500/100 scale.

Stage 2: Secondary analysis. Two routes are provided. A convenience interface (itasa, adapted from intsvy) supports weighted descriptives, group comparisons, PV pooling via Rubin’s rules, and graphics. A lower-level route uses survey and mitools to define the replicate-weighted design (JKn with strata and jackknife zones) and to pool PV-specific estimates. For SEM, lavaan.survey fits a model in which SES predicts motivational constructs (interest, self-concept, social motivation) and English proficiency; the motivational constructs in turn predict proficiency.

Research Results

Model selection and diagnostics. Information criteria favoured 2PL/GPCM under uni- and two-dimensional structures; the bifactor improved fit marginally at substantial computational cost. A small set of items showed elevated Outfit, but Infit remained acceptable. Gender DIF was limited (significant for a few items by LRT, not by Wald), supporting approximate invariance.

PV construction. Ten PVs were extracted using the imputed, dimension-reduced background set (~129 principal-component covariates) and standardized to 500/100, facilitating interpretability and comparability with international conventions.

Design-based descriptives. Weighted means by gender and community development level exhibited a robust gradient: students in advantaged areas outperformed peers in disadvantaged areas, and females outperformed males. Highest means occurred for females in advantaged areas; lowest for males in disadvantaged areas. Confidence intervals did not overlap. Estimates from itasa matched survey/mitools, validating the convenience interface.

Regression analyses. Using normalized “household” weights, multiple regression of English PVs on SES and motivational constructs showed SES and academic self-concept as positive, statistically significant predictors. Interest and social motivation became non-significant once SES and self-concept were included, aligning with collinearity and mediational accounts in the achievement literature. Coefficients and standard errors were essentially identical across the two routes; Rubin-pooled standard errors reflected both replication and PV uncertainty.

SEM under complex sampling. Comparing traditional ML (Satorra–Bentler corrections) with design-based pseudo-ML (jackknife replication) revealed higher standard errors and adjusted fit when accounting for the survey design (average generalized design effect rising from ≈ 1.09 to ≈ 1.45). Parameter directions were stable across estimators; conditional relative efficiency averaged ≈ 1.13. The SEM indicated strong direct and indirect associations between selfconcept and English proficiency, with self-concept showing the most robust proximal link.

Discussion and Recommendations

An integrated pipeline. Treat PV estimation and secondary analysis as one workflow to reduce misalignment between how proficiency is generated and later modelled. In practice: (1) Handle missing background data consistently (avoid mixing missing-indicator PVs with imputed downstream models); (2) Document conditioning sets and transformations (e.g., PCA thresholds); and (3) Expose modular code to permit alternative imputation models and latent regressions.

Modelling choices and scalability. Rich multidimensional or bifactor IRT can capture domain structure but may yield limited practical gains at high computational cost. For routine reporting and most secondary analyses, a well-diagnosed 2PL/GPCM with rigorous item screening balances fidelity and feasibility. When many domains/subdomains must be estimated, staged estimation or domain-specific PVs with explicit linking may be preferable to monolithic high-dimensional fits.

Design-based inference is essential. Two-stage stratified cluster designs and replicate weights materially affect standard errors and fit. Analysts should default to replicate-weighted designs (jackknife, BRR variants) and Rubin’s pooling for PVs; single-PV analyses or PV averaging should be avoided because they understate uncertainty.

Substantive insights and limits. Demonstrated patterns—area and female advantages, primacy of SES and self-concept—are consonant with the literature, lending face validity to the pipeline. Because the dataset was intentionally modified and partial, results are methodological demonstrations rather than policy findings.

Prospects. Future work should: (1) Co-design conditioning models with anticipated downstream analyses (e.g., propensity-score weighting, subgroup definitions) to minimize incoherence between PV generation and causal/ comparative modelling; and (2) Integrate multi-study synthesis (e.g., IPD meta-analysis) with PV-aware, design-based estimation to scale evidence across cohorts and years while preserving complex-sample guarantees. Practically, agencies should release PVs with documented conditioning and imputation, provide total/normalized weights plus replicate schemes, and supply lightweight, auditable R functions that mirror survey/mitools calls—enabling Taiwan’s national assessments to inform policy and support high-quality secondary research.

起訖頁 043-088
關鍵詞 大型評量可能值國際評比設計本位次級資料分析開放軟體Rlarge-scale assessmentplausible valueinternational large-scale assessment in educationdesign-based secondary data analysisR
刊名 教育與心理研究  
期數 202509 (48:3期)
出版單位 國立政治大學教育學院
DOI 10.53106/102498852025094803002   複製DOI
QR Code
該期刊
上一篇
臺灣版複雜性創傷評估工具的效度研究:複雜性創傷與童年逆境經驗、身心健康的關聯
該期刊
下一篇
文本分析校準閱讀策略教學之研究

高等教育知識庫  新書優惠  教育研究月刊  Ericdata高等教育知識庫  

教師服務
合作出版
期刊徵稿
聯絡高教
高教FB
讀者服務
圖書目錄
教育期刊
訂購服務
活動訊息
數位服務
高等教育知識庫
國際資料庫收錄
投審稿系統
DOI註冊
線上購買
高點網路書店 
元照網路書店
博客來網路書店
教育資源
教育網站
國際教育網站
關於高教
高教簡介
出版授權
合作單位
知識達 知識達 知識達 知識達 知識達 知識達
版權所有‧轉載必究 Copyright2011 高等教育文化事業股份有限公司  All Rights Reserved
服務信箱:edubook@edubook.com.tw 台北市館前路 26 號 6 樓 Tel:+886-2-23885899 Fax:+886-2-23892500