polytomously scored items: Topics by Science.gov

Sample records for polytomously scored items

An NCME Instructional Module on Polytomous Item Response Theory Models

ERIC Educational Resources Information Center

Penfield, Randall David

2014-01-01

A polytomous item is one for which the responses are scored according to three or more categories. Given the increasing use of polytomous items in assessment practices, item response theory (IRT) models specialized for polytomous items are becoming increasingly common. The purpose of this ITEMS module is to provide an accessible overview of…
Designing P-Optimal Item Pools in Computerized Adaptive Tests with Polytomous Items

ERIC Educational Resources Information Center

Zhou, Xuechun

2012-01-01

Current CAT applications consist of predominantly dichotomous items, and CATs with polytomously scored items are limited. To ascertain the best approach to polytomous CAT, a significant amount of research has been conducted on item selection, ability estimation, and impact of termination rules based on polytomous IRT models. Few studies…
Asymptotic Standard Errors of Observed-Score Equating with Polytomous IRT Models

ERIC Educational Resources Information Center

Andersson, Björn

2016-01-01

In observed-score equipercentile equating, the goal is to make scores on two scales or tests measuring the same construct comparable by matching the percentiles of the respective score distributions. If the tests consist of different items with multiple categories for each item, a suitable model for the responses is a polytomous item response…
Conditional Covariance Theory and DETECT for Polytomous Items. Research Report. ETS RR-04-50

ERIC Educational Resources Information Center

Zhang, Jinming

2004-01-01

This paper extends the theory of conditional covariances to polytomous items. It has been mathematically proven that under some mild conditions, commonly assumed in the analysis of response data, the conditional covariance of two items, dichotomously or polytomously scored, is positive if the two items are dimensionally homogeneous and negative…
Asymptotic Standard Errors for Item Response Theory True Score Equating of Polytomous Items

ERIC Educational Resources Information Center

Cher Wong, Cheow

2015-01-01

Building on previous works by Lord and Ogasawara for dichotomous items, this article proposes an approach to derive the asymptotic standard errors of item response theory true score equating involving polytomous items, for equivalent and nonequivalent groups of examinees. This analytical approach could be used in place of empirical methods like…
Examining the Impact of Drifted Polytomous Anchor Items on Test Characteristic Curve (TCC) Linking and IRT True Score Equating. Research Report. ETS RR-12-09

ERIC Educational Resources Information Center

Li, Yanmei

2012-01-01

In a common-item (anchor) equating design, the common items should be evaluated for item parameter drift. Drifted items are often removed. For a test that contains mostly dichotomous items and only a small number of polytomous items, removing some drifted polytomous anchor items may result in anchor sets that no longer resemble mini-versions of…
Conditional Covariance Theory and Detect for Polytomous Items

ERIC Educational Resources Information Center

Zhang, Jinming

2007-01-01

This paper extends the theory of conditional covariances to polytomous items. It has been proven that under some mild conditions, commonly assumed in the analysis of response data, the conditional covariance of two items, dichotomously or polytomously scored, given an appropriately chosen composite is positive if, and only if, the two items…
Assessment of Differential Item Functioning in Health-Related Outcomes: A Simulation and Empirical Analysis with Hierarchical Polytomous Data

PubMed Central

Sharafi, Zahra

2017-01-01

Background The purpose of this study was to evaluate the effectiveness of two methods of detecting differential item functioning (DIF) in the presence of multilevel data and polytomously scored items. The assessment of DIF with multilevel data (e.g., patients nested within hospitals, hospitals nested within districts) from large-scale assessment programs has received considerable attention but very few studies evaluated the effect of hierarchical structure of data on DIF detection for polytomously scored items. Methods The ordinal logistic regression (OLR) and hierarchical ordinal logistic regression (HOLR) were utilized to assess DIF in simulated and real multilevel polytomous data. Six factors (DIF magnitude, grouping variable, intraclass correlation coefficient, number of clusters, number of participants per cluster, and item discrimination parameter) with a fully crossed design were considered in the simulation study. Furthermore, data of Pediatric Quality of Life Inventory™ (PedsQL™) 4.0 collected from 576 healthy school children were analyzed. Results Overall, results indicate that both methods performed equivalently in terms of controlling Type I error and detection power rates. Conclusions The current study showed negligible difference between OLR and HOLR in detecting DIF with polytomously scored items in a hierarchical structure. Implications and considerations while analyzing real data were also discussed. PMID:29312463
Assessment of Differential Item Functioning in Health-Related Outcomes: A Simulation and Empirical Analysis with Hierarchical Polytomous Data.

PubMed

Sharafi, Zahra; Mousavi, Amin; Ayatollahi, Seyyed Mohammad Taghi; Jafari, Peyman

2017-01-01

The purpose of this study was to evaluate the effectiveness of two methods of detecting differential item functioning (DIF) in the presence of multilevel data and polytomously scored items. The assessment of DIF with multilevel data (e.g., patients nested within hospitals, hospitals nested within districts) from large-scale assessment programs has received considerable attention but very few studies evaluated the effect of hierarchical structure of data on DIF detection for polytomously scored items. The ordinal logistic regression (OLR) and hierarchical ordinal logistic regression (HOLR) were utilized to assess DIF in simulated and real multilevel polytomous data. Six factors (DIF magnitude, grouping variable, intraclass correlation coefficient, number of clusters, number of participants per cluster, and item discrimination parameter) with a fully crossed design were considered in the simulation study. Furthermore, data of Pediatric Quality of Life Inventory™ (PedsQL™) 4.0 collected from 576 healthy school children were analyzed. Overall, results indicate that both methods performed equivalently in terms of controlling Type I error and detection power rates. The current study showed negligible difference between OLR and HOLR in detecting DIF with polytomously scored items in a hierarchical structure. Implications and considerations while analyzing real data were also discussed.
A Note on Stochastic Ordering of the Latent Trait Using the Sum of Polytomous Item Scores

ERIC Educational Resources Information Center

van der Ark, L. Andries; Bergsma, Wicher P.

2010-01-01

In contrast to dichotomous item response theory (IRT) models, most well-known polytomous IRT models do not imply stochastic ordering of the latent trait by the total test score (SOL). This has been thought to make the ordering of respondents on the latent trait using the total test score questionable and throws doubt on the justifiability of using…
Polytomous Latent Scales for the Investigation of the Ordering of Items

ERIC Educational Resources Information Center

Ligtvoet, Rudy; van der Ark, L. Andries; Bergsma, Wicher P.; Sijtsma, Klaas

2011-01-01

We propose three latent scales within the framework of nonparametric item response theory for polytomously scored items. Latent scales are models that imply an invariant item ordering, meaning that the order of the items is the same for each measurement value on the latent scale. This ordering property may be important in, for example,…
Modeling Item-Level and Step-Level Invariance Effects in Polytomous Items Using the Partial Credit Model

ERIC Educational Resources Information Center

Gattamorta, Karina A.; Penfield, Randall D.; Myers, Nicholas D.

2012-01-01

Measurement invariance is a common consideration in the evaluation of the validity and fairness of test scores when the tested population contains distinct groups of examinees, such as examinees receiving different forms of a translated test. Measurement invariance in polytomous items has traditionally been evaluated at the item-level,…
Detecting DIF in Polytomous Items Using MACS, IRT and Ordinal Logistic Regression

ERIC Educational Resources Information Center

Elosua, Paula; Wells, Craig

2013-01-01

The purpose of the present study was to compare the Type I error rate and power of two model-based procedures, the mean and covariance structure model (MACS) and the item response theory (IRT), and an observed-score based procedure, ordinal logistic regression, for detecting differential item functioning (DIF) in polytomous items. A simulation…
Comparing Vertical Scales Derived from Dichotomous and Polytomous IRT Models for a Test Composed of Testlets.

ERIC Educational Resources Information Center

Bishop, N. Scott; Omar, Md Hafidz

Previous research has shown that testlet structures often violate important assumptions of dichotomous item response theory (D-IRT) models, applied to item-level scores, that can in turn affect the results of many measurement applications. In this situation, polytomous IRT (P-IRT) models, applied to testlet-level scores, have been used as an…
Using a Taxonomy of Differential Step Functioning to Improve the Interpretation of DIF in Polytomous Items: An Illustration

ERIC Educational Resources Information Center

Penfield, Randall D.; Alvarez, Karina; Lee, Okhee

2009-01-01

The assessment of differential item functioning (DIF) in polytomous items addresses between-group differences in measurement properties at the item level, but typically does not inform which score levels may be involved in the DIF effect. The framework of differential step functioning (DSF) addresses this issue by examining between-group…
The Effects of Small Sample Size on Identifying Polytomous DIF Using the Liu-Agresti Estimator of the Cumulative Common Odds Ratio

ERIC Educational Resources Information Center

Carvajal, Jorge; Skorupski, William P.

2010-01-01

This study is an evaluation of the behavior of the Liu-Agresti estimator of the cumulative common odds ratio when identifying differential item functioning (DIF) with polytomously scored test items using small samples. The Liu-Agresti estimator has been proposed by Penfield and Algina as a promising approach for the study of polytomous DIF but no…
Stability of Rasch Scales over Time

ERIC Educational Resources Information Center

Taylor, Catherine S.; Lee, Yoonsun

2010-01-01

Item response theory (IRT) methods are generally used to create score scales for large-scale tests. Research has shown that IRT scales are stable across groups and over time. Most studies have focused on items that are dichotomously scored. Now Rasch and other IRT models are used to create scales for tests that include polytomously scored items.…
Stochastic Ordering Using the Latent Trait and the Sum Score in Polytomous IRT Models.

ERIC Educational Resources Information Center

Hemker, Bas T.; Sijtsma, Klaas; Molenaar, Ivo W.; Junker, Brian W.

1997-01-01

Stochastic ordering properties are investigated for a broad class of item response theory (IRT) models for which the monotone likelihood ratio does not hold. A taxonomy is given for nonparametric and parametric models for polytomous models based on the hierarchical relationship between the models. (SLD)
Real and Artificial Differential Item Functioning in Polytomous Items

ERIC Educational Resources Information Center

Andrich, David; Hagquist, Curt

2015-01-01

Differential item functioning (DIF) for an item between two groups is present if, for the same person location on a variable, persons from different groups have different expected values for their responses. Applying only to dichotomously scored items in the popular Mantel-Haenszel (MH) method for detecting DIF in which persons are classified by…
Psychometric properties for the Balanced Inventory of Desirable Responding: dichotomous versus polytomous conventional and IRT scoring.

PubMed

Vispoel, Walter P; Kim, Han Yi

2014-09-01

[Correction Notice: An Erratum for this article was reported in Vol 26(3) of Psychological Assessment (see record 2014-16017-001). The mean, standard deviation and alpha coefficient originally reported in Table 1 should be 74.317, 10.214 and .802, respectively. The validity coefficients in the last column of Table 4 are affected as well. Correcting this error did not change the substantive interpretations of the results, but did increase the mean, standard deviation, alpha coefficient, and validity coefficients reported for the Honesty subscale in the text and in Tables 1 and 4. The corrected versions of Tables 1 and Table 4 are shown in the erratum.] Item response theory (IRT) models were applied to dichotomous and polytomous scoring of the Self-Deceptive Enhancement and Impression Management subscales of the Balanced Inventory of Desirable Responding (Paulhus, 1991, 1999). Two dichotomous scoring methods reflecting exaggerated endorsement and exaggerated denial of socially desirable behaviors were examined. The 1- and 2-parameter logistic models (1PLM, 2PLM, respectively) were applied to dichotomous responses, and the partial credit model (PCM) and graded response model (GRM) were applied to polytomous responses. For both subscales, the 2PLM fit dichotomous responses better than did the 1PLM, and the GRM fit polytomous responses better than did the PCM. Polytomous GRM and raw scores for both subscales yielded higher test-retest and convergent validity coefficients than did PCM, 1PLM, 2PLM, and dichotomous raw scores. Information plots showed that the GRM provided consistently high measurement precision that was superior to that of all other IRT models over the full range of both construct continuums. Dichotomous scores reflecting exaggerated endorsement of socially desirable behaviors provided noticeably weak precision at low levels of the construct continuums, calling into question the use of such scores for detecting instances of "faking bad." Dichotomous models reflecting exaggerated denial of the same behaviors yielded much better precision at low levels of the constructs, but it was still less precision than that of the GRM. These results support polytomous over dichotomous scoring in general, alternative dichotomous scoring for detecting faking bad, and extension of GRM scoring to situations in which IRT offers additional practical advantages over classical test theory (adaptive testing, equating, linking, scaling, detecting differential item functioning, and so forth). PsycINFO Database Record (c) 2014 APA, all rights reserved.

DIFAS: Differential Item Functioning Analysis System. Computer Program Exchange

ERIC Educational Resources Information Center

Penfield, Randall D.

2005-01-01

Differential item functioning (DIF) is an important consideration in assessing the validity of test scores (Camilli & Shepard, 1994). A variety of statistical procedures have been developed to assess DIF in tests of dichotomous (Hills, 1989; Millsap & Everson, 1993) and polytomous (Penfield & Lam, 2000; Potenza & Dorans, 1995) items. Some of these…
Practical Guide to Conducting an Item Response Theory Analysis

ERIC Educational Resources Information Center

Toland, Michael D.

2014-01-01

Item response theory (IRT) is a psychometric technique used in the development, evaluation, improvement, and scoring of multi-item scales. This pedagogical article provides the necessary information needed to understand how to conduct, interpret, and report results from two commonly used ordered polytomous IRT models (Samejima's graded…
An Empirical Investigation of Methods for Assessing Item Fit for Mixed Format Tests

ERIC Educational Resources Information Center

Chon, Kyong Hee; Lee, Won-Chan; Ansley, Timothy N.

2013-01-01

Empirical information regarding performance of model-fit procedures has been a persistent need in measurement practice. Statistical procedures for evaluating item fit were applied to real test examples that consist of both dichotomously and polytomously scored items. The item fit statistics used in this study included the PARSCALE's G[squared],…
Fitting Item Response Theory Models to Two Personality Inventories: Issues and Insights.

PubMed

Chernyshenko, O S; Stark, S; Chan, K Y; Drasgow, F; Williams, B

2001-10-01

The present study compared the fit of several IRT models to two personality assessment instruments. Data from 13,059 individuals responding to the US-English version of the Fifth Edition of the Sixteen Personality Factor Questionnaire (16PF) and 1,770 individuals responding to Goldberg's 50 item Big Five Personality measure were analyzed. Various issues pertaining to the fit of the IRT models to personality data were considered. We examined two of the most popular parametric models designed for dichotomously scored items (i.e., the two- and three-parameter logistic models) and a parametric model for polytomous items (Samejima's graded response model). Also examined were Levine's nonparametric maximum likelihood formula scoring models for dichotomous and polytomous data, which were previously found to provide good fits to several cognitive ability tests (Drasgow, Levine, Tsien, Williams, & Mead, 1995). The two- and three-parameter logistic models fit some scales reasonably well but not others; the graded response model generally did not fit well. The nonparametric formula scoring models provided the best fit of the models considered. Several implications of these findings for personality measurement and personnel selection were described.
IRTPRO 2.1 for Windows (Item Response Theory for Patient-Reported Outcomes)

ERIC Educational Resources Information Center

Paek, Insu; Han, Kyung T.

2013-01-01

This article reviews a new item response theory (IRT) model estimation program, IRTPRO 2.1, for Windows that is capable of unidimensional and multidimensional IRT model estimation for existing and user-specified constrained IRT models for dichotomously and polytomously scored item response data. (Contains 1 figure and 2 notes.)
Weighted Maximum-a-Posteriori Estimation in Tests Composed of Dichotomous and Polytomous Items

ERIC Educational Resources Information Center

Sun, Shan-Shan; Tao, Jian; Chang, Hua-Hua; Shi, Ning-Zhong

2012-01-01

For mixed-type tests composed of dichotomous and polytomous items, polytomous items often yield more information than dichotomous items. To reflect the difference between the two types of items and to improve the precision of ability estimation, an adaptive weighted maximum-a-posteriori (WMAP) estimation is proposed. To evaluate the performance of…
Polytomous versus Dichotomous Scoring on Multiple-Choice Examinations: Development of a Rubric for Rating Partial Credit

ERIC Educational Resources Information Center

Grunert, Megan L.; Raker, Jeffrey R.; Murphy, Kristen L.; Holme, Thomas A.

2013-01-01

The concept of assigning partial credit on multiple-choice test items is considered for items from ACS Exams. Because the items on these exams, particularly the quantitative items, use common student errors to define incorrect answers, it is possible to assign partial credits to some of these incorrect responses. To do so, however, it becomes…
Comparing and Combining Dichotomous and Polytomous Items with SPRT Procedure in Computerized Classification Testing.

ERIC Educational Resources Information Center

Lau, C. Allen; Wang, Tianyou

The purposes of this study were to: (1) extend the sequential probability ratio testing (SPRT) procedure to polytomous item response theory (IRT) models in computerized classification testing (CCT); (2) compare polytomous items with dichotomous items using the SPRT procedure for their accuracy and efficiency; (3) study a direct approach in…
A Generalizability Analysis of Score Consistency for the Balanced Inventory of Desirable Responding

ERIC Educational Resources Information Center

Vispoel, Walter P.; Tao, Shuqin

2013-01-01

Our goal in this investigation was to evaluate the reliability of scores from the Balanced Inventory of Desirable Responding (BIDR) more comprehensively than in prior research using a generalizability-theory framework based on both dichotomous and polytomous scoring of items. Generalizability coefficients accounting for specific-factor, transient,…
Accounting for Local Dependence with the Rasch Model: The Paradox of Information Increase.

PubMed

Andrich, David

Test theories imply statistical, local independence. Where local independence is violated, models of modern test theory that account for it have been proposed. One violation of local independence occurs when the response to one item governs the response to a subsequent item. Expanding on a formulation of this kind of violation between two items in the dichotomous Rasch model, this paper derives three related implications. First, it formalises how the polytomous Rasch model for an item constituted by summing the scores of the dependent items absorbs the dependence in its threshold structure. Second, it shows that as a consequence the unit when the dependence is accounted for is not the same as if the items had no response dependence. Third, it explains the paradox, known, but not explained in the literature, that the greater the dependence of the constituent items the greater the apparent information in the constituted polytomous item when it should provide less information.
Comparison of IRT Likelihood Ratio Test and Logistic Regression DIF Detection Procedures

ERIC Educational Resources Information Center

Atar, Burcu; Kamata, Akihito

2011-01-01

The Type I error rates and the power of IRT likelihood ratio test and cumulative logit ordinal logistic regression procedures in detecting differential item functioning (DIF) for polytomously scored items were investigated in this Monte Carlo simulation study. For this purpose, 54 simulation conditions (combinations of 3 sample sizes, 2 sample…
Examining Power and Type 1 Error for Step and Item Level Tests of Invariance: Investigating the Effect of the Number of Item Score Levels

ERIC Educational Resources Information Center

Ayodele, Alicia Nicole

2017-01-01

Within polytomous items, differential item functioning (DIF) can take on various forms due to the number of response categories. The lack of invariance at this level is referred to as differential step functioning (DSF). The most common DSF methods in the literature are the adjacent category log odds ratio (AC-LOR) estimator and cumulative…
Some Considerations on the Partial Credit Model

ERIC Educational Resources Information Center

Verhelst, N. D.; Verstralen, H. H. F. M.

2008-01-01

The Partial Credit Model (PCM) is sometimes interpreted as a model for stepwise solution of polytomously scored items, where the item parameters are interpreted as difficulties of the steps. It is argued that this interpretation is not justified. A model for stepwise solution is discussed. It is shown that the PCM is suited to model sums of binary…
Model Selection Indices for Polytomous Items

ERIC Educational Resources Information Center

Kang, Taehoon; Cohen, Allan S.; Sung, Hyun-Jung

2009-01-01

This study examines the utility of four indices for use in model selection with nested and nonnested polytomous item response theory (IRT) models: a cross-validation index and three information-based indices. Four commonly used polytomous IRT models are considered: the graded response model, the generalized partial credit model, the partial credit…
A note on monotonicity of item response functions for ordered polytomous item response theory models.

PubMed

Kang, Hyeon-Ah; Su, Ya-Hui; Chang, Hua-Hua

2018-03-08

A monotone relationship between a true score (τ) and a latent trait level (θ) has been a key assumption for many psychometric applications. The monotonicity property in dichotomous response models is evident as a result of a transformation via a test characteristic curve. Monotonicity in polytomous models, in contrast, is not immediately obvious because item response functions are determined by a set of response category curves, which are conceivably non-monotonic in θ. The purpose of the present note is to demonstrate strict monotonicity in ordered polytomous item response models. Five models that are widely used in operational assessments are considered for proof: the generalized partial credit model (Muraki, 1992, Applied Psychological Measurement, 16, 159), the nominal model (Bock, 1972, Psychometrika, 37, 29), the partial credit model (Masters, 1982, Psychometrika, 47, 147), the rating scale model (Andrich, 1978, Psychometrika, 43, 561), and the graded response model (Samejima, 1972, A general model for free-response data (Psychometric Monograph no. 18). Psychometric Society, Richmond). The study asserts that the item response functions in these models strictly increase in θ and thus there exists strict monotonicity between τ and θ under certain specified conditions. This conclusion validates the practice of customarily using τ in place of θ in applied settings and provides theoretical grounds for one-to-one transformations between the two scales. © 2018 The British Psychological Society.
Contextual Differential Item Functioning: Examining the Validity of Teaching Self-Efficacy Instruments Using Hierarchical Generalized Linear Modeling

ERIC Educational Resources Information Center

Zhao, Jing

2012-01-01

The purpose of the study is to further investigate the validity of instruments used for collecting preservice teachers' perceptions of self-efficacy adapting the three-level IRT model described in Cheong's study (2006). The focus of the present study is to investigate whether the polytomously-scored items on the preservice teachers' self-efficacy…
Computerized Adaptive Test (CAT) Applications and Item Response Theory Models for Polytomous Items

ERIC Educational Resources Information Center

Aybek, Eren Can; Demirtasli, R. Nukhet

2017-01-01

This article aims to provide a theoretical framework for computerized adaptive tests (CAT) and item response theory models for polytomous items. Besides that, it aims to introduce the simulation and live CAT software to the related researchers. Computerized adaptive test algorithm, assumptions of item response theory models, nominal response…
A Comparison of Three Polytomous Item Response Theory Models in the Context of Testlet Scoring.

ERIC Educational Resources Information Center

Cook, Karon F.; Dodd, Barbara G.; Fitzpatrick, Steven J.

1999-01-01

The partial-credit model, the generalized partial-credit model, and the graded-response model were compared in the context of testlet scoring using Scholastic Assessment Tests results (n=2,548) and a simulated data set. Results favor the partial-credit model in this context; considerations for model selection in other contexts are discussed. (SLD)
Polytomous Differential Item Functioning and Violations of Ordering of the Expected Latent Trait by the Raw Score

ERIC Educational Resources Information Center

DeMars, Christine E.

2008-01-01

The graded response (GR) and generalized partial credit (GPC) models do not imply that examinees ordered by raw observed score will necessarily be ordered on the expected value of the latent trait (OEL). Factors were manipulated to assess whether increased violations of OEL also produced increased Type I error rates in differential item…
Examination of Polytomous Items' Psychometric Properties According to Nonparametric Item Response Theory Models in Different Test Conditions

ERIC Educational Resources Information Center

Sengul Avsar, Asiye; Tavsancil, Ezel

2017-01-01

This study analysed polytomous items' psychometric properties according to nonparametric item response theory (NIRT) models. Thus, simulated datasets--three different test lengths (10, 20 and 30 items), three sample distributions (normal, right and left skewed) and three samples sizes (100, 250 and 500)--were generated by conducting 20…

Applying Bayesian Item Selection Approaches to Adaptive Tests Using Polytomous Items

ERIC Educational Resources Information Center

Penfield, Randall D.

2006-01-01

This study applied the maximum expected information (MEI) and the maximum posterior-weighted information (MPI) approaches of computer adaptive testing item selection to the case of a test using polytomous items following the partial credit model. The MEI and MPI approaches are described. A simulation study compared the efficiency of ability…
Mokken scale analysis of mental health and well-being questionnaire item responses: a non-parametric IRT method in empirical research for applied health researchers

PubMed Central

2012-01-01

Background Mokken scaling techniques are a useful tool for researchers who wish to construct unidimensional tests or use questionnaires that comprise multiple binary or polytomous items. The stochastic cumulative scaling model offered by this approach is ideally suited when the intention is to score an underlying latent trait by simple addition of the item response values. In our experience, the Mokken model appears to be less well-known than for example the (related) Rasch model, but is seeing increasing use in contemporary clinical research and public health. Mokken's method is a generalisation of Guttman scaling that can assist in the determination of the dimensionality of tests or scales, and enables consideration of reliability, without reliance on Cronbach's alpha. This paper provides a practical guide to the application and interpretation of this non-parametric item response theory method in empirical research with health and well-being questionnaires. Methods Scalability of data from 1) a cross-sectional health survey (the Scottish Health Education Population Survey) and 2) a general population birth cohort study (the National Child Development Study) illustrate the method and modeling steps for dichotomous and polytomous items respectively. The questionnaire data analyzed comprise responses to the 12 item General Health Questionnaire, under the binary recoding recommended for screening applications, and the ordinal/polytomous responses to the Warwick-Edinburgh Mental Well-being Scale. Results and conclusions After an initial analysis example in which we select items by phrasing (six positive versus six negatively worded items) we show that all items from the 12-item General Health Questionnaire (GHQ-12) – when binary scored – were scalable according to the double monotonicity model, in two short scales comprising six items each (Bech’s “well-being” and “distress” clinical scales). An illustration of ordinal item analysis confirmed that all 14 positively worded items of the Warwick-Edinburgh Mental Well-being Scale (WEMWBS) met criteria for the monotone homogeneity model but four items violated double monotonicity with respect to a single underlying dimension. Software availability and commands used to specify unidimensionality and reliability analysis and graphical displays for diagnosing monotone homogeneity and double monotonicity are discussed, with an emphasis on current implementations in freeware. PMID:22686586
Mokken scale analysis of mental health and well-being questionnaire item responses: a non-parametric IRT method in empirical research for applied health researchers.

PubMed

Stochl, Jan; Jones, Peter B; Croudace, Tim J

2012-06-11

Mokken scaling techniques are a useful tool for researchers who wish to construct unidimensional tests or use questionnaires that comprise multiple binary or polytomous items. The stochastic cumulative scaling model offered by this approach is ideally suited when the intention is to score an underlying latent trait by simple addition of the item response values. In our experience, the Mokken model appears to be less well-known than for example the (related) Rasch model, but is seeing increasing use in contemporary clinical research and public health. Mokken's method is a generalisation of Guttman scaling that can assist in the determination of the dimensionality of tests or scales, and enables consideration of reliability, without reliance on Cronbach's alpha. This paper provides a practical guide to the application and interpretation of this non-parametric item response theory method in empirical research with health and well-being questionnaires. Scalability of data from 1) a cross-sectional health survey (the Scottish Health Education Population Survey) and 2) a general population birth cohort study (the National Child Development Study) illustrate the method and modeling steps for dichotomous and polytomous items respectively. The questionnaire data analyzed comprise responses to the 12 item General Health Questionnaire, under the binary recoding recommended for screening applications, and the ordinal/polytomous responses to the Warwick-Edinburgh Mental Well-being Scale. After an initial analysis example in which we select items by phrasing (six positive versus six negatively worded items) we show that all items from the 12-item General Health Questionnaire (GHQ-12)--when binary scored--were scalable according to the double monotonicity model, in two short scales comprising six items each (Bech's "well-being" and "distress" clinical scales). An illustration of ordinal item analysis confirmed that all 14 positively worded items of the Warwick-Edinburgh Mental Well-being Scale (WEMWBS) met criteria for the monotone homogeneity model but four items violated double monotonicity with respect to a single underlying dimension.Software availability and commands used to specify unidimensionality and reliability analysis and graphical displays for diagnosing monotone homogeneity and double monotonicity are discussed, with an emphasis on current implementations in freeware.
Location Indices for Ordinal Polytomous Items Based on Item Response Theory. Research Report. ETS RR-15-20

ERIC Educational Resources Information Center

Ali, Usama S.; Chang, Hua-Hua; Anderson, Carolyn J.

2015-01-01

Polytomous items are typically described by multiple category-related parameters; situations, however, arise in which a single index is needed to describe an item's location along a latent trait continuum. Situations in which a single index would be needed include item selection in computerized adaptive testing or test assembly. Therefore single…
A Monte Carlo Study Investigating the Influence of Item Discrimination, Category Intersection Parameters, and Differential Item Functioning Patterns on the Detection of Differential Item Functioning in Polytomous Items

ERIC Educational Resources Information Center

Thurman, Carol

2009-01-01

The increased use of polytomous item formats has led assessment developers to pay greater attention to the detection of differential item functioning (DIF) in these items. DIF occurs when an item performs differently for two contrasting groups of respondents (e.g., males versus females) after controlling for differences in the abilities of the…
The Reliability and Precision of Total Scores and IRT Estimates as a Function of Polytomous IRT Parameters and Latent Trait Distribution

ERIC Educational Resources Information Center

Culpepper, Steven Andrew

2013-01-01

A classic topic in the fields of psychometrics and measurement has been the impact of the number of scale categories on test score reliability. This study builds on previous research by further articulating the relationship between item response theory (IRT) and classical test theory (CTT). Equations are presented for comparing the reliability and…
Fitting and Testing Conditional Multinormal Partial Credit Models

ERIC Educational Resources Information Center

Hessen, David J.

2012-01-01

A multinormal partial credit model for factor analysis of polytomously scored items with ordered response categories is derived using an extension of the Dutch Identity (Holland in "Psychometrika" 55:5-18, 1990). In the model, latent variables are assumed to have a multivariate normal distribution conditional on unweighted sums of item…
Application of a Method of Estimating DIF for Polytomous Test Items.

ERIC Educational Resources Information Center

Camilli, Gregory; Congdon, Peter

1999-01-01

Demonstrates a method for studying differential item functioning (DIF) that can be used with dichotomous or polytomous items and that is valid for data that follow a partial credit Item Response Theory model. A simulation study shows that positively biased Type I error rates are in accord with results from previous studies. (SLD)
MIMIC Methods for Assessing Differential Item Functioning in Polytomous Items

ERIC Educational Resources Information Center

Wang, Wen-Chung; Shih, Ching-Lin

2010-01-01

Three multiple indicators-multiple causes (MIMIC) methods, namely, the standard MIMIC method (M-ST), the MIMIC method with scale purification (M-SP), and the MIMIC method with a pure anchor (M-PA), were developed to assess differential item functioning (DIF) in polytomous items. In a series of simulations, it appeared that all three methods…
Detecting Local Item Dependence in Polytomous Adaptive Data

ERIC Educational Resources Information Center

Mislevy, Jessica L.; Rupp, Andre A.; Harring, Jeffrey R.

2012-01-01

A rapidly expanding arena for item response theory (IRT) is in attitudinal and health-outcomes survey applications, often with polytomous items. In particular, there is interest in computer adaptive testing (CAT). Meeting model assumptions is necessary to realize the benefits of IRT in this setting, however. Although initial investigations of…
Calibrating the Medical Council of Canada's Qualifying Examination Part I using an integrated item response theory framework: a comparison of models and designs.

PubMed

De Champlain, Andre F; Boulais, Andre-Philippe; Dallas, Andrew

2016-01-01

The aim of this research was to compare different methods of calibrating multiple choice question (MCQ) and clinical decision making (CDM) components for the Medical Council of Canada's Qualifying Examination Part I (MCCQEI) based on item response theory. Our data consisted of test results from 8,213 first time applicants to MCCQEI in spring and fall 2010 and 2011 test administrations. The data set contained several thousand multiple choice items and several hundred CDM cases. Four dichotomous calibrations were run using BILOG-MG 3.0. All 3 mixed item format (dichotomous MCQ responses and polytomous CDM case scores) calibrations were conducted using PARSCALE 4. The 2-PL model had identical numbers of items with chi-square values at or below a Type I error rate of 0.01 (83/3,499 or 0.02). In all 3 polytomous models, whether the MCQs were either anchored or concurrently run with the CDM cases, results suggest very poor fit. All IRT abilities estimated from dichotomous calibration designs correlated very highly with each other. IRT-based pass-fail rates were extremely similar, not only across calibration designs and methods, but also with regard to the actual reported decision to candidates. The largest difference noted in pass rates was 4.78%, which occurred between the mixed format concurrent 2-PL graded response model (pass rate= 80.43%) and the dichotomous anchored 1-PL calibrations (pass rate= 85.21%). Simpler calibration designs with dichotomized items should be implemented. The dichotomous calibrations provided better fit of the item response matrix than more complex, polytomous calibrations.
Computerized Adaptive Testing for Polytomous Motivation Items: Administration Mode Effects and a Comparison with Short Forms

ERIC Educational Resources Information Center

Hol, A. Michiel; Vorst, Harrie C. M.; Mellenbergh, Gideon J.

2007-01-01

In a randomized experiment (n = 515), a computerized and a computerized adaptive test (CAT) are compared. The item pool consists of 24 polytomous motivation items. Although items are carefully selected, calibration data show that Samejima's graded response model did not fit the data optimally. A simulation study is done to assess possible…
Performance of the Generalized S-X[Superscript 2] Item Fit Index for Polytomous IRT Models

ERIC Educational Resources Information Center

Kang, Taehoon; Chen, Troy T.

2008-01-01

Orlando and Thissen's S-X[superscript 2] item fit index has performed better than traditional item fit statistics such as Yen' s Q[subscript 1] and McKinley and Mill' s G[superscript 2] for dichotomous item response theory (IRT) models. This study extends the utility of S-X[superscript 2] to polytomous IRT models, including the generalized partial…
A Person Fit Test for IRT Models for Polytomous Items

ERIC Educational Resources Information Center

Glas, C. A. W.; Dagohoy, Anna Villa T.

2007-01-01

A person fit test based on the Lagrange multiplier test is presented for three item response theory models for polytomous items: the generalized partial credit model, the sequential model, and the graded response model. The test can also be used in the framework of multidimensional ability parameters. It is shown that the Lagrange multiplier…
Rasch fit statistics and sample size considerations for polytomous data.

PubMed

Smith, Adam B; Rush, Robert; Fallowfield, Lesley J; Velikova, Galina; Sharpe, Michael

2008-05-29

Previous research on educational data has demonstrated that Rasch fit statistics (mean squares and t-statistics) are highly susceptible to sample size variation for dichotomously scored rating data, although little is known about this relationship for polytomous data. These statistics help inform researchers about how well items fit to a unidimensional latent trait, and are an important adjunct to modern psychometrics. Given the increasing use of Rasch models in health research the purpose of this study was therefore to explore the relationship between fit statistics and sample size for polytomous data. Data were collated from a heterogeneous sample of cancer patients (n = 4072) who had completed both the Patient Health Questionnaire - 9 and the Hospital Anxiety and Depression Scale. Ten samples were drawn with replacement for each of eight sample sizes (n = 25 to n = 3200). The Rating and Partial Credit Models were applied and the mean square and t-fit statistics (infit/outfit) derived for each model. The results demonstrated that t-statistics were highly sensitive to sample size, whereas mean square statistics remained relatively stable for polytomous data. It was concluded that mean square statistics were relatively independent of sample size for polytomous data and that misfit to the model could be identified using published recommended ranges.
Rasch fit statistics and sample size considerations for polytomous data

PubMed Central

Smith, Adam B; Rush, Robert; Fallowfield, Lesley J; Velikova, Galina; Sharpe, Michael

2008-01-01

Background Previous research on educational data has demonstrated that Rasch fit statistics (mean squares and t-statistics) are highly susceptible to sample size variation for dichotomously scored rating data, although little is known about this relationship for polytomous data. These statistics help inform researchers about how well items fit to a unidimensional latent trait, and are an important adjunct to modern psychometrics. Given the increasing use of Rasch models in health research the purpose of this study was therefore to explore the relationship between fit statistics and sample size for polytomous data. Methods Data were collated from a heterogeneous sample of cancer patients (n = 4072) who had completed both the Patient Health Questionnaire – 9 and the Hospital Anxiety and Depression Scale. Ten samples were drawn with replacement for each of eight sample sizes (n = 25 to n = 3200). The Rating and Partial Credit Models were applied and the mean square and t-fit statistics (infit/outfit) derived for each model. Results The results demonstrated that t-statistics were highly sensitive to sample size, whereas mean square statistics remained relatively stable for polytomous data. Conclusion It was concluded that mean square statistics were relatively independent of sample size for polytomous data and that misfit to the model could be identified using published recommended ranges. PMID:18510722
Estimating the Nominal Response Model under Nonnormal Conditions

ERIC Educational Resources Information Center

Preston, Kathleen Suzanne Johnson; Reise, Steven Paul

2014-01-01

The nominal response model (NRM), a much understudied polytomous item response theory (IRT) model, provides researchers the unique opportunity to evaluate within-item category distinctions. Polytomous IRT models, such as the NRM, are frequently applied to psychological assessments representing constructs that are unlikely to be normally…
Handbook of Polytomous Item Response Theory Models

ERIC Educational Resources Information Center

Nering, Michael L., Ed.; Ostini, Remo, Ed.

2010-01-01

This comprehensive "Handbook" focuses on the most used polytomous item response theory (IRT) models. These models help us understand the interaction between examinees and test questions where the questions have various response categories. The book reviews all of the major models and includes discussions about how and where the models…
Comparison of Multidimensional Item Response Models: Multivariate Normal Ability Distributions versus Multivariate Polytomous Ability Distributions. Research Report. ETS RR-08-45

ERIC Educational Resources Information Center

Haberman, Shelby J.; von Davier, Matthias; Lee, Yi-Hsuan

2008-01-01

Multidimensional item response models can be based on multivariate normal ability distributions or on multivariate polytomous ability distributions. For the case of simple structure in which each item corresponds to a unique dimension of the ability vector, some applications of the two-parameter logistic model to empirical data are employed to…
Examining the Process of Responding to Circumplex Scales of Interpersonal Values Items: Should Ideal Point Scoring Methods Be Considered?

PubMed

Ling, Ying; Zhang, Minqiang; Locke, Kenneth D; Li, Guangming; Li, Zonglong

2016-01-01

The Circumplex Scales of Interpersonal Values (CSIV) is a 64-item self-report measure of goals from each octant of the interpersonal circumplex. We used item response theory methods to compare whether dominance models or ideal point models best described how people respond to CSIV items. Specifically, we fit a polytomous dominance model called the generalized partial credit model and an ideal point model of similar complexity called the generalized graded unfolding model to the responses of 1,893 college students. The results of both graphical comparisons of item characteristic curves and statistical comparisons of model fit suggested that an ideal point model best describes the process of responding to CSIV items. The different models produced different rank orderings of high-scoring respondents, but overall the models did not differ in their prediction of criterion variables (agentic and communal interpersonal traits and implicit motives).

A Generalized Partial Credit Model: Application of an EM Algorithm.

ERIC Educational Resources Information Center

Muraki, Eiji

1992-01-01

The partial credit model with a varying slope parameter is developed and called the generalized partial credit model (GPCM). Analysis results for simulated data by this and other polytomous item-response models demonstrate that the rating formulation of the GPCM is adaptable to the analysis of polytomous item responses. (SLD)
Detection of Person Misfit in Computerized Adaptive Tests with Polytomous Items.

ERIC Educational Resources Information Center

van Krimpen-Stoop, Edith M. L. A.; Meijer, Rob R.

2002-01-01

Compared the nominal and empirical null distributions of the standardized log-likelihood statistic for polytomous items for paper-and-pencil (P&P) and computerized adaptive tests (CATs). Results show that the empirical distribution of the statistic differed from the assumed standard normal distribution for both P&P tests and CATs. Also…
Adjacent-Categories Mokken Models for Rater-Mediated Assessments

PubMed Central

Wind, Stefanie A.

2016-01-01

Molenaar extended Mokken’s original probabilistic-nonparametric scaling models for use with polytomous data. These polytomous extensions of Mokken’s original scaling procedure have facilitated the use of Mokken scale analysis as an approach to exploring fundamental measurement properties across a variety of domains in which polytomous ratings are used, including rater-mediated educational assessments. Because their underlying item step response functions (i.e., category response functions) are defined using cumulative probabilities, polytomous Mokken models can be classified as cumulative models based on the classifications of polytomous item response theory models proposed by several scholars. In order to permit a closer conceptual alignment with educational performance assessments, this study presents an adjacent-categories variation on the polytomous monotone homogeneity and double monotonicity models. Data from a large-scale rater-mediated writing assessment are used to illustrate the adjacent-categories approach, and results are compared with the original formulations. Major findings suggest that the adjacent-categories models provide additional diagnostic information related to individual raters’ use of rating scale categories that is not observed under the original formulation. Implications are discussed in terms of methods for evaluating rating quality. PMID:29795916
Rasch analysis of the Edmonton Symptom Assessment System and research implications.

PubMed

Cheifetz, O; Packham, T L; Macdermid, J C

2014-04-01

Reliable and valid assessment of the disease burden across all forms of cancer is critical to the evaluation of treatment effectiveness and patient progress. The Edmonton Symptom Assessment System (esas) is used for routine evaluation of people attending for cancer care. In the present study, we used Rasch analysis to explore the measurement properties of the esas and to determine the effect of using Rasch-proposed interval-level esas scoring compared with traditional scoring when evaluating the effects of an exercise program for cancer survivors. Polytomous Rasch analysis (Andrich's rating-scale model) was applied to data from 26,645 esas questionnaires completed at the Juravinski Cancer Centre. The fit of the esas to the polytomous Rasch model was investigated, including evaluations of differential item functioning for sex, age, and disease group. The research implication was investigated by comparing the results of an observational research study previously analysed using a traditional approach with the results obtained by Rasch-proposed interval-level esas scoring. The Rasch reliability index was 0.73, falling short of the desired 0.80-0.90 level. However, the esas was found to fit the Rasch model, including the criteria for uni-dimensional data. The analysis suggests that the current esas scoring system of 0-10 could be collapsed to a 6-point scale. Use of the Rasch-proposed interval-level scoring yielded results that were different from those calculated using summarized ordinal-level esas scores. Differential item functioning was not found for sex, age, or diagnosis groups. The esas is a moderately reliable uni-dimensional measure of cancer disease burden and can provide interval-level scaling with Rasch-based scoring. Further, our study indicates that, compared with the traditional scoring metric, Rasch-based scoring could result in substantive changes to conclusions.
Online Calibration of Polytomous Items Under the Generalized Partial Credit Model

PubMed Central

Zheng, Yi

2016-01-01

Online calibration is a technology-enhanced architecture for item calibration in computerized adaptive tests (CATs). Many CATs are administered continuously over a long term and rely on large item banks. To ensure test validity, these item banks need to be frequently replenished with new items, and these new items need to be pretested before being used operationally. Online calibration dynamically embeds pretest items in operational tests and calibrates their parameters as response data are gradually obtained through the continuous test administration. This study extends existing formulas, procedures, and algorithms for dichotomous item response theory models to the generalized partial credit model, a popular model for items scored in more than two categories. A simulation study was conducted to investigate the developed algorithms and procedures under a variety of conditions, including two estimation algorithms, three pretest item selection methods, three seeding locations, two numbers of score categories, and three calibration sample sizes. Results demonstrated acceptable estimation accuracy of the two estimation algorithms in some of the simulated conditions. A variety of findings were also revealed for the interacted effects of included factors, and recommendations were made respectively. PMID:29881063
Rasch analysis on OSCE data : An illustrative example.

PubMed

Tor, E; Steketee, C

2011-01-01

The Objective Structured Clinical Examination (OSCE) is a widely used tool for the assessment of clinical competence in health professional education. The goal of the OSCE is to make reproducible decisions on pass/fail status as well as students' levels of clinical competence according to their demonstrated abilities based on the scores. This paper explores the use of the polytomous Rasch model in evaluating the psychometric properties of OSCE scores through a case study. The authors analysed an OSCE data set (comprised of 11 stations) for 80 fourth year medical students based on the polytomous Rasch model in an effort to answer two research questions: (1) Do the clinical tasks assessed in the 11 OSCE stations map on to a common underlying construct, namely clinical competence? (2) What other insights can Rasch analysis offer in terms of scaling, item analysis and instrument validation over and above the conventional analysis based on classical test theory? The OSCE data set has demonstrated a sufficient degree of fit to the Rasch model (Χ(2) = 17.060, DF=22, p=0.76) indicating that the 11 OSCE station scores have sufficient psychometric properties to form a measure for a common underlying construct, i.e. clinical competence. Individual OSCE station scores with good fit to the Rasch model (p > 0.1 for all Χ(2) statistics) further corroborated the characteristic of unidimensionality of the OSCE scale for clinical competence. A Person Separation Index (PSI) of 0.704 indicates sufficient level of reliability for the OSCE scores. Other useful findings from the Rasch analysis that provide insights, over and above the analysis based on classical test theory, are also exemplified using the data set. The polytomous Rasch model provides a useful and supplementary approach to the calibration and analysis of OSCE examination data.
Rasch analysis of the Edmonton Symptom Assessment System and research implications

PubMed Central

Cheifetz, O.; Packham, T.L.; MacDermid, J.C.

2014-01-01

Background Reliable and valid assessment of the disease burden across all forms of cancer is critical to the evaluation of treatment effectiveness and patient progress. The Edmonton Symptom Assessment System (esas) is used for routine evaluation of people attending for cancer care. In the present study, we used Rasch analysis to explore the measurement properties of the esas and to determine the effect of using Rasch-proposed interval-level esas scoring compared with traditional scoring when evaluating the effects of an exercise program for cancer survivors. Methods Polytomous Rasch analysis (Andrich’s rating-scale model) was applied to data from 26,645 esas questionnaires completed at the Juravinski Cancer Centre. The fit of the esas to the polytomous Rasch model was investigated, including evaluations of differential item functioning for sex, age, and disease group. The research implication was investigated by comparing the results of an observational research study previously analysed using a traditional approach with the results obtained by Rasch-proposed interval-level esas scoring. Results The Rasch reliability index was 0.73, falling short of the desired 0.80–0.90 level. However, the esas was found to fit the Rasch model, including the criteria for uni-dimensional data. The analysis suggests that the current esas scoring system of 0–10 could be collapsed to a 6-point scale. Use of the Rasch-proposed interval-level scoring yielded results that were different from those calculated using summarized ordinal-level esas scores. Differential item functioning was not found for sex, age, or diagnosis groups. Conclusions The esas is a moderately reliable uni-dimensional measure of cancer disease burden and can provide interval-level scaling with Rasch-based scoring. Further, our study indicates that, compared with the traditional scoring metric, Rasch-based scoring could result in substantive changes to conclusions. PMID:24764703
Three Classes of Nonparametric Differential Step Functioning Effect Estimators

ERIC Educational Resources Information Center

Penfield, Randall D.

2008-01-01

The examination of measurement invariance in polytomous items is complicated by the possibility that the magnitude and sign of lack of invariance may vary across the steps underlying the set of polytomous response options, a concept referred to as differential step functioning (DSF). This article describes three classes of nonparametric DSF effect…
Polytomous Adaptive Classification Testing: Effects of Item Pool Size, Test Termination Criterion, and Number of Cutscores

ERIC Educational Resources Information Center

Gnambs, Timo; Batinic, Bernad

2011-01-01

Computer-adaptive classification tests focus on classifying respondents in different proficiency groups (e.g., for pass/fail decisions). To date, adaptive classification testing has been dominated by research on dichotomous response formats and classifications in two groups. This article extends this line of research to polytomous classification…
Vegetable parenting practices scale: Item response modeling analyses

USDA-ARS?s Scientific Manuscript database

Our objective was to evaluate the psychometric properties of a vegetable parenting practices scale using multidimensional polytomous item response modeling which enables assessing item fit to latent variables and the distributional characteristics of the items in comparison to the respondents. We al...
Simulation-based Bayesian inference for latent traits of item response models: Introduction to the ltbayes package for R.

PubMed

Johnson, Timothy R; Kuhn, Kristine M

2015-12-01

This paper introduces the ltbayes package for R. This package includes a suite of functions for investigating the posterior distribution of latent traits of item response models. These include functions for simulating realizations from the posterior distribution, profiling the posterior density or likelihood function, calculation of posterior modes or means, Fisher information functions and observed information, and profile likelihood confidence intervals. Inferences can be based on individual response patterns or sets of response patterns such as sum scores. Functions are included for several common binary and polytomous item response models, but the package can also be used with user-specified models. This paper introduces some background and motivation for the package, and includes several detailed examples of its use.
A Note on the Equivalence between Observed and Expected Information Functions with Polytomous IRT Models

ERIC Educational Resources Information Center

Magis, David

2015-01-01

The purpose of this note is to study the equivalence of observed and expected (Fisher) information functions with polytomous item response theory (IRT) models. It is established that observed and expected information functions are equivalent for the class of divide-by-total models (including partial credit, generalized partial credit, rating…
Application of an IRT Polytomous Model for Measuring Health Related Quality of Life

ERIC Educational Resources Information Center

Tejada, Antonio J. Rojas; Rojas, Oscar M. Lozano

2005-01-01

Background: The Item Response Theory (IRT) has advantages for measuring Health Related Quality of Life (HRQOL) as opposed to the Classical Tests Theory (CTT). Objectives: To present the results of the application of a polytomous model based on IRT, specifically, the Rating Scale Model (RSM), to measure HRQOL with the EORTC QLQ-C30. Methods: 103…
Fitting IRT Models to Dichotomous and Polytomous Data: Assessing the Relative Model-Data Fit of Ideal Point and Dominance Models

ERIC Educational Resources Information Center

Tay, Louis; Ali, Usama S.; Drasgow, Fritz; Williams, Bruce

2011-01-01

This study investigated the relative model-data fit of an ideal point item response theory (IRT) model (the generalized graded unfolding model [GGUM]) and dominance IRT models (e.g., the two-parameter logistic model [2PLM] and Samejima's graded response model [GRM]) to simulated dichotomous and polytomous data generated from each of these models.…
An item response theory analysis of the Executive Interview and development of the EXIT8: A Project FRONTIER Study.

PubMed

Jahn, Danielle R; Dressel, Jeffrey A; Gavett, Brandon E; O'Bryant, Sid E

2015-01-01

The Executive Interview (EXIT25) is an effective measure of executive dysfunction, but may be inefficient due to the time it takes to complete 25 interview-based items. The current study aimed to examine psychometric properties of the EXIT25, with a specific focus on determining whether a briefer version of the measure could comprehensively assess executive dysfunction. The current study applied a graded response model (a type of item response theory model for polytomous categorical data) to identify items that were most closely related to the underlying construct of executive functioning and best discriminated between varying levels of executive functioning. Participants were 660 adults ages 40 to 96 years living in West Texas, who were recruited through an ongoing epidemiological study of rural health and aging, called Project FRONTIER. The EXIT25 was the primary measure examined. Participants also completed the Trail Making Test and Controlled Oral Word Association Test, among other measures, to examine the convergent validity of a brief form of the EXIT25. Eight items were identified that provided the majority of the information about the underlying construct of executive functioning; total scores on these items were associated with total scores on other measures of executive functioning and were able to differentiate between cognitively healthy, mildly cognitively impaired, and demented participants. In addition, cutoff scores were recommended based on sensitivity and specificity of scores. A brief, eight-item version of the EXIT25 may be an effective and efficient screening for executive dysfunction among older adults.
Using the Nominal Response Model to Evaluate Response Category Discrimination in the PROMIS Emotional Distress Item Pools

ERIC Educational Resources Information Center

Preston, Kathleen; Reise, Steven; Cai, Li; Hays, Ron D.

2011-01-01

The authors used a nominal response item response theory model to estimate category boundary discrimination (CBD) parameters for items drawn from the Emotional Distress item pools (Depression, Anxiety, and Anger) developed in the Patient-Reported Outcomes Measurement Information Systems (PROMIS) project. For polytomous items with ordered response…
Aggregating Polytomous DIF Results over Multiple Test Administrations

ERIC Educational Resources Information Center

Zwick, Rebecca; Ye, Lei; Isham, Steven

2018-01-01

In typical differential item functioning (DIF) assessments, an item's DIF status is not influenced by its status in previous test administrations. An item that has shown DIF at multiple administrations may be treated the same way as an item that has shown DIF in only the most recent administration. Therefore, much useful information about the…
Higher-Order Item Response Models for Hierarchical Latent Traits

ERIC Educational Resources Information Center

Huang, Hung-Yu; Wang, Wen-Chung; Chen, Po-Hsi; Su, Chi-Ming

2013-01-01

Many latent traits in the human sciences have a hierarchical structure. This study aimed to develop a new class of higher order item response theory models for hierarchical latent traits that are flexible in accommodating both dichotomous and polytomous items, to estimate both item and person parameters jointly, to allow users to specify…
A knowledge-based theory of rising scores on "culture-free" tests.

PubMed

Fox, Mark C; Mitchum, Ainsley L

2013-08-01

Secular gains in intelligence test scores have perplexed researchers since they were documented by Flynn (1984, 1987). Gains are most pronounced on abstract, so-called culture-free tests, prompting Flynn (2007) to attribute them to problem-solving skills availed by scientifically advanced cultures. We propose that recent-born individuals have adopted an approach to analogy that enables them to infer higher level relations requiring roles that are not intrinsic to the objects that constitute initial representations of items. This proposal is translated into item-specific predictions about differences between cohorts in pass rates and item-response patterns on the Raven's Matrices (Flynn, 1987), a seemingly culture-free test that registers the largest Flynn effect. Consistent with predictions, archival data reveal that individuals born around 1940 are less able to map objects at higher levels of relational abstraction than individuals born around 1990. Polytomous Rasch models verify predicted violations of measurement invariance, as raw scores are found to underestimate the number of analogical rules inferred by members of the earlier cohort relative to members of the later cohort who achieve the same overall score. The work provides a plausible cognitive account of the Flynn effect, furthers understanding of the cognition of matrix reasoning, and underscores the need to consider how test-takers select item responses. PsycINFO Database Record (c) 2013 APA, all rights reserved.
Using G-Theory to Enhance Evidence of Reliability and Validity for Common Uses of the Paulhus Deception Scales.

PubMed

Vispoel, Walter P; Morris, Carrie A; Kilinc, Murat

2018-01-01

We applied a new approach to Generalizability theory (G-theory) involving parallel splits and repeated measures to evaluate common uses of the Paulhus Deception Scales based on polytomous and four types of dichotomous scoring. G-theory indices of reliability and validity accounting for specific-factor, transient, and random-response measurement error supported use of polytomous over dichotomous scores as contamination checks; as control, explanatory, and outcome variables; as aspects of construct validation; and as indexes of environmental effects on socially desirable responding. Polytomous scoring also provided results for flagging faking as dependable as those when using dichotomous scoring methods. These findings argue strongly against the nearly exclusive use of dichotomous scoring for the Paulhus Deception Scales in practice and underscore the value of G-theory in demonstrating this. We provide guidelines for applying our G-theory techniques to other objectively scored clinical assessments, for using G-theory to estimate how changes to a measure might improve reliability, and for obtaining software to conduct G-theory analyses free of charge.

Classification Consistency and Accuracy for Complex Assessments Using Item Response Theory

ERIC Educational Resources Information Center

Lee, Won-Chan

2010-01-01

In this article, procedures are described for estimating single-administration classification consistency and accuracy indices for complex assessments using item response theory (IRT). This IRT approach was applied to real test data comprising dichotomous and polytomous items. Several different IRT model combinations were considered. Comparisons…
Assessment of Differential Item Functioning in Testlet-Based Items Using the Rasch Testlet Model

ERIC Educational Resources Information Center

Wang, Wen-Chung; Wilson, Mark

2005-01-01

This study presents a procedure for detecting differential item functioning (DIF) for dichotomous and polytomous items in testlet-based tests, whereby DIF is taken into account by adding DIF parameters into the Rasch testlet model. Simulations were conducted to assess recovery of the DIF and other parameters. Two independent variables, test type…
Application of a General Polytomous Testlet Model to the Reading Section of a Large-Scale English Language Assessment. Research Report. ETS RR-10-21

ERIC Educational Resources Information Center

Li, Yanmei; Li, Shuhong; Wang, Lin

2010-01-01

Many standardized educational tests include groups of items based on a common stimulus, known as "testlets". Standard unidimensional item response theory (IRT) models are commonly used to model examinees' responses to testlet items. However, it is known that local dependence among testlet items can lead to biased item parameter estimates…
Explaining Crossing DIF in Polytomous Items Using Differential Step Functioning Effects

ERIC Educational Resources Information Center

Penfield, Randall D.

2010-01-01

Crossing, or intersecting, differential item functioning (DIF) is a form of nonuniform DIF that exists when the sign of the between-group difference in expected item performance changes across the latent trait continuum. The presence of crossing DIF presents a problem for many statistics developed for evaluating DIF because positive and negative…
Development and Validation of a Computer Adaptive EFL Test

ERIC Educational Resources Information Center

He, Lianzhen; Min, Shangchao

2017-01-01

The first aim of this study was to develop a computer adaptive EFL test (CALT) that assesses test takers' listening and reading proficiency in English with dichotomous items and polytomous testlets. We reported in detail on the development of the CALT, including item banking, determination of suitable item response theory (IRT) models for item…
Standard Errors and Confidence Intervals from Bootstrapping for Ramsay-Curve Item Response Theory Model Item Parameters

ERIC Educational Resources Information Center

Gu, Fei; Skorupski, William P.; Hoyle, Larry; Kingston, Neal M.

2011-01-01

Ramsay-curve item response theory (RC-IRT) is a nonparametric procedure that estimates the latent trait using splines, and no distributional assumption about the latent trait is required. For item parameters of the two-parameter logistic (2-PL), three-parameter logistic (3-PL), and polytomous IRT models, RC-IRT can provide more accurate estimates…
A Framework for Dimensionality Assessment for Multidimensional Item Response Models

ERIC Educational Resources Information Center

Svetina, Dubravka; Levy, Roy

2014-01-01

A framework is introduced for considering dimensionality assessment procedures for multidimensional item response models. The framework characterizes procedures in terms of their confirmatory or exploratory approach, parametric or nonparametric assumptions, and applicability to dichotomous, polytomous, and missing data. Popular and emerging…
DIF Detection Using Multiple-Group Categorical CFA with Minimum Free Baseline Approach

ERIC Educational Resources Information Center

Chang, Yu-Wei; Huang, Wei-Kang; Tsai, Rung-Ching

2015-01-01

The aim of this study is to assess the efficiency of using the multiple-group categorical confirmatory factor analysis (MCCFA) and the robust chi-square difference test in differential item functioning (DIF) detection for polytomous items under the minimum free baseline strategy. While testing for DIF items, despite the strong assumption that all…
A Study of Reverse-Worded Matched Item Pairs Using the Generalized Partial Credit and Nominal Response Models

ERIC Educational Resources Information Center

Matlock Cole, Ki Lynn; Turner, Ronna C.; Gitchel, W. Dent

2018-01-01

The generalized partial credit model (GPCM) is often used for polytomous data; however, the nominal response model (NRM) allows for the investigation of how adjacent categories may discriminate differently when items are positively or negatively worded. Ten items from three different self-reported scales were used (anxiety, depression, and…
Methods for Assessing Item, Step, and Threshold Invariance in Polytomous Items Following the Partial Credit Model

ERIC Educational Resources Information Center

Penfield, Randall D.; Myers, Nicholas D.; Wolfe, Edward W.

2008-01-01

Measurement invariance in the partial credit model (PCM) can be conceptualized in several different but compatible ways. In this article the authors distinguish between three forms of measurement invariance in the PCM: step invariance, item invariance, and threshold invariance. Approaches for modeling these three forms of invariance are proposed,…
A Polytomous Item Response Theory Analysis of Social Physique Anxiety Scale

ERIC Educational Resources Information Center

Fletcher, Richard B.; Crocker, Peter

2014-01-01

The present study investigated the social physique anxiety scale's factor structure and item properties using confirmatory factor analysis and item response theory. An additional aim was to identify differences in response patterns between groups (gender). A large sample of high school students aged 11-15 years (N = 1,529) consisting of n =…
Quantifying Local, Response Dependence between Two Polytomous Items Using the Rasch Model

ERIC Educational Resources Information Center

Andrich, David; Humphry, Stephen M.; Marais, Ida

2012-01-01

Models of modern test theory imply statistical independence among responses, generally referred to as "local independence." One violation of local independence occurs when the response to one item governs the response to a subsequent item. Expanding on a formulation of this kind of violation as a process in the dichotomous Rasch model,…
A Generalized DIF Effect Variance Estimator for Measuring Unsigned Differential Test Functioning in Mixed Format Tests

ERIC Educational Resources Information Center

Penfield, Randall D.; Algina, James

2006-01-01

One approach to measuring unsigned differential test functioning is to estimate the variance of the differential item functioning (DIF) effect across the items of the test. This article proposes two estimators of the DIF effect variance for tests containing dichotomous and polytomous items. The proposed estimators are direct extensions of the…
Modified Likelihood-Based Item Fit Statistics for the Generalized Graded Unfolding Model

ERIC Educational Resources Information Center

Roberts, James S.

2008-01-01

Orlando and Thissen (2000) developed an item fit statistic for binary item response theory (IRT) models known as S-X[superscript 2]. This article generalizes their statistic to polytomous unfolding models. Four alternative formulations of S-X[superscript 2] are developed for the generalized graded unfolding model (GGUM). The GGUM is a…
Item response theory - A first approach

NASA Astrophysics Data System (ADS)

Nunes, Sandra; Oliveira, Teresa; Oliveira, Amílcar

2017-07-01

The Item Response Theory (IRT) has become one of the most popular scoring frameworks for measurement data, frequently used in computerized adaptive testing, cognitively diagnostic assessment and test equating. According to Andrade et al. (2000), IRT can be defined as a set of mathematical models (Item Response Models - IRM) constructed to represent the probability of an individual giving the right answer to an item of a particular test. The number of Item Responsible Models available to measurement analysis has increased considerably in the last fifteen years due to increasing computer power and due to a demand for accuracy and more meaningful inferences grounded in complex data. The developments in modeling with Item Response Theory were related with developments in estimation theory, most remarkably Bayesian estimation with Markov chain Monte Carlo algorithms (Patz & Junker, 1999). The popularity of Item Response Theory has also implied numerous overviews in books and journals, and many connections between IRT and other statistical estimation procedures, such as factor analysis and structural equation modeling, have been made repeatedly (Van der Lindem & Hambleton, 1997). As stated before the Item Response Theory covers a variety of measurement models, ranging from basic one-dimensional models for dichotomously and polytomously scored items and their multidimensional analogues to models that incorporate information about cognitive sub-processes which influence the overall item response process. The aim of this work is to introduce the main concepts associated with one-dimensional models of Item Response Theory, to specify the logistic models with one, two and three parameters, to discuss some properties of these models and to present the main estimation procedures.
A Monte Carlo Study Investigating Missing Data, Differential Item Functioning, and Effect Size

ERIC Educational Resources Information Center

Garrett, Phyllis

2009-01-01

The use of polytomous items in assessments has increased over the years, and as a result, the validity of these assessments has been a concern. Differential item functioning (DIF) and missing data are two factors that may adversely affect assessment validity. Both factors have been studied separately, but DIF and missing data are likely to occur…
Scaling Users' Perceptions of Library Service Quality Using Item Response Theory: A LibQUAL+ [TM] Study

ERIC Educational Resources Information Center

Wei, Youhua; Thompson, Bruce; Cook, C. Colleen

2005-01-01

LibQUAL+[TM] data to date have not been subjected to the modern measurement theory called polytomous item response theory (IRT). The data interpreted here were collected from 42,090 participants who completed the "American English" version of the 22 core LibQUAL+[TM] items, and 12,552 participants from Australia and Europe who…
Bayesian Estimation of Multi-Unidimensional Graded Response IRT Models

ERIC Educational Resources Information Center

Kuo, Tzu-Chun

2015-01-01

Item response theory (IRT) has gained an increasing popularity in large-scale educational and psychological testing situations because of its theoretical advantages over classical test theory. Unidimensional graded response models (GRMs) are useful when polytomous response items are designed to measure a unified latent trait. They are limited in…
An Investigation of the Performance of the Generalized S-X[superscript 2] Item-Fit Index for Polytomous IRT Models. ACT Research Report Series, 2007-1

ERIC Educational Resources Information Center

Kang, Taehoon; Chen, Troy T.

2007-01-01

Orlando and Thissen (2000, 2003) proposed an item-fit index, S-X[superscript 2], for dichotomous item response theory (IRT) models, which has performed better than traditional item-fit statistics such as Yen's (1981) Q[subscript 1] and McKinley and Mill's (1985) G[superscript 2]. This study extends the utility of S-X[superscript 2] to polytomous…
Modeling Polytomous Item Responses Using Simultaneously Estimated Multinomial Logistic Regression Models

ERIC Educational Resources Information Center

Anderson, Carolyn J.; Verkuilen, Jay; Peyton, Buddy L.

2010-01-01

Survey items with multiple response categories and multiple-choice test questions are ubiquitous in psychological and educational research. We illustrate the use of log-multiplicative association (LMA) models that are extensions of the well-known multinomial logistic regression model for multiple dependent outcome variables to reanalyze a set of…

Using a Multivariate Multilevel Polytomous Item Response Theory Model to Study Parallel Processes of Change: The Dynamic Association between Adolescents' Social Isolation and Engagement with Delinquent Peers in the National Youth Survey

ERIC Educational Resources Information Center

Hsieh, Chueh-An; von Eye, Alexander A.; Maier, Kimberly S.

2010-01-01

The application of multidimensional item response theory models to repeated observations has demonstrated great promise in developmental research. It allows researchers to take into consideration both the characteristics of item response and measurement error in longitudinal trajectory analysis, which improves the reliability and validity of the…
An Investigation of Item Fit Statistics for Mixed IRT Models

ERIC Educational Resources Information Center

Chon, Kyong Hee

2009-01-01

The purpose of this study was to investigate procedures for assessing model fit of IRT models for mixed format data. In this study, various IRT model combinations were fitted to data containing both dichotomous and polytomous item responses, and the suitability of the chosen model mixtures was evaluated based on a number of model fit procedures.…
Vegetable parenting practices scale. Item response modeling analyses

PubMed Central

Chen, Tzu-An; O’Connor, Teresia; Hughes, Sheryl; Beltran, Alicia; Baranowski, Janice; Diep, Cassandra; Baranowski, Tom

2015-01-01

Objective To evaluate the psychometric properties of a vegetable parenting practices scale using multidimensional polytomous item response modeling which enables assessing item fit to latent variables and the distributional characteristics of the items in comparison to the respondents. We also tested for differences in the ways item function (called differential item functioning) across child’s gender, ethnicity, age, and household income groups. Method Parents of 3–5 year old children completed a self-reported vegetable parenting practices scale online. Vegetable parenting practices consisted of 14 effective vegetable parenting practices and 12 ineffective vegetable parenting practices items, each with three subscales (responsiveness, structure, and control). Multidimensional polytomous item response modeling was conducted separately on effective vegetable parenting practices and ineffective vegetable parenting practices. Results One effective vegetable parenting practice item did not fit the model well in the full sample or across demographic groups, and another was a misfit in differential item functioning analyses across child’s gender. Significant differential item functioning was detected across children’s age and ethnicity groups, and more among effective vegetable parenting practices than ineffective vegetable parenting practices items. Wright maps showed items only covered parts of the latent trait distribution. The harder- and easier-to-respond ends of the construct were not covered by items for effective vegetable parenting practices and ineffective vegetable parenting practices, respectively. Conclusions Several effective vegetable parenting practices and ineffective vegetable parenting practices scale items functioned differently on the basis of child’s demographic characteristics; therefore, researchers should use these vegetable parenting practices scales with caution. Item response modeling should be incorporated in analyses of parenting practice questionnaires to better assess differences across demographic characteristics. PMID:25895694
GMHDIF: A Computer Program for Detecting DIF in Dichotomous and Polytomous Items Using Generalized Mantel-Haenszel Statistics

ERIC Educational Resources Information Center

Fidalgo, Angel M.

2011-01-01

Mantel-Haenszel (MH) methods constitute one of the most popular nonparametric differential item functioning (DIF) detection procedures. GMHDIF has been developed to provide an easy-to-use program for conducting DIF analyses. Some of the advantages of this program are that (a) it performs two-stage DIF analyses in multiple groups simultaneously;…
A General Program for Item-Response Analysis That Employs the Stabilized Newton-Raphson Algorithm. Research Report. ETS RR-13-32

ERIC Educational Resources Information Center

Haberman, Shelby J.

2013-01-01

A general program for item-response analysis is described that uses the stabilized Newton-Raphson algorithm. This program is written to be compliant with Fortran 2003 standards and is sufficiently general to handle independent variables, multidimensional ability parameters, and matrix sampling. The ability variables may be either polytomous or…
Multidimensional Scoring of Abilities: The Ordered Polytomous Response Case

ERIC Educational Resources Information Center

de la Torre, Jimmy

2008-01-01

Recent work has shown that multidimensionally scoring responses from different tests can provide better ability estimates. For educational assessment data, applications of this approach have been limited to binary scores. Of the different variants, the de la Torre and Patz model is considered more general because implementing the scoring procedure…
An Application of Unfolding and Cumulative Item Response Theory Models for Noncognitive Scaling: Examining the Assumptions and Applicability of the Generalized Graded Unfolding Model

ERIC Educational Resources Information Center

Sgammato, Adrienne N.

2009-01-01

This study examined the applicability of a relatively new unidimensional, unfolding item response theory (IRT) model called the generalized graded unfolding model (GGUM; Roberts, Donoghue, & Laughlin, 2000). A total of four scaling methods were applied. Two commonly used cumulative IRT models for polytomous data, the Partial Credit Model and…
Testing parent dyad interchangeability in the parent proxy-report of PedsQL™ 4.0: a differential item functioning analysis.

PubMed

Doostfatemeh, Marziyeh; Ayatollahi, Seyyed Mohammad Taghi; Jafari, Peyman

2015-08-01

In child-parent agreement studies in the field of paediatric health-related quality of life (HRQoL), little attention has been paid to the effect of gender in parental proxy rating of children's HRQoL. This study aims to test the potential interchangeability of parent dyads in reporting children's HRQoL on both item and scale levels of the PedsQL™ 4.0 instrument, using the approach of differential item functioning (DIF). The PedsQL™ 4.0 Generic Core Scales were completed by 576 father-and-mother dyads. A polytomous item response theory model, graded response model, was used to detect DIF across fathers and mothers. Assessment at item level showed that fathers and mothers perceived the meaning of items of the PedsQL™ 4.0 consistently. Regarding the scale level, a moderate to high level of agreement was observed between mothers' and fathers' reports on all similar subscales. Although the significant mean score differences in total, physical and emotional functioning indicated that fathers gave higher scores to their children, the small effect size implied that this difference may not be practically meaningful. Our findings revealed that discrepancy in parent dyads in rating children's HRQoL is a "real" difference and not an artefact due to measurement non-invariance. Fathers were seen to have slightly different insights into their children, especially for emotional functioning, but overall the results were not all that different. This suggests that paternal proxy-reports can be included in studies along with maternal proxy-reports, and the two may be combined when looking at parent-child agreement. Parent-child agreement studies in Iran are not affected by parents' gender, and therefore, researchers may rely on the assumption of the interchangeability of fathers and mothers in these studies.
CTTITEM: SAS macro and SPSS syntax for classical item analysis.

PubMed

Lei, Pui-Wa; Wu, Qiong

2007-08-01

This article describes the functions of a SAS macro and an SPSS syntax that produce common statistics for conventional item analysis including Cronbach's alpha, item difficulty index (p-value or item mean), and item discrimination indices (D-index, point biserial and biserial correlations for dichotomous items and item-total correlation for polytomous items). These programs represent an improvement over the existing SAS and SPSS item analysis routines in terms of completeness and user-friendliness. To promote routine evaluations of item qualities in instrument development of any scale, the programs are available at no charge for interested users. The program codes along with a brief user's manual that contains instructions and examples are downloadable from suen.ed.psu.edu/-pwlei/plei.htm.
Polytomous Rasch Models in Counseling Assessment

ERIC Educational Resources Information Center

Willse, John T.

2017-01-01

This article provides a brief introduction to the Rasch model. Motivation for using Rasch analyses is provided. Important Rasch model concepts and key aspects of result interpretation are introduced, with major points reinforced using a simulation demonstration. Concrete guidelines are provided regarding sample size and the evaluation of items.
Tree-Based Global Model Tests for Polytomous Rasch Models

ERIC Educational Resources Information Center

Komboz, Basil; Strobl, Carolin; Zeileis, Achim

2018-01-01

Psychometric measurement models are only valid if measurement invariance holds between test takers of different groups. Global model tests, such as the well-established likelihood ratio (LR) test, are sensitive to violations of measurement invariance, such as differential item functioning and differential step functioning. However, these…
PROC IRT: A SAS Procedure for Item Response Theory

PubMed Central

Matlock Cole, Ki; Paek, Insu

2017-01-01

This article reviews the procedure for item response theory (PROC IRT) procedure in SAS/STAT 14.1 to conduct item response theory (IRT) analyses of dichotomous and polytomous datasets that are unidimensional or multidimensional. The review provides an overview of available features, including models, estimation procedures, interfacing, input, and output files. A small-scale simulation study evaluates the IRT model parameter recovery of the PROC IRT procedure. The use of the IRT procedure in Statistical Analysis Software (SAS) may be useful for researchers who frequently utilize SAS for analyses, research, and teaching.
Rasch analysis of the hospital anxiety and depression scale among Chinese cataract patients.

PubMed

Lin, Xianchai; Chen, Ziyan; Jin, Ling; Gao, Wuyou; Qu, Bo; Zuo, Yajing; Liu, Rongjiao; Yu, Minbin

2017-01-01

To analyze the validity of the Hospital Anxiety and Depression Scale (HADS) among Chinese cataract population. A total of 275 participants with unilateral or bilateral cataract were recruited to complete the Chinese version of HADS. The patients' demographic and ophthalmic characteristics were documented. Rasch analysis was conducted to examine the model fit statistics, the thresholds ordering of the polytomous items, targeting, person separation index and reliability, local dependency, unidimentionality, differential item functioning (DIF) and construct validity of the HADS individual and summary measures. Rasch analysis was performed on anxiety and depression subscales as well as HADS-Total score respectively. The items of original HADS-Anxiety, HADS-Depression and HADS-Total demonstrated evidence of misfit of the Rasch model. Removing items A7 for anxiety subscale and rescoring items D14 for depression subscale significantly improved Rasch model fit. A 12-item higher order total scale with further removal of D12 was found to fit the Rasch model. The modified items had ordered response thresholds. No uniform DIF was detected, whereas notable non-uniform DIF in high-ability group was found. The revised cut-off points were given for the modified anxiety and depression subscales. The modified version of HADS with HADS-A and HADS-D as subscale and HADS-T as a higher-order measure is a reliable and valid instrument that may be useful for assessing anxiety and depression states in Chinese cataract population.
Conditional Standard Errors, Reliability and Decision Consistency of Performance Levels Using Polytomous IRT.

ERIC Educational Resources Information Center

Wang, Tianyou; And Others

M. J. Kolen, B. A. Hanson, and R. L. Brennan (1992) presented a procedure for assessing the conditional standard error of measurement (CSEM) of scale scores using a strong true-score model. They also investigated the ways of using nonlinear transformation from number-correct raw score to scale score to equalize the conditional standard error along…
Dimensionality Assessment of Ordered Polytomous Items with Parallel Analysis

ERIC Educational Resources Information Center

Timmerman, Marieke E.; Lorenzo-Seva, Urbano

2011-01-01

Parallel analysis (PA) is an often-recommended approach for assessment of the dimensionality of a variable set. PA is known in different variants, which may yield different dimensionality indications. In this article, the authors considered the most appropriate PA procedure to assess the number of common factors underlying ordered polytomously…
Heteroscedastic Latent Trait Models for Dichotomous Data.

PubMed

Molenaar, Dylan

2015-09-01

Effort has been devoted to account for heteroscedasticity with respect to observed or latent moderator variables in item or test scores. For instance, in the multi-group generalized linear latent trait model, it could be tested whether the observed (polychoric) covariance matrix differs across the levels of an observed moderator variable. In the case that heteroscedasticity arises across the latent trait itself, existing models commonly distinguish between heteroscedastic residuals and a skewed trait distribution. These models have valuable applications in intelligence, personality and psychopathology research. However, existing approaches are only limited to continuous and polytomous data, while dichotomous data are common in intelligence and psychopathology research. Therefore, in present paper, a heteroscedastic latent trait model is presented for dichotomous data. The model is studied in a simulation study, and applied to data pertaining alcohol use and cognitive ability.
ADHD Symptoms in Preschool Children: Examining Psychometric Properties using IRT

PubMed Central

Purpura, David J.; Wilson, Shauna B.; Lonigan, Christopher J.

2010-01-01

Clear and empirically supported diagnostic symptoms are important for proper diagnosis and treatment of psychological disorders. Unfortunately, symptoms of many disorders presented in the DSM-IV-TR lack sufficient psychometric evaluation. In this study, an Item Response Theory analysis was applied to ratings of the 18 Attention-Deficit/Hyperactivity Disorder (ADHD) symptoms in 268 preschool children. Children (55% boys) in this sample ranged in age from 37 to 74 months; 80.4% were identified as African American, 15.1% Caucasian, and 4.5% other ethnicity. Dichotomous and polytomous scoring methods for rating ADHD symptoms were compared and psychometric properties of these symptoms were calculated. Symptom-level analyses revealed that, in general, the current symptoms provided useful information in diagnosing ADHD in preschool children; however, several symptoms provided redundant information and should be examined further. PMID:20822267
Measuring Variability and Change with an Item Response Model for Polytomous Variables.

ERIC Educational Resources Information Center

Eid, Michael; Hoffman, Lore

1998-01-01

An extension of the graded-response model of F. Samejima (1969) is presented for the measurement of variability and change. The model is illustrated with a longitudinal study of student interest in radioactivity conducted with about 1,200 German students in elementary school when the study began. (SLD)
The Internet Gaming Disorder Scale.

PubMed

Lemmens, Jeroen S; Valkenburg, Patti M; Gentile, Douglas A

2015-06-01

Recently, the American Psychiatric Association included Internet gaming disorder (IGD) in the appendix of the 5th edition of the Diagnostic and Statistical Manual of Mental Disorders (DSM-5). The main aim of the current study was to test the reliability and validity of 4 survey instruments to measure IGD on the basis of the 9 criteria from the DSM-5: a long (27-item) and short (9-item) polytomous scale and a long (27-item) and short (9-item) dichotomous scale. The psychometric properties of these scales were tested among a representative sample of 2,444 Dutch adolescents and adults, ages 13-40 years. Confirmatory factor analyses demonstrated that the structural validity (i.e., the dimensional structure) of all scales was satisfactory. Both types of assessment (polytomous and dichotomous) were also reliable (i.e., internally consistent) and showed good criterion-related validity, as indicated by positive correlations with time spent playing games, loneliness, and aggression and negative correlations with self-esteem, prosocial behavior, and life satisfaction. The dichotomous 9-item IGD scale showed solid psychometric properties and was the most practical scale for diagnostic purposes. Latent class analysis of this dichotomous scale indicated that 3 groups could be discerned: normal gamers, risky gamers, and disordered gamers. On the basis of the number of people in this last group, the prevalence of IGD among 13- through 40-year-olds in the Netherlands is approximately 4%. If the DSM-5 threshold for diagnosis (experiencing 5 or more criteria) is applied, the prevalence of disordered gamers is more than 5%. (c) 2015 APA, all rights reserved).
Using SAS PROC MCMC for Item Response Theory Models

PubMed Central

Samonte, Kelli

2014-01-01

Interest in using Bayesian methods for estimating item response theory models has grown at a remarkable rate in recent years. This attentiveness to Bayesian estimation has also inspired a growth in available software such as WinBUGS, R packages, BMIRT, MPLUS, and SAS PROC MCMC. This article intends to provide an accessible overview of Bayesian methods in the context of item response theory to serve as a useful guide for practitioners in estimating and interpreting item response theory (IRT) models. Included is a description of the estimation procedure used by SAS PROC MCMC. Syntax is provided for estimation of both dichotomous and polytomous IRT models, as well as a discussion on how to extend the syntax to accommodate more complex IRT models. PMID:29795834

Assessment of Scientific Literacy: Development and Validation of the Quantitative Assessment of Socio-Scientific Reasoning (QuASSR)

ERIC Educational Resources Information Center

Romine, William L.; Sadler, Troy D.; Kinslow, Andrew T.

2017-01-01

We describe the development and validation of the Quantitative Assessment of Socio-scientific Reasoning (QuASSR) in a college context. The QuASSR contains 10 polytomous, two-tiered items crossed between two scenarios, and is based on theory suggesting a four-pronged structure for SSR (complexity, perspective taking, inquiry, and skepticism). In…
Using the Bootstrap Method to Evaluate the Critical Range of Misfit for Polytomous Rasch Fit Statistics.

PubMed

Seol, Hyunsoo

2016-06-01

The purpose of this study was to apply the bootstrap procedure to evaluate how the bootstrapped confidence intervals (CIs) for polytomous Rasch fit statistics might differ according to sample sizes and test lengths in comparison with the rule-of-thumb critical value of misfit. A total of 25 simulated data sets were generated to fit the Rasch measurement and then a total of 1,000 replications were conducted to compute the bootstrapped CIs under each of 25 testing conditions. The results showed that rule-of-thumb critical values for assessing the magnitude of misfit were not applicable because the infit and outfit mean square error statistics showed different magnitudes of variability over testing conditions and the standardized fit statistics did not exactly follow the standard normal distribution. Further, they also do not share the same critical range for the item and person misfit. Based on the results of the study, the bootstrapped CIs can be used to identify misfitting items or persons as they offer a reasonable alternative solution, especially when the distributions of the infit and outfit statistics are not well known and depend on sample size. © The Author(s) 2016.
Estimating Parameters in the Generalized Graded Unfolding Model: Sensitivity to the Prior Distribution Assumption and the Number of Quadrature Points Used.

ERIC Educational Resources Information Center

Roberts, James S.; Donoghue, John R.; Laughlin, James E.

The generalized graded unfolding model (J. Roberts, J. Donoghue, and J. Laughlin, 1998, 1999) is an item response theory model designed to unfold polytomous responses. The model is based on a proximity relation that postulates higher levels of expected agreement with a given statement to the extent that a respondent is located close to the…
The Usability of CAT System for Assessing the Depressive Level of Japanese-A Study on Psychometric Properties and Response Behavior.

PubMed

Iwata, Noboru; Kikuchi, Kenichi; Fujihara, Yuya

2016-08-01

An innovative measurement system using a computerized adaptive testing technique based on the item response theory (CAT) has been expanding to measure mental health status. However, little is known about details in its measurement properties based on the empirical data. Moreover, the response time (RT) data, which are not available by a paper-and-pencil measurement but available by a computerized measurement, would be worth investigating for exploring the response behavior. We aimed at constructing the CAT to measure depressive symptomatology in a community population and exploring its measurement properties. Also, we examined the relationships between RTs, individual item responses, and depressive levels. For constructing the CAT system, responses of 2061 workers and university students to 24 depression scale plus four negatively revised positive affect items were subjected to a polytomous IRT analysis. The stopping rule was set for standard error of estimation < 0.30 or the maximum 15 items displayed. The CAT and non-adaptive computer-based test (CBT) were administered to 209 undergraduates, and 168 of them administered again after 1 week. On average, the CAT was converged by 10.4 items. The θ values estimated by CAT and CBT were highly correlated (r = 0.94 and 0.95 for the 1st and 2nd measurements) and with the traditional scoring procedures (r's > 0.90). The test-retest reliability was at a satisfactory level (r = 0.86). RTs to some items significantly correlated with the θ estimates. The mean RT varied by the item contents and wording, i.e., the RT to positive affect items required additional 2 s or longer than the other subscale items. The CAT would be a reliable and practical measurement tool for various purposes including stress check at workplace.
Strategies for controlling item exposure in computerized adaptive testing with the partial credit model.

PubMed

Davis, Laurie Laughlin; Dodd, Barbara G

2008-01-01

Exposure control research with polytomous item pools has determined that randomization procedures can be very effective for controlling test security in computerized adaptive testing (CAT). The current study investigated the performance of four procedures for controlling item exposure in a CAT under the partial credit model. In addition to a no exposure control baseline condition, the Kingsbury-Zara, modified-within-.10-logits, Sympson-Hetter, and conditional Sympson-Hetter procedures were implemented to control exposure rates. The Kingsbury-Zara and the modified-within-.10-logits procedures were implemented with 3 and 6 item candidate conditions. The results show that the Kingsbury-Zara and modified-within-.10-logits procedures with 6 item candidates performed as well as the conditional Sympson-Hetter in terms of exposure rates, overlap rates, and pool utilization. These two procedures are strongly recommended for use with partial credit CATs due to their simplicity and strength of their results.
Item response theory analysis of the life orientation test-revised: age and gender differential item functioning analyses.

PubMed

Steca, Patrizia; Monzani, Dario; Greco, Andrea; Chiesi, Francesca; Primi, Caterina

2015-06-01

This study is aimed at testing the measurement properties of the Life Orientation Test-Revised (LOT-R) for the assessment of dispositional optimism by employing item response theory (IRT) analyses. The LOT-R was administered to a large sample of 2,862 Italian adults. First, confirmatory factor analyses demonstrated the theoretical conceptualization of the construct measured by the LOT-R as a single bipolar dimension. Subsequently, IRT analyses for polytomous, ordered response category data were applied to investigate the items' properties. The equivalence of the items across gender and age was assessed by analyzing differential item functioning. Discrimination and severity parameters indicated that all items were able to distinguish people with different levels of optimism and adequately covered the spectrum of the latent trait. Additionally, the LOT-R appears to be gender invariant and, with minor exceptions, age invariant. Results provided evidence that the LOT-R is a reliable and valid measure of dispositional optimism. © The Author(s) 2014.
Validation of self-directed learning instrument and establishment of normative data for nursing students in taiwan: using polytomous item response theory.

PubMed

Cheng, Su-Fen; Lee-Hsieh, Jane; Turton, Michael A; Lin, Kuan-Chia

2014-06-01

Little research has investigated the establishment of norms for nursing students' self-directed learning (SDL) ability, recognized as an important capability for professional nurses. An item response theory (IRT) approach was used to establish norms for SDL abilities valid for the different nursing programs in Taiwan. The purposes of this study were (a) to use IRT with a graded response model to reexamine the SDL instrument, or the SDLI, originally developed by this research team using confirmatory factor analysis and (b) to establish SDL ability norms for the four different nursing education programs in Taiwan. Stratified random sampling with probability proportional to size was used. A minimum of 15% of students from the four different nursing education degree programs across Taiwan was selected. A total of 7,879 nursing students from 13 schools were recruited. The research instrument was the 20-item SDLI developed by Cheng, Kuo, Lin, and Lee-Hsieh (2010). IRT with the graded response model was used with a two-parameter logistic model (discrimination and difficulty) for the data analysis, calculated using MULTILOG. Norms were established using percentile rank. Analysis of item information and test information functions revealed that 18 items exhibited very high discrimination and two items had high discrimination. The test information function was higher in this range of scores, indicating greater precision in the estimate of nursing student SDL. Reliability fell between .80 and .94 for each domain and the SDLI as a whole. The total information function shows that the SDLI is appropriate for all nursing students, except for the top 2.5%. SDL ability norms were established for each nursing education program and for the nation as a whole. IRT is shown to be a potent and useful methodology for scale evaluation. The norms for SDL established in this research will provide practical standards for nursing educators and students in Taiwan.
Applying the Longitudinal Model from Item Response Theory to Assess Health-Related Quality of Life in the PRODIGE 4/ACCORD 11 Randomized Trial.

PubMed

Barbieri, Antoine; Anota, Amélie; Conroy, Thierry; Gourgou-Bourgade, Sophie; Juzyna, Beata; Bonnetain, Franck; Lavergne, Christian; Bascoul-Mollevi, Caroline

2016-07-01

A new longitudinal statistical approach was compared to the classical methods currently used to analyze health-related quality-of-life (HRQoL) data. The comparison was made using data in patients with metastatic pancreatic cancer. Three hundred forty-two patients from the PRODIGE4/ACCORD 11 study were randomly assigned to FOLFIRINOX versus gemcitabine regimens. HRQoL was evaluated using the European Organization for Research and Treatment of Cancer (EORTC) QLQ-C30. The classical analysis uses a linear mixed model (LMM), considering an HRQoL score as a good representation of the true value of the HRQoL, following EORTC recommendations. In contrast, built on the item response theory (IRT), our approach considered HRQoL as a latent variable directly estimated from the raw data. For polytomous items, we extended the partial credit model to a longitudinal analysis (longitudinal partial credit model [LPCM]), thereby modeling the latent trait as a function of time and other covariates. Both models gave the same conclusions on 11 of 15 HRQoL dimensions. HRQoL evolution was similar between the 2 treatment arms, except for the symptoms of pain. Indeed, regarding the LPCM, pain perception was significantly less important in the FOLFIRINOX arm than in the gemcitabine arm. For most of the scales, HRQoL changes over time, and no difference was found between treatments in terms of HRQoL. The use of LMM to study the HRQoL score does not seem appropriate. It is an easy-to-use model, but the basic statistical assumptions do not check. Our IRT model may be more complex but shows the same qualities and gives similar results. It has the additional advantage of being more precise and suitable because of its direct use of raw data. © The Author(s) 2015.
Adjusting for Year to Year Rater Variation in IRT Linking--An Empirical Evaluation

ERIC Educational Resources Information Center

Yen, Shu Jing; Ochieng, Charles; Michaels, Hillary; Friedman, Greg

2005-01-01

The main purpose of this study was to illustrate a polytomous IRT-based linking procedure that adjusts for rater variations. Test scores from two administrations of a statewide reading assessment were used. An anchor set of Year 1 students' constructed responses were rescored by Year 2 raters. To adjust for year-to-year rater variation in IRT…
A Rasch Analysis to Determine the Difficulty of the National Senior Certificate Mathematics Examination

ERIC Educational Resources Information Center

Sewry, Joyce; Mokilane, Paul

2014-01-01

The National Senior Certificate (NSC) examinations were written for the second time in 2009 amid much criticism. In this study, scripts of candidates who wrote the NSC Mathematics examinations (papers 1 and 2) in 2009 were used as data to analyse the marks scored and then polytomous Rasch analysis was conducted for all the sub-questions to…
Adjacent-Categories Mokken Models for Rater-Mediated Assessments

ERIC Educational Resources Information Center

Wind, Stefanie A.

2017-01-01

Molenaar extended Mokken's original probabilistic-nonparametric scaling models for use with polytomous data. These polytomous extensions of Mokken's original scaling procedure have facilitated the use of Mokken scale analysis as an approach to exploring fundamental measurement properties across a variety of domains in which polytomous ratings are…
Sequential Objective Structured Clinical Examination based on item response theory in Iran.

PubMed

Hejri, Sara Mortaz; Jalili, Mohammad

2017-01-01

In a sequential objective structured clinical examination (OSCE), all students initially take a short screening OSCE. Examinees who pass are excused from further testing, but an additional OSCE is administered to the remaining examinees. Previous investigations of sequential OSCE were based on classical test theory. We aimed to design and evaluate screening OSCEs based on item response theory (IRT). We carried out a retrospective observational study. At each station of a 10-station OSCE, the students' performance was graded on a Likert-type scale. Since the data were polytomous, the difficulty parameters, discrimination parameters, and students' ability were calculated using a graded response model. To design several screening OSCEs, we identified the 5 most difficult stations and the 5 most discriminative ones. For each test, 5, 4, or 3 stations were selected. Normal and stringent cut-scores were defined for each test. We compared the results of each of the 12 screening OSCEs to the main OSCE and calculated the positive and negative predictive values (PPV and NPV), as well as the exam cost. A total of 253 students (95.1%) passed the main OSCE, while 72.6% to 94.4% of examinees passed the screening tests. The PPV values ranged from 0.98 to 1.00, and the NPV values ranged from 0.18 to 0.59. Two tests effectively predicted the results of the main exam, resulting in financial savings of 34% to 40%. If stations with the highest IRT-based discrimination values and stringent cut-scores are utilized in the screening test, sequential OSCE can be an efficient and convenient way to conduct an OSCE.
Maximum Marginal Likelihood Estimation of a Monotonic Polynomial Generalized Partial Credit Model with Applications to Multiple Group Analysis.

PubMed

Falk, Carl F; Cai, Li

2016-06-01

We present a semi-parametric approach to estimating item response functions (IRF) useful when the true IRF does not strictly follow commonly used functions. Our approach replaces the linear predictor of the generalized partial credit model with a monotonic polynomial. The model includes the regular generalized partial credit model at the lowest order polynomial. Our approach extends Liang's (A semi-parametric approach to estimate IRFs, Unpublished doctoral dissertation, 2007) method for dichotomous item responses to the case of polytomous data. Furthermore, item parameter estimation is implemented with maximum marginal likelihood using the Bock-Aitkin EM algorithm, thereby facilitating multiple group analyses useful in operational settings. Our approach is demonstrated on both educational and psychological data. We present simulation results comparing our approach to more standard IRF estimation approaches and other non-parametric and semi-parametric alternatives.
Using Rasch-models to compare the 30-, 20-, and 12-items version of the general health questionnaire taking four recoding schemes into account.

PubMed

Alexandrowicz, Rainer W; Friedrich, Fabian; Jahn, Rebecca; Soulier, Nathalie

2015-01-01

The present study compares the 30-, 20-, and 12-items versions of the General Health Questionnaire (GHQ) in the original coding and four different recoding schemes (Bimodal, Chronic, Modified Likert and a newly proposed Modified Chronic) with respect to their psychometric qualities. The dichotomized versions (i.e. Bimodal, Chronic and Modified Chronic) were evaluated with the Rasch-Model and the polytomous original version and the Modified Likert version were evaluated with the Partial Credit Model. In general, the versions under consideration showed agreement with the model assumption. However, the recoded versions exhibited some deficits with respect to the Outfit index. Because of the item deficits and for theoretical reasons we argue in favor of using the any of the three length versions with the original four-categorical coding scheme. Nevertheless, any of the versions appears apt for clinical use from a psychometric perspective.
Validation of the knowledge, attitude and perceived practice of asthma instrument among community pharmacists using Rasch analysis.

PubMed

Akram, Waqas; Hussein, Maryam S E; Ahmad, Sohail; Mamat, Mohd N; Ismail, Nahlah E

2015-10-01

There is no instrument which collectively assesses the knowledge, attitude and perceived practice of asthma among community pharmacists. Therefore, this study aimed to validate the instrument which measured the knowledge, attitude and perceived practice of asthma among community pharmacists by producing empirical evidence of validity and reliability of the items using Rasch model (Bond & Fox software®) for dichotomous and polytomous data. This baseline study recruited 33 community pharmacists from Penang, Malaysia. The results showed that all PTMEA Corr were in positive values, where an item was able to distinguish between the ability of respondents. Based on the MNSQ infit and outfit range (0.60-1.40), out of 55 items, 2 items from the instrument were suggested to be removed. The findings indicated that the instrument fitted with Rasch measurement model and showed the acceptable reliability values of 0.88 and 0.83 and 0.79 for knowledge, attitude and perceived practice respectively.
A General Cognitive Diagnosis Model for Expert-Defined Polytomous Attributes

ERIC Educational Resources Information Center

Chen, Jinsong; de la Torre, Jimmy

2013-01-01

Polytomous attributes, particularly those defined as part of the test development process, can provide additional diagnostic information. The present research proposes the polytomous generalized deterministic inputs, noisy, "and" gate (pG-DINA) model to accommodate such attributes. The pG-DINA model allows input from substantive experts…
Applying Computerized Adaptive Testing to the Negative Acts Questionnaire-Revised: Rasch Analysis of Workplace Bullying

PubMed Central

Ma, Shu-Ching; Li, Yu-Chi; Yui, Mei-Shu

2014-01-01

Background Workplace bullying is a prevalent problem in contemporary work places that has adverse effects on both the victims of bullying and organizations. With the rapid development of computer technology in recent years, there is an urgent need to prove whether item response theory–based computerized adaptive testing (CAT) can be applied to measure exposure to workplace bullying. Objective The purpose of this study was to evaluate the relative efficiency and measurement precision of a CAT-based test for hospital nurses compared to traditional nonadaptive testing (NAT). Under the preliminary conditions of a single domain derived from the scale, a CAT module bullying scale model with polytomously scored items is provided as an example for evaluation purposes. Methods A total of 300 nurses were recruited and responded to the 22-item Negative Acts Questionnaire-Revised (NAQ-R). All NAT (or CAT-selected) items were calibrated with the Rasch rating scale model and all respondents were randomly selected for a comparison of the advantages of CAT and NAT in efficiency and precision by paired t tests and the area under the receiver operating characteristic curve (AUROC). Results The NAQ-R is a unidimensional construct that can be applied to measure exposure to workplace bullying through CAT-based administration. Nursing measures derived from both tests (CAT and NAT) were highly correlated (r=.97) and their measurement precisions were not statistically different (P=.49) as expected. CAT required fewer items than NAT (an efficiency gain of 32%), suggesting a reduced burden for respondents. There were significant differences in work tenure between the 2 groups (bullied and nonbullied) at a cutoff point of 6 years at 1 worksite. An AUROC of 0.75 (95% CI 0.68-0.79) with logits greater than –4.2 (or >30 in summation) was defined as being highly likely bullied in a workplace. Conclusions With CAT-based administration of the NAQ-R for nurses, their burden was substantially reduced without compromising measurement precision. PMID:24534113
Computational Psychometrics for the Measurement of Collaborative Problem Solving Skills

PubMed Central

Polyak, Stephen T.; von Davier, Alina A.; Peterschmidt, Kurt

2017-01-01

This paper describes a psychometrically-based approach to the measurement of collaborative problem solving skills, by mining and classifying behavioral data both in real-time and in post-game analyses. The data were collected from a sample of middle school children who interacted with a game-like, online simulation of collaborative problem solving tasks. In this simulation, a user is required to collaborate with a virtual agent to solve a series of tasks within a first-person maze environment. The tasks were developed following the psychometric principles of Evidence Centered Design (ECD) and are aligned with the Holistic Framework developed by ACT. The analyses presented in this paper are an application of an emerging discipline called computational psychometrics which is growing out of traditional psychometrics and incorporates techniques from educational data mining, machine learning and other computer/cognitive science fields. In the real-time analysis, our aim was to start with limited knowledge of skill mastery, and then demonstrate a form of continuous Bayesian evidence tracing that updates sub-skill level probabilities as new conversation flow event evidence is presented. This is performed using Bayes' rule and conversation item conditional probability tables. The items are polytomous and each response option has been tagged with a skill at a performance level. In our post-game analysis, our goal was to discover unique gameplay profiles by performing a cluster analysis of user's sub-skill performance scores based on their patterns of selected dialog responses. PMID:29238314
Computational Psychometrics for the Measurement of Collaborative Problem Solving Skills.

PubMed

Polyak, Stephen T; von Davier, Alina A; Peterschmidt, Kurt

2017-01-01

This paper describes a psychometrically-based approach to the measurement of collaborative problem solving skills, by mining and classifying behavioral data both in real-time and in post-game analyses. The data were collected from a sample of middle school children who interacted with a game-like, online simulation of collaborative problem solving tasks. In this simulation, a user is required to collaborate with a virtual agent to solve a series of tasks within a first-person maze environment. The tasks were developed following the psychometric principles of Evidence Centered Design (ECD) and are aligned with the Holistic Framework developed by ACT. The analyses presented in this paper are an application of an emerging discipline called computational psychometrics which is growing out of traditional psychometrics and incorporates techniques from educational data mining, machine learning and other computer/cognitive science fields. In the real-time analysis, our aim was to start with limited knowledge of skill mastery, and then demonstrate a form of continuous Bayesian evidence tracing that updates sub-skill level probabilities as new conversation flow event evidence is presented. This is performed using Bayes' rule and conversation item conditional probability tables. The items are polytomous and each response option has been tagged with a skill at a performance level. In our post-game analysis, our goal was to discover unique gameplay profiles by performing a cluster analysis of user's sub-skill performance scores based on their patterns of selected dialog responses.
Power analysis on the time effect for the longitudinal Rasch model.

PubMed

Feddag, M L; Blanchin, M; Hardouin, J B; Sebille, V

2014-01-01

Statistics literature in the social, behavioral, and biomedical sciences typically stress the importance of power analysis. Patient Reported Outcomes (PRO) such as quality of life and other perceived health measures (pain, fatigue, stress,...) are increasingly used as important health outcomes in clinical trials or in epidemiological studies. They cannot be directly observed nor measured as other clinical or biological data and they are often collected through questionnaires with binary or polytomous items. The Rasch model is the well known model in the item response theory (IRT) for binary data. The article proposes an approach to evaluate the statistical power of the time effect for the longitudinal Rasch model with two time points. The performance of this method is compared to the one obtained by simulation study. Finally, the proposed approach is illustrated on one subscale of the SF-36 questionnaire.

Item usage in a multidimensional computerized adaptive test (MCAT) measuring health-related quality of life.

PubMed

Paap, Muirne C S; Kroeze, Karel A; Terwee, Caroline B; van der Palen, Job; Veldkamp, Bernard P

2017-11-01

Examining item usage is an important step in evaluating the performance of a computerized adaptive test (CAT). We study item usage for a newly developed multidimensional CAT which draws items from three PROMIS domains, as well as a disease-specific one. The multidimensional item bank used in the current study contained 194 items from four domains: the PROMIS domains fatigue, physical function, and ability to participate in social roles and activities, and a disease-specific domain (the COPD-SIB). The item bank was calibrated using the multidimensional graded response model and data of 795 patients with chronic obstructive pulmonary disease. To evaluate the item usage rates of all individual items in our item bank, CAT simulations were performed on responses generated based on a multivariate uniform distribution. The outcome variables included active bank size and item overuse (usage rate larger than the expected item usage rate). For average θ-values, the overall active bank size was 9-10%; this number quickly increased as θ-values became more extreme. For values of -2 and +2, the overall active bank size equaled 39-40%. There was 78% overlap between overused items and active bank size for average θ-values. For more extreme θ-values, the overused items made up a much smaller part of the active bank size: here the overlap was only 35%. Our results strengthen the claim that relatively short item banks may suffice when using polytomous items (and no content constraints/exposure control mechanisms), especially when using MCAT.
Cross-sectional analysis of deprivation and ideal cardiovascular health in the Paris Prospective Study 3.

PubMed

Empana, J P; Perier, M C; Singh-Manoux, A; Gaye, B; Thomas, F; Prugger, C; Plichart, M; Wiernik, E; Guibout, C; Lemogne, C; Pannier, B; Boutouyrie, P; Jouven, X

2016-12-01

We hypothesised that deprivation might represent a barrier to attain an ideal cardiovascular health (CVH) as defined by the American Heart Association (AHA). The baseline data of 8916 participants of the Paris Prospective Study 3, an observational cohort on novel markers for future cardiovascular disease, were used. The AHA 7-item tool includes four health behaviours (smoking, body weight, physical activity and optimal diet) and three biological measures (blood cholesterol, blood glucose and blood pressure). A validated 11-item score of individual material and psychosocial deprivation, the Evaluation de la Précarité et des Inégalités dans les Centres d'Examens de Santé-Evaluation of Deprivation and Inequalities in Health Examination centres (EPICES) score was used. The mean age was 59.5 years (standard deviation 6.2), 61.2% were men and 9.98% had an ideal CVH. In sex-specific multivariable polytomous logistic regression, the odds ratio (OR) for ideal behavioural CVH progressively decreased with quartile of increasing deprivation, from 0.54 (95% CI 0.41 to 0.72) to 0.49 (0.37 to 0.65) in women and from 0.61 (0.50 to 0.76) to 0.57 (0.46 to 0.71) in men. Associations with ideal biological CVH were confined to the most deprived women (OR=0.60; 95% CI 0.37 to 0.99), whereas in men, greater deprivation was related to higher OR of intermediate biological CVH (OR=1.28; 95% CI 1.05 to 1.57 for the third quartile vs the first quartile). Higher material and psychosocial deprivation may represent a barrier to reach an ideal CVH. NCT00741728. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://www.bmj.com/company/products-services/rights-and-licensing/.
MODELING SNAKE MICROHABITAT FROM RADIOTELEMETRY STUDIES USING POLYTOMOUS LOGISTIC REGRESSION

EPA Science Inventory

Multivariate analysis of snake microhabitat has historically used techniques that were derived under assumptions of normality and common covariance structure (e.g., discriminant function analysis, MANOVA). In this study, polytomous logistic regression (PLR which does not require ...
A general diagnostic model applied to language testing data.

PubMed

von Davier, Matthias

2008-11-01

Probabilistic models with one or more latent variables are designed to report on a corresponding number of skills or cognitive attributes. Multidimensional skill profiles offer additional information beyond what a single test score can provide, if the reported skills can be identified and distinguished reliably. Many recent approaches to skill profile models are limited to dichotomous data and have made use of computationally intensive estimation methods such as Markov chain Monte Carlo, since standard maximum likelihood (ML) estimation techniques were deemed infeasible. This paper presents a general diagnostic model (GDM) that can be estimated with standard ML techniques and applies to polytomous response variables as well as to skills with two or more proficiency levels. The paper uses one member of a larger class of diagnostic models, a compensatory diagnostic model for dichotomous and partial credit data. Many well-known models, such as univariate and multivariate versions of the Rasch model and the two-parameter logistic item response theory model, the generalized partial credit model, as well as a variety of skill profile models, are special cases of this GDM. In addition to an introduction to this model, the paper presents a parameter recovery study using simulated data and an application to real data from the field test for TOEFL Internet-based testing.
Introduction to bifactor polytomous item response theory analysis.

PubMed

Toland, Michael D; Sulis, Isabella; Giambona, Francesca; Porcu, Mariano; Campbell, Jonathan M

2017-02-01

A bifactor item response theory model can be used to aid in the interpretation of the dimensionality of a multifaceted questionnaire that assumes continuous latent variables underlying the propensity to respond to items. This model can be used to describe the locations of people on a general continuous latent variable as well as on continuous orthogonal specific traits that characterize responses to groups of items. The bifactor graded response (bifac-GR) model is presented in contrast to a correlated traits (or multidimensional GR model) and unidimensional GR model. Bifac-GR model specification, assumptions, estimation, and interpretation are demonstrated with a reanalysis of data (Campbell, 2008) on the Shared Activities Questionnaire. We also show the importance of marginalizing the slopes for interpretation purposes and we extend the concept to the interpretation of the information function. To go along with the illustrative example analyses, we have made available supplementary files that include command file (syntax) examples and outputs from flexMIRT, IRTPRO, R, Mplus, and STATA. Supplementary data to this article can be found online at http://dx.doi.org/10.1016/j.jsp.2016.11.001. Data needed to reproduce analyses in this article are available as supplemental materials (online only) in the Appendix of this article. Copyright © 2016 Society for the Study of School Psychology. Published by Elsevier Ltd. All rights reserved.
Risk of lymphoma subtypes and dietary habits in a Mediterranean area.

PubMed

Campagna, Marcello; Cocco, Pierluigi; Zucca, Mariagrazia; Angelucci, Emanuele; Gabbas, Attilio; Latte, Gian Carlo; Uras, Antonella; Rais, Marco; Sanna, Sonia; Ennas, Maria Grazia

2015-12-01

Previous studies have suggested that diet might affect risk of lymphoma subtypes. We investigated risk of lymphoma and its major subtypes associated with diet in the Mediterranean island of Sardinia, Italy. In 1998-2004, 322 incident lymphoma cases and 446 randomly selected population controls participated in a case-control study on lymphoma etiology in central-southern Sardinia. Questionnaire interviews included frequency of intake of 112 food items. Risk associated with individual dietary items and groups thereof was explored by unconditional and polytomous logistic regression analysis, adjusting by age, gender and education. We observed an upward trend in risk of lymphoma (all subtypes combined) and B-cell lymphoma with frequency of intake of well done grilled/roasted chicken (p for trend=0.01), and pizza (p for trend=0.047), Neither adherence to Mediterranean diet nor a frequent intake of its individual components conveyed protection. We detected heterogeneity in risk associated with several food items and groups thereof by lymphoma subtypes although we could not rule out chance as responsible for the observed direct or inverse associations. Adherence to a Mediterranean diet does not seem to convey protection against the development of lymphoma. The association with specific food items might vary by lymphoma subtype. Copyright © 2015 Elsevier Ltd. All rights reserved.
A general equation to obtain multiple cut-off scores on a test from multinomial logistic regression.

PubMed

Bersabé, Rosa; Rivas, Teresa

2010-05-01

The authors derive a general equation to compute multiple cut-offs on a total test score in order to classify individuals into more than two ordinal categories. The equation is derived from the multinomial logistic regression (MLR) model, which is an extension of the binary logistic regression (BLR) model to accommodate polytomous outcome variables. From this analytical procedure, cut-off scores are established at the test score (the predictor variable) at which an individual is as likely to be in category j as in category j+1 of an ordinal outcome variable. The application of the complete procedure is illustrated by an example with data from an actual study on eating disorders. In this example, two cut-off scores on the Eating Attitudes Test (EAT-26) scores are obtained in order to classify individuals into three ordinal categories: asymptomatic, symptomatic and eating disorder. Diagnoses were made from the responses to a self-report (Q-EDD) that operationalises DSM-IV criteria for eating disorders. Alternatives to the MLR model to set multiple cut-off scores are discussed.
Comparison of standard maximum likelihood classification and polytomous logistic regression used in remote sensing

Treesearch

John Hogland; Nedret Billor; Nathaniel Anderson

2013-01-01

Discriminant analysis, referred to as maximum likelihood classification within popular remote sensing software packages, is a common supervised technique used by analysts. Polytomous logistic regression (PLR), also referred to as multinomial logistic regression, is an alternative classification approach that is less restrictive, more flexible, and easy to interpret. To...
The Social Provisions Scale: psychometric properties of the SPS-10 among participants in nature-based services.

PubMed

Steigen, Anne Mari; Bergh, Daniel

2018-02-05

This article analyses the psychometric properties of the Social Provisions Scale 10-items version. The Social Provisions Scale was analysed by means of the polytomous Rasch model, applied to data on 93 young adults (16-30 years) out of school or work, participating in different nature-based services, due to mental or drug-related problems. The psychometric analysis concludes that the original scale has difficulties related to targeting and construct validity. In order to improve the psychometric properties, the scale was modified to include eight items measuring functional support. The modification was based on theoretical and statistical considerations. After modifications the scale showed not only satisfying psychometric properties, but it also clarified uncertainties regarding construct validity of the measure. However, further analysis on larger samples are required. Implications for Rehabilitation Social support is important for a variety of rehabilitation outcomes and for different patient groups in the rehabilitation context, including people with mental health or drug-related problems. Social Provisions Scale may be used as a screening tool to assess social support of participants in rehabilitation, and the scale may also be an important instrument in rehabilitation research. There might be issues measuring structural support using a 10-items version of the Social Provisions Scale but it seemed to work well as an 8-item scale measuring functional support.
New robust statistical procedures for the polytomous logistic regression models.

PubMed

Castilla, Elena; Ghosh, Abhik; Martin, Nirian; Pardo, Leandro

2018-05-17

This article derives a new family of estimators, namely the minimum density power divergence estimators, as a robust generalization of the maximum likelihood estimator for the polytomous logistic regression model. Based on these estimators, a family of Wald-type test statistics for linear hypotheses is introduced. Robustness properties of both the proposed estimators and the test statistics are theoretically studied through the classical influence function analysis. Appropriate real life examples are presented to justify the requirement of suitable robust statistical procedures in place of the likelihood based inference for the polytomous logistic regression model. The validity of the theoretical results established in the article are further confirmed empirically through suitable simulation studies. Finally, an approach for the data-driven selection of the robustness tuning parameter is proposed with empirical justifications. © 2018, The International Biometric Society.
Evaluation properties of the French version of the OUT-PATSAT35 satisfaction with care questionnaire according to classical and item response theory analyses.

PubMed

Panouillères, M; Anota, A; Nguyen, T V; Brédart, A; Bosset, J F; Monnier, A; Mercier, M; Hardouin, J B

2014-09-01

The present study investigates the properties of the French version of the OUT-PATSAT35 questionnaire, which evaluates the outpatients' satisfaction with care in oncology using classical analysis (CTT) and item response theory (IRT). This cross-sectional multicenter study includes 692 patients who completed the questionnaire at the end of their ambulatory treatment. CTT analyses tested the main psychometric properties (convergent and divergent validity, and internal consistency). IRT analyses were conducted separately for each OUT-PATSAT35 domain (the doctors, the nurses or the radiation therapists and the services/organization) by models from the Rasch family. We examined the fit of the data to the model expectations and tested whether the model assumptions of unidimensionality, monotonicity and local independence were respected. A total of 605 (87.4%) respondents were analyzed with a mean age of 64 years (range 29-88). Internal consistency for all scales separately and for the three main domains was good (Cronbach's α 0.74-0.98). IRT analyses were performed with the partial credit model. No disordered thresholds of polytomous items were found. Each domain showed high reliability but fitted poorly to the Rasch models. Three items in particular, the item about "promptness" in the doctors' domain and the items about "accessibility" and "environment" in the services/organization domain, presented the highest default of fit. A correct fit of the Rasch model can be obtained by dropping these items. Most of the local dependence concerned items about "information provided" in each domain. A major deviation of unidimensionality was found in the nurses' domain. CTT showed good psychometric properties of the OUT-PATSAT35. However, the Rasch analysis revealed some misfitting and redundant items. Taking the above problems into consideration, it could be interesting to refine the questionnaire in a future study.
Item response theory analysis applied to the Spanish version of the Personal Outcomes Scale.

PubMed

Guàrdia-Olmos, J; Carbó-Carreté, M; Peró-Cebollero, M; Giné, C

2017-11-01

The study of measurements of quality of life (QoL) is one of the great challenges of modern psychology and psychometric approaches. This issue has greater importance when examining QoL in populations that were historically treated on the basis of their deficiency, and recently, the focus has shifted to what each person values and desires in their life, as in cases of people with intellectual disability (ID). Many studies of QoL scales applied in this area have attempted to improve the validity and reliability of their components by incorporating various sources of information to achieve consistency in the data obtained. The adaptation of the Personal Outcomes Scale (POS) in Spanish has shown excellent psychometric attributes, and its administration has three sources of information: self-assessment, practitioner and family. The study of possible congruence or incongruence of observed distributions of each item between sources is therefore essential to ensure a correct interpretation of the measure. The aim of this paper was to analyse the observed distribution of items and dimensions from the three Spanish POS information sources cited earlier, using the item response theory. We studied a sample of 529 people with ID and their respective practitioners and family member, and in each case, we analysed items and factors using Samejima's model of polytomic ordinal scales. The results indicated an important number of items with differential effects regarding sources, and in some cases, they indicated significant differences in the distribution of items, factors and sources of information. As a result of this analysis, we must affirm that the administration of the POS, considering three sources of information, was adequate overall, but a correct interpretation of the results requires that it obtain much more information to consider, as well as some specific items in specific dimensions. The overall ratings, if these comments are considered, could result in bias. © 2017 MENCAP and International Association of the Scientific Study of Intellectual and Developmental Disabilities and John Wiley & Sons Ltd.
Relationship between Item Responses of Negative Affect Items and the Distribution of the Sum of the Item Scores in the General Population

PubMed Central

Kawasaki, Yohei; Ide, Kazuki; Akutagawa, Maiko; Yamada, Hiroshi; Furukawa, Toshiaki A.; Ono, Yutaka

2016-01-01

Background Several studies have shown that total depressive symptom scores in the general population approximate an exponential pattern, except for the lower end of the distribution. The Center for Epidemiologic Studies Depression Scale (CES-D) consists of 20 items, each of which may take on four scores: “rarely,” “some,” “occasionally,” and “most of the time.” Recently, we reported that the item responses for 16 negative affect items commonly exhibit exponential patterns, except for the level of “rarely,” leading us to hypothesize that the item responses at the level of “rarely” may be related to the non-exponential pattern typical of the lower end of the distribution. To verify this hypothesis, we investigated how the item responses contribute to the distribution of the sum of the item scores. Methods Data collected from 21,040 subjects who had completed the CES-D questionnaire as part of a Japanese national survey were analyzed. To assess the item responses of negative affect items, we used a parameter r, which denotes the ratio of “rarely” to “some” in each item response. The distributions of the sum of negative affect items in various combinations were analyzed using log-normal scales and curve fitting. Results The sum of the item scores approximated an exponential pattern regardless of the combination of items, whereas, at the lower end of the distributions, there was a clear divergence between the actual data and the predicted exponential pattern. At the lower end of the distributions, the sum of the item scores with high values of r exhibited higher scores compared to those predicted from the exponential pattern, whereas the sum of the item scores with low values of r exhibited lower scores compared to those predicted. Conclusions The distributional pattern of the sum of the item scores could be predicted from the item responses of such items. PMID:27806132
Scoring best-worst data in unbalanced many-item designs, with applications to crowdsourcing semantic judgments.

PubMed

Hollis, Geoff

2018-04-01

Best-worst scaling is a judgment format in which participants are presented with a set of items and have to choose the superior and inferior items in the set. Best-worst scaling generates a large quantity of information per judgment because each judgment allows for inferences about the rank value of all unjudged items. This property of best-worst scaling makes it a promising judgment format for research in psychology and natural language processing concerned with estimating the semantic properties of tens of thousands of words. A variety of different scoring algorithms have been devised in the previous literature on best-worst scaling. However, due to problems of computational efficiency, these scoring algorithms cannot be applied efficiently to cases in which thousands of items need to be scored. New algorithms are presented here for converting responses from best-worst scaling into item scores for thousands of items (many-item scoring problems). These scoring algorithms are validated through simulation and empirical experiments, and considerations related to noise, the underlying distribution of true values, and trial design are identified that can affect the relative quality of the derived item scores. The newly introduced scoring algorithms consistently outperformed scoring algorithms used in the previous literature on scoring many-item best-worst data.
Item Response Modeling with Sum Scores

ERIC Educational Resources Information Center

Johnson, Timothy R.

2013-01-01

One of the distinctions between classical test theory and item response theory is that the former focuses on sum scores and their relationship to true scores, whereas the latter concerns item responses and their relationship to latent scores. Although item response theory is often viewed as the richer of the two theories, sum scores are still…
Boundary curves of individual items in the distribution of total depressive symptom scores approximate an exponential pattern in a general population.

PubMed

Tomitaka, Shinichiro; Kawasaki, Yohei; Ide, Kazuki; Akutagawa, Maiko; Yamada, Hiroshi; Furukawa, Toshiaki A; Ono, Yutaka

2016-01-01

Previously, we proposed a model for ordinal scale scoring in which individual thresholds for each item constitute a distribution by each item. This lead us to hypothesize that the boundary curves of each depressive symptom score in the distribution of total depressive symptom scores follow a common mathematical model, which is expressed as the product of the frequency of the total depressive symptom scores and the probability of the cumulative distribution function of each item threshold. To verify this hypothesis, we investigated the boundary curves of the distribution of total depressive symptom scores in a general population. Data collected from 21,040 subjects who had completed the Center for Epidemiologic Studies Depression Scale (CES-D) questionnaire as part of a national Japanese survey were analyzed. The CES-D consists of 20 items (16 negative items and four positive items). The boundary curves of adjacent item scores in the distribution of total depressive symptom scores for the 16 negative items were analyzed using log-normal scales and curve fitting. The boundary curves of adjacent item scores for a given symptom approximated a common linear pattern on a log normal scale. Curve fitting showed that an exponential fit had a markedly higher coefficient of determination than either linear or quadratic fits. With negative affect items, the gap between the total score curve and boundary curve continuously increased with increasing total depressive symptom scores on a log-normal scale, whereas the boundary curves of positive affect items, which are not considered manifest variables of the latent trait, did not exhibit such increases in this gap. The results of the present study support the hypothesis that the boundary curves of each depressive symptom score in the distribution of total depressive symptom scores commonly follow the predicted mathematical model, which was verified to approximate an exponential mathematical pattern.
Boundary curves of individual items in the distribution of total depressive symptom scores approximate an exponential pattern in a general population

PubMed Central

Kawasaki, Yohei; Akutagawa, Maiko; Yamada, Hiroshi; Furukawa, Toshiaki A.; Ono, Yutaka

2016-01-01

Background Previously, we proposed a model for ordinal scale scoring in which individual thresholds for each item constitute a distribution by each item. This lead us to hypothesize that the boundary curves of each depressive symptom score in the distribution of total depressive symptom scores follow a common mathematical model, which is expressed as the product of the frequency of the total depressive symptom scores and the probability of the cumulative distribution function of each item threshold. To verify this hypothesis, we investigated the boundary curves of the distribution of total depressive symptom scores in a general population. Methods Data collected from 21,040 subjects who had completed the Center for Epidemiologic Studies Depression Scale (CES-D) questionnaire as part of a national Japanese survey were analyzed. The CES-D consists of 20 items (16 negative items and four positive items). The boundary curves of adjacent item scores in the distribution of total depressive symptom scores for the 16 negative items were analyzed using log-normal scales and curve fitting. Results The boundary curves of adjacent item scores for a given symptom approximated a common linear pattern on a log normal scale. Curve fitting showed that an exponential fit had a markedly higher coefficient of determination than either linear or quadratic fits. With negative affect items, the gap between the total score curve and boundary curve continuously increased with increasing total depressive symptom scores on a log-normal scale, whereas the boundary curves of positive affect items, which are not considered manifest variables of the latent trait, did not exhibit such increases in this gap. Discussion The results of the present study support the hypothesis that the boundary curves of each depressive symptom score in the distribution of total depressive symptom scores commonly follow the predicted mathematical model, which was verified to approximate an exponential mathematical pattern. PMID:27761346
Robust Scale Transformation Methods in IRT True Score Equating under Common-Item Nonequivalent Groups Design

ERIC Educational Resources Information Center

He, Yong

2013-01-01

Common test items play an important role in equating multiple test forms under the common-item nonequivalent groups design. Inconsistent item parameter estimates among common items can lead to large bias in equated scores for IRT true score equating. Current methods extensively focus on detection and elimination of outlying common items, which…
Development of a testlet generator in re-engineering the Indonesian physics national-exams

NASA Astrophysics Data System (ADS)

Mindyarto, Budi Naini; Mardapi, Djemari; Bastari

2017-08-01

The Indonesian Physics national-exams are end-of-course summative assessments that could be utilized to support the assessment for learning in physics educations. This paper discusses the development and evaluation of a testlet generator based on a re-engineering of Indonesian physics national exams. The exam problems were dissected and decomposed into testlets revealing the deeper understanding of the underlying physical concepts by inserting a qualitative question and its scientific reasoning question. A template-based generator was built to facilitate teachers in generating testlet variants that would be more conform to students' scientific attitude development than their original simple multiple-choice formats. The testlet generator was built using open source software technologies and was evaluated focusing on the black-box testing by exploring the generator's execution, inputs and outputs. The results showed the correctly-performed functionalities of the developed testlet generator in validating inputs, generating testlet variants, and accommodating polytomous item characteristics.
An Approach to Scoring and Equating Tests with Binary Items: Piloting With Large-Scale Assessments

ERIC Educational Resources Information Center

Dimitrov, Dimiter M.

2016-01-01

This article describes an approach to test scoring, referred to as "delta scoring" (D-scoring), for tests with dichotomously scored items. The D-scoring uses information from item response theory (IRT) calibration to facilitate computations and interpretations in the context of large-scale assessments. The D-score is computed from the…

The reverse of social anxiety is not always the opposite: the reverse-scored items of the social interaction anxiety scale do not belong.

PubMed

Rodebaugh, Thomas L; Woods, Carol M; Heimberg, Richard G

2007-06-01

Although well-used and empirically supported, the Social Interaction Anxiety Scale (SIAS) has a questionable factor structure and includes reverse-scored items with questionable utility. Here, using samples of undergraduates and a sample of clients with social anxiety disorder, we extend previous work that opened the question of whether the reverse-scored items belong on the scale. First, we successfully confirmed the factor structure obtained in previous samples. Second, we found the reverse-scored items to show consistently weaker relationships with a variety of comparison measures. Third, we demonstrated that removing the reverse-scored questions generally helps rather than hinders the psychometric performance of the SIAS total score. Fourth, we found that the reverse-scored items show a strong relationship with the normal personality characteristic of extraversion, suggesting that the reverse-scored items may primarily assess extraversion. Given the above results, we suggest investigators consider performing data analyses using only the straightforwardly worded items of the SIAS.
Using the Free-Response Scoring Tool To Automatically Score the Formulating-Hypotheses Item. GRE Board Professional Report No. 90-02bP.

ERIC Educational Resources Information Center

Kaplan, Randy M.; Bennett, Randy Elliot

This study explores the potential for using a computer-based scoring procedure for the formulating-hypotheses (F-H) item. This item type presents a situation and asks the examinee to generate explanations for it. Each explanation is judged right or wrong, and the number of creditable explanations is summed to produce an item score. Scores were…
Impact of IRT item misfit on score estimates and severity classifications: an examination of PROMIS depression and pain interference item banks.

PubMed

Zhao, Yue

2017-03-01

In patient-reported outcome research that utilizes item response theory (IRT), using statistical significance tests to detect misfit is usually the focus of IRT model-data fit evaluations. However, such evaluations rarely address the impact/consequence of using misfitting items on the intended clinical applications. This study was designed to evaluate the impact of IRT item misfit on score estimates and severity classifications and to demonstrate a recommended process of model-fit evaluation. Using secondary data sources collected from the Patient-Reported Outcome Measurement Information System (PROMIS) wave 1 testing phase, analyses were conducted based on PROMIS depression (28 items; 782 cases) and pain interference (41 items; 845 cases) item banks. The identification of misfitting items was assessed using Orlando and Thissen's summed-score item-fit statistics and graphical displays. The impact of misfit was evaluated according to the agreement of both IRT-derived T-scores and severity classifications between inclusion and exclusion of misfitting items. The examination of the presence and impact of misfit suggested that item misfit had a negligible impact on the T-score estimates and severity classifications with the general population sample in the PROMIS depression and pain interference item banks, implying that the impact of item misfit was insignificant. Findings support the T-score estimates in the two item banks as robust against item misfit at both the group and individual levels and add confidence to the use of T-scores for severity diagnosis in the studied sample. Recommendations on approaches for identifying item misfit (statistical significance) and assessing the misfit impact (practical significance) are given.
Upper-extremity and mobility subdomains from the Patient-Reported Outcomes Measurement Information System (PROMIS) adult physical functioning item bank.

PubMed

Hays, Ron D; Spritzer, Karen L; Amtmann, Dagmar; Lai, Jin-Shei; Dewitt, Esi Morgan; Rothrock, Nan; Dewalt, Darren A; Riley, William T; Fries, James F; Krishnan, Eswar

2013-11-01

To create upper-extremity and mobility subdomain scores from the Patient-Reported Outcomes Measurement Information System (PROMIS) physical functioning adult item bank. Expert reviews were used to identify upper-extremity and mobility items from the PROMIS item bank. Psychometric analyses were conducted to assess empirical support for scoring upper-extremity and mobility subdomains. Data were collected from the U.S. general population and multiple disease groups via self-administered surveys. The sample (N=21,773) included 21,133 English-speaking adults who participated in the PROMIS wave 1 data collection and 640 Spanish-speaking Latino adults recruited separately. Not applicable. We used English- and Spanish-language data and existing PROMIS item parameters for the physical functioning item bank to estimate upper-extremity and mobility scores. In addition, we fit graded response models to calibrate the upper-extremity items and mobility items separately, compare separate to combined calibrations, and produce subdomain scores. After eliminating items because of local dependency, 16 items remained to assess upper extremity and 17 items to assess mobility. The estimated correlation between upper extremity and mobility was .59 using existing PROMIS physical functioning item parameters (r=.60 using parameters calibrated separately for upper-extremity and mobility items). Upper-extremity and mobility subdomains shared about 35% of the variance in common, and produced comparable scores whether calibrated separately or together. The identification of the subset of items tapping these 2 aspects of physical functioning and scored using the existing PROMIS parameters provides the option of scoring these subdomains in addition to the overall physical functioning score. Copyright © 2013 American Congress of Rehabilitation Medicine. Published by Elsevier Inc. All rights reserved.
Development of a Microsoft Excel tool for applying a factor retention criterion of a dimension coefficient to a survey on patient safety culture.

PubMed

Chien, Tsair-Wei; Shao, Yang; Jen, Dong-Hui

2017-10-27

Many quality-of-life studies have been conducted in healthcare settings, but few have used Microsoft Excel to incorporate Cronbach's α with dimension coefficient (DC) for describing a scale's characteristics. To present a computer module that can report a scale's validity, we manipulated datasets to verify a DC that can be used as a factor retention criterion for demonstrating its usefulness in a patient safety culture survey (PSC). Microsoft Excel Visual Basic for Applications was used to design a computer module for simulating 2000 datasets fitting the Rasch rating scale model. The datasets consisted of (i) five dual correlation coefficients (correl. = 0.3, 0.5, 0.7, 0.9, and 1.0) on two latent traits (i.e., true scores) following a normal distribution and responses to their respective 1/3 and 2/3 items in length; (ii) 20 scenarios of item lengths from 5 to 100; and (iii) 20 sample sizes from 50 to 1000. Each item containing 5-point polytomous responses was uniformly distributed in difficulty across a ± 2 logit range. Three methods (i.e., dimension interrelation ≥0.7, Horn's parallel analysis (PA) 95% confidence interval, and individual random eigenvalues) were used for determining one factor to retain. DC refers to the binary classification (1 as one factor and 0 as many factors) used for examining accuracy with the indicators sensitivity, specificity, and area under receiver operating characteristic curve (AUC). The scale's reliability and DC were simultaneously calculated for each simulative dataset. PSC real data were demonstrated with DC to interpret reports of the unit-based construct validity using the author-made MS Excel module. The DC method presented accurate sensitivity (=0.96), specificity (=0.92) with a DC criterion (≥0.70), and AUC (=0.98) that were higher than those of the two PA methods. PA combined with DC yielded good sensitivity (=0.96), specificity (=1.0) with a DC criterion (≥0.70), and AUC (=0.99). Advances in computer technology may enable healthcare users familiar with MS Excel to apply DC as a factor retention criterion for determining a scale's unidimensionality and evaluating a scale's quality.
[Impact of passing items above the ceiling on the assessment results of Peabody developmental motor scales].

PubMed

Zhao, Gai; Bian, Yang; Li, Ming

2013-12-18

To analyze the impact of passing items above the roof level in the gross motor subtest of Peabody development motor scales (PDMS-2) on its assessment results. In the subtests of PDMS-2, 124 children from 1.2 to 71 months were administered. Except for the original scoring method, a new scoring method which includes passing items above the ceiling were developed. The standard scores and quotients of the two scoring methods were compared using the independent-samples t test. Only one child could pass the items above the ceiling in the stationary subtest, 19 children in the locomotion subtest, and 17 children in the visual-motor integration subtest. When the scores of these passing items were included in the raw scores, the total raw scores got the added points of 1-12, the standard scores added 0-1 points and the motor quotients added 0-3 points. The diagnostic classification was changed only in two children. There was no significant difference between those two methods about motor quotients or standard scores in the specific subtest (P>0.05). The passing items above a ceiling of PDMS-2 isn't a rare situation. It usually takes place in the locomotion subtest and visual-motor integration subtest. Including these passing items into the scoring system will not make significant difference in the standard scores of the subtests or the developmental motor quotients (DMQ), which supports the original setting of a ceiling established by upassing 3 items in a row. However, putting the passing items above the ceiling into the raw score will improve tracking of children's developmental trajectory and intervention effects.
Use of non-parametric item response theory to develop a shortened version of the Positive and Negative Syndrome Scale (PANSS).

PubMed

Khan, Anzalee; Lewis, Charles; Lindenmayer, Jean-Pierre

2011-11-16

Nonparametric item response theory (IRT) was used to examine (a) the performance of the 30 Positive and Negative Syndrome Scale (PANSS) items and their options ((levels of severity), (b) the effectiveness of various subscales to discriminate among differences in symptom severity, and (c) the development of an abbreviated PANSS (Mini-PANSS) based on IRT and a method to link scores to the original PANSS. Baseline PANSS scores from 7,187 patients with Schizophrenia or Schizoaffective disorder who were enrolled between 1995 and 2005 in psychopharmacology trials were obtained. Option characteristic curves (OCCs) and Item Characteristic Curves (ICCs) were constructed to examine the probability of rating each of seven options within each of 30 PANSS items as a function of subscale severity, and summed-score linking was applied to items selected for the Mini-PANSS. The majority of items forming the Positive and Negative subscales (i.e. 19 items) performed very well and discriminate better along symptom severity compared to the General Psychopathology subscale. Six of the seven Positive Symptom items, six of the seven Negative Symptom items, and seven out of the 16 General Psychopathology items were retained for inclusion in the Mini-PANSS. Summed score linking and linear interpolation was able to produce a translation table for comparing total subscale scores of the Mini-PANSS to total subscale scores on the original PANSS. Results show scores on the subscales of the Mini-PANSS can be linked to scores on the original PANSS subscales, with very little bias. The study demonstrated the utility of non-parametric IRT in examining the item properties of the PANSS and to allow selection of items for an abbreviated PANSS scale. The comparisons between the 30-item PANSS and the Mini-PANSS revealed that the shorter version is comparable to the 30-item PANSS, but when applying IRT, the Mini-PANSS is also a good indicator of illness severity.
Use of NON-PARAMETRIC Item Response Theory to develop a shortened version of the Positive and Negative Syndrome Scale (PANSS)

PubMed Central

2011-01-01

Background Nonparametric item response theory (IRT) was used to examine (a) the performance of the 30 Positive and Negative Syndrome Scale (PANSS) items and their options ((levels of severity), (b) the effectiveness of various subscales to discriminate among differences in symptom severity, and (c) the development of an abbreviated PANSS (Mini-PANSS) based on IRT and a method to link scores to the original PANSS. Methods Baseline PANSS scores from 7,187 patients with Schizophrenia or Schizoaffective disorder who were enrolled between 1995 and 2005 in psychopharmacology trials were obtained. Option characteristic curves (OCCs) and Item Characteristic Curves (ICCs) were constructed to examine the probability of rating each of seven options within each of 30 PANSS items as a function of subscale severity, and summed-score linking was applied to items selected for the Mini-PANSS. Results The majority of items forming the Positive and Negative subscales (i.e. 19 items) performed very well and discriminate better along symptom severity compared to the General Psychopathology subscale. Six of the seven Positive Symptom items, six of the seven Negative Symptom items, and seven out of the 16 General Psychopathology items were retained for inclusion in the Mini-PANSS. Summed score linking and linear interpolation was able to produce a translation table for comparing total subscale scores of the Mini-PANSS to total subscale scores on the original PANSS. Results show scores on the subscales of the Mini-PANSS can be linked to scores on the original PANSS subscales, with very little bias. Conclusions The study demonstrated the utility of non-parametric IRT in examining the item properties of the PANSS and to allow selection of items for an abbreviated PANSS scale. The comparisons between the 30-item PANSS and the Mini-PANSS revealed that the shorter version is comparable to the 30-item PANSS, but when applying IRT, the Mini-PANSS is also a good indicator of illness severity. PMID:22087503
Examining the psychometric properties of a sport-related concussion survey: a Rasch measurement approach.

PubMed

Hecimovich, Mark; Marais, Ida

2017-06-26

Awareness of sport-related concussion (SRC) is an essential step in increasing the number of athletes or parents who report on SRC. This awareness is important, as there is no established data on medical care at youth-level sports and may be limited to individuals with only first aid training. In this circumstance, aside from the coach, it is the players and their parents who need to be aware of possible signs and symptoms. The aim of this study was to examine the psychometric properties of a parent and player concussion survey intended for use before and after an education campaign regarding SRC. 1441 questionnaires were received from parents and 284 questionnaires from players. The responses to the sixteen-item section of the questionnaire's 'recognition of signs and symptoms' were submitted to psychometric analysis using the dichotomous and polytomous Rasch model via the Rasch Unidimensional Measurement Model software RUMM2030. The Rasch model of Modern Test Theory can be considered a refinement of, or advance on, traditional analyses of an instrument's psychometric properties. The main finding is that these sixteen items measure two factors: items that are symptoms of concussion and items that are not symptoms of concussion. Parents and athletes were able to identify most or all of the symptoms, but were not as good at distinguishing symptoms that are not symptoms of concussion. Analyzing these responses revealed differential item functioning for parents and athletes on non-symptom items. When the DIF was resolved a significant difference was found between parents and athletes. The main finding is that the items measure two 'dimensions' in concussion symptom recognition. The first dimension consists of those items that are symptoms of concussion and the second dimension of those items that are not symptoms of concussion. Parents and players were able to identify most or all of the symptoms of concussion, so one would not expect to pick up any positive change on these items after an education campaign. Parents and players were not as good at distinguishing symptoms that are not symptoms of concussion. It is on these items that one may possibly expect improvement to manifest, so to evaluate the effectiveness of an education campaign it would pay to look for improvement in distinguishing symptoms that are not symptoms of concussion.
Estimating the Reliability of a Test Battery Composite or a Test Score Based on Weighted Item Scoring

ERIC Educational Resources Information Center

Feldt, Leonard S.

2004-01-01

In some settings, the validity of a battery composite or a test score is enhanced by weighting some parts or items more heavily than others in the total score. This article describes methods of estimating the total score reliability coefficient when differential weights are used with items or parts.
Estimating Total-test Scores from Partial Scores in a Matrix Sampling Design.

ERIC Educational Resources Information Center

Sachar, Jane; Suppes, Patrick

It is sometimes desirable to obtain an estimated total-test score for an individual who was administered only a subset of the items in a total test. The present study compared six methods, two of which utilize the content structure of items, to estimate total-test scores using 450 students in grades 3-5 and 60 items of the ll0-item Stanford Mental…
Estimating Total-Test Scores from Partial Scores in a Matrix Sampling Design.

ERIC Educational Resources Information Center

Sachar, Jane; Suppes, Patrick

1980-01-01

The present study compared six methods, two of which utilize the content structure of items, to estimate total-test scores using 450 students and 60 items of the 110-item Stanford Mental Arithmetic Test. Three methods yielded fairly good estimates of the total-test score. (Author/RL)
PARADISE 24: A Measure to Assess the Impact of Brain Disorders on People’s Lives

PubMed Central

Cieza, Alarcos; Sabariego, Carla; Anczewska, Marta; Ballert, Carolina; Bickenbach, Jerome; Cabello, Maria; Giovannetti, Ambra; Kaskela, Teemu; Mellor, Blanca; Pitkänen, Tuuli; Quintas, Rui; Raggi, Alberto; Świtaj, Piotr; Chatterji, Somnath

2015-01-01

Objective To construct a metric of the impact of brain disorders on people’s lives, based on the psychosocial difficulties (PSDs) that are experienced in common across brain disorders. Study Design Psychometric study using data from a cross-sectional study with a convenience sample of 722 persons with 9 different brain disorders interviewed in four European countries: Italy, Poland, Spain and Finland. Questions addressing 64 PSDs were first reduced based on statistical considerations, patient’s perspective and clinical expertise. Rasch analyses for polytomous data were also applied. Setting In and outpatient settings. Results A valid and reliable metric with 24 items was created. The infit of all questions ranged between 0.7 and 1.3. There were no disordered thresholds. The targeting between item thresholds and persons’ abilities was good and the person-separation index was 0.92. Persons’ abilities were linearly transformed into a more intuitive scale ranging from zero (no PSDs) to 100 (extreme PSDs). Conclusion The metric, called PARADISE 24, is based on the hypothesis of horizontal epidemiology, which affirms that people with brain disorders commonly experience PSDs. This metric is a useful tool to carry out cardinal comparisons over time of the magnitude of the psychosocial impact of brain disorders and between persons and groups in clinical practice and research. PMID:26147343
Assessment of the Item Selection and Weighting in the Birmingham Vasculitis Activity Score for Wegener's Granulomatosis

PubMed Central

MAHR, ALFRED D.; NEOGI, TUHINA; LAVALLEY, MICHAEL P.; DAVIS, JOHN C.; HOFFMAN, GARY S.; MCCUNE, W. JOSEPH; SPECKS, ULRICH; SPIERA, ROBERT F.; ST.CLAIR, E. WILLIAM; STONE, JOHN H.; MERKEL, PETER A.

2013-01-01

Objective To assess the Birmingham Vasculitis Activity Score for Wegener's Granulomatosis (BVAS/WG) with respect to its selection and weighting of items. Methods This study used the BVAS/WG data from the Wegener's Granulomatosis Etanercept Trial. The scoring frequencies of the 34 predefined items and any “other” items added by clinicians were calculated. Using linear regression with generalized estimating equations in which the physician global assessment (PGA) of disease activity was the dependent variable, we computed weights for all predefined items. We also created variables for clinical manifestations frequently added as other items, and computed weights for these as well. We searched for the model that included the items and their generated weights yielding an activity score with the highest R2 to predict the PGA. Results We analyzed 2,044 BVAS/WG assessments from 180 patients; 734 assessments were scored during active disease. The highest R2 with the PGA was obtained by scoring WG activity based on the following items: the 25 predefined items rated on ≥5 visits, the 2 newly created fatigue and weight loss variables, the remaining minor other and major other items, and a variable that signified whether new or worse items were present at a specific visit. The weights assigned to the items ranged from 1 to 21. Compared with the original BVAS/WG, this modified score correlated significantly more strongly with the PGA. Conclusion This study suggests possibilities to enhance the item selection and weighting of the BVAS/WG. These changes may increase this instrument's ability to capture the continuum of disease activity in WG. PMID:18512722
Effect of clinically discriminating, evidence-based checklist items on the reliability of scores from an Internal Medicine residency OSCE.

PubMed

Daniels, Vijay J; Bordage, Georges; Gierl, Mark J; Yudkowsky, Rachel

2014-10-01

Objective structured clinical examinations (OSCEs) are used worldwide for summative examinations but often lack acceptable reliability. Research has shown that reliability of scores increases if OSCE checklists for medical students include only clinically relevant items. Also, checklists are often missing evidence-based items that high-achieving learners are more likely to use. The purpose of this study was to determine if limiting checklist items to clinically discriminating items and/or adding missing evidence-based items improved score reliability in an Internal Medicine residency OSCE. Six internists reviewed the traditional checklists of four OSCE stations classifying items as clinically discriminating or non-discriminating. Two independent reviewers augmented checklists with missing evidence-based items. We used generalizability theory to calculate overall reliability of faculty observer checklist scores from 45 first and second-year residents and predict how many 10-item stations would be required to reach a Phi coefficient of 0.8. Removing clinically non-discriminating items from the traditional checklist did not affect the number of stations (15) required to reach a Phi of 0.8 with 10 items. Focusing the checklist on only evidence-based clinically discriminating items increased test score reliability, needing 11 stations instead of 15 to reach 0.8; adding missing evidence-based clinically discriminating items to the traditional checklist modestly improved reliability (needing 14 instead of 15 stations). Checklists composed of evidence-based clinically discriminating items improved the reliability of checklist scores and reduced the number of stations needed for acceptable reliability. Educators should give preference to evidence-based items over non-evidence-based items when developing OSCE checklists.
Item Purification Does Not Always Improve DIF Detection: A Counterexample with Angoff's Delta Plot

ERIC Educational Resources Information Center

Magis, David; Facon, Bruno

2013-01-01

Item purification is an iterative process that is often advocated as improving the identification of items affected by differential item functioning (DIF). With test-score-based DIF detection methods, item purification iteratively removes the items currently flagged as DIF from the test scores to get purified sets of items, unaffected by DIF. The…
Item Response Theory Analysis of the Psychopathic Personality Inventory-Revised.

PubMed

Eichenbaum, Alexander E; Marcus, David K; French, Brian F

2017-06-01

This study examined item and scale functioning in the Psychopathic Personality Inventory-Revised (PPI-R) using an item response theory analysis. PPI-R protocols from 1,052 college student participants (348 male, 704 female) were analyzed. Analyses were conducted on the 131 self-report items comprising the PPI-R's eight content scales, using a graded response model. Scales collected a majority of their information about respondents possessing higher than average levels of the traits being measured. Each scale contained at least some items that evidenced limited ability to differentiate between respondents with differing levels of the trait being measured. Moreover, 80 items (61.1%) yielded significantly different responses between men and women presumably possessing similar levels of the trait being measured. Item performance was also influenced by the scoring format (directly scored vs. reverse-scored) of the items. Overall, the results suggest that the PPI-R, despite identifying psychopathic personality traits in individuals possessing high levels of those traits, may not identify these traits equally well for men and women, and scores are likely influenced by the scoring format of the individual item and scale.
The Comparative Effectiveness of Different Item Analysis Techniques in Increasing Change Score Reliability.

ERIC Educational Resources Information Center

Crocker, Linda M.; Mehrens, William A.

Four new methods of item analysis were used to select subsets of items which would yield measures of attitude change. The sample consisted of 263 students at Michigan State University who were tested on the Inventory of Beliefs as freshmen and retested on the same instrument as juniors. Item change scores and total change scores were computed for…
Determining an Imaging Literacy Curriculum for Radiation Oncologists: An International Delphi Study

DOE Office of Scientific and Technical Information (OSTI.GOV)

Giuliani, Meredith E., E-mail: Meredith.Giuliani@rmp.uhn.on.ca; Department of Radiation Oncology, University of Toronto, Toronto, Ontario; Gillan, Caitlin

2014-03-15

Purpose: Rapid evolution of imaging technologies and their integration into radiation therapy practice demands that radiation oncology (RO) training curricula be updated. The purpose of this study was to develop an entry-to-practice image literacy competency profile. Methods and Materials: A list of 263 potential imaging competency items were assembled from international objectives of training. Expert panel eliminated redundant or irrelevant items to create a list of 97 unique potential competency items. An international 2-round Delphi process was conducted with experts in RO. In round 1, all experts scored, on a 9-point Likert scale, the degree to which they agreed anmore » item should be included in the competency profile. Items with a mean score ≥7 were included, those 4 to 6 were reviewed in round 2, and items scored <4 were excluded. In round 2, items were discussed and subsequently ranked for inclusion or exclusion in the competency profile. Items with >75% voting for inclusion were included in the final competency profile. Results: Forty-nine radiation oncologists were invited to participate in round 1, and 32 (65%) did so. Participants represented 24 centers in 6 countries. Of the 97 items ranked in round 1, 80 had a mean score ≥7, 1 item had a score <4, and 16 items with a mean score of 4 to 6 were reviewed and rescored in round 2. In round 2, 4 items had >75% of participants voting for inclusion and were included; the remaining 12 were excluded. The final list of 84 items formed the final competency profile. The 84 enabling competency items were aggregated into the following 4 thematic groups of key competencies: (1) imaging fundamentals (42 items); (2) clinical application (27 items); (3) clinical management (5 items); and (4) professional practice (10 items). Conclusions: We present an imaging literacy competency profile which could constitute the minimum training standards in radiation oncology residency programs.« less
Generalization of the Lord-Wingersky Algorithm to Computing the Distribution of Summed Test Scores Based on Real-Number Item Scores

ERIC Educational Resources Information Center

Kim, Seonghoon

2013-01-01

With known item response theory (IRT) item parameters, Lord and Wingersky provided a recursive algorithm for computing the conditional frequency distribution of number-correct test scores, given proficiency. This article presents a generalized algorithm for computing the conditional distribution of summed test scores involving real-number item…

Observed Score and True Score Equating Procedures for Multidimensional Item Response Theory

ERIC Educational Resources Information Center

Brossman, Bradley Grant

2010-01-01

The purpose of this research was to develop observed score and true score equating procedures to be used in conjunction with the Multidimensional Item Response Theory (MIRT) framework. Currently, MIRT scale linking procedures exist to place item parameter estimates and ability estimates on the same scale after separate calibrations are conducted.…
Development of a PROMIS item bank to measure pain interference.

PubMed

Amtmann, Dagmar; Cook, Karon F; Jensen, Mark P; Chen, Wen-Hung; Choi, Seung; Revicki, Dennis; Cella, David; Rothrock, Nan; Keefe, Francis; Callahan, Leigh; Lai, Jin-Shei

2010-07-01

This paper describes the psychometric properties of the PROMIS-pain interference (PROMIS-PI) bank. An initial candidate item pool (n=644) was developed and evaluated based on the review of existing instruments, interviews with patients, and consultation with pain experts. From this pool, a candidate item bank of 56 items was selected and responses to the items were collected from large community and clinical samples. A total of 14,848 participants responded to all or a subset of candidate items. The responses were calibrated using an item response theory (IRT) model. A final 41-item bank was evaluated with respect to IRT assumptions, model fit, differential item function (DIF), precision, and construct and concurrent validity. Items of the revised bank had good fit to the IRT model (CFI and NNFI/TLI ranged from 0.974 to 0.997), and the data were strongly unidimensional (e.g., ratio of first and second eigenvalue=35). Nine items exhibited statistically significant DIF. However, adjusting for DIF had little practical impact on score estimates and the items were retained without modifying scoring. Scores provided substantial information across levels of pain; for scores in the T-score range 50-80, the reliability was equivalent to 0.96-0.99. Patterns of correlations with other health outcomes supported the construct validity of the item bank. The scores discriminated among persons with different numbers of chronic conditions, disabling conditions, levels of self-reported health, and pain intensity (p<0.0001). The results indicated that the PROMIS-PI items constitute a psychometrically sound bank. Computerized adaptive testing and short forms are available. Copyright 2010 International Association for the Study of Pain. All rights reserved.
Comparability of scores on the MMPI-2-RF scales generated with the MMPI-2 and MMPI-2-RF booklets.

PubMed

Van der Heijden, P T; Egger, J I M; Derksen, J J L

2010-05-01

In most validity studies on the recently released 338-item MMPI-2 (Butcher, Dahlstrom, Graham, Tellegen, & Kaemmer, 1989) Restructured Form (MMPI-2-RF; Ben-Porath & Tellegen, 2008; Tellegen & Ben-Porath, 2008), scale scores were derived from the 567-item MMPI-2 booklet. In this study, we evaluated the comparability of the MMPI-2-RF scale scores derived from the original 567-item MMPI-2 booklet with MMPI-2-RF scale scores derived from the 338-item MMPI-2-RF booklet in a Dutch student sample (N = 107). We used a counterbalanced (ABBA) design. We compared results with those previously reported by Tellegen and Ben-Porath (2008). Our findings support the comparability of the scores of the 338-item version and the 567-item version of the 50 MMPI-2-RF scales. We discuss clinical implications and directions for further research.
Item response theory analyses of the Delis-Kaplan Executive Function System card sorting subtest.

PubMed

Spencer, Mercedes; Cho, Sun-Joo; Cutting, Laurie E

2018-02-02

In the current study, we examined the dimensionality of the 16-item Card Sorting subtest of the Delis-Kaplan Executive Functioning System assessment in a sample of 264 native English-speaking children between the ages of 9 and 15 years. We also tested for measurement invariance for these items across age and gender groups using item response theory (IRT). Results of the exploratory factor analysis indicated that a two-factor model that distinguished between verbal and perceptual items provided the best fit to the data. Although the items demonstrated measurement invariance across age groups, measurement invariance was violated for gender groups, with two items demonstrating differential item functioning for males and females. Multigroup analysis using all 16 items indicated that the items were more effective for individuals whose IRT scale scores were relatively high. A single-group explanatory IRT model using 14 non-differential item functioning items showed that for perceptual ability, females scored higher than males and that scores increased with age for both males and females; for verbal ability, the observed increase in scores across age differed for males and females. The implications of these findings are discussed.
Integrating patient reported outcome measures and computerized adaptive test estimates on the same common metrics: an example from the assessment of activities in rheumatoid arthritis.

PubMed

Doğanay Erdoğan, Beyza; Elhan, Atilla Halİl; Kaskatı, Osman Tolga; Öztuna, Derya; Küçükdeveci, Ayşe Adile; Kutlay, Şehim; Tennant, Alan

2017-10-01

This study aimed to explore the potential of an inclusive and fully integrated measurement system for the Activities component of the International Classification of Functioning, Disability and Health (ICF), incorporating four classical scales, including the Health Assessment Questionnaire (HAQ), and a Computerized Adaptive Testing (CAT). Three hundred patients with rheumatoid arthritis (RA) answered relevant questions from four questionnaires. Rasch analysis was performed to create an item bank using this item pool. A further 100 RA patients were recruited for a CAT application. Both real and simulated CATs were applied and the agreement between these CAT-based scores and 'paper-pencil' scores was evaluated with intraclass correlation coefficient (ICC). Anchoring strategies were used to obtain a direct translation from the item bank common metric to the HAQ score. Mean age of 300 patients was 52.3 ± 11.7 years; disease duration was 11.3 ± 8.0 years; 74.7% were women. After testing for the assumptions of Rasch analysis, a 28-item Activities item bank was created. The agreement between CAT-based scores and paper-pencil scores were high (ICC = 0.993). Using those HAQ items in the item bank as anchoring items, another Rasch analysis was performed with HAQ-8 scores as separate items together with anchoring items. Finally a conversion table of the item bank common metric to the HAQ scores was created. A fully integrated and inclusive health assessment system, illustrating the Activities component of the ICF, was built to assess RA patients. Raw score to metric conversions and vice versa were available, giving access to the metric by a simple look-up table. © 2015 Asia Pacific League of Associations for Rheumatology and Wiley Publishing Asia Pty Ltd.
Accounting for independent nondifferential misclassification does not increase certainty that an observed association is in the correct direction.

PubMed

Greenland, Sander; Gustafson, Paul

2006-07-01

Researchers sometimes argue that their exposure-measurement errors are independent of other errors and are nondifferential with respect to disease, resulting in estimation bias toward the null. Among well-known problems with such arguments are that independence and nondifferentiality are harder to satisfy than ordinarily appreciated (e.g., because of correlation of errors in questionnaire items, and because of uncontrolled covariate effects on error rates); small violations of independence or nondifferentiality may lead to bias away from the null; and, if exposure is polytomous, the bias produced by independent nondifferential error is not always toward the null. The authors add to this list by showing that, in a 2 x 2 table (for which independent nondifferential error produces bias toward the null), accounting for independent nondifferential error does not reduce the p value even though it increases the point estimate. Thus, such accounting should not increase certainty that an association is present.
Assessing Psycho-social Barriers to Rehabilitation in Injured Workers with Chronic Musculoskeletal Pain: Development and Item Properties of the Yellow Flag Questionnaire (YFQ).

PubMed

Salathé, Cornelia Rolli; Trippolini, Maurizio Alen; Terribilini, Livio Claudio; Oliveri, Michael; Elfering, Achim

2018-06-01

Purpose To develop a multidimensional scale to asses psychosocial beliefs-the Yellow Flag Questionnaire (YFQ)-aimed at guiding interventions for workers with chronic musculoskeletal (MSK) pain. Methods Phase 1 consisted of item selection based on literature search, item development and expert consensus rounds. In phase 2, items were reduced with calculating a quality-score per item, using structure equation modeling and confirmatory factor analysis on data from 666 workers. In phase 3, Cronbach's α, and Pearson correlations coefficients were computed to compare YFQ with disability, anxiety, depression and self-efficacy and the YFQ score based on data from 253 injured workers. Regressions of YFQ total score on disability, anxiety, depression and self-efficacy were calculated. Results After phase 1, the YFQ included 116 items and 15 domains. Further reductions of items in phase 2 by applying the item quality criteria reduced the total to 48 items. Phase factor analysis with structural equation modeling confirmed 32 items in seven domains: activity, work, emotions, harm & blame, diagnosis beliefs, co-morbidity and control. Cronbach α was 0.91 for the total score, between 0.49 and 0.81 for the 7 distinct scores of each domain, respectively. Correlations between YFQ total score ranged with disability, anxiety, depression and self-efficacy was .58, .66, .73, -.51, respectively. After controlling for age and gender the YFQ total score explained between R2 27% and R2 53% variance of disability, anxiety, depression and self-efficacy. Conclusions The YFQ, a multidimensional screening scale is recommended for use to assess psychosocial beliefs of workers with chronic MSK pain. Further evaluation of the measurement properties such as the test-retest reliability, responsiveness and prognostic validity is warranted.
Development of a Valid and Reliable Knee Articular Cartilage Condition-Specific Study Methodological Quality Score.

PubMed

Harris, Joshua D; Erickson, Brandon J; Cvetanovich, Gregory L; Abrams, Geoffrey D; McCormick, Frank M; Gupta, Anil K; Verma, Nikhil N; Bach, Bernard R; Cole, Brian J

2014-02-01

Condition-specific questionnaires are important components in evaluation of outcomes of surgical interventions. No condition-specific study methodological quality questionnaire exists for evaluation of outcomes of articular cartilage surgery in the knee. To develop a reliable and valid knee articular cartilage-specific study methodological quality questionnaire. Cross-sectional study. A stepwise, a priori-designed framework was created for development of a novel questionnaire. Relevant items to the topic were identified and extracted from a recent systematic review of 194 investigations of knee articular cartilage surgery. In addition, relevant items from existing generic study methodological quality questionnaires were identified. Items for a preliminary questionnaire were generated. Redundant and irrelevant items were eliminated, and acceptable items modified. The instrument was pretested and items weighed. The instrument, the MARK score (Methodological quality of ARticular cartilage studies of the Knee), was tested for validity (criterion validity) and reliability (inter- and intraobserver). A 19-item, 3-domain MARK score was developed. The 100-point scale score demonstrated face validity (focus group of 8 orthopaedic surgeons) and criterion validity (strong correlation to Cochrane Quality Assessment score and Modified Coleman Methodology Score). Interobserver reliability for the overall score was good (intraclass correlation coefficient [ICC], 0.842), and for all individual items of the MARK score, acceptable to perfect (ICC, 0.70-1.000). Intraobserver reliability ICC assessed over a 3-week interval was strong for 2 reviewers (≥0.90). The MARK score is a valid and reliable knee articular cartilage condition-specific study methodological quality instrument. This condition-specific questionnaire may be used to evaluate the quality of studies reporting outcomes of articular cartilage surgery in the knee.
Handling missing values in the MDS-UPDRS.

PubMed

Goetz, Christopher G; Luo, Sheng; Wang, Lu; Tilley, Barbara C; LaPelle, Nancy R; Stebbins, Glenn T

2015-10-01

This study was undertaken to define the number of missing values permissible to render valid total scores for each Movement Disorder Society Unified Parkinson's Disease Rating Scale (MDS-UPDRS) part. To handle missing values, imputation strategies serve as guidelines to reject an incomplete rating or create a surrogate score. We tested a rigorous, scale-specific, data-based approach to handling missing values for the MDS-UPDRS. From two large MDS-UPDRS datasets, we sequentially deleted item scores, either consistently (same items) or randomly (different items) across all subjects. Lin's Concordance Correlation Coefficient (CCC) compared scores calculated without missing values with prorated scores based on sequentially increasing missing values. The maximal number of missing values retaining a CCC greater than 0.95 determined the threshold for rendering a valid prorated score. A second confirmatory sample was selected from the MDS-UPDRS international translation program. To provide valid part scores applicable across all Hoehn and Yahr (H&Y) stages when the same items are consistently missing, one missing item from Part I, one from Part II, three from Part III, but none from Part IV can be allowed. To provide valid part scores applicable across all H&Y stages when random item entries are missing, one missing item from Part I, two from Part II, seven from Part III, but none from Part IV can be allowed. All cutoff values were confirmed in the validation sample. These analyses are useful for constructing valid surrogate part scores for MDS-UPDRS when missing items fall within the identified threshold and give scientific justification for rejecting partially completed ratings that fall below the threshold. © 2015 International Parkinson and Movement Disorder Society.
Automatic Scoring of Paper-and-Pencil Figural Responses. Research Report.

ERIC Educational Resources Information Center

Martinez, Michael E.; And Others

Large-scale testing is dominated by the multiple-choice question format. Widespread use of the format is due, in part, to the ease with which multiple-choice items can be scored automatically. This paper examines automatic scoring procedures for an alternative item type: figural response. Figural response items call for the completion or…
Automatically Scoring Short Essays for Content. CRESST Report 836

ERIC Educational Resources Information Center

Kerr, Deirdre; Mousavi, Hamid; Iseli, Markus R.

2013-01-01

The Common Core assessments emphasize short essay constructed response items over multiple choice items because they are more precise measures of understanding. However, such items are too costly and time consuming to be used in national assessments unless a way is found to score them automatically. Current automatic essay scoring techniques are…
Preequating with Empirical Item Characteristic Curves: An Observed-Score Preequating Method

ERIC Educational Resources Information Center

Zu, Jiyun; Puhan, Gautam

2014-01-01

Preequating is in demand because it reduces score reporting time. In this article, we evaluated an observed-score preequating method: the empirical item characteristic curve (EICC) method, which makes preequating without item response theory (IRT) possible. EICC preequating results were compared with a criterion equating and with IRT true-score…
A Comparison of Item-Level and Scale-Level Multiple Imputation for Questionnaire Batteries

ERIC Educational Resources Information Center

Gottschall, Amanda C.; West, Stephen G.; Enders, Craig K.

2012-01-01

Behavioral science researchers routinely use scale scores that sum or average a set of questionnaire items to address their substantive questions. A researcher applying multiple imputation to incomplete questionnaire data can either impute the incomplete items prior to computing scale scores or impute the scale scores directly from other scale…
Variation in the Readability of Items Within Surveys

PubMed Central

Calderón, José L.; Morales, Leo S.; Liu, Honghu; Hays, Ron D.

2006-01-01

The objective of this study was to estimate the variation in the readability of survey items within 2 widely used health-related quality-of-life surveys: the National Eye Institute Visual Functioning Questionnaire–25 (VFQ-25) and the Short Form Health Survey, version 2 (SF-36v2). Flesch-Kincaid and Flesch Reading Ease formulas were used to estimate readability. Individual survey item scores and descriptive statistics for each survey were calculated. Variation of individual item scores from the mean survey score was graphically depicted for each survey. The mean reading grade level and reading ease estimates for the VFQ-25 and SF-36v2 were 7.8 (fairly easy) and 6.4 (easy), respectively. Both surveys had notable variation in item readability; individual item readability scores ranged from 3.7 to 12.0 (very easy to difficult) for the VFQ-25 and 2.2 to 12.0 (very easy to difficult) for the SF-36v2. Because survey respondents may not comprehend items with readability scores that exceed their reading ability, estimating the readability of each survey item is an important component of evaluating survey readability. Standards for measuring the readability of surveys are needed. PMID:16401705
Measuring anxiety after spinal cord injury: Development and psychometric characteristics of the SCI-QOL Anxiety item bank and linkage with GAD-7.

PubMed

Kisala, Pamela A; Tulsky, David S; Kalpakjian, Claire Z; Heinemann, Allen W; Pohlig, Ryan T; Carle, Adam; Choi, Seung W

2015-05-01

To develop a calibrated item bank and computer adaptive test to assess anxiety symptoms in individuals with spinal cord injury (SCI), transform scores to the Patient Reported Outcomes Measurement Information System (PROMIS) metric, and create a statistical linkage with the Generalized Anxiety Disorder (GAD)-7, a widely used anxiety measure. Grounded-theory based qualitative item development methods; large-scale item calibration field testing; confirmatory factor analysis; graded response model item response theory analyses; statistical linking techniques to transform scores to a PROMIS metric; and linkage with the GAD-7. Setting Five SCI Model System centers and one Department of Veterans Affairs medical center in the United States. Participants Adults with traumatic SCI. Spinal Cord Injury-Quality of Life (SCI-QOL) Anxiety Item Bank Seven hundred sixteen individuals with traumatic SCI completed 38 items assessing anxiety, 17 of which were PROMIS items. After 13 items (including 2 PROMIS items) were removed, factor analyses confirmed unidimensionality. Item response theory analyses were used to estimate slopes and thresholds for the final 25 items (15 from PROMIS). The observed Pearson correlation between the SCI-QOL Anxiety and GAD-7 scores was 0.67. The SCI-QOL Anxiety item bank demonstrates excellent psychometric properties and is available as a computer adaptive test or short form for research and clinical applications. SCI-QOL Anxiety scores have been transformed to the PROMIS metric and we provide a method to link SCI-QOL Anxiety scores with those of the GAD-7.
Development, scoring, and reliability of the Microscale Audit of Pedestrian Streetscapes (MAPS)

PubMed Central

2013-01-01

Background Streetscape (microscale) features of the built environment can influence people’s perceptions of their neighborhoods’ suitability for physical activity. Many microscale audit tools have been developed, but few have published systematic scoring methods. We present the development, scoring, and reliability of the Microscale Audit of Pedestrian Streetscapes (MAPS) tool and its theoretically-based subscales. Methods MAPS was based on prior instruments and was developed to assess details of streetscapes considered relevant for physical activity. MAPS sections (route, segments, crossings, and cul-de-sacs) were scored by two independent raters for reliability analyses. There were 290 route pairs, 516 segment pairs, 319 crossing pairs, and 53 cul-de-sac pairs in the reliability sample. Individual inter-rater item reliability analyses were computed using Kappa, intra-class correlation coefficient (ICC), and percent agreement. A conceptual framework for subscale creation was developed using theory, expert consensus, and policy relevance. Items were grouped into subscales, and subscales were analyzed for inter-rater reliability at tiered levels of aggregation. Results There were 160 items included in the subscales (out of 201 items total). Of those included in the subscales, 80 items (50.0%) had good/excellent reliability, 41 items (25.6%) had moderate reliability, and 18 items (11.3%) had low reliability, with limited variability in the remaining 21 items (13.1%). Seventeen of the 20 route section subscales, valence (positive/negative) scores, and overall scores (85.0%) demonstrated good/excellent reliability and 3 demonstrated moderate reliability. Of the 16 segment subscales, valence scores, and overall scores, 12 (75.0%) demonstrated good/excellent reliability, three demonstrated moderate reliability, and one demonstrated poor reliability. Of the 8 crossing subscales, valence scores, and overall scores, 6 (75.0%) demonstrated good/excellent reliability, and 2 demonstrated moderate reliability. The cul-de-sac subscale demonstrated good/excellent reliability. Conclusions MAPS items and subscales predominantly demonstrated moderate to excellent reliability. The subscales and scoring system represent a theoretically based framework for using these complex microscale data and may be applicable to other similar instruments. PMID:23621947
Development of a Survey to Explore Factors Influencing the Adoption of Best Practices for Diabetic Foot Ulcer Offloading.

PubMed

Bleau Lavigne, Maude; Reeves, Isabelle; Sasseville, Marie-Josée; Loignon, Christine

The primary purpose of this study was to develop 2 survey tools to explore factors influencing adoption of best practices for diabetic foot ulcer offloading treatment in primary health care settings. One survey was intended for the patients receiving care for a diabetic foot ulcer in primary health care settings and the other was intended for the health professionals providing treatment. The second purpose of this study was to evaluate the psychometric properties of the 2 surveys. Development and validation of survey instruments. Two surveys were developed using a published guide. Following review of pertinent literature and identification of variables to be measured, a bank of items was developed and pretested to determine clarity of the item and responses. Psychometric testing comprised measurement of content validity index (CVI) and intraclass correlation coefficient (ICC). Only items obtaining satisfactory CVI and ICC scores were included in the final version of the surveys. The final version of the patient survey contained 41 items and the final version of the survey for health care professionals contained 21 items. The patient-intended survey's items demonstrate high content validity scores and satisfactory test-retest reliability scores. The overall CVI score was 0.98. Forty of the 49 items eligible for testing obtain satisfactory ICC scores. One item's test-retest reliability could not be tested but it was retained based on its high CVI. The health professional-intended survey, an overall CVI score of 0.91 but items had lower ICC scores (63%, 31 of the 49 items), did not achieve a satisfactory ICC score for inclusion in the final instrument. This project led to development of 2 instruments designed to identify and explore factors influencing adoption of best practices for diabetic foot ulcer offloading treatment in the primary health care setting. Future research and testing is required to translate these French surveys into English and additional languages, in order to reach a broader population.
The Dutch-Flemish PROMIS Physical Function item bank exhibited strong psychometric properties in patients with chronic pain.

PubMed

Crins, Martine H P; Terwee, Caroline B; Klausch, Thomas; Smits, Niels; de Vet, Henrica C W; Westhovens, Rene; Cella, David; Cook, Karon F; Revicki, Dennis A; van Leeuwen, Jaap; Boers, Maarten; Dekker, Joost; Roorda, Leo D

2017-07-01

The objective of this study was to assess the psychometric properties of the Dutch-Flemish Patient-Reported Outcomes Measurement Information System (PROMIS) Physical Function item bank in Dutch patients with chronic pain. A bank of 121 items was administered to 1,247 Dutch patients with chronic pain. Unidimensionality was assessed by fitting a one-factor confirmatory factor analysis and evaluating resulting fit statistics. Items were calibrated with the graded response model and its fit was evaluated. Cross-cultural validity was assessed by testing items for differential item functioning (DIF) based on language (Dutch vs. English). Construct validity was evaluated by calculation correlations between scores on the Dutch-Flemish PROMIS Physical Function measure and scores on generic and disease-specific measures. Results supported the Dutch-Flemish PROMIS Physical Function item bank's unidimensionality (Comparative Fit Index = 0.976, Tucker Lewis Index = 0.976) and model fit. Item thresholds targeted a wide range of physical function construct (threshold-parameters range: -4.2 to 5.6). Cross-cultural validity was good as four items only showed DIF for language and their impact on item scores was minimal. Physical Function scores were strongly associated with scores on all other measures (all correlations ≤ -0.60 as expected). The Dutch-Flemish PROMIS Physical Function item bank exhibited good psychometric properties. Development of a computer adaptive test based on the large bank is warranted. Copyright © 2017 Elsevier Inc. All rights reserved.
Differential item functioning magnitude and impact measures from item response theory models.

PubMed

Kleinman, Marjorie; Teresi, Jeanne A

2016-01-01

Measures of magnitude and impact of differential item functioning (DIF) at the item and scale level, respectively are presented and reviewed in this paper. Most measures are based on item response theory models. Magnitude refers to item level effect sizes, whereas impact refers to differences between groups at the scale score level. Reviewed are magnitude measures based on group differences in the expected item scores and impact measures based on differences in the expected scale scores. The similarities among these indices are demonstrated. Various software packages are described that provide magnitude and impact measures, and new software presented that computes all of the available statistics conveniently in one program with explanations of their relationships to one another.
Multidimensional CAT Item Selection Methods for Domain Scores and Composite Scores with Item Exposure Control and Content Constraints

ERIC Educational Resources Information Center

Yao, Lihua

2014-01-01

The intent of this research was to find an item selection procedure in the multidimensional computer adaptive testing (CAT) framework that yielded higher precision for both the domain and composite abilities, had a higher usage of the item pool, and controlled the exposure rate. Five multidimensional CAT item selection procedures (minimum angle;…

Structural Zeros and Their Implications with Log-Linear Bivariate Presmoothing under the Internal-Anchor Design

ERIC Educational Resources Information Center

Kim, Hyung Jin; Brennan, Robert L.; Lee, Won-Chan

2017-01-01

In equating, when common items are internal and scoring is conducted in terms of the number of correct items, some pairs of total scores ("X") and common-item scores ("V") can never be observed in a bivariate distribution of "X" and "V"; these pairs are called "structural zeros." This simulation…
Effects of Differential Item Functioning on Examinees' Test Performance and Reliability of Test

ERIC Educational Resources Information Center

Lee, Yi-Hsuan; Zhang, Jinming

2017-01-01

Simulations were conducted to examine the effect of differential item functioning (DIF) on measurement consequences such as total scores, item response theory (IRT) ability estimates, and test reliability in terms of the ratio of true-score variance to observed-score variance and the standard error of estimation for the IRT ability parameter. The…
Reliability of Summed Item Scores Using Structural Equation Modeling: An Alternative to Coefficient Alpha

ERIC Educational Resources Information Center

Green, Samuel B.; Yang, Yanyun

2009-01-01

A method is presented for estimating reliability using structural equation modeling (SEM) that allows for nonlinearity between factors and item scores. Assuming the focus is on consistency of summed item scores, this method for estimating reliability is preferred to those based on linear SEM models and to the most commonly reported estimate of…
Item hierarchy-based analysis of the Rivermead Mobility Index resulted in improved interpretation and enabled faster scoring in patients undergoing rehabilitation after stroke.

PubMed

Roorda, Leo D; Green, John R; Houwink, Annemieke; Bagley, Pam J; Smith, Jane; Molenaar, Ivo W; Geurts, Alexander C

2012-06-01

To enable improved interpretation of the total score and faster scoring of the Rivermead Mobility Index (RMI) by studying item ordering or hierarchy and formulating start-and-stop rules in patients after stroke. Cohort study. Rehabilitation center in the Netherlands; stroke rehabilitation units and the community in the United Kingdom. Item hierarchy of the RMI was studied in an initial group of patients (n=620; mean age ± SD, 69.2±12.5y; 297 [48%] men; 304 [49%] left hemisphere lesion, and 269 [43%] right hemisphere lesion), and the adequacy of the item hierarchy-based start-and-stop rules was checked in a second group of patients (n=237; mean age ± SD, 60.0±11.3y; 139 [59%] men; 103 [44%] left hemisphere lesion, and 93 [39%] right hemisphere lesion) undergoing rehabilitation after stroke. Not applicable. Mokken scale analysis was used to investigate the fit of the double monotonicity model, indicating hierarchical item ordering. The percentages of patients with a difference between the RMI total score and the scores based on the start-and-stop rules were calculated to check the adequacy of these rules. The RMI had good fit of the double monotonicity model (coefficient H(T)=.87). The interpretation of the total score improved. Item hierarchy-based start-and-stop rules were formulated. The percentages of patients with a difference between the RMI total score and the score based on the recommended start-and-stop rules were 3% and 5%, respectively. Ten of the original 15 items had to be scored after applying the start-and-stop rules. Item hierarchy was established, enabling improved interpretation and faster scoring of the RMI. Copyright © 2012 American Congress of Rehabilitation Medicine. Published by Elsevier Inc. All rights reserved.
Testing item response theory invariance of the standardized Quality-of-life Disease Impact Scale (QDIS(®)) in acute coronary syndrome patients: differential functioning of items and test.

PubMed

Deng, Nina; Anatchkova, Milena D; Waring, Molly E; Han, Kyung T; Ware, John E

2015-08-01

The Quality-of-life (QOL) Disease Impact Scale (QDIS(®)) standardizes the content and scoring of QOL impact attributed to different diseases using item response theory (IRT). This study examined the IRT invariance of the QDIS-standardized IRT parameters in an independent sample. The differential functioning of items and test (DFIT) of a static short-form (QDIS-7) was examined across two independent sources: patients hospitalized for acute coronary syndrome (ACS) in the TRACE-CORE study (N = 1,544) and chronically ill US adults in the QDIS standardization sample. "ACS-specific" IRT item parameters were calibrated and linearly transformed to compare to "standardized" IRT item parameters. Differences in IRT model-expected item, scale and theta scores were examined. The DFIT results were also compared in a standard logistic regression differential item functioning analysis. Item parameters estimated in the ACS sample showed lower discrimination parameters than the standardized discrimination parameters, but only small differences were found for thresholds parameters. In DFIT, results on the non-compensatory differential item functioning index (range 0.005-0.074) were all below the threshold of 0.096. Item differences were further canceled out at the scale level. IRT-based theta scores for ACS patients using standardized and ACS-specific item parameters were highly correlated (r = 0.995, root-mean-square difference = 0.09). Using standardized item parameters, ACS patients scored one-half standard deviation higher (indicating greater QOL impact) compared to chronically ill adults in the standardization sample. The study showed sufficient IRT invariance to warrant the use of standardized IRT scoring of QDIS-7 for studies comparing the QOL impact attributed to acute coronary disease and other chronic conditions.
Using Linear Equating to Map PROMIS(®) Global Health Items and the PROMIS-29 V2.0 Profile Measure to the Health Utilities Index Mark 3.

PubMed

Hays, Ron D; Revicki, Dennis A; Feeny, David; Fayers, Peter; Spritzer, Karen L; Cella, David

2016-10-01

Preference-based health-related quality of life (HR-QOL) scores are useful as outcome measures in clinical studies, for monitoring the health of populations, and for estimating quality-adjusted life-years. This was a secondary analysis of data collected in an internet survey as part of the Patient-Reported Outcomes Measurement Information System (PROMIS(®)) project. To estimate Health Utilities Index Mark 3 (HUI-3) preference scores, we used the ten PROMIS(®) global health items, the PROMIS-29 V2.0 single pain intensity item and seven multi-item scales (physical functioning, fatigue, pain interference, depressive symptoms, anxiety, ability to participate in social roles and activities, sleep disturbance), and the PROMIS-29 V2.0 items. Linear regression analyses were used to identify significant predictors, followed by simple linear equating to avoid regression to the mean. The regression models explained 48 % (global health items), 61 % (PROMIS-29 V2.0 scales), and 64 % (PROMIS-29 V2.0 items) of the variance in the HUI-3 preference score. Linear equated scores were similar to observed scores, although differences tended to be larger for older study participants. HUI-3 preference scores can be estimated from the PROMIS(®) global health items or PROMIS-29 V2.0. The estimated HUI-3 scores from the PROMIS(®) health measures can be used for economic applications and as a measure of overall HR-QOL in research.
The Relationship of Expert-System Scored Constrained Free-Response Items to Multiple-Choice and Open-Ended Items.

ERIC Educational Resources Information Center

Bennett, Randy Elliot; And Others

1990-01-01

The relationship of an expert-system-scored constrained free-response item type to multiple-choice and free-response items was studied using data for 614 students on the College Board's Advanced Placement Computer Science (APCS) Examination. Implications for testing and the APCS test are discussed. (SLD)
Introducing the Postsecondary Instructional Practices Survey (PIPS): A Concise, Interdisciplinary, and Easy-to-Score Survey

PubMed Central

Walter, Emily M.; Henderson, Charles R.; Beach, Andrea L.; Williams, Cody T.

2016-01-01

Researchers, administrators, and policy makers need valid and reliable information about teaching practices. The Postsecondary Instructional Practices Survey (PIPS) is designed to measure the instructional practices of postsecondary instructors from any discipline. The PIPS has 24 instructional practice statements and nine demographic questions. Users calculate PIPS scores by an intuitive proportion-based scoring convention. Factor analyses from 72 departments at four institutions (N = 891) support a 2- or 5-factor solution for the PIPS; both models include all 24 instructional practice items and have good model fit statistics. Factors in the 2-factor model include (a) instructor-centered practices, nine items; and (b) student-centered practices, 13 items. Factors in the 5-factor model include (a) student–student interactions, six items; (b) content delivery, four items; (c) formative assessment, five items; (d) student-content engagement, five items; and (e) summative assessment, four items. In this article, we describe our development and validation processes, provide scoring conventions and outputs for results, and describe wider applications of the instrument. PMID:27810868
Rasch analysis of the UK Functional Assessment Measure in patients with complex disability after stroke.

PubMed

Medvedev, Oleg N; Turner-Stokes, Lynne; Ashford, Stephen; Siegert, Richard J

2018-02-28

To determine whether the UK Functional Assessment Measure (UK FIM+FAM) fits the Rasch model in stroke patients with complex disability and, if so, to derive a conversion table of Rasch-transformed interval level scores. The sample included a UK multicentre cohort of 1,318 patients admitted for specialist rehabilitation following a stroke. Rasch analysis was conducted for the 30-item scale including 3 domains of items measuring physical, communication and psychosocial functions. The fit of items to the Rasch model was examined using 3 different analytical approaches referred to as "pathways". The best fit was achieved in the pathway where responses from motor, communication and psychosocial domains were summarized into 3 super-items and where some items were split because of differential item functioning (DIF) relative to left and right hemisphere location (χ2 (10) = 14.48, p = 0.15). Re-scoring of items showing disordered thresholds did not significantly improve the overall model fit. The UK FIM+FAM with domain super-items satisfies expectations of the unidimensional Rasch model without the need for re-scoring. A conversion table was produced to convert the total scale scores into interval-level data based on person estimates of the Rasch model. The clinical benefits of interval-transformed scores require further evaluation.
Psychometric Properties of Reverse-Scored Items on the CES-D in a Sample of Ethnically Diverse Older Adults

ERIC Educational Resources Information Center

Carlson, Mike; Wilcox, Rand; Chou, Chih-Ping; Chang, Megan; Yang, Frances; Blanchard, Jeanine; Marterella, Abbey; Kuo, Ann; Clark, Florence

2011-01-01

Reverse-scored items on assessment scales increase cognitive processing demands and may therefore lead to measurement problems for older adult respondents. In this study, the objective was to examine possible psychometric inadequacies of reverse-scored items on the Center for Epidemiologic Studies Depression Scale (CES-D) when used to assess…
Gender-, age-, and race/ethnicity-based differential item functioning analysis of the movement disorder society-sponsored revision of the Unified Parkinson's disease rating scale.

PubMed

Goetz, Christopher G; Liu, Yuanyuan; Stebbins, Glenn T; Wang, Lu; Tilley, Barbara C; Teresi, Jeanne A; Merkitch, Douglas; Luo, Sheng

2016-12-01

Assess MDS-UPDRS items for gender-, age-, and race/ethnicity-based differential item functioning. Assessing differential item functioning is a core rating scale validation step. For the MDS-UPDRS, differential item functioning occurs if item-score probability among people with similar levels of parkinsonism differ according to selected covariates (gender, age, race/ethnicity). If the magnitude of differential item functioning is clinically relevant, item-score interpretation must consider influences by these covariates. Differential item functioning can be nonuniform (covariate variably influences an item-score across different levels of parkinsonism) or uniform (covariate influences an item-score consistently over all levels of parkinsonism). Using the MDS-UPDRS translation database of more than 5,000 PD patients from 14 languages, we tested gender-, age-, and race/ethnicity-based differential item functioning. To designate an item as having clinically relevant differential item functioning, we required statistical confirmation by 2 independent methods, along with a McFadden pseudo-R 2 magnitude statistic greater than "negligible." Most items showed no gender-, age- or race/ethnicity-based differential item functioning. When differential item functioning was identified, the magnitude statistic was always in the "negligible" range, and the scale-level impact was minimal. The absence of clinically relevant differential item functioning across all items and all parts of the MDS-UPDRS is strong evidence that the scale can be used confidently. As studies of Parkinson's disease increasingly involve multinational efforts and the MDS-UPDRS has several validated non-English translations, the findings support the scale's broad applicability in populations with varying gender, age, and race/ethnicity distributions. © 2016 International Parkinson and Movement Disorder Society. © 2016 International Parkinson and Movement Disorder Society.
The Intuitive Eating Scale-2: item refinement and psychometric evaluation with college women and men.

PubMed

Tylka, Tracy L; Kroon Van Diest, Ashley M

2013-01-01

The 21-item Intuitive Eating Scale (IES; Tylka, 2006) measures individuals' tendency to follow their physical hunger and satiety cues when determining when, what, and how much to eat. While its scores have demonstrated reliability and validity with college women, the IES-2 was developed to improve upon the original version. Specifically, we added 17 positively scored items to the original IES items (which were predominantly negatively scored), integrated an additional component of intuitive eating (Body-Food Choice Congruence), and evaluated its psychometric properties with 1,405 women and 1,195 men across three studies. After we deleted 15 items (due to low item-factor loadings, high cross-loadings, and redundant content), the results supported the psychometric properties of the IES-2 with women and men. The final 23-item IES-2 contained 11 original items and 12 added items. Exploratory and second-order confirmatory factor analyses upheld its hypothesized 4-factor structure (its original 3 factors, plus Body-Food Choice Congruence) and a higher order factor. The IES-2 was largely invariant across sex, although negligible differences on 1 factor loading and 2 item intercepts were detected. Demonstrating validity, the IES-2 total scores and most IES-2 subscale scores were (a) positively related to body appreciation, self-esteem, and satisfaction with life; (b) inversely related to eating disorder symptomatology, poor interoceptive awareness, body surveillance, body shame, body mass index, and internalization of media appearance ideals; and (c) negligibly related to social desirability. IES-2 scores also garnered incremental validity by predicting psychological well-being above and beyond eating disorder symptomatology. The IES-2's applications for empirical research and clinical work are discussed. PsycINFO Database Record (c) 2013 APA, all rights reserved.
A quality analysis of clinical anaesthesia study protocols from the Chinese clinical trials registry according to the SPIRIT statement.

PubMed

Yang, Lei; Chen, Shouming; Yang, Di; Li, Jiajin; Wu, Taixiang; Zuo, Yunxia

2018-05-15

To learn about the overall quality of clinical anaesthesia study protocols from the Chinese Clinical Trials Registry and to discuss the way to improve study protocol quality. We defined completeness of each sub-item in SPIRIT as N/A (not applicable) or with a score of 0, 1, or 2. For each protocol, we calculated the proportion of adequately reported items (score = 2 and N/A) and unreported items (score = 0). Protocol quality was determined according to the proportion of reported items, with values >50% indicating high quality. Protocol quality was determined according to the proportion of reported items. For each sub-item in SPIRIT, we calculated the adequately reported rate (percentage of all protocols with score 2 and NA on one sub-item) as well as the unreported rate (percentage of all protocols with score 0 on one sub-item). Total 126 study protocols were available for assessment. Among these, 88.1% were assessed as being of low quality. By comparison, the percentage of low-quality protocols was 88.9% after the publication of the SPIRIT statement. Among the 51 SPIRIT sub-items, 18 sub-items had an unreported rate above 90% while 16 had a higher adequately reported rate than an unreported rate. The overall quality of clinical anaesthesia study protocols registered in the ChiCTR was poor. A mandatory protocol upload and self-check based on the SPIRIT statement during the trial registration process may improve protocol quality in the future.
Single- versus Double-Scoring of Trend Responses in Trend Score Equating with Constructed-Response Tests. Research Report. ETS RR-10-12

ERIC Educational Resources Information Center

Tan, Xuan; Ricker, Kathryn L.; Puhan, Gautam

2010-01-01

This study examines the differences in equating outcomes between two trend score equating designs resulting from two different scoring strategies for trend scoring when operational constructed-response (CR) items are double-scored--the single group (SG) design, where each trend CR item is double-scored, and the nonequivalent groups with anchor…
Using PROMIS Pain Interference Items to Improve Quality Measurement in Inpatient Rehabilitation Facilities.

PubMed

Schalet, Benjamin D; Kallen, Michael A; Heinemann, Allen W; Deutsch, Anne; Cook, Karon F; Foster, Linda; Cella, David

2018-05-24

To evaluate the Patient-Reported Outcomes Measurement Information System (PROMIS) pain interference items for use in a quality measure and to compare the resulting quality score, along with internal reliability and validity, to a similar item set in the Minimum Data Set Version 3.0 (MDS). Cross-sectional, observational study. One freestanding inpatient rehabilitation facility (IRF) and one large hospital-based IRF. Patients with neurologic disorders. Of 1055 consecutive admissions, 26% were excluded based on clinician-determined cognitive impairment or emotional distress. Of the remainder, 50% consented and completed the survey near the end of their IRF stay (N = 391). Of these, more than half (57%) reported pain over the last day (n = 224). Psychometric statistics and quality scores were computed from a 55-question survey, including the MDS and PROMIS pain interference items. Estimates for internal reliability were higher for the PROMIS 2-item scale compared to the MDS: Cronbach α (0.86 vs 0.48) and interitem correlations (0.75 vs 0.31). The PROMIS-2 items were better able to detect differences in patients with mild and severe pain intensity (Cohen d = 1.57) relative to the corresponding MDS items (Cohen d = 0.81). Two quality scores based on the PROMIS-2 items, reflecting low and high levels of pain interference, showed 46% or 12% of patients meeting these thresholds. This compared to a 30% rate when patients were classified by the MDS as experiencing pain interference. PROMIS pain interference items appear to be more internally consistent than similar MDS items. The graded PROMIS items permit the creation of multiple quality scores, showing predictable overlap with corresponding MDS quality scores. Because PROMIS items provide finer distinctions, they allow greater latitude in reporting quality scores. We recommend further study of pain interference scores across IRFs to improve their reliability and validity. Copyright © 2018 AMDA – The Society for Post-Acute and Long-Term Care Medicine. Published by Elsevier Inc. All rights reserved.
Item response analysis of the Positive and Negative Syndrome Scale

PubMed Central

Santor, Darcy A; Ascher-Svanum, Haya; Lindenmayer, Jean-Pierre; Obenchain, Robert L

2007-01-01

Background Statistical models based on item response theory were used to examine (a) the performance of individual Positive and Negative Syndrome Scale (PANSS) items and their options, (b) the effectiveness of various subscales to discriminate among individual differences in symptom severity, and (c) the appropriateness of cutoff scores recently recommended by Andreasen and her colleagues (2005) to establish symptom remission. Methods Option characteristic curves were estimated using a nonparametric item response model to examine the probability of endorsing each of 7 options within each of 30 PANSS items as a function of standardized, overall symptom severity. Our data were baseline PANSS scores from 9205 patients with schizophrenia or schizoaffective disorder who were enrolled between 1995 and 2003 in either a large, naturalistic, observational study or else in 1 of 12 randomized, double-blind, clinical trials comparing olanzapine to other antipsychotic drugs. Results Our analyses show that the majority of items forming the Positive and Negative subscales of the PANSS perform very well. We also identified key areas for improvement or revision in items and options within the General Psychopathology subscale. The Positive and Negative subscale scores are not only more discriminating of individual differences in symptom severity than the General Psychopathology subscale score, but are also more efficient on average than the 30-item total score. Of the 8 items recently recommended to establish symptom remission, 1 performed markedly different from the 7 others and should either be deleted or rescored requiring that patients achieve a lower score of 2 (rather than 3) to signal remission. Conclusion This first item response analysis of the PANSS supports its sound psychometric properties; most PANSS items were either very good or good at assessing overall severity of illness. These analyses did identify some items which might be further improved for measuring individual severity differences or for defining remission thresholds. Findings also suggest that the Positive and Negative subscales are more sensitive to change than the PANSS total score and, thus, may constitute a "mini PANSS" that may be more reliable, require shorter administration and training time, and possibly reduce sample sizes needed for future research. PMID:18005449
Disparity between General Symptom Relief and Remission Criteria in the Positive and Negative Syndrome Scale (PANSS): A Post-treatment Bifactor Item Response Theory Model.

PubMed

Anderson, Ariana E; Reise, Steven P; Marder, Stephen R; Mansolf, Maxwell; Han, Carol; Bilder, Robert M

2017-12-01

Objective: Total scale scores derived by summing ratings from the 30-item PANSS are commonly used in clinical trial research to measure overall symptom severity, and percentage reductions in the total scores are sometimes used to document the efficacy of treatment. Acknowledging that some patients may have substantial changes in PANSS total scores but still be sufficiently symptomatic to warrant diagnosis, ratings on a subset of 8 items, referred to here as the "Remission set," are sometimes used to determine if patients' symptoms no longer satisfy diagnostic criteria. An unanswered question remains: is the goal of treatment better conceptualized as reduction in overall symptom severity, or reduction in symptoms below the threshold for diagnosis? We evaluated the psychometric properties of PANSS total scores, to assess whether having low symptom severity post-treatment is equivalent to attaining Remission. Design: We applied a bifactor item response theory (IRT) model to post-treatment PANSS ratings of 3,647 subjects diagnosed with schizophrenia assessed at the termination of 11 clinical trials. The bifactor model specified one general dimension to reflect overall symptom severity, and five domain-specific dimensions. We assessed how PANSS item discrimination and information parameters varied across the range of overall symptom severity (θ), with a special focus on low levels of symptoms (i.e., θ<-1), which we refer to as "Relief" from symptoms. A score of θ=-1 corresponds to an expected PANSS item score of 1.83, a rating between "Absent" and "Minimal" for a PANSS symptom. Results: The application of the bifactor IRT model revealed: (1) 88% of total score variation was attributable to variation in general symptom severity, and only 8% reflected secondary domain factors. This implies that a general factor may provide a good indicator of symptom severity, and that interpretation is not overly complicated by multidimensionality; (2) Post-treatment, 534 individuals (about 15% of the whole sample) scored in the "Relief" range of general symptom severity, but more than twice that number (n = 1351) satisfied Remission criteria (37%). 2 in 3 Remitted patients had scores that were not in a low symptom range (corresponding to Absent or Minimal item scores); (3) PANSS items vary greatly in their ability to measure the general symptom severity dimension; while many items are highly discriminating and relatively "pure" indicators of general symptom severity (delusions, conceptual disorganization), others are better indicators of specific dimensions (blunted affect, depression). The utility of a given PANSS item for assessing a patient depended on the illness level of the patient. Conclusion: Satisfying conventional Remission criteria was not strongly associated with low levels of symptoms. The items providing the most information for patients in the symptom Relief range were Delusions, Preoccupation, Suspiciousness Persecution, Unusual Thought Content, Conceptual Disorganization, Stereotyped Thinking, Active Social Avoidance, and Lack of Judgment and Insight. Lower scores on these items (item scores ≤2) were strongly associated with having a low latent trait θ or experiencing overall symptom relief. The inter-rater agreement between Remission and Relief subjects suggested that these criteria identified different subsets of patients. Alternative subsets of items may offer better indicators of general symptom severity and provide better discrimination (and lower standard errors) for scaling individuals and judging symptom relief, where the "best" subset of items ultimately depends on the illness range and treatment phase being evaluated.
Measurement Equivalence of the Patient Reported Outcomes Measurement Information System® (PROMIS®) Applied Cognition – General Concerns, Short Forms in Ethnically Diverse Groups

PubMed Central

Fieo, Robert; Ocepek-Welikson, Katja; Kleinman, Marjorie; Eimicke, Joseph P.; Crane, Paul K.; Cella, David; Teresi, Jeanne A.

2017-01-01

Aims The goals of these analyses were to examine the psychometric properties and measurement equivalence of a self-reported cognition measure, the Patient Reported Outcome Measurement Information System® (PROMIS®) Applied Cognition – General Concerns short form. These items are also found in the PROMIS Cognitive Function (version 2) item bank. This scale consists of eight items related to subjective cognitive concerns. Differential item functioning (DIF) analyses of gender, education, race, age, and (Spanish) language were performed using an ethnically diverse sample (n = 5,477) of individuals with cancer. This is the first analysis examining DIF in this item set across ethnic and racial groups. Methods DIF hypotheses were derived by asking content experts to indicate whether they posited DIF for each item and to specify the direction. The principal DIF analytic model was item response theory (IRT) using the graded response model for polytomous data, with accompanying Wald tests and measures of magnitude. Sensitivity analyses were conducted using ordinal logistic regression (OLR) with a latent conditioning variable. IRT-based reliability, precision and information indices were estimated. Results DIF was identified consistently only for the item, brain not working as well as usual. After correction for multiple comparisons, this item showed significant DIF for both the primary and sensitivity analyses. Black respondents and Hispanics in comparison to White non-Hispanic respondents evidenced a lower conditional probability of endorsing the item, brain not working as well as usual. The same pattern was observed for the education grouping variable: as compared to those with a graduate degree, conditioning on overall level of subjective cognitive concerns, those with less than high school education also had a lower probability of endorsing this item. DIF was also observed for age for two items after correction for multiple comparisons for both the IRT and OLR-based models: “I have had to work really hard to pay attention or I would make a mistake” and “I have had trouble shifting back and forth between different activities that require thinking”. For both items, conditional on cognitive complaints, older respondents had a higher likelihood than younger respondents of endorsing the item in the cognitive complaints direction. The magnitude and impact of DIF was minimal. The scale showed high precision along much of the subjective cognitive concerns continuum; the overall IRT-based reliability estimate for the total sample was 0.88 and the estimates for subgroups ranged from 0.87 to 0.92. Conclusion Little DIF of high magnitude or impact was observed in the PROMIS Applied Cognition – General Concerns short form item set. One item, “It has seemed like my brain was not working as well as usual” might be singled out for further study. However, in general the short form item set was highly reliable, informative, and invariant across differing race/ethnic, educational, age, gender, and language groups. PMID:28523238
Measurement Equivalence of the Patient Reported Outcomes Measurement Information System® (PROMIS®) Applied Cognition - General Concerns, Short Forms in Ethnically Diverse Groups.

PubMed

Fieo, Robert; Ocepek-Welikson, Katja; Kleinman, Marjorie; Eimicke, Joseph P; Crane, Paul K; Cella, David; Teresi, Jeanne A

2016-01-01

The goals of these analyses were to examine the psychometric properties and measurement equivalence of a self-reported cognition measure, the Patient Reported Outcome Measurement Information System ® (PROMIS ® ) Applied Cognition - General Concerns short form. These items are also found in the PROMIS Cognitive Function (version 2) item bank. This scale consists of eight items related to subjective cognitive concerns. Differential item functioning (DIF) analyses of gender, education, race, age, and (Spanish) language were performed using an ethnically diverse sample ( n = 5,477) of individuals with cancer. This is the first analysis examining DIF in this item set across ethnic and racial groups. DIF hypotheses were derived by asking content experts to indicate whether they posited DIF for each item and to specify the direction. The principal DIF analytic model was item response theory (IRT) using the graded response model for polytomous data, with accompanying Wald tests and measures of magnitude. Sensitivity analyses were conducted using ordinal logistic regression (OLR) with a latent conditioning variable. IRT-based reliability, precision and information indices were estimated. DIF was identified consistently only for the item, brain not working as well as usual. After correction for multiple comparisons, this item showed significant DIF for both the primary and sensitivity analyses. Black respondents and Hispanics in comparison to White non-Hispanic respondents evidenced a lower conditional probability of endorsing the item, brain not working as well as usual. The same pattern was observed for the education grouping variable: as compared to those with a graduate degree, conditioning on overall level of subjective cognitive concerns, those with less than high school education also had a lower probability of endorsing this item. DIF was also observed for age for two items after correction for multiple comparisons for both the IRT and OLR-based models: "I have had to work really hard to pay attention or I would make a mistake" and "I have had trouble shifting back and forth between different activities that require thinking". For both items, conditional on cognitive complaints, older respondents had a higher likelihood than younger respondents of endorsing the item in the cognitive complaints direction. The magnitude and impact of DIF was minimal. The scale showed high precision along much of the subjective cognitive concerns continuum; the overall IRT-based reliability estimate for the total sample was 0.88 and the estimates for subgroups ranged from 0.87 to 0.92. Little DIF of high magnitude or impact was observed in the PROMIS Applied Cognition - General Concerns short form item set. One item, "It has seemed like my brain was not working as well as usual" might be singled out for further study. However, in general the short form item set was highly reliable, informative, and invariant across differing race/ethnic, educational, age, gender, and language groups.
Test Score Equating Using Discrete Anchor Items versus Passage-Based Anchor Items: A Case Study Using "SAT"® Data. Research Report. ETS RR-14-14

ERIC Educational Resources Information Center

Liu, Jinghua; Zu, Jiyun; Curley, Edward; Carey, Jill

2014-01-01

The purpose of this study is to investigate the impact of discrete anchor items versus passage-based anchor items on observed score equating using empirical data.This study compares an "SAT"® critical reading anchor that contains more discrete items proportionally, compared to the total tests to be equated, to another anchor that…

Development of a new Rasch-based scoring algorithm for the National Eye Institute Visual Functioning Questionnaire to improve its interpretability.

PubMed

Petrillo, Jennifer; Bressler, Neil M; Lamoureux, Ecosse; Ferreira, Alberto; Cano, Stefan

2017-08-14

The NEI VFQ-25 has undergone psychometric evaluation in patients with varying ocular conditions and the general population. However, important limitations which may affect the interpretation of clinical trial results have been previously identified, such as concerns with reliability and validity. The purpose of this study was to evaluate the National Eye Institute Visual Functioning Questionnaire (NEI VFQ-25) and make recommendations for a revised scoring structure, with a view to improving its psychometric performance and interpretability. Rasch Measurement Theory analyses were conducted in two stages using pooled baseline NEI VFQ-25 data for 2487 participants with retinal diseases enrolled in six clinical trials. In stage 1, we examined: scale-to-sample targeting; thresholds for item response options; item fit statistics; stability; local dependence; and reliability. In stage 2, a post-hoc revision of the scoring structure (VFQ-28R) was created and psychometrically re-evaluated. In stage 1, we found that the NEI VFQ-25 was mis-targeted to the sample, and had disordered response thresholds (15/25 items) and mis-fitting items (8/25 items). However, items appeared to be stable (differential item functioning for three items), have minimal item dependency (one pair of items) and good reliability (person-separation index, 0.93). In stage 2, the modified Rasch-scored NEI VFQ-28-R was assessed. It comprised two broad domains: Activity Limitation (19 items) and Socio-Emotional Functioning (nine items). The NEI VFQ-28-R demonstrated improved performance with fewer disordered response thresholds (no items), less item misfit (three items) and improved population targeting (reduced ceiling effect) compared with the NEI VFQ-25. Compared with the original version, the proposed NEI VFQ-28-R, with Rasch-based scoring and a two-domain structure, appears to offer improved psychometric performance and interpretability of the vision-related quality of life scale for the population analysed.
Improving Measurement Efficiency of the Inner EAR Scale with Item Response Theory.

PubMed

Jessen, Annika; Ho, Andrew D; Corrales, C Eduardo; Yueh, Bevan; Shin, Jennifer J

2018-02-01

Objectives (1) To assess the 11-item Inner Effectiveness of Auditory Rehabilitation (Inner EAR) instrument with item response theory (IRT). (2) To determine whether the underlying latent ability could also be accurately represented by a subset of the items for use in high-volume clinical scenarios. (3) To determine whether the Inner EAR instrument correlates with pure tone thresholds and word recognition scores. Design IRT evaluation of prospective cohort data. Setting Tertiary care academic ambulatory otolaryngology clinic. Subjects and Methods Modern psychometric methods, including factor analysis and IRT, were used to assess unidimensionality and item properties. Regression methods were used to assess prediction of word recognition and pure tone audiometry scores. Results The Inner EAR scale is unidimensional, and items varied in their location and information. Information parameter estimates ranged from 1.63 to 4.52, with higher values indicating more useful items. The IRT model provided a basis for identifying 2 sets of items with relatively lower information parameters. Item information functions demonstrated which items added insubstantial value over and above other items and were removed in stages, creating a 8- and 3-item Inner EAR scale for more efficient assessment. The 8-item version accurately reflected the underlying construct. All versions correlated moderately with word recognition scores and pure tone averages. Conclusion The 11-, 8-, and 3-item versions of the Inner EAR scale have strong psychometric properties, and there is correlational validity evidence for the observed scores. Modern psychometric methods can help streamline care delivery by maximizing relevant information per item administered.
A Comparison of the Approaches of Generalizability Theory and Item Response Theory in Estimating the Reliability of Test Scores for Testlet-Composed Tests

ERIC Educational Resources Information Center

Lee, Guemin; Park, In-Yong

2012-01-01

Previous assessments of the reliability of test scores for testlet-composed tests have indicated that item-based estimation methods overestimate reliability. This study was designed to address issues related to the extent to which item-based estimation methods overestimate the reliability of test scores composed of testlets and to compare several…
Automatic Short Essay Scoring Using Natural Language Processing to Extract Semantic Information in the Form of Propositions. CRESST Report 831

ERIC Educational Resources Information Center

Kerr, Deirdre; Mousavi, Hamid; Iseli, Markus R.

2013-01-01

The Common Core assessments emphasize short essay constructed-response items over multiple-choice items because they are more precise measures of understanding. However, such items are too costly and time consuming to be used in national assessments unless a way to score them automatically can be found. Current automatic essay-scoring techniques…
Defining Malaysian Knowledge Society: Results from the Delphi Technique

NASA Astrophysics Data System (ADS)

Hamid, Norsiah Abdul; Zaman, Halimah Badioze

This paper outlines the findings of research where the central idea is to define the term Knowledge Society (KS) in Malaysian context. The research focuses on three important dimensions, namely knowledge, ICT and human capital. This study adopts a modified Delphi technique to seek the important dimensions that can contribute to the development of Malaysian's KS. The Delphi technique involved ten experts in a five-round iterative and controlled feedback procedure to obtain consensus on the important dimensions and to verify the proposed definition of KS. The finding shows that all three dimensions proposed initially scored high and moderate consensus. Round One (R1) proposed an initial definition of KS and required comments and inputs from the panel. These inputs were then used to develop items for a R2 questionnaire. In R2, 56 out of 73 items scored high consensus and in R3, 63 out of 90 items scored high. R4 was conducted to re-rate the new items, in which 8 out of 17 items scored high. Other items scored moderate consensus and no item scored low or no consensus in all rounds. The final round (R5) was employed to verify the final definition of KS. Findings and discovery of this study are significant to the definition of KS and the development of a framework in the Malaysian context.
Pattern analysis of total item score and item response of the Kessler Screening Scale for Psychological Distress (K6) in a nationally representative sample of US adults

PubMed Central

Kawasaki, Yohei; Ide, Kazuki; Akutagawa, Maiko; Yamada, Hiroshi; Yutaka, Ono; Furukawa, Toshiaki A.

2017-01-01

Background Several recent studies have shown that total scores on depressive symptom measures in a general population approximate an exponential pattern except for the lower end of the distribution. Furthermore, we confirmed that the exponential pattern is present for the individual item responses on the Center for Epidemiologic Studies Depression Scale (CES-D). To confirm the reproducibility of such findings, we investigated the total score distribution and item responses of the Kessler Screening Scale for Psychological Distress (K6) in a nationally representative study. Methods Data were drawn from the National Survey of Midlife Development in the United States (MIDUS), which comprises four subsamples: (1) a national random digit dialing (RDD) sample, (2) oversamples from five metropolitan areas, (3) siblings of individuals from the RDD sample, and (4) a national RDD sample of twin pairs. K6 items are scored using a 5-point scale: “none of the time,” “a little of the time,” “some of the time,” “most of the time,” and “all of the time.” The pattern of total score distribution and item responses were analyzed using graphical analysis and exponential regression model. Results The total score distributions of the four subsamples exhibited an exponential pattern with similar rate parameters. The item responses of the K6 approximated a linear pattern from “a little of the time” to “all of the time” on log-normal scales, while “none of the time” response was not related to this exponential pattern. Discussion The total score distribution and item responses of the K6 showed exponential patterns, consistent with other depressive symptom scales. PMID:28289560
Validation of Automated Scoring of Science Assessments

ERIC Educational Resources Information Center

Liu, Ou Lydia; Rios, Joseph A.; Heilman, Michael; Gerard, Libby; Linn, Marcia C.

2016-01-01

Constructed response items can both measure the coherence of student ideas and serve as reflective experiences to strengthen instruction. We report on new automated scoring technologies that can reduce the cost and complexity of scoring constructed-response items. This study explored the accuracy of c-rater-ML, an automated scoring engine…
Secondary Psychometric Examination of the Dimensional Obsessive-Compulsive Scale: Classical Testing, Item Response Theory, and Differential Item Functioning.

PubMed

Thibodeau, Michel A; Leonard, Rachel C; Abramowitz, Jonathan S; Riemann, Bradley C

2015-12-01

The Dimensional Obsessive-Compulsive Scale (DOCS) is a promising measure of obsessive-compulsive disorder (OCD) symptoms but has received minimal psychometric attention. We evaluated the utility and reliability of DOCS scores. The study included 832 students and 300 patients with OCD. Confirmatory factor analysis supported the originally proposed four-factor structure. DOCS total and subscale scores exhibited good to excellent internal consistency in both samples (α = .82 to α = .96). Patient DOCS total scores reduced substantially during treatment (t = 16.01, d = 1.02). DOCS total scores discriminated between students and patients (sensitivity = 0.76, 1 - specificity = 0.23). The measure did not exhibit gender-based differential item functioning as tested by Mantel-Haenszel chi-square tests. Expected response options for each item were plotted as a function of item response theory and demonstrated that DOCS scores incrementally discriminate OCD symptoms ranging from low to extremely high severity. Incremental differences in DOCS scores appear to represent unbiased and reliable differences in true OCD symptom severity. © The Author(s) 2014.
An Analysis of Cross Racial Identity Scale Scores Using Classical Test Theory and Rasch Item Response Models

ERIC Educational Resources Information Center

Sussman, Joshua; Beaujean, A. Alexander; Worrell, Frank C.; Watson, Stevie

2013-01-01

Item response models (IRMs) were used to analyze Cross Racial Identity Scale (CRIS) scores. Rasch analysis scores were compared with classical test theory (CTT) scores. The partial credit model demonstrated a high goodness of fit and correlations between Rasch and CTT scores ranged from 0.91 to 0.99. CRIS scores are supported by both methods.…
Disparity between General Symptom Relief and Remission Criteria in the Positive and Negative Syndrome Scale (PANSS)

PubMed Central

Reise, Steven P.; Marder, Stephen R.; Mansolf, Maxwell; Han, Carol; Bilder, Robert M.

2017-01-01

Objective: Total scale scores derived by summing ratings from the 30-item PANSS are commonly used in clinical trial research to measure overall symptom severity, and percentage reductions in the total scores are sometimes used to document the efficacy of treatment. Acknowledging that some patients may have substantial changes in PANSS total scores but still be sufficiently symptomatic to warrant diagnosis, ratings on a subset of 8 items, referred to here as the “Remission set,” are sometimes used to determine if patients’ symptoms no longer satisfy diagnostic criteria. An unanswered question remains: is the goal of treatment better conceptualized as reduction in overall symptom severity, or reduction in symptoms below the threshold for diagnosis? We evaluated the psychometric properties of PANSS total scores, to assess whether having low symptom severity post-treatment is equivalent to attaining Remission. Design: We applied a bifactor item response theory (IRT) model to post-treatment PANSS ratings of 3,647 subjects diagnosed with schizophrenia assessed at the termination of 11 clinical trials. The bifactor model specified one general dimension to reflect overall symptom severity, and five domain-specific dimensions. We assessed how PANSS item discrimination and information parameters varied across the range of overall symptom severity (θ), with a special focus on low levels of symptoms (i.e., θ<-1), which we refer to as “Relief” from symptoms. A score of θ=-1 corresponds to an expected PANSS item score of 1.83, a rating between “Absent” and “Minimal” for a PANSS symptom. Results: The application of the bifactor IRT model revealed: (1) 88% of total score variation was attributable to variation in general symptom severity, and only 8% reflected secondary domain factors. This implies that a general factor may provide a good indicator of symptom severity, and that interpretation is not overly complicated by multidimensionality; (2) Post-treatment, 534 individuals (about 15% of the whole sample) scored in the “Relief” range of general symptom severity, but more than twice that number (n = 1351) satisfied Remission criteria (37%). 2 in 3 Remitted patients had scores that were not in a low symptom range (corresponding to Absent or Minimal item scores); (3) PANSS items vary greatly in their ability to measure the general symptom severity dimension; while many items are highly discriminating and relatively “pure” indicators of general symptom severity (delusions, conceptual disorganization), others are better indicators of specific dimensions (blunted affect, depression). The utility of a given PANSS item for assessing a patient depended on the illness level of the patient. Conclusion: Satisfying conventional Remission criteria was not strongly associated with low levels of symptoms. The items providing the most information for patients in the symptom Relief range were Delusions, Preoccupation, Suspiciousness Persecution, Unusual Thought Content, Conceptual Disorganization, Stereotyped Thinking, Active Social Avoidance, and Lack of Judgment and Insight. Lower scores on these items (item scores ≤2) were strongly associated with having a low latent trait θ or experiencing overall symptom relief. The inter-rater agreement between Remission and Relief subjects suggested that these criteria identified different subsets of patients. Alternative subsets of items may offer better indicators of general symptom severity and provide better discrimination (and lower standard errors) for scaling individuals and judging symptom relief, where the “best” subset of items ultimately depends on the illness range and treatment phase being evaluated. PMID:29410936
Tracking functional status across the spinal cord injury lifespan: linking pediatric and adult patient-reported outcome scores.

PubMed

Tian, Feng; Ni, Pengsheng; Mulcahey, M J; Hambleton, Ronald K; Tulsky, David; Haley, Stephen M; Jette, Alan M

2014-11-01

To use item response theory (IRT) methods to link scores from 2 recently developed contemporary functional outcome measures, the adult Spinal Cord Injury-Functional Index (SCI-FI) and the Pedi SCI (both the parent version and the child version). Secondary data analysis of the physical functioning items of the adult SCI-FI and the Pedi SCI instruments. We used a nonequivalent group design with items common to both instruments and the Stocking-Lord method for the linking. Linking was conducted so that the adult SCI-FI and Pedi SCI scaled scores could be compared. Community. This study included a total sample of 1558 participants. Pedi SCI items were administered to a sample of children (n=381) with SCI aged 8 to 21 years, and of parents/caregivers (n=322) of children with SCI aged 4 to 21 years. Adult SCI-FI items were administered to a sample of adults (n=855) with SCI aged 18 to 92 years. Not applicable. Five scales common to both instruments were included in the analysis: Wheelchair, Daily Routine/Self-care, Daily Routine/Fine Motor, Ambulation, and General Mobility functioning. Confirmatory factor analysis and exploratory factor analysis results indicated that the 5 scales are unidimensional. A graded response model was used to calibrate the items. Misfitting items were identified and removed from the item banks. Items that function differently between the adult and child samples (ie, exhibit differential item functioning) were identified and removed from the common items used for linking. Domain scores from the Pedi SCI instruments were transformed onto the adult SCI-FI metric. This IRT linking allowed estimation of adult SCI-FI scale scores based on Pedi SCI scale scores and vice versa; therefore, it provides clinicians with a means of tracking long-term functional data for children with an SCI across their entire lifespan. Copyright © 2014 American Congress of Rehabilitation Medicine. Published by Elsevier Inc. All rights reserved.
Evaluation and performance of a newly developed patient-reported outcome instrument for diarrhea-predominant irritable bowel syndrome in a clinical study population

PubMed Central

Delgado-Herrera, Leticia; Lasch, Kathryn; Zeiher, Bernhardt; Lembo, Anthony J.; Drossman, Douglas A.; Banderas, Benjamin; Rosa, Kathleen; Lademacher, Christopher; Arbuckle, Rob

2017-01-01

Background: To evaluate the psychometric properties of the newly developed seven-item Irritable Bowel Syndrome – Diarrhea predominant (IBS-D) Daily Symptom Diary and four-item Event Log using phase II clinical trial safety and efficacy data in patients with IBS-D. This instrument measures diarrhea (stool frequency and stool consistency), abdominal pain related to IBS-D (stomach pain, abdominal pain, abdominal cramps), immediate need to have a bowel movement (immediate need and accident occurrence), bloating, pressure, gas, and incomplete evacuation. Methods: Psychometric properties and responsiveness of the instrument were evaluated in a clinical trial population [ClinicalTrials.gov identifier: NCT01494233]. Results: A total of 434 patients were included in the analyses. Significant differences were found among severity groups (p < 0.01) defined by IBS Patient Global Impression of Severity (PGI-S) and IBS Patient Global Impression of Change (PGI-C). Severity scores for each Diary and Event Log item score and five-item, four-item, and three-item summary scores were calculated. Between-group differences in changes over time were significant for all summary scores in groups stratified by changes in PGI-S (p < 0.05), two of six Diary items, and three of four Event Log items; a one-grade change in PGI-S was considered a meaningful difference with mean change scores on all Diary items −0.13 to −0.86 [standard deviation (SD) 0.79–1.39]. Similarly, for patients who reported being ‘slightly improved’ (considered a clinically meaningful difference) on the PGI-C, mean change scores on Diary items ranged from −0.45 to −1.55 (SD 0.69–1.39). All estimates of clinically important change for each item and all summary scores were small and should be considered preliminary. These results are aligned with the previous standalone psychometric study regarding reliability and validity tests. Conclusions: These analyses provide evidence of the psychometric properties of the IBS-D Daily Symptom Diary and Event Log in a clinical trial population. PMID:28932269
Psychometric properties of a revised version of the Assisting Hand Assessment (Kids-AHA 5.0).

PubMed

Holmefur, Marie M; Krumlinde-Sundholm, Lena

2016-06-01

The aim of this study was to scrutinize the Assisting Hand Assessment (AHA) version 4.4 for possible improvements and to evaluate the psychometric properties regarding internal scale validity and aspects of reliability of a revised version of the AHA. In collaboration with experts, scoring criteria were changed for four items, and one fully new item was constructed. Twenty-two original, one new, and four revised items were scored for 164 assessments of children with unilateral cerebral palsy aged 18 months to 12 years. Rasch measurement analysis was used to evaluate internal scale validity by exploring rating-scale functioning, item and person goodness-of-fit, and principal component analysis. Targeting and scale reliability were also evaluated. After removal of misfitting items, a 20-item scale showed satisfactory goodness-of-fit. Unidimensionality was confirmed by principal component analysis. The rating scale functioned well for the 20 items, and the item difficulty was well suited to the ability level of the sample. The person reliability coefficient was 0.98, indicating high separation ability of the scale. A conversion table of AHA scores between the previous version (4.4) and the new version (5.0) was constructed. The new, 20-item version of the Kids-AHA (version 5.0), demonstrated excellent internal scale validity, suggesting improved responsiveness to changes and shortened scoring time. For comparison of scores from version 4.4 to 5.0, a transformation table is presented. © 2015 Mac Keith Press.
Cross-Cultural Adaptation and Validation of the MPAM-R to Brazilian Portuguese and Proposal of a New Method to Calculate Factor Scores

PubMed Central

Albuquerque, Maicon R.; Lopes, Mariana C.; de Paula, Jonas J.; Faria, Larissa O.; Pereira, Eveline T.; da Costa, Varley T.

2017-01-01

In order to understand the reasons that lead individuals to practice physical activity, researchers developed the Motives for Physical Activity Measure-Revised (MPAM-R) scale. In 2010, a translation of MPAM-R to Portuguese and its validation was performed. However, psychometric measures were not acceptable. In addition, factor scores in some sports psychology scales are calculated by the mean of scores by items of the factor. Nevertheless, it seems appropriate that items with higher factor loadings, extracted by Factor Analysis, have greater weight in the factor score, as items with lower factor loadings have less weight in the factor score. The aims of the present study are to translate, validate the MPAM-R for Portuguese versions, and investigate agreement between two methods used to calculate factor scores. Three hundred volunteers who were involved in physical activity programs for at least 6 months were collected. Confirmatory Factor Analysis of the 30 items indicated that the version did not fit the model. After excluding four items, the final model with 26 items showed acceptable model fit measures by Exploratory Factor Analysis, as well as it conceptually supports the five factors as the original proposal. When two methods are compared to calculate factors scores, our results showed that only “Enjoyment” and “Appearance” factors showed agreement between methods to calculate factor scores. So, the Portuguese version of the MPAM-R can be used in a Brazilian context, and a new proposal for the calculation of the factor score seems to be promising. PMID:28293203
Can health care providers recognise a fibromyalgia personality?

PubMed

Da Silva, José A P; Jacobs, Johannes W G; Branco, Jaime C; Canaipa, Rita; Gaspar, M Filomena; Griep, Ed N; van Helmond, Toon; Oliveira, Paula J; Zijlstra, Theo J; Geenen, Rinie

2017-01-01

To determine if experienced health care providers (HCPs) can recognise patients with fibromyalgia (FM) based on a limited set of personality items, exploring the existence of a FM personality. From the 240-item NEO-PI-R personality questionnaire, 8 HCPs from two different countries each selected 20 items they considered most discriminative of FM personality. Then, evaluating the scores on these items of 129 female patients with FM and 127 female controls, each HCP rated the probability of FM for each individual on a 0-10 scale. Personality characteristics (domains and facets) of selected items were determined. Scores of patients with FM and controls on the eight 20-item sets, and HCPs' estimates of each individual's probability of FM were analysed for their discriminative value. The eight 20-item sets discriminated for FM, with areas under the receiver operating characteristic curve ranging from 0.71-0.81. The estimated probabilities for FM showed, in general, percentages of correct classifications above 50%, with rising correct percentages for higher estimated probabilities. The most often chosen and discriminatory items were predominantly of the domain neuroticism (all with higher scores in FM), followed by some items of the facet trust (lower scores in FM). HCPs can, based on a limited set of items from a personality questionnaire, distinguish patients with FM from controls with a statistically significant probability. The HCPs' expectation that personality in FM patients is associated with higher levels for aspects of neuroticism (proneness to psychological distress) and lower scores for aspects of trust, proved to be correct.
A quality analysis of clinical anaesthesia study protocols from the Chinese clinical trials registry according to the SPIRIT statement

PubMed Central

Yang, Lei; Chen, Shouming; Yang, Di; Li, Jiajin; Wu, Taixiang; Zuo, Yunxia

2018-01-01

Objective To learn about the overall quality of clinical anaesthesia study protocols from the Chinese Clinical Trials Registry and to discuss the way to improve study protocol quality. Methods We defined completeness of each sub-item in SPIRIT as N/A (not applicable) or with a score of 0, 1, or 2. For each protocol, we calculated the proportion of adequately reported items (score = 2 and N/A) and unreported items (score = 0). Protocol quality was determined according to the proportion of reported items, with values >50% indicating high quality. Protocol quality was determined according to the proportion of reported items. For each sub-item in SPIRIT, we calculated the adequately reported rate (percentage of all protocols with score 2 and NA on one sub-item) as well as the unreported rate (percentage of all protocols with score 0 on one sub-item). Results Total 126 study protocols were available for assessment. Among these, 88.1% were assessed as being of low quality. By comparison, the percentage of low-quality protocols was 88.9% after the publication of the SPIRIT statement. Among the 51 SPIRIT sub-items, 18 sub-items had an unreported rate above 90% while 16 had a higher adequately reported rate than an unreported rate. Conclusions The overall quality of clinical anaesthesia study protocols registered in the ChiCTR was poor. A mandatory protocol upload and self-check based on the SPIRIT statement during the trial registration process may improve protocol quality in the future. PMID:29872509
Item Selection and Pre-equating with Empirical Item Characteristic Curves.

ERIC Educational Resources Information Center

Livingston, Samuel A.

An empirical item characteristic curve shows the probability of a correct response as a function of the student's total test score. These curves can be estimated from large-scale pretest data. They enable test developers to select items that discriminate well in the score region where decisions are made. A similar set of curves can be used to…
Improving the Reliability of Student Scores from Speeded Assessments: An Illustration of Conditional Item Response Theory Using a Computer-Administered Measure of Vocabulary.

PubMed

Petscher, Yaacov; Mitchell, Alison M; Foorman, Barbara R

2015-01-01

A growing body of literature suggests that response latency, the amount of time it takes an individual to respond to an item, may be an important factor to consider when using assessment data to estimate the ability of an individual. Considering that tests of passage and list fluency are being adapted to a computer administration format, it is possible that accounting for individual differences in response times may be an increasingly feasible option to strengthen the precision of individual scores. The present research evaluated the differential reliability of scores when using classical test theory and item response theory as compared to a conditional item response model which includes response time as an item parameter. Results indicated that the precision of student ability scores increased by an average of 5 % when using the conditional item response model, with greater improvements for those who were average or high ability. Implications for measurement models of speeded assessments are discussed.
Improving the Reliability of Student Scores from Speeded Assessments: An Illustration of Conditional Item Response Theory Using a Computer-Administered Measure of Vocabulary

PubMed Central

Petscher, Yaacov; Mitchell, Alison M.; Foorman, Barbara R.

2016-01-01

A growing body of literature suggests that response latency, the amount of time it takes an individual to respond to an item, may be an important factor to consider when using assessment data to estimate the ability of an individual. Considering that tests of passage and list fluency are being adapted to a computer administration format, it is possible that accounting for individual differences in response times may be an increasingly feasible option to strengthen the precision of individual scores. The present research evaluated the differential reliability of scores when using classical test theory and item response theory as compared to a conditional item response model which includes response time as an item parameter. Results indicated that the precision of student ability scores increased by an average of 5 % when using the conditional item response model, with greater improvements for those who were average or high ability. Implications for measurement models of speeded assessments are discussed. PMID:27721568
Tailoring Multimedia Instruction to Soldier Needs

DTIC Science & Technology

2014-12-01

Pretest Score (Mean % Items Correct) 39% 34% 48% 51% 51% 45% Posttest (Mean % Items Correct) 47% 44% 66% 60% 63% 56...Stepwise regression was used to examine the relationship between Soldiers’ posttest scores (criterion) and their pretest scores, training time, type of...differences among IMI types had no effect.) Pretest scores predicted posttest scores for both Adjust Indirect Fire (βstandardized = .66, t = 6.36

Influence of dominant- as compared with nondominant-side symptoms on Disabilities of the Arm, Shoulder and Hand and Western Ontario Rotator Cuff scores in patients with rotator cuff tendinopathy.

PubMed

Christiansen, David Høyrup; Michener, Lori; Roy, Jean-Sébastien

2018-02-13

The Disabilities of the Arm, Shoulder and Hand (DASH) questionnaire and the Western Ontario Rotator Cuff (WORC) index are 2 widely used patient-reported questionnaires in individuals with rotator cuff (RC) tendinopathy. In contrast to the WORC index, for which the items are specific to the affected shoulder, the items of the DASH questionnaire assess the ability to perform activities regardless of the arm used. The objective of this study is to determine whether scores on the DASH questionnaire and WORC index are affected if the symptoms are on the dominant or nondominant side in individuals with RC tendinopathy. Given the number of items that can be influenced by dominance, the hypothesis is that DASH scores will be impacted by the side of the symptoms. Individuals with RC tendinopathy (N = 149) completed questions on symptomatology and hand dominance, the DASH questionnaire, and the WORC index. Differences in total scores (independent t test) and single items (Wilcoxon rank sum test) were compared between groups of participants with dominant-side symptoms and those without dominant-side symptoms. No significant differences were observed for WORC or DASH total scores when comparing participants with and without symptoms on their dominant side. Single-item comparison revealed more items being affected by symptom side on the DASH questionnaire (6 of 30 items) than on the WORC index (2 of 21 items). The side of the symptoms does not influence the DASH and WORC total scores, as there are no systematic differences between individuals with and without symptoms in their dominant shoulder. However, the presence of dominant symptoms does influence item scores more on the DASH questionnaire than on the WORC index. Copyright © 2018 Journal of Shoulder and Elbow Surgery Board of Trustees. Published by Elsevier Inc. All rights reserved.
Distribution of Total Depressive Symptoms Scores and Each Depressive Symptom Item in a Sample of Japanese Employees.

PubMed

Tomitaka, Shinichiro; Kawasaki, Yohei; Ide, Kazuki; Yamada, Hiroshi; Miyake, Hirotsugu; Furukawa, Toshiaki A; Furukaw, Toshiaki A

2016-01-01

In a previous study, we reported that the distribution of total depressive symptoms scores according to the Center for Epidemiologic Studies Depression Scale (CES-D) in a general population is stable throughout middle adulthood and follows an exponential pattern except for at the lowest end of the symptom score. Furthermore, the individual distributions of 16 negative symptom items of the CES-D exhibit a common mathematical pattern. To confirm the reproducibility of these findings, we investigated the distribution of total depressive symptoms scores and 16 negative symptom items in a sample of Japanese employees. We analyzed 7624 employees aged 20-59 years who had participated in the Northern Japan Occupational Health Promotion Centers Collaboration Study for Mental Health. Depressive symptoms were assessed using the CES-D. The CES-D contains 20 items, each of which is scored in four grades: "rarely," "some," "much," and "most of the time." The descriptive statistics and frequency curves of the distributions were then compared according to age group. The distribution of total depressive symptoms scores appeared to be stable from 30-59 years. The right tail of the distribution for ages 30-59 years exhibited a linear pattern with a log-normal scale. The distributions of the 16 individual negative symptom items of the CES-D exhibited a common mathematical pattern which displayed different distributions with a boundary at "some." The distributions of the 16 negative symptom items from "some" to "most" followed a linear pattern with a log-normal scale. The distributions of the total depressive symptoms scores and individual negative symptom items in a Japanese occupational setting show the same patterns as those observed in a general population. These results show that the specific mathematical patterns of the distributions of total depressive symptoms scores and individual negative symptom items can be reproduced in an occupational population.
Translation and validation of the new version of the Knee Society Score - The 2011 KS Score - into Brazilian Portuguese.

PubMed

Silva, Adriana Lucia Pastore E; Croci, Alberto Tesconi; Gobbi, Riccardo Gomes; Hinckel, Betina Bremer; Pecora, José Ricardo; Demange, Marco Kawamura

2017-01-01

Translation, cultural adaptation, and validation of the new version of the Knee Society Score - The 2011 KS Score - into Brazilian Portuguese and verification of its measurement properties, reproducibility, and validity. In 2012, the new version of the Knee Society Score was developed and validated. This scale comprises four separate subscales: (a) objective knee score (seven items: 100 points); (b) patient satisfaction score (five items: 40 points); (c) patient expectations score (three items: 15 points); and (d) functional activity score (19 items: 100 points). A total of 90 patients aged 55-85 years were evaluated in a clinical cross-sectional study. The pre-operative translated version was applied to patients with TKA referral, and the post-operative translated version was applied to patients who underwent TKA. Each patient answered the same questionnaire twice and was evaluated by two experts in orthopedic knee surgery. Evaluations were performed pre-operatively and three, six, or 12 months post-operatively. The reliability of the questionnaire was evaluated using the intraclass correlation coefficient (ICC) between the two applications. Internal consistency was evaluated using Cronbach's alpha. The ICC found no difference between the means of the pre-operative, three-month, and six-month post-operative evaluations between sub-scale items. The Brazilian Portuguese version of The 2011 KS Score is a valid and reliable instrument for objective and subjective evaluation of the functionality of Brazilian patients who undergo TKA and revision TKA.
Validation of the Adolescent Concerns Measure (ACM): evidence from exploratory and confirmatory factor analysis.

PubMed

Ang, Rebecca P; Chong, Wan Har; Huan, Vivien S; Yeo, Lay See

2007-01-01

This article reports the development and initial validation of scores obtained from the Adolescent Concerns Measure (ACM), a scale which assesses concerns of Asian adolescent students. In Study 1, findings from exploratory factor analysis using 619 adolescents suggested a 24-item scale with four correlated factors--Family Concerns (9 items), Peer Concerns (5 items), Personal Concerns (6 items), and School Concerns (4 items). Initial estimates of convergent validity for ACM scores were also reported. The four-factor structure of ACM scores derived from Study 1 was confirmed via confirmatory factor analysis in Study 2 using a two-fold cross-validation procedure with a separate sample of 811 adolescents. Support was found for both the multidimensional and hierarchical models of adolescent concerns using the ACM. Internal consistency and test-retest reliability estimates were adequate for research purposes. ACM scores show promise as a reliable and potentially valid measure of Asian adolescents' concerns.
More Reasons to be Straightforward: Findings and Norms for Two Scales Relevant to Social Anxiety

PubMed Central

Rodebaugh, Thomas L.; Heimberg, Richard G.; Brown, Patrick J.; Fernandez, Katya C.; Blanco, Carlos; Schneier, Franklin R.; Liebowitz, Michael R.

2011-01-01

The validity of both the Social Interaction Anxiety Scale and Brief Fear of Negative Evaluation scale has been well-supported, yet the scales have a small number of reverse-scored items that may detract from the validity of their total scores. The current study investigates two characteristics of participants that may be associated with compromised validity of these items: higher age and lower levels of education. In community and clinical samples, the validity of each scale's reverse-scored items was moderated by age, years of education, or both. The straightforward items did not show this pattern. To encourage the use of the straightforward items of these scales, we provide normative data from the same samples as well as two large student samples. We contend that although response bias can be a substantial problem, the reverse-scored questions of these scales do not solve that problem and instead decrease overall validity. PMID:21388781
Item Parameter Changes and Equating: An Examination of the Effects of Lack of Item Parameter Invariance on Equating and Score Accuracy for Different Proficiency Levels

ERIC Educational Resources Information Center

Store, Davie

2013-01-01

The impact of particular types of context effects on actual scores is less understood although there has been some research carried out regarding certain types of context effects under the nonequivalent anchor test (NEAT) design. In addition, the issue of the impact of item context effects on scores has not been investigated extensively when item…
Alternative Matching Scores to Control Type I Error of the Mantel-Haenszel Procedure for DIF in Dichotomously Scored Items Conforming to 3PL IRT and Nonparametric 4PBCB Models

ERIC Educational Resources Information Center

Monahan, Patrick O.; Ankenmann, Robert D.

2010-01-01

When the matching score is either less than perfectly reliable or not a sufficient statistic for determining latent proficiency in data conforming to item response theory (IRT) models, Type I error (TIE) inflation may occur for the Mantel-Haenszel (MH) procedure or any differential item functioning (DIF) procedure that matches on summed-item…
The Impact of Test Dimensionality, Common-Item Set Format, and Scale Linking Methods on Mixed-Format Test Equating

ERIC Educational Resources Information Center

Öztürk-Gübes, Nese; Kelecioglu, Hülya

2016-01-01

The purpose of this study was to examine the impact of dimensionality, common-item set format, and different scale linking methods on preserving equity property with mixed-format test equating. Item response theory (IRT) true-score equating (TSE) and IRT observed-score equating (OSE) methods were used under common-item nonequivalent groups design.…
Five Methods to Score the Teacher Observation of Classroom Adaptation Checklist and to Examine Group Differences

ERIC Educational Resources Information Center

Wang, Ze; Rohrer, David; Chuang, Chi-ching; Fujiki, Mayo; Herman, Keith; Reinke, Wendy

2015-01-01

This study compared 5 scoring methods in terms of their statistical assumptions. They were then used to score the Teacher Observation of Classroom Adaptation Checklist, a measure consisting of 3 subscales and 21 Likert-type items. The 5 methods used were (a) sum/average scores of items, (b) latent factor scores with continuous indicators, (c)…
Do large-scale assessments measure students' ability to integrate scientific knowledge?

NASA Astrophysics Data System (ADS)

Lee, Hee-Sun

2010-03-01

Large-scale assessments are used as means to diagnose the current status of student achievement in science and compare students across schools, states, and countries. For efficiency, multiple-choice items and dichotomously-scored open-ended items are pervasively used in large-scale assessments such as Trends in International Math and Science Study (TIMSS). This study investigated how well these items measure secondary school students' ability to integrate scientific knowledge. This study collected responses of 8400 students to 116 multiple-choice and 84 open-ended items and applied an Item Response Theory analysis based on the Rasch Partial Credit Model. Results indicate that most multiple-choice items and dichotomously-scored open-ended items can be used to determine whether students have normative ideas about science topics, but cannot measure whether students integrate multiple pieces of relevant science ideas. Only when the scoring rubric is redesigned to capture subtle nuances of student open-ended responses, open-ended items become a valid and reliable tool to assess students' knowledge integration ability.
Perceptions of Culture of Safety in Hemodialysis Centers.

PubMed

Davis, Kristina K; Harris, Kathleen G; Mahishi, Vrinda; Bartholomew, Edward G; Kenward, Kevin

2016-01-01

Staff members, physicians, nurse practitioners, and physician assistants from a sample of hemodialysis facilities in Network 6 (North Carolina, South Carolina, and Georgia) and Network 11 (Michigan, Minnesota, North Dakota, South Dakota, and Wisconsin) completed a 10-item assessment with modified questions from the Hospital Survey on Patient Safety Culture, with an emphasis on safety culture related to vascular access infections. A composite score was constructed, which was the average of the percent-positive scores of the items. Overall, scores were high, indicating a positive patient safety culture. Composite scores varied by role type, with nurses, patient care technicians, and other technicians reporting the lowest composite scores. Network 6 participants reported higher scores on two of the survey items. Fewer staff within a facility were associated with higher composite scores.
Development and psychometric testing of an instrument designed to measure chronic pain in dogs with osteoarthritis

PubMed Central

Boston, Raymond C.; Coyne, James C.; Farrar, John T.

2010-01-01

Objective To develop and psychometrically test an owner self-administered questionnaire designed to assess severity and impact of chronic pain in dogs with osteoarthritis. Sample Population 70 owners of dogs with osteoarthritis and 50 owners of clinically normal dogs. Procedures Standard methods for the stepwise development and testing of instruments designed to assess subjective states were used. Items were generated through focus groups and an expert panel. Items were tested for readability and ambiguity, and poorly performing items were removed. The reduced set of items was subjected to factor analysis, reliability testing, and validity testing. Results Severity of pain and interference with function were 2 factors identified and named on the basis of the items contained in them. Cronbach’s α was 0.93 and 0.89, respectively, suggesting that the items in each factor could be assessed as a group to compute factor scores (ie, severity score and interference score). The test-retest analysis revealed κ values of 0.75 for the severity score and 0.81 for the interference score. Scores correlated moderately well (r = 0.51 and 0.50, respectively) with the overall quality-of-life (QOL) question, such that as severity and interference scores increased, QOL decreased. Clinically normal dogs had significantly lower severity and interference scores than dogs with osteoarthritis. Conclusions and Clinical Relevance A psychometrically sound instrument was developed. Responsiveness testing must be conducted to determine whether the questionnaire will be useful in reliably obtaining quantifiable assessments from owners regarding the severity and impact of chronic pain and its treatment on dogs with osteoarthritis. PMID:17542696
Distinctions between Item Format and Objectivity in Scoring.

ERIC Educational Resources Information Center

Terwilliger, James S.

This paper clarifies important distinctions in item writing and item scoring and considers the implications of these distinctions for developing guidelines related to test construction for training teachers. The terminology used to describe and classify paper and pencil test questions frequently confuses two distinct features of questions:…
Improving Factor Score Estimation Through the Use of Observed Background Characteristics

PubMed Central

Curran, Patrick J.; Cole, Veronica; Bauer, Daniel J.; Hussong, Andrea M.; Gottfredson, Nisha

2016-01-01

A challenge facing nearly all studies in the psychological sciences is how to best combine multiple items into a valid and reliable score to be used in subsequent modelling. The most ubiquitous method is to compute a mean of items, but more contemporary approaches use various forms of latent score estimation. Regardless of approach, outside of large-scale testing applications, scoring models rarely include background characteristics to improve score quality. The current paper used a Monte Carlo simulation design to study score quality for different psychometric models that did and did not include covariates across levels of sample size, number of items, and degree of measurement invariance. The inclusion of covariates improved score quality for nearly all design factors, and in no case did the covariates degrade score quality relative to not considering the influences at all. Results suggest that the inclusion of observed covariates can improve factor score estimation. PMID:28757790
Determinants of ante-partum depression: a multicenter study.

PubMed

Balestrieri, Matteo; Matteo, Balestrieri; Isola, Miriam; Miriam, Isola; Bisoffi, Giulia; Giulia, Bisoffi; Calò, Salvatore; Salvatore, Calò; Conforti, Anita; Anita, Conforti; Driul, Lorenza; Lorenza, Driul; Marchesoni, Diego; Diego, Marchesoni; Petrosemolo, Paola; Paola, Petrosemolo; Rossi, Michela; Michela, Rossi; Zito, Adriana; Adriana, Zito; Zorzenone, Stefania; Stefania, Zorzenone; Di Sciascio, Guido; Guido, Di Sciascio; Leone, Roberto; Roberto, Leone; Bellantuono, Cesario; Cesario, Bellantuono

2012-12-01

Ante-partum depression (APD) is usually defined as a non-psychotic depressive episode of mild to moderate severity, beginning in or extending into pregnancy. APD has received less attention than postpartum depression. This is a cross-sectional study carried out in the Obstetrics and Gynaecology (OG) departments of four different general hospitals in Italy. Women attending consecutively the OG departments for their first ultrasound examination were asked to fill in the Edinburgh Postnatal Depression Scale (EPDS) in its Italian validated version. We used the total scores of the EPDS as a continuous variable for univariate and linear regression analyses; in accordance with the literature, the item analysis of EPDS was carried out by classifying the sample as women with "no depression" (scores 0-9), "possible depression" (scores 10-12), "probable depression" (scores 13+) and "probable APD" (scores 15+). The number of women recruited was 1,608. The EPDS assessment classified 10.9 % of the women as possibly depressed, 8.3 % as probably depressed and 4.7 % probably affected from an APD. EPDS score distribution was associated with nationality (higher scores for foreigners), cohabitation (higher scores for women living with friends or in a community), occupation (higher scores for housewives), past episodes of depression and use of herbal drugs. Non-depressed women had significantly lower values on all ten items as compared with depressed women, however, the pattern of item distribution on the EPDS scale remained similar across depression severity groups. In all four groups item 4 (anxious depression) attained the highest scores, while item 10 (suicidality) attained the lowest scores.
Effects of levomilnacipran ER on fatigue symptoms associated with major depressive disorder

PubMed Central

Fava, Maurizio; Gommoll, Carl; Chen, Changzheng; Greenberg, William M.; Ruth, Adam

2016-01-01

The aim of this study was to evaluate the effects of levomilnacipran extended-release (ER) on depression-related fatigue in adults with major depressive disorder. Post-hoc analyses of five phase III trials were carried out, with evaluation of fatigue symptoms based on score changes in four items: Montgomery–Åsberg Depression Rating Scale (MADRS) item 7 (lassitude), and 17-item Hamilton Depression Rating Scale (HAMD17) items 7 (work/activities), 8 (retardation), and 13 (somatic symptoms). Symptom remission was analyzed on the basis of score shifts from baseline to end of treatment: MADRS item 7 and HAMD17 item 7 (from ≥2 to ≤1); HAMD17 items 8 and 13 (from ≥1 to 0). The mean change in MADRS total score was analyzed in patients with low and high fatigue (MADRS item 7 baseline score <4 and ≥4, respectively). Patients receiving levomilnacipran ER had significantly greater mean improvements and symptom remission (no/minimal residual fatigue) on all fatigue-related items: lassitude (35 vs. 28%), work/activities (43 vs. 35%), retardation (46 vs. 39%), somatic symptoms (26 vs. 18%; all Ps<0.01 versus placebo). The mean change in MADRS total score was significantly greater with levomilnacipran ER versus placebo in both low (least squares mean difference=−2.8, P=0.0018) and high (least squares mean difference=−3.1, P<0.0001) fatigue subgroups. Levomilnacipran ER treatment was effective in reducing depression-related fatigue in adult patients with major depressive disorder and was associated with remission of fatigue symptoms. PMID:26584326
Multidimensional CAT Item Selection Methods for Domain Scores and Composite Scores: Theory and Applications

ERIC Educational Resources Information Center

Yao, Lihua

2012-01-01

Multidimensional computer adaptive testing (MCAT) can provide higher precision and reliability or reduce test length when compared with unidimensional CAT or with the paper-and-pencil test. This study compared five item selection procedures in the MCAT framework for both domain scores and overall scores through simulation by varying the structure…
A protocol for the Hamilton Rating Scale for Depression: Item scoring rules, Rater training, and outcome accuracy with data on its application in a clinical trial.

PubMed

Rohan, Kelly J; Rough, Jennifer N; Evans, Maggie; Ho, Sheau-Yan; Meyerhoff, Jonah; Roberts, Lorinda M; Vacek, Pamela M

2016-08-01

We present a fully articulated protocol for the Hamilton Rating Scale for Depression (HAM-D), including item scoring rules, rater training procedures, and a data management algorithm to increase accuracy of scores prior to outcome analyses. The latter involves identifying potentially inaccurate scores as interviews with discrepancies between two independent raters on the basis of either scores >=5-point difference) or meeting threshold for depression recurrence status, a long-term treatment outcome with public health significance. Discrepancies are resolved by assigning two new raters, identifying items with disagreement per an algorithm, and reaching consensus on the most accurate scores for those items. These methods were applied in a clinical trial where the primary outcome was the Structured Interview Guide for the Hamilton Rating Scale for Depression-Seasonal Affective Disorder version (SIGH-SAD), which includes the 21-item HAM-D and 8 items assessing atypical symptoms. 177 seasonally depressed adult patients were enrolled and interviewed at 10 time points across treatment and the 2-year followup interval for a total of 1589 completed interviews with 1535 (96.6%) archived. Inter-rater reliability ranged from ICCs of .923-.967. Only 86 (5.6%) interviews met criteria for a between-rater discrepancy. HAM-D items "Depressed Mood", "Work and Activities", "Middle Insomnia", and "Hypochondriasis" and Atypical items "Fatigability" and "Hypersomnia" contributed most to discrepancies. Generalizability beyond well-trained, experienced raters in a clinical trial is unknown. Researchers might want to consider adopting this protocol in part or full. Clinicians might want to tailor it to their needs. Copyright © 2016 Elsevier B.V. All rights reserved.
Measuring benefits and patients' satisfaction when glasses are not needed after cataract and presbyopia surgery: scoring and psychometric validation of the Freedom from Glasses Value Scale (FGVS).

PubMed

Berdeaux, Gilles; Meunier, Juliette; Arnould, Benoit; Viala-Danten, Muriel

2010-05-24

The purpose of this study was to reduce the number of items, create a scoring method and assess the psychometric properties of the Freedom from Glasses Value Scale (FGVS), which measures benefits of freedom from glasses perceived by cataract and presbyopic patients after multifocal intraocular lens (IOL) surgery. The 21-item FGVS, developed simultaneously in French and Spanish, was administered by phone during an observational study to 152 French and 152 Spanish patients who had undergone cataract or presbyopia surgery at least 1 year before the study. Reduction of items and creation of the scoring method employed statistical methods (principal component analysis, multitrait analysis) and content analysis. Psychometric properties (validation of the structure, internal consistency reliability, and known-group validity) of the resulting version were assessed in the pooled population and per country. One item was deleted and 3 were kept but not aggregated in a dimension. The other 17 items were grouped into 2 dimensions ('global evaluation', 9 items; 'advantages', 8 items) and divided into 5 sub-dimensions, with higher scores indicating higher benefit of surgery. The structure was validated (good item convergent and discriminant validity). Internal consistency reliability was good for all dimensions and sub-dimensions (Cronbach's alphas above 0.70). The FGVS was able to discriminate between patients wearing glasses or not after surgery (higher scores for patients not wearing glasses). FGVS scores were significantly higher in Spain than France; however, the measure had similar psychometric performances in both countries. The FGVS is a valid and reliable instrument measuring benefits of freedom from glasses perceived by cataract and presbyopic patients after multifocal IOL surgery.
Measuring benefits and patients' satisfaction when glasses are not needed after cataract and presbyopia surgery: scoring and psychometric validation of the Freedom from Glasses Value Scale (FGVS©)

PubMed Central

2010-01-01

Background The purpose of this study was to reduce the number of items, create a scoring method and assess the psychometric properties of the Freedom from Glasses Value Scale (FGVS), which measures benefits of freedom from glasses perceived by cataract and presbyopic patients after multifocal intraocular lens (IOL) surgery. Methods The 21-item FGVS, developed simultaneously in French and Spanish, was administered by phone during an observational study to 152 French and 152 Spanish patients who had undergone cataract or presbyopia surgery at least 1 year before the study. Reduction of items and creation of the scoring method employed statistical methods (principal component analysis, multitrait analysis) and content analysis. Psychometric properties (validation of the structure, internal consistency reliability, and known-group validity) of the resulting version were assessed in the pooled population and per country. Results One item was deleted and 3 were kept but not aggregated in a dimension. The other 17 items were grouped into 2 dimensions ('global evaluation', 9 items; 'advantages', 8 items) and divided into 5 sub-dimensions, with higher scores indicating higher benefit of surgery. The structure was validated (good item convergent and discriminant validity). Internal consistency reliability was good for all dimensions and sub-dimensions (Cronbach's alphas above 0.70). The FGVS was able to discriminate between patients wearing glasses or not after surgery (higher scores for patients not wearing glasses). FGVS scores were significantly higher in Spain than France; however, the measure had similar psychometric performances in both countries. Conclusions The FGVS is a valid and reliable instrument measuring benefits of freedom from glasses perceived by cataract and presbyopic patients after multifocal IOL surgery. PMID:20497555

Checking Equity: Why Differential Item Functioning Analysis Should Be a Routine Part of Developing Conceptual Assessments

PubMed Central

Martinková, Patrícia; Drabinová, Adéla; Liaw, Yuan-Ling; Sanders, Elizabeth A.; McFarland, Jenny L.; Price, Rebecca M.

2017-01-01

We provide a tutorial on differential item functioning (DIF) analysis, an analytic method useful for identifying potentially biased items in assessments. After explaining a number of methodological approaches, we test for gender bias in two scenarios that demonstrate why DIF analysis is crucial for developing assessments, particularly because simply comparing two groups’ total scores can lead to incorrect conclusions about test fairness. First, a significant difference between groups on total scores can exist even when items are not biased, as we illustrate with data collected during the validation of the Homeostasis Concept Inventory. Second, item bias can exist even when the two groups have exactly the same distribution of total scores, as we illustrate with a simulated data set. We also present a brief overview of how DIF analysis has been used in the biology education literature to illustrate the way DIF items need to be reevaluated by content experts to determine whether they should be revised or removed from the assessment. Finally, we conclude by arguing that DIF analysis should be used routinely to evaluate items in developing conceptual assessments. These steps will ensure more equitable—and therefore more valid—scores from conceptual assessments. PMID:28572182
Using a MaxEnt Classifier for the Automatic Content Scoring of Free-Text Responses

NASA Astrophysics Data System (ADS)

Sukkarieh, Jana Z.

2011-03-01

Criticisms against multiple-choice item assessments in the USA have prompted researchers and organizations to move towards constructed-response (free-text) items. Constructed-response (CR) items pose many challenges to the education community—one of which is that they are expensive to score by humans. At the same time, there has been widespread movement towards computer-based assessment and hence, assessment organizations are competing to develop automatic content scoring engines for such items types—which we view as a textual entailment task. This paper describes how MaxEnt Modeling is used to help solve the task. MaxEnt has been used in many natural language tasks but this is the first application of the MaxEnt approach to textual entailment and automatic content scoring.
Item Response Theory Modeling of the Philadelphia Naming Test

ERIC Educational Resources Information Center

Fergadiotis, Gerasimos; Kellough, Stacey; Hula, William D.

2015-01-01

Purpose: In this study, we investigated the fit of the Philadelphia Naming Test (PNT; Roach, Schwartz, Martin, Grewal, & Brecher, 1996) to an item-response-theory measurement model, estimated the precision of the resulting scores and item parameters, and provided a theoretical rationale for the interpretation of PNT overall scores by relating…
An Evaluation of Three Approximate Item Response Theory Models for Equating Test Scores.

ERIC Educational Resources Information Center

Marco, Gary L.; And Others

Three item response models were evaluated for estimating item parameters and equating test scores. The models, which approximated the traditional three-parameter model, included: (1) the Rasch one-parameter model, operationalized in the BICAL computer program; (2) an approximate three-parameter logistic model based on coarse group data divided…
Use of Automated Scoring Features to Generate Hypotheses Regarding Language-Based DIF

ERIC Educational Resources Information Center

Shermis, Mark D.; Mao, Liyang; Mulholland, Matthew; Kieftenbeld, Vincent

2017-01-01

This study uses the feature sets employed by two automated scoring engines to determine if a "linguistic profile" could be formulated that would help identify items that are likely to exhibit differential item functioning (DIF) based on linguistic features. Sixteen items were administered to 1200 students where demographic information…
Nonparametric Item Response Curve Estimation with Correction for Measurement Error

ERIC Educational Resources Information Center

Guo, Hongwen; Sinharay, Sandip

2011-01-01

Nonparametric or kernel regression estimation of item response curves (IRCs) is often used in item analysis in testing programs. These estimates are biased when the observed scores are used as the regressor because the observed scores are contaminated by measurement error. Accuracy of this estimation is a concern theoretically and operationally.…
Item response theory scoring and the detection of curvilinear relationships.

PubMed

Carter, Nathan T; Dalal, Dev K; Guan, Li; LoPilato, Alexander C; Withrow, Scott A

2017-03-01

Psychologists are increasingly positing theories of behavior that suggest psychological constructs are curvilinearly related to outcomes. However, results from empirical tests for such curvilinear relations have been mixed. We propose that correctly identifying the response process underlying responses to measures is important for the accuracy of these tests. Indeed, past research has indicated that item responses to many self-report measures follow an ideal point response process-wherein respondents agree only to items that reflect their own standing on the measured variable-as opposed to a dominance process, wherein stronger agreement, regardless of item content, is always indicative of higher standing on the construct. We test whether item response theory (IRT) scoring appropriate for the underlying response process to self-report measures results in more accurate tests for curvilinearity. In 2 simulation studies, we show that, regardless of the underlying response process used to generate the data, using the traditional sum-score generally results in high Type 1 error rates or low power for detecting curvilinearity, depending on the distribution of item locations. With few exceptions, appropriate power and Type 1 error rates are achieved when dominance-based and ideal point-based IRT scoring are correctly used to score dominance and ideal point response data, respectively. We conclude that (a) researchers should be theory-guided when hypothesizing and testing for curvilinear relations; (b) correctly identifying whether responses follow an ideal point versus dominance process, particularly when items are not extreme is critical; and (c) IRT model-based scoring is crucial for accurate tests of curvilinearity. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Developing a Clinician Friendly Tool to Identify Useful Clinical Practice Guidelines: G-TRUST.

PubMed

Shaughnessy, Allen F; Vaswani, Akansha; Andrews, Bonnie K; Erlich, Deborah R; D'Amico, Frank; Lexchin, Joel; Cosgrove, Lisa

2017-09-01

Clinicians are faced with a plethora of guidelines. To rate guidelines, they can select from a number of evaluation tools, most of which are long and difficult to apply. The goal of this project was to develop a simple, easy-to-use checklist for clinicians to use to identify trustworthy, relevant, and useful practice guidelines, the Guideline Trustworthiness, Relevance, and Utility Scoring Tool (G-TRUST). A modified Delphi process was used to obtain consensus of experts and guideline developers regarding a checklist of items and their relative impact on guideline quality. We conducted 4 rounds of sampling to refine wording, add and subtract items, and develop a scoring system. Multiple attribute utility analysis was used to develop a weighted utility score for each item to determine scoring. Twenty-two experts in evidence-based medicine, 17 developers of high-quality guidelines, and 1 consumer representative participated. In rounds 1 and 2, items were rewritten or dropped, and 2 items were added. In round 3, weighted scores were calculated from rankings and relative weights assigned by the expert panel. In the last round, more than 75% of experts indicated 3 of the 8 checklist items to be major indicators of guideline usefulness and, using the AGREE tool as a reference standard, a scoring system was developed to identify guidelines as useful, may not be useful, and not useful. The 8-item G-TRUST is potentially helpful as a tool for clinicians to identify useful guidelines. Further research will focus on its reliability when used by clinicians. © 2017 Annals of Family Medicine, Inc.
Implications of Changing Answers on Objective Test Items

ERIC Educational Resources Information Center

Mueller, Daniel J.; Wasser, Virginia

1977-01-01

Eighteen studies of the effects of changing initial answers to objective test items are reviewed. While students throughout the total test score range tended to gain more points than they lost, higher scoring students gain more than did lower scoring students. Suggestions for further research are made. (Author/JKS)
Psychometric properties of the brief version of the Fear of Negative Evaluation Scale in a Turkish sample.

PubMed

Koydemir, Selda; Demir, Ayhan

2007-06-01

The purpose of the study was to report initial data on the psychometric properties of the Brief Fear of Negative Evaluation Scale. The scale was applied to a nonclinical sample of 250 (137 women, 113 men) Turkish undergraduate students selected randomly from Middle East Technical University. Their mean age was 20.4 yr. (SD= 1.9). The factor structure of the Turkish version, its criterion validity, and internal reliability coefficients were assessed. Although maximum likelihood factor analysis initially indicated that the scale had only one factor, a forced two-factor solution accounted for more variance (61%) in scale scores than a single factor. The straightforward items loaded on the first factor, and the reverse-coded items loaded on the second factor. The total score was significantly positively correlated with scores on the Revised Cheek and Buss Shyness Scale and significantly negatively correlated with scores on the Rosenberg Self-Esteem Scale. Factor 1 (straightforward items) correlated more highly with both Shyness and Self-esteem than Factor 2 (reverse-coded items). Internal consistency estimate was .94 for the Total scores, .91 for the Factor 1 (straightforward items), and .87 for the Factor 2 (reverse-coded items). No sex differences were evident for Fear of Negative Evaluation.
Validation of a single summary score for the Prolapse/Incontinence Sexual Questionnaire-IUGA revised (PISQ-IR).

PubMed

Constantine, Melissa L; Pauls, Rachel N; Rogers, Rebecca R; Rockwood, Todd H

2017-12-01

The Prolapse/Incontinence Sexual Questionnaire-International Urogynecology Association (IUGA) Revised (PISQ-IR) measures sexual function in women with pelvic floor disorders (PFDs) yet is unwieldy, with six individual subscale scores for sexually active women and four for women who are not. We hypothesized that a valid and responsive summary score could be created for the PISQ-IR. Item response data from participating women who completed a revised version of the PISQ-IR at three clinical sites were used to generate item weights using a magnitude estimation (ME) and Q-sort (Q) approaches. Item weights were applied to data from the original PISQ-IR validation to generate summary scores. Correlation and factor analysis methods were used to evaluate validity and responsiveness of summary scores. Weighted and nonweighted summary scores for the sexually active PISQ-IR demonstrated good criterion validity with condition-specific measures: Incontinence Severity Index = 0.12, 0.11, 0.11; Pelvic Floor Distress Inventory-20 = 0.39, 0.39, 0.12; Epidemiology of Prolapse and Incontinence Questionnaire-Q35 = 0.26 0,.25, 0.40); Female Sexual Functioning Index subscale total score = 0.72, 0.75, 0.72 for nonweighted, ME, and Q summary scores, respectively. Responsiveness evaluation showed weighted and nonweighted summary scores detected moderate effect sizes (Cohen's d > 0.5). Weighted items for those NSA demonstrated significant floor effects and did not meet criterion validity. A PISQ-IR summary score for use with sexually active women, nonweighted or calculated with ME or Q item weights, is a valid and reliable measure for clinical use. The summary scores provide value for assesing clinical treatment of pelvic floor disorders.
An Item Response Analysis of the Motor and Behavioral Subscales of the Unified Huntington's Disease Rating Scale in Huntington Disease Gene Expansion Carriers

PubMed Central

Vaccarino, Anthony L.; Anderson, Karen; Borowsky, Beth; Duff, Kevin; Giuliano, Joseph; Guttman, Mark; Ho, Aileen K.; Orth, Michael; Paulsen, Jane S.; Sills, Terrence; van Kammen, Daniel P.; Evans, Kenneth R.

2011-01-01

Although the Unified Huntington's Disease Rating Scale (UHDRS) is widely used in the assessment of Huntington disease (HD), the ability of individual items to discriminate individual differences in motor or behavioral manifestations has not been extensively studied in HD gene expansion carriers without a motor-defined clinical diagnosis (i.e., prodromal-HD or prHD). To elucidate the relationship between scores on individual motor and behavioral UHDRS items and total score for each subscale, a non-parametric item response analysis was performed on retrospective data from two multicentre, longitudinal studies. Motor and Behavioral assessments were supplied for 737 prHD individuals with data from 2114 visits (PREDICT-HD) and 686 HD individuals with data from 1482 visits (REGISTRY). Option characteristic curves were generated for UHDRS subscale items in relation to their subscale score. In prHD, overall severity of motor signs was low and participants had scores of 2 or above on very few items. In HD, motor items that assessed ocular pursuit, saccade initiation, finger tapping, tandem walking, and to a lesser extent saccade velocity, dysarthia, tongue protrusion, pronation/supination, Luria, bradykinesia, choreas, gait and balance on the retropulsion test were found to discriminate individual differences across a broad range of motor severity. In prHD, depressed mood, anxiety, and irritable behavior demonstrated good discriminative properties. In HD, depressed mood demonstrated a good relationship with the overall behavioral score. These data suggest that at least some UHDRS items appear to have utility across a broad range of severity, although many items demonstrate problematic features. PMID:21370269
An item response analysis of the motor and behavioral subscales of the unified Huntington's disease rating scale in huntington disease gene expansion carriers.

PubMed

Vaccarino, Anthony L; Anderson, Karen; Borowsky, Beth; Duff, Kevin; Giuliano, Joseph; Guttman, Mark; Ho, Aileen K; Orth, Michael; Paulsen, Jane S; Sills, Terrence; van Kammen, Daniel P; Evans, Kenneth R

2011-04-01

Although the Unified Huntington's Disease Rating Scale (UHDRS) is widely used in the assessment of Huntington disease (HD), the ability of individual items to discriminate individual differences in motor or behavioral manifestations has not been extensively studied in HD gene expansion carriers without a motor-defined clinical diagnosis (ie, prodromal-HD or prHD). To elucidate the relationship between scores on individual motor and behavioral UHDRS items and total score for each subscale, a nonparametric item response analysis was performed on retrospective data from 2 multicenter longitudinal studies. Motor and behavioral assessments were supplied for 737 prHD individuals with data from 2114 visits (PREDICT-HD) and 686 HD individuals with data from 1482 visits (REGISTRY). Option characteristic curves were generated for UHDRS subscale items in relation to their subscale score. In prHD, overall severity of motor signs was low, and participants had scores of 2 or above on very few items. In HD, motor items that assessed ocular pursuit, saccade initiation, finger tapping, tandem walking, and to a lesser extent, saccade velocity, dysarthria, tongue protrusion, pronation/supination, Luria, bradykinesia, choreas, gait, and balance on the retropulsion test were found to discriminate individual differences across a broad range of motor severity. In prHD, depressed mood, anxiety, and irritable behavior demonstrated good discriminative properties. In HD, depressed mood demonstrated a good relationship with the overall behavioral score. These data suggest that at least some UHDRS items appear to have utility across a broad range of severity, although many items demonstrate problematic features. Copyright © 2011 Movement Disorder Society.
Psychometric properties of the Global Operative Assessment of Laparoscopic Skills (GOALS) using item response theory.

PubMed

Watanabe, Yusuke; Madani, Amin; Ito, Yoichi M; Bilgic, Elif; McKendy, Katherine M; Feldman, Liane S; Fried, Gerald M; Vassiliou, Melina C

2017-02-01

The extent to which each item assessed using the Global Operative Assessment of Laparoscopic Skills (GOALS) contributes to the total score remains unknown. The purpose of this study was to evaluate the level of difficulty and discriminative ability of each of the 5 GOALS items using item response theory (IRT). A total of 396 GOALS assessments for a variety of laparoscopic procedures over a 12-year time period were included. Threshold parameters of item difficulty and discrimination power were estimated for each item using IRT. The higher slope parameters seen with "bimanual dexterity" and "efficiency" are indicative of greater discriminative ability than "depth perception", "tissue handling", and "autonomy". IRT psychometric analysis indicates that the 5 GOALS items do not demonstrate uniform difficulty and discriminative power, suggesting that they should not be scored equally. "Bimanual dexterity" and "efficiency" seem to have stronger discrimination. Weighted scores based on these findings could improve the accuracy of assessing individual laparoscopic skills. Copyright © 2016 Elsevier Inc. All rights reserved.
Health Effects of Exposure to Water-Damaged New Orleans Homes Six Months After Hurricanes Katrina and Rita

PubMed Central

Cummings, Kristin J.; Cox-Ganser, Jean; Riggs, Margaret A.; Edwards, Nicole; Hobbs, Gerald R.; Kreiss, Kathleen

2008-01-01

Objectives. We investigated the relation between respiratory symptoms and exposure to water-damaged homes and the effect of respirator use in posthurricane New Orleans, Louisiana. Methods. We randomly selected 600 residential sites and then interviewed 1 adult per site. We created an exposure variable, calculated upper respiratory symptom (URS) and lower respiratory symptom (LRS) scores, and defined exacerbation categories by the effect on symptoms of being inside water-damaged homes. We used multiple linear regression to model symptom scores (for all participants) and polytomous logistic regression to model exacerbation of symptoms when inside (for those participating in clean-up). Results. Of 553 participants (response rate=92%), 372 (68%) had participated in clean-up; 233 (63%) of these used a respirator. Respiratory symptom scores increased linearly with exposure (P<.05 for trend). Disposable-respirator use was associated with lower odds of exacerbation of moderate or severe symptoms inside water-damaged homes for URS (odds ratio (OR)=.51; 95% confidence interval (CI)=0.24, 1.09) and LRS (OR=0.33; 95% CI=0.13, 0.83). Conclusions. Respiratory symptoms were positively associated with exposure to water-damaged homes, including exposure limited to being inside without participating in clean-up. Respirator use had a protective effect and should be considered when inside water-damaged homes regardless of activities undertaken. PMID:18381997
A PROMIS Measure of Neuropathic Pain Quality

PubMed Central

Askew, Robert L.; Cook, Karon F.; Keefe, Francis J.; Nowinski, Cindy J; Cella, David; Revicki, Dennis A.; DeWitt, Esi M. Morgan; Michaud, Kaleb; Trence, Dace L.; Amtmann, Dagmar

2016-01-01

Objectives Neuropathic pain is a consequence of many chronic conditions. This study aimed to develop a unidimensional neuropathic pain scale whose scores represent levels of neuropathic pain and distinguish between individuals with neuropathic and non-neuropathic pain conditions. Methods A candidate item pool of 42 pain quality descriptors was administered to participants with osteoarthritis, rheumatoid arthritis, diabetic neuropathy, and cancer chemotherapy-induced peripheral neuropathy. A subset of pain quality descriptors (items) that best distinguished between participants with and those without neuropathic pain conditions were identified. Dimensionality of pain descriptors was evaluated in a development sample and cross-validated in a hold-out sample. Item responses were calibrated using an item response theory model, and scores were generated on a T-score metric. Neuropathic pain scale scores were evaluated in terms of reliability, validity, and the ability to distinguish between participants with and without conditions typically associated with neuropathic pain. Results Of the 42 initial items, 5 were identified for the Patient Reported Outcome Measurement Information System (PROMIS) Neuropathic Pain Quality scale (PROMIS-PQ-Neuro). The IRT-generated T-scores exhibited good discriminatory ability based on receiver operator characteristic analysis. Score thresholds were identified that optimize sensitivity and specificity. Construct, criterion, and discriminant validity, and reliability of scale scores were supported. Conclusions The 5-item PROMIS PQ-Neuro is a short and practical measure that can be used to identify patients more likely to have neuropathic pain and to distinguish levels of neuropathic pain. The data collected will support future research that targets other unidimensional pain quality domains (e.g., nociceptive pain). PMID:27565279
Development of a Computer Adaptive Test for Depression Based on the Dutch-Flemish Version of the PROMIS Item Bank.

PubMed

Flens, Gerard; Smits, Niels; Terwee, Caroline B; Dekker, Joost; Huijbrechts, Irma; de Beurs, Edwin

2017-03-01

We developed a Dutch-Flemish version of the patient-reported outcomes measurement information system (PROMIS) adult V1.0 item bank for depression as input for computerized adaptive testing (CAT). As item bank, we used the Dutch-Flemish translation of the original PROMIS item bank (28 items) and additionally translated 28 U.S. depression items that failed to make the final U.S. item bank. Through psychometric analysis of a combined clinical and general population sample ( N = 2,010), 8 added items were removed. With the final item bank, we performed several CAT simulations to assess the efficiency of the extended (48 items) and the original item bank (28 items), using various stopping rules. Both item banks resulted in highly efficient and precise measurement of depression and showed high similarity between the CAT simulation scores and the full item bank scores. We discuss the implications of using each item bank and stopping rule for further CAT development.
More reasons to be straightforward: findings and norms for two scales relevant to social anxiety.

PubMed

Rodebaugh, Thomas L; Heimberg, Richard G; Brown, Patrick J; Fernandez, Katya C; Blanco, Carlos; Schneier, Franklin R; Liebowitz, Michael R

2011-06-01

The validity of both the Social Interaction Anxiety Scale and Brief Fear of Negative Evaluation scale has been well-supported, yet the scales have a small number of reverse-scored items that may detract from the validity of their total scores. The current study investigates two characteristics of participants that may be associated with compromised validity of these items: higher age and lower levels of education. In community and clinical samples, the validity of each scale's reverse-scored items was moderated by age, years of education, or both. The straightforward items did not show this pattern. To encourage the use of the straightforward items of these scales, we provide normative data from the same samples as well as two large student samples. We contend that although response bias can be a substantial problem, the reverse-scored questions of these scales do not solve that problem and instead decrease overall validity. Copyright © 2011 Elsevier Ltd. All rights reserved.
[The keys to success in French Medical National Ranking Examination: Integrated training activities in teaching hospital and medical school].

PubMed

Gillois, Pierre; Fourcot, Marie; Genty, Céline; Morand, Patrice; Bosson, Jean-Luc

2015-12-01

The National Ranking Examination (NRE) is the key to the choice of career and specialty for future physicians; it lets them choose their place of employment in a specialty and an hospital for their internship. It seems interesting to model the success factors to this exam for the medical students from Grenoble University. For each of the medical students at Grenoble University who did apply to the NRE in 2012, data have been collected about their academic background and personal details from the administration of the University. A simple logistic regression with success set as being ranked in the first 2000 students, then a polytomous logistic regression, have been performed. The 191 students in the models are 59% female, 25 years old in average (SD 1.8). The factors associated to a ranking in the first 2000 are: not repeating the PCEM1 class (odds ratio [OR] 2.63, CI95: [1.26; 5.56]), performing nurse practice during internships (OR=1.27 [1.00; 1.62]), being ranked in the first half of the class for S3 pole (OR=6.04 [1.21; 30.20] for the first quarter, OR=5.65 [1.15; 27.74] for the second quarter) and being in the first quarter at T5 pole (OR=3.42 [1.08; 10.82]). Our study finds four factors independently contributing to the success at NRE: not repeating PCEM1, performing nurse practice and being ranked in the top of the class at certain academic fields. The AUC is 0.76 and student accuracy is more than 80%. However, some items, for example repeating DCEM4 or participating in NRE mock exams, have no influence on success. A different motivation should be a part of the explanation… As these analysed data are mainly institutional, they are accurate and reliable. The polytomic logistic model, sharing 3 factors with the simple logistic model, replace a performing nurse practice factor's by a grant recipient factor. Copyright © 2015 Elsevier Masson SAS. All rights reserved.
Team-based learning on a third-year pediatric clerkship improves NBME subject exam blood disorder scores.

PubMed

Saudek, Kris; Treat, Robert

2015-01-01

Purpose At our institution, speculation amongst medical students and faculty exists as to whether team-based learning (TBL) can improve scores on high-stakes examinations over traditional didactic lectures. Faculty with experience using TBL developed and piloted a required TBL blood disorders (BD) module for third-year medical students on their pediatric clerkship. The purpose of this study is to analyze the BD scores from the NBME subject exams before and after the introduction of the module. Methods We analyzed institutional and national item difficulties for BD items from the NBME pediatrics content area item analysis reports from 2011 to 2014 before (pre) and after (post) the pilot (October 2012). Total scores of 590 NBME subject examination students from examinee performance profiles were analyzed pre/post. t-Tests and Cohen's d effect sizes were used to analyze item difficulties for institutional versus national scores and pre/post comparisons of item difficulties and total scores. Results BD scores for our institution were 0.65 (±0.19) compared to 0.62 (±0.15) nationally (P=0.346; Cohen's d=0.15). The average of post-consecutive BD scores for our students was 0.70(±0.21) compared to examinees nationally [0.64 (±0.15)] with a significant mean difference (P=0.031; Cohen's d=0.43). The difference in our institutions pre [0.65 (±0.19)] and post [0.70 (±0.21)] BD scores trended higher (P=0.391; Cohen's d=0.27). Institutional BD scores were higher than national BD scores for both pre and post, with an effect size that tripled from pre to post scores. Institutional BD scores increased after the use of the TBL module, while overall exam scores remained steadily above national norms. Conclusions Institutional BD scores were higher than national BD scores for both pre and post, with an effect size that tripled from pre to post scores. Institutional BD scores increased after the use of the TBL module, while overall exam scores remained steadily above national norms.

Characterizing Sources of Uncertainty in Item Response Theory Scale Scores

ERIC Educational Resources Information Center

Yang, Ji Seung; Hansen, Mark; Cai, Li

2012-01-01

Traditional estimators of item response theory scale scores ignore uncertainty carried over from the item calibration process, which can lead to incorrect estimates of the standard errors of measurement (SEMs). Here, the authors review a variety of approaches that have been applied to this problem and compare them on the basis of their statistical…
Global, Local, and Graphical Person-Fit Analysis Using Person-Response Functions

ERIC Educational Resources Information Center

Emons, Wilco H. M.; Sijtsma, Klaas; Meijer, Rob R.

2005-01-01

Person-fit statistics test whether the likelihood of a respondent's complete vector of item scores on a test is low given the hypothesized item response theory model. This binary information may be insufficient for diagnosing the cause of a misfitting item-score vector. The authors propose a comprehensive methodology for person-fit analysis in the…
Sex Differences in the Tendency to Omit Items on Multiple-Choice Tests: 1980-2000

ERIC Educational Resources Information Center

von Schrader, Sarah; Ansley, Timothy

2006-01-01

Much has been written concerning the potential group differences in responding to multiple-choice achievement test items. This discussion has included references to possible disparities in tendency to omit such test items. When test scores are used for high-stakes decision making, even small differences in scores and rankings that arise from male…
The Effect of Guessing on Item Reliability under Answer-Until-Correct Scoring

ERIC Educational Resources Information Center

Kane, Michael; Moloney, James

1978-01-01

The answer-until-correct (AUC) procedure requires that examinees respond to a multi-choice item until they answer it correctly. Using a modified version of Horst's model for examinee behavior, this paper compares the effect of guessing on item reliability for the AUC procedure and the zero-one scoring procedure. (Author/CTM)
Measurement Error in Nonparametric Item Response Curve Estimation. Research Report. ETS RR-11-28

ERIC Educational Resources Information Center

Guo, Hongwen; Sinharay, Sandip

2011-01-01

Nonparametric, or kernel, estimation of item response curve (IRC) is a concern theoretically and operationally. Accuracy of this estimation, often used in item analysis in testing programs, is biased when the observed scores are used as the regressor because the observed scores are contaminated by measurement error. In this study, we investigate…
Effect of Clinically Discriminating, Evidence-Based Checklist Items on the Reliability of Scores from an Internal Medicine Residency OSCE

ERIC Educational Resources Information Center

Daniels, Vijay J.; Bordage, Georges; Gierl, Mark J.; Yudkowsky, Rachel

2014-01-01

Objective structured clinical examinations (OSCEs) are used worldwide for summative examinations but often lack acceptable reliability. Research has shown that reliability of scores increases if OSCE checklists for medical students include only clinically relevant items. Also, checklists are often missing evidence-based items that high-achieving…
Sensitivity of Equated Aggregate Scores to the Treatment of Misbehaving Common Items

ERIC Educational Resources Information Center

Michaelides, Michalis P.

2010-01-01

The delta-plot method (Angoff, 1972) is a graphical technique used in the context of test equating for identifying common items with aberrant changes in their item difficulties across administrations or alternate forms. This brief research report explores the effects on equated aggregate scores when delta-plot outliers are either retained in or…
Linking Physical and Mental Health Summary Scores from the Veterans RAND 12-Item Health Survey (VR-12) to the PROMIS(®) Global Health Scale.

PubMed

Schalet, Benjamin D; Rothrock, Nan E; Hays, Ron D; Kazis, Lewis E; Cook, Karon F; Rutsohn, Joshua P; Cella, David

2015-10-01

Global health measures represent an attractive option for researchers and clinicians seeking a brief snapshot of a patient's overall perspective on his or her health. Because scores on different global health measures are not comparable, comparative effectiveness research (CER) is challenging. To establish a common reporting metric so that the physical and mental health scores on the Veterans RAND 12-Item Health Survey (VR-12 (©) ) can be converted into scores on the corresponding Patient Reported Outcomes Measurement Information System (PROMIS(®)) Global Health scores. Following a single-sample linking design, participants from an Internet panel completed items from the PROMIS Global Health and VR-12 Health Survey. A common metric was created using analyses based on item response theory (IRT), producing score cross-walk tables for the mental and physical health components of each measure. The linking relationships were evaluated by calculating the standard deviation of differences between the observed and linked PROMIS scores and estimating confidence intervals by sample size. Participants (N = 2025) were 49 % male and 73 % white; mean age was 46 years. Mental and physical health subscales of the PROMIS Global Health and the VR-12. The mean VR-12 physical component and mental component scores were 45.2 and 46.6, respectively; the mean PROMIS physical and mental health scores were 48.3 and 48.5, respectively. We found evidence that the combined set of VR-12 and PROMIS items were relatively unidimensional and that we could proceed with linking. Linking worked better between the physical health than mental health scores using VR-12 item responses (vs. linking based on algorithmic scores). For each of the cross-walks, users can minimize the impact of linking error with modest increases in sample sizes. VR-12 scores can be expressed on the PROMIS Global Health metric to facilitate the evaluation of treatment, including CER. Extending these results to other common measures of global health is encouraged.
The prosocial and aggressive driving inventory (PADI): a self-report measure of safe and unsafe driving behaviors.

PubMed

Harris, Paul B; Houston, John M; Vazquez, Jose A; Smither, Janan A; Harms, Amanda; Dahlke, Jeffrey A; Sachau, Daniel A

2014-11-01

Surveys of 1217 undergraduate students supported the reliability (inter-item and test-retest) and validity of the Prosocial and Aggressive Driving Inventory (PADI). Principal component analyses on the PADI items yielded two scales: Prosocial Driving (17 items) and Aggressive Driving (12 items). Prosocial Driving was associated with fewer reported traffic accidents and violations, with participants who were older and female, and with lower Boredom Susceptibility and Hostility scores, and higher scores on Agreeableness, Conscientiousness, Openness, and Neuroticism. Aggressive Driving was associated with more frequent traffic violations, with female participants, and with higher scores on Competitiveness, Sensation Seeking, Hostility, and Extraversion, and lower scores on Conscientiousness, Agreeableness, and Openness. The theoretical and practical implications of the PADI's dual focus on safe and unsafe driving are discussed. Copyright © 2014 Elsevier Ltd. All rights reserved.
Development of a brachytherapy audit checklist tool.

PubMed

Prisciandaro, Joann; Hadley, Scott; Jolly, Shruti; Lee, Choonik; Roberson, Peter; Roberts, Donald; Ritter, Timothy

2015-01-01

To develop a brachytherapy audit checklist that could be used to prepare for Nuclear Regulatory Commission or agreement state inspections, to aid in readiness for a practice accreditation visit, or to be used as an annual internal audit tool. Six board-certified medical physicists and one radiation oncologist conducted a thorough review of brachytherapy-related literature and practice guidelines published by professional organizations and federal regulations. The team members worked at two facilities that are part of a large, academic health care center. Checklist items were given a score based on their judged importance. Four clinical sites performed an audit of their program using the checklist. The sites were asked to score each item based on a defined severity scale for their noncompliance, and final audit scores were tallied by summing the products of importance score and severity score for each item. The final audit checklist, which is available online, contains 83 items. The audit scores from the beta sites ranged from 17 to 71 (out of 690) and identified a total of 7-16 noncompliance items. The total time to conduct the audit ranged from 1.5 to 5 hours. A comprehensive audit checklist was developed which can be implemented by any facility that wishes to perform a program audit in support of their own brachytherapy program. The checklist is designed to allow users to identify areas of noncompliance and to prioritize how these items are addressed to minimize deviations from nationally-recognized standards. Copyright © 2015 American Brachytherapy Society. All rights reserved.
Introducing the Postsecondary Instructional Practices Survey (PIPS): A Concise, Interdisciplinary, and Easy-to-Score Survey.

PubMed

Walter, Emily M; Henderson, Charles R; Beach, Andrea L; Williams, Cody T

Researchers, administrators, and policy makers need valid and reliable information about teaching practices. The Postsecondary Instructional Practices Survey (PIPS) is designed to measure the instructional practices of postsecondary instructors from any discipline. The PIPS has 24 instructional practice statements and nine demographic questions. Users calculate PIPS scores by an intuitive proportion-based scoring convention. Factor analyses from 72 departments at four institutions (N = 891) support a 2- or 5-factor solution for the PIPS; both models include all 24 instructional practice items and have good model fit statistics. Factors in the 2-factor model include (a) instructor-centered practices, nine items; and (b) student-centered practices, 13 items. Factors in the 5-factor model include (a) student-student interactions, six items; (b) content delivery, four items; (c) formative assessment, five items; (d) student-content engagement, five items; and (e) summative assessment, four items. In this article, we describe our development and validation processes, provide scoring conventions and outputs for results, and describe wider applications of the instrument. © 2016 E. M. Walter et al. CBE—Life Sciences Education © 2016 The American Society for Cell Biology. This article is distributed by The American Society for Cell Biology under license from the author(s). It is available to the public under an Attribution–Noncommercial–Share Alike 3.0 Unported Creative Commons License (http://creativecommons.org/licenses/by-nc-sa/3.0).
Using Patient Health Questionnaire-9 item parameters of a common metric resulted in similar depression scores compared to independent item response theory model reestimation.

PubMed

Liegl, Gregor; Wahl, Inka; Berghöfer, Anne; Nolte, Sandra; Pieh, Christoph; Rose, Matthias; Fischer, Felix

2016-03-01

To investigate the validity of a common depression metric in independent samples. We applied a common metrics approach based on item-response theory for measuring depression to four German-speaking samples that completed the Patient Health Questionnaire (PHQ-9). We compared the PHQ item parameters reported for this common metric to reestimated item parameters that derived from fitting a generalized partial credit model solely to the PHQ-9 items. We calibrated the new model on the same scale as the common metric using two approaches (estimation with shifted prior and Stocking-Lord linking). By fitting a mixed-effects model and using Bland-Altman plots, we investigated the agreement between latent depression scores resulting from the different estimation models. We found different item parameters across samples and estimation methods. Although differences in latent depression scores between different estimation methods were statistically significant, these were clinically irrelevant. Our findings provide evidence that it is possible to estimate latent depression scores by using the item parameters from a common metric instead of reestimating and linking a model. The use of common metric parameters is simple, for example, using a Web application (http://www.common-metrics.org) and offers a long-term perspective to improve the comparability of patient-reported outcome measures. Copyright © 2016 Elsevier Inc. All rights reserved.
Wiggins Content Scales and the MMPI-2.

PubMed

Kohutek, K J

1992-03-01

The omission of the Wiggins Content Scales occurred because of the number of items deleted as well as the addition of items to the MMPI-2. The purpose of this study is to compare scorings of the items on the Wiggins Scales of the MMPI and the items that remain on these scales on the MMPI-2. The scales of Religious Fundamentalism and Authority Conflict appear to be those most seriously affected by the item change on the MMPI-2. The scales Depression and Family Conflict maintained all of their items, and the remaining nine were not found to be statistically different when the two scorings were compared.
Rating the methodological quality in systematic reviews of studies on measurement properties: a scoring system for the COSMIN checklist.

PubMed

Terwee, Caroline B; Mokkink, Lidwine B; Knol, Dirk L; Ostelo, Raymond W J G; Bouter, Lex M; de Vet, Henrica C W

2012-05-01

The COSMIN checklist is a standardized tool for assessing the methodological quality of studies on measurement properties. It contains 9 boxes, each dealing with one measurement property, with 5-18 items per box about design aspects and statistical methods. Our aim was to develop a scoring system for the COSMIN checklist to calculate quality scores per measurement property when using the checklist in systematic reviews of measurement properties. The scoring system was developed based on discussions among experts and testing of the scoring system on 46 articles from a systematic review. Four response options were defined for each COSMIN item (excellent, good, fair, and poor). A quality score per measurement property is obtained by taking the lowest rating of any item in a box ("worst score counts"). Specific criteria for excellent, good, fair, and poor quality for each COSMIN item are described. In defining the criteria, the "worst score counts" algorithm was taken into consideration. This means that only fatal flaws were defined as poor quality. The scores of the 46 articles show how the scoring system can be used to provide an overview of the methodological quality of studies included in a systematic review of measurement properties. Based on experience in testing this scoring system on 46 articles, the COSMIN checklist with the proposed scoring system seems to be a useful tool for assessing the methodological quality of studies included in systematic reviews of measurement properties.
Effect of Item Arrangement, Knowledge of Arrangement, and Test Anxiety on Two Scoring Methods.

ERIC Educational Resources Information Center

Plake, Barbara S.; And Others

1981-01-01

Number right and elimination scores were analyzed on a college level mathematics exam assembled from pretest data. Anxiety measures were administered along with the experimental forms to undergraduates. Results suggest that neither test scores nor attitudes are influenced by item order knowledge thereof, or anxiety level. (Author/GK)
Analysis of Open-Ended Statistics Questions with Many Facet Rasch Model

ERIC Educational Resources Information Center

Güler, Nese

2014-01-01

Problem Statement: The most significant disadvantage of open-ended items that allow the valid measurement of upper level cognitive behaviours, such as synthesis and evaluation, is scoring. The difficulty associated with objectively scoring the answers to the items contributes to the reduction of the reliability of the scores. Moreover, other…
Recursive Partitioning to Identify Potential Causes of Differential Item Functioning in Cross-National Data

ERIC Educational Resources Information Center

Finch, W. Holmes; Hernández Finch, Maria E.; French, Brian F.

2016-01-01

Differential item functioning (DIF) assessment is key in score validation. When DIF is present scores may not accurately reflect the construct of interest for some groups of examinees, leading to incorrect conclusions from the scores. Given rising immigration, and the increased reliance of educational policymakers on cross-national assessments…
A Rasch measure of teachers' views of teacher-student relationships in the primary school.

PubMed

Leitao, Natalie; Waugh, Russell F

2012-01-01

This study investigated teacher-student relationships from the teachers' point of view at Perth metropolitan schools in Western Australia. The study identified three key social and emotional aspects that affect teacher-student relationships, namely, Connectedness, Availability and Communication. Data were collected by questionnaire (N = 139) with stem-items answered in three perspectives: (1) Idealistic: this is what I would like to happen; (2) Capability: this is what I am capable of; and (3) Behaviour: this is what actually happens, using four ordered response categories: not at all (score 1), some of the time (score 2), most of the time (score 3), and almost always (score 4). Data were analysed with a Rasch measurement model and a uni-dimensional, linear scale with 24 items, ordered from easy to hard, was created. The data were shown to be highly reliable, so that valid inferences could be made from the scale. The Person Separation Index (akin to a reliability index) was 0.93; there was good global teacher and item fit to the measurement model; there was good item fit; the targeting of the item difficulties against the teacher measures was good, and the response categories were answered consistently and logically. Teachers said that the ideal items were all easier than their corresponding capability items which were in turn easier than the behaviour items (where the items fitted the model), as conceptualized. The easiest ideal items were: I like this child and This child and I get along well together. The hardest ideal item (but still easy) was: I am available for this child. The easiest behaviour item (but still hard) was: This child and I get along well together. The hardest behaviour item (and very hard) was: I am interested to learn about this child's personal thoughts, feelings and experiences. The difficulties of the items supported the conceptual structure of the variable.
Quality and quantity of information in summary basis of decision documents issued by health Canada.

PubMed

Habibi, Roojin; Lexchin, Joel

2014-01-01

Health Canada's Summary Basis of Decision (SBD) documents outline the clinical trial information that was considered in approving a new drug. We examined the ability of SBDs to inform clinician decision-making. We asked if SBDs answered three questions that clinicians might have prior to prescribing a new drug: 1) Do the characteristics of patients enrolled in trials match those of patients in their practice? 2) What are the details concerning the drug's risks and benefits? 3) What are the basic characteristics of trials? 14 items of clinical trial information were identified from all SBDs published on or before April 2012. Each item received a score of 2 (present), 1 (unclear) or 0 (absent). The unit of analysis was the individual SBD, and an overall SBD score was derived based on the sum of points for each item. Scores were expressed as a percentage of the maximum possible points, and then classified into five descriptive categories based on that score. Additionally, three overall 'component' scores were tallied for each SBD: "patient characteristics", "benefit/risk information" and "basic trial characteristics". 161 documents, spanning 456 trials, were analyzed. The majority (126/161) were rated as having information sometimes present (score of >33 to 66%). No SBDs had either no information on any item, or 100% of the information. Items in the patient characteristics component scored poorest (mean component score of 40.4%), while items corresponding to basic trial information were most frequently provided (mean component score of 71%). The significant omissions in the level of clinical trial information in SBDs provide little to aid clinicians in their decision-making. Clinicians' preferred source of information is scientific knowledge, but in Canada, access to such information is limited. Consequently, we believe that clinicians are being denied crucial tools for decision-making.
Item validity vs. item discrimination index: a redundancy?

NASA Astrophysics Data System (ADS)

Panjaitan, R. L.; Irawati, R.; Sujana, A.; Hanifah, N.; Djuanda, D.

2018-03-01

In several literatures about evaluation and test analysis, it is common to find that there are calculations of item validity as well as item discrimination index (D) with different formula for each. Meanwhile, other resources said that item discrimination index could be obtained by calculating the correlation between the testee’s score in a particular item and the testee’s score on the overall test, which is actually the same concept as item validity. Some research reports, especially undergraduate theses tend to include both item validity and item discrimination index in the instrument analysis. It seems that these concepts might overlap for both reflect the test quality on measuring the examinees’ ability. In this paper, examples of some results of data processing on item validity and item discrimination index were compared. It would be discussed whether item validity and item discrimination index can be represented by one of them only or it should be better to present both calculations for simple test analysis, especially in undergraduate theses where test analyses were included.

The Impact of Settable Test Item Exposure Control Interface Format on Postsecondary Business Student Test Performance

ERIC Educational Resources Information Center

Truell, Allen D.; Zhao, Jensen J.; Alexander, Melody W.

2005-01-01

The purposes of this study were to determine if there is a significant difference in postsecondary business student scores and test completion time based on settable test item exposure control interface format, and to determine if there is a significant difference in student scores and test completion time based on settable test item exposure…
Real Time Cockpit Resource Management (CRM) Training

DTIC Science & Technology

2010-10-01

to post-test. Table 4 Learning Scores for the Five Spiral 1 Classes Spiral 1 Class Pilots Sensors Pretest Posttest Difference Pretest Posttest ...results from the five Spiral 1 classes. Table 6 Pretest / Posttest Gain Scores Associated with Each Learning Test Item Test Item Class Item...SMALL BUSINESS INNOVATION RESEARCH (SBIR) PHASE II REPORT. Distribution A: Approved for public release; distribution unlimited. (Approval given
Modified Scoring, Traditional Item Analysis, and Sato's Caution Index Used To Investigate the Reading Recall Protocol.

ERIC Educational Resources Information Center

Deville, Craig W.; Chalhoub-Deville, Micheline

A study demonstrated the utility of item analyses to investigate which items function well or poorly in a second language reading recall protocol instrument. Data were drawn from a larger study of 56 learners of German as a second language at various proficiency levels. Pausal units of scored recall protocols were analyzed using both classical…
Using existing questionnaires in latent class analysis: should we use summary scores or single items as input? A methodological study using a cohort of patients with low back pain.

PubMed

Nielsen, Anne Molgaard; Vach, Werner; Kent, Peter; Hestbaek, Lise; Kongsted, Alice

2016-01-01

Latent class analysis (LCA) is increasingly being used in health research, but optimal approaches to handling complex clinical data are unclear. One issue is that commonly used questionnaires are multidimensional, but expressed as summary scores. Using the example of low back pain (LBP), the aim of this study was to explore and descriptively compare the application of LCA when using questionnaire summary scores and when using single items to subgrouping of patients based on multidimensional data. Baseline data from 928 LBP patients in an observational study were classified into four health domains (psychology, pain, activity, and participation) using the World Health Organization's International Classification of Functioning, Disability, and Health framework. LCA was performed within each health domain using the strategies of summary-score and single-item analyses. The resulting subgroups were descriptively compared using statistical measures and clinical interpretability. For each health domain, the preferred model solution ranged from five to seven subgroups for the summary-score strategy and seven to eight subgroups for the single-item strategy. There was considerable overlap between the results of the two strategies, indicating that they were reflecting the same underlying data structure. However, in three of the four health domains, the single-item strategy resulted in a more nuanced description, in terms of more subgroups and more distinct clinical characteristics. In these data, application of both the summary-score strategy and the single-item strategy in the LCA subgrouping resulted in clinically interpretable subgroups, but the single-item strategy generally revealed more distinguishing characteristics. These results 1) warrant further analyses in other data sets to determine the consistency of this finding, and 2) warrant investigation in longitudinal data to test whether the finer detail provided by the single-item strategy results in improved prediction of outcomes and treatment response.
Test-retest stability of the Task and Ego Orientation Questionnaire.

PubMed

Lane, Andrew M; Nevill, Alan M; Bowes, Neal; Fox, Kenneth R

2005-09-01

Establishing stability, defined as observing minimal measurement error in a test-retest assessment, is vital to validating psychometric tools. Correlational methods, such as Pearson product-moment, intraclass, and kappa are tests of association or consistency, whereas stability or reproducibility (regarded here as synonymous) assesses the agreement between test-retest scores. Indexes of reproducibility using the Task and Ego Orientation in Sport Questionnaire (TEOSQ; Duda & Nicholls, 1992) were investigated using correlational (Pearson product-moment, intraclass, and kappa) methods, repeated measures multivariate analysis of variance, and calculating the proportion of agreement within a referent value of +/-1 as suggested by Nevill, Lane, Kilgour, Bowes, and Whyte (2001). Two hundred thirteen soccer players completed the TEOSQ on two occasions, 1 week apart. Correlation analyses indicated a stronger test-retest correlation for the Ego subscale than the Task subscale. Multivariate analysis of variance indicated stability for ego items but with significant increases in four task items. The proportion of test-retest agreement scores indicated that all ego items reported relatively poor stability statistics with test-retest scores within a range of +/-1, ranging from 82.7-86.9%. By contrast, all task items showed test-retest difference scores ranging from 92.5-99%, although further analysis indicated that four task subscale items increased significantly. Findings illustrated that correlational methods (Pearson product-moment, intraclass, and kappa) are influenced by the range in scores, and calculating the proportion of agreement of test-retest differences with a referent value of +/-1 could provide additional insight into the stability of the questionnaire. It is suggested that the item-by-item proportion of agreement method proposed by Nevill et al. (2001) should be used to supplement existing methods and could be especially helpful in identifying rogue items in the initial stages of psychometric questionnaire validation.
Revised multicultural perspective index and measures of depression, life satisfaction, shyness, and self-esteem.

PubMed

Mowrer, Robert R; Parker, Keesha N

2004-12-01

In a 2002 publication, Mowrer and McCarver reported weak but significant correlations (r =.24) between scores on the Multicultural Perspective Index and scores on Neugarten, Havighurst, and Tobin's 1961 Life Satisfaction Index-A and the Life Satisfaction Scale developed in 1985 by Diener, Emmons, Larsen, and Griffin. Using 382 undergraduate students the present study reduced the Index from 42 to 29 items based on each item's correlation with total items. An additional 104 undergraduate students then completed the modified 29-item version, Rosenberg's Self-esteem Scale, Cheek and Buss's Shyness Scale, the Self-rating Depression Scale by Zung, and the Neugarten, et al. Life Satisfaction Index-A. Scores on the modified Index were negatively correlated with those on the Depression and Shyness scales and positively correlated with scores on the Self-esteem and Life Satisfaction scales (p< .05).
Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales

ERIC Educational Resources Information Center

Topczewski, Anna Marie

2013-01-01

Developmental score scales represent the performance of students along a continuum, where as students learn more they move higher along that continuum. Unidimensional item response theory (UIRT) vertical scaling has become a commonly used method to create developmental score scales. Research has shown that UIRT vertical scaling methods can be…
Using Empirical Data to Set Cutoff Scores.

ERIC Educational Resources Information Center

Hills, John R.

Six experimental approaches to the problems of setting cutoff scores and choosing proper test length are briefly mentioned. Most of these methods share the premise that a test is a random sample of items, from a domain associated with a carefully specified objective. Each item is independent and is scored zero or one, with no provision for…
Prediction of true test scores from observed item scores and ancillary data.

PubMed

Haberman, Shelby J; Yao, Lili; Sinharay, Sandip

2015-05-01

In many educational tests which involve constructed responses, a traditional test score is obtained by adding together item scores obtained through holistic scoring by trained human raters. For example, this practice was used until 2008 in the case of GRE(®) General Analytical Writing and until 2009 in the case of TOEFL(®) iBT Writing. With use of natural language processing, it is possible to obtain additional information concerning item responses from computer programs such as e-rater(®). In addition, available information relevant to examinee performance may include scores on related tests. We suggest application of standard results from classical test theory to the available data to obtain best linear predictors of true traditional test scores. In performing such analysis, we require estimation of variances and covariances of measurement errors, a task which can be quite difficult in the case of tests with limited numbers of items and with multiple measurements per item. As a consequence, a new estimation method is suggested based on samples of examinees who have taken an assessment more than once. Such samples are typically not random samples of the general population of examinees, so that we apply statistical adjustment methods to obtain the needed estimated variances and covariances of measurement errors. To examine practical implications of the suggested methods of analysis, applications are made to GRE General Analytical Writing and TOEFL iBT Writing. Results obtained indicate that substantial improvements are possible both in terms of reliability of scoring and in terms of assessment reliability. © 2015 The British Psychological Society.
Reliability of the Adult Myopathy Assessment Tool in Individuals with Myositis

PubMed Central

Harris-Love, Michael O.; Joe, Galen; Davenport, Todd E.; Koziol, Deloris; Rose, Kristen Abbett; Shrader, Joseph A.; Vasconcelos, Olavo M.; McElroy, Beverly; Dalakas, Marinos C.

2015-01-01

Objective The Adult Myopathy Assessment Tool (AMAT) is a 13-item performance-based battery developed to assess functional status and muscle endurance. The purpose of this study was to determine the intrarater and interrater reliability of the AMAT in adults with myosits. Methods Nineteen raters (13 physical therapists and 6 physicians) scored videotaped recordings of patients with myositis performing the AMAT for a total of 114 tests and 1,482 item observations per session. Raters rescored the AMAT test and item observations during a follow up session (19 ±6 days between scoring sessions). All raters completed a single, self-directed, electronic training module prior to the initial scoring session. Results Intrarater and interrater reliability correlation coefficients were .94 or greater for the AMAT Functional Subscale, Endurance Subscale, and Total score (all p < 0.02 for Ho:ρ ≤ 0.75). All AMAT items had satisfactory intrarater agreement (Kappa statistics with Fleiss-Cohen weights, Kw = .57-1.00). Interrater agreement was acceptable for each AMAT item (K = .56-.89) except the sit up (K = .16). The standard error of measurement and 95% confidence interval range for the AMAT Total scores did not exceed 2 points across all observations (AMAT Total score range = 0-45). Conclusions The AMAT is a reliable, domain-specific assessment of functional status and muscle endurance for adult subjects with myositis. Results of this study suggest that physicians and physical therapists may reliably score the AMAT following a single training session. The AMAT Functional Subscale, Endurance Subscale, and Total score exhibit interrater and intrarater reliability suitable for clinical and research use. PMID:25201624
The stroke impairment assessment set: its internal consistency and predictive validity.

PubMed

Tsuji, T; Liu, M; Sonoda, S; Domen, K; Chino, N

2000-07-01

To study the scale quality and predictive validity of the Stroke Impairment Assessment Set (SIAS) developed for stroke outcome research. Rasch analysis of the SIAS; stepwise multiple regression analysis to predict discharge functional independence measure (FIM) raw scores from demographic data, the SIAS scores, and the admission FIM scores; cross-validation of the prediction rule. Tertiary rehabilitation center in Japan. One hundred ninety stroke inpatients for the study of the scale quality and the predictive validity; a second sample of 116 stroke inpatients for the cross-validation study. Mean square fit statistics to study the degree of fit to the unidimensional model; logits to express item difficulties; discharge FIM scores for the study of predictive validity. The degree of misfit was acceptable except for the shoulder range of motion (ROM), pain, visuospatial function, and speech items; and the SIAS items could be arranged on a common unidimensional scale. The difficulty patterns were identical at admission and at discharge except for the deep tendon reflexes, ROM, and pain items. They were also similar for the right- and left-sided brain lesion groups except for the speech and visuospatial items. For the prediction of the discharge FIM scores, the independent variables selected were age, the SIAS total scores, and the admission FIM scores; and the adjusted R2 was .64 (p < .0001). Stability of the predictive equation was confirmed in the cross-validation sample (R2 = .68, p < .001). The unidimensionality of the SIAS was confirmed, and the SIAS total scores proved useful for stroke outcome prediction.
Comparison of scoring approaches for the NEI VFQ-25 in low vision.

PubMed

Dougherty, Bradley E; Bullimore, Mark A

2010-08-01

The aim of this study was to evaluate different approaches to scoring the National Eye Institute Visual Functioning Questionnaire-25 (NEI VFQ-25) in patients with low vision including scoring by the standard method, by Rasch analysis, and by use of an algorithm created by Massof to approximate Rasch person measure. Subscale validity and use of a 7-item short form instrument proposed by Ryan et al. were also investigated. NEI VFQ-25 data from 50 patients with low vision were analyzed using the standard method of summing Likert-type scores and calculating an overall average, Rasch analysis using Winsteps software, and the Massof algorithm in Excel. Correlations between scores were calculated. Rasch person separation reliability and other indicators were calculated to determine the validity of the subscales and of the 7-item instrument. Scores calculated using all three methods were highly correlated, but evidence of floor and ceiling effects was found with the standard scoring method. None of the subscales investigated proved valid. The 7-item instrument showed acceptable person separation reliability and good targeting and item performance. Although standard scores and Rasch scores are highly correlated, Rasch analysis has the advantages of eliminating floor and ceiling effects and producing interval-scaled data. The Massof algorithm for approximation of the Rasch person measure performed well in this group of low-vision patients. The validity of the subscales VFQ-25 should be reconsidered.
[Development of competency to stand trial rating scale in offenders with mental disorders].

PubMed

Chen, Xiao-Bing; Cai, Wei-Xiong

2013-04-01

According with Chinese legal system, to develop a competency to stand trial rating scale in offenders with mental disorders. Proceeding from the juristical elements, 15 items were extracted and formulated a preliminary instrument named the competency to stand trial rating scale in offenders with mental disorders. The item analysis included six aspects, which were critical ratio, item-total correlation, corrected item-total correlation, alpha value if item deleted, communalities of items, and factor loading. The Logistic regression equation and cut-off score of ROC curve were used to explore the diagnostic efficiency. The data of critical ratio of extreme group were 18.390-46.763; item-total correlation, 0.639-0.952; corrected item-total correlation, 0.582-0.944; communalities of items, 0.377-0.916; and factor loadings, 0.614-0.957. Seven items were included in the regression equation and the accuracy of back substitution test was 96.0%. The score of 33 was ascertained as the cut-off score by ROC fitting curve, the overlapping ratio compared with the expertise was 95.8%. The sensibility and the specificity were 0.938 and 0.966, respectively, while the positive and negative likelihood ratios were 27.67 and 0.06, respectively. With all items satisfied the requirement of homogeneity test, the rating scale has a reasonable construct and excellent diagnostic efficiency.
Differential item functioning of the patient-reported outcomes information system (PROMIS®) pain interference item bank by language (Spanish versus English).

PubMed

Paz, Sylvia H; Spritzer, Karen L; Reise, Steven P; Hays, Ron D

2017-06-01

About 70% of Latinos, 5 years old or older, in the United States speak Spanish at home. Measurement equivalence of the PROMIS ® pain interference (PI) item bank by language of administration (English versus Spanish) has not been evaluated. A sample of 527 adult Spanish-speaking Latinos completed the Spanish version of the 41-item PROMIS ® pain interference item bank. We evaluate dimensionality, monotonicity and local independence of the Spanish-language items. Then we evaluate differential item functioning (DIF) using ordinal logistic regression with item response theory scores estimated from DIF-free "anchor" items. One of the 41 items in the Spanish version of the PROMIS ® PI item bank was identified as having significant uniform DIF. English- and Spanish-speaking subjects with the same level of pain interference responded differently to 1 of the 41 items in the PROMIS ® PI item bank. This item was not retained due to proprietary issues. The original English language item parameters can be used when estimating PROMIS ® PI scores.
Undesired variance due to examiner stringency/leniency effect in communication skill scores assessed in OSCEs.

PubMed

Harasym, Peter H; Woloschuk, Wayne; Cunning, Leslie

2008-12-01

Physician-patient communication is a clinical skill that can be learned and has a positive impact on patient satisfaction and health outcomes. A concerted effort at all medical schools is now directed at teaching and evaluating this core skill. Student communication skills are often assessed by an Objective Structure Clinical Examination (OSCE). However, it is unknown what sources of error variance are introduced into examinee communication scores by various OSCE components. This study primarily examined the effect different examiners had on the evaluation of students' communication skills assessed at the end of a family medicine clerkship rotation. The communication performance of clinical clerks from Classes 2005 and 2006 were assessed using six OSCE stations. Performance was rated at each station using the 28-item Calgary-Cambridge guide. Item Response Theory analysis using a Multifaceted Rasch model was used to partition the various sources of error variance and generate a "true" communication score where the effects of examiner, case, and items are removed. Variance and reliability of scores were as follows: communication scores (.20 and .87), examiner stringency/leniency (.86 and .91), case (.03 and .96), and item (.86 and .99), respectively. All facet scores were reliable (.87-.99). Examiner variance (.86) was more than four times the examinee variance (.20). About 11% of the clerks' outcome status shifted using "true" rather than observed/raw scores. There was large variability in examinee scores due to variation in examiner stringency/leniency behaviors that may impact pass-fail decisions. Exploring the benefits of examiner training and employing "true" scores generated using Item Response Theory analyses prior to making pass/fail decisions are recommended.
Post Hoc Analyses of Anxiety Measures in Adult Patients With Generalized Anxiety Disorder Treated With Vilazodone

PubMed Central

Khan, Arif; Durgam, Suresh; Tang, Xiongwen; Ruth, Adam; Mathews, Maju; Gommoll, Carl P.

2016-01-01

Objective To investigate vilazodone, currently approved for major depressive disorder in adults, for generalized anxiety disorder (GAD). Method Three randomized, double-blind, placebo-controlled studies showing positive results for vilazodone (2,040 mg/d) in adult patients with GAD (DSM-IV-TR) were pooled for analyses; data were collected from June 2012 to March 2014. Post hoc outcomes in the pooled intent-to-treat population (n = 1,462) included mean change from baseline to week 8 in Hamilton Anxiety Rating Scale (HARS) total score, psychic and somatic anxiety subscale scores, and individual item scores; HARS response (≥ 50% total score improvement) and remission (total score ≤ 7) at week 8; and category shifts, defined as HARS item score ≥ 2 at baseline (moderate to very severe symptoms) and score of 0 at week 8 (no symptoms). Results The least squares mean difference was statistically significant for vilazodone versus placebo in change from baseline to week 8 in HARS total score (−1.83, P < .0001) and in psychic anxiety (−1.21, P < .0001) and somatic anxiety (−0.63, P < .01) subscale scores; differences from placebo were significant on 11 of 14 HARS items (P < .05). Response rates were higher with vilazodone than placebo (48% vs 39%, P < .001), as were remission rates (27% vs 21%, P < .01). The percentage of patients who shifted to no symptoms was significant for vilazodone on several items: anxious mood, tension, intellectual, depressed mood, somatic-muscular, somatic-sensory, cardiovascular, respiratory, and autonomic symptoms (P < .05). Conclusions Treatment with vilazodone versus placebo was effective in adult GAD patients, with significant differences between treatment groups found on both psychic and somatic HARS items. Trial Registration ClinicalTrials.gov identifiers: NCT01629966, NCT01766401, NCT01844115. PMID:27486544
Post Hoc Analyses of Anxiety Measures in Adult Patients With Generalized Anxiety Disorder Treated With Vilazodone.

PubMed

Khan, Arif; Durgam, Suresh; Tang, Xiongwen; Ruth, Adam; Mathews, Maju; Gommoll, Carl P

2016-01-01

To investigate vilazodone, currently approved for major depressive disorder in adults, for generalized anxiety disorder (GAD). Three randomized, double-blind, placebo-controlled studies showing positive results for vilazodone (2,040 mg/d) in adult patients with GAD (DSM-IV-TR) were pooled for analyses; data were collected from June 2012 to March 2014. Post hoc outcomes in the pooled intent-to-treat population (n = 1,462) included mean change from baseline to week 8 in Hamilton Anxiety Rating Scale (HARS) total score, psychic and somatic anxiety subscale scores, and individual item scores; HARS response (≥ 50% total score improvement) and remission (total score ≤ 7) at week 8; and category shifts, defined as HARS item score ≥ 2 at baseline (moderate to very severe symptoms) and score of 0 at week 8 (no symptoms). The least squares mean difference was statistically significant for vilazodone versus placebo in change from baseline to week 8 in HARS total score (-1.83, P < .0001) and in psychic anxiety (-1.21, P < .0001) and somatic anxiety (-0.63, P < .01) subscale scores; differences from placebo were significant on 11 of 14 HARS items (P < .05). Response rates were higher with vilazodone than placebo (48% vs 39%, P < .001), as were remission rates (27% vs 21%, P < .01). The percentage of patients who shifted to no symptoms was significant for vilazodone on several items: anxious mood, tension, intellectual, depressed mood, somatic-muscular, somatic-sensory, cardiovascular, respiratory, and autonomic symptoms (P < .05). Treatment with vilazodone versus placebo was effective in adult GAD patients, with significant differences between treatment groups found on both psychic and somatic HARS items. ClinicalTrials.gov identifiers: NCT01629966, NCT01766401, NCT01844115.
Checklist content on a standardized patient assessment: an ex post facto review.

PubMed

Boulet, John R; van Zanten, Marta; de Champlain, André; Hawkins, Richard E; Peitzman, Steven J

2008-03-01

While checklists are often used to score standardized patient based clinical assessments, little research has focused on issues related to their development or the level of agreement with respect to the importance of specific items. Five physicians independently reviewed checklists from 11 simulation scenarios that were part of the former Educational Commission for Foreign Medical Graduate's Clinical Skills Assessment and classified the clinical appropriateness of each of the checklist items. Approximately 78% of the original checklist items were judged to be needed, or indicated, given the presenting complaint and the purpose of the assessment. Rater agreement was relatively poor with pairwise associations (Kappa coefficient) ranging from 0.09 to 0.29. However, when only consensus indicated items were included, there was little change in examinee scores, including their reliability over encounters. Although most checklist items in this sample were judged to be appropriate, some could potentially be eliminated, thereby minimizing the scoring burden placed on the standardized patients. Periodic review of checklist items, concentrating on their clinical importance, is warranted.
Rasch Analysis of the General Self-Efficacy Scale in Workers with Traumatic Limb Injuries.

PubMed

Wu, Tzu-Yi; Yu, Wan-Hui; Huang, Chien-Yu; Hou, Wen-Hsuan; Hsieh, Ching-Lin

2016-09-01

Purpose The purpose of this study was to apply Rasch analysis to examine the unidimensionality and reliability of the General Self-Efficacy Scale (GSE) in workers with traumatic limb injuries. Furthermore, if the items of the GSE fitted the Rasch model's assumptions, we transformed the raw sum ordinal scores of the GSE into Rasch interval scores. Methods A total of 1076 participants completed the GSE at 1 month post injury. Rasch analysis was used to examine the unidimensionality and person reliability of the GSE. The unidimensionality of the GSE was verified by determining whether the items fit the Rasch model's assumptions: (1) item fit indices: infit and outfit mean square (MNSQ) ranged from 0.6 to 1.4; and (2) the eigenvalue of the first factor extracted from principal component analysis (PCA) for residuals was <2. Person reliability was calculated. Results The unidimensionality of the 10-item GSE was supported in terms of good item fit statistics (infit and outfit MNSQ ranging from 0.92 to 1.32) and acceptable eigenvalues (1.6) of the first factor of the PCA, with person reliability = 0.89. Consequently, the raw sum scores of the GSE were transformed into Rasch scores. Conclusions The results indicated that the items of GSE are unidimensional and have acceptable person reliability in workers with traumatic limb injuries. Additionally, the raw sum scores of the GSE can be transformed into Rasch interval scores for prospective users to quantify workers' levels of self-efficacy and to conduct further statistical analyses.
The PROMIS fatigue item bank has good measurement properties in patients with fibromyalgia and severe fatigue.

PubMed

Yost, Kathleen J; Waller, Niels G; Lee, Minji K; Vincent, Ann

2017-06-01

Efficient management of fibromyalgia (FM) requires precise measurement of FM-specific symptoms. Our objective was to assess the measurement properties of the Patient-Reported Outcome Measurement Information System (PROMIS) fatigue item bank (FIB) in people with FM. We applied classical psychometric and item response theory methods to cross-sectional PROMIS-FIB data from two samples. Data on the clinical FM sample were obtained at a tertiary medical center. Data for the U.S. general population sample were obtained from the PROMIS network. The full 95-item bank was administered to both samples. We investigated dimensionality of the item bank in both samples by separately fitting a bifactor model with two group factors; experience and impact. We assessed measurement invariance between samples, and we explored an alternate factor structure with the normative sample and subsequently confirmed that structure in the clinical sample. Finally, we assessed whether reporting FM subdomain scores added value over reporting a single total score. The item bank was dominated by a general fatigue factor. The fit of the initial bifactor model and evidence of measurement invariance indicated that the same constructs were measured across the samples. An alternative bifactor model with three group factors demonstrated slightly improved fit. Subdomain scores add value over a total score. We demonstrated that the PROMIS-FIB is appropriate for measuring fatigue in clinical samples of FM patients. The construct can be presented by a single score; however, subdomain scores for the three group factors identified in the alternative model may also be reported.

Perceived Perfectionism from God Scale: Development and Initial Evidence.

PubMed

Wang, Kenneth T; Allen, G E Kawika; Stokes, Hannah I; Suh, Han Na

2017-05-03

In this study, the Perceived Perfectionism from God Scale (PPGS) was developed with Latter-day Saints (Mormons) across two samples. Sample 1 (N = 421) was used for EFA to select items for the Perceived Standards from God (5 items) and the Perceived Discrepancy from God (5 items) subscales. Sample 2 (N = 420) was used for CFA and cross-validated the 2-factor oblique model as well as a bifactor model. Perceived Standards from God scores had Cronbach alphas ranging from .73 to .78, and Perceived Discrepancy from God scores had Cronbach alphas ranging from .82 to .84. Standards from God scores were positively correlated with positive affect, whereas Discrepancy from God scores was positively correlated with negative affect, shame and guilt. Moreover, these two PPGS subscale scores added significant incremental variances in predicting associated variables over and above corresponding personal perfectionism scores.
Will a Short Training Session Improve Multiple-Choice Item-Writing Quality by Dental School Faculty? A Pilot Study.

PubMed

Dellinges, Mark A; Curtis, Donald A

2017-08-01

Faculty members are expected to write high-quality multiple-choice questions (MCQs) in order to accurately assess dental students' achievement. However, most dental school faculty members are not trained to write MCQs. Extensive faculty development programs have been used to help educators write better test items. The aim of this pilot study was to determine if a short workshop would result in improved MCQ item-writing by dental school faculty at one U.S. dental school. A total of 24 dental school faculty members who had previously written MCQs were randomized into a no-intervention group and an intervention group in 2015. Six previously written MCQs were randomly selected from each of the faculty members and given an item quality score. The intervention group participated in a training session of one-hour duration that focused on reviewing standard item-writing guidelines to improve in-house MCQs. The no-intervention group did not receive any training but did receive encouragement and an explanation of why good MCQ writing was important. The faculty members were then asked to revise their previously written questions, and these were given an item quality score. The item quality scores for each faculty member were averaged, and the difference from pre-training to post-training scores was evaluated. The results showed a significant difference between pre-training and post-training MCQ difference scores for the intervention group (p=0.04). This pilot study provides evidence that the training session of short duration was effective in improving the quality of in-house MCQs.
An Evaluation of the Single-Group Growth Model as an Alternative to Common-Item Equating. Research Report. ETS RR-16-01

ERIC Educational Resources Information Center

Wei, Youhua; Morgan, Rick

2016-01-01

As an alternative to common-item equating when common items do not function as expected, the single-group growth model (SGGM) scaling uses common examinees or repeaters to link test scores on different forms. The SGGM scaling assumes that, for repeaters taking adjacent administrations, the conditional distribution of scale scores in later…
An Empirical Investigation of the Potential Impact of Item Misfit on Test Scores. Research Report. ETS RR-17-60

ERIC Educational Resources Information Center

Kim, Sooyeon; Robin, Frederic

2017-01-01

In this study, we examined the potential impact of item misfit on the reported scores of an admission test from the subpopulation invariance perspective. The target population of the test consisted of 3 major subgroups with different geographic regions. We used the logistic regression function to estimate item parameters of the operational items…
On an Extension of the Rasch Model to the Case of Polychotomously Scored Items.

ERIC Educational Resources Information Center

Vogt, Dorothee K.

The Rasch model for the probability of a person's response to an item is extended to the case where this response depends on a set of scoring or category weights, in addition to person and item parameters. The maximum likelihood approach introduced by Wright for the dichotomous case is applicable here also, and it is shown to yield a unique…
A Comparison of Methods for Estimating Conditional Item Score Differences in Differential Item Functioning (DIF) Assessments. Research Report. ETS RR-10-15

ERIC Educational Resources Information Center

Moses, Tim; Miao, Jing; Dorans, Neil

2010-01-01

This study compared the accuracies of four differential item functioning (DIF) estimation methods, where each method makes use of only one of the following: raw data, logistic regression, loglinear models, or kernel smoothing. The major focus was on the estimation strategies' potential for estimating score-level, conditional DIF. A secondary focus…
Reduced-Item Food Audits Based on the Nutrition Environment Measures Surveys.

PubMed

Partington, Susan N; Menzies, Tim J; Colburn, Trina A; Saelens, Brian E; Glanz, Karen

2015-10-01

The community food environment may contribute to obesity by influencing food choice. Store and restaurant audits are increasingly common methods for assessing food environments, but are time consuming and costly. A valid, reliable brief measurement tool is needed. The purpose of this study was to develop and validate reduced-item food environment audit tools for stores and restaurants. Nutrition Environment Measures Surveys for stores (NEMS-S) and restaurants (NEMS-R) were completed in 820 stores and 1,795 restaurants in West Virginia, San Diego, and Seattle. Data mining techniques (correlation-based feature selection and linear regression) were used to identify survey items highly correlated to total survey scores and produce reduced-item audit tools that were subsequently validated against full NEMS surveys. Regression coefficients were used as weights that were applied to reduced-item tool items to generate comparable scores to full NEMS surveys. Data were collected and analyzed in 2008-2013. The reduced-item tools included eight items for grocery, ten for convenience, seven for variety, and five for other stores; and 16 items for sit-down, 14 for fast casual, 19 for fast food, and 13 for specialty restaurants-10% of the full NEMS-S and 25% of the full NEMS-R. There were no significant differences in median scores for varying types of retail food outlets when compared to the full survey scores. Median in-store audit time was reduced 25%-50%. Reduced-item audit tools can reduce the burden and complexity of large-scale or repeated assessments of the retail food environment without compromising measurement quality. Copyright © 2015 American Journal of Preventive Medicine. Published by Elsevier Inc. All rights reserved.
The international phase 4 validation study of the EORTC QLQ-SWB32: A stand-alone measure of spiritual well-being for people receiving palliative care for cancer.

PubMed

Vivat, B; Young, T E; Winstanley, J; Arraras, J I; Black, K; Boyle, F; Bredart, A; Costantini, A; Guo, J; Irarrazaval, M E; Kobayashi, K; Kruizinga, R; Navarro, M; Omidvari, S; Rohde, G E; Serpentini, S; Spry, N; Van Laarhoven, H W M; Yang, G M

2017-11-01

The EORTC Quality of Life Group has just completed the final phase (field-testing and validation) of an international project to develop a stand-alone measure of spiritual well-being (SWB) for palliative cancer patients. Participants (n = 451)-from 14 countries on four continents; 54% female; 188 Christian; 50 Muslim; 156 with no religion-completed a provisional 36-item measure of SWB plus the EORTC QLQ-C15-PAL (PAL), then took part in a structured debriefing interview. All items showed good score distribution across response categories. We assessed scale structure using principal component analysis and Rasch analysis, and explored construct validity, and convergent/divergent validity with the PAL. Twenty-two items in four scoring scales (Relationship with Self, Relationships with Others, Relationship with Someone or Something Greater, and Existential) explained 53% of the variance. The measure also includes a global SWB item and nine other items. Scores on the PAL global quality-of-life item and Emotional Functioning scale weakly-moderately correlated with scores on the global SWB item and two of the four SWB scales. This new validated 32-item SWB measure addresses a distinct aspect of quality-of-life, and is now available for use in research and clinical practice, with a role as both a measurement and an intervention tool. © 2017 John Wiley & Sons Ltd.
Validation of the Expanded Versions of the Adult ADHD Self-Report Scale v1.1 Symptom Checklist and the Adult ADHD Investigator Symptom Rating Scale.

PubMed

Silverstein, Michael J; Faraone, Stephen V; Alperin, Samuel; Leon, Terry L; Biederman, Joseph; Spencer, Thomas J; Adler, Lenard A

2018-02-01

The aim of this study is to validate the Adult ADHD Self-Report Scale (ASRS) and Adult ADHD Investigator Symptom Rating Scale (AISRS) expanded versions, including executive function deficits (EFDs) and emotional dyscontrol (EC) items, and to present ASRS and AISRS pilot normative data. Two patient samples (referred and primary care physician [PCP] controls) were pooled together for these analyses. Final analysis included 297 respondents, 171 with adult ADHD. Cronbach's alphas were high for all sections of the scales. Examining histograms of ASRS 31-item and AISRS 18-item total scores for ADHD controls, 95% cutoff scores were 70 and 23, respectively; histograms for pilot normative sample suggest cutoffs of 82 and 26, respectively. (a) ASRS- and AISRS-expanded versions have high validity in assessment of core 18 adult ADHD Diagnostic and Statistical Manual of Mental Disorders ( DSM) symptoms and EFD and EC symptoms. (b) ASRS (31-item) scores 70 to 82 and AISRS (18-item) scores from 23 to 26 suggest a high likelihood of adult ADHD.
Analyzing force concept inventory with item response theory

NASA Astrophysics Data System (ADS)

Wang, Jing; Bao, Lei

2010-10-01

Item response theory is a popular assessment method used in education. It rests on the assumption of a probability framework that relates students' innate ability and their performance on test questions. Item response theory transforms students' raw test scores into a scaled proficiency score, which can be used to compare results obtained with different test questions. The scaled score also addresses the issues of ceiling effects and guessing, which commonly exist in quantitative assessment. We used item response theory to analyze the force concept inventory (FCI). Our results show that item response theory can be useful for analyzing physics concept surveys such as the FCI and produces results about the individual questions and student performance that are beyond the capability of classical statistics. The theory yields detailed measurement parameters regarding the difficulty, discrimination features, and probability of correct guess for each of the FCI questions.
Detection of Differential Item Functioning Using the Lasso Approach

ERIC Educational Resources Information Center

Magis, David; Tuerlinckx, Francis; De Boeck, Paul

2015-01-01

This article proposes a novel approach to detect differential item functioning (DIF) among dichotomously scored items. Unlike standard DIF methods that perform an item-by-item analysis, we propose the "LR lasso DIF method": logistic regression (LR) model is formulated for all item responses. The model contains item-specific intercepts,…
Comparing five depression measures in depressed Chinese patients using item response theory: an examination of item properties, measurement precision and score comparability.

PubMed

Zhao, Yue; Chan, Wai; Lo, Barbara Chuen Yee

2017-04-04

Item response theory (IRT) has been increasingly applied to patient-reported outcome (PRO) measures. The purpose of this study is to apply IRT to examine item properties (discrimination and severity of depressive symptoms), measurement precision and score comparability across five depression measures, which is the first study of its kind in the Chinese context. A clinical sample of 207 Hong Kong Chinese outpatients was recruited. Data analyses were performed including classical item analysis, IRT concurrent calibration and IRT true score equating. The IRT assumptions of unidimensionality and local independence were tested respectively using confirmatory factor analysis and chi-square statistics. The IRT linking assumptions of construct similarity, equity and subgroup invariance were also tested. The graded response model was applied to concurrently calibrate all five depression measures in a single IRT run, resulting in the item parameter estimates of these measures being placed onto a single common metric. IRT true score equating was implemented to perform the outcome score linking and construct score concordances so as to link scores from one measure to corresponding scores on another measure for direct comparability. Findings suggested that (a) symptoms on depressed mood, suicidality and feeling of worthlessness served as the strongest discriminating indicators, and symptoms concerning suicidality, changes in appetite, depressed mood, feeling of worthlessness and psychomotor agitation or retardation reflected high levels of severity in the clinical sample. (b) The five depression measures contributed to various degrees of measurement precision at varied levels of depression. (c) After outcome score linking was performed across the five measures, the cut-off scores led to either consistent or discrepant diagnoses for depression. The study provides additional evidence regarding the psychometric properties and clinical utility of the five depression measures, offers methodological contributions to the appropriate use of IRT in PRO measures, and helps elucidate cultural variation in depressive symptomatology. The approach of concurrently calibrating and linking multiple PRO measures can be applied to the assessment of PROs other than the depression context.
Correlates of cognitive function scores in elderly outpatients.

PubMed

Mangione, C M; Seddon, J M; Cook, E F; Krug, J H; Sahagian, C R; Campion, E W; Glynn, R J

1993-05-01

To determine medical, ophthalmologic, and demographic predictors of cognitive function scores as measured by the Telephone Interview for Cognitive Status (TICS), an adaptation of the Folstein Mini-Mental Status Exam. A secondary objective was to perform an item-by-item analysis of the TICS scores to determine which items correlated most highly with the overall scores. Cross-sectional cohort study. The Glaucoma Consultation Service of the Massachusetts Eye and Ear Infirmary. 472 of 565 consecutive patients age 65 and older who were seen at the Glaucoma Consultation Service between November 1, 1987 and October 31, 1988. Each subject had a standard visual examination and review of medical history at entry, followed by a telephone interview that collected information on demographic characteristics, cognitive status, health status, accidents, falls, symptoms of depression, and alcohol intake. A multivariate linear regression model of correlates of TICS score found the strongest correlates to be education, age, occupation, and the presence of depressive symptoms. The only significant ocular condition that correlated with lower TICS score was the presence of surgical aphakia (model R2 = .46). Forty-six percent (216/472) of patients fell below the established definition of normal on the mental status scale. In a logistic regression analysis, the strongest correlates of an abnormal cognitive function score were age, diabetes, educational status, and occupational status. An item analysis using step-wise linear regression showed that 85 percent of the variance in the TICS score was explained by the ability to perform serial sevens and to repeat 10 items immediately after hearing them. Educational status correlated most highly with both of these items (Kendall Tau R = .43 and Kendall Tau R = .30, respectively). Education, occupation, depression, and age were the strongest correlates of the score on this new screening test for assessing cognitive status. These factors were stronger correlates of the TICS score than chronic medical conditions, visual loss, or medications. The Telephone Interview for Cognitive Status is a useful instrument, but it may overestimate the prevalence of dementia in studies with a high prevalence of persons with less than a high school education.
Item Response Theory Modeling of the Philadelphia Naming Test.

PubMed

Fergadiotis, Gerasimos; Kellough, Stacey; Hula, William D

2015-06-01

In this study, we investigated the fit of the Philadelphia Naming Test (PNT; Roach, Schwartz, Martin, Grewal, & Brecher, 1996) to an item-response-theory measurement model, estimated the precision of the resulting scores and item parameters, and provided a theoretical rationale for the interpretation of PNT overall scores by relating explanatory variables to item difficulty. This article describes the statistical model underlying the computer adaptive PNT presented in a companion article (Hula, Kellough, & Fergadiotis, 2015). Using archival data, we evaluated the fit of the PNT to 1- and 2-parameter logistic models and examined the precision of the resulting parameter estimates. We regressed the item difficulty estimates on three predictor variables: word length, age of acquisition, and contextual diversity. The 2-parameter logistic model demonstrated marginally better fit, but the fit of the 1-parameter logistic model was adequate. Precision was excellent for both person ability and item difficulty estimates. Word length, age of acquisition, and contextual diversity all independently contributed to variance in item difficulty. Item-response-theory methods can be productively used to analyze and quantify anomia severity in aphasia. Regression of item difficulty on lexical variables supported the validity of the PNT and interpretation of anomia severity scores in the context of current word-finding models.
Effect of the framing of questionnaire items regarding satisfaction with training on residents' responses.

PubMed

Guyatt, G H; Cook, D J; King, D; Norman, G R; Kane, S L; van Ineveld, C

1999-02-01

To determine whether framing questions positively or negatively influences residents' apparent satisfaction with their training. In 1993-94, 276 residents at five Canadian internal medicine residency programs responded to 53 Likert-scale items designed to determine sources of the residents' satisfaction and stress. Two versions of the questionnaire were randomly distributed: one in which half the items were stated positively and the other half negatively, the other version in which the items were stated in the opposite way. The residents scored 43 of the 53 items higher when stated positively and scored ten higher when stated negatively (p < .0001). When analyzed using an analysis-of-variance model, the effect of positive versus negative framing was highly significant (F = 129.81, p < .0001). While the interaction between item and framing was also significant, the effect was much less strong (F = 5.56, p < .0001). On a scale where 1 represented the lowest possible level of satisfaction and 7 the highest, the mean score of the positively stated items was 4.1 and that of the negatively stated items, 3.8, an effect of 0.3. These results suggest a significant "response acquiescence bias." To minimize this bias, questionnaires assessing attitudes toward educational programs should include a mix of positively and negatively stated items.
Item Analyses of Memory Differences

PubMed Central

Salthouse, Timothy A.

2017-01-01

Objective Although performance on memory and other cognitive tests is usually assessed with a score aggregated across multiple items, potentially valuable information is also available at the level of individual items. Method The current study illustrates how analyses of variance with item as one of the factors, and memorability analyses in which item accuracy in one group is plotted as a function of item accuracy in another group, can provide a more detailed characterization of the nature of group differences in memory. Data are reported for two memory tasks, word recall and story memory, across age, ability, repetition, delay, and longitudinal contrasts. Results The item-level analyses revealed evidence for largely uniform differences across items in the age, ability, and longitudinal contrasts, but differential patterns across items in the repetition contrast, and unsystematic item relations in the delay contrast. Conclusion Analyses at the level of individual items have the potential to indicate the manner by which group differences in the aggregate test score are achieved. PMID:27618285
Evaluation of the psychometric properties of the Nighttime Symptoms of COPD Instrument.

PubMed

Mocarski, Michelle; Zaiser, Erica; Trundell, Dylan; Make, Barry J; Hareendran, Asha

2015-01-01

Nighttime symptoms can negatively impact the quality of life of patients with chronic obstructive pulmonary disease (COPD). The Nighttime Symptoms of COPD Instrument (NiSCI) was designed to measure the occurrence and severity of nighttime symptoms in patients with COPD, the impact of symptoms on nighttime awakenings, and rescue medication use. The objective of this study was to explore item reduction, inform scoring recommendations, and evaluate the psychometric properties of the NiSCI. COPD patients participating in a Phase III clinical trial completed the NiSCI daily. Item analyses were conducted using weekly mean and single day scores. Descriptive statistics (including percentage of respondents at floor/ceiling and inter-item correlations), factor analyses, and Rasch model analyses were conducted to examine item performance and scoring. Test-retest reliability was assessed for the final instrument using the intraclass correlation coefficient (ICC). Correlations with assessments conducted during study visits were used to evaluate convergent and known-groups validity. Data from 1,663 COPD patients aged 40-93 years were analyzed. Item analyses supported the generation of four scores. A one-factor structure was confirmed with factor analysis and Rasch analysis for the symptom severity score. Test-retest reliability was confirmed for the six-item symptom severity (ICC, 0.85), number of nighttime awakenings (ICC, 0.82), and rescue medication (ICC, 0.68) scores. Convergent validity was supported by significant correlations between the NiSCI, St George's Respiratory Questionnaire, and Exacerbations of Chronic Obstructive Pulmonary Disease Tool-Respiratory Symptoms scores. The results suggest that the NiSCI can be used to determine the severity of nighttime COPD symptoms, the number of nighttime awakenings due to COPD symptoms, and the nighttime use of rescue medication. The NiSCI is a reliable and valid instrument to evaluate these concepts in COPD patients in clinical trials and clinical practice. Scoring recommendations and steps for further research are discussed.
Attitudes to mesalamine questionnaire: a novel tool to predict mesalamine nonadherence in patients with IBD.

PubMed

Moss, Alan C; Lillis, Yvonne; Edwards George, Jessica B; Choudhry, Niteesh K; Berg, Anders H; Cheifetz, Adam S; Horowitz, Gary; Leffler, Dan A

2014-12-01

Poor adherence to mesalamine is common and driven by a combination of lifestyle and behavioral factors, as well as health beliefs. We sought to develop a valid tool to identify barriers to patient adherence and predict those at risk for future nonadherence. A 10-item survey was developed from patient-reported barriers to adherence. The survey was administered to 106 patients with ulcerative colitis who were prescribed mesalamine, and correlated with prospectively collected 12-month pharmacy refills (medication possession ratio (MPR)), urine levels of salicylates, and self-reported adherence (Morisky Medication Adherence Scale (MMAS)-8). From the initial 10-item survey, 8 items correlated highly with the MMAS-8 score at enrollment. Computer-generated randomization produced a derivation cohort of 60 subjects and a validation cohort of 46 subjects to assess the survey items in their ability to predict future adherence. Two items from the patient survey correlated with objective measures of long-term adherence: their belief in the importance of maintenance mesalamine even when in remission and their concerns about side effects. The additive score based on these two items correlated with 12-month MPR in both the derivation and validation cohorts (P<0.05). Scores on these two items were associated with a higher risk of being nonadherent over the subsequent 12 months (relative risk (RR) =2.2, 95% confidence interval=1.5-3.5, P=0.04). The area under the curve for the performance of this 2-item tool was greater than that of the 10-item MMAS-8 score for predicting MPR scores over 12 months (area under the curve 0.7 vs. 0.5). Patients' beliefs about the need for maintenance mesalamine and their concerns about side effects influence their adherence to mesalamine over time. These concerns could easily be raised in practice to identify patients at risk of nonadherence (Clinical Trial number NCT01349504).
Avoiding and Correcting Bias in Score-Based Latent Variable Regression with Discrete Manifest Items

ERIC Educational Resources Information Center

Lu, Irene R. R.; Thomas, D. Roland

2008-01-01

This article considers models involving a single structural equation with latent explanatory and/or latent dependent variables where discrete items are used to measure the latent variables. Our primary focus is the use of scores as proxies for the latent variables and carrying out ordinary least squares (OLS) regression on such scores to estimate…
Automated Scoring of Speaking Tasks in the Test of English-for-Teaching ("TEFT"™). Research Report. ETS RR-15-31

ERIC Educational Resources Information Center

Zechner, Klaus; Chen, Lei; Davis, Larry; Evanini, Keelan; Lee, Chong Min; Leong, Chee Wee; Wang, Xinhao; Yoon, Su-Youn

2015-01-01

This research report presents a summary of research and development efforts devoted to creating scoring models for automatically scoring spoken item responses of a pilot administration of the Test of English-for-Teaching ("TEFT"™) within the "ELTeach"™ framework.The test consists of items for all four language modalities:…

An efficacy analysis of olanzapine treatment data in schizophrenia patients with catatonic signs and symptoms.

PubMed

Martényi, F; Metcalfe, S; Schausberger, B; Dossenbach, M R

2001-01-01

Thirty-five patients suffering from schizophrenia, as diagnosed by the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition, were preselected from 7 clinical trials according to a priori criteria of catatonic signs and symptoms based on 3 Positive and Negative Syndrome Scale (PANSS) items: scores for PANSS item 19 (mannerism and posturing) and either item 4 (excitement) or item 21 (motor retardation) had to exceed or equal 4 at baseline. This particular patient population represents a severely psychotic sample: mean +/- SD PANSS total scores at baseline were 129.26 +/- 19.76. After I week of olanzapine treatment, mean PANSS total score was decreased significantly (-13.14; p < .001), as was mean PANSS total score after 6 weeks of olanzapine treatment (-45.16; p < .001); additionally, the positive subscale, negative subscale, and mood scores improved significantly. A significant improvement in the catatonic signs and symptoms composite score was also observed at week 6 (-4.96; p < .001). The mean +/- SD daily dose of olanzapine was 18.00 +/- 2.89 mg after 6 weeks of treatment. The present data analysis suggests the efficacy of olanzapine in the treatment of severely ill schizophrenic patients with nonspecified catatonic signs and symptoms.
Evaluating the Effects of Differences in Group Abilities on the Tucker and the Levine Observed-Score Methods for Common-Item Nonequivalent Groups Equating. ACT Research Report Series 2010-1

ERIC Educational Resources Information Center

Chen, Hanwei; Cui, Zhongmin; Zhu, Rongchun; Gao, Xiaohong

2010-01-01

The most critical feature of a common-item nonequivalent groups equating design is that the average score difference between the new and old groups can be accurately decomposed into a group ability difference and a form difficulty difference. Two widely used observed-score linear equating methods, the Tucker and the Levine observed-score methods,…
Detection and validation of unscalable item score patterns using item response theory: an illustration with Harter's Self-Perception Profile for Children.

PubMed

Meijer, Rob R; Egberink, Iris J L; Emons, Wilco H M; Sijtsma, Klaas

2008-05-01

We illustrate the usefulness of person-fit methodology for personality assessment. For this purpose, we use person-fit methods from item response theory. First, we give a nontechnical introduction to existing person-fit statistics. Second, we analyze data from Harter's (1985) Self-Perception Profile for Children (Harter, 1985) in a sample of children ranging from 8 to 12 years of age (N = 611) and argue that for some children, the scale scores should be interpreted with care and caution. Combined information from person-fit indexes and from observation, interviews, and self-concept theory showed that similar score profiles may have a different interpretation. For some children in the sample, item scores did not adequately reflect their trait level. Based on teacher interviews, this was found to be due most likely to a less developed self-concept and/or problems understanding the meaning of the questions. We recommend investigating the scalability of score patterns when using self-report inventories to help the researcher interpret respondents' behavior correctly.
Development of Elderly Quality of Life Index – Eqoli: Item Reduction and Distribution into Dimensions

PubMed Central

Paschoal, Sérgio Márcio Pacheco; Filho, Wilson Jacob; Litvoc, Júlio

2008-01-01

OBJECTIVE To describe item reduction and its distribution into dimensions in the construction process of a quality of life evaluation instrument for the elderly. METHODS The sampling method was chosen by convenience through quotas, with selection of elderly subjects from four programs to achieve heterogeneity in the “health status”, “functional capacity”, “gender”, and “age” variables. The Clinical Impact Method was used, consisting of the spontaneous and elicited selection by the respondents of relevant items to the construct Quality of Life in Old Age from a previously elaborated item pool. The respondents rated each item’s importance using a 5-point Likert scale. The product of the proportion of elderly selecting the item as relevant (frequency) and the mean importance score they attributed to it (importance) represented the overall impact of that item in their quality of life (impact). The items were ordered according to their impact scores and the top 46 scoring items were grouped in dimensions by three experts. A review of the negative items was performed. RESULTS One hundred and ninety three people (122 women and 71 men) were interviewed. Experts distributed the 46 items into eight dimensions. Closely related items were grouped and dimensions not reaching the minimum expected number of items received additional items resulting in eight dimensions and 43 items. DISCUSSION The sample was heterogeneous and similar to what was expected. The dimensions and items demonstrated the multidimensionality of the construct. The Clinical Impact Method was appropriate to construct the instrument, which was named Elderly Quality of Life Index - EQoLI. An accuracy process will be examined in the future. PMID:18438571
Psychometric assessment of the IBS-D Daily Symptom Diary and Symptom Event Log.

PubMed

Rosa, Kathleen; Delgado-Herrera, Leticia; Zeiher, Bernie; Banderas, Benjamin; Arbuckle, Rob; Spears, Glen; Hudgens, Stacie

2016-12-01

Diarrhea-predominant irritable bowel syndrome (IBS-D) can considerably impact patients' lives. Patient-reported symptoms are crucial in understanding the diagnosis and progression of IBS-D. This study psychometrically evaluates the newly developed IBS-D Daily Symptom Diary and Symptom Event Log (hereafter, "Event Log") according to US regulatory recommendations. A US-based observational field study was conducted to understand cross-sectional psychometric properties of the IBS-D Daily Symptom Diary and Event Log. Analyses included item descriptive statistics, item-to-item correlations, reliability, and construct validity. The IBS-D Daily Symptom Diary and Event Log had no items with excessive missing data. With the exception of two items ("frequency of gas" and "accidents"), moderate to high inter-item correlations were observed among all items of the IBS-D Daily Symptom Diary and Event Log (day 1 range 0.67-0.90). Item scores demonstrated reliability, with the exception of the "frequency of gas" and "accidents" items of the Diary and "incomplete evacuation" item of the Event Log. The pattern of correlations of the IBS-D Daily Symptom Diary and Event Log item scores with generic and disease-specific measures was as expected, moderate for similar constructs and low for dissimilar constructs, supporting construct validity. Known-groups methods showed statistically significant differences and monotonic trends in each of the IBS-D Daily Symptom Diary item scores among groups defined by patients' IBS-D severity ratings ("none"/"mild," "moderate," or "severe"/"very severe"), supporting construct validity. Initial psychometric results support the reliability and validity of the items of the IBS-D Daily Symptom Diary and Event Log.
Complex versus Simple Modeling for DIF Detection: When the Intraclass Correlation Coefficient (?) of the Studied Item Is Less Than the ? of the Total Score

ERIC Educational Resources Information Center

Jin, Ying; Myers, Nicholas D.; Ahn, Soyeon

2014-01-01

Previous research has demonstrated that differential item functioning (DIF) methods that do not account for multilevel data structure could result in too frequent rejection of the null hypothesis (i.e., no DIF) when the intraclass correlation coefficient (?) of the studied item was the same as the ? of the total score. The current study extended…
Measurement properties of painDETECT: Rasch analysis of responses from community-dwelling adults with neuropathic pain.

PubMed

Packham, Tara L; Cappelleri, Joseph C; Sadosky, Alesia; MacDermid, Joy C; Brunner, Florian

2017-03-04

painDETECT (PD-Q) is a self-reported assessment of pain qualities developed as a screening tool for pain of neuropathic origin. Rasch analysis is a strategy for examining the measurement characteristics of a scale using a form of item response theory. We conducted a Rasch analysis to consider if the scoring and measurement properties of PD-Q would support its use as an outcome measure. Rasch analysis was conducted on PD-Q scores drawn from a cross-sectional study of the burden and costs of NeP. The analysis followed an iterative process based on recommendations in the literature, including examination of sequential scoring categories, unidimensionality, reliability and differential item function. Data from 624 persons with a diagnosis of painful diabetic polyneuropathy, small fibre neuropathy, and neuropathic pain associated with chronic low back pain, spinal cord injury, HIV-related pain, or chronic post-surgical pain was used for this analysis. PD-Q demonstrated fit to the Rasch model after adjustments of scoring categories for four items, and omission of the time course and radiating questions. The resulting seven-item scale of pain qualities demonstrated good reliability with a person-separation index of 0.79. No scoring bias (differential item functioning) was found for this version. Rasch modelling suggests the seven pain-qualities items from PD-Q may be used as an outcome measure. Further research is required to confirm validity and responsiveness in a clinical setting.
A Practical Guide to Check the Consistency of Item Response Patterns in Clinical Research Through Person-Fit Statistics: Examples and a Computer Program.

PubMed

Meijer, Rob R; Niessen, A Susan M; Tendeiro, Jorge N

2016-02-01

Although there are many studies devoted to person-fit statistics to detect inconsistent item score patterns, most studies are difficult to understand for nonspecialists. The aim of this tutorial is to explain the principles of these statistics for researchers and clinicians who are interested in applying these statistics. In particular, we first explain how invalid test scores can be detected using person-fit statistics; second, we provide the reader practical examples of existing studies that used person-fit statistics to detect and to interpret inconsistent item score patterns; and third, we discuss a new R-package that can be used to identify and interpret inconsistent score patterns. © The Author(s) 2015.
Calibrating Item Families and Summarizing the Results Using Family Expected Response Functions

ERIC Educational Resources Information Center

Sinharay, Sandip; Johnson, Matthew S.; Williamson, David M.

2003-01-01

Item families, which are groups of related items, are becoming increasingly popular in complex educational assessments. For example, in automatic item generation (AIG) systems, a test may consist of multiple items generated from each of a number of item models. Item calibration or scoring for such an assessment requires fitting models that can…
Development of the outcome expectancy scale for self-care among periodontal disease patients.

PubMed

Kakudate, Naoki; Morita, Manabu; Fukuhara, Shunichi; Sugai, Makoto; Nagayama, Masato; Isogai, Emiko; Kawanami, Masamitsu; Chiba, Itsuo

2011-12-01

The theory of self-efficacy states that specific efficacy expectations affect behaviour. Two types of efficacy expectations are described within the theory. Self-efficacy expectations are the beliefs in the capacity to perform a specific behaviour. Outcome expectations are the beliefs that carrying out a specific behaviour will lead to a desired outcome. To develop and examine the reliability and validity of an outcome expectancy scale for self-care (OESS) among periodontal disease patients. A 34-item scale was tested on 101 patients at a dental clinic. Accuracy was improved by item analysis, and internal consistency and test-retest stability were investigated. Concurrent validity was tested by examining associations of the OESS score with the self-efficacy scale for self-care (SESS) score and plaque index score. Construct validity was examined by comparing OESS scores between periodontal patients at initial visit (group 1) and those continuing maintenance care (group 2). Item analysis identified 13 items for the OESS. Factor analysis extracted three factors: social-, oral- and self-evaluative outcome expectancy. Cronbach's alpha coefficient for the OESS was 0.90. A significant association was observed between test and retest scores, and between the OESS and SESS and plaque index scores. Further, group 2 had a significantly higher mean OESS score than group 1. We developed a 13-item OESS with high reliability and validity which may be used to assess outcome expectancy for self-care. A patient's psychological condition with regard to behaviour and affective status can be accurately evaluated using the OESS with SESS. © 2011 Blackwell Publishing Ltd.
Quality evaluation of JAMA Patient Pages on diabetes using the Ensuring Quality Information for Patient (EQIP) tool.

PubMed

Vaona, Alberto; Marcon, Alessandro; Rava, Marta; Buzzetti, Roberto; Sartori, Marco; Abbinante, Crescenza; Moser, Andrea; Seddaiu, Antonia; Prontera, Manuela; Quaglio, Alessandro; Pallazzoni, Piera; Sartori, Valentina; Rigon, Giulio

2011-12-01

Many medical journals provide patient information leaflets on the correct use of medicines and/or appropriate lifestyles. Only a few studies have assessed the quality of this patient-specific literature. The purpose of this study was to evaluate the quality of JAMA Patient Pages on diabetes using the Ensuring Quality Information for Patient (EQIP) tool. A multidisciplinary group of 10 medical doctors analyzed all diabetes-related Patient Pages published by JAMA from 1998 to 2010 using the EQIP tool. Inter-rater reliability was assessed using the percentage of observed total agreement (p(o)). A quality score between 0 and 1 (the higher score indicating higher quality) was calculated for each item on every page as a function of raters' answers to the EQIP checklist. A mean score per item and a mean score per page were then calculated. We found 8 Patient Pages on diabetes on the JAMA web site. The overall quality score of the documents ranged between 0.55 (Managing Diabetes and Diabetes) and 0.67 (weight and diabetes). p(o) was at least moderate (>50%) for 15 of the 20 EQIP items. Despite generally favorable quality scores, some items received low scores. The worst scores were for the item assessing provision of an empty space to customize information for individual patients (score=0.01, p(o)=95%) and patients involvement in document drafting (score=0.11, p(o)=79%). The Patient Pages on diabetes published by JAMA were found to present weak points that limit their overall quality and may jeopardize their efficacy. We therefore recommend that authors and publishers of written patient information comply with published quality criteria. Further research is needed to evaluate the quality and efficacy of existing written health care information. Copyright © 2011 Primary Care Diabetes Europe. Published by Elsevier Ltd. All rights reserved.
Evaluation of the internal construct validity of the Personal Care Participation Assessment and Resource Tool (PC-PART) using Rasch analysis.

PubMed

Darzins, Susan; Imms, Christine; Di Stefano, Marilyn; Taylor, Nicholas F; Pallant, Julie F

2014-11-05

The Personal Care Participation Assessment and Resource Tool (PC-PART) is a 43-item, clinician-administered assessment, designed to identify patients' unmet needs (participation restrictions) in activities of daily living (ADL) required for community life. This information is important for identifying problems that need addressing to enable, for example, discharge from inpatient settings to community living. The objective of this study was to evaluate internal construct validity of the PC-PART using Rasch methods. Fit to the Rasch model was evaluated for 41 PC-PART items, assessing threshold ordering, overall model fit, individual item fit, person fit, internal consistency, Differential Item Functioning (DIF), targeting of items and dimensionality. Data used in this research were taken from admission data from a randomised controlled trial conducted at two publically funded inpatient rehabilitation units in Melbourne, Australia, with 996 participants (63% women; mean age 74 years) and with various impairment types. PC-PART items assessed as one scale, and original PC-PART domains evaluated as separate scales, demonstrated poor fit to the Rasch model. Adequate fit to the Rasch model was achieved in two newly formed PC-PART scales: Self-Care (16 items) and Domestic Life (14 items). Both scales were unidimensional, had acceptable internal consistency (PSI =0.85, 0.76, respectively) and well-targeted items. Rasch analysis did not support conventional summation of all PC-PART item scores to create a total score. However, internal construct validity of the newly formed PC-PART scales, Self-Care and Domestic Life, was supported. Their Rasch-derived scores provided interval-level measurement enabling summation of scores to form a total score on each scale. These scales may assist clinicians, managers and researchers in rehabilitation settings to assess and measure changes in ADL participation restrictions relevant to community living. Data used in this research were gathered during a registered randomised controlled trial: Australian and New Zealand Clinical Trials Registry ACTRN12609000973213. Ethics committee approval was gained for secondary analysis of data for this study.
An Isotonic Partial Credit Model for Ordering Subjects on the Basis of Their Sum Scores

ERIC Educational Resources Information Center

Ligtvoet, Rudy

2012-01-01

In practice, the sum of the item scores is often used as a basis for comparing subjects. For items that have more than two ordered score categories, only the partial credit model (PCM) and special cases of this model imply that the subjects are stochastically ordered on the common latent variable. However, the PCM is very restrictive with respect…
Multivariate Generalizability Analysis of Automated Scoring for Short Answer Items of Social Studies in Large-Scale Assessment

ERIC Educational Resources Information Center

Sung, Kyung Hee; Noh, Eun Hee; Chon, Kyong Hee

2017-01-01

With increased use of constructed response items in large scale assessments, the cost of scoring has been a major consideration (Noh et al. in KICE Report RRE 2012-6, 2012; Wainer and Thissen in "Applied Measurement in Education" 6:103-118, 1993). In response to the scoring cost issues, various forms of automated system for scoring…
Performance of Automated Speech Scoring on Different Low- to Medium-Entropy Item Types for Low-Proficiency English Learners. Research Report. ETS RR-17-12

ERIC Educational Resources Information Center

Loukina, Anastassia; Zechner, Klaus; Yoon, Su-Youn; Zhang, Mo; Tao, Jidong; Wang, Xinhao; Lee, Chong Min; Mulholland, Matthew

2017-01-01

This report presents an overview of the "SpeechRater"? automated scoring engine model building and evaluation process for several item types with a focus on a low-English-proficiency test-taker population. We discuss each stage of speech scoring, including automatic speech recognition, filtering models for nonscorable responses, and…
Veterinary students' perceptions of their learning environment as measured by the Dundee Ready Education Environment Measure.

PubMed

Pelzer, Jacquelyn M; Hodgson, Jennifer L; Werre, Stephen R

2014-03-24

The Dundee Ready Education Environment Measure (DREEM) has been widely used to evaluate the learning environment within health sciences education, however, this tool has not been applied in veterinary medical education. The aim of this study was to evaluate the reliability and validity of the DREEM tool in a veterinary medical program and to determine veterinary students' perceptions of their learning environment. The DREEM is a survey tool which quantitatively measures students' perceptions of their learning environment. The survey consists of 50 items, each scored 0-4 on a Likert Scale. The 50 items are subsequently analysed within five subscales related to students' perceptions of learning, faculty (teachers), academic atmosphere, and self-perceptions (academic and social). An overall score is obtained by summing the mean score for each subscale, with an overall possible score of 200. All students in the program were asked to complete the DREEM. Means and standard deviations were calculated for the 50 items, the five subscale scores and the overall score. Cronbach's alpha was determined for the five subscales and overall score to evaluate reliability. Confirmatory factor analysis was used to evaluate construct validity. 224 responses (53%) were received. The Cronbach's alpha for the overall score was 0.93 and for the five subscales were; perceptions of learning 0.85, perceptions of faculty 0.79, perceptions of atmosphere 0.81, academic self-perceptions 0.68, and social self-perceptions 0.72. Construct validity was determined to be acceptable (p < 0.001) and all items contributed to the overall validity of the DREEM. The overall DREEM score was 128.9/200, which is a positive result based on the developers' descriptors and comparable to other health science education programs. Four individual items of concern were identified by students. In this setting the DREEM was a reliable and valid tool to measure veterinary students' perceptions of their learning environment. The four items identified as concerning originated from four of the five subscales, but all related to workload. Negative perceptions regarding workload is a common concern of students in health education programs. If not addressed, this perception may have an unfavourable impact on veterinary students' learning environment.
Adherence to dietary guidelines positively affects quality of life and functional status of older adults.

PubMed

Gopinath, Bamini; Russell, Joanna; Flood, Victoria M; Burlutsky, George; Mitchell, Paul

2014-02-01

Nutritional parameters could influence self-perceived health and functional status of older adults. We prospectively determined the association between diet quality and quality of life and activities of daily living. This was an observational cohort study in which total diet scores, reflecting adherence to dietary guidelines, were determined. Dietary intakes were assessed using a food frequency questionnaire at baseline. Total diet scores were allocated for intake of selected food groups and nutrients for each participant as described in the Australian Guide to Healthy Eating. Higher scores indicated closer adherence to dietary guidelines. In Sydney, Australia, 1,305 and 895 participants (aged ≥ 55 years) with complete data were examined over 5 and 10 years, respectively. The 36-Item Short-Form Survey assesses quality of life and has eight subscales representing dimensions of health and well-being; higher scores reflect better quality of life. Functional status was determined once at the 10-year follow-up by the Older Americans Resources and Services activities of daily living scale. This scale has 14 items: seven items assess basic activities of daily living (eg, eating and walking) and seven items assess instrumental activities of daily living (eg, shopping or housework). Normalized 36-Item Short-Form Survey component scores were used in analysis of covariance to calculate multivariable adjusted mean scores. Logistic regression analysis was used to calculate adjusted odds ratios and 95% CIs to demonstrate the association between total diet score with the 5-year incidence of impaired activities of daily living. Participants in the highest vs lowest quartile of baseline total diet scores had adjusted mean scores 5.6, 4.0, 5.3, and 2.6 units higher in these 36-Item Short-Form Survey domains 5 years later: physical function (P trend=0.003), general health (P trend=0.02), vitality (P trend=0.001), and physical composite score (P trend=0.003), respectively. Participants in the highest vs lowest quartile of baseline total diet scores had 50% reduced risk of impaired instrumental activites of daily living at follow-up (multivariable-adjusted P trend=0.03). Higher diet quality was prospectively associated with better quality of life and functional ability. Copyright © 2014 Academy of Nutrition and Dietetics. Published by Elsevier Inc. All rights reserved.
Prediction of IOI-HA scores using speech reception thresholds and speech discrimination scores in quiet.

PubMed

Brännström, K Jonas; Lantz, Johannes; Nielsen, Lars Holme; Olsen, Steen Østergaard

2014-02-01

Outcome measures can be used to improve the quality of the rehabilitation by identifying and understanding which variables influence the outcome. This information can be used to improve outcomes for clients. In clinical practice, pure-tone audiometry, speech reception thresholds (SRTs), and speech discrimination scores (SDSs) in quiet or in noise are common assessments made prior to hearing aid (HA) fittings. It is not known whether SRT and SDS in quiet relate to HA outcome measured with the International Outcome Inventory for Hearing Aids (IOI-HA). The aim of the present study was to investigate the relationship between pure-tone average (PTA), SRT, and SDS in quiet and IOI-HA in both first-time and experienced HA users. SRT and SDS were measured in a sample of HA users who also responded to the IOI-HA. Fifty-eight Danish-speaking adult HA users. The psychometric properties were evaluated and compared to previous studies using the IOI-HA. The associations and differences between the outcome scores and a number of descriptive variables (age, gender, fitted monaurally/binaurally with HA, first-time/experienced HA users, years of HA use, time since last HA fitting, best ear PTA, best ear SRT, or best ear SDS) were examined. A multiple forward stepwise regression analysis was conducted using scores on the separate IOI-HA items, the global score, and scores on the introspection and interaction subscales as dependent variables to examine whether the descriptive variables could predict these outcome measures. Scores on single IOI-HA items, the global score, and scores on the introspection (items 1, 2, 4, and 7) and interaction (items 3, 5, and 6) subscales closely resemble those previously reported. Multiple regression analysis showed that the best ear SDS predicts about 18-19% of the outcome on items 3 and 5 separately, and about 16% on the interaction subscale (sum of items 3, 5, and 6) CONCLUSIONS: The best ears SDS explains some of the variance displayed in the IOI-HA global score and the interaction subscale. The relation between SDS and IOI-HA suggests that a poor unaided SDS might in itself be a limiting factor for the HA rehabilitation efficacy and hence the IOI-HA outcome. The clinician could use this information to align the user's HA expectations to what is within possible reach. American Academy of Audiology.
Two-Step Screening for Depressive Symptoms and Prediction of Mortality in Patients With Heart Failure.

PubMed

Lee, Kyoung Suk; Moser, Debra K; Pelter, Michele; Biddle, Martha J; Dracup, Kathleen

2017-05-01

Comorbid depression in patients with heart failure is associated with increased risk for death. In order to effectively identify depressed patients with cardiac disease, the American Heart Association suggests a 2-step screening method: administering the 2-item Patient Health Questionnaire first and then the 9-item Patient Health Questionnaire. However, whether the 2-step method is better for predicting poor prognosis in heart failure than is either the 2-item or the 9-item tool alone is not known. To determine whether the 2-step method is better than either the 2-item or the 9-item questionnaire alone for predicting all-cause mortality in heart failure. During a 2-year period, 562 patients with heart failure were assessed for depression by using the 2-step method. With the 2-step method, results are considered positive if patients endorse either depressed mood or anhedonia on the 2-item screen and have scores of 10 or higher on the 9-item screen. Screening results with the 2-step method were not associated with all-cause mortality. Patients with scores positive for depression on either the 2-item or 9-item screen alone had 53% and 60% greater risk, respectively, for all-cause death than did patients with scores negative for depression after adjustments for covariates (hazard ratio, 1.530; 95% CI, 1.029-2.274 for the 2-item screen; hazard ratio, 1.603; 95% CI, 1.079-2.383 for the 9-item screen). The 2-step method has no clear advantages compared with the 2-item screen alone or the 9-item screen alone for predicting adverse prognostic effects of depressive symptoms in heart failure. ©2017 American Association of Critical-Care Nurses.
A Non-Parametric Item Response Theory Evaluation of the CAGE Instrument Among Older Adults.

PubMed

Abdin, Edimansyah; Sagayadevan, Vathsala; Vaingankar, Janhavi Ajit; Picco, Louisa; Chong, Siow Ann; Subramaniam, Mythily

2018-02-23

The validity of the CAGE using item response theory (IRT) has not yet been examined in older adult population. This study aims to investigate the psychometric properties of the CAGE using both non-parametric and parametric IRT models, assess whether there is any differential item functioning (DIF) by age, gender and ethnicity and examine the measurement precision at the cut-off scores. We used data from the Well-being of the Singapore Elderly study to conduct Mokken scaling analysis (MSA), dichotomous Rasch and 2-parameter logistic IRT models. The measurement precision at the cut-off scores were evaluated using classification accuracy (CA) and classification consistency (CC). The MSA showed the overall scalability H index was 0.459, indicating a medium performing instrument. All items were found to be homogenous, measuring the same construct and able to discriminate well between respondents with high levels of the construct and the ones with lower levels. The item discrimination ranged from 1.07 to 6.73 while the item difficulty ranged from 0.33 to 2.80. Significant DIF was found for 2-item across ethnic group. More than 90% (CC and CA ranged from 92.5% to 94.3%) of the respondents were consistently and accurately classified by the CAGE cut-off scores of 2 and 3. The current study provides new evidence on the validity of the CAGE from the IRT perspective. This study provides valuable information of each item in the assessment of the overall severity of alcohol problem and the precision of the cut-off scores in older adult population.

Dimensionality and summary measures of the SF-36 v1.6: comparison of scale- and item-based approach across ECRHS II adults population.

PubMed

Grassi, Mario; Nucera, Andrea

2010-01-01

The objective of this study was twofold: 1) to confirm the hypothetical eight scales and two-component summaries of the questionnaire Short Form 36 Health Survey (SF-36), and 2) to evaluate the performance of two alternative measures to the original physical component summary (PCS) and mental component summary (MCS). We performed principal component analysis (PCA) based on 35 items, after optimal scaling via multiple correspondence analysis (MCA), and subsequently on eight scales, after standard summative scoring. Item-based summary measures were planned. Data from the European Community Respiratory Health Survey II follow-up of 8854 subjects from 25 centers were analyzed to cross-validate the original and the novel PCS and MCS. Overall, the scale- and item-based comparison indicated that the SF-36 scales and summaries meet the supposed dimensionality. However, vitality, social functioning, and general health items did not fit data optimally. The novel measures, derived a posteriori by unit-rule from an oblique (correlated) MCA/PCA solution, are simple item sums or weighted scale sums where the weights are the raw scale ranges. These item-based scores yielded consistent scale-summary results for outliers profiles, with an expected known-group differences validity. We were able to confirm the hypothesized dimensionality of eight scales and two summaries of the SF-36. The alternative scoring reaches at least the same required standards of the original scoring. In addition, it can reduce the item-scale inconsistencies without loss of predictive validity.
Evaluation of psychometric properties and differential item functioning of 8-item Child Perceptions Questionnaires using item response theory.

PubMed

Yau, David T W; Wong, May C M; Lam, K F; McGrath, Colman

2015-08-19

Four-factor structure of the two 8-item short forms of Child Perceptions Questionnaire CPQ11-14 (RSF:8 and ISF:8) has been confirmed. However, the sum scores are typically reported in practice as a proxy of Oral health-related Quality of Life (OHRQoL), which implied a unidimensional structure. This study first assessed the unidimensionality of 8-item short forms of CPQ11-14. Item response theory (IRT) was employed to offer an alternative and complementary approach of validation and to overcome the limitations of classical test theory assumptions. A random sample of 649 12-year-old school children in Hong Kong was analyzed. Unidimensionality of the scale was tested by confirmatory factor analysis (CFA), principle component analysis (PCA) and local dependency (LD) statistic. Graded response model was fitted to the data. Contribution of each item to the scale was assessed by item information function (IIF). Reliability of the scale was assessed by test information function (TIF). Differential item functioning (DIF) across gender was identified by Wald test and expected score functions. Both CPQ11-14 RSF:8 and ISF:8 did not deviate much from the unidimensionality assumption. Results from CFA indicated acceptable fit of the one-factor model. PCA indicated that the first principle component explained >30 % of the total variation with high factor loadings for both RSF:8 and ISF:8. Almost all LD statistic <10 indicated the absence of local dependency. Flat and low IIFs were observed in the oral symptoms items suggesting little contribution of information to the scale and item removal caused little practical impact. Comparing the TIFs, RSF:8 showed slightly better information than ISF:8. In addition to oral symptoms items, the item "Concerned with what other people think" demonstrated a uniform DIF (p < 0.001). The expected score functions were not much different between boys and girls. Items related to oral symptoms were not informative to OHRQoL and deletion of these items is suggested. The impact of DIF across gender on the overall score was minimal. CPQ11-14 RSF:8 performed slightly better than ISF:8 in measurement precision. The 6-item short forms suggested by IRT validation should be further investigated to ensure their robustness, responsiveness and discriminative performance.
Combining agreement and frequency rating scales to optimize psychometrics in measuring behavioral health functioning.

PubMed

Marfeo, Elizabeth E; Ni, Pengsheng; Chan, Leighton; Rasch, Elizabeth K; Jette, Alan M

2014-07-01

The goal of this article was to investigate optimal functioning of using frequency vs. agreement rating scales in two subdomains of the newly developed Work Disability Functional Assessment Battery: the Mood & Emotions and Behavioral Control scales. A psychometric study comparing rating scale performance embedded in a cross-sectional survey used for developing a new instrument to measure behavioral health functioning among adults applying for disability benefits in the United States was performed. Within the sample of 1,017 respondents, the range of response category endorsement was similar for both frequency and agreement item types for both scales. There were fewer missing values in the frequency items than the agreement items. Both frequency and agreement items showed acceptable reliability. The frequency items demonstrated optimal effectiveness around the mean ± 1-2 standard deviation score range; the agreement items performed better at the extreme score ranges. Findings suggest an optimal response format requires a mix of both agreement-based and frequency-based items. Frequency items perform better in the normal range of responses, capturing specific behaviors, reactions, or situations that may elicit a specific response. Agreement items do better for those whose scores are more extreme and capture subjective content related to general attitudes, behaviors, or feelings of work-related behavioral health functioning. Copyright © 2014 Elsevier Inc. All rights reserved.
eHealth literacy in chronic disease patients: An item response theory analysis of the eHealth literacy scale (eHEALS).

PubMed

Paige, Samantha R; Krieger, Janice L; Stellefson, Michael; Alber, Julia M

2017-02-01

Chronic disease patients are affected by low computer and health literacy, which negatively affects their ability to benefit from access to online health information. To estimate reliability and confirm model specifications for eHealth Literacy Scale (eHEALS) scores among chronic disease patients using Classical Test (CTT) and Item Response Theory techniques. A stratified sample of Black/African American (N=341) and Caucasian (N=343) adults with chronic disease completed an online survey including the eHEALS. Item discrimination was explored using bi-variate correlations and Cronbach's alpha for internal consistency. A categorical confirmatory factor analysis tested a one-factor structure of eHEALS scores. Item characteristic curves, in-fit/outfit statistics, omega coefficient, and item reliability and separation estimates were computed. A 1-factor structure of eHEALS was confirmed by statistically significant standardized item loadings, acceptable model fit indices (CFI/TLI>0.90), and 70% variance explained by the model. Item response categories increased with higher theta levels, and there was evidence of acceptable reliability (ω=0.94; item reliability=89; item separation=8.54). eHEALS scores are a valid and reliable measure of self-reported eHealth literacy among Internet-using chronic disease patients. Providers can use eHEALS to help identify patients' eHealth literacy skills. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Validation of Gujarati Version of ABILOCO-Kids Questionnaire.

PubMed

Diwan, Shraddha; Diwan, Jasmin; Patel, Pankaj; Bansal, Ankita B

2015-10-01

ABILOCO-Kids is a measure of locomotion ability for children with cerebral palsy (CP) aged 6 to 15 years & is available in English & French. To validate the Gujarati version of ABILOCO-Kids questionnaire to be used in clinical research on Gujarati population. ABILOCO-Kids questionnaire was translated into Gujarati from English using forward-backward-forward method. To ensure face & content validity of Gujarati version using group consensus method, each item was examined by group of experts having mean experience of 24.62 years in field of paediatric and paediatric physiotherapy. Each item was analysed for content, meaning, wording, format, ease of administration & scoring. Each item was scored by expert group as either accepted, rejected or accepted with modification. Procedure was continued until 80% of consensus for all items. Concurrent validity was examined on 55 children with Cerebral Palsy (6-15 years) of all Gross Motor Functional Classification System (GMFCS) level & all clinical types by correlating score of ABILOCO-Kids with Gross Motor Functional Measure & GMFCS. In phase 1 of validation, 16 items were accepted as it is; 22 items accepted with modification & 3 items went for phase 2 validation. For concurrent validity, highly significant positive correlation was found between score of ABILOCO-Kids & total GMFM (r=0.713, p<0.005) & highly significant negative correlation with GMFCS (r= -0.778, p<0.005). Gujarati translated version of ABILOCO-Kids questionnaire has good face & content validity as well as concurrent validity which can be used to measure caregiver reported locomotion ability in children with CP.
Application of the diligence inventory in dental education.

PubMed

Jasinevicius, T R; Bernard, H; Schuttenberg, E M

1998-04-01

The fifty-five-item Diligence Inventory for Higher Education (DI-HE) was applied to a new subject group--190 dental students. After item and factor analysis, a fifty-item (four subscale) inventory best reflected this group. The DI-HE's split half reliability was 0.81 (p < 0.001), the reliability coefficient for the pre- and post-test was 0.68 (p < 0.01), and the correlation coefficient alpha was 0.90. The DI-HE scores were high, with no statistical differences among the four classes. Overall, significant relationships were found between grade point averages (GPAs) and DI-HE total and subscale scores, with r values as high as 0.44. While female students' DI-HE scores were significantly higher (p = 0.023) than male students' scores, no correlations between DI-HE scores and GPAs for females were found. The results suggest that DI-HE may be useful for assessment purposes in professional education.
Effects of language dominance on item and order memory in free recall, serial recall and order reconstruction.

PubMed

Francis, Wendy S; Baca, Yuzeth

2014-01-01

Spanish-English bilinguals (N = 144) performed free recall, serial recall and order reconstruction tasks in both English and Spanish. Long-term memory for both item and order information was worse in the less fluent language (L2) than in the more fluent language (L1). Item scores exhibited a stronger disadvantage for the L2 in serial recall than in free recall. Relative order scores were lower in the L2 for all three tasks, but adjusted scores for free and serial recall were equivalent across languages. Performance of English-speaking monolinguals (N = 72) was comparable to bilingual performance in the L1, except that monolinguals had higher adjusted order scores in free recall. Bilingual performance patterns in the L2 were consistent with the established effects of concurrent task performance on these memory tests, suggesting that the cognitive resources required for processing words in the L2 encroach on resources needed to commit item and order information to memory. These findings are also consistent with a model in which item memory is connected to the language system, order information is processed by separate mechanisms and attention can be allocated differentially to these two systems.
An analysis of the masking of speech by competing speech using self-report data.

PubMed

Agus, Trevor R; Akeroyd, Michael A; Noble, William; Bhullar, Navjot

2009-01-01

Many of the items in the "Speech, Spatial, and Qualities of Hearing" scale questionnaire [S. Gatehouse and W. Noble, Int. J. Audiol. 43, 85-99 (2004)] are concerned with speech understanding in a variety of backgrounds, both speech and nonspeech. To study if this self-report data reflected informational masking, previously collected data on 414 people were analyzed. The lowest scores (greatest difficulties) were found for the two items in which there were two speech targets, with successively higher scores for competing speech (six items), energetic masking (one item), and no masking (three items). The results suggest significant masking by competing speech in everyday listening situations.
Urbanism and Life Satisfaction among the Aged.

ERIC Educational Resources Information Center

Liang, Jersey; Warfel, Becky L.

1983-01-01

Examined the impact of urbanism on the causal mechanisms by which life satisfaction is determined using a causal model that incorporates urbanism as a polytomous variable. Urbanism was found to have indirect main effects as well as interaction effects on life satisfaction. (Author/JAC)
[Development of a cell phone addiction scale for korean adolescents].

PubMed

Koo, Hyun Young

2009-12-01

This study was done to develop a cell phone addiction scale for Korean adolescents. The process included construction of a conceptual framework, generation of initial items, verification of content validity, selection of secondary items, preliminary study, and extraction of final items. The participants were 577 adolescents in two middle schools and three high schools. Item analysis, factor analysis, criterion related validity, and internal consistency were used to analyze the data. Twenty items were selected for the final scale, and categorized into 3 factors explaining 55.45% of total variance. The factors were labeled as withdrawal/tolerance (7 items), life dysfunction (6 items), and compulsion/persistence (7 items). The scores for the scale were significantly correlated with self-control, impulsiveness, and cell phone use. Cronbach's alpha coefficient for the 20 items was .92. Scale scores identified students as cell phone addicted, heavy users, or average users. The above findings indicate that the cell phone addiction scale has good validity and reliability when used with Korean adolescents.
Validation of 5-item and 2-item questionnaires in Chinese version of Dizziness Handicap Inventory for screening objective benign paroxysmal positional vertigo.

PubMed

Chen, Wei; Shu, Liang; Wang, Qian; Pan, Hui; Wu, Jing; Fang, Jie; Sun, Xu-Hong; Zhai, Yu; Dong, You-Rong; Liu, Jian-Ren

2016-08-01

As possible candidate screening instruments for benign paroxysmal positional vertigo (BPPV), studies to validate the Dizziness Handicap Inventory (DHI) sub-scale (5-item and 2-item) and total scores are rare in China. From May 2014 to December 2014, 108(55 with and 53 without BPPV) patients complaining of episodic vertigo in the past week from a vertigo outpatient clinic were enrolled for DHI evaluation, as well as demographic and other clinical data. Objective BPPV was subsequently determined by positional evoking maneuvers under the record of optical Frenzel glasses. Cronbach's coefficient α was used to evaluate the reliability of psychometric scales. The validity of DHI total, 5-item and 2-item questionnaires to screen for BPPV was assessed by receiver operating characteristic (ROC) curves. It revealed that the DHI 5-item questionnaire had good internal consistency (Cronbach's coefficient α = 0.72). Area under the curve of total DHI, 5-item and 2-item scores for discriminating BPPV from those without was 0.678 (95 % CI 0.578-0.778), 0.873(95 % CI 0.807-0.940) and 0.895(95 % CI 0.836-0.953), respectively. It revealed 74.5 % sensitivity and 88.7 % specificity in separating BPPV and those without, with a cutoff value of 12 in the 5-item questionnaire. The corresponding rate of sensitivity and specificity was 78.2 and 88.7 %, respectively, with a cutoff value of 6 in 2-item questionnaire. The present study indicated that both 5-item and 2-item questionnaires in the Chinese version of DHI may be more valid than DHI total score for screening objective BPPV and merit further application in clinical practice in China.
Measuring depression after spinal cord injury: Development and psychometric characteristics of the SCI-QOL Depression item bank and linkage with PHQ-9.

PubMed

Tulsky, David S; Kisala, Pamela A; Kalpakjian, Claire Z; Bombardier, Charles H; Pohlig, Ryan T; Heinemann, Allen W; Carle, Adam; Choi, Seung W

2015-05-01

To develop a calibrated spinal cord injury-quality of life (SCI-QOL) item bank, computer adaptive test (CAT), and short form to assess depressive symptoms experienced by individuals with SCI, transform scores to the Patient Reported Outcomes Measurement Information System (PROMIS) metric, and create a crosswalk to the Patient Health Questionnaire (PHQ)-9. We used grounded-theory based qualitative item development methods, large-scale item calibration field testing, confirmatory factor analysis, item response theory (IRT) analyses, and statistical linking techniques to transform scores to a PROMIS metric and to provide a crosswalk with the PHQ-9. Five SCI Model System centers and one Department of Veterans Affairs medical center in the United States. Adults with traumatic SCI. Spinal Cord Injury--Quality of Life (SCI-QOL) Depression Item Bank Individuals with SCI were involved in all phases of SCI-QOL development. A sample of 716 individuals with traumatic SCI completed 35 items assessing depression, 18 of which were PROMIS items. After removing 7 non-PROMIS items, factor analyses confirmed a unidimensional pool of items. We used a graded response IRT model to estimate slopes and thresholds for the 28 retained items. The SCI-QOL Depression measure correlated 0.76 with the PHQ-9. The SCI-QOL Depression item bank provides a reliable and sensitive measure of depressive symptoms with scores reported in terms of general population norms. We provide a crosswalk to the PHQ-9 to facilitate comparisons between measures. The item bank may be administered as a CAT or as a short form and is suitable for research and clinical applications.
Systematic evaluation of clinical practice guidelines for pharmacogenomics.

PubMed

Beckett, Robert D; Kisor, David F; Smith, Thomas; Vonada, Brooke

2018-06-01

To systematically assess methodological quality of pharmacogenomics clinical practice guidelines. Guidelines published through 2017 were reviewed by at least three independent reviewers using the AGREE II instrument, which consists of 23 items grouped into 6 domains and 2 items representing an overall assessment. Items were assessed on a seven-point rating scale, and aggregate quality scores were calculated. 31 articles were included. All guidelines were published as peer-reviewed articles and 90% (n = 28) were endorsed by professional organizations. Mean AGREE II domain scores (maximum score 100%) ranged from 46.6 ± 11.5% ('applicability') to 78.9 ± 11.4% ('clarity of presentation'). Median overall quality score was 72.2% (IQR: 61.1-77.8%). Quality of pharmacogenomics guidelines was generally high, but variable, for most AGREE II domains.
Using a Linear Regression Method to Detect Outliers in IRT Common Item Equating

ERIC Educational Resources Information Center

He, Yong; Cui, Zhongmin; Fang, Yu; Chen, Hanwei

2013-01-01

Common test items play an important role in equating alternate test forms under the common item nonequivalent groups design. When the item response theory (IRT) method is applied in equating, inconsistent item parameter estimates among common items can lead to large bias in equated scores. It is prudent to evaluate inconsistency in parameter…
The Consequences of Ignoring Item Parameter Drift in Longitudinal Item Response Models

ERIC Educational Resources Information Center

Lee, Wooyeol; Cho, Sun-Joo

2017-01-01

Utilizing a longitudinal item response model, this study investigated the effect of item parameter drift (IPD) on item parameters and person scores via a Monte Carlo study. Item parameter recovery was investigated for various IPD patterns in terms of bias and root mean-square error (RMSE), and percentage of time the 95% confidence interval covered…
A rasch analysis of the Manchester foot pain and disability index

PubMed Central

Muller, Sara; Roddy, Edward

2009-01-01

Background There is currently no interval-level measure of foot-related disability and this has hampered research in this area. The Manchester Foot Pain and Disability Index (FPDI) could potentially fill this gap. Objective To assess the fit of the three subscales (function, pain, appearance) of the FPDI to the Rasch unidimensional measurement model in order to form interval-level scores. Methods A two-stage postal survey at a general practice in the UK collected data from 149 adults aged 50 years and over with foot pain. The 17 FPDI items, in three subscales, were assessed for their fit to the Rasch model. Checks were carried out for differential item functioning by age and gender. Results The function and pain items fit the Rasch model and interval-level scores can be constructed. There were too few people without extreme scores on the appearance subscale to allow fit to the Rasch model to be tested. Conclusion The items from the FPDI function and pain subscales can be used to obtain interval level scores for these factors for use in future research studies in older adults. Further work is needed to establish the interval nature of these subscale scores in more diverse populations and to establish the measurement properties of these interval-level scores. PMID:19878536
Medical students' clinical performance in general practice - Triangulating assessments from patients, teachers and students.

PubMed

Braend, Anja Maria; Gran, Sarah Frandsen; Frich, Jan C; Lindbaek, Morten

2010-01-01

Formative assessment of medical students' clinical performance during general practice clerkship is necessary to learn consultation skills. Our aim was to triangulate feedback using patient questionnaires, written self-assessment and teachers' observation-based assessment, and to describe the content of this feedback. We developed StudentPEP, a 15-item version of EUROPEP, a tool for measuring patients' evaluation of quality in general practice. The teacher and student forms consisted of five StudentPEP-items and open-ended questions asking for approval and improvement needed on four aspects. Quantitative scores were analyzed statistically. Free-text comments were analyzed and categorized into 'specific and concrete' versus 'general and unspecific'. One hundred seventy-three students returned data from 2643 consultations. Mean patients' scores for 15 items were 4.3-4.8 on a five-point Likert scale. Mean teacher scores were 4.4 on five items, while students' mean self-assessments were 3.6-3.8. In an analysis of 380 consultations, students were more specific and concrete in their self-evaluation compared with teachers (p < 0.01). Patients scored students' performance high compared with students' self-assessments. Teachers' scores were in accordance with patients' scores. Teachers' written evaluations of students were often general. There is a potential for improving teachers' feedback in terms of more specific and concrete comments.
Statistical power as a function of Cronbach alpha of instrument questionnaire items.

PubMed

Heo, Moonseong; Kim, Namhee; Faith, Myles S

2015-10-14

In countless number of clinical trials, measurements of outcomes rely on instrument questionnaire items which however often suffer measurement error problems which in turn affect statistical power of study designs. The Cronbach alpha or coefficient alpha, here denoted by C(α), can be used as a measure of internal consistency of parallel instrument items that are developed to measure a target unidimensional outcome construct. Scale score for the target construct is often represented by the sum of the item scores. However, power functions based on C(α) have been lacking for various study designs. We formulate a statistical model for parallel items to derive power functions as a function of C(α) under several study designs. To this end, we assume fixed true score variance assumption as opposed to usual fixed total variance assumption. That assumption is critical and practically relevant to show that smaller measurement errors are inversely associated with higher inter-item correlations, and thus that greater C(α) is associated with greater statistical power. We compare the derived theoretical statistical power with empirical power obtained through Monte Carlo simulations for the following comparisons: one-sample comparison of pre- and post-treatment mean differences, two-sample comparison of pre-post mean differences between groups, and two-sample comparison of mean differences between groups. It is shown that C(α) is the same as a test-retest correlation of the scale scores of parallel items, which enables testing significance of C(α). Closed-form power functions and samples size determination formulas are derived in terms of C(α), for all of the aforementioned comparisons. Power functions are shown to be an increasing function of C(α), regardless of comparison of interest. The derived power functions are well validated by simulation studies that show that the magnitudes of theoretical power are virtually identical to those of the empirical power. Regardless of research designs or settings, in order to increase statistical power, development and use of instruments with greater C(α), or equivalently with greater inter-item correlations, is crucial for trials that intend to use questionnaire items for measuring research outcomes. Further development of the power functions for binary or ordinal item scores and under more general item correlation strutures reflecting more real world situations would be a valuable future study.
Grooming a CAT: customizing CAT administration rules to increase response efficiency in specific research and clinical settings.

PubMed

Kallen, Michael A; Cook, Karon F; Amtmann, Dagmar; Knowlton, Elizabeth; Gershon, Richard C

2018-05-05

To evaluate the degree to which applying alternative stopping rules would reduce response burden while maintaining score precision in the context of computer adaptive testing (CAT). Analyses were conducted on secondary data comprised of CATs administered in a clinical setting at multiple time points (baseline and up to two follow ups) to 417 study participants who had back pain (51.3%) and/or depression (47.0%). Participant mean age was 51.3 years (SD = 17.2) and ranged from 18 to 86. Participants tended to be white (84.7%), relatively well educated (77% with at least some college), female (63.9%), and married or living in a committed relationship (57.4%). The unit of analysis was individual assessment histories (i.e., CAT item response histories) from the parent study. Data were first aggregated across all individuals, domains, and time points in an omnibus dataset of assessment histories and then were disaggregated by measure for domain-specific analyses. Finally, assessment histories within a "clinically relevant range" (score ≥ 1 SD from the mean in direction of poorer health) were analyzed separately to explore score level-specific findings. Two different sets of CAT administration rules were compared. The original CAT (CAT ORIG ) rules required at least four and no more than 12 items be administered. If the score standard error (SE) reached a value < 3 points (T score metric) before 12 items were administered, the CAT was stopped. We simulated applying alternative stopping rules (CAT ALT ), removing the requirement that a minimum four items be administered, and stopped a CAT if responses to the first two items were both associated with best health, if the SE was < 3, if SE change < 0.1 (T score metric), or if 12 items were administered. We then compared score fidelity and response burden, defined as number of items administered, between CAT ORIG and CAT ALT . CAT ORIG and CAT ALT scores varied little, especially within the clinically relevant range, and response burden was substantially lower under CAT ALT (e.g., 41.2% savings in omnibus dataset). Alternate stopping rules result in substantial reductions in response burden with minimal sacrifice in score precision.
Attitudes of medical students toward psychiatry in a Chilean medical school.

PubMed

Valdivieso, Sergio; Sirhan, Marisol; Aguirre, Constanza; Ivelic, Jose Antonio; Aillach, Emilio; Villarroel, Luis

2014-06-01

The authors assess the attitudes of seventh-year medical students with regard to psychiatry and patients with psychiatric illness during the psychiatry clerkship. A 32-item questionnaire regarding attitudes toward psychiatry and patients with psychiatric illness was administered at the beginning of the psychiatry clerkship. One hundred and ten seventh-year students participated in the study, providing responses anonymously. Average negative attitude item score was 2.45 ± 0.3 (range 1.7-3.3). Eighty-three students (75 %) responded to all the questions with an average negative attitude item score of 2.43 ± 0.3 (range 1.7-3.3) and a total negative attitude item score of 77.9 ± 10.3 (range 55-104). Undergraduate students of a Chilean medical school showed fairly positive attitudes toward psychiatry and toward patients with psychiatric illness.

Effects of asenapine on agitation and hostility in adults with acute manic or mixed episodes associated with bipolar I disorder.

PubMed

Citrome, Leslie; Landbloom, Ronald; Chang, Cheng-Tao; Earley, Willie

2017-01-01

Bipolar disorder is associated with an increased risk of aggression. However, effective management of hostility and/or agitation symptoms may prevent patients from becoming violent. This analysis investigated the efficacy of the antipsychotic asenapine on hostility and agitation in patients with bipolar I disorder. Data were pooled from three randomized, double-blind, placebo-controlled, Phase III trials of asenapine in adults with manic or mixed episodes of bipolar I disorder (NCT00159744, NCT00159796, and NCT00764478). Post hoc analyses assessed the changes from baseline to day 21 on the Young Mania Rating Scale (YMRS) and the Positive and Negative Syndrome Scale (PANSS) hostility-related item scores in asenapine- or placebo-treated patients with at least minimal or mild symptom severity and on the PANSS-excited component (PANSS-EC) total score in agitated patients. Changes were adjusted for improvements in overall mania symptoms to investigate direct effects on hostility. Significantly greater changes in favor of asenapine versus placebo were observed in YMRS hostility-related item scores (irritability: least squares mean difference [95% confidence interval] =-0.5 [-0.87, -0.22], P =0.001; disruptive-aggressive behavior: -0.7 [-0.99, -0.37], P <0.0001), PANSS hostility item score (-0.2 [-0.44, -0.04]; P =0.0181), and PANSS-EC total score (-1.4 [-2.4, -0.4]; P =0.0055). Changes in the YMRS disruptive-aggressive behavior score and the sum of the hostility-related items remained significant after adjusting for improvements in other YMRS item scores. Asenapine significantly reduced hostility and agitation in patients with bipolar I disorder; improvement was at least partially independent of overall improvement on mania symptoms.
Measurement Equivalence in ADL and IADL Difficulty Across International Surveys of Aging: Findings From the HRS, SHARE, and ELSA

PubMed Central

Kasper, Judith D.; Brandt, Jason; Pezzin, Liliana E.

2012-01-01

Objective. To examine the measurement equivalence of items on disability across three international surveys of aging. Method. Data for persons aged 65 and older were drawn from the Health and Retirement Survey (HRS, n = 10,905), English Longitudinal Study of Aging (ELSA, n = 5,437), and Survey of Health, Ageing and Retirement in Europe (SHARE, n = 13,408). Differential item functioning (DIF) was assessed using item response theory (IRT) methods for activities of daily living (ADL) and instrumental activities of daily living (IADL) items. Results. HRS and SHARE exhibited measurement equivalence, but 6 of 11 items in ELSA demonstrated meaningful DIF. At the scale level, this item-level DIF affected scores reflecting greater disability. IRT methods also spread out score distributions and shifted scores higher (toward greater disability). Results for mean disability differences by demographic characteristics, using original and DIF-adjusted scores, were the same overall but differed for some subgroup comparisons involving ELSA. Discussion. Testing and adjusting for DIF is one means of minimizing measurement error in cross-national survey comparisons. IRT methods were used to evaluate potential measurement bias in disability comparisons across three international surveys of aging. The analysis also suggested DIF was mitigated for scales including both ADL and IADL and that summary indexes (counts of limitations) likely underestimate mean disability in these international populations. PMID:22156662
Quality of Life in orthognathic surgery patients: post-surgical improvements in aesthetics and self-confidence.

PubMed

Rustemeyer, Jan; Gregersen, Johanne

2012-07-01

The objective of this prospective study was to assess changes of Quality of Life (QoL) in patients undergoing bimaxillary orthognathic surgery. Questionnaires were based on the Oral Health Impact Profile (OHIP, items OH-1-OH-14) and three additional questions (items AD-1-3), and were completed by patients (n=50; mean age 26.9±9.9 years) on average 9.1±2.4 months before surgery, and 12.1±1.4 months after surgery, using a scoring scale. Item scores describing functional limitation, physical pain, physical disability and chewing function did not change significantly, whereas item scores covering psychological discomfort and social disability domains revealed significant decreases following surgery. AD-2 "dissatisfying aesthetics" revealed the greatest difference between pre- and post-surgical scores (p<0.001). If there was a perception of aesthetic improvement of facial features post-surgery, the benefit in QoL was generally high. The significant correlation of the pre- to post-surgical changes of item OH-5 "self conscious" to nearly all other item changes suggested that OH-5 was the most sensitive indicator for post-surgical improvement of QoL. Psychological factors and aesthetics exerted a strong influence on the patients' QoL, and determined major changes more than functional aspects did. Copyright © 2011 European Association for Cranio-Maxillo-Facial Surgery. Published by Elsevier Ltd. All rights reserved.
Assessing depression outcome in patients with moderate dementia: sensitivity of the HoNOS65+ scale.

PubMed

Canuto, Alessandra; Rudhard-Thomazic, Valérie; Herrmann, François R; Delaloye, Christophe; Giannakopoulos, Panteleimon; Weber, Kerstin

2009-08-15

To date, there is no widely accepted clinical scale to monitor the evolution of depressive symptoms in demented patients. We assessed the sensitivity to treatment of a validated French version of the Health of the Nation Outcome Scale (HoNOS) 65+ compared to five routinely used scales. Thirty elderly inpatients with ICD-10 diagnosis of dementia and depression were evaluated at admission and discharge using paired t-test. Using the Brief Psychiatric Rating Scale (BPRS) "depressive mood" item as gold standard, a receiver operating characteristic curve (ROC) analysis assessed the validity of HoNOS65+F "depressive symptoms" item score changes. Unlike Geriatric Depression Scale, Mini Mental State Examination and Activities of Daily Living scores, BPRS scores decreased and Global Assessment Functioning Scale score increased significantly from admission to discharge. Amongst HoNOS65+F items, "behavioural disturbance", "depressive symptoms", "activities of daily life" and "drug management" items showed highly significant changes between the first and last day of hospitalization. The ROC analysis revealed that changes in the HoNOS65+F "depressive symptoms" item correctly classified 93% of the cases with good sensitivity (0.95) and specificity (0.88) values. These data suggest that the HoNOS65+F "depressive symptoms" item may provide a valid assessment of the evolution of depressive symptoms in demented patients.
Physical performance testing in mucopolysaccharidosis I: a pilot study.

PubMed

Dumas, Helene M; Fragala, Maria A; Haley, Stephen M; Skrinar, Alison M; Wraith, James E; Cox, Gerald F

2004-01-01

To develop and field-test a physical performance measure (MPS-PPM) for individuals with Mucopolysaccharidosis I (MPS I), a rare genetic disorder. Motor performance and endurance items were developed based on literature review, clinician feedback, feasibility, and equipment and training needs. A standardized testing protocol and scoring rules were created. The MPS-PPM includes: Arm Function (7 items), Leg Function (5 items), and Endurance (2 items). Pilot data were collected for 10 subjects (ages 5-29 years). We calculated Spearman's rho correlations between age, severity and summary z-scores on the MPS-PPM. Subjects had variable presentations, as correlations among the three sub-test scores were not significant. Increasing age was related to greater severity in physical performance (r = 0.72, p<0.05) and lower scores on the Leg Function (r = -0.67, p<0.05) and Endurance (r = -0.65, p<0.05) sub-tests. The MPS-PPM was sensitive to detecting physical performance deficits, as six subjects could not complete the full battery of Arm Function items and eight subjects were unable to complete all Leg Function items. Subjects walked more slowly and expended more energy than typically developing peers. Individuals with MPS I have difficulty with arm and leg function and reduced endurance. The MPS-PPM is a clinically feasible measure that detects limitations in physical performance and may have potential to quantify changes in function following intervention. Copyright 2004 Taylor and Francis Ltd.
When less is more: validating a brief scale to rate interprofessional team competencies.

PubMed

Lie, Désirée A; Richter-Lagha, Regina; Forest, Christopher P; Walsh, Anne; Lohenry, Kevin

2017-01-01

There is a need for validated and easy-to-apply behavior-based tools for assessing interprofessional team competencies in clinical settings. The seven-item observer-based Modified McMaster-Ottawa scale was developed for the Team Objective Structured Clinical Encounter (TOSCE) to assess individual and team performance in interprofessional patient encounters. We aimed to improve scale usability for clinical settings by reducing item numbers while maintaining generalizability; and to explore the minimum number of observed cases required to achieve modest generalizability for giving feedback. We administered a two-station TOSCE in April 2016 to 63 students split into 16 newly-formed teams, each consisting of four professions. The stations were of similar difficulty. We trained sixteen faculty to rate two teams each. We examined individual and team performance scores using generalizability (G) theory and principal component analysis (PCA). The seven-item scale shows modest generalizability (.75) with individual scores. PCA revealed multicollinearity and singularity among scale items and we identified three potential items for removal. Reducing items for individual scores from seven to four (measuring Collaboration, Roles, Patient/Family-centeredness, and Conflict Management) changed scale generalizability from .75 to .73. Performance assessment with two cases is associated with reasonable generalizability (.73). Students in newly-formed interprofessional teams show a learning curve after one patient encounter. Team scores from a two-station TOSCE demonstrate low generalizability whether the scale consisted of four (.53) or seven items (.55). The four-item Modified McMaster-Ottawa scale for assessing individual performance in interprofessional teams retains the generalizability and validity of the seven-item scale. Observation of students in teams interacting with two different patients provides reasonably reliable ratings for giving feedback. The four-item scale has potential for assessing individual student skills and the impact of IPE curricula in clinical practice settings. IPE: Interprofessional education; SP: Standardized patient; TOSCE: Team objective structured clinical encounter.
Functional recovery in patients with schizophrenia: recommendations from a panel of experts.

PubMed

Lahera, Guillermo; Gálvez, José L; Sánchez, Pedro; Martínez-Roig, Miguel; Pérez-Fuster, J V; García-Portilla, Paz; Herrera, Berta; Roca, Miquel

2018-06-05

The management of schizophrenia is evolving towards a more comprehensive model based on functional recovery. The concept of functional recovery goes beyond clinical remission and encompasses multiple aspects of the patient's life, making it difficult to settle on a definition and to develop reliable assessment criteria. In this consensus process based on a panel of experts in schizophrenia, we aimed to provide useful insights on functional recovery and its involvement in clinical practice and clinical research. After a literature review of functional recovery in schizophrenia, a scientific committee of 8 members prepared a 75-item questionnaire, including 6 sections: (I) the concept of functional recovery (9 items), (II) assessment of functional recovery (23 items), (III) factors influencing functional recovery (16 items), (IV) psychosocial interventions and functional recovery (8 items), (V) pharmacological treatment and functional recovery (14 items), and (VI) the perspective of patients and their relatives on functional recovery (5 items). The questionnaire was sent to a panel of 53 experts, who rated each item on a 9-point Likert scale. Consensus was achieved in a 2-round Delphi dynamics, using the median (interquartile range) scores to consider consensus in either agreement (scores 7-9) or disagreement (scores 1-3). Items not achieving consensus in the first round were sent back to the experts for a second consideration. After the two recursive rounds, consensus was achieved in 64 items (85.3%): 61 items (81.3%) in agreement and 3 (4.0%) in disagreement, all of them from section II (assessment of functional recovery). Items not reaching consensus were related to the concepts of functional recovery (1 item, 1.3%), functional assessment (5 items, 6.7%), factors influencing functional recovery (3 items, 4.0%), and psychosocial interventions (2 items, 5.6%). Despite the lack of a well-defined concept of functional recovery, we identified a trend towards a common archetype of the definition and factors associated with functional recovery, as well as its applicability in clinical practice and clinical research.
Modeling Incorrect Responses to Multiple-Choice Items with Multilinear Formula Score Theory.

ERIC Educational Resources Information Center

Drasgow, Fritz; And Others

This paper addresses the information revealed in incorrect option selection on multiple choice items. Multilinear Formula Scoring (MFS), a theory providing methods for solving psychological measurement problems of long standing, is first used to estimate option characteristic curves for the Armed Services Vocational Aptitude Battery Arithmetic…
Hidden Item Variance in Multiple Mini-Interview Scores

ERIC Educational Resources Information Center

Zaidi, Nikki L.; Swoboda, Christopher M.; Kelcey, Benjamin M.; Manuel, R. Stephen

2017-01-01

The extant literature has largely ignored a potentially significant source of variance in multiple mini-interview (MMI) scores by "hiding" the variance attributable to the sample of attributes used on an evaluation form. This potential source of hidden variance can be defined as rating items, which typically comprise an MMI evaluation…
34 CFR 200.8 - Assessment reports.

Code of Federal Regulations, 2013 CFR

2013-07-01

... assessment is given; (ii) In an understandable and uniform format, including an alternative format (e.g... understand. (b) Itemized score analyses for LEAs and schools. (1) A State's academic assessment system must produce and report to LEAs and schools itemized score analyses, consistent with § 200.2(b)(4), so that...
34 CFR 200.8 - Assessment reports.

Code of Federal Regulations, 2014 CFR

2014-07-01

... assessment is given; (ii) In an understandable and uniform format, including an alternative format (e.g... understand. (b) Itemized score analyses for LEAs and schools. (1) A State's academic assessment system must produce and report to LEAs and schools itemized score analyses, consistent with § 200.2(b)(4), so that...
34 CFR 200.8 - Assessment reports.

Code of Federal Regulations, 2012 CFR

2012-07-01

... assessment is given; (ii) In an understandable and uniform format, including an alternative format (e.g... understand. (b) Itemized score analyses for LEAs and schools. (1) A State's academic assessment system must produce and report to LEAs and schools itemized score analyses, consistent with § 200.2(b)(4), so that...
34 CFR 200.8 - Assessment reports.

Code of Federal Regulations, 2010 CFR

2010-07-01

... assessment is given; (ii) In an understandable and uniform format, including an alternative format (e.g... understand. (b) Itemized score analyses for LEAs and schools. (1) A State's academic assessment system must produce and report to LEAs and schools itemized score analyses, consistent with § 200.2(b)(4), so that...
34 CFR 200.8 - Assessment reports.

Code of Federal Regulations, 2011 CFR

2011-07-01

... assessment is given; (ii) In an understandable and uniform format, including an alternative format (e.g... understand. (b) Itemized score analyses for LEAs and schools. (1) A State's academic assessment system must produce and report to LEAs and schools itemized score analyses, consistent with § 200.2(b)(4), so that...
Usefulness of the dizziness handicap inventory in the screening for benign paroxysmal positional vertigo.

PubMed

Whitney, Susan L; Marchetti, Gregory F; Morris, Laura O

2005-09-01

The purpose of the study was to determine whether a newly developed subscale of the Dizziness Handicap Inventory (DHI) could assist in the screening of benign paroxysmal positional vertigo (BPPV). Retrospective case review. Tertiary balance referral center. Charts of 383 patients (mean age, 61 yr) with a variety of vestibular diagnoses (peripheral and central) were reviewed. Patients completed the DHI before the onset of physical therapy intervention. A newly developed BPPV subscale developed from current DHI items was computed to determine whether the score could assist the practitioner in identifying individuals with BPPV. Individuals with BPPV had significantly higher mean scores on the newly developed BPPV subscale of the DHI (p < 0.01). The five-item BPPV score was a significant predictor of the likelihood of having BPPV (chi2 = 8.35; p < 0.01). On the two-item BPPV scale, individuals who had a score of 8 of 8 were 4.3 times more likely to have BPPV compared with individuals who had a score of 0. Items on the DHI appear to be helpful in determining the likelihood of an individual having the diagnosis of BPPV.
Calibration of the Spanish PROMIS Smoking Item Banks.

PubMed

Huang, Wenjing; Stucky, Brian D; Edelen, Maria O; Tucker, Joan S; Shadel, William G; Hansen, Mark; Cai, Li

2016-07-01

The Patient-Reported Outcomes Measurement Information System (PROMIS) Smoking Initiative has developed item banks for assessing six smoking behaviors and biopsychosocial correlates of smoking among adult cigarette smokers. The goal of this study is to evaluate the performance of the Spanish version of the PROMIS smoking item banks as compared to the original banks developed in English. The six PROMIS banks for daily smokers were translated into Spanish and administered to a sample of Spanish-speaking adult daily smokers in the United States (N = 302). We first evaluated the unidimensionality of each bank using confirmatory factor analysis. We then conducted a two-group item response theory calibration, including an item response theory-based Differential Item Functioning (DIF) analysis by language of administration (Spanish vs. English). Finally, we generated full bank and short form scores for the translated banks and evaluated their psychometric performance. Unidimensionality of the Spanish smoking item banks was supported by confirmatory factor analysis results. Out of a total of 109 items that were evaluated for language DIF, seven items in three of the six banks were identified as having levels of DIF that exceeded an established criterion. The psychometric performance of the Spanish daily smoker banks is largely comparable to that of the English versions. The Spanish PROMIS smoking item banks are highly similar, but not entirely equivalent, to the original English versions. The parameters from these two-group calibrations can be used to generate comparable bank scores across the two language versions. In this study, we developed a Spanish version of the PROMIS smoking toolkit, which was originally designed and developed for English speakers. With the growing Spanish-speaking population, it is important to make the toolkit more accessible by translating the items and calibrating the Spanish version to be comparable with English-language scores. This study provided the translated item banks and short forms, comparable unbiased scores for Spanish speakers and evaluations of the psychometric properties of the new Spanish toolkit. © The Author 2016. Published by Oxford University Press on behalf of the Society for Research on Nicotine and Tobacco. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Applications of computerized adaptive testing (CAT) to the assessment of headache impact.

PubMed

Ware, John E; Kosinski, Mark; Bjorner, Jakob B; Bayliss, Martha S; Batenhorst, Alice; Dahlöf, Carl G H; Tepper, Stewart; Dowson, Andrew

2003-12-01

To evaluate the feasibility of computerized adaptive testing (CAT) and the reliability and validity of CAT-based estimates of headache impact scores in comparison with 'static' surveys. Responses to the 54-item Headache Impact Test (HIT) were re-analyzed for recent headache sufferers (n = 1016) who completed telephone interviews during the National Survey of Headache Impact (NSHI). Item response theory (IRT) calibrations and the computerized dynamic health assessment (DYNHA) software were used to simulate CAT assessments by selecting the most informative items for each person and estimating impact scores according to pre-set precision standards (CAT-HIT). Results were compared with IRT estimates based on all items (total-HIT), computerized 6-item dynamic estimates (CAT-HIT-6), and a developmental version of a 'static' 6-item form (HIT-6-D). Analyses focused on: respondent burden (survey length and administration time), score distributions ('ceiling' and 'floor' effects), reliability and standard errors, and clinical validity (diagnosis, level of severity). A random sample (n = 245) was re-assessed to test responsiveness. A second study (n = 1103) compared actual CAT surveys and an improved 'static' HIT-6 among current headache sufferers sampled on the Internet. Respondents completed measures from the first study and the generic SF-8 Health Survey; some (n = 540) were re-tested on the Internet after 2 weeks. In the first study, simulated CAT-HIT and total-HIT scores were highly correlated (r = 0.92) without 'ceiling' or 'floor' effects and with a substantial reduction (90.8%) in respondent burden. Six of the 54 items accounted for the great majority of item administrations (3603/5028, 77.6%). CAT-HIT reliability estimates were very high (0.975-0.992) in the range where 95% of respondents scored, and relative validity (RV) coefficients were high for diagnosis (RV = 0.87) and severity (RV = 0.89); patient-level classifications were accurate 91.3% for a diagnosis of migraine. For all three criteria of change, CAT-HIT scores were more responsive than all other measures. In the second study, estimates of respondent burden, item usage, reliability and clinical validity were replicated. The test-retest reliability of CAT-HIT was 0.79 and alternate forms coefficients ranged from 0.85 to 0.91. All correlations with the generic SF-8 were negative. CAT-based administrations of headache impact items achieved very large reductions in respondent burden without compromising validity for purposes of patient screening or monitoring changes in headache impact over time. IRT models and CAT-based dynamic health assessments warrant testing among patients with other conditions.
The Assessment of Physiotherapy Practice (APP) is a valid measure of professional competence of physiotherapy students: a cross-sectional study with Rasch analysis.

PubMed

Dalton, Megan; Davidson, Megan; Keating, Jenny

2011-01-01

Is the Assessment of Physiotherapy Practice (APP) a valid instrument for the assessment of entry-level competence in physiotherapy students? Cross-sectional study with Rasch analysis of initial (n=326) and validation samples (n=318). Students were assessed on completion of 4, 5, or 6-week clinical placements across one university semester. 298 clinical educators and 456 physiotherapy students at nine universities in Australia and New Zealand provided 644 completed APP instruments. APP data in both samples showed overall fit to a Rasch model of expected item functioning for interval scale measurement. Item 6 (Written communication) exhibited misfit in both samples, but was retained as an important element of competence. The hierarchy of item difficulty was the same in both samples with items related to professional behaviour and communication the easiest to achieve and items related to clinical reasoning the most difficult. Item difficulty was well targeted to person ability. No Differential Item Functioning was identified, indicating that the scale performed in a comparable way regardless of the student's age, gender or amount of prior clinical experience, and the educator's age, gender, or experience as an educator, or the type of facility, university, or clinical area. The instrument demonstrated unidimensionality confirming the appropriateness of summing the scale scores on each item to provide an overall score of clinical competence and was able to discriminate four levels of professional competence (Person Separation Index=0.96). Person ability and raw APP scores had a linear relationship (r(2)=0.99). Rasch analysis supports the interpretation that a student's APP score is an indication of their underlying level of professional competence in workplace practice. Copyright © 2011 Australian Physiotherapy Association. Published by .. All rights reserved.
Efficiently measuring dimensions of the externalizing spectrum model: Development of the Externalizing Spectrum Inventory-Computerized Adaptive Test (ESI-CAT).

PubMed

Sunderland, Matthew; Slade, Tim; Krueger, Robert F; Markon, Kristian E; Patrick, Christopher J; Kramer, Mark D

2017-07-01

The development of the Externalizing Spectrum Inventory (ESI) was motivated by the need to comprehensively assess the interrelated nature of externalizing psychopathology and personality using an empirically driven framework. The ESI measures 23 theoretically distinct yet related unidimensional facets of externalizing, which are structured under 3 superordinate factors representing general externalizing, callous aggression, and substance abuse. One limitation of the ESI is its length at 415 items. To facilitate the use of the ESI in busy clinical and research settings, the current study sought to examine the efficiency and accuracy of a computerized adaptive version of the ESI. Data were collected over 3 waves and totaled 1,787 participants recruited from undergraduate psychology courses as well as male and female state prisons. A series of 6 algorithms with different termination rules were simulated to determine the efficiency and accuracy of each test under 3 different assumed distributions. Scores generated using an optimal adaptive algorithm evidenced high correlations (r > .9) with scores generated using the full ESI, brief ESI item-based factor scales, and the 23 facet scales. The adaptive algorithms for each facet administered a combined average of 115 items, a 72% decrease in comparison to the full ESI. Similarly, scores on the item-based factor scales of the ESI-brief form (57 items) were generated using on average of 17 items, a 70% decrease. The current study successfully demonstrates that an adaptive algorithm can generate similar scores for the ESI and the 3 item-based factor scales using a fraction of the total item pool. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Changes in the nutritional quality of fast-food items marketed at restaurants, 2010 v. 2013.

PubMed

Soo, Jackie; Harris, Jennifer L; Davison, Kirsten K; Williams, David R; Roberto, Christina A

2018-03-27

To examine the nutritional quality of menu items promoted in four (US) fast-food restaurant chains (McDonald's, Burger King, Wendy's, Taco Bell) in 2010 and 2013. Menu items pictured on signs and menu boards were recorded at 400 fast-food restaurants across the USA. The Nutrient Profile Index (NPI) was used to calculate overall nutrition scores for items (higher scores indicate greater nutritional quality) and was dichotomized to denote healthier v. less healthy items. Changes over time in NPI scores and energy of promoted foods and beverages were analysed using linear regression. Four hundred fast-food restaurants (McDonald's, Burger King, Wendy's, Taco Bell; 100 locations per chain). NPI of fast-food items marketed at fast-food restaurants. Promoted foods and beverages on general menu boards and signs remained below the 'healthier' cut-off at both time points. On general menu boards, pictured items became modestly healthier from 2010 to 2013, increasing (mean (se)) by 3·08 (0·16) NPI score points (P<0·001) and decreasing (mean (se)) by 130 (15) kJ (31·1 (3·65) kcal; P<0·001). This pattern was evident in all chains except Taco Bell, where pictured items increased in energy. Foods and beverages pictured on the kids' section showed the greatest nutritional improvements. Although promoted foods on general menu boards and signs improved in nutritional quality, beverages remained the same or became worse. Foods, and to a lesser extent, beverages, promoted on menu boards and signs in fast-food restaurants showed limited improvements in nutritional quality in 2013 v. 2010.

Calibration of Automatically Generated Items Using Bayesian Hierarchical Modeling.

ERIC Educational Resources Information Center

Johnson, Matthew S.; Sinharay, Sandip

For complex educational assessments, there is an increasing use of "item families," which are groups of related items. However, calibration or scoring for such an assessment requires fitting models that take into account the dependence structure inherent among the items that belong to the same item family. C. Glas and W. van der Linden…
Household Food Security in Isfahan Based on Current Population Survey Adapted Questionnaire

PubMed Central

Rafiei, Morteza; Rastegari, Hosein Ali; Ghiasi, Mojdeh; Shahsanaie, Vahid

2013-01-01

Background: Food security is a state in which all people at every time have physical and economic access to adequate food to obviate nutritional needs and live a healthy and active life. Therefore, this study was performed to quantitatively evaluate the household food security in Esfahan using the localized version of US Household Food Security Survey Module (US HFSSM). Methods: This descriptive cross-sectional study was performed in year 2006 on 3000 households of Esfahan. The study instrument used in this work is 18-item US food security module, which is developed into a localized 15-item questionnaire. This study is performed in two stages of families with no children (under 18 years old) and families with children over 18 years old. Results: The results showed that item severity coefficient, ratio of responses given by households and item infit and outfit coefficient in adult's and children's questionnaire respectively. According to obtained data, scale score of +3 in adults group is described as determination limit of slight food insecurity and +6 is stated as the limit for severe food insecurity. For children's group, scale score of +2 is defined to be the limit of slight food insecurity and +5 is the determination limit of severe food insecurity. Conclusions: The main hypothesis of this survey analysis is based on the raw scale score of USFSSM The item of “lack of enough money for buying food” (item 2) and the item of “lack of balanced meal” (3rd item) have the lowest severity coefficient. Then, the ascending rate of item severity continues in first item, 4th item and keeps increasing into 10th item. PMID:24498498
Development and initial validation of the Bedside Paediatric Early Warning System score

PubMed Central

2009-01-01

Introduction Adverse outcomes following clinical deterioration in children admitted to hospital wards is frequently preventable. Identification of children for referral to critical care experts remains problematic. Our objective was to develop and validate a simple bedside score to quantify severity of illness in hospitalized children. Methods A case-control design was used to evaluate 11 candidate items and identify a pragmatic score for routine bedside use. Case-patients were urgently admitted to the intensive care unit (ICU). Control-patients had no 'code blue', ICU admission or care restrictions. Validation was performed using two prospectively collected datasets. Results Data from 60 case and 120 control-patients was obtained. Four out of eleven candidate-items were removed. The seven-item Bedside Paediatric Early Warning System (PEWS) score ranges from 0–26. The mean maximum scores were 10.1 in case-patients and 3.4 in control-patients. The area under the receiver operating characteristics curve was 0.91, compared with 0.84 for the retrospective nurse-rating of patient risk for near or actual cardiopulmonary arrest. At a score of 8 the sensitivity and specificity were 82% and 93%, respectively. The score increased over 24 hours preceding urgent paediatric intensive care unit (PICU) admission (P < 0.0001). In 436 urgent consultations, the Bedside PEWS score was higher in patients admitted to the ICU than patients who were not admitted (P < 0.0001). Conclusions We developed and performed the initial validation of the Bedside PEWS score. This 7-item score can quantify severity of illness in hospitalized children and identify critically ill children with at least one hours notice. Prospective validation in other populations is required before clinical application. PMID:19678924
Evaluation of a daily practice composite score for the assessment of Crohn's disease: the treatment impact of certolizumab pegol.

PubMed

Feagan, B G; Hanauer, S B; Coteur, G; Schreiber, S

2011-05-01

Successful treatment of systemic inflammatory symptoms is essential for improving health-related quality of life in patients with active Crohn's disease. Patient-reported outcomes provide unique perspectives on the impact of chronic disease. It is unknown whether a combination of different instruments might improve sensitivity to clinically relevant changes in health status. To develop a composite score based upon Crohn's Disease Activity Index (CDAI) and Inflammatory Bowel Disease Questionnaire (IBDQ) items. Patients from the PRECiSE 2 trial who responded at week 6 to certolizumab pegol (CZP) were randomised to receive treatment with CZP 400 mg or placebo for up to 26 weeks. IBDQ and CDAI scores were assessed at weeks 0, 6, 16 and 26. A 'daily practice' composite score (DP-6) containing two items from the CDAI and four items from IBDQ was constructed. Correlation coefficients between the CDAI score and IBDQ total score at baseline and at week 26 were -0.344 and -0.603, respectively (P<0.05). All IBDQ items were improved following CZP treatment. The DP-6 had the highest responsiveness at assessing response to treatment, relative to CDAI total score, when compared with other scores. The DP-6 composite score could be used to optimise the use of existing instruments by serving as an index of symptoms due to systemic inflammation. Additional studies are needed to determine if the DP-6 composite score differentiates the impact of different treatments on patient-reported outcomes, and to determine if the use of the DP-6 improves the care of patients in clinical practice. © 2011 Blackwell Publishing Ltd.
The Effects of Judgment-Based Stratum Classifications on the Efficiency of Stratum Scored CATs.

ERIC Educational Resources Information Center

Finney, Sara J.; Smith, Russell W.; Wise, Steven L.

Two operational item pools were used to investigate the performance of stratum computerized adaptive tests (CATs) when items were assigned to strata based on empirical estimates of item difficulty or human judgments of item difficulty. Items from the first data set consisted of 54 5-option multiple choice items from a form of the ACT mathematics…
The factor structure and screening utility of the Social Interaction Anxiety Scale.

PubMed

Rodebaugh, Thomas L; Woods, Carol M; Heimberg, Richard G; Liebowitz, Michael R; Schneier, Franklin R

2006-06-01

The widely used Social Interaction Anxiety Scale (SIAS; R. P. Mattick & J. C. Clarke, 1998) possesses favorable psychometric properties, but questions remain concerning its factor structure and item properties. Analyses included 445 people with social anxiety disorder and 1,689 undergraduates. Simple unifactorial models fit poorly, and models that accounted for differences due to item wording (i.e., reverse scoring) provided superior fit. It was further found that clients and undergraduates approached some items differently, and the SIAS may be somewhat overly conservative in selecting analogue participants from an undergraduate sample. Overall, this study provides support for the excellent properties of the SIAS's straightforwardly worded items, although questions remain regarding its reverse-scored items. Copyright 2006 APA, all rights reserved.
Investigating diagnostic bias in autism spectrum conditions: An item response theory analysis of sex bias in the AQ-10.

PubMed

Murray, Aja Louise; Allison, Carrie; Smith, Paula L; Baron-Cohen, Simon; Booth, Tom; Auyeung, Bonnie

2017-05-01

Diagnostic bias is a concern in autism spectrum conditions (ASC) where prevalence and presentation differ by sex. To ensure that females with ASC are not under-identified, it is important that ASC screening tools do not systematically underestimate autistic traits in females relative to males. We evaluated whether the AQ-10, a brief screen for ASC recommended by the National Institute of Clinical Excellence in cases of suspected ASC, exhibits such a bias. Using an item response theory approach, we evaluated differential item functioning and differential test functioning. We found that although individual items showed some sex bias, these biases at times favored males and at other times favored females. Thus, at the level of test scores the item-level biases cancelled out to give an unbiased overall score. Results support the continued use of the AQ-10 sum score in its current form; however, suggest that caution should be exercised when interpreting responses to individual items. The nature of the item level biases could serve as a guide for future research into how ASC affects males and females differently. Autism Res 2017, 10: 790-800. © 2016 International Society for Autism Research, Wiley Periodicals, Inc. © 2016 International Society for Autism Research, Wiley Periodicals, Inc.
Determining the Sensitivity of CAT-ASVAB (Computerized Adaptive Testing- Armed Services Vocational Aptitude Battery) Scores to Changes in Item Response Curves with the Medium of Administration

DTIC Science & Technology

1986-08-01

most examinees. Therefore it appears psychometrically ac - ceptable for the CAT -ASVAB project to proceed without item recalibration based on...MEMORANDUM DETERMINING THE SENSITIVITY OF CAT -ASVAB SCORES TO CHANGES IN ITEM RESPONSE CURVES WITH THE MEDIUM OF ADMINISTRATION D. R. Divgi...Subj: Center for Naval Analyses Research Memorandum 86-189 End: (1) CNA Research Memorandum 86-189, "Determining the Sensitivity of CAT -ASVAB
Can business and economics students perform elementary arithmetic?

PubMed

Standing, Lionel G; Sproule, Robert A; Leung, Ambrose

2006-04-01

Business and economics majors (N=146) were tested on the D'Amore Test of Elementary Arithmetic, which employs third-grade test items from 1932. Only 40% of the subjects passed the test by answering 10 out of 10 items correctly. Self-predicted scores were a good predictor of actual scores, but performance was not associated with demographic variables, grades in calculus courses, liking for science or computers, or mathematics anxiety. Scores decreased over the subjects' initial years on campus. The hardest test item, with an error rate of 23%, required the subject to evaluate (36 x 7) + (33 x 7). The results are similar to those of Standing in 2006, despite methodological changes intended to maximize performance.
Validation of Gujarati Version of ABILOCO-Kids Questionnaire

PubMed Central

Diwan, Jasmin; Patel, Pankaj; Bansal, Ankita B.

2015-01-01

Background ABILOCO-Kids is a measure of locomotion ability for children with cerebral palsy (CP) aged 6 to 15 years & is available in English & French. Aim To validate the Gujarati version of ABILOCO-Kids questionnaire to be used in clinical research on Gujarati population. Materials and Methods ABILOCO-Kids questionnaire was translated into Gujarati from English using forward-backward-forward method. To ensure face & content validity of Gujarati version using group consensus method, each item was examined by group of experts having mean experience of 24.62 years in field of paediatric and paediatric physiotherapy. Each item was analysed for content, meaning, wording, format, ease of administration & scoring. Each item was scored by expert group as either accepted, rejected or accepted with modification. Procedure was continued until 80% of consensus for all items. Concurrent validity was examined on 55 children with Cerebral Palsy (6-15 years) of all Gross Motor Functional Classification System (GMFCS) level & all clinical types by correlating score of ABILOCO-Kids with Gross Motor Functional Measure & GMFCS. Result In phase 1 of validation, 16 items were accepted as it is; 22 items accepted with modification & 3 items went for phase 2 validation. For concurrent validity, highly significant positive correlation was found between score of ABILOCO-Kids & total GMFM (r=0.713, p<0.005) & highly significant negative correlation with GMFCS (r= -0.778, p<0.005). Conclusion Gujarati translated version of ABILOCO-Kids questionnaire has good face & content validity as well as concurrent validity which can be used to measure caregiver reported locomotion ability in children with CP. PMID:26557603
Testing manifest monotonicity using order-constrained statistical inference.

PubMed

Tijmstra, Jesper; Hessen, David J; van der Heijden, Peter G M; Sijtsma, Klaas

2013-01-01

Most dichotomous item response models share the assumption of latent monotonicity, which states that the probability of a positive response to an item is a nondecreasing function of a latent variable intended to be measured. Latent monotonicity cannot be evaluated directly, but it implies manifest monotonicity across a variety of observed scores, such as the restscore, a single item score, and in some cases the total score. In this study, we show that manifest monotonicity can be tested by means of the order-constrained statistical inference framework. We propose a procedure that uses this framework to determine whether manifest monotonicity should be rejected for specific items. This approach provides a likelihood ratio test for which the p-value can be approximated through simulation. A simulation study is presented that evaluates the Type I error rate and power of the test, and the procedure is applied to empirical data.
Lawton IADL scale in dementia: can item response theory make it more informative?

PubMed

McGrory, Sarah; Shenkin, Susan D; Austin, Elizabeth J; Starr, John M

2014-07-01

impairment of functional abilities represents a crucial component of dementia diagnosis. Current functional measures rely on the traditional aggregate method of summing raw scores. While this summary score provides a quick representation of a person's ability, it disregards useful information on the item level. to use item response theory (IRT) methods to increase the interpretive power of the Lawton Instrumental Activities of Daily Living (IADL) scale by establishing a hierarchy of item 'difficulty' and 'discrimination'. this cross-sectional study applied IRT methods to the analysis of IADL outcomes. Participants were 202 members of the Scottish Dementia Research Interest Register (mean age = 76.39, range = 56-93, SD = 7.89 years) with complete itemised data available. a Mokken scale with good reliability (Molenaar Sijtsama statistic 0.79) was obtained, satisfying the IRT assumption that the items comprise a single unidimensional scale. The eight items in the scale could be placed on a hierarchy of 'difficulty' (H coefficient = 0.55), with 'Shopping' being the most 'difficult' item and 'Telephone use' being the least 'difficult' item. 'Shopping' was the most discriminatory item differentiating well between patients of different levels of ability. IRT methods are capable of providing more information about functional impairment than a summed score. 'Shopping' and 'Telephone use' were identified as items that reveal key information about a patient's level of ability, and could be useful screening questions for clinicians. © The Author 2013. Published by Oxford University Press on behalf of the British Geriatrics Society. All rights reserved. For Permissions, please email: journals.permissions@ oup.com.
[Severe intimate partner violence risk prediction scale-revised].

PubMed

Echeburúa, Enrique; Amor, Pedro Javier; Loinaz, Ismael; de Corral, Paz

2010-11-01

The aim of this study was to describe the psychometric properties of the Severe Intimate Partner Violence Risk Prediction Scale and to revise it in order to ponderate the 20 items according to their discriminant capacity and to solve the missing item problem. The sample for this study consisted of 450 male batterers who were reported to the police station. The victims were classified as high-risk (18.2%), moderate-risk (45.8%) and low-risk (36%), depending on the cutoff scores in the original scale. Internal consistency (Cronbach's alpha=.72) and interrater reliability (r=.73) were acceptable. The point biserial correlation coefficient between each item and the corrected total score of the 20-item scale was calculated to determine the most discriminative items, which were associated with the context of intimate partner violence in the last month, with the male batterer's profile and with the victim's vulnerability. A revised scale (EPV-R) with new cutoff scores and indications on how to deal with the missing items were proposed in accordance with these results. This easy-to-use tool appears to be suitable to the requirements of criminal justice professionals and is intended for use in safety planning. Implications of these results for further research are discussed.
Dimensionality Assessment for Dichotomously Scored Items Using Multidimensional Scaling.

ERIC Educational Resources Information Center

Jones, Patricia B.; And Others

In order to determine the effectiveness of multidimensional scaling (MDS) in recovering the dimensionality of a set of dichotomously-scored items, data were simulated in one, two, and three dimensions for a variety of correlations with the underlying latent trait. Similarity matrices were constructed from these data using three margin-sensitive…
Comparison of Reliability Measures under Factor Analysis and Item Response Theory

ERIC Educational Resources Information Center

Cheng, Ying; Yuan, Ke-Hai; Liu, Cheng

2012-01-01

Reliability of test scores is one of the most pervasive psychometric concepts in measurement. Reliability coefficients based on a unifactor model for continuous indicators include maximal reliability rho and an unweighted sum score-based omega, among many others. With increasing popularity of item response theory, a parallel reliability measure pi…
The Effects of Item by Item Feedback Given during an Ability Test.

ERIC Educational Resources Information Center

Whetton, C.; Childs, R.

1981-01-01

Answer-until-correct (AUC) is a procedure for providing feedback during a multiple-choice test, giving an increased range of scores. The performance of secondary students on a verbal ability test using AUC procedures was compared with a group using conventional instructions. AUC scores considerably enhanced reliability but not validity.…
Observed-Score Equating as a Test Assembly Problem.

ERIC Educational Resources Information Center

van der Linden, Wim J.; Luecht, Richard M.

1998-01-01

Derives a set of linear conditions of item-response functions that guarantees identical observed-score distributions on two test forms. The conditions can be added as constraints to a linear programming model for test assembly. An example illustrates the use of the model for an item pool from the Law School Admissions Test (LSAT). (SLD)
Developing and Evaluating a Machine-Scorable, Constrained Constructed-Response Item.

ERIC Educational Resources Information Center

Braun, Henry I.; And Others

The use of constructed response items in large scale standardized testing has been hampered by the costs and difficulties associated with obtaining reliable scores. The advent of expert systems may signal the eventual removal of this impediment. This study investigated the accuracy with which expert systems could score a new, non-multiple choice…
Length of stay of stroke rehabilitation inpatients: prediction through the functional independence measure.

PubMed

Franchignoni, F; Tesio, L; Martino, M T; Benevolo, E; Castagna, M

1998-01-01

A model for prediction of length of stay (LOS, in days) of stroke rehabilitation inpatients was developed, based on patients' age (years) and function at admission (scored on the Functional Independence Measure, FIMSM). One hundred and twenty-nine cases, consecutively admitted to three free-standing rehabilitation centres in Italy, were analyzed. A multiple linear regression using forward stepwise selection procedure was adopted. Median admission and discharge scores were: 57 and 75 for the total FIM score, 29 and 48 for the 13-item motor FIM subscore, 29 and 30 for the 5-item cognitive FIM subscore (potential range: 18-126, 13-91, 5-35, respectively). Median LOS was 44 days (interquartile range 30-62). The logLOS predictive model included three FIM items ("toilet transfer", TTr; "social interaction"; "expression") and patient's age (R2 = 0.48). TTr alone explained 31.3% of the variance of logLOS. These results are consistent with previous American studies, showing that FIM scores at admission are strong predictors of patients' LOS, with the transfer items having the greatest predictive power.
Reliability of the Melbourne assessment of unilateral upper limb function.

PubMed

Randall, M; Carlin, J B; Chondros, P; Reddihough, D

2001-11-01

This study examines the reliability of the Melbourne Assessment of Unilateral Upper Limb Function: a quantitative test of quality of movement in children with neurological impairment. The assessment was administered to 20 children aged from 5 to 16 years (mean age 9 years 10 months, SD 2 years 10 months) who had various types and degrees of cerebral palsy (CP). The performances of the 20 children during assessment were videotaped for subsequent scoring by 15 occupational therapists. Scores were analyzed for internal consistency of test items, inter- and intrarater reliability of scorings of the same videotapes, and test-retest reliability using repeat videotaping. Results revealed very high internal consistency of test items (alpha=0.96), moderate to high agreement both within and between raters for all test items (intraclass correlations of at least 0.7) apart from item 16 (hand to mouth and down), and high interrater reliability (0.95) and intrarater reliability (0.97) for total test scores. Test-retest results revealed moderate to high intrarater reliability for item totals (mean of 0.83 and 0.79) for each rater and high reliability for test totals (0.98 and 0.97). These findings indicate that the Melbourne Assessment of Unilateral Upper Limb Function is a reliable tool for measuring the quality of unilateral upper-limb movement in children with CP.

The Bergen Shopping Addiction Scale: reliability and validity of a brief screening test.

PubMed

Andreassen, Cecilie S; Griffiths, Mark D; Pallesen, Ståle; Bilder, Robert M; Torsheim, Torbjørn; Aboujaoude, Elias

2015-01-01

Although excessive and compulsive shopping has been increasingly placed within the behavioral addiction paradigm in recent years, items in existing screens arguably do not assess the core criteria and components of addiction. To date, assessment screens for shopping disorders have primarily been rooted within the impulse-control or obsessive-compulsive disorder paradigms. Furthermore, existing screens use the terms 'shopping,' 'buying,' and 'spending' interchangeably, and do not necessarily reflect contemporary shopping habits. Consequently, a new screening tool for assessing shopping addiction was developed. Initially, 28 items, four for each of seven addiction criteria (salience, mood modification, conflict, tolerance, withdrawal, relapse, and problems), were constructed. These items and validated scales (i.e., Compulsive Buying Measurement Scale, Mini-International Personality Item Pool, Hospital Anxiety and Depression Scale, Rosenberg Self-Esteem Scale) were then administered to 23,537 participants (M age = 35.8 years, SD age = 13.3). The highest loading item from each set of four pooled items reflecting the seven addiction criteria were retained in the final scale, The Bergen Shopping Addiction Scale (BSAS). The factor structure of the BSAS was good (RMSEA = 0.064, CFI = 0.983, TLI = 0.973) and coefficient alpha was 0.87. The scores on the BSAS converged with scores on the Compulsive Buying Measurement Scale (CBMS; 0.80), and were positively correlated with extroversion and neuroticism, and negatively with conscientiousness, agreeableness, and intellect/imagination. The scores of the BSAS were positively associated with anxiety, depression, and low self-esteem and inversely related to age. Females scored higher than males on the BSAS. The BSAS is the first scale to fully embed shopping addiction within an addiction paradigm. A recommended cutoff score for the new scale and future research directions are discussed.
Psychometric Properties of the Quantitative Myasthenia Gravis Score and the Myasthenia Gravis Composite Scale.

PubMed

Barnett, Carolina; Merkies, Ingemar S J; Katzberg, Hans; Bril, Vera

2015-09-02

The Quantitative Myasthenia Gravis Score and the Myasthenia Gravis Composite are two commonly used outcome measures in Myasthenia Gravis. So far, their measurement properties have not been compared, so we aimed to study their psychometric properties using the Rasch model. 251 patients with stable myasthenia gravis were assessed with both scales, and 211 patients returned for a second assessment. We studied fit to the Rasch model at the first visit, and compared item fit, thresholds, differential item functioning, local dependence, person separation index, and tests for unidimensionality. We also assessed test-retest reliability and estimated the Minimal Detectable Change. Neither scale fit the Rasch model (X2p < 0.05). The Myasthenia Gravis Composite had lower discrimination properties than the Quantitative Myasthenia Gravis Scale (Person Separation Index: 0.14 and 0.7). There was local dependence in both scales, as well as differential item functioning for ocular and generalized disease. Disordered thresholds were found in 6(60%) items of the Myasthenia Gravis Composite and in 4(31%) of the Quantitative Myasthenia Gravis Score. Both tools had adequate test-retest reliability (ICCs >0.8). The minimally detectable change was 4.9 points for the Myasthenia Gravis Composite and 4.3 points for the Quantitative Myasthenia Gravis Score. Neither scale fulfilled Rasch model expectations. The Quantitative Myasthenia Gravis Score has higher discrimination than the Myasthenia Gravis Composite. Both tools have items with disordered thresholds, differential item functioning and local dependency. There was evidence of multidimensionality in the QMGS. The minimal detectable change values are higher than previous studies on the minimal significant change. These findings might inform future modifications of these tools.
Assessing for suicidal behavior in youth using the Achenbach System of Empirically Based Assessment.

PubMed

Van Meter, Anna R; Algorta, Guillermo Perez; Youngstrom, Eric A; Lechtman, Yana; Youngstrom, Jen K; Feeny, Norah C; Findling, Robert L

2018-02-01

This study investigated the clinical utility of the Achenbach System of Empirically Based Assessment (ASEBA) for identifying youth at risk for suicide. Specifically, we investigated how well the Total Problems scores and the sum of two suicide-related items (#18 "Deliberately harms self or attempts suicide" and #91 "Talks about killing self") were able to distinguish youth with a history of suicidal behavior. Youth (N = 1117) aged 5-18 were recruited for two studies of mental illness. History of suicidal behavior was assessed by semi-structured interviews (K-SADS) with youth and caregivers. Youth, caregivers, and a primary teacher each completed the appropriate form (YSR, CBCL, and TRF, respectively) of the ASEBA. Areas under the curve (AUCs) from ROC analyses and diagnostic likelihood ratios (DLRs) were used to measure the ability of both Total Problems T scores, as well as the summed score of two suicide-related items, to identify youth with a history of suicidal behavior. The Suicide Items from the CBCL and YSR performed well (AUCs = 0.85 and 0.70, respectively). The TRF Suicide Items did not perform better than chance, AUC = 0.45. The AUCs for the Total Problems scores were poor-to-fair (0.33-0.65). The CBCL Suicide Items outperformed all other scores (ps = 0.04 to <0.0005). Combining the CBCL and YSR items did not lead to incremental improvement in prediction over the CBCL alone. The sum of two questions from a commonly used assessment tool can offer important information about a youth's risk for suicidal behavior. The low burden of this approach could facilitate wide-spread screening for suicide in an increasingly at-risk population.
The Bergen Shopping Addiction Scale: reliability and validity of a brief screening test

PubMed Central

Andreassen, Cecilie S.; Griffiths, Mark D.; Pallesen, Ståle; Bilder, Robert M.; Torsheim, Torbjørn; Aboujaoude, Elias

2015-01-01

Although excessive and compulsive shopping has been increasingly placed within the behavioral addiction paradigm in recent years, items in existing screens arguably do not assess the core criteria and components of addiction. To date, assessment screens for shopping disorders have primarily been rooted within the impulse-control or obsessive-compulsive disorder paradigms. Furthermore, existing screens use the terms ‘shopping,’ ‘buying,’ and ‘spending’ interchangeably, and do not necessarily reflect contemporary shopping habits. Consequently, a new screening tool for assessing shopping addiction was developed. Initially, 28 items, four for each of seven addiction criteria (salience, mood modification, conflict, tolerance, withdrawal, relapse, and problems), were constructed. These items and validated scales (i.e., Compulsive Buying Measurement Scale, Mini-International Personality Item Pool, Hospital Anxiety and Depression Scale, Rosenberg Self-Esteem Scale) were then administered to 23,537 participants (Mage = 35.8 years, SDage = 13.3). The highest loading item from each set of four pooled items reflecting the seven addiction criteria were retained in the final scale, The Bergen Shopping Addiction Scale (BSAS). The factor structure of the BSAS was good (RMSEA = 0.064, CFI = 0.983, TLI = 0.973) and coefficient alpha was 0.87. The scores on the BSAS converged with scores on the Compulsive Buying Measurement Scale (CBMS; 0.80), and were positively correlated with extroversion and neuroticism, and negatively with conscientiousness, agreeableness, and intellect/imagination. The scores of the BSAS were positively associated with anxiety, depression, and low self-esteem and inversely related to age. Females scored higher than males on the BSAS. The BSAS is the first scale to fully embed shopping addiction within an addiction paradigm. A recommended cutoff score for the new scale and future research directions are discussed. PMID:26441749
Effects of Misbehaving Common Items on Aggregate Scores and an Application of the Mantel-Haenszel Statistic in Test Equating. CSE Report 688

ERIC Educational Resources Information Center

Michaelides, Michalis P.

2006-01-01

Consistent behavior is a desirable characteristic that common items are expected to have when administered to different groups. Findings from the literature have established that items do not always behave in consistent ways; item indices and IRT item parameter estimates of the same items differ when obtained from different administrations.…
One-Year Mortality in Older Patients with Cancer: Development and External Validation of an MNA-Based Prognostic Score.

PubMed

Bourdel-Marchasson, Isabelle; Diallo, Abou; Bellera, Carine; Blanc-Bisson, Christelle; Durrieu, Jessica; Germain, Christine; Mathoulin-Pélissier, Simone; Soubeyran, Pierre; Rainfray, Muriel; Fonck, Mariane; Doussau, Adelaïde

2016-01-01

The MNA (Mini Nutritional Assessment) is known as a prognosis factor in older population. We analyzed the prognostic value for one-year mortality of MNA items in older patients with cancer treated with chemotherapy as the basis of a simplified prognostic score. The prospective derivation cohort included 606 patients older than 70 years with an indication of chemotherapy for cancers. The endpoint to predict was one-year mortality. The 18 items of the Full MNA, age, gender, weight loss, cancer origin, TNM, performance status and lymphocyte count were considered to construct the prognostic model. MNA items were analyzed with a backward step-by-step multivariate logistic regression and other items were added in a forward step-by-step regression. External validation was performed on an independent cohort of 229 patients. At one year 266 deaths had occurred. Decreased dietary intake (p = 0.0002), decreased protein-rich food intake (p = 0.025), 3 or more prescribed drugs (p = 0.023), calf circumference <31 cm (p = 0.0002), tumor origin (p<0.0001), metastatic status (p = 0.0007) and lymphocyte count <1500/mm3 (0.029) were found to be associated with 1-year mortality in the final model and were used to construct a prognostic score. The area under curve (AUC) of the score was 0.793, which was higher than the Full MNA AUC (0.706). The AUC of the score in validation cohort (229 subjects, 137 deaths) was 0.698. Key predictors of one-year mortality included cancer cachexia clinical features, comorbidities, the origin and the advanced status of the tumor. The prognostic value of this model combining a subset of MNA items and cancer related items was better than the full MNA, thus providing a simple score to predict 1-year mortality in older patients with an indication of chemotherapy.
Relationship between cognitive and non-cognitive symptoms of delirium.

PubMed

Rajlakshmi, Aarya Krishnan; Mattoo, Surendra Kumar; Grover, Sandeep

2013-04-01

To study relationship between the cognitive and the non-cognitive symptoms of delirium. Eighty-four patients referred to psychiatry liaison services and met DSM-IVTR criteria of delirium were assessed using the Delirium Rating Scale Revised-1998 (DRSR-98) and Cognitive Test for Delirium (CTD). The mean DRS-R-98 severity score was 17.19 and DRS-R-98 total score was 23.36. The mean total score on CTD was 11.75. The mean scores on CTD were highest for comprehension (3.47) and lowest for vigilance (1.71). Poor attention was associated with significantly higher motor retardation and higher DRS-R-98 severity scores minus the attention scores. There were no significant differences between those with and without poor attention. Higher attention deficits were associated with higher dysfunction on all other domains of cognition on CTD. There was significant correlation between cognitive functions as assessed on CTD and total DRS-R-98 score, DRS-R-98 severity score and DRS-R-98 severity score without the attention item score. However, few correlations emerged between CTD domains and CTD total scores with cognitive symptom total score of DRS-R-98 (items 9-13) and non-cognitive symptom total score of DRS-R-98 (items 1-8). Our study suggests that in delirium, cognitive deficits are quite prevalent and correlate with overall severity of delirium. Attention deficit is a core symptom of delirium. Copyright © 2012 Elsevier B.V. All rights reserved.
Validity of the Cambridge Cognitive Examination-Revised new Executive Function Scores in the diagnosis of dementia: some early findings.

PubMed

Heinik, Jeremia; Solomesh, Isaac

2007-03-01

The Cambridge Cognitive Examination-Revised introduces 2 new executive items (Ideational Fluency and Visual Reasoning), which separately or combined with 2 executive items in the former version (word list generation and similarities) might constitute an Executive Function Score (EFS). The authors studied the validity of these new EFSs in 51 demented (dementia of the Alzheimer's type, vascular dementia) and nondemented individuals (depressives and normals). The new EFSs were found valid to accurately differentiate between demented and nondemented subjects; however, they were considerably less so when specific diagnoses were considered. Correlations between the variously combined executive scores and the cognitive scales and subscales studied were prevalently low to moderate, and ranged from high and significant to low and nonsignificant when the 4 executive items were correlated to each other. The ability of the executive scores to discriminate demented from nondemented individuals was lower compared with the Cambridge Cognitive Examination-Revised scores. EFS was found internally consistent.
The colostomy impact score: development and validation of a patient reported outcome measure for rectal cancer patients with a permanent colostomy. A population-based study.

PubMed

Thyø, A; Emmertsen, K J; Pinkney, T D; Christensen, P; Laurberg, S

2017-01-01

The aim was to develop and validate a simple scoring system evaluating the impact of colostomy dysfunction on quality of life (QOL) in patients with a permanent stoma after rectal cancer treatment. In this population-based study, 610 patients with a permanent colostomy after previous rectal cancer treatment during the period 2001-2007 completed two questionnaires: (i) the basic stoma questionnaire consisting of 22 items about stoma function with one anchor question addressing the overall stoma impact on QOL and (ii) the European Organization for Research and Treatment of Cancer Quality of Life Questionnaire (EORTC QLQ) C30. Answers from half of the cohort were used to develop the score and subsequently validated on the remaining half. Logistic regression analyses identified and selected items for the score and multivariate analysis established the score value allocated to each item. The colostomy impact score includes seven items with a total range from 0 to 38 points. A score of ≥ 10 indicates major colostomy impact (Major CI). The score has a sensitivity of 85.7% for detecting patients with significant stoma impact on QOL. Using the EORTC QLQ scales, patients with Major CI experienced significant impairment in their QOL compared to the Minor CI group. This new scoring system appears valid for the assessment of the impact on QOL from having a permanent colostomy in a Danish rectal cancer population. It requires validation in non-Danish populations prior to its acceptance as a valuable patient-reported outcome measure for patients internationally. Colorectal Disease © 2016 The Association of Coloproctology of Great Britain and Ireland.
The medial tibial stress syndrome score: a new patient-reported outcome measure.

PubMed

Winters, Marinus; Moen, Maarten H; Zimmermann, Wessel O; Lindeboom, Robert; Weir, Adam; Backx, Frank Jg; Bakker, Eric Wp

2016-10-01

At present, there is no validated patient-reported outcome measure (PROM) for patients with medial tibial stress syndrome (MTSS). Our aim was to select and validate previously generated items and create a valid, reliable and responsive PROM for patients with MTSS: the MTSS score. A prospective cohort study was performed in multiple sports medicine, physiotherapy and military facilities in the Netherlands. Participants with MTSS filled out the previously generated items for the MTSS score on 3 occasions. From previously generated items, we selected the best items. We assessed the MTSS score for its validity, reliability and responsiveness. The MTSS score was filled out by 133 participants with MTSS. Factor analysis showed the MTSS score to exhibit a single-factor structure with acceptable internal consistency (α=0.58) and good test-retest reliability (intraclass correlation coefficient=0.81). The MTSS score ranges from 0 to 10 points. The smallest detectable change in our sample was 0.69 at the group level and 4.80 at the individual level. Construct validity analysis showed significant moderate-to-large correlations (r=0.34-0.52, p<0.01). Responsiveness of the MTSS score was confirmed by a significant relation with the global perceived effect scale (β=-0.288, R(2)=0.21, p<0.001). The MTSS score is a valid, reliable and responsive PROM to measure the severity of MTSS. It is designed to evaluate treatment outcomes in clinical studies. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://www.bmj.com/company/products-services/rights-and-licensing/
Measuring the Reliability of Picture Story Exercises like the TAT

PubMed Central

Gruber, Nicole; Kreuzpointner, Ludwig

2013-01-01

As frequently reported, psychometric assessments on Picture Story Exercises, especially variations of the Thematic Apperception Test, mostly reveal inadequate scores for internal consistency. We demonstrate that the reason for this apparent shortcoming is not caused by the coding system itself but from the incorrect use of internal consistency coefficients, especially Cronbach’s α. This problem could be eliminated by using the category-scores as items instead of the picture-scores. In addition to a theoretical explanation we prove mathematically why the use of category-scores produces an adequate internal consistency estimation and examine our idea empirically with the origin data set of the Thematic Apperception Test by Heckhausen and two additional data sets. We found generally higher values when using the category-scores as items instead of picture-scores. From an empirical and theoretical point of view, the estimated reliability is also superior to each category within a picture as item measuring. When comparing our suggestion with a multifaceted Rasch-model we provide evidence that our procedure better fits the underlying principles of PSE. PMID:24348902
Development of an instrument to measure medical students' perceptions of the assessment environment: initial validation.

PubMed

Sim, Joong Hiong; Tong, Wen Ting; Hong, Wei-Han; Vadivelu, Jamuna; Hassan, Hamimah

2015-01-01

Assessment environment, synonymous with climate or atmosphere, is multifaceted. Although there are valid and reliable instruments for measuring the educational environment, there is no validated instrument for measuring the assessment environment in medical programs. This study aimed to develop an instrument for measuring students' perceptions of the assessment environment in an undergraduate medical program and to examine the psychometric properties of the new instrument. The Assessment Environment Questionnaire (AEQ), a 40-item, four-point (1=Strongly Disagree to 4=Strongly Agree) Likert scale instrument designed by the authors, was administered to medical undergraduates from the authors' institution. The response rate was 626/794 (78.84%). To establish construct validity, exploratory factor analysis (EFA) with principal component analysis and varimax rotation was conducted. To examine the internal consistency reliability of the instrument, Cronbach's α was computed. Mean scores for the entire AEQ and for each factor/subscale were calculated. Mean AEQ scores of students from different academic years and sex were examined. Six hundred and eleven completed questionnaires were analysed. EFA extracted four factors: feedback mechanism (seven items), learning and performance (five items), information on assessment (five items), and assessment system/procedure (three items), which together explained 56.72% of the variance. Based on the four extracted factors/subscales, the AEQ was reduced to 20 items. Cronbach's α for the 20-item AEQ was 0.89, whereas Cronbach's α for the four factors/subscales ranged from 0.71 to 0.87. Mean score for the AEQ was 2.68/4.00. The factor/subscale of 'feedback mechanism' recorded the lowest mean (2.39/4.00), whereas the factor/subscale of 'assessment system/procedure' scored the highest mean (2.92/4.00). Significant differences were found among the AEQ scores of students from different academic years. The AEQ is a valid and reliable instrument. Initial validation supports its use to measure students' perceptions of the assessment environment in an undergraduate medical program.
Patient-reported questionnaires in MS rehabilitation: responsiveness and minimal important difference of the multiple sclerosis questionnaire for physiotherapists (MSQPT).

PubMed

van der Maas, Nico Arie

2017-03-16

The Multiple Sclerosis Questionnaire for Physical Therapists (MSQPT) is a patient-rated outcome questionnaire for evaluating the rehabilitation of persons with multiple sclerosis (MS). Responsiveness was evaluated, and minimal important difference (MID) estimates were calculated to provide thresholds for clinical change for four items, three sections and the total score of the MSQPT. This multicentre study used a combined distribution- and anchor-based approach with multiple anchors and multiple rating of change questions. Responsiveness was evaluated using effect size, standardized response mean (SRM), modified SRM and relative efficiency. For distribution-based MID estimates, 0.2 and 0.33 standard deviations (SD), standard error of measurement (SEM) and minimal detectable change were used . Triangulation of anchor- and distribution-based MID estimates provided a range of MID values for each of the four items, the three sections and the total score of the MSQPT. The MID values were tested for their sensitivity and specificity for amelioration and deterioration for each of the four items, the three sections and the total score of the MSQPT. The MID values of each item and section and of the total score with the best sensitivity and specificity were selected as thresholds for clinical change. The outcome measures were the MSQPT, Hamburg Quality of Life Questionnaire for Multiple Sclerosis (HAQUAMS), rating of change questionnaires, Expanded Disability Status Scale, 6-metre timed walking test, Berg Balance Scale and 6-minute walking test. The effect size ranged from 0.46 to 1.49. The SRM data showed comparable results. The modified SRM ranged from 0.00 to 0.60. Anchor-based MID estimates were very low and were comparable with SD- and SEM-based estimates. The MSQPT was more responsive than the HAQUAMS in detecting improvement but less responsive in finding deterioration. The best MID estimates of the items, sections and total score, expressed in percentage of their maximum score, were between 5.4% (activity) and 22% (item 10) change for improvement and between 5.7% (total score) and 22% (item 10) change for deterioration. The MSQPT is a responsive questionnaire with an adequate MID that may be used as threshold for change during rehabilitation of MS patients. This trial was retrospectively (01/24/2015) registered in ClinicalTrials.gov as NCT02346279.
Development of a structure-validated Family Relationship Questionnaire (FRQ) with Chinese university students.

PubMed

Chen, Liuxi; Xu, Kai; Fu, Lingyun; Xu, Shaofang; Gao, Qianqian; Wang, Wei

2015-01-01

Consistent results have shown a relationship between the psychological world of children and their perceived parental bonding or family attachment style, but to date there is no single measure covering both styles. The authors designed a statement matrix with 116 items for this purpose and compared it with the Parental Bonding Instrument (PBI) in a study with 718 university students. After exploratory and confirmatory factor analyses, five factors (scales)--namely, Paternal/Maternal Encouragement (5 items each), Paternal/Maternal Abuse (5 items each), Paternal/Maternal Freedom Release (5 items each), General Attachment (5 items), and Paternal/Maternal Dominance (4 items each)--were defined to form a Family Relationship Questionnaire (FRQ). The internal alphas of the factors ranged from .64 to .83, and their congruency coefficients were .93 to .98 in samples regarding father and mother. Women scored significantly higher on FRQ General Attachment and Maternal Encouragement and lower on Paternal Abuse than men did; only children scored significantly higher on Paternal and Maternal Encouragements than children with siblings did. Women also scored significantly higher on PBI Paternal Autonomy Denial; only children scored significantly higher on Paternal and Maternal Cares and Maternal Autonomy Denial. All intercorrelations between FRQ scales were low to medium, and some correlations between FRQ and PBI scales were medium to high. This study demonstrates that the FRQ has a structure of five factors with satisfactory discriminant and convergent validities, which might help to characterize family relationships in healthy and clinical populations.
Filipino Nurses' Spirituality and Provision of Spiritual Nursing Care.

PubMed

Labrague, Leodoro J; McEnroe-Petitte, Denise M; Achaso, Romeo H; Cachero, Geifsonne S; Mohammad, Mary Rose A

2016-12-01

This study was to explore the perceptions of Filipino nurses' spirituality and the provision of spiritual nursing care. A descriptive, cross-sectional, and quantitative study was adopted for this study. The study was conducted in the Philippines utilizing a convenience sample of 245 nurses. Nurses' Spirituality and Delivery of Spiritual Care (NSDSC) was used as the main instrument. The items on NSDSC with higher mean scores related to nurses' perception of spirituality were Item 7, "I believe that God loves me and cares for me," and Item 8, "Prayer is an important part of my life," with mean scores of 4.87 (SD = 1.36) and 4.88 (SD = 1.34), respectively. Items on NSDSC with higher mean scores related to the practice of spiritual care were Item 26, "I usually comfort clients spiritually (e.g., reading books, prayers, music, etc.)," and Item 25, "I refer the client to his/her spiritual counselor (e.g., hospital chaplain) if needed," with mean scores of 3.16 (SD = 1.54) and 2.92 (SD = 1.59). Nurse's spirituality correlated significantly with their understanding of spiritual nursing care (r = .3376, p ≤ .05) and delivery of spiritual nursing care (r = .3980, p ≤ .05). Positive significant correlations were found between understanding of spiritual nursing care and delivery of spiritual nursing care (r = .3289, p ≤ .05). For nurses to better provide spiritual nursing care, they must care for themselves through self-awareness, self-reflection, and developing a sense of satisfaction and contentment. © The Author(s) 2015.
Exploratory Factor Analysis of the Beck Anxiety Inventory and the Beck Depression Inventory-II in a Psychiatric Outpatient Population

PubMed Central

2018-01-01

Background To further understand the relationship between anxiety and depression, this study examined the factor structure of the combined items from two validated measures for anxiety and depression. Methods The participants were 406 patients with mixed psychiatric diagnoses including anxiety and depressive disorders from a psychiatric outpatient unit at a university-affiliated medical center. Responses of the Beck Anxiety Inventory (BAI), Beck Depression Inventory (BDI)-II, and Symptom Checklist-90-Revised (SCL-90-R) were analyzed. We conducted an exploratory factor analysis of 42 items from the BAI and BDI-II. Correlational analyses were performed between subscale scores of the SCL-90-R and factors derived from the factor analysis. Scores of individual items of the BAI and BDI-II were also compared between groups of anxiety disorder (n = 185) and depressive disorder (n = 123). Results Exploratory factor analysis revealed the following five factors explaining 56.2% of the total variance: somatic anxiety (factor 1), cognitive depression (factor 2), somatic depression (factor 3), subjective anxiety (factor 4), and autonomic anxiety (factor 5). The depression group had significantly higher scores for 12 items on the BDI while the anxiety group demonstrated higher scores for six items on the BAI. Conclusion Our results suggest that anxiety and depressive symptoms as measured by the BAI and BDI-II can be empirically differentiated and that particularly items of the cognitive domain in depression and those of physical domain in anxiety are noteworthy. PMID:29651821
DIF Trees: Using Classification Trees to Detect Differential Item Functioning

ERIC Educational Resources Information Center

Vaughn, Brandon K.; Wang, Qiu

2010-01-01

A nonparametric tree classification procedure is used to detect differential item functioning for items that are dichotomously scored. Classification trees are shown to be an alternative procedure to detect differential item functioning other than the use of traditional Mantel-Haenszel and logistic regression analysis. A nonparametric…
Item Response Theory to Quantify Longitudinal Placebo and Paliperidone Effects on PANSS Scores in Schizophrenia.

PubMed

Krekels, Ehj; Novakovic, A M; Vermeulen, A M; Friberg, L E; Karlsson, M O

2017-08-01

As biomarkers are lacking, multi-item questionnaire-based tools like the Positive and Negative Syndrome Scale (PANSS) are used to quantify disease severity in schizophrenia. Analyzing composite PANSS scores as continuous data discards information and violates the numerical nature of the scale. Here a longitudinal analysis based on Item Response Theory is presented using PANSS data from phase III clinical trials. Latent disease severity variables were derived from item-level data on the positive, negative, and general PANSS subscales each. On all subscales, the time course of placebo responses were best described with Weibull models, and dose-independent functions with exponential models to describe the onset of the full effect were used to describe paliperidone's effect. Placebo and drug effect were most pronounced on the positive subscale. The final model successfully describes the time course of treatment effects on the individual PANSS item-levels, on all PANSS subscale levels, and on the total score level. © 2017 The Authors CPT: Pharmacometrics & Systems Pharmacology published by Wiley Periodicals, Inc. on behalf of American Society for Clinical Pharmacology and Therapeutics.
[The appraisal of reliability and validity of subjective workload assessment technique and NASA-task load index].

PubMed

Xiao, Yuan-mei; Wang, Zhi-ming; Wang, Mian-zhen; Lan, Ya-jia

2005-06-01

To test the reliability and validity of two mental workload assessment scales, i.e. subjective workload assessment technique (SWAT) and NASA task load index (NASA-TLX). One thousand two hundred and sixty-eight mental workers were sampled from various kinds of occupations, such as scientific research, education, administration and medicine, etc, with randomized cluster sampling. The re-test reliability, split-half reliability, Cronbach's alpha coefficient and correlation coefficients between item score and total score were adopted to test the reliability. The test of validity included structure validity. The re-test reliability coefficients of these two scales and their items were ranged from 0.516 to 0.753 (P < 0.01), indicating the two scales had good re-test reliability; the split-half reliability of SWAT was 0.645, and its Cronbach's alpha coefficient was more than 0.80, all the correlation coefficients between its items score and total score were more than 0.70; as for NASA-TLX, both the split-half reliability and Cronbach's alpha coefficient were more than 0.80, the correlation coefficients between its items score and total score were all more than 0.60 (P < 0.01) except the item of performance. Both scales had good inner consistency. The Pearson correlation coefficient between the two scales was 0.492 (P < 0.01), implying the results of the two scales had good consistency. Factor analysis showed that the two scales had good structure validity. Both SWAT and NASA-TLX have good reliability and validity and may be used as a valid tool to assess mental workload in China after being revised properly.
Measuring Global Physical Health in Children with Cerebral Palsy: Illustration of a Multidimensional Bi-factor Model and Computerized Adaptive Testing

PubMed Central

Haley, Stephen M.; Ni, Pengsheng; Dumas, Helene M.; Fragala-Pinkham, Maria A.; Hambleton, Ronald K.; Montpetit, Kathleen; Bilodeau, Nathalie; Gorton, George E.; Watson, Kyle; Tucker, Carole A

2009-01-01

Purpose The purpose of this study was to apply a bi-factor model for the determination of test dimensionality and a multidimensional CAT using computer simulations of real data for the assessment of a new global physical health measure for children with cerebral palsy (CP). Methods Parent respondents of 306 children with cerebral palsy were recruited from four pediatric rehabilitation hospitals and outpatient clinics. We compared confirmatory factor analysis results across four models: (1) one-factor unidimensional; (2) two-factor multidimensional (MIRT); (3) bi-factor MIRT with fixed slopes; and (4) bi-factor MIRT with varied slopes. We tested whether the general and content (fatigue and pain) person score estimates could discriminate across severity and types of CP, and whether score estimates from a simulated CAT were similar to estimates based on the total item bank, and whether they correlated as expected with external measures. Results Confirmatory factor analysis suggested separate pain and fatigue sub-factors; all 37 items were retained in the analyses. From the bi-factor MIRT model with fixed slopes, the full item bank scores discriminated across levels of severity and types of CP, and compared favorably to external instruments. CAT scores based on 10- and 15-item versions accurately captured the global physical health scores. Conclusions The bi-factor MIRT CAT application, especially the 10- and 15-item version, yielded accurate global physical health scores that discriminated across known severity groups and types of CP, and correlated as expected with concurrent measures. The CATs have potential for collecting complex data on the physical health of children with CP in an efficient manner. PMID:19221892

Two types of squalor: findings from a factor analysis of the Environmental Cleanliness and Clutter Scale (ECCS).

PubMed

Snowdon, John; Halliday, Graeme; Hunt, Glenn E

2013-07-01

Most people who collect and hoard, and then have difficulty discarding items, do not live in squalor, even though accumulation of hoarded items can make cleaning very difficult. Commonly, people living in squalor accumulate garbage, but relatively few fulfill proposed criteria for "hoarding disorder." We examined the overlap between hoarding and squalor among people referred because of unacceptable living conditions. Ongoing collection of data by a Squalor Project team, including ratings on the Environmental Cleanliness and Clutter Scale (ECCS), allowed (1) description of characteristics of cases and (2) examination of ratings of uncleanliness, and of the effect of accumulation of items or material on access within dwellings. Principal component analysis was used to examine latent variables underlying the ECCS. The mean age of the referred occupants (108 male, 95 female) was 61.9 years. The mean ECCS score in 186 rated cases was 18.5. Factor analysis of ECCS data showed a two-factor solution as the most plausible. Factor 1, comprising seven squalor items, accounted for 33.7% of the variance. Factor 2 comprised reduced accessibility and accumulation of items of little value (variance 17.6%). Accumulation of garbage loaded equally on the two factors. High levels of squalor and/or accumulation were recorded in 105 (56%) of the 186 dwellings. One-third scored high on accumulation/hoarding, while 38% scored high on squalor; 15% scored high on both squalor and accumulation. A quarter of those scoring high on squalor scored low on hoarding/accumulation. The ECCS is useful when describing whether referred cases show high levels of squalor, hoarding, or both.
Measuring the Success of a Pipeline Program to Increase Nursing Workforce Diversity.

PubMed

Katz, Janet R; Barbosa-Leiker, Celestina; Benavides-Vaello, Sandra

2016-01-01

The purpose of this study was to understand changes in knowledge and opinions of underserved American Indian and Hispanic high school students after attending a 2-week summer pipeline program using and testing a pre/postsurvey. The research aims were to (a) psychometrically analyze the survey to determine if scale items could be summed to create a total scale score or subscale scores; (b) assess change in scores pre/postprogram; and (c) examine the survey to make suggestions for modifications and further testing to develop a valid tool to measure changes in student perceptions about going to college and nursing as a result of pipeline programs. Psychometric analysis indicated poor model fit for a 1-factor model for the total scale and majority of subscales. Nonparametric tests indicated statistically significant increases in 13 items and decreases in 2 items. Therefore, while total scores or subscale scores cannot be used to assess changes in perceptions from pre- to postprogram, the survey can be used to examine changes over time in each item. Student did not have an accurate view of nursing and college and underestimated support needed to attend college. However students realized that nursing was a profession with autonomy, respect, and honor. Copyright © 2016 Elsevier Inc. All rights reserved.
Development and Evaluation of a Confidence-Weighting Computerized Adaptive Testing

ERIC Educational Resources Information Center

Yen, Yung-Chin; Ho, Rong-Guey; Chen, Li-Ju; Chou, Kun-Yi; Chen, Yan-Lin

2010-01-01

The purpose of this study was to examine whether the efficiency, precision, and validity of computerized adaptive testing (CAT) could be improved by assessing confidence differences in knowledge that examinees possessed. We proposed a novel polytomous CAT model called the confidence-weighting computerized adaptive testing (CWCAT), which combined a…
Microhabitat analysis using radiotelemetry locations and polytomous logistic regression

Treesearch

Malcolm P. North; Joel H. Reynolds

1996-01-01

Microhabitat analyses often use discriminant function analysis (DFA) to compare vegetation structures or environmental conditions between sites classified by a study animal's presence or absence. These presence/absence studies make questionable assumptions about the habitat value of the comparison sites and the microhabitat data often violate the DFA's...
A General Cognitive Diagnosis Model for Continuous-Response Data

ERIC Educational Resources Information Center

Minchen, Nathan; de la Torre, Jimmy

2018-01-01

Cognitive diagnosis models (CDMs) allow for the extraction of fine-grained, multidimensional diagnostic information from appropriately designed tests. In recent years, interest in such models has grown as formative assessment grows in popularity. Many dichotomous as well as several polytomous CDMs have been proposed in the last two decades, but…
Multilevel Models for Binary Data

ERIC Educational Resources Information Center

Powers, Daniel A.

2012-01-01

The methods and models for categorical data analysis cover considerable ground, ranging from regression-type models for binary and binomial data, count data, to ordered and unordered polytomous variables, as well as regression models that mix qualitative and continuous data. This article focuses on methods for binary or binomial data, which are…
Confirmatory Factor Analysis of the Minnesota Nicotine Withdrawal Scale

PubMed Central

Toll, Benjamin A.; O’Malley, Stephanie S.; McKee, Sherry A.; Salovey, Peter; Krishnan-Sarin, Suchitra

2008-01-01

The authors examined the factor structure of the Minnesota Nicotine Withdrawal Scale (MNWS) using confirmatory factor analysis in clinical research samples of smokers trying to quit (n = 723). Three confirmatory factor analytic models, based on previous research, were tested with each of the 3 study samples at multiple points in time. A unidimensional model including all 8 MNWS items was found to be the best explanation of the data. This model produced fair to good internal consistency estimates. Additionally, these data revealed that craving should be included in the total score of the MNWS. Factor scores derived from this single-factor, 8-item model showed that increases in withdrawal were associated with poor smoking outcome for 2 of the clinical studies. Confirmatory factor analyses of change scores showed that the MNWS symptoms cohere as a syndrome over time. Future investigators should report a total score using all of the items from the MNWS. PMID:17563141
An examination of the interrater reliability between practitioners and researchers on the static-99.

PubMed

Quesada, Stephen P; Calkins, Cynthia; Jeglic, Elizabeth L

2014-11-01

Many studies have validated the psychometric properties of the Static-99, the most widely used measure of sexual offender recidivism risk. However much of this research relied on instrument coding completed by well-trained researchers. This study is the first to examine the interrater reliability (IRR) of the Static-99 between practitioners in the field and researchers. Using archival data from a sample of 1,973 formerly incarcerated sex offenders, field raters' scores on the Static-99 were compared with those of researchers. Overall, clinicians and researchers had excellent IRR on Static-99 total scores, with IRR coefficients ranging from "substantial" to "outstanding" for the individual 10 items of the scale. The most common causes of discrepancies were coding manual errors, followed by item subjectivity, inaccurate item scoring, and calculation errors. These results offer important data with regard to the frequency and perceived nature of scoring errors. © The Author(s) 2013.
Mixed-Format Test Score Equating: Effect of Item-Type Multidimensionality, Length and Composition of Common-Item Set, and Group Ability Difference

ERIC Educational Resources Information Center

Wang, Wei

2013-01-01

Mixed-format tests containing both multiple-choice (MC) items and constructed-response (CR) items are now widely used in many testing programs. Mixed-format tests often are considered to be superior to tests containing only MC items although the use of multiple item formats leads to measurement challenges in the context of equating conducted under…
Pick-N Multiple Choice-Exams: A Comparison of Scoring Algorithms

ERIC Educational Resources Information Center

Bauer, Daniel; Holzer, Matthias; Kopp, Veronika; Fischer, Martin R.

2011-01-01

To compare different scoring algorithms for Pick-N multiple correct answer multiple-choice (MC) exams regarding test reliability, student performance, total item discrimination and item difficulty. Data from six 3rd year medical students' end of term exams in internal medicine from 2005 to 2008 at Munich University were analysed (1,255 students,…
Network Approach to Autistic Traits: Group and Subgroup Analyses of ADOS Item Scores

ERIC Educational Resources Information Center

Anderson, George M.; Montazeri, Farhad; de Bildt, Annelies

2015-01-01

A network conceptualization might contribute to understanding the occurrence and interacting nature of behavioral traits in the autism realm. Networks were constructed based on correlations of item scores of the Autism Diagnostic Observation Schedule for Modules 1, 2 and 3 obtained for a group of 477 Dutch individuals with developmental disorders.…
Effect of Examinee Certainty on Probabilistic Test Scores and a Comparison of Scoring Methods for Probabilistic Responses.

DTIC Science & Technology

1983-07-01

be a useful tool for assessing kowledge , but there are several problems with this item format. These problems include the possibility of an examinee...1959. -Kane, M. T., & Moloney, J. M. The effect of SSM grading on reliability when residual items have no discriminating power . Paper presented at
The outdoor situational fear inventory: a newer measure of an older instrument

Treesearch

Anderson B. Young; Alan Ewert; Sharon Todd; Thomas Steele; Thomas Quinn

1995-01-01

This study examined the relationship of two methods of scaling the Outdoor Situational Fear Inventory - continuum scaling and the more easily scored certainty method of scaling. Although item-by-item correlations varied widely, overall and subscale score relationships were strong. The data also suggested ways to clarify interpretations of earlier continuum scaled OSFI...
Using Confirmatory Factor Analysis and the Rasch Model to Assess Measurement Invariance in a High Stakes Reading Assessment

ERIC Educational Resources Information Center

Randall, Jennifer; Engelhard, George, Jr.

2010-01-01

The psychometric properties and multigroup measurement invariance of scores across subgroups, items, and persons on the "Reading for Meaning" items from the Georgia Criterion Referenced Competency Test (CRCT) were assessed in a sample of 778 seventh-grade students. Specifically, we sought to determine the extent to which score-based…
Multiple Imputation of Item Scores in Test and Questionnaire Data, and Influence on Psychometric Results

ERIC Educational Resources Information Center

van Ginkel, Joost R.; van der Ark, L. Andries; Sijtsma, Klaas

2007-01-01

The performance of five simple multiple imputation methods for dealing with missing data were compared. In addition, random imputation and multivariate normal imputation were used as lower and upper benchmark, respectively. Test data were simulated and item scores were deleted such that they were either missing completely at random, missing at…
An Evaluation of a New Method of IRT Scaling

ERIC Educational Resources Information Center

Ragland, Shelley

2010-01-01

In order to be able to fairly compare scores derived from different forms of the same test within the Item Response Theory framework, all individual item parameters must be on the same scale. A new approach, the RPA method, which is based on transformations of predicted score distributions was evaluated here and was shown to produce results…
Item Review and the Rearrangement Procedure: Its Process and Its Results

ERIC Educational Resources Information Center

Papanastasiou, Elena C.

2005-01-01

Permitting item review is to the benefit of the examinees who typically increase their test scores with item review. However, testing companies do not prefer item review since it does not follow the logic on which adaptive tests are based, and since it is prone to cheating strategies. Consequently, item review is not permitted in many adaptive…
Item Reliabilities for a Family of Answer-Until-Correct (AUC) Scoring Rules.

ERIC Educational Resources Information Center

Kane, Michael T.; Moloney, James M.

The Answer-Until-Correct (AUC) procedure has been proposed in order to increase the reliability of multiple-choice items. A model for examinees' behavior when they must respond to each item until they answer it correctly is presented. An expression for the reliability of AUC items, as a function of the characteristics of the item and the scoring…
A large-scale, long-term study of scale drift: The micro view and the macro view

NASA Astrophysics Data System (ADS)

He, W.; Li, S.; Kingsbury, G. G.

2016-11-01

The development of measurement scales for use across years and grades in educational settings provides unique challenges, as instructional approaches, instructional materials, and content standards all change periodically. This study examined the measurement stability of a set of Rasch measurement scales that have been in place for almost 40 years. In order to investigate the stability of these scales, item responses were collected from a large set of students who took operational adaptive tests using items calibrated to the measurement scales. For the four scales that were examined, item samples ranged from 2183 to 7923 items. Each item was administered to at least 500 students in each grade level, resulting in approximately 3000 responses per item. Stability was examined at the micro level analysing change in item parameter estimates that have occurred since the items were first calibrated. It was also examined at the macro level, involving groups of items and overall test scores for students. Results indicated that individual items had changes in their parameter estimates, which require further analysis and possible recalibration. At the same time, the results at the total score level indicate substantial stability in the measurement scales over the span of their use.
Clusters of cultures: diversity in meaning of family value and gender role items across Europe.

PubMed

van Vlimmeren, Eva; Moors, Guy B D; Gelissen, John P T M

2017-01-01

Survey data are often used to map cultural diversity by aggregating scores of attitude and value items across countries. However, this procedure only makes sense if the same concept is measured in all countries. In this study we argue that when (co)variances among sets of items are similar across countries, these countries share a common way of assigning meaning to the items. Clusters of cultures can then be observed by doing a cluster analysis on the (co)variance matrices of sets of related items. This study focuses on family values and gender role attitudes. We find four clusters of cultures that assign a distinct meaning to these items, especially in the case of gender roles. Some of these differences reflect response style behavior in the form of acquiescence. Adjusting for this style effect impacts on country comparisons hence demonstrating the usefulness of investigating the patterns of meaning given to sets of items prior to aggregating scores into cultural characteristics.

The Protective Behavioral Strategies for Marijuana Scale: Further examination using item response theory.

PubMed

Pedersen, Eric R; Huang, Wenjing; Dvorak, Robert D; Prince, Mark A; Hummer, Justin F

2017-08-01

Given recent state legislation legalizing marijuana for recreational purposes and majority popular opinion favoring these laws, we developed the Protective Behavioral Strategies for Marijuana scale (PBSM) to identify strategies that may mitigate the harms related to marijuana use among those young people who choose to use the drug. In the current study, we expand on the initial exploratory study of the PBSM to further validate the measure with a large and geographically diverse sample (N = 2,117; 60% women, 30% non-White) of college students from 11 different universities across the United States. We sought to develop a psychometrically sound item bank for the PBSM and to create a short assessment form that minimizes respondent burden and time. Quantitative item analyses, including exploratory and confirmatory factor analyses with item response theory (IRT) and evaluation of differential item functioning (DIF), revealed an item bank of 36 items that was examined for unidimensionality and good content coverage, as well as a short form of 17 items that is free of bias in terms of gender (men vs. women), race (White vs. non-White), ethnicity (Hispanic vs. non-Hispanic), and recreational marijuana use legal status (state recreational marijuana was legal for 25.5% of participants). We also provide a scoring table for easy transformation from sum scores to IRT scale scores. The PBSM item bank and short form associated strongly and negatively with past month marijuana use and consequences. The measure may be useful to researchers and clinicians conducting intervention and prevention programs with young adults. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Handling missing Mini-Mental State Examination (MMSE) values: Results from a cross-sectional long-term-care study.

PubMed

Godin, Judith; Keefe, Janice; Andrew, Melissa K

2017-04-01

Missing values are commonly encountered on the Mini Mental State Examination (MMSE), particularly when administered to frail older people. This presents challenges for MMSE scoring in research settings. We sought to describe missingness in MMSEs administered in long-term-care facilities (LTCF) and to compare and contrast approaches to dealing with missing items. As part of the Care and Construction project in Nova Scotia, Canada, LTCF residents completed an MMSE. Different methods of dealing with missing values (e.g., use of raw scores, raw scores/number of items attempted, scale-level multiple imputation [MI], and blended approaches) are compared to item-level MI. The MMSE was administered to 320 residents living in 23 LTCF. The sample was predominately female (73%), and 38% of participants were aged >85 years. At least one item was missing from 122 (38.2%) of the MMSEs. Data were not Missing Completely at Random (MCAR), χ 2 (1110) = 1,351, p < 0.001. Using raw scores for those missing <6 items in combination with scale-level MI resulted in the regression coefficients and standard errors closest to item-level MI. Patterns of missing items often suggest systematic problems, such as trouble with manual dexterity, literacy, or visual impairment. While these observations may be relatively easy to take into account in clinical settings, non-random missingness presents challenges for research and must be considered in statistical analyses. We present suggestions for dealing with missing MMSE data based on the extent of missingness and the goal of analyses. Copyright © 2016 The Authors. Production and hosting by Elsevier B.V. All rights reserved.
Diet-quality scores and risk of hip fractures in elderly urban Chinese in Guangdong, China: a case-control study.

PubMed

Zeng, F F; Xue, W Q; Cao, W T; Wu, B H; Xie, H L; Fan, F; Zhu, H L; Chen, Y M

2014-08-01

This case-control study compared the associations of four widely used diet-quality scoring systems with the risk of hip fractures and assessed their utility in elderly Chinese. We found that individuals avoiding a low-quality diet have a lower risk of hip fractures in elderly Chinese. Few studies examined the associations of diet-quality scores on bone health, and no studies were available in Asians and compared their validity and utility in a study. We assessed the associations and utility of four widely used diet-quality scoring systems with the risk of hip fractures. A case-control study of 726 patients with hip fractures (diagnosed within 2 weeks) aged 55-80 years and 726 age- (within 3 years) and gender-matched controls was conducted in Guangdong, China (2009-2013). Dietary intake was assessed using a 79-item food frequency questionnaire with face-to-face interviews, and the Healthy Eating Index-2005 (HEI-2005, 12 items), the alternate Healthy Eating Index (aHEI, 8 items), the Diet Quality Index-International (DQI-I, 17 items), and the alternate Mediterranean Diet Score (aMed, 9 items) (the simplest one) were calculated. All greater values of the diet-quality scores were significantly associated with a similar decreased risk of hip fractures (all p trends <0.001). The multivariate-adjusted odds ratios (ORs) and 95% confidential intervals (95% CIs) comparing the extreme groups of diet-quality scores were 0.29 (0.18, 0.46) (HEI-2005), 0.20 (0.12, 0.33) (aHEI), 0.25 (0.16, 0.39) (DQI-I), and 0.28 (0.18, 0.43) (aMed) in total subjects; and the corresponding ORs ranged from 0.04 to 0.27 for men and from 0.26 to 0.44 for women (all p trends <0.05), respectively. Avoiding a low-quality diet is associated with a lower risk of hip fractures, and the aMed score is the best scoring system due to its equivalent performance and simplicity for the user.
Fitting measurement models to vocational interest data: are dominance models ideal?

PubMed

Tay, Louis; Drasgow, Fritz; Rounds, James; Williams, Bruce A

2009-09-01

In this study, the authors examined the item response process underlying 3 vocational interest inventories: the Occupational Preference Inventory (C.-P. Deng, P. I. Armstrong, & J. Rounds, 2007), the Interest Profiler (J. Rounds, T. Smith, L. Hubert, P. Lewis, & D. Rivkin, 1999; J. Rounds, C. M. Walker, et al., 1999), and the Interest Finder (J. E. Wall & H. E. Baker, 1997; J. E. Wall, L. L. Wise, & H. E. Baker, 1996). Item response theory (IRT) dominance models, such as the 2-parameter and 3-parameter logistic models, assume that item response functions (IRFs) are monotonically increasing as the latent trait increases. In contrast, IRT ideal point models, such as the generalized graded unfolding model, have IRFs that peak where the latent trait matches the item. Ideal point models are expected to fit better because vocational interest inventories ask about typical behavior, as opposed to requiring maximal performance. Results show that across all 3 interest inventories, the ideal point model provided better descriptions of the response process. The importance of specifying the correct item response model for precise measurement is discussed. In particular, scores computed by a dominance model were shown to be sometimes illogical: individuals endorsing mostly realistic or mostly social items were given similar scores, whereas scores based on an ideal point model were sensitive to which type of items respondents endorsed.
A test of the International Personality Item Pool representation of the Revised NEO Personality Inventory and development of a 120-item IPIP-based measure of the five-factor model.

PubMed

Maples, Jessica L; Guan, Li; Carter, Nathan T; Miller, Joshua D

2014-12-01

There has been a substantial increase in the use of personality assessment measures constructed using items from the International Personality Item Pool (IPIP) such as the 300-item IPIP-NEO (Goldberg, 1999), a representation of the Revised NEO Personality Inventory (NEO PI-R; Costa & McCrae, 1992). The IPIP-NEO is free to use and can be modified to accommodate its users' needs. Despite the substantial interest in this measure, there is still a dearth of data demonstrating its convergence with the NEO PI-R. The present study represents an investigation of the reliability and validity of scores on the IPIP-NEO. Additionally, we used item response theory (IRT) methodology to create a 120-item version of the IPIP-NEO. Using an undergraduate sample (n = 359), we examined the reliability, as well as the convergent and criterion validity, of scores from the 300-item IPIP-NEO, a previously constructed 120-item version of the IPIP-NEO (Johnson, 2011), and the newly created IRT-based IPIP-120 in comparison to the NEO PI-R across a range of outcomes. Scores from all 3 IPIP measures demonstrated strong reliability and convergence with the NEO PI-R and a high degree of similarity with regard to their correlational profiles across the criterion variables (rICC = .983, .972, and .976, respectively). The replicability of these findings was then tested in a community sample (n = 757), and the results closely mirrored the findings from Sample 1. These results provide support for the use of the IPIP-NEO and both 120-item IPIP-NEO measures as assessment tools for measurement of the five-factor model. (c) 2014 APA, all rights reserved.
Validation and psychometric properties of the Somatic and Psychological HEalth REport (SPHERE) in a young Australian-based population sample using non-parametric item response theory.

PubMed

Couvy-Duchesne, Baptiste; Davenport, Tracey A; Martin, Nicholas G; Wright, Margaret J; Hickie, Ian B

2017-08-01

The Somatic and Psychological HEalth REport (SPHERE) is a 34-item self-report questionnaire that assesses symptoms of mental distress and persistent fatigue. As it was developed as a screening instrument for use mainly in primary care-based clinical settings, its validity and psychometric properties have not been studied extensively in population-based samples. We used non-parametric Item Response Theory to assess scale validity and item properties of the SPHERE-34 scales, collected through four waves of the Brisbane Longitudinal Twin Study (N = 1707, mean age = 12, 51% females; N = 1273, mean age = 14, 50% females; N = 1513, mean age = 16, 54% females, N = 1263, mean age = 18, 56% females). We estimated the heritability of the new scores, their genetic correlation, and their predictive ability in a sub-sample (N = 1993) who completed the Composite International Diagnostic Interview. After excluding items most responsible for noise, sex or wave bias, the SPHERE-34 questionnaire was reduced to 21 items (SPHERE-21), comprising a 14-item scale for anxiety-depression and a 10-item scale for chronic fatigue (3 items overlapping). These new scores showed high internal consistency (alpha > 0.78), moderate three months reliability (ICC = 0.47-0.58) and item scalability (Hi > 0.23), and were positively correlated (phenotypic correlations r = 0.57-0.70; rG = 0.77-1.00). Heritability estimates ranged from 0.27 to 0.51. In addition, both scores were associated with later DSM-IV diagnoses of MDD, social anxiety and alcohol dependence (OR in 1.23-1.47). Finally, a post-hoc comparison showed that several psychometric properties of the SPHERE-21 were similar to those of the Beck Depression Inventory. The scales of SPHERE-21 measure valid and comparable constructs across sex and age groups (from 9 to 28 years). SPHERE-21 scores are heritable, genetically correlated and show good predictive ability of mental health in an Australian-based population sample of young people.
Intermediate vaginal flora and bacterial vaginosis are associated with the same factors: findings from an exploratory analysis among female sex workers in Africa and India.

PubMed

Guédou, Fernand A; Van Damme, Lut; Deese, Jennifer; Crucitti, Tania; Mirembe, Florence; Solomon, Suniti; Becker, Marissa; Alary, Michel

2014-03-01

Several recent studies suggest that intermediate vaginal flora (IVF) is associated with similar adverse health outcomes as bacterial vaginosis (BV). Yet, it is still unknown if IVF and BV share the same correlates. We conducted a cross-sectional and exploratory analysis of data from women screened prior to enrolment in a microbicide trial to estimate BV and IVF prevalence and examine their respective correlates. Participants were interviewed, examined and provided blood and genital samples for the diagnosis of IVF and BV (using Nugent score) and other reproductive tract infections. Polytomous logistic regressions were used in estimating respective ORs of IVF and BV, in relation to each potential risk factor. Among 1367 women, BV and IVF prevalences were 47.6% (95% CI 45.0% to 50.3%) and 19.2% (95% CI 17.1% to 21.2%), respectively. Multivariate polytomous analysis of IVF and BV showed that they were generally associated with the same factors. The respective adjusted ORs were for HIV 1.98 (95% CI 1.37 to 2.86) and 1.62 (95% CI 1.20 to 2.20) (p=0.2248), for gonorrhoea 1.25 (95% CI 0.64 to 2.4) and 2.01 (95% CI 1.19 to 3.49) (p=0.0906), for trichomoniasis 3.26 (95% CI 1.71 to 6.31) and 2.39 (95% CI 1.37 to 4.33) (p=0.2630), for candidiasis 0.52 (95% CI 0.36 to 0.75) and 0.59 (95% CI 0.44 to 0.78) (p=0.5288), and for hormonal contraception 0.65 (95% CI 0.40 to 1.04) and 0.62 (95% CI 0.43 to 0.90) (p=0.8819). In addition, the association between vaginal flora abnormalities and factors such as younger age, HIV, gonorrhoea trichomoniasis and candidiasis were modified by the study site (all p for interaction ≤0.05). IVF has almost the same correlates as BV. The relationship between some factors and vaginal flora abnormalities may be site-specific.
Comparison of scales for evaluating premenstrual symptoms in women using oral contraceptives.

PubMed

Coffee, Andrea L; Kuehl, Thomas J; Sulak, Patricia J

2008-05-01

To compare two scales used in research to evaluate daily premenstrual mood symptoms during use of a monophasic oral contraceptive. Subanalysis of data from a prospective study. University-affiliated medical center. SUBJECTS; One hundred two reproductive-aged (18-48 yrs) women taking a monophasic oral contraceptive containing ethinyl estradiol and drospirenone in the standard 21-7 fashion (21 days of hormones followed by 7 days of placebo), and who had self-identified premenstrual symptoms of headache, mood changes, or pelvic pain. Subjects completed a single-item questionnaire, the Scott & White Daily Diary of Symptoms, and a multiple-item questionnaire, the Penn State Daily Symptom Report (DSR), to assess their premenstrual symptoms. The Scott & White diary used a visual analog scale of 0-10 to assess pelvic pain, headache, and mood (a composite of anxiety, depression, and irritability). The Penn State DSR contained 17 items: 10 behavioral and seven physical components, each rated on a scale of 0-4, with one item that specifically rated mood swings. Scores from the two scales were compared by using Spearman correlation coefficients, the Kendall W for concordance, and linear regression of ranked sums for study cycles. The Scott & White mood score significantly correlated with the total of the 17 items on the Penn State DSR, as well as the 10 behavioral items, the seven physical items, and the single mood-swing item (p<0.0001); specific coefficients of concordance were 0.44, 0.23, 0.10, and 0.28, respectively, and R2 values were 0.39, 0.39, 0.30, and 0.34, respectively. The daily Scott & White mood score was positively correlated with all 17 elements of the Penn State DSR (0.25-0.57). The greatest correlation was seen with the mood-swing element. Both instruments demonstrated the same patterns during the 21-7 oral contraceptive cycle, with symptoms increasing immediately before and peaking during the 7-day hormone-free interval. A single-item daily mood score using a rating scale of 0-10 was concordant with a relatively complex 17-element symptom index and demonstrated the same pattern of change during cycles of oral contraception. The simple scoring system offers an advantage, especially in clinical studies of long duration.
On Multidimensional Item Response Theory: A Coordinate-Free Approach. Research Report. ETS RR-07-30

ERIC Educational Resources Information Center

Antal, Tamás

2007-01-01

A coordinate-free definition of complex-structure multidimensional item response theory (MIRT) for dichotomously scored items is presented. The point of view taken emphasizes the possibilities and subtleties of understanding MIRT as a multidimensional extension of the classical unidimensional item response theory models. The main theorem of the…
Differential Item Functioning: Its Consequences. Research Report. ETS RR-10-01

ERIC Educational Resources Information Center

Lee, Yi-Hsuan; Zhang, Jinming

2010-01-01

This report examines the consequences of differential item functioning (DIF) using simulated data. Its impact on total score, item response theory (IRT) ability estimate, and test reliability was evaluated in various testing scenarios created by manipulating the following four factors: test length, percentage of DIF items per form, sample sizes of…
A Methodology for Zumbo's Third Generation DIF Analyses and the Ecology of Item Responding

ERIC Educational Resources Information Center

Zumbo, Bruno D.; Liu, Yan; Wu, Amery D.; Shear, Benjamin R.; Olvera Astivia, Oscar L.; Ark, Tavinder K.

2015-01-01

Methods for detecting differential item functioning (DIF) and item bias are typically used in the process of item analysis when developing new measures; adapting existing measures for different populations, languages, or cultures; or more generally validating test score inferences. In 2007 in "Language Assessment Quarterly," Zumbo…
Item Vector Plots for the Multidimensional Three-Parameter Logistic Model

ERIC Educational Resources Information Center

Bryant, Damon; Davis, Larry

2011-01-01

This brief technical note describes how to construct item vector plots for dichotomously scored items fitting the multidimensional three-parameter logistic model (M3PLM). As multidimensional item response theory (MIRT) shows promise of being a very useful framework in the test development life cycle, graphical tools that facilitate understanding…
An Item Response Theory Model for Test Bias.

ERIC Educational Resources Information Center

Shealy, Robin; Stout, William

This paper presents a conceptualization of test bias for standardized ability tests which is based on multidimensional, non-parametric, item response theory. An explanation of how individually-biased items can combine through a test score to produce test bias is provided. It is contended that bias, although expressed at the item level, should be…
The Structured Interview & Scoring Tool-Massachusetts Alzheimer's Disease Research Center (SIST-M): development, reliability, and cross-sectional validation of a brief structured clinical dementia rating interview.

PubMed

Okereke, Olivia I; Copeland, Maura; Hyman, Bradley T; Wanggaard, Taylor; Albert, Marilyn S; Blacker, Deborah

2011-03-01

The Clinical Dementia Rating (CDR) and CDR Sum-of-Boxes can be used to grade mild but clinically important cognitive symptoms of Alzheimer disease. However, sensitive clinical interview formats are lengthy. To develop a brief instrument for obtaining CDR scores and to assess its reliability and cross-sectional validity. Using legacy data from expanded interviews conducted among 347 community-dwelling older adults in a longitudinal study, we identified 60 questions (from a possible 131) about cognitive functioning in daily life using clinical judgment, inter-item correlations, and principal components analysis. Items were selected in 1 cohort (n=147), and a computer algorithm for generating CDR scores was developed in this same cohort and re-run in a replication cohort (n=200) to evaluate how well the 60 items retained information from the original 131 items. Short interviews based on the 60 items were then administered to 50 consecutively recruited older individuals, with no symptoms or mild cognitive symptoms, at an Alzheimer's Disease Research Center. Clinical Dementia Rating scores based on short interviews were compared with those from independent long interviews. In the replication cohort, agreement between short and long CDR interviews ranged from κ=0.65 to 0.79, with κ=0.76 for Memory, κ=0.77 for global CDR, and intraclass correlation coefficient for CDR Sum-of-Boxes=0.89. In the cross-sectional validation, short interview scores were slightly lower than those from long interviews, but good agreement was observed for global CDR and Memory (κ≥0.70) as well as for CDR Sum-of-Boxes (intraclass correlation coefficient=0.73). The Structured Interview & Scoring Tool-Massachusetts Alzheimer's Disease Research Center is a brief, reliable, and sensitive instrument for obtaining CDR scores in persons with symptoms along the spectrum of mild cognitive change.
Therapist perception of treatment outcome: Evaluating treatment outcomes among youth with antisocial behavior problems.

PubMed

Crandal, Brent R; Foster, Sharon L; Chapman, Jason E; Cunningham, Phillippe B; Brennan, Patricia A; Whitmore, Elizabeth A

2015-06-01

Effective evaluation of treatment requires the use of measurement tools producing reliable scores that can be used to make valid decisions about the outcomes of interest. Therapist-rated treatment outcome scores that are obtained within the context of empirically supported treatments (ESTs) could provide clinicians and researchers with data that are easily accessible and complimentary to existing instrumentation. We examined the psychometric properties of scores from the Therapist Perception of Treatment Outcome: Youth Antisocial Behavior (TPTO:YAB), an instrument developed to assess therapist judgments of treatment success among families participating in an EST, Multisystemic Therapy (MST), for youth with antisocial behavior problems. Data were drawn from a longitudinal study of MST. The initial 20-item TPTO:YAB was completed by therapists of 111 families at midtreatment and 163 families at treatment termination. Rasch model dimensionality analyses provided evidence for 2 dimensions reflecting youth- and caregiver-related aspects of treatment outcome, although a bifactor analyses suggested that these dimensions reflected a single more general construct. Rasch analyses were also used to assess item and rating scale characteristics and refine the number of items. These analyses suggested items performed similarly across time and that scores reflect treatment outcome in similar ways at mid and posttreatment. Multilevel and zero-order analyses provided evidence for the validity of TPTO:YAB scores. TPTO:YAB scores were moderately correlated with scores of youth and caregiver behaviors targeted in treatment, adding support to its use as a treatment outcome measurement instrument. (c) 2015 APA, all rights reserved).
Impact of student ethnicity and patient-centredness on communication skills performance.

PubMed

Hauer, Karen E; Boscardin, Christy; Gesundheit, Neil; Nevins, Andrew; Srinivasan, Malathi; Fernandez, Alicia

2010-07-01

The development of patient-centred attitudes by health care providers is critical to improving health care quality. A prior study showed that medical students with more patient-centred attitudes scored higher in communication skills as judged by standardised patients (SPs) than students with less patient-centred attitudes. We designed this multicentre study to examine the relationships among students' demographic characteristics, patient-centredness and communication scores on an SP examination. Early Year 4 medical students at three US schools completed a 12-item survey during an SP examination. Survey items addressed demographics (gender, ethnicity, primary childhood language) and patient-centredness. Factor analysis on the patient-centredness items defined specific patient-centred attitudes. We used multiple regression analysis incorporating demographic characteristics, school and patient-centredness items and examined the effect of these variables on the outcome variable of communication score. A total of 351 students took the SP examination and 329 (94%) completed the patient-centredness questionnaire. Responses indicated generally high patient-centredness. Student ethnicity and medical school were significantly associated with communication scores; gender and primary childhood language were not. Two attitudinal factors were identified: patient perspective and impersonal attitude. Multiple regression analysis revealed that school and scores on the impersonal factor were associated with communication scores. The effect size was modest. In a medical student SP examination, modest differences in communication scores based on ethnicity were observed and can be partially explained by student attitudes regarding patient-centredness. Curricular interventions to enhance clinical experiences, teaching and feedback are needed to address key elements of a patient-centred approach to care.
The development of a test of biodiversity knowledge of high school students

NASA Astrophysics Data System (ADS)

Ajayi, Olabisi Modupe

2002-09-01

The primary purpose of this study was to develop a valid and reliable test of the knowledge of biodiversity of high school students. The test differentiated students' knowledge on three levels of biodiversity: species, ecosystem and genetics. A secondary purpose was to examine how biodiversity scores were affected by gender, grade point average, and families' socioeconomic status. The initial phase of the instrument development involved the construction of 60 dichotomous items (true/false). To establish content validity, a panel of biodiversity experts reviewed the items for appropriateness and clarity. The items were checked for readability using Flesch-Kincaid Readability Index and the readability was at the fifth grade level. The instrument was subjected to factor analysis. As a result, the final instrument was compiled and named the Ajayi Biodiversity Instrument (ABI). The reliability of ABI was .87. The mean score on the 25-item test was 79%. No significant difference at >0.05 was found in the score of students on each of the three subtests for genetics, species, and ecosystem. No significant difference was found in the score of students relative to their family's socioeconomic status. There was a significant correlation between grade point average and participation in extracurricular activities that related to biodiversity concepts and scores on ABI. Gender differences emerged at the ecosystem level, females scoring higher than males. Differences among ethnic groups also emerged. Anglo-Americans scored significantly higher on the test of knowledge of biodiversity for high school students than the rest of the ethnic groups combined.
Development of a frequent heartburn index.

PubMed

Stull, Donald E; van Hanswijck de Jonge, Patricia; Houghton, Katherine; Kocun, Christopher; Sandor, David W

2011-09-01

The aim of this study is to develop and validate a brief instrument for the measurement of overall psychosocial impact of frequent heartburn (heartburn experienced 2+ times weekly) in the general U.S. population, yielding a single, composite score. Item reduction and psychometric analyses of an existing Frequent Heartburn (FHB) Survey, a 52-item, 13-domain, patient-reported outcomes (PRO) survey assessing the impact of frequent heartburn on psychosocial quality of life. Item reduction resulted in 9 items from the original FHB Survey measuring all domains. All retained items in this full Frequent Heartburn Index (FHBI-Full) had moderate to strong factor loadings on the underlying factor (range: 0.66-0.85) and acceptable overall model fit (CFI = 0.93, SRMR = 0.04). Coefficient alpha was 0.92. A shorter FHBI (FHBI-Brief) was created that excludes the two employment-related items. The FHBI-Brief had a coefficient alpha of 0.90. Both FHBI versions have good psychometric properties and capture a full range of psychosocial effects of frequent heartburn. Normed national scores for the FHBI are available against which an individual can compare their own FHBI score. The FHBI-Full and FHBI-Brief show promise as PRO instruments that may help individuals and clinicians better understand the effect of frequent heartburn on psychosocial functioning.
Functional recovery is considered the most important target: a survey of dedicated professionals

PubMed Central

2014-01-01

Background The aim of this study was to survey the relative importance of postoperative recovery targets and perioperative care items, as perceived by a large group of international dedicated professionals. Methods A questionnaire with eight postoperative recovery targets and 13 perioperative care items was mailed to participants of the first international Enhanced Recovery After Surgery (ERAS) congress and to authors of papers with a clear relevance to ERAS in abdominal surgery. The responders were divided into categories according to profession and region. Results The recovery targets ‘To be completely free of nausea’, ‘To be independently mobile’ and ‘To be able to eat and drink as soon as possible’ received the highest score irrespective of the responder's profession or region of origin. Equally, the care items ‘Optimizing fluid balance’, ‘Preoperative counselling’ and ‘Promoting early and scheduled mobilisation’ received the highest score across all groups. Conclusions Functional recovery, as in tolerance of food without nausea and regained mobility, was considered the most important target of recovery. There was a consistent uniformity in the way international dedicated professionals scored the relative importance of recovery targets and care items. The relative rating of the perioperative care items was not dependent on the strength of evidence supporting the items. PMID:25089195
Clinical validation of a non-heteronormative version of the Social Interaction Anxiety Scale (SIAS).

PubMed

Lindner, Philip; Martell, Christopher; Bergström, Jan; Andersson, Gerhard; Carlbring, Per

2013-12-19

Despite welcomed changes in societal attitudes and practices towards sexual minorities, instances of heteronormativity can still be found within healthcare and research. The Social Interaction Anxiety Scale (SIAS) is a valid and reliable self-rating scale of social anxiety, which includes one item (number 14) with an explicit heteronormative assumption about the respondent's sexual orientation. This heteronormative phrasing may confuse, insult or alienate sexual minority respondents. A clinically validated version of the SIAS featuring a non-heteronormative phrasing of item 14 is thus needed. 129 participants with diagnosed social anxiety disorder, enrolled in an Internet-based intervention trial, were randomly assigned to responding to the SIAS featuring either the original or a novel non-heteronormative phrasing of item 14, and then answered the other item version. Within-subject, correlation between item versions was calculated and the two scores were statistically compared. The two items' correlations with the other SIAS items and other psychiatric rating scales were also statistically compared. Item versions were highly correlated and scores did not differ statistically. The two items' correlations with other measures did not differ statistically either. The SIAS can be revised with a non-heteronormative formulation of item 14 with psychometric equivalence on item and scale level. Implications for other psychiatric instruments with heteronormative phrasings are discussed.

Is the General Self-Efficacy Scale a Reliable Measure to be used in Cross-Cultural Studies? Results from Brazil, Germany and Colombia.

PubMed

Damásio, Bruno F; Valentini, Felipe; Núñes-Rodriguez, Susana I; Kliem, Soeren; Koller, Sílvia H; Hinz, Andreas; Brähler, Elmar; Finck, Carolyn; Zenger, Markus

2016-05-26

This study evaluated cross-cultural measurement invariance for the General Self-efficacy Scale (GSES) in a large Brazilian (N = 2.394) and representative German (N = 2.046) and Colombian (N = 1.500) samples. Initially, multiple-indicators multiple-causes (MIMIC) analyses showed that sex and age were biasing items responses on the total sample (2 and 10 items, respectively). After controlling for these two covariates, a multigroup confirmatory factor analysis (MGCFA) was employed. Configural invariance was attested. However, metric invariance was not supported for five items, in a total of 10, and scalar invariance was not supported for all items. We also evaluated the differences between the latent scores estimated by two models: MIMIC and MGCFA unconstraining the non-equivalent parameters across countries. The average difference was equal to |.07| on the estimation of the latent scores, and 22.8% of the scores were biased in at least .10 standardized points. Bias effects were above the mean for the German group, which the average difference was equal to |.09|, and 33.7% of the scores were biased in at least .10. In synthesis, the GSES did not provide evidence of measurement invariance to be employed in this cross-cultural study. More than that, our results showed that even when controlling for sex and age effects, the absence of control on items parameters in the MGCFA analyses across countries would implicate in bias of the latent scores estimation, with a higher effect for the German population.
Towards operationalising internal distractibility (Mind Wandering) in adults with ADHD.

PubMed

Biederman, Joseph; Fitzgerald, Maura; Uchida, Mai; Spencer, Thomas J; Fried, Ronna; Wicks, Jennifer; Saunders, Alexandra; Faraone, Stephen V

2017-12-01

To investigate whether specific symptoms of attention deficit hyperactivity disorder (ADHD) can help identify ADHD patients with mind wandering. Subjects were adults ages 18-55 of both sexes (n=41) who completed the Mind-Wandering Questionnaire (MWQ) and the ADHD module of the Schedule for Affective Disorders and Schizophrenia for School-Age Children Epidemiologic Version. We used Spearman's rank correlation and Pearson's χ2 analyses to examine associations between the ADHD module and the MWQ and receiver operator characteristic (ROC) analyses to evaluate the diagnostic efficiency of the ADHD module. Out of the three ADHD domains, the inattentive ADHD scores had the strongest association with the MWQ (total: r s=0.34, df=39, p=0.03; inattentive: r s=0.38, df=39, p=0.02; Hyperactive: r s=0.17, df=39, p=0.28). Correlation analyses between individual items on the ADHD module and the MWQ showed that two inattention items ('failure to pay attention to detail' and 'trouble following instructions') were positively associated with total scores on the MWQ (p=0.02). These two inattention items had the strongest association with the MWQ (r s=0.45, df=38, p=0.004). ROC analyses showed that the combined score of the two significant inattention items had the highest efficiency (AUC=0.71) in classifying high-level mind wanderers as defined by scores greater than the median split on the MWQ. The combined score of the two inattention items best identified high-level mind wanderers. Results suggest a way to operationalise mind wandering using the symptoms of ADHD.
The Assignment of Raters to Items: Controlling for Rater Effects.

ERIC Educational Resources Information Center

Sykes, Robert C.; Heidorn, Mark; Lee, Guemin

A study was conducted to evaluate the effect of different modes (modalities) of assigning raters to test items. The impact on total constructed response (c.r.) score, and subsequently on total test score, of assigning a single versus multiple raters to an examination reading of a student's set of c.r. responses was evaluated for several mixed-item…
An Investigation of the Accuracy of Alternative Methods of True Score Estimation in High-Stakes Mixed-Format Examinations.

ERIC Educational Resources Information Center

Klinger, Don A.; Rogers, W. Todd

2003-01-01

The estimation accuracy of procedures based on classical test score theory and item response theory (generalized partial credit model) were compared for examinations consisting of multiple-choice and extended-response items. Analysis of British Columbia Scholarship Examination results found an error rate of about 10 percent for both methods, with…
Examining the Impact of Unscorable Item Responses on the Validity and Interpretability of MMPI-2/MMPI-2-RF Restructured Clinical (RC) Scale Scores

ERIC Educational Resources Information Center

Dragon, Wendy R.; Ben-Porath, Yossef S.; Handel, Richard W.

2012-01-01

This article examined the impact of unscorable item responses on the psychometric validity and practical interpretability of scores on the Restructured Clinical (RC) Scales of the Minnesota Multiphasic Personality Inventory-2/Minnesota Multiphasic Personality Inventory-2-Restructured Form (MMPI-2/MMPI-2-RF). In analyses conducted with five…
A Robust Outlier Approach to Prevent Type I Error Inflation in Differential Item Functioning

ERIC Educational Resources Information Center

Magis, David; De Boeck, Paul

2012-01-01

The identification of differential item functioning (DIF) is often performed by means of statistical approaches that consider the raw scores as proxies for the ability trait level. One of the most popular approaches, the Mantel-Haenszel (MH) method, belongs to this category. However, replacing the ability level by the simple raw score is a source…
The Impact of Reading Self-Efficacy and Task Value on Reading Comprehension Scores in Different Item Formats

ERIC Educational Resources Information Center

Solheim, Oddny Judith

2011-01-01

It has been hypothesized that students with low self-efficacy will struggle with complex reading tasks in assessment situations. In this study we examined whether perceived reading self-efficacy and reading task value uniquely predicted reading comprehension scores in two different item formats in a sample of fifth-grade students. Results showed…
Lord-Wingersky Algorithm Version 2.0 for Hierarchical Item Factor Models with Applications in Test Scoring, Scale Alignment, and Model Fit Testing. CRESST Report 830

ERIC Educational Resources Information Center

Cai, Li

2013-01-01

Lord and Wingersky's (1984) recursive algorithm for creating summed score based likelihoods and posteriors has a proven track record in unidimensional item response theory (IRT) applications. Extending the recursive algorithm to handle multidimensionality is relatively simple, especially with fixed quadrature because the recursions can be defined…
Propensity Score Matching Helps to Understand Sources of DIF and Mathematics Performance Differences of Indonesian, Turkish, Australian, and Dutch Students in PISA

ERIC Educational Resources Information Center

Arikan, Serkan; van de Vijver, Fons J. R.; Yagmur, Kutlay

2018-01-01

We examined Differential Item Functioning (DIF) and the size of cross-cultural performance differences in the Programme for International Student Assessment (PISA) 2012 mathematics data before and after application of propensity score matching. The mathematics performance of Indonesian, Turkish, Australian, and Dutch students on released items was…
Automated Scoring for the "TOEFL Junior"® Comprehensive Writing and Speaking Test. Research Report. ETS RR-15-09

ERIC Educational Resources Information Center

Evanini, Keelan; Heilman, Michael; Wang, Xinhao; Blanchard, Daniel

2015-01-01

This report describes the initial automated scoring results that were obtained using the constructed responses from the Writing and Speaking sections of the pilot forms of the "TOEFL Junior"® Comprehensive test administered in late 2011. For all of the items except one (the edit item in the Writing section), existing automated scoring…
Evaluation of Two Methods for Modeling Measurement Errors When Testing Interaction Effects with Observed Composite Scores

ERIC Educational Resources Information Center

Hsiao, Yu-Yu; Kwok, Oi-Man; Lai, Mark H. C.

2018-01-01

Path models with observed composites based on multiple items (e.g., mean or sum score of the items) are commonly used to test interaction effects. Under this practice, researchers generally assume that the observed composites are measured without errors. In this study, we reviewed and evaluated two alternative methods within the structural…
Agreement between the SCORE and D’Agostino Scales for the Classification of High Cardiovascular Risk in Sedentary Spanish Patients

PubMed Central

Gómez-Marcos, Manuel A.; Grandes, Gonzalo; Iglesias-Valiente, José A.; Sánchez, Alvaro; Montoya, Imanol; García-Ortiz, Luis

2009-01-01

Background: To evaluate agreement between cardiovascular risk in sedentary patients as estimated by the new Framingham-D’Agostino scale and by the SCORE chart, and to describe the patient characteristics associated with the observed disagreement between the scales. Design: A cross-sectional study was undertaken involving a systematic sample of 2,295 sedentary individuals between 40–65 years of age seen for any reason in 56 primary care offices. An estimation was made of the Pearson correlation coefficient and kappa statistic for the classification of high risk subjects (≥20% according to the Framingham-D’Agostino scale, and ≥5% according to SCORE). Polytomous logistic regression models were fitted to identify the variables associated with the discordance between the two scales. Results: The mean risk in males (35%) was 19.5% ± 13% with D’Agostino scale, and 3.2% ± 3.3% with SCORE. Among females, they were 8.1% ± 6.8% and 1.2% ± 2.2%, respectively. The correlation between the two scales was 0.874 in males (95% CI: 0.857–0.889) and 0.818 in females (95% CI: 0.800–0.834), while the kappa index was 0.50 in males (95% CI: 0.44%–0.56%) and 0.61 in females (95% CI: 0.52%–0.71%). The most frequent disagreement, characterized by high risk according to D’Agostino scale but not according to SCORE, was much more prevalent among males and proved more probable with increasing age and increased LDL-cholesterol, triglyceride and systolic blood pressure values, as well as among those who used antihypertensive drugs and smokers. Conclusions: The quantitative correlation between the two scales is very high. Patient categorization as corresponding to high risk generates disagreements, mainly among males, where agreement between the two classifications is only moderate. PMID:20049225
Optimizing data collection for public health decisions: a data mining approach

PubMed Central

2014-01-01

Background Collecting data can be cumbersome and expensive. Lack of relevant, accurate and timely data for research to inform policy may negatively impact public health. The aim of this study was to test if the careful removal of items from two community nutrition surveys guided by a data mining technique called feature selection, can (a) identify a reduced dataset, while (b) not damaging the signal inside that data. Methods The Nutrition Environment Measures Surveys for stores (NEMS-S) and restaurants (NEMS-R) were completed on 885 retail food outlets in two counties in West Virginia between May and November of 2011. A reduced dataset was identified for each outlet type using feature selection. Coefficients from linear regression modeling were used to weight items in the reduced datasets. Weighted item values were summed with the error term to compute reduced item survey scores. Scores produced by the full survey were compared to the reduced item scores using a Wilcoxon rank-sum test. Results Feature selection identified 9 store and 16 restaurant survey items as significant predictors of the score produced from the full survey. The linear regression models built from the reduced feature sets had R2 values of 92% and 94% for restaurant and grocery store data, respectively. Conclusions While there are many potentially important variables in any domain, the most useful set may only be a small subset. The use of feature selection in the initial phase of data collection to identify the most influential variables may be a useful tool to greatly reduce the amount of data needed thereby reducing cost. PMID:24919484
Epilepsy-related ambiguity in rating the child behavior checklist and the teacher's report form.

PubMed

Oostrom, K J; Schouten, A; Kruitwagen, C L; Peters, A C; Jennekens-Schinkel, A

2001-01-01

Although the child behavior checklist (CBCL) and the teacher's report form (TRF) were not designed for diagnosing psychopathology in children with chronic illnesses, they have become extensively used research tools to assess behavioural problems in paediatric populations, including children with epilepsy. When applied to children with epilepsy, items like "staring blankly" or "twitching" can be rated on the basis of seizure features rather than behaviour and, hence, render behavioural scores ambiguous. The aims were detection, and evaluation of the impact, of CBCL and TRF items eliciting ambiguity when applied to children with "epilepsy only" (idiopathic or cryptogenic epilepsy, attending normal schools). Experts identified items that give rise to interpretational ambiguity of the ratings in epilepsy. By treating ratings on these items as missing values, their effect was evaluated in CBCL and TRF scores of 59 schoolchildren with "epilepsy only" and age and gender matched healthy classmates. Seven items of the CBCL gave rise to ambiguity of which items 5 co-occur on the TRF. Rescoring reduced psychopathology scores in children with "epilepsy only", but not in those of healthy children: the percentage of patients trespassing the clinical cut off score, on at least one of the subscales, reduced from 46 to 23% on the CBCL and from 18 to 15% on the TRF. Parents and teachers run the risk of confusing behaviour and seizure features when filling out the CBCL and TRF. In "epilepsy only", prevalence estimates of psychopathology based on the CBCL and TRF, should be considered with some reserve.
Head and neck cancer-specific quality of life: instrument validation.

PubMed

Terrell, J E; Nanavati, K A; Esclamado, R M; Bishop, J K; Bradford, C R; Wolf, G T

1997-10-01

The disfigurement and dysfunction associated with head and neck cancer affect emotional well-being and some of the most basic functions of life. Most cancer-specific quality-of-life assessments give a single composite score for head and neck cancer-related quality of life. To develop and evaluate an improved multidimensional instrument to assess head and neck cancer-related functional status and well-being. The item selection process included literature review, interviews with health care workers, and patient surveys. A survey with 37 disease-specific questions and the SF-12 survey were administered to 253 patients in 3 large medical centers. Factor analysis was performed to identify disease-specific domains. Domain scores were calculated as the standardized score of the component items. These domains were assessed for construct validity based on clinical hypotheses and test-retest reliability. Four relevant domains were identified: Eating (6 items), Communication (4 items), Pain (4 items), and Emotion (6 items). Each had an internal consistency (Cronbach alpha value) of greater than 0.80. Construct validity was demonstrated by moderate correlations with the SF-12 Physical and Mental component scores (r=0.43-0.60). Test-retest reliability for each domain demonstrated strong reliability between the 2 time points. Correlations were strong for each individual question, ranging from 0.53 to 0.93. Construct validity testing demonstrated that the direction of differences for each domain were as hypothesized. The Head and Neck Quality of Life questionnaire is a promising multidimensional tool with which to assess head and neck cancer-specific quality of life.
International Comparisons of the Dysregulation Profile Based on Reports by Parents, Adolescents, and Teachers.

PubMed

Rescorla, Leslie A; Blumenfeld, Mary C; Ivanova, Masha Y; Achenbach, Thomas M; International Aseba Consortium

2018-06-14

Our objective was to examine international similarities and differences in the Dysregulation Profile (DP) of the Child Behavior Checklist (CBCL), Teacher's Report Form (TRF), and Youth Self-Report (YSR) via comparisons of data from many societies. Primary samples were those studied by Rescorla et al. (2012): CBCL: N = 69,866, 42 societies; YSR: N = 38,070, 34 societies; TRF: N = 37,244, 27 societies. Omnicultural Q correlations of items composing the DP (from the Anxious/Depressed, Attention Problems, and Aggressive Behavior syndromes) indicated considerable consistency across diverse societies with respect to which of the DP items tended to receive low, medium, or high ratings, whether ratings were provided by parents (M Q = .70), adolescents (M Q = .72), or teachers (M Q = .68). Omnicultural mean item ratings indicated that, for all 3 forms, the most common items on the DP reflect a mix of problems from all 3 constituent scales. Cross-informant analyses for the CBCL-YSR and CBCL-TRF supported these results. Aggregated DP scores, derived by summing ratings on all DP items, varied significantly by society. Age and gender differences were minor for all 3 forms, but boys scored higher than girls on the TRF. Many societies differing in ethnicity, religion, political/economic system, and geographical region manifested very similar DP scores. The most commonly reported DP problems reflected the mixed symptom picture of the DP, with dysregulation in mood, attention, and aggression. Overall, societies were more similar than different on DP scale scores and item ratings.
Optimizing data collection for public health decisions: a data mining approach.

PubMed

Partington, Susan N; Papakroni, Vasil; Menzies, Tim

2014-06-12

Collecting data can be cumbersome and expensive. Lack of relevant, accurate and timely data for research to inform policy may negatively impact public health. The aim of this study was to test if the careful removal of items from two community nutrition surveys guided by a data mining technique called feature selection, can (a) identify a reduced dataset, while (b) not damaging the signal inside that data. The Nutrition Environment Measures Surveys for stores (NEMS-S) and restaurants (NEMS-R) were completed on 885 retail food outlets in two counties in West Virginia between May and November of 2011. A reduced dataset was identified for each outlet type using feature selection. Coefficients from linear regression modeling were used to weight items in the reduced datasets. Weighted item values were summed with the error term to compute reduced item survey scores. Scores produced by the full survey were compared to the reduced item scores using a Wilcoxon rank-sum test. Feature selection identified 9 store and 16 restaurant survey items as significant predictors of the score produced from the full survey. The linear regression models built from the reduced feature sets had R2 values of 92% and 94% for restaurant and grocery store data, respectively. While there are many potentially important variables in any domain, the most useful set may only be a small subset. The use of feature selection in the initial phase of data collection to identify the most influential variables may be a useful tool to greatly reduce the amount of data needed thereby reducing cost.
Validity of parent's self-reported responses to home safety questions.

PubMed

Osborne, Jodie M; Shibl, Rania; Cameron, Cate M; Kendrick, Denise; Lyons, Ronan A; Spinks, Anneliese B; Sipe, Neil; McClure, Roderick J

2016-09-01

The aim of the study was to describe the validity of parent's self-reported responses to questions on home safety practices for children of 2-4 years. A cross-sectional validation study compared parent's self-administered responses to items in the Home Injury Prevention Survey with home observations undertaken by trained researchers. The relationship between the questionnaire and observation results was assessed using percentage agreement, sensitivity, specificity, positive predictive value, negative predictive value and intraclass correlation coefficients. Percentage agreements ranged from 44% to 100% with 40 of the total 45 items scoring higher than 70%. Sensitivities ranged from 0% to 100%, with 27 items scoring at least 70%. Specificities also ranged from 0% to 100%, with 33 items scoring at least 70%. As such, the study identified a series of self-administered home safety questions that have sensitivities, specificities and predictive values sufficiently high to allow the information to be useful in research and injury prevention practice.
Development of an Itemwise Efficiency Scoring Method: Concurrent, Convergent, Discriminant, and Neuroimaging-Based Predictive Validity Assessed in a Large Community Sample

PubMed Central

Moore, Tyler M.; Reise, Steven P.; Roalf, David R.; Satterthwaite, Theodore D.; Davatzikos, Christos; Bilker, Warren B.; Port, Allison M.; Jackson, Chad T.; Ruparel, Kosha; Savitt, Adam P.; Baron, Robert B.; Gur, Raquel E.; Gur, Ruben C.

2016-01-01

Traditional “paper-and-pencil” testing is imprecise in measuring speed and hence limited in assessing performance efficiency, but computerized testing permits precision in measuring itemwise response time. We present a method of scoring performance efficiency (combining information from accuracy and speed) at the item level. Using a community sample of 9,498 youths age 8-21, we calculated item-level efficiency scores on four neurocognitive tests, and compared the concurrent, convergent, discriminant, and predictive validity of these scores to simple averaging of standardized speed and accuracy-summed scores. Concurrent validity was measured by the scores' abilities to distinguish men from women and their correlations with age; convergent and discriminant validity were measured by correlations with other scores inside and outside of their neurocognitive domains; predictive validity was measured by correlations with brain volume in regions associated with the specific neurocognitive abilities. Results provide support for the ability of itemwise efficiency scoring to detect signals as strong as those detected by standard efficiency scoring methods. We find no evidence of superior validity of the itemwise scores over traditional scores, but point out several advantages of the former. The itemwise efficiency scoring method shows promise as an alternative to standard efficiency scoring methods, with overall moderate support from tests of four different types of validity. This method allows the use of existing item analysis methods and provides the convenient ability to adjust the overall emphasis of accuracy versus speed in the efficiency score, thus adjusting the scoring to the real-world demands the test is aiming to fulfill. PMID:26866796
Reducing the Impact of Inappropriate Items on Reviewable Computerized Adaptive Testing

ERIC Educational Resources Information Center

Yen, Yung-Chin; Ho, Rong-Guey; Liao, Wen-Wei; Chen, Li-Ju

2012-01-01

In a test, the testing score would be closer to examinee's actual ability when careless mistakes were corrected. In CAT, however, changing the answer of one item in CAT might cause the following items no longer appropriate for estimating the examinee's ability. These inappropriate items in a reviewable CAT might in turn introduce bias in ability…

Some links on this page may take you to non-federal websites. Their policies may differ from this site.