An Improved Internal Consistency Reliability Estimate.
ERIC Educational Resources Information Center
Cliff, Norman
1984-01-01
The proposed coefficient is derived by assuming that the average Goodman-Kruskal gamma between items of identical difficulty would be the same for items of different difficulty. An estimate of covariance between items of identical difficulty leads to an estimate of the correlation between two tests with identical distributions of difficulty.…
ERIC Educational Resources Information Center
Matlock, Ki Lynn
2013-01-01
When test forms that have equal total test difficulty and number of items vary in difficulty and length within sub-content areas, an examinee's estimated score may vary across equivalent forms, depending on how well his or her true ability in each sub-content area aligns with the difficulty of items and number of items within these areas.…
Item Writer Judgments of Item Difficulty versus Actual Item Difficulty: A Case Study
ERIC Educational Resources Information Center
Sydorenko, Tetyana
2011-01-01
This study investigates how accurate one item writer can be on item difficulty estimates and whether factors affecting item writer judgments correspond to predictors of actual item difficulty. The items were based on conversational dialogs (presented as videos online) that focus on pragmatic functions. Thirty-five 2nd-, 3rd-, and 4th-year learners…
Item Response Theory Modeling of the Philadelphia Naming Test.
Fergadiotis, Gerasimos; Kellough, Stacey; Hula, William D
2015-06-01
In this study, we investigated the fit of the Philadelphia Naming Test (PNT; Roach, Schwartz, Martin, Grewal, & Brecher, 1996) to an item-response-theory measurement model, estimated the precision of the resulting scores and item parameters, and provided a theoretical rationale for the interpretation of PNT overall scores by relating explanatory variables to item difficulty. This article describes the statistical model underlying the computer adaptive PNT presented in a companion article (Hula, Kellough, & Fergadiotis, 2015). Using archival data, we evaluated the fit of the PNT to 1- and 2-parameter logistic models and examined the precision of the resulting parameter estimates. We regressed the item difficulty estimates on three predictor variables: word length, age of acquisition, and contextual diversity. The 2-parameter logistic model demonstrated marginally better fit, but the fit of the 1-parameter logistic model was adequate. Precision was excellent for both person ability and item difficulty estimates. Word length, age of acquisition, and contextual diversity all independently contributed to variance in item difficulty. Item-response-theory methods can be productively used to analyze and quantify anomia severity in aphasia. Regression of item difficulty on lexical variables supported the validity of the PNT and interpretation of anomia severity scores in the context of current word-finding models.
ERIC Educational Resources Information Center
Matlock, Ki Lynn; Turner, Ronna
2016-01-01
When constructing multiple test forms, the number of items and the total test difficulty are often equivalent. Not all test developers match the number of items and/or average item difficulty within subcontent areas. In this simulation study, six test forms were constructed having an equal number of items and average item difficulty overall.…
Multiple choice questions can be designed or revised to challenge learners' critical thinking.
Tractenberg, Rochelle E; Gushta, Matthew M; Mulroney, Susan E; Weissinger, Peggy A
2013-12-01
Multiple choice (MC) questions from a graduate physiology course were evaluated by cognitive-psychology (but not physiology) experts, and analyzed statistically, in order to test the independence of content expertise and cognitive complexity ratings of MC items. Integration of higher order thinking into MC exams is important, but widely known to be challenging-perhaps especially when content experts must think like novices. Expertise in the domain (content) may actually impede the creation of higher-complexity items. Three cognitive psychology experts independently rated cognitive complexity for 252 multiple-choice physiology items using a six-level cognitive complexity matrix that was synthesized from the literature. Rasch modeling estimated item difficulties. The complexity ratings and difficulty estimates were then analyzed together to determine the relative contributions (and independence) of complexity and difficulty to the likelihood of correct answers on each item. Cognitive complexity was found to be statistically independent of difficulty estimates for 88 % of items. Using the complexity matrix, modifications were identified to increase some item complexities by one level, without affecting the item's difficulty. Cognitive complexity can effectively be rated by non-content experts. The six-level complexity matrix, if applied by faculty peer groups trained in cognitive complexity and without domain-specific expertise, could lead to improvements in the complexity targeted with item writing and revision. Targeting higher order thinking with MC questions can be achieved without changing item difficulties or other test characteristics, but this may be less likely if the content expert is left to assess items within their domain of expertise.
Item Estimates under Low-Stakes Conditions: How Should Omits Be Treated?
ERIC Educational Resources Information Center
DeMars, Christine
Using data from a pilot test of science and math from students in 30 high schools, item difficulties were estimated with a one-parameter model (partial-credit model for the multi-point items). Some items were multiple-choice items, and others were constructed-response items (open-ended). Four sets of estimates were obtained: estimates for males…
The Effects of Judgment-Based Stratum Classifications on the Efficiency of Stratum Scored CATs.
ERIC Educational Resources Information Center
Finney, Sara J.; Smith, Russell W.; Wise, Steven L.
Two operational item pools were used to investigate the performance of stratum computerized adaptive tests (CATs) when items were assigned to strata based on empirical estimates of item difficulty or human judgments of item difficulty. Items from the first data set consisted of 54 5-option multiple choice items from a form of the ACT mathematics…
Rasch Measurement and Item Banking: Theory and Practice.
ERIC Educational Resources Information Center
Nakamura, Yuji
The Rasch Model is an item response theory, one parameter model developed that states that the probability of a correct response on a test is a function of the difficulty of the item and the ability of the candidate. Item banking is useful for language testing. The Rasch Model provides estimates of item difficulties that are meaningful,…
A Comparison of Three Test Formats to Assess Word Difficulty
ERIC Educational Resources Information Center
Culligan, Brent
2015-01-01
This study compared three common vocabulary test formats, the Yes/No test, the Vocabulary Knowledge Scale (VKS), and the Vocabulary Levels Test (VLT), as measures of vocabulary difficulty. Vocabulary difficulty was defined as the item difficulty estimated through Item Response Theory (IRT) analysis. Three tests were given to 165 Japanese students,…
ERIC Educational Resources Information Center
Ali, Usama S.; Walker, Michael E.
2014-01-01
Two methods are currently in use at Educational Testing Service (ETS) for equating observed item difficulty statistics. The first method involves the linear equating of item statistics in an observed sample to reference statistics on the same items. The second method, or the item response curve (IRC) method, involves the summation of conditional…
Comparison of Alternate and Original Items on the Montreal Cognitive Assessment.
Lebedeva, Elena; Huang, Mei; Koski, Lisa
2016-03-01
The Montreal Cognitive Assessment (MoCA) is a screening tool for mild cognitive impairment (MCI) in elderly individuals. We hypothesized that measurement error when using the new alternate MoCA versions to monitor change over time could be related to the use of items that are not of comparable difficulty to their corresponding originals of similar content. The objective of this study was to compare the difficulty of the alternate MoCA items to the original ones. Five selected items from alternate versions of the MoCA were included with items from the original MoCA administered adaptively to geriatric outpatients (N = 78). Rasch analysis was used to estimate the difficulty level of the items. None of the five items from the alternate versions matched the difficulty level of their corresponding original items. This study demonstrates the potential benefits of a Rasch analysis-based approach for selecting items during the process of development of parallel forms. The results suggest that better match of the items from different MoCA forms by their difficulty would result in higher sensitivity to changes in cognitive function over time.
Estimation of Item Response Theory Parameters in the Presence of Missing Data
ERIC Educational Resources Information Center
Finch, Holmes
2008-01-01
Missing data are a common problem in a variety of measurement settings, including responses to items on both cognitive and affective assessments. Researchers have shown that such missing data may create problems in the estimation of item difficulty parameters in the Item Response Theory (IRT) context, particularly if they are ignored. At the same…
Estimating the Number of Examinees Who Did Not Reach the Last Item of a Section.
ERIC Educational Resources Information Center
Wainer, Howard
It is important to estimate the number of examinees who reached a test item, because item difficulty is defined by the number who answered correctly divided by the number who reached the item. A new method is presented and compared to the previously used definition of three categories of response to an item: (1) answered; (2) omitted--a…
Comparison of Alternate and Original Items on the Montreal Cognitive Assessment
Lebedeva, Elena; Huang, Mei; Koski, Lisa
2016-01-01
Background The Montreal Cognitive Assessment (MoCA) is a screening tool for mild cognitive impairment (MCI) in elderly individuals. We hypothesized that measurement error when using the new alternate MoCA versions to monitor change over time could be related to the use of items that are not of comparable difficulty to their corresponding originals of similar content. The objective of this study was to compare the difficulty of the alternate MoCA items to the original ones. Methods Five selected items from alternate versions of the MoCA were included with items from the original MoCA administered adaptively to geriatric outpatients (N = 78). Rasch analysis was used to estimate the difficulty level of the items. Results None of the five items from the alternate versions matched the difficulty level of their corresponding original items. Conclusions This study demonstrates the potential benefits of a Rasch analysis-based approach for selecting items during the process of development of parallel forms. The results suggest that better match of the items from different MoCA forms by their difficulty would result in higher sensitivity to changes in cognitive function over time. PMID:27076861
A Graphical Approach to Item Analysis. Research Report. ETS RR-04-10
ERIC Educational Resources Information Center
Livingston, Samuel A.; Dorans, Neil J.
2004-01-01
This paper describes an approach to item analysis that is based on the estimation of a set of response curves for each item. The response curves show, at a glance, the difficulty and the discriminating power of the item and the popularity of each distractor, at any level of the criterion variable (e.g., total score). The curves are estimated by…
Classical Item Analysis Using Latent Variable Modeling: A Note on a Direct Evaluation Procedure
ERIC Educational Resources Information Center
Raykov, Tenko; Marcoulides, George A.
2011-01-01
A directly applicable latent variable modeling procedure for classical item analysis is outlined. The method allows one to point and interval estimate item difficulty, item correlations, and item-total correlations for composites consisting of categorical items. The approach is readily employed in empirical research and as a by-product permits…
Park, Jong Cook; Kim, Kwang Sig
2012-03-01
The reliability of test is determined by each items' characteristics. Item analysis is achieved by classical test theory and item response theory. The purpose of the study was to compare the discrimination indices with item response theory using the Rasch model. Thirty-one 4th-year medical school students participated in the clinical course written examination, which included 22 A-type items and 3 R-type items. Point biserial correlation coefficient (C(pbs)) was compared to method of extreme group (D), biserial correlation coefficient (C(bs)), item-total correlation coefficient (C(it)), and corrected item-total correlation coeffcient (C(cit)). Rasch model was applied to estimate item difficulty and examinee's ability and to calculate item fit statistics using joint maximum likelihood. Explanatory power (r2) of Cpbs is decreased in the following order: C(cit) (1.00), C(it) (0.99), C(bs) (0.94), and D (0.45). The ranges of difficulty logit and standard error and ability logit and standard error were -0.82 to 0.80 and 0.37 to 0.76, -3.69 to 3.19 and 0.45 to 1.03, respectively. Item 9 and 23 have outfit > or =1.3. Student 1, 5, 7, 18, 26, 30, and 32 have fit > or =1.3. C(pbs), C(cit), and C(it) are good discrimination parameters. Rasch model can estimate item difficulty parameter and examinee's ability parameter with standard error. The fit statistics can identify bad items and unpredictable examinee's responses.
ERIC Educational Resources Information Center
Schmitt, T. A.; Sass, D. A.; Sullivan, J. R.; Walker, C. M.
2010-01-01
Imposed time limits on computer adaptive tests (CATs) can result in examinees having difficulty completing all items, thus compromising the validity and reliability of ability estimates. In this study, the effects of speededness were explored in a simulated CAT environment by varying examinee response patterns to end-of-test items. Expectedly,…
Watanabe, Yusuke; Madani, Amin; Ito, Yoichi M; Bilgic, Elif; McKendy, Katherine M; Feldman, Liane S; Fried, Gerald M; Vassiliou, Melina C
2017-02-01
The extent to which each item assessed using the Global Operative Assessment of Laparoscopic Skills (GOALS) contributes to the total score remains unknown. The purpose of this study was to evaluate the level of difficulty and discriminative ability of each of the 5 GOALS items using item response theory (IRT). A total of 396 GOALS assessments for a variety of laparoscopic procedures over a 12-year time period were included. Threshold parameters of item difficulty and discrimination power were estimated for each item using IRT. The higher slope parameters seen with "bimanual dexterity" and "efficiency" are indicative of greater discriminative ability than "depth perception", "tissue handling", and "autonomy". IRT psychometric analysis indicates that the 5 GOALS items do not demonstrate uniform difficulty and discriminative power, suggesting that they should not be scored equally. "Bimanual dexterity" and "efficiency" seem to have stronger discrimination. Weighted scores based on these findings could improve the accuracy of assessing individual laparoscopic skills. Copyright © 2016 Elsevier Inc. All rights reserved.
Tsubakita, Takashi; Shimazaki, Kazuyo; Ito, Hiroshi; Kawazoe, Nobuo
2017-10-30
The Utrecht Work Engagement Scale for Students has been used internationally to assess students' academic engagement, but it has not been analyzed via item response theory. The purpose of this study was to conduct an item response theory analysis of the Japanese version of the Utrecht Work Engagement Scale for Students translated by authors. Using a two-parameter model and Samejima's graded response model, difficulty and discrimination parameters were estimated after confirming the factor structure of the scale. The 14 items on the scale were analyzed with a sample of 3214 university and college students majoring medical science, nursing, or natural science in Japan. The preliminary parameter estimation was conducted with the two parameter model, and indicated that three items should be removed because there were outlier parameters. Final parameter estimation was conducted using the survived 11 items, and indicated that all difficulty and discrimination parameters were acceptable. The test information curve suggested that the scale better assesses higher engagement than average engagement. The estimated parameters provide a basis for future comparative studies. The results also suggested that a 7-point Likert scale is too broad; thus, the scaling should be modified to fewer graded scaling structure.
Measuring the Instructional Sensitivity of ESL Reading Comprehension Items.
ERIC Educational Resources Information Center
Brutten, Sheila R.; And Others
A study attempted to estimate the instructional sensitivity of items in three reading comprehension tests in English as a second language (ESL). Instructional sensitivity is a test-item construct defined as the tendency for a test item to vary in difficulty as a function of instruction. Similar tasks were given to readers at different proficiency…
Rasch Measurement of Collaborative Problem Solving in an Online Environment.
Harding, Susan-Marie E; Griffin, Patrick E
2016-01-01
This paper describes an approach to the assessment of human to human collaborative problem solving using a set of online interactive tasks completed by student dyads. Within the dyad, roles were nominated as either A or B and students selected their own roles. The question as to whether role selection affected individual student performance measures is addressed. Process stream data was captured from 3402 students in six countries who explored the problem space by clicking, dragging the mouse, moving the cursor and collaborating with their partner through a chat box window. Process stream data were explored to identify behavioural indicators that represented elements of a conceptual framework. These indicative behaviours were coded into a series of dichotomous items. These items represented actions and chats performed by students. The frequency of occurrence was used as a proxy measure of item difficulty. Then given a measure of item difficulty, student ability could be estimated using the difficulty estimates of the range of items demonstrated by the student. The Rasch simple logistic model was used to review the indicators to identify those that were consistent with the assumptions of the model and were invariant across national samples, language, curriculum and age of the student. The data were analysed using a one and two dimension, one parameter model. Rasch separation reliability, fit to the model, distribution of students and items on the underpinning construct, estimates for each country and the effect of role differences are reported. This study provides evidence that collaborative problem solving can be assessed in an online environment involving human to human interaction using behavioural indicators shown to have a consistent relationship between the estimate of student ability, and the probability of demonstrating the behaviour.
ERIC Educational Resources Information Center
Dai, Yunyun
2013-01-01
Mixtures of item response theory (IRT) models have been proposed as a technique to explore response patterns in test data related to cognitive strategies, instructional sensitivity, and differential item functioning (DIF). Estimation proves challenging due to difficulties in identification and questions of effect size needed to recover underlying…
Regression Effects in Angoff Ratings: Examples from Credentialing Exams
ERIC Educational Resources Information Center
Wyse, Adam E.
2018-01-01
This article discusses regression effects that are commonly observed in Angoff ratings where panelists tend to think that hard items are easier than they are and easy items are more difficult than they are in comparison to estimated item difficulties. Analyses of data from two credentialing exams illustrate these regression effects and the…
ERIC Educational Resources Information Center
Chan, David W.
2010-01-01
Data of item responses to the Impossible Figures Task (IFT) from 492 Chinese primary, secondary, and university students were analyzed using the dichotomous Rasch measurement model. Item difficulty estimates and person ability estimates located on the same logit scale revealed that the pooled sample of Chinese students, who were relatively highly…
ERIC Educational Resources Information Center
Rakkapao, Suttida; Prasitpong, Singha; Arayathanitkul, Kwan
2016-01-01
This study investigated the multiple-choice test of understanding of vectors (TUV), by applying item response theory (IRT). The difficulty, discriminatory, and guessing parameters of the TUV items were fit with the three-parameter logistic model of IRT, using the parscale program. The TUV ability is an ability parameter, here estimated assuming…
Examining the Invariance of Rater and Project Calibrations Using a Multi-facet Rasch Model.
ERIC Educational Resources Information Center
O'Neill, Thomas R.; Lunz, Mary E.
To generalize test results beyond the particular test administration, an examinee's ability estimate must be independent of the particular items attempted, and the item difficulty calibrations must be independent of the particular sample of people attempting the items. This stability is a key concept of the Rasch model, a latent trait model of…
A New Functional Health Literacy Scale for Japanese Young Adults Based on Item Response Theory.
Tsubakita, Takashi; Kawazoe, Nobuo; Kasano, Eri
2017-03-01
Health literacy predicts health outcomes. Despite concerns surrounding the health of Japanese young adults, to date there has been no objective assessment of health literacy in this population. This study aimed to develop a Functional Health Literacy Scale for Young Adults (funHLS-YA) based on item response theory. Each item in the scale requires participants to choose the most relevant term from 3 choices in relation to a target item, thus assessing objective rather than perceived health literacy. The 20-item scale was administered to 1816 university students and 1751 responded. Cronbach's α coefficient was .73. Difficulty and discrimination parameters of each item were estimated, resulting in the exclusion of 1 item. Some items showed different difficulty parameters for male and female participants, reflecting that some aspects of health literacy may differ by gender. The current 19-item version of funHLS-YA can reliably assess the objective health literacy of Japanese young adults.
Item selection via Bayesian IRT models.
Arima, Serena
2015-02-10
With reference to a questionnaire that aimed to assess the quality of life for dysarthric speakers, we investigate the usefulness of a model-based procedure for reducing the number of items. We propose a mixed cumulative logit model, which is known in the psychometrics literature as the graded response model: responses to different items are modelled as a function of individual latent traits and as a function of item characteristics, such as their difficulty and their discrimination power. We jointly model the discrimination and the difficulty parameters by using a k-component mixture of normal distributions. Mixture components correspond to disjoint groups of items. Items that belong to the same groups can be considered equivalent in terms of both difficulty and discrimination power. According to decision criteria, we select a subset of items such that the reduced questionnaire is able to provide the same information that the complete questionnaire provides. The model is estimated by using a Bayesian approach, and the choice of the number of mixture components is justified according to information criteria. We illustrate the proposed approach on the basis of data that are collected for 104 dysarthric patients by local health authorities in Lecce and in Milan. Copyright © 2014 John Wiley & Sons, Ltd.
ERIC Educational Resources Information Center
Schroeders, Ulrich; Robitzsch, Alexander; Schipolowski, Stefan
2014-01-01
C-tests are a specific variant of cloze tests that are considered time-efficient, valid indicators of general language proficiency. They are commonly analyzed with models of item response theory assuming local item independence. In this article we estimated local interdependencies for 12 C-tests and compared the changes in item difficulties,…
The Accuracy of Estimated Total Test Statistics. Final Report.
ERIC Educational Resources Information Center
Kleinke, David J.
In a post-mortem study of item sampling, 1,050 examinees were divided into ten groups 50 times. Each time, their papers were scored on four different sets of item samples from a 150-item test of academic aptitude. These samples were selected using (a) unstratified random sampling and stratification on (b) content, (c) difficulty, and (d) both.…
ERIC Educational Resources Information Center
Kibble, Jonathan D.; Johnson, Teresa
2011-01-01
The purpose of this study was to evaluate whether multiple-choice item difficulty could be predicted either by a subjective judgment by the question author or by applying a learning taxonomy to the items. Eight physiology faculty members teaching an upper-level undergraduate human physiology course consented to participate in the study. The…
Eye Movements Reveal How Task Difficulty Moulds Visual Search
ERIC Educational Resources Information Center
Young, Angela H.; Hulleman, Johan
2013-01-01
In two experiments we investigated the relationship between eye movements and performance in visual search tasks of varying difficulty. Experiment 1 provided evidence that a single process is used for search among static and moving items. Moreover, we estimated the functional visual field (FVF) from the gaze coordinates and found that its size…
Cohn, Amy M.; Hagman, Brett T.; Graff, Fiona S.; Noel, Nora E.
2011-01-01
Objective: The present study examined the latent continuum of alcohol-related negative consequences among first-year college women using methods from item response theory and classical test theory. Method: Participants (N = 315) were college women in their freshman year who reported consuming any alcohol in the past 90 days and who completed assessments of alcohol consumption and alcohol-related negative consequences using the Rutgers Alcohol Problem Index. Results: Item response theory analyses showed poor model fit for five items identified in the Rutgers Alcohol Problem Index. Two-parameter item response theory logistic models were applied to the remaining 18 items to examine estimates of item difficulty (i.e., severity) and discrimination parameters. The item difficulty parameters ranged from 0.591 to 2.031, and the discrimination parameters ranged from 0.321 to 2.371. Classical test theory analyses indicated that the omission of the five misfit items did not significantly alter the psychometric properties of the construct. Conclusions: Findings suggest that those consequences that had greater severity and discrimination parameters may be used as screening items to identify female problem drinkers at risk for an alcohol use disorder. PMID:22051212
A new item response theory model to adjust data allowing examinee choice
Costa, Marcelo Azevedo; Braga Oliveira, Rivert Paulo
2018-01-01
In a typical questionnaire testing situation, examinees are not allowed to choose which items they answer because of a technical issue in obtaining satisfactory statistical estimates of examinee ability and item difficulty. This paper introduces a new item response theory (IRT) model that incorporates information from a novel representation of questionnaire data using network analysis. Three scenarios in which examinees select a subset of items were simulated. In the first scenario, the assumptions required to apply the standard Rasch model are met, thus establishing a reference for parameter accuracy. The second and third scenarios include five increasing levels of violating those assumptions. The results show substantial improvements over the standard model in item parameter recovery. Furthermore, the accuracy was closer to the reference in almost every evaluated scenario. To the best of our knowledge, this is the first proposal to obtain satisfactory IRT statistical estimates in the last two scenarios. PMID:29389996
ERIC Educational Resources Information Center
Hamadneh, Iyad Mohammed
2015-01-01
This study aimed at investigating the impact changing of escape alternative position in multiple-choice test on the psychometric properties of a test and it's items parameters (difficulty, discrimination & guessing), and estimation of examinee ability. To achieve the study objectives, a 4-alternative multiple choice type achievement test…
Tulsky, David S; Kisala, Pamela A; Tate, Denise G; Spungen, Ann M; Kirshblum, Steven C
2015-05-01
To describe the development and psychometric properties of the Spinal Cord Injury--Quality of Life (SCI-QOL) Bladder Management Difficulties and Bowel Management Difficulties item banks and Bladder Complications scale. Using a mixed-methods design, a pool of items assessing bladder and bowel-related concerns were developed using focus groups with individuals with spinal cord injury (SCI) and SCI clinicians, cognitive interviews, and item response theory (IRT) analytic approaches, including tests of model fit and differential item functioning. Thirty-eight bladder items and 52 bowel items were tested at the University of Michigan, Kessler Foundation Research Center, the Rehabilitation Institute of Chicago, the University of Washington, Craig Hospital, and the James J. Peters VA Medical Center, Bronx, NY. Seven hundred fifty-seven adults with traumatic SCI. The final item banks demonstrated unidimensionality (Bladder Management Difficulties CFI=0.965; RMSEA=0.093; Bowel Management Difficulties CFI=0.955; RMSEA=0.078) and acceptable fit to a graded response IRT model. The final calibrated Bladder Management Difficulties bank includes 15 items, and the final Bowel Management Difficulties item bank consists of 26 items. Additionally, 5 items related to urinary tract infections (UTI) did not fit with the larger Bladder Management Difficulties item bank but performed relatively well independently (CFI=0.992, RMSEA=0.050) and were thus retained as a separate scale. The SCI-QOL Bladder Management Difficulties and Bowel Management Difficulties item banks are psychometrically robust and are available as computer adaptive tests or short forms. The SCI-QOL Bladder Complications scale is a brief, fixed-length outcomes instrument for individuals with a UTI.
Is the Factor Observed in Investigations on the Item-Position Effect Actually the Difficulty Factor?
Schweizer, Karl; Troche, Stefan
2018-02-01
In confirmatory factor analysis quite similar models of measurement serve the detection of the difficulty factor and the factor due to the item-position effect. The item-position effect refers to the increasing dependency among the responses to successively presented items of a test whereas the difficulty factor is ascribed to the wide range of item difficulties. The similarity of the models of measurement hampers the dissociation of these factors. Since the item-position effect should theoretically be independent of the item difficulties, the statistical ex post manipulation of the difficulties should enable the discrimination of the two types of factors. This method was investigated in two studies. In the first study, Advanced Progressive Matrices (APM) data of 300 participants were investigated. As expected, the factor thought to be due to the item-position effect was observed. In the second study, using data simulated to show the major characteristics of the APM data, the wide range of items with various difficulties was set to zero to reduce the likelihood of detecting the difficulty factor. Despite this reduction, however, the factor now identified as item-position factor, was observed in virtually all simulated datasets.
ERIC Educational Resources Information Center
Kostin, Irene
2004-01-01
The purpose of this study is to explore the relationship between a set of item characteristics and the difficulty of TOEFL[R] dialogue items. Identifying characteristics that are related to item difficulty has the potential to improve the efficiency of the item-writing process The study employed 365 TOEFL dialogue items, which were coded on 49…
Building an Evaluation Scale using Item Response Theory.
Lalor, John P; Wu, Hao; Yu, Hong
2016-11-01
Evaluation of NLP methods requires testing against a previously vetted gold-standard test set and reporting standard metrics (accuracy/precision/recall/F1). The current assumption is that all items in a given test set are equal with regards to difficulty and discriminating power. We propose Item Response Theory (IRT) from psychometrics as an alternative means for gold-standard test-set generation and NLP system evaluation. IRT is able to describe characteristics of individual items - their difficulty and discriminating power - and can account for these characteristics in its estimation of human intelligence or ability for an NLP task. In this paper, we demonstrate IRT by generating a gold-standard test set for Recognizing Textual Entailment. By collecting a large number of human responses and fitting our IRT model, we show that our IRT model compares NLP systems with the performance in a human population and is able to provide more insight into system performance than standard evaluation metrics. We show that a high accuracy score does not always imply a high IRT score, which depends on the item characteristics and the response pattern.
Building an Evaluation Scale using Item Response Theory
Lalor, John P.; Wu, Hao; Yu, Hong
2016-01-01
Evaluation of NLP methods requires testing against a previously vetted gold-standard test set and reporting standard metrics (accuracy/precision/recall/F1). The current assumption is that all items in a given test set are equal with regards to difficulty and discriminating power. We propose Item Response Theory (IRT) from psychometrics as an alternative means for gold-standard test-set generation and NLP system evaluation. IRT is able to describe characteristics of individual items - their difficulty and discriminating power - and can account for these characteristics in its estimation of human intelligence or ability for an NLP task. In this paper, we demonstrate IRT by generating a gold-standard test set for Recognizing Textual Entailment. By collecting a large number of human responses and fitting our IRT model, we show that our IRT model compares NLP systems with the performance in a human population and is able to provide more insight into system performance than standard evaluation metrics. We show that a high accuracy score does not always imply a high IRT score, which depends on the item characteristics and the response pattern.1 PMID:28004039
Statistical Approaches to the Study of Item Difficulty.
ERIC Educational Resources Information Center
Olson, John F.; And Others
Traditionally, item difficulty has been defined in terms of the performance of examinees. For test development purposes, a more useful concept would be some kind of intrinsic item difficulty, defined in terms of the item's content, context, or characteristics and the task demands set by the item. In this investigation, the measurement literature…
A Comparison of IRT Proficiency Estimation Methods under Adaptive Multistage Testing
ERIC Educational Resources Information Center
Kim, Sooyeon; Moses, Tim; Yoo, Hanwook
2015-01-01
This inquiry is an investigation of item response theory (IRT) proficiency estimators' accuracy under multistage testing (MST). We chose a two-stage MST design that includes four modules (one at Stage 1, three at Stage 2) and three difficulty paths (low, middle, high). We assembled various two-stage MST panels (i.e., forms) by manipulating two…
Modelling Question Difficulty in an A Level Physics Examination
ERIC Educational Resources Information Center
Crisp, Victoria; Grayson, Rebecca
2013-01-01
"Item difficulty modelling" is a technique used for a number of purposes such as to support future item development, to explore validity in relation to the constructs that influence difficulty and to predict the difficulty of items. This research attempted to explore the factors influencing question difficulty in a general qualification…
Predicting Item Difficulty of Science National Curriculum Tests: The Case of Key Stage 2 Assessments
ERIC Educational Resources Information Center
El Masri, Yasmine H.; Ferrara, Steve; Foltz, Peter W.; Baird, Jo-Anne
2017-01-01
Predicting item difficulty is highly important in education for both teachers and item writers. Despite identifying a large number of explanatory variables, predicting item difficulty remains a challenge in educational assessment with empirical attempts rarely exceeding 25% of variance explained. This paper analyses 216 science items of key stage…
ERIC Educational Resources Information Center
Wind, Stefanie A.; Engelhard, George, Jr.; Wesolowski, Brian
2016-01-01
When good model-data fit is observed, the Many-Facet Rasch (MFR) model acts as a linking and equating model that can be used to estimate student achievement, item difficulties, and rater severity on the same linear continuum. Given sufficient connectivity among the facets, the MFR model provides estimates of student achievement that are equated to…
Understanding Orgasmic Difficulty in Women.
Rowland, David L; Kolba, Tiffany N
2016-08-01
Women's primary issue with the orgasmic phase is usually difficulty reaching orgasm. To identify predictors of orgasmic difficulty in women within the context of a partnered sexual experience; to assess the relation between orgasmic difficulty and self-reported levels of sexual desire or interest and arousal in women; and to assess the interrelations among three dimensions of orgasmic response during partnered sex: self-reported time to reach orgasm, general difficulty or ease of reaching orgasm, and level of distress or concern. Drawing from a community-based sample using the Internet, 866 women were queried on a 26-item survey regarding their difficulty reaching orgasm during partnered sex. Four hundred sixteen women who indicated difficulty also responded to items assessing arousal and desire difficulties, level of distress about their condition, and their estimated time to reach orgasm. Answers to a 26-item survey on surveyed women's difficulty reaching orgasm during partnered sex. Age, arousal difficulty, and lubrication difficulty predicted difficulty reaching orgasm in the overall sample. In the subsample of women reporting difficulty, approximately half reported issues with arousal. Women with arousal problems reported greater difficulty reaching orgasm but did not differ from those without arousal problems on measurements of orgasm latency or levels of distress. Slightly more than half the women experiencing difficulty reaching orgasm were distressed by their condition; distressed women reported greater difficulty reaching orgasm and longer latencies to orgasm than non-distressed counterparts. They also reported lower satisfaction with their sexual relationship. This study indicates the importance of assessing multiple parameters when investigating orgasmic problems in women, including arousal issues, levels of distress, and latency to orgasm. Results also clarify that women with arousal problems do not differ substantially from those without arousal problems; in contrast, women distressed by their condition differ from non-distressed women along some critical dimensions. Although orgasmic problems decreased with age, the overall relation of this variable to distress, arousal, and latency to orgasm was essentially unchanged across age groups. Copyright © 2016 International Society for Sexual Medicine. Published by Elsevier Inc. All rights reserved.
Echeverri, Margarita; Anderson, David; Nápoles, Anna María
2016-01-01
This article describes the adaptation and initial validation of the Cancer Health Literacy Test (CHLT) for Spanish speakers. A cross-sectional field test of the Spanish version of the CHLT (CHLT-30-DKspa) was conducted among healthy Latinos in Louisiana. Diagonally weighted least squares was used to confirm the factor structure. Item response analysis using 2-parameter logistic estimates was used to identify questions that may require modification to avoid bias. Cronbach's alpha coefficients estimated scale internal consistency reliability. Analysis of variance was used to test for significant differences in CHLT-30-DKspa scores by gender, origin, age and education. The mean CHLT-30-DKspa score (N = 400) was 17.13 (range = 0-30, SD = 6.65). Results confirmed a unidimensional structure, χ(2)(405) = 461.55, p = .027, comparative fit index = .993, Tucker-Lewis index = .992, root mean square error of approximation = .0180. Cronbach's alpha was .88. Items Q1-High Calorie and Q15-Tumor Spread had the lowest item-scale correlations (.148 and .288, respectively) and standardized factor loadings (.152 and .302, respectively). Items Q19-Smoking Risk, Q8-Palliative Care, and Q1-High Calorie had the highest item difficulty parameters (difficulty = 1.12, 1.21, and 2.40, respectively). Results generally support the applicability of the CHLT-30-DKspa for healthy Spanish-speaking populations, with the exception of 4 items that need to be deleted or revised and further studied: Q1, Q8, Q15, and Q19.
Rasch Mixture Models for DIF Detection
Strobl, Carolin; Zeileis, Achim
2014-01-01
Rasch mixture models can be a useful tool when checking the assumption of measurement invariance for a single Rasch model. They provide advantages compared to manifest differential item functioning (DIF) tests when the DIF groups are only weakly correlated with the manifest covariates available. Unlike in single Rasch models, estimation of Rasch mixture models is sensitive to the specification of the ability distribution even when the conditional maximum likelihood approach is used. It is demonstrated in a simulation study how differences in ability can influence the latent classes of a Rasch mixture model. If the aim is only DIF detection, it is not of interest to uncover such ability differences as one is only interested in a latent group structure regarding the item difficulties. To avoid any confounding effect of ability differences (or impact), a new score distribution for the Rasch mixture model is introduced here. It ensures the estimation of the Rasch mixture model to be independent of the ability distribution and thus restricts the mixture to be sensitive to latent structure in the item difficulties only. Its usefulness is demonstrated in a simulation study, and its application is illustrated in a study of verbal aggression. PMID:29795819
Andersson, Helle Wessel; Bjørngaard, Johan Håkon; Kaspersen, Silje Lill; Wang, Catharina E A; Skre, Ingunn; Dahl, Thomas
2010-05-01
The aim was to examine the prevalence of mental health difficulties and prejudices toward mental illness among adolescents, and to analyze possible school and school class effects on these issues. The sample comprised 4,046 pupils (16-19 years) in 257 school classes from 45 Norwegian upper secondary schools. The estimated response rate among the pupils was about 96%. Self-reported mental health difficulties were measured with a four-item scale that covered emotional and behavioral difficulties. Prejudiced attitudes toward mental illness were assessed using a nine-item scale. Multilevel regression analysis was used to estimate the contribution of factors at the individual level, and at the school and class levels. Most of the variance in self-reported mental health difficulties and prejudices was accounted for by individual level factors (92-94%). However, there were statistically significant school and class level effects (P < 0.01), confounded by socioeconomic factors. Mental health difficulties were commonly reported, more often by females than males (P < 0.01). Difficulties with emotions and attention were the two main problem areas, with definite to severe difficulties being reported by 19 and 21% of the females, and by 9 and 16% of the males, respectively. Prejudices were reported more often by males than females (P < 0.01). Both self-reported mental health difficulties and prejudiced attitudes were related to educational program, living situation, and parental education (P < 0.01). The relatively high prevalences of mental health difficulties and prejudiced attitudes toward mental illness among adolescents indicate a need for effective mental health intervention programs. Targeted intervention strategies should be considered when there is evidence of a high number of risk factors in schools and school classes. Furthermore, the gender differences found in self-reported mental health difficulties and prejudices suggest a need for gender-differentiated programs.
Development and Psychometric Evaluation of the Gay Male Sexual Difficulties Scale.
McDonagh, Lorraine K; Stewart, Ian; Morrison, Melanie A; Morrison, Todd G
2016-08-01
Sexual difficulties (i.e., disturbances in normal sexual responding) have the potential to significantly and negatively affect men's social and psychological well-being. However, a review of published measurement tools indicates that most have limited applicability to gay men, and none offer a nuanced understanding of sexual difficulties, as experienced by members of this population. To address this omission, the Gay Male Sexual Difficulties Scale (GMSDS) was developed using a sequential mixed-methods approach. The 25-item GMSDS uses a 6-point frequency Likert-type response format and examines: difficulties with receptive and insertive anal intercourse (5 items each); erectile difficulties (4 items); foreskin difficulties (4 items); body embarrassment (4 items); and seminal fluid concerns (3 items). The measure's scale score dimensionality, assessed using both exploratory and confirmatory factor analyses, as well as scale score reliability and validity (e.g., known-groups and convergent) was tested and deemed to be satisfactory. Limitations of the current series of studies and directions for future research are discussed.
ERIC Educational Resources Information Center
Nissan, Susan; And Others
One of the item types in the Listening Comprehension section of the Test of English as a Foreign Language (TOEFL) test is the dialogue. Because the dialogue item pool needs to have an appropriate balance of items at a range of difficulty levels, test developers have examined items at various difficulty levels in an attempt to identify their…
Numerosity underestimation with item similarity in dynamic visual display.
Au, Ricky K C; Watanabe, Katsumi
2013-01-01
The estimation of numerosity of a large number of objects in a static visual display is possible even at short durations. Such coarse approximations of numerosity are distinct from subitizing, in which the number of objects can be reported with high precision when a small number of objects are presented simultaneously. The present study examined numerosity estimation of visual objects in dynamic displays and the effect of object similarity on numerosity estimation. In the basic paradigm (Experiment 1), two streams of dots were presented and observers were asked to indicate which of the two streams contained more dots. Streams consisting of dots that were identical in color were judged as containing fewer dots than streams where the dots were different colors. This underestimation effect for identical visual items disappeared when the presentation rate was slower (Experiment 1) or the visual display was static (Experiment 2). In Experiments 3 and 4, in addition to the numerosity judgment task, observers performed an attention-demanding task at fixation. Task difficulty influenced observers' precision in the numerosity judgment task, but the underestimation effect remained evident irrespective of task difficulty. These results suggest that identical or similar visual objects presented in succession might induce substitution among themselves, leading to an illusion that there are few items overall and that exploiting attentional resources does not eliminate the underestimation effect.
Sources of difficulty in assessment: example of PISA science items
NASA Astrophysics Data System (ADS)
Le Hebel, Florence; Montpied, Pascale; Tiberghien, Andrée; Fontanieu, Valérie
2017-03-01
The understanding of what makes a question difficult is a crucial concern in assessment. To study the difficulty of test questions, we focus on the case of PISA, which assesses to what degree 15-year-old students have acquired knowledge and skills essential for full participation in society. Our research question is to identify PISA science item characteristics that could influence the item's proficiency level. It is based on an a-priori item analysis and a statistical analysis. Results show that only the cognitive complexity and the format out of the different characteristics of PISA science items determined in our a-priori analysis have an explanatory power on an item's proficiency levels. The proficiency level cannot be explained by the dependence/independence of the information provided in the unit and/or item introduction and the competence. We conclude that in PISA, it appears possible to anticipate a high proficiency level, that is, students' low scores for items displaying a high cognitive complexity. In the case of a middle or low cognitive complexity level item, the cognitive complexity level is not sufficient to predict item difficulty. Other characteristics play a crucial role in item difficulty. We discuss anticipating the difficulties in assessment in a broader perspective.
Identifying predictors of physics item difficulty: A linear regression approach
NASA Astrophysics Data System (ADS)
Mesic, Vanes; Muratovic, Hasnija
2011-06-01
Large-scale assessments of student achievement in physics are often approached with an intention to discriminate students based on the attained level of their physics competencies. Therefore, for purposes of test design, it is important that items display an acceptable discriminatory behavior. To that end, it is recommended to avoid extraordinary difficult and very easy items. Knowing the factors that influence physics item difficulty makes it possible to model the item difficulty even before the first pilot study is conducted. Thus, by identifying predictors of physics item difficulty, we can improve the test-design process. Furthermore, we get additional qualitative feedback regarding the basic aspects of student cognitive achievement in physics that are directly responsible for the obtained, quantitative test results. In this study, we conducted a secondary analysis of data that came from two large-scale assessments of student physics achievement at the end of compulsory education in Bosnia and Herzegovina. Foremost, we explored the concept of “physics competence” and performed a content analysis of 123 physics items that were included within the above-mentioned assessments. Thereafter, an item database was created. Items were described by variables which reflect some basic cognitive aspects of physics competence. For each of the assessments, Rasch item difficulties were calculated in separate analyses. In order to make the item difficulties from different assessments comparable, a virtual test equating procedure had to be implemented. Finally, a regression model of physics item difficulty was created. It has been shown that 61.2% of item difficulty variance can be explained by factors which reflect the automaticity, complexity, and modality of the knowledge structure that is relevant for generating the most probable correct solution, as well as by the divergence of required thinking and interference effects between intuitive and formal physics knowledge structures. Identified predictors point out the fundamental cognitive dimensions of student physics achievement at the end of compulsory education in Bosnia and Herzegovina, whose level of development influenced the test results within the conducted assessments.
Is the Factor Observed in Investigations on the Item-Position Effect Actually the Difficulty Factor?
ERIC Educational Resources Information Center
Schweizer, Karl; Troche, Stefan
2018-01-01
In confirmatory factor analysis quite similar models of measurement serve the detection of the difficulty factor and the factor due to the item-position effect. The item-position effect refers to the increasing dependency among the responses to successively presented items of a test whereas the difficulty factor is ascribed to the wide range of…
Item analysis of examinations in the Faculty of Medicine of Tunis.
Hermi, Amene; Achour, Wafa
2016-04-01
Introduction Item analysis is the process of collecting, summarizing and using information from students' responses to assess test items' quality. This study used this approach to evaluate the quality of items and examinations given in the Faculty of Medicine of Tunis (FMT). Methods This study concerned the examinations of 2012-2013 (principal session). It analyzed 3138 items from 66 examinations, of which, 46 were multidisciplinary (187 disciplines). A total of 2515 students took the examinations. "AnItem.xls" file was used for the analysis that focused on difficulty, discrimination and internal consistency. Results Mean difficulty for all examinations was optimum (mean difficulty index: 0.59). Majority of items (89.17%) were either easy or of acceptable difficulty. Mean discrimination for all examinations was moderate (mean item discrimination coefficient: 0.28) with poor discrimination in 23.62% of items. Maximal discrimination occurred with disciplines of difficulty index between 0.4-0.6. « Ideal » items represented 27.02%. Mean internal consistency for all examinations was acceptable (Cronbach's alpha: 0.79). Disciplines with nonacceptable internal consistency (68.45%) contained a maximum of 33 items (each one) and a positive correlation between their alpha and the number of their questions. Distributions were mostly (72.73%) platykurtic and negatively asymmetric (89.39%). First year of studies had the best parameters. Conclusion Our examinations had an acceptable internal consistency, and a good level of difficulty and discrimination. They tended to facility and discriminated basically students of medium level. Item analysis is useful as a guide to item writers to improve the overall quality of questions in the future.
Dikken, Jeroen; Hoogerduijn, Jita G; Kruitwagen, Cas; Schuurmans, Marieke J
2016-11-01
To assess the content validity and psychometric characteristics of the Knowledge about Older Patients Quiz (KOP-Q), which measures nurses' knowledge regarding older hospitalized adults and their certainty regarding this knowledge. Cross-sectional. Content validity: general hospitals. Psychometric characteristics: nursing school and general hospitals in the Netherlands. Content validity: 12 nurse specialists in geriatrics. Psychometric characteristics: 107 first-year and 78 final-year bachelor of nursing students, 148 registered nurses, and 20 nurse specialists in geriatrics. Content validity: The nurse specialists rated each item of the initial KOP-Q (52 items) on relevance. Ratings were used to calculate Item-Content Validity Index and average Scale-Content Validity Index (S-CVI/ave) scores. Items with insufficient content validity were removed. Psychometric characteristics: Ratings of students, nurses, and nurse specialists were used to test for different item functioning (DIF) and unidimensionality before item characteristics (discrimination and difficulty) were examined using Item Response Theory. Finally, norm references were calculated and nomological validity was assessed. Content validity: Forty-three items remained after assessing content validity (S-CVI/ave = 0.90). Psychometric characteristics: Of the 43 items, two demonstrating ceiling effects and 11 distorting ability estimates (DIF) were subsequently excluded. Item characteristics were assessed for the remaining 30 items, all of which demonstrated good discrimination and difficulty parameters. Knowledge was positively correlated with certainty about this knowledge. The final 30-item KOP-Q is a valid, psychometrically sound, comprehensive instrument that can be used to assess the knowledge of nursing students, hospital nurses, and nurse specialists in geriatrics regarding older hospitalized adults. It can identify knowledge and certainty deficits for research purposes or serve as a tool in educational or quality improvement programs. © 2016, Copyright the Authors Journal compilation © 2016, The American Geriatrics Society.
ERIC Educational Resources Information Center
Çokluk, Ömay; Gül, Emrah; Dogan-Gül, Çilem
2016-01-01
The study aims to examine whether differential item function is displayed in three different test forms that have item orders of random and sequential versions (easy-to-hard and hard-to-easy), based on Classical Test Theory (CTT) and Item Response Theory (IRT) methods and bearing item difficulty levels in mind. In the correlational research, the…
ERIC Educational Resources Information Center
Kim, Sooyeon; Livingston, Samuel A.
2017-01-01
The purpose of this simulation study was to assess the accuracy of a classical test theory (CTT)-based procedure for estimating the alternate-forms reliability of scores on a multistage test (MST) having 3 stages. We generated item difficulty and discrimination parameters for 10 parallel, nonoverlapping forms of the complete 3-stage test and…
ERIC Educational Resources Information Center
Scheuneman, Janice Dowd; Gerritz, Kalle
1990-01-01
Differential item functioning (DIF) methodology for revealing sources of item difficulty and performance characteristics of different groups was explored. A total of 150 Scholastic Aptitude Test items and 132 Graduate Record Examination general test items were analyzed. DIF was evaluated for males and females and Blacks and Whites. (SLD)
Item Structural Properties as Predictors of Item Difficulty and Item Association.
ERIC Educational Resources Information Center
Solano-Flores, Guillermo
1993-01-01
Studied the ability of logical test design (LTD) to predict student performance in reading Roman numerals for 211 sixth graders in Mexico City tested on Roman numeral items varying on LTD-related and non-LTD-related variables. The LTD-related variable item iterativity was found to be the best predictor of item difficulty. (SLD)
Predicting Item Difficulty in a Reading Comprehension Test with an Artificial Neural Network.
ERIC Educational Resources Information Center
Perkins, Kyle; And Others
1995-01-01
This article reports the results of using a three-layer back propagation artificial neural network to predict item difficulty in a reading comprehension test. Three classes of variables were examined: text structure, propositional analysis, and cognitive demand. Results demonstrate that the networks can consistently predict item difficulty. (JL)
Vaughn, Kalif E; Rawson, Katherine A; Pyc, Mary A
2013-12-01
A wealth of previous research has established that retrieval practice promotes memory, particularly when retrieval is successful. Although successful retrieval promotes memory, it remains unclear whether successful retrieval promotes memory equally well for items of varying difficulty. Will easy items still outperform difficult items on a final test if all items have been correctly recalled equal numbers of times during practice? In two experiments, normatively difficult and easy Lithuanian-English word pairs were learned via test-restudy practice until each item had been correctly recalled a preassigned number of times (from 1 to 11 correct recalls). Despite equating the numbers of successful recalls during practice, performance on a delayed final cued-recall test was lower for difficult than for easy items. Experiment 2 was designed to diagnose whether the disadvantage for difficult items was due to deficits in cue memory, target memory, and/or associative memory. The results revealed a disadvantage for the difficult versus the easy items only on the associative recognition test, with no differences on cue recognition, and even an advantage on target recognition. Although successful retrieval enhanced memory for both difficult and easy items, equating retrieval success during practice did not eliminate normative item difficulty differences.
The Effect of the Position of an Item within a Test on the Item Difficulty Value.
ERIC Educational Resources Information Center
Rubin, Lois S.; Mott, David E. W.
An investigation of the effect on the difficulty value of an item due to position placement within a test was made. Using a 60-item operational test comprised of 5 subtests, 60 items were placed as experimental items on a number of spiralled test forms in three different positions (first, middle, last) within the subtest composed of like items.…
Using the Nudge and Shove Methods to Adjust Item Difficulty Values.
Royal, Kenneth D
2015-01-01
In any examination, it is important that a sufficient mix of items with varying degrees of difficulty be present to produce desirable psychometric properties and increase instructors' ability to make appropriate and accurate inferences about what a student knows and/or can do. The purpose of this "teaching tip" is to demonstrate how examination items can be affected by the quality of distractors, and to present a simple method for adjusting items to meet difficulty specifications.
Component Identification and Item Difficulty of Raven's Matrices Items.
ERIC Educational Resources Information Center
Green, Kathy E.; Kluever, Raymond C.
Item components that might contribute to the difficulty of items on the Raven Colored Progressive Matrices (CPM) and the Standard Progressive Matrices (SPM) were studied. Subjects providing responses to CPM items were 269 children aged 2 years 9 months to 11 years 8 months, most of whom were referred for testing as potentially gifted. A second…
Yao, Shih-Ying; Bull, Rebecca; Khng, Kiat Hui; Rahim, Anisa
2018-01-01
Understanding a child's ability to decode emotion expressions is important to allow early interventions for potential difficulties in social and emotional functioning. This study applied the Rasch model to investigate the psychometric properties of the NEPSY-II Affect Recognition subtest, a U.S. normed measure for 3-16 year olds which assesses the ability to recognize facial expressions of emotion. Data were collected from 1222 children attending preschools in Singapore. We first performed the Rasch analysis with the raw item data, and examined the technical qualities and difficulty pattern of the studied items. We subsequently investigated the relation of the estimated affect recognition ability from the Rasch analysis to a teacher-reported measure of a child's behaviors, emotions, and relationships. Potential gender differences were also examined. The Rasch model fits our data well. Also, the NEPSY-II Affect Recognition subtest was found to have reasonable technical qualities, expected item difficulty pattern, and desired association with the external measure of children's behaviors, emotions, and relationships for both boys and girls. Overall, findings from this study suggest that the NEPSY-II Affect Recognition subtest is a promising measure of young children's affect recognition ability. Suggestions for future test improvement and research were discussed.
Cross-cultural comparisons of the Mini-mental State Examination between Japanese and U.S. cohorts
Meguro, Kenichi; Ishii, Hiroshi; Yamaguchi, Satoshi; Saxton, Judith A.; Ganguli, Mary
2009-01-01
Background The Mini-mental State Examination (MMSE) is widely used in Japan and the U.S.A. for cognitive screening in the clinical setting and in epidemiological studies. A previous Japanese community study reported distributions of the MMSE total score very similar to that of the U.S.A. Methods Data were obtained from the Monongahela Valley Independent Elder's Study (MoVIES), a representative sample of community-dwelling elderly people aged 65 and older living near Pittsburgh, U.S.A., and from the Tajiri Project, with similar aims in Tajiri, Japan. We examined item-by-item distributions of the MMSE between two cohorts, comparing (1) percentage of correct answers for each item within each cohort, and (2) relative difficulty of each item measured by Item Characteristic Curve analysis (ICC), which estimates log odds of obtaining a correct answer adjusted for the remaining MMSE items, demographic variables (age, gender, education) and interactions of demographic variables and cohort. Results Median MMSE scores were very similar between the two samples within the same education groups. However, the relative difficulty of each item differed substantially between the two cohorts. Specifically, recall and auditory comprehension were easier for the Tajiri group, but reading comprehension and sentence construction were easier for the MoVIES group. Conclusions Our results reaffirm the importance of validation and examination of thresholds in each cohort to be studied when a common instrument is used as a dementia screening tool or for defining cognitive impairment. PMID:18925977
When Listening Is Better Than Reading: Performance Gains on Cardiac Auscultation Test Questions.
Short, Kathleen; Bucak, S Deniz; Rosenthal, Francine; Raymond, Mark R
2018-05-01
In 2007, the United States Medical Licensing Examination embedded multimedia simulations of heart sounds into multiple-choice questions. This study investigated changes in item difficulty as determined by examinee performance over time. The data reflect outcomes obtained following initial use of multimedia items from 2007 through 2012, after which an interface change occurred. A total of 233,157 examinees responded to 1,306 cardiology test items over the six-year period; 138 items included multimedia simulations of heart sounds, while 1,168 text-based items without multimedia served as controls. The authors compared changes in difficulty of multimedia items over time with changes in difficulty of text-based cardiology items over time. Further, they compared changes in item difficulty for both groups of items between graduates of Liaison Committee on Medical Education (LCME)-accredited and non-LCME-accredited (i.e., international) medical schools. Examinee performance on cardiology test items with multimedia heart sounds improved by 12.4% over the six-year period, while performance on text-based cardiology items improved by approximately 1.4%. These results were similar for graduates of LCME-accredited and non-LCME-accredited medical schools. Examinees' ability to interpret auscultation findings in test items that include multimedia presentations increased from 2007 to 2012.
ERIC Educational Resources Information Center
Vorstenbosch, Marc A. T. M.; Klaassen, Tim P. F. M.; Kooloos, Jan G. M.; Bolhuis, Sanneke M.; Laan, Roland F. J. M.
2013-01-01
Anatomists often use images in assessments and examinations. This study aims to investigate the influence of different types of images on item difficulty and item discrimination in written assessments. A total of 210 of 460 students volunteered for an extra assessment in a gross anatomy course. This assessment contained 39 test items grouped in…
Outcome-based self-assessment on a team-teaching subject in the medical school
Cho, Sa Sun
2014-01-01
We attempted to investigate the reason why the students got a worse grade in gross anatomy and the way how we can improve upon the teaching method since there were gaps between teaching and learning under recently changed integration curriculum. General characteristics of students and exploratory factors to testify the validity were compared between year 2011 and 2012. Students were asked to complete a short survey with a Likert scale. The results were as follows: although the percentage of acceptable items was similar between professors, professor C preferred questions with adequate item discrimination and inappropriate item difficulty whereas professor Y preferred adequate item discrimination and appropriate item difficulty with statistical significance (P<0.01). The survey revealed that 26.5% of total students gave up the exam on gross anatomy of professor Y irrespective of years. These results suggested that students were affected by the corrected item difficulty rather than item discrimination in order to obtain academic achievement. Therefore, professors in a team-teaching subject should reach a consensus on an item difficulty with proper teaching methods. PMID:25548724
Sim, Si-Mui; Rasiah, Raja Isaiah
2006-02-01
This paper reports the relationship between the difficulty level and the discrimination power of true/false-type multiple-choice questions (MCQs) in a multidisciplinary paper for the para-clinical year of an undergraduate medical programme. MCQ items in papers taken from Year II Parts A, B and C examinations for Sessions 2001/02, and Part B examinations for 2002/03 and 2003/04, were analysed to obtain their difficulty indices and discrimination indices. Each paper consisted of 250 true/false items (50 questions of 5 items each) on topics drawn from different disciplines. The questions were first constructed and vetted by the individual departments before being submitted to a central committee, where the final selection of the MCQs was made, based purely on the academic judgement of the committee. There was a wide distribution of item difficulty indices in all the MCQ papers analysed. Furthermore, the relationship between the difficulty index (P) and discrimination index (D) of the MCQ items in a paper was not linear, but more dome-shaped. Maximal discrimination (D = 51% to 71%) occurred with moderately easy/difficult items (P = 40% to 74%). On average, about 38% of the MCQ items in each paper were "very easy" (P > or =75%), while about 9% were "very difficult" (P <25%). About two-thirds of these very easy/difficult items had "very poor" or even negative discrimination (D < or =20%). MCQ items that demonstrate good discriminating potential tend to be moderately difficult items, and the moderately-to-very difficult items are more likely to show negative discrimination. There is a need to evaluate the effectiveness of our MCQ items.
Skilled but Unaware of It: CAT Undermines a Test Taker's Metacognitive Competence
ERIC Educational Resources Information Center
Ortner, Tuulia M.; Weisskopf, Eva; Gerstenberg, Friederike X. R.
2013-01-01
We investigated students' metacognitive experiences with regard to feelings of difficulty (FD), feelings of satisfaction (FS), and estimate of effort (EE), employing either computerized adaptive testing (CAT) or computerized fixed item testing (FIT). In an experimental approach, 174 students in grades 10 to 13 were tested either with a CAT or a…
Application of Rasch Measurement to a Measure of Musical Performance.
ERIC Educational Resources Information Center
Haley, Kathleen A.
1999-01-01
Describes the Rasch calibration of a portion of the Watkins Farnum Performance Scale (J. Watkins and S. Farnum, 1954), a test of instructional music performance, for 218 sixth graders. Results show how Rasch scaling allows item difficulties to be estimated, the test to be administered more efficiently, and diagnostic information to be obtained.…
ERIC Educational Resources Information Center
Rahman, Taslima; Mislevy, Robert J.
2017-01-01
To demonstrate how methodologies for assessing reading comprehension can grow out of views of the construct suggested in the reading research literature, we constructed tasks and carried out psychometric analyses that were framed in accordance with 2 leading reading models. In estimating item difficulty and subsequently, examinee proficiency, an…
Mamikonian-Zarpas, Ani; Laganá, Luciana
2016-01-01
Functional status is often defined by cumulative scores across indices of independence in performing basic and instrumental activities of daily living (ADL/IADL), but little is known about the unique relationship of each daily activity item with the fall outcome. The purpose of this retrospective study was to examine the level of relative risk for a future fall associated with difficulty with performing various tasks of normal daily functioning among older adults who had fallen at least once in the past 12 months. The sample was comprised of community-dwelling individuals 70 years and older from the 1984–1990 Longitudinal Study of Aging by Kovar, Fitti, and Chyba (1992). Risk analysis was performed on individual items quantifying 6 ADLs and 7 IADLs, as well as 10 items related to mobility limitations. Within a subsample of 1,675 older adults with a history of at least one fall within the past year, the responses of individuals who reported multiple falls were compared to the responses of participants who had a single fall and reported 1) difficulty with walking and/or balance (FRAIL group, n = 413) vs. 2) no difficulty with walking or dizziness (NDW+ND group, n = 415). The items that had the strongest relationships and highest risk ratios for the FRAIL group (which had the highest probabilities for a future fall) included difficulty with: eating (73%); managing money (70%); biting or chewing food (66%); walking a quarter of a mile (65%); using fingers to grasp (65%); and dressing without help (65%). For the NDW+ND group, the most noteworthy items included difficulty with: bathing or showering (79%); managing money (77%); shopping for personal items (75%); walking up 10 steps without rest (72%); difficulty with walking a quarter of a mile (72%); and stooping/crouching/kneeling (70%). These findings suggest that individual items quantifying specific ADLs and IADLs have substantive relationships with the fall outcome among older adults who have difficulty with walking and balance, as well as among older individuals without dizziness or difficulty with walking. Furthermore, the examination of the relationships between items that are related to more challenging activities and the fall outcome revealed that higher functioning older adults who reported difficulty with the 6 items that yielded the highest risk ratios may also be at elevated risk for a fall. PMID:27200366
NASA Astrophysics Data System (ADS)
Rakkapao, Suttida; Prasitpong, Singha; Arayathanitkul, Kwan
2016-12-01
This study investigated the multiple-choice test of understanding of vectors (TUV), by applying item response theory (IRT). The difficulty, discriminatory, and guessing parameters of the TUV items were fit with the three-parameter logistic model of IRT, using the parscale program. The TUV ability is an ability parameter, here estimated assuming unidimensionality and local independence. Moreover, all distractors of the TUV were analyzed from item response curves (IRC) that represent simplified IRT. Data were gathered on 2392 science and engineering freshmen, from three universities in Thailand. The results revealed IRT analysis to be useful in assessing the test since its item parameters are independent of the ability parameters. The IRT framework reveals item-level information, and indicates appropriate ability ranges for the test. Moreover, the IRC analysis can be used to assess the effectiveness of the test's distractors. Both IRT and IRC approaches reveal test characteristics beyond those revealed by the classical analysis methods of tests. Test developers can apply these methods to diagnose and evaluate the features of items at various ability levels of test takers.
Koh, Bongyeun; Hong, Sunggi; Kim, Soon-Sim; Hyun, Jin-Sook; Baek, Milye; Moon, Jundong; Kwon, Hayran; Kim, Gyoungyong; Min, Seonggi; Kang, Gu-Hyun
2016-01-01
The goal of this study was to characterize the difficulty index of the items in the skills test components of the class I and II Korean emergency medical technician licensing examination (KEMTLE), which requires examinees to select items randomly. The results of 1,309 class I KEMTLE examinations and 1,801 class II KEMTLE examinations in 2013 were subjected to analysis. Items from the basic and advanced skills test sections of the KEMTLE were compared to determine whether some were significantly more difficult than others. In the class I KEMTLE, all 4 of the items on the basic skills test showed significant variation in difficulty index (P<0.01), as well as 4 of the 5 items on the advanced skills test (P<0.05). In the class II KEMTLE, 4 of the 5 items on the basic skills test showed significantly different difficulty index (P<0.01), as well as all 3 of the advanced skills test items (P<0.01). In the skills test components of the class I and II KEMTLE, the procedure in which examinees randomly select questions should be revised to require examinees to respond to a set of fixed items in order to improve the reliability of the national licensing examination.
ERIC Educational Resources Information Center
Kahraman, Nilufer; De Champlain, Andre; Raymond, Mark
2012-01-01
Item-level information, such as difficulty and discrimination are invaluable to the test assembly, equating, and scoring practices. Estimating these parameters within the context of large-scale performance assessments is often hindered by the use of unbalanced designs for assigning examinees to tasks and raters because such designs result in very…
Factors Affecting Item Difficulty in English Listening Comprehension Tests
ERIC Educational Resources Information Center
Sung, Pei-Ju; Lin, Su-Wei; Hung, Pi-Hsia
2015-01-01
Task difficulty is a critical issue affecting test developers. Controlling or balancing the item difficulty of an assessment improves its validity and discrimination. Test developers construct tests from the cognitive perspective, by making the test constructing process more scientific and efficient; thus, the scores obtained more precisely…
HIV/AIDS knowledge among men who have sex with men: applying the item response theory.
Gomes, Raquel Regina de Freitas Magalhães; Batista, José Rodrigues; Ceccato, Maria das Graças Braga; Kerr, Lígia Regina Franco Sansigolo; Guimarães, Mark Drew Crosland
2014-04-01
To evaluate the level of HIV/AIDS knowledge among men who have sex with men in Brazil using the latent trait model estimated by Item Response Theory. Multicenter, cross-sectional study, carried out in ten Brazilian cities between 2008 and 2009. Adult men who have sex with men were recruited (n = 3,746) through Respondent Driven Sampling. HIV/AIDS knowledge was ascertained through ten statements by face-to-face interview and latent scores were obtained through two-parameter logistic modeling (difficulty and discrimination) using Item Response Theory. Differential item functioning was used to examine each item characteristic curve by age and schooling. Overall, the HIV/AIDS knowledge scores using Item Response Theory did not exceed 6.0 (scale 0-10), with mean and median values of 5.0 (SD = 0.9) and 5.3, respectively, with 40.7% of the sample with knowledge levels below the average. Some beliefs still exist in this population regarding the transmission of the virus by insect bites, by using public restrooms, and by sharing utensils during meals. With regard to the difficulty and discrimination parameters, eight items were located below the mean of the scale and were considered very easy, and four items presented very low discrimination parameter (< 0.34). The absence of difficult items contributed to the inaccuracy of the measurement of knowledge among those with median level and above. Item Response Theory analysis, which focuses on the individual properties of each item, allows measures to be obtained that do not vary or depend on the questionnaire, which provides better ascertainment and accuracy of knowledge scores. Valid and reliable scales are essential for monitoring HIV/AIDS knowledge among the men who have sex with men population over time and in different geographic regions, and this psychometric model brings this advantage.
Comparison of university students' understanding of graphs in different contexts
NASA Astrophysics Data System (ADS)
Planinic, Maja; Ivanjek, Lana; Susac, Ana; Milin-Sipus, Zeljka
2013-12-01
This study investigates university students’ understanding of graphs in three different domains: mathematics, physics (kinematics), and contexts other than physics. Eight sets of parallel mathematics, physics, and other context questions about graphs were developed. A test consisting of these eight sets of questions (24 questions in all) was administered to 385 first year students at University of Zagreb who were either prospective physics or mathematics teachers or prospective physicists or mathematicians. Rasch analysis of data was conducted and linear measures for item difficulties were obtained. Average difficulties of items in three domains (mathematics, physics, and other contexts) and over two concepts (graph slope, area under the graph) were computed and compared. Analysis suggests that the variation of average difficulty among the three domains is much smaller for the concept of graph slope than for the concept of area under the graph. Most of the slope items are very close in difficulty, suggesting that students who have developed sufficient understanding of graph slope in mathematics are generally able to transfer it almost equally successfully to other contexts. A large difference was found between the difficulty of the concept of area under the graph in physics and other contexts on one side and mathematics on the other side. Comparison of average difficulty of the three domains suggests that mathematics without context is the easiest domain for students. Adding either physics or other context to mathematical items generally seems to increase item difficulty. No significant difference was found between the average item difficulty in physics and contexts other than physics, suggesting that physics (kinematics) remains a difficult context for most students despite the received instruction on kinematics in high school.
An Investigation of the Impact of Guessing on Coefficient α and Reliability
2014-01-01
Guessing is known to influence the test reliability of multiple-choice tests. Although there are many studies that have examined the impact of guessing, they used rather restrictive assumptions (e.g., parallel test assumptions, homogeneous inter-item correlations, homogeneous item difficulty, and homogeneous guessing levels across items) to evaluate the relation between guessing and test reliability. Based on the item response theory (IRT) framework, this study investigated the extent of the impact of guessing on reliability under more realistic conditions where item difficulty, item discrimination, and guessing levels actually vary across items with three different test lengths (TL). By accommodating multiple item characteristics simultaneously, this study also focused on examining interaction effects between guessing and other variables entered in the simulation to be more realistic. The simulation of the more realistic conditions and calculations of reliability and classical test theory (CTT) item statistics were facilitated by expressing CTT item statistics, coefficient α, and reliability in terms of IRT model parameters. In addition to the general negative impact of guessing on reliability, results showed interaction effects between TL and guessing and between guessing and test difficulty.
A Study of Inference in Standardized Reading Test Items and Its Relationship to Difficulty.
ERIC Educational Resources Information Center
Marzano, Robert J.
To study the relationship between inferences made on standardized reading tests and item difficulty, 50 items on the reading comprehension section of the Metropolitan Achievement Test were analyzed independently in this study by two raters using four general categories of inferences: (1) reference inferences, (2) between proposition inferences,…
The Definition of Difficulty and Discrimination for Multidimensional Item Response Theory Models.
ERIC Educational Resources Information Center
Reckase, Mark D.; McKinley, Robert L.
A study was undertaken to develop guidelines for the interpretation of the parameters of three multidimensional item response theory models and to determine the relationship between the parameters and traditional concepts of item difficulty and discrimination. The three models considered were multidimensional extensions of the one-, two-, and…
Iwashita, Yukio; Hibi, Taizo; Ohyama, Tetsuji; Honda, Goro; Yoshida, Masahiro; Miura, Fumihiko; Takada, Tadahiro; Han, Ho-Seong; Hwang, Tsann-Long; Shinya, Satoshi; Suzuki, Kenji; Umezawa, Akiko; Yoon, Yoo-Seok; Choi, In-Seok; Huang, Wayne Shih-Wei; Chen, Kuo-Hsin; Watanabe, Manabu; Abe, Yuta; Misawa, Takeyuki; Nagakawa, Yuichi; Yoon, Dong-Sup; Jang, Jin-Young; Yu, Hee Chul; Ahn, Keun Soo; Kim, Song Cheol; Song, In Sang; Kim, Ji Hoon; Yun, Sung Su; Choi, Seong Ho; Jan, Yi-Yin; Shan, Yan-Shen; Ker, Chen-Guo; Chan, De-Chuan; Wu, Cheng-Chung; Lee, King-Teh; Toyota, Naoyuki; Higuchi, Ryota; Nakamura, Yoshiharu; Mizuguchi, Yoshiaki; Takeda, Yutaka; Ito, Masahiro; Norimizu, Shinji; Yamada, Shigetoshi; Matsumura, Naoki; Shindoh, Junichi; Sunagawa, Hiroki; Gocho, Takeshi; Hasegawa, Hiroshi; Rikiyama, Toshiki; Sata, Naohiro; Kano, Nobuyasu; Kitano, Seigo; Tokumura, Hiromi; Yamashita, Yuichi; Watanabe, Goro; Nakagawa, Kunitoshi; Kimura, Taizo; Yamakawa, Tatsuo; Wakabayashi, Go; Mori, Rintaro; Endo, Itaru; Miyazaki, Masaru; Yamamoto, Masakazu
2017-04-01
We previously identified 25 intraoperative findings during laparoscopic cholecystectomy (LC) as potential indicators of surgical difficulty per nominal group technique. This study aimed to build a consensus among expert LC surgeons on the impact of each item on surgical difficulty. Surgeons from Japan, Korea, and Taiwan (n = 554) participated in a Delphi process and graded the 25 items on a seven-stage scale (range, 0-6). Consensus was defined as (1) the interquartile range (IQR) of overall responses ≤2 and (2) ≥66% of the responses concentrated within a median ± 1 after stratification by workplace and LC experience level. Response rates for the first and the second-round Delphi were 92.6% and 90.3%, respectively. Final consensus was reached for all the 25 items. 'Diffuse scarring in the Calot's triangle area' in the 'Factors related to inflammation of the gallbladder' category had the strongest impact on surgical difficulty (median, 5; IQR, 1). Surgeons agreed that the surgical difficulty increases as more fibrotic change and scarring develop. The median point for each item was set as the difficulty score. A Delphi consensus was reached among expert LC surgeons on the impact of intraoperative findings on surgical difficulty. © 2017 Japanese Society of Hepato-Biliary-Pancreatic Surgery.
Kim, Stella H; Strutt, Adriana M; Olabarrieta-Landa, Laiene; Lequerica, Anthony H; Rivera, Diego; De Los Reyes Aragon, Carlos Jose; Utria, Oscar; Arango-Lasprilla, Juan Carlos
2018-02-23
The Boston Naming Test (BNT) is a widely used measure of confrontation naming ability that has been criticized for its questionable construct validity for non-English speakers. This study investigated item difficulty and construct validity of the Spanish version of the BNT to assess cultural and linguistic impact on performance. Subjects were 1298 healthy Spanish speaking adults from Colombia. They were administered the 60- and 15-item Spanish version of the BNT. A Rasch analysis was computed to assess dimensionality, item hierarchy, targeting, reliability, and item fit. Both versions of the BNT satisfied requirements for unidimensionality. Although internal consistency was excellent for the 60-item BNT, order of difficulty did not increase consistently with item number and there were a number of items that did not fit the Rasch model. For the 15-item BNT, a total of 5 items changed position on the item hierarchy with 7 poor fitting items. Internal consistency was acceptable. Construct validity of the BNT remains a concern when it is administered to non-English speaking populations. Similar to previous findings, the order of item presentation did not correspond with increasing item difficulty, and both versions were inadequate at assessing high naming ability.
2016-01-01
Purpose: The goal of this study was to characterize the difficulty index of the items in the skills test components of the class I and II Korean emergency medical technician licensing examination (KEMTLE), which requires examinees to select items randomly. Methods: The results of 1,309 class I KEMTLE examinations and 1,801 class II KEMTLE examinations in 2013 were subjected to analysis. Items from the basic and advanced skills test sections of the KEMTLE were compared to determine whether some were significantly more difficult than others. Results: In the class I KEMTLE, all 4 of the items on the basic skills test showed significant variation in difficulty index (P<0.01), as well as 4 of the 5 items on the advanced skills test (P<0.05). In the class II KEMTLE, 4 of the 5 items on the basic skills test showed significantly different difficulty index (P<0.01), as well as all 3 of the advanced skills test items (P<0.01). Conclusion: In the skills test components of the class I and II KEMTLE, the procedure in which examinees randomly select questions should be revised to require examinees to respond to a set of fixed items in order to improve the reliability of the national licensing examination. PMID:26883810
Land, Stephanie R; Warren, Graham W; Crafts, Jennifer L; Hatsukami, Dorothy K; Ostroff, Jamie S; Willis, Gordon B; Chollette, Veronica Y; Mitchell, Sandra A; Folz, Jasmine N M; Gulley, James L; Szabo, Eva; Brandon, Thomas H; Duffy, Sonia A; Toll, Benjamin A
2016-06-01
To the authors' knowledge, there are currently no standardized measures of tobacco use and secondhand smoke exposure in patients diagnosed with cancer, and this gap hinders the conduct of studies examining the impact of tobacco on cancer treatment outcomes. The objective of the current study was to evaluate and refine questionnaire items proposed by an expert task force to assess tobacco use. Trained interviewers conducted cognitive testing with cancer patients aged ≥21 years with a history of tobacco use and a cancer diagnosis of any stage and organ site who were recruited at the National Institutes of Health Clinical Center in Bethesda, Maryland. Iterative rounds of testing and item modification were conducted to identify and resolve cognitive issues (comprehension, memory retrieval, decision/judgment, and response mapping) and instrument navigation issues until no items warranted further significant modification. Thirty participants (6 current cigarette smokers, 1 current cigar smoker, and 23 former cigarette smokers) were enrolled from September 2014 to February 2015. The majority of items functioned well. However, qualitative testing identified wording ambiguities related to cancer diagnosis and treatment trajectory, such as "treatment" and "surgery"; difficulties with lifetime recall; errors in estimating quantities; and difficulties with instrument navigation. Revisions to item wording, format, order, response options, and instructions resulted in a questionnaire that demonstrated navigational ease as well as good question comprehension and response accuracy. The Cancer Patient Tobacco Use Questionnaire (C-TUQ) can be used as a standardized item set to accelerate the investigation of tobacco use in the cancer setting. Cancer 2016;122:1728-34. © 2016 American Cancer Society. © 2016 American Cancer Society.
ERIC Educational Resources Information Center
Andrich, David; Marais, Ida; Humphry, Stephen Mark
2016-01-01
Recent research has shown how the statistical bias in Rasch model difficulty estimates induced by guessing in multiple-choice items can be eliminated. Using vertical scaling of a high-profile national reading test, it is shown that the dominant effect of removing such bias is a nonlinear change in the unit of scale across the continuum. The…
Reliability of self-rated tinnitus distress and association with psychological symptom patterns.
Hiller, W; Goebel, G; Rief, W
1994-05-01
Psychological complaints were investigated in two samples of 60 and 138 in-patients suffering from chronic tinnitus. We administered the Tinnitus Questionnaire (TQ), a 52-item self-rating scale which differentiates between dimensions of emotional and cognitive distress, intrusiveness, auditory perceptual difficulties, sleep disturbances and somatic complaints. The test-retest reliability was .94 for the TQ global score and between .86 and .93 for subscales. Three independent analyses were conducted to estimate the split-half reliability (internal consistency) which was only slightly lower than the test-retest values for scales with a relatively small number of items. Reliability was sufficient also on the level of single items. Low correlation between the TQ and the Hopkins Symptom Checklist (SCL-90-R) indicate a distinct quality of tinnitus-related and general psychological disturbances.
What Does a Verbal Test Measure? A New Approach to Understanding Sources of Item Difficulty.
ERIC Educational Resources Information Center
Berk, Eric J. Vanden; Lohman, David F.; Cassata, Jennifer Coyne
Assessing the construct relevance of mental test results continues to present many challenges, and it has proven to be particularly difficult to assess the construct relevance of verbal items. This study was conducted to gain a better understanding of the conceptual sources of verbal item difficulty using a unique approach that integrates…
On Maximizing Item Information and Matching Difficulty with Ability.
ERIC Educational Resources Information Center
Bickel, Peter; Buyske, Steven; Chang, Huahua; Ying, Zhiliang
2001-01-01
Examined the assumption that matching difficulty levels of test items with an examinee's ability makes a test more efficient and challenged this assumption through a class of one-parameter item response theory models. Found the validity of the fundamental assumption to be closely related to the van Zwet tail ordering of symmetric distributions (W.…
Detecting a Gender-Related Differential Item Functioning Using Transformed Item Difficulty
ERIC Educational Resources Information Center
Abedalaziz, Nabeel; Leng, Chin Hai; Alahmadi, Ahlam
2014-01-01
The purpose of the study was to examine gender differences in performance on multiple-choice mathematical ability test, administered within the context of high school graduation test that was designed to match eleventh grade curriculum. The transformed item difficulty (TID) was used to detect a gender related DIF. A random sample of 1400 eleventh…
ERIC Educational Resources Information Center
Solano-Flores, Guillermo; Wang, Chao; Shade, Chelsey
2016-01-01
We examined multimodality (the representation of information in multiple semiotic modes) in the context of international test comparisons. Using Program of International Student Assessment (PISA)-2009 data, we examined the correlation of the difficulty of science items and the complexity of their illustrations. We observed statistically…
ERIC Educational Resources Information Center
Kramer, Gene A.; Smith, Richard M.
2001-01-01
Examined the role that gender differences play in the determination of the components influencing the difficulty of spatial ability items. Results for 2,245 examinees taking a spatial ability test that is part of the Dental School Admission Battery show that component difficulties show little variation across gender. (SLD)
Binks-Cantrell, Emily; Joshi, R Malatesha; Washburn, Erin K
2012-10-01
Recent national reports have stressed the importance of teacher knowledge in teaching reading. However, in the past, teachers' knowledge of language and literacy constructs has typically been assessed with instruments that are not fully tested for validity. In the present study, an instrument was developed; and its reliability, item difficulty, and item discrimination were computed and examined to identify model fit by applying exploratory factor analysis. Such analyses showed that the instrument demonstrated adequate estimates of reliability in assessing teachers' knowledge of language constructs. The implications for professional development of in-service teachers as well as preservice teacher education are also discussed.
Development and initial evaluation of the SCI-FI/AT
Jette, Alan M.; Slavin, Mary D.; Ni, Pengsheng; Kisala, Pamela A.; Tulsky, David S.; Heinemann, Allen W.; Charlifue, Susie; Tate, Denise G.; Fyffe, Denise; Morse, Leslie; Marino, Ralph; Smith, Ian; Williams, Steve
2015-01-01
Objectives To describe the domain structure and calibration of the Spinal Cord Injury Functional Index for samples using Assistive Technology (SCI-FI/AT) and report the initial psychometric properties of each domain. Design Cross sectional survey followed by computerized adaptive test (CAT) simulations. Setting Inpatient and community settings. Participants A sample of 460 adults with traumatic spinal cord injury (SCI) stratified by level of injury, completeness of injury, and time since injury. Interventions None Main outcome measure SCI-FI/AT Results Confirmatory factor analysis (CFA) and Item response theory (IRT) analyses identified 4 unidimensional SCI-FI/AT domains: Basic Mobility (41 items) Self-care (71 items), Fine Motor Function (35 items), and Ambulation (29 items). High correlations of full item banks with 10-item simulated CATs indicated high accuracy of each CAT in estimating a person's function, and there was high measurement reliability for the simulated CAT scales compared with the full item bank. SCI-FI/AT item difficulties in the domains of Self-care, Fine Motor Function, and Ambulation were less difficult than the same items in the original SCI-FI item banks. Conclusion With the development of the SCI-FI/AT, clinicians and investigators have available multidimensional assessment scales that evaluate function for users of AT to complement the scales available in the original SCI-FI. PMID:26010975
Development and initial evaluation of the SCI-FI/AT.
Jette, Alan M; Slavin, Mary D; Ni, Pengsheng; Kisala, Pamela A; Tulsky, David S; Heinemann, Allen W; Charlifue, Susie; Tate, Denise G; Fyffe, Denise; Morse, Leslie; Marino, Ralph; Smith, Ian; Williams, Steve
2015-05-01
To describe the domain structure and calibration of the Spinal Cord Injury Functional Index for samples using Assistive Technology (SCI-FI/AT) and report the initial psychometric properties of each domain. Cross sectional survey followed by computerized adaptive test (CAT) simulations. Inpatient and community settings. A sample of 460 adults with traumatic spinal cord injury (SCI) stratified by level of injury, completeness of injury, and time since injury. None SCI-FI/AT RESULTS: Confirmatory factor analysis (CFA) and Item response theory (IRT) analyses identified 4 unidimensional SCI-FI/AT domains: Basic Mobility (41 items) Self-care (71 items), Fine Motor Function (35 items), and Ambulation (29 items). High correlations of full item banks with 10-item simulated CATs indicated high accuracy of each CAT in estimating a person's function, and there was high measurement reliability for the simulated CAT scales compared with the full item bank. SCI-FI/AT item difficulties in the domains of Self-care, Fine Motor Function, and Ambulation were less difficult than the same items in the original SCI-FI item banks. With the development of the SCI-FI/AT, clinicians and investigators have available multidimensional assessment scales that evaluate function for users of AT to complement the scales available in the original SCI-FI.
Monclús Cols, Ester; Nicolás Ocejo, David; Sánchez Sánchez, Miquel; Ortega Romero, Mar
2015-02-01
To detect the problems hospital emergency room staff have when prescribing and administering antibiotics. A 14-item questionnaire was designed to assess staff members' knowledge of the importance of starting antibiotic treatment promptly, assigning appropriate dosing intervals, adjusting for renal function, and switching to oral therapy. Agreement with each item was expressed on a 5-point Likert scale. Items with a rate of appropriate response of less than 75% were targeted for specific attention. Two hundred questionnaires were distributed to the staff and 150 were returned completed (response rate, 75%). The following items were targeted for attention based on rates of appropriate response of less than 75%: clear medical orders (65%), understanding the implication of early empirical antibiotic therapy on prognosis in serious infections (67%), estimation of the prevalence of renal insufficiency (42%), assumption that a creatinine serum level under < 1.6 mg/dL is safe (33%), use of glomerular filtration rate to adjust dose according to renal function (47%), and an understanding of switching from intravenous to oral treatment (60%). This study revealed the difficulties medical and nursing staff have in prescribing and administering antibiotics in a hospital emergency department. The results can facilitate improvements in antibiotic therapy by pinpointing areas to target for specific training interventions or the design of electronic prescribing aids.
2018-01-01
Objective To investigate the psychometric properties of the activities of daily living (ADL) instrument used in the analysis of Korean Longitudinal Study of Ageing (KLoSA) dataset. Methods A retrospective study was carried out involving 2006 KLoSA records of community-dwelling adults diagnosed with stroke. The ADL instrument used for the analysis of KLoSA included 17 items, which were analyzed using Rasch modeling to develop a robust outcome measure. The unidimensionality of the ADL instrument was examined based on confirmatory factor analysis with a one-factor model. Item-level psychometric analysis of the ADL instrument included fit statistics, internal consistency, precision, and the item difficulty hierarchy. Results The study sample included a total of 201 community-dwelling adults (1.5% of the Korean population with an age over 45 years; mean age=70.0 years, SD=9.7) having a history of stroke. The ADL instrument demonstrated unidimensional construct. Two misfit items, money management (mean square [MnSq]=1.56, standardized Z-statistics [ZSTD]=2.3) and phone use (MnSq=1.78, ZSTD=2.3) were removed from the analysis. The remaining 15 items demonstrated good item fit, high internal consistency (person reliability=0.91), and good precision (person strata=3.48). The instrument precisely estimated person measures within a wide range of theta (−4.75 logits < θ < 3.97 logits) and a reliability of 0.9, with a conceptual hierarchy of item difficulty. Conclusion The findings indicate that the 15 ADL items met Rasch expectations of unidimensionality and demonstrated good psychometric properties. It is proposed that the validated ADL instrument can be used as a primary outcome measure for assessing longitudinal disability trajectories in the Korean adult population and can be employed for comparative analysis of international disability across national aging studies. PMID:29765888
ERIC Educational Resources Information Center
Retnawati, Heri; Kartowagiran, Badrun; Arlinwibowo, Janu; Sulistyaningsih, Eny
2017-01-01
The quality of national examination items plays an enormous role in identifying students' competencies mastery and their difficulties. This study aims to identify the difficult items in the Junior High School Mathematics National Examination, to find the factors that cause students' difficulty and to reveal the strategies that the teachers and the…
ERIC Educational Resources Information Center
Dodonova, Yulia A.; Dodonov, Yury S.
2013-01-01
Using more complex items than those commonly employed within the information-processing approach, but still easier than those used in intelligence tests, this study analyzed how the association between processing speed and accuracy level changes as the difficulty of the items increases. The study involved measuring cognitive ability using Raven's…
Predicting Item Difficulty in a Reading Comprehension Test with an Artificial Neural Network.
ERIC Educational Resources Information Center
Perkins, Kyle; And Others
This paper reports the results of using a three-layer backpropagation artificial neural network to predict item difficulty in a reading comprehension test. Two network structures were developed, one with and one without a sigmoid function in the output processing unit. The data set, which consisted of a table of coded test items and corresponding…
Detecting unexpected variables in the MMPI 2 Social Introversion scale.
Chang, C H; Wright, B D
2001-01-01
The standard scoring structure of the revised Minnesota Multiphasic Personality Inventory (MMPI-2) Social Introversion (Si) scale was reexamined with Rasch Measurement. The 69-item Si scale split into two distinct dimensions when their standardized residuals were factor analyzed. Items keyed "true" to Si defined one dimension and items keyed "false" defined another. Relationships between Lexile values (an index of reading difficulty and comprehension) and item difficulties were also explored. The article shows how to use Rasch Measurement to understand and improve personality assessment.
An Alternate Definition of the ETS Delta Scale of Item Difficulty. Program Statistics Research.
ERIC Educational Resources Information Center
Holland, Paul W.; Thayer, Dorothy T.
An alternative definition has been developed of the delta scale of item difficulty used at Educational Testing Service. The traditional delta scale uses an inverse normal transformation based on normal ogive models developed years ago. However, no use is made of this fact in typical uses of item deltas. It is simply one way to make the probability…
The role of difficulty and gender in numbers, algebra, geometry and mathematics achievement
NASA Astrophysics Data System (ADS)
Rabab'h, Belal Sadiq Hamed; Veloo, Arsaythamby; Perumal, Selvan
2015-05-01
This study aims to identify the role of difficulty and gender in numbers, algebra, geometry and mathematics achievement among secondary schools students in Jordan. The respondent of the study were 337 students from eight public secondary school in Alkoura district by using stratified random sampling. The study comprised of 179 (53%) males and 158 (47%) females students. The mathematics test comprises of 30 items which has eight items for numbers, 14 items for algebra and eight items for geometry. Based on difficulties among male and female students, the findings showed that item 4 (fractions - 0.34) was most difficult for male students and item 6 (square roots - 0.39) for females in numbers. For the algebra, item 11 (inequality - 0.23) was most difficult for male students and item 6 (algebraic expressions - 0.35) for female students. In geometry, item 3 (reflection - 0.34) was most difficult for male students and item 8 (volume - 0.33) for female students. Based on gender differences, female students showed higher achievement in numbers and algebra compare to male students. On the other hand, there was no differences between male and female students achievement in geometry test. This study suggest that teachers need to give more attention on numbers and algebra when teaching mathematics.
Developing an African youth psychosocial assessment: an application of item response theory.
Betancourt, Theresa S; Yang, Frances; Bolton, Paul; Normand, Sharon-Lise
2014-06-01
This study aimed to refine a dimensional scale for measuring psychosocial adjustment in African youth using item response theory (IRT). A 60-item scale derived from qualitative data was administered to 667 war-affected adolescents (55% female). Exploratory factor analysis (EFA) determined the dimensionality of items based on goodness-of-fit indices. Items with loadings less than 0.4 were dropped. Confirmatory factor analysis (CFA) was used to confirm the scale's dimensionality found under the EFA. Item discrimination and difficulty were estimated using a graded response model for each subscale using weighted least squares means and variances. Predictive validity was examined through correlations between IRT scores (θ) for each subscale and ratings of functional impairment. All models were assessed using goodness-of-fit and comparative fit indices. Fisher's Information curves examined item precision at different underlying ranges of each trait. Original scale items were optimized and reconfigured into an empirically-robust 41-item scale, the African Youth Psychosocial Assessment (AYPA). Refined subscales assess internalizing and externalizing problems, prosocial attitudes/behaviors and somatic complaints without medical cause. The AYPA is a refined dimensional assessment of emotional and behavioral problems in African youth with good psychometric properties. Validation studies in other cultures are recommended. Copyright © 2014 John Wiley & Sons, Ltd.
Developing an African youth psychosocial assessment: an application of item response theory
BETANCOURT, THERESA S.; YANG, FRANCES; BOLTON, PAUL; NORMAND, SHARON-LISE
2014-01-01
This study aimed to refine a dimensional scale for measuring psychosocial adjustment in African youth using item response theory (IRT). A 60-item scale derived from qualitative data was administered to 667 war-affected adolescents (55% female). Exploratory factor analysis (EFA) determined the dimensionality of items based on goodness-of-fit indices. Items with loadings less than 0.4 were dropped. Confirmatory factor analysis (CFA) was used to confirm the scale's dimensionality found under the EFA. Item discrimination and difficulty were estimated using a graded response model for each subscale using weighted least squares means and variances. Predictive validity was examined through correlations between IRT scores (θ) for each subscale and ratings of functional impairment. All models were assessed using goodness-of-fit and comparative fit indices. Fisher's Information curves examined item precision at different underlying ranges of each trait. Original scale items were optimized and reconfigured into an empirically-robust 41-item scale, the African Youth Psychosocial Assessment (AYPA). Refined subscales assess internalizing and externalizing problems, prosocial attitudes/behaviors and somatic complaints without medical cause. The AYPA is a refined dimensional assessment of emotional and behavioral problems in African youth with good psychometric properties. Validation studies in other cultures are recommended. PMID:24478113
Keller, Johannes
2007-06-01
Stereotype threat research revealed that negative stereotypes can disrupt the performance of persons targeted by such stereotypes. This paper contributes to stereotype threat research by providing evidence that domain identification and the difficulty level of test items moderate stereotype threat effects on female students' maths performance. The study was designed to test theoretical ideas derived from stereotype threat theory and assumptions outlined in the Yerkes-Dodson law proposing a nonlinear relationship between arousal, task difficulty and performance. Participants were 108 high school students attending secondary schools. Participants worked on a test comprising maths problems of different difficulty levels. Half of the participants learned that the test had been shown to produce gender differences (stereotype threat). The other half learned that the test had been shown not to produce gender differences (no threat). The degree to which participants identify with the domain of maths was included as a quasi-experimental factor. Maths-identified female students showed performance decrements under conditions of stereotype threat. Moreover, the stereotype threat manipulation had different effects on low and high domain identifiers' performance depending on test item difficulty. On difficult items, low identifiers showed higher performance under threat (vs. no threat) whereas the reverse was true in high identifiers. This interaction effect did not emerge on easy items. Domain identification and test item difficulty are two important factors that need to be considered in the attempt to understand the impact of stereotype threat on performance.
Ali, Amira Mohammed; Ahmed, Anwar; Sharaf, Amira; Kawakami, Norito; Abdeldayem, Samia M; Green, Joseph
2017-12-01
This study aimed to examine the validity of the Arabic version of the Depression Anxiety Stress Scale-21 (DASS-21) in 149 illicit drug users. We calculated α coefficient, inter-item and item-total correlations, coefficients of reproducibility and scalability (CR and CS), item difficulty and discrimination indices. The DASS-21 had an acceptable reliability; but values of the CR and the CS were less than acceptable. Items varied in difficulty and discrimination; some items are candidates for elimination. The DASS-21 is a probabilistic and not a deterministic measure of distress; it has problematic items and needs further investigations. Copyright © 2017 Elsevier B.V. All rights reserved.
Do item-writing flaws reduce examinations psychometric quality?
Pais, João; Silva, Artur; Guimarães, Bruno; Povo, Ana; Coelho, Elisabete; Silva-Pereira, Fernanda; Lourinho, Isabel; Ferreira, Maria Amélia; Severo, Milton
2016-08-11
The psychometric characteristics of multiple-choice questions (MCQ) changed when taking into account their anatomical sites and the presence of item-writing flaws (IWF). The aim is to understand the impact of the anatomical sites and the presence of IWF in the psychometric qualities of the MCQ. 800 Clinical Anatomy MCQ from eight examinations were classified as standard or flawed items and according to one of the eight anatomical sites. An item was classified as flawed if it violated at least one of the principles of item writing. The difficulty and discrimination indices of each item were obtained. 55.8 % of the MCQ were flawed items. The anatomical site of the items explained 6.2 and 3.2 % of the difficulty and discrimination parameters and the IWF explained 2.8 and 0.8 %, respectively. The impact of the IWF was heterogeneous, the Writing the Stem and Writing the Choices categories had a negative impact (higher difficulty and lower discrimination) while the other categories did not have any impact. The anatomical site effect was higher than IWF effect in the psychometric characteristics of the examination. When constructing MCQ, the focus should be in the topic/area of the items and only after in the presence of IWF.
Selecting Items for Criterion-Referenced Tests.
ERIC Educational Resources Information Center
Mellenbergh, Gideon J.; van der Linden, Wim J.
1982-01-01
Three item selection methods for criterion-referenced tests are examined: the classical theory of item difficulty and item-test correlation; the latent trait theory of item characteristic curves; and a decision-theoretic approach for optimal item selection. Item contribution to the standardized expected utility of mastery testing is discussed. (CM)
Item Difficulty Modeling of Paragraph Comprehension Items
ERIC Educational Resources Information Center
Gorin, Joanna S.; Embretson, Susan E.
2006-01-01
Recent assessment research joining cognitive psychology and psychometric theory has introduced a new technology, item generation. In algorithmic item generation, items are systematically created based on specific combinations of features that underlie the processing required to correctly solve a problem. Reading comprehension items have been more…
Item difficulty and item validity for the Children's Group Embedded Figures Test.
Rusch, R R; Trigg, C L; Brogan, R; Petriquin, S
1994-02-01
The validity and reliability of the Children's Group Embedded Figures Test was reported for students in Grade 2 by Cromack and Stone in 1980; however, a search of the literature indicates no evidence for internal consistency or item analysis. Hence the purpose of this study was to examine the item difficulty and item validity of the test with children in Grades 1 and 2. Confusion in the literature over development and use of this test was seemingly resolved through analysis of these descriptions and through an interview with the test developer. One early-appearing item was unreasonably difficult. Two or three other items were quite difficult and made little contribution to the total score. Caution is recommended, however, in any reordering or elimination of items based on these findings, given the limited number of subjects (n = 84).
North American Veterinary Licensing Examination pacing study.
Subhiyah, Raja G; Boyce, John R
2010-01-01
The National Board of Veterinary Medical Examiners was interested in the possible effects of word count on the outcomes of the North American Veterinary Licensing Examination. In this study, the authors investigated the effects of increasing word count on the pacing of examinees during each section of the examination and on the performance of examinees on the items. Specifically, the authors analyzed the effect of item word count on the average time spent on each item within a section of the examination, the average number of items omitted at the end of a section, and the average difficulty of items as a function of presentation order. The average word count per item increased from 2001 to 2008. As expected, there was a relationship between word count and time spent on the item. No significant relationship was found between word count and item difficulty, and an analysis of omitted items and pacing patterns showed no indication of overall pacing problems.
Difficulty and Discriminability of Introductory Psychology Test Items.
ERIC Educational Resources Information Center
Scialfa, Charles; Legare, Connie; Wenger, Larry; Dingley, Louis
2001-01-01
Analyzes multiple-choice questions provided in test banks for introductory psychology textbooks. Study 1 offered a consistent picture of the objective difficulty of multiple-choice tests for introductory psychology students, while both studies 1 and 2 indicated that test items taken from commercial test banks have poor psychometric properties.…
Working memory capacity and fluid abilities: the more difficult the item, the more more is better.
Little, Daniel R; Lewandowsky, Stephan; Craig, Stewart
2014-01-01
The relationship between fluid intelligence and working memory is of fundamental importance to understanding how capacity-limited structures such as working memory interact with inference abilities to determine intelligent behavior. Recent evidence has suggested that the relationship between a fluid abilities test, Raven's Progressive Matrices, and working memory capacity (WMC) may be invariant across difficulty levels of the Raven's items. We show that this invariance can only be observed if the overall correlation between Raven's and WMC is low. Simulations of Raven's performance revealed that as the overall correlation between Raven's and WMC increases, the item-wise point bi-serial correlations involving WMC are no longer constant but increase considerably with item difficulty. The simulation results were confirmed by two studies that used a composite measure of WMC, which yielded a higher correlation between WMC and Raven's than reported in previous studies. As expected, with the higher overall correlation, there was a significant positive relationship between Raven's item difficulty and the extent of the item-wise correlation with WMC.
Yost, Kathleen J; Webster, Kimberly; Baker, David W; Choi, Seung W; Bode, Rita K; Hahn, Elizabeth A
2009-06-01
Current health literacy measures are too long, imprecise, or have questionable equivalence of English and Spanish versions. The purpose of this paper is to describe the development and pilot testing of a new bilingual computer-based health literacy assessment tool. We analyzed literacy data from three large studies. Using a working definition of health literacy, we developed new prose, document and quantitative items in English and Spanish. Items were pilot tested on 97 English- and 134 Spanish-speaking participants to assess item difficulty. Items covered topics relevant to primary care patients and providers. English- and Spanish-speaking participants understood the tasks involved in answering each type of question. The English Talking Touchscreen was easy to use and the English and Spanish items provided good coverage of the difficulty continuum. Qualitative and quantitative results provided useful information on computer acceptability and initial item difficulty. After the items have been administered on the Talking Touchscreen (la Pantalla Parlanchina) to 600 English-speaking (and 600 Spanish-speaking) primary care patients, we will develop a computer adaptive test. This health literacy tool will enable clinicians and researchers to more precisely determine the level at which low health literacy adversely affects health and healthcare utilization.
The Effects of Item Format and Cognitive Domain on Students' Science Performance in TIMSS 2011
NASA Astrophysics Data System (ADS)
Liou, Pey-Yan; Bulut, Okan
2017-12-01
The purpose of this study was to examine eighth-grade students' science performance in terms of two test design components, item format, and cognitive domain. The portion of Taiwanese data came from the 2011 administration of the Trends in International Mathematics and Science Study (TIMSS), one of the major international large-scale assessments in science. The item difficulty analysis was initially applied to show the proportion of correct items. A regression-based cumulative link mixed modeling (CLMM) approach was further utilized to estimate the impact of item format, cognitive domain, and their interaction on the students' science scores. The results of the proportion-correct statistics showed that constructed-response items were more difficult than multiple-choice items, and that the reasoning cognitive domain items were more difficult compared to the items in the applying and knowing domains. In terms of the CLMM results, students tended to obtain higher scores when answering constructed-response items as well as items in the applying cognitive domain. When the two predictors and the interaction term were included together, the directions and magnitudes of the predictors on student science performance changed substantially. Plausible explanations for the complex nature of the effects of the two test-design predictors on student science performance are discussed. The results provide practical, empirical-based evidence for test developers, teachers, and stakeholders to be aware of the differential function of item format, cognitive domain, and their interaction in students' science performance.
[Instrument to measure adherence in hypertensive patients: contribution of Item Response Theory].
Rodrigues, Malvina Thaís Pacheco; Moreira, Thereza Maria Magalhaes; Vasconcelos, Alexandre Meira de; Andrade, Dalton Francisco de; Silva, Daniele Braz da; Barbetta, Pedro Alberto
2013-06-01
To analyze, by means of "Item Response Theory", an instrument to measure adherence to t treatment for hypertension. Analytical study with 406 hypertensive patients with associated complications seen in primary care in Fortaleza, CE, Northeastern Brazil, 2011 using "Item Response Theory". The stages were: dimensionality test, calibrating the items, processing data and creating a scale, analyzed using the gradual response model. A study of the dimensionality of the instrument was conducted by analyzing the polychoric correlation matrix and factor analysis of complete information. Multilog software was used to calibrate items and estimate the scores. Items relating to drug therapy are the most directly related to adherence while those relating to drug-free therapy need to be reworked because they have less psychometric information and low discrimination. The independence of items, the small number of levels in the scale and low explained variance in the adjustment of the models show the main weaknesses of the instrument analyzed. The "Item Response Theory" proved to be a relevant analysis technique because it evaluated respondents for adherence to treatment for hypertension, the level of difficulty of the items and their ability to discriminate between individuals with different levels of adherence, which generates a greater amount of information. The instrument analyzed is limited in measuring adherence to hypertension treatment, by analyzing the "Item Response Theory" of the item, and needs adjustment. The proper formulation of the items is important in order to accurately measure the desired latent trait.
Mokken scaling of the Myocardial Infarction Dimensional Assessment Scale (MIDAS).
Thompson, David R; Watson, Roger
2011-02-01
The purpose of this study was to examine the hierarchical and cumulative nature of the 35 items of the Myocardial Infarction Dimensional Assessment Scale (MIDAS), a disease-specific health-related quality of life measure. Data from 668 participants who completed the MIDAS were analysed using the Mokken Scaling Procedure, which is a computer program that searches polychotomous data for hierarchical and cumulative scales on the basis of a range of diagnostic criteria. Fourteen MIDAS items were retained in a Mokken scale and these items included physical activity, insecurity, emotional reaction and dependency items but excluded items related to diet, medication or side-effects. Item difficulty, in item response theory terms, ran from physical activity items (low difficulty) to insecurity, suggesting that the most severe quality of life effect of myocardial infarction is loneliness and isolation. Items from the MIDAS form a strong and reliable Mokken scale, which provides new insight into the relationship between items in the MIDAS and the measurement of quality of life after myocardial infarction. © 2010 Blackwell Publishing Ltd.
Interpretation of the Rasch Ability and Difficulty Scales for Educational Purposes.
ERIC Educational Resources Information Center
Woodcock, Richard W.
Though many test developers have utilized item response theory in their work, few have taken advantage of the potential of item response theory for providing new interpretation procedures that accentuate the educational implications to be drawn from test scores. This paper describes several features, based upon the Rasch difficulty and ability…
The Effect of Anchor Test Construction on Scale Drift
ERIC Educational Resources Information Center
Antal, Judit; Proctor, Thomas P.; Melican, Gerald J.
2014-01-01
In common-item equating the anchor block is generally built to represent a miniature form of the total test in terms of content and statistical specifications. The statistical properties frequently reflect equal mean and spread of item difficulty. Sinharay and Holland (2007) suggested that the requirement for equal spread of difficulty may be too…
Simple mental addition in children with and without mild mental retardation.
Janssen, R; De Boeck, P; Viaene, M; Vallaeys, L
1999-11-01
The speeded performance on simple mental addition problems of 6- and 7-year-old children with and without mild mental retardation is modeled from a person perspective and an item perspective. On the person side, it was found that a single cognitive dimension spanned the performance differences between the two ability groups. However, a discontinuity, or "jump," was observed in the performance of the normal ability group on the easier items. On the item side, the addition problems were almost perfectly ordered in difficulty according to their problem size. Differences in difficulty were explained by factors related to the difficulty of executing nonretrieval strategies. All findings were interpreted within the framework of Siegler's (e.g., R. S. Siegler & C. Shipley, 1995) model of children's strategy choices in arithmetic. Models from item response theory were used to test the hypotheses. Copyright 1999 Academic Press.
ERIC Educational Resources Information Center
Hewitt, Margaret A.; Homan, Susan P.
2004-01-01
Test validity issues considered by test developers and school districts rarely include individual item readability levels. In this study, items from a major standardized test were examined for individual item readability level and item difficulty. The Homan-Hewitt Readability Formula was applied to items across three grade levels. Results of…
Bond, Kathy S; Chalmers, Kathryn J; Jorm, Anthony F; Kitchener, Betty A; Reavley, Nicola J
2015-06-03
There is a strong association between mental health problems and financial difficulties. Therefore, people who work with those who have financial difficulties (financial counsellors and financial institution staff) need to have knowledge and helping skills relevant to mental health problems. Conversely, people who support those with mental health problems (mental health professionals and carers) may need to have knowledge and helping skills relevant to financial difficulties. The Delphi expert consensus method was used to develop guidelines for people who work with or support those with mental health problems and financial difficulties. A systematic review of websites, books and journal articles was conducted to develop a questionnaire containing items about the knowledge, skills and actions relevant to working with or supporting someone with mental health problems and financial difficulties. These items were rated over three rounds by five Australian expert panels comprising of financial counsellors (n = 33), financial institution staff (n = 54), mental health professionals (n = 31), consumers (n = 20) and carers (n = 24). A total of 897 items were rated, with 462 items endorsed by at least 80 % of members of each of the expert panels. These endorsed statements were used to develop a set of guidelines for financial counsellors, financial institution staff, mental health professionals and carers about how to assist someone with mental health problems and financial difficulties. A diverse group of expert panel members were able to reach substantial consensus on the knowledge, skills and actions needed to work with and support people with mental health problems and financial difficulties. These guidelines can be used to inform policy and practice in the financial and mental health sectors.
You'll change more than I will: Adults' predictions about their own and others' future preferences.
Renoult, Louis; Kopp, Leia; Davidson, Patrick S R; Taler, Vanessa; Atance, Cristina M
2016-01-01
It has been argued that adults underestimate the extent to which their preferences will change over time. We sought to determine whether such mispredictions are the result of a difficulty imagining that one's own current and future preferences may differ or whether it also characterizes our predictions about the future preferences of others. We used a perspective-taking task in which we asked young people how much they liked stereotypically young-person items (e.g., Top 40 music, adventure vacations) and stereotypically old-person items (e.g., jazz, playing bridge) now, and how much they would like them in the distant future (i.e., when they are 70 years old). Participants also made these same predictions for a generic same-age, same-sex peer. In a third condition, participants predicted how much a generic older (i.e., age 70) same-sex adult would like items from both categories today. Participants predicted less change between their own current and future preferences than between the current and future preferences of a peer. However, participants estimated that, compared to a current older adult today, their peer would like stereotypically young items more in the future and stereotypically old items less. The fact that peers' distant-future estimated preferences were different from the ones they made for "current" older adults suggests that even though underestimation of change of preferences over time is attenuated when thinking about others, a bias still exists.
ERIC Educational Resources Information Center
Shulruf, Boaz; Jones, Phil; Turner, Rolf
2015-01-01
The determination of Pass/Fail decisions over Borderline grades, (i.e., grades which do not clearly distinguish between the competent and incompetent examinees) has been an ongoing challenge for academic institutions. This study utilises the Objective Borderline Method (OBM) to determine examinee ability and item difficulty, and from that…
ERIC Educational Resources Information Center
Wu, Pei-Chen; Chang, Lily
2008-01-01
The authors investigated the Chinese version of the Beck Depression Inventory-II (BDI-II-C; Chinese Behavioral Science Corporation, 2000) within the Rasch framework in terms of dimensionality, item difficulty, and category functioning. Two underlying scale dimensions, relatively high item difficulties, and a need for collapsing 2 response…
A Comparison of Three Types of Test Development Procedures Using Classical and Latent Trait Methods.
ERIC Educational Resources Information Center
Benson, Jeri; Wilson, Michael
Three methods of item selection were used to select sets of 38 items from a 50-item verbal analogies test and the resulting item sets were compared for internal consistency, standard errors of measurement, item difficulty, biserial item-test correlations, and relative efficiency. Three groups of 1,500 cases each were used for item selection. First…
ERIC Educational Resources Information Center
Wang, Wen-Chung
2004-01-01
Scale indeterminacy in analysis of differential item functioning (DIF) within the framework of item response theory can be resolved by imposing 3 anchor item methods: the equal-mean-difficulty method, the all-other anchor item method, and the constant anchor item method. In this article, applicability and limitations of these 3 methods are…
Using Classical Test Theory and Item Response Theory to Evaluate the LSCI
NASA Astrophysics Data System (ADS)
Schlingman, Wayne M.; Prather, E. E.; Collaboration of Astronomy Teaching Scholars CATS
2011-01-01
Analyzing the data from the recent national study using the Light and Spectroscopy Concept Inventory (LSCI), this project uses both Classical Test Theory (CTT) and Item Response Theory (IRT) to investigate the LSCI itself in order to better understand what it is actually measuring. We use Classical Test Theory to form a framework of results that can be used to evaluate the effectiveness of individual questions at measuring differences in student understanding and provide further insight into the prior results presented from this data set. In the second phase of this research, we use Item Response Theory to form a theoretical model that generates parameters accounting for a student's ability, a question's difficulty, and estimate the level of guessing. The combined results from our investigations using both CTT and IRT are used to better understand the learning that is taking place in classrooms across the country. The analysis will also allow us to evaluate the effectiveness of individual questions and determine whether the item difficulties are appropriately matched to the abilities of the students in our data set. These results may require that some questions be revised, motivating the need for further development of the LSCI. This material is based upon work supported by the National Science Foundation under Grant No. 0715517, a CCLI Phase III Grant for the Collaboration of Astronomy Teaching Scholars (CATS). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Torgén, M; Winkel, J; Alfredsson, L; Kilbom, A
1999-06-01
The principal aim of the present study was to evaluate questionnaire-based information on past physical work loads (6-year recall). Effects of memory difficulties on reproducibility were evaluated for 82 subjects by comparing previously reported results on current work loads (test-retest procedure) with the same items recalled 6 years later. Validity was assessed by comparing self-reports in 1995, regarding work loads in 1989, with worksite measurements performed in 1989. Six-year reproducibility, calculated as weighted kappa coefficients (k(w)), varied between 0.36 and 0.86, with the highest values for proportion of the workday spent sitting and for perceived general exertion and the lowest values for trunk and neck flexion. The six-year reproducibility results were similar to previously reported test-retest results for these items; this finding indicates that memory difficulties was a minor problem. The validity of the questionnaire responses, expressed as rank correlations (r(s)) between the questionnaire responses and workplace measurements, varied between -0.16 and 0.78. The highest values were obtained for the items sitting and repetitive work, and the lowest and "unacceptable" values were for head rotation and neck flexion. Misclassification of exposure did not appear to be differential with regard to musculoskeletal symptom status, as judged by the calculated risk estimates. The validity of some of these self-administered questionnaire items appears sufficient for a crude assessment of physical work loads in the past in epidemiologic studies of the general population with predominantly low levels of exposure.
Within-item strategy switching in arithmetic: a comparative study in children
Ardiale, Eléonore; Lemaire, Patrick
2013-01-01
The present study aimed at determining whether (1) children were able to interrupt a strategy execution to switch and choose another better strategy, and (2) their ability to switch strategy within-item improved with age. Third, fifth, and seventh graders performed a computational estimation task in which they had to provide the better estimates to two-digit addition problems (e.g., 32 + 54) while using the rounding-down (e.g., 30 + 50) or the rounding-up strategy (e.g., 40 + 60). After having executing the cued strategy (e.g., 30 + 50) during 1,000 ms, participants were given the opportunity to switch to another better strategy (e.g., 40 + 60) or to repeat the same strategy (e.g., 30 + 50). The results showed that children switched strategies within items, and were able to switch more often when the addition problems were cued with the poorer strategy (e.g., 40 + 60 for 32 + 54) than when cued with the better strategy (e.g., 30 + 50). As they grew up, children based their decisions to switch strategies more often on whether the 1,000-ms strategy execution concerned the better strategy or strategy difficulty (i.e., the rounding-up strategy). These findings have important implications to further understand mechanisms underlying within-item strategy switching as well as strategic variations in children. PMID:24368906
Fraundorf, Scott H; Benjamin, Aaron S
2016-09-01
Information about others' success in remembering is frequently available. For example, students taking an exam may assess its difficulty by monitoring when others turn in their exams. In two experiments, we investigated how rememberers use this information to guide recall. Participants studied paired associates, some semantically related (and thus easier to retrieve) and some unrelated (and thus harder). During a subsequent cued recall test, participants viewed fictive information about an opponent's accuracy on each item. In Experiment 1, participants responded to each cue once before seeing the opponent's performance and once afterwards. Participants reconsidered their responses least often when the opponent's accuracy matched the item difficulty (easy items the opponent recalled, hard items the opponent forgot) and most often when the opponent's accuracy and the item difficulty mismatched. When participants responded only after seeing the opponent's performance (Experiment 2), the same mismatch conditions that led to reconsideration even produced superior recall. These results suggest that rememberers monitor whether others' knowledge states accord or conflict with their own experience, and that this information shifts how they interrogate their memory and what they recall.
Intervention for children with word-finding difficulties: a parallel group randomised control trial.
Best, Wendy; Hughes, Lucy Mari; Masterson, Jackie; Thomas, Michael; Fedor, Anna; Roncoli, Silvia; Fern-Pollak, Liory; Shepherd, Donna-Lynn; Howard, David; Shobbrook, Kate; Kapikian, Anna
2017-07-31
The study investigated the outcome of a word-web intervention for children diagnosed with word-finding difficulties (WFDs). Twenty children age 6-8 years with WFDs confirmed by a discrepancy between comprehension and production on the Test of Word Finding-2, were randomly assigned to intervention (n = 11) and waiting control (n = 9) groups. The intervention group had six sessions of intervention which used word-webs and targeted children's meta-cognitive awareness and word-retrieval. On the treated experimental set (n = 25 items) the intervention group gained on average four times as many items as the waiting control group (d = 2.30). There were also gains on personally chosen items for the intervention group. There was little change on untreated items for either group. The study is the first randomised control trial to demonstrate an effect of word-finding therapy with children with language difficulties in mainstream school. The improvement in word-finding for treated items was obtained following a clinically realistic intervention in terms of approach, intensity and duration.
Fayyaz Khan, Humaira; Farooq Danish, Khalid; Saeed Awan, Azra; Anwar, Masood
2013-05-01
The purpose of the study was to identify technical item flaws in the multiple choice questions submitted for the final exams for the years 2009, 2010 and 2011. This descriptive analytical study was carried out in Islamic International Medical College (IIMC). The Data was collected from the MCQ's submitted by the faculty for the final exams for the year 2009, 2010 and 2011. The data was compiled and evaluated by a three member assessment committee. The data was analyzed for frequency and percentages the categorical data was analyzed by chi-square test. Overall percentage of flawed item was 67% for the year 2009 of which 21% were for testwiseness and 40% were for irrelevant difficulty. In year 2010 the total item flaws were 36% and 11% testwiseness and 22% were for irrelevant difficulty. The year 2011 data showed decreased overall flaws of 21%. The flaws of testwisness were 7%, irrelevant difficulty were 11%. Technical item flaws are frequently encountered during MCQ construction, and the identification of flaws leads to improved quality of the single best MCQ's.
A Review of Classical Methods of Item Analysis.
ERIC Educational Resources Information Center
French, Christine L.
Item analysis is a very important consideration in the test development process. It is a statistical procedure to analyze test items that combines methods used to evaluate the important characteristics of test items, such as difficulty, discrimination, and distractibility of the items in a test. This paper reviews some of the classical methods for…
Detecting a Gender-Related DIF Using Logistic Regression and Transformed Item Difficulty
ERIC Educational Resources Information Center
Abedlaziz, Nabeel; Ismail, Wail; Hussin, Zaharah
2011-01-01
Test items are designed to provide information about the examinees. Difficult items are designed to be more demanding and easy items are less so. However, sometimes, test items carry with their demands other than those intended by the test developer (Scheuneman & Gerritz, 1990). When personal attributes such as gender systematically affect…
Adaptable Learning Assistant for Item Bank Management
ERIC Educational Resources Information Center
Nuntiyagul, Atorn; Naruedomkul, Kanlaya; Cercone, Nick; Wongsawang, Damras
2008-01-01
We present PKIP, an adaptable learning assistant tool for managing question items in item banks. PKIP is not only able to automatically assist educational users to categorize the question items into predefined categories by their contents but also to correctly retrieve the items by specifying the category and/or the difficulty level. PKIP adapts…
Two-item same/different discrimination in rhesus monkeys (Macaca mulatta).
Basile, Benjamin M; Moylan, Emily J; Charles, David P; Murray, Elisabeth A
2015-11-01
Almost all nonhuman animals can recognize when one item is the same as another item. It is less clear whether nonhuman animals possess abstract concepts of "same" and "different" that can be divorced from perceptual similarity. Pigeons and monkeys show inconsistent performance, and often surprising difficulty, in laboratory tests of same/different learning that involve only two items. Previous results from tests using multi-item arrays suggest that nonhumans compute sameness along a continuous scale of perceptual variability, which would explain the difficulty of making two-item same/different judgments. Here, we provide evidence that rhesus monkeys can learn a two-item same/different discrimination similar to those on which monkeys and pigeons have previously failed. Monkeys' performance transferred to novel stimuli and was not affected by perceptual variations in stimulus size, rotation, view, or luminance. Success without the use of multi-item arrays, and the lack of effect of perceptual variability, suggests a computation of sameness that is more categorical, and perhaps more abstract, than previously thought.
Rasch analysis for psychometric improvement of science attitude rating scales
NASA Astrophysics Data System (ADS)
Oon, Pey-Tee; Fan, Xitao
2017-04-01
Students' attitude towards science (SAS) is often a subject of investigation in science education research. Survey of rating scale is commonly used in the study of SAS. The present study illustrates how Rasch analysis can be used to provide psychometric information of SAS rating scales. The analyses were conducted on a 20-item SAS scale used in an existing dataset of The Trends in International Mathematics and Science Study (TIMSS) (2011). Data of all the eight-grade participants from Hong Kong and Singapore (N = 9942) were retrieved for analyses. Additional insights from Rasch analysis that are not commonly available from conventional test and item analyses were discussed, such as invariance measurement of SAS, unidimensionality of SAS construct, optimum utilization of SAS rating categories, and item difficulty hierarchy in the SAS scale. Recommendations on how TIMSS items on the measurement of SAS can be better designed were discussed. The study also highlights the importance of using Rasch estimates for statistical parametric tests (e.g. ANOVA, t-test) that are common in science education research for group comparisons.
Ability evaluation by binary tests: Problems, challenges & recent advances
NASA Astrophysics Data System (ADS)
Bashkansky, E.; Turetsky, V.
2016-11-01
Binary tests designed to measure abilities of objects under test (OUTs) are widely used in different fields of measurement theory and practice. The number of test items in such tests is usually very limited. The response to each test item provides only one bit of information per OUT. The problem of correct ability assessment is even more complicated, when the levels of difficulty of the test items are unknown beforehand. This fact makes the search for effective ways of planning and processing the results of such tests highly relevant. In recent years, there has been some progress in this direction, generated by both the development of computational tools and the emergence of new ideas. The latter are associated with the use of so-called “scale invariant item response models”. Together with maximum likelihood estimation (MLE) approach, they helped to solve some problems of engineering and proficiency testing. However, several issues related to the assessment of uncertainties, replications scheduling, the use of placebo, as well as evaluation of multidimensional abilities still present a challenge for researchers. The authors attempt to outline the ways to solve the above problems.
Saltychev, Mikhail; Mattie, Ryan; McCormick, Zachary; Laimi, Katri
2017-05-13
The Neck Disability Index (NDI) is commonly used for clinical and research assessment for chronic neck pain, yet the original version of this tool has not undergone significant validity testing, and in particular, there has been minimal assessment using Item Response Theory. The goal of the present study was to investigate the psychometric properties of the original version of the NDI in a large sample of individuals with chronic neck pain by defining its internal consistency, construct structure and validity, and its ability to discriminate between different degrees of functional limitation. This is a cross-sectional cohort study of 585 consecutive patients with chronic neck pain seen in a university hospital rehabilitation clinic. Internal consistency was evaluated using Cronbach's alpha, construct structure was evaluated by exploratory factor analysis, and discrimination ability was determined by Item Response Theory. The NDI demonstrated good internal consistency assessed by Cronbach's alpha (0.87). The exploratory factor analysis identified only one factor with eigenvalue considered significant (cutoff 1.0). When analyzed by Item Response Theory, eight out of 10 items demonstrated almost ideal difficulty parameter estimates. In addition, eight out of 10 items showed high to perfect estimates of discrimination ability (overall range 0.8 to 2.9). Amongst patients with chronic neck pain, the NDI was found to have good internal consistency, have unidimensional properties, and an excellent ability to distinguish patients with different levels of perceived disability. Implications for Rehabilitation The Neck Disability Index has good internal consistency, unidimensional properties, and an excellent ability to distinguish patients with different levels of perceived disability. The Neck Disability Index is recommended for use when selecting patients for rehabilitation, setting rehabilitation goals, and measuring the outcome of intervention.
Echeverri, Margarita; Anderson, David; Nápoles, Anna María
2016-01-01
Objective Describe adaptation and initial validation of the Cancer Health Literacy Test (CHLT) for Spanish-speakers. Methods Cross-sectional field test of the CHLT Spanish version (CHLT-30-DKspa) among healthy Latinos in Louisiana. Diagonally Weighted Least Squares were used to confirm the factor structure. Item-Response Analysis using 2-parameter logistic estimates were used to identify questions that may require modification to avoid bias. Cronbach's alpha coefficients estimated scale internal consistency reliability. Analysis of variance was used to test for significant differences in CHLT-30-DKspa scores by gender, origin, age and education. Results Mean CHLT-30-DKspa score (N=400) was 17.13 (range 0 to 30; SD 6.65). Results confirmed a unidimensional structure (X2[405] =461.55, p=.027, CFI=.993; TLI=.992, RMSEA=.0180). Cronbach's alpha was 0.88. Items Q1-High calorie and Q15-Tumor spread had the lowest item-scale correlations (.148 and .288) and standardized factor loadings (.152 and .302). Items Q1-High Calories, Q8-Palliative Care, and Q19-Smoking Risk had the highest item-difficulty parameters (diff=1.12, 1.21, and 2.40). Conclusions Results generally supported the applicability of the CHLT-30-DKspa for Spanish-speaking healthy populations, with the exception of four items that need to be deleted or revised and further studied Q1, Q8, Q15, and Q19). Practical Implications The CHLT-30-DKspa can be used to assess cancer health literacy among Spanish-speaking populations to advance research on cancer health literacy and outcomes. PMID:27043760
Odukoya, Jonathan A; Adekeye, Olajide; Igbinoba, Angie O; Afolabi, A
2018-01-01
Teachers and Students worldwide often dance to the tune of tests and examinations. Assessments are powerful tools for catalyzing the achievement of educational goals, especially if done rightly. One of the tools for 'doing it rightly' is item analysis. The core objectives for this study, therefore, were: ascertaining the item difficulty and distractive indices of the university wide courses. A range of 112-1956 undergraduate students participated in this study. With the use of secondary data, the ex-post facto design was adopted for this project. In virtually all cases, majority of the items (ranging between 65% and 97% of the 70 items fielded in each course) did not meet psychometric standard in terms of difficulty and distractive indices and consequently needed to be moderated or deleted. Considering the importance of these courses, the need to apply item analyses when developing these tests was emphasized.
ERIC Educational Resources Information Center
Marie, S. Maria Josephine Arokia; Edannur, Sreekala
2015-01-01
This paper focused on the analysis of test items constructed in the paper of teaching Physical Science for B.Ed. class. It involved the analysis of difficulty level and discrimination power of each test item. Item analysis allows selecting or omitting items from the test, but more importantly item analysis is a tool to help the item writer improve…
Park, In Sook; Suh, Yeon Ok; Park, Hae Sook; Kang, So Young; Kim, Kwang Sung; Kim, Gyung Hee; Choi, Yeon-Hee; Kim, Hyun-Ju
2017-01-01
The purpose of this study was to improve the quality of items on the Korean Nursing Licensing Examination by developing and evaluating case-based items that reflect integrated nursing knowledge. We conducted a cross-sectional observational study to develop new case-based items. The methods for developing test items included expert workshops, brainstorming, and verification of content validity. After a mock examination of undergraduate nursing students using the newly developed case-based items, we evaluated the appropriateness of the items through classical test theory and item response theory. A total of 50 case-based items were developed for the mock examination, and content validity was evaluated. The question items integrated 34 discrete elements of integrated nursing knowledge. The mock examination was taken by 741 baccalaureate students in their fourth year of study at 13 universities. Their average score on the mock examination was 57.4, and the examination showed a reliability of 0.40. According to classical test theory, the average level of item difficulty of the items was 57.4% (80%-100% for 12 items; 60%-80% for 13 items; and less than 60% for 25 items). The mean discrimination index was 0.19, and was above 0.30 for 11 items and 0.20 to 0.29 for 15 items. According to item response theory, the item discrimination parameter (in the logistic model) was none for 10 items (0.00), very low for 20 items (0.01 to 0.34), low for 12 items (0.35 to 0.64), moderate for 6 items (0.65 to 1.34), high for 1 item (1.35 to 1.69), and very high for 1 item (above 1.70). The item difficulty was very easy for 24 items (below -2.0), easy for 8 items (-2.0 to -0.5), medium for 6 items (-0.5 to 0.5), hard for 3 items (0.5 to 2.0), and very hard for 9 items (2.0 or above). The goodness-of-fit test in terms of the 2-parameter item response model between the range of 2.0 to 0.5 revealed that 12 items had an ideal correct answer rate. We surmised that the low reliability of the mock examination was influenced by the timing of the test for the examinees and the inappropriate difficulty of the items. Our study suggested a methodology for the development of future case-based items for the Korean Nursing Licensing Examination.
Fitting the Rasch Model to Account for Variation in Item Discrimination
ERIC Educational Resources Information Center
Weitzman, R. A.
2009-01-01
Building on the Kelley and Gulliksen versions of classical test theory, this article shows that a logistic model having only a single item parameter can account for varying item discrimination, as well as difficulty, by using item-test correlations to adjust incorrect-correct (0-1) item responses prior to an initial model fit. The fit occurs…
Effects of Item Exposure for Conventional Examinations in a Continuous Testing Environment.
ERIC Educational Resources Information Center
Hertz, Norman R.; Chinn, Roberta N.
This study explored the effect of item exposure on two conventional examinations administered as computer-based tests. A principal hypothesis was that item exposure would have little or no effect on average difficulty of the items over the course of an administrative cycle. This hypothesis was tested by exploring conventional item statistics and…
Efforts Toward the Development of Unbiased Selection and Assessment Instruments.
ERIC Educational Resources Information Center
Rudner, Lawrence M.
Investigations into item bias provide an empirical basis for the identification and elimination of test items which appear to measure different traits across populations or cultural groups. The Psychometric rationales for six approaches to the identification of biased test items are reviewed: (1) Transformed item difficulties: within-group…
Automatic Item Generation of Probability Word Problems
ERIC Educational Resources Information Center
Holling, Heinz; Bertling, Jonas P.; Zeuch, Nina
2009-01-01
Mathematical word problems represent a common item format for assessing student competencies. Automatic item generation (AIG) is an effective way of constructing many items with predictable difficulties, based on a set of predefined task parameters. The current study presents a framework for the automatic generation of probability word problems…
ERIC Educational Resources Information Center
Chauvin, Bruno; Leonova, Tamara
2016-01-01
Key concerns about the psychometric properties of the 25-item version of the Strengths and Difficulties Questionnaire (SDQ) have consistently been raised in the literature. The present study aimed at examining the meaningfulness of an alternative model to the SDQ in which 7 problematic items are excluded. French-speaking parents of 262 boys and…
Hurtado-Pardos, Barbara; Casas, Irma; Lluch-Canut, Teresa; Moreno-Arroyo, Carmen; Nebot-Bergua, Carlos; Roldán-Merino, Juan
2018-01-02
The aim of this study was to design and validate an instrument to measure the wellness among university nursing faculty. The study was performed in two phases. Phase I consisted of the development of the instrument with discussion groups and participant consensus. We designed an instrument including the 21 items or psychosocial risk factors identified and estimated an index by evaluating the frequency and intensity of each item. The items were grouped into 3 dimensions: teaching work demands, curricular demands, and organizational difficulties. Phase II, we evaluated the psychometric properties of the tool in a sample of 263 participants. Exploratory factor analysis showed a 3-factor structure that explained 53% of the total variance. The internal consistency of the instrument was 0.91 for the whole instrument. The results indicate that the tool developed is valid and reliable and may be a good instrument to monitor the wellness of university nursing faculty.
Smolen, Tomasz; Chuderski, Adam
2015-01-01
Fluid intelligence (Gf) is a crucial cognitive ability that involves abstract reasoning in order to solve novel problems. Recent research demonstrated that Gf strongly depends on the individual effectiveness of working memory (WM). We investigated a popular claim that if the storage capacity underlay the WM-Gf correlation, then such a correlation should increase with an increasing number of items or rules (load) in a Gf-test. As often no such link is observed, on that basis the storage-capacity account is rejected, and alternative accounts of Gf (e.g., related to executive control or processing speed) are proposed. Using both analytical inference and numerical simulations, we demonstrated that the load-dependent change in correlation is primarily a function of the amount of floor/ceiling effect for particular items. Thus, the item-wise WM correlation of a Gf-test depends on its overall difficulty, and the difficulty distribution across its items. When the early test items yield huge ceiling, but the late items do not approach floor, that correlation will increase throughout the test. If the early items locate themselves between ceiling and floor, but the late items approach floor, the respective correlation will decrease. For a hallmark Gf-test, the Raven-test, whose items span from ceiling to floor, the quadratic relationship is expected, and it was shown empirically using a large sample and two types of WMC tasks. In consequence, no changes in correlation due to varying WM/Gf load, or lack of them, can yield an argument for or against any theory of WM/Gf. Moreover, as the mathematical properties of the correlation formula make it relatively immune to ceiling/floor effects for overall moderate correlations, only minor changes (if any) in the WM-Gf correlation should be expected for many psychological tests.
Fraundorf, Scott H.; Benjamin, Aaron S.
2015-01-01
Information about others’ success in remembering is frequently available. For example, students taking an exam may assess its difficulty by monitoring when others turn in their exams. In two experiments, we investigated how rememberers use this information to guide recall. Participants studied paired associates, some semantically related (and thus easier to retrieve) and some unrelated (and thus harder). During a subsequent cued recall test, participants viewed fictive information about an opponent’s accuracy on each item. In Experiment 1, participants responded to each cue once before seeing the opponent’s performance and once afterwards. Participants reconsidered their responses least often when the opponent’s accuracy matched the item difficulty (easy items the opponent recalled, hard items the opponent forgot) and most often when the opponent’s accuracy and the item difficulty mismatched. When participants responded only after seeing the opponent’s performance (Experiment 2), the same mismatch conditions that led to reconsideration even produced superior recall. These results suggest that rememberers monitor whether others’ knowledge states accord or conflict with their own experience, and that this information shifts how they interrogate their memory and what they recall. PMID:26247369
ERIC Educational Resources Information Center
Jones, Andrew T.
2011-01-01
Practitioners often depend on item analysis to select items for exam forms and have a variety of options available to them. These include the point-biserial correlation, the agreement statistic, the B index, and the phi coefficient. Although research has demonstrated that these statistics can be useful for item selection, no research as of yet has…
Rasch Analysis of the Power as Knowing Participation in Change Tool--the Brazilian version.
Guedes, Erika de Souza; Orozco-Vargas, Luiz Carlos; Turrini, Ruth Natália Teresa; de Sousa, Regina Márcia Cardoso; dos Santos, Mariana Alvina; da Cruz, Diná de Almeida Lopes Monteiro
2013-01-01
the objective of this study was to evaluate the items contained in the Brazilian version of the Power as Knowing Participation in Change Tool (PKPCT). investigation of the psychometric properties of the mentioned questionnaire through Rasch analysis. the data from 952 nursing assistants and 627 baccalaureate nurses were analyzed (average age 44.1 (SD=9.5); 13.0% men). The subscales Choices, Awareness, Freedom and Involvement were tested separately and presented unidimensionality; the categories of the responses given to the items were compiled from 7 to 3 levels and the items fit the model well, except for the following/leading item, in which the infit and outfit values were above 1.4; this item has also presented Differential Item Functioning (DIF) according to the participant's role. The reliability of the items was of 0.99 and the reliability of the participants ranged from 0.80 to 0.84 in the subscales. Items with extremely high levels of difficulty were not identified. the PKPCT should not be viewed as unidimensional, items with extremely high levels of difficulty in the scale need to be created and the differential functioning of some items has to be further investigated.
Do Reading Experts Agree with MCAT Verbal Reasoning Item Classifications?
ERIC Educational Resources Information Center
Jackson, Evelyn W.; And Others
1994-01-01
Examined whether expert raters (n=5) could agree about classification of Medical College Admission Test (MCAT) items and whether they agreed with MCAT student manual in labeling skill being measured by each test item. Results revealed difficulties in replicating authors' labeling of skills for reading items on practice test provided with 1991 MCAT…
Measuring Student Learning with Item Response Theory
ERIC Educational Resources Information Center
Lee, Young-Jin; Palazzo, David J.; Warnakulasooriya, Rasil; Pritchard, David E.
2008-01-01
We investigate short-term learning from hints and feedback in a Web-based physics tutoring system. Both the skill of students and the difficulty and discrimination of items were determined by applying item response theory (IRT) to the first answers of students who are working on for-credit homework items in an introductory Newtonian physics…
Combining the Best of Two Standard Setting Methods: The Ordered Item Booklet Angoff
ERIC Educational Resources Information Center
Smith, Russell W.; Davis-Becker, Susan L.; O'Leary, Lisa S.
2014-01-01
This article describes a hybrid standard setting method that combines characteristics of the Angoff (1971) and Bookmark (Mitzel, Lewis, Patz & Green, 2001) methods. The proposed approach utilizes strengths of each method while addressing weaknesses. An ordered item booklet, with items sorted based on item difficulty, is used in combination…
Comparative Racial Analysis of Enlisted Advancement Exams: Item- Difficulty.
1975-07-01
11cm-ana lysis Promotion Racial comparison Equal opportunity 1 20. ABSTRACT (Continue on reveree aide 11 neceeemry mnd Identity by block...improving equal oppor- tunity in career growth for minority groups. The study of exam item- difficulty levels is the first of a series of technical reports...under Exploratory Development Task Area PF55.521.032 (Contemporary Social Issues). J. J. CLARKIN Commanding Officer SUMMARY Purpose A number of
Psychometric Evaluation of a Cultural Competency Assessment Instrument for Health Professionals
Haywood, Sonja H.; Goode, Tawara; Gao, Yong; Smith, Kristyn; Bronheim, Suzanne; Flocke, Susan A; Zyzanski, Steve
2012-01-01
Background Few valid and reliable measures exist for health care professionals interested in determining their levels of cultural and linguistic competence. Objective To evaluate the measurement properties of the Cultural Competence Health Practitioner Assessment (CCHPA-129). Methods The CCHPA-129 is a 129-item web-based instrument, developed by the National Center for Cultural Competence (NCCC). Responses on the CCHPA -129 were examined using factor analysis; Rasch modeling; and Differential Item Functioning (DIF) across race, ethnicity, gender, and profession. Subjects 2504 practitioners, including 1864 nurses (RN/LPN,/BSN); 341 clinicians (PA/NP); and 299 physicians (MD/DO), who completed the CCHPA-129 online between 2005 and 2008. Results Three factors representing domains of knowledge, adapting practice, and promoting health for culturally and linguistically diverse populations accounted for 46% of the variance. Among Knowledge factor items, 53% (23/43) fit the Rasch model, item difficulties ranged from −1.01 logits (least difficult) to +1.11 logits (most difficult), separation index (SI) 13.82, and Cronbach’s α 0.92. Forty-seven percent (21/44) Adapting Practice factor items fit the model, item difficulties −0.07 to +1.11 logits, SI 11.59, Cronbach’s α 0.88; and 58% (23/39). Promoting Health factor items fit the model, item difficulties −1.01 to +1.38 logits, SI 22.64, Cronbach’s α 0.92. Early evidence of validity was established by known groups having statistically different scores. Conclusion The 67-item CCHPA-67 is psychometrically sound. This shorted instrument can be used to establish associations between practitioners’ cultural and linguistic competence and health outcomes as well as to evaluate interventions to increase practitioners’ cultural and linguistic competence. PMID:22437625
ERIC Educational Resources Information Center
Quaigrain, Kennedy; Arhin, Ato Kwamina
2017-01-01
Item analysis is essential in improving items which will be used again in later tests; it can also be used to eliminate misleading items in a test. The study focused on item and test quality and explored the relationship between difficulty index (p-value) and discrimination index (DI) with distractor efficiency (DE). The study was conducted among…
ERIC Educational Resources Information Center
Masters, James S.
2010-01-01
With the need for larger and larger banks of items to support adaptive testing and to meet security concerns, large-scale item generation is a requirement for many certification and licensure programs. As part of the mass production of items, it is critical that the difficulty and the discrimination of the items be known without the need for…
Item response theory and the measurement of motor behavior.
Safrit, M J; Cohen, A S; Costa, M G
1989-12-01
Item response theory (IRT) has been the focus of intense research and development activity in educational and psychological measurement during the past decade. Because this theory can provide more precise information about test items than other theories usually used in measuring motor behavior, the application of IRT in physical education and exercise science merits investigation. In IRT, the difficulty level of each item (e.g., trial or task) can be estimated and placed on the same scale as the ability of the examinee. Using this information, the test developer can determine the ability levels at which the test functions best. Equating the scores of individuals on two or more items or tests can be handled efficiently by applying IRT. The precision of the identification of performance standards in a mastery test context can be enhanced, as can adaptive testing procedures. In this tutorial, several potential benefits of applying IRT to the measurement of motor behavior were described. An example is provided using bowling data and applying the graded-response form of the Rasch IRT model. The data were calibrated and the goodness of fit was examined. This analysis is described in a step-by-step approach. Limitations to using an IRT model with a test consisting of repeated measures were noted.
A Comparison of Alternate-Choice and True-False Item Forms Used in Classroom Examinations.
ERIC Educational Resources Information Center
Maihoff, N. A.; Mehrens, Wm. A.
A comparison is presented of alternate-choice and true-false item forms used in an undergraduate natural science course. The alternate-choice item is a modified two-choice multiple-choice item in which the two responses are included within the question stem. This study (1) compared the difficulty level, discrimination level, reliability, and…
ERIC Educational Resources Information Center
Freund, Philipp Alexander; Hofer, Stefan; Holling, Heinz
2008-01-01
Figural matrix items are a popular task type for assessing general intelligence (Spearman's g). Items of this kind can be constructed rationally, allowing the implementation of computerized generation algorithms. In this study, the influence of different task parameters on the degree of difficulty in matrix items was investigated. A sample of N =…
Item Difficulty in the Evaluation of Computer-Based Instruction: An Example from Neuroanatomy
ERIC Educational Resources Information Center
Chariker, Julia H.; Naaz, Farah; Pani, John R.
2012-01-01
This article reports large item effects in a study of computer-based learning of neuroanatomy. Outcome measures of the efficiency of learning, transfer of learning, and generalization of knowledge diverged by a wide margin across test items, with certain sets of items emerging as particularly difficult to master. In addition, the outcomes of…
Francis, Wendy S; Tokowicz, Natasha; Kroll, Judith F
2014-01-01
Repetition priming was used to assess how proficiency and the ease or difficulty of lexical access influence bilingual translation. Two experiments, conducted at different universities with different Spanish-English bilingual populations and materials, showed repetition priming in word translation for same-direction and different-direction repetitions. Experiment 1, conducted in an English-dominant environment, revealed an effect of translation direction but not of direction match, whereas Experiment 2, conducted in a more balanced bilingual environment, showed an effect of direction match but not of translation direction. A combined analysis on the items common to both studies revealed that bilingual proficiency was negatively associated with response time (RT), priming, and the degree of translation asymmetry in RTs and priming. An item analysis showed that item difficulty was positively associated with RTs, priming, and the benefit of same-direction over different-direction repetition. Thus, although both participant accuracy and item accuracy are indices of learning, they have distinct effects on translation RTs and on the learning that is captured by the repetition-priming paradigm.
The second version of the L. V. Prasad-functional vision questionnaire.
Gothwal, Vijaya K; Sumalini, Rebecca; Bharani, Seelam; Reddy, Shailaja P; Bagga, Deepak K
2012-11-01
The L. V. Prasad-Functional Vision Questionnaire (LVP-FVQ) was developed using Rasch analysis to assess self-reported difficulties in performing daily tasks in school children with visual impairment (VI) in India. However, the LVP-FVQ has psychometric problems of inadequate measurement precision and lack of detailed assessment of dimensionality. Furthermore, items pertaining to use of technology are lacking. The aim of this study was to present the development and validation of the second version of LVP-FVQ (LVP-FVQ II). Development of LVP-FVQ II involved extracting items from other similar questionnaires (albeit developed for Western populations) and focus group discussions of children with VI and their parents that resulted in a 32-item pilot questionnaire. Overall, six items from the LVP-FVQ were retained. The questionnaire underwent pilot testing in 25 such children, following which a 27-item LVP-FVQ II emerged, and this was administered to 150 children with VI. Response to each item was rated on a three-category scale. Rasch analysis was used to validate the LVP-FVQ II. Rating scale was used by participants as was intended to. Four mobility-related items required deletion, as these did not contribute toward measurement of a single construct, indicating a secondary dimension. Deletion of the four items resulted in the 23-item unidimensional LVP-FVQ II, with good measurement precision, effective targeting of item difficulty to participant ability, and lack of notable differential item functioning. The LVP-FVQ II has high reliability, indicating that it is effectively able to discriminate between visual disability of school children in India, and is valid across age, gender, duration of VI, and location of residence. Given the superior measurement properties and the interval-level scores, the LVP-FVQ II appears to offer advantages over LVP-FVQ in assessment of difficulties in performing daily tasks in this population. It can be adapted for use in other developing countries.
Item Difficulty in the Evaluation of Computer-Based Instruction: An Example from Neuroanatomy
Chariker, Julia H.; Naaz, Farah; Pani, John R.
2012-01-01
This article reports large item effects in a study of computer-based learning of neuroanatomy. Outcome measures of the efficiency of learning, transfer of learning, and generalization of knowledge diverged by a wide margin across test items, with certain sets of items emerging as particularly difficult to master. In addition, the outcomes of comparisons between instructional methods changed with the difficulty of the items to be learned. More challenging items better differentiated between instructional methods. This set of results is important for two reasons. First, it suggests that instruction may be more efficient if sets of consistently difficult items are the targets of instructional methods particularly suited to them. Second, there is wide variation in the published literature regarding the outcomes of empirical evaluations of computer-based instruction. As a consequence, many questions arise as to the factors that may affect such evaluations. The present paper demonstrates that the level of challenge in the material that is presented to learners is an important factor to consider in the evaluation of a computer-based instructional system. PMID:22231801
Item difficulty in the evaluation of computer-based instruction: an example from neuroanatomy.
Chariker, Julia H; Naaz, Farah; Pani, John R
2012-01-01
This article reports large item effects in a study of computer-based learning of neuroanatomy. Outcome measures of the efficiency of learning, transfer of learning, and generalization of knowledge diverged by a wide margin across test items, with certain sets of items emerging as particularly difficult to master. In addition, the outcomes of comparisons between instructional methods changed with the difficulty of the items to be learned. More challenging items better differentiated between instructional methods. This set of results is important for two reasons. First, it suggests that instruction may be more efficient if sets of consistently difficult items are the targets of instructional methods particularly suited to them. Second, there is wide variation in the published literature regarding the outcomes of empirical evaluations of computer-based instruction. As a consequence, many questions arise as to the factors that may affect such evaluations. The present article demonstrates that the level of challenge in the material that is presented to learners is an important factor to consider in the evaluation of a computer-based instructional system. Copyright © 2011 American Association of Anatomists.
2011-01-01
Background The quality of data in national health information systems has been questionable in most developing countries. However, the mechanisms of errors in the case identification process are not fully understood. This study aimed to investigate the mechanisms of errors in the case identification process in the existing routine health information system (RHIS) in the Philippines by measuring the risk of committing errors for health program indicators used in the Field Health Services Information System (FHSIS 1996), and characterizing those indicators accordingly. Methods A structured questionnaire on the definitions of 12 selected indicators in the FHSIS was administered to 132 health workers in 14 selected municipalities in the province of Palawan. A proportion of correct answers (difficulty index) and a disparity of two proportions of correct answers between higher and lower scored groups (discrimination index) were calculated, and the patterns of wrong answers for each of the 12 items were abstracted from 113 valid responses. Results None of 12 items reached a difficulty index of 1.00. The average difficulty index of 12 items was 0.266 and the discrimination index that showed a significant difference was 0.216 and above. Compared with these two cut-offs, six items showed non-discrimination against lower difficulty indices of 0.035 (4/113) to 0.195 (22/113), two items showed a positive discrimination against lower difficulty indices of 0.142 (16/113) and 0.248 (28/113), and four items showed a positive discrimination against higher difficulty indices of 0.469 (53/113) to 0.673 (76/113). Conclusions The results suggest three characteristics of definitions of indicators such as those that are (1) unsupported by the current conditions in the health system, i.e., (a) data are required from a facility that cannot directly generate the data and, (b) definitions of indicators are not consistent with its corresponding program; (2) incomplete or ambiguous, which allow several interpretations; and (3) complete yet easily misunderstood by health workers. Taking systemic factors into account, the case identification step needs to be reviewed and designed to generate intended data in health information systems. PMID:21995369
Redintegration, task difficulty, and immediate serial recall tasks.
Ritchie, Gabrielle; Tolan, Georgina Anne; Tehan, Gerald
2015-03-01
While current theoretical models remain somewhat inconclusive in their explanation of short-term memory (STM), many theories suggest at least a contribution of long-term memory (LTM) to the short-term system. A number of researchers refer to this process as redintegration (e.g., Schweickert, 1993). Under short-term recall conditions, the current study investigated the effects of redintegration and task difficulty in order to extend research conducted by Neale and Tehan (2007). Thirty participants in Experiment 1 and 26 participants in Experiment 2 completed a serial recall task in which retention interval, presentation rate, and articulatory suppression were used to modify task difficulty. Redintegration was examined by manipulating the characteristics of the to-be-remembered items; lexicality in Experiment 1 and wordlikeness in Experiment 2. Responses were scored based on correct-in-position recall, item scoring, and order accuracy scoring. In line with the Neale and Tehan results, as the difficulty of the task increased so did the effects of redintegration. This was evident in that the advantage for words in Experiment 1 and wordlikeness in Experiment 2 decreased as task difficulty increased. This relationship was observed for item but not order memory, and findings were discussed in relation to the theory of redintegration. (PsycINFO Database Record (c) 2015 APA, all rights reserved).
ERIC Educational Resources Information Center
Engelen, Ron J. H.; And Others
Fisher's information measure for the item difficulty parameter in the Rasch model and its marginal and conditional formulations are investigated. It is shown that expected item information in the unconditional model equals information in the marginal model, provided the assumption of sampling examinees from an ability distribution is made. For the…
ERIC Educational Resources Information Center
Alberta Dept. of Education, Edmonton.
This document outlines the use of machine-scorable open-ended questions for the evaluation of Physics 30 in Alberta. Contents include: (1) an introduction to the questions; (2) sample instruction sheet; (3) fifteen sample items; (4) item information including the key, difficulty, and source of each item; (5) solutions to items having multiple…
2017-01-01
Background Palliative care is nowadays essential in nursing care, due to the increasing number of patients who require attention in final stages of their life. Nurses need to acquire specific knowledge and abilities to provide quality palliative care. Palliative Care Quiz for Nurses is a questionnaire that evaluates their basic knowledge about palliative care. The Palliative Care Quiz for Nurses (PCQN) is useful to evaluate basic knowledge about palliative care, but its adaptation into the Spanish language and the analysis of its effectiveness and utility for Spanish culture is lacking. Purpose To report the adaptation into the Spanish language and the psychometric analysis of the Palliative Care Quiz for Nurses. Method The Palliative Care Quiz for Nurses-Spanish Version (PCQN-SV) was obtained from a process including translation, back-translation, comparison with versions in other languages, revision by experts, and pilot study. Content validity and reliability of questionnaire were analyzed. Difficulty and discrimination indexes of each item were also calculated according to Item Response Theory (IRT). Findings Adequate internal consistency was found (S-CVI = 0.83); Cronbach's alpha coefficient of 0.67 and KR-20 test result of 0,72 reflected the reliability of PCQN-SV. The questionnaire had a global difficulty index of 0,55, with six items which could be considered as difficult or very difficult, and five items with could be considered easy or very easy. The discrimination indexes of the 20 items, show us that eight items are good or very good while six items are bad to discriminate between good and bad respondents. Discussion Although in shows internal consistency, reliability and difficulty indexes similar to those obtained by versions of PCQN in other languages, a reformulation of the items with lowest content validity or discrimination indexes and those showing difficulties with their comprehension is an aspect to take into account in order to improve the PCQN-SV. Conclusion The PCQN-SV is a useful Spanish language instrument for measuring Spanish nurses’ knowledge in palliative care and it is adequate to establish international comparisons. PMID:28545037
Chover-Sierra, Elena; Martínez-Sabater, Antonio; Lapeña-Moñux, Yolanda Raquel
2017-01-01
Palliative care is nowadays essential in nursing care, due to the increasing number of patients who require attention in final stages of their life. Nurses need to acquire specific knowledge and abilities to provide quality palliative care. Palliative Care Quiz for Nurses is a questionnaire that evaluates their basic knowledge about palliative care. The Palliative Care Quiz for Nurses (PCQN) is useful to evaluate basic knowledge about palliative care, but its adaptation into the Spanish language and the analysis of its effectiveness and utility for Spanish culture is lacking. To report the adaptation into the Spanish language and the psychometric analysis of the Palliative Care Quiz for Nurses. The Palliative Care Quiz for Nurses-Spanish Version (PCQN-SV) was obtained from a process including translation, back-translation, comparison with versions in other languages, revision by experts, and pilot study. Content validity and reliability of questionnaire were analyzed. Difficulty and discrimination indexes of each item were also calculated according to Item Response Theory (IRT). Adequate internal consistency was found (S-CVI = 0.83); Cronbach's alpha coefficient of 0.67 and KR-20 test result of 0,72 reflected the reliability of PCQN-SV. The questionnaire had a global difficulty index of 0,55, with six items which could be considered as difficult or very difficult, and five items with could be considered easy or very easy. The discrimination indexes of the 20 items, show us that eight items are good or very good while six items are bad to discriminate between good and bad respondents. Although in shows internal consistency, reliability and difficulty indexes similar to those obtained by versions of PCQN in other languages, a reformulation of the items with lowest content validity or discrimination indexes and those showing difficulties with their comprehension is an aspect to take into account in order to improve the PCQN-SV. The PCQN-SV is a useful Spanish language instrument for measuring Spanish nurses' knowledge in palliative care and it is adequate to establish international comparisons.
Fractionating the Neural Substrates of Incidental Recognition Memory
ERIC Educational Resources Information Center
Greene, Ciara M.; Vidaki, Kleio; Soto, David
2015-01-01
Familiar stimuli are typically accompanied by decreases in neural response relative to the presentation of novel items, but these studies often include explicit instructions to discriminate old and new items; this creates difficulties in partialling out the contribution of top-down intentional orientation to the items based on recognition goals.…
ERIC Educational Resources Information Center
Gaitas, Sérgio; Alves Martins, Margarida
2017-01-01
This study analyses teacher perceived difficulty in implementing differentiated instructional strategies in regular classes. The participants were 273 Portuguese primary school teachers with teaching experience ranging from 1 to 33 years. A 39-item questionnaire was used to evaluate teacher perceived difficulty in relation to different…
Measuring and Predicting Graded Reader Difficulty
ERIC Educational Resources Information Center
Holster, Trevor A.; Lake, J. W.; Pellowe, William R.
2017-01-01
This study used many-faceted Rasch measurement to investigate the difficulty of graded readers using a 3-item survey. Book difficulty was compared with Kyoto Level, Yomiyasusa Level, Lexile Level, book length, mean sentence length, and mean word frequency. Word frequency and Kyoto Level were found to be ineffective in predicting students'…
Critical success factors in awareness of and choice towards low vision rehabilitation.
Fraser, Sarah A; Johnson, Aaron P; Wittich, Walter; Overbury, Olga
2015-01-01
The goal of the current study was to examine the critical factors indicative of an individual's choice to access low vision rehabilitation services. Seven hundred and forty-nine visually impaired individuals, from the Montreal Barriers Study, completed a structured interview and questionnaires (on visual function, coping, depression, satisfaction with life). Seventy-five factors from the interview and questionnaires were entered into a data-driven Classification and Regression Tree Analysis in order to determine the best predictors of awareness group: positive personal choice (I knew and I went), negative personal choice (I knew and did not go), and lack of information (Nobody told me, and I did not know). Having a response of moderate to no difficulty on item 6 (reading signs) of the Visual Function Index 14 (VF-14) indicated that the person had made a positive personal choice to seek rehabilitation, whereas reporting a great deal of difficulty on this item was associated with a lack of information on low vision rehabilitation. In addition to this factor, symptom duration of under nine years, moderate difficulty or less on item 5 (seeing steps or curbs) of the VF-14, and an indication of little difficulty or less on item 3 (reading large print) of the VF-14 further identified those who were more likely to have made a positive personal choice. Individuals in the lack of information group also reported greater difficulty on items 3 and 5 of the VF-14 and were more likely to be male. The duration-of-symptoms factor suggests that, even in the positive choice group, it may be best to offer rehabilitation services early. Being male and responding moderate difficulty or greater to the VF-14 questions about far, medium-distance and near situations involving vision was associated with individuals that lack information. Consequently, these individuals may need additional education about the benefits of low vision services in order to make a positive personal choice. © 2014 The Authors Ophthalmic & Physiological Optics © 2014 The College of Optometrists.
Choi, Bongsam
2018-01-01
[Purpose] This study aimed to cross-cultural adapt and validate the Korean version of an physical activity measure (K-PAM) for community-dwelling elderly. [Subjects and Methods] One hundred and thirty eight community-dwelling elderlies, 32 males and 106 female, participated in the study. All participants were asked to fill out a fifty-one item questionnaire measuring perceived difficulty in the activities of daily living (ADL) for the elderly. One-parameter model of item response theory (Rasch analysis) was applied to determine the construct validity and to inspect item-level psychometric properties of 51 ADL items of the K-PAM. [Results] Person separation reliability (analogous to Cronbach's alpha) for internal consistency was ranging 0.93 to 0.94. A total of 16 items was misfit to the Rasch model. After misfit item deletion, 35 ADL items of the K-PAM were placed in an empirically meaningful hierarchy from easy to hard. The item-person map analysis delineated that the item difficulty was well matched for the elderlies with moderate and low ability except for high ceilings. [Conclusion] Cross-cultural adapted K-PAM was shown to be sufficient for establishing construct validity and stable psychometric properties confirmed by person separation reliability and fit statistics.
Increased susceptibility to proactive interference in adults with dyslexia?
Bogaerts, Louisa; Szmalec, Arnaud; Hachmann, Wibke M; Page, Mike P A; Woumans, Evy; Duyck, Wouter
2015-01-01
Recent findings show that people with dyslexia have an impairment in serial-order memory. Based on these findings, the present study aimed to test the hypothesis that people with dyslexia have difficulties dealing with proactive interference (PI) in recognition memory. A group of 25 adults with dyslexia and a group of matched controls were subjected to a 2-back recognition task, which required participants to indicate whether an item (mis)matched the item that had been presented 2 trials before. PI was elicited using lure trials in which the item matched the item in the 3-back position instead of the targeted 2-back position. Our results demonstrate that the introduction of lure trials affected 2-back recognition performance more severely in the dyslexic group than in the control group, suggesting greater difficulty in resisting PI in dyslexia.
Shift in sodium chloride sources in past 10 years of salt reduction campaign in Japan.
Shimbo, S; Hatai, I; Saito, T; Yokota, M; Imai, Y; Watanabe, T; Moon, C S; Zhang, Z W; Ikeda, M
1996-11-01
Twenty four-hr total food duplicate samples were collected from nonsmoking house-wives (aged mostly 30 to 60 years) twice at a 10-year interval in winter seasons, once in around 1980 and then in around 1990 in 11 prefectures in Japan. In practice, 342 and 472 samples were obtained in the 1980 and 1990 studies, respectively. Sodium chloride (NaCl) intake via each food item was estimated from the weight of the item in the duplicate. The comparison of 1990 results with 1980 results showed that the total NaCl intake (i.e., NaCl intake via all food items) decreased after a 10-year campaign to lower salt intake. The NaCl/energy ratio however stayed essentially unchanged. Whereas NaCl intake via pickles decreased remarkably and that via miso paste [a fermentation product of soy bean, rice (or wheat) and salt] slightly, the decreases were counteracted by a substantial increase in NaCl intake via soy bean sauce. Meaning of this unexpected counteraction was discussed in relation to the difficulties in the campaign to lower salt intake.
ERIC Educational Resources Information Center
Lee, Young-Sun; Krishnan, Anita; Park, Yoon Soo
2012-01-01
The purpose of this study was to investigate psychometric properties of the Children's Depression Inventory within a nonclinical and longitudinal sample (8th and 12th grades). Using the Rasch rating scale, most items represented one dimension. There was adequate separation among items and no overlap between ranges of item difficulties with latent…
ERIC Educational Resources Information Center
Atalmis, Erkan Hasan
2016-01-01
Multiple-choice (MC) items are commonly used in high-stake tests. Thus, each item of such tests should be meticulously constructed to increase the accuracy of decisions based on test results. Haladyna and his colleagues (2002) addressed the valid item-writing guidelines to construct high quality MC items in order to increase test reliability and…
Maximum Likelihood Item Easiness Models for Test Theory Without an Answer Key
Batchelder, William H.
2014-01-01
Cultural consensus theory (CCT) is a data aggregation technique with many applications in the social and behavioral sciences. We describe the intuition and theory behind a set of CCT models for continuous type data using maximum likelihood inference methodology. We describe how bias parameters can be incorporated into these models. We introduce two extensions to the basic model in order to account for item rating easiness/difficulty. The first extension is a multiplicative model and the second is an additive model. We show how the multiplicative model is related to the Rasch model. We describe several maximum-likelihood estimation procedures for the models and discuss issues of model fit and identifiability. We describe how the CCT models could be used to give alternative consensus-based measures of reliability. We demonstrate the utility of both the basic and extended models on a set of essay rating data and give ideas for future research. PMID:29795812
Mackus, Marlou; Kruijff, Deborah de; Otten, Leila S; Kraneveld, Aletta D; Garssen, Johan; Verster, Joris C
2017-04-12
Altered immune functioning has been demonstrated in individuals with autism spectrum disorder (ASD). The current study explores the relationship between perceived immune functioning and experiencing ASD traits in healthy young adults. N = 410 students from Utrecht University completed a survey on immune functioning and autistic traits. In addition to a 1-item perceived immune functioning rating, the Immune Function Questionnaire (IFQ) was completed to assess perceived immune functioning. The Dutch translation of the Autism-Spectrum Quotient (AQ) was completed to examine variation in autistic traits, including the domains "social insights and behavior", "difficulties with change", "communication", "phantasy and imagination", and "detail orientation". The 1-item perceived immune functioning score did not significantly correlate with the total AQ score. However, a significant negative correlation was found between perceived immune functioning and the AQ subscale "difficulties with change" (r = -0.119, p = 0.019). In women, 1-item perceived immune functioning correlated significantly with the AQ subscales "difficulties with change" (r = -0.149, p = 0.029) and "communication" (r = -0.145, p = 0.032). In men, none of the AQ subscales significantly correlated with 1-item perceived immune functioning. In conclusion, a modest relationship between perceived immune functioning and several autistic traits was found.
Assessing the Conceptual Understanding about Heat and Thermodynamics at Undergraduate Level
ERIC Educational Resources Information Center
Kulkarni, Vasudeo Digambar; Tambade, Popat Savaleram
2013-01-01
In this study, a Thermodynamic Concept Test (TCT) was designed to assess student's conceptual understanding heat and thermodynamics at undergraduate level. The different statistical tests such as item difficulty index, item discrimination index, point biserial coefficient were used for assessing TCT. For each item of the test these indices were…
Modeling Booklet Effects for Nonequivalent Group Designs in Large-Scale Assessment
ERIC Educational Resources Information Center
Hecht, Martin; Weirich, Sebastian; Siegle, Thilo; Frey, Andreas
2015-01-01
Multiple matrix designs are commonly used in large-scale assessments to distribute test items to students. These designs comprise several booklets, each containing a subset of the complete item pool. Besides reducing the test burden of individual students, using various booklets allows aligning the difficulty of the presented items to the assumed…
Effects of Using Modified Items to Test Students with Persistent Academic Difficulties
ERIC Educational Resources Information Center
Elliott, Stephen N.; Kettler, Ryan J.; Beddow, Peter A.; Kurz, Alexander; Compton, Elizabeth; McGrath, Dawn; Bruen, Charles; Hinton, Kent; Palmer, Porter; Rodriguez, Michael C.; Bolt, Daniel; Roach, Andrew T.
2010-01-01
This study investigated the effects of using modified items in achievement tests to enhance accessibility. An experiment determined whether tests composed of modified items would reduce the performance gap between students eligible for an alternate assessment based on modified achievement standards (AA-MAS) and students not eligible, and the…
A Five-Year Evaluation of Examination Structure in a Cardiovascular Pharmacotherapy Course
Kolar, Claire; Janke, Kristin K.
2015-01-01
Objective. To evaluate the composition and effectiveness as an assessment tool of a criterion-referenced examination comprised of clinical cases tied to practice decisions, to examine the effect of varying audience response system (ARS) questions on student examination preparation, and to articulate guidelines for structuring examinations to maximize evaluation of student learning. Design. Multiple-choice items developed over 5 years were evaluated using Bloom’s Taxonomy classification, point biserial correlation, item difficulty, and grade distribution. In addition, examination items were classified into categories based on similarity to items used in ARS preparation. Assessment. As the number of items directly tied to clinical practice rose, Bloom’s Taxonomy level and item difficulty also rose. In examination years where Bloom’s levels were high but preparation was minimal, average grade distribution was lower compared with years in which student preparation was higher. Conclusion. Criterion-referenced examinations can benefit from systematic evaluation of their composition and effectiveness as assessment tools. Calculated design and delivery of classroom preparation is an asset in improving examination performance on rigorous, practice-relevant examinations. PMID:27168611
Conditional statistical inference with multistage testing designs.
Zwitser, Robert J; Maris, Gunter
2015-03-01
In this paper it is demonstrated how statistical inference from multistage test designs can be made based on the conditional likelihood. Special attention is given to parameter estimation, as well as the evaluation of model fit. Two reasons are provided why the fit of simple measurement models is expected to be better in adaptive designs, compared to linear designs: more parameters are available for the same number of observations; and undesirable response behavior, like slipping and guessing, might be avoided owing to a better match between item difficulty and examinee proficiency. The results are illustrated with simulated data, as well as with real data.
Kılıç, Aslı; Hoyer, William J; Howard, Marc W
2013-01-01
BACKGROUND/STUDY CONTEXT: Older adults exhibit an age-related deficit in item memory as a function of the length of the retention interval, but older adults and young adults usually show roughly equivalent benefits due to the spacing of item repetitions in continuous memory tasks. The current experiment investigates the seemingly paradoxical effects of retention interval and spacing in young and older adults using a continuous recognition memory procedure. Fifty young adults and 52 older adults gave memory confidence ratings to words that were presented once (P1), twice (P2), or three times (P3), and the effects of the lag length and retention interval were assessed at P2 and at P3, respectively. Response times at P2 were disproportionately longer for older adults than for younger adults as a function of the number of items occurring between P1 and P2, suggestive of age-related loss in item memory. Ratings of confidence in memory responses revealed that older adults remembered fewer items at P2 with a high degree of certainty. Confidence ratings given at P3 suggested that young and older adults derived equivalent benefits from the spacing between P1 and P2. Findings of this study support theoretical accounts that suggest that recursive reminding and/or item retrieval difficulty promote item retention in older adults.
ERIC Educational Resources Information Center
Carroll, H. C. M.
2013-01-01
Two complementary studies of poor and better attenders are presented. To measure emotional and behavioural difficulties (EBD) different teacher-completed rating scales were employed, and to determine social difficulties, the studies used sociometry and some items from the scales. One study had a longitudinal design. It revealed that, after…
Psychometric assessment of HIV/STI sexual risk scale among MSM: a Rasch model approach.
Li, Jian; Liu, Hongjie; Liu, Hui; Feng, Tiejian; Cai, Yumao
2011-10-05
Little research has assessed the degree of severity and ordering of different types of sexual behaviors for HIV/STI infection in a measurement scale. The purpose of this study was to apply the Rasch model on psychometric assessment of an HIV/STI sexual risk scale among men who have sex with men (MSM). A cross-sectional study using respondent driven sampling was conducted among 351 MSM in Shenzhen, China. The Rasch model was used to examine the psychometric properties of an HIV/STI sexual risk scale including nine types of sexual behaviors. The Rasch analysis of the nine items met the unidimensionality and local independence assumption. Although the person reliability was low at 0.35, the item reliability was high at 0.99. The fit statistics provided acceptable infit and outfit values. Item difficulty invariance analysis showed that the item estimates of the risk behavior items were invariant (within error). The findings suggest that the Rasch model can be utilized for measuring the level of sexual risk for HIV/STI infection as a single latent construct and for establishing the relative degree of severity of each type of sexual behavior in HIV/STI transmission and acquisition among MSM. The measurement scale provides a useful measurement tool to inform, design and evaluate behavioral interventions for HIV/STI infection among MSM.
Middle school students' reading comprehension of mathematical texts and algebraic equations
NASA Astrophysics Data System (ADS)
Duru, Adem; Koklu, Onder
2011-06-01
In this study, middle school students' abilities to translate mathematical texts into algebraic representations and vice versa were investigated. In addition, students' difficulties in making such translations and the potential sources for these difficulties were also explored. Both qualitative and quantitative methods were used to collect data for this study: questionnaire and clinical interviews. The questionnaire consisted of two general types of items: (1) selected-response (multiple-choice) items for which the respondent selects from multiple options and (2) open-ended items for which the respondent constructs a response. In order to further investigate the students' strategies while they were translating the given mathematical texts to algebraic equations and vice versa, five randomly chosen (n = 5) students were interviewed. Data were collected in the 2007-2008 school year from 185 middle-school students in five teachers' classrooms in three different schools in the city of Adıyaman, Turkey. After the analysis of data, it was found that students who participated in this study had difficulties in translating the mathematical texts into algebraic equations by using symbols. It was also observed that these students had difficulties in translating the symbolic representations into mathematical texts because of their weak reading comprehension. In addition, finding of this research revealed that students' difficulties in translating the given mathematical texts into symbolic representations or vice versa come from different sources.
Gabay, Yafit; Karni, Avi; Banai, Karen
2017-01-01
Speech perception can improve substantially with practice (perceptual learning) even in adults. Here we compared the effects of four training protocols that differed in whether and how task difficulty was changed during a training session, in terms of the gains attained and the ability to apply (transfer) these gains to previously un-encountered items (tokens) and to different talkers. Participants trained in judging the semantic plausibility of sentences presented as time-compressed speech and were tested on their ability to reproduce, in writing, the target sentences; trail-by-trial feedback was afforded in all training conditions. In two conditions task difficulty (low or high compression) was kept constant throughout the training session, whereas in the other two conditions task difficulty was changed in an adaptive manner (incrementally from easy to difficult, or using a staircase procedure). Compared to a control group (no training), all four protocols resulted in significant post-training improvement in the ability to reproduce the trained sentences accurately. However, training in the constant-high-compression protocol elicited the smallest gains in deciphering and reproducing trained items and in reproducing novel, untrained, items after training. Overall, these results suggest that training procedures that start off with relatively little signal distortion (“easy” items, not far removed from standard speech) may be advantageous compared to conditions wherein severe distortions are presented to participants from the very beginning of the training session. PMID:28545039
Adaptive Mental Testing: The State of the Art
1979-11-01
typically vary in their psychometric properties --particularly in their difficulty--the test designer must decide what configuration of these item...psychometric properties best suits the test’s purpose. There are two extreme ration- ales to guide that decision. One rationale is to choose items that are...development of item response theory (Rasch, 1960; Lord, 1952, 1970, 1974a; Birnbaum, 1968) that provided the needed invariance properties for item
ERIC Educational Resources Information Center
van der Linden, Wim J.; Eggen, Theo J. H. M.
A procedure for the sequential optimization of the calibration of an item bank is given. The procedure is based on an empirical Bayes approach to a reformulation of the Rasch model as a model for paired comparisons between the difficulties of test items in which ties are allowed to occur. First, it is indicated how a paired-comparisons design…
Assessment of item-writing flaws in multiple-choice questions.
Nedeau-Cayo, Rosemarie; Laughlin, Deborah; Rus, Linda; Hall, John
2013-01-01
This study evaluated the quality of multiple-choice questions used in a hospital's e-learning system. Constructing well-written questions is fraught with difficulty, and item-writing flaws are common. Study results revealed that most items contained flaws and were written at the knowledge/comprehension level. Few items had linked objectives, and no association was found between the presence of objectives and flaws. Recommendations include education for writing test questions.
Rosneck, James S; Hughes, Joel; Gunstad, John; Josephson, Richard; Noe, Donald A; Waechter, Donna
2014-01-01
This article describes the systematic construction and psychometric analysis of a knowledge assessment instrument for phase II cardiac rehabilitation (CR) patients measuring risk modification disease management knowledge and behavioral outcomes derived from national standards relevant to secondary prevention and management of cardiovascular disease. First, using adult curriculum based on disease-specific learning outcomes and competencies, a systematic test item development process was completed by clinical staff. Second, a panel of educational and clinical experts used an iterative process to identify test content domain and arrive at consensus in selecting items meeting criteria. Third, the resulting 31-question instrument, the Cardiac Knowledge Assessment Tool (CKAT), was piloted in CR patients to ensure use of application. Validity and reliability analyses were performed on 3638 adults before test administrations with additional focused analyses on 1999 individuals completing both pretreatment and posttreatment administrations within 6 months. Evidence of CKAT content validity was substantiated, with 85% agreement among content experts. Evidence of construct validity was demonstrated via factor analysis identifying key underlying factors. Estimates of internal consistency, for example, Cronbach's α = .852 and Spearman-Brown split-half reliability = 0.817 on pretesting, support test reliability. Item analysis, using point biserial correlation, measured relationships between performance on single items and total score (P < .01). Analyses using item difficulty and item discrimination indices further verified item stability and validity of the CKAT. A knowledge instrument specifically designed for an adult CR population was systematically developed and tested in a large representative patient population, satisfying psychometric parameters, including validity and reliability.
Sayer, Nina A; Frazier, Patricia; Orazem, Robert J; Murdoch, Maureen; Gravely, Amy; Carlson, Kathleen F; Hintz, Samuel; Noorbaloochi, Siamak
2011-12-01
The primary objective of this study was to describe the development, reliability, and construct validity of scores on the Military to Civilian Questionnaire (M2C-Q), a 16-item self-report measure of postdeployment community reintegration difficulty. We surveyed a national, stratified sample of 1,226 Iraq and Afghanistan veterans who used U.S. Department of Veterans Affairs (VA) medical care; 745 completed the M2C-Q and validated mental health screening measures. All analyses were based on weighted estimates. The internal consistency of the M2C-Q was .95 in this sample. Factor analyses indicated a single total score was the best-fitting model. Total scores were associated with measures theoretically related to reintegration difficulties including perception of overall difficulty readjusting back into civilian life (R(2) = .49), probable PTSD (d = 1.07), probable problem drug or alcohol use (d = 0.34), and overall mental health (r = -.83). Subgroup analyses revealed a similar pattern of findings in those who screened negative for PTSD. Nonwhite and unemployed veterans reported greater community reintegration difficulty (d = 0.20 and 0.45, respectively). Findings offer preliminary support for the reliability and construct validity of M2C-Q scores. Published 2011. This article is a US Government work and is in the public domain in the USA.
Narimoto, Tadamasa; Matsuura, Naomi; Takezawa, Tomohiro; Mitsuhashi, Yoshinori; Hiratani, Michio
2013-01-01
The authors investigated whether impaired spatial short-term memory exhibited by children with nonverbal learning disabilities is due to a problem in the encoding process. Children with or without nonverbal learning disabilities performed a simple spatial test that required them to remember 3, 5, or 7 spatial items presented simultaneously in random positions (i.e., spatial configuration) and to decide if a target item was changed or all items including the target were in the same position. The results showed that, even when the spatial positions in the encoding and probe phases were similar, the mean proportion correct of children with nonverbal learning disabilities was 0.58 while that of children without nonverbal learning disabilities was 0.84. The authors argue with the results that children with nonverbal learning disabilities have difficulty encoding relational information between spatial items, and that this difficulty is responsible for their impaired spatial short-term memory.
Application of Computerized Adaptive Testing to Entrance Examination for Graduate Studies in Turkey
ERIC Educational Resources Information Center
Bulut, Okan; Kan, Adnan
2012-01-01
Problem Statement: Computerized adaptive testing (CAT) is a sophisticated and efficient way of delivering examinations. In CAT, items for each examinee are selected from an item bank based on the examinee's responses to the items. In this way, the difficulty level of the test is adjusted based on the examinee's ability level. Instead of…
Rasch Based Analysis of Oral Proficiency Test Data.
ERIC Educational Resources Information Center
Nakamura, Yuji
2001-01-01
This paper examines the rating scale data of oral proficiency tests analyzed by a Rasch Analysis focusing on an item map and factor analysis. In discussing the item map, the difficulty order of six items and students' answering patterns are analyzed using descriptive statistics and measures of central tendency of test scores. The data ranks the…
Investigating the Performance of Omega Index According to Item Parameters and Ability Levels
ERIC Educational Resources Information Center
Sunbul, Onder; Yormaz, Seha
2018-01-01
Purpose: Several studies can be found in the literature that investigate the performance of ? under various conditions. However no study for the effects of item difficulty, item discrimination, and ability restrictions on the performance of ? could be found. The current study aims to investigate the performance of ? for the conditions given below.…
ERIC Educational Resources Information Center
Parish, Jane A.; Karisch, Brandi B.
2013-01-01
Item analysis can serve as a useful tool in improving multiple-choice questions used in Extension programming. It can identify gaps between instruction and assessment. An item analysis of Mississippi Master Cattle Producer program multiple-choice examination responses was performed to determine the difficulty of individual examinations, assess the…
Exploring the Manifestations of Anxiety in Children with Autism Spectrum Disorders
ERIC Educational Resources Information Center
Hallett, Victoria; Lecavalier, Luc; Sukhodolsky, Denis G.; Cipriano, Noreen; Aman, Michael G.; McCracken, James T.; McDougle, Christopher J.; Tierney, Elaine; King, Bryan H.; Hollander, Eric; Sikich, Linmarie; Bregman, Joel; Anagnostou, Evdokia; Donnelly, Craig; Katsovich, Lily; Dukes, Kimberly; Vitiello, Benedetto; Gadow, Kenneth; Scahill, Lawrence
2013-01-01
This study explores the manifestation and measurement of anxiety symptoms in 415 children with ASDs on a 20-item, parent-rated, DSM-IV referenced anxiety scale. In both high and low-functioning children (IQ above vs. below 70), commonly endorsed items assessed restlessness, tension and sleep difficulties. Items requiring verbal expression of worry…
Sensitivity of Equated Aggregate Scores to the Treatment of Misbehaving Common Items
ERIC Educational Resources Information Center
Michaelides, Michalis P.
2010-01-01
The delta-plot method (Angoff, 1972) is a graphical technique used in the context of test equating for identifying common items with aberrant changes in their item difficulties across administrations or alternate forms. This brief research report explores the effects on equated aggregate scores when delta-plot outliers are either retained in or…
de Sá Junior, Antonio Reis; de Andrade, Arthur Guerra; Andrade, Laura Helena; Gorenstein, Clarice; Wang, Yuan-Pang
2018-07-01
This study examines the response pattern of depressive symptoms in a nationwide student sample, through item analyses of a rating scale by both classical test theory (CTT) and item response theory (IRT). The 21-item Beck Depression Inventory-II (BDI-II) was administered to 12,711 college students. First, the psychometric properties of the scale were described. Thereafter, the endorsement probability of depressive symptom in each scale item was analyzed through CTT and IRT. Graphical plots depicted the endorsement probability of scale items and intensity of depression. Three items of different difficulty level were compared through CTT and IRT approach. Four in five students reported the presence of depressive symptoms. The BDI-II items presented good reliability and were distributed along the symptomatic continuum of depression. Similarly, in both CTT and IRT approaches, the item 'changes in sleep' was easily endorsed, 'loss of interest' moderately and 'suicidal thoughts' hardly. Graphical representation of BDI-II of both methods showed much equivalence in terms of item discrimination and item difficulty. The item characteristic curve of the IRT method provided informative evaluation of item performance. The inventory was applied only in college students. Depressive symptoms were frequent psychopathological manifestations among college students. The performance of the BDI-II items indicated convergent results from both methods of analysis. While the CTT was easy to understand and to apply, the IRT was more complex to understand and to implement. Comprehensive assessment of the functioning of each BDI-II item might be helpful in efficient detection of depressive conditions in college students. Copyright © 2018 Elsevier B.V. All rights reserved.
Benaïm, C; Perennou, D-A; Pelissier, J-Y; Daures, J-P
2010-02-01
Many clinical scales contain items that are scored separately prior to being compiled into a single score. However, if the items have different degrees of importance, they should be weighted differently before being compiled. The principal aims of this study were to show how the "analytic hierarchy process" (AHP), which has never been used for this purpose, can be applied to weighting the six items of the "London handicap scale", and to compare the AHP to the "conjoint analysis" (CA), which was previously implemented by Harwood et al. (1994) [1]. In order to assess the relative importance of the six items, we submitted AHP and CA to a group of 10 physiatrists. We compared the methods in terms of item ranking according to importance, assessment of fictitious patients based on weights determined by each method, and perceived difficulty by the physiatrist. For both techniques, "Physical independence" (PHY) was the best-weighted item, but other ranks varied depending on the technique. AHP was better than CA in terms of accuracy (global assessment of the clinical status) and perceived difficulty. AHP may be used to reveal the importance that experts assign to the items of a multidimensional scale, and to calculate the appropriate weights for specific items. For this purpose, AHP seems to be more accurate than CA.
Validity of a Protocol for Adult Self-Report of Dyslexia and Related Difficulties
ERIC Educational Resources Information Center
Snowling, Margaret; Dawes, Piers; Nash, Hannah; Hulme, Charles
2012-01-01
Background: There is an increased prevalence of reading and related difficulties in children of dyslexic parents. In order to understand the causes of these difficulties, it is important to quantify the risk factors passed from parents to their offspring. Method: 417 adults completed a protocol comprising a 15-item questionnaire rating reading and…
Cappelleri, Joseph C; Jason Lundy, J; Hays, Ron D
2014-05-01
The US Food and Drug Administration's guidance for industry document on patient-reported outcomes (PRO) defines content validity as "the extent to which the instrument measures the concept of interest" (FDA, 2009, p. 12). According to Strauss and Smith (2009), construct validity "is now generally viewed as a unifying form of validity for psychological measurements, subsuming both content and criterion validity" (p. 7). Hence, both qualitative and quantitative information are essential in evaluating the validity of measures. We review classical test theory and item response theory (IRT) approaches to evaluating PRO measures, including frequency of responses to each category of the items in a multi-item scale, the distribution of scale scores, floor and ceiling effects, the relationship between item response options and the total score, and the extent to which hypothesized "difficulty" (severity) order of items is represented by observed responses. If a researcher has few qualitative data and wants to get preliminary information about the content validity of the instrument, then descriptive assessments using classical test theory should be the first step. As the sample size grows during subsequent stages of instrument development, confidence in the numerical estimates from Rasch and other IRT models (as well as those of classical test theory) would also grow. Classical test theory and IRT can be useful in providing a quantitative assessment of items and scales during the content-validity phase of PRO-measure development. Depending on the particular type of measure and the specific circumstances, the classical test theory and/or the IRT should be considered to help maximize the content validity of PRO measures. Copyright © 2014 Elsevier HS Journals, Inc. All rights reserved.
The degree of social difficulties experienced by cancer patients and their spouses.
Takeuchi, Takashi; Ichikura, Kanako; Amano, Kanako; Takeshita, Wakana; Hisamura, Kazuho
2018-06-08
Although recent studies have increasingly reported physical and psychological problems associated with cancer and its treatment, social problems of cancer patients and their families have not been sufficiently elucidated. The present study aimed to identify cancer-associated social problems from the perspectives of both patients and their spouses and to compare and analyze differences in their problems. This was a cross-sectional internet-based study. Subjects were 259 patients who developed cancer within the previous five years and 259 patients' spouses; the data were derived from two surveys in 2010 (patients) and 2016 (spouses) whose participants were not part of the same dyad but matched by propensity scores, estimated for age, sex, and the presence or absence of recurrence. We investigated the social difficulties of cancer patients and patients' spouses. Regarding social difficulties experienced by cancer patients and spouses, the 60 patient survey items were categorized into 14 labels by the Jiro Kawakita (KJ) method, which is a qualitative synthesis method developed by Kawakita to classify categorical data. Although patients had higher scores on most subcategories, young spouses aged 39 or younger and female spouses had difficulty scores as high as the corresponding patients on many subcategories. Health care providers should show sufficient concern for both patients and their spouses, particularly young and female spouses.
ERIC Educational Resources Information Center
Brese, Falk, Ed.
2012-01-01
The goal for selecting the released set of test items was to have approximately 25% of each of the full item sets for mathematics content knowledge (MCK) and mathematics pedagogical content knowledge (MPCK) that would represent the full range of difficulty, content, and item format used in the TEDS-M study. The initial step in the selection was to…
Gerrard, Paul
2013-01-01
Nursing facility patients are a population that has not been well studied with regard to functional status and independence previously. As such, the manner in which activities of daily living (ADL) relate to one another is not well understood in this population. An understanding of ADL difficulty ordering has helped to devise systems of functional independence grading in other populations, which have value in understanding patients' global levels of independence and providing expectations regarding changes in function. This study seeks to examine the hierarchy of ADL in the nursing facility population. Data were analyzed from the 2004 National Nursing Home Survey, a cross-sectional data set of 13 507 skilled nursing facility subjects with functional independence items. The ADL difficulty hierarchy was determined using Rasch analysis. Item fit values for the Rasch model using Mean-Square infit statistics were also determined. The robustness of the hierarchy was tested for each ADL. Two grading systems were devised from the results of the item difficulty ordering. One was based on the most difficult item that he or she could perform, and the other assigned a grade based on the least difficult item that a subject could not perform. A total of 13 113 patients were included in this analysis, the majority of whom were female and white. They had an average age of 81 years. An ordered hierarchy of ADL was found with eating being the easiest and bathing the most difficult. All items in the Katz index fit the Rasch model adequately well. The majority of patients able to perform any particular ADL were also able to perform all easier ADL. Cohen's κ for the 2 grading systems was 0.73. This study is the first to show the expected hierarchy of difficulty of the 6 activities of daily proposed in the Katz index in the nursing facility population. The hierarchy found in this population matches the original hierarchy found in older adults in the community and acute care settings. It is also similar to hierarchy found in the inpatient rehabilitation setting. Patients would be expected to lose or gain function based on the order of difficulty, but this remains to be confirmed. Among the 6 activities of daily living tested here, their order from easiest to most difficult is eating, maintaining continence, transferring, toileting, dressing, and bathing. In addition, the index formed by these 6 items has construct validity in the nursing facility population.
Validation of a clinical critical thinking skills test in nursing.
Shin, Sujin; Jung, Dukyoo; Kim, Sungeun
2015-01-27
The purpose of this study was to develop a revised version of the clinical critical thinking skills test (CCTS) and to subsequently validate its performance. This study is a secondary analysis of the CCTS. Data were obtained from a convenience sample of 284 college students in June 2011. Thirty items were analyzed using item response theory and test reliability was assessed. Test-retest reliability was measured using the results of 20 nursing college and graduate school students in July 2013. The content validity of the revised items was analyzed by calculating the degree of agreement between instrument developer intention in item development and the judgments of six experts. To analyze response process validity, qualitative data related to the response processes of nine nursing college students obtained through cognitive interviews were analyzed. Out of initial 30 items, 11 items were excluded after the analysis of difficulty and discrimination parameter. When the 19 items of the revised version of the CCTS were analyzed, levels of item difficulty were found to be relatively low and levels of discrimination were found to be appropriate or high. The degree of agreement between item developer intention and expert judgments equaled or exceeded 50%. From above results, evidence of the response process validity was demonstrated, indicating that subjects respondeds as intended by the test developer. The revised 19-item CCTS was found to have sufficient reliability and validity and will therefore represents a more convenient measurement of critical thinking ability.
Validation of a clinical critical thinking skills test in nursing
2015-01-01
Purpose: The purpose of this study was to develop a revised version of the clinical critical thinking skills test (CCTS) and to subsequently validate its performance. Methods: This study is a secondary analysis of the CCTS. Data were obtained from a convenience sample of 284 college students in June 2011. Thirty items were analyzed using item response theory and test reliability was assessed. Test-retest reliability was measured using the results of 20 nursing college and graduate school students in July 2013. The content validity of the revised items was analyzed by calculating the degree of agreement between instrument developer intention in item development and the judgments of six experts. To analyze response process validity, qualitative data related to the response processes of nine nursing college students obtained through cognitive interviews were analyzed. Results: Out of initial 30 items, 11 items were excluded after the analysis of difficulty and discrimination parameter. When the 19 items of the revised version of the CCTS were analyzed, levels of item difficulty were found to be relatively low and levels of discrimination were found to be appropriate or high. The degree of agreement between item developer intention and expert judgments equaled or exceeded 50%. Conclusion: From above results, evidence of the response process validity was demonstrated, indicating that subjects respondeds as intended by the test developer. The revised 19-item CCTS was found to have sufficient reliability and validity and will therefore represents a more convenient measurement of critical thinking ability. PMID:25622716
Do subitizing deficits in developmental dyscalculia involve pattern recognition weakness?
Ashkenazi, Sarit; Mark-Zigdon, Nitza; Henik, Avishai
2013-01-01
The abilities of children diagnosed with developmental dyscalculia (DD) were examined in two types of object enumeration: subitizing, and small estimation (5-9 dots). Subitizing is usually defined as a fast and accurate assessment of a number of small dots (range 1 to 4 dots), and estimation is an imprecise process to assess a large number of items (range 5 dots or more). Based on reaction time (RT) and accuracy analysis, our results indicated a deficit in the subitizing and small estimation range among DD participants in relation to controls. There are indications that subitizing is based on pattern recognition, thus presenting dots in a canonical shape in the estimation range should result in a subitizing-like pattern. In line with this theory, our control group presented a subitizing-like pattern in the small estimation range for canonically arranged dots, whereas the DD participants presented a deficit in the estimation of canonically arranged dots. The present finding indicates that pattern recognition difficulties may play a significant role in both subitizing and subitizing deficits among those with DD. © 2012 Blackwell Publishing Ltd.
ERIC Educational Resources Information Center
Magno, Carlo
2009-01-01
The present report demonstrates the difference between classical test theory (CTT) and item response theory (IRT) approach using an actual test data for chemistry junior high school students. The CTT and IRT were compared across two samples and two forms of test on their item difficulty, internal consistency, and measurement errors. The specific…
ERIC Educational Resources Information Center
Brackenbury, Tim; Zickar, Michael J.; Munson, Benjamin; Storkel, Holly L.
2017-01-01
Purpose: Item response theory (IRT) is a psychometric approach to measurement that uses latent trait abilities (e.g., speech sound production skills) to model performance on individual items that vary by difficulty and discrimination. An IRT analysis was applied to preschoolers' productions of the words on the Goldman-Fristoe Test of…
ERIC Educational Resources Information Center
Sullins, Walter L.
Five-hundred dichotomously scored response patterns were generated with sequentially independent (SI) items and 500 with dependent (SD) items for each of thirty-six combinations of sampling parameters (i.e., three test lengths, three sample sizes, and four item difficulty distributions). KR-20, KR-21, and Split-Half (S-H) reliabilities were…
ERIC Educational Resources Information Center
Planinic, Maja; Boone, William J.; Krsnik, Rudolf; Beilfuss, Meredith L.
2006-01-01
Croatian 1st-year and 3rd-year high-school students (N = 170) completed a conceptual physics test. Students were evaluated with regard to two physics topics: Newtonian dynamics and simple DC circuits. Students answered test items and also indicated their confidence in each answer. Rasch analysis facilitated the calculation of three linear…
ERIC Educational Resources Information Center
Pawade, Yogesh R.; Diwase, Dipti S.
2016-01-01
Item analysis of Multiple Choice Questions (MCQs) is the process of collecting, summarizing and utilizing information from students' responses to evaluate the quality of test items. Difficulty Index (p-value), Discrimination Index (DI) and Distractor Efficiency (DE) are the parameters which help to evaluate the quality of MCQs used in an…
A measure of early physical functioning (EPF) post-stroke.
Finch, Lois E; Higgins, Johanne; Wood-Dauphinee, Sharon; Mayo, Nancy E
2008-07-01
To develop a comprehensive measure of Early Physical Functioning (EPF) post-stroke quantified through Rasch analysis and conceptualized using the International Classification of Functioning Disability and Health (ICF). An observational cohort study. A cohort of 262 subjects (mean age 71.6 (standard deviation 12.5) years) hospitalized post-acute stroke. Functional assessments were made within 3 days of stroke with items from valid and reliable indices commonly utilized to evaluate stroke survivors. Information on important variables was also collected. Principal component and Rasch analysis confirmed the factor structure, and dimensionality of the measure. Rasch analysis combined items across ICF components to develop the measure. Items were deleted iteratively, those retained fit the model and were related to the construct; reliability and validity were assessed. A 38-item unidimensional measure of the EPF met all Rasch model requirements. The item difficulty matched the person ability (mean person measure: -0.31; standard error 0.37 logits), reliability of the person-item-hierarchy was excellent at 0.97. Initial validity was adequate. The 38-item EPF measure was developed. It expands the range of assessment post acute stroke; it covers a broad spectrum of difficulty with good initial psychometric properties that, once revalidated, can assist in planning and evaluating early interventions.
Hällgren, Monica; Nygård, Louise; Kottorp, Anders
2014-05-01
While the development and possibilities of technology today are commonly regarded to be unlimited, knowledge regarding the technological needs of people with mental retardation is fairly limited. The aim of this study was to enhance knowledge of perceived relevance and difficulty in using everyday technology (ET) such as stoves, cell phones, and elevators in adults with mental retardation. 120 participants with different levels of mental retardation were interviewed with the Everyday Technology Use Questionnaire (ETUQ) about their use of such technologies in their everyday life. Analyses of variance, post hoc tests, and regression analyses were used to explore the data. Participants with moderate and severe mental retardation differed in mean perceived difficulty from those with mild mental retardation, suggesting that increased perceived difficulty in ET use is related to the level of mental retardation. Differences between groups were also found in the proportion of items that were relevant for each person. The variables Level of Mental Retardation, Additional Disabilities, and Proportional Relevance of ET Items could together predict 67.2% of the variation in perceived difficulty in technology use. The findings also indicate that age, housing, gender, and geographical district do not covariate with perceived difficulty in ET use.
CTTITEM: SAS macro and SPSS syntax for classical item analysis.
Lei, Pui-Wa; Wu, Qiong
2007-08-01
This article describes the functions of a SAS macro and an SPSS syntax that produce common statistics for conventional item analysis including Cronbach's alpha, item difficulty index (p-value or item mean), and item discrimination indices (D-index, point biserial and biserial correlations for dichotomous items and item-total correlation for polytomous items). These programs represent an improvement over the existing SAS and SPSS item analysis routines in terms of completeness and user-friendliness. To promote routine evaluations of item qualities in instrument development of any scale, the programs are available at no charge for interested users. The program codes along with a brief user's manual that contains instructions and examples are downloadable from suen.ed.psu.edu/-pwlei/plei.htm.
Can Latino Food Trucks (Loncheras) Serve Healthy Meals? A Feasibility Study
Cohen, Deborah; Colaiaco, Ben; Martinez-Wenzl, Mary; Montes, Monica; Han, Bing; Berry, Sandy H.
2018-01-01
Objective To conduct a pilot study to assess the feasibility of modifying food truck meals to meet the My Plate guidelines as well as the acceptability of healthier meals among consumers. Design We recruited the owners of Latino food trucks (loncheras) in 2013-14 and offered an incentive for participation, assistance with marketing, and training by a bilingual dietician. We surveyed customers and we audited purchases to estimate sales of the modified meals. Setting City of Los Angeles Subjects Owners or operators of Latino food trucks (loncheras) and their customers Results We enrolled 22 lonchera owners and 11 completed the intervention, offering more than 50 new menu items meeting meal guidelines. Sales of the meals comprised 2% of audited orders. Customers rated the meals highly; 97% said they would recommend and buy them again and 75% of participants who completed the intervention intended to continue offering the healthier meals. However, adherence to guidelines drifted after several months of operation and participant burden was cited as a reason for drop-out among 3/11 lonchera owners. Conclusions Loncheras who participated reported minimal difficulty in modifying menu items. Given the difficulty in enrollment, expanding this program and ensuring adherence would likely need to be accomplished through regulatory requirements, monitoring and feedback, similar to the methods used to achieve compliance with sanitary standards. A companion marketing campaign would be helpful to increase consumer demand. PMID:28069099
Can Latino food trucks (loncheras) serve healthy meals? A feasibility study.
Cohen, Deborah A; Colaiaco, Ben; Martinez-Wenzl, Mary; Montes, Monica; Han, Bing; Berry, Sandy H
2017-05-01
To conduct a pilot study to assess the feasibility of modifying food truck meals to meet the My Plate guidelines as well as the acceptability of healthier meals among consumers. We recruited the owners of Latino food trucks (loncheras) in 2013-2014 and offered an incentive for participation, assistance with marketing and training by a bilingual dietitian. We surveyed customers and we audited purchases to estimate sales of the modified meals. City of Los Angeles, CA, USA. Owners or operators of Latino food trucks (loncheras) and their customers. We enrolled twenty-two lonchera owners and eleven completed the intervention, offering more than fifty new menu items meeting meal guidelines. Sales of the meals comprised 2 % of audited orders. Customers rated the meals highly; 97 % said they would recommend and buy them again and 75 % of participants who completed the intervention intended to continue offering the healthier meals. However, adherence to guidelines drifted after several months of operation and participant burden was cited as a reason for dropout among three of eleven lonchera owners who dropped out. Lonchera owners/operators who participated reported minimal difficulty in modifying menu items. Given the difficulty in enrolment, expanding this programme and ensuring adherence would likely need to be accomplished through regulatory requirements, monitoring and feedback, similar to the methods used to achieve compliance with sanitary standards. A companion marketing campaign would be helpful to increase consumer demand.
Michaelides, Michalis P.
2010-01-01
Many studies have investigated the topic of change or drift in item parameter estimates in the context of item response theory (IRT). Content effects, such as instructional variation and curricular emphasis, as well as context effects, such as the wording, position, or exposure of an item have been found to impact item parameter estimates. The issue becomes more critical when items with estimates exhibiting differential behavior across test administrations are used as common for deriving equating transformations. This paper reviews the types of effects on IRT item parameter estimates and focuses on the impact of misbehaving or aberrant common items on equating transformations. Implications relating to test validity and the judgmental nature of the decision to keep or discard aberrant common items are discussed, with recommendations for future research into more informed and formal ways of dealing with misbehaving common items. PMID:21833230
Michaelides, Michalis P
2010-01-01
Many studies have investigated the topic of change or drift in item parameter estimates in the context of item response theory (IRT). Content effects, such as instructional variation and curricular emphasis, as well as context effects, such as the wording, position, or exposure of an item have been found to impact item parameter estimates. The issue becomes more critical when items with estimates exhibiting differential behavior across test administrations are used as common for deriving equating transformations. This paper reviews the types of effects on IRT item parameter estimates and focuses on the impact of misbehaving or aberrant common items on equating transformations. Implications relating to test validity and the judgmental nature of the decision to keep or discard aberrant common items are discussed, with recommendations for future research into more informed and formal ways of dealing with misbehaving common items.
Andrich, David; Marais, Ida; Humphry, Stephen Mark
2015-01-01
Recent research has shown how the statistical bias in Rasch model difficulty estimates induced by guessing in multiple-choice items can be eliminated. Using vertical scaling of a high-profile national reading test, it is shown that the dominant effect of removing such bias is a nonlinear change in the unit of scale across the continuum. The consequence is that the proficiencies of the more proficient students are increased relative to those of the less proficient. Not controlling the guessing bias underestimates the progress of students across 7 years of schooling with important educational implications. PMID:29795871
Trani, Jean-François; Babulal, Ganesh Muneshwar; Bakhshi, Parul
2015-01-01
Although 80% of persons with disabilities live in low and middle-income countries, there is still a lack of comprehensive, cross-culturally validated tools to identify persons facing activity limitations and functioning difficulties in these settings. In absence of such a tool, disability estimates vary considerably according to the methodology used, and policies are based on unreliable estimates. The Disability Screening Questionnaire composed of 27 items (DSQ-27) was initially designed by a group of international experts in survey development and disability in Afghanistan for a national survey. Items were selected based on major domains of activity limitations and functioning difficulties linked to an impairment as defined by the International Classification of Functioning, Disability and Health. Face, content and construct validity, as well as sensitivity and specificity were examined. Based on the results obtained, the tool was subsequently refined and expanded to 34 items, tested and validated in Darfur, Sudan. Internal consistency for the total DSQ-34 using a raw and standardized Cronbach's Alpha and within each domain using a standardized Cronbach's Alpha was examined in the Asian context (India and Nepal). Exploratory factor analysis (EFA) using principal axis factoring (PAF) evaluated the lowest number of factors to account for the common variance among the questions in the screen. Test-retest reliability was determined by calculating intraclass correlation (ICC) and inter-rater reliability by calculating the kappa statistic; results were checked using Bland-Altman plots. The DSQ-34 was further tested for standard error of measurement (SEM) and for the minimum detectable change (MDC). Good internal consistency was indicated by Cronbach's Alpha of 0.83/0.82 for India and 0.76/0.78 for Nepal. We confirmed our assumption for EFA using the Kaiser-Meyer-Olkin measure of sampling well above the accepted cutoff of 0.40 for India (0.82) and Nepal (0.82). The criteria for Bartlett's test of sphericity were also met for both India (< .001) and Nepal (< .001). Estimates of reliability from the two countries reached acceptable levels of ICC of 0.75 (p<0.001) for India of 0.77 for Nepal (p<0.001) and good strength of agreement for weighted kappa (respectively 0.77 and 0.79). The SEM/MDC was 0.80/2.22 for India and 0.96/2.66 for Nepal indicating a smaller amount of measurement error in the screen. In Nepal and India, the DSQ-34 shows strong psychometric properties that indicate that it effectively discriminates between persons with and without disabilities. This instrument can be used in association with other instruments for the purpose of comparing health outcomes of persons with and without disabilities in LMICs.
Haggerty, Jeannie L; Levesque, Jean-Frédéric
2017-04-01
Patients are the most valid source for evaluating the accessibility of services, but a previous study observed differential psychometric performance of instruments in rural and urban respondents. To validate a measure of organizational accessibility free of differential rural-urban performance that predicts consequences of difficult access for patient-initiated care. Sequential qualitative-quantitative study. Qualitative findings used to adapt or develop evaluative and reporting items. Quantitative validation study. Primary data by telephone from 750 urban, rural and remote respondents in Quebec, Canada; follow-up mailed questionnaire to a subset of 316. Items were developed for barriers along the care trajectory. We used common factor and confirmatory factor analysis to identify constructs and compare models. We used item response theory analysis to test for differential rural-urban performance; examine individual item performance; adjust response options; and exclude redundant or non-discriminatory items. We used logistic regression to examine predictive validity of the subscale on access difficulty (outcome). Initial factor resolution suggested geographic and organizational dimensions, plus consequences of access difficulty. After second administration, organizational accommodation and geographic indicators were integrated into a 6-item subscale of Effective Availability and Accommodation, which demonstrates good variability and internal consistency (α = 0.84) and no differential functioning by geographic area. Each unit increase predicts decreased likelihood of consequences of access difficulties (unmet need and problem aggravation). The new subscale is a practical, valid and reliable measure for patients to evaluate first-contact health services accessibility, yielding valid comparisons between urban and rural contexts. © 2016 The Authors. Health Expectations published by John Wiley & Sons Ltd.
O'Brien, Kelly K; Bayoumi, Ahmed M; Stratford, Paul; Solomon, Patricia
2015-01-01
To assess the dimensions of disability measured by the HIV Disability Questionnaire (HDQ), a newly developed 72-item self-administered questionnaire that describes the presence, severity and episodic nature of disability experienced by people living with HIV. We recruited adults living with HIV from hospital clinics, AIDS service organizations and a specialty hospital and administered the HDQ followed by a demographic questionnaire. We conducted an exploratory factor analysis using disability severity scores to determine the domains of disability in the HDQ. We used the following steps: (a) ensured correlations between items were >0.30 and <0.80; (b) conducted a principal components analysis to extract factors; (c) used the Scree Test and eigenvalue threshold >1.5 to determine the number of factors to retain; and d) used oblique rotation to simplify the factor loading matrix. We assigned items to factors based on factor loadings of >0.30. Of the 361 participants, 80% were men and 77% reported living with at least two concurrent health conditions in addition to HIV. The exploratory factor analysis suggested retaining six factors. Items related to symptoms and impairments loaded on three factors (physical [20 items], cognitive [3 items], and mental and emotional health [11 items]) and items related to worrying about the future, daily activities, and personal relationships loaded on three additional factors (uncertainty [14 items], difficulties with day-to-day activities [9 items], social inclusion [12 items]). The HDQ has six domains: physical symptoms and impairments; cognitive symptoms and impairments; mental and emotional health symptoms and impairments; uncertainty; difficulties with day-to-day activities and challenges to social inclusion. These domains establish the scoring structure for the dimensions of disability measured by the HDQ. Implications for Rehabilitation As individuals live longer and age with HIV, they may be living with the health-related consequences of HIV and concurrent health conditions, a concept that may be termed disability. Measuring disability is important to understand the impact of HIV and its comorbidities. The HIV Disability Questionnaire (HDQ) is a self-administered questionnaire developed to describe the presence, severity and episodic nature of disability experienced by people living with HIV. The HDQ is comprised of six domains of disability including: physical symptoms and impairments (20 items); cognitive symptoms and impairments (3 items); mental and emotional health symptoms and impairments (11 items); uncertainty (14 items); difficulties with day-to-day activities (9 items) and challenges to social inclusion (12 items). These domains represent the dimensions of disability measured by the HDQ. The HDQ is the first known HIV-specific disability measure for adults living with HIV. The HDQ may be used by clinicians and researchers to assess disability experienced by adults living with HIV.
Hong, Ickpyo; Lee, Mi Jung; Kim, Moon Young; Park, Hae Yean
2017-10-01
The aim of this study is to investigate the psychometrics of the 12 items of an instrument assessing activities of daily living (ADL) using an item response theory model. A total of 648 adults with physical disabilities and having difficulties in ADLs were retrieved from the 2014 Korean National Survey on People with Disabilities. The psychometric testing included factor analysis, internal consistency, precision, and differential item functioning (DIF) across categories including sex, older age, marital status, and physical impairment area. The sample had a mean age of 69.7 years old (SD = 13.7). The majority of the sample had lower extremity impairments (62.0%) and had at least 2.1 chronic conditions. The instrument demonstrated unidimensional construct and good internal consistency (Cronbach's alpha = 0.95). The instrument precisely estimated person measures within a wide range of theta values (-2.22 logits < θ < 0.27 logits) with a reliability of 0.9. Only the changing position item demonstrated misfit (χ 2 = 36.6, df = 17, p = 0.0038), and the dressing item demonstrated DIF on the impairment type (upper extremity/others, McFadden's Pseudo R 2 > 5.0%). Our findings indicate that the dressing item would need to be modified to improve its psychometrics. Overall, the ADL instrument demonstrates good psychometrics, and thus, it may be used as a standardized instrument for measuring disability in rehabilitation contexts. However, the findings are limited to adults with physical disabilities. Future studies should replicate psychometric testing for survey respondents with other disorders and for children.
Development of a vision-targeted health-related quality of life item measure
Slotkin, Jerry; McKean-Cowdin, Roberta; Lee, Paul; Owsley, Cynthia; Vitale, Susan; Varma, Rohit; Gershon, Richard; Hays, Ron D.
2013-01-01
Purpose To develop a vision-targeted health-related quality of life (HRQOL) measure for the NIH Toolbox for the Assessment of Neurological and Behavioral Function. Methods We conducted a review of existing vision-targeted HRQOL surveys and identified color vision, low luminance vision, distance vision, general vision, near vision, ocular symptoms, psychosocial well-being, and role performance domains. Items in existing survey instruments were sorted into these domains. We selected non-redundant items and revised them to improve clarity and to limit the number of different response options. We conducted 10 cognitive interviews to evaluate the items. Finally, we revised the items and administered them to 819 individuals to calibrate the items and estimate the measure’s reliability and validity. Results The field test provided support for the 53-item vision-targeted HRQOL measure encompassing 6 domains: color vision, distance vision, near vision, ocular symptoms, psychosocial well-being, and role performance. The domain scores had high levels of reliability (coefficient alphas ranged from 0.848 to 0.940). Validity was supported by high correlations between National Eye Institute Visual Function Questionnaire scales and the new-vision-targeted scales (highest values were 0.771 between psychosocial well-being and mental health, and 0.729 between role performance and role difficulties), and by lower mean scores in those groups self-reporting eye disease (F statistic with p < 0.01 for all comparisons except cataract with ocular symptoms, psychosocial well-being, and role performance scales). Conclusions This vision-targeted HRQOL measure provides a basis for comprehensive assessment of the impact of eye diseases and treatments on daily functioning and well-being in adults. PMID:23475688
Coons, Stephen Joel; Chongpison, Yuda; Wendel, Christopher S; Grant, Marcia; Krouse, Robert S
2007-09-01
To explore whether there was a significant relationship between difficulty paying for ostomy supplies and overall quality of life among a sample of ostomates receiving care from the Veterans Health Administration (VHA). The data were collected as part of the Veterans Affairs (VA) Ostomy Health-Related Quality of Life Study, in which 511 respondents (239 cases, 272 controls) completed a survey instrument that included the modified City of Hope Quality of Life (mCOH-QOL) Ostomy questionnaire, SF-36V, and sociodemographic items. Responses from the 239 cases (ie, patients with intestinal stomas) were used in this analysis. The modified City of Hope Quality of Life Ostomy questionnaire item, "How good is your overall quality of life?," was the dependent variable for this analysis. The primary independent variable was the response (yes/no) to the item, "If you pay for any of the (ostomy) costs, is it difficult for you?" A hierarchical regression model was used to examine whether difficulty paying was significantly related to overall quality of life after adjusting for age, income, race/ethnicity, and physical health. After accounting for the proportion of variance explained by age, income, race/ethnicity, and physical health, the additional proportion of variance explained by difficulty paying was statistically significant. Individuals reporting difficulty paying had a roughly 1 point lower (ie, beta-coefficient = -1.052; SE = 0.481) overall quality of life score on the 11-point scale. We found a significant association between difficulty paying for ostomy supplies and overall quality of life. Although the cross-sectional study design does not allow causal inference, the results suggest a relationship that merits further examination.
Validation of the Spanish Short Self-Regulation Questionnaire (SSSRQ) through Rasch Analysis.
Garzón Umerenkova, Angélica; de la Fuente Arias, Jesús; Martínez-Vicente, José Manuel; Zapata Sevillano, Lucía; Pichardo, Mari Carmen; García-Berbén, Ana Belén
2017-01-01
Background: The aim of the study was to psychometrically characterize the Spanish Short Self-Regulation Questionnaire (SSSRQ) through Rasch analysis. Materials and Methods: 831 Spaniard university students (262 men), between 17 and 39 years of age and ranging from the first to the 5th year of studies, completed the SSSRQ questionnaire. Confirmatory factor analysis (CFA) was carried out in order to establish structural adequacy. Afterward, by means of the Rasch model, a study of each sub scale was conducted to test for dimensionality, fit of the sample questions, functionality of the response categories, reliability and estimation of Differential Item Functioning by gender and course. Results: The four sub-scales comply with the unidimensionality criteria, the questions are in line with the model, the response categories operate properly and the reliability of the sample is acceptable. Nonetheless, the test could benefit from the inclusion of additional items of both high and low difficulty in order to increase construct validity, discrimination and reliability for the respondents. Several items with differences in gender and course were also identified. Discussion: The results evidence the need and adequacy of this complementary psychometric analysis strategy, in relation to the CFA to enhance the instrument.
Validation of the Spanish Short Self-Regulation Questionnaire (SSSRQ) through Rasch Analysis
Garzón Umerenkova, Angélica; de la Fuente Arias, Jesús; Martínez-Vicente, José Manuel; Zapata Sevillano, Lucía; Pichardo, Mari Carmen; García-Berbén, Ana Belén
2017-01-01
Background: The aim of the study was to psychometrically characterize the Spanish Short Self-Regulation Questionnaire (SSSRQ) through Rasch analysis. Materials and Methods: 831 Spaniard university students (262 men), between 17 and 39 years of age and ranging from the first to the 5th year of studies, completed the SSSRQ questionnaire. Confirmatory factor analysis (CFA) was carried out in order to establish structural adequacy. Afterward, by means of the Rasch model, a study of each sub scale was conducted to test for dimensionality, fit of the sample questions, functionality of the response categories, reliability and estimation of Differential Item Functioning by gender and course. Results: The four sub-scales comply with the unidimensionality criteria, the questions are in line with the model, the response categories operate properly and the reliability of the sample is acceptable. Nonetheless, the test could benefit from the inclusion of additional items of both high and low difficulty in order to increase construct validity, discrimination and reliability for the respondents. Several items with differences in gender and course were also identified. Discussion: The results evidence the need and adequacy of this complementary psychometric analysis strategy, in relation to the CFA to enhance the instrument. PMID:28298898
ERIC Educational Resources Information Center
Semino, Sara; Ring, Melanie; Bowler, Dermot M.; Gaigg, Sebastian B.
2018-01-01
Autism Spectrum Disorder (ASD) is generally associated with difficulties in contextual source memory but not single item memory. There are surprising inconsistencies in the literature, however, that the current study seeks to address by examining item and source memory in age and ability matched groups of 22 ASD and 21 comparison adults. Results…
Arnould, Carlyne; Vandervelde, Laure; Batcho, Charles Sèbiyo; Penta, Massimo; Thonnard, Jean-Louis
2012-01-01
Objectives Several ABILHAND Rasch-built manual ability scales were previously developed for chronic stroke (CS), cerebral palsy (CP), rheumatoid arthritis (RA), systemic sclerosis (SSc) and neuromuscular disorders (NMD). The present study aimed to explore the applicability of a generic manual ability scale unbiased by diagnosis and to study the nature of manual ability across diagnoses. Design Cross-sectional study. Setting Outpatient clinic homes (CS, CP, RA), specialised centres (CP), reference centres (CP, NMD) and university hospitals (SSc). Participants 762 patients from six diagnostic groups: 103 CS adults, 113 CP children, 112 RA adults, 156 SSc adults, 124 NMD children and 124 NMD adults. Primary and secondary outcome measures Manual ability as measured by the ABILHAND disease-specific questionnaires, diagnosis and nature (ie, uni-manual or bi-manual involvement and proximal or distal joints involvement) of the ABILHAND manual activities. Results The difficulties of most manual activities were diagnosis dependent. A principal component analysis highlighted that 57% of the variance in the item difficulty between diagnoses was explained by the symmetric or asymmetric nature of the disorders. A generic scale was constructed, from a metric point of view, with 11 items sharing a common difficulty among diagnoses and 41 items displaying a category-specific location (asymmetric: CS, CP; and symmetric: RA, SSc, NMD). This generic scale showed that CP and NMD children had significantly less manual ability than RA patients, who had significantly less manual ability than CS, SSc and NMD adults. However, the generic scale was less discriminative and responsive to small deficits than disease-specific instruments. Conclusions Our finding that most of the manual item difficulties were disease-dependent emphasises the danger of using generic scales without prior investigation of item invariance across diagnostic groups. Nevertheless, a generic manual ability scale could be developed by adjusting and accounting for activities perceived differently in various disorders. PMID:23117570
Lawton IADL scale in dementia: can item response theory make it more informative?
McGrory, Sarah; Shenkin, Susan D; Austin, Elizabeth J; Starr, John M
2014-07-01
impairment of functional abilities represents a crucial component of dementia diagnosis. Current functional measures rely on the traditional aggregate method of summing raw scores. While this summary score provides a quick representation of a person's ability, it disregards useful information on the item level. to use item response theory (IRT) methods to increase the interpretive power of the Lawton Instrumental Activities of Daily Living (IADL) scale by establishing a hierarchy of item 'difficulty' and 'discrimination'. this cross-sectional study applied IRT methods to the analysis of IADL outcomes. Participants were 202 members of the Scottish Dementia Research Interest Register (mean age = 76.39, range = 56-93, SD = 7.89 years) with complete itemised data available. a Mokken scale with good reliability (Molenaar Sijtsama statistic 0.79) was obtained, satisfying the IRT assumption that the items comprise a single unidimensional scale. The eight items in the scale could be placed on a hierarchy of 'difficulty' (H coefficient = 0.55), with 'Shopping' being the most 'difficult' item and 'Telephone use' being the least 'difficult' item. 'Shopping' was the most discriminatory item differentiating well between patients of different levels of ability. IRT methods are capable of providing more information about functional impairment than a summed score. 'Shopping' and 'Telephone use' were identified as items that reveal key information about a patient's level of ability, and could be useful screening questions for clinicians. © The Author 2013. Published by Oxford University Press on behalf of the British Geriatrics Society. All rights reserved. For Permissions, please email: journals.permissions@ oup.com.
Bootstrap Standard Errors for Maximum Likelihood Ability Estimates When Item Parameters Are Unknown
ERIC Educational Resources Information Center
Patton, Jeffrey M.; Cheng, Ying; Yuan, Ke-Hai; Diao, Qi
2014-01-01
When item parameter estimates are used to estimate the ability parameter in item response models, the standard error (SE) of the ability estimate must be corrected to reflect the error carried over from item calibration. For maximum likelihood (ML) ability estimates, a corrected asymptotic SE is available, but it requires a long test and the…
Parent outcome expectancies for purchasing fruit and vegetables: a validation.
Baranowski, Tom; Watson, Kathy; Missaghian, Mariam; Broadfoot, Alison; Baranowski, Janice; Cullen, Karen; Nicklas, Theresa; Fisher, Jennifer; O'Donnell, Sharon
2007-03-01
To validate four scales -- outcome expectancies for purchasing fruit and for purchasing vegetables, and comparative outcome expectancies for purchasing fresh fruit and for purchasing fresh vegetables versus other forms of fruit and vegetables (F&V). Survey instruments were administered twice, separated by 6 weeks. Recruited in front of supermarkets and grocery stores; interviews conducted by telephone. One hundred and sixty-one food shoppers with children (18 years or younger). Single dimension scales were specified for fruit and for vegetable purchasing outcome expectancies, and for comparative (fresh vs. other) fruit and vegetable purchasing outcome expectancies. Item Response Theory parameter estimates revealed easily interpreted patterns in the sequence of items by difficulty of response. Fruit and vegetable purchasing and fresh fruit comparative purchasing outcome expectancy scales were significantly correlated with home F&V availability, after controlling for social desirability of response. Comparative fresh vegetable outcome expectancy scale was significantly bivariately correlated with home vegetable availability, but not after controlling for social desirability. These scales are available to help better understand family F&V purchasing decisions.
ERIC Educational Resources Information Center
Ackerman, Brian P.; And Others
1990-01-01
Results of four experiments show that developmental differences in elaborative conceptual processing at acquisition and retrieval contribute independently to developmental increases in recall. Item identification processes for both words and pictures constrain children's elaborative processing. The constraints are time limited. (RH)
Treatment of Not-Administered Items on Individually Administered Intelligence Tests
ERIC Educational Resources Information Center
He, Wei; Wolfe, Edward W.
2012-01-01
In administration of individually administered intelligence tests, items are commonly presented in a sequence of increasing difficulty, and test administration is terminated after a predetermined number of incorrect answers. This practice produces stochastically censored data, a form of nonignorable missing data. By manipulating four factors…
A signal detection-item response theory model for evaluating neuropsychological measures.
Thomas, Michael L; Brown, Gregory G; Gur, Ruben C; Moore, Tyler M; Patt, Virginie M; Risbrough, Victoria B; Baker, Dewleen G
2018-02-05
Models from signal detection theory are commonly used to score neuropsychological test data, especially tests of recognition memory. Here we show that certain item response theory models can be formulated as signal detection theory models, thus linking two complementary but distinct methodologies. We then use the approach to evaluate the validity (construct representation) of commonly used research measures, demonstrate the impact of conditional error on neuropsychological outcomes, and evaluate measurement bias. Signal detection-item response theory (SD-IRT) models were fitted to recognition memory data for words, faces, and objects. The sample consisted of U.S. Infantry Marines and Navy Corpsmen participating in the Marine Resiliency Study. Data comprised item responses to the Penn Face Memory Test (PFMT; N = 1,338), Penn Word Memory Test (PWMT; N = 1,331), and Visual Object Learning Test (VOLT; N = 1,249), and self-report of past head injury with loss of consciousness. SD-IRT models adequately fitted recognition memory item data across all modalities. Error varied systematically with ability estimates, and distributions of residuals from the regression of memory discrimination onto self-report of past head injury were positively skewed towards regions of larger measurement error. Analyses of differential item functioning revealed little evidence of systematic bias by level of education. SD-IRT models benefit from the measurement rigor of item response theory-which permits the modeling of item difficulty and examinee ability-and from signal detection theory-which provides an interpretive framework encompassing the experimentally validated constructs of memory discrimination and response bias. We used this approach to validate the construct representation of commonly used research measures and to demonstrate how nonoptimized item parameters can lead to erroneous conclusions when interpreting neuropsychological test data. Future work might include the development of computerized adaptive tests and integration with mixture and random-effects models.
Development of the Sexual Minority Adolescent Stress Inventory
Schrager, Sheree M.; Goldbach, Jeremy T.; Mamey, Mary Rose
2018-01-01
Although construct measurement is critical to explanatory research and intervention efforts, rigorous measure development remains a notable challenge. For example, though the primary theoretical model for understanding health disparities among sexual minority (e.g., lesbian, gay, bisexual) adolescents is minority stress theory, nearly all published studies of this population rely on minority stress measures with poor psychometric properties and development procedures. In response, we developed the Sexual Minority Adolescent Stress Inventory (SMASI) with N = 346 diverse adolescents ages 14–17, using a comprehensive approach to de novo measure development designed to produce a measure with desirable psychometric properties. After exploratory factor analysis on 102 candidate items informed by a modified Delphi process, we applied item response theory techniques to the remaining 72 items. Discrimination and difficulty parameters and item characteristic curves were estimated overall, within each of 12 initially derived factors, and across demographic subgroups. Two items were removed for excessive discrimination and three were removed following reliability analysis. The measure demonstrated configural and scalar invariance for gender and age; a three-item factor was excluded for demonstrating substantial differences by sexual identity and race/ethnicity. The final 64-item measure comprised 11 subscales and demonstrated excellent overall (α = 0.98), subscale (α range 0.75–0.96), and test–retest (scale r > 0.99; subscale r range 0.89–0.99) reliabilities. Subscales represented a mix of proximal and distal stressors, including domains of internalized homonegativity, identity management, intersectionality, and negative expectancies (proximal) and social marginalization, family rejection, homonegative climate, homonegative communication, negative disclosure experiences, religion, and work domains (distal). Thus, the SMASI development process illustrates a method to incorporate information from multiple sources, including item response theory models, to guide item selection in building a psychometrically sound measure. We posit that similar methods can be used to improve construct measurement across all areas of psychological research, particularly in areas where a strong theoretical framework exists but existing measures are limited. PMID:29599737
Item Response Theory Equating Using Bayesian Informative Priors.
ERIC Educational Resources Information Center
de la Torre, Jimmy; Patz, Richard J.
This paper seeks to extend the application of Markov chain Monte Carlo (MCMC) methods in item response theory (IRT) to include the estimation of equating relationships along with the estimation of test item parameters. A method is proposed that incorporates estimation of the equating relationship in the item calibration phase. Item parameters from…
Massof, Robert W
2014-10-01
A simple theoretical framework explains patient responses to items in rating scale questionnaires. Fixed latent variables position each patient and each item on the same linear scale. Item responses are governed by a set of fixed category thresholds, one for each ordinal response category. A patient's item responses are magnitude estimates of the difference between the patient variable and the patient's estimate of the item variable, relative to his/her personally defined response category thresholds. Differences between patients in their personal estimates of the item variable and in their personal choices of category thresholds are represented by random variables added to the corresponding fixed variables. Effects of intervention correspond to changes in the patient variable, the patient's response bias, and/or latent item variables for a subset of items. Intervention effects on patients' item responses were simulated by assuming the random variables are normally distributed with a constant scalar covariance matrix. Rasch analysis was used to estimate latent variables from the simulated responses. The simulations demonstrate that changes in the patient variable and changes in response bias produce indistinguishable effects on item responses and manifest as changes only in the estimated patient variable. Changes in a subset of item variables manifest as intervention-specific differential item functioning and as changes in the estimated person variable that equals the average of changes in the item variables. Simulations demonstrate that intervention-specific differential item functioning produces inefficiencies and inaccuracies in computer adaptive testing. © The Author(s) 2013 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav.
Statistical Indexes for Monitoring Item Behavior under Computer Adaptive Testing Environment.
ERIC Educational Resources Information Center
Zhu, Renbang; Yu, Feng; Liu, Su
A computerized adaptive test (CAT) administration usually requires a large supply of items with accurately estimated psychometric properties, such as item response theory (IRT) parameter estimates, to ensure the precision of examinee ability estimation. However, an estimated IRT model of a given item in any given pool does not always correctly…
Anesthesiology Journal club assessment by means of semantic changes.
Vieira, Joaquim Edson; Torres, Marcelo Luís Abramides; Pose, Regina Albanese; Auler, José Otávio Costa Junior
2014-01-01
the interactive approach of a journal club has been described in the medical education literature. The aim of this investigation is to present an assessment of journal club as a tool to address the question whether residents read more and critically. this study reports the performance of medical residents in anesthesiology from the Clinics Hospital - University of São Paulo Medical School. All medical residents were invited to answer five questions derived from discussed papers. The answer sheet consisted of an affirmative statement with a Likert type scale (totally disagree-disagree-not sure-agree-totally agree), each related to one of the chosen articles. The results were evaluated by means of item analysis - difficulty index and discrimination power. residents filled one hundred and seventy three evaluations in the months of December 2011 (n=51), July 2012 (n=66) and December 2012 (n=56). The first exam presented all items with straight statement, second and third exams presented mixed items. Separating "totally agree" from "agree" increased the difficulty indices, but did not improve the discrimination power. the use of a journal club assessment with straight and inverted statements and by means of five points scale for agreement has been shown to increase its item difficulty and discrimination power. This may reflect involvement either with the reading or the discussion during the journal meeting. Copyright © 2013 Sociedade Brasileira de Anestesiologia. Published by Elsevier Editora Ltda. All rights reserved.
Constructing three emotion knowledge tests from the invariant measurement approach
Prieto, Gerardo; Burin, Debora I.
2017-01-01
Background Psychological constructionist models like the Conceptual Act Theory (CAT) postulate that complex states such as emotions are composed of basic psychological ingredients that are more clearly respected by the brain than basic emotions. The objective of this study was the construction and initial validation of Emotion Knowledge measures from the CAT frame by means of an invariant measurement approach, the Rasch Model (RM). Psychological distance theory was used to inform item generation. Methods Three EK tests—emotion vocabulary (EV), close emotional situations (CES) and far emotional situations (FES)—were constructed and tested with the RM in a community sample of 100 females and 100 males (age range: 18–65), both separately and conjointly. Results It was corroborated that data-RM fit was sufficient. Then, the effect of type of test and emotion on Rasch-modelled item difficulty was tested. Significant effects of emotion on EK item difficulty were found, but the only statistically significant difference was that between “happiness” and the remaining emotions; neither type of test, nor interaction effects on EK item difficulty were statistically significant. The testing of gender differences was carried out after corroborating that differential item functioning (DIF) would not be a plausible alternative hypothesis for the results. No statistically significant sex-related differences were found out in EV, CES, FES, or total EK. However, the sign of d indicate that female participants were consistently better than male ones, a result that will be of interest for future meta-analyses. Discussion The three EK tests are ready to be used as components of a higher-level measurement process. PMID:28929013
ERIC Educational Resources Information Center
Wallace, Colin S.; Prather, Edward E.; Duncan, Douglas K.
2012-01-01
This is the third of five papers detailing our national study of general education astronomy students' conceptual and reasoning difficulties with cosmology. In this paper, we use item response theory to analyze students' responses to three out of the four conceptual cosmology surveys we developed. The specific item response theory model we use is…
Item analysis of three Spanish naming tests: a cross-cultural investigation.
Marquez de la Plata, Carlos; Arango-Lasprilla, Juan Carlos; Alegret, Montse; Moreno, Alexander; Tárraga, Luis; Lara, Mar; Hewlitt, Margaret; Hynan, Linda; Cullum, C Munro
2009-01-01
Neuropsychological evaluations conducted in the United States and abroad commonly include the use of tests translated from English to Spanish. The use of translated naming tests for evaluating predominately Spanish-speakers has recently been challenged on the grounds that translating test items may compromise a test's construct validity. The Texas Spanish Naming Test (TNT) has been developed in Spanish specifically for use with Spanish-speakers; however, it is unlikely patients from diverse Spanish-speaking geographical regions will perform uniformly on a naming test. The present study evaluated and compared the internal consistency and patterns of item-difficulty and -discrimination for the TNT and two commonly used translated naming tests in three countries (i.e., United States, Colombia, Spain). Two hundred fifty two subjects (136 demented, 116 nondemented) across three countries were administered the TNT, Modified Boston Naming Test-Spanish, and the naming subtest from the CERAD. The TNT demonstrated superior internal consistency to its counterparts, a superior item difficulty pattern than the CERAD naming test, and a superior item discrimination pattern than the MBNT-S across countries. Overall, all three Spanish naming tests differentiated nondemented and moderately demented individuals, but the results suggest the items of the TNT are most appropriate to use with Spanish-speakers. Preliminary normative data for the three tests examined in each country are provided.
Examination of the item structure of the Alberta infant motor scale.
Liao, Pai-Jun M; Campbell, Suzann K
2004-01-01
The Alberta Infant Motor Scale (AIMS) is a screening tool for identifying delayed motor development from birth to 18 months of age. The purpose of this study was to examine the psychometric structure of the AIMS, including the hierarchical scale of items and the precision for measuring infant ability at different ages. Ninety-seven infants with varying degrees of risk of developmental disability were recruited from three hospitals or from the community in the Chicago metropolitan area. Infants were tested on the AIMS at three, six, nine, and 12 months of age. The hierarchical structure and the range and distribution of item difficulty on the AIMS were analyzed using Rasch psychometric analysis. The Rasch analysis confirmed that items for each of the four testing positions (supine, prone, sitting, and standing) were arranged in increasing order of difficulty, but a ceiling effect was present. Gaps exist at six ability levels, indicating low precision of measurement for differentiating among infants after about nine months of age. The AIMS shows a ceiling effect, measures infant ability best from three to nine months of age, and has few items available for discriminating among infants after they pass the controlled lowering through standing item. Clinical impressions should be drawn with caution at ages when the precision of measurement is low.
A Rasch measure of teachers' views of teacher-student relationships in the primary school.
Leitao, Natalie; Waugh, Russell F
2012-01-01
This study investigated teacher-student relationships from the teachers' point of view at Perth metropolitan schools in Western Australia. The study identified three key social and emotional aspects that affect teacher-student relationships, namely, Connectedness, Availability and Communication. Data were collected by questionnaire (N = 139) with stem-items answered in three perspectives: (1) Idealistic: this is what I would like to happen; (2) Capability: this is what I am capable of; and (3) Behaviour: this is what actually happens, using four ordered response categories: not at all (score 1), some of the time (score 2), most of the time (score 3), and almost always (score 4). Data were analysed with a Rasch measurement model and a uni-dimensional, linear scale with 24 items, ordered from easy to hard, was created. The data were shown to be highly reliable, so that valid inferences could be made from the scale. The Person Separation Index (akin to a reliability index) was 0.93; there was good global teacher and item fit to the measurement model; there was good item fit; the targeting of the item difficulties against the teacher measures was good, and the response categories were answered consistently and logically. Teachers said that the ideal items were all easier than their corresponding capability items which were in turn easier than the behaviour items (where the items fitted the model), as conceptualized. The easiest ideal items were: I like this child and This child and I get along well together. The hardest ideal item (but still easy) was: I am available for this child. The easiest behaviour item (but still hard) was: This child and I get along well together. The hardest behaviour item (and very hard) was: I am interested to learn about this child's personal thoughts, feelings and experiences. The difficulties of the items supported the conceptual structure of the variable.
ERIC Educational Resources Information Center
Cacchione, Trix; Indino, Marcello; Fujita, Kazuo; Itakura, Shoji; Matsuno, Toyomi; Schaub, Simone; Amici, Federica
2014-01-01
Previous research has demonstrated that adults are successful at visually tracking rigidly moving items, but experience great difficulties when tracking substance-like "pouring" items. Using a comparative approach, we investigated whether the presence/absence of the grammatical count-mass distinction influences adults and children's…
The Handling of Missing Binary Data in Language Research
ERIC Educational Resources Information Center
Pichette, François; Béland, Sébastien; Jolani, Shahab; Lesniewska, Justyna
2015-01-01
Researchers are frequently confronted with unanswered questions or items on their questionnaires and tests, due to factors such as item difficulty, lack of testing time, or participant distraction. This paper first presents results from a poll confirming previous claims (Rietveld & van Hout, 2006; Schafer & Graham, 2002) that data…
Decimal Fraction Arithmetic: Logical Error Analysis and Its Validation.
ERIC Educational Resources Information Center
Standiford, Sally N.; And Others
This report illustrates procedures of item construction for addition and subtraction examples involving decimal fractions. Using a procedural network of skills required to solve such examples, an item characteristic matrix of skills analysis was developed to describe the characteristics of the content domain by projected student difficulties. Then…
Mutual Information Item Selection in Adaptive Classification Testing
ERIC Educational Resources Information Center
Weissman, Alexander
2007-01-01
A general approach for item selection in adaptive multiple-category classification tests is provided. The approach uses mutual information (MI), a special case of the Kullback-Leibler distance, or relative entropy. MI works efficiently with the sequential probability ratio test and alleviates the difficulties encountered with using other local-…
Weighted Maximum-a-Posteriori Estimation in Tests Composed of Dichotomous and Polytomous Items
ERIC Educational Resources Information Center
Sun, Shan-Shan; Tao, Jian; Chang, Hua-Hua; Shi, Ning-Zhong
2012-01-01
For mixed-type tests composed of dichotomous and polytomous items, polytomous items often yield more information than dichotomous items. To reflect the difference between the two types of items and to improve the precision of ability estimation, an adaptive weighted maximum-a-posteriori (WMAP) estimation is proposed. To evaluate the performance of…
Sadler, Philip M.; Coyle, Harold; Smith, Nancy Cook; Miller, Jaimie; Mintzes, Joel; Tanner, Kimberly; Murray, John
2013-01-01
We report on the development of an item test bank and associated instruments based on the National Research Council (NRC) K–8 life sciences content standards. Utilizing hundreds of studies in the science education research literature on student misconceptions, we constructed 476 unique multiple-choice items that measure the degree to which test takers hold either a misconception or an accepted scientific view. Tested nationally with 30,594 students, following their study of life science, and their 353 teachers, these items reveal a range of interesting results, particularly student difficulties in mastering the NRC standards. Teachers also answered test items and demonstrated a high level of subject matter knowledge reflecting the standards of the grade level at which they teach, but exhibiting few misconceptions of their own. In addition, teachers predicted the difficulty of each item for their students and which of the wrong answers would be the most popular. Teachers were found to generally overestimate their own students’ performance and to have a high level of awareness of the particular misconceptions that their students hold on the K–4 standards, but a low level of awareness of misconceptions related to the 5–8 standards. PMID:24006402
Sadler, Philip M; Coyle, Harold; Smith, Nancy Cook; Miller, Jaimie; Mintzes, Joel; Tanner, Kimberly; Murray, John
2013-01-01
We report on the development of an item test bank and associated instruments based on the National Research Council (NRC) K-8 life sciences content standards. Utilizing hundreds of studies in the science education research literature on student misconceptions, we constructed 476 unique multiple-choice items that measure the degree to which test takers hold either a misconception or an accepted scientific view. Tested nationally with 30,594 students, following their study of life science, and their 353 teachers, these items reveal a range of interesting results, particularly student difficulties in mastering the NRC standards. Teachers also answered test items and demonstrated a high level of subject matter knowledge reflecting the standards of the grade level at which they teach, but exhibiting few misconceptions of their own. In addition, teachers predicted the difficulty of each item for their students and which of the wrong answers would be the most popular. Teachers were found to generally overestimate their own students' performance and to have a high level of awareness of the particular misconceptions that their students hold on the K-4 standards, but a low level of awareness of misconceptions related to the 5-8 standards.
Equating with Miditests Using IRT
ERIC Educational Resources Information Center
Fitzpatrick, Joseph; Skorupski, William P.
2016-01-01
The equating performance of two internal anchor test structures--miditests and minitests--is studied for four IRT equating methods using simulated data. Originally proposed by Sinharay and Holland, miditests are anchors that have the same mean difficulty as the overall test but less variance in item difficulties. Four popular IRT equating methods…
A Test of the Similar Sequence Hypothesis.
ERIC Educational Resources Information Center
Silverstein, A. B.; And Others
1982-01-01
Scales for object permanence and spatial relationships were administered to 98 severely and profoundly mentally retarded children (mean age 13 years) on three occasions, 6 months apart. Differences in the difficulty of the items were quite stable, but their order of difficulty differed appreciably from that for nonretarded infants. (Author/SB)
Reproduction of Inflectional Markers in French-Speaking Children with Reading Impairment
ERIC Educational Resources Information Center
St-Pierre, Marie-Catherine; Beland, Renee
2010-01-01
Purpose: Children with reading impairment (RI) experience difficulties in oral and written production of inflectional markers. The origin of these difficulties is not well documented in French. According to some authors, acquisition of irregular items by typically developing children is predicted by token frequency, whereas acquisition of regular…
Investigating the Impact of Uncertainty about Item Parameters on Ability Estimation
ERIC Educational Resources Information Center
Zhang, Jinming; Xie, Minge; Song, Xiaolan; Lu, Ting
2011-01-01
Asymptotic expansions of the maximum likelihood estimator (MLE) and weighted likelihood estimator (WLE) of an examinee's ability are derived while item parameter estimators are treated as covariates measured with error. The asymptotic formulae present the amount of bias of the ability estimators due to the uncertainty of item parameter estimators.…
ERIC Educational Resources Information Center
Palmer, D. G.
This publication presents an organized collection of biology questions, designed for use in evaluation at the secondary level in Tasmania. Each item has been tried for quality and is accompanied by its difficulty percentage as well as by its content area and the mental processes required to answer it. The content areas include: Diversity,…
Development and assessment of floor and ceiling items for the PROMIS physical function item bank
2013-01-01
Introduction Disability and Physical Function (PF) outcome assessment has had limited ability to measure functional status at the floor (very poor functional abilities) or the ceiling (very high functional abilities). We sought to identify, develop and evaluate new floor and ceiling items to enable broader and more precise assessment of PF outcomes for the NIH Patient-Reported-Outcomes Measurement Information System (PROMIS). Methods We conducted two cross-sectional studies using NIH PROMIS item improvement protocols with expert review, participant survey and focus group methods. In Study 1, respondents with low PF abilities evaluated new floor items, and those with high PF abilities evaluated new ceiling items for clarity, importance and relevance. In Study 2, we compared difficulty ratings of new floor items by low functioning respondents and ceiling items by high functioning respondents to reference PROMIS PF-10 items. We used frequencies, percentages, means and standard deviations to analyze the data. Results In Study 1, low (n = 84) and high (n = 90) functioning respondents were mostly White, women, 70 years old, with some college, and disability scores of 0.62 and 0.30. More than 90% of the 31 new floor and 31 new ceiling items were rated as clear, important and relevant, leaving 26 ceiling and 30 floor items for Study 2. Low (n = 246) and high (n = 637) functioning Study 2 respondents were mostly White, women, 70 years old, with some college, and Health Assessment Questionnaire (HAQ) scores of 1.62 and 0.003. Compared to difficulty ratings of reference items, ceiling items were rated to be 10% more to greater than 40% more difficult to do, and floor items were rated to be about 12% to nearly 90% less difficult to do. Conclusions These new floor and ceiling items considerably extend the measurable range of physical function at either extreme. They will help improve instrument performance in populations with broad functional ranges and those concentrated at one or the other extreme ends of functioning. Optimal use of these new items will be assisted by computerized adaptive testing (CAT), reducing questionnaire burden and insuring item administration to appropriate individuals. PMID:24286166
Bayesian inference in an item response theory model with a generalized student t link function
NASA Astrophysics Data System (ADS)
Azevedo, Caio L. N.; Migon, Helio S.
2012-10-01
In this paper we introduce a new item response theory (IRT) model with a generalized Student t-link function with unknown degrees of freedom (df), named generalized t-link (GtL) IRT model. In this model we consider only the difficulty parameter in the item response function. GtL is an alternative to the two parameter logit and probit models, since the degrees of freedom (df) play a similar role to the discrimination parameter. However, the behavior of the curves of the GtL is different from those of the two parameter models and the usual Student t link, since in GtL the curve obtained from different df's can cross the probit curves in more than one latent trait level. The GtL model has similar proprieties to the generalized linear mixed models, such as the existence of sufficient statistics and easy parameter interpretation. Also, many techniques of parameter estimation, model fit assessment and residual analysis developed for that models can be used for the GtL model. We develop fully Bayesian estimation and model fit assessment tools through a Metropolis-Hastings step within Gibbs sampling algorithm. We consider a prior sensitivity choice concerning the degrees of freedom. The simulation study indicates that the algorithm recovers all parameters properly. In addition, some Bayesian model fit assessment tools are considered. Finally, a real data set is analyzed using our approach and other usual models. The results indicate that our model fits the data better than the two parameter models.
Jafari, Peyman; Bagheri, Zahra; Ayatollahi, Seyyed Mohamad Taghi; Soltani, Zahra
2012-03-13
Item response theory (IRT) is extensively used to develop adaptive instruments of health-related quality of life (HRQoL). However, each IRT model has its own function to estimate item and category parameters, and hence different results may be found using the same response categories with different IRT models. The present study used the Rasch rating scale model (RSM) to examine and reassess the psychometric properties of the Persian version of the PedsQL™ 4.0 Generic Core Scales. The PedsQL™ 4.0 Generic Core Scales was completed by 938 Iranian school children and their parents. Convergent, discriminant and construct validity of the instrument were assessed by classical test theory (CTT). The RSM was applied to investigate person and item reliability, item statistics and ordering of response categories. The CTT method showed that the scaling success rate for convergent and discriminant validity were 100% in all domains with the exception of physical health in the child self-report. Moreover, confirmatory factor analysis supported a four-factor model similar to its original version. The RSM showed that 22 out of 23 items had acceptable infit and outfit statistics (<1.4, >0.6), person reliabilities were low, item reliabilities were high, and item difficulty ranged from -1.01 to 0.71 and -0.68 to 0.43 for child self-report and parent proxy-report, respectively. Also the RSM showed that successive response categories for all items were not located in the expected order. This study revealed that, in all domains, the five response categories did not perform adequately. It is not known whether this problem is a function of the meaning of the response choices in the Persian language or an artifact of a mostly healthy population that did not use the full range of the response categories. The response categories should be evaluated in further validation studies, especially in large samples of chronically ill patients.
How Task Features Impact Evidence from Assessments Embedded in Simulations and Games
ERIC Educational Resources Information Center
Almond, Russell G.; Kim, Yoon Jeon; Velasquez, Gertrudes; Shute, Valerie J.
2014-01-01
One of the key ideas of evidence-centered assessment design (ECD) is that task features can be deliberately manipulated to change the psychometric properties of items. ECD identifies a number of roles that task-feature variables can play, including determining the focus of evidence, guiding form creation, determining item difficulty and…
An Eye-Movement Study of Relational Memory in Adults with Autism Spectrum Disorder
ERIC Educational Resources Information Center
Ring, Melanie; Bowler, Dermot M.; Gaigg, Sebastian B.
2017-01-01
Persons with Autism Spectrum Disorder (ASD) demonstrate good memory for single items but difficulties remembering contextual information related to these items. Recently, we found compromised explicit but intact implicit retrieval of object-location information in ASD (Ring et al. "Autism Res" 8(5):609-619, 2015). Eye-movement data…
Probing University Students' Pre-Knowledge in Quantum Physics with QPCS Survey
ERIC Educational Resources Information Center
Asikainen, Mervi A.
2017-01-01
The study investigated the use of Quantum Physics Conceptual Survey (QPCS) in probing student understanding of quantum physics. Altogether 103 Finnish university students responded to QPCS. The mean scores of the student responses were calculated and the test was evaluated using common five indices: Item difficulty index, Item discrimination…
Identifying Predictors of Physics Item Difficulty: A Linear Regression Approach
ERIC Educational Resources Information Center
Mesic, Vanes; Muratovic, Hasnija
2011-01-01
Large-scale assessments of student achievement in physics are often approached with an intention to discriminate students based on the attained level of their physics competencies. Therefore, for purposes of test design, it is important that items display an acceptable discriminatory behavior. To that end, it is recommended to avoid extraordinary…
Cognitive Complexity in the Remote Association Test--Chinese Version
ERIC Educational Resources Information Center
Hung, Su-Pin; Huang, Po-Sheng; Chen, Hsueh-Chih
2016-01-01
The remote association test (RAT) has been applied in various fields; however, evidence of construct validity for the original version and subsequent extensions of the RAT remains limited. This study aimed to elucidate the dimensionality and the relationship between item features and item difficulties for the RAT--Chinese Version (RAT-C) using the…
Analysis of Open-Ended Statistics Questions with Many Facet Rasch Model
ERIC Educational Resources Information Center
Güler, Nese
2014-01-01
Problem Statement: The most significant disadvantage of open-ended items that allow the valid measurement of upper level cognitive behaviours, such as synthesis and evaluation, is scoring. The difficulty associated with objectively scoring the answers to the items contributes to the reduction of the reliability of the scores. Moreover, other…
Developing and Evaluating a Machine-Scorable, Constrained Constructed-Response Item.
ERIC Educational Resources Information Center
Braun, Henry I.; And Others
The use of constructed response items in large scale standardized testing has been hampered by the costs and difficulties associated with obtaining reliable scores. The advent of expert systems may signal the eventual removal of this impediment. This study investigated the accuracy with which expert systems could score a new, non-multiple choice…
A Comparison between Element Salience versus Context as Item Difficulty Factors in Raven's Matrices
ERIC Educational Resources Information Center
Perez-Salas, Claudia P.; Streiner, David L.; Roberts, Maxwell J.
2012-01-01
The nature of contextual facilitation effects for items derived from Raven's Progressive Matrices was investigated in two experiments. For these, the original matrices were modified, creating either abstract versions with high element salience, or versions which comprised realistic entities set in familiar contexts. In order to replicate and…
An Application of the Rasch Model.
ERIC Educational Resources Information Center
Veitch, William R.
The one parameter latent trait theory of Georg Rasch has two assumptions: that student abilities can be measured on an equal interval scale, and that the success of a student with a given item is a function of student achievement and item difficulty. The grade four Michigan Educational Assessment Program reading test was designed to measure…
Shalev, Anat; Shor, Ron
2016-12-01
Limited research attention has been given to the needs of family caregivers of persons with mental illness in psychiatric hospitals despite the stressors and difficulties they experience. In light of the recognition of the significance of helping family caregivers, a new model of consultation and support centers for family caregivers, called Meital, has been developed. To examine the needs of family caregivers who receive help in Meital, at the Beer Sheva Mental Health Center. Eighty-five family caregivers participated in the research. They completed a structured questionnaire constructed for this research two weeks after they started receiving services from Meital. The questionnaire included four areas of needs for help. These areas examined the extent of the need for help with respect to each of the items in the instrument. The mean of the extent of need for help of the items in the 'information and knowledge' subscale was the highest. Average to high means of the items of the subscales were found in the subscales relating to 'difficulties stemming from the impact of the situation of the person with mental illness on the function of the family caregiver receiving help,' 'on the function of other family members' and 'difficulties coping with the person with mental illness.' The mean of the items of the subscale 'relationships with professionals and informal systems' was the lowest. An examination of the items within the subscales indicated that items relating to the 'impact of the situation of the person with mental illness on the family caregiver who receives help' were ranked higher than the items relating to the 'impact on the function of other family caregivers.' Items relating to 'relationships with professionals' were ranked higher than items relating to 'relationships with informal systems.' This research emphasizes the importance of implementing the family-centered approach, the basis of the Meital Model, in psychiatric institutions. The focus of this approach is on the need for help of family caregivers beyond the help needed for them to function as a resource of help for the ill person. The findings also illuminate the importance of making information and knowledge accessible for family caregivers.
2015-01-01
Purpose: The situational judgment test (SJT) shows promise for assessing the non-cognitive skills of medical school applicants, but has only been used in Europe. Since the admissions processes and education levels of applicants to medical school are different in the United States and in Europe, it is necessary to obtain validity evidence of the SJT based on a sample of United States applicants. Methods: Ninety SJT items were developed and Kane’s validity framework was used to create a test blueprint. A total of 489 applicants selected for assessment/interview day at the University of Utah School of Medicine during the 2014-2015 admissions cycle completed one of five SJTs, which assessed professionalism, coping with pressure, communication, patient focus, and teamwork. Item difficulty, each item’s discrimination index, internal consistency, and the categorization of items by two experts were used to create the test blueprint. Results: The majority of item scores were within an acceptable range of difficulty, as measured by the difficulty index (0.50-0.85) and had fair to good discrimination. However, internal consistency was low for each domain, and 63% of items appeared to assess multiple domains. The concordance of categorization between the two educational experts ranged from 24% to 76% across the five domains. Conclusion: The results of this study will help medical school admissions departments determine how to begin constructing a SJT. Further testing with a more representative sample is needed to determine if the SJT is a useful assessment tool for measuring the non-cognitive skills of medical school applicants. PMID:26582629
ERIC Educational Resources Information Center
Zhang, Dake; Ding, Yi; Stegall, Joanna; Mo, Lei
2012-01-01
Students who struggle with learning mathematics often have difficulties with geometry problem solving, which requires strong visual imagery skills. These difficulties have been correlated with deficiencies in visual working memory. Cognitive psychology has shown that chunking of visual items accommodates students' working memory deficits. This…
ERIC Educational Resources Information Center
Dickey, Wayne C.; Blumberg, Stephen J.
2004-01-01
Objective: The Strengths and Difficulties Questionnaire is a 25-item instrument developed to assess emotional and behavioral problems. The current study attempted to replicate previous European structural analyses and to describe the latent dimensions that underlie responses to the parent-reported version of the Strengths and Difficulties…
Comparison of Difficulties and Reliabilities of Math-Completion and Multiple-Choice Item Formats.
ERIC Educational Resources Information Center
Oosterhof, Albert C.; Coats, Pamela K.
Instructors who develop classroom examinations that require students to provide a numerical response to a mathematical problem are often very concerned about the appropriateness of the multiple-choice format. The present study augments previous research relevant to this concern by comparing the difficulty and reliability of multiple-choice and…
Belief-bias reasoning in non-clinical delusion-prone individuals.
Anandakumar, T; Connaughton, E; Coltheart, M; Langdon, R
2017-03-01
It has been proposed that people with delusions have difficulty inhibiting beliefs (i.e., "doxastic inhibition") so as to reason about them as if they might not be true. We used a continuity approach to test this proposal in non-clinical adults scoring high and low in psychometrically assessed delusion-proneness. High delusion-prone individuals were expected to show greater difficulty than low delusion-prone individuals on "conflict" items of a "belief-bias" reasoning task (i.e. when required to reason logically about statements that conflicted with reality), but not on "non-conflict" items. Twenty high delusion-prone and twenty low delusion-prone participants (according to the Peters et al. Delusions Inventory) completed a belief-bias reasoning task and tests of IQ, working memory and general inhibition (Excluded Letter Fluency, Stroop and Hayling Sentence Completion). High delusion-prone individuals showed greater difficulty than low delusion-prone individuals on the Stroop and Excluded Letter Fluency tests of inhibition, but no greater difficulty on the conflict versus non-conflict items of the belief-bias task. They did, however, make significantly more errors overall on the belief-bias task, despite controlling for IQ, working memory and general inhibitory control. The study had a relatively small sample size and used non-clinical participants to test a theory of cognitive processing in individuals with clinically diagnosed delusions. Results failed to support a role for doxastic inhibitory failure in non-clinical delusion-prone individuals. These individuals did, however, show difficulty with conditional reasoning about statements that may or may not conflict with reality, independent of any general cognitive or inhibitory deficits. Copyright © 2016 Elsevier Ltd. All rights reserved.
Belief-bias reasoning in non-clinical delusion-prone individuals.
Anandakumar, T; Connaughton, E; Coltheart, M; Langdon, R
2017-09-01
It has been proposed that people with delusions have difficulty inhibiting beliefs (i.e., "doxastic inhibition") so as to reason about them as if they might not be true. We used a continuity approach to test this proposal in non-clinical adults scoring high and low in psychometrically assessed delusion-proneness. High delusion-prone individuals were expected to show greater difficulty than low delusion-prone individuals on "conflict" items of a "belief-bias" reasoning task (i.e. when required to reason logically about statements that conflicted with reality), but not on "non-conflict" items. Twenty high delusion-prone and twenty low delusion-prone participants (according to the Peters et al. Delusions Inventory) completed a belief-bias reasoning task and tests of IQ, working memory and general inhibition (Excluded Letter Fluency, Stroop and Hayling Sentence Completion). High delusion-prone individuals showed greater difficulty than low delusion-prone individuals on the Stroop and Excluded Letter Fluency tests of inhibition, but no greater difficulty on the conflict versus non-conflict items of the belief-bias task. They did, however, make significantly more errors overall on the belief-bias task, despite controlling for IQ, working memory and general inhibitory control. The study had a relatively small sample size and used non-clinical participants to test a theory of cognitive processing in individuals with clinically diagnosed delusions. Results failed to support a role for doxastic inhibitory failure in non-clinical delusion-prone individuals. These individuals did, however, show difficulty with conditional reasoning about statements that may or may not conflict with reality, independent of any general cognitive or inhibitory deficits. Copyright © 2016 Elsevier Ltd. All rights reserved.
Braun, J
1994-02-01
In more than one respect, visual search for the most salient or the least salient item in a display are different kinds of visual tasks. The present work investigated whether this difference is primarily one of perceptual difficulty, or whether it is more fundamental and relates to visual attention. Display items of different salience were produced by varying either size, contrast, color saturation, or pattern. Perceptual masking was employed and, on average, mask onset was delayed longer in search for the least salient item than in search for the most salient item. As a result, the two types of visual search presented comparable perceptual difficulty, as judged by psychophysical measures of performance, effective stimulus contrast, and stability of decision criterion. To investigate the role of attention in the two types of search, observers attempted to carry out a letter discrimination and a search task concurrently. To discriminate the letters, observers had to direct visual attention at the center of the display and, thus, leave unattended the periphery, which contained target and distractors of the search task. In this situation, visual search for the least salient item was severely impaired while visual search for the most salient item was only moderately affected, demonstrating a fundamental difference with respect to visual attention. A qualitatively identical pattern of results was encountered by Schiller and Lee (1991), who used similar visual search tasks to assess the effect of a lesion in extrastriate area V4 of the macaque.
Wang, Xiaoli; Xuan, Yifu; Jarrold, Christopher
2016-01-01
Previous studies have examined whether difficulties in short-term memory for verbal information, that might be associated with dyslexia, are driven by problems in retaining either information about to-be-remembered items or the order in which these items were presented. However, such studies have not used process-pure measures of short-term memory for item or order information. In this work we adapt a process dissociation procedure to properly distinguish the contributions of item and order processes to verbal short-term memory in a group of 28 adults with a self-reported diagnosis of dyslexia and a comparison sample of 29 adults without a dyslexia diagnosis. In contrast to previous work that has suggested that individuals with dyslexia experience item deficits resulting from inefficient phonological representation and language-independent order memory deficits, the results showed no evidence of specific problems in short-term retention of either item or order information among the individuals with a self-reported diagnosis of dyslexia, despite this group showing expected difficulties on separate measures of word and non-word reading. However, there was some suggestive evidence of a link between order memory for verbal material and individual differences in non-word reading, consistent with other claims for a role of order memory in phonologically mediated reading. The data from the current study therefore provide empirical evidence to question the extent to which item and order short-term memory are necessarily impaired in dyslexia. PMID:26941679
Wang, Xiaoli; Xuan, Yifu; Jarrold, Christopher
2016-01-01
Previous studies have examined whether difficulties in short-term memory for verbal information, that might be associated with dyslexia, are driven by problems in retaining either information about to-be-remembered items or the order in which these items were presented. However, such studies have not used process-pure measures of short-term memory for item or order information. In this work we adapt a process dissociation procedure to properly distinguish the contributions of item and order processes to verbal short-term memory in a group of 28 adults with a self-reported diagnosis of dyslexia and a comparison sample of 29 adults without a dyslexia diagnosis. In contrast to previous work that has suggested that individuals with dyslexia experience item deficits resulting from inefficient phonological representation and language-independent order memory deficits, the results showed no evidence of specific problems in short-term retention of either item or order information among the individuals with a self-reported diagnosis of dyslexia, despite this group showing expected difficulties on separate measures of word and non-word reading. However, there was some suggestive evidence of a link between order memory for verbal material and individual differences in non-word reading, consistent with other claims for a role of order memory in phonologically mediated reading. The data from the current study therefore provide empirical evidence to question the extent to which item and order short-term memory are necessarily impaired in dyslexia.
ITEM ANALYSIS OF THREE SPANISH NAMING TESTS: A CROSS-CULTURAL INVESTIGATION
de la Plata, Carlos Marquez; Arango-Lasprilla, Juan Carlos; Alegret, Montse; Moreno, Alexander; Tárraga, Luis; Lara, Mar; Hewlitt, Margaret; Hynan, Linda; Cullum, C. Munro
2009-01-01
Neuropsychological evaluations conducted in the United States and abroad commonly include the use of tests translated from English to Spanish. The use of translated naming tests for evaluating predominately Spanish-speakers has recently been challenged on the grounds that translating test items may compromise a test’s construct validity. The Texas Spanish Naming Test (TNT) has been developed in Spanish specifically for use with Spanish-speakers; however, it is unlikely patients from diverse Spanish-speaking geographical regions will perform uniformly on a naming test. The present study evaluated and compared the internal consistency and patterns of item-difficulty and -discrimination for the TNT and two commonly used translated naming tests in three countries (i.e., United States, Colombia, Spain). Two hundred fifty two subjects (126 demented, 116 nondemented) across three countries were administered the TNT, Modified Boston Naming Test-Spanish, and the naming subtest from the CERAD. The TNT demonstrated superior internal consistency to its counterparts, a superior item difficulty pattern than the CERAD naming test, and a superior item discrimination pattern than the MBNT-S across countries. Overall, all three Spanish naming tests differentiated nondemented and moderately demented individuals, but the results suggest the items of the TNT are most appropriate to use with Spanish-speakers. Preliminary normative data for the three tests examined in each country are provided. PMID:19208960
Applying Rasch model analysis in the development of the cantonese tone identification test (CANTIT).
Lee, Kathy Y S; Lam, Joffee H S; Chan, Kit T Y; van Hasselt, Charles Andrew; Tong, Michael C F
2017-01-01
Applying Rasch analysis to evaluate the internal structure of a lexical tone perception test known as the Cantonese Tone Identification Test (CANTIT). A 75-item pool (CANTIT-75) with pictures and sound tracks was developed. Respondents were required to make a four-alternative forced choice on each item. A short version of 30 items (CANTIT-30) was developed based on fit statistics, difficulty estimates, and content evaluation. Internal structure was evaluated by fit statistics and Rasch Factor Analysis (RFA). 200 children with normal hearing and 141 children with hearing impairment were recruited. For CANTIT-75, all infit and 97% of outfit values were < 2.0. RFA revealed 40.1% of total variance was explained by the Rasch measure. The first residual component explained 2.5% of total variance in an eigenvalue of 3.1. For CANTIT-30, all infit and outfit values were < 2.0. The Rasch measure explained 38.8% of total variance, the first residual component explained 3.9% of total variance in an eigenvalue of 1.9. The Rasch model provides excellent guidance for the development of short forms. Both CANTIT-75 and CANTIT-30 possess satisfactory internal structure as a construct validity evidence in measuring the lexical tone identification ability of the Cantonese speakers.
Trani, Jean-François; Babulal, Ganesh Muneshwar; Bakhshi, Parul
2015-01-01
Background Although 80% of persons with disabilities live in low and middle-income countries, there is still a lack of comprehensive, cross-culturally validated tools to identify persons facing activity limitations and functioning difficulties in these settings. In absence of such a tool, disability estimates vary considerably according to the methodology used, and policies are based on unreliable estimates. Methods and Findings The Disability Screening Questionnaire composed of 27 items (DSQ-27) was initially designed by a group of international experts in survey development and disability in Afghanistan for a national survey. Items were selected based on major domains of activity limitations and functioning difficulties linked to an impairment as defined by the International Classification of Functioning, Disability and Health. Face, content and construct validity, as well as sensitivity and specificity were examined. Based on the results obtained, the tool was subsequently refined and expanded to 34 items, tested and validated in Darfur, Sudan. Internal consistency for the total DSQ-34 using a raw and standardized Cronbach’s Alpha and within each domain using a standardized Cronbach’s Alpha was examined in the Asian context (India and Nepal). Exploratory factor analysis (EFA) using principal axis factoring (PAF) evaluated the lowest number of factors to account for the common variance among the questions in the screen. Test-retest reliability was determined by calculating intraclass correlation (ICC) and inter-rater reliability by calculating the kappa statistic; results were checked using Bland-Altman plots. The DSQ-34 was further tested for standard error of measurement (SEM) and for the minimum detectable change (MDC). Good internal consistency was indicated by Cronbach’s Alpha of 0.83/0.82 for India and 0.76/0.78 for Nepal. We confirmed our assumption for EFA using the Kaiser-Meyer-Olkin measure of sampling well above the accepted cutoff of 0.40 for India (0.82) and Nepal (0.82). The criteria for Bartlett’s test of sphericity were also met for both India (< .001) and Nepal (< .001). Estimates of reliability from the two countries reached acceptable levels of ICC of 0.75 (p<0.001) for India of 0.77 for Nepal (p<0.001) and good strength of agreement for weighted kappa (respectively 0.77 and 0.79). The SEM/MDC was 0.80/2.22 for India and 0.96/2.66 for Nepal indicating a smaller amount of measurement error in the screen. Conclusions In Nepal and India, the DSQ-34 shows strong psychometric properties that indicate that it effectively discriminates between persons with and without disabilities. This instrument can be used in association with other instruments for the purpose of comparing health outcomes of persons with and without disabilities in LMICs. PMID:26630668
Item-focussed Trees for the Identification of Items in Differential Item Functioning.
Tutz, Gerhard; Berger, Moritz
2016-09-01
A novel method for the identification of differential item functioning (DIF) by means of recursive partitioning techniques is proposed. We assume an extension of the Rasch model that allows for DIF being induced by an arbitrary number of covariates for each item. Recursive partitioning on the item level results in one tree for each item and leads to simultaneous selection of items and variables that induce DIF. For each item, it is possible to detect groups of subjects with different item difficulties, defined by combinations of characteristics that are not pre-specified. The way a DIF item is determined by covariates is visualized in a small tree and therefore easily accessible. An algorithm is proposed that is based on permutation tests. Various simulation studies, including the comparison with traditional approaches to identify items with DIF, show the applicability and the competitive performance of the method. Two applications illustrate the usefulness and the advantages of the new method.
ERIC Educational Resources Information Center
Finch, Holmes
2010-01-01
The accuracy of item parameter estimates in the multidimensional item response theory (MIRT) model context is one that has not been researched in great detail. This study examines the ability of two confirmatory factor analysis models specifically for dichotomous data to properly estimate item parameters using common formulae for converting factor…
Uncertainties in the Item Parameter Estimates and Robust Automated Test Assembly
ERIC Educational Resources Information Center
Veldkamp, Bernard P.; Matteucci, Mariagiulia; de Jong, Martijn G.
2013-01-01
Item response theory parameters have to be estimated, and because of the estimation process, they do have uncertainty in them. In most large-scale testing programs, the parameters are stored in item banks, and automated test assembly algorithms are applied to assemble operational test forms. These algorithms treat item parameters as fixed values,…
NON-SPECIFIC SYMPTOMS AND SCREENING OF NON-PSYCHOTIC MORBIDITY IN PRIMARY CARE1
Srinivasan, T.N.; Suresh, T.R.
1990-01-01
SUMMARY Much of the non-psychotic mental morbidity in primary care goes undetected by the primary care health personnel. This is often because of the non-specific somatic nature of the presenting complaints of these patients and the difficulty on the part of the primary care physician to elicit specific emotional symptoms to screen psychiatric problems. This paper describes the development of the 7-item Primary care Psychiatric Questionnaire (PPQ.) which, by requiring to elicit only the non-specific symptoms, could overcome this practical difficulty. This new screening method has been standardised against the Self Report Questionaaire—20-item version which is commonly used in primary care. PMID:21927432
ERIC Educational Resources Information Center
Zickar, Michael J.; Ury, Karen L.
2002-01-01
Attempted to relate content features of personality items to item parameter estimates from the partial credit model of E. Muraki (1990) by administering the Adjective Checklist (L. Goldberg, 1992) to 329 undergraduates. As predicted, the discrimination parameter was related to the item subtlety ratings of personality items but the level of word…
Effects of Ignoring Item Interaction on Item Parameter Estimation and Detection of Interacting Items
ERIC Educational Resources Information Center
Chen, Cheng-Te; Wang, Wen-Chung
2007-01-01
This study explores the effects of ignoring item interaction on item parameter estimation and the efficiency of using the local dependence index Q[subscript 3] and the SAS NLMIXED procedure to detect item interaction under the three-parameter logistic model and the generalized partial credit model. Through simulations, it was found that ignoring…
Rodríguez-Díez, María Cristina; Alegre, Manuel; Díez, Nieves; Arbea, Leire; Ferrer, Marta
2016-02-03
The main factor that determines the selection of a medical specialty in Spain after obtaining a medical degree is the MIR ("médico interno residente", internal medical resident) exam. This exam consists of 235 multiple-choice questions with five options, some of which include images provided in a separate booklet. The aim of this study was to analyze the technical quality of the multiple-choice questions included in the MIR exam over the last five years. All the questions included in the exams from 2009 to 2013 were analyzed. We studied the proportion of questions including clinical vignettes, the number of items related to an image and the presence of technical flaws in the questions. For the analysis of technical flaws, we adapted the National Board of Medical Examiners (NBME) guidelines. We looked for 18 different issues included in the manual, grouped into two categories: issues related to testwiseness and issues related to irrelevant difficulties. The final number of questions analyzed was 1,143. The percentage of items based on clinical vignettes increased from 50% in 2009 to 56-58% in the following years (2010-2013). The percentage of items based on an image increased progressively from 10% in 2009 to 15% in 2012 and 2013. The percentage of items with at least one technical flaw varied between 68 and 72%. We observed a decrease in the percentage of items with flaws related to testwiseness, from 30% in 2009 to 20% in 2012 and 2013. While most of these issues decreased dramatically or even disappeared (such as the imbalance in the correct option numbers), the presence of non-plausible options remained frequent. With regard to technical flaws related to irrelevant difficulties, no improvement was observed; this is especially true with respect to negative stem questions and "hinged" questions. The formal quality of the MIR exam items has improved over the last five years with regard to testwiseness. A more detailed revision of the items submitted, checking systematically for the presence of technical flaws, could improve the validity and discriminatory power of the exam, without increasing its difficulty.
Refining a self-assessment of informatics competency scale using Mokken scaling analysis.
Yoon, Sunmoo; Shaffer, Jonathan A; Bakken, Suzanne
2015-01-01
Healthcare environments are increasingly implementing health information technology (HIT) and those from various professions must be competent to use HIT in meaningful ways. In addition, HIT has been shown to enable interprofessional approaches to health care. The purpose of this article is to describe the refinement of the Self-Assessment of Nursing Informatics Competencies Scale (SANICS) using analytic techniques based upon item response theory (IRT) and discuss its relevance to interprofessional education and practice. In a sample of 604 nursing students, the 93-item version of SANICS was examined using non-parametric IRT. The iterative modeling procedure included 31 steps comprising: (1) assessing scalability, (2) assessing monotonicity, (3) assessing invariant item ordering, and (4) expert input. SANICS was reduced to an 18-item hierarchical scale with excellent reliability. Fundamental skills for team functioning and shared decision making among team members (e.g. "using monitoring systems appropriately," "describing general systems to support clinical care") had the highest level of difficulty, and "demonstrating basic technology skills" had the lowest difficulty level. Most items reflect informatics competencies relevant to all health professionals. Further, the approaches can be applied to construct a new hierarchical scale or refine an existing scale related to informatics attitudes or competencies for various health professions.
Haggerty, Jeannie L.; Bouharaoui, Fatima; Santor, Darcy A.
2011-01-01
Evaluating the extent to which groups or subgroups of individuals differ with respect to primary healthcare experience depends on first ruling out the possibility of bias. Objective: To determine whether item or subscale performance differs systematically between French/English, high/low education subgroups and urban/rural residency. Method: A sample of 645 adult users balanced by French/English language (in Quebec and Nova Scotia, respectively), high/low education and urban/rural residency responded to six validated instruments: the Primary Care Assessment Survey (PCAS); the Primary Care Assessment Tool – Short Form (PCAT-S); the Components of Primary Care Index (CPCI); the first version of the EUROPEP (EUROPEP-I); the Interpersonal Processes of Care Survey, version II (IPC-II); and part of the Veterans Affairs National Outpatient Customer Satisfaction Survey (VANOCSS). We normalized subscale scores to a 0-to-10 scale and tested for between-group differences using ANOVA tests. We used a parametric item response model to test for differences between subgroups in item discriminability and item difficulty. We re-examined group differences after removing items with differential item functioning. Results: Experience of care was assessed more positively in the English-speaking (Nova Scotia) than in the French-speaking (Quebec) respondents. We found differential English/French item functioning in 48% of the 153 items: discriminability in 20% and differential difficulty in 28%. English items were more discriminating generally than the French. Removing problematic items did not change the differences in French/English assessments. Differential item functioning by high/low education status affected 27% of items, with items being generally more discriminating in high-education groups. Between-group comparisons were unchanged. In contrast, only 9% of items showed differential item functioning by geography, affecting principally the accessibility attribute. Removing problematic items reversed a previously non-significant finding, revealing poorer first-contact access in rural than in urban areas. Conclusion: Differential item functioning does not bias or invalidate French/English comparisons on subscales, but additional development is required to make French and English items equivalent. These instruments are relatively robust by educational status and geography, but results suggest potential differences in the underlying construct in low-education and rural respondents. PMID:23205035
A Comparative Study on Carbohydrate Estimation: GoCARB vs. Dietitians.
Vasiloglou, Maria F; Mougiakakou, Stavroula; Aubry, Emilie; Bokelmann, Anika; Fricker, Rita; Gomes, Filomena; Guntermann, Cathrin; Meyer, Alexa; Studerus, Diana; Stanga, Zeno
2018-06-07
GoCARB is a computer vision-based smartphone system designed for individuals with Type 1 Diabetes to estimate plated meals' carbohydrate (CHO) content. We aimed to compare the accuracy of GoCARB in estimating CHO with the estimations of six experienced dietitians. GoCARB was used to estimate the CHO content of 54 Central European plated meals, with each of them containing three different weighed food items. Ground truth was calculated using the USDA food composition database. Dietitians were asked to visually estimate the CHO content based on meal photographs. GoCARB and dietitians achieved comparable accuracies. The mean absolute error of the dietitians was 14.9 (SD 10.12) g of CHO versus 14.8 (SD 9.73) g of CHO for the GoCARB ( p = 0.93). No differences were found between the estimations of dietitians and GoCARB, regardless the meal size. The larger the size of the meal, the greater were the estimation errors made by both. Moreover, the higher the CHO content of a food category was, the more challenging its accurate estimation. GoCARB had difficulty in estimating rice, pasta, potatoes, and mashed potatoes, while dietitians had problems with pasta, chips, rice, and polenta. GoCARB may offer diabetic patients the option of an easy, accurate, and almost real-time estimation of the CHO content of plated meals, and thus enhance diabetes self-management.
Waller, Niels G; Feuerstahler, Leah
2017-01-01
In this study, we explored item and person parameter recovery of the four-parameter model (4PM) in over 24,000 real, realistic, and idealized data sets. In the first analyses, we fit the 4PM and three alternative models to data from three Minnesota Multiphasic Personality Inventory-Adolescent form factor scales using Bayesian modal estimation (BME). Our results indicated that the 4PM fits these scales better than simpler item Response Theory (IRT) models. Next, using the parameter estimates from these real data analyses, we estimated 4PM item parameters in 6,000 realistic data sets to establish minimum sample size requirements for accurate item and person parameter recovery. Using a factorial design that crossed discrete levels of item parameters, sample size, and test length, we also fit the 4PM to an additional 18,000 idealized data sets to extend our parameter recovery findings. Our combined results demonstrated that 4PM item parameters and parameter functions (e.g., item response functions) can be accurately estimated using BME in moderate to large samples (N ⩾ 5, 000) and person parameters can be accurately estimated in smaller samples (N ⩾ 1, 000). In the supplemental files, we report annotated [Formula: see text] code that shows how to estimate 4PM item and person parameters in [Formula: see text] (Chalmers, 2012 ).
Combining computer adaptive testing technology with cognitively diagnostic assessment.
McGlohen, Meghan; Chang, Hua-Hua
2008-08-01
A major advantage of computerized adaptive testing (CAT) is that it allows the test to home in on an examinee's ability level in an interactive manner. The aim of the new area of cognitive diagnosis is to provide information about specific content areas in which an examinee needs help. The goal of this study was to combine the benefit of specific feedback from cognitively diagnostic assessment with the advantages of CAT. In this study, three approaches to combining these were investigated: (1) item selection based on the traditional ability level estimate (theta), (2) item selection based on the attribute mastery feedback provided by cognitively diagnostic assessment (alpha), and (3) item selection based on both the traditional ability level estimate (theta) and the attribute mastery feedback provided by cognitively diagnostic assessment (alpha). The results from these three approaches were compared for theta estimation accuracy, attribute mastery estimation accuracy, and item exposure control. The theta- and alpha-based condition outperformed the alpha-based condition regarding theta estimation, attribute mastery pattern estimation, and item exposure control. Both the theta-based condition and the theta- and alpha-based condition performed similarly with regard to theta estimation, attribute mastery estimation, and item exposure control, but the theta- and alpha-based condition has an additional advantage in that it uses the shadow test method, which allows the administrator to incorporate additional constraints in the item selection process, such as content balancing, item type constraints, and so forth, and also to select items on the basis of both the current theta and alpha estimates, which can be built on top of existing 3PL testing programs.
Tong, Fang; Fu, Tong
2013-01-01
Objective To evaluate the differences in fluid intelligence tests between normal children and children with learning difficulties in China. Method PubMed, MD Consult, and other Chinese Journal Database were searched from their establishment to November 2012. After finding comparative studies of Raven measurements of normal children and children with learning difficulties, full Intelligent Quotation (FIQ) values and the original values of the sub-measurement were extracted. The corresponding effect model was selected based on the results of heterogeneity and parallel sub-group analysis was performed. Results Twelve documents were included in the meta-analysis, and the studies were all performed in mainland of China. Among these, two studies were performed at child health clinics, the other ten sites were schools and control children were schoolmates or classmates. FIQ was evaluated using a random effects model. WMD was −13.18 (95% CI: −16.50–−9.85). Children with learning difficulties showed significantly lower FIQ scores than controls (P<0.00001); Type of learning difficulty and gender differences were evaluated using a fixed-effects model (I2 = 0%). The sites and purposes of the studies evaluated here were taken into account, but the reasons of heterogeneity could not be eliminated; The sum IQ of all the subgroups showed considerable heterogeneity (I2 = 76.5%). The sub-measurement score of document A showed moderate heterogeneity among all documents, and AB, B, and E showed considerable heterogeneity, which was used in a random effect model. Individuals with learning difficulties showed heterogeneity as well. There was a moderate delay in the first three items (−0.5 to −0.9), and a much more pronounced delay in the latter three items (−1.4 to −1.6). Conclusion In the Chinese mainland, the level of fluid intelligence of children with learning difficulties was lower than that of normal children. Delayed development in sub-items of C, D, and E was more obvious. PMID:24236016
The Development of a Post Separation/Post Divorce Problems and Stress Scale.
ERIC Educational Resources Information Center
Raschke, Helen J.
Factors associated with the speed and level of difficulty with which individuals adjust to separation and divorce were investigated. A scale was developed to analyze these factors, and included items dealing with the subdimensions of stress and the perception of the persons involved. Factor analysis of the scale items as well as additional tests…
ERIC Educational Resources Information Center
Trace, Jonathan; Brown, James Dean; Janssen, Gerriet; Kozhevnikova, Liudmila
2017-01-01
Cloze tests have been the subject of numerous studies regarding their function and use in both first language and second language contexts (e.g., Jonz & Oller, 1994; Watanabe & Koyama, 2008). From a validity standpoint, one area of investigation has been the extent to which cloze tests measure reading ability beyond the sentence level.…
Some Considerations on the Partial Credit Model
ERIC Educational Resources Information Center
Verhelst, N. D.; Verstralen, H. H. F. M.
2008-01-01
The Partial Credit Model (PCM) is sometimes interpreted as a model for stepwise solution of polytomously scored items, where the item parameters are interpreted as difficulties of the steps. It is argued that this interpretation is not justified. A model for stepwise solution is discussed. It is shown that the PCM is suited to model sums of binary…
Language Effects in International Testing: The Case of PISA 2006 Science Items
ERIC Educational Resources Information Center
El Masri, Yasmine H.; Baird, Jo-Anne; Graesser, Art
2016-01-01
We investigate the extent to which language versions (English, French and Arabic) of the same science test are comparable in terms of item difficulty and demands. We argue that language is an inextricable part of the scientific literacy construct, be it intended or not by the examiner. This argument has considerable implications on methodologies…
Pick-N Multiple Choice-Exams: A Comparison of Scoring Algorithms
ERIC Educational Resources Information Center
Bauer, Daniel; Holzer, Matthias; Kopp, Veronika; Fischer, Martin R.
2011-01-01
To compare different scoring algorithms for Pick-N multiple correct answer multiple-choice (MC) exams regarding test reliability, student performance, total item discrimination and item difficulty. Data from six 3rd year medical students' end of term exams in internal medicine from 2005 to 2008 at Munich University were analysed (1,255 students,…
Item Mass and Complexity and the Arithmetic Computation of Students with Learning Disabilities.
ERIC Educational Resources Information Center
Cawley, John F.; Shepard, Teri; Smith, Maureen; Parmar, Rene S.
1997-01-01
The performance of 76 students (ages 10 to 15) with learning disabilities on four tasks of arithmetic computation within each of the four basic operations was examined. Tasks varied in difficulty level and number of strokes needed to complete all items. Intercorrelations between task sets and operations were examined as was the use of…
The Golden Rule Agreement is Psychometrically Defensible.
ERIC Educational Resources Information Center
Gonzalez-Tamayo, Eulogio
The agreement between the Educational Testing Service (ETS) and the Golden Rule Insurance Company of Illinois is interpreted as setting the general principles on which items must be selected to be included in a licensure test. These principles put a limit to the difficulty level of any item, and they also limit the size of the difference in…
Generic ABILHAND Questionnaire Can Measure Manual Ability across a Variety of Motor Impairments
ERIC Educational Resources Information Center
Simone, Anna; Rota, Viviana; Tesio, Luigi; Perucca, Laura
2011-01-01
ABILHAND is, in its original version, a 46-item, 4-level questionnaire. It measures the difficulty perceived by patients with rheumatoid arthritis as they do various daily manual tasks. ABILHAND was originally built through Rasch analysis. In a later study, it was simplified to a generic 23-item, three-level questionnaire, showing both…
Solving Graphics Problems: Student Performance in Junior Grades
ERIC Educational Resources Information Center
Lowrie, Tom; Diezmann, Carmel M.
2007-01-01
The authors investigated the performance of 172 Grade 4 students (9 to 10 years) over 12 months on a 36-item test that comprised items from 6 distinct graphical languages (e.g., maps) commonly used to convey mathematical information. Results revealed (a) difficulties in Grade 4 students' capacity to decode a variety of graphics, (b) significant…
ERIC Educational Resources Information Center
Sweller, Naomi
2015-01-01
Individuals with autism have difficulty generalising information from one situation to another, a process that requires the learning of categories and concepts. Category information may be learned through: (1) classifying items into categories, or (2) predicting missing features of category items. Predicting missing features has to this point been…
Unidimensional Interpretations for Multidimensional Test Items
ERIC Educational Resources Information Center
Kahraman, Nilufer
2013-01-01
This article considers potential problems that can arise in estimating a unidimensional item response theory (IRT) model when some test items are multidimensional (i.e., show a complex factorial structure). More specifically, this study examines (1) the consequences of model misfit on IRT item parameter estimates due to unintended minor item-level…
ERIC Educational Resources Information Center
Kersten, Paula; Czuba, Karol; McPherson, Kathryn; Dudley, Margaret; Elder, Hinemoa; Tauroa, Robyn; Vandal, Alain
2016-01-01
This article synthesized evidence for the validity and reliability of the Strengths and Difficulties Questionnaire in children aged 3-5 years. A systematic review using the Preferred Reporting Items for Systematic Reviews and Meta-Analyses statement guidelines was carried out. Study quality was rated using the Consensus-based Standards for the…
ERIC Educational Resources Information Center
Keller, Johannes
2007-01-01
Background: Stereotype threat research revealed that negative stereotypes can disrupt the performance of persons targeted by such stereotypes. This paper contributes to stereotype threat research by providing evidence that domain identification and the difficulty level of test items moderate stereotype threat effects on female students' maths…
ERIC Educational Resources Information Center
Zhang, Jinming; Lu, Ting
2007-01-01
In practical applications of item response theory (IRT), item parameters are usually estimated first from a calibration sample. After treating these estimates as fixed and known, ability parameters are then estimated. However, the statistical inferences based on the estimated abilities can be misleading if the uncertainty of the item parameter…
Dabkowska, Małgorzata
2007-01-01
Alexythymia has been reported in various psychiatric disorders, also in post-traumatic stress disorder (PTSD). The 20-item Toronto Alexythymia Scale (TAS-20) measures three inter-correlated dimensions ofalexythymia: 1. difficulties in identifying feelings, 2. difficulties in describing feelings, 3. externally oriented thinking. The aim of the study was to assess the correlation between factors of TAS-20 and intensification of PTSD symptoms. Presence and a degree of alexythymia were estimated using three factorial 20-point self-assessment Toronto Alexythymia Scale. Diagnosis and a degree of intensification of PTSD was based on C.G. Watson's et al. PTSD-I. The study group consisted of 30 women who have experienced domestic violence. Women were residents of hostels for victims of domestic violence or residents of the Lonely Mother House. There was a significant correlation between factor 2 (difficulties describing feelings) scores of TAS-20 and intensification of PTSD (correlation is significant at the 0.05 level, Spearman's correlation coefficient 0.383, p = 0.037). There was no significant relationship between the scores of PTSD-I and the scores of sub-factors 1 and 3. The results emphasize, in addition to the TAS-20 total score, the three sub-factors providing information about whether cognitive and/or affective aspects of alexythymia are associated with posttraumatic stress disorder. The most significant factor determining occurrence of PTSD symptoms in the study group of women who have experienced domestic violence was a difficulty in verbalising emotions.
The Effect of Error in Item Parameter Estimates on the Test Response Function Method of Linking.
ERIC Educational Resources Information Center
Kaskowitz, Gary S.; De Ayala, R. J.
2001-01-01
Studied the effect of item parameter estimation for computation of linking coefficients for the test response function (TRF) linking/equating method. Simulation results showed that linking was more accurate when there was less error in the parameter estimates, and that 15 or 25 common items provided better results than 5 common items under both…
ERIC Educational Resources Information Center
Wu, Yi-Fang
2015-01-01
Item response theory (IRT) uses a family of statistical models for estimating stable characteristics of items and examinees and defining how these characteristics interact in describing item and test performance. With a focus on the three-parameter logistic IRT (Birnbaum, 1968; Lord, 1980) model, the current study examines the accuracy and…
An analysis of the masking of speech by competing speech using self-report data.
Agus, Trevor R; Akeroyd, Michael A; Noble, William; Bhullar, Navjot
2009-01-01
Many of the items in the "Speech, Spatial, and Qualities of Hearing" scale questionnaire [S. Gatehouse and W. Noble, Int. J. Audiol. 43, 85-99 (2004)] are concerned with speech understanding in a variety of backgrounds, both speech and nonspeech. To study if this self-report data reflected informational masking, previously collected data on 414 people were analyzed. The lowest scores (greatest difficulties) were found for the two items in which there were two speech targets, with successively higher scores for competing speech (six items), energetic masking (one item), and no masking (three items). The results suggest significant masking by competing speech in everyday listening situations.
ERIC Educational Resources Information Center
Bazaldua, Diego A. Luna; Lee, Young-Sun; Keller, Bryan; Fellers, Lauren
2017-01-01
The performance of various classical test theory (CTT) item discrimination estimators has been compared in the literature using both empirical and simulated data, resulting in mixed results regarding the preference of some discrimination estimators over others. This study analyzes the performance of various item discrimination estimators in CTT:…
Nonparametric Item Response Curve Estimation with Correction for Measurement Error
ERIC Educational Resources Information Center
Guo, Hongwen; Sinharay, Sandip
2011-01-01
Nonparametric or kernel regression estimation of item response curves (IRCs) is often used in item analysis in testing programs. These estimates are biased when the observed scores are used as the regressor because the observed scores are contaminated by measurement error. Accuracy of this estimation is a concern theoretically and operationally.…
Dalton, Megan; Davidson, Megan; Keating, Jenny
2011-01-01
Is the Assessment of Physiotherapy Practice (APP) a valid instrument for the assessment of entry-level competence in physiotherapy students? Cross-sectional study with Rasch analysis of initial (n=326) and validation samples (n=318). Students were assessed on completion of 4, 5, or 6-week clinical placements across one university semester. 298 clinical educators and 456 physiotherapy students at nine universities in Australia and New Zealand provided 644 completed APP instruments. APP data in both samples showed overall fit to a Rasch model of expected item functioning for interval scale measurement. Item 6 (Written communication) exhibited misfit in both samples, but was retained as an important element of competence. The hierarchy of item difficulty was the same in both samples with items related to professional behaviour and communication the easiest to achieve and items related to clinical reasoning the most difficult. Item difficulty was well targeted to person ability. No Differential Item Functioning was identified, indicating that the scale performed in a comparable way regardless of the student's age, gender or amount of prior clinical experience, and the educator's age, gender, or experience as an educator, or the type of facility, university, or clinical area. The instrument demonstrated unidimensionality confirming the appropriateness of summing the scale scores on each item to provide an overall score of clinical competence and was able to discriminate four levels of professional competence (Person Separation Index=0.96). Person ability and raw APP scores had a linear relationship (r(2)=0.99). Rasch analysis supports the interpretation that a student's APP score is an indication of their underlying level of professional competence in workplace practice. Copyright © 2011 Australian Physiotherapy Association. Published by .. All rights reserved.
Peterson, Alexander C; Sutherland, Jason M; Liu, Guiping; Crump, R Trafford; Karimuddin, Ahmer A
2018-06-01
The Fecal Incontinence Quality of Life Scale (FIQL) is a commonly used patient-reported outcome measure for fecal incontinence, often used in clinical trials, yet has not been validated in English since its initial development. This study uses modern methods to thoroughly evaluate the psychometric characteristics of the FIQL and its potential for differential functioning by gender. This study analyzed prospectively collected patient-reported outcome data from a sample of patients prior to colorectal surgery. Patients were recruited from 14 general and colorectal surgeons in Vancouver Coastal Health hospitals in Vancouver, Canada. Confirmatory factor analysis was used to assess construct validity. Item response theory was used to evaluate test reliability, describe item-level characteristics, identify local item dependence, and test for differential functioning by gender. 236 patients were included for analysis, with mean age 58 and approximately half female. Factor analysis failed to identify the lifestyle, coping, depression, and embarrassment domains, suggesting lack of construct validity. Items demonstrated low difficulty, indicating that the test has the highest reliability among individuals who have low quality of life. Five items are suggested for removal or replacement. Differential test functioning was minimal. This study has identified specific improvements that can be made to each domain of the Fecal Incontinence Quality of Life Scale and to the instrument overall. Formatting, scoring, and instructions may be simplified, and items with higher difficulty developed. The lifestyle domain can be used as is. The embarrassment domain should be significantly revised before use.
Rasch-Master's Partial Credit Model in the assessment of children's creativity in drawings.
Nakano, Tatiana de Cássia; Primi, Ricardo
2014-01-01
The purpose of the present study was to use the Partial Credit Model to study the factors of the Test of Creativity in Children and identify which characteristics of the creative person would be more effective to differentiate subjects according to their ability level. A sample of 1426 students from first to eighth grades answered the instrument. The Partial Credits model was used to estimate the ability of the subjects and item difficulties on a common scale for each of the four factors, indicating which items required a higher level of creativity to be scored and will differentiate the more creative individuals. The results demonstrated that the greater part of the characteristics showed good fit indices, with values between 0.80 and 1.30 both infit and outfit, indicating a response pattern consistent with the model. The characteristics of Unusual Perspective, Expression of Emotion and Originality have been identified as better predictors of creative performance because requires greater ability level (usually above two standard deviation). These results may be used in the future development of an instrument's reduced form or simplification of the current correction model.
Chatterji, Madhabi
2002-01-01
This study examines validity of data generated by the School Readiness for Reforms: Leader Questionnaire (SRR-LQ) using an iterative procedure that combines classical and Rasch rating scale analysis. Following content-validation and pilot-testing, principal axis factor extraction and promax rotation of factors yielded a five factor structure consistent with the content-validated subscales of the original instrument. Factors were identified based on inspection of pattern and structure coefficients. The rotated factor pattern, inter-factor correlations, convergent validity coefficients, and Cronbach's alpha reliability estimates supported the hypothesized construct properties. To further examine unidimensionality and efficacy of the rating scale structures, item-level data from each factor-defined subscale were subjected to analysis with the Rasch rating scale model. Data-to-model fit statistics and separation reliability for items and persons met acceptable criteria. Rating scale results suggested consistency of expected and observed step difficulties in rating categories, and correspondence of step calibrations with increases in the underlying variables. The combined approach yielded more comprehensive diagnostic information on the quality of the five SRR-LQ subscales; further research is continuing.
Levac, Danielle; Nawrotek, Joanna; Deschenes, Emilie; Giguere, Tia; Serafin, Julie; Bilodeau, Martin; Sveistrup, Heidi
2016-06-01
Virtual reality active video games are increasingly popular physical therapy interventions for children with cerebral palsy. However, physical therapists require educational resources to support decision making about game selection to match individual patient goals. Quantifying the movements elicited during virtual reality active video game play can inform individualized game selection in pediatric rehabilitation. The objectives of this study were to develop and evaluate the feasibility and reliability of the Movement Rating Instrument for Virtual Reality Game Play (MRI-VRGP). Item generation occurred through an iterative process of literature review and sample videotape viewing. The MRI-VRGP includes 25 items quantifying upper extremity, lower extremity, and total body movements. A total of 176 videotaped 90-second game play sessions involving 7 typically developing children and 4 children with cerebral palsy were rated by 3 raters trained in MRI-VRGP use. Children played 8 games on 2 virtual reality and active video game systems. Intraclass correlation coefficients (ICCs) determined intra-rater and interrater reliability. Excellent intrarater reliability was evidenced by ICCs of >0.75 for 17 of the 25 items across the 3 raters. Interrater reliability estimates were less precise. Excellent interrater reliability was achieved for far reach upper extremity movements (ICC=0.92 [for right and ICC=0.90 for left) and for squat (ICC=0.80) and jump items (ICC=0.99), with 9 items achieving ICCs of >0.70, 12 items achieving ICCs of between 0.40 and 0.70, and 4 items achieving poor reliability (close-reach upper extremity-ICC=0.14 for right and ICC=0.07 for left) and single-leg stance (ICC=0.55 for right and ICC=0.27 for left). Poor video quality, differing item interpretations between raters, and difficulty quantifying the high-speed movements involved in game play affected reliability. With item definition clarification and further psychometric property evaluation, the MRI-VRGP could inform the content of educational resources for therapists by ranking games according to frequency and type of elicited body movements.
Nawrotek, Joanna; Deschenes, Emilie; Giguere, Tia; Serafin, Julie; Bilodeau, Martin; Sveistrup, Heidi
2016-01-01
Background Virtual reality active video games are increasingly popular physical therapy interventions for children with cerebral palsy. However, physical therapists require educational resources to support decision making about game selection to match individual patient goals. Quantifying the movements elicited during virtual reality active video game play can inform individualized game selection in pediatric rehabilitation. Objective The objectives of this study were to develop and evaluate the feasibility and reliability of the Movement Rating Instrument for Virtual Reality Game Play (MRI-VRGP). Methods Item generation occurred through an iterative process of literature review and sample videotape viewing. The MRI-VRGP includes 25 items quantifying upper extremity, lower extremity, and total body movements. A total of 176 videotaped 90-second game play sessions involving 7 typically developing children and 4 children with cerebral palsy were rated by 3 raters trained in MRI-VRGP use. Children played 8 games on 2 virtual reality and active video game systems. Intraclass correlation coefficients (ICCs) determined intra-rater and interrater reliability. Results Excellent intrarater reliability was evidenced by ICCs of >0.75 for 17 of the 25 items across the 3 raters. Interrater reliability estimates were less precise. Excellent interrater reliability was achieved for far reach upper extremity movements (ICC=0.92 [for right and ICC=0.90 for left) and for squat (ICC=0.80) and jump items (ICC=0.99), with 9 items achieving ICCs of >0.70, 12 items achieving ICCs of between 0.40 and 0.70, and 4 items achieving poor reliability (close-reach upper extremity-ICC=0.14 for right and ICC=0.07 for left) and single-leg stance (ICC=0.55 for right and ICC=0.27 for left). Conclusions Poor video quality, differing item interpretations between raters, and difficulty quantifying the high-speed movements involved in game play affected reliability. With item definition clarification and further psychometric property evaluation, the MRI-VRGP could inform the content of educational resources for therapists by ranking games according to frequency and type of elicited body movements. PMID:27251029
ERIC Educational Resources Information Center
Penfield, Randall D.; Algina, James
2006-01-01
One approach to measuring unsigned differential test functioning is to estimate the variance of the differential item functioning (DIF) effect across the items of the test. This article proposes two estimators of the DIF effect variance for tests containing dichotomous and polytomous items. The proposed estimators are direct extensions of the…
Lynskey, M T; Agrawal, A
2007-09-01
DSM-IV criteria for illicit drug abuse and dependence are largely based on criteria developed for alcohol use disorders and there is a lack of research evidence on the psychometric properties of these symptoms when applied to illicit drugs. This study utilizes data on abuse/dependence criteria for cannabis, cocaine, stimulants, sedatives, tranquilizers, opiates, hallucinogens and inhalants from the National Epidemiological Survey on Alcohol and Related Conditions (NESARC, n=43 093). Analyses included factor analysis to explore the dimensionality of illicit drug abuse and dependence criteria, calculation of item difficulty and discrimination within an item response framework and a descriptive analysis of 'diagnostic orphans': individuals meeting criteria for 1-2 dependence symptoms but not abuse. Rates of psychiatric disorders were compared across groups. Results favor a uni-dimensional construct for abuse/dependence on each of the eight drug classes. Factor loadings, item difficulty and discrimination were remarkably consistent across drug categories. For each drug category, between 29% and 51% of all individuals meeting criteria for at least one symptom did not receive a formal diagnosis of either abuse or dependence and were therefore classified as 'orphans'. Mean rates of disorder in these individuals suggested that illicit drug use disorders may be more adequately described along a spectrum of severity. While there were remarkable similarities across categories of illicit drugs, consideration of item difficulty suggested that some alterations to DSM regarding the relevant severity of specific abuse and dependence criteria may be warranted.
An Ethical Issue Scale for Community Pharmacy Setting (EISP): Development and Validation.
Crnjanski, Tatjana; Krajnovic, Dusanka; Tadic, Ivana; Stojkov, Svetlana; Savic, Mirko
2016-04-01
Many problems that arise when providing pharmacy services may contain some ethical components and the aims of this study were to develop and validate a scale that could assess difficulties of ethical issues, as well as the frequency of those occurrences in everyday practice of community pharmacists. Development and validation of the scale was conducted in three phases: (1) generating items for the initial survey instrument after qualitative analysis; (2) defining the design and format of the instrument; (3) validation of the instrument. The constructed Ethical Issue scale for community pharmacy setting has two parts containing the same 16 items for assessing the difficulty and frequency thereof. The results of the 171 completely filled out scales were analyzed (response rate 74.89%). The Cronbach's α value of the part of the instrument that examines difficulties of the ethical situations was 0.83 and for the part of the instrument that examined frequency of the ethical situations was 0.84. Test-retest reliability for both parts of the instrument was satisfactory with all Interclass correlation coefficient (ICC) values above 0.6, (for the part that examines severity ICC = 0.809, for the part that examines frequency ICC = 0.929). The 16-item scale, as a self assessment tool, demonstrated a high degree of content, criterion, and construct validity and test-retest reliability. The results support its use as a research tool to asses difficulty and frequency of ethical issues in community pharmacy setting. The validated scale needs to be further employed on a larger sample of pharmacists.
The Utility of the Family Empowerment Scale With Custodial Grandmothers
Hayslip, Bert; Smith, Gregory C.; Montoro-Rodriguez, Julian; Streider, Frederick H.; Merchant, William
2016-01-01
The Family Empowerment Scale (FES) was developed specifically to assess empowerment in families with emotional disorders. Its relevance to custodial grandfamilies is reflected in the difficulties in grandchildren's social, emotional, and behavioral functioning, wherein such difficulties may be explained via either reactions to changes in their family structure or in their responses to the newly formed family unit. Utilizing 27 items derived from the 34-item version of the FES, which had represented differential levels of empowerment (family, service system, community) as indexed by one's attitudes, knowledge, and behavior, we explored the factor structure, internal consistency, construct, and convergent validity of the FES with grandparent caregivers. Three-hundred forty-three (M age = 58.45, SD = 8.22, n Caucasian = 152, n African American = 149, n Hispanic = 38) custodial grandmothers caring for grandchildren between ages 4 and 12 years completed the 27 FES items and various measures of their psychological well-being, grandchild psychological difficulties, emotional support, and parenting practices. Factor analysis revealed three factors that differed slightly from the originally proposed FES subscales: Parental Self-Efficacy/Self-Confidence, Service Activism, and Service Knowledge. Each of the factors was internally consistent, and derived factor scores were moderately interrelated, speaking to the question of convergent validity. The construct validity of these three factors was evidenced by meaningful patterns of statistically significant correlations with grandmothers’ psychological well-being, grandchild psychological difficulties, emotional support, and parenting practices. These factor scores were independent of grandmother age, health, and education. These findings suggest the newly identified FES factors to be valuable in understanding empowerment among grandmother caregivers. PMID:26452627
The Utility of the Family Empowerment Scale With Custodial Grandmothers.
Hayslip, Bert; Smith, Gregory C; Montoro-Rodriguez, Julian; Streider, Frederick H; Merchant, William
2017-03-01
The Family Empowerment Scale (FES) was developed specifically to assess empowerment in families with emotional disorders. Its relevance to custodial grandfamilies is reflected in the difficulties in grandchildren's social, emotional, and behavioral functioning, wherein such difficulties may be explained via either reactions to changes in their family structure or in their responses to the newly formed family unit. Utilizing 27 items derived from the 34-item version of the FES, which had represented differential levels of empowerment (family, service system, community) as indexed by one's attitudes, knowledge, and behavior, we explored the factor structure, internal consistency, construct, and convergent validity of the FES with grandparent caregivers. Three-hundred forty-three ( M age = 58.45, SD = 8.22, n Caucasian = 152, n African American = 149, n Hispanic = 38) custodial grandmothers caring for grandchildren between ages 4 and 12 years completed the 27 FES items and various measures of their psychological well-being, grandchild psychological difficulties, emotional support, and parenting practices. Factor analysis revealed three factors that differed slightly from the originally proposed FES subscales: Parental Self-Efficacy/Self-Confidence, Service Activism, and Service Knowledge. Each of the factors was internally consistent, and derived factor scores were moderately interrelated, speaking to the question of convergent validity. The construct validity of these three factors was evidenced by meaningful patterns of statistically significant correlations with grandmothers' psychological well-being, grandchild psychological difficulties, emotional support, and parenting practices. These factor scores were independent of grandmother age, health, and education. These findings suggest the newly identified FES factors to be valuable in understanding empowerment among grandmother caregivers.
Chan, Kitty S; Gross, Alden L; Pezzin, Liliana E; Brandt, Jason; Kasper, Judith D
2015-12-01
To harmonize measures of cognitive performance using item response theory (IRT) across two international aging studies. Data for persons ≥65 years from the Health and Retirement Study (HRS, N = 9,471) and the English Longitudinal Study of Aging (ELSA, N = 5,444). Cognitive performance measures varied (HRS fielded 25, ELSA 13); 9 were in common. Measurement precision was examined for IRT scores based on (a) common items, (b) common items adjusted for differential item functioning (DIF), and (c) DIF-adjusted all items. Three common items (day of date, immediate word recall, and delayed word recall) demonstrated DIF by survey. Adding survey-specific items improved precision but mainly for HRS respondents at lower cognitive levels. IRT offers a feasible strategy for harmonizing cognitive performance measures across other surveys and for other multi-item constructs of interest in studies of aging. Practical implications depend on sample distribution and the difficulty mix of in-common and survey-specific items. © The Author(s) 2015.
Hogge, Michaël; Adam, Stéphane; Collette, Fabienne
2008-07-01
The directed forgetting effect obtained with the item method is supposed to depend on both selective rehearsal of to-be-remembered (TBR) items and attentional inhibition of to-be-forgotten (TBF) items. In this study, we investigated the locus of the directed forgetting deficit in older adults by exploring the influence of recollection and familiarity-based retrieval processes on age-related differences in directed forgetting. Moreover, we explored the influence of processing speed, short-term memory capacity, thought suppression tendencies, and sensitivity to proactive interference on performance. The results indicated that older adults' directed forgetting difficulties are due to decreased recollection of TBR items, associated with increased automatic retrieval of TBF items. Moreover, processing speed and proactive interference appeared to be responsible for the decreased recall of TBR items.
Lim, Bee Chiu; Kueh, Yee Cheng; Arifin, Wan Nor; Ng, Kok Huan
2016-01-01
Background Heart disease knowledge is an important concept for health education, yet there is lack of evidence on proper validated instruments used to measure levels of heart disease knowledge in the Malaysian context. Methods A cross-sectional, survey design was conducted to examine the psychometric properties of the adapted English version of the Heart Disease Knowledge Questionnaire (HDKQ). Using proportionate cluster sampling, 788 undergraduate students at Universiti Sains Malaysia, Malaysia, were recruited and completed the HDKQ. Item analysis and confirmatory factor analysis (CFA) were used for the psychometric evaluation. Construct validity of the measurement model was included. Results Most of the students were Malay (48%), female (71%), and from the field of science (51%). An acceptable range was obtained with respect to both the difficulty and discrimination indices in the item analysis results. The difficulty index ranged from 0.12–0.91 and a discrimination index of ≥ 0.20 were reported for the final retained 23 items. The final CFA model showed an adequate fit to the data, yielding a 23-item, one-factor model [weighted least squares mean and variance adjusted scaled chi-square difference = 1.22, degrees of freedom = 2, P-value = 0.544, the root mean square error of approximation = 0.03 (90% confidence interval = 0.03, 0.04); close-fit P-value = > 0.950]. Conclusion Adequate psychometric values were obtained for Malaysian undergraduate university students using the 23-item, one-factor model of the adapted HDKQ. PMID:27660543
Lim, Bee Chiu; Kueh, Yee Cheng; Arifin, Wan Nor; Ng, Kok Huan
2016-07-01
Heart disease knowledge is an important concept for health education, yet there is lack of evidence on proper validated instruments used to measure levels of heart disease knowledge in the Malaysian context. A cross-sectional, survey design was conducted to examine the psychometric properties of the adapted English version of the Heart Disease Knowledge Questionnaire (HDKQ). Using proportionate cluster sampling, 788 undergraduate students at Universiti Sains Malaysia, Malaysia, were recruited and completed the HDKQ. Item analysis and confirmatory factor analysis (CFA) were used for the psychometric evaluation. Construct validity of the measurement model was included. Most of the students were Malay (48%), female (71%), and from the field of science (51%). An acceptable range was obtained with respect to both the difficulty and discrimination indices in the item analysis results. The difficulty index ranged from 0.12-0.91 and a discrimination index of ≥ 0.20 were reported for the final retained 23 items. The final CFA model showed an adequate fit to the data, yielding a 23-item, one-factor model [weighted least squares mean and variance adjusted scaled chi-square difference = 1.22, degrees of freedom = 2, P-value = 0.544, the root mean square error of approximation = 0.03 (90% confidence interval = 0.03, 0.04); close-fit P-value = > 0.950]. Adequate psychometric values were obtained for Malaysian undergraduate university students using the 23-item, one-factor model of the adapted HDKQ.
ERIC Educational Resources Information Center
Liao, Chi-Wen; Livingston, Samuel A.
2008-01-01
Randomly equivalent forms (REF) of tests in listening and reading for nonnative speakers of English were created by stratified random assignment of items to forms, stratifying on item content and predicted difficulty. The study included 50 replications of the procedure for each test. Each replication generated 2 REFs. The equivalence of those 2…
An Information Analysis of 2-, 3-, and 4-Word Verbal Discrimination Learning.
ERIC Educational Resources Information Center
Arima, James K.; Gray, Francis D.
Information theory was used to qualify the difficulty of verbal discrimination (VD) learning tasks and to measure VD performance. Words for VD items were selected with high background frequency and equal a priori probabilities of being selected as a first response. Three VD lists containing only 2-, 3-, or 4-word items were created and equated for…
ERIC Educational Resources Information Center
Chen, Chieh-Yu; Chen, Ching-I; Squires, Jane; Bian, Xiaoyan; Heo, Kay H.; Filgueiras, Alberto; Kalinina, Svetlana; Samarina, Larissa; Ermolaeva, Evgeniya; Xie, Huichao; Yu, Ting-Ying; Wu, Pei-Fang; Landeira-Fernandez, Jesus
2017-01-01
Ages & Stages Questionnaires: Social-Emotional (ASQ:SE) is a widely used screening instrument for detecting social-emotional difficulties in infants and young children. To use a screening instrument across cultures and countries, it is necessary to identify potential item-level biases and ensure item equivalence. This study investigated the…
ERIC Educational Resources Information Center
Goldhammer, Frank
2015-01-01
The main challenge of ability tests relates to the difficulty of items, whereas speed tests demand that test takers complete very easy items quickly. This article proposes a conceptual framework to represent how performance depends on both between-person differences in speed and ability and the speed-ability compromise within persons. Related…
Screening for oral health literacy in an urban dental clinic
Atchison, Kathryn A.; Gironda, Melanie W.; Messadi, Diana; Der-Martirosian, Claudia
2013-01-01
Objective Studies show that the average person fails to understand and use health care related materials to their full potential. The goal of this study was to evaluate a health literacy instrument based on the Rapid Estimate of Adult Literacy in Medicine (REALM) that incorporates dental and medical terms into one 84-item Rapid Estimate of Adult Literacy in Medicine and Dentistry (REALM-D) measure and determine its association with patient characteristics of a culturally diverse dental clinic population. Methods An 84-item dental/medical health literacy word list and a 48-item health beliefs and attitudes survey was provided to a sample of 200 adult patients seeking treatment for the first time at an oral diagnosis clinic located in a large urban medical center in Los Angeles, California. Results Of the total sample, 154 participants read all of list 1 correctly, 141 read list 2 correctly, and only 38 read list 3 correctly. Nonwhite participants had significantly lower REALM-D scores at each level of difficulty as well as the total scale score compared to white participants. Participants who reported English as not their main language had significantly lower REALM-D scores. REALM-D scores also varied significantly by level of education among participants where as level of education increased, oral health literacy increased. At a bivariate level, race, education, and English as a main language remain predictive of health literacy in a regression model. An interaction between education and English as a main language was significant. Conclusions The REALM-D is an effective instrument for use by medical and dental clinicians in detecting differences among people of different backgrounds and for whom English was not their primary language. PMID:20545829
Leonard, Laurence B; Deevy, Patricia; Fey, Marc E; Bredin-Oja, Shelley L
2013-04-01
This study examined sentence comprehension in children with specific language impairment (SLI) in a manner designed to separate the contribution of cognitive capacity from the effects of syntactic structure. Nineteen children with SLI, 19 typically developing children matched for age (TD-A), and 19 younger typically developing children (TD-Y) matched according to sentence comprehension test scores responded to sentence comprehension items that varied in either length or their demands on cognitive capacity, based on the nature of the foils competing with the target picture. The TD-A children were accurate across all item types. The SLI and TD-Y groups were less accurate than the TD-A group on items with greater length and, especially, on items with the greatest demands on cognitive capacity. The types of errors were consistent with failure to retain details of the sentence apart from syntactic structure. The difficulty in the more demanding conditions seemed attributable to interference. Specifically, the children with SLI and the TD-Y children appeared to have difficulty retaining details of the target sentence when the information reflected in the foils closely resembled the information in the target sentence.
Both younger and older adults have difficulty updating emotional memories.
Nashiro, Kaoru; Sakaki, Michiko; Huffman, Derek; Mather, Mara
2013-03-01
The main purpose of the study was to examine whether emotion impairs associative memory for previously seen items in older adults, as previously observed in younger adults. Thirty-two younger adults and 32 older adults participated. The experiment consisted of 2 parts. In Part 1, participants learned picture-object associations for negative and neutral pictures. In Part 2, they learned picture-location associations for negative and neutral pictures; half of these pictures were seen in Part 1 whereas the other half were new. The dependent measure was how many locations of negative versus neutral items in the new versus old categories participants remembered in Part 2. Both groups had more difficulty learning the locations of old negative pictures than of new negative pictures. However, this pattern was not observed for neutral items. Despite the fact that older adults showed overall decline in associative memory, the impairing effect of emotion on updating associative memory was similar between younger and older adults.
Ramsay-Curve Item Response Theory for the Three-Parameter Logistic Item Response Model
ERIC Educational Resources Information Center
Woods, Carol M.
2008-01-01
In Ramsay-curve item response theory (RC-IRT), the latent variable distribution is estimated simultaneously with the item parameters of a unidimensional item response model using marginal maximum likelihood estimation. This study evaluates RC-IRT for the three-parameter logistic (3PL) model with comparisons to the normal model and to the empirical…
Variability in Parameter Estimates and Model Fit across Repeated Allocations of Items to Parcels
ERIC Educational Resources Information Center
Sterba, Sonya K.; MacCallum, Robert C.
2010-01-01
Different random or purposive allocations of items to parcels within a single sample are thought not to alter structural parameter estimates as long as items are unidimensional and congeneric. If, additionally, numbers of items per parcel and parcels per factor are held fixed across allocations, different allocations of items to parcels within a…
ERIC Educational Resources Information Center
Kearns, Devin M.; Steacy, Laura M.; Compton, Donald L.; Gilbert, Jennifer K.; Goodwin, Amanda P.; Cho, Eunsoo; Lindstrom, Esther R.; Collins, Alyson A.
2016-01-01
Comprehensive models of derived polymorphemic word recognition skill in developing readers, with an emphasis on children with reading difficulty (RD), have not been developed. The purpose of the present study was to model individual differences in polymorphemic word recognition ability at the item level among 5th-grade children (N = 173)…
ERIC Educational Resources Information Center
Palmieri, Patrick A.; Smith, Gregory C.
2007-01-01
The authors examined the structural validity of the parent informant version of the Strengths and Difficulties Questionnaire (SDQ) with a sample of 733 custodial grandparents. Three models of the SDQ's factor structure were evaluated with confirmatory factor analysis based on the item covariance matrix. Although indices of fit were good across all…
ERIC Educational Resources Information Center
Oruç Ertürk, Nesrin; Mumford, Simon E.
2017-01-01
This study, conducted by two researchers who were also multiple-choice question (MCQ) test item writers at a private English-medium university in an English as a foreign language (EFL) context, was designed to shed light on the factors that influence test-takers' perceptions of difficulty in English for academic purposes (EAP) vocabulary, with the…
A new self-report inventory of dyslexia for students: criterion and construct validity.
Tamboer, Peter; Vorst, Harrie C M
2015-02-01
The validity of a Dutch self-report inventory of dyslexia was ascertained in two samples of students. Six biographical questions, 20 general language statements and 56 specific language statements were based on dyslexia as a multi-dimensional deficit. Dyslexia and non-dyslexia were assessed with two criteria: identification with test results (Sample 1) and classification using biographical information (both samples). Using discriminant analyses, these criteria were predicted with various groups of statements. All together, 11 discriminant functions were used to estimate classification accuracy of the inventory. In Sample 1, 15 statements predicted the test criterion with classification accuracy of 98%, and 18 statements predicted the biographical criterion with classification accuracy of 97%. In Sample 2, 16 statements predicted the biographical criterion with classification accuracy of 94%. Estimations of positive and negative predictive value were 89% and 99%. Items of various discriminant functions were factor analysed to find characteristic difficulties of students with dyslexia, resulting in a five-factor structure in Sample 1 and a four-factor structure in Sample 2. Answer bias was investigated with measures of internal consistency reliability. Less than 20 self-report items are sufficient to accurately classify students with and without dyslexia. This supports the usefulness of self-assessment of dyslexia as a valid alternative to diagnostic test batteries. Copyright © 2015 John Wiley & Sons, Ltd.
An Evaluation of Different Statistical Targets for Assembling Parallel Forms in Item Response Theory
Ali, Usama S.; van Rijn, Peter W.
2015-01-01
Assembly of parallel forms is an important step in the test development process. Therefore, choosing a suitable theoretical framework to generate well-defined test specifications is critical. The performance of different statistical targets of test specifications using the test characteristic curve (TCC) and the test information function (TIF) was investigated. Test length, the number of test forms, and content specifications are considered as well. The TCC target results in forms that are parallel in difficulty, but not necessarily in terms of precision. Vice versa, test forms created using a TIF target are parallel in terms of precision, but not necessarily in terms of difficulty. As sometimes the focus is either on TIF or TCC, differences in either difficulty or precision can arise. Differences in difficulty can be mitigated by equating, but differences in precision cannot. In a series of simulations using a real item bank, the two-parameter logistic model, and mixed integer linear programming for automated test assembly, these differences were found to be quite substantial. When both TIF and TCC are combined into one target with manipulation to relative importance, these differences can be made to disappear.
Kraft, Pål; Rise, Jostein; Sutton, Stephen; Røysamb, Espen
2005-09-01
A study was conducted to explore (a) the dimensional structure of perceived behavioural control (PBC), (b) the conceptual basis of perceived difficulty items, and (c) how PBC components and instrumental and affective attitudes, respectively, relate to intention and behaviour. The material stemmed from a two-wave study of Norwegian graduate students (N = 227 for the prediction of intention and N = 110 for the prediction of behaviour). Data were analysed using confirmatory factor analysis (CFA) and multiple regression by the application of structural equation modelling (SEM). CFA suggested that PBC could be conceived of as consisting of three separate but interrelated factors (perceived control, perceived confidence and perceived difficulty), or as two separate but interrelated factors representing self-efficacy (measured by perceived difficulty and perceived confidence or by just perceived confidence) and perceived control. However, the perceived difficulty items also overlapped substantially with affective attitude. Perceived confidence was a strong predictor of exercise intention but not of recycling intention. Perceived control, however, was a strong predictor of recycling intention but not exercise intention. Affective attitudes but not instrumental attitudes were identified as substantial predictors of intentions. The findings suggest that at least under some circumstances it may be inadequate to measure PBC by means of perceived difficulty. One possible consequence may be that the role of PBC as a predictor of intention is somewhat overestimated, whereas the role of (affective) attitude may be similarly underestimated.
Estimating Total-test Scores from Partial Scores in a Matrix Sampling Design.
ERIC Educational Resources Information Center
Sachar, Jane; Suppes, Patrick
It is sometimes desirable to obtain an estimated total-test score for an individual who was administered only a subset of the items in a total test. The present study compared six methods, two of which utilize the content structure of items, to estimate total-test scores using 450 students in grades 3-5 and 60 items of the ll0-item Stanford Mental…
ERIC Educational Resources Information Center
Ferrando, Pere J.
2004-01-01
This study used kernel-smoothing procedures to estimate the item characteristic functions (ICFs) of a set of continuous personality items. The nonparametric ICFs were compared with the ICFs estimated (a) by the linear model and (b) by Samejima's continuous-response model. The study was based on a conditioned approach and used an error-in-variables…
Tin, L N W; Lui, S S Y; Ho, K K Y; Hung, K S Y; Wang, Y; Yeung, H K H; Wong, T Y; Lam, S M; Chan, R C K; Cheung, E F C
2018-06-01
Evidence suggests that autism and schizophrenia share similarities in genetic, neuropsychological and behavioural aspects. Although both disorders are associated with theory of mind (ToM) impairments, a few studies have directly compared ToM between autism patients and schizophrenia patients. This study aimed to investigate to what extent high-functioning autism patients and schizophrenia patients share and differ in ToM performance. Thirty high-functioning autism patients, 30 schizophrenia patients and 30 healthy individuals were recruited. Participants were matched in age, gender and estimated intelligence quotient. The verbal-based Faux Pas Task and the visual-based Yoni Task were utilised to examine first- and higher-order, affective and cognitive ToM. The task/item difficulty of two paradigms was examined using mixed model analyses of variance (ANOVAs). Multiple ANOVAs and mixed model ANOVAs were used to examine group differences in ToM. The Faux Pas Task was more difficult than the Yoni Task. High-functioning autism patients showed more severely impaired verbal-based ToM in the Faux Pas Task, but shared similar visual-based ToM impairments in the Yoni Task with schizophrenia patients. The findings that individuals with high-functioning autism shared similar but more severe impairments in verbal ToM than individuals with schizophrenia support the autism-schizophrenia continuum. The finding that verbal-based but not visual-based ToM was more impaired in high-functioning autism patients than schizophrenia patients could be attributable to the varied task/item difficulty between the two paradigms.
A meta-analysis of response-time tests of the sequential two-systems model of moral judgment.
Baron, Jonathan; Gürçay, Burcu
2017-05-01
The (generalized) sequential two-system ("default interventionist") model of utilitarian moral judgment predicts that utilitarian responses often arise from a system-two correction of system-one deontological intuitions. Response-time (RT) results that seem to support this model are usually explained by the fact that low-probability responses have longer RTs. Following earlier results, we predicted response probability from each subject's tendency to make utilitarian responses (A, "Ability") and each dilemma's tendency to elicit deontological responses (D, "Difficulty"), estimated from a Rasch model. At the point where A = D, the two responses are equally likely, so probability effects cannot account for any RT differences between them. The sequential two-system model still predicts that many of the utilitarian responses made at this point will result from system-two corrections of system-one intuitions, hence should take longer. However, when A = D, RT for the two responses was the same, contradicting the sequential model. Here we report a meta-analysis of 26 data sets, which replicated the earlier results of no RT difference overall at the point where A = D. The data sets used three different kinds of moral judgment items, and the RT equality at the point where A = D held for all three. In addition, we found that RT increased with A-D. This result holds for subjects (characterized by Ability) but not for items (characterized by Difficulty). We explain the main features of this unanticipated effect, and of the main results, with a drift-diffusion model.
ERIC Educational Resources Information Center
Wollack, James A.; Bolt, Daniel M.; Cohen, Allan S.; Lee, Young-Sun
2002-01-01
Compared the quality of item parameter estimates for marginal maximum likelihood (MML) and Markov Chain Monte Carlo (MCMC) with the nominal response model using simulation. The quality of item parameter recovery was nearly identical for MML and MCMC, and both methods tended to produce good estimates. (SLD)
ERIC Educational Resources Information Center
Finch, Holmes; Edwards, Julianne M.
2016-01-01
Standard approaches for estimating item response theory (IRT) model parameters generally work under the assumption that the latent trait being measured by a set of items follows the normal distribution. Estimation of IRT parameters in the presence of nonnormal latent traits has been shown to generate biased person and item parameter estimates. A…
ERIC Educational Resources Information Center
Monroe, Scott; Cai, Li
2013-01-01
In Ramsay curve item response theory (RC-IRT, Woods & Thissen, 2006) modeling, the shape of the latent trait distribution is estimated simultaneously with the item parameters. In its original implementation, RC-IRT is estimated via Bock and Aitkin's (1981) EM algorithm, which yields maximum marginal likelihood estimates. This method, however,…
ERIC Educational Resources Information Center
Monroe, Scott; Cai, Li
2014-01-01
In Ramsay curve item response theory (RC-IRT) modeling, the shape of the latent trait distribution is estimated simultaneously with the item parameters. In its original implementation, RC-IRT is estimated via Bock and Aitkin's EM algorithm, which yields maximum marginal likelihood estimates. This method, however, does not produce the…
An alternative to Rasch analysis using triadic comparisons and multi-dimensional scaling
NASA Astrophysics Data System (ADS)
Bradley, C.; Massof, R. W.
2016-11-01
Rasch analysis is a principled approach for estimating the magnitude of some shared property of a set of items when a group of people assign ordinal ratings to them. In the general case, Rasch analysis not only estimates person and item measures on the same invariant scale, but also estimates the average thresholds used by the population to define rating categories. However, Rasch analysis fails when there is insufficient variance in the observed responses because it assumes a probabilistic relationship between person measures, item measures and the rating assigned by a person to an item. When only a single person is rating all items, there may be cases where the person assigns the same rating to many items no matter how many times he rates them. We introduce an alternative to Rasch analysis for precisely these situations. Our approach leverages multi-dimensional scaling (MDS) and requires only rank orderings of items and rank orderings of pairs of distances between items to work. Simulations show one variant of this approach - triadic comparisons with non-metric MDS - provides highly accurate estimates of item measures in realistic situations.
Shen, Linjun; Li, Feiming; Wattleworth, Roberta; Filipetto, Frank
2010-10-01
The Comprehensive Osteopathic Medical Licensing Examination conducted a trial of multimedia items in the 2008-2009 Level 3 testing cycle to determine (1) if multimedia items were able to test additional elements of medical knowledge and skills and (2) how to develop effective multimedia items. Forty-four content-matched multimedia and text multiple-choice items were randomly delivered to Level 3 candidates. Logistic regression and paired-samples t tests were used for pairwise and group-level comparisons, respectively. Nine pairs showed significant differences in either difficulty or/and discrimination. Content analysis found that, if text narrations were less direct, multimedia materials could make items easier. When textbook terminologies were replaced by multimedia presentations, multimedia items could become more difficult. Moreover, a multimedia item was found not uniformly difficult for candidates at different ability levels, possibly because multimedia and text items tested different elements of a same concept. Multimedia items may be capable of measuring some constructs different from what text items can measure. Effective multimedia items with reasonable psychometric properties can be intentionally developed.
Haberman, Shelby J; Sinharay, Sandip; Chon, Kyong Hee
2013-07-01
Residual analysis (e.g. Hambleton & Swaminathan, Item response theory: principles and applications, Kluwer Academic, Boston, 1985; Hambleton, Swaminathan, & Rogers, Fundamentals of item response theory, Sage, Newbury Park, 1991) is a popular method to assess fit of item response theory (IRT) models. We suggest a form of residual analysis that may be applied to assess item fit for unidimensional IRT models. The residual analysis consists of a comparison of the maximum-likelihood estimate of the item characteristic curve with an alternative ratio estimate of the item characteristic curve. The large sample distribution of the residual is proved to be standardized normal when the IRT model fits the data. We compare the performance of our suggested residual to the standardized residual of Hambleton et al. (Fundamentals of item response theory, Sage, Newbury Park, 1991) in a detailed simulation study. We then calculate our suggested residuals using data from an operational test. The residuals appear to be useful in assessing the item fit for unidimensional IRT models.
ERIC Educational Resources Information Center
Casabianca, Jodi M.; Lewis, Charles
2015-01-01
Loglinear smoothing (LLS) estimates the latent trait distribution while making fewer assumptions about its form and maintaining parsimony, thus leading to more precise item response theory (IRT) item parameter estimates than standard marginal maximum likelihood (MML). This article provides the expectation-maximization algorithm for MML estimation…
Estimating Total-Test Scores from Partial Scores in a Matrix Sampling Design.
ERIC Educational Resources Information Center
Sachar, Jane; Suppes, Patrick
1980-01-01
The present study compared six methods, two of which utilize the content structure of items, to estimate total-test scores using 450 students and 60 items of the 110-item Stanford Mental Arithmetic Test. Three methods yielded fairly good estimates of the total-test score. (Author/RL)
A Comparison of Methods for Nonparametric Estimation of Item Characteristic Curves for Binary Items
ERIC Educational Resources Information Center
Lee, Young-Sun
2007-01-01
This study compares the performance of three nonparametric item characteristic curve (ICC) estimation procedures: isotonic regression, smoothed isotonic regression, and kernel smoothing. Smoothed isotonic regression, employed along with an appropriate kernel function, provides better estimates and also satisfies the assumption of strict…
Durning, Steven J; Dong, Ting; Artino, Anthony R; van der Vleuten, Cees; Holmboe, Eric; Schuwirth, Lambert
2015-08-01
An ongoing debate exists in the medical education literature regarding the potential benefits of pattern recognition (non-analytic reasoning), actively comparing and contrasting diagnostic options (analytic reasoning) or using a combination approach. Studies have not, however, explicitly explored faculty's thought processes while tackling clinical problems through the lens of dual process theory to inform this debate. Further, these thought processes have not been studied in relation to the difficulty of the task or other potential mediating influences such as personal factors and fatigue, which could also be influenced by personal factors such as sleep deprivation. We therefore sought to determine which reasoning process(es) were used with answering clinically oriented multiple-choice questions (MCQs) and if these processes differed based on the dual process theory characteristics: accuracy, reading time and answering time as well as psychometrically determined item difficulty and sleep deprivation. We performed a think-aloud procedure to explore faculty's thought processes while taking these MCQs, coding think-aloud data based on reasoning process (analytic, nonanalytic, guessing or combination of processes) as well as word count, number of stated concepts, reading time, answering time, and accuracy. We also included questions regarding amount of work in the recent past. We then conducted statistical analyses to examine the associations between these measures such as correlations between frequencies of reasoning processes and item accuracy and difficulty. We also observed the total frequencies of different reasoning processes in the situations of getting answers correctly and incorrectly. Regardless of whether the questions were classified as 'hard' or 'easy', non-analytical reasoning led to the correct answer more often than to an incorrect answer. Significant correlations were found between self-reported recent number of hours worked with think-aloud word count and number of concepts used in the reasoning but not item accuracy. When all MCQs were included, 19 % of the variance of correctness could be explained by the frequency of expression of these three think-aloud processes (analytic, nonanalytic, or combined). We found evidence to support the notion that the difficulty of an item in a test is not a systematic feature of the item itself but is always a result of the interaction between the item and the candidate. Use of analytic reasoning did not appear to improve accuracy. Our data suggest that individuals do not apply either System 1 or System 2 but instead fall along a continuum with some individuals falling at one end of the spectrum.
Wan, Li-ping; He, Run-lian; Ai, Yong-mei; Zhang, Hui-min; Xing, Min; Yang, Lin; Song, Yan-long; Yu, Hong-mei
2013-07-01
To introduce the Item Function Analysis(IFA) of Quality of Life- Alzheimer's disease(QOL-AD)Chinese version and to explore the feasibility of its application on Chinese patients with AD. Two hundred AD patients were interviewed and assessed by QOL-AD, through the stratified cluster sampling method. Multilog 7.03. was used for Item Function Analysis. Difference scale(a), difficulty scale(b)and Item Characteristic Curve(ICC) of each item of QOL-AD were provided. Different scales of the item 1, 7 were below 0.6, while all the others were above 0.6. As for ICC. The first and last lines for the other items were monotonic in which the two in between were in inverted V-shape, with very steep slopes, except for the item 1 and 7. Results form the IFA showed that QOL-AD was applicable to be used in the Chinese patients with AD.
Development of a Frequency-based Measure of Syntactic Difficulty for Estimating Readability.
ERIC Educational Resources Information Center
Selden, Ramsay
Readability estimates are usually based on measures of word difficulty and measures of sentence difficulty. Word difficulty is measured in two ways: by the structural size and complexity of words or by reference to phonomena of language use, such as word-list frequency or the regularity of spelling patterns. Sentence difficulty is measured only in…
ERIC Educational Resources Information Center
Moses, Tim; Miao, Jing; Dorans, Neil
2010-01-01
This study compared the accuracies of four differential item functioning (DIF) estimation methods, where each method makes use of only one of the following: raw data, logistic regression, loglinear models, or kernel smoothing. The major focus was on the estimation strategies' potential for estimating score-level, conditional DIF. A secondary focus…
ERIC Educational Resources Information Center
Schlingman, Wayne M.; Prather, Edward E.; Wallace, Colin S.; Brissenden, Gina; Rudolph, Alexander L.
2012-01-01
This paper is the first in a series of investigations into the data from the recent national study using the Light and Spectroscopy Concept Inventory (LSCI). In this paper, we use classical test theory to form a framework of results that will be used to evaluate individual item difficulties, item discriminations, and the overall reliability of the…
Psychometrics of the self-report safe driving behavior measure for older adults.
Classen, Sherrilene; Wen, Pey-Shan; Velozo, Craig A; Bédard, Michel; Winter, Sandra M; Brumback, Babette; Lanford, Desiree N
2012-01-01
We investigated the psychometric properties of the 68-item Safe Driving Behavior Measure (SDBM) with 80 older drivers, 80 caregivers, and 2 evaluators from two sites. Using Rasch analysis, we examined unidimensionality and local dependence; rating scale; item- and person-level psychometrics; and item hierarchy of older drivers, caregivers, and driving evaluators who had completed the SDBM. The evidence suggested the SDBM is unidimensional, but pairs of items showed local dependency. Across the three rater groups, the data showed good person (≥3.4) and item (≥3.6) separation as well as good person (≥.93) and item reliability (≥.92). Cronbach's α was ≥.96, and few items were misfitting. Some of the items did not follow the hypothesized order of item difficulty. The SDBM classified the older drivers into six ability levels, but to fully calibrate the instrument it must be refined in terms of its items (e.g., item exclusion) and then tested among participants of lesser ability. Copyright © 2012 by the American Occupational Therapy Association, Inc.
2012-01-01
Background Item response theory (IRT) is extensively used to develop adaptive instruments of health-related quality of life (HRQoL). However, each IRT model has its own function to estimate item and category parameters, and hence different results may be found using the same response categories with different IRT models. The present study used the Rasch rating scale model (RSM) to examine and reassess the psychometric properties of the Persian version of the PedsQLTM 4.0 Generic Core Scales. Methods The PedsQLTM 4.0 Generic Core Scales was completed by 938 Iranian school children and their parents. Convergent, discriminant and construct validity of the instrument were assessed by classical test theory (CTT). The RSM was applied to investigate person and item reliability, item statistics and ordering of response categories. Results The CTT method showed that the scaling success rate for convergent and discriminant validity were 100% in all domains with the exception of physical health in the child self-report. Moreover, confirmatory factor analysis supported a four-factor model similar to its original version. The RSM showed that 22 out of 23 items had acceptable infit and outfit statistics (<1.4, >0.6), person reliabilities were low, item reliabilities were high, and item difficulty ranged from -1.01 to 0.71 and -0.68 to 0.43 for child self-report and parent proxy-report, respectively. Also the RSM showed that successive response categories for all items were not located in the expected order. Conclusions This study revealed that, in all domains, the five response categories did not perform adequately. It is not known whether this problem is a function of the meaning of the response choices in the Persian language or an artifact of a mostly healthy population that did not use the full range of the response categories. The response categories should be evaluated in further validation studies, especially in large samples of chronically ill patients. PMID:22414135
Kashiwagi, Mitsuru; Suzuki, Shuhei
2009-09-01
Many children with developmental disorders are known to have motor impairment such as clumsiness and poor physical ability;however, the objective evaluation of such difficulties is not easy in routine clinical practice. In this study, we aimed to establish a simple method for evaluating motor difficulty of childhood. This method employs a scored interview and examination for detecting soft neurological signs (SNSs). After a preliminary survey with 22 normal children, we set the items and the cutoffs for the interview and SNSs. The interview consisted of questions pertaining to 12 items related to a child's motor skills in his/her past and current life, such as skipping, jumping a rope, ball sports, origami, and using chopsticks. The SNS evaluation included 5 tests, namely, standing on one leg with eyes closed, diadochokinesia, associated movements during diadochokinesia, finger opposition test, and laterally fixed gaze. We applied this method to 43 children, including 25 cases of developmental disorders. Children showing significantly high scores in both the interview and SNS were assigned to the "with motor difficulty" group, while those with low scores in both the tests were assigned to the "without motor difficulty" group. The remaining children were assigned to the "with suspicious motor difficulty" group. More than 90% of the children in the "with motor difficulty" group had high impairment scores in Movement Assessment Battery for Children (M-ABC), a standardized motor test, whereas 82% of the children in the "without motor difficulty" group revealed no motor impairment. Thus, we conclude that our simple method and criteria would be useful for the evaluation of motor difficulty of childhood. Further, we have discussed the diagnostic process for developmental coordination disorder using our evaluation method.
Zhao, Yue
2017-03-01
In patient-reported outcome research that utilizes item response theory (IRT), using statistical significance tests to detect misfit is usually the focus of IRT model-data fit evaluations. However, such evaluations rarely address the impact/consequence of using misfitting items on the intended clinical applications. This study was designed to evaluate the impact of IRT item misfit on score estimates and severity classifications and to demonstrate a recommended process of model-fit evaluation. Using secondary data sources collected from the Patient-Reported Outcome Measurement Information System (PROMIS) wave 1 testing phase, analyses were conducted based on PROMIS depression (28 items; 782 cases) and pain interference (41 items; 845 cases) item banks. The identification of misfitting items was assessed using Orlando and Thissen's summed-score item-fit statistics and graphical displays. The impact of misfit was evaluated according to the agreement of both IRT-derived T-scores and severity classifications between inclusion and exclusion of misfitting items. The examination of the presence and impact of misfit suggested that item misfit had a negligible impact on the T-score estimates and severity classifications with the general population sample in the PROMIS depression and pain interference item banks, implying that the impact of item misfit was insignificant. Findings support the T-score estimates in the two item banks as robust against item misfit at both the group and individual levels and add confidence to the use of T-scores for severity diagnosis in the studied sample. Recommendations on approaches for identifying item misfit (statistical significance) and assessing the misfit impact (practical significance) are given.
Lamb, Sarah E; McCabe, Chris; Becker, Clemens; Fried, Linda P; Guralnik, Jack M
2008-10-01
Falls are a major cause of disability, dependence, and death in older people. Brief screening algorithms may be helpful in identifying risk and leading to more detailed assessment. Our aim was to determine the most effective sequence of falls screening test items from a wide selection of recommended items including self-report and performance tests, and to compare performance with other published guidelines. Data were from a prospective, age-stratified, cohort study. Participants were 1002 community-dwelling women aged 65 years old or older, experiencing at least some mild disability. Assessments of fall risk factors were conducted in participants' homes. Fall outcomes were collected at 6 monthly intervals. Algorithms were built for prediction of any fall over a 12-month period using tree classification with cross-set validation. Algorithms using performance tests provided the best prediction of fall events, and achieved moderate to strong performance when compared to commonly accepted benchmarks. The items selected by the best performing algorithm were the number of falls in the last year and, in selected subpopulations, frequency of difficulty balancing while walking, a 4 m walking speed test, body mass index, and a test of knee extensor strength. The algorithm performed better than that from the American Geriatric Society/British Geriatric Society/American Academy of Orthopaedic Surgeons and other guidance, although these findings should be treated with caution. Suggestions are made on the type, number, and sequence of tests that could be used to maximize estimation of the probability of falling in older disabled women.
ERIC Educational Resources Information Center
Sinharay, Sandip
2015-01-01
The maximum likelihood estimate (MLE) of the ability parameter of an item response theory model with known item parameters was proved to be asymptotically normally distributed under a set of regularity conditions for tests involving dichotomous items and a unidimensional ability parameter (Klauer, 1990; Lord, 1983). This article first considers…
ASCAL: A Microcomputer Program for Estimating Logistic IRT Item Parameters.
ERIC Educational Resources Information Center
Vale, C. David; Gialluca, Kathleen A.
ASCAL is a microcomputer-based program for calibrating items according to the three-parameter logistic model of item response theory. It uses a modified multivariate Newton-Raphson procedure for estimating item parameters. This study evaluated this procedure using Monte Carlo Simulation Techniques. The current version of ASCAL was then compared to…
Cheng, Su-Fen; Lee-Hsieh, Jane; Turton, Michael A; Lin, Kuan-Chia
2014-06-01
Little research has investigated the establishment of norms for nursing students' self-directed learning (SDL) ability, recognized as an important capability for professional nurses. An item response theory (IRT) approach was used to establish norms for SDL abilities valid for the different nursing programs in Taiwan. The purposes of this study were (a) to use IRT with a graded response model to reexamine the SDL instrument, or the SDLI, originally developed by this research team using confirmatory factor analysis and (b) to establish SDL ability norms for the four different nursing education programs in Taiwan. Stratified random sampling with probability proportional to size was used. A minimum of 15% of students from the four different nursing education degree programs across Taiwan was selected. A total of 7,879 nursing students from 13 schools were recruited. The research instrument was the 20-item SDLI developed by Cheng, Kuo, Lin, and Lee-Hsieh (2010). IRT with the graded response model was used with a two-parameter logistic model (discrimination and difficulty) for the data analysis, calculated using MULTILOG. Norms were established using percentile rank. Analysis of item information and test information functions revealed that 18 items exhibited very high discrimination and two items had high discrimination. The test information function was higher in this range of scores, indicating greater precision in the estimate of nursing student SDL. Reliability fell between .80 and .94 for each domain and the SDLI as a whole. The total information function shows that the SDLI is appropriate for all nursing students, except for the top 2.5%. SDL ability norms were established for each nursing education program and for the nation as a whole. IRT is shown to be a potent and useful methodology for scale evaluation. The norms for SDL established in this research will provide practical standards for nursing educators and students in Taiwan.
Reliability and validity of a short form household food security scale in a Caribbean community.
Gulliford, Martin C; Mahabir, Deepak; Rocke, Brian
2004-06-16
We evaluated the reliability and validity of the short form household food security scale in a different setting from the one in which it was developed. The scale was interview administered to 531 subjects from 286 households in north central Trinidad in Trinidad and Tobago, West Indies. We evaluated the six items by fitting item response theory models to estimate item thresholds, estimating agreement among respondents in the same households and estimating the slope index of income-related inequality (SII) after adjusting for age, sex and ethnicity. Item-score correlations ranged from 0.52 to 0.79 and Cronbach's alpha was 0.87. Item responses gave within-household correlation coefficients ranging from 0.70 to 0.78. Estimated item thresholds (standard errors) from the Rasch model ranged from -2.027 (0.063) for the 'balanced meal' item to 2.251 (0.116) for the 'hungry' item. The 'balanced meal' item had the lowest threshold in each ethnic group even though there was evidence of differential functioning for this item by ethnicity. Relative thresholds of other items were generally consistent with US data. Estimation of the SII, comparing those at the bottom with those at the top of the income scale, gave relative odds for an affirmative response of 3.77 (95% confidence interval 1.40 to 10.2) for the lowest severity item, and 20.8 (2.67 to 162.5) for highest severity item. Food insecurity was associated with reduced consumption of green vegetables after additionally adjusting for income and education (0.52, 0.28 to 0.96). The household food security scale gives reliable and valid responses in this setting. Differing relative item thresholds compared with US data do not require alteration to the cut-points for classification of 'food insecurity without hunger' or 'food insecurity with hunger'. The data provide further evidence that re-evaluation of the 'balanced meal' item is required.
A New Clinical Pain Knowledge Test for Nurses: Development and Psychometric Evaluation.
Bernhofer, Esther I; St Marie, Barbara; Bena, James F
2017-08-01
All nurses care for patients with pain, and pain management knowledge and attitude surveys for nurses have been around since 1987. However, no validated knowledge test exists to measure postlicensure clinicians' knowledge of the core competencies of pain management in current complex patient populations. To develop and test the psychometric properties of an instrument designed to measure pain management knowledge of postlicensure nurses. Psychometric instrument validation. Four large Midwestern U.S. hospitals. Registered nurses employed full time and part time August 2015 to April 2016, aged M = 43.25 years; time as RN, M = 16.13 years. Prospective survey design using e-mail to invite nurses to take an electronic multiple choice pain knowledge test. Content validity of initial 36-item test "very good" (95.1% agreement). Completed tests that met analysis criteria, N = 747. Mean initial test score, 69.4% correct (range 27.8-97.2). After revision/removal of 13 unacceptable questions, mean test score was 50.4% correct (range 8.7-82.6). Initial test item percent difficulty range was 15.2%-98.1%; discrimination values range, 0.03-0.50; final test item percent difficulty range, 17.6%-91.1%, discrimination values range, -0.04 to 1.04. Split-half reliability final test was 0.66. A high decision consistency reliability was identified, with test cut-score of 75%. The final 23-item Clinical Pain Knowledge Test has acceptable discrimination, difficulty, decision consistency, reliability, and validity in the general clinical inpatient nurse population. This instrument will be useful in assessing pain management knowledge of clinical nurses to determine gaps in education, evaluate knowledge after pain management education, and measure research outcomes. Copyright © 2017 American Society for Pain Management Nursing. Published by Elsevier Inc. All rights reserved.
Aritake, Sayaka; Asaoka, Shoichi; Kagimura, Tatsuo; Shimura, Akiyoshi; Futenma, Kunihiro; Komada, Yoko; Inoue, Yuichi
2015-04-01
This study was conducted to determine what symptom components or conditions of insomnia are related to subjective feelings of insomnia, low health-related quality of life (HRQOL), or depression. Data from 7,027 Japanese adults obtained using an Internet-based questionnaire survey was analyzed to examine associations between demographic variables and each sleep difficulty symptom item on the Pittsburgh Sleep Quality Index (PSQI) with the presence/absence of subjective insomnia and scores on the Short Form-8 (SF-8) and Center for Epidemiologic Studies Depression Scale (CES-D). Prevalence of subjective insomnia was 12.2% (n = 860). Discriminant function analysis revealed that item scores for sleep quality, sleep latency, and sleep medication use on the PSQI and CES-D showed relatively high discriminant function coefficients for identifying positivity for the subjective feeling of insomnia. Among respondents with subjective insomnia, a low SF-8 physical component summary score was associated with higher age, depressive state, and PSQI items for sleep difficulty and daytime dysfunction, whereas a low SF-8 mental component summary score was associated with depressive state, PSQI sleep latency, sleeping medication use, and daytime dysfunction. Depressive state was significantly associated with sleep latency, sleeping medication use, and daytime dysfunction. Among insomnia symptom components, disturbed sleep quality and sleep onset insomnia may be specifically associated with subjective feelings of the disorder. The existence of a depressive state could be significantly associated with not only subjective insomnia but also mental and physical QOL. Our results also suggest that different components of sleep difficulty, as measured by the PSQI, might be associated with mental and physical QOL and depressive status.
Dima, Alexandra Lelia; Schulz, Peter Johannes
2017-01-01
Background The eHealth Literacy Scale (eHEALS) is a tool to assess consumers’ comfort and skills in using information technologies for health. Although evidence exists of reliability and construct validity of the scale, less agreement exists on structural validity. Objective The aim of this study was to validate the Italian version of the eHealth Literacy Scale (I-eHEALS) in a community sample with a focus on its structural validity, by applying psychometric techniques that account for item difficulty. Methods Two Web-based surveys were conducted among a total of 296 people living in the Italian-speaking region of Switzerland (Ticino). After examining the latent variables underlying the observed variables of the Italian scale via principal component analysis (PCA), fit indices for two alternative models were calculated using confirmatory factor analysis (CFA). The scale structure was examined via parametric and nonparametric item response theory (IRT) analyses accounting for differences between items regarding the proportion of answers indicating high ability. Convergent validity was assessed by correlations with theoretically related constructs. Results CFA showed a suboptimal model fit for both models. IRT analyses confirmed all items measure a single dimension as intended. Reliability and construct validity of the final scale were also confirmed. The contrasting results of factor analysis (FA) and IRT analyses highlight the importance of considering differences in item difficulty when examining health literacy scales. Conclusions The findings support the reliability and validity of the translated scale and its use for assessing Italian-speaking consumers’ eHealth literacy. PMID:28400356
Jeong, Eunju; Lesiuk, Teresa L
2011-01-01
Impairments in attention are commonly seen in individuals with traumatic brain injury (TBI). While visual attention assessment measurements have been rigorously developed and frequently used in cognitive neurorehabilitation, there is a paucity of auditory attention assessment measurements for patients with TBI. The purpose of this study was to field test a researcher-developed Music-based Attention Assessment (MAA), a melodic contour identification test designed to assess three different types of attention (i.e., sustained attention, selective attention, and divided attention), for patients with TBI. Additionally, this study aimed to evaluate the readability and comprehensibility of the test items and to examine the preliminary psychometric properties of the scale and test items. Fifteen patients diagnosed with TBI completed 3 different series of tasks in which they were required to identify melodic contours. The resulting data showed that (a) test items in each of the 3 subtests were found to have an easy to moderate level of item difficulty and an acceptable to high level of item discrimination, and (b) the musical characteristics (i.e., contour, congruence, and pitch interference) were found to be associated with the level of item difficulty, and (c) the internal consistency of the MAA as computed by Cronbach's alpha was .95. Subsequent studies using a larger sample of typical participants, along with individuals with TBI, are needed to confirm construct validity and internal consistency of the MAA. In addition, the authors recommend examination of criterion validity of the MAA as correlated with current neuropsychological attention assessment measurements.
... of items, gradual buildup of clutter in living spaces and difficulty discarding things are usually the first ... for which there is no immediate need or space. By middle age, symptoms are often severe and ...
ERIC Educational Resources Information Center
Green, Samuel B.; Yang, Yanyun
2009-01-01
A method is presented for estimating reliability using structural equation modeling (SEM) that allows for nonlinearity between factors and item scores. Assuming the focus is on consistency of summed item scores, this method for estimating reliability is preferred to those based on linear SEM models and to the most commonly reported estimate of…
Sample Size and Item Parameter Estimation Precision When Utilizing the One-Parameter "Rasch" Model
ERIC Educational Resources Information Center
Custer, Michael
2015-01-01
This study examines the relationship between sample size and item parameter estimation precision when utilizing the one-parameter model. Item parameter estimates are examined relative to "true" values by evaluating the decline in root mean squared deviation (RMSD) and the number of outliers as sample size increases. This occurs across…
Measurement Error in Nonparametric Item Response Curve Estimation. Research Report. ETS RR-11-28
ERIC Educational Resources Information Center
Guo, Hongwen; Sinharay, Sandip
2011-01-01
Nonparametric, or kernel, estimation of item response curve (IRC) is a concern theoretically and operationally. Accuracy of this estimation, often used in item analysis in testing programs, is biased when the observed scores are used as the regressor because the observed scores are contaminated by measurement error. In this study, we investigate…
A Note on the Reliability Coefficients for Item Response Model-Based Ability Estimates
ERIC Educational Resources Information Center
Kim, Seonghoon
2012-01-01
Assuming item parameters on a test are known constants, the reliability coefficient for item response theory (IRT) ability estimates is defined for a population of examinees in two different ways: as (a) the product-moment correlation between ability estimates on two parallel forms of a test and (b) the squared correlation between the true…
Evaluation of five guidelines for option development in multiple-choice item-writing.
Martínez, Rafael J; Moreno, Rafael; Martín, Irene; Trigo, M Eva
2009-05-01
This paper evaluates certain guidelines for writing multiple-choice test items. The analysis of the responses of 5013 subjects to 630 items from 21 university classroom achievement tests suggests that an option should not differ in terms of heterogeneous content because such error has a slight but harmful effect on item discrimination. This also occurs with the "None of the above" option when it is the correct one. In contrast, results do not show the supposedly negative effects of a different-length option, the use of specific determiners, or the use of the "All of the above" option, which not only decreases difficulty but also improves discrimination when it is the correct option.
Three controversies over item disclosure in medical licensure examinations.
Park, Yoon Soo; Yang, Eunbae B
2015-01-01
In response to views on public's right to know, there is growing attention to item disclosure - release of items, answer keys, and performance data to the public - in medical licensure examinations and their potential impact on the test's ability to measure competence and select qualified candidates. Recent debates on this issue have sparked legislative action internationally, including South Korea, with prior discussions among North American countries dating over three decades. The purpose of this study is to identify and analyze three issues associated with item disclosure in medical licensure examinations - 1) fairness and validity, 2) impact on passing levels, and 3) utility of item disclosure - by synthesizing existing literature in relation to standards in testing. Historically, the controversy over item disclosure has centered on fairness and validity. Proponents of item disclosure stress test takers' right to know, while opponents argue from a validity perspective. Item disclosure may bias item characteristics, such as difficulty and discrimination, and has consequences on setting passing levels. To date, there has been limited research on the utility of item disclosure for large scale testing. These issues requires ongoing and careful consideration.
Best Design for Multidimensional Computerized Adaptive Testing With the Bifactor Model
Seo, Dong Gi; Weiss, David J.
2015-01-01
Most computerized adaptive tests (CATs) have been studied using the framework of unidimensional item response theory. However, many psychological variables are multidimensional and might benefit from using a multidimensional approach to CATs. This study investigated the accuracy, fidelity, and efficiency of a fully multidimensional CAT algorithm (MCAT) with a bifactor model using simulated data. Four item selection methods in MCAT were examined for three bifactor pattern designs using two multidimensional item response theory models. To compare MCAT item selection and estimation methods, a fixed test length was used. The Ds-optimality item selection improved θ estimates with respect to a general factor, and either D- or A-optimality improved estimates of the group factors in three bifactor pattern designs under two multidimensional item response theory models. The MCAT model without a guessing parameter functioned better than the MCAT model with a guessing parameter. The MAP (maximum a posteriori) estimation method provided more accurate θ estimates than the EAP (expected a posteriori) method under most conditions, and MAP showed lower observed standard errors than EAP under most conditions, except for a general factor condition using Ds-optimality item selection. PMID:29795848
ERIC Educational Resources Information Center
DeMars, Christine E.
2012-01-01
In structural equation modeling software, either limited-information (bivariate proportions) or full-information item parameter estimation routines could be used for the 2-parameter item response theory (IRT) model. Limited-information methods assume the continuous variable underlying an item response is normally distributed. For skewed and…
ERIC Educational Resources Information Center
Matthews-Lopez, Joy L.; Hombo, Catherine M.
The purpose of this study was to examine the recovery of item parameters in simulated Automatic Item Generation (AIG) conditions, using Markov chain Monte Carlo (MCMC) estimation methods to attempt to recover the generating distributions. To do this, variability in item and ability parameters was manipulated. Realistic AIG conditions were…
A Markov Chain Monte Carlo Approach to Confirmatory Item Factor Analysis
ERIC Educational Resources Information Center
Edwards, Michael C.
2010-01-01
Item factor analysis has a rich tradition in both the structural equation modeling and item response theory frameworks. The goal of this paper is to demonstrate a novel combination of various Markov chain Monte Carlo (MCMC) estimation routines to estimate parameters of a wide variety of confirmatory item factor analysis models. Further, I show…
ERIC Educational Resources Information Center
Bilir, Mustafa Kuzey
2009-01-01
This study uses a new psychometric model (mixture item response theory-MIMIC model) that simultaneously estimates differential item functioning (DIF) across manifest groups and latent classes. Current DIF detection methods investigate DIF from only one side, either across manifest groups (e.g., gender, ethnicity, etc.), or across latent classes…
ERIC Educational Resources Information Center
Kim, Kyung Yong; Lee, Won-Chan
2017-01-01
This article provides a detailed description of three factors (specification of the ability distribution, numerical integration, and frame of reference for the item parameter estimates) that might affect the item parameter estimation of the three-parameter logistic model, and compares five item calibration methods, which are combinations of the…
Selective loss of verbal imagery.
Mehta, Z; Newcombe, F
1996-05-01
This single case study of the ability to generate verbal and non-verbal imagery in a woman who sustained a gunshot wound to the brain reports a significant difficulty in generating images of word shapes but not a significant problem in generating object images. Further dissociation, however, was observed in her ability to generate images of living vs non-living material. She made more errors in imagery and factual information tasks for non-living items than for living items. This pattern contrasts with our previous report of the agnosic patient, M.S., who had severe difficulty in generating images of living material, whereas his ability to image the shape of words was comparable to that of normal control subjects. Furthermore, with regard to the generation of images of living compared with non-living material, M.S. shows more errors with living than nonliving items. In contrast, the present patient, S.M., made significantly more errors with non-living relative to living items. There appear to be two types of double dissociation which reinforce the growing evidence of dissociable impairments in the ability to generate images for different types of verbal and non-verbal material. Such dissociations, presumably related to sensory and cognitive processing demands, address the problem of the neural basis of imagery.
Park, Yoon Soo; Lee, Young-Sun; Xing, Kuan
2016-01-01
This study investigates the impact of item parameter drift (IPD) on parameter and ability estimation when the underlying measurement model fits a mixture distribution, thereby violating the item invariance property of unidimensional item response theory (IRT) models. An empirical study was conducted to demonstrate the occurrence of both IPD and an underlying mixture distribution using real-world data. Twenty-one trended anchor items from the 1999, 2003, and 2007 administrations of Trends in International Mathematics and Science Study (TIMSS) were analyzed using unidimensional and mixture IRT models. TIMSS treats trended anchor items as invariant over testing administrations and uses pre-calibrated item parameters based on unidimensional IRT. However, empirical results showed evidence of two latent subgroups with IPD. Results also showed changes in the distribution of examinee ability between latent classes over the three administrations. A simulation study was conducted to examine the impact of IPD on the estimation of ability and item parameters, when data have underlying mixture distributions. Simulations used data generated from a mixture IRT model and estimated using unidimensional IRT. Results showed that data reflecting IPD using mixture IRT model led to IPD in the unidimensional IRT model. Changes in the distribution of examinee ability also affected item parameters. Moreover, drift with respect to item discrimination and distribution of examinee ability affected estimates of examinee ability. These findings demonstrate the need to caution and evaluate IPD using a mixture IRT framework to understand its effects on item parameters and examinee ability.
Park, Yoon Soo; Lee, Young-Sun; Xing, Kuan
2016-01-01
This study investigates the impact of item parameter drift (IPD) on parameter and ability estimation when the underlying measurement model fits a mixture distribution, thereby violating the item invariance property of unidimensional item response theory (IRT) models. An empirical study was conducted to demonstrate the occurrence of both IPD and an underlying mixture distribution using real-world data. Twenty-one trended anchor items from the 1999, 2003, and 2007 administrations of Trends in International Mathematics and Science Study (TIMSS) were analyzed using unidimensional and mixture IRT models. TIMSS treats trended anchor items as invariant over testing administrations and uses pre-calibrated item parameters based on unidimensional IRT. However, empirical results showed evidence of two latent subgroups with IPD. Results also showed changes in the distribution of examinee ability between latent classes over the three administrations. A simulation study was conducted to examine the impact of IPD on the estimation of ability and item parameters, when data have underlying mixture distributions. Simulations used data generated from a mixture IRT model and estimated using unidimensional IRT. Results showed that data reflecting IPD using mixture IRT model led to IPD in the unidimensional IRT model. Changes in the distribution of examinee ability also affected item parameters. Moreover, drift with respect to item discrimination and distribution of examinee ability affected estimates of examinee ability. These findings demonstrate the need to caution and evaluate IPD using a mixture IRT framework to understand its effects on item parameters and examinee ability. PMID:26941699
Sources of interference in item and associative recognition memory.
Osth, Adam F; Dennis, Simon
2015-04-01
A powerful theoretical framework for exploring recognition memory is the global matching framework, in which a cue's memory strength reflects the similarity of the retrieval cues being matched against the contents of memory simultaneously. Contributions at retrieval can be categorized as matches and mismatches to the item and context cues, including the self match (match on item and context), item noise (match on context, mismatch on item), context noise (match on item, mismatch on context), and background noise (mismatch on item and context). We present a model that directly parameterizes the matches and mismatches to the item and context cues, which enables estimation of the magnitude of each interference contribution (item noise, context noise, and background noise). The model was fit within a hierarchical Bayesian framework to 10 recognition memory datasets that use manipulations of strength, list length, list strength, word frequency, study-test delay, and stimulus class in item and associative recognition. Estimates of the model parameters revealed at most a small contribution of item noise that varies by stimulus class, with virtually no item noise for single words and scenes. Despite the unpopularity of background noise in recognition memory models, background noise estimates dominated at retrieval across nearly all stimulus classes with the exception of high frequency words, which exhibited equivalent levels of context noise and background noise. These parameter estimates suggest that the majority of interference in recognition memory stems from experiences acquired before the learning episode. (c) 2015 APA, all rights reserved).
ERIC Educational Resources Information Center
Sass, D. A.; Schmitt, T. A.; Walker, C. M.
2008-01-01
Item response theory (IRT) procedures have been used extensively to study normal latent trait distributions and have been shown to perform well; however, less is known concerning the performance of IRT with non-normal latent trait distributions. This study investigated the degree of latent trait estimation error under normal and non-normal…
ERIC Educational Resources Information Center
Tsutakawa, Robert K.
This paper presents a method for estimating certain characteristics of test items which are designed to measure ability, or knowledge, in a particular area. Under the assumption that ability parameters are sampled from a normal distribution, the EM algorithm is used to derive maximum likelihood estimates to item parameters of the two-parameter…
An Analysis of the Connectedness to Nature Scale Based on Item Response Theory.
Pasca, Laura; Aragonés, Juan I; Coello, María T
2017-01-01
The Connectedness to Nature Scale (CNS) is used as a measure of the subjective cognitive connection between individuals and nature. However, to date, it has not been analyzed at the item level to confirm its quality. In the present study, we conduct such an analysis based on Item Response Theory. We employed data from previous studies using the Spanish-language version of the CNS, analyzing a sample of 1008 participants. The results show that seven items presented appropriate indices of discrimination and difficulty, in addition to a good fit. The remaining six have inadequate discrimination indices and do not present a good fit. A second study with 321 participants shows that the seven-item scale has adequate levels of reliability and validity. Therefore, it would be appropriate to use a reduced version of the scale after eliminating the items that display inappropriate behavior, since they may interfere with research results on connectedness to nature.
Short-term memory in autism spectrum disorder.
Poirier, Marie; Martin, Jonathan S; Gaigg, Sebastian B; Bowler, Dermot M
2011-02-01
Three experiments examined verbal short-term memory in comparison and autism spectrum disorder (ASD) participants. Experiment 1 involved forward and backward digit recall. Experiment 2 used a standard immediate serial recall task where, contrary to the digit-span task, items (words) were not repeated from list to list. Hence, this task called more heavily on item memory. Experiment 3 tested short-term order memory with an order recognition test: Each word list was repeated with or without the position of 2 adjacent items swapped. The ASD group showed poorer performance in all 3 experiments. Experiments 1 and 2 showed that group differences were due to memory for the order of the items, not to memory for the items themselves. Confirming these findings, the results of Experiment 3 showed that the ASD group had more difficulty detecting a change in the temporal sequence of the items. (c) 2010 APA, all rights reserved.
Development of the Serenity Scale.
Roberts, K T; Aspy, C B
1993-01-01
Serenity is a sustained inner peace. Nurses can use knowledge about serenity to help clients cope with harsh circumstances. The Serenity Scale is a 40-item self-report, summated scale that evaluates clients' serenity status. Critical attributes, identified by serenity experts, served as the theoretical framework. Sixty-five items were given to 542 male and female subjects age 20 to 95 (73% Caucasians and 27% minority) from varying income and educational levels yielding an alpha of .93. Forty items (SS.V2) were extracted for further analysis. The alpha coefficient was .92 with item-to-total correlations ranging from .25 to .67. Item means ranged from 2.6-3.7 (grand mean = 3.4). A principal components factor analysis with varimax rotation revealed nine factors explaining 58.2% of the variance. Limitations are that SS.V2 has not been tested with an independent sample and subjects with low educational levels had difficulty with some items.
Online Calibration of Polytomous Items Under the Generalized Partial Credit Model
Zheng, Yi
2016-01-01
Online calibration is a technology-enhanced architecture for item calibration in computerized adaptive tests (CATs). Many CATs are administered continuously over a long term and rely on large item banks. To ensure test validity, these item banks need to be frequently replenished with new items, and these new items need to be pretested before being used operationally. Online calibration dynamically embeds pretest items in operational tests and calibrates their parameters as response data are gradually obtained through the continuous test administration. This study extends existing formulas, procedures, and algorithms for dichotomous item response theory models to the generalized partial credit model, a popular model for items scored in more than two categories. A simulation study was conducted to investigate the developed algorithms and procedures under a variety of conditions, including two estimation algorithms, three pretest item selection methods, three seeding locations, two numbers of score categories, and three calibration sample sizes. Results demonstrated acceptable estimation accuracy of the two estimation algorithms in some of the simulated conditions. A variety of findings were also revealed for the interacted effects of included factors, and recommendations were made respectively. PMID:29881063
Using SAS PROC MCMC for Item Response Theory Models
Samonte, Kelli
2014-01-01
Interest in using Bayesian methods for estimating item response theory models has grown at a remarkable rate in recent years. This attentiveness to Bayesian estimation has also inspired a growth in available software such as WinBUGS, R packages, BMIRT, MPLUS, and SAS PROC MCMC. This article intends to provide an accessible overview of Bayesian methods in the context of item response theory to serve as a useful guide for practitioners in estimating and interpreting item response theory (IRT) models. Included is a description of the estimation procedure used by SAS PROC MCMC. Syntax is provided for estimation of both dichotomous and polytomous IRT models, as well as a discussion on how to extend the syntax to accommodate more complex IRT models. PMID:29795834
Students’ understanding of forces: Force diagrams on horizontal and inclined plane
NASA Astrophysics Data System (ADS)
Sirait, J.; Hamdani; Mursyid, S.
2018-03-01
This study aims to analyse students’ difficulties in understanding force diagrams on horizontal surfaces and inclined planes. Physics education students (pre-service physics teachers) of Tanjungpura University, who had completed a Basic Physics course, took a Force concept test which has six questions covering three concepts: an object at rest, an object moving at constant speed, and an object moving at constant acceleration both on a horizontal surface and on an inclined plane. The test is in a multiple-choice format. It examines the ability of students to select appropriate force diagrams depending on the context. The results show that 44% of students have difficulties in solving the test (these students only could solve one or two items out of six items). About 50% of students faced difficulties finding the correct diagram of an object when it has constant speed and acceleration in both contexts. In general, students could only correctly identify 48% of the force diagrams on the test. The most difficult task for the students in terms was identifying the force diagram representing forces exerted on an object on in an inclined plane.
Reading Ability and Print Exposure: Item Response Theory Analysis of the Author Recognition Test
Moore, Mariah; Gordon, Peter C.
2015-01-01
In the Author Recognition Test (ART) participants are presented with a series of names and foils and are asked to indicate which ones they recognize as authors. The test is a strong predictor of reading skill, with this predictive ability generally explained as occurring because author knowledge is likely acquired through reading or other forms of print exposure. This large-scale study (1012 college student participants) used Item Response Theory (IRT) to analyze item (author) characteristics to facilitate identification of the determinants of item difficulty, provide a basis for further test development, and to optimize scoring of the ART. Factor analysis suggests a potential two factor structure of the ART differentiating between literary vs. popular authors. Effective and ineffective author names were identified so as to facilitate future revisions of the ART. Analyses showed that the ART is a highly significant predictor of time spent encoding words as measured using eye-tracking during reading. The relationship between the ART and time spent reading provided a basis for implementing a higher penalty for selecting foils, rather than the standard method of ART scoring (names selected minus foils selected). The findings provide novel support for the view that the ART is a valid indicator of reading volume. Further, they show that frequency data can be used to select items of appropriate difficulty and that frequency data from corpora based on particular time periods and types of text may allow test adaptation for different populations. PMID:25410405
Reading ability and print exposure: item response theory analysis of the author recognition test.
Moore, Mariah; Gordon, Peter C
2015-12-01
In the author recognition test (ART), participants are presented with a series of names and foils and are asked to indicate which ones they recognize as authors. The test is a strong predictor of reading skill, and this predictive ability is generally explained as occurring because author knowledge is likely acquired through reading or other forms of print exposure. In this large-scale study (1,012 college student participants), we used item response theory (IRT) to analyze item (author) characteristics in order to facilitate identification of the determinants of item difficulty, provide a basis for further test development, and optimize scoring of the ART. Factor analysis suggested a potential two-factor structure of the ART, differentiating between literary and popular authors. Effective and ineffective author names were identified so as to facilitate future revisions of the ART. Analyses showed that the ART is a highly significant predictor of the time spent encoding words, as measured using eyetracking during reading. The relationship between the ART and time spent reading provided a basis for implementing a higher penalty for selecting foils, rather than the standard method of ART scoring (names selected minus foils selected). The findings provide novel support for the view that the ART is a valid indicator of reading volume. Furthermore, they show that frequency data can be used to select items of appropriate difficulty, and that frequency data from corpora based on particular time periods and types of texts may allow adaptations of the test for different populations.
Sousa, Renata M; Dewey, Michael E; Acosta, Daisy; Jotheeswaran, AT; Castro-Costa, Erico; Ferri, Cleusa P; Guerra, Mariella; Huang, Yueqin; Jacob, KS; Pichardo, Juana Guillermina Rodriguez; Ramírez, Nayeli Garcia; Rodriguez, Juan Llibre; Rodriguez, Marina Calvo; Salas, Aquiles; Sosa, Ana Luisa; Williams, Joseph; Prince, Martin J
2010-01-01
We evaluated the psychometric properties of the 12-item interviewer-administered screener version of the World Health Organization Disability Assessment Schedule – version II (WHODAS II) among older people living in seven low- and middle-income countries. Principal component analysis (PCA), confirmatory factor analysis (CFA) and Mokken analyses were carried out to test for unidimensionality, hierarchical structure, and measurement invariance across 10/66 Dementia Research Group sites. PCA generated a one-factor solution in most sites. In CFA, the two-factor solution generated in Dominican Republic fitted better for all sites other than rural China. The two factors were not easily interpretable, and may have been an artefact of differing item difficulties. Strong internal consistency and high factor loadings for the one-factor solution supported unidimensionality. Furthermore, the WHODAS II was found to be a ‘strong’ Mokken scale. Measurement invariance was supported by the similarity of factor loadings across sites, and by the high between-site correlations in item difficulties. The Mokken results strongly support that the WHODAS II 12-item screener is a unidimensional and hierarchical scale confirming to item response theory (IRT) principles, at least at the monotone homogeneity model level. More work is needed to assess the generalizability of our findings to different populations. Copyright © 2010 John Wiley & Sons, Ltd. PMID:20104493
Powell, Sarah R.; Fuchs, Lynn S.
2014-01-01
According to national mathematics standards, algebra instruction should begin at kindergarten and continue through elementary school. Most often, teachers address algebra in the elementary grades with problems related to solving equations or understanding functions. With 789 2nd- grade students, we administered (a) measures of calculations and word problems in the fall and (b) an assessment of pre-algebraic reasoning, with items that assessed solving equations and functions, in the spring. Based on the calculation and word-problem measures, we placed 148 students into 1 of 4 difficulty status categories: typically performing, calculation difficulty, word-problem difficulty, or difficulty with calculations and word problems. Analyses of variance were conducted on the 148 students; path analytic mediation analyses were conducted on the larger sample of 789 students. Across analyses, results corroborated the finding that word-problem difficulty is more strongly associated with difficulty with pre-algebraic reasoning. As an indicator of later algebra difficulty, word-problem difficulty may be a more useful predictor than calculation difficulty, and students with word-problem difficulty may require a different level of algebraic reasoning intervention than students with calculation difficulty. PMID:25309044
Platat, Carine; El Mesmoudi, Najoua; El Sadig, Mohamed; Tewfik, Ihab
2018-01-01
Although, United Arab Emirates (UAE) has one of the highest prevalence of overweight, obesity and type 2 diabetes in the world, however, validated dietary assessment aids to estimate food intake of individuals and populations in the UAE are currently lacking. We conducted two observational studies to evaluate the accuracy of a photographic food atlas which was developed as a tool for food portion size estimation in the UAE. The UAE Food Atlas presents eight portion sizes for each food. Study 1 involved portion size estimations of 13 food items consumed during the previous day. Study 2 involved portion size estimations of nine food items immediately after consumption. Differences between the food portion sizes estimated from the photographs and the weighed food portions (estimation error), as well as the percentage differences relative to the weighed food portion for each tested food item were calculated. Four of the evaluated food items were underestimated (by -8.9% to -18.4%), while nine were overestimated (by 9.5% to 90.9%) in Study 1. Moreover, there were significant differences between estimated and eaten food portions for eight food items (P<0.05). In Study 2, one food item was underestimated (-8.1%) while eight were overestimated (range 2.52% to 82.1%). Furthermore, there were significant differences between estimated and eaten food portions (P<0.05) for six food items. The limits of agreement between the estimated and consumed food portion size were wide indicating a large variability in food portion estimation errors. These reported findings highlight the need for further developments of the UAE Food Atlas to improve the accuracy of food portion size intake estimations in dietary assessments. Additionally, recalling food portions from the previous day did not seem to increase food portion estimation errors in this study. PMID:29698434
Consequences of Ignoring Guessing when Estimating the Latent Density in Item Response Theory
ERIC Educational Resources Information Center
Woods, Carol M.
2008-01-01
In Ramsay-curve item response theory (RC-IRT), the latent variable distribution is estimated simultaneously with the item parameters. In extant Monte Carlo evaluations of RC-IRT, the item response function (IRF) used to fit the data is the same one used to generate the data. The present simulation study examines RC-IRT when the IRF is imperfectly…
ERIC Educational Resources Information Center
Jones, Douglas H.
The progress of modern mental test theory depends very much on the techniques of maximum likelihood estimation, and many popular applications make use of likelihoods induced by logistic item response models. While, in reality, item responses are nonreplicate within a single examinee and the logistic models are only ideal, practitioners make…
Variation in the Readability of Items Within Surveys
Calderón, José L.; Morales, Leo S.; Liu, Honghu; Hays, Ron D.
2006-01-01
The objective of this study was to estimate the variation in the readability of survey items within 2 widely used health-related quality-of-life surveys: the National Eye Institute Visual Functioning Questionnaire–25 (VFQ-25) and the Short Form Health Survey, version 2 (SF-36v2). Flesch-Kincaid and Flesch Reading Ease formulas were used to estimate readability. Individual survey item scores and descriptive statistics for each survey were calculated. Variation of individual item scores from the mean survey score was graphically depicted for each survey. The mean reading grade level and reading ease estimates for the VFQ-25 and SF-36v2 were 7.8 (fairly easy) and 6.4 (easy), respectively. Both surveys had notable variation in item readability; individual item readability scores ranged from 3.7 to 12.0 (very easy to difficult) for the VFQ-25 and 2.2 to 12.0 (very easy to difficult) for the SF-36v2. Because survey respondents may not comprehend items with readability scores that exceed their reading ability, estimating the readability of each survey item is an important component of evaluating survey readability. Standards for measuring the readability of surveys are needed. PMID:16401705
Exploratory Item Classification Via Spectral Graph Clustering
Chen, Yunxiao; Li, Xiaoou; Liu, Jingchen; Xu, Gongjun; Ying, Zhiliang
2017-01-01
Large-scale assessments are supported by a large item pool. An important task in test development is to assign items into scales that measure different characteristics of individuals, and a popular approach is cluster analysis of items. Classical methods in cluster analysis, such as the hierarchical clustering, K-means method, and latent-class analysis, often induce a high computational overhead and have difficulty handling missing data, especially in the presence of high-dimensional responses. In this article, the authors propose a spectral clustering algorithm for exploratory item cluster analysis. The method is computationally efficient, effective for data with missing or incomplete responses, easy to implement, and often outperforms traditional clustering algorithms in the context of high dimensionality. The spectral clustering algorithm is based on graph theory, a branch of mathematics that studies the properties of graphs. The algorithm first constructs a graph of items, characterizing the similarity structure among items. It then extracts item clusters based on the graphical structure, grouping similar items together. The proposed method is evaluated through simulations and an application to the revised Eysenck Personality Questionnaire. PMID:29033476
Liegl, Gregor; Wahl, Inka; Berghöfer, Anne; Nolte, Sandra; Pieh, Christoph; Rose, Matthias; Fischer, Felix
2016-03-01
To investigate the validity of a common depression metric in independent samples. We applied a common metrics approach based on item-response theory for measuring depression to four German-speaking samples that completed the Patient Health Questionnaire (PHQ-9). We compared the PHQ item parameters reported for this common metric to reestimated item parameters that derived from fitting a generalized partial credit model solely to the PHQ-9 items. We calibrated the new model on the same scale as the common metric using two approaches (estimation with shifted prior and Stocking-Lord linking). By fitting a mixed-effects model and using Bland-Altman plots, we investigated the agreement between latent depression scores resulting from the different estimation models. We found different item parameters across samples and estimation methods. Although differences in latent depression scores between different estimation methods were statistically significant, these were clinically irrelevant. Our findings provide evidence that it is possible to estimate latent depression scores by using the item parameters from a common metric instead of reestimating and linking a model. The use of common metric parameters is simple, for example, using a Web application (http://www.common-metrics.org) and offers a long-term perspective to improve the comparability of patient-reported outcome measures. Copyright © 2016 Elsevier Inc. All rights reserved.
Shikata, Satoru; Nakayama, Takeo; Yamagishi, Hisakazu
2008-01-01
In this study, we conducted a limited survey of reports of surgical randomized controlled trials, using the consolidated standards of reporting trials (CONSORT) statement and additional check items to clarify problems in the evaluation of surgical reports. A total of 13 randomized trials were selected from two latest review articles on biliary surgery. Each randomized trial was evaluated according to 28 quality measures that comprised items from the CONSORT statement plus additional items. Analysis focused on relationships between the quality of each study and the estimated effect gap ("pooled estimate in meta-analysis" -- "estimated effect of each study"). No definite relationships were found between individual study quality and the estimated effect gap. The following items could have been described but were not provided in almost all the surgical RCT reports: "clearly defined outcomes"; "details of randomization"; "participant flow charts"; "intention-to-treat analysis"; "ancillary analyses"; and "financial conflicts of interest". The item, "participation of a trial methodologist in the study" was not found in any of the reports. Although the quality of reporting trials is not always related to a biased estimation of treatment effect, the items used for quality measures must be described to enable readers to evaluate the quality and applicability of the reporting. Further development of an assessment tool is needed for items specific to surgical randomized controlled trials.
Schinka, John A
2012-10-01
Issues regarding the readability of self-report assessment instruments, methods for establishing the reading ability level of respondents, and guidelines for development of scales designed for marginal readers have been inconsistently addressed in the literature. A recent study by McHugh and Behar (2009) provided new findings relevant to these issues. McHugh and Behar calculated indices of readability separately for the instructions and the item sets of 105 self-report measures of anxiety and depression. Results revealed substantial variability in readability among the measures, with most measures being written at or above the mean reading grade level in the United States. These results were consistent with those reported previously by Schinka and Borum (1993, 1994) in analyses of the readability of commonly used self-report psychopathology and personality inventories. In their discussion, McHugh and Behar addressed implications of their findings for clinical assessment and for scale development. I expand on their comments by addressing the failure to consider vocabulary difficulty, a major shortcoming of readability indices that examine only text complexity. I demonstrate how vocabulary difficulty influences readability and discuss additional considerations and possible solutions for addressing the gap between scale readability and the reading skill level of the self-report respondent. The work of McHugh and Behar clearly demonstrates that the issues of reading ability that arise in collecting self-report data are neither simple nor straightforward. Comments are offered to focus attention on the problems identified by their work. These problems will require additional effort on the part of researchers and clinicians in order to obtain reliable, valid estimates of clinical status. (PsycINFO Database Record (c) 2012 APA, all rights reserved).
ERIC Educational Resources Information Center
Lee, Guemin; Park, In-Yong
2012-01-01
Previous assessments of the reliability of test scores for testlet-composed tests have indicated that item-based estimation methods overestimate reliability. This study was designed to address issues related to the extent to which item-based estimation methods overestimate the reliability of test scores composed of testlets and to compare several…
ERIC Educational Resources Information Center
St-Onge, Christina; Valois, Pierre; Abdous, Belkacem; Germain, Stephane
2009-01-01
To date, there have been no studies comparing parametric and nonparametric Item Characteristic Curve (ICC) estimation methods on the effectiveness of Person-Fit Statistics (PFS). The primary aim of this study was to determine if the use of ICCs estimated by nonparametric methods would increase the accuracy of item response theory-based PFS for…
ERIC Educational Resources Information Center
Lee, Yi-Hsuan; Zhang, Jinming
2008-01-01
The method of maximum-likelihood is typically applied to item response theory (IRT) models when the ability parameter is estimated while conditioning on the true item parameters. In practice, the item parameters are unknown and need to be estimated first from a calibration sample. Lewis (1985) and Zhang and Lu (2007) proposed the expected response…
Underestimating numerosity of items in visual search tasks.
Cassenti, Daniel N; Kelley, Troy D; Ghirardelli, Thomas G
2010-10-01
Previous research on numerosity judgments addressed attended items, while the present research addresses underestimation for unattended items in visual search tasks. One potential cause of underestimation for unattended items is that estimates of quantity may depend on viewing a large portion of the display within foveal vision. Another theory follows from the occupancy model: estimating quantity of items in greater proximity to one another increases the likelihood of an underestimation error. Three experimental manipulations addressed aspects of underestimation for unattended items: the size of the distracters, the distance of the target from fixation, and whether items were clustered together. Results suggested that the underestimation effect for unattended items was best explained within a Gestalt grouping framework.
Arias, Victor B.; Nuñez, Daniel E.; Martínez-Molina, Agustín; Ponce, Fernando P.; Arias, Benito
2016-01-01
The Diagnostic and Statistical Manual of Mental Disorders (DSM) diagnostic criteria assume that the 18 symptoms carry the same weight in an Attention Deficit with Hyperactivity Disorder (ADHD) diagnosis and bear the same discriminatory capacity. However, it is reasonable to think that symptoms may differ in terms of severity and even in the reliability with they represent the disorder. To test this hypothesis, the aim of this study was to calibrate in a sample of Spanish children (age 4–7; n = 784) a scale for assessing the symptoms of ADHD proposed by Diagnostic and Statistical Manual of Mental Disorders, IV-TR within the framework of Item Response Theory. Samejima’s Graded Response Model was used as a method for estimating the item difficulty and discrimination parameters. The results showed that ADHD subscales (Attention Deficit and Hyperactivity / Impulsivity) had good psychometric properties and had also a good fit to the model. However, relevant differences between symptoms were observed at the level of severity, informativeness and reliability for the assessment of ADHD. This finding suggests that it would be useful to identify the symptoms that are more important than the others with regard to diagnosing ADHD. PMID:27736911
Arias, Victor B; Nuñez, Daniel E; Martínez-Molina, Agustín; Ponce, Fernando P; Arias, Benito
2016-01-01
The Diagnostic and Statistical Manual of Mental Disorders (DSM) diagnostic criteria assume that the 18 symptoms carry the same weight in an Attention Deficit with Hyperactivity Disorder (ADHD) diagnosis and bear the same discriminatory capacity. However, it is reasonable to think that symptoms may differ in terms of severity and even in the reliability with they represent the disorder. To test this hypothesis, the aim of this study was to calibrate in a sample of Spanish children (age 4-7; n = 784) a scale for assessing the symptoms of ADHD proposed by Diagnostic and Statistical Manual of Mental Disorders, IV-TR within the framework of Item Response Theory. Samejima's Graded Response Model was used as a method for estimating the item difficulty and discrimination parameters. The results showed that ADHD subscales (Attention Deficit and Hyperactivity / Impulsivity) had good psychometric properties and had also a good fit to the model. However, relevant differences between symptoms were observed at the level of severity, informativeness and reliability for the assessment of ADHD. This finding suggests that it would be useful to identify the symptoms that are more important than the others with regard to diagnosing ADHD.
ERIC Educational Resources Information Center
Kawahara, Jun-ichiro; Enns, James T.
2009-01-01
When observers try to identify successive targets in a visual stream at a rate of 100 ms per item, accuracy for the 2nd target is impaired for intertarget lags of 100-500 ms. Yet, when the same stream is presented more rapidly (e.g., 50 ms per item), this pattern reverses and a 1st-target deficit is obtained. M. C. Potter, A. Staub, and D. H.…
Wu, Huey-Min; Lin, Chin-Kai; Yang, Yu-Mao; Kuo, Bor-Chen
2014-11-12
Visual perception is the fundamental skill required for a child to recognize words, and to read and write. There was no visual perception assessment tool developed for preschool children based on Chinese characters in Taiwan. The purposes were to develop the computerized visual perception assessment tool for Chinese Characters Structures and to explore the psychometrical characteristic of assessment tool. This study adopted purposive sampling. The study evaluated 551 kindergarten-age children (293 boys, 258 girls) ranging from 46 to 81 months of age. The test instrument used in this study consisted of three subtests and 58 items, including tests of basic strokes, single-component characters, and compound characters. Based on the results of model fit analysis, the higher-order item response theory was used to estimate the performance in visual perception, basic strokes, single-component characters, and compound characters simultaneously. Analyses of variance were used to detect significant difference in age groups and gender groups. The difficulty of identifying items in a visual perception test ranged from -2 to 1. The visual perception ability of 4- to 6-year-old children ranged from -1.66 to 2.19. Gender did not have significant effects on performance. However, there were significant differences among the different age groups. The performance of 6-year-olds was better than that of 5-year-olds, which was better than that of 4-year-olds. This study obtained detailed diagnostic scores by using a higher-order item response theory model to understand the visual perception of basic strokes, single-component characters, and compound characters. Further statistical analysis showed that, for basic strokes and compound characters, girls performed better than did boys; there also were differences within each age group. For single-component characters, there was no difference in performance between boys and girls. However, again the performance of 6-year-olds was better than that of 4-year-olds, but there were no statistical differences between the performance of 5-year-olds and 6-year-olds. Results of tests with basic strokes, single-component characters and compound characters tests had good reliability and validity. Therefore, it can be apply to diagnose the problem of visual perception at preschool. Copyright © 2014 Elsevier Ltd. All rights reserved.
Extending item response theory to online homework
NASA Astrophysics Data System (ADS)
Kortemeyer, Gerd
2014-06-01
Item response theory (IRT) becomes an increasingly important tool when analyzing "big data" gathered from online educational venues. However, the mechanism was originally developed in traditional exam settings, and several of its assumptions are infringed upon when deployed in the online realm. For a large-enrollment physics course for scientists and engineers, the study compares outcomes from IRT analyses of exam and homework data, and then proceeds to investigate the effects of each confounding factor introduced in the online realm. It is found that IRT yields the correct trends for learner ability and meaningful item parameters, yet overall agreement with exam data is moderate. It is also found that learner ability and item discrimination is robust over a wide range with respect to model assumptions and introduced noise. Item difficulty is also robust, but over a narrower range.
Armour, Cherie; Raudzah Ghazali, Siti; Elklit, Ask
2013-03-30
The underlying latent structure of Posttraumatic Stress Disorder (PTSD) is widely researched. However, despite a plethora of factor analytic studies, no single model has consistently been shown as superior to alternative models. The two most often supported models are the Emotional Numbing and the Dysphoria models. However, a recently proposed five-factor Dysphoric Arousal model has been gathering support over and above existing models. Data for the current study were gathered from Malaysian Tsunami survivors (N=250). Three competing models (Emotional Numbing/Dysphoria/Dysphoric Arousal) were specified and estimated using Confirmatory Factor Analysis (CFA). The Dysphoria model provided superior fit to the data compared to the Emotional Numbing model. However, using chi-square difference tests, the Dysphoric Arousal model showed a superior fit compared to both the Emotional Numbing and Dysphoria models. In conclusion, the current results suggest that the Dysphoric Arousal model better represents PTSD's latent structure and that items measuring sleeping difficulties, irritability/anger and concentration difficulties form a separate, unique PTSD factor. These results are discussed in relation to the role of Hyperarousal in PTSD's on-going symptom maintenance and in relation to the DSM-5. Copyright © 2012 Elsevier Ireland Ltd. All rights reserved.
When students can choose easy, medium, or hard homework problems
NASA Astrophysics Data System (ADS)
Teodorescu, Raluca E.; Seaton, Daniel T.; Cardamone, Caroline N.; Rayyan, Saif; Abbott, Jonathan E.; Barrantes, Analia; Pawl, Andrew; Pritchard, David E.
2012-02-01
We investigate student-chosen, multi-level homework in our Integrated Learning Environment for Mechanics [1] built using the LON-CAPA [2] open-source learning system. Multi-level refers to problems categorized as easy, medium, and hard. Problem levels were determined a priori based on the knowledge needed to solve them [3]. We analyze these problems using three measures: time-per-problem, LON-CAPA difficulty, and item difficulty measured by item response theory. Our analysis of student behavior in this environment suggests that time-per-problem is strongly dependent on problem category, unlike either score-based measures. We also found trends in student choice of problems, overall effort, and efficiency across the student population. Allowing students choice in problem solving seems to improve their motivation; 70% of students worked additional problems for which no credit was given.
Sources of Interactional Problems in a Survey of Racial/Ethnic Discrimination
Johnson, Timothy P.; Shariff-Marco, Salma; Willis, Gordon; Cho, Young Ik; Breen, Nancy; Gee, Gilbert C.; Krieger, Nancy; Grant, David; Alegria, Margarita; Mays, Vickie M.; Williams, David R.; Landrine, Hope; Liu, Benmei; Reeve, Bryce B.; Takeuchi, David; Ponce, Ninez A.
2014-01-01
Cross-cultural variability in respondent processing of survey questions may bias results from multiethnic samples. We analyzed behavior codes, which identify difficulties in the interactions of respondents and interviewers, from a discrimination module contained within a field test of the 2007 California Health Interview Survey. In all, 553 (English) telephone interviews yielded 13,999 interactions involving 22 items. Multilevel logistic regression modeling revealed that respondent age and several item characteristics (response format, customized questions, length, and first item with new response format), but not race/ethnicity, were associated with interactional problems. These findings suggest that item function within a multi-cultural, albeit English language, survey may be largely influenced by question features, as opposed to respondent characteristics such as race/ethnicity. PMID:26166949
ERIC Educational Resources Information Center
Michaelides, Michalis P.; Haertel, Edward H.
2014-01-01
The standard error of equating quantifies the variability in the estimation of an equating function. Because common items for deriving equated scores are treated as fixed, the only source of variability typically considered arises from the estimation of common-item parameters from responses of samples of examinees. Use of alternative, equally…
ERIC Educational Resources Information Center
Lee, Young-Sun; Wollack, James A.; Douglas, Jeffrey
2009-01-01
The purpose of this study was to assess the model fit of a 2PL through comparison with the nonparametric item characteristic curve (ICC) estimation procedures. Results indicate that three nonparametric procedures implemented produced ICCs that are similar to that of the 2PL for items simulated to fit the 2PL. However for misfitting items,…
ERIC Educational Resources Information Center
Karkee, Thakur B.; Wright, Karen R.
2004-01-01
Different item response theory (IRT) models may be employed for item calibration. Change of testing vendors, for example, may result in the adoption of a different model than that previously used with a testing program. To provide scale continuity and preserve cut score integrity, item parameter estimates from the new model must be linked to the…
Sun, Wei; Chou, Chih-Ping; Stacy, Alan W; Ma, Huiyan; Unger, Jennifer; Gallaher, Peggy
2007-02-01
Cronbach's a is widely used in social science research to estimate the internal consistency of reliability of a measurement scale. However, when items are not strictly parallel, the Cronbach's a coefficient provides a lower-bound estimate of true reliability, and this estimate may be further biased downward when items are dichotomous. The estimation of standardized Cronbach's a for a scale with dichotomous items can be improved by using the upper bound of coefficient phi. SAS and SPSS macros have been developed in this article to obtain standardized Cronbach's a via this method. The simulation analysis showed that Cronbach's a from upper-bound phi might be appropriate for estimating the real reliability when standardized Cronbach's a is problematic.
Chen, Liang-Yu; Wu, Yi-Hui; Huang, Chung-Yu; Liu, Li-Kuo; Hwang, An-Chun; Peng, Li-Ning; Lin, Ming-Hsieh; Chen, Liang-Kung
2017-04-01
To identify potentially modifiable risk factors for cognitive decline among veterans' home residents in Taiwan METHODS: The present retrospective cohort study was part of the Veteran Affairs-Comprehensive Geriatric Assessment study that retrieved data of the comprehensive geriatric assessment for 946 residents living at four veterans' homes in Taiwan. The study participants were interviewed every 3-6 months from January 2012 and December 2014. Demographic characteristics,multimorbidity by Charlson's Comorbidities Index, physical function by the Barthel Index, cognition by the Mini-Mental State Examination (MMSE), depression by the five-item Geriatric Depression Scale and nutritional status by the Mini-Nutrition Assessment-Short Form were collected for analysis. A generalized estimating equation model was used after it was adjusted for age, educational level, five-item Geriatric Depression Scale, and problem of communication difficulty to identify potential modifiable risk factors for cognitive decline. The mean age of the participants was 85.7 ± 5.2 years, with a mean follow-up period of 41 ± 21.6 weeks. The prevalence of cognitive impairment (defined by MMSE <24) was 65.6%, whereas 34% of the study participants were positive for depressive symptoms. Approximately one-fifth of the study participants were using psychotropic agents, which was higher among participants with cognitive impairment (23.6% vs 15.6%, P < 0.05) than those without. In the generalized estimating equation model, physical function, nutritional status, depressive symptoms, ex-drinker, multimorbidity and stool incontinence were positively correlated with MMSE score; whereas advanced age, low educational level (<6 years), presence of communication difficulty and use of psychotropic agents were inversely associated with the MMSE score. Physical function and nutritional status were positively associated with the MMSE score, and use of psychotropic agents was negatively correlated with cognitive function. Further intervention study is required to improve the cognitive health of older adults living in the veterans' retirement communities. Geriatr Gerontol Int 2017: 17 (Suppl. 1): 7-13. © 2017 Japan Geriatrics Society.
Diviani, Nicola; Dima, Alexandra Lelia; Schulz, Peter Johannes
2017-04-11
The eHealth Literacy Scale (eHEALS) is a tool to assess consumers' comfort and skills in using information technologies for health. Although evidence exists of reliability and construct validity of the scale, less agreement exists on structural validity. The aim of this study was to validate the Italian version of the eHealth Literacy Scale (I-eHEALS) in a community sample with a focus on its structural validity, by applying psychometric techniques that account for item difficulty. Two Web-based surveys were conducted among a total of 296 people living in the Italian-speaking region of Switzerland (Ticino). After examining the latent variables underlying the observed variables of the Italian scale via principal component analysis (PCA), fit indices for two alternative models were calculated using confirmatory factor analysis (CFA). The scale structure was examined via parametric and nonparametric item response theory (IRT) analyses accounting for differences between items regarding the proportion of answers indicating high ability. Convergent validity was assessed by correlations with theoretically related constructs. CFA showed a suboptimal model fit for both models. IRT analyses confirmed all items measure a single dimension as intended. Reliability and construct validity of the final scale were also confirmed. The contrasting results of factor analysis (FA) and IRT analyses highlight the importance of considering differences in item difficulty when examining health literacy scales. The findings support the reliability and validity of the translated scale and its use for assessing Italian-speaking consumers' eHealth literacy. ©Nicola Diviani, Alexandra Lelia Dima, Peter Johannes Schulz. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 11.04.2017.
Rodrigues-Bigaton, Delaine; de Castro, Ester M; Pires, Paulo F
Rasch analysis has been used in recent studies to test the psychometric properties of a questionnaire. The conditions for use of the Rasch model are one-dimensionality (assessed via prior factor analysis) and local independence (the probability of getting a particular item right or wrong should not be conditioned upon success or failure in another). To evaluate the dimensionality and the psychometric properties of the Fonseca anamnestic index (FAI), such as the fit of the data to the model, the degree of difficulty of the items, and the ability to respond in patients with myogenous temporomandibular disorder (TMD). The sample consisted of 94 women with myogenous TMD, diagnosed by the Research Diagnostic Criteria for Temporomandibular Disorders (RDC/TMD), who answered the FAI. For the factor analysis, we applied the Kaiser-Meyer-Olkin test, Bartlett's sphericity, Spearman's correlation, and the determinant of the correlation matrix. For extraction of the factors/dimensions, an eigenvalue >1.0 was used, followed by oblique oblimin rotation. The Rasch analysis was conducted on the dimension that showed the highest proportion of variance explained. Adequate sample "n" and FAI multidimensionality were observed. Dimension 1 (primary) consisted of items 1, 2, 3, 6, and 7. All items of dimension 1 showed adequate fit to the model, being observed according to the degree of difficulty (from most difficult to easiest), respectively, items 2, 1, 3, 6, and 7. The FAI presented multidimensionality with its main dimension consisting of five reliable items with adequate fit to the composition of its structure. Copyright © 2017 Associação Brasileira de Pesquisa e Pós-Graduação em Fisioterapia. Publicado por Elsevier Editora Ltda. All rights reserved.
Sakakibara, Brodie M.; Miller, William C.; Backman, Catherine L.
2012-01-01
Objective To explore shortened response formats for use with the Activities-specific Balance Confidence scale and then: 1) evaluate the unidimensionality of the scale; 2) evaluate the item difficulty; 3) evaluate the scale for redundancy and content gaps; and 4) evaluate the item standard error of measurement (SEM) and internal consistency reliability among aging individuals (≥50 years) with a lower-limb amputation living in the community. Design Secondary analysis of cross-sectional survey and chart review data. Setting Out-patient amputee clinics, Ontario, Canada. Participants Four hundred forty eight community living adults, at least 50 years old (mean = 68 years), who have used a prosthesis for at least 6 months for a major unilateral lower limb amputation. Three hundred twenty five (72.5%) were men. Intervention N/a Main Outcome Measure(s) Activities-specific Balance Confidence Scale. Results A 5-option response format outperformed 4- and 6-option formats. Factor analyses confirmed a unidimensional scale. The distance between response options is not the same for all items on the scale, evident by the Partial Credit Model (PCM) having a better fit to the data than the Rating Scale Model. Two items, however, did not fit the PCM within statistical reason. Revising the wording of the two items may resolve the misfit, and improve the construct validity and lower the SEM. Overall, the difficulty of the scale’s items is appropriate for use with aging individuals with lower-limb amputation, and is most reliable (Cronbach ∝ = 0.94) for use with individuals with moderately low balance confidence levels. Conclusions The ABC-scale with a simplified 5-option response format is a valid and reliable measure of balance confidence for use with individuals aging with a lower limb amputation. PMID:21704978
Cairnduff, Victoria; Dean, Moira; Koidis, Anastasios
2016-09-01
Food preparation and storage behaviors in the home deviating from the "best practice" food safety recommendations may result in foodborne illnesses. Currently, there are limited tools available to fully evaluate the consumer knowledge, perceptions, and behavior in the area of refrigerator safety. The current study aimed to develop a valid and reliable tool in the form of a questionnaire, the Consumer Refrigerator Safety Questionnaire (CRSQ), for assessing systematically all these aspects. Items relating to refrigerator safety knowledge (n =17), perceptions (n =46), and reported behavior (n =30) were developed and pilot tested by an expert reference group and various consumer groups to assess face and content validity (n =20), item difficulty and consistency (n =55), and construct validity (n =23). The findings showed that the CRSQ has acceptable face and content validity with acceptable levels of item difficulty. Item consistency was observed for 12 of 15 in refrigerator safety knowledge. Further, all 5 of the subscales of consumer perceptions of refrigerator safety practices relating to risk of developing foodborne disease showed acceptable internal consistency (Cronbach's α value > 0.8). Construct validity of the CRSQ was shown to be very good (P = 0.022). The CRSQ exhibited acceptable test-retest reliability at 14 days with the majority of knowledge items (93.3%) and reported behavior items (96.4%) having correlation coefficients of greater than 0.70. Overall, the CRSQ was deemed valid and reliable in assessing refrigerator safety knowledge and behavior; therefore, it has the potential for future use in identifying groups of individuals at increased risk of deviating from recommended refrigerator safety practices, as well as the assessment of refrigerator safety knowledge and behavior for use before and after an intervention.
Tak, SangWoo; Calvert, Geoffrey M
2008-01-01
To estimate the national burden of hearing difficulty among workers in US industries and occupations. Data on 130,102 employed National Health Interview Survey respondents between the ages of 18 to 65 years who were interviewed between 1997 and 2003 were analyzed to estimate the population prevalence, adjusted prevalence ratios, and fractions of hearing difficulty attributable to employment. The estimated population prevalence of hearing difficulty was 11.4% (24% attributable to employment). The adjusted prevalence ratios of hearing difficulty were highest for railroads, mining, and primary metal manufacturing industry. Occupations with increased risk of hearing difficulty were mechanics/repairers, machine operators, and transportation equipment operators. Hearing difficulty was differentially distributed across various industries. In industries with high rates, employers and workers should take preventive action to reduce the risk of occupational hearing loss.
Three controversies over item disclosure in medical licensure examinations
Park, Yoon Soo; Yang, Eunbae B.
2015-01-01
In response to views on public's right to know, there is growing attention to item disclosure – release of items, answer keys, and performance data to the public – in medical licensure examinations and their potential impact on the test's ability to measure competence and select qualified candidates. Recent debates on this issue have sparked legislative action internationally, including South Korea, with prior discussions among North American countries dating over three decades. The purpose of this study is to identify and analyze three issues associated with item disclosure in medical licensure examinations – 1) fairness and validity, 2) impact on passing levels, and 3) utility of item disclosure – by synthesizing existing literature in relation to standards in testing. Historically, the controversy over item disclosure has centered on fairness and validity. Proponents of item disclosure stress test takers’ right to know, while opponents argue from a validity perspective. Item disclosure may bias item characteristics, such as difficulty and discrimination, and has consequences on setting passing levels. To date, there has been limited research on the utility of item disclosure for large scale testing. These issues requires ongoing and careful consideration. PMID:26374693
[Perceptions on item disclosure for the Korean medical licensing examination].
Yang, Eunbae B
2015-09-01
This study analyzed the perceptions of medical students and faculty regarding disclosure of test items on the Korean medical licensing examination. I conducted a survey of medical students from medical colleges and professional medical schools nationwide. Responses were analyzed from 718 participants as well as 69 faculty members who participated in creating the medical licensing examination item sets. Data were analyzed using descriptive statistics and the chi-square test. It is important to maintain test quality and to keep the test items unavailable to the public. There are also concerns among students that disclosure of test items would prompt increasing difficulty of test items (48.3%). Further, few students found it desirable to disclose test items regardless of any considerations (28.5%). The professors, who had experience in designing the test items, also expressed their opposition to test item disclosure (60.9%). It is desirable not to disclose the test items of the Korean medical licensing examination to the public on the condition that students are provided with a sufficient amount of information regarding the examination. This is so that the exam can appropriately identify candidates with the required qualifications.
Chen, Cheng-Te; Chen, Yu-Lan; Lin, Yu-Ching; Hsieh, Ching-Lin; Tzeng, Jeng-Yi
2018-01-01
Objective The purpose of this study was to construct a computerized adaptive test (CAT) for measuring self-care performance (the CAT-SC) in children with developmental disabilities (DD) aged from 6 months to 12 years in a content-inclusive, precise, and efficient fashion. Methods The study was divided into 3 phases: (1) item bank development, (2) item testing, and (3) a simulation study to determine the stopping rules for the administration of the CAT-SC. A total of 215 caregivers of children with DD were interviewed with the 73-item CAT-SC item bank. An item response theory model was adopted for examining the construct validity to estimate item parameters after investigation of the unidimensionality, equality of slope parameters, item fitness, and differential item functioning (DIF). In the last phase, the reliability and concurrent validity of the CAT-SC were evaluated. Results The final CAT-SC item bank contained 56 items. The stopping rules suggested were (a) reliability coefficient greater than 0.9 or (b) 14 items administered. The results of simulation also showed that 85% of the estimated self-care performance scores would reach a reliability higher than 0.9 with a mean test length of 8.5 items, and the mean reliability for the rest was 0.86. Administering the CAT-SC could reduce the number of items administered by 75% to 84%. In addition, self-care performances estimated by the CAT-SC and the full item bank were very similar to each other (Pearson r = 0.98). Conclusion The newly developed CAT-SC can efficiently measure self-care performance in children with DD whose performances are comparable to those of TD children aged from 6 months to 12 years as precisely as the whole item bank. The item bank of the CAT-SC has good reliability and a unidimensional self-care construct, and the CAT can estimate self-care performance with less than 25% of the items in the item bank. Therefore, the CAT-SC could be useful for measuring self-care performance in children with DD in clinical and research settings. PMID:29561879
Chen, Cheng-Te; Chen, Yu-Lan; Lin, Yu-Ching; Hsieh, Ching-Lin; Tzeng, Jeng-Yi; Chen, Kuan-Lin
2018-01-01
The purpose of this study was to construct a computerized adaptive test (CAT) for measuring self-care performance (the CAT-SC) in children with developmental disabilities (DD) aged from 6 months to 12 years in a content-inclusive, precise, and efficient fashion. The study was divided into 3 phases: (1) item bank development, (2) item testing, and (3) a simulation study to determine the stopping rules for the administration of the CAT-SC. A total of 215 caregivers of children with DD were interviewed with the 73-item CAT-SC item bank. An item response theory model was adopted for examining the construct validity to estimate item parameters after investigation of the unidimensionality, equality of slope parameters, item fitness, and differential item functioning (DIF). In the last phase, the reliability and concurrent validity of the CAT-SC were evaluated. The final CAT-SC item bank contained 56 items. The stopping rules suggested were (a) reliability coefficient greater than 0.9 or (b) 14 items administered. The results of simulation also showed that 85% of the estimated self-care performance scores would reach a reliability higher than 0.9 with a mean test length of 8.5 items, and the mean reliability for the rest was 0.86. Administering the CAT-SC could reduce the number of items administered by 75% to 84%. In addition, self-care performances estimated by the CAT-SC and the full item bank were very similar to each other (Pearson r = 0.98). The newly developed CAT-SC can efficiently measure self-care performance in children with DD whose performances are comparable to those of TD children aged from 6 months to 12 years as precisely as the whole item bank. The item bank of the CAT-SC has good reliability and a unidimensional self-care construct, and the CAT can estimate self-care performance with less than 25% of the items in the item bank. Therefore, the CAT-SC could be useful for measuring self-care performance in children with DD in clinical and research settings.
Baker, Richard S; Bazargan, Mohsen; Calderón, José L; Hays, Ron D
2006-08-01
To compare the psychometric performance of Spanish versions of the 25-item National Eye Institute Visual Function Questionnaire (NEI VFQ-25) and the NEI VFQ-39 administered to Latino patients with the psychometric performance of the standard English NEI VFQ-25 and NEI VFQ-39 administered to non-Latino patients. Clinic-based cross-sectional survey. Four hundred three patients (160 Latinos and 243 non-Latinos) recruited from general ophthalmology clinics of an urban public hospital over a 6-month period. Structured face-to-face interviews were conducted in Spanish and English to collect data for the NEI VFQ-25 and NEI VFQ-39. We calculated the mean, standard deviation, and percentage of participants having the minimum (floor) and maximum (ceiling) possible score for each item and scale. Internal consistency reliability of the NEI VFQ-25 and NEI VFQ-39 was estimated using the Cronbach alpha and average inter-item correlation. Construct validity for the instruments was assessed by comparing scores for participants classified as having normal versus impaired visual acuity. Instrument scales for general health; general vision; ocular pain; near activities; distance activities; vision-specific social functioning, mental health, role difficulties, and dependency; driving; color vision; and peripheral vision. Internal consistency reliability was significantly lower in the Spanish version than in the English version for 3 scales of the NEI VFQ-25. More importantly, 3 scales in the Spanish version manifested inadequate reliability (alpha< or =0.70), compared with only 1 inadequately reliable subscale in the English version. Reliability coefficients associated with the Spanish NEI VFQ-39 scales exceeded commonly accepted minimum standards. Comparison of reliability coefficients between Latino and non-Latino subgroups demonstrated statistically significant differences for 4 scales: Ocular Pain, Mental Health, Role Difficulties, and Dependency. In each case, the Latino group had the lower internal consistency reliability. However, only for the Ocular Pain subscale was reliability both significantly lower and inadequate (alpha<0.70). Overall performance of the NEI VFQ in Latino populations is adequate. However, in the absence of modifications to improve the reliability of specific Spanish version subscales, comparisons between Latino and non-Latino subgroups using the NEI VFQ must be interpreted with appropriate caution.
Invariance Properties for General Diagnostic Classification Models
ERIC Educational Resources Information Center
Bradshaw, Laine P.; Madison, Matthew J.
2016-01-01
In item response theory (IRT), the invariance property states that item parameter estimates are independent of the examinee sample, and examinee ability estimates are independent of the test items. While this property has long been established and understood by the measurement community for IRT models, the same cannot be said for diagnostic…
Item response theory - A first approach
NASA Astrophysics Data System (ADS)
Nunes, Sandra; Oliveira, Teresa; Oliveira, Amílcar
2017-07-01
The Item Response Theory (IRT) has become one of the most popular scoring frameworks for measurement data, frequently used in computerized adaptive testing, cognitively diagnostic assessment and test equating. According to Andrade et al. (2000), IRT can be defined as a set of mathematical models (Item Response Models - IRM) constructed to represent the probability of an individual giving the right answer to an item of a particular test. The number of Item Responsible Models available to measurement analysis has increased considerably in the last fifteen years due to increasing computer power and due to a demand for accuracy and more meaningful inferences grounded in complex data. The developments in modeling with Item Response Theory were related with developments in estimation theory, most remarkably Bayesian estimation with Markov chain Monte Carlo algorithms (Patz & Junker, 1999). The popularity of Item Response Theory has also implied numerous overviews in books and journals, and many connections between IRT and other statistical estimation procedures, such as factor analysis and structural equation modeling, have been made repeatedly (Van der Lindem & Hambleton, 1997). As stated before the Item Response Theory covers a variety of measurement models, ranging from basic one-dimensional models for dichotomously and polytomously scored items and their multidimensional analogues to models that incorporate information about cognitive sub-processes which influence the overall item response process. The aim of this work is to introduce the main concepts associated with one-dimensional models of Item Response Theory, to specify the logistic models with one, two and three parameters, to discuss some properties of these models and to present the main estimation procedures.
An Evaluation of Hierarchical Bayes Estimation for the Two- Parameter Logistic Model.
ERIC Educational Resources Information Center
Kim, Seock-Ho
Hierarchical Bayes procedures for the two-parameter logistic item response model were compared for estimating item parameters. Simulated data sets were analyzed using two different Bayes estimation procedures, the two-stage hierarchical Bayes estimation (HB2) and the marginal Bayesian with known hyperparameters (MB), and marginal maximum…
Development of an Inconsistent Responding Scale for the Triarchic Psychopathy Measure.
Mowle, Elyse N; Kelley, Shannon E; Edens, John F; Donnellan, M Brent; Smith, Shannon Toney; Wygant, Dustin B; Sellbom, Martin
2017-08-01
Inconsistent or careless responding to self-report measures is estimated to occur in approximately 10% of university research participants and may be even more common among offender populations. Inconsistent responding may be a result of a number of factors including inattentiveness, reading or comprehension difficulties, and cognitive impairment. Many stand-alone personality scales used in applied and research settings, however, do not include validity indicators to help identify inattentive response patterns. Using multiple archival samples, the current study describes the development of an inconsistent responding scale for the Triarchic Psychopathy Measure (TriPM; Patrick, 2010), a widely used self-report measure of psychopathy. We first identified pairs of correlated TriPM items in a derivation sample (N = 2,138) and then created a total score based on the sum of the absolute value of the differences for each item pair. The resulting scale, the Triarchic Assessment Procedure for Inconsistent Responding (TAPIR), strongly differentiated between genuine TriPM protocols and randomly generated TriPM data (N = 1,000), as well as between genuine protocols and those in which 50% of the original data were replaced with random item responses. TAPIR scores demonstrated fairly consistent patterns of association with some theoretically relevant correlates (e.g., inconsistency scales embedded in other personality inventories), although not others (e.g., measures of conscientiousness) across our cross-validation samples. Tentative TAPIR cut scores that may discriminate between attentively and carelessly completed protocols are presented. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Fayers, Peter M
2007-01-01
We review the papers presented at the NCI/DIA conference, to identify areas of controversy and uncertainty, and to highlight those aspects of item response theory (IRT) and computer adaptive testing (CAT) that require theoretical or empirical research in order to justify their application to patient reported outcomes (PROs). IRT and CAT offer exciting potential for the development of a new generation of PRO instruments. However, most of the research into these techniques has been in non-healthcare settings, notably in education. Educational tests are very different from PRO instruments, and consequently problematic issues arise when adapting IRT and CAT to healthcare research. Clinical scales differ appreciably from educational tests, and symptoms have characteristics distinctly different from examination questions. This affects the transferring of IRT technology. Particular areas of concern when applying IRT to PROs include inadequate software, difficulties in selecting models and communicating results, insufficient testing of local independence and other assumptions, and a need of guidelines for estimating sample size requirements. Similar concerns apply to differential item functioning (DIF), which is an important application of IRT. Multidimensional IRT is likely to be advantageous only for closely related PRO dimensions. Although IRT and CAT provide appreciable potential benefits, there is a need for circumspection. Not all PRO scales are necessarily appropriate targets for this methodology. Traditional psychometric methods, and especially qualitative methods, continue to have an important role alongside IRT. Research should be funded to address the specific concerns that have been identified.
Saudek, Kris; Treat, Robert
2015-01-01
Purpose At our institution, speculation amongst medical students and faculty exists as to whether team-based learning (TBL) can improve scores on high-stakes examinations over traditional didactic lectures. Faculty with experience using TBL developed and piloted a required TBL blood disorders (BD) module for third-year medical students on their pediatric clerkship. The purpose of this study is to analyze the BD scores from the NBME subject exams before and after the introduction of the module. Methods We analyzed institutional and national item difficulties for BD items from the NBME pediatrics content area item analysis reports from 2011 to 2014 before (pre) and after (post) the pilot (October 2012). Total scores of 590 NBME subject examination students from examinee performance profiles were analyzed pre/post. t-Tests and Cohen's d effect sizes were used to analyze item difficulties for institutional versus national scores and pre/post comparisons of item difficulties and total scores. Results BD scores for our institution were 0.65 (±0.19) compared to 0.62 (±0.15) nationally (P=0.346; Cohen's d=0.15). The average of post-consecutive BD scores for our students was 0.70(±0.21) compared to examinees nationally [0.64 (±0.15)] with a significant mean difference (P=0.031; Cohen's d=0.43). The difference in our institutions pre [0.65 (±0.19)] and post [0.70 (±0.21)] BD scores trended higher (P=0.391; Cohen's d=0.27). Institutional BD scores were higher than national BD scores for both pre and post, with an effect size that tripled from pre to post scores. Institutional BD scores increased after the use of the TBL module, while overall exam scores remained steadily above national norms. Conclusions Institutional BD scores were higher than national BD scores for both pre and post, with an effect size that tripled from pre to post scores. Institutional BD scores increased after the use of the TBL module, while overall exam scores remained steadily above national norms.
ERIC Educational Resources Information Center
Chen, Hanwei; Cui, Zhongmin; Zhu, Rongchun; Gao, Xiaohong
2010-01-01
The most critical feature of a common-item nonequivalent groups equating design is that the average score difference between the new and old groups can be accurately decomposed into a group ability difference and a form difficulty difference. Two widely used observed-score linear equating methods, the Tucker and the Levine observed-score methods,…
Objective and subjective memory ratings in cannabis-dependent adolescents.
McClure, Erin A; Lydiard, Jessica B; Goddard, Scott D; Gray, Kevin M
2015-01-01
Cannabis is the most widely used illicit substance worldwide, with an estimated 160 million users. Among adolescents, rates of cannabis use are increasing, while the perception of detrimental effects of cannabis use is declining. Difficulty with memory is one of the most frequently noted cognitive deficits associated with cannabis use, but little data exist exploring how well users can identify their own memory deficits, if present. The current secondary analysis sought to characterize objective verbal and visual memory performance via a neurocognitive battery in cannabis-dependent adolescents enrolled in a pharmacotherapeutic cannabis cessation clinical trial (N = 112) and compare this to a single self-reported item assessing difficulties with memory loss. Exploratory analyses also assessed dose-dependent effects of cannabis on memory performance. A small portion of the study sample (10%) endorsed a "serious problem" with memory loss. Those participants reporting "no problem" or "serious problem" scored similarly on visual and verbal memory tasks on the neurocognitive battery. Exploratory analyses suggested a potential relationship between days of cannabis use, amount of cannabis used, and gender with memory performance. This preliminary and exploratory analysis suggests that a sub-set of cannabis users may not accurately perceive difficulties with memory. Further work should test this hypothesis with the use of a control group, comprehensive self-reports of memory problems, and adult populations that may have more years of cannabis use and more severe cognitive deficits. © American Academy of Addiction Psychiatry.
Perez, Kathryn E.; Hiatt, Anna; Davis, Gregory K.; Trujillo, Caleb; French, Donald P.; Terry, Mark; Price, Rebecca M.
2013-01-01
The American Association for the Advancement of Science 2011 report Vision and Change in Undergraduate Biology Education encourages the teaching of developmental biology as an important part of teaching evolution. Recently, however, we found that biology majors often lack the developmental knowledge needed to understand evolutionary developmental biology, or “evo-devo.” To assist in efforts to improve evo-devo instruction among undergraduate biology majors, we designed a concept inventory (CI) for evolutionary developmental biology, the EvoDevoCI. The CI measures student understanding of six core evo-devo concepts using four scenarios and 11 multiple-choice items, all inspired by authentic scientific examples. Distracters were designed to represent the common conceptual difficulties students have with each evo-devo concept. The tool was validated by experts and administered at four institutions to 1191 students during preliminary (n = 652) and final (n = 539) field trials. We used student responses to evaluate the readability, difficulty, discriminability, validity, and reliability of the EvoDevoCI, which included items ranging in difficulty from 0.22–0.55 and in discriminability from 0.19–0.38. Such measures suggest the EvoDevoCI is an effective tool for assessing student understanding of evo-devo concepts and the prevalence of associated common conceptual difficulties among both novice and advanced undergraduate biology majors. PMID:24297293
Kulich, Károly; Keininger, Dorothy L; Tiplady, Brian; Banerji, Donald
2015-01-01
Symptoms, particularly dyspnea, and activity limitation, have an impact on the health status and the ability to function normally in patients with chronic obstructive pulmonary disease (COPD). To develop an electronic patient diary (eDiary), qualitative patient interviews were conducted from 2009 to 2010 to identify relevant symptoms and degree of bother due to symptoms. The eDiary was completed by a subset of 209 patients with moderate-to-severe COPD in the 26-week QVA149 SHINE study. Two morning assessments (since awakening and since the last assessment) and one evening assessment were made each day. Assessments covered five symptoms ("shortness of breath," "phlegm/mucus," "chest tightness," "wheezing," and "coughing") and two impact items ("bothered by COPD" and "difficulty with activities") and were scored on a 10-point numeric scale. Patient compliance with the eDiary was 90.4% at baseline and 81.3% at week 26. Correlations between shortness of breath and impact items were >0.95. Regression analysis showed that shortness of breath was a highly significant (P<0.0001) predictor of impact items. Exploratory factor analysis gave a single factor comprising all eDiary items, including both symptoms and impact items. Shortness of breath, the total score (including five symptoms and two impact items), and the five-item symptom score from the eDiary performed well, with good consistency and reliability. The eDiary showed good sensitivity to change, with a 0.6 points reduction in the symptoms scores (on a 0-10 point scale) representing a meaningful change. The eDiary was found to be valid, reliable, and responsive. The high correlations obtained between "shortness of breath" and the ratings of "bother" and "difficulty with activities" confirmed the relevance of this symptom in patients with COPD. Future studies will be required to explore further psychometric properties and their ability to differentiate between COPD treatments.
The stroke impairment assessment set: its internal consistency and predictive validity.
Tsuji, T; Liu, M; Sonoda, S; Domen, K; Chino, N
2000-07-01
To study the scale quality and predictive validity of the Stroke Impairment Assessment Set (SIAS) developed for stroke outcome research. Rasch analysis of the SIAS; stepwise multiple regression analysis to predict discharge functional independence measure (FIM) raw scores from demographic data, the SIAS scores, and the admission FIM scores; cross-validation of the prediction rule. Tertiary rehabilitation center in Japan. One hundred ninety stroke inpatients for the study of the scale quality and the predictive validity; a second sample of 116 stroke inpatients for the cross-validation study. Mean square fit statistics to study the degree of fit to the unidimensional model; logits to express item difficulties; discharge FIM scores for the study of predictive validity. The degree of misfit was acceptable except for the shoulder range of motion (ROM), pain, visuospatial function, and speech items; and the SIAS items could be arranged on a common unidimensional scale. The difficulty patterns were identical at admission and at discharge except for the deep tendon reflexes, ROM, and pain items. They were also similar for the right- and left-sided brain lesion groups except for the speech and visuospatial items. For the prediction of the discharge FIM scores, the independent variables selected were age, the SIAS total scores, and the admission FIM scores; and the adjusted R2 was .64 (p < .0001). Stability of the predictive equation was confirmed in the cross-validation sample (R2 = .68, p < .001). The unidimensionality of the SIAS was confirmed, and the SIAS total scores proved useful for stroke outcome prediction.
Generalized Full-Information Item Bifactor Analysis
Cai, Li; Yang, Ji Seung; Hansen, Mark
2011-01-01
Full-information item bifactor analysis is an important statistical method in psychological and educational measurement. Current methods are limited to single group analysis and inflexible in the types of item response models supported. We propose a flexible multiple-group item bifactor analysis framework that supports a variety of multidimensional item response theory models for an arbitrary mixing of dichotomous, ordinal, and nominal items. The extended item bifactor model also enables the estimation of latent variable means and variances when data from more than one group are present. Generalized user-defined parameter restrictions are permitted within or across groups. We derive an efficient full-information maximum marginal likelihood estimator. Our estimation method achieves substantial computational savings by extending Gibbons and Hedeker’s (1992) bifactor dimension reduction method so that the optimization of the marginal log-likelihood only requires two-dimensional integration regardless of the dimensionality of the latent variables. We use simulation studies to demonstrate the flexibility and accuracy of the proposed methods. We apply the model to study cross-country differences, including differential item functioning, using data from a large international education survey on mathematics literacy. PMID:21534682
Paz, Sylvia H; Spritzer, Karen L; Reise, Steven P; Hays, Ron D
2017-06-01
About 70% of Latinos, 5 years old or older, in the United States speak Spanish at home. Measurement equivalence of the PROMIS ® pain interference (PI) item bank by language of administration (English versus Spanish) has not been evaluated. A sample of 527 adult Spanish-speaking Latinos completed the Spanish version of the 41-item PROMIS ® pain interference item bank. We evaluate dimensionality, monotonicity and local independence of the Spanish-language items. Then we evaluate differential item functioning (DIF) using ordinal logistic regression with item response theory scores estimated from DIF-free "anchor" items. One of the 41 items in the Spanish version of the PROMIS ® PI item bank was identified as having significant uniform DIF. English- and Spanish-speaking subjects with the same level of pain interference responded differently to 1 of the 41 items in the PROMIS ® PI item bank. This item was not retained due to proprietary issues. The original English language item parameters can be used when estimating PROMIS ® PI scores.
Analyzing force concept inventory with item response theory
NASA Astrophysics Data System (ADS)
Wang, Jing; Bao, Lei
2010-10-01
Item response theory is a popular assessment method used in education. It rests on the assumption of a probability framework that relates students' innate ability and their performance on test questions. Item response theory transforms students' raw test scores into a scaled proficiency score, which can be used to compare results obtained with different test questions. The scaled score also addresses the issues of ceiling effects and guessing, which commonly exist in quantitative assessment. We used item response theory to analyze the force concept inventory (FCI). Our results show that item response theory can be useful for analyzing physics concept surveys such as the FCI and produces results about the individual questions and student performance that are beyond the capability of classical statistics. The theory yields detailed measurement parameters regarding the difficulty, discrimination features, and probability of correct guess for each of the FCI questions.
Improved Classification of Mammograms Following Idealized Training
Hornsby, Adam N.; Love, Bradley C.
2014-01-01
People often make decisions by stochastically retrieving a small set of relevant memories. This limited retrieval implies that human performance can be improved by training on idealized category distributions (Giguère & Love, 2013). Here, we evaluate whether the benefits of idealized training extend to categorization of real-world stimuli, namely classifying mammograms as normal or tumorous. Participants in the idealized condition were trained exclusively on items that, according to a norming study, were relatively unambiguous. Participants in the actual condition were trained on a representative range of items. Despite being exclusively trained on easy items, idealized-condition participants were more accurate than those in the actual condition when tested on a range of item types. However, idealized participants experienced difficulties when test items were very dissimilar from training cases. The benefits of idealization, attributable to reducing noise arising from cognitive limitations in memory retrieval, suggest ways to improve real-world decision making. PMID:24955325
Improved Classification of Mammograms Following Idealized Training.
Hornsby, Adam N; Love, Bradley C
2014-06-01
People often make decisions by stochastically retrieving a small set of relevant memories. This limited retrieval implies that human performance can be improved by training on idealized category distributions (Giguère & Love, 2013). Here, we evaluate whether the benefits of idealized training extend to categorization of real-world stimuli, namely classifying mammograms as normal or tumorous. Participants in the idealized condition were trained exclusively on items that, according to a norming study, were relatively unambiguous. Participants in the actual condition were trained on a representative range of items. Despite being exclusively trained on easy items, idealized-condition participants were more accurate than those in the actual condition when tested on a range of item types. However, idealized participants experienced difficulties when test items were very dissimilar from training cases. The benefits of idealization, attributable to reducing noise arising from cognitive limitations in memory retrieval, suggest ways to improve real-world decision making.
Informed choice: understanding knowledge in the context of screening uptake.
Michie, Susan; Dormandy, Elizabeth; Marteau, Theresa M
2003-07-01
This study evaluates a scale measuring knowledge about a screening test and investigates the association between knowledge, uptake and attitudes towards screening. One thousand four hundred ninety-nine pregnant women completed the knowledge scale of the multidimensional measure of informed choice (MMIC). Three hundred forty-five of these women and 152 professionals providing antenatal care also rated the importance of the knowledge items. Item characteristic curves show that, with one exception, the knowledge items reflect a spread of difficulty and are able to discriminate between people. All items were seen as essential or helpful by both women and health professionals, with two items seen as particularly important and one as unimportant. There were some differences between health professionals, women with low risk results and women with high risk results. Knowledge was not associated with uptake, attitude, or the extent to which uptake was consistent with women's attitudes towards undergoing the test.
Item Selection and Ability Estimation Procedures for a Mixed-Format Adaptive Test
ERIC Educational Resources Information Center
Ho, Tsung-Han; Dodd, Barbara G.
2012-01-01
In this study we compared five item selection procedures using three ability estimation methods in the context of a mixed-format adaptive test based on the generalized partial credit model. The item selection procedures used were maximum posterior weighted information, maximum expected information, maximum posterior weighted Kullback-Leibler…
ERIC Educational Resources Information Center
Gadermann, Anne M.; Guhn, Martin; Zumbo, Bruno D.
2012-01-01
This paper provides a conceptual, empirical, and practical guide for estimating ordinal reliability coefficients for ordinal item response data (also referred to as Likert, Likert-type, ordered categorical, or rating scale item responses). Conventionally, reliability coefficients, such as Cronbach's alpha, are calculated using a Pearson…
A Feedback Control Strategy for Enhancing Item Selection Efficiency in Computerized Adaptive Testing
ERIC Educational Resources Information Center
Weissman, Alexander
2006-01-01
A computerized adaptive test (CAT) may be modeled as a closed-loop system, where item selection is influenced by trait level ([theta]) estimation and vice versa. When discrepancies exist between an examinee's estimated and true [theta] levels, nonoptimal item selection is a likely result. Nevertheless, examinee response behavior consistent with…
Investigating the Stability of Four Methods for Estimating Item Bias.
ERIC Educational Resources Information Center
Perlman, Carole L.; And Others
The reliability of item bias estimates was studied for four methods: (1) the transformed delta method; (2) Shepard's modified delta method; (3) Rasch's one-parameter residual analysis; and (4) the Mantel-Haenszel procedure. Bias statistics were computed for each sample using all methods. Data were from administration of multiple-choice items from…
IRTPRO 2.1 for Windows (Item Response Theory for Patient-Reported Outcomes)
ERIC Educational Resources Information Center
Paek, Insu; Han, Kyung T.
2013-01-01
This article reviews a new item response theory (IRT) model estimation program, IRTPRO 2.1, for Windows that is capable of unidimensional and multidimensional IRT model estimation for existing and user-specified constrained IRT models for dichotomously and polytomously scored item response data. (Contains 1 figure and 2 notes.)
Normal Theory Two-Stage ML Estimator When Data Are Missing at the Item Level
Savalei, Victoria; Rhemtulla, Mijke
2017-01-01
In many modeling contexts, the variables in the model are linear composites of the raw items measured for each participant; for instance, regression and path analysis models rely on scale scores, and structural equation models often use parcels as indicators of latent constructs. Currently, no analytic estimation method exists to appropriately handle missing data at the item level. Item-level multiple imputation (MI), however, can handle such missing data straightforwardly. In this article, we develop an analytic approach for dealing with item-level missing data—that is, one that obtains a unique set of parameter estimates directly from the incomplete data set and does not require imputations. The proposed approach is a variant of the two-stage maximum likelihood (TSML) methodology, and it is the analytic equivalent of item-level MI. We compare the new TSML approach to three existing alternatives for handling item-level missing data: scale-level full information maximum likelihood, available-case maximum likelihood, and item-level MI. We find that the TSML approach is the best analytic approach, and its performance is similar to item-level MI. We recommend its implementation in popular software and its further study. PMID:29276371
Normal Theory Two-Stage ML Estimator When Data Are Missing at the Item Level.
Savalei, Victoria; Rhemtulla, Mijke
2017-08-01
In many modeling contexts, the variables in the model are linear composites of the raw items measured for each participant; for instance, regression and path analysis models rely on scale scores, and structural equation models often use parcels as indicators of latent constructs. Currently, no analytic estimation method exists to appropriately handle missing data at the item level. Item-level multiple imputation (MI), however, can handle such missing data straightforwardly. In this article, we develop an analytic approach for dealing with item-level missing data-that is, one that obtains a unique set of parameter estimates directly from the incomplete data set and does not require imputations. The proposed approach is a variant of the two-stage maximum likelihood (TSML) methodology, and it is the analytic equivalent of item-level MI. We compare the new TSML approach to three existing alternatives for handling item-level missing data: scale-level full information maximum likelihood, available-case maximum likelihood, and item-level MI. We find that the TSML approach is the best analytic approach, and its performance is similar to item-level MI. We recommend its implementation in popular software and its further study.
A Test-Length Correction to the Estimation of Extreme Proficiency Levels
ERIC Educational Resources Information Center
Magis, David; Beland, Sebastien; Raiche, Gilles
2011-01-01
In this study, the estimation of extremely large or extremely small proficiency levels, given the item parameters of a logistic item response model, is investigated. On one hand, the estimation of proficiency levels by maximum likelihood (ML), despite being asymptotically unbiased, may yield infinite estimates. On the other hand, with an…
Large Sample Confidence Intervals for Item Response Theory Reliability Coefficients
ERIC Educational Resources Information Center
Andersson, Björn; Xin, Tao
2018-01-01
In applications of item response theory (IRT), an estimate of the reliability of the ability estimates or sum scores is often reported. However, analytical expressions for the standard errors of the estimators of the reliability coefficients are not available in the literature and therefore the variability associated with the estimated reliability…
Wolfe, Edward W; McGill, Michael T
2011-01-01
This article summarizes a simulation study of the performance of five item quality indicators (the weighted and unweighted versions of the mean square and standardized mean square fit indices and the point-measure correlation) under conditions of relatively high and low amounts of missing data under both random and conditional patterns of missing data for testing contexts such as those encountered in operational administrations of a computerized adaptive certification or licensure examination. The results suggest that weighted fit indices, particularly the standardized mean square index, and the point-measure correlation provide the most consistent information between random and conditional missing data patterns and that these indices perform more comparably for items near the passing score than for items with extreme difficulty values.
Yang, Sook Ja; Chee, Yeon Kyung; An, Jisook; Park, Min Hee; Jung, Sunok
2016-05-01
The purpose of this study was to obtain an independent evaluation of the factor structure of the 12-item Health Literacy Index for Female Marriage Immigrants (HLI-FMI), the first measure for assessing health literacy for FMIs in Korea. Participants were 250 Asian women who migrated from China, Vietnam, and the Philippines to marry. The HLI-FMI was originally developed and administered in Korean, and other questionnaires were translated into participants' native languages. The HLI-FMI consisted of 2 factors: (1) Access-Understand Health Literacy (7 items) and (2) Appraise-Apply Health Literacy (5 items); Cronbach's α = .73. Confirmatory factor analysis indicated adequate fit for the 2-factor model. HLI-FMI scores were positively associated with time since immigration and Korean proficiency. Based on classical test theory and item response theory, strong support was provided for item discrimination and item difficulty. Findings suggested that the HLI-FMI is an easily administered, reliable, and valid scale. © 2016 APJPH.
An Analysis of the Connectedness to Nature Scale Based on Item Response Theory
Pasca, Laura; Aragonés, Juan I.; Coello, María T.
2017-01-01
The Connectedness to Nature Scale (CNS) is used as a measure of the subjective cognitive connection between individuals and nature. However, to date, it has not been analyzed at the item level to confirm its quality. In the present study, we conduct such an analysis based on Item Response Theory. We employed data from previous studies using the Spanish-language version of the CNS, analyzing a sample of 1008 participants. The results show that seven items presented appropriate indices of discrimination and difficulty, in addition to a good fit. The remaining six have inadequate discrimination indices and do not present a good fit. A second study with 321 participants shows that the seven-item scale has adequate levels of reliability and validity. Therefore, it would be appropriate to use a reduced version of the scale after eliminating the items that display inappropriate behavior, since they may interfere with research results on connectedness to nature. PMID:28824509
Colorado Learning Difficulties Questionnaire:Validation of a parent-report screening measure
Willcutt, Erik G.; Boada, Richard; Riddle, Margaret W.; Chhabildas, Nomita; DeFries, John C.; Pennington, Bruce F.
2011-01-01
This study evaluated the internal structure and convergent and discriminant evidence for the Colorado Learning Difficulties Questionnaire (CLDQ), a 20-item parent-report rating scale that was developed to provide a brief screening measure for learning difficulties. CLDQ ratings were obtained from parents of children in two large community samples and two samples from clinics that specialize in the assessment of learning disabilities and related disorders (total N = 8,004). Exploratory and confirmatory factor analyses revealed five correlated but separable dimensions that were labeled reading, math, social cognition, social anxiety, and spatial difficulties. Results revealed strong convergent and discriminant evidence for the CLDQ Reading scale, suggesting that this scale may provide a useful method to screen for reading difficulties in both research studies and clinical settings. Results are also promising for the other four CLDQ scales, but additional research is needed to refine each of these measures. PMID:21574721
Stability of INFIT and OUTFIT Compared to Simulated Estimates in Applied Setting.
Hodge, Kari J; Morgan, Grant B
Residual-based fit statistics are commonly used as an indication of the extent to which the item response data fit the Rash model. Fit statistic estimates are influenced by sample size and rules-of thumb estimates may result in incorrect conclusions about the extent to which the model fits the data. Estimates obtained in this analysis were compared to 250 simulated data sets to examine the stability of the estimates. All INFIT estimates were within the rule-of-thumb range of 0.7 to 1.3. However, only 82% of the INFIT estimates fell within the 2.5th and 97.5th percentile of the simulated item's INFIT distributions using this 95% confidence-like interval. This is a 18 percentage point difference in items that were classified as acceptable. Fourty-eight percent of OUTFIT estimates fell within the 0.7 to 1.3 rule- of-thumb range. Whereas 34% of OUTFIT estimates fell within the 2.5th and 97.5th percentile of the simulated item's OUTFIT distributions. This is a 13 percentage point difference in items that were classified as acceptable. When using the rule-of- thumb ranges for fit estimates the magnitude of misfit was smaller than with the 95% confidence interval of the simulated distribution. The findings indicate that the use of confidence intervals as critical values for fit statistics leads to different model data fit conclusions than traditional rule of thumb critical values.
Teachers' experiences supporting children after traumatic exposure.
Alisic, Eva; Bus, Marissa; Dulack, Wendel; Pennings, Lenneke; Splinter, Jessica
2012-02-01
Teachers can be instrumental in supporting children's recovery after trauma, but some work suggests that elementary school teachers are uncertain about their role and about what to do to assist children effectively after their students have been exposed to traumatic stressors. This study examined the extent to which teachers working with children from ages 8 to 12 years report similar concerns. A random sample of teachers in the Netherlands (N = 765) completed a questionnaire that included 9 items measuring difficulties on a 6-point Likert scale (potential range of total scores: 9-54). The mean total difficulty score was 29.8 (ranging from 10 to 50; SD = 7.37). On individual items, the fraction of teachers scoring 4 or more varied between 25 and 63%. A multiple regression analysis showed that teachers' total scores depended on amount of teaching experience, attendance at trauma-focused training, and the number of traumatized children they had worked with. The model explained 4% of the variance, a small effect. Because traumatic exposure in children is rather common, the findings point to a need to better understand what influences teachers' difficulties and develop trauma-informed practice in elementary schools. Copyright © 2012 International Society for Traumatic Stress Studies.
Schmitter-Edgecombe, Maureen; Parsey, Carolyn; Lamb, Richard
2014-01-01
The Instrumental Activities of Daily Living – Compensation (IADL-C) scale was developed to capture early functional difficulties and to quantify compensatory strategy use that may mitigate functional decline in the aging population. The IADL-C was validated in a sample of cognitively healthy older adults (N=184) and individuals with mild cognitive impairment (MCI; N=92) and dementia (N=24). Factor analysis and Rasch item analysis led to the 27-item IADL-C informant questionnaire with four functional domain subscales (money and self-management, home daily living, travel and event memory, and social skills). The subscales demonstrated good internal consistency (Rasch reliability 0.80 to 0.93) and test-retest reliability (Spearman coefficients 0.70 to 0.91). The IADL-C total score and subscales showed convergent validity with other IADL measures, discriminant validity with psychosocial measures, and the ability to discriminate between diagnostic groups. The money and self management subscale showed notable difficulties for individuals with MCI, whereas difficulties with home daily living became more prominent for dementia participants. Compensatory strategy use increased in the MCI group and decreased in the dementia group. PMID:25344901
What You Don't Know Can Hurt You: Missing Data and Partial Credit Model Estimates
Thomas, Sarah L.; Schmidt, Karen M.; Erbacher, Monica K.; Bergeman, Cindy S.
2017-01-01
The authors investigated the effect of Missing Completely at Random (MCAR) item responses on partial credit model (PCM) parameter estimates in a longitudinal study of Positive Affect. Participants were 307 adults from the older cohort of the Notre Dame Study of Health and Well-Being (Bergeman and Deboeck, 2014) who completed questionnaires including Positive Affect items for 56 days. Additional missing responses were introduced to the data, randomly replacing 20%, 50%, and 70% of the responses on each item and each day with missing values, in addition to the existing missing data. Results indicated that item locations and person trait level measures diverged from the original estimates as the level of degradation from induced missing data increased. In addition, standard errors of these estimates increased with the level of degradation. Thus, MCAR data does damage the quality and precision of PCM estimates. PMID:26784376
ITEMS Project: An online sequence for teaching mathematics and astronomy
NASA Astrophysics Data System (ADS)
Martínez, Bernat; Pérez, Josep
2010-10-01
This work describes an elearning sequence for teaching geometry and astronomy in lower secondary school created inside the ITEMS (Improving Teacher Education in Mathematics and Science) project. It is based on results from the astronomy education research about studentsŠ difficulties in understanding elementary astronomical observations and models. The sequence consists of a set of computer animations embedded in an elearning environment aimed at supporting students in learning about astronomy ideas that require the use of geometrical concepts and visual-spatial reasoning.
Item response theory analysis of the mechanics baseline test
NASA Astrophysics Data System (ADS)
Cardamone, Caroline N.; Abbott, Jonathan E.; Rayyan, Saif; Seaton, Daniel T.; Pawl, Andrew; Pritchard, David E.
2012-02-01
Item response theory is useful in both the development and evaluation of assessments and in computing standardized measures of student performance. In item response theory, individual parameters (difficulty, discrimination) for each item or question are fit by item response models. These parameters provide a means for evaluating a test and offer a better measure of student skill than a raw test score, because each skill calculation considers not only the number of questions answered correctly, but the individual properties of all questions answered. Here, we present the results from an analysis of the Mechanics Baseline Test given at MIT during 2005-2010. Using the item parameters, we identify questions on the Mechanics Baseline Test that are not effective in discriminating between MIT students of different abilities. We show that a limited subset of the highest quality questions on the Mechanics Baseline Test returns accurate measures of student skill. We compare student skills as determined by item response theory to the more traditional measurement of the raw score and show that a comparable measure of learning gain can be computed.
Effects of Differential Item Functioning on Examinees' Test Performance and Reliability of Test
ERIC Educational Resources Information Center
Lee, Yi-Hsuan; Zhang, Jinming
2017-01-01
Simulations were conducted to examine the effect of differential item functioning (DIF) on measurement consequences such as total scores, item response theory (IRT) ability estimates, and test reliability in terms of the ratio of true-score variance to observed-score variance and the standard error of estimation for the IRT ability parameter. The…
ERIC Educational Resources Information Center
Sahin, Alper; Ozbasi, Durmus
2017-01-01
Purpose: This study aims to reveal effects of content balancing and item selection method on ability estimation in computerized adaptive tests by comparing Fisher's maximum information (FMI) and likelihood weighted information (LWI) methods. Research Methods: Four groups of examinees (250, 500, 750, 1000) and a bank of 500 items with 10 different…
Characterizing Sources of Uncertainty in Item Response Theory Scale Scores
ERIC Educational Resources Information Center
Yang, Ji Seung; Hansen, Mark; Cai, Li
2012-01-01
Traditional estimators of item response theory scale scores ignore uncertainty carried over from the item calibration process, which can lead to incorrect estimates of the standard errors of measurement (SEMs). Here, the authors review a variety of approaches that have been applied to this problem and compare them on the basis of their statistical…
How Big Is Big Enough? Sample Size Requirements for CAST Item Parameter Estimation
ERIC Educational Resources Information Center
Chuah, Siang Chee; Drasgow, Fritz; Luecht, Richard
2006-01-01
Adaptive tests offer the advantages of reduced test length and increased accuracy in ability estimation. However, adaptive tests require large pools of precalibrated items. This study looks at the development of an item pool for 1 type of adaptive administration: the computer-adaptive sequential test. An important issue is the sample size required…