measurement error rater: Topics by Science.gov

Sample records for measurement error rater

Intra-rater and inter-rater reliability of ultrasonographic measurements of acromion-greater tuberosity distance in patients with post-stroke hemiplegia.

PubMed

Kumar, Praveen; Cruziah, Reynold; Bradley, Michael; Gray, Selena; Swinkels, Annette

2016-06-01

Glenohumeral subluxation (GHS) is reported in up to 81% of patients with stroke. Ultrasonographic measurements of GHS by measuring the acromion-greater tuberosity (AGT) have been found to be reliable for experienced raters. The primary aim was to assess the intra-rater reliability of measurements of AGT distance in people with stroke following a short course of rater training. A secondary aim was to compare the inter-rater reliability of these measurements between novice and experienced raters. Patients with stroke (n = 16; 5 men, 11 women; 74 ± 10 years) with 1-sided weakness who gave informed consent were recruited. Ultrasonographic measurements were recorded at the bedside by two physiotherapists with patients seated upright in a hospital chair. Reliability was assessed by intra-class correlation coefficients (ICCs) and the standard error of measurements (SEM). Minimum detectable change (MDC90) scores were used to estimate the magnitude of change that is likely to exceed measurement error. Mean ± SD AGT distances on the affected and unaffected sides for rater 1 were 2.2 ± 0.7 and 1.7 ± 0.4 cm, respectively. Corresponding values for rater 2 were 2.5 ± 0.6 and 2.0 ± 0.4 cm. Intra-class correlation coefficient values for the affected and unaffected shoulders for rater 1 were 0.96 and 0.91, respectively. Corresponding values for rater 2 were 0.95 and 0.90.SEM and MDC90 for both affected and unaffected shoulders were ≤ 0.2 cm. Inter-rater reliability coefficients were 0.86 (affected) and 0.76 (unaffected) shoulders. Ultrasonographic measurement of AGT distance demonstrates excellent intra-rater reliability for a novice rater. Inter-rater reliability of ultrasonographic measurement of AGT also demonstrates good reliability between novice and experienced raters.
Measurement of glenohumeral joint translation using real-time ultrasound imaging: A physiotherapist and sonographer intra-rater and inter-rater reliability study.

PubMed

Rathi, Sangeeta; Taylor, Nicholas F; Gee, Jamie; Green, Rodney A

2016-12-01

Ultrasonography is an economical and non-invasive method for measuring real-time joint movements. Although physiotherapists are increasingly using ultrasound imaging for rotator cuff disorders, there is a lack of evidence on their reliability in using ultrasonography to measure glenohumeral translation. The aim of this study was to evaluate the reliability of a physiotherapist in measuring anterior and posterior glenohumeral joint translation with ultrasound. Study design: within day reliability. Anterior and posterior glenohumeral translations were measured at rest, in response to passive accessory motion testing force, and with isometric internal and external rotation in 12 young healthy adults. All the measurements were made in real time by a physiotherapist and an experienced sonographer in two positions (neutral and abducted) and in two views (anterior and posterior). Intra-rater and inter-rater reliability were expressed using intraclass correlation coefficients (ICC) and measurement error (mm). Intra-rater reliability was good for both raters (ICC P : 0.86-0.98; ICC S : 0.85-0.96). The inter-rater reliability between the physiotherapist and sonographer was moderate to good for posterior measurements (ICC 0.50-0.75) and poor to moderate for anterior measurements (ICC 0.31-0.53). For both intra-rater and inter-rater measurements, posterior translation was more reliable than the anterior translation with smaller measurement errors (posterior: 0.1-0.2 mm, anterior: 0.2-0.3 mm). A physiotherapist with minimal training was reliable in measuring glenohumeral joint translations. The ultrasound method was reliable for repeated measurement of both anterior and posterior glenohumeral translations with posterior measurements being more reliable than anterior. This method is recommended for future research to investigate the stabilising role of rotator cuff muscles. Copyright © 2016 Elsevier Ltd. All rights reserved.
Screening of the spine in adolescents: inter- and intra-rater reliability and measurement error of commonly used clinical tests.

PubMed

Aartun, Ellen; Degerfalk, Anna; Kentsdotter, Linn; Hestbaek, Lise

2014-02-10

Evidence on the reliability of clinical tests used for the spinal screening of children and adolescents is currently lacking. The aim of this study was to determine the inter- and intra-rater reliability and measurement error of clinical tests commonly used when screening young spines. Two experienced chiropractors independently assessed 111 adolescents aged 12-14 years who were recruited from a primary school in Denmark. A standardised examination protocol was used to test inter-rater reliability including tests for scoliosis, hypermobility, general mobility, inter-segmental mobility and end range pain in the spine. Seventy-five of the 111 subjects were re-examined after one to four hours to test intra-rater reliability. Percentage agreement and Cohen's Kappa were calculated for binary variables, and interclass correlation (ICC) and Bland-Altman plots with Limits of Agreement (LoA) were calculated for continuous measures. Inter-rater percentage agreement for binary data ranged from 59.5% to 100%. Kappa ranged from 0.06-1.00. Kappa ≥ 0.40 was seen for elbow, thumb, fifth finger and trunk/hip flexion hypermobility, pain response in inter-segmental mobility and end range pain in lumbar flexion and extension. For continuous data, ICCs ranged from 0.40-0.95. Only forward flexion as measured by finger-to-floor distance reached an acceptable ICC(≥ 0.75). Overall, results for intra-rater reliability were better than for inter-rater reliability but for both components, the LoA were quite wide compared with the range of assessments. Some clinical tests showed good, and some tests poor, reliability when applied in a spinal screening of adolescents. The results could probably be improved by additional training and further test standardization. This is the first step in evaluating the value of these tests for the spinal screening of adolescents. Future research should determine the association between these tests and current and/or future neck and back pain.
On Rater Agreement and Rater Training

ERIC Educational Resources Information Center

Wang, Binhong

2010-01-01

This paper first analyzed two studies on rater factors and rating criteria to raise the problem of rater agreement. After that the author reveals the causes of discrepencies in rating administration by discussing rater variability and rater bias. The author argues that rater bias can not be eliminated completely, we can only reduce the error to a…
Intra-Rater and Inter-Rater Reliability of the Balance Error Scoring System in Pre-Adolescent School Children

ERIC Educational Resources Information Center

Sheehan, Dwayne P.; Lafave, Mark R.; Katz, Larry

2011-01-01

This study was designed to test the intra- and inter-rater reliability of the University of North Carolina's Balance Error Scoring System in 9- and 10-year-old children. Additionally, a modified version of the Balance Error Scoring System was tested to determine if it was more sensitive in this population ("raw scores"). Forty-six…
Inter-Rater Variability as Mutual Disagreement: Identifying Raters' Divergent Points of View

ERIC Educational Resources Information Center

Gingerich, Andrea; Ramlo, Susan E.; van der Vleuten, Cees P. M.; Eva, Kevin W.; Regehr, Glenn

2017-01-01

Whenever multiple observers provide ratings, even of the same performance, inter-rater variation is prevalent. The resulting "idiosyncratic rater variance" is considered to be unusable error of measurement in psychometric models and is a threat to the defensibility of our assessments. Prior studies of inter-rater variation in clinical…
Longitudinal Rater Modeling with Splines

ERIC Educational Resources Information Center

Dobria, Lidia

2011-01-01

Performance assessments rely on the expert judgment of raters for the measurement of the quality of responses, and raters unavoidably introduce error in the scoring process. Defined as the tendency of a rater to assign higher or lower ratings, on average, than those assigned by other raters, even after accounting for differences in examinee…
Inter-rater Reliability of Real-Time Ultrasound to Measure Acromiohumeral Distance.

PubMed

Mackenzie, Tanya Anne; Bdaiwi, Alya H; Herrington, Lee; Cools, Ann

2016-07-01

Real-time ultrasound (RTUS) has been suggested as a reliable measure of acromiohumeral distance. However, to date, no vigorous assessment and reporting of inter-rater reliability of this method has been performed with the shoulder in a neutral position or with active and passive arm abduction. To assess intrasession inter-rater reliability of using RTUS to measure acromiohumeral distance with the shoulder in a neutral position and with 60° active and passive abduction. Inter-rater intrasession reliability of repeated measures. Human performance laboratory. Twenty persons (12 male and 8 female) with an average age of 29.86 years (standard deviation, 7.8). In an inter-rater, intrasession study, RTUS was used to measure the acromiohumeral distance with the shoulder in a neutral position and with 60° of both active and passive abduction. Acromiohumeral distance. Intraclass correlation coefficient (ICC)2.1 scores ranged between 0.65-0.88 (standard error of the mean = 0.81-1.2 mm and minimal detectable differences with 95% confidence = 2.2-2.3 mm) for inter-rater intrasession reliability. RTUS was found to have fair to good inter-rater reliability as a tool to measure acromiohumeral distance with the shoulder in a neutral position and with 60° of both active and passive arm abduction. Copyright © 2016 American Academy of Physical Medicine and Rehabilitation. Published by Elsevier Inc. All rights reserved.
Ultrasound measures of tendon thickness: Intra-rater, Inter-rater and Inter-machine reliability.

PubMed

Del Baño-Aledo, María Elena; Martínez-Payá, Jacinto Javier; Ríos-Díaz, José; Mejías-Suárez, Silvia; Serrano-Carmona, Sergio; de Groot-Ferrando, Ana

2017-01-01

Ultrasound imaging is often used by physiotherapists and other healthcare professionals but the reliability of image acquisition with different ultrasound machines is unknown. The objective was to compare the intra-rater, inter-rater and intermachine reliability of thickness measurements of the plantar fascia (PF), Achilles tendon (AT), patellar tendon (PT) and elbow common extensor tendon (ECET) with musculoskeletal ultrasound imaging (MSUS). Tendon thickness was measured in four anatomical structures (14 participants, 28 images per tendon) by two sonographers and with two different ultrasound machines. Intraclass Correlation Coefficients (ICCs) and Bland-Altman plots were calculated. The standard error of measurement (SEM) and minimum detectable difference (MDD) were calculated. Inter-rater reliability was excellent for AT (ICC=0.98; 95% CI= 0.96-0.99) and very good for PT (ICC=0.85; 95% CI = 0.67-0.93) and ECET (ICC=0.81; 95% CI= 0.72-0.94). Reliability for PF was moderate, with an ICC of 0.63 (CI 95%= 0.20-0.83). Bland-Altman plot for inter-machine reliability showed a mean difference of 1 m for PF measurements and a mean difference of 4 m and 20 m for AT and PT. The relative SEMs were below 7% and the MDCs were below 0.7 mm. The MSUS reliability in measuring thickness of the four tendons is confirmed by the homogeneous readings intra sonographers, between operators and between different machines. Level of evidence: Tendon thickness can be measured reliably on different ultrasound devices, which is an important step forward in the use of this technique in daily clinical practice and research. III.
Inter-Rater Reliability of Cyclotorsion Measurements Using Fundus Photography.

PubMed

Dysli, Muriel; Kanku, Madeleine; Traber, Ghislaine L

2018-04-01

The foveo-papillary angle (FPA) on fundus photographs is the accepted standard for the measurement of ocular cyclotorsion. We assessed the inter-rater reliability of this method in healthy subjects and in patients with trochlear nerve palsies. In this methodological study, fundus photographs of healthy subjects and of patients with trochlear nerve palsies were made with a fundus camera (Zeiss Fundus Camera FF 450 plus, Jena, Germany). Three independent observers measured the FPA on the fundus photographs of all subjects in synedra View (synedra View 16, Version 16.0.0.11, Innsbruck, Austria). One hundred and four eyes of 52 subjects (26 healthy controls and 26 patients) were assessed. The mean FPA of the healthy controls was 5.80 degrees (°) [± 0.44 standard error of the mean (SEM)] compared to 11.55° (± 0.80 SEM) for patients with trochlear nerve palsies. The inter-rater reliability of all measured FPAs showed an intraclass correlation coefficient (ICC) of 0.98 (95% CI 0.97 - 0.98). The inter-rater reliability of objective cyclotorsion measurements using fundus photographs was very high. Georg Thieme Verlag KG Stuttgart · New York.
Computerized back postural assessment in physiotherapy practice: Intra-rater and inter-rater reliability of the MIDAS system.

PubMed

McAlpine, R T; Bettany-Saltikov, J A; Warren, J G

2009-01-01

Assessment of spinal posture during physiotherapy practice is routine, yet few objective measures exist to this end. The Middlesbrough Integrated Digital Assessment System (MIDAS) is a low cost portable system able to record 3D information on posture. The purpose of this study was to assess both the intra-rater and inter-rater reliability of the MIDAS system. Twenty-five healthy subjects were recruited. A repeated measures design was used to record fifteen pre-palpated landmarks on the back of each subject. To limit the sources of variability, the principal researcher palpated the landmarks for each subject. Each of three raters took two measurements on each subject in a standardized upright posture. X (medio-lateral), Y (antero-posterior) and Z (height) landmark positions were recorded via a computer interface. Both intra-rater agreement (mean ICCs - rater 1 r=0.970, rater 2 r=0.965 and rater 3 r=0.965, p< 0.001) and inter-rater agreement (mean ICCs r=0.967, p< 0.001) was very high between repeated measures and between markers. Error values for the z-axis (height) were the lowest. The MIDAS demonstrated both high inter-rater and intra-rater reliability and provides an objective method for the assessment of posture in physiotherapy practice.
Intra-rater reliability of hallux flexor strength measures using the Nintendo Wii Balance Board.

PubMed

Quek, June; Treleaven, Julia; Brauer, Sandra G; O'Leary, Shaun; Clark, Ross A

2015-01-01

The purpose of this study was to investigate the intra-rater reliability of a new method in combination with the Nintendo Wii Balance Board (NWBB) to measure the strength of hallux flexor muscle. Thirty healthy individuals (age: 34.9 ± 12.9 years, height: 170.4 ± 10.5 cm, weight: 69.3 ± 15.3 kg, female = 15) participated. Repeated testing was completed within 7 days. Participants performed strength testing in sitting using a wooden platform in combination with the NWBB. This new method was set up to selectively recruit an intrinsic muscle of the foot, specifically the flexor hallucis brevis muscle. Statistical analysis was performed using intra-class coefficients and ordinary least product analysis. To estimate measurement error, standard error of measurement (SEM), minimal detectable change (MDC) and percentage error were calculated. Results indicate excellent intra-rater reliability (ICC = 0.982, CI = 0.96-0.99) with an absence of systematic bias. SEM, MDC and percentage error value were 0.5, 1.4 and 12 % respectively. This study demonstrates that a new method in combination with the NWBB application is reliable to measure hallux flexor strength and has potential to be used for future research and clinical application.
An Alternative Method Used in Evaluating Agreement among Repeat Measurements by Two Raters in Education

ERIC Educational Resources Information Center

Erdogan, Semra; Orekici Temel, Gülhan; Selvi, Hüseyin; Ersöz Kaya, Irem

2017-01-01

Taking more than one measurement of the same variable also hosts the possibility of contamination from error sources, both singly and in combination as a result of interactions. Therefore, although the internal consistency of scores received from measurement tools is examined by itself, it is necessary to ensure interrater or intra-rater agreement…
Summary measures of agreement and association between many raters' ordinal classifications.

PubMed

Mitani, Aya A; Freer, Phoebe E; Nelson, Kerrie P

2017-10-01

Interpretation of screening tests such as mammograms usually require a radiologist's subjective visual assessment of images, often resulting in substantial discrepancies between radiologists' classifications of subjects' test results. In clinical screening studies to assess the strength of agreement between experts, multiple raters are often recruited to assess subjects' test results using an ordinal classification scale. However, using traditional measures of agreement in some studies is challenging because of the presence of many raters, the use of an ordinal classification scale, and unbalanced data. We assess and compare the performances of existing measures of agreement and association as well as a newly developed model-based measure of agreement to three large-scale clinical screening studies involving many raters' ordinal classifications. We also conduct a simulation study to demonstrate the key properties of the summary measures. The assessment of agreement and association varied according to the choice of summary measure. Some measures were influenced by the underlying prevalence of disease and raters' marginal distributions and/or were limited in use to balanced data sets where every rater classifies every subject. Our simulation study indicated that popular measures of agreement and association are prone to underlying disease prevalence. Model-based measures provide a flexible approach for calculating agreement and association and are robust to missing and unbalanced data as well as the underlying disease prevalence. Copyright © 2017 Elsevier Inc. All rights reserved.
The inter and intra rater reliability of the Netball Movement Screening Tool.

PubMed

Reid, Duncan A; Vanweerd, Rebecca J; Larmer, Peter J; Kingstone, Rachel

2015-05-01

To establish the inter- and intra-rater reliability of the Netball Movement Screening Tool, for screening adolescent female netball players. Inter- and intra-rater reliability study. Forty secondary school netball players were recruited to take part in the study. Twenty subjects were screened simultaneously and independently by two raters to ascertain inter-rater agreement. Twenty subjects were scored by rater one on two occasions, separated by a week, to ascertain intra-rater agreement. Inter and intra-rater agreement was assessed utilising the two-way mixed inter class correlation coefficient and weighted kappa statistics. No significant demographic differences were found between the inter and intra-rater groups of subjects. Inter class correlation coefficients' demonstrated excellent inter-rater (two-way mixed inter class correlation coefficients 0.84, standard error of measurement 0.25) and intra-rater (two-way mixed inter class correlation coefficients 0.96, standard error of measurement 0.13) reliability for the overall Netball Movement Screening Tool score and substantial-excellent (two-way mixed inter class correlation coefficients 1.0-0.65) inter-rater and substantial-excellent intra-rater (two-way mixed inter class correlation coefficients 0.96-0.79) reliability for the component scores of the Netball Movement Screening Tool. Kappa statistic showed substantial to poor inter-rater (k=0.75-0.32) and intra-rater (k=0.77-0.27) agreement for individual tests of the NMST. The Netball Movement Screening Tool may be a reliable screening tool for adolescent netball players; however the individual test scores have low reliability. The screening tool can be administered reliably by raters with similar levels of training in the tool but variable clinical experience. On-going research needs to be undertaken to ascertain whether the Netball Movement Screening Tool is a valid tool in ascertaining increased injury risk for netball players. Copyright © 2014 Sports
Kappa and Rater Accuracy: Paradigms and Parameters.

PubMed

Conger, Anthony J

2017-12-01

Drawing parallels to classical test theory, this article clarifies the difference between rater accuracy and reliability and demonstrates how category marginal frequencies affect rater agreement and Cohen's kappa (κ). Category assignment paradigms are developed: comparing raters to a standard (index) versus comparing two raters to one another (concordance), using both nonstochastic and stochastic category membership. Using a probability model to express category assignments in terms of rater accuracy and random error, it is shown that observed agreement (Po) depends only on rater accuracy and number of categories; however, expected agreement (Pe) and κ depend additionally on category frequencies. Moreover, category frequencies affect Pe and κ solely through the variance of the category proportions, regardless of the specific frequencies underlying the variance. Paradoxically, some judgment paradigms involving stochastic categories are shown to yield higher κ values than their nonstochastic counterparts. Using the stated probability model, assignments to categories were generated for 552 combinations of paradigms, rater and category parameters, category frequencies, and number of stimuli. Observed means and standard errors for Po, Pe, and κ were fully consistent with theory expectations. Guidelines for interpretation of rater accuracy and reliability are offered, along with a discussion of alternatives to the basic model.
Inter-rater reliability of output measures for a posture matching assessment approach: a pilot study with food service workers.

PubMed

Cann, A P; Connolly, M; Ruuska, R; MacNeil, M; Birmingham, T B; Vandervoort, A A; Callaghan, J P

2008-04-01

Despite the ongoing health problem of repetitive strain injuries, there are few tools currently available for ergonomic applications evaluating cumulative loading that have well-documented evidence of reliability and validity. The purpose of this study was to determine the inter-rater reliability of a posture matching based analysis tool (3DMatch, University of Waterloo) for predicting cumulative and peak spinal loads. A total of 30 food service workers were each videotaped for a 1-h period while performing typical work activities and a single work task was randomly selected from each for analysis by two raters. Inter-rater reliability was determined using intraclass correlation coefficients (ICC) model 2,1 and standard errors of measurement for cumulative and peak spinal and shoulder loading variables across all subjects. Overall, 85.5% of variables had moderate to excellent inter-rater reliability, with ICCs ranging from 0.30-0.99 for all cumulative and peak loading variables. 3DMatch was found to be a reliable ergonomic tool when more than one rater is involved.
Measuring the quality of life in mild to very severe dementia: testing the inter-rater and intra-rater reliability of the German version of the QUALIDEM.

PubMed

Dichter, Martin Nikolaus; Schwab, Christian G G; Meyer, Gabriele; Bartholomeyczik, Sabine; Dortmann, Olga; Halek, Margareta

2014-05-01

Quality of life (Qol) is an increasingly used outcome measure in dementia research. The QUALIDEM is a dementia-specific and proxy-rated Qol instrument. We aimed to determine the inter-rater and intra-rater reliability in residents with dementia in German nursing homes. The QUALIDEM consists of nine subscales that were applied to a sample of 108 people with mild to severe dementia and six consecutive subscales that were applied to a sample of 53 people with very severe dementia. The proxy raters were 49 registered nurses and nursing assistants. Inter-rater and intra-rater reliability scores were calculated on the subscale and item level. None of the QUALIDEM subscales showed strong inter-rater reliability based on the single-measure Intra-Class Correlation Coefficient (ICC) for absolute agreement ≥ 0.70. Based on the average-measure ICC for four raters, eight subscales for people with mild to severe dementia (care relationship, positive affect, negative affect, restless tense behavior, social relations, social isolation, feeling at home and having something to do) and five subscales for very severe dementia (care relationship, negative affect, restless tense behavior, social relations and social isolation) yielded a strong inter-rater agreement (ICC: 0.72-0.86). All of the QUALIDEM subscales, regardless of dementia severity, showed strong intra-rater agreement. The ICC values ranged between 0.70 and 0.79 for people with mild to severe dementia and between 0.75 and 0.87 for people with very severe dementia. This study demonstrated insufficient inter-rater reliability and sufficient intra-rater reliability for all subscales of both versions of the German QUALIDEM. The degree of inter-rater reliability can be improved by collaborative Qol rating by more than one nurse. The development of a measurement manual with accurate item definitions and a standardized education program for proxy raters is recommended.
Y-balance test: a reliability study involving multiple raters.

PubMed

Shaffer, Scott W; Teyhen, Deydre S; Lorenson, Chelsea L; Warren, Rick L; Koreerat, Christina M; Straseske, Crystal A; Childs, John D

2013-11-01

The Y-balance test (YBT) is one of the few field expedient tests that have shown predictive validity for injury risk in an athletic population. However, analysis of the YBT in a heterogeneous population of active adults (e.g., military, specific occupations) involving multiple raters with limited experience in a mass screening setting is lacking. The primary purpose of this study was to determine interrater test-retest reliability of the YBT in a military setting using multiple raters. Sixty-four service members (53 males, 11 females) actively conducting military training volunteered to participate. Interrater test-retest reliability of the maximal reach had intraclass correlation coefficients (2,1) of 0.80 to 0.85 with a standard error of measurement ranging from 3.1 to 4.2 cm for the 3 reach directions (anterior, posteromedial, and posterolateral). Interrater test-retest reliability of the average reach of 3 trails had an intraclass correlation coefficients (2,3) range of 0.85 to 0.93 with an associated standard error of measurement ranging from 2.0 to 3.5cm. The YBT showed good interrater test-retest reliability with an acceptable level of measurement error among multiple raters screening active duty service members. In addition, 31.3% (n = 20 of 64) of participants exhibited an anterior reach asymmetry of >4cm, suggesting impaired balance symmetry and potentially increased risk for injury. Reprint & Copyright © 2013 Association of Military Surgeons of the U.S.
Inter- and intra-rater reliability of calliper-based lymph node measurement in dogs with peripheral nodal lymphomas.

PubMed

Childress, M O; Fulkerson, C M; Lahrman, S A; Weng, H-Y

2016-08-01

The purpose of this study was to assess reliability of lymph node measurements between and within raters in dogs with nodal lymphomas. Three raters measured lymph nodes from 20 dogs twice prior to and once after administering chemotherapy. Sum tumour volume (TV) and sum longest diameter (LD) of all lymph nodes at each time point, and the percent change in measurements following chemotherapy, were calculated for each dog. Inter- and intra-rater reliability were assessed with the intraclass correlation coefficient (ICC). ICC for inter-rater sum TV and sum LD prior to chemotherapy were 0.86 and 0.80, respectively. ICC for inter-rater sum TV and sum LD after chemotherapy were 0.95 and 0.91, respectively. ICC for percent change in sum TV and sum LD were 0.96 and 0.94, respectively. ICC for intra-rater reliability ranged from 0.90 to 0.98 for each rater. Inter- and intra-rater reliability in measurements among the three raters was good to excellent. © 2014 John Wiley & Sons Ltd.

The Smile Esthetic Index (SEI): A method to measure the esthetics of the smile. An intra-rater and inter-rater agreement study.

PubMed

Rotundo, Roberto; Nieri, Michele; Bonaccini, Daniele; Mori, Massimiliano; Lamberti, Elena; Massironi, Domenico; Giachetti, Luca; Franchi, Lorenzo; Venezia, Piero; Cavalcanti, Raffaele; Bondi, Elena; Farneti, Mauro; Pinchi, Vilma; Buti, Jacopo

2015-01-01

To propose a method to measure the esthetics of the smile and to report its validation by means of an intra-rater and inter-rater agreement analysis. Ten variables were chosen as determinants for the esthetics of a smile: smile line and facial midline, tooth alignment, tooth deformity, tooth dischromy, gingival dischromy, gingival recession, gingival excess, gingival scars and diastema/missing papillae. One examiner consecutively selected seventy smile pictures, which were in the frontal view. Ten examiners, with different levels of clinical experience and specialties, applied the proposed assessment method twice on the selected pictures, independently and blindly. Intraclass correlation coefficient (ICC) and Fleiss' kappa) statistics were performed to analyse the intra-rater and inter-rater agreement. Considering the cumulative assessment of the Smile Esthetic Index (SEI), the ICC value for the inter-rater agreement of the 10 examiners was 0.62 (95% CI: 0.51 to 0.72), representing a substantial agreement. Intra-rater agreement ranged from 0.86 to 0.99. Inter-rater agreement (Fleiss' kappa statistics) calculated for each variable ranged from 0.17 to 0.75. The SEI was a reproducible method, to assess the esthetic component of the smile, useful for the diagnostic phase and for setting appropriate treatment plans.
The Effects of Rater Training on Inter-Rater Agreement

ERIC Educational Resources Information Center

Pufpaff, Lisa A.; Clarke, Laura; Jones, Ruth E.

2015-01-01

This paper addresses the effects of rater training on the rubric-based scoring of three preservice teacher candidate performance assessments. This project sought to evaluate the consistency of ratings assigned to student learning outcome measures being used for program accreditation and to explore the need for rater training in order to increase…
Effects of measurement method and transcript availability on inexperienced raters' stuttering frequency scores.

PubMed

Chakraborty, Nalanda; Logan, Kenneth J

To examine the effects of measurement method and transcript availability on the accuracy, reliability, and efficiency of inexperienced raters' stuttering frequency measurements. 44 adults, all inexperienced at evaluating stuttered speech, underwent 20 min of preliminary training in stuttering measurement and then analyzed a series of sentences, with and without access to transcripts of sentence stimuli, using either a syllable-based analysis (SBA) or an utterance-based analysis (UBA). Participants' analyses were compared between groups and to a composite analysis from two experienced evaluators. Stuttering frequency scores from the SBA and UBA groups differed significantly from the experienced evaluators' scores; however, UBA scores were significantly closer to the experienced evaluators' scores and were completed significantly faster than the SBA scores. Transcript availability facilitated scoring accuracy and efficiency in both groups. The internal reliability of stuttering frequency scores was acceptable for the SBA and UBA groups; however, the SBA group demonstrated only modest point-by-point agreement with ratings from the experienced evaluators. Given its accuracy and efficiency advantages over syllable-based analysis, utterance-based fluency analysis appears to be an appropriate context for introducing stuttering frequency measurement to raters who have limited experience in stuttering measurement. To address accuracy gaps between experienced and inexperienced raters, however, use of either analysis must be supplemented with training activities that expose inexperienced raters to the decision-making processes used by experienced raters when identifying stuttered syllables. Copyright © 2018 Elsevier Inc. All rights reserved.
Inter-rater and intra-rater reliability of the Bahasa Melayu version of Rose Angina Questionnaire.

PubMed

Hassan, N B; Choudhury, S R; Naing, L; Conroy, R M; Rahman, A R A

2007-01-01

The objective of the study is to translate the Rose Questionnaire (RQ) into a Bahasa Melayu version and adapt it cross-culturally, and to measure its inter-rater and intrarater reliability. This cross sectional study was conducted in the respondents' homes or workplaces in Kelantan, Malaysia. One hundred respondents aged 30 and above with different socio-demographic status were interviewed for face validity. For each inter-rater and intra-rater reliability, a sample of 150 respondents was interviewed. Inter-rater and intra-rater reliabilities were assessed by Cohen's kappa. The overall inter-rater agreements by the five pair of interviewers at point one and two were 0.86, and intrarater reliability by the five interviewers on the seven-item questionnaire at poinone and two was 0.88, as measured by kappa coefficient. The translated Malay version of RQ demonstrated an almost perfect inter-rater and intra-rater reliability and further validation such as sensitivity and specificity analysis of this translated questionnaire is highly recommended.
Intra-rater Reliability of Arm and Hand Muscle Strength Measurements in Persons With Late Effects of Polio.

PubMed

Brogårdh, Christina; Flansbjer, Ulla-Britt; Carlsson, Håkan; Lexell, Jan

2015-10-01

Muscle weakness in the upper limb is common in persons with late effects of polio. To be able to measure muscle strength and follow changes over time, reliable measurements are needed. To evaluate the intra-rater reliability of isometric and isokinetic arm and hand muscle strength measurements in persons with late effects of polio. A test-retest design. A university hospital outpatient clinic. Twenty-eight persons (mean age 68 years, SD 11 years) with late effects of polio in their upper limbs. Isometric shoulder abduction, isokinetic concentric elbow flexion and extension, isometric elbow flexion, and isometric grip strength were measured twice, 14 days apart. Reliability was evaluated with the intra-class correlation coefficient, the mean difference between the test sessions (d¯), together with the 95% confidence intervals for d¯ , the standard error of measurement (SEM and SEM%), the smallest real difference (SRD and SRD%), and Bland-Altman graphs. A fixed dynamometer (Biodex) was used to measure arm strength and an electronic dynamometer (GRIP-it) was used to measure grip strength. Intra-rater reliability was high, with intra-class correlation coefficients between 0.87 and 0.98. The SEM%, representing the smallest change for a group of persons, ranged from 7%-24% for all strength measurements, and the SRD%, representing the smallest change for an individual person, ranged from 20%-67%. Muscle strength in the upper limbs can be reliably measured in persons with late effects of polio. However, the measurement errors indicate that the method is more suitable to detect changes in muscle strength for a group of persons than for an individual person. Copyright © 2015 American Academy of Physical Medicine and Rehabilitation. Published by Elsevier Inc. All rights reserved.
Inter and intra-rater reliability of mobile device goniometer in measuring lumbar flexion range of motion.

PubMed

Bedekar, Nilima; Suryawanshi, Mayuri; Rairikar, Savita; Sancheti, Parag; Shyam, Ashok

2014-01-01

Evaluation of range of motion (ROM) is integral part of assessment of musculoskeletal system. This is required in health fitness and pathological conditions; also it is used as an objective outcome measure. Several methods are described to check spinal flexion range of motion. Different methods for measuring spine ranges have their advantages and disadvantages. Hence, a new device was introduced in this study using the method of dual inclinometer to measure lumbar spine flexion range of motion (ROM). To determine Intra and Inter-rater reliability of mobile device goniometer in measuring lumbar flexion range of motion. iPod mobile device with goniometer software was used. The part being measure i.e the back of the subject was suitably exposed. Subject was standing with feet shoulder width apart. Spinous process of second sacral vertebra S2 and T12 were located, these were used as the reference points and readings were taken. Three readings were taken for each: inter-rater reliability as well as the intra-rater reliability. Sufficient rest was given between each flexion movement. Intra-rater reliability using ICC was r=0.920 and inter-rater r=0.812 at CI 95%. Validity r=0.95. Mobile device goniometer has high intra-rater reliability. The inter-rater reliability was moderate. This device can be used to assess range of motion of spine flexion, representing uni-planar movement.
A Simulation Study of Rater Agreement Measures with 2x2 Contingency Tables

ERIC Educational Resources Information Center

Ato, Manuel; Lopez, Juan Jose; Benavente, Ana

2011-01-01

A comparison between six rater agreement measures obtained using three different approaches was achieved by means of a simulation study. Rater coefficients suggested by Bennet's [sigma] (1954), Scott's [pi] (1955), Cohen's [kappa] (1960) and Gwet's [gamma] (2008) were selected to represent the classical, descriptive approach, [alpha] agreement…
Assessing the influence of rater and subject characteristics on measures of agreement for ordinal ratings.

PubMed

Nelson, Kerrie P; Mitani, Aya A; Edwards, Don

2017-09-10

Widespread inconsistencies are commonly observed between physicians' ordinal classifications in screening tests results such as mammography. These discrepancies have motivated large-scale agreement studies where many raters contribute ratings. The primary goal of these studies is to identify factors related to physicians and patients' test results, which may lead to stronger consistency between raters' classifications. While ordered categorical scales are frequently used to classify screening test results, very few statistical approaches exist to model agreement between multiple raters. Here we develop a flexible and comprehensive approach to assess the influence of rater and subject characteristics on agreement between multiple raters' ordinal classifications in large-scale agreement studies. Our approach is based upon the class of generalized linear mixed models. Novel summary model-based measures are proposed to assess agreement between all, or a subgroup of raters, such as experienced physicians. Hypothesis tests are described to formally identify factors such as physicians' level of experience that play an important role in improving consistency of ratings between raters. We demonstrate how unique characteristics of individual raters can be assessed via conditional modes generated during the modeling process. Simulation studies are presented to demonstrate the performance of the proposed methods and summary measure of agreement. The methods are applied to a large-scale mammography agreement study to investigate the effects of rater and patient characteristics on the strength of agreement between radiologists. Copyright © 2017 John Wiley & Sons, Ltd. Copyright © 2017 John Wiley & Sons, Ltd.
Writing Evaluation: Rater and Task Effects on the Reliability of Writing Scores for Children in Grades 3 and 4

ERIC Educational Resources Information Center

Kim, Young-Suk Grace; Schatschneider, Christopher; Wanzek, Jeanne; Gatlin, Brandy; Al Otaiba, Stephanie

2017-01-01

We examined how raters and tasks influence measurement error in writing evaluation and how many raters and tasks are needed to reach a desirable level of 0.90 and 0.80 reliabilities for children in Grades 3 and 4. A total of 211 children (102 boys) were administered three tasks in narrative and expository genres, respectively, and their written…
Rater methodology for stroboscopy: a systematic review.

PubMed

Bonilha, Heather Shaw; Focht, Kendrea L; Martin-Harris, Bonnie

2015-01-01

Laryngeal endoscopy with stroboscopy (LES) remains the clinical gold standard for assessing vocal fold function. LES is used to evaluate the efficacy of voice treatments in research studies and clinical practice. LES as a voice treatment outcome tool is only as good as the clinician interpreting the recordings. Research using LES as a treatment outcome measure should be evaluated based on rater methodology and reliability. The purpose of this literature review was to evaluate the rater-related methodology from studies that use stroboscopic findings as voice treatment outcome measures. Systematic literature review. Computerized journal databases were searched for relevant articles using terms: stroboscopy and treatment. Eligible articles were categorized and evaluated for the use of rater-related methodology, reporting of number of raters, types of raters, blinding, and rater reliability. Of the 738 articles reviewed, 80 articles met inclusion criteria. More than one-third of the studies included in the review did not report the number of raters who participated in the study. Eleven studies reported results of rater reliability analysis with only two studies reporting good inter- and intrarater reliability. The comparability and use of results from treatment studies that use LES are limited by a lack of rigor in rater methodology and variable, mostly poor, inter- and intrarater reliability. To improve our ability to evaluate and use the findings from voice treatment studies that use LES features as outcome measures, greater consistency of reporting rater methodology characteristics across studies and improved rater reliability is needed. Copyright © 2015 The Voice Foundation. Published by Elsevier Inc. All rights reserved.
Rater Training to Support High-Stakes Simulation-Based Assessments

PubMed Central

Feldman, Moshe; Lazzara, Elizabeth H.; Vanderbilt, Allison A.; DiazGranados, Deborah

2013-01-01

Competency-based assessment and an emphasis on obtaining higher-level outcomes that reflect physicians’ ability to demonstrate their skills has created a need for more advanced assessment practices. Simulation-based assessments provide medical education planners with tools to better evaluate the 6 Accreditation Council for Graduate Medical Education (ACGME) and American Board of Medical Specialties (ABMS) core competencies by affording physicians opportunities to demonstrate their skills within a standardized and replicable testing environment, thus filling a gap in the current state of assessment for regulating the practice of medicine. Observational performance assessments derived from simulated clinical tasks and scenarios enable stronger inferences about the skill level a physician may possess, but also introduce the potential of rater errors into the assessment process. This article reviews the use of simulation-based assessments for certification, credentialing, initial licensure, and relicensing decisions and describes rater training strategies that may be used to reduce rater errors, increase rating accuracy, and enhance the validity of simulation-based observational performance assessments. PMID:23280532
High inter-rater reliability, agreement, and convergent validity of Constant score in patients with clavicle fractures.

PubMed

Ban, Ilija; Troelsen, Anders; Kristensen, Morten Tange

2016-10-01

The Constant score (CS) has been the primary endpoint in most studies on clavicle fractures. However, the CS was not developed to assess patients with clavicle fractures. Our aim was to examine inter-rater reliability and agreement of the CS in patients with clavicle fractures. The secondary aim was to estimate the correlation between the CS and the Disabilities of the Arm, Shoulder and Hand score and the internal consistency of the 2 scores. On the basis of sample sizing, 36 patients (31 male and 5 female patients; mean age, 41.3 years) with clavicle fractures underwent standardized CS assessment at a mean of 6.8 weeks (SD, 1.0 weeks) after injury. Reliability and agreement of the CS were determined by 2 raters. The interclass correlation coefficient (ICC2,1), standard error of measurement, minimal detectable change, Cronbach α coefficient, and Pearson correlation coefficient were estimated. Inter-rater reliability of the total CS was excellent (interclass correlation coefficient, 0.94; 95% confidence interval, 0.88-0.97), with no systematic difference between the 2 raters (P = .75). The standard error of measurement (measurement error at the group level) was 4.9, whereas the minimal detectable change (smallest change needed to indicate a real change for an individual) was 13.6 CS points. The internal consistency of the 10 CS items was good, with a Cronbach α of .85, and we found a strong correlation (r = -0.92) between the CS and Disabilities of the Arm, Shoulder and Hand score. The CS was found to be reliable for assessing patients with clavicle fractures, especially at the group level. With high inter-rater reliability and agreement, in addition to good internal consistency, the standardized CS used in this study can be used for comparison of results from different settings. Copyright © 2016 Journal of Shoulder and Elbow Surgery Board of Trustees. Published by Elsevier Inc. All rights reserved.
Inter-rater reliability of measures to characterize the tobacco retail environment in Mexico.

PubMed

Hall, Marissa G; Kollath-Cattano, Christy; Reynales-Shigematsu, Luz Myriam; Thrasher, James F

2015-01-01

To evaluate the inter-rater reliability of a data collection instrument to assess the tobacco retail environment in Mexico, after major marketing regulations were implemented. In 2013, two data collectors independently evaluated 21 stores in two census tracts, through a data collection instrument that assessed the presence of price promotions, whether single cigarettes were sold, the number of visible advertisements, the presence of signage prohibiting the sale of cigarettes to minors, and characteristics of cigarette pack displays. We evaluated the inter-rater reliability of the collected data, through the calculation of metrics such as intraclass correlation coefficient, percent agreement, Cohen's kappa and Krippendorff's alpha. Most measures demonstrated substantial or perfect inter-rater reliability. Our results indicate the potential utility of the data collection instrument for future point-of-sale research.
The Scarbase Duo(®): Intra-rater and inter-rater reliability and validity of a compact dual scar assessment tool.

PubMed

Fell, Matthew; Meirte, Jill; Anthonissen, Mieke; Maertens, Koen; Pleat, Jonathon; Moortgat, Peter

2016-03-01

Objective scar assessment tools were designed to help identify problematic scars and direct clinical management. Their use has been restricted by their measurement of a single scar property and the bulky size of equipment. The Scarbase Duo(®) was designed to assess both trans-epidermal water loss (TEWL) and colour of a burn scar whilst being compact and easy to use. Twenty patients with a burn scar were recruited and measurements taken using the Scarbase Duo(®) by two observers. The Scarbase Duo(®) measures TEWL via an open-chamber system and undertakes colorimetry via narrow-band spectrophotometry, producing values for relative erythema and melanin pigmentation. Validity was assessed by comparing the Scarbase Duo(®) against the Dermalab(®) and the Minolta Chromameter(®) respectively for TEWL and colorimetry measurements. The intra-class correlation coefficient (ICC) was used to assess reliability with standard error of measurement (SEM) used to assess reproducibility of measurements. The Pearson correlation coefficient (r) was used to assess the convergent validity. The Scarbase Duo(®) TEWL mode had excellent reliability when used on scars for both intra- (ICC=0.95) and inter-rater (ICC=0.96) measurements with moderate SEM values. The erythema component of the colorimetry mode showed good reliability for use on scars for both intra-(ICC=0.81) and inter-rater (ICC=0.83) measurements with low SEM values. Pigmentation values showed excellent reliability on scar tissue for both intra- (ICC=0.97) and inter-rater (ICC=0.97) with moderate SEM values. The Scarbase Duo(®) TEWL function had excellent correlation with the Dermalab(®) (r=0.93) whilst the colorimetry erythema value had moderate correlation with the Minolta Chromameter (r=0.72). The Scarbase Duo(®) is a reliable and objective scar assessment tool, which is specifically designed for burn scars. However, for clinical use, standardised measurement conditions are recommended. Copyright © 2015 Elsevier
Objective measurements of excess skin in post bariatric patients--inter-rater reliability.

PubMed

Biörserud, Christina; Fagevik Olsén, Monika; Elander, Anna; Wiklund, Malin

2016-01-01

An ability to reliably assess excess skin after massive weight loss using well-described and transferrable methods is important. The aim of this trial was to evaluate inter-rater reliability of ptosis and circumference measurements in patients with excess skin after bariatric surgery. Twenty-five postbariatric patients were included in the study, and their excess skin was measured 18 months after surgery. A protocol was designed to measure excess skin in a standardised way. To evaluate the inter-rater reliability in the measuring protocol, all patients were measured twice, by a specialist nurse and a specialist physiotherapist. All circumference measurements on different body parts had an ICC > 0.9, indicating high reliability. Furthermore, all breast and abdominal ptosis measurements had high reliability. In contrast, visual evaluation of abdominal ptosis had poor reliability. Measurements of ptoses on different body parts had an ICC > 0.6. There were no systematic differences between the results of the two testers, except for measurements of the buttocks and maximal knee circumference. The measuring protocol presented in this study has high reliability and, therefore, represents a useful instrument to provide a consistent and objective assessment of excess skin in the postbariatric patient.
Test-retest intra-rater reliability of grip force in patients with stroke.

PubMed

Hammer, Ann; Lindmark, Birgitta

2003-07-01

Coefficients of repeatability and reproducibility can be guides in differentiating between real changes and measurement error. The aim was to evaluate test-retest intra-rater reliability of a clinical procedure measuring grip force with Grippit in stroke patients, to assess relationship between grip force of the hands and between sustained and peak grip force. Eighteen patients were tested using the Grippit at two occasions one hour apart. Each occasion comprised three consecutive trials per hand. The paretic hand needs to score a 50 N change within and between occasions to exceed the measurement error in 95% of the observations, irrespective of calculation method. Expressed by CV(within) the measurement error was 10%. There was no learning or fatigue effect during measuring. There was a wide variation between subjects but the mean ratio between sides was 0.66. The mean ratio between sustained and peak grip force was 0.80-0.84. The measurement errors were acceptable and the instrument can be recommended for the use in stroke patients at a department of rehabilitation medicine.
Measuring the Impact of Rater Negotiation in Writing Performance Assessment

ERIC Educational Resources Information Center

Trace, Jonathan; Janssen, Gerriet; Meier, Valerie

2017-01-01

Previous research in second language writing has shown that when scoring performance assessments even trained raters can exhibit significant differences in severity. When raters disagree, using discussion to try to reach a consensus is one popular form of score resolution, particularly in contexts with limited resources, as it does not require…
Writing Evaluation: Rater and Task Effects on the Reliability of Writing Scores for Children in Grades 3 and 4

PubMed Central

Kim, Grace Young-Suk; Schatschneider, Christopher; Wanzek, Jeanne; Gatlin, Brandy; Al Otaiba, Stephanie

2017-01-01

We examined how raters and tasks influence measurement error in writing evaluation and how many raters and tasks are needed to reach a desirable level of .90 and .80 reliabilities for children in Grades 3 and 4. A total of 211 children (102 boys) were administered three tasks in narrative and expository genres, respectively, and their written compositions were evaluated in widely used evaluation methods for developing writers: holistic scoring, productivity, and curriculum-based writing scores. Results showed that 54% and 52% of variance in narrative and expository compositions were attributable to true individual differences in writing. Students’ scores varied largely by tasks (30.44% and 28.61% of variance), but not by raters. To reach the reliability of .90, multiple tasks and raters were needed, and for the reliability of .80, a single rater and multiple tasks were needed. These findings offer important implications about reliably evaluating children’s writing skills, given that writing is typically evaluated by a single task and a single rater in classrooms and even in state accountability systems. PMID:29075050
Effect of rater training on reliability and accuracy of mini-CEX scores: a randomized, controlled trial.

PubMed

Cook, David A; Dupras, Denise M; Beckman, Thomas J; Thomas, Kris G; Pankratz, V Shane

2009-01-01

Mini-CEX scores assess resident competence. Rater training might improve mini-CEX score interrater reliability, but evidence is lacking. Evaluate a rater training workshop using interrater reliability and accuracy. Randomized trial (immediate versus delayed workshop) and single-group pre/post study (randomized groups combined). Academic medical center. Fifty-two internal medicine clinic preceptors (31 randomized and 21 additional workshop attendees). The workshop included rater error training, performance dimension training, behavioral observation training, and frame of reference training using lecture, video, and facilitated discussion. Delayed group received no intervention until after posttest. Mini-CEX ratings at baseline (just before workshop for workshop group), and four weeks later using videotaped resident-patient encounters; mini-CEX ratings of live resident-patient encounters one year preceding and one year following the workshop; rater confidence using mini-CEX. Among 31 randomized participants, interrater reliabilities in the delayed group (baseline intraclass correlation coefficient [ICC] 0.43, follow-up 0.53) and workshop group (baseline 0.40, follow-up 0.43) were not significantly different (p = 0.19). Mean ratings were similar at baseline (delayed 4.9 [95% confidence interval 4.6-5.2], workshop 4.8 [4.5-5.1]) and follow-up (delayed 5.4 [5.0-5.7], workshop 5.3 [5.0-5.6]; p = 0.88 for interaction). For the entire cohort, rater confidence (1 = not confident, 6 = very confident) improved from mean (SD) 3.8 (1.4) to 4.4 (1.0), p = 0.018. Interrater reliability for ratings of live encounters (entire cohort) was higher after the workshop (ICC 0.34) than before (ICC 0.18) but the standard error of measurement was similar for both periods. Rater training did not improve interrater reliability or accuracy of mini-CEX scores. clinicaltrials.gov identifier NCT00667940
Reliability of the standard goniometry and diagrammatic recording of finger joint angles: a comparative study with healthy subjects and non-professional raters.

PubMed

Macionis, Valdas

2013-01-09

Diagrammatic recording of finger joint angles by using two criss-crossed paper strips can be a quick substitute to the standard goniometry. As a preliminary step toward clinical validation of the diagrammatic technique, the current study employed healthy subjects and non-professional raters to explore whether reliability estimates of the diagrammatic goniometry are comparable with those of the standard procedure. The study included two procedurally different parts, which were replicated by assigning 24 medical students to act interchangeably as 12 subjects and 12 raters. A larger component of the study was designed to compare goniometers side-by-side in measurement of finger joint angles varying from subject to subject. In the rest of the study, the instruments were compared by parallel evaluations of joint angles similar for all subjects in a situation of simulated change of joint range of motion over time. The subjects used special guides to position the joints of their left ring finger at varying angles of flexion and extension. The obtained diagrams of joint angles were converted to numerical values by computerized measurements. The statistical approaches included calculation of appropriate intraclass correlation coefficients, standard errors of measurements, proportions of measurement differences of 5 or less degrees, and significant differences between paired observations. Reliability estimates were similar for both goniometers. Intra-rater and inter-rater intraclass correlation coefficients ranged from 0.69 to 0.93. The corresponding standard errors of measurements ranged from 2.4 to 4.9 degrees. Repeated measurements of a considerable number of raters fell within clinically non-meaningful 5 degrees of each other in proportions comparable with a criterion value of 0.95. Data collected with both instruments could be similarly interpreted in a simulated situation of change of joint range of motion over time. The paper goniometer and the standard goniometer can

Reliability of the standard goniometry and diagrammatic recording of finger joint angles: a comparative study with healthy subjects and non-professional raters

PubMed Central

2013-01-01

Background Diagrammatic recording of finger joint angles by using two criss-crossed paper strips can be a quick substitute to the standard goniometry. As a preliminary step toward clinical validation of the diagrammatic technique, the current study employed healthy subjects and non-professional raters to explore whether reliability estimates of the diagrammatic goniometry are comparable with those of the standard procedure. Methods The study included two procedurally different parts, which were replicated by assigning 24 medical students to act interchangeably as 12 subjects and 12 raters. A larger component of the study was designed to compare goniometers side-by-side in measurement of finger joint angles varying from subject to subject. In the rest of the study, the instruments were compared by parallel evaluations of joint angles similar for all subjects in a situation of simulated change of joint range of motion over time. The subjects used special guides to position the joints of their left ring finger at varying angles of flexion and extension. The obtained diagrams of joint angles were converted to numerical values by computerized measurements. The statistical approaches included calculation of appropriate intraclass correlation coefficients, standard errors of measurements, proportions of measurement differences of 5 or less degrees, and significant differences between paired observations. Results Reliability estimates were similar for both goniometers. Intra-rater and inter-rater intraclass correlation coefficients ranged from 0.69 to 0.93. The corresponding standard errors of measurements ranged from 2.4 to 4.9 degrees. Repeated measurements of a considerable number of raters fell within clinically non-meaningful 5 degrees of each other in proportions comparable with a criterion value of 0.95. Data collected with both instruments could be similarly interpreted in a simulated situation of change of joint range of motion over time. Conclusions The paper
Inter-Rater Agreement of Pressure Ulcer Risk and Prevention Measures in the National Database of Nursing Quality Indicators(®) (NDNQI).

PubMed

Waugh, Shirley Moore; Bergquist-Beringer, Sandra

2016-06-01

In this descriptive multi-site study, we examined inter-rater agreement on 11 National Database of Nursing Quality Indicators(®) (NDNQI(®) ) pressure ulcer (PrU) risk and prevention measures. One hundred twenty raters at 36 hospitals captured data from 1,637 patient records. At each hospital, agreement between the most experienced rater and each other team rater was calculated for each measure. In the ratings studied, 528 patients were rated as "at risk" for PrU and, therefore, were included in calculations of agreement for the prevention measures. Prevalence-adjusted kappa (PAK) was used to interpret inter-rater agreement because prevalence of single responses was high. The PAK values for eight measures indicated "substantial" to "near perfect" agreement between most experienced and other team raters: Skin assessment on admission (.977, 95% CI [.966-.989]), PrU risk assessment on admission (.978, 95% CI [.964-.993]), Time since last risk assessment (.790, 95% CI [.729-.852]), Risk assessment method (.997, 95% CI [.991-1.0]), Risk status (.877, 95% CI [.838-.917]), Any prevention (.856, 95% CI [.76-.943]), Skin assessment (.956, 95% CI [.904-1.0]), and Pressure-redistribution surface use (.839, 95% CI [.763-.916]). For three intervention measures, PAK values fell below the recommended value of ≥.610: Routine repositioning (.577, 95% CI [.494-.661]), Nutritional support (.500, 95% CI [.418-.581]), and Moisture management (.556, 95% CI [.469-.643]). Areas of disagreement were identified. Findings provide support for the reliability of 8 of the 11 measures. Further clarification of data collection procedures is needed to improve reliability for the less reliable measures. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
Exploring Rating Quality in Rater-Mediated Assessments Using Mokken Scale Analysis

PubMed Central

Wind, Stefanie A.; Engelhard, George

2015-01-01

Mokken scale analysis is a probabilistic nonparametric approach that offers statistical and graphical tools for evaluating the quality of social science measurement without placing potentially inappropriate restrictions on the structure of a data set. In particular, Mokken scaling provides a useful method for evaluating important measurement properties, such as invariance, in contexts where response processes are not well understood. Because rater-mediated assessments involve complex interactions among many variables, including assessment contexts, student artifacts, rubrics, individual rater characteristics, and others, rater-assigned scores are suitable candidates for Mokken scale analysis. The purposes of this study are to describe a suite of indices that can be used to explore the psychometric quality of data from rater-mediated assessments and to illustrate the substantive interpretation of Mokken-based statistics and displays in this context. Techniques that are commonly used in polytomous applications of Mokken scaling are adapted for use with rater-mediated assessments, with a focus on the substantive interpretation related to individual raters. Overall, the findings suggest that indices of rater monotonicity, rater scalability, and invariant rater ordering based on Mokken scaling provide diagnostic information at the level of individual raters related to the requirements for invariant measurement. These Mokken-based indices serve as an additional suite of diagnostic tools for exploring the quality of data from rater-mediated assessments that can supplement rating quality indices based on parametric models. PMID:29795883
Inter-rater and test–retest reliability of quality assessments by novice student raters using the Jadad and Newcastle–Ottawa Scales

PubMed Central

Oremus, Carolina; Hall, Geoffrey B C; McKinnon, Margaret C

2012-01-01

Introduction Quality assessment of included studies is an important component of systematic reviews. Objective The authors investigated inter-rater and test–retest reliability for quality assessments conducted by inexperienced student raters. Design Student raters received a training session on quality assessment using the Jadad Scale for randomised controlled trials and the Newcastle–Ottawa Scale (NOS) for observational studies. Raters were randomly assigned into five pairs and they each independently rated the quality of 13–20 articles. These articles were drawn from a pool of 78 papers examining cognitive impairment following electroconvulsive therapy to treat major depressive disorder. The articles were randomly distributed to the raters. Two months later, each rater re-assessed the quality of half of their assigned articles. Setting McMaster Integrative Neuroscience Discovery and Study Program. Participants 10 students taking McMaster Integrative Neuroscience Discovery and Study Program courses. Main outcome measures The authors measured inter-rater reliability using κ and the intraclass correlation coefficient type 2,1 or ICC(2,1). The authors measured test–retest reliability using ICC(2,1). Results Inter-rater reliability varied by scale question. For the six-item Jadad Scale, question-specific κs ranged from 0.13 (95% CI −0.11 to 0.37) to 0.56 (95% CI 0.29 to 0.83). The ranges were −0.14 (95% CI −0.28 to 0.00) to 0.39 (95% CI −0.02 to 0.81) for the NOS cohort and −0.20 (95% CI −0.49 to 0.09) to 1.00 (95% CI 1.00 to 1.00) for the NOS case–control. For overall scores on the six-item Jadad Scale, ICC(2,1)s for inter-rater and test–retest reliability (accounting for systematic differences between raters) were 0.32 (95% CI 0.08 to 0.52) and 0.55 (95% CI 0.41 to 0.67), respectively. Corresponding ICC(2,1)s for the NOS cohort were −0.19 (95% CI −0.67 to 0.35) and 0.62 (95% CI 0.25 to 0.83), and for the NOS case–control, the ICC(2
Harmonization Process and Reliability Assessment of Anthropometric Measurements in the Elderly EXERNET Multi-Centre Study

PubMed Central

Gómez-Cabello, Alba; Vicente-Rodríguez, Germán; Albers, Ulrike; Mata, Esmeralda; Rodriguez-Marroyo, Jose A.; Olivares, Pedro R.; Gusi, Narcis; Villa, Gerardo; Aznar, Susana; Gonzalez-Gross, Marcela; Casajús, Jose A.; Ara, Ignacio

2012-01-01

Background The elderly EXERNET multi-centre study aims to collect normative anthropometric data for old functionally independent adults living in Spain. Purpose To describe the standardization process and reliability of the anthropometric measurements carried out in the pilot study and during the final workshop, examining both intra- and inter-rater errors for measurements. Materials and Methods A total of 98 elderly from five different regions participated in the intra-rater error assessment, and 10 different seniors living in the city of Toledo (Spain) participated in the inter-rater assessment. We examined both intra- and inter-rater errors for heights and circumferences. Results For height, intra-rater technical errors of measurement (TEMs) were smaller than 0.25 cm. For circumferences and knee height, TEMs were smaller than 1 cm, except for waist circumference in the city of Cáceres. Reliability for heights and circumferences was greater than 98% in all cases. Inter-rater TEMs were 0.61 cm for height, 0.75 cm for knee-height and ranged between 2.70 and 3.09 cm for the circumferences measured. Inter-rater reliabilities for anthropometric measurements were always higher than 90%. Conclusion The harmonization process, including the workshop and pilot study, guarantee the quality of the anthropometric measurements in the elderly EXERNET multi-centre study. High reliability and low TEM may be expected when assessing anthropometry in elderly population. PMID:22860013
On the Performance of the Marginal Homogeneity Test to Detect Rater Drift.

PubMed

Sgammato, Adrienne; Donoghue, John R

2018-06-01

When constructed response items are administered repeatedly, "trend scoring" can be used to test for rater drift. In trend scoring, raters rescore responses from the previous administration. Two simulation studies evaluated the utility of Stuart's Q measure of marginal homogeneity as a way of evaluating rater drift when monitoring trend scoring. In the first study, data were generated based on trend scoring tables obtained from an operational assessment. The second study tightly controlled table margins to disentangle certain features present in the empirical data. In addition to Q , the paired t test was included as a comparison, because of its widespread use in monitoring trend scoring. Sample size, number of score categories, interrater agreement, and symmetry/asymmetry of the margins were manipulated. For identical margins, both statistics had good Type I error control. For a unidirectional shift in margins, both statistics had good power. As expected, when shifts in the margins were balanced across categories, the t test had little power. Q demonstrated good power for all conditions and identified almost all items identified by the t test. Q shows substantial promise for monitoring of trend scoring.
Introducing a new definition of a near fall: intra-rater and inter-rater reliability.

PubMed

Maidan, I; Freedman, T; Tzemah, R; Giladi, N; Mirelman, A; Hausdorff, J M

2014-01-01

Near falls (NFs) are more frequent than falls, and may occur before falls, potentially predicting fall risk. As such, identification of a NF is important. We aimed to assess intra and inter-rater reliability of the traditional definition of a NF and to demonstrate the potential utility of a new definition. To this end, 10 older adults, 10 idiopathic elderly fallers, and 10 patients with Parkinson's disease (PD) walked in an obstacle course while wearing a safety harness. All walks were videotaped. Forty-nine video segments were extracted to create 2 clips each of 8.48 min. Four raters scored each event using the traditional definition and, two weeks later, using the new definition. A fifth rater used only the new definition. Intra-rater reliability was determined using Kappa (K) statistics and inter-rater reliability was determined using ICC. Using the traditional definition, three raters had poor intra-rater reliability (K<0.054, p>0.137) and one rater had moderate intra-rater reliability (K=0.624, p<0.001). With the traditional definition, inter-rater reliability between the four raters was moderate (ICC=0.667, p<0.001). In contrast, the new NF definition showed high intra-rater (K>0.601, p<0.001) and excellent inter-rater reliability (ICC=0.815, p<0.001). A priori, it is easy to distinguish falls from usual walking and NFs, but it is more challenging to distinguish NFs from obstacle negotiation and usual walking. Therefore, a more precise definition of NF is required. The results of the present study suggest that the proposed new definition increases intra and inter-rater reliability, a critical step for using NFs to quantify fall risk. Copyright © 2013 Elsevier B.V. All rights reserved.
A rater training protocol to assess team performance.

PubMed

Eppich, Walter; Nannicelli, Anna P; Seivert, Nicholas P; Sohn, Min-Woong; Rozenfeld, Ranna; Woods, Donna M; Holl, Jane L

2015-01-01

Simulation-based methodologies are increasingly used to assess teamwork and communication skills and provide team training. Formative feedback regarding team performance is an essential component. While effective use of simulation for assessment or training requires accurate rating of team performance, examples of rater-training programs in health care are scarce. We describe our rater training program and report interrater reliability during phases of training and independent rating. We selected an assessment tool shown to yield valid and reliable results and developed a rater training protocol with an accompanying rater training handbook. The rater training program was modeled after previously described high-stakes assessments in the setting of 3 facilitated training sessions. Adjacent agreement was used to measure interrater reliability between raters. Nine raters with a background in health care and/or patient safety evaluated team performance of 42 in-situ simulations using post-hoc video review. Adjacent agreement increased from the second training session (83.6%) to the third training session (85.6%) when evaluating the same video segments. Adjacent agreement for the rating of overall team performance was 78.3%, which was added for the third training session. Adjacent agreement was 97% 4 weeks posttraining and 90.6% at the end of independent rating of all simulation videos. Rater training is an important element in team performance assessment, and providing examples of rater training programs is essential. Articulating key rating anchors promotes adequate interrater reliability. In addition, using adjacent agreement as a measure allows differentiation between high- and low-performing teams on video review. © 2015 The Alliance for Continuing Education in the Health Professions, the Society for Academic Continuing Medical Education, and the Council on Continuing Medical Education, Association for Hospital Medical Education.
Measuring Error Identification and Recovery Skills in Surgical Residents.

PubMed

Sternbach, Joel M; Wang, Kevin; El Khoury, Rym; Teitelbaum, Ezra N; Meyerson, Shari L

2017-02-01

Although error identification and recovery skills are essential for the safe practice of surgery, they have not traditionally been taught or evaluated in residency training. This study validates a method for assessing error identification and recovery skills in surgical residents using a thoracoscopic lobectomy simulator. We developed a 5-station, simulator-based examination containing the most commonly encountered cognitive and technical errors occurring during division of the superior pulmonary vein for left upper lobectomy. Successful completion of each station requires identification and correction of these errors. Examinations were video recorded and scored in a blinded fashion using an examination-specific rating instrument evaluating task performance as well as error identification and recovery skills. Evidence of validity was collected in the categories of content, response process, internal structure, and relationship to other variables. Fifteen general surgical residents (9 interns and 6 third-year residents) completed the examination. Interrater reliability was high, with an intraclass correlation coefficient of 0.78 between 4 trained raters. Station scores ranged from 64% to 84% correct. All stations adequately discriminated between high- and low-performing residents, with discrimination ranging from 0.35 to 0.65. The overall examination score was significantly higher for intermediate residents than for interns (mean, 74 versus 64 of 90 possible; p = 0.03). The described simulator-based examination with embedded errors and its accompanying assessment tool can be used to measure error identification and recovery skills in surgical residents. This examination provides a valid method for comparing teaching strategies designed to improve error recognition and recovery to enhance patient safety. Copyright © 2017 The Society of Thoracic Surgeons. Published by Elsevier Inc. All rights reserved.
Inter-rater and test-retest reliability of quality assessments by novice student raters using the Jadad and Newcastle-Ottawa Scales.

PubMed

Oremus, Mark; Oremus, Carolina; Hall, Geoffrey B C; McKinnon, Margaret C

2012-01-01

Quality assessment of included studies is an important component of systematic reviews. The authors investigated inter-rater and test-retest reliability for quality assessments conducted by inexperienced student raters. Student raters received a training session on quality assessment using the Jadad Scale for randomised controlled trials and the Newcastle-Ottawa Scale (NOS) for observational studies. Raters were randomly assigned into five pairs and they each independently rated the quality of 13-20 articles. These articles were drawn from a pool of 78 papers examining cognitive impairment following electroconvulsive therapy to treat major depressive disorder. The articles were randomly distributed to the raters. Two months later, each rater re-assessed the quality of half of their assigned articles. McMaster Integrative Neuroscience Discovery and Study Program. 10 students taking McMaster Integrative Neuroscience Discovery and Study Program courses. The authors measured inter-rater reliability using κ and the intraclass correlation coefficient type 2,1 or ICC(2,1). The authors measured test-retest reliability using ICC(2,1). Inter-rater reliability varied by scale question. For the six-item Jadad Scale, question-specific κs ranged from 0.13 (95% CI -0.11 to 0.37) to 0.56 (95% CI 0.29 to 0.83). The ranges were -0.14 (95% CI -0.28 to 0.00) to 0.39 (95% CI -0.02 to 0.81) for the NOS cohort and -0.20 (95% CI -0.49 to 0.09) to 1.00 (95% CI 1.00 to 1.00) for the NOS case-control. For overall scores on the six-item Jadad Scale, ICC(2,1)s for inter-rater and test-retest reliability (accounting for systematic differences between raters) were 0.32 (95% CI 0.08 to 0.52) and 0.55 (95% CI 0.41 to 0.67), respectively. Corresponding ICC(2,1)s for the NOS cohort were -0.19 (95% CI -0.67 to 0.35) and 0.62 (95% CI 0.25 to 0.83), and for the NOS case-control, the ICC(2,1)s were 0.46 (95% CI -0.13 to 0.92) and 0.83 (95% CI 0.48 to 0.95). Inter-rater reliability was generally poor
Inter-Rater Reliability and Intra-Rater Reliability of Assessing the 2-Minute Push-Up Test.

PubMed

Fielitz, Lynn; Coelho, Jeffrey; Horne, Thomas; Brechue, William

2016-02-01

The purpose of this study was to assess inter-rater reliability and intra-rater reliability of the 2-minute, 90° push-up test as utilized in the Army Physical Fitness Test. Analysis of rater assessment reliability included both total score agreement and agreement across individual push-up repetitions. This study utilized 8 Raters who assessed 15 different videotaped push-up performances over 4 iterations separated by a minimum of 1 week. The 15 push-up participants were videotaped during the semiannual Army Physical Fitness Test. Each Rater randomly viewed the 15 push-up and verbally responded with a "yes" or "no" to each push-up repetition. The data generated were analyzed using the Pearson product-moment correlation as well as the kappa, modified kappa and the intra-class correlation coefficient (3,1). An attribute agreement analysis was conducted to determine the percent of inter-rater and intra-rater agreement across individual push-ups.The results indicated that Raters varied a great deal in assessing push-ups. Over the 4 trials of 15 participants, the overall scores of the Raters varied between 3.0 and 35.7 push-ups. Post hoc comparisons found that there was significant increase in the grand mean of push-ups from trials 1-3 to trial 4 (p < 0.05). Also, there was a significant difference among raters over the 4 trials (p < 0.05). Pearson correlation coefficients for inter-rater and intra-rater reliability identified inter-rater reliability coefficients were between 0.10 and 0.97. Intra-rater coefficients were between 0.48 and 0.99. Intra-rater agreement for individual push-up repetitions ranged from 41.8% to 84.8%. The results indicated that the raters failed to assess the same push-up repetition with the same score (below 70% agreement) as well as failed to agree when viewed between raters (29%). Interestingly, as previously mentioned, scores on trial 4 increased significantly which might have been caused by rater drift or that the Raters did not maintain
Body Shape Preferences: Associations with Rater Body Shape and Sociosexuality

PubMed Central

Price, Michael E.; Pound, Nicholas; Dunn, James; Hopkins, Sian; Kang, Jinsheng

2013-01-01

There is accumulating evidence of condition-dependent mate choice in many species, that is, individual preferences varying in strength according to the condition of the chooser. In humans, for example, people with more attractive faces/bodies, and who are higher in sociosexuality, exhibit stronger preferences for attractive traits in opposite-sex faces/bodies. However, previous studies have tended to use only relatively simple, isolated measures of rater attractiveness. Here we use 3D body scanning technology to examine associations between strength of rater preferences for attractive traits in opposite-sex bodies, and raters’ body shape, self-perceived attractiveness, and sociosexuality. For 118 raters and 80 stimuli models, we used a 3D scanner to extract body measurements associated with attractiveness (male waist-chest ratio [WCR], female waist-hip ratio [WHR], and volume-height index [VHI] in both sexes) and also measured rater self-perceived attractiveness and sociosexuality. As expected, WHR and VHI were important predictors of female body attractiveness, while WCR and VHI were important predictors of male body attractiveness. Results indicated that male rater sociosexuality scores were positively associated with strength of preference for attractive (low) VHI and attractive (low) WHR in female bodies. Moreover, male rater self-perceived attractiveness was positively associated with strength of preference for low VHI in female bodies. The only evidence of condition-dependent preferences in females was a positive association between attractive VHI in female raters and preferences for attractive (low) WCR in male bodies. No other significant associations were observed in either sex between aspects of rater body shape and strength of preferences for attractive opposite-sex body traits. These results suggest that among male raters, rater self-perceived attractiveness and sociosexuality are important predictors of preference strength for attractive opposite
Virtual Raters for Reproducible and Objective Assessments in Radiology

NASA Astrophysics Data System (ADS)

Kleesiek, Jens; Petersen, Jens; Döring, Markus; Maier-Hein, Klaus; Köthe, Ullrich; Wick, Wolfgang; Hamprecht, Fred A.; Bendszus, Martin; Biller, Armin

2016-04-01

Volumetric measurements in radiologic images are important for monitoring tumor growth and treatment response. To make these more reproducible and objective we introduce the concept of virtual raters (VRs). A virtual rater is obtained by combining knowledge of machine-learning algorithms trained with past annotations of multiple human raters with the instantaneous rating of one human expert. Thus, he is virtually guided by several experts. To evaluate the approach we perform experiments with multi-channel magnetic resonance imaging (MRI) data sets. Next to gross tumor volume (GTV) we also investigate subcategories like edema, contrast-enhancing and non-enhancing tumor. The first data set consists of N = 71 longitudinal follow-up scans of 15 patients suffering from glioblastoma (GB). The second data set comprises N = 30 scans of low- and high-grade gliomas. For comparison we computed Pearson Correlation, Intra-class Correlation Coefficient (ICC) and Dice score. Virtual raters always lead to an improvement w.r.t. inter- and intra-rater agreement. Comparing the 2D Response Assessment in Neuro-Oncology (RANO) measurements to the volumetric measurements of the virtual raters results in one-third of the cases in a deviating rating. Hence, we believe that our approach will have an impact on the evaluation of clinical studies as well as on routine imaging diagnostics.
Rating the raters in a mixed model: An approach to deciphering the rater reliability

NASA Astrophysics Data System (ADS)

Shang, Junfeng; Wang, Yougui

2013-05-01

Rating the raters has attracted extensive attention in recent years. Ratings are quite complex in that the subjective assessment and a number of criteria are involved in a rating system. Whenever the human judgment is a part of ratings, the inconsistency of ratings is the source of variance in scores, and it is therefore quite natural for people to verify the trustworthiness of ratings. Accordingly, estimation of the rater reliability will be of great interest and an appealing issue. To facilitate the evaluation of the rater reliability in a rating system, we propose a mixed model where the scores of the ratees offered by a rater are described with the fixed effects determined by the ability of the ratees and the random effects produced by the disagreement of the raters. In such a mixed model, for the rater random effects, we derive its posterior distribution for the prediction of random effects. To quantitatively make a decision in revealing the unreliable raters, the predictive influence function (PIF) serves as a criterion which compares the posterior distributions of random effects between the full data and rater-deleted data sets. The benchmark for this criterion is also discussed. This proposed methodology of deciphering the rater reliability is investigated in the multiple simulated and two real data sets.
Inter-rater and intra-rater reliability of a movement control test in shoulder.

PubMed

Rajasekar, S; Bangera, Rakshith K; Sekaran, Padmanaban

2017-07-01

Movement faults are commonly observed in patients with musculoskeletal pain. The Kinetic Medial Rotation Test (KMRT) is a movement control test used to identify movement faults of the scapula and gleno-humeral joints during arm movement. Objective tests such as the KMRT need to be reliable and valid for the results to be applied across different clinical settings and patient populations. The primary objective of the present study was to determine the intra-rater and inter-rater reliability of KMRT in subjects with and without shoulder pain. Sixty subjects were included in this study based on specific inclusion and exclusion criteria. Two musculoskeletal physiotherapists with different levels of clinical experience performed the tests. The intra-rater reliability was tested in twenty asymptomatic subjects by a single assessor at two week intervals. An equal number of subjects with and without shoulder pain were tested by both the assessors to determine the inter-rater reliability. Both components of the KMRT, the Gleno- Humeral Anterior Translation (GHAT) and the Scapular Forward Tilt (SCFT) were tested. The Kappa values for inter-rater reliability of the GHAT and SCFT were K = 0.68 & K = 0.65 respectively in subjects with shoulder pain. In asymptomatic subjects, the inter-rater reliability of GHAT was K = 0.61 and SCFT was K = 0.85. Intra-rater reliability ranged from K = 0.66 for GHAT to K = 0.87 for SCFT. Our study found substantial agreement in inter-rater reliability of KMRT in subjects with shoulder pain, whereas substantial to near perfect agreement was found in intra-rater and inter-rater reliability of KMRT in subjects without shoulder pain. Copyright © 2017 Elsevier Ltd. All rights reserved.
Exploring Differences in Measurement and Reporting of Classroom Observation Inter-Rater Reliability

ERIC Educational Resources Information Center

Wilhelm, Anne Garrison; Gillespie Rouse, Amy; Jones, Francesca

2018-01-01

Although inter-rater reliability is an important aspect of using observational instruments, it has received little theoretical attention. In this article, we offer some guidance for practitioners and consumers of classroom observations so that they can make decisions about inter-rater reliability, both for study design and in the reporting of data…
Measures of Linguistic Accuracy in Second Language Writing Research.

ERIC Educational Resources Information Center

Polio, Charlene G.

1997-01-01

Investigates the reliability of measures of linguistic accuracy in second language writing. The study uses a holistic scale, error-free T-units, and an error classification system on the essays of English-as-a-Second-Language students and discusses why disagreements arise within a rater and between raters. (24 references) (Author/CK)
A Hierarchical Rater Model for Constructed Responses, with a Signal Detection Rater Model

ERIC Educational Resources Information Center

DeCarlo, Lawrence T.; Kim, YoungKoung; Johnson, Matthew S.

2011-01-01

The hierarchical rater model (HRM) recognizes the hierarchical structure of data that arises when raters score constructed response items. In this approach, raters' scores are not viewed as being direct indicators of examinee proficiency but rather as indicators of essay quality; the (latent categorical) quality of an examinee's essay in turn…
Intra- and Inter-Rater Reliability of the Rate of Force Development of Hip Abductor Muscles Measured by Hand-Held Dynamometer

ERIC Educational Resources Information Center

Takeda, Kazuya; Tanabe, Shigeo; Koyama, Soichiro; Nagai, Tomoko; Sakurai, Hiroaki; Kanada, Yoshikiyo; Shomoto, Koji

2018-01-01

The aim of this study was to clarify the intra- and inter-rater reliability of the rate of force development in hip abductor muscle force measurements using a hand-held dynamometer. Thirty healthy adults were separately assessed by two independent raters on two separate days. Rate of force development was calculated from the slope of the…
Improved assessment of multiple sclerosis lesion segmentation agreement via detection and outline error estimates

PubMed Central

2012-01-01

Background Presented is the method “Detection and Outline Error Estimates” (DOEE) for assessing rater agreement in the delineation of multiple sclerosis (MS) lesions. The DOEE method divides operator or rater assessment into two parts: 1) Detection Error (DE) -- rater agreement in detecting the same regions to mark, and 2) Outline Error (OE) -- agreement of the raters in outlining of the same lesion. Methods DE, OE and Similarity Index (SI) values were calculated for two raters tested on a set of 17 fluid-attenuated inversion-recovery (FLAIR) images of patients with MS. DE, OE, and SI values were tested for dependence with mean total area (MTA) of the raters' Region of Interests (ROIs). Results When correlated with MTA, neither DE (ρ = .056, p=.83) nor the ratio of OE to MTA (ρ = .23, p=.37), referred to as Outline Error Rate (OER), exhibited significant correlation. In contrast, SI is found to be strongly correlated with MTA (ρ = .75, p < .001). Furthermore, DE and OER values can be used to model the variation in SI with MTA. Conclusions The DE and OER indices are proposed as a better method than SI for comparing rater agreement of ROIs, which also provide specific information for raters to improve their agreement. PMID:22812697

Validity and intra-rater reliability of an Android phone application to measure cervical range-of-motion

PubMed Central

2014-01-01

Background Concurrent validity and intra-rater reliability using a customized Android phone application to measure cervical-spine range-of-motion (ROM) has not been previously validated against a gold-standard three-dimensional motion analysis (3DMA) system. Findings Twenty-one healthy individuals (age:31 ± 9.1 years, male:11) participated, with 16 re-examined for intra-rater reliability 1–7 days later. An Android phone was fixed on a helmet, which was then securely fastened on the participant’s head. Cervical-spine ROM in flexion, extension, lateral flexion and rotation were performed in sitting with concurrent measurements obtained from both a 3DMA system and the phone. The phone demonstrated moderate to excellent (ICC = 0.53-0.98, Spearman ρ = 0.52-0.98) concurrent validity for ROM measurements in cervical flexion, extension, lateral-flexion and rotation. However, cervical rotation demonstrated both proportional and fixed bias. Excellent intra-rater reliability was demonstrated for cervical flexion, extension and lateral flexion (ICC = 0.82-0.90), but poor for right- and left-rotation (ICC = 0.05-0.33) using the phone. Possible reasons for the outcome are that flexion, extension and lateral-flexion measurements are detected by gravity-dependent accelerometers while rotation measurements are detected by the magnetometer which can be adversely affected by surrounding magnetic fields. Conclusion The results of this study demonstrate that the tested Android phone application is valid and reliable to measure ROM of the cervical-spine in flexion, extension and lateral-flexion but not in rotation likely due to magnetic interference. The clinical implication of this study is that therapists should be mindful of the plane of measurement when using the Android phone to measure ROM of the cervical-spine. PMID:24742001
Validity and intra-rater reliability of an android phone application to measure cervical range-of-motion.

PubMed

Quek, June; Brauer, Sandra G; Treleaven, Julia; Pua, Yong-Hao; Mentiplay, Benjamin; Clark, Ross Allan

2014-04-17

Concurrent validity and intra-rater reliability using a customized Android phone application to measure cervical-spine range-of-motion (ROM) has not been previously validated against a gold-standard three-dimensional motion analysis (3DMA) system. Twenty-one healthy individuals (age:31 ± 9.1 years, male:11) participated, with 16 re-examined for intra-rater reliability 1-7 days later. An Android phone was fixed on a helmet, which was then securely fastened on the participant's head. Cervical-spine ROM in flexion, extension, lateral flexion and rotation were performed in sitting with concurrent measurements obtained from both a 3DMA system and the phone.The phone demonstrated moderate to excellent (ICC = 0.53-0.98, Spearman ρ = 0.52-0.98) concurrent validity for ROM measurements in cervical flexion, extension, lateral-flexion and rotation. However, cervical rotation demonstrated both proportional and fixed bias. Excellent intra-rater reliability was demonstrated for cervical flexion, extension and lateral flexion (ICC = 0.82-0.90), but poor for right- and left-rotation (ICC = 0.05-0.33) using the phone. Possible reasons for the outcome are that flexion, extension and lateral-flexion measurements are detected by gravity-dependent accelerometers while rotation measurements are detected by the magnetometer which can be adversely affected by surrounding magnetic fields. The results of this study demonstrate that the tested Android phone application is valid and reliable to measure ROM of the cervical-spine in flexion, extension and lateral-flexion but not in rotation likely due to magnetic interference. The clinical implication of this study is that therapists should be mindful of the plane of measurement when using the Android phone to measure ROM of the cervical-spine.
Intra- and Inter-rater Agreement of Superior Vena Cava Flow and Right Ventricular Outflow Measurements in Late Preterm and Term Neonates.

PubMed

Mahoney, Liam; Fernandez-Alvarez, Jose R; Rojas-Anaya, Hector; Aiton, Neil; Wertheim, David; Seddon, Paul; Rabe, Heike

2018-02-24

To explore the intra- and inter-rater agreement of superior vena cava (SVC) flow and right ventricular (RV) outflow in healthy and unwell late preterm neonates (33-37 weeks' gestational age), term neonates (≥37 weeks' gestational age), and neonates receiving total-body cooling. The intra- and inter-rater agreement (n = 25 and 41 neonates, respectively) rates for SVC flow and RV outflow were determined by echocardiography in healthy and unwell late preterm and term neonates with the use of Bland-Altman plots, the repeatability coefficient, the repeatability index, and intraclass correlation coefficients. The intra-rater repeatability index values were 41% for SVC flow and 31% for RV outflow, with intraclass correlation coefficients indicating good agreement for both measures. The inter-rater repeatability index values for SVC flow and RV outflow were 63% and 51%, respectively, with intraclass correlation coefficients indicating moderate agreement for both measures. If SVC flow or RV outflow is used in the hemodynamic treatment of neonates, sequential measurements should ideally be performed by the same clinician to reduce potential variability. © 2018 by the American Institute of Ultrasound in Medicine.
The Assignment of Raters to Items: Controlling for Rater Effects.

ERIC Educational Resources Information Center

Sykes, Robert C.; Heidorn, Mark; Lee, Guemin

A study was conducted to evaluate the effect of different modes (modalities) of assigning raters to test items. The impact on total constructed response (c.r.) score, and subsequently on total test score, of assigning a single versus multiple raters to an examination reading of a student's set of c.r. responses was evaluated for several mixed-item…
Intra- and inter-rater reliability of digital image analysis for skin color measurement

PubMed Central

Sommers, Marilyn; Beacham, Barbara; Baker, Rachel; Fargo, Jamison

2013-01-01

Background We determined the intra- and inter-rater reliability of data from digital image color analysis between an expert and novice analyst. Methods Following training, the expert and novice independently analyzed 210 randomly ordered images. Both analysts used Adobe® Photoshop lasso or color sampler tools based on the type of image file. After color correction with Pictocolor® in camera software, they recorded L*a*b* (L*=light/dark; a*=red/green; b*=yellow/blue) color values for all skin sites. We computed intra-rater and inter-rater agreement within anatomical region, color value (L*, a*, b*), and technique (lasso, color sampler) using a series of one-way intra-class correlation coefficients (ICCs). Results Results of ICCs for intra-rater agreement showed high levels of internal consistency reliability within each rater for the lasso technique (ICC ≥ 0.99) and somewhat lower, yet acceptable, level of agreement for the color sampler technique (ICC = 0.91 for expert, ICC = 0.81 for novice). Skin L*, skin b*, and labia L* values reached the highest level of agreement (ICC ≥ 0.92) and skin a*, labia b*, and vaginal wall b* were the lowest (ICC ≥ 0.64). Conclusion Data from novice analysts can achieve high levels of agreement with data from expert analysts with training and the use of a detailed, standard protocol. PMID:23551208
Surveying for "artifacts": the susceptibility of the OCB-performance evaluation relationship to common rater, item, and measurement context effects.

PubMed

Podsakoff, Nathan P; Whiting, Steven W; Welsh, David T; Mai, Ke Michael

2013-09-01

Despite the increased attention paid to biases attributable to common method variance (CMV) over the past 50 years, researchers have only recently begun to systematically examine the effect of specific sources of CMV in previously published empirical studies. Our study contributes to this research by examining the extent to which common rater, item, and measurement context characteristics bias the relationships between organizational citizenship behaviors and performance evaluations using a mixed-effects analytic technique. Results from 173 correlations reported in 81 empirical studies (N = 31,146) indicate that even after controlling for study-level factors, common rater and anchor point number similarity substantially biased the focal correlations. Indeed, these sources of CMV (a) led to estimates that were between 60% and 96% larger when comparing measures obtained from a common rater, versus different raters; (b) led to 39% larger estimates when a common source rated the scales using the same number, versus a different number, of anchor points; and (c) when taken together with other study-level predictors, accounted for over half of the between-study variance in the focal correlations. We discuss the implications for researchers and practitioners and provide recommendations for future research. PsycINFO Database Record (c) 2013 APA, all rights reserved
Evaluating Rater Accuracy in Rater-Mediated Assessments Using an Unfolding Model

ERIC Educational Resources Information Center

Wang, Jue; Engelhard, George, Jr.; Wolfe, Edward W.

2016-01-01

The number of performance assessments continues to increase around the world, and it is important to explore new methods for evaluating the quality of ratings obtained from raters. This study describes an unfolding model for examining rater accuracy. Accuracy is defined as the difference between observed and expert ratings. Dichotomous accuracy…
Rater Effects in Clinical Performance Ratings of Surgery Residents

ERIC Educational Resources Information Center

Iramaneerat, Cherdsak; Myford, Carol M.

2006-01-01

A multi-faceted Rasch measurement (MFRM) approach was used to analyze clinical performance ratings of 24 first-year residents in one surgery residency program in Thailand to investigate three types of rater effects: leniency, rater inconsistency, and restriction of range. Faculty from 14 surgical services rated the clinical performance of…
Intra- and inter-rater reliability of digital image analysis for skin color measurement.

PubMed

Sommers, Marilyn; Beacham, Barbara; Baker, Rachel; Fargo, Jamison

2013-11-01

We determined the intra- and inter-rater reliability of data from digital image color analysis between an expert and novice analyst. Following training, the expert and novice independently analyzed 210 randomly ordered images. Both analysts used Adobe(®) Photoshop lasso or color sampler tools based on the type of image file. After color correction with Pictocolor(®) in camera software, they recorded L*a*b* (L*=light/dark; a*=red/green; b*=yellow/blue) color values for all skin sites. We computed intra-rater and inter-rater agreement within anatomical region, color value (L*, a*, b*), and technique (lasso, color sampler) using a series of one-way intra-class correlation coefficients (ICCs). Results of ICCs for intra-rater agreement showed high levels of internal consistency reliability within each rater for the lasso technique (ICC ≥ 0.99) and somewhat lower, yet acceptable, level of agreement for the color sampler technique (ICC = 0.91 for expert, ICC = 0.81 for novice). Skin L*, skin b*, and labia L* values reached the highest level of agreement (ICC ≥ 0.92) and skin a*, labia b*, and vaginal wall b* were the lowest (ICC ≥ 0.64). Data from novice analysts can achieve high levels of agreement with data from expert analysts with training and the use of a detailed, standard protocol. © 2013 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
Does a Rater's Professional Background Influence Communication Skills Assessment?

PubMed

Artemiou, Elpida; Hecker, Kent G; Adams, Cindy L; Coe, Jason B

2015-01-01

There is increasing pressure in veterinary education to teach and assess communication skills, with the Objective Structured Clinical Examination (OSCE) being the most common assessment method. Previous research reveals that raters are a large source of variance in OSCEs. This study focused on examining the effect of raters' professional background as a source of variance when assessing students' communication skills. Twenty-three raters were categorized according to their professional background: clinical sciences (n=11), basic sciences (n=4), clinical communication (n=5), or hospital administrator/clinical skills technicians (n=3). Raters from each professional background were assigned to the same station and assessed the same students during two four-station OSCEs. Students were in year 2 of their pre-clinical program. Repeated-measures ANOVA results showed that OSCE scores awarded by the rater groups differed significantly: (F(matched_station_1) [2,91]=6.97, p=.002), (F(matched_station_2) [3,90]=13.95, p=.001), (F(matched_station_3) [3,90]=8.76, p=.001), and ((Fmatched_station_4) [2,91]=30.60, p=.001). A significant time effect between the two OSCEs was calculated for matched stations 1, 2, and 4, indicating improved student performances. Raters with a clinical communication skills background assigned scores that were significantly lower compared to the other rater groups. Analysis of written feedback provided by the clinical sciences raters showed that they were influenced by the students' clinical knowledge of the case and that they did not rely solely on the communication checklist items. This study shows that it is important to consider rater background both in recruitment and training programs for communication skills' assessment.
Workplace-based assessment: raters' performance theories and constructs.

PubMed

Govaerts, M J B; Van de Wiel, M W J; Schuwirth, L W T; Van der Vleuten, C P M; Muijtjens, A M M

2013-08-01

Weaknesses in the nature of rater judgments are generally considered to compromise the utility of workplace-based assessment (WBA). In order to gain insight into the underpinnings of rater behaviours, we investigated how raters form impressions of and make judgments on trainee performance. Using theoretical frameworks of social cognition and person perception, we explored raters' implicit performance theories, use of task-specific performance schemas and the formation of person schemas during WBA. We used think-aloud procedures and verbal protocol analysis to investigate schema-based processing by experienced (N = 18) and inexperienced (N = 16) raters (supervisor-raters in general practice residency training). Qualitative data analysis was used to explore schema content and usage. We quantitatively assessed rater idiosyncrasy in the use of performance schemas and we investigated effects of rater expertise on the use of (task-specific) performance schemas. Raters used different schemas in judging trainee performance. We developed a normative performance theory comprising seventeen inter-related performance dimensions. Levels of rater idiosyncrasy were substantial and unrelated to rater expertise. Experienced raters made significantly more use of task-specific performance schemas compared to inexperienced raters, suggesting more differentiated performance schemas in experienced raters. Most raters started to develop person schemas the moment they began to observe trainee performance. The findings further our understanding of processes underpinning judgment and decision making in WBA. Raters make and justify judgments based on personal theories and performance constructs. Raters' information processing seems to be affected by differences in rater expertise. The results of this study can help to improve rater training, the design of assessment instruments and decision making in WBA.
Inter-rater reliability of select physical examination procedures in patients with neck pain.

PubMed

Hanney, William J; George, Steven Z; Kolber, Morey J; Young, Ian; Salamh, Paul A; Cleland, Joshua A

2014-07-01

This study evaluated the inter-rater reliability of select examination procedures in patients with neck pain (NP) conducted over a 24- to 48-h period. Twenty-two patients with mechanical NP participated in a standardized examination. One examiner performed standardized examination procedures and a second blinded examiner repeated the procedures 24-48 h later with no treatment administered between examinations. Inter-rater reliability was calculated with the Cohen Kappa and weighted Kappa for ordinal data while continuous level data were calculated using an intraclass correlation coefficient model 2,1 (ICC2,1). Coefficients for categorical variables ranged from poor to moderate agreement (-0.22 to 0.70 Kappa) and coefficients for continuous data ranged from slight to moderate (ICC2,1 0.28-0.74). The standard error of measurement for cervical range of motion ranged from 5.3° to 9.9° while the minimal detectable change ranged from 12.5° to 23.1°. This study is the first to report inter-rater reliability values for select components of the cervical examination in those patients with NP performed 24-48 h after the initial examination. There was considerably less reliability when compared to previous studies, thus clinicians should consider how the passage of time may influence variability in examination findings over a 24- to 48-h period.
Rater reliability and construct validity of a mobile application for posture analysis

PubMed Central

Szucs, Kimberly A.; Brown, Elena V. Donoso

2018-01-01

[Purpose] Measurement of posture is important for those with a clinical diagnosis as well as researchers aiming to understand the impact of faulty postures on the development of musculoskeletal disorders. A reliable, cost-effective and low tech posture measure may be beneficial for research and clinical applications. The purpose of this study was to determine rater reliability and construct validity of a posture screening mobile application in healthy young adults. [Subjects and Methods] Pictures of subjects were taken in three standing positions. Two raters independently digitized the static standing posture image twice. The app calculated posture variables, including sagittal and coronal plane translations and angulations. Intra- and inter-rater reliability were calculated using the appropriate ICC models for complete agreement. Construct validity was determined through comparison of known groups using repeated measures ANOVA. [Results] Intra-rater reliability ranged from 0.71 to 0.99. Inter-rater reliability was good to excellent for all translations. ICCs were stronger for translations versus angulations. The construct validity analysis found that the app was able to detect the change in the four variables selected. [Conclusion] The posture mobile application has demonstrated strong rater reliability and preliminary evidence of construct validity. This application may have utility in clinical and research settings. PMID:29410561
Rater reliability and construct validity of a mobile application for posture analysis.

PubMed

Szucs, Kimberly A; Brown, Elena V Donoso

2018-01-01

[Purpose] Measurement of posture is important for those with a clinical diagnosis as well as researchers aiming to understand the impact of faulty postures on the development of musculoskeletal disorders. A reliable, cost-effective and low tech posture measure may be beneficial for research and clinical applications. The purpose of this study was to determine rater reliability and construct validity of a posture screening mobile application in healthy young adults. [Subjects and Methods] Pictures of subjects were taken in three standing positions. Two raters independently digitized the static standing posture image twice. The app calculated posture variables, including sagittal and coronal plane translations and angulations. Intra- and inter-rater reliability were calculated using the appropriate ICC models for complete agreement. Construct validity was determined through comparison of known groups using repeated measures ANOVA. [Results] Intra-rater reliability ranged from 0.71 to 0.99. Inter-rater reliability was good to excellent for all translations. ICCs were stronger for translations versus angulations. The construct validity analysis found that the app was able to detect the change in the four variables selected. [Conclusion] The posture mobile application has demonstrated strong rater reliability and preliminary evidence of construct validity. This application may have utility in clinical and research settings.
Exploring the role of first impressions in rater-based assessments.

PubMed

Wood, Timothy J

2014-08-01

Medical education relies heavily on assessment formats that require raters to assess the competence and skills of learners. Unfortunately, there are often inconsistencies and variability in the scores raters assign. To ensure the scores from these assessment tools have validity, it is important to understand the underlying cognitive processes that raters use when judging the abilities of their learners. The goal of this paper, therefore, is to contribute to a better understanding of the cognitive processes used by raters. Representative findings from the social judgment and decision making, cognitive psychology, and educational measurement literature will be used to enlighten the underpinnings of these rater-based assessments. Of particular interest is the impact judgments referred to as first impressions (or thin slices) have on rater-based assessments. These are judgments about people made very quickly and based on very little information. A narrative review will provide a synthesis of research in these three literatures (social judgment and decision making, educational psychology, and cognitive psychology) and will focus on the underlying cognitive processes, the accuracy and the impact of first impressions on rater-based assessments. The application of these findings to the types of rater-based assessments used in medical education will then be reviewed. Gaps in understanding will be identified and suggested directions for future research studies will be discussed.
A Comparison of Assessment Methods and Raters in Product Creativity

ERIC Educational Resources Information Center

Lu, Chia-Chen; Luh, Ding-Bang

2012-01-01

Although previous studies have attempted to use different experiences of raters to rate product creativity by adopting the Consensus Assessment Method (CAT) approach, the validity of replacing CAT with another measurement tool has not been adequately tested. This study aimed to compare raters with different levels of experience (expert ves.…
Kappa and Rater Accuracy: Paradigms and Parameters

ERIC Educational Resources Information Center

Conger, Anthony J.

2017-01-01

Drawing parallels to classical test theory, this article clarifies the difference between rater accuracy and reliability and demonstrates how category marginal frequencies affect rater agreement and Cohen's kappa. Category assignment paradigms are developed: comparing raters to a standard (index) versus comparing two raters to one another…
The Stability of Rater Severity in Large-Scale Assessment Programs.

ERIC Educational Resources Information Center

Congdon, Peter J.; McQueen, Joy

2000-01-01

Studied the stability of rater severity over an extended rating period by applying multifaceted Rasch analysis to ratings of 16 raters of writing performances of 8,285 elementary school students. Findings cast doubt on the practice of using a single calibration of rate severity as the basis for adjustment of person measures. (SLD)
Rater agreement of visual lameness assessment in horses during lungeing.

PubMed

Hammarberg, M; Egenvall, A; Pfau, T; Rhodin, M

2016-01-01

Lungeing is an important part of lameness examinations as the circular path may accentuate low-grade lameness. Movement asymmetries related to the circular path, to compensatory movements and to pain make the lameness evaluation complex. Scientific studies have shown high inter-rater variation when assessing lameness during straight line movement. The aim was to estimate inter- and intra-rater agreement of equine veterinarians evaluating lameness from videos of sound and lame horses during lungeing and to investigate the influence of veterinarians' experience and the objective degree of movement asymmetry on rater agreement. Cross-sectional observational study. Video recordings and quantitative gait analysis with inertial sensors were performed in 23 riding horses of various breeds. The horses were examined at trot on a straight line and during lungeing on soft or hard surfaces in both directions. One video sequence was recorded per condition and the horses were classified as forelimb lame, hindlimb lame or sound from objective straight line symmetry measurements. Equine veterinarians (n = 86), including 43 with >5 years of orthopaedic experience, participated in a web-based survey and were asked to identify the lamest limb on 60 videos, including 10 repeats. The agreements between (inter-rater) and within (intra-rater) veterinarians were analysed with κ statistics (Fleiss, Cohen). Inter-rater agreement κ was 0.31 (0.38/0.25 for experienced/less experienced) and higher for forelimb (0.33) than for hindlimb lameness (0.11) or soundness (0.08) evaluation. Median intra-rater agreement κ was 0.57. Inter-rater agreement was poor for less experienced raters, and for all raters when evaluating hindlimb lameness. Since identification of the lame limb/limbs is a prerequisite for successful diagnosis, treatment and recovery, the high inter-rater variation when evaluating lameness on the lunge is likely to influence the accuracy and repeatability of lameness examinations
Rater agreement of visual lameness assessment in horses during lungeing

PubMed Central

Hammarberg, M.; Egenvall, A.; Pfau, T.

2015-01-01

Summary Reasons for performing study Lungeing is an important part of lameness examinations as the circular path may accentuate low‐grade lameness. Movement asymmetries related to the circular path, to compensatory movements and to pain make the lameness evaluation complex. Scientific studies have shown high inter‐rater variation when assessing lameness during straight line movement. Objectives The aim was to estimate inter‐ and intra‐rater agreement of equine veterinarians evaluating lameness from videos of sound and lame horses during lungeing and to investigate the influence of veterinarians’ experience and the objective degree of movement asymmetry on rater agreement. Study design Cross‐sectional observational study. Methods Video recordings and quantitative gait analysis with inertial sensors were performed in 23 riding horses of various breeds. The horses were examined at trot on a straight line and during lungeing on soft or hard surfaces in both directions. One video sequence was recorded per condition and the horses were classified as forelimb lame, hindlimb lame or sound from objective straight line symmetry measurements. Equine veterinarians (n = 86), including 43 with >5 years of orthopaedic experience, participated in a web‐based survey and were asked to identify the lamest limb on 60 videos, including 10 repeats. The agreements between (inter‐rater) and within (intra‐rater) veterinarians were analysed with κ statistics (Fleiss, Cohen). Results Inter‐rater agreement κ was 0.31 (0.38/0.25 for experienced/less experienced) and higher for forelimb (0.33) than for hindlimb lameness (0.11) or soundness (0.08) evaluation. Median intra‐rater agreement κ was 0.57. Conclusions Inter‐rater agreement was poor for less experienced raters, and for all raters when evaluating hindlimb lameness. Since identification of the lame limb/limbs is a prerequisite for successful diagnosis, treatment and recovery, the high inter‐rater variation

How Do Raters Judge Spoken Vocabulary?

ERIC Educational Resources Information Center

Li, Hui

2016-01-01

The aim of the study was to investigate how raters come to their decisions when judging spoken vocabulary. Segmental rating was introduced to quantify raters' decision-making process. It is hoped that this simulated study brings fresh insight to future methodological considerations with spoken data. Twenty trainee raters assessed five Chinese…
Inter- and intra-rater reliability and agreement in determining subcutaneous tumour margins in dogs.

PubMed

Ranganathan, B; Milovancev, M; Leeper, H; Townsend, K L; Bracha, S; Curran, K

2018-03-01

The objective of this prospective study was to evaluate agreement and reliability of calliper-based measurements of locally invasive subcutaneous malignant tumours in dogs. Four raters measured the longest diameter of 12 subcutaneous tumours (7 soft tissue sarcomas and 5 mast cell tumours) from 11 client-owned dogs during 3 randomized, blinded measurement trials, both pre- and post-sedation. Inter- and intra-rater reliability was evaluated using intra-class correlation coefficient (ICC) and agreement was evaluated using Bland-Altman plots. Inter- and intra-rater reliability was good (ICC range of 0.8694-0.89520) and excellent (ICC range of 0.9720-0.9966), respectively. For agreement calculations, an a priori clinically relevant limit of agreement of 10 mm was set. Inter- and intra-rater agreement was unacceptable with inter-rater limits of agreement ranging from 15.9 to 55.6 mm and intra-rater limit of agreement ranging from 11.9 to 28.1 mm. Review of the measurement trial photographs revealed that calliper orientation changes were frequent, occurring in 9/12 (75%) and 8/12 (67%) pre- and post-sedation cases. No significant correlation was found between inter-rater measurement standard deviations and calliper orientation changes or dog body condition score. These findings suggest veterinarians may have poor agreement in determining the gross edge of tumours, which is expected to introduce bias and inconsistency in tumour staging, assessing response to therapy, and surgical margin planning. Due to the potential consequences for veterinary cancer patients, future studies are needed to validate the present findings. © 2018 John Wiley & Sons Ltd.
Consensus Conference Follow-up: Inter-rater Reliability Assessment of the Best Evidence in Emergency Medicine (BEEM) Rater Scale, a Medical Literature Rating Tool for Emergency Physicians

PubMed Central

Worster, Andrew; Kulasegaram, Kulamakan; Carpenter, Christopher R.; Vallera, Teresa; Upadhye, Suneel; Sherbino, Jonathan; Haynes, R. Brian

2011-01-01

Background Studies published in general and specialty medical journals have the potential to improve emergency medicine (EM) practice, but there can be delayed awareness of this evidence because emergency physicians (EPs) are unlikely to read most of these journals. Also, not all published studies are intended for or ready for clinical practice application. The authors developed “Best Evidence in Emergency Medicine” (BEEM) to ameliorate these problems by searching for, identifying, appraising, and translating potentially practice-changing studies for EPs. An initial step in the BEEM process is the BEEM rater scale, a novel tool for EPs to collectively evaluate the relative clinical relevance of EM-related studies found in more than 120 journals. The BEEM rater process was designed to serve as a clinical relevance filter to identify those studies with the greatest potential to affect EM practice. Therefore, only those studies identified by BEEM raters as having the highest clinical relevance are selected for the subsequent critical appraisal process and, if found methodologically sound, are promoted as the best evidence in EM. Objectives The primary objective was to measure inter-rater reliability (IRR) of the BEEM rater scale. Secondary objectives were to determine the minimum number of EP raters needed for the BEEM rater scale to achieve acceptable reliability and to compare performance of the scale against a previously published evidence rating system, the McMaster Online Rating of Evidence (MORE), in an EP population. Methods The authors electronically distributed the title, conclusion, and a PubMed link for 23 recently published studies related to EM to a volunteer group of 134 EPs. The volunteers answered two demographic questions and rated the articles using one of two randomly assigned seven-point Likert scales, the BEEM rater scale (n = 68) or the MORE scale (n = 66), over two separate administrations. The IRR of each scale was measured using
Emotions and assessment: considerations for rater-based judgements of entrustment.

PubMed

Gomez-Garibello, Carlos; Young, Meredith

2018-03-01

Assessment is subject to increasing scrutiny as medical education transitions towards a competency-based medical education (CBME) model. Traditional perspectives on the roles of assessment emphasise high-stakes, summative assessment, whereas CBME argues for formative assessment. Revisiting conceptualisations about the roles and formats of assessment in medical education provides opportunities to examine understandings and expectations of the assessment of learners. The act of the rater generating scores might be considered as an exclusively cognitive exercise; however, current literature has drawn attention to the notion of raters as measurement instruments, thereby attributing additional factors to their decision-making processes, such as social considerations and intuition. However, the literature has not comprehensively examined the influence of raters' emotions during assessment. In this narrative review, we explore the influence of raters' emotions in the assessment of learners. We summarise existing literature that describes the role of emotions in assessment broadly, and rater-based assessment specifically, across a variety of fields. The literature related to emotions and assessment is examined from different perspectives, including those of educational context, decision making and rater cognition. We use the concept of entrustable professional activities (EPAs) to contextualise a discussion of the ways in which raters' emotions may have meaningful impacts on the decisions they make in clinical settings. This review summarises findings from different perspectives and identifies areas for consideration for the role of emotion in rater-based assessment, and areas for future research. We identify and discuss three different interpretations of the influence of raters' emotions during assessments: (i) emotions lead to biased decision making; (ii) emotions contribute random noise to assessment, and (iii) emotions constitute legitimate sources of information that
Measurement of the Inter-Rater Reliability Rate Is Mandatory for Improving the Quality of a Medical Database: Experience with the Paulista Lung Cancer Registry.

PubMed

Lauricella, Leticia L; Costa, Priscila B; Salati, Michele; Pego-Fernandes, Paulo M; Terra, Ricardo M

2018-06-01

Database quality measurement should be considered a mandatory step to ensure an adequate level of confidence in data used for research and quality improvement. Several metrics have been described in the literature, but no standardized approach has been established. We aimed to describe a methodological approach applied to measure the quality and inter-rater reliability of a regional multicentric thoracic surgical database (Paulista Lung Cancer Registry). Data from the first 3 years of the Paulista Lung Cancer Registry underwent an audit process with 3 metrics: completeness, consistency, and inter-rater reliability. The first 2 methods were applied to the whole data set, and the last method was calculated using 100 cases randomized for direct auditing. Inter-rater reliability was evaluated using percentage of agreement between the data collector and auditor and through calculation of Cohen's κ and intraclass correlation. The overall completeness per section ranged from 0.88 to 1.00, and the overall consistency was 0.96. Inter-rater reliability showed many variables with high disagreement (>10%). For numerical variables, intraclass correlation was a better metric than inter-rater reliability. Cohen's κ showed that most variables had moderate to substantial agreement. The methodological approach applied to the Paulista Lung Cancer Registry showed that completeness and consistency metrics did not sufficiently reflect the real quality status of a database. The inter-rater reliability associated with κ and intraclass correlation was a better quality metric than completeness and consistency metrics because it could determine the reliability of specific variables used in research or benchmark reports. This report can be a paradigm for future studies of data quality measurement. Copyright © 2018 American College of Surgeons. Published by Elsevier Inc. All rights reserved.
Rater Cognition: Implications for Validity

ERIC Educational Resources Information Center

Bejar, Issac I.

2012-01-01

The scoring process is critical in the validation of tests that rely on constructed responses. Documenting that readers carry out the scoring in ways consistent with the construct and measurement goals is an important aspect of score validity. In this article, rater cognition is approached as a source of support for a validity argument for scores…
Measuring the Pain Area: An Intra- and Inter-Rater Reliability Study Using Image Analysis Software.

PubMed

Dos Reis, Felipe Jose Jandre; de Barros E Silva, Veronica; de Lucena, Raphaela Nunes; Mendes Cardoso, Bruno Alexandre; Nogueira, Leandro Calazans

2016-01-01

Pain drawings have frequently been used for clinical information and research. The aim of this study was to investigate intra- and inter-rater reliability of area measurements performed on pain drawings. Our secondary objective was to verify the reliability when using computers with different screen sizes, both with and without mouse hardware. Pain drawings were completed by patients with chronic neck pain or neck-shoulder-arm pain. Four independent examiners participated in the study. Examiners A and B used the same computer with a 16-inch screen and wired mouse hardware. Examiner C used a notebook with a 16-inch screen and no mouse hardware, and Examiner D used a computer with an 11.6-inch screen and a wireless mouse. Image measurements were obtained using GIMP and NIH ImageJ computer programs. The length of all the images was measured using GIMP software to a set scale in ImageJ. Thus, each marked area was encircled and the total surface area (cm(2) ) was calculated for each pain drawing measurement. A total of 117 areas were identified and 52 pain drawings were analyzed. The intrarater reliability between all examiners was high (ICC = 0.989). The inter-rater reliability was also high. No significant differences were observed when using different screen sizes or when using or not using the mouse hardware. This suggests that the precision of these measurements is acceptable for the use of this method as a measurement tool in clinical practice and research. © 2014 World Institute of Pain.
The Effects of Rater Severity and Rater Distribution on Examinees' Ability Estimation for Constructed-Response Items. Research Report. ETS RR-13-23

ERIC Educational Resources Information Center

Wang, Zhen; Yao, Lihua

2013-01-01

The current study used simulated data to investigate the properties of a newly proposed method (Yao's rater model) for modeling rater severity and its distribution under different conditions. Our study examined the effects of rater severity, distributions of rater severity, the difference between item response theory (IRT) models with rater effect…
Measurement System Characterization in the Presence of Measurement Errors

NASA Technical Reports Server (NTRS)

Commo, Sean A.

2012-01-01

In the calibration of a measurement system, data are collected in order to estimate a mathematical model between one or more factors of interest and a response. Ordinary least squares is a method employed to estimate the regression coefficients in the model. The method assumes that the factors are known without error; yet, it is implicitly known that the factors contain some uncertainty. In the literature, this uncertainty is known as measurement error. The measurement error affects both the estimates of the model coefficients and the prediction, or residual, errors. There are some methods, such as orthogonal least squares, that are employed in situations where measurement errors exist, but these methods do not directly incorporate the magnitude of the measurement errors. This research proposes a new method, known as modified least squares, that combines the principles of least squares with knowledge about the measurement errors. This knowledge is expressed in terms of the variance ratio - the ratio of response error variance to measurement error variance.
Inter-rater reliability of Hamilton depression rating scale using video-recorded interviews — Focus on rater-blinding

PubMed Central

Prasad, M. Krishna; Udupa, K.; Kishore, K. R.; Thirthalli, J.; Sathyaprabha, T. N.; Gangadhar, B. N.

2009-01-01

Background: Hamilton depression rating scale (Ham-D) is the most widely used clinician rating scale for depression. There has been no Indian study that has examined the inter-rater reliability (IRR) of video-recorded interviews of the 21-item Ham-D. Aim: To study the IRR of scoring video-recorded interviews for 21-item Ham-D. Materials and Methods: Eighteen subjects with major depressive disorder involved in a larger study were interviewed using the semi-structured clinical interview of the 21-item Ham-D by a primary rater after informed consent. These interviews were video-recorded and portions edited to ensure rater blinding. Subsequently, the video-recorded interviews were rated by a “blind” rater. Both rated the different sub-domains of Ham-D according to Rhoades and Overall (1983). IRR was evaluated using intra-class correlation coefficient. Results: Excellent IRR was observed (0.9891) between the two raters. This was true for each of the primary factors and super-factors. Conclusion: Video recorded 21-item Ham-D has excellentIRR. Video-recorded interviews of Ham-D can be reliably used to blind raters in research. PMID:19881046
Rater Agreement Indexes for Performance Assessment.

ERIC Educational Resources Information Center

Burry-Stock, Judith A.; And Others

1996-01-01

It is argued that interrater agreement is a psychometric property which is theoretically different from classic reliability. Formulas are presented to illustrate a set of algebraically equivalent rater agreement indices that are intended to provide educational and psychological researchers with a practical way to establish a measure of rater…
Age Matters, and so May Raters: Rater Differences in the Assessment of Foreign Accents

ERIC Educational Resources Information Center

Huang, Becky H.; Jun, Sun-Ah

2015-01-01

Research on the age of learning effect on second language learners' foreign accents utilizes human judgments to determine speech production outcomes. Inferences drawn from analyses of these ratings are then used to inform theories. The present study focuses on rater differences in the age of learning effect research. Three groups of raters who…
Participant, Rater, and Computer Measures of Coherence in Posttraumatic Stress Disorder

PubMed Central

Rubin, David C.; Deffler, Samantha A.; Ogle, Christin M.; Dowell, Nia M.; Graesser, Arthur C.; Beckham, Jean C.

2015-01-01

We examined the coherence of trauma memories in a trauma-exposed community sample of 30 adults with and 30 without PTSD. The groups had similar categories of traumas and were matched on multiple factors that could affect the coherence of memories. We compared the transcribed oral trauma memories of participants with their most important and most positive memories. A comprehensive set of 28 measures of coherence including 3 ratings by the participants, 7 ratings by outside raters, and 18 computer-scored measures, provided a variety of approaches to defining and measuring coherence. A MANOVA indicated differences in coherence among the trauma, important, and positive memories, but not between the diagnostic groups or their interaction with these memory types. Most differences were small in magnitude; in some cases, the trauma memories were more, rather than less, coherent than the control memories. Where differences existed, the results agreed with the existing literature, suggesting that factors other than the incoherence of trauma memories are most likely to be central to the maintenance of PTSD and thus its treatment. PMID:26523945
Correcting AUC for Measurement Error.

PubMed

Rosner, Bernard; Tworoger, Shelley; Qiu, Weiliang

2015-12-01

Diagnostic biomarkers are used frequently in epidemiologic and clinical work. The ability of a diagnostic biomarker to discriminate between subjects who develop disease (cases) and subjects who do not (controls) is often measured by the area under the receiver operating characteristic curve (AUC). The diagnostic biomarkers are usually measured with error. Ignoring measurement error can cause biased estimation of AUC, which results in misleading interpretation of the efficacy of a diagnostic biomarker. Several methods have been proposed to correct AUC for measurement error, most of which required the normality assumption for the distributions of diagnostic biomarkers. In this article, we propose a new method to correct AUC for measurement error and derive approximate confidence limits for the corrected AUC. The proposed method does not require the normality assumption. Both real data analyses and simulation studies show good performance of the proposed measurement error correction method.
Is the Parkinson Anxiety Scale comparable across raters?

PubMed

Forjaz, Maria João; Ayala, Alba; Martinez-Martin, Pablo; Dujardin, Kathy; Pontone, Gregory M; Starkstein, Sergio E; Weintraub, Daniel; Leentjens, Albert F G

2015-04-01

The Parkinson Anxiety Scale is a new scale developed to measure anxiety severity in Parkinson's disease specifically. It consists of three dimensions: persistent anxiety, episodic anxiety, and avoidance behavior. This study aimed to assess the measurement properties of the scale while controlling for the rater (self- vs. clinician-rated) effect. The Parkinson Anxiety Scale was administered to a cross-sectional multicenter international sample of 362 Parkinson's disease patients. Both patients and clinicians rated the patient's anxiety independently. A many-facet Rasch model design was applied to estimate and remove the rater effect. The following measurement properties were assessed: fit to the Rasch model, unidimensionality, reliability, differential item functioning, item local independency, interrater reliability (self or clinician), and scale targeting. In addition, test-retest stability, construct validity, precision, and diagnostic properties of the Parkinson Anxiety Scale were also analyzed. A good fit to the Rasch model was obtained for Parkinson Anxiety Scale dimensions A and B, after the removal of one item and rescoring of the response scale for certain items, whereas dimension C showed marginal fit. Self versus clinician rating differences were of small magnitude, with patients reporting higher anxiety levels than clinicians. The linear measure for Parkinson Anxiety Scale dimensions A and B showed good convergent construct with other anxiety measures and good diagnostic properties. Parkinson Anxiety Scale modified dimensions A and B provide valid and reliable measures of anxiety in Parkinson's disease that are comparable across raters. Further studies are needed with dimension C. © 2014 International Parkinson and Movement Disorder Society.
A Surgery Oral Examination: Interrater Agreement and the Influence of Rater Characteristics.

ERIC Educational Resources Information Center

Burchard, Kenneth W.; And Others

1995-01-01

A study measured interrater reliability among 140 United States and Canadian surgery exam raters and the influences of age, years in practice, and experience as an examiner on individual scores. Results indicate three aspects of examinee performance influenced scores: verbal style, dress, and content of answers. No rater characteristic…
Inter-rater reliability of kinesthetic measurements with the KINARM robotic exoskeleton.

PubMed

Semrau, Jennifer A; Herter, Troy M; Scott, Stephen H; Dukelow, Sean P

2017-05-22

Kinesthesia (sense of limb movement) has been extremely difficult to measure objectively, especially in individuals who have survived a stroke. The development of valid and reliable measurements for proprioception is important to developing a better understanding of proprioceptive impairments after stroke and their impact on the ability to perform daily activities. We recently developed a robotic task to evaluate kinesthetic deficits after stroke and found that the majority (~60%) of stroke survivors exhibit significant deficits in kinesthesia within the first 10 days post-stroke. Here we aim to determine the inter-rater reliability of this robotic kinesthetic matching task. Twenty-five neurologically intact control subjects and 15 individuals with first-time stroke were evaluated on a robotic kinesthetic matching task (KIN). Subjects sat in a robotic exoskeleton with their arms supported against gravity. In the KIN task, the robot moved the subjects' stroke-affected arm at a preset speed, direction and distance. As soon as subjects felt the robot begin to move their affected arm, they matched the robot movement with the unaffected arm. Subjects were tested in two sessions on the KIN task: initial session and then a second session (within an average of 18.2 ± 13.8 h of the initial session for stroke subjects), which were supervised by different technicians. The task was performed both with and without the use of vision in both sessions. We evaluated intra-class correlations of spatial and temporal parameters derived from the KIN task to determine the reliability of the robotic task. We evaluated 8 spatial and temporal parameters that quantify kinesthetic behavior. We found that the parameters exhibited moderate to high intra-class correlations between the initial and retest conditions (Range, r-value = [0.53-0.97]). The robotic KIN task exhibited good inter-rater reliability. This validates the KIN task as a reliable, objective method for quantifying
Identification of 'Point A' as the prevalent source of error in cephalometric analysis of lateral radiographs.

PubMed

Grogger, P; Sacher, C; Weber, S; Millesi, G; Seemann, R

2018-04-10

Deviations in measuring dentofacial components in a lateral X-ray represent a major hurdle in the subsequent treatment of dysgnathic patients. In a retrospective study, we investigated the most prevalent source of error in the following commonly used cephalometric measurements: the angles Sella-Nasion-Point A (SNA), Sella-Nasion-Point B (SNB) and Point A-Nasion-Point B (ANB); the Wits appraisal; the anteroposterior dysplasia indicator (APDI); and the overbite depth indicator (ODI). Preoperative lateral radiographic images of patients with dentofacial deformities were collected and the landmarks digitally traced by three independent raters. Cephalometric analysis was automatically performed based on 1116 tracings. Error analysis identified the x-coordinate of Point A as the prevalent source of error in all investigated measurements, except SNB, in which it is not incorporated. In SNB, the y-coordinate of Nasion predominated error variance. SNB showed lowest inter-rater variation. In addition, our observations confirmed previous studies showing that landmark identification variance follows characteristic error envelopes in the highest number of tracings analysed up to now. Variance orthogonal to defining planes was of relevance, while variance parallel to planes was not. Taking these findings into account, orthognathic surgeons as well as orthodontists would be able to perform cephalometry more accurately and accomplish better therapeutic results. Copyright © 2018 International Association of Oral and Maxillofacial Surgeons. Published by Elsevier Ltd. All rights reserved.
A Note on the Interpretation of Weighted Kappa and its Relations to Other Rater Agreement Statistics for Metric Scales

ERIC Educational Resources Information Center

Schuster, Christof

2004-01-01

This article presents a formula for weighted kappa in terms of rater means, rater variances, and the rater covariance that is particularly helpful in emphasizing that weighted kappa is an absolute agreement measure in the sense that it is sensitive to differences in rater's marginal distributions. Specifically, rater mean differences will decrease…
Explaining sexual harassment judgments: looking beyond gender of the rater.

PubMed

O'Connor, Maureen; Gutek, Barbara A; Stockdale, Margaret; Geer, Tracey M; Melançon, Renée

2004-02-01

In two decades of research on sexual harassment, one finding that appears repeatedly is that gender of the rater influences judgments about sexual harassment such that women are more likely than men to label behavior as sexual harassment. Yet, sexual harassment judgments are complex, particularly in situations that culminate in legal proceedings. And, this one variable, gender, may have been overemphasized to the exclusion of other situational and rater characteristic variables. Moreover, why do gender differences appear? As work by Wiener and his colleagues have done (R. L. Wiener et al., 2002; R. L. Wiener & L. Hurt, 2000; R. L. Wiener, L. Hurt, B. Russell, K. Mannen, & C. Gasper, 1997), this study attempts to look beyond gender to answer this question. In the studies reported here, raters (undergraduates and community adults), either read a written scenario or viewed a videotaped reenactment of a sexual harassment trial. The nature of the work environment was manipulated to see what, if any, effect the context would have on gender effects. Additionally, a number of rater characteristics beyond gender were measured, including ambivalent sexism attitudes of the raters, their judgments of complainant credibility, and self-referencing that might help explain rater judgments. Respondent gender, work environment, and community vs. student sample differences produced reliable differences in sexual harassment ratings in both the written and video trial versions of the study. The gender and sample differences in the sexual harassment ratings, however, are explained by a model which incorporates hostile sexism, perceptions of the complainants credibility, and raters' own ability to put themselves in the complainant's position (self-referencing).

The Critical Thinking Analytic Rubric (CTAR): Investigating Intra-Rater and Inter-Rater Reliability of a Scoring Mechanism for Critical Thinking Performance Assessments

ERIC Educational Resources Information Center

Saxton, Emily; Belanger, Secret; Becker, William

2012-01-01

The purpose of this study was to investigate the intra-rater and inter-rater reliability of the Critical Thinking Analytic Rubric (CTAR). The CTAR is composed of 6 rubric categories: interpretation, analysis, evaluation, inference, explanation, and disposition. To investigate inter-rater reliability, two trained raters scored four sets of…
BurnCase 3D software validation study: Burn size measurement accuracy and inter-rater reliability.

PubMed

Parvizi, Daryousch; Giretzlehner, Michael; Wurzer, Paul; Klein, Limor Dinur; Shoham, Yaron; Bohanon, Fredrick J; Haller, Herbert L; Tuca, Alexandru; Branski, Ludwik K; Lumenta, David B; Herndon, David N; Kamolz, Lars-P

2016-03-01

The aim of this study was to compare the accuracy of burn size estimation using the computer-assisted software BurnCase 3D (RISC Software GmbH, Hagenberg, Austria) with that using a 2D scan, considered to be the actual burn size. Thirty artificial burn areas were pre planned and prepared on three mannequins (one child, one female, and one male). Five trained physicians (raters) were asked to assess the size of all wound areas using BurnCase 3D software. The results were then compared with the real wound areas, as determined by 2D planimetry imaging. To examine inter-rater reliability, we performed an intraclass correlation analysis with a 95% confidence interval. The mean wound area estimations of the five raters using BurnCase 3D were in total 20.7±0.9% for the child, 27.2±1.5% for the female and 16.5±0.1% for the male mannequin. Our analysis showed relative overestimations of 0.4%, 2.8% and 1.5% for the child, female and male mannequins respectively, compared to the 2D scan. The intraclass correlation between the single raters for mean percentage of the artificial burn areas was 98.6%. There was also a high intraclass correlation between the single raters and the 2D Scan visible. BurnCase 3D is a valid and reliable tool for the determination of total body surface area burned in standard models. Further clinical studies including different pediatric and overweight adult mannequins are warranted. Copyright © 2016 Elsevier Ltd and ISBI. All rights reserved.
Effects of a rater training on rating accuracy in a physical examination skills assessment.

PubMed

Weitz, Gunther; Vinzentius, Christian; Twesten, Christoph; Lehnert, Hendrik; Bonnemeier, Hendrik; König, Inke R

2014-01-01

The accuracy and reproducibility of medical skills assessment is generally low. Rater training has little or no effect. Our knowledge in this field, however, relies on studies involving video ratings of overall clinical performances. We hypothesised that a rater training focussing on the frame of reference could improve accuracy in grading the curricular assessment of a highly standardised physical head-to-toe examination. Twenty-one raters assessed the performance of 242 third-year medical students. Eleven raters had been randomly assigned to undergo a brief frame-of-reference training a few days before the assessment. 218 encounters were successfully recorded on video and re-assessed independently by three additional observers. Accuracy was defined as the concordance between the raters' grade and the median of the observers' grade. After the assessment, both students and raters filled in a questionnaire about their views on the assessment. Rater training did not have a measurable influence on accuracy. However, trained raters rated significantly more stringently than untrained raters, and their overall stringency was closer to the stringency of the observers. The questionnaire indicated a higher awareness of the halo effect in the trained raters group. Although the self-assessment of the students mirrored the assessment of the raters in both groups, the students assessed by trained raters felt more discontent with their grade. While training had some marginal effects, it failed to have an impact on the individual accuracy. These results in real-life encounters are consistent with previous studies on rater training using video assessments of clinical performances. The high degree of standardisation in this study was not suitable to harmonize the trained raters' grading. The data support the notion that the process of appraising medical performance is highly individual. A frame-of-reference training as applied does not effectively adjust the physicians' judgement
Error measuring system of rotary Inductosyn

NASA Astrophysics Data System (ADS)

Liu, Chengjun; Zou, Jibin; Fu, Xinghe

2008-10-01

The inductosyn is a kind of high-precision angle-position sensor. It has important applications in servo table, precision machine tool and other products. The precision of inductosyn is calibrated by its error. It's an important problem about the error measurement in the process of production and application of the inductosyn. At present, it mainly depends on the method of artificial measurement to obtain the error of inductosyn. Therefore, the disadvantages can't be ignored such as the high labour intensity of the operator, the occurrent error which is easy occurred and the poor repeatability, and so on. In order to solve these problems, a new automatic measurement method is put forward in this paper which based on a high precision optical dividing head. Error signal can be obtained by processing the output signal of inductosyn and optical dividing head precisely. When inductosyn rotating continuously, its zero position error can be measured dynamically, and zero error curves can be output automatically. The measuring and calculating errors caused by man-made factor can be overcome by this method, and it makes measuring process more quickly, exactly and reliably. Experiment proves that the accuracy of error measuring system is 1.1 arc-second (peak - peak value).
Human errors and measurement uncertainty

NASA Astrophysics Data System (ADS)

Kuselman, Ilya; Pennecchi, Francesca

2015-04-01

Evaluating the residual risk of human errors in a measurement and testing laboratory, remaining after the error reduction by the laboratory quality system, and quantifying the consequences of this risk for the quality of the measurement/test results are discussed based on expert judgments and Monte Carlo simulations. A procedure for evaluation of the contribution of the residual risk to the measurement uncertainty budget is proposed. Examples are provided using earlier published sets of expert judgments on human errors in pH measurement of groundwater, elemental analysis of geological samples by inductively coupled plasma mass spectrometry, and multi-residue analysis of pesticides in fruits and vegetables. The human error contribution to the measurement uncertainty budget in the examples was not negligible, yet also not dominant. This was assessed as a good risk management result.
Intra and inter-rater reliability of infrared image analysis of masticatory and upper trapezius muscles in women with and without temporomandibular disorder.

PubMed

Costa, Ana C S; Dibai Filho, Almir V; Packer, Amanda C; Rodrigues-Bigaton, Delaine

2013-01-01

Infrared thermography is an aid tool that can be used to evaluate several pathologies given its efficiency in analyzing the distribution of skin surface temperature. To propose two forms of infrared image analysis of the masticatory and upper trapezius muscles, and to determine the intra and inter-rater reliability of both forms of analysis. Infrared images of masticatory and upper trapezius muscles of 64 female volunteers with and without temporomandibular disorder (TMD) were collected. Two raters performed the infrared image analysis, which occurred in two ways: temperature measurement of the muscle length and in central portion of the muscle. The Intraclass Correlation Coefficient (ICC) was used to determine the intra and inter-rater reliability. The ICC showed excellent intra and inter-rater values for both measurements: temperature measurement of the muscle length (TMD group, intra-rater, ICC ranged from 0.996 to 0.999, inter-rater, ICC ranged from 0.992 to 0.999; control group, intra-rater, ICC ranged from 0.993 to 0.998, inter-rater, ICC ranged from 0.990 to 0.998), and temperature measurement of the central portion of the muscle (TMD group, intra-rater, ICC ranged from 0.981 to 0.998, inter-rater, ICC ranged from 0.971 to 0.998; control group, intra-rater, ICC ranged from 0.887 to 0.996, inter-rater, ICC ranged from 0.852 to 0.996). The results indicated that temperature measurements of the masticatory and upper trapezius muscles carried out by the analysis of the muscle length and central portion yielded excellent intra and inter-rater reliability.
Relationships of Measurement Error and Prediction Error in Observed-Score Regression

ERIC Educational Resources Information Center

Moses, Tim

2012-01-01

The focus of this paper is assessing the impact of measurement errors on the prediction error of an observed-score regression. Measures are presented and described for decomposing the linear regression's prediction error variance into parts attributable to the true score variance and the error variances of the dependent variable and the predictor…
Inter-rater reliability for movement pattern analysis (MPA): measuring patterning of behaviors versus discrete behavior counts as indicators of decision-making style.

PubMed

Connors, Brenda L; Rende, Richard; Colton, Timothy J

2014-01-01

The unique yield of collecting observational data on human movement has received increasing attention in a number of domains, including the study of decision-making style. As such, interest has grown in the nuances of core methodological issues, including the best ways of assessing inter-rater reliability. In this paper we focus on one key topic - the distinction between establishing reliability for the patterning of behaviors as opposed to the computation of raw counts - and suggest that reliability for each be compared empirically rather than determined a priori. We illustrate by assessing inter-rater reliability for key outcome measures derived from movement pattern analysis (MPA), an observational methodology that records body movements as indicators of decision-making style with demonstrated predictive validity. While reliability ranged from moderate to good for raw counts of behaviors reflecting each of two Overall Factors generated within MPA (Assertion and Perspective), inter-rater reliability for patterning (proportional indicators of each factor) was significantly higher and excellent (ICC = 0.89). Furthermore, patterning, as compared to raw counts, provided better prediction of observable decision-making process assessed in the laboratory. These analyses support the utility of using an empirical approach to inform the consideration of measuring patterning versus discrete behavioral counts of behaviors when determining inter-rater reliability of observable behavior. They also speak to the substantial reliability that may be achieved via application of theoretically grounded observational systems such as MPA that reveal thinking and action motivations via visible movement patterns.
Compact disk error measurements

NASA Technical Reports Server (NTRS)

Howe, D.; Harriman, K.; Tehranchi, B.

1993-01-01

The objectives of this project are as follows: provide hardware and software that will perform simple, real-time, high resolution (single-byte) measurement of the error burst and good data gap statistics seen by a photoCD player read channel when recorded CD write-once discs of variable quality (i.e., condition) are being read; extend the above system to enable measurement of the hard decision (i.e., 1-bit error flags) and soft decision (i.e., 2-bit error flags) decoding information that is produced/used by the Cross Interleaved - Reed - Solomon - Code (CIRC) block decoder employed in the photoCD player read channel; construct a model that uses data obtained via the systems described above to produce meaningful estimates of output error rates (due to both uncorrected ECC words and misdecoded ECC words) when a CD disc having specific (measured) error statistics is read (completion date to be determined); and check the hypothesis that current adaptive CIRC block decoders are optimized for pressed (DAD/ROM) CD discs. If warranted, do a conceptual design of an adaptive CIRC decoder that is optimized for write-once CD discs.
Measurement Error, Reliability, and Minimum Detectable Change in the Mini-Mental State Examination, Montreal Cognitive Assessment, and Color Trails Test among Community Living Middle-Aged and Older Adults.

PubMed

Feeney, Joanne; Savva, George M; O'Regan, Claire; King-Kallimanis, Bellinda; Cronin, Hilary; Kenny, Rose Anne

2016-05-31

Knowing the reliability of cognitive tests, particularly those commonly used in clinical practice, is important in order to interpret the clinical significance of a change in performance or a low score on a single test. To report the intra-class correlation (ICC), standard error of measurement (SEM) and minimum detectable change (MDC) for the Mini-Mental State Examination (MMSE), Montreal Cognitive Assessment (MoCA), and Color Trails Test (CTT) among community dwelling older adults. 130 participants aged 55 and older without severe cognitive impairment underwent two cognitive assessments between two and four months apart. Half the group changed rater between assessments and half changed time of day. Mean (standard deviation) MMSE was 28.1 (2.1) at baseline and 28.4 (2.1) at repeat. Mean (SD) MoCA increased from 24.8 (3.6) to 25.2 (3.6). There was a rater effect on CTT, but not on the MMSE or MoCA. The SEM of the MMSE was 1.0, leading to an MDC (based on a 95% confidence interval) of 3 points. The SEM of the MoCA was 1.5, implying an MDC95 of 4 points. MoCA (ICC = 0.81) was more reliable than MMSE (ICC = 0.75), but all tests examined showed substantial within-patient variation. An individual's score would have to change by greater than or equal to 3 points on the MMSE and 4 points on the MoCA for the rater to be confident that the change was not due to measurement error. This has important implications for epidemiologists and clinicians in dementia screening and diagnosis.
Error-tradeoff and error-disturbance relations for incompatible quantum measurements.

PubMed

Branciard, Cyril

2013-04-23

Heisenberg's uncertainty principle is one of the main tenets of quantum theory. Nevertheless, and despite its fundamental importance for our understanding of quantum foundations, there has been some confusion in its interpretation: Although Heisenberg's first argument was that the measurement of one observable on a quantum state necessarily disturbs another incompatible observable, standard uncertainty relations typically bound the indeterminacy of the outcomes when either one or the other observable is measured. In this paper, we quantify precisely Heisenberg's intuition. Even if two incompatible observables cannot be measured together, one can still approximate their joint measurement, at the price of introducing some errors with respect to the ideal measurement of each of them. We present a tight relation characterizing the optimal tradeoff between the error on one observable vs. the error on the other. As a particular case, our approach allows us to characterize the disturbance of an observable induced by the approximate measurement of another one; we also derive a stronger error-disturbance relation for this scenario.
Three-dimensional assessment of the asymptomatic and post-stroke shoulder: intra-rater test-retest reliability and within-subject repeatability of the palpation and digitization approach.

PubMed

Pain, Liza A M; Baker, Ross; Sohail, Qazi Zain; Richardson, Denyse; Zabjek, Karl; Mogk, Jeremy P M; Agur, Anne M R

2018-03-23

Altered three-dimensional (3D) joint kinematics can contribute to shoulder pathology, including post-stroke shoulder pain. Reliable assessment methods enable comparative studies between asymptomatic shoulders of healthy subjects and painful shoulders of post-stroke subjects, and could inform treatment planning for post-stroke shoulder pain. The study purpose was to establish intra-rater test-retest reliability and within-subject repeatability of a palpation/digitization protocol, which assesses 3D clavicular/scapular/humeral rotations, in asymptomatic and painful post-stroke shoulders. Repeated measurements of 3D clavicular/scapular/humeral joint/segment rotations were obtained using palpation/digitization in 32 asymptomatic and six painful post-stroke shoulders during four reaching postures (rest/flexion/abduction/external rotation). Intra-class correlation coefficients (ICCs), standard error of the measurement and 95% confidence intervals were calculated. All ICC values indicated high to very high test-retest reliability (≥0.70), with lower reliability for scapular anterior/posterior tilt during external rotation in asymptomatic subjects, and scapular medial/lateral rotation, humeral horizontal abduction/adduction and axial rotation during abduction in post-stroke subjects. All standard error of measurement values demonstrated within-subject repeatability error ≤5° for all clavicular/scapular/humeral joint/segment rotations (asymptomatic ≤3.75°; post-stroke ≤5.0°), except for humeral axial rotation (asymptomatic ≤5°; post-stroke ≤15°). This noninvasive, clinically feasible palpation/digitization protocol was reliable and repeatable in asymptomatic shoulders, and in a smaller sample of painful post-stroke shoulders. Implications for Rehabilitation In the clinical setting, a reliable and repeatable noninvasive method for assessment of three-dimensional (3D) clavicular/scapular/humeral joint orientation and range of motion (ROM) is currently
Challenges in using rater judgements in medical education.

PubMed

Albanese, M A

2000-08-01

Changes in the healthcare environment are putting increasing pressure on medical schools to make faculty accountable and to document the quality of the medical education they provide. Faculty's ratings of students' performances and students' ratings of faculty's teaching are important elements in these efforts to document educational quality. This article discusses selected research related to factors affecting raters' judgements, analyses how changes in the health care environment are influencing such judgements, offers some suggestions to moderate some of the effects and links these influences to the system that upholds professional standards. Ratings are known to have a positive bias (generosity error), provide limited discrimination and often fail to document serious deficits. The potential sources of these problems relate to the mechanics of the rating task, the system used to obtain ratings and factors affecting rater judgement. As managed care demands reduce the time faculty have for teaching, as system-wide disincentives to provide negative ratings proliferate and as social engineering challenges, such as the Americans with Disabilities Act, impose differential standards for students, the natural tendency to avoid giving negative ratings becomes even harder to resist. Ultimately, these forces compromise the capability of faculty to uphold the standards of the profession. The author calls for a national effort to stem the erosion of those standards.
Workplace-Based Assessment: Raters' Performance Theories and Constructs

ERIC Educational Resources Information Center

Govaerts, M. J. B.; Van de Wiel, M. W. J.; Schuwirth, L. W. T.; Van der Vleuten, C. P. M.; Muijtjens, A. M. M.

2013-01-01

Weaknesses in the nature of rater judgments are generally considered to compromise the utility of workplace-based assessment (WBA). In order to gain insight into the underpinnings of rater behaviours, we investigated how raters form impressions of and make judgments on trainee performance. Using theoretical frameworks of social cognition and…
Inter-rater and intra-rater agreement on the Nordic Orofacial Test--Screening examination in children, adolescents and young adults with cerebral palsy.

PubMed

Edvinsson, Siv Elisabet; Lundqvist, Lars-Olov

2014-02-01

To evaluate inter-rater and intra-rater agreement on the Nordic Orofacial Test-Screening (NOT-S) examination applied to children, adolescents and young adults with cerebral palsy (CP). Using the NOT-S examination, two speech and language pathologists independently assessed video recordings of 48 subjects with CP aged 5-22 years and representing all CP sub-diagnoses and levels of gross motor function and manual ability. Thirty-one subjects were reassessed. Fifteen out of 17 items in the NOT-S examination domains (1) Face at rest, (2) Nose breathing, (3) Facial expression, (4) Masticatory muscle and jaw function, (5) Oral motor function and (6) Speech were rated using a 'yes' (dysfunction observed)/'no' format, generating an overall score of 0-6. Inter-rater agreement: Twelve out of 15 items and five out of six domains showed acceptable unweighted kappa values (κ = 0.46-1.00). The lowest kappa value was found for domain 4 (κ = -0.04), although it had high inter-rater agreement (92%). The linear weighted kappa value for the overall NOT-S examination score was 0.65 (95% CI = 0.49-0.82). Intra-rater agreement: All items and domains showed acceptable unweighted kappa values (items 0.58-1.00 and 0.59-1.00, domains 0.81-1.00 and 0.62-0.89) for both raters. The linear weighted kappa value for the overall NOT-S examination score was 0.81 (95% CI = 0.63-0.99) for rater A and 0.54 (95% CI = 0.25-0.82) for rater B. The NOT-S examination has acceptable inter-rater and intra-rater agreement when used in young individuals with CP.
Inter-rater reliability for movement pattern analysis (MPA): measuring patterning of behaviors versus discrete behavior counts as indicators of decision-making style

PubMed Central

Connors, Brenda L.; Rende, Richard; Colton, Timothy J.

2014-01-01

The unique yield of collecting observational data on human movement has received increasing attention in a number of domains, including the study of decision-making style. As such, interest has grown in the nuances of core methodological issues, including the best ways of assessing inter-rater reliability. In this paper we focus on one key topic – the distinction between establishing reliability for the patterning of behaviors as opposed to the computation of raw counts – and suggest that reliability for each be compared empirically rather than determined a priori. We illustrate by assessing inter-rater reliability for key outcome measures derived from movement pattern analysis (MPA), an observational methodology that records body movements as indicators of decision-making style with demonstrated predictive validity. While reliability ranged from moderate to good for raw counts of behaviors reflecting each of two Overall Factors generated within MPA (Assertion and Perspective), inter-rater reliability for patterning (proportional indicators of each factor) was significantly higher and excellent (ICC = 0.89). Furthermore, patterning, as compared to raw counts, provided better prediction of observable decision-making process assessed in the laboratory. These analyses support the utility of using an empirical approach to inform the consideration of measuring patterning versus discrete behavioral counts of behaviors when determining inter-rater reliability of observable behavior. They also speak to the substantial reliability that may be achieved via application of theoretically grounded observational systems such as MPA that reveal thinking and action motivations via visible movement patterns. PMID:24999336
Accuracy of Surgery Clerkship Performance Raters.

ERIC Educational Resources Information Center

Littlefield, John H.; And Others

1991-01-01

Interrater reliability in numerical ratings of clerkship performance (n=1,482 students) in five surgery programs was studied. Raters were classified as accurate or moderately or significantly stringent or lenient. Results indicate that increasing the proportion of accurate raters would substantially improve the precision of class rankings. (MSE)
Effects of a rater training on rating accuracy in a physical examination skills assessment

PubMed Central

Weitz, Gunther; Vinzentius, Christian; Twesten, Christoph; Lehnert, Hendrik; Bonnemeier, Hendrik; König, Inke R.

2014-01-01

Background: The accuracy and reproducibility of medical skills assessment is generally low. Rater training has little or no effect. Our knowledge in this field, however, relies on studies involving video ratings of overall clinical performances. We hypothesised that a rater training focussing on the frame of reference could improve accuracy in grading the curricular assessment of a highly standardised physical head-to-toe examination. Methods: Twenty-one raters assessed the performance of 242 third-year medical students. Eleven raters had been randomly assigned to undergo a brief frame-of-reference training a few days before the assessment. 218 encounters were successfully recorded on video and re-assessed independently by three additional observers. Accuracy was defined as the concordance between the raters' grade and the median of the observers' grade. After the assessment, both students and raters filled in a questionnaire about their views on the assessment. Results: Rater training did not have a measurable influence on accuracy. However, trained raters rated significantly more stringently than untrained raters, and their overall stringency was closer to the stringency of the observers. The questionnaire indicated a higher awareness of the halo effect in the trained raters group. Although the self-assessment of the students mirrored the assessment of the raters in both groups, the students assessed by trained raters felt more discontent with their grade. Conclusions: While training had some marginal effects, it failed to have an impact on the individual accuracy. These results in real-life encounters are consistent with previous studies on rater training using video assessments of clinical performances. The high degree of standardisation in this study was not suitable to harmonize the trained raters’ grading. The data support the notion that the process of appraising medical performance is highly individual. A frame-of-reference training as applied does not
Rater Variables Associated with ITER Ratings

ERIC Educational Resources Information Center

Paget, Michael; Wu, Caren; McIlwrick, Joann; Woloschuk, Wayne; Wright, Bruce; McLaughlin, Kevin

2013-01-01

Advocates of holistic assessment consider the ITER a more authentic way to assess performance. But this assessment format is subjective and, therefore, susceptible to rater bias. Here our objective was to study the association between rater variables and ITER ratings. In this observational study our participants were clerks at the University of…
Measurement Error and Equating Error in Power Analysis

ERIC Educational Resources Information Center

Phillips, Gary W.; Jiang, Tao

2016-01-01

Power analysis is a fundamental prerequisite for conducting scientific research. Without power analysis the researcher has no way of knowing whether the sample size is large enough to detect the effect he or she is looking for. This paper demonstrates how psychometric factors such as measurement error and equating error affect the power of…

Analyzing Written Comments by Performance Raters.

ERIC Educational Resources Information Center

Littlefield, John; And Others

A four-level taxonomy is proposed to define the usefulness of rater written comments for supporting letters of recommendation. The taxonomy is used to classify comments on 220 rating forms by 25 raters from two surgery departments regarding performance by third-year medical students. Written comments were classified by the following taxonomy: (1)…
DeuteRater: a tool for quantifying peptide isotope precision and kinetic proteomics.

PubMed

Naylor, Bradley C; Porter, Michael T; Wilson, Elise; Herring, Adam; Lofthouse, Spencer; Hannemann, Austin; Piccolo, Stephen R; Rockwood, Alan L; Price, John C

2017-05-15

Using mass spectrometry to measure the concentration and turnover of the individual proteins in a proteome, enables the calculation of individual synthesis and degradation rates for each protein. Software to analyze concentration is readily available, but software to analyze turnover is lacking. Data analysis workflows typically don't access the full breadth of information about instrument precision and accuracy that is present in each peptide isotopic envelope measurement. This method utilizes both isotope distribution and changes in neutromer spacing, which benefits the analysis of both concentration and turnover. We have developed a data analysis tool, DeuteRater, to measure protein turnover from metabolic D 2 O labeling. DeuteRater uses theoretical predictions for label-dependent change in isotope abundance and inter-peak (neutromer) spacing within the isotope envelope to calculate protein turnover rate. We have also used these metrics to evaluate the accuracy and precision of peptide measurements and thereby determined the optimal data acquisition parameters of different instruments, as well as the effect of data processing steps. We show that these combined measurements can be used to remove noise and increase confidence in the protein turnover measurement for each protein. Source code and ReadMe for Python 2 and 3 versions of DeuteRater are available at https://github.com/JC-Price/DeuteRater . Data is at https://chorusproject.org/pages/index.html project number 1147. Critical Intermediate calculation files provided as Tables S3 and S4. Software has only been tested on Windows machines. jcprice@chem.byu.edu. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Impact of Measurement Error on Synchrophasor Applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Liu, Yilu; Gracia, Jose R.; Ewing, Paul D.

2015-07-01

Phasor measurement units (PMUs), a type of synchrophasor, are powerful diagnostic tools that can help avert catastrophic failures in the power grid. Because of this, PMU measurement errors are particularly worrisome. This report examines the internal and external factors contributing to PMU phase angle and frequency measurement errors and gives a reasonable explanation for them. It also analyzes the impact of those measurement errors on several synchrophasor applications: event location detection, oscillation detection, islanding detection, and dynamic line rating. The primary finding is that dynamic line rating is more likely to be influenced by measurement error. Other findings include themore » possibility of reporting nonoscillatory activity as an oscillation as the result of error, failing to detect oscillations submerged by error, and the unlikely impact of error on event location and islanding detection.« less
Rater agreement reliability of the dial test in the ACL-deficient knee.

PubMed

Slichter, Malou E; Wolterbeek, Nienke; Auw Yang, K Gie; Zijl, Jacco A C; Piscaer, Tom M

2018-06-14

Posterolateral rotatory instability (PLRI) of the knee can easily be missed, because attention is paid to injury of the cruciate ligaments. If left untreated this clinical instability may persist after reconstruction of the cruciate ligaments and may put the graft at risk of failure. Even though the dial test is widely used to diagnose PLRI, no validity and reliability studies of the manual dial test are yet performed in patients. This study focuses on the reliability of the manual dial test by determining the rater agreement. Two independent examiners performed the dial test in knees of 52 patients after knee distorsion with a suspicion on ACL rupture. The dial test was performed in prone position in 30°, 60° and 90° of flexion of the knees. ≥10° side-to-side difference was considered a positive dial test. For quantification of the amount of rotation in degrees, a measuring device was used with a standardized 6 Nm force, using a digital torque adapter on a booth. The intra-rater, inter-rater and rater-device agreement were determined by calculating kappa (κ) for the dial test. A positive dial test was found in 21.2% and 18.0% of the patients as assessed by a blinded examiner and orthopaedic surgeon respectively. Fair inter-rater agreement was found in 30° of flexion, κ F = 0.29 (95% CI: 0.01 to 0.56), p = 0.044 and 90° of flexion, κ F = 0.38 (95% CI: 0.10 to 0.66), p = 0.007. Almost perfect rater-device agreement was found in 30° of flexion, κ C = 0.84 (95% CI: 0.52 to 1.15), p < 0.001. Moderate rater-device agreement was found in 30° and 90° combined, κ C = 0.50 (95% CI: 0.13 to 0.86), p = 0.008. No significant intra-rater agreement was found. Rater agreement reliability of the manual dial test is questionable. It has a fair inter-rater agreement in 30° and 90° of flexion.
Rater cognition: review and integration of research findings.

PubMed

Gauthier, Geneviève; St-Onge, Christina; Tavares, Walter

2016-05-01

Given the complexity of competency frameworks, associated skills and abilities, and contexts in which they are to be assessed in competency-based education (CBE), there is an increased reliance on rater judgements when considering trainee performance. This increased dependence on rater-based assessment has led to the emergence of rater cognition as a field of research in health professions education. The topic, however, is often conceptualised and ultimately investigated using many different perspectives and theoretical frameworks. Critically analysing how researchers think about, study and discuss rater cognition or the judgement processes in assessment frameworks may provide meaningful and efficient directions in how the field continues to explore the topic. We conducted a critical and integrative review of the literature to explore common conceptualisations and unified terminology associated with rater cognition research. We identified 1045 articles on rater-based assessment in health professions education using Scorpus, Medline and ERIC and 78 articles were included in our review. We propose a three-phase framework of observation, processing and integration. We situate nine specific mechanisms and sub-mechanisms described across the literature within these phases: (i) generating automatic impressions about the person; (ii) formulating high-level inferences; (iii) focusing on different dimensions of competencies; (iv) categorising through well-developed schemata based on (a) personal concept of competence, (b) comparison with various exemplars and (c) task and context specificity; (v) weighting and synthesising information differently, (vi) producing narrative judgements; and (vii) translating narrative judgements into scales. Our review has allowed us to identify common underlying conceptualisations of observed rater mechanisms and subsequently propose a comprehensive, although complex, framework for the dynamic and contextual nature of the rating process
Resampling probability values for weighted kappa with multiple raters.

PubMed

Mielke, Paul W; Berry, Kenneth J; Johnston, Janis E

2008-04-01

A new procedure to compute weighted kappa with multiple raters is described. A resampling procedure to compute approximate probability values for weighted kappa with multiple raters is presented. Applications of weighted kappa are illustrated with an example analysis of classifications by three independent raters.
Cultural values and performance appraisal: assessing the effects of rater self-construal on performance ratings.

PubMed

Mishra, Vipanchi; Roch, Sylvia G

2013-01-01

Much of the prior research investigating the influence of cultural values on performance ratings has focused either on conducting cross-national comparisons among raters or using cultural level individualism/collectivism scales to measure the effects of cultural values on performance ratings. Recent research has shown that there is considerable within country variation in cultural values, i.e. people in one country can be more individualistic or collectivistic in nature. Taking the latter perspective, the present study used Markus and Kitayama's (1991) conceptualization of independent and interdependent self-construals as measures of individual variations in cultural values to investigate within culture variations in performance ratings. Results suggest that rater self-construal has a significant influence on overall performance evaluations; specifically, raters with a highly interdependent self-construal tend to show a preference for interdependent ratees, whereas raters high on independent self-construal do not show a preference for specific type of ratees when making overall performance evaluations. Although rater self-construal significantly influenced overall performance evaluations, no such effects were observed for specific dimension ratings. Implications of these results for performance appraisal research and practice are discussed.
Validity and reliability of exposure assessors' ratings of exposure intensity by type of occupational questionnaire and type of rater.

PubMed

Friesen, Melissa C; Coble, Joseph B; Katki, Hormuzd A; Ji, Bu-Tian; Xue, Shouzheng; Lu, Wei; Stewart, Patricia A

2011-07-01

In epidemiologic studies that rely on professional judgment to assess occupational exposures, the raters' accurate assessment is vital to detect associations. We examined the influence of the type of questionnaire, type of industry, and type of rater on the raters' ability to reliably and validly assess within-industry differences in exposure. Our aim was to identify areas where improvements in exposure assessment may be possible. Subjects from three foundries (n = 72) and three textile plants (n = 74) in Shanghai, China, completed an occupational history (OH) and an industry-specific questionnaire (IQ). Six total dust measurements were collected per subject and were used to calculate a subject-specific measurement mean, which was used as the gold standard. Six raters independently ranked the intensity of each subject's current job on an ordinal scale (1-4) based on the OH alone and on the OH and IQ together. Aggregate ratings were calculated for the group, for industrial hygienists, and for occupational physicians. We calculated intra-class correlation coefficients (ICCs) to evaluate the reliability of the raters. We calculated the correlation between the subject-specific measurement means and the ratings to evaluate the raters' validity. Analyses were stratified by industry, type of questionnaire, and type of rater. We also examined the agreement between the ratings by exposure category, where the subject-specific measurement means were categorized into two and four categories. The reliability and validity measures were higher for the aggregate ratings than for the ratings from the individual raters. The group's performance was maximized with three raters. Both the reliability and validity measures were higher for the foundry industry than for the textile industry. The ICCs were consistently lower in the OH/IQ round than in the OH round in both industries. In contrast, the correlations with the measurement means were higher in the OH/IQ round than in the OH round
Rater Cognition Research: Some Possible Directions for the Future

ERIC Educational Resources Information Center

Myford, Carol M.

2012-01-01

Over the last several decades, researchers have studied many and varied aspects of rater cognition. Those interested in pursuing basic research have focused on gaining an understanding of raters' thought processes as they score different types of performances and products, striving to understand how raters' mental representations and the cognitive…
Power Measurement Errors on a Utility Aircraft

NASA Technical Reports Server (NTRS)

Bousman, William G.

2002-01-01

Extensive flight test data obtained from two recent performance tests of a UH 60A aircraft are reviewed. A power difference is calculated from the power balance equation and is used to examine power measurement errors. It is shown that the baseline measurement errors are highly non-Gaussian in their frequency distribution and are therefore influenced by additional, unquantified variables. Linear regression is used to examine the influence of other variables and it is shown that a substantial portion of the variance depends upon measurements of atmospheric parameters. Correcting for temperature dependence, although reducing the variance in the measurement errors, still leaves unquantified effects. Examination of the power difference over individual test runs indicates significant errors from drift, although it is unclear how these may be corrected. In an idealized case, where the drift is correctable, it is shown that the power measurement errors are significantly reduced and the error distribution is Gaussian. A new flight test program is recommended that will quantify the thermal environment for all torque measurements on the UH 60. Subsequently, the torque measurement systems will be recalibrated based on the measured thermal environment and a new power measurement assessment performed.
The intra- and inter-rater reliability of five clinical muscle performance tests in patients with and without neck pain

PubMed Central

2013-01-01

Background This study investigates the reliability of muscle performance tests using cost- and time-effective methods similar to those used in clinical practice. When conducting reliability studies, great effort goes into standardising test procedures to facilitate a stable outcome. Therefore, several test trials are often performed. However, when muscle performance tests are applied in the clinical setting, clinicians often only conduct a muscle performance test once as repeated testing may produce fatigue and pain, thus variation in test results. We aimed to investigate whether cervical muscle performance tests, which have shown promising psychometric properties, would remain reliable when examined under conditions similar to those of daily clinical practice. Methods The intra-rater (between-day) and inter-rater (within-day) reliability was assessed for five cervical muscle performance tests in patients with (n = 33) and without neck pain (n = 30). The five tests were joint position error, the cranio-cervical flexion test, the neck flexor muscle endurance test performed in supine and in a 45°-upright position and a new neck extensor test. Results Intra-rater reliability ranged from moderate to almost perfect agreement for joint position error (ICC ≥ 0.48-0.82), the cranio-cervical flexion test (ICC ≥ 0.69), the neck flexor muscle endurance test performed in supine (ICC ≥ 0.68) and in a 45°-upright position (ICC ≥ 0.41) with the exception of a new test (neck extensor test), which ranged from slight to moderate agreement (ICC = 0.14-0.41). Likewise, inter-rater reliability ranged from moderate to almost perfect agreement for joint position error (ICC ≥ 0.51-0.75), the cranio-cervical flexion test (ICC ≥ 0.85), the neck flexor muscle endurance test performed in supine (ICC ≥ 0.70) and in a 45°-upright position (ICC ≥ 0.56). However, only slight to fair agreement was found for the neck extensor test (ICC�
A paired comparison analysis of third-party rater thyroidectomy scar preference.

PubMed

Rajakumar, C; Doyle, P C; Brandt, M G; Moore, C C; Nichols, A; Franklin, J H; Yoo, J; Fung, K

2017-01-01

To determine the length and position of a thyroidectomy scar that is cosmetically most appealing to naïve raters. Images of thyroidectomy scars were reproduced on male and female necks using digital imaging software. Surgical variables studied were scar position and length. Fifteen raters were presented with 56 scar pairings and asked to identify which was preferred cosmetically. Twenty duplicate pairings were included to assess rater reliability. Analysis of variance was used to determine preference. Raters preferred low, short scars, followed by high, short scars, with long scars in either position being less desirable (p < 0.05). Twelve of 15 raters had acceptable intra-rater and inter-rater reliability. Naïve raters preferred low, short scars over the alternatives. High, short scars were the next most favourably rated. If other factors influencing incision choice are considered equal, surgeons should consider these preferences in scar position and length when planning their thyroidectomy approach.
Feasibility and inter-rater reliability of the ICU Mobility Scale.

PubMed

Hodgson, Carol; Needham, Dale; Haines, Kimberley; Bailey, Michael; Ward, Alison; Harrold, Megan; Young, Paul; Zanni, Jennifer; Buhr, Heidi; Higgins, Alisa; Presneill, Jeff; Berney, Sue

2014-01-01

The objectives of this study were to develop a scale for measuring the highest level of mobility in adult ICU patients and to assess its feasibility and inter-rater reliability. Growing evidence supports the feasibility, safety and efficacy of early mobilization in the intensive care unit (ICU). However, there are no adequately validated tools to quickly, easily, and reliably describe the mobility milestones of adult patients in ICU. Identifying or developing such a tool is a priority for evaluating mobility and rehabilitation activities for research and clinical care purposes. This study was performed at two ICUs in Australia. Thirty ICU nursing, and physiotherapy staff assessed the feasibility of the 'ICU Mobility Scale' (IMS) using a 10-item questionnaire. The inter-rater reliability of the IMS was assessed by 2 junior physical therapists, 2 senior physical therapists, and 16 nursing staff in 100 consecutive medical, surgical or trauma ICU patients. An 11 point IMS scale was developed based on multidisciplinary input. Participating clinicians reported that the scale was clear, with 95% of respondents reporting that it took <1 min to complete. The junior and senior physical therapists showed the highest inter-rater reliability with a weighted Kappa (95% confidence interval) of 0.83 (0.76-0.90), while the senior physical therapists and nurses and the junior physical therapists and nurses had a weighted Kappa of 0.72 (0.61-0.83) and 0.69 (0.56-0.81) respectively. The IMS is a feasible tool with strong inter-rater reliability for measuring the maximum level of mobility of adult patients in the ICU. Copyright © 2014 Elsevier Inc. All rights reserved.
Measuring the morphological characteristics of thoracolumbar fascia in ultrasound images: an inter-rater reliability study.

PubMed

De Coninck, Kyra; Hambly, Karen; Dickinson, John W; Passfield, Louis

2018-06-01

Chronic lower back pain is still regarded as a poorly understood multifactorial condition. Recently, the thoracolumbar fascia complex has been found to be a contributing factor. Ultrasound imaging has shown that people with chronic lower back pain demonstrate both a significant decrease in shear strain, and a 25% increase in thickness of the thoracolumbar fascia. There is sparse data on whether medical practitioners agree on the level of disorganisation in ultrasound images of thoracolumbar fascia. The purpose of this study was to establish inter-rater reliability of the ranking of architectural disorganisation of thoracolumbar fascia on a scale from 'very disorganised' to 'very organised'. An exploratory analysis was performed using a fully crossed design of inter-rater reliability. Thirty observers were recruited, consisting of 21 medical doctors, 7 physiotherapists and 2 radiologists, with an average of 13.03 ± 9.6 years of clinical experience. All 30 observers independently rated the architectural disorganisation of the thoracolumbar fascia in 30 ultrasound scans, on a Likert-type scale with rankings from 1 = very disorganised to 10 = very organised. Internal consistency was assessed using Cronbach's alpha. Krippendorff's alpha was used to calculate the overall inter-rater reliability. The Krippendorf's alpha was .61, indicating a modest degree of agreement between observers on the different morphologies of thoracolumbar fascia.The Cronbach's alpha (0.98), indicated that there was a high degree of consistency between observers. Experience in ultrasound image analysis did not affect constancy between observers (Cronbach's range between experienced and inexperienced raters: 0.95 and 0.96 respectively). Medical practitioners agree on morphological features such as levels of organisation and disorganisation in ultrasound images of thoracolumbar fascia, regardless of experience. Further analysis by an expert panel is required to develop specific
A Study of the Use of the "e-rater"® Scoring Engine for the Analytical Writing Measure of the "GRE"® revised General Test. Research Report. ETS RR-14-24

ERIC Educational Resources Information Center

Breyer, F. Jay; Attali, Yigal; Williamson, David M.; Ridolfi-McCulla, Laura; Ramineni, Chaitanya; Duchnowski, Matthew; Harris, April

2014-01-01

In this research, we investigated the feasibility of implementing the "e-rater"® scoring engine as a check score in place of all-human scoring for the "Graduate Record Examinations"® ("GRE"®) revised General Test (rGRE) Analytical Writing measure. This report provides the scientific basis for the use of e-rater as a…
A FACETS Analysis of Rater Bias in Measuring Japanese Second Language Writing Performance.

ERIC Educational Resources Information Center

Kondo-Brown, Kimi

2002-01-01

Using FACETS, investigates how judgments of trained teacher raters are biased toward certain types of candidates and certain criteria in assessing Japanese second language writing. Explores the potential for using a modified version of a rating scale for norm-referenced decisions about Japanese second language writing ability. (Author/VWL)
On individual differences in person perception: raters' personality traits relate to their psychopathy checklist-revised scoring tendencies.

PubMed

Miller, Audrey K; Rufino, Katrina A; Boccaccini, Marcus T; Jackson, Rebecca L; Murrie, Daniel C

2011-06-01

This study investigated raters' personality traits in relation to scores they assigned to offenders using the Psychopathy Checklist-Revised (PCL-R). A total of 22 participants, including graduate students and faculty members in clinical psychology programs, completed a PCL-R training session, independently scored four criminal offenders using the PCL-R, and completed a comprehensive measure of their own personality traits. A priori hypotheses specified that raters' personality traits, and their similarity to psychopathy characteristics, would relate to raters' PCL-R scoring tendencies. As hypothesized, some raters assigned consistently higher scores on the PCL-R than others, especially on PCL-R Facets 1 and 2. Also as hypothesized, raters' scoring tendencies related to their own personality traits (e.g., higher rater Agreeableness was associated with lower PCL-R Interpersonal facet scoring). Overall, findings underscore the need for future research to examine the role of evaluator characteristics on evaluation results and the need for clinical training to address evaluators' personality influences on their ostensibly objective evaluations.
Test-re-test reliability and inter-rater reliability of a digital pelvic inclinometer in young, healthy males and females.

PubMed

Beardsley, Chris; Egerton, Tim; Skinner, Brendon

2016-01-01

Objective. The purpose of this study was to investigate the reliability of a digital pelvic inclinometer (DPI) for measuring sagittal plane pelvic tilt in 18 young, healthy males and females. Method. The inter-rater reliability and test-re-test reliabilities of the DPI for measuring pelvic tilt in standing on both the right and left sides of the pelvis were measured by two raters carrying out two rating sessions of the same subjects, three weeks apart. Results. For measuring pelvic tilt, inter-rater reliability was designated as good on both sides (ICC = 0.81-0.88), test-re-test reliability within a single rating session was designated as good on both sides (ICC = 0.88-0.95), and test-re-test reliability between two rating sessions was designated as moderate on the left side (ICC = 0.65) and good on the right side (ICC = 0.85). Conclusion. Inter-rater reliability and test-re-test reliability within a single rating session of the DPI in measuring pelvic tilt were both good, while test-re-test reliability between rating sessions was moderate-to-good. Caution is required regarding the interpretation of the test-re-test reliability within a single rating session, as the raters were not blinded. Further research is required to establish validity.
Measuring Cyclic Error in Laser Heterodyne Interferometers

NASA Technical Reports Server (NTRS)

Ryan, Daniel; Abramovici, Alexander; Zhao, Feng; Dekens, Frank; An, Xin; Azizi, Alireza; Chapsky, Jacob; Halverson, Peter

2010-01-01

An improved method and apparatus have been devised for measuring cyclic errors in the readouts of laser heterodyne interferometers that are configured and operated as displacement gauges. The cyclic errors arise as a consequence of mixing of spurious optical and electrical signals in beam launchers that are subsystems of such interferometers. The conventional approach to measurement of cyclic error involves phase measurements and yields values precise to within about 10 pm over air optical paths at laser wavelengths in the visible and near infrared. The present approach, which involves amplitude measurements instead of phase measurements, yields values precise to about .0.1 microns . about 100 times the precision of the conventional approach. In a displacement gauge of the type of interest here, the laser heterodyne interferometer is used to measure any change in distance along an optical axis between two corner-cube retroreflectors. One of the corner-cube retroreflectors is mounted on a piezoelectric transducer (see figure), which is used to introduce a low-frequency periodic displacement that can be measured by the gauges. The transducer is excited at a frequency of 9 Hz by a triangular waveform to generate a 9-Hz triangular-wave displacement having an amplitude of 25 microns. The displacement gives rise to both amplitude and phase modulation of the heterodyne signals in the gauges. The modulation includes cyclic error components, and the magnitude of the cyclic-error component of the phase modulation is what one needs to measure in order to determine the magnitude of the cyclic displacement error. The precision attainable in the conventional (phase measurement) approach to measuring cyclic error is limited because the phase measurements are af-
Rater Wealth Predicts Perceptions of Outgroup Competence

PubMed Central

Chan, Wayne; McCrae, Robert R.; Rogers, Darrin L.; Weimer, Amy A.; Greenberg, David M.; Terracciano, Antonio

2011-01-01

National income has a pervasive influence on the perception of ingroup stereotypes, with high status and wealthy targets perceived as more competent. In two studies we investigated the degree to which economic wealth of raters related to perceptions of outgroup competence. Raters’ economic wealth predicted trait ratings when 1) raters in 48 other cultures rated Americans’ competence and 2) Mexican Americans rated Anglo Americans’ competence. Rater wealth also predicted ratings of interpersonal warmth on the culture level. In conclusion, raters’ economic wealth, either nationally or individually, is significantly associated with perception of outgroup members, supporting the notion that ingroup conditions or stereotypes function as frames of reference in evaluating outgroup traits. PMID:22379232

Measurement system and model for simultaneously measuring 6DOF geometric errors.

PubMed

Zhao, Yuqiong; Zhang, Bin; Feng, Qibo

2017-09-04

A measurement system to simultaneously measure six degree-of-freedom (6DOF) geometric errors is proposed. The measurement method is based on a combination of mono-frequency laser interferometry and laser fiber collimation. A simpler and more integrated optical configuration is designed. To compensate for the measurement errors introduced by error crosstalk, element fabrication error, laser beam drift, and nonparallelism of two measurement beam, a unified measurement model, which can improve the measurement accuracy, is deduced and established using the ray-tracing method. A numerical simulation using the optical design software Zemax is conducted, and the results verify the correctness of the model. Several experiments are performed to demonstrate the feasibility and effectiveness of the proposed system and measurement model.
The effect of rater training on scoring performance and scale-specific expertise amongst occupational therapists participating in a multicentre study: a single-group pre-post-test study.

PubMed

Hansen, Tina; Elholm Madsen, Esben; Sørensen, Annette

2016-01-01

In order to enhance the quality of the data collected in a multicentre validation study of a revised Danish version of the McGill Ingestive Skills Assessment (MISA), the authors developed a rater training programme. The purpose of the present study was to evaluate the effect of the training on scoring performance and scale-specific expertise amongst raters. During 2 days of rater training, 81 occupational therapists (OTs) were qualified to observe and score dysphagic clients' mealtime performance according to the criteria of 36 MISA-items. The training effects were evaluated pre- to post-training using percentage exact agreement (PA) of scored MISA items of a case-vignette and a Likert scale self-report of scale-specific expertise. PA increased significantly from pre- to post-training (Z = -4.404, p < 0.001), although items for which the case-vignette reflected deficient mealtime performance appeared most difficult to score. The OTs scale-specific expertise improved significantly (knowledge: Z = -7.857, p < 0.001 and confidence: Z = -7.838, p < 0.001). Rater training improved OTs scoring performance when using the Danish MISA as well as their perceived scale-specific expertise. Future rater training should emphasis the items identified as those most difficult to score. Additionally, further studies addressing different training approaches and durations are warranted. When occupational therapists (OTs) use the McGill Ingestive Skills Assessment (MISA) they observe, interpret and record occupational performance of dysphagic clients participating in a meal. This is a highly complex task, which might introduce unwanted variability in measurement scores. A 2-day rater training programme was developed and this builds on the findings of several studies. These suggest that combinations of different training methods tend to yield the most effective results. Participation in the newly developed training programme on how to administer the MISA significantly reduces unwanted
Establishing Immediate Reliability of Sonographic Measurements of the Transversus Abdominis in Asymptomatic Adults Performing Upright Loaded Functional Tasks in a Clinical Context Without Delayed Recorded Measurement.

PubMed

McPherson, Sue; Watson, Todd; Pate, Lindsey

2016-08-01

This study examined the reliability of sonographic measurements of the transversus abdominis of adults without low back pain during upright loaded functional tasks in real time, without relying on delayed recorded images. A single-group repeated-measures reliability study was conducted on 12 healthy participants without low back pain. Six of these adults reported a prior history of abdominal drawing-in maneuver training without sonographic measurement. The participants performed 3 trials of neutral standing, a loaded forward reach, and a loaded box lift under rest and with abdominal drawing-in maneuver instructions; task order was randomized. Transversus abdominis thickness measurements were obtained by an experienced rater using B-mode sonography in real-time via electronic calipers twice on the same static image during all trials by a rater. The rater was masked to group assignment and on-screen measurement output and required to respond to trivia questions between repeated measurements. The participants included 6 male and 6 female adults with a mean age ± SD of 26.3 ± 3.7 years. Intra-rater intraclass correlation coefficients (2,3) were high and precise for the rater's first and second measurements for all tasks and instruction conditions for mean transversus abdominis thickness and percent change in thickness measurements (eg, ranges were 0.968-0.997 for intraclass correlation coefficients, 0.01-0.21 mm for standard errors of the measurement, and 0.01-0.58 mm for minimal detectable changes). Calipers cleared by the rater or a research assistant produced similar findings of excellent reliability and precision. High intra-rater reliability and precision of transversus abdominis thickness measurements were obtained by a physical therapist in real time from asymptomatic adults performing upright loaded functional tasks under rest and with abdominal drawing-in maneuver instructions.
Influence of measurement error on Maxwell's demon

NASA Astrophysics Data System (ADS)

Sørdal, Vegard; Bergli, Joakim; Galperin, Y. M.

2017-06-01

In any general cycle of measurement, feedback, and erasure, the measurement will reduce the entropy of the system when information about the state is obtained, while erasure, according to Landauer's principle, is accompanied by a corresponding increase in entropy due to the compression of logical and physical phase space. The total process can in principle be fully reversible. A measurement error reduces the information obtained and the entropy decrease in the system. The erasure still gives the same increase in entropy, and the total process is irreversible. Another consequence of measurement error is that a bad feedback is applied, which further increases the entropy production if the proper protocol adapted to the expected error rate is not applied. We consider the effect of measurement error on a realistic single-electron box Szilard engine, and we find the optimal protocol for the cycle as a function of the desired power P and error ɛ .
Intertester agreement in refractive error measurements.

PubMed

Huang, Jiayan; Maguire, Maureen G; Ciner, Elise; Kulp, Marjean T; Quinn, Graham E; Orel-Bixler, Deborah; Cyert, Lynn A; Moore, Bruce; Ying, Gui-Shuang

2013-10-01

To determine the intertester agreement of refractive error measurements between lay and nurse screeners using the Retinomax Autorefractor and the SureSight Vision Screener. Trained lay and nurse screeners measured refractive error in 1452 preschoolers (3 to 5 years old) using the Retinomax and the SureSight in a random order for screeners and instruments. Intertester agreement between lay and nurse screeners was assessed for sphere, cylinder, and spherical equivalent (SE) using the mean difference and the 95% limits of agreement. The mean intertester difference (lay minus nurse) was compared between groups defined based on the child's age, cycloplegic refractive error, and the reading's confidence number using analysis of variance. The limits of agreement were compared between groups using the Brown-Forsythe test. Intereye correlation was accounted for in all analyses. The mean intertester differences (95% limits of agreement) were -0.04 (-1.63, 1.54) diopter (D) sphere, 0.00 (-0.52, 0.51) D cylinder, and -0.04 (1.65, 1.56) D SE for the Retinomax and 0.05 (-1.48, 1.58) D sphere, 0.01 (-0.58, 0.60) D cylinder, and 0.06 (-1.45, 1.57) D SE for the SureSight. For either instrument, the mean intertester differences in sphere and SE did not differ by the child's age, cycloplegic refractive error, or the reading's confidence number. However, for both instruments, the limits of agreement were wider when eyes had significant refractive error or the reading's confidence number was below the manufacturer's recommended value. Among Head Start preschool children, trained lay and nurse screeners agree well in measuring refractive error using the Retinomax or the SureSight. Both instruments had similar intertester agreement in refractive error measurements independent of the child's age. Significant refractive error and a reading with low confidence number were associated with worse intertester agreement.
An Investigation of Rater Cognition in the Assessment of Projects

ERIC Educational Resources Information Center

Crisp, Victoria

2012-01-01

In the United Kingdom, the majority of national assessments involve human raters. The processes by which raters determine the scores to award are central to the assessment process and affect the extent to which valid inferences can be made from assessment outcomes. Thus, understanding rater cognition has become a growing area of research in the…
Rater Accuracy and Training Group Effects in Expert- and Supervisor-Based Monitoring Systems

ERIC Educational Resources Information Center

Baird, Jo-Anne; Meadows, Michelle; Leckie, George; Caro, Daniel

2017-01-01

This study evaluated rater accuracy with rater-monitoring data from high stakes examinations in England. Rater accuracy was estimated with cross-classified multilevel modelling. The data included face-to-face training and monitoring of 567 raters in 110 teams, across 22 examinations, giving a total of 5500 data points. Two rater-monitoring systems…
A novel approach to rater training and certification in multinational trials.

PubMed

Jeglic, Elizabeth; Kobak, Kenneth A; Engelhardt, Nina; Williams, Janet B W; Lipsitz, Joshua D; Salvucci, Donna; Bryson, Heather; Bellew, Kevin

2007-07-01

Clinical trials are becoming increasingly international in scope. Global studies pose unique challenges in training and calibrating raters owing to language and cultural differences. Recent findings that poorly conducted interviews reduce study power, makes attention to raters' clinical skills critical. In this study, 109 raters from 14 countries went through a two-step certification process on the Hamilton Depression and Anxiety Rating Scales: (i) an online didactic tutorial on scoring conventions, and (ii) applied clinical training, consisting of small language-specific groups in which raters took turns interviewing patients while observed by an expert trainer, and observation and evaluation of individual interviews. Translators were used when native-language trainers were unavailable. Those who were unable to attend the startup meeting received the training individually via telephone. Results found a significant improvement in raters' knowledge of scoring conventions, with the mean number of correct answers on the 20-item test improving from 14.59 to 17.83, P<0.0001. In addition, raters' clinical skills improved significantly, with the mean score on the Rater Applied Performance Scale improving from their first to their second testing from 10.25 to 11.31, P=0.003. These results support the efficacy of this applied training model in improving raters' applied clinical skills in multinational trials.
Rating Written Performance: What Do Raters Do and Why?

ERIC Educational Resources Information Center

Kuiken, Folkert; Vedder, Ineke

2014-01-01

This study investigates the relationship in L2 writing between raters' judgments of communicative adequacy and linguistic complexity by means of six-point Likert scales, and general measures of linguistic performance. The participants were 39 learners of Italian and 32 of Dutch, who wrote two short argumentative essays. The same writing tasks…
The new GRID Hamilton Rating Scale for Depression demonstrates excellent inter-rater reliability for inexperienced and experienced raters before and after training.

PubMed

Tabuse, Hideaki; Kalali, Amir; Azuma, Hideki; Ozaki, Norio; Iwata, Nakao; Naitoh, Hiroshi; Higuchi, Teruhiko; Kanba, Shigenobu; Shioe, Kunihiko; Akechi, Tatsuo; Furukawa, Toshi A

2007-09-30

The Hamilton Rating Scale for Depression (HAMD) is the de facto international gold standard for the assessment of depression. There are some criticisms, however, especially with regard to its inter-rater reliability, due to the lack of standardized questions or explicit scoring procedures. The GRID-HAMD was developed to provide standardized explicit scoring conventions and a structured interview guide for administration and scoring of the HAMD. We developed the Japanese version of the GRID-HAMD and examined its inter-rater reliability among experienced and inexperienced clinicians (n=70), how rater characteristics may affect it, and how training can improve it in the course of a model training program using videotaped interviews. The results showed that the inter-rater reliability of the GRID-HAMD total score was excellent to almost perfect and those of most individual items were also satisfactory to excellent, both with experienced and inexperienced raters, and both before and after the training. With its standardized definitions, questions and detailed scoring conventions, the GRID-HAMD appears to be the best achievable set of interview guides for the HAMD and can provide a solid tool for highly reliable assessment of depression severity.
Comparing Measurement Error between Two Different Methods of Measurement of Various Magnitudes

ERIC Educational Resources Information Center

Zavorsky, Gerald S.

2010-01-01

Measurement error is a common problem in several fields of research such as medicine, physiology, and exercise science. The standard deviation of repeated measurements on the same person is the measurement error. One way of presenting measurement error is called the repeatability, which is 2.77 multiplied by the within subject standard deviation.…
Fusing metabolomics data sets with heterogeneous measurement errors

PubMed Central

Waaijenborg, Sandra; Korobko, Oksana; Willems van Dijk, Ko; Lips, Mirjam; Hankemeier, Thomas; Wilderjans, Tom F.; Smilde, Age K.

2018-01-01

Combining different metabolomics platforms can contribute significantly to the discovery of complementary processes expressed under different conditions. However, analysing the fused data might be hampered by the difference in their quality. In metabolomics data, one often observes that measurement errors increase with increasing measurement level and that different platforms have different measurement error variance. In this paper we compare three different approaches to correct for the measurement error heterogeneity, by transformation of the raw data, by weighted filtering before modelling and by a modelling approach using a weighted sum of residuals. For an illustration of these different approaches we analyse data from healthy obese and diabetic obese individuals, obtained from two metabolomics platforms. Concluding, the filtering and modelling approaches that both estimate a model of the measurement error did not outperform the data transformation approaches for this application. This is probably due to the limited difference in measurement error and the fact that estimation of measurement error models is unstable due to the small number of repeats available. A transformation of the data improves the classification of the two groups. PMID:29698490
Rater variables associated with ITER ratings.

PubMed

Paget, Michael; Wu, Caren; McIlwrick, Joann; Woloschuk, Wayne; Wright, Bruce; McLaughlin, Kevin

2013-10-01

Advocates of holistic assessment consider the ITER a more authentic way to assess performance. But this assessment format is subjective and, therefore, susceptible to rater bias. Here our objective was to study the association between rater variables and ITER ratings. In this observational study our participants were clerks at the University of Calgary and preceptors who completed online ITERs between February 2008 and July 2009. Our outcome variable was global rating on the ITER (rated 1-5), and we used a generalized estimating equation model to identify variables associated with this rating. Students were rated "above expected level" or "outstanding" on 66.4 % of 1050 online ITERs completed during the study period. Two rater variables attenuated ITER ratings: the log transformed time taken to complete the ITER [β = -0.06, 95 % confidence interval (-0.10, -0.02), p = 0.002], and the number of ITERs that a preceptor completed over the time period of the study [β = -0.008 (-0.02, -0.001), p = 0.02]. In this study we found evidence of leniency bias that resulted in two thirds of students being rated above expected level of performance. This leniency bias appeared to be attenuated by delay in ITER completion, and was also blunted in preceptors who rated more students. As all biases threaten the internal validity of the assessment process, further research is needed to confirm these and other sources of rater bias in ITER ratings, and to explore ways of limiting their impact.
The Relationship between Lexical Frequency Profiling Measures and Rater Judgements of Spoken and Written General English Language Proficiency on the CELPIP-General Test

ERIC Educational Resources Information Center

Douglas, Scott Roy

2015-01-01

Independent confirmation that vocabulary in use unfolds across levels of performance as expected can contribute to a more complete understanding of validity in standardized English language tests. This study examined the relationship between Lexical Frequency Profiling (LFP) measures and rater judgements of test-takers' overall levels of…
A Qualitative Analysis of Rater Behavior on an L2 Speaking Assessment

ERIC Educational Resources Information Center

Kim, Hyun Jung

2015-01-01

Human raters are normally involved in L2 performance assessment; as a result, rater behavior has been widely investigated to reduce rater effects on test scores and to provide validity arguments. Yet raters' cognition and use of rubrics in their actual rating have rarely been explored qualitatively in L2 speaking assessments. In this study three…
Assessing disease severity: accuracy and reliability of rater estimates in relation to number of diagrams in a standard area diagram set

USDA-ARS?s Scientific Manuscript database

Error in rater estimates of plant disease severity occur, and standard area diagrams (SADs) help improve accuracy and reliability. The effects of diagram number in a SAD set on accuracy and reliability is unknown. The objective of this study was to compare estimates of pecan scab severity made witho...
A sequential test for assessing observed agreement between raters.

PubMed

Bersimis, Sotiris; Sachlas, Athanasios; Chakraborti, Subha

2018-01-01

Assessing the agreement between two or more raters is an important topic in medical practice. Existing techniques, which deal with categorical data, are based on contingency tables. This is often an obstacle in practice as we have to wait for a long time to collect the appropriate sample size of subjects to construct the contingency table. In this paper, we introduce a nonparametric sequential test for assessing agreement, which can be applied as data accrues, does not require a contingency table, facilitating a rapid assessment of the agreement. The proposed test is based on the cumulative sum of the number of disagreements between the two raters and a suitable statistic representing the waiting time until the cumulative sum exceeds a predefined threshold. We treat the cases of testing two raters' agreement with respect to one or more characteristics and using two or more classification categories, the case where the two raters extremely disagree, and finally the case of testing more than two raters' agreement. The numerical investigation shows that the proposed test has excellent performance. Compared to the existing methods, the proposed method appears to require significantly smaller sample size with equivalent power. Moreover, the proposed method is easily generalizable and brings the problem of assessing the agreement between two or more raters and one or more characteristics under a unified framework, thus providing an easy to use tool to medical practitioners. © 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
The Effect of Year-to-Year Rater Variation on IRT Linking

ERIC Educational Resources Information Center

Yen, Shu Jing; Ochieng, Charles; Michaels, Hillary; Friedman, Greg

2005-01-01

Year-to-year rater variation may result in constructed response (CR) parameter changes, making CR items inappropriate to use in anchor sets for linking or equating. This study demonstrates how rater severity affected the writing and reading scores. Rater adjustments were made to statewide results using an item response theory (IRT) methodology…
Statistically Comparing the Performance of Multiple Automated Raters across Multiple Items

ERIC Educational Resources Information Center

Kieftenbeld, Vincent; Boyer, Michelle

2017-01-01

Automated scoring systems are typically evaluated by comparing the performance of a single automated rater item-by-item to human raters. This presents a challenge when the performance of multiple raters needs to be compared across multiple items. Rankings could depend on specifics of the ranking procedure; observed differences could be due to…
Multiple Indicators, Multiple Causes Measurement Error Models

PubMed Central

Tekwe, Carmen D.; Carter, Randy L.; Cullings, Harry M.; Carroll, Raymond J.

2014-01-01

Multiple Indicators, Multiple Causes Models (MIMIC) are often employed by researchers studying the effects of an unobservable latent variable on a set of outcomes, when causes of the latent variable are observed. There are times however when the causes of the latent variable are not observed because measurements of the causal variable are contaminated by measurement error. The objectives of this paper are: (1) to develop a novel model by extending the classical linear MIMIC model to allow both Berkson and classical measurement errors, defining the MIMIC measurement error (MIMIC ME) model, (2) to develop likelihood based estimation methods for the MIMIC ME model, (3) to apply the newly defined MIMIC ME model to atomic bomb survivor data to study the impact of dyslipidemia and radiation dose on the physical manifestations of dyslipidemia. As a by-product of our work, we also obtain a data-driven estimate of the variance of the classical measurement error associated with an estimate of the amount of radiation dose received by atomic bomb survivors at the time of their exposure. PMID:24962535

The challenges in defining and measuring diagnostic error.

PubMed

Zwaan, Laura; Singh, Hardeep

2015-06-01

Diagnostic errors have emerged as a serious patient safety problem but they are hard to detect and complex to define. At the research summit of the 2013 Diagnostic Error in Medicine 6th International Conference, we convened a multidisciplinary expert panel to discuss challenges in defining and measuring diagnostic errors in real-world settings. In this paper, we synthesize these discussions and outline key research challenges in operationalizing the definition and measurement of diagnostic error. Some of these challenges include 1) difficulties in determining error when the disease or diagnosis is evolving over time and in different care settings, 2) accounting for a balance between underdiagnosis and overaggressive diagnostic pursuits, and 3) determining disease diagnosis likelihood and severity in hindsight. We also build on these discussions to describe how some of these challenges can be addressed while conducting research on measuring diagnostic error.
The Impact of Rater Variability on Relationships among Different Effect-Size Indices for Inter-Rater Agreement between Human and Automated Essay Scoring

ERIC Educational Resources Information Center

Yun, Jiyeo

2017-01-01

Since researchers investigated automatic scoring systems in writing assessments, they have dealt with relationships between human and machine scoring, and then have suggested evaluation criteria for inter-rater agreement. The main purpose of my study is to investigate the magnitudes of and relationships among indices for inter-rater agreement used…
The prediction of satellite ephemeris errors as they result from surveillance system measurement errors

NASA Astrophysics Data System (ADS)

Simmons, B. E.

1981-08-01

This report derives equations predicting satellite ephemeris error as a function of measurement errors of space-surveillance sensors. These equations lend themselves to rapid computation with modest computer resources. They are applicable over prediction times such that measurement errors, rather than uncertainties of atmospheric drag and of Earth shape, dominate in producing ephemeris error. This report describes the specialization of these equations underlying the ANSER computer program, SEEM (Satellite Ephemeris Error Model). The intent is that this report be of utility to users of SEEM for interpretive purposes, and to computer programmers who may need a mathematical point of departure for limited generalization of SEEM.
Impact of exposure measurement error in air pollution epidemiology: effect of error type in time-series studies.

PubMed

Goldman, Gretchen T; Mulholland, James A; Russell, Armistead G; Strickland, Matthew J; Klein, Mitchel; Waller, Lance A; Tolbert, Paige E

2011-06-22

Two distinctly different types of measurement error are Berkson and classical. Impacts of measurement error in epidemiologic studies of ambient air pollution are expected to depend on error type. We characterize measurement error due to instrument imprecision and spatial variability as multiplicative (i.e. additive on the log scale) and model it over a range of error types to assess impacts on risk ratio estimates both on a per measurement unit basis and on a per interquartile range (IQR) basis in a time-series study in Atlanta. Daily measures of twelve ambient air pollutants were analyzed: NO2, NOx, O3, SO2, CO, PM10 mass, PM2.5 mass, and PM2.5 components sulfate, nitrate, ammonium, elemental carbon and organic carbon. Semivariogram analysis was applied to assess spatial variability. Error due to this spatial variability was added to a reference pollutant time-series on the log scale using Monte Carlo simulations. Each of these time-series was exponentiated and introduced to a Poisson generalized linear model of cardiovascular disease emergency department visits. Measurement error resulted in reduced statistical significance for the risk ratio estimates for all amounts (corresponding to different pollutants) and types of error. When modelled as classical-type error, risk ratios were attenuated, particularly for primary air pollutants, with average attenuation in risk ratios on a per unit of measurement basis ranging from 18% to 92% and on an IQR basis ranging from 18% to 86%. When modelled as Berkson-type error, risk ratios per unit of measurement were biased away from the null hypothesis by 2% to 31%, whereas risk ratios per IQR were attenuated (i.e. biased toward the null) by 5% to 34%. For CO modelled error amount, a range of error types were simulated and effects on risk ratio bias and significance were observed. For multiplicative error, both the amount and type of measurement error impact health effect estimates in air pollution epidemiology. By modelling
Incorporating measurement error in n = 1 psychological autoregressive modeling

PubMed Central

Schuurman, Noémi K.; Houtveen, Jan H.; Hamaker, Ellen L.

2015-01-01

Measurement error is omnipresent in psychological data. However, the vast majority of applications of autoregressive time series analyses in psychology do not take measurement error into account. Disregarding measurement error when it is present in the data results in a bias of the autoregressive parameters. We discuss two models that take measurement error into account: An autoregressive model with a white noise term (AR+WN), and an autoregressive moving average (ARMA) model. In a simulation study we compare the parameter recovery performance of these models, and compare this performance for both a Bayesian and frequentist approach. We find that overall, the AR+WN model performs better. Furthermore, we find that for realistic (i.e., small) sample sizes, psychological research would benefit from a Bayesian approach in fitting these models. Finally, we illustrate the effect of disregarding measurement error in an AR(1) model by means of an empirical application on mood data in women. We find that, depending on the person, approximately 30–50% of the total variance was due to measurement error, and that disregarding this measurement error results in a substantial underestimation of the autoregressive parameters. PMID:26283988
Incorporating measurement error in n = 1 psychological autoregressive modeling.

PubMed

Schuurman, Noémi K; Houtveen, Jan H; Hamaker, Ellen L

2015-01-01

Measurement error is omnipresent in psychological data. However, the vast majority of applications of autoregressive time series analyses in psychology do not take measurement error into account. Disregarding measurement error when it is present in the data results in a bias of the autoregressive parameters. We discuss two models that take measurement error into account: An autoregressive model with a white noise term (AR+WN), and an autoregressive moving average (ARMA) model. In a simulation study we compare the parameter recovery performance of these models, and compare this performance for both a Bayesian and frequentist approach. We find that overall, the AR+WN model performs better. Furthermore, we find that for realistic (i.e., small) sample sizes, psychological research would benefit from a Bayesian approach in fitting these models. Finally, we illustrate the effect of disregarding measurement error in an AR(1) model by means of an empirical application on mood data in women. We find that, depending on the person, approximately 30-50% of the total variance was due to measurement error, and that disregarding this measurement error results in a substantial underestimation of the autoregressive parameters.
A refined definition improves the measurement reliability of the tip-apex distance.

PubMed

Sakagoshi, Daigo; Sawaguchi, Takeshi; Shima, Yosuke; Inoue, Daisuke; Oshima, Takeshi; Goldhahn, Sabine

2016-07-01

Tip-apex distance (TAD) is reported as a good predictor for cut-outs of lag screws and spiral blades in the treatment of intertrochanteric fractures, and surgeons are advised to strive for TAD within 20 mm. However, the femoral neck axis and the position of the lower limb in the lateral radiograph are not clearly defined and can lead to measurement errors. We propose a refined TAD by defining these factors. The objective of this study was to analyze the reliability of this refined TAD. The radiographs of 130 prospective cases with unstable trochanteric fractures were used for the analysis of the refined TAD. The refined TAD was independently measured by 2 raters with clinical experience of more than 10 years (rater 1, 2) and 2 raters with much less clinical experience (rater 3, 4) after they received a training about the new measurement method. Intraclass correlation coefficient (ICC [2,4]) was calculated to assess the interrater reliability. The mean refined TADs were 18.2:18.4:18.2:18.2 mm for rater 1:2:3:4. There was a strong correlation among all four raters (ICC 0.998, (95% CI: 0.998, 0.999). Regardless of the clinical experience of raters, the refined TAD is a reliable tool and can be used to develop new TAD recommendations for predicting failure of fixation. Future studies with larger samples are needed to evaluate the predictive value of the refined TAD. Copyright © 2016 The Japanese Orthopaedic Association. Published by Elsevier B.V. All rights reserved.
Error-compensation model for simultaneous measurement of five degrees of freedom motion errors of a rotary axis

NASA Astrophysics Data System (ADS)

Bao, Chuanchen; Li, Jiakun; Feng, Qibo; Zhang, Bin

2018-07-01

This paper introduces an error-compensation model for our measurement method to measure five motion errors of a rotary axis based on fibre laser collimation. The error-compensation model is established in a matrix form using the homogeneous coordinate transformation theory. The influences of the installation errors, error crosstalk, and manufacturing errors are analysed. The model is verified by both ZEMAX simulation and measurement experiments. The repeatability values of the radial and axial motion errors are significantly suppressed by more than 50% after compensation. The repeatability experiments of five degrees of freedom motion errors and the comparison experiments of two degrees of freedom motion errors of an indexing table were performed by our measuring device and a standard instrument. The results show that the repeatability values of the angular positioning error ε z and tilt motion error around the Y axis ε y are 1.2″ and 4.4″, and the comparison deviations of the two motion errors are 4.0″ and 4.4″, respectively. The repeatability values of the radial and axial motion errors, δ y and δ z , are 1.3 and 0.6 µm, respectively. The repeatability value of the tilt motion error around the X axis ε x is 3.8″.
Multiple indicators, multiple causes measurement error models

DOE PAGES

Tekwe, Carmen D.; Carter, Randy L.; Cullings, Harry M.; ...

2014-06-25

Multiple indicators, multiple causes (MIMIC) models are often employed by researchers studying the effects of an unobservable latent variable on a set of outcomes, when causes of the latent variable are observed. There are times, however, when the causes of the latent variable are not observed because measurements of the causal variable are contaminated by measurement error. The objectives of this study are as follows: (i) to develop a novel model by extending the classical linear MIMIC model to allow both Berkson and classical measurement errors, defining the MIMIC measurement error (MIMIC ME) model; (ii) to develop likelihood-based estimation methodsmore » for the MIMIC ME model; and (iii) to apply the newly defined MIMIC ME model to atomic bomb survivor data to study the impact of dyslipidemia and radiation dose on the physical manifestations of dyslipidemia. Finally, as a by-product of our work, we also obtain a data-driven estimate of the variance of the classical measurement error associated with an estimate of the amount of radiation dose received by atomic bomb survivors at the time of their exposure.« less
Multiple indicators, multiple causes measurement error models

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tekwe, Carmen D.; Carter, Randy L.; Cullings, Harry M.

Multiple indicators, multiple causes (MIMIC) models are often employed by researchers studying the effects of an unobservable latent variable on a set of outcomes, when causes of the latent variable are observed. There are times, however, when the causes of the latent variable are not observed because measurements of the causal variable are contaminated by measurement error. The objectives of this study are as follows: (i) to develop a novel model by extending the classical linear MIMIC model to allow both Berkson and classical measurement errors, defining the MIMIC measurement error (MIMIC ME) model; (ii) to develop likelihood-based estimation methodsmore » for the MIMIC ME model; and (iii) to apply the newly defined MIMIC ME model to atomic bomb survivor data to study the impact of dyslipidemia and radiation dose on the physical manifestations of dyslipidemia. Finally, as a by-product of our work, we also obtain a data-driven estimate of the variance of the classical measurement error associated with an estimate of the amount of radiation dose received by atomic bomb survivors at the time of their exposure.« less
Alcohol consumption, beverage prices and measurement error.

PubMed

Young, Douglas J; Bielinska-Kwapisz, Agnieszka

2003-03-01

Alcohol price data collected by the American Chamber of Commerce Researchers Association (ACCRA) have been widely used in studies of alcohol consumption and related behaviors. A number of problems with these data suggest that they contain substantial measurement error, which biases conventional statistical estimators toward a finding of little or no effect of prices on behavior. We test for measurement error, assess the magnitude of the bias and provide an alternative estimator that is likely to be superior. The study utilizes data on per capita alcohol consumption across U.S. states and the years 1982-1997. State and federal alcohol taxes are used as instrumental variables for prices. Formal tests strongly confim the hypothesis of measurement error. Instrumental variable estimates of the price elasticity of demand range from -0.53 to -1.24. These estimates are substantially larger in absolute value than ordinary least squares estimates, which sometimes are not significantly different from zero or even positive. The ACCRA price data are substantially contaminated with measurement error, but using state and federal taxes as instrumental variables mitigates the problem.
Slope Error Measurement Tool for Solar Parabolic Trough Collectors: Preprint

DOE Office of Scientific and Technical Information (OSTI.GOV)

Stynes, J. K.; Ihas, B.

2012-04-01

The National Renewable Energy Laboratory (NREL) has developed an optical measurement tool for parabolic solar collectors that measures the combined errors due to absorber misalignment and reflector slope error. The combined absorber alignment and reflector slope errors are measured using a digital camera to photograph the reflected image of the absorber in the collector. Previous work using the image of the reflection of the absorber finds the reflector slope errors from the reflection of the absorber and an independent measurement of the absorber location. The accuracy of the reflector slope error measurement is thus dependent on the accuracy of themore » absorber location measurement. By measuring the combined reflector-absorber errors, the uncertainty in the absorber location measurement is eliminated. The related performance merit, the intercept factor, depends on the combined effects of the absorber alignment and reflector slope errors. Measuring the combined effect provides a simpler measurement and a more accurate input to the intercept factor estimate. The minimal equipment and setup required for this measurement technique make it ideal for field measurements.« less
Reference-free error estimation for multiple measurement methods.

PubMed

Madan, Hennadii; Pernuš, Franjo; Špiclin, Žiga

2018-01-01

We present a computational framework to select the most accurate and precise method of measurement of a certain quantity, when there is no access to the true value of the measurand. A typical use case is when several image analysis methods are applied to measure the value of a particular quantitative imaging biomarker from the same images. The accuracy of each measurement method is characterized by systematic error (bias), which is modeled as a polynomial in true values of measurand, and the precision as random error modeled with a Gaussian random variable. In contrast to previous works, the random errors are modeled jointly across all methods, thereby enabling the framework to analyze measurement methods based on similar principles, which may have correlated random errors. Furthermore, the posterior distribution of the error model parameters is estimated from samples obtained by Markov chain Monte-Carlo and analyzed to estimate the parameter values and the unknown true values of the measurand. The framework was validated on six synthetic and one clinical dataset containing measurements of total lesion load, a biomarker of neurodegenerative diseases, which was obtained with four automatic methods by analyzing brain magnetic resonance images. The estimates of bias and random error were in a good agreement with the corresponding least squares regression estimates against a reference.
Aliasing errors in measurements of beam position and ellipticity

NASA Astrophysics Data System (ADS)

Ekdahl, Carl

2005-09-01

Beam position monitors (BPMs) are used in accelerators and ion experiments to measure currents, position, and azimuthal asymmetry. These usually consist of discrete arrays of electromagnetic field detectors, with detectors located at several equally spaced azimuthal positions at the beam tube wall. The discrete nature of these arrays introduces systematic errors into the data, independent of uncertainties resulting from signal noise, lack of recording dynamic range, etc. Computer simulations were used to understand and quantify these aliasing errors. If required, aliasing errors can be significantly reduced by employing more than the usual four detectors in the BPMs. These simulations show that the error in measurements of the centroid position of a large beam is indistinguishable from the error in the position of a filament. The simulations also show that aliasing errors in the measurement of beam ellipticity are very large unless the beam is accurately centered. The simulations were used to quantify the aliasing errors in beam parameter measurements during early experiments on the DARHT-II accelerator, demonstrating that they affected the measurements only slightly, if at all.
Modeling Errors in Daily Precipitation Measurements: Additive or Multiplicative?

NASA Technical Reports Server (NTRS)

Tian, Yudong; Huffman, George J.; Adler, Robert F.; Tang, Ling; Sapiano, Matthew; Maggioni, Viviana; Wu, Huan

2013-01-01

The definition and quantification of uncertainty depend on the error model used. For uncertainties in precipitation measurements, two types of error models have been widely adopted: the additive error model and the multiplicative error model. This leads to incompatible specifications of uncertainties and impedes intercomparison and application.In this letter, we assess the suitability of both models for satellite-based daily precipitation measurements in an effort to clarify the uncertainty representation. Three criteria were employed to evaluate the applicability of either model: (1) better separation of the systematic and random errors; (2) applicability to the large range of variability in daily precipitation; and (3) better predictive skills. It is found that the multiplicative error model is a much better choice under all three criteria. It extracted the systematic errors more cleanly, was more consistent with the large variability of precipitation measurements, and produced superior predictions of the error characteristics. The additive error model had several weaknesses, such as non constant variance resulting from systematic errors leaking into random errors, and the lack of prediction capability. Therefore, the multiplicative error model is a better choice.
Training Raters to Assess Adult ADHD: Reliability of Ratings

ERIC Educational Resources Information Center

Adler, Lenard A.; Spencer, Thomas; Faraone, Stephen V.; Reimherr, Fred W.; Kelsey, Douglas; Michelson, David; Biederman, Joseph

2005-01-01

The standardization of ADHD ratings in adults is important given their differing symptom presentation. The authors investigated the agreement and reliability of rater standardization in a large-scale trial of atomoxetine in adults with ADHD. Training of 91 raters for the investigator-administered ADHD Rating Scale (ADHDRS-IV-Inv) occurred prior to…
Rater Severity in Large-Scale Assessment: Is It Invariant?

ERIC Educational Resources Information Center

McQueen, Joy; Congdon, Peter J.

A study was conducted to investigate the stability of rater severity over an extended rating period. Multifaceted Rasch analysis was applied to ratings of writing performances of 8,285 primary school (elementary) students. Each performance was rated on two performance dimensions by two trained raters over a period of 7 rating days. Performances…
Measurement Error and Environmental Epidemiology: A Policy Perspective

PubMed Central

Edwards, Jessie K.; Keil, Alexander P.

2017-01-01

Purpose of review Measurement error threatens public health by producing bias in estimates of the population impact of environmental exposures. Quantitative methods to account for measurement bias can improve public health decision making. Recent findings We summarize traditional and emerging methods to improve inference under a standard perspective, in which the investigator estimates an exposure response function, and a policy perspective, in which the investigator directly estimates population impact of a proposed intervention. Summary Under a policy perspective, the analysis must be sensitive to errors in measurement of factors that modify the effect of exposure on outcome, must consider whether policies operate on the true or measured exposures, and may increasingly need to account for potentially dependent measurement error of two or more exposures affected by the same policy or intervention. Incorporating approaches to account for measurement error into such a policy perspective will increase the impact of environmental epidemiology. PMID:28138941
Investigating Raters' Development of Rating Ability on a Second Language Speaking Assessment

ERIC Educational Resources Information Center

Kim, Hyun Jung

2011-01-01

The purpose of the study was to investigate the extent to which raters coming from diverse backgrounds exhibited different levels of rating ability while scoring speaking performances. The study also aimed to examine how raters with different backgrounds could develop their rating ability over time. For this purpose, raters' background…
Use of units of measurement error in anthropometric comparisons.

PubMed

Lucas, Teghan; Henneberg, Maciej

2017-09-01

Anthropometrists attempt to minimise measurement errors, however, errors cannot be eliminated entirely. Currently, measurement errors are simply reported. Measurement errors should be included into analyses of anthropometric data. This study proposes a method which incorporates measurement errors into reported values, replacing metric units with 'units of technical error of measurement (TEM)' by applying these to forensics, industrial anthropometry and biological variation. The USA armed forces anthropometric survey (ANSUR) contains 132 anthropometric dimensions of 3982 individuals. Concepts of duplication and Euclidean distance calculations were applied to the forensic-style identification of individuals in this survey. The National Size and Shape Survey of Australia contains 65 anthropometric measurements of 1265 women. This sample was used to show how a woman's body measurements expressed in TEM could be 'matched' to standard clothing sizes. Euclidean distances show that two sets of repeated anthropometric measurements of the same person cannot be matched (> 0) on measurements expressed in millimetres but can in units of TEM (= 0). Only 81 women can fit into any standard clothing size when matched using centimetres, with units of TEM, 1944 women fit. The proposed method can be applied to all fields that use anthropometry. Units of TEM are considered a more reliable unit of measurement for comparisons.

Improving Teacher Selection: The Effect of Inter-Rater Reliability in the Screening Process. CEDR Working Paper. WP #2015-7

ERIC Educational Resources Information Center

Martinkova, Patricia; Goldhaber, Dan

2015-01-01

Inter-rater reliability, commonly assessed by intra-class correlation coefficient ICC, is an important index for describing the extent to which there is consistency amongst two or more raters in assigned measures. In organizational research, the data structure is often hierarchical and designs deviate substantially from the ideal of a balanced…
Rater Expertise in a Second Language Speaking Assessment: The Influence of Training and Experience

ERIC Educational Resources Information Center

Davis, Lawrence Edward

2012-01-01

Speaking performance tests typically employ raters to produce scores; accordingly, variability in raters' scoring decisions has important consequences for test reliability and validity. One such source of variability is the rater's level of expertise in scoring. Therefore, it is important to understand how raters' performance is influenced by…
Detecting rater bias using a person-fit statistic: a Monte Carlo simulation study.

PubMed

Aubin, André-Sébastien; St-Onge, Christina; Renaud, Jean-Sébastien

2018-04-01

With the Standards voicing concern for the appropriateness of response processes, we need to explore strategies that would allow us to identify inappropriate rater response processes. Although certain statistics can be used to help detect rater bias, their use is complicated by either a lack of data about their actual power to detect rater bias or the difficulty related to their application in the context of health professions education. This exploratory study aimed to establish the worthiness of pursuing the use of l z to detect rater bias. We conducted a Monte Carlo simulation study to investigate the power of a specific detection statistic, that is: the standardized likelihood l z person-fit statistics (PFS). Our primary outcome was the detection rate of biased raters, namely: raters whom we manipulated into being either stringent (giving lower scores) or lenient (giving higher scores), using the l z statistic while controlling for the number of biased raters in a sample (6 levels) and the rate of bias per rater (6 levels). Overall, stringent raters (M = 0.84, SD = 0.23) were easier to detect than lenient raters (M = 0.31, SD = 0.28). More biased raters were easier to detect then less biased raters (60% bias: 62, SD = 0.37; 10% bias: 43, SD = 0.36). The PFS l z seems to offer an interesting potential to identify biased raters. We observed detection rates as high as 90% for stringent raters, for whom we manipulated more than half their checklist. Although we observed very interesting results, we cannot generalize these results to the use of PFS with estimated item/station parameters or real data. Such studies should be conducted to assess the feasibility of using PFS to identify rater bias.
Rating Communication in GP Consultations: The Association Between Ratings Made by Patients and Trained Clinical Raters

PubMed Central

Burt, Jenni; Abel, Gary; Elmore, Natasha; Newbould, Jenny; Davey, Antoinette; Llanwarne, Nadia; Maramba, Inocencio; Paddison, Charlotte; Benson, John; Silverman, Jonathan; Elliott, Marc N.; Campbell, John; Roland, Martin

2016-01-01

Patient evaluations of physician communication are widely used, but we know little about how these relate to professionally agreed norms of communication quality. We report an investigation into the association between patient assessments of communication quality and an observer-rated measure of communication competence. Consent was obtained to video record consultations with Family Practitioners in England, following which patients rated the physician’s communication skills. A sample of consultation videos was subsequently evaluated by trained clinical raters using an instrument derived from the Calgary-Cambridge guide to the medical interview. Consultations scored highly for communication by clinical raters were also scored highly by patients. However, when clinical raters judged communication to be of lower quality, patient scores ranged from “poor” to “very good.” Some patients may be inhibited from rating poor communication negatively. Patient evaluations can be useful for measuring relative performance of physicians’ communication skills, but absolute scores should be interpreted with caution. PMID:27698072
Individual Differences in Rater Decision-Making Style: An Exploratory Mixed-Methods Study

ERIC Educational Resources Information Center

Baker, Beverly Anne

2012-01-01

Researchers of high-stakes, subjectively scored writing assessments have done much work to better understand the process that raters go through in applying a rating scale to a language performance to arrive at a score. However, there is still unexplained, systematic variability in rater scoring that resists rater training (see Hoyt & Kerns,…
Inter-Rater and Test-Retest Reliability of the Beery VMI in Schoolchildren

PubMed Central

Harvey, Erin M.; Leonard-Green, Tina K.; Mohan, Kathleen M.; Kulp, Marjean Taylor; Davis, Amy L.; Miller, Joseph M.; Twelker, J. Daniel; Campus, Irene; Dennis, Leslie K.

2017-01-01

Purpose To assess inter-rater and test-retest reliability of the 6th Edition Beery-Buktenica Developmental Test of Visual-Motor Integration (VMI) and test-retest reliability of the VMI Visual Perception Supplemental Test (VMIp) in school-age children. Methods Subjects were 163 Native American 3rd – 8th grade students with no significant refractive error (astigmatism < 1.00 D, myopia: < 0.75 D, hyperopia: < 2.50 D, anisometropia < 1.50 D) or ocular abnormalities. The VMI and VMIp were administered twice, on separate days. All VMI tests were scored by two trained scorers and a subset of 50 tests were also scored by an experienced scorer. Scorers strictly applied objective scoring criteria. Analyses included inter-rater and test-retest assessments of bias, 95% limits of agreement, and intraclass correlation analysis. Results Trained scorers had no significant scoring bias compared to the experienced scorer. One of the two trained scorers tended to provide higher scores than the other (mean difference in standardized scores = 1.54). Inter-rater correlations were strong (0.75 to 0.88). VMI and VMIp test-retest comparisons indicated no significant bias (subjects did not tend to score better on retest). Test-retest correlations were moderate (0.54 to 0.58). The 95% LOAs for the VMI were −24.14 to 24.67 (scorer 1) and −26.06 to 26.58 (scorer 2) and the 95% LOAs for the VMIp were −27.11 to 27.34. Conclusions The 95% LOA for test-retest differences will be useful for determining if the VMI and VMIp have sufficient sensitivity for detecting change with treatment in both clinical and research settings. Further research on test-retest reliability reporting 95% LOAs for children across different age ranges are recommended, particularly if the test is to be used to detect changes due to intervention or treatment. PMID:28422801
Reducing measurement errors during functional capacity tests in elders.

PubMed

da Silva, Mariane Eichendorf; Orssatto, Lucas Bet da Rosa; Bezerra, Ewertton de Souza; Silva, Diego Augusto Santos; Moura, Bruno Monteiro de; Diefenthaeler, Fernando; Freitas, Cíntia de la Rocha

2018-06-01

Accuracy is essential to the validity of functional capacity measurements. To evaluate the error of measurement of functional capacity tests for elders and suggest the use of the technical error of measurement and credibility coefficient. Twenty elders (65.8 ± 4.5 years) completed six functional capacity tests that were simultaneously filmed and timed by four evaluators by means of a chronometer. A fifth evaluator timed the tests by analyzing the videos (reference data). The means of most evaluators for most tests were different from the reference (p < 0.05), except for two evaluators for two different tests. There were different technical error of measurement between tests and evaluators. The Bland-Altman test showed difference in the concordance of the results between methods. Short duration tests showed higher technical error of measurement than longer tests. In summary, tests timed by a chronometer underestimate the real results of the functional capacity. Difference between evaluators' reaction time and perception to determine the start and the end of the tests would justify the errors of measurement. Calculation of the technical error of measurement or the use of the camera can increase data validity.
Factors Influencing Mini-CEX Rater Judgments and Their Practical Implications: A Systematic Literature Review.

PubMed

Lee, Victor; Brain, Keira; Martin, Jenepher

2017-06-01

At present, little is known about how mini-clinical evaluation exercise (mini-CEX) raters translate their observations into judgments and ratings. The authors of this systematic literature review aim both to identify the factors influencing mini-CEX rater judgments in the medical education setting and to translate these findings into practical implications for clinician assessors. The authors searched for internal and external factors influencing mini-CEX rater judgments in the medical education setting from 1980 to 2015 using the Ovid MEDLINE, PsycINFO, ERIC, PubMed, and Scopus databases. They extracted the following information from each study: country of origin, educational level, study design and setting, type of observation, occurrence of rater training, provision of feedback to the trainee, research question, and identified factors influencing rater judgments. The authors also conducted a quality assessment for each study. Seventeen articles met the inclusion criteria. The authors identified both internal and external factors that influence mini-CEX rater judgments. They subcategorized the internal factors into intrinsic rater factors, judgment-making factors (conceptualization, interpretation, attention, and impressions), and scoring factors (scoring integration and domain differentiation). The current theories of rater-based judgment have not helped clinicians resolve the issues of rater idiosyncrasy, bias, gestalt, and conflicting contextual factors; therefore, the authors believe the most important solution is to increase the justification of rater judgments through the use of specific narrative and contextual comments, which are more informative for trainees. Finally, more real-world research is required to bridge the gap between the theory and practice of rater cognition.
Six of one, half a dozen of the other: A measure of multidisciplinary inter/intra-rater reliability of the society for fetal urology and urinary tract dilation grading systems for hydronephrosis.

PubMed

Rickard, Mandy; Easterbrook, Bethany; Kim, Soojin; Farrokhyar, Forough; Stein, Nina; Arora, Steven; Belostotsky, Vladamir; DeMaria, Jorge; Lorenzo, Armando J; Braga, Luis H

2017-02-01

The urinary tract dilation (UTD) classification system was introduced to standardize terminology in the reporting of hydronephrosis (HN), and bridge a gap between pre- and postnatal classification such as the Society for Fetal Urology (SFU) grading system. Herein we compare the intra/inter-rater reliability of both grading systems. SFU (I-IV) and UTD (I-III) grades were independently assigned by 13 raters (9 pediatric urology staff, 2 nephrologists, 2 radiologists), twice, 3 weeks apart, to 50 sagittal postnatal ultrasonographic views of hydronephrotic kidneys. Data regarding ureteral measurements and bladder abnormalities were included to allow proper UTD categorization. Ten images were repeated to assess intra-rater reliability. Krippendorff's alpha coefficient was used to measure overall and by grade intra/inter-rater reliability. Reliability between specialties and training levels were also analyzed. Overall inter-rater reliability was slightly higher for SFU (α = 0.842, 95% CI 0.812-0.879, in session 1; and α = 0.808, 95% CI 0.775-0.839, in session 2) than for UTD (α = 0.774, 95% CI 0.715-0.827, in session 1; and α = 0.679, 95% CI 0.605-0.750, in session 2). Reliability for intermediate grades (SFU II/III and UTD 2) of HN was poor regardless of the system. Reliabilities for SFU and UTD classifications among Urology, Nephrology, and Radiology, as well as between training levels were not significantly different. Despite the introduction of HN grading systems to standardize the interpretation and reporting of renal ultrasound in infants with HN, none have been proven superior in allowing clinicians to distinguish between "moderate" grades. While this study demonstrated high reliability in distinguishing between "mild" (SFU I/II and UTD 1) and "severe" (SFU IV and UTD 3) grades of HN, the overall reliability between specialties was poor. This is in keeping with a previous report of modest inter-rater reliability of the SFU system. This drawback is
The Child and Adolescent Services Assessment: Interrater Reliability and Predictors of Rater Disagreement.

PubMed

Schwartz, Karen T G; Bowling, Amanda A; Dickerson, John F; Lynch, Frances L; Brent, David A; Porta, Giovanna; Iyengar, Satish; Weersing, V Robin

2018-05-24

The current study evaluated the interrater reliability of the Child and Adolescent Services Assessment (CASA), a widely used structured interview measuring pediatric mental health service use. Interviews (N = 72) were randomly selected from a pediatric effectiveness trial, and audio was coded by an independent rater. Regressions were employed to identify predictors of rater disagreement. Interrater reliability was high for items (> 94%) and summary metrics (ICC > .79) across service sectors. Predictors of disagreement varied by domain; significant predictors indexed higher clinical severity or social disadvantage. Results support the CASA as a reliable and robust assessment of pediatric service use, but administrators should be alert when assessing vulnerable populations.
Raters Interpret Positively and Negatively Worded Items Similarly in a Quality of Life Instrument for Children

PubMed Central

Lin, Chung-Ying; Strong, Carol; Tsai, Meng-Che; Lee, Chih-Ting

2017-01-01

Measurement invariance is an important assumption to meaningfully compare children’s quality of life (QoL) between different raters (eg, children and parents) and across genders. Moreover, QoL instruments may combine using negatively and positively worded items—a common method to reduce response bias. However, the wording effects may have different levels of impact on different raters and genders. Our aim was to investigate the measurement invariance of Kid-KINDL, a commonly used QoL instrument, across genders and raters and to consider the wording effects simultaneously. Third to sixth graders (208 boys and 235 girls) completed the self-rated Kid-KINDL, and 1 parent each of 241 children completed the parent-rated Kid-KINDL. The wording effects were accounted for by correlated traits-uncorrelated methods model. The measurement invariance was examined using multigroup confirmatory factor analysis. Item loadings and item intercepts were invariant across gender and rater when we simultaneously accounted for the wording effects of Kid-KINDL. Our results suggest that Kid-KINDL could be used to compare QoL across gender and that parent-rated Kid-KINDL could be used to measure children’s QoL. Specifically, the invariant factor loadings across child-rated and parent-rated Kid-KINDL suggest that the score weights in each item were the same for both children and parents (ie, the important items identified by the children are the same items identified by the parents). The invariant item intercepts suggest that both children and parents share the same threshold for each item. Based on the results, we tentatively recommend that each score of a parent-rated Kid-KINDL can stand for each child’s QoL. PMID:28292193
Adjacent-Categories Mokken Models for Rater-Mediated Assessments

PubMed Central

Wind, Stefanie A.

2016-01-01

Molenaar extended Mokken’s original probabilistic-nonparametric scaling models for use with polytomous data. These polytomous extensions of Mokken’s original scaling procedure have facilitated the use of Mokken scale analysis as an approach to exploring fundamental measurement properties across a variety of domains in which polytomous ratings are used, including rater-mediated educational assessments. Because their underlying item step response functions (i.e., category response functions) are defined using cumulative probabilities, polytomous Mokken models can be classified as cumulative models based on the classifications of polytomous item response theory models proposed by several scholars. In order to permit a closer conceptual alignment with educational performance assessments, this study presents an adjacent-categories variation on the polytomous monotone homogeneity and double monotonicity models. Data from a large-scale rater-mediated writing assessment are used to illustrate the adjacent-categories approach, and results are compared with the original formulations. Major findings suggest that the adjacent-categories models provide additional diagnostic information related to individual raters’ use of rating scale categories that is not observed under the original formulation. Implications are discussed in terms of methods for evaluating rating quality. PMID:29795916
Technical approaches for measurement of human errors

NASA Technical Reports Server (NTRS)

Clement, W. F.; Heffley, R. K.; Jewell, W. F.; Mcruer, D. T.

1980-01-01

Human error is a significant contributing factor in a very high proportion of civil transport, general aviation, and rotorcraft accidents. The technical details of a variety of proven approaches for the measurement of human errors in the context of the national airspace system are presented. Unobtrusive measurements suitable for cockpit operations and procedures in part of full mission simulation are emphasized. Procedure, system performance, and human operator centered measurements are discussed as they apply to the manual control, communication, supervisory, and monitoring tasks which are relevant to aviation operations.
Exploring the Role of First Impressions in Rater-Based Assessments

ERIC Educational Resources Information Center

Wood, Timothy J.

2014-01-01

Medical education relies heavily on assessment formats that require raters to assess the competence and skills of learners. Unfortunately, there are often inconsistencies and variability in the scores raters assign. To ensure the scores from these assessment tools have validity, it is important to understand the underlying cognitive processes that…
Inter-rater Reliability of Three Musculoskeletal Physical examination Techniques Used to Assess Motion in Three Planes While Standing

PubMed Central

Prather, Heidi; Hunt, Devyani; Steger-May, Karen; Hayes, Marcie Harris; Knaus, Evan; Clohisy, John

2012-01-01

Objective The objective of the study was to measure the reliability between examiners of three basic maneuvers of the Total Body Functional Profile© physical examination test. The hypothesis was musculoskeletal health care providers of different disciplines could reliably use the three basic maneuvers as part of the musculoskeletal physical examination. Design A prospective observational study was conducted. Twenty-eight adult volunteers were measured on both the left and right side by two independent raters on a single occasion. Setting The subjects were recruited through advertisements placed by the orthopedic department at a tertiary university. Participants 28 volunteers were recruited and completed the study. The volunteers were between the ages of 18 and 51 years of age, had no symptoms in the lower extremity or spine, had no previous history of surgery or tumor involving the lower extremity, and no medical conditions that would preclude participation. Assessment On a single occasion, two examiners per one volunteer were blinded to their own and each others' measurements. Each examiner assessed the distance of frontal and sagittal plane lunge and angle of motion for transverse plane testing. Main Outcome Measurements Inter-rater agreement is expressed with intraclass correlation coefficients (ICCs) and corresponding 95% confidence intervals (CIs). The difference between raters is reported with 95% CIs. Baseline demographics, UCLA, and Harris hip questionnaires were completed by all participants. Results The UCLA and Harris hip scores showed no significant activity restrictions or pain limitations in all participants. The inter-rater reliability for sagittal, frontal, and transverse plane matrix testing was good with ICCs of 0.86 (95% CI 0.77, 0.91), 0.90 (95% CI 0.84, 0.94), and 0.85 (95% CI 0.75, 0.91) respectively. The rater reliability between disciplines for transverse, sagittal and frontal plane matrix testing was good with ICCs of 0.89 (95% CI 0.80, 0
Inter-rater reliability of twelve diagnostic systems of schizophrenia.

PubMed

Helmes, E; Landmark, J; Kazarian, S S

1983-05-01

The present and past symptomatology of 31 chronic schizophrenics was rated by four independent judges, two experienced clinical psychiatrists and two psychiatric residents, in a context more representative of actual clinical practice than most research studies. Ratings were made on 64 symptoms derived from 12 diagnostic systems, based on either live or videotaped interviews for present symptomatology and case records for past symptomatology. Inter-rater reliabilities were higher for present than for past symptoms, and in general did not approach those reported for highly trained raters. There were no differences between live and videotaped interviews. Diagnostic systems differed widely in rater agreement. The most consistent across both past and present symptomatology were the systems of Langfeldt, Schneider, and DSM-III, for which the level of reliability was consistent with other studies.
Unit of Measurement Used and Parent Medication Dosing Errors

PubMed Central

Dreyer, Benard P.; Ugboaja, Donna C.; Sanchez, Dayana C.; Paul, Ian M.; Moreira, Hannah A.; Rodriguez, Luis; Mendelsohn, Alan L.

2014-01-01

BACKGROUND AND OBJECTIVES: Adopting the milliliter as the preferred unit of measurement has been suggested as a strategy to improve the clarity of medication instructions; teaspoon and tablespoon units may inadvertently endorse nonstandard kitchen spoon use. We examined the association between unit used and parent medication errors and whether nonstandard instruments mediate this relationship. METHODS: Cross-sectional analysis of baseline data from a larger study of provider communication and medication errors. English- or Spanish-speaking parents (n = 287) whose children were prescribed liquid medications in 2 emergency departments were enrolled. Medication error defined as: error in knowledge of prescribed dose, error in observed dose measurement (compared to intended or prescribed dose); >20% deviation threshold for error. Multiple logistic regression performed adjusting for parent age, language, country, race/ethnicity, socioeconomic status, education, health literacy (Short Test of Functional Health Literacy in Adults); child age, chronic disease; site. RESULTS: Medication errors were common: 39.4% of parents made an error in measurement of the intended dose, 41.1% made an error in the prescribed dose. Furthermore, 16.7% used a nonstandard instrument. Compared with parents who used milliliter-only, parents who used teaspoon or tablespoon units had twice the odds of making an error with the intended (42.5% vs 27.6%, P = .02; adjusted odds ratio=2.3; 95% confidence interval, 1.2–4.4) and prescribed (45.1% vs 31.4%, P = .04; adjusted odds ratio=1.9; 95% confidence interval, 1.03–3.5) dose; associations greater for parents with low health literacy and non–English speakers. Nonstandard instrument use partially mediated teaspoon and tablespoon–associated measurement errors. CONCLUSIONS: Findings support a milliliter-only standard to reduce medication errors. PMID:25022742
Unit of measurement used and parent medication dosing errors.

PubMed

Yin, H Shonna; Dreyer, Benard P; Ugboaja, Donna C; Sanchez, Dayana C; Paul, Ian M; Moreira, Hannah A; Rodriguez, Luis; Mendelsohn, Alan L

2014-08-01

Adopting the milliliter as the preferred unit of measurement has been suggested as a strategy to improve the clarity of medication instructions; teaspoon and tablespoon units may inadvertently endorse nonstandard kitchen spoon use. We examined the association between unit used and parent medication errors and whether nonstandard instruments mediate this relationship. Cross-sectional analysis of baseline data from a larger study of provider communication and medication errors. English- or Spanish-speaking parents (n = 287) whose children were prescribed liquid medications in 2 emergency departments were enrolled. Medication error defined as: error in knowledge of prescribed dose, error in observed dose measurement (compared to intended or prescribed dose); >20% deviation threshold for error. Multiple logistic regression performed adjusting for parent age, language, country, race/ethnicity, socioeconomic status, education, health literacy (Short Test of Functional Health Literacy in Adults); child age, chronic disease; site. Medication errors were common: 39.4% of parents made an error in measurement of the intended dose, 41.1% made an error in the prescribed dose. Furthermore, 16.7% used a nonstandard instrument. Compared with parents who used milliliter-only, parents who used teaspoon or tablespoon units had twice the odds of making an error with the intended (42.5% vs 27.6%, P = .02; adjusted odds ratio=2.3; 95% confidence interval, 1.2-4.4) and prescribed (45.1% vs 31.4%, P = .04; adjusted odds ratio=1.9; 95% confidence interval, 1.03-3.5) dose; associations greater for parents with low health literacy and non-English speakers. Nonstandard instrument use partially mediated teaspoon and tablespoon-associated measurement errors. Findings support a milliliter-only standard to reduce medication errors. Copyright © 2014 by the American Academy of Pediatrics.
Measurement uncertainty evaluation of conicity error inspected on CMM

NASA Astrophysics Data System (ADS)

Wang, Dongxia; Song, Aiguo; Wen, Xiulan; Xu, Youxiong; Qiao, Guifang

2016-01-01

The cone is widely used in mechanical design for rotation, centering and fixing. Whether the conicity error can be measured and evaluated accurately will directly influence its assembly accuracy and working performance. According to the new generation geometrical product specification(GPS), the error and its measurement uncertainty should be evaluated together. The mathematical model of the minimum zone conicity error is established and an improved immune evolutionary algorithm(IIEA) is proposed to search for the conicity error. In the IIEA, initial antibodies are firstly generated by using quasi-random sequences and two kinds of affinities are calculated. Then, each antibody clone is generated and they are self-adaptively mutated so as to maintain diversity. Similar antibody is suppressed and new random antibody is generated. Because the mathematical model of conicity error is strongly nonlinear and the input quantities are not independent, it is difficult to use Guide to the expression of uncertainty in the measurement(GUM) method to evaluate measurement uncertainty. Adaptive Monte Carlo method(AMCM) is proposed to estimate measurement uncertainty in which the number of Monte Carlo trials is selected adaptively and the quality of the numerical results is directly controlled. The cone parts was machined on lathe CK6140 and measured on Miracle NC 454 Coordinate Measuring Machine(CMM). The experiment results confirm that the proposed method not only can search for the approximate solution of the minimum zone conicity error(MZCE) rapidly and precisely, but also can evaluate measurement uncertainty and give control variables with an expected numerical tolerance. The conicity errors computed by the proposed method are 20%-40% less than those computed by NC454 CMM software and the evaluation accuracy improves significantly.
The Influence of Training and Experience on Rater Performance in Scoring Spoken Language

ERIC Educational Resources Information Center

Davis, Larry

2016-01-01

Two factors were investigated that are thought to contribute to consistency in rater scoring judgments: rater training and experience in scoring. Also considered were the relative effects of scoring rubrics and exemplars on rater performance. Experienced teachers of English (N = 20) scored recorded responses from the TOEFL iBT speaking test prior…

Analysis on the dynamic error for optoelectronic scanning coordinate measurement network

NASA Astrophysics Data System (ADS)

Shi, Shendong; Yang, Linghui; Lin, Jiarui; Guo, Siyang; Ren, Yongjie

2018-01-01

Large-scale dynamic three-dimension coordinate measurement technique is eagerly demanded in equipment manufacturing. Noted for advantages of high accuracy, scale expandability and multitask parallel measurement, optoelectronic scanning measurement network has got close attention. It is widely used in large components jointing, spacecraft rendezvous and docking simulation, digital shipbuilding and automated guided vehicle navigation. At present, most research about optoelectronic scanning measurement network is focused on static measurement capacity and research about dynamic accuracy is insufficient. Limited by the measurement principle, the dynamic error is non-negligible and restricts the application. The workshop measurement and positioning system is a representative which can realize dynamic measurement function in theory. In this paper we conduct deep research on dynamic error resources and divide them two parts: phase error and synchronization error. Dynamic error model is constructed. Based on the theory above, simulation about dynamic error is carried out. Dynamic error is quantized and the rule of volatility and periodicity has been found. Dynamic error characteristics are shown in detail. The research result lays foundation for further accuracy improvement.
Measurement error is often neglected in medical literature: a systematic review.

PubMed

Brakenhoff, Timo B; Mitroiu, Marian; Keogh, Ruth H; Moons, Karel G M; Groenwold, Rolf H H; van Smeden, Maarten

2018-06-01

In medical research, covariates (e.g., exposure and confounder variables) are often measured with error. While it is well accepted that this introduces bias and imprecision in exposure-outcome relations, it is unclear to what extent such issues are currently considered in research practice. The objective was to study common practices regarding covariate measurement error via a systematic review of general medicine and epidemiology literature. Original research published in 2016 in 12 high impact journals was full-text searched for phrases relating to measurement error. Reporting of measurement error and methods to investigate or correct for it were quantified and characterized. Two hundred and forty-seven (44%) of the 565 original research publications reported on the presence of measurement error. 83% of these 247 did so with respect to the exposure and/or confounder variables. Only 18 publications (7% of 247) used methods to investigate or correct for measurement error. Consequently, it is difficult for readers to judge the robustness of presented results to the existence of measurement error in the majority of publications in high impact journals. Our systematic review highlights the need for increased awareness about the possible impact of covariate measurement error. Additionally, guidance on the use of measurement error correction methods is necessary. Copyright © 2018 Elsevier Inc. All rights reserved.
Intra and Inter-Rater Reliability of Screening for Movement Impairments: Movement Control Tests from The Foundation Matrix

PubMed Central

Mischiati, Carolina R.; Comerford, Mark; Gosford, Emma; Swart, Jacqueline; Ewings, Sean; Botha, Nadine; Stokes, Maria; Mottram, Sarah L.

2015-01-01

Pre-season screening is well established within the sporting arena, and aims to enhance performance and reduce injury risk. With the increasing need to identify potential injury with greater accuracy, a new risk assessment process has been produced; The Performance Matrix (battery of movement control tests). As with any new method of objective testing, it is fundamental to establish whether the same results can be reproduced between examiners and by the same examiner on consecutive occasions. This study aimed to determine the intra-rater test re-test and inter-rater reliability of tests from a component of The Performance Matrix, The Foundation Matrix. Twenty participants were screened by two experienced musculoskeletal therapists using nine tests to assess the ability to control movement during specific tasks. Movement evaluation criteria for each test were rated as pass or fail. The therapists observed participants real-time and tests were recorded on video to enable repeated ratings four months later to examine intra-rater reliability (videos rated two weeks apart). Overall test percentage agreement was 87% for inter-rater reliability; 98% Rater 1, 94% Rater 2 for test re-test reliability; and 75% for real-time versus video. Intraclass-correlation coefficients (ICCs) were excellent between raters (0.81) and within raters (Rater 1, 0.96; Rater 2, 0.88) but poor for real-time versus video (0.23). Reliability for individual components of each test was more variable: inter-rater, 68-100%; intra-rater, 88-100% Rater 1, 75-100% Rater 2; and real-time versus video 31-100%. Cohen’s Kappa values for inter-rater reliability were 0.0-1.0; intra-rater 0.6-1.0 for Rater 1; -0.1-1.0 for Rater 2; and -0.1-1 for real-time versus video. It is concluded that both inter and intra-rater reliability of tests in The Foundation Matrix are acceptable when rated by experienced therapists. Recommendations are made for modifying some of the criteria to improve reliability where
Comparing the Effectiveness of Self-Paced and Collaborative Frame-of-Reference Training on Rater Accuracy in a Large-Scale Writing Assessment

ERIC Educational Resources Information Center

Raczynski, Kevin R.; Cohen, Allan S.; Engelhard, George, Jr.; Lu, Zhenqiu

2015-01-01

There is a large body of research on the effectiveness of rater training methods in the industrial and organizational psychology literature. Less has been reported in the measurement literature on large-scale writing assessments. This study compared the effectiveness of two widely used rater training methods--self-paced and collaborative…
Adjusting for Year to Year Rater Variation in IRT Linking--An Empirical Evaluation

ERIC Educational Resources Information Center

Yen, Shu Jing; Ochieng, Charles; Michaels, Hillary; Friedman, Greg

2005-01-01

The main purpose of this study was to illustrate a polytomous IRT-based linking procedure that adjusts for rater variations. Test scores from two administrations of a statewide reading assessment were used. An anchor set of Year 1 students' constructed responses were rescored by Year 2 raters. To adjust for year-to-year rater variation in IRT…
A new approach to the characterization of subtle errors in everyday action: implications for mild cognitive impairment.

PubMed

Seligman, Sarah C; Giovannetti, Tania; Sestito, John; Libon, David J

2014-01-01

Mild functional difficulties have been associated with early cognitive decline in older adults and increased risk for conversion to dementia in mild cognitive impairment, but our understanding of this decline has been limited by a dearth of objective methods. This study evaluated the reliability and validity of a new system to code subtle errors on an established performance-based measure of everyday action and described preliminary findings within the context of a theoretical model of action disruption. Here 45 older adults completed the Naturalistic Action Test (NAT) and neuropsychological measures. NAT performance was coded for overt errors, and subtle action difficulties were scored using a novel coding system. An inter-rater reliability coefficient was calculated. Validity of the coding system was assessed using a repeated-measures ANOVA with NAT task (simple versus complex) and error type (overt versus subtle) as within-group factors. Correlation/regression analyses were conducted among overt NAT errors, subtle NAT errors, and neuropsychological variables. The coding of subtle action errors was reliable and valid, and episodic memory breakdown predicted subtle action disruption. Results suggest that the NAT can be useful in objectively assessing subtle functional decline. Treatments targeting episodic memory may be most effective in addressing early functional impairment in older age.
Measurement errors in voice-key naming latency for Hiragana.

PubMed

Yamada, Jun; Tamaoka, Katsuo

2003-12-01

This study makes explicit the limitations and possibilities of voice-key naming latency research on single hiragana symbols (a Japanese syllabic script) by examining three sets of voice-key naming data against Sakuma, Fushimi, and Tatsumi's 1997 speech-analyzer voice-waveform data. Analysis showed that voice-key measurement errors can be substantial in standard procedures as they may conceal the true effects of significant variables involved in hiragana-naming behavior. While one can avoid voice-key measurement errors to some extent by applying Sakuma, et al.'s deltas and by excluding initial phonemes which induce measurement errors, such errors may be ignored when test items are words and other higher-level linguistic materials.
Automated Quantification of the Landing Error Scoring System With a Markerless Motion-Capture System.

PubMed

Mauntel, Timothy C; Padua, Darin A; Stanley, Laura E; Frank, Barnett S; DiStefano, Lindsay J; Peck, Karen Y; Cameron, Kenneth L; Marshall, Stephen W

2017-11-01

The Landing Error Scoring System (LESS) can be used to identify individuals with an elevated risk of lower extremity injury. The limitation of the LESS is that raters identify movement errors from video replay, which is time-consuming and, therefore, may limit its use by clinicians. A markerless motion-capture system may be capable of automating LESS scoring, thereby removing this obstacle. To determine the reliability of an automated markerless motion-capture system for scoring the LESS. Cross-sectional study. United States Military Academy. A total of 57 healthy, physically active individuals (47 men, 10 women; age = 18.6 ± 0.6 years, height = 174.5 ± 6.7 cm, mass = 75.9 ± 9.2 kg). Participants completed 3 jump-landing trials that were recorded by standard video cameras and a depth camera. Their movement quality was evaluated by expert LESS raters (standard video recording) using the LESS rubric and by software that automates LESS scoring (depth-camera data). We recorded an error for a LESS item if it was present on at least 2 of 3 jump-landing trials. We calculated κ statistics, prevalence- and bias-adjusted κ (PABAK) statistics, and percentage agreement for each LESS item. Interrater reliability was evaluated between the 2 expert rater scores and between a consensus expert score and the markerless motion-capture system score. We observed reliability between the 2 expert LESS raters (average κ = 0.45 ± 0.35, average PABAK = 0.67 ± 0.34; percentage agreement = 0.83 ± 0.17). The markerless motion-capture system had similar reliability with consensus expert scores (average κ = 0.48 ± 0.40, average PABAK = 0.71 ± 0.27; percentage agreement = 0.85 ± 0.14). However, reliability was poor for 5 LESS items in both LESS score comparisons. A markerless motion-capture system had the same level of reliability as expert LESS raters, suggesting that an automated system can accurately assess movement. Therefore, clinicians can use
Error analysis and experiments of attitude measurement using laser gyroscope

NASA Astrophysics Data System (ADS)

Ren, Xin-ran; Ma, Wen-li; Jiang, Ping; Huang, Jin-long; Pan, Nian; Guo, Shuai; Luo, Jun; Li, Xiao

2018-03-01

The precision of photoelectric tracking and measuring equipment on the vehicle and vessel is deteriorated by the platform's movement. Specifically, the platform's movement leads to the deviation or loss of the target, it also causes the jitter of visual axis and then produces image blur. In order to improve the precision of photoelectric equipment, the attitude of photoelectric equipment fixed with the platform must be measured. Currently, laser gyroscope is widely used to measure the attitude of the platform. However, the measurement accuracy of laser gyro is affected by its zero bias, scale factor, installation error and random error. In this paper, these errors were analyzed and compensated based on the laser gyro's error model. The static and dynamic experiments were carried out on a single axis turntable, and the error model was verified by comparing the gyro's output with an encoder with an accuracy of 0.1 arc sec. The accuracy of the gyroscope has increased from 7000 arc sec to 5 arc sec for an hour after error compensation. The method used in this paper is suitable for decreasing the laser gyro errors in inertial measurement applications.
A Sensor Dynamic Measurement Error Prediction Model Based on NAPSO-SVM.

PubMed

Jiang, Minlan; Jiang, Lan; Jiang, Dingde; Li, Fei; Song, Houbing

2018-01-15

Dynamic measurement error correction is an effective way to improve sensor precision. Dynamic measurement error prediction is an important part of error correction, and support vector machine (SVM) is often used for predicting the dynamic measurement errors of sensors. Traditionally, the SVM parameters were always set manually, which cannot ensure the model's performance. In this paper, a SVM method based on an improved particle swarm optimization (NAPSO) is proposed to predict the dynamic measurement errors of sensors. Natural selection and simulated annealing are added in the PSO to raise the ability to avoid local optima. To verify the performance of NAPSO-SVM, three types of algorithms are selected to optimize the SVM's parameters: the particle swarm optimization algorithm (PSO), the improved PSO optimization algorithm (NAPSO), and the glowworm swarm optimization (GSO). The dynamic measurement error data of two sensors are applied as the test data. The root mean squared error and mean absolute percentage error are employed to evaluate the prediction models' performances. The experimental results show that among the three tested algorithms the NAPSO-SVM method has a better prediction precision and a less prediction errors, and it is an effective method for predicting the dynamic measurement errors of sensors.
A Sensor Dynamic Measurement Error Prediction Model Based on NAPSO-SVM

PubMed Central

Jiang, Minlan; Jiang, Lan; Jiang, Dingde; Li, Fei

2018-01-01

Dynamic measurement error correction is an effective way to improve sensor precision. Dynamic measurement error prediction is an important part of error correction, and support vector machine (SVM) is often used for predicting the dynamic measurement errors of sensors. Traditionally, the SVM parameters were always set manually, which cannot ensure the model’s performance. In this paper, a SVM method based on an improved particle swarm optimization (NAPSO) is proposed to predict the dynamic measurement errors of sensors. Natural selection and simulated annealing are added in the PSO to raise the ability to avoid local optima. To verify the performance of NAPSO-SVM, three types of algorithms are selected to optimize the SVM’s parameters: the particle swarm optimization algorithm (PSO), the improved PSO optimization algorithm (NAPSO), and the glowworm swarm optimization (GSO). The dynamic measurement error data of two sensors are applied as the test data. The root mean squared error and mean absolute percentage error are employed to evaluate the prediction models’ performances. The experimental results show that among the three tested algorithms the NAPSO-SVM method has a better prediction precision and a less prediction errors, and it is an effective method for predicting the dynamic measurement errors of sensors. PMID:29342942
Improving Localization Accuracy: Successive Measurements Error Modeling

PubMed Central

Abu Ali, Najah; Abu-Elkheir, Mervat

2015-01-01

Vehicle self-localization is an essential requirement for many of the safety applications envisioned for vehicular networks. The mathematical models used in current vehicular localization schemes focus on modeling the localization error itself, and overlook the potential correlation between successive localization measurement errors. In this paper, we first investigate the existence of correlation between successive positioning measurements, and then incorporate this correlation into the modeling positioning error. We use the Yule Walker equations to determine the degree of correlation between a vehicle’s future position and its past positions, and then propose a p-order Gauss–Markov model to predict the future position of a vehicle from its past p positions. We investigate the existence of correlation for two datasets representing the mobility traces of two vehicles over a period of time. We prove the existence of correlation between successive measurements in the two datasets, and show that the time correlation between measurements can have a value up to four minutes. Through simulations, we validate the robustness of our model and show that it is possible to use the first-order Gauss–Markov model, which has the least complexity, and still maintain an accurate estimation of a vehicle’s future location over time using only its current position. Our model can assist in providing better modeling of positioning errors and can be used as a prediction tool to improve the performance of classical localization algorithms such as the Kalman filter. PMID:26140345
Random measurement error: Why worry? An example of cardiovascular risk factors.

PubMed

Brakenhoff, Timo B; van Smeden, Maarten; Visseren, Frank L J; Groenwold, Rolf H H

2018-01-01

With the increased use of data not originally recorded for research, such as routine care data (or 'big data'), measurement error is bound to become an increasingly relevant problem in medical research. A common view among medical researchers on the influence of random measurement error (i.e. classical measurement error) is that its presence leads to some degree of systematic underestimation of studied exposure-outcome relations (i.e. attenuation of the effect estimate). For the common situation where the analysis involves at least one exposure and one confounder, we demonstrate that the direction of effect of random measurement error on the estimated exposure-outcome relations can be difficult to anticipate. Using three example studies on cardiovascular risk factors, we illustrate that random measurement error in the exposure and/or confounder can lead to underestimation as well as overestimation of exposure-outcome relations. We therefore advise medical researchers to refrain from making claims about the direction of effect of measurement error in their manuscripts, unless the appropriate inferential tools are used to study or alleviate the impact of measurement error from the analysis.
Descriptions of verbal communication errors between staff. An analysis of 84 root cause analysis-reports from Danish hospitals.

PubMed

Rabøl, Louise Isager; Andersen, Mette Lehmann; Østergaard, Doris; Bjørn, Brian; Lilja, Beth; Mogensen, Torben

2011-03-01

Poor teamwork and communication between healthcare staff are correlated to patient safety incidents. However, the organisational factors responsible for these issues are unexplored. Root cause analyses (RCA) use human factors thinking to analyse the systems behind severe patient safety incidents. The objective of this study is to review RCA reports (RCAR) for characteristics of verbal communication errors between hospital staff in an organisational perspective. Two independent raters analysed 84 RCARs, conducted in six Danish hospitals between 2004 and 2006, for descriptions and characteristics of verbal communication errors such as handover errors and error during teamwork. Raters found description of verbal communication errors in 44 reports (52%). These included handover errors (35 (86%)), communication errors between different staff groups (19 (43%)), misunderstandings (13 (30%)), communication errors between junior and senior staff members (11 (25%)), hesitance in speaking up (10 (23%)) and communication errors during teamwork (8 (18%)). The kappa values were 0.44-0.78. Unproceduralized communication and information exchange via telephone, related to transfer between units and consults from other specialties, were particularly vulnerable processes. With the risk of bias in mind, it is concluded that more than half of the RCARs described erroneous verbal communication between staff members as root causes of or contributing factors of severe patient safety incidents. The RCARs rich descriptions of the incidents revealed the organisational factors and needs related to these errors.
How reliable are Functional Movement Screening scores? A systematic review of rater reliability.

PubMed

Moran, Robert W; Schneiders, Anthony G; Major, Katherine M; Sullivan, S John

2016-05-01

Several physical assessment protocols to identify intrinsic risk factors for injury aetiology related to movement quality have been described. The Functional Movement Screen (FMS) is a standardised, field-expedient test battery intended to assess movement quality and has been used clinically in preparticipation screening and in sports injury research. To critically appraise and summarise research investigating the reliability of scores obtained using the FMS battery. Systematic literature review. Systematic search of Google Scholar, Scopus (including ScienceDirect and PubMed), EBSCO (including Academic Search Complete, AMED, CINAHL, Health Source: Nursing/Academic Edition), MEDLINE and SPORTDiscus. Studies meeting eligibility criteria were assessed by 2 reviewers for risk of bias using the Quality Appraisal of Reliability Studies checklist. Overall quality of evidence was determined using van Tulder's levels of evidence approach. 12 studies were appraised. Overall, there was a 'moderate' level of evidence in favour of 'acceptable' (intraclass correlation coefficient ≥0.6) inter-rater and intra-rater reliability for composite scores derived from live scoring. For inter-rater reliability of composite scores derived from video recordings there was 'conflicting' evidence, and 'limited' evidence for intra-rater reliability. For inter-rater reliability based on live scoring of individual subtests there was 'moderate' evidence of 'acceptable' reliability (κ≥0.4) for 4 subtests (Deep Squat, Shoulder Mobility, Active Straight-leg Raise, Trunk Stability Push-up) and 'conflicting' evidence for the remaining 3 (Hurdle Step, In-line Lunge, Rotary Stability). This review found 'moderate' evidence that raters can achieve acceptable levels of inter-rater and intra-rater reliability of composite FMS scores when using live ratings. Overall, there were few high-quality studies, and the quality of several studies was impacted by poor study reporting particularly in relation to
Investigating Differences between American and Indian Raters in Assessing TOEFL iBT Speaking Tasks

ERIC Educational Resources Information Center

Wei, Jing; Llosa, Lorena

2015-01-01

This article reports on an investigation of the role raters' language background plays in raters' assessment of test takers' speaking ability. Specifically, this article examines differences between American and Indian raters in their scores and scoring processes when rating Indian test takers' responses to the Test of English as a Foreign…
Inter-rater reliability of three musculoskeletal physical examination techniques used to assess motion in three planes while standing.

PubMed

Prather, Heidi; Hunt, Devyani; Steger-May, Karen; Hayes, Marcie Harris; Knaus, Evan; Clohisy, John

2009-07-01

The objective of the study was to measure the reliability between examiners of 3 basic maneuvers of the Total Body Functional Profile physical examination test. The hypothesis was musculoskeletal health care providers of different disciplines could reliably use the 3 basic maneuvers as part of the musculoskeletal physical examination. A prospective observational study was conducted. Twenty-eight adult volunteers were measured on both the left and right side by 2 independent raters on a single occasion. The subjects were recruited through advertisements placed by the orthopedic department at a tertiary university. Twenty-eight volunteers were recruited and completed the study. The volunteers were between the ages of 18 and 51 years of age, had no symptoms in the lower extremity or spine, had no previous history of surgery or tumor involving the lower extremity, and no medical conditions that would preclude participation. On a single occasion, 2 examiners per 1 volunteer were blinded to their own and each others' measurements. Each examiner assessed the distance of frontal and sagittal plane lunge and angle of motion for transverse plane testing. Inter-rater agreement is expressed with intraclass correlation coefficients (ICCs) and corresponding 95% confidence intervals (CIs). The difference between raters is reported with 95% CIs. Baseline demographics, University of California Los Angeles (UCLA), and Harris hip questionnaires were completed by all participants. The UCLA and Harris hip scores showed no significant activity restrictions or pain limitations in all participants. The inter-rater reliability for sagittal, frontal, and transverse plane matrix testing was good with ICCs of 0.86 (95% CI 0.77-0.91), 0.90 (95% CI 0.84-0.94), and 0.85 (95% CI 0.75-0.91), respectively. The rater reliability between disciplines for transverse, sagittal, and frontal plane matrix testing was good with ICCs of 0.89 (95% CI 0.80-0.94), 0.88 (95% CI 0.79-0.94), and 0.90 (95% CI 0
New Gear Transmission Error Measurement System Designed

NASA Technical Reports Server (NTRS)

Oswald, Fred B.

2001-01-01

The prime source of vibration and noise in a gear system is the transmission error between the meshing gears. Transmission error is caused by manufacturing inaccuracy, mounting errors, and elastic deflections under load. Gear designers often attempt to compensate for transmission error by modifying gear teeth. This is done traditionally by a rough "rule of thumb" or more recently under the guidance of an analytical code. In order for a designer to have confidence in a code, the code must be validated through experiment. NASA Glenn Research Center contracted with the Design Unit of the University of Newcastle in England for a system to measure the transmission error of spur and helical test gears in the NASA Gear Noise Rig. The new system measures transmission error optically by means of light beams directed by lenses and prisms through gratings mounted on the gear shafts. The amount of light that passes through both gratings is directly proportional to the transmission error of the gears. A photodetector circuit converts the light to an analog electrical signal. To increase accuracy and reduce "noise" due to transverse vibration, there are parallel light paths at the top and bottom of the gears. The two signals are subtracted via differential amplifiers in the electronics package. The output of the system is 40 mV/mm, giving a resolution in the time domain of better than 0.1 mm, and discrimination in the frequency domain of better than 0.01 mm. The new system will be used to validate gear analytical codes and to investigate mechanisms that produce vibration and noise in parallel axis gears.
Two Models of Raters in a Structured Oral Examination: Does It Make a Difference?

ERIC Educational Resources Information Center

Touchie, Claire; Humphrey-Murto, Susan; Ainslie, Martha; Myers, Kathryn; Wood, Timothy J.

2010-01-01

Oral examinations have become more standardized over recent years. Traditionally a small number of raters were used for this type of examination. Past studies suggested that more raters should improve reliability. We compared the results of a multi-station structured oral examination using two different rater models, those based in a station,…
Rater reliability and concurrent validity of the Keyboard Personal Computer Style instrument (K-PeCS).

PubMed

Baker, Nancy A; Cook, James R; Redfern, Mark S

2009-01-01

This paper describes the inter-rater and intra-rater reliability, and the concurrent validity of an observational instrument, the Keyboard Personal Computer Style instrument (K-PeCS), which assesses stereotypical postures and movements associated with computer keyboard use. Three trained raters independently rated the video clips of 45 computer keyboard users to ascertain inter-rater reliability, and then re-rated a sub-sample of 15 video clips to ascertain intra-rater reliability. Concurrent validity was assessed by comparing the ratings obtained using the K-PeCS to scores developed from a 3D motion analysis system. The overall K-PeCS had excellent reliability [inter-rater: intra-class correlation coefficients (ICC)=.90; intra-rater: ICC=.92]. Most individual items on the K-PeCS had from good to excellent reliability, although six items fell below ICC=.75. Those K-PeCS items that were assessed for concurrent validity compared favorably to the motion analysis data for all but two items. These results suggest that most items on the K-PeCS can be used to reliably document computer keyboarding style.

Reliability of clinical impact grading by healthcare professionals of common prescribing error and optimisation cases in critical care patients.

PubMed

Bourne, Richard S; Shulman, Rob; Tomlin, Mark; Borthwick, Mark; Berry, Will; Mills, Gary H

2017-04-01

To identify between and within profession-rater reliability of clinical impact grading for common critical care prescribing error and optimisation cases. To identify representative clinical impact grades for each individual case. Electronic questionnaire. 5 UK NHS Trusts. 30 Critical care healthcare professionals (doctors, pharmacists and nurses). Participants graded severity of clinical impact (5-point categorical scale) of 50 error and 55 optimisation cases. Case between and within profession-rater reliability and modal clinical impact grading. Between and within profession rater reliability analysis used linear mixed model and intraclass correlation, respectively. The majority of error and optimisation cases (both 76%) had a modal clinical severity grade of moderate or higher. Error cases: doctors graded clinical impact significantly lower than pharmacists (-0.25; P < 0.001) and nurses (-0.53; P < 0.001), with nurses significantly higher than pharmacists (0.28; P < 0.001). Optimisation cases: doctors graded clinical impact significantly lower than nurses and pharmacists (-0.39 and -0.5; P < 0.001, respectively). Within profession reliability grading was excellent for pharmacists (0.88 and 0.89; P < 0.001) and doctors (0.79 and 0.83; P < 0.001) but only fair to good for nurses (0.43 and 0.74; P < 0.001), for optimisation and error cases, respectively. Representative clinical impact grades for over 100 common prescribing error and optimisation cases are reported for potential clinical practice and research application. The between professional variability highlights the importance of multidisciplinary perspectives in assessment of medication error and optimisation cases in clinical practice and research. © The Author 2017. Published by Oxford University Press in association with the International Society for Quality in Health Care. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
Integration of Error Compensation of Coordinate Measuring Machines into Feature Measurement: Part I—Model Development

PubMed Central

Calvo, Roque; D’Amato, Roberto; Gómez, Emilio; Domingo, Rosario

2016-01-01

The development of an error compensation model for coordinate measuring machines (CMMs) and its integration into feature measurement is presented. CMMs are widespread and dependable instruments in industry and laboratories for dimensional measurement. From the tip probe sensor to the machine display, there is a complex transformation of probed point coordinates through the geometrical feature model that makes the assessment of accuracy and uncertainty measurement results difficult. Therefore, error compensation is not standardized, conversely to other simpler instruments. Detailed coordinate error compensation models are generally based on CMM as a rigid-body and it requires a detailed mapping of the CMM’s behavior. In this paper a new model type of error compensation is proposed. It evaluates the error from the vectorial composition of length error by axis and its integration into the geometrical measurement model. The non-explained variability by the model is incorporated into the uncertainty budget. Model parameters are analyzed and linked to the geometrical errors and uncertainty of CMM response. Next, the outstanding measurement models of flatness, angle, and roundness are developed. The proposed models are useful for measurement improvement with easy integration into CMM signal processing, in particular in industrial environments where built-in solutions are sought. A battery of implementation tests are presented in Part II, where the experimental endorsement of the model is included. PMID:27690052
Specific agreement on dichotomous outcomes can be calculated for more than two raters.

PubMed

de Vet, Henrica C W; Dikmans, Rieky E; Eekhout, Iris

2017-03-01

For assessing interrater agreement, the concepts of observed agreement and specific agreement have been proposed. The situation of two raters and dichotomous outcomes has been described, whereas often, multiple raters are involved. We aim to extend it for more than two raters and examine how to calculate agreement estimates and 95% confidence intervals (CIs). As an illustration, we used a reliability study that includes the scores of four plastic surgeons classifying photographs of breasts of 50 women after breast reconstruction into "satisfied" or "not satisfied." In a simulation study, we checked the hypothesized sample size for calculation of 95% CIs. For m raters, all pairwise tables [ie, m (m - 1)/2] were summed. Then, the discordant cells were averaged before observed and specific agreements were calculated. The total number (N) in the summed table is m (m - 1)/2 times larger than the number of subjects (n), in the example, N = 300 compared to n = 50 subjects times m = 4 raters. A correction of n√(m - 1) was appropriate to find 95% CIs comparable to bootstrapped CIs. The concept of observed agreement and specific agreement can be extended to more than two raters with a valid estimation of the 95% CIs. Copyright © 2017 Elsevier Inc. All rights reserved.
The Effects of Primacy on Rater Cognition: An Eye-Tracking Study

ERIC Educational Resources Information Center

Ballard, Laura

2017-01-01

Rater scoring has an impact on writing test reliability and validity. Thus, there has been a continued call for researchers to investigate issues related to rating (Crusan, 2015). Investigating the scoring process and understanding how raters arrive at particular scores are critical "because the score is ultimately what will be used in making…
Interval sampling methods and measurement error: a computer simulation.

PubMed

Wirth, Oliver; Slaven, James; Taylor, Matthew A

2014-01-01

A simulation study was conducted to provide a more thorough account of measurement error associated with interval sampling methods. A computer program simulated the application of momentary time sampling, partial-interval recording, and whole-interval recording methods on target events randomly distributed across an observation period. The simulation yielded measures of error for multiple combinations of observation period, interval duration, event duration, and cumulative event duration. The simulations were conducted up to 100 times to yield measures of error variability. Although the present simulation confirmed some previously reported characteristics of interval sampling methods, it also revealed many new findings that pertain to each method's inherent strengths and weaknesses. The analysis and resulting error tables can help guide the selection of the most appropriate sampling method for observation-based behavioral assessments. © Society for the Experimental Analysis of Behavior.
Incorporating Measurement Error from Modeled Air Pollution Exposures into Epidemiological Analyses.

PubMed

Samoli, Evangelia; Butland, Barbara K

2017-12-01

Outdoor air pollution exposures used in epidemiological studies are commonly predicted from spatiotemporal models incorporating limited measurements, temporal factors, geographic information system variables, and/or satellite data. Measurement error in these exposure estimates leads to imprecise estimation of health effects and their standard errors. We reviewed methods for measurement error correction that have been applied in epidemiological studies that use model-derived air pollution data. We identified seven cohort studies and one panel study that have employed measurement error correction methods. These methods included regression calibration, risk set regression calibration, regression calibration with instrumental variables, the simulation extrapolation approach (SIMEX), and methods under the non-parametric or parameter bootstrap. Corrections resulted in small increases in the absolute magnitude of the health effect estimate and its standard error under most scenarios. Limited application of measurement error correction methods in air pollution studies may be attributed to the absence of exposure validation data and the methodological complexity of the proposed methods. Future epidemiological studies should consider in their design phase the requirements for the measurement error correction method to be later applied, while methodological advances are needed under the multi-pollutants setting.
Improved characterisation of measurement errors in electrical resistivity tomography (ERT) surveys

NASA Astrophysics Data System (ADS)

Tso, C. H. M.; Binley, A. M.; Kuras, O.; Graham, J.

2016-12-01

Measurement errors can play a pivotal role in geophysical inversion. Most inverse models require users to prescribe a statistical model of data errors before inversion. Wrongly prescribed error levels can lead to over- or under-fitting of data, yet commonly used models of measurement error are relatively simplistic. With the heightening interests in uncertainty estimation across hydrogeophysics, better characterisation and treatment of measurement errors is needed to provide more reliable estimates of uncertainty. We have analysed two time-lapse electrical resistivity tomography (ERT) datasets; one contains 96 sets of direct and reciprocal data collected from a surface ERT line within a 24h timeframe, while the other is a year-long cross-borehole survey at a UK nuclear site with over 50,000 daily measurements. Our study included the characterisation of the spatial and temporal behaviour of measurement errors using autocorrelation and covariance analysis. We find that, in addition to well-known proportionality effects, ERT measurements can also be sensitive to the combination of electrodes used. This agrees with reported speculation in previous literature that ERT errors could be somewhat correlated. Based on these findings, we develop a new error model that allows grouping based on electrode number in additional to fitting a linear model to transfer resistance. The new model fits the observed measurement errors better and shows superior inversion and uncertainty estimates in synthetic examples. It is robust, because it groups errors together based on the number of the four electrodes used to make each measurement. The new model can be readily applied to the diagonal data weighting matrix commonly used in classical inversion methods, as well as to the data covariance matrix in the Bayesian inversion framework. We demonstrate its application using extensive ERT monitoring datasets from the two aforementioned sites.
Reliability and diagnostic characteristics of the JFK coma recovery scale-revised: exploring the influence of rater's level of experience.

PubMed

Løvstad, Marianne; Frøslie, Kathrine F; Giacino, Joseph T; Skandsen, Toril; Anke, Audny; Schanke, Anne-Kristine

2010-01-01

To confirm the reliability and diagnostic validity of the JFK Coma Recovery Scale-Revised (CRS-R) across raters with varying levels of experience. Thirty-one patients with disorders of consciousness were recruited from 6 Norwegian hospitals. CRS-R and the Disability Rating Scale. Reliability measures were good for the CRS-R total scores and moderate to good for its subscales. Diagnostic agreement among examiners was good. Raters' experience with the CRS-R favorably influenced reliability. Sensitivity and specificity analyses demonstrated better detection of patients in minimally conscious state on the CRS-R relative to the Disability Rating Scale. The CRS-R is a reliable tool for diagnosing vegetative state and minimally conscious state. Raters' level of experience influences the reliability of the CRS-R scores.
The impact of response measurement error on the analysis of designed experiments

DOE PAGES

Anderson-Cook, Christine Michaela; Hamada, Michael Scott; Burr, Thomas Lee

2016-11-01

This study considers the analysis of designed experiments when there is measurement error in the true response or so-called response measurement error. We consider both additive and multiplicative response measurement errors. Through a simulation study, we investigate the impact of ignoring the response measurement error in the analysis, that is, by using a standard analysis based on t-tests. In addition, we examine the role of repeat measurements in improving the quality of estimation and prediction in the presence of response measurement error. We also study a Bayesian approach that accounts for the response measurement error directly through the specification ofmore » the model, and allows including additional information about variability in the analysis. We consider the impact on power, prediction, and optimization. Copyright © 2015 John Wiley & Sons, Ltd.« less
The impact of response measurement error on the analysis of designed experiments

DOE Office of Scientific and Technical Information (OSTI.GOV)

Anderson-Cook, Christine Michaela; Hamada, Michael Scott; Burr, Thomas Lee

This study considers the analysis of designed experiments when there is measurement error in the true response or so-called response measurement error. We consider both additive and multiplicative response measurement errors. Through a simulation study, we investigate the impact of ignoring the response measurement error in the analysis, that is, by using a standard analysis based on t-tests. In addition, we examine the role of repeat measurements in improving the quality of estimation and prediction in the presence of response measurement error. We also study a Bayesian approach that accounts for the response measurement error directly through the specification ofmore » the model, and allows including additional information about variability in the analysis. We consider the impact on power, prediction, and optimization. Copyright © 2015 John Wiley & Sons, Ltd.« less
An in-situ measuring method for planar straightness error

NASA Astrophysics Data System (ADS)

Chen, Xi; Fu, Luhua; Yang, Tongyu; Sun, Changku; Wang, Zhong; Zhao, Yan; Liu, Changjie

2018-01-01

According to some current problems in the course of measuring the plane shape error of workpiece, an in-situ measuring method based on laser triangulation is presented in this paper. The method avoids the inefficiency of traditional methods like knife straightedge as well as the time and cost requirements of coordinate measuring machine(CMM). A laser-based measuring head is designed and installed on the spindle of a numerical control(NC) machine. The measuring head moves in the path planning to measure measuring points. The spatial coordinates of the measuring points are obtained by the combination of the laser triangulation displacement sensor and the coordinate system of the NC machine, which could make the indicators of measurement come true. The method to evaluate planar straightness error adopts particle swarm optimization(PSO). To verify the feasibility and accuracy of the measuring method, simulation experiments were implemented with a CMM. Comparing the measurement results of measuring head with the corresponding measured values obtained by composite measuring machine, it is verified that the method can realize high-precise and automatic measurement of the planar straightness error of the workpiece.
A toolkit for measurement error correction, with a focus on nutritional epidemiology

PubMed Central

Keogh, Ruth H; White, Ian R

2014-01-01

Exposure measurement error is a problem in many epidemiological studies, including those using biomarkers and measures of dietary intake. Measurement error typically results in biased estimates of exposure-disease associations, the severity and nature of the bias depending on the form of the error. To correct for the effects of measurement error, information additional to the main study data is required. Ideally, this is a validation sample in which the true exposure is observed. However, in many situations, it is not feasible to observe the true exposure, but there may be available one or more repeated exposure measurements, for example, blood pressure or dietary intake recorded at two time points. The aim of this paper is to provide a toolkit for measurement error correction using repeated measurements. We bring together methods covering classical measurement error and several departures from classical error: systematic, heteroscedastic and differential error. The correction methods considered are regression calibration, which is already widely used in the classical error setting, and moment reconstruction and multiple imputation, which are newer approaches with the ability to handle differential error. We emphasize practical application of the methods in nutritional epidemiology and other fields. We primarily consider continuous exposures in the exposure-outcome model, but we also outline methods for use when continuous exposures are categorized. The methods are illustrated using the data from a study of the association between fibre intake and colorectal cancer, where fibre intake is measured using a diet diary and repeated measures are available for a subset. © 2014 The Authors. PMID:24497385
Content Validity of a Tool Measuring Medication Errors.

PubMed

Tabassum, Nishat; Allana, Saleema; Saeed, Tanveer; Dias, Jacqueline Maria

2015-08-01

The objective of this study was to determine content and face validity of a tool measuring medication errors among nursing students in baccalaureate nursing education. Data was collected from the Aga Khan University School of Nursing and Midwifery (AKUSoNaM), Karachi, from March to August 2014. The tool was developed utilizing literature and the expertise of the team members, expert in different areas. The developed tool was then sent to five experts from all over Karachi for ensuring the content validity of the tool, which was measured on relevance and clarity of the questions. The Scale Content Validity Index (S-CVI) for clarity and relevance of the questions was found to be 0.94 and 0.98, respectively. The tool measuring medication errors has an excellent content validity. This tool should be used for future studies on medication errors, with different study populations such as medical students, doctors, and nurses.
Measurement uncertainty relations: characterising optimal error bounds for qubits

NASA Astrophysics Data System (ADS)

Bullock, T.; Busch, P.

2018-07-01

In standard formulations of the uncertainty principle, two fundamental features are typically cast as impossibility statements: two noncommuting observables cannot in general both be sharply defined (for the same state), nor can they be measured jointly. The pioneers of quantum mechanics were acutely aware and puzzled by this fact, and it motivated Heisenberg to seek a mitigation, which he formulated in his seminal paper of 1927. He provided intuitive arguments to show that the values of, say, the position and momentum of a particle can at least be unsharply defined, and they can be measured together provided some approximation errors are allowed. Only now, nine decades later, a working theory of approximate joint measurements is taking shape, leading to rigorous and experimentally testable formulations of associated error tradeoff relations. Here we briefly review this new development, explaining the concepts and steps taken in the construction of optimal joint approximations of pairs of incompatible observables. As a case study, we deduce measurement uncertainty relations for qubit observables using two distinct error measures. We provide an operational interpretation of the error bounds and discuss some of the first experimental tests of such relations.
Systematic reviews need to consider applicability to disadvantaged populations: inter-rater agreement for a health equity plausibility algorithm

PubMed Central

2012-01-01

Background Systematic reviews have been challenged to consider effects on disadvantaged groups. A priori specification of subgroup analyses is recommended to increase the credibility of these analyses. This study aimed to develop and assess inter-rater agreement for an algorithm for systematic review authors to predict whether differences in effect measures are likely for disadvantaged populations relative to advantaged populations (only relative effect measures were addressed). Methods A health equity plausibility algorithm was developed using clinimetric methods with three items based on literature review, key informant interviews and methodology studies. The three items dealt with the plausibility of differences in relative effects across sex or socioeconomic status (SES) due to: 1) patient characteristics; 2) intervention delivery (i.e., implementation); and 3) comparators. Thirty-five respondents (consisting of clinicians, methodologists and research users) assessed the likelihood of differences across sex and SES for ten systematic reviews with these questions. We assessed inter-rater reliability using Fleiss multi-rater kappa. Results The proportion agreement was 66% for patient characteristics (95% confidence interval: 61%-71%), 67% for intervention delivery (95% confidence interval: 62% to 72%) and 55% for the comparator (95% confidence interval: 50% to 60%). Inter-rater kappa, assessed with Fleiss kappa, ranged from 0 to 0.199, representing very low agreement beyond chance. Conclusions Users of systematic reviews rated that important differences in relative effects across sex and socioeconomic status were plausible for a range of individual and population-level interventions. However, there was very low inter-rater agreement for these assessments. There is an unmet need for discussion of plausibility of differential effects in systematic reviews. Increased consideration of external validity and applicability to different populations and settings is
Systematic reviews need to consider applicability to disadvantaged populations: inter-rater agreement for a health equity plausibility algorithm.

PubMed

Welch, Vivian; Brand, Kevin; Kristjansson, Elizabeth; Smylie, Janet; Wells, George; Tugwell, Peter

2012-12-19

Systematic reviews have been challenged to consider effects on disadvantaged groups. A priori specification of subgroup analyses is recommended to increase the credibility of these analyses. This study aimed to develop and assess inter-rater agreement for an algorithm for systematic review authors to predict whether differences in effect measures are likely for disadvantaged populations relative to advantaged populations (only relative effect measures were addressed). A health equity plausibility algorithm was developed using clinimetric methods with three items based on literature review, key informant interviews and methodology studies. The three items dealt with the plausibility of differences in relative effects across sex or socioeconomic status (SES) due to: 1) patient characteristics; 2) intervention delivery (i.e., implementation); and 3) comparators. Thirty-five respondents (consisting of clinicians, methodologists and research users) assessed the likelihood of differences across sex and SES for ten systematic reviews with these questions. We assessed inter-rater reliability using Fleiss multi-rater kappa. The proportion agreement was 66% for patient characteristics (95% confidence interval: 61%-71%), 67% for intervention delivery (95% confidence interval: 62% to 72%) and 55% for the comparator (95% confidence interval: 50% to 60%). Inter-rater kappa, assessed with Fleiss kappa, ranged from 0 to 0.199, representing very low agreement beyond chance. Users of systematic reviews rated that important differences in relative effects across sex and socioeconomic status were plausible for a range of individual and population-level interventions. However, there was very low inter-rater agreement for these assessments. There is an unmet need for discussion of plausibility of differential effects in systematic reviews. Increased consideration of external validity and applicability to different populations and settings is warranted in systematic reviews to meet this
Comparing Native and Non-Native Raters of US Federal Government Speaking Tests

ERIC Educational Resources Information Center

Brooks, Rachel Lunde

2013-01-01

Previous Language Testing research has largely reported that although many raters' characteristics affect their evaluations of language assessments (Reed & Cohen, 2001), being a native speaker or non-native speaker rater does not significantly affect final ratings (Kim, 2009). In Second Language Acquisition, some researchers conclude that…
Qualitative analysis of MMI raters' scorings of medical school candidates: A matter of taste?

PubMed

Christensen, Mette K; Lykkegaard, Eva; Lund, Ole; O'Neill, Lotte D

2018-05-01

Recent years have seen leading medical educationalists repeatedly call for a paradigm shift in the way we view, value and use subjectivity in assessment. The argument is that subjective expert raters generally bring desired quality, not just noise, to performance evaluations. While several reviews document the psychometric qualities of the Multiple Mini-Interview (MMI), we currently lack qualitative studies examining what we can learn from MMI raters' subjectivity. The present qualitative study therefore investigates rater subjectivity or taste in MMI selection interview. Taste (Bourdieu 1984) is a practical sense, which makes it possible at a pre-reflective level to apply 'invisible' or 'tacit' categories of perception for distinguishing between good and bad. The study draws on data from explorative in-depth interviews with 12 purposefully selected MMI raters. We find that MMI raters spontaneously applied subjective criteria-their taste-enabling them to assess the candidates' interpersonal attributes and to predict the candidates' potential. In addition, MMI raters seemed to share a taste for certain qualities in the candidates (e.g. reflectivity, resilience, empathy, contact, alikeness, 'the good colleague'); hence, taste may be the result of an ongoing enculturation in medical education and healthcare systems. This study suggests that taste is an inevitable condition in the assessment of students' performance. The MMI set-up should therefore make room for MMI raters' taste and their connoisseurship, i.e. their ability to taste, to improve the quality of their assessment of medical school candidates.
Backward-gazing method for measuring solar concentrators shape errors.

PubMed

Coquand, Mathieu; Henault, François; Caliot, Cyril

2017-03-01

This paper describes a backward-gazing method for measuring the optomechanical errors of solar concentrating surfaces. It makes use of four cameras placed near the solar receiver and simultaneously recording images of the sun reflected by the optical surfaces. Simple data processing then allows reconstructing the slope and shape errors of the surfaces. The originality of the method is enforced by the use of generalized quad-cell formulas and approximate mathematical relations between the slope errors of the mirrors and their reflected wavefront in the case of sun-tracking heliostats at high-incidence angles. Numerical simulations demonstrate that the measurement accuracy is compliant with standard requirements of solar concentrating optics in the presence of noise or calibration errors. The method is suited to fine characterization of the optical and mechanical errors of heliostats and their facets, or to provide better control for real-time sun tracking.
Assessing Agreement between Multiple Raters with Missing Rating Information, Applied to Breast Cancer Tumour Grading

PubMed Central

Ellis, Ian O.; Green, Andrew R.; Hanka, Rudolf

2008-01-01

Background We consider the problem of assessing inter-rater agreement when there are missing data and a large number of raters. Previous studies have shown only ‘moderate’ agreement between pathologists in grading breast cancer tumour specimens. We analyse a large but incomplete data-set consisting of 24177 grades, on a discrete 1–3 scale, provided by 732 pathologists for 52 samples. Methodology/Principal Findings We review existing methods for analysing inter-rater agreement for multiple raters and demonstrate two further methods. Firstly, we examine a simple non-chance-corrected agreement score based on the observed proportion of agreements with the consensus for each sample, which makes no allowance for missing data. Secondly, treating grades as lying on a continuous scale representing tumour severity, we use a Bayesian latent trait method to model cumulative probabilities of assigning grade values as functions of the severity and clarity of the tumour and of rater-specific parameters representing boundaries between grades 1–2 and 2–3. We simulate from the fitted model to estimate, for each rater, the probability of agreement with the majority. Both methods suggest that there are differences between raters in terms of rating behaviour, most often caused by consistent over- or under-estimation of the grade boundaries, and also considerable variability in the distribution of grades assigned to many individual samples. The Bayesian model addresses the tendency of the agreement score to be biased upwards for raters who, by chance, see a relatively ‘easy’ set of samples. Conclusions/Significance Latent trait models can be adapted to provide novel information about the nature of inter-rater agreement when the number of raters is large and there are missing data. In this large study there is substantial variability between pathologists and uncertainty in the identity of the ‘true’ grade of many of the breast cancer tumours, a fact often ignored in

Extinction measurements with low-power hsrl systems—error limits

NASA Astrophysics Data System (ADS)

Eloranta, Ed

2018-04-01

HSRL measurements of extinction are more difficult than backscatter measurements. This is particularly true for low-power, eye-safe systems. This paper looks at error sources that currently provide an error limit of 10-5 m-1 for boundary layer extinction measurements made with University of Wisconsin HSRL systems. These eye-safe systems typically use 300mW transmitters and 40 cm diameter receivers with a 10-4 radian field-of-view.
Reliability of Pain Measurements Using Computerized Cuff Algometry: A DoloCuff Reliability and Agreement Study.

PubMed

Kvistgaard Olsen, Jack; Fener, Dilay Kesgin; Waehrens, Eva Elisabet; Wulf Christensen, Anton; Jespersen, Anders; Danneskiold-Samsøe, Bente; Bartels, Else Marie

2017-07-01

Computerized pneumatic cuff pressure algometry (CPA) using the DoloCuff is a new method for pain assessment. Intra- and inter-rater reliabilities have not yet been established. Our aim was to examine the inter- and intrarater reliabilities of DoloCuff measures in healthy subjects. Twenty healthy subjects (ages 20 to 29 years) were assessed three times at 24-hour intervals by two trained raters. Inter-rater reliability was established based on the first and second assessments, whereas intrarater reliability was based on the second and third assessments. Subjects were randomized 1:1 to first assessment at either rater 1 or rater 2. The variables of interest were pressure pain threshold (PT), pressure pain tolerance (PTol), and temporal summation index (TSI). Reliability was estimated by a two-way mixed intraclass correlation coefficient (ICC) absolute agreement analysis. Reliability was considered excellent if ICC > 0.75, fair to good if 0.4 < ICC < 0.75, and poor if ICC < 0.4. Bias and random errors between raters and assessments were evaluated using 95% confidence interval (CI) and Bland-Altman plots. Inter-rater reliability for PT, PTol, and TSI was 0.88 (95% CI: 0.69 to 0.95), 0.86 (95% CI: 0.65 to 0.95), and 0.81 (95% CI: 0.42 to 0.94), respectively. The intrarater reliability for PT, PTol, and TSI was 0.81 (95% CI: 0.53 to 0.92), 0.89 (95% CI: 0.74 to 0.96), and 0.75 (95% CI: 0.28 to 0.91), respectively. Inter-rater reliability was excellent for PT, PTol, and TSI. Similarly, the intrarater reliability for PT and PTol was excellent, while borderline excellent/good for TSI. Therefore, the DoloCuff can be used to obtain reliable measures of pressure pain parameters in healthy subjects. © 2016 World Institute of Pain.
Error Analysis and Validation for Insar Height Measurement Induced by Slant Range

NASA Astrophysics Data System (ADS)

Zhang, X.; Li, T.; Fan, W.; Geng, X.

2018-04-01

InSAR technique is an important method for large area DEM extraction. Several factors have significant influence on the accuracy of height measurement. In this research, the effect of slant range measurement for InSAR height measurement was analysis and discussed. Based on the theory of InSAR height measurement, the error propagation model was derived assuming no coupling among different factors, which directly characterise the relationship between slant range error and height measurement error. Then the theoretical-based analysis in combination with TanDEM-X parameters was implemented to quantitatively evaluate the influence of slant range error to height measurement. In addition, the simulation validation of InSAR error model induced by slant range was performed on the basis of SRTM DEM and TanDEM-X parameters. The spatial distribution characteristics and error propagation rule of InSAR height measurement were further discussed and evaluated.
Correcting systematic errors in high-sensitivity deuteron polarization measurements

NASA Astrophysics Data System (ADS)

Brantjes, N. P. M.; Dzordzhadze, V.; Gebel, R.; Gonnella, F.; Gray, F. E.; van der Hoek, D. J.; Imig, A.; Kruithof, W. L.; Lazarus, D. M.; Lehrach, A.; Lorentz, B.; Messi, R.; Moricciani, D.; Morse, W. M.; Noid, G. A.; Onderwater, C. J. G.; Özben, C. S.; Prasuhn, D.; Levi Sandri, P.; Semertzidis, Y. K.; da Silva e Silva, M.; Stephenson, E. J.; Stockhorst, H.; Venanzoni, G.; Versolato, O. O.

2012-02-01

This paper reports deuteron vector and tensor beam polarization measurements taken to investigate the systematic variations due to geometric beam misalignments and high data rates. The experiments used the In-Beam Polarimeter at the KVI-Groningen and the EDDA detector at the Cooler Synchrotron COSY at Jülich. By measuring with very high statistical precision, the contributions that are second-order in the systematic errors become apparent. By calibrating the sensitivity of the polarimeter to such errors, it becomes possible to obtain information from the raw count rate values on the size of the errors and to use this information to correct the polarization measurements. During the experiment, it was possible to demonstrate that corrections were satisfactory at the level of 10 -5 for deliberately large errors. This may facilitate the real time observation of vector polarization changes smaller than 10 -6 in a search for an electric dipole moment using a storage ring.
Lock-in amplifier error prediction and correction in frequency sweep measurements.

PubMed

Sonnaillon, Maximiliano Osvaldo; Bonetto, Fabian Jose

2007-01-01

This article proposes an analytical algorithm for predicting errors in lock-in amplifiers (LIAs) working with time-varying reference frequency. Furthermore, a simple method for correcting such errors is presented. The reference frequency can be swept in order to measure the frequency response of a system within a given spectrum. The continuous variation of the reference frequency produces a measurement error that depends on three factors: the sweep speed, the LIA low-pass filters, and the frequency response of the measured system. The proposed error prediction algorithm is based on the final value theorem of the Laplace transform. The correction method uses a double-sweep measurement. A mathematical analysis is presented and validated with computational simulations and experimental measurements.
Multiscale measurement error models for aggregated small area health data.

PubMed

Aregay, Mehreteab; Lawson, Andrew B; Faes, Christel; Kirby, Russell S; Carroll, Rachel; Watjou, Kevin

2016-08-01

Spatial data are often aggregated from a finer (smaller) to a coarser (larger) geographical level. The process of data aggregation induces a scaling effect which smoothes the variation in the data. To address the scaling problem, multiscale models that link the convolution models at different scale levels via the shared random effect have been proposed. One of the main goals in aggregated health data is to investigate the relationship between predictors and an outcome at different geographical levels. In this paper, we extend multiscale models to examine whether a predictor effect at a finer level hold true at a coarser level. To adjust for predictor uncertainty due to aggregation, we applied measurement error models in the framework of multiscale approach. To assess the benefit of using multiscale measurement error models, we compare the performance of multiscale models with and without measurement error in both real and simulated data. We found that ignoring the measurement error in multiscale models underestimates the regression coefficient, while it overestimates the variance of the spatially structured random effect. On the other hand, accounting for the measurement error in multiscale models provides a better model fit and unbiased parameter estimates. © The Author(s) 2016.
Accounting for measurement error: a critical but often overlooked process.

PubMed

Harris, Edward F; Smith, Richard N

2009-12-01

Due to instrument imprecision and human inconsistencies, measurements are not free of error. Technical error of measurement (TEM) is the variability encountered between dimensions when the same specimens are measured at multiple sessions. A goal of a data collection regimen is to minimise TEM. The few studies that actually quantify TEM, regardless of discipline, report that it is substantial and can affect results and inferences. This paper reviews some statistical approaches for identifying and controlling TEM. Statistically, TEM is part of the residual ('unexplained') variance in a statistical test, so accounting for TEM, which requires repeated measurements, enhances the chances of finding a statistically significant difference if one exists. The aim of this paper was to review and discuss common statistical designs relating to types of error and statistical approaches to error accountability. This paper addresses issues of landmark location, validity, technical and systematic error, analysis of variance, scaled measures and correlation coefficients in order to guide the reader towards correct identification of true experimental differences. Researchers commonly infer characteristics about populations from comparatively restricted study samples. Most inferences are statistical and, aside from concerns about adequate accounting for known sources of variation with the research design, an important source of variability is measurement error. Variability in locating landmarks that define variables is obvious in odontometrics, cephalometrics and anthropometry, but the same concerns about measurement accuracy and precision extend to all disciplines. With increasing accessibility to computer-assisted methods of data collection, the ease of incorporating repeated measures into statistical designs has improved. Accounting for this technical source of variation increases the chance of finding biologically true differences when they exist.
Inter-rater reliability of three standardized functional tests in patients with low back pain

PubMed Central

Tidstrand, Johan; Horneij, Eva

2009-01-01

Background Of all patients with low back pain, 85% are diagnosed as "non-specific lumbar pain". Lumbar instability has been described as one specific diagnosis which several authors have described as delayed muscular responses, impaired postural control as well as impaired muscular coordination among these patients. This has mostly been measured and evaluated in a laboratory setting. There are few standardized and evaluated functional tests, examining functional muscular coordination which are also applicable in the non-laboratory setting. In ordinary clinical work, tests of functional muscular coordination should be easy to apply. The aim of this present study was to therefore standardize and examine the inter-rater reliability of three functional tests of muscular functional coordination of the lumbar spine in patients with low back pain. Methods Nineteen consecutive individuals, ten men and nine women were included. (Mean age 42 years, SD ± 12 yrs). Two independent examiners assessed three tests: "single limb stance", "sitting on a Bobath ball with one leg lifted" and "unilateral pelvic lift" on the same occasion. The standardization procedure took altered positions of the spine or pelvis and compensatory movements of the free extremities into account. The inter-rater reliability was analyzed by Cohen's kappa coefficient (κ) and by percentage agreement. Results The inter-rater reliability for the right and the left leg respectively was: for the single limb stance very good (κ: 0.88–1.0), for sitting on a Bobath ball good (κ: 0.79) and very good (κ: 0.88) and for the unilateral pelvic lift: good (κ: 0.61) and moderate (κ: 0.47). Conclusion The present study showed good to very good inter-rater reliability for two standardized tests, that is, the single-limb stance and sitting on a Bobath-ball with one leg lifted. Inter-rater reliability for the unilateral pelvic lift test was moderate to good. Validation of the tests in their ability to evaluate lumbar
Automatic diagnostic system for measuring ocular refractive errors

NASA Astrophysics Data System (ADS)

Ventura, Liliane; Chiaradia, Caio; de Sousa, Sidney J. F.; de Castro, Jarbas C.

1996-05-01

Ocular refractive errors (myopia, hyperopia and astigmatism) are automatic and objectively determined by projecting a light target onto the retina using an infra-red (850 nm) diode laser. The light vergence which emerges from the eye (light scattered from the retina) is evaluated in order to determine the corresponding ametropia. The system basically consists of projecting a target (ring) onto the retina and analyzing the scattered light with a CCD camera. The light scattered by the eye is divided into six portions (3 meridians) by using a mask and a set of six prisms. The distance between the two images provided by each of the meridians, leads to the refractive error of the referred meridian. Hence, it is possible to determine the refractive error at three different meridians, which gives the exact solution for the eye's refractive error (spherical and cylindrical components and the axis of the astigmatism). The computational basis used for the image analysis is a heuristic search, which provides satisfactory calculation times for our purposes. The peculiar shape of the target, a ring, provides a wider range of measurement and also saves parts of the retina from unnecessary laser irradiation. Measurements were done in artificial and in vivo eyes (using cicloplegics) and the results were in good agreement with the retinoscopic measurements.
Inter-rater agreement on PIVC-associated phlebitis signs, symptoms and scales.

PubMed

Marsh, Nicole; Mihala, Gabor; Ray-Barruel, Gillian; Webster, Joan; Wallis, Marianne C; Rickard, Claire M

2015-10-01

Many peripheral intravenous catheter (PIVC) infusion phlebitis scales and definitions are used internationally, although no existing scale has demonstrated comprehensive reliability and validity. We examined inter-rater agreement between registered nurses on signs, symptoms and scales commonly used in phlebitis assessment. Seven PIVC-associated phlebitis signs/symptoms (pain, tenderness, swelling, erythema, palpable venous cord, purulent discharge and warmth) were observed daily by two raters (a research nurse and registered nurse). These data were modelled into phlebitis scores using 10 different tools. Proportions of agreement (e.g. positive, negative), observed and expected agreements, Cohen's kappa, the maximum achievable kappa, prevalence- and bias-adjusted kappa were calculated. Two hundred ten patients were recruited across three hospitals, with 247 sets of paired observations undertaken. The second rater was blinded to the first's findings. The Catney and Rittenberg scales were the most sensitive (phlebitis in >20% of observations), whereas the Curran, Lanbeck and Rickard scales were the most restrictive (≤2% phlebitis). Only tenderness and the Catney (one of pain, tenderness, erythema or palpable cord) and Rittenberg scales (one of erythema, swelling, tenderness or pain) had acceptable (more than two-thirds, 66.7%) levels of inter-rater agreement. Inter-rater agreement for phlebitis assessment signs/symptoms and scales is low. This likely contributes to the high degree of variability in phlebitis rates in literature. We recommend further research into assessment of infrequent signs/symptoms and the Catney or Rittenberg scales. New approaches to evaluating vein irritation that are valid, reliable and based on their ability to predict complications need exploration. © 2015 John Wiley & Sons, Ltd.
Inter- and intra-rater reliability of nasal auscultation in daycare children.

PubMed

Santos, Rita; Silva Alexandrino, Ana; Tomé, David; Melo, Cristina; Mesquita Montes, António; Costa, Daniel; Pinto Ferreira, João

2018-02-01

The aim of this study was to assess nasal auscultation's intra- and inter-rater reliability and to analyze ear and respiratory clinical condition according to nasal auscultation. Cross-sectional study performed in 125 children aged up to 3 years old attending daycare centers. Nasal auscultation, tympanometry and Paediatric Respiratory Severity Score (PRSS) were applied to all children. Nasal sounds were classified by an expert panel in order to determine nasal auscultation's intra and inter- rater reliability. The classification of nasal sounds was assessed against tympanometric and PRSS values. Nasal auscultation revealed substantial inter-rater (K=0.75) and intra-rater (K=0.69; K=0.61 and K=0.72) reliability. Children with a "non-obstructed" classification revealed a lower peak pressure (t=-3.599, P<0.001 in left ear; t=-2.258, P=0.026 in right ear) and a higher compliance (t=-2,728, P=0.007 in left ear; t=-3.830. P<0.001 in right ear) in both ears. There was an association between the classification of sounds and tympanogram types in both ears (X=11.437, P=0.003 in left ear; X=13.535, P=0.001 in right ear). Children with a "non-obstructed" classification had a healthier respiratory condition. Nasal auscultation revealed substantial intra- and inter-rater reliability. Nasal auscultation exhibited important differences according to ear and respiratory clinical conditions. Nasal auscultation in pediatrics seems to be an original topic as well as a simple method that can be used to identify early signs of nasopharyngeal obstruction.
Error Modelling for Multi-Sensor Measurements in Infrastructure-Free Indoor Navigation

PubMed Central

Ruotsalainen, Laura; Kirkko-Jaakkola, Martti; Rantanen, Jesperi; Mäkelä, Maija

2018-01-01

The long-term objective of our research is to develop a method for infrastructure-free simultaneous localization and mapping (SLAM) and context recognition for tactical situational awareness. Localization will be realized by propagating motion measurements obtained using a monocular camera, a foot-mounted Inertial Measurement Unit (IMU), sonar, and a barometer. Due to the size and weight requirements set by tactical applications, Micro-Electro-Mechanical (MEMS) sensors will be used. However, MEMS sensors suffer from biases and drift errors that may substantially decrease the position accuracy. Therefore, sophisticated error modelling and implementation of integration algorithms are key for providing a viable result. Algorithms used for multi-sensor fusion have traditionally been different versions of Kalman filters. However, Kalman filters are based on the assumptions that the state propagation and measurement models are linear with additive Gaussian noise. Neither of the assumptions is correct for tactical applications, especially for dismounted soldiers, or rescue personnel. Therefore, error modelling and implementation of advanced fusion algorithms are essential for providing a viable result. Our approach is to use particle filtering (PF), which is a sophisticated option for integrating measurements emerging from pedestrian motion having non-Gaussian error characteristics. This paper discusses the statistical modelling of the measurement errors from inertial sensors and vision based heading and translation measurements to include the correct error probability density functions (pdf) in the particle filter implementation. Then, model fitting is used to verify the pdfs of the measurement errors. Based on the deduced error models of the measurements, particle filtering method is developed to fuse all this information, where the weights of each particle are computed based on the specific models derived. The performance of the developed method is tested via two
[Inter-rater concordance of the "Nursing Activities Score" in intensive care].

PubMed

Valls-Matarín, Josefa; Salamero-Amorós, Maria; Roldán-Gil, Carmen; Quintana-Riera, Salvador

2015-01-01

To evaluate inter-rater concordance in the valuation of the "Nursing Activities Score". Cross-sectional descriptive study conducted from December 2012 until June 2013 in a general intensive care unit with twelve beds. Three evaluator nurses, simultaneously and independently, through the patient daily charts, scored the nursing workload using Nursing Activities Score scale in all patients admitted over 18 years old. Three hundreds and thirty-nine records were collected. The intra-class correlation coefficient (ICC) between evaluators was 0.92 (0.89-0.94). A perfect concordance was obtained in 39.1% of the items, with 52.2% having a high, and 8.7% having lower concordance, corresponding to two of the items with multiple scoring options. Significant differences between two of the evaluators (P=.049) were found. Although the inter-rater concordance was high, more accurate records are needed to reduce the variability of the items with multiple options and to allow more accuracy in the interpretation and measurement of the data regarding nursing workload. Copyright © 2015 Elsevier España, S.L.U. All rights reserved.
Phase measurement error in summation of electron holography series.

PubMed

McLeod, Robert A; Bergen, Michael; Malac, Marek

2014-06-01

Off-axis electron holography is a method for the transmission electron microscope (TEM) that measures the electric and magnetic properties of a specimen. The electrostatic and magnetic potentials modulate the electron wavefront phase. The error in measurement of the phase therefore determines the smallest observable changes in electric and magnetic properties. Here we explore the summation of a hologram series to reduce the phase error and thereby improve the sensitivity of electron holography. Summation of hologram series requires independent registration and correction of image drift and phase wavefront drift, the consequences of which are discussed. Optimization of the electro-optical configuration of the TEM for the double biprism configuration is examined. An analytical model of image and phase drift, composed of a combination of linear drift and Brownian random-walk, is derived and experimentally verified. The accuracy of image registration via cross-correlation and phase registration is characterized by simulated hologram series. The model of series summation errors allows the optimization of phase error as a function of exposure time and fringe carrier frequency for a target spatial resolution. An experimental example of hologram series summation is provided on WS2 fullerenes. A metric is provided to measure the object phase error from experimental results and compared to analytical predictions. The ultimate experimental object root-mean-square phase error is 0.006 rad (2π/1050) at a spatial resolution less than 0.615 nm and a total exposure time of 900 s. The ultimate phase error in vacuum adjacent to the specimen is 0.0037 rad (2π/1700). The analytical prediction of phase error differs with the experimental metrics by +7% inside the object and -5% in the vacuum, indicating that the model can provide reliable quantitative predictions. Crown Copyright © 2014. Published by Elsevier B.V. All rights reserved.
Tests for detecting overdispersion in models with measurement error in covariates.

PubMed

Yang, Yingsi; Wong, Man Yu

2015-11-30

Measurement error in covariates can affect the accuracy in count data modeling and analysis. In overdispersion identification, the true mean-variance relationship can be obscured under the influence of measurement error in covariates. In this paper, we propose three tests for detecting overdispersion when covariates are measured with error: a modified score test and two score tests based on the proposed approximate likelihood and quasi-likelihood, respectively. The proposed approximate likelihood is derived under the classical measurement error model, and the resulting approximate maximum likelihood estimator is shown to have superior efficiency. Simulation results also show that the score test based on approximate likelihood outperforms the test based on quasi-likelihood and other alternatives in terms of empirical power. By analyzing a real dataset containing the health-related quality-of-life measurements of a particular group of patients, we demonstrate the importance of the proposed methods by showing that the analyses with and without measurement error correction yield significantly different results. Copyright © 2015 John Wiley & Sons, Ltd.
Effects of Rating Purpose and Rater Self-Esteem on Performance Ratings.

DTIC Science & Technology

1983-03-01

examined in a laboratory study, using a 2x2 analysis of variance design. Results indicate that low self - esteem raters assign significantly higher...design. Results indicate that low self - esteem raters assign significantly higher performance ratings when performance appraisal information will be used...studies indicated that individuals low in self - esteem have less self -confidence, feel less competent, and rely more on others’ opinions than do individuals
Measurement error associated with surveys of fish abundance in Lake Michigan

USGS Publications Warehouse

Krause, Ann E.; Hayes, Daniel B.; Bence, James R.; Madenjian, Charles P.; Stedman, Ralph M.

2002-01-01

In fisheries, imprecise measurements in catch data from surveys adds uncertainty to the results of fishery stock assessments. The USGS Great Lakes Science Center (GLSC) began to survey the fall fish community of Lake Michigan in 1962 with bottom trawls. The measurement error was evaluated at the level of individual tows for nine fish species collected in this survey by applying a measurement-error regression model to replicated trawl data. It was found that the estimates of measurement-error variance ranged from 0.37 (deepwater sculpin, Myoxocephalus thompsoni) to 1.23 (alewife, Alosa pseudoharengus) on a logarithmic scale corresponding to a coefficient of variation = 66% to 156%. The estimates appeared to increase with the range of temperature occupied by the fish species. This association may be a result of the variability in the fall thermal structure of the lake. The estimates may also be influenced by other factors, such as pelagic behavior and schooling. Measurement error might be reduced by surveying the fish community during other seasons and/or by using additional technologies, such as acoustics. Measurement-error estimates should be considered when interpreting results of assessments that use abundance information from USGS-GLSC surveys of Lake Michigan and could be used if the survey design was altered. This study is the first to report estimates of measurement-error variance associated with this survey.
Measurement variability error for estimates of volume change

Treesearch

James A. Westfall; Paul L. Patterson

2007-01-01

Using quality assurance data, measurement variability distributions were developed for attributes that affect tree volume prediction. Random deviations from the measurement variability distributions were applied to 19381 remeasured sample trees in Maine. The additional error due to measurement variation and measurement bias was estimated via a simulation study for...
An Examination of Rater Performance on a Local Oral English Proficiency Test: A Mixed-Methods Approach

ERIC Educational Resources Information Center

Yan, Xun

2014-01-01

This paper reports on a mixed-methods approach to evaluate rater performance on a local oral English proficiency test. Three types of reliability estimates were reported to examine rater performance from different perspectives. Quantitative results were also triangulated with qualitative rater comments to arrive at a more representative picture of…
Construct validity and inter-rater reliability of the Dutch activity measure for post-acute care "6-clicks" basic mobility form to assess the mobility of hospitalized patients.

PubMed

Geelen, Sven Jacobus Gertruda; Valkenet, Karin; Veenhof, Cindy

2018-05-12

To evaluate the construct validity and the inter-rater reliability of the Dutch Activity Measure for Post-Acute Care "6-clicks" Basic Mobility short form measuring the patient's mobility in Dutch hospital care. First, the "6-clicks" was translated by using a forward-backward translation protocol. Next, 64 patients were assessed by the physiotherapist to determine the validity while being admitted to the Internal Medicine wards of a university medical center. Six hypotheses were tested regarding the construct "mobility" which showed that: Better "6-clicks" scores were related to less restrictive pre-admission living situations (p = 0.011), less restrictive discharge locations (p = 0.001), more independence in activities of daily living (p = 0.001) and less physiotherapy visits (p < 0.001). A correlation was found between the "6-clicks" and length of stay (r= -0.408, p = 0.001), but not between the "6-clicks" and age (r= -0.180, p = 0.528). To determine the inter-rater reliability, an additional 50 patients were assessed by pairs of physiotherapists who independently scored the patients. Intraclass Correlation Coefficients of 0.920 (95%CI: 0.828-0.964) were found. The Kappa Coefficients for the individual items ranged from 0.649 (walking stairs) to 0.841 (sit-to-stand). The Dutch "6-clicks" shows a good construct validity and moderate-to-excellent inter-rater reliability when used to assess the mobility of hospitalized patients. Implications for Rehabilitation Even though various measurement tools have been developed, it appears the majority of physiotherapists working in a hospital currently do not use these tools as a standard part of their care. The Activity Measure for Post-Acute Care "6-clicks" Basic Mobility is the only tool which is designed to be short, easy to use within usual care and has been validated in the entire hospital population. This study shows that the Dutch version of the Activity Measure for Post-Acute Care "6-clicks

Workplace-Based Assessment: Effects of Rater Expertise

ERIC Educational Resources Information Center

Govaerts, M. J. B.; Schuwirth, L. W. T.; Van der Vleuten, C. P. M.; Muijtjens, A. M. M.

2011-01-01

Traditional psychometric approaches towards assessment tend to focus exclusively on quantitative properties of assessment outcomes. This may limit more meaningful educational approaches towards workplace-based assessment (WBA). Cognition-based models of WBA argue that assessment outcomes are determined by cognitive processes by raters which are…
Analysis of measured data of human body based on error correcting frequency

NASA Astrophysics Data System (ADS)

Jin, Aiyan; Peipei, Gao; Shang, Xiaomei

2014-04-01

Anthropometry is to measure all parts of human body surface, and the measured data is the basis of analysis and study of the human body, establishment and modification of garment size and formulation and implementation of online clothing store. In this paper, several groups of the measured data are gained, and analysis of data error is gotten by analyzing the error frequency and using analysis of variance method in mathematical statistics method. Determination of the measured data accuracy and the difficulty of measured parts of human body, further studies of the causes of data errors, and summarization of the key points to minimize errors possibly are also mentioned in the paper. This paper analyses the measured data based on error frequency, and in a way , it provides certain reference elements to promote the garment industry development.
[Inter-rater reliability and validity of the OPD-CA axes structure and conflict].

PubMed

Benecke, Cord; Bock, Astrid; Wieser, Elke; Tschiesner, Reinhard; Lochmann, Martha; Küspert, Felicia; Schorn, Robert; Viertler, Bernhard; Steinmayr-Gensluckner, Maria

2011-01-01

The manual of the Operationalized Psychodynamic Diagnostics in childhood and adolescence (OPD-CA) is an instrument meanwhile widespread in the clinical practice to assess psychodynamic dimensions. Publications of inter-rater agreement and validity are still outstanding. This study assessed the interrater-reliability and validity for the axis structure and the axis conflict. 60 adolescents between 14 and 17 years, with and without psychic disorders, were diagnosed with the Operationalized Psychodynamic Diagnostics in childhood and adolescence (Arbeitskreis OPD-KJ, 2007) and SCID-II-interviews and questionnaires. A partial sample of 36 OPD-CA-interviews was the data basis for the assessment of inter-rater agreement. Calculations of validity for axis structure and axis conflict were made with the whole sample. Inter-rater agreement for the axis structure and the axis conflict showed good to very good weighted Kappa coefficients among the trained raters. Validity of the axis structure showed good results. The Operationalized Psychodynamic Diagnostics in childhood and adolescence (OPD-CA) allows a reliable diagnostic of axis structure and axis conflict, if the ratings are done on the basis of semistructured videotaped interviews by trained raters. The axis structure shows validity, while the results concerning the validity of the axis conflict remain unclear.
The inter-rater reliability and prognostic value of coma scales in Nepali children with acute encephalitis syndrome.

PubMed

Ray, Stephen; Rayamajhi, Ajit; Bonnett, Laura J; Solomon, Tom; Kneen, Rachel; Griffiths, Michael J

2018-02-01

Background Acute encephalitis syndrome (AES) is a common cause of coma in Nepali children. The Glasgow coma scale (GCS) is used to assess the level of coma in these patients and predict outcome. Alternative coma scales may have better inter-rater reliability and prognostic value in encephalitis in Nepali children, but this has not been studied. The Adelaide coma scale (ACS), Blantyre coma scale (BCS) and the Alert, Verbal, Pain, Unresponsive scale (AVPU) are alternatives to the GCS which can be used. Methods Children aged 1-14 years who presented to Kanti Children's Hospital, Kathmandu with AES between September 2010 and November 2011 were recruited. All four coma scales (GCS, ACS, BCS and AVPU) were applied on admission, 48 h later and on discharge. Inter-rater reliability (unweighted kappa) was measured for each. Correlation and agreement between total coma score and outcome (Liverpool outcome score) was measured by Spearman's rank and Bland-Altman plot. The prognostic value of coma scales alone and in combination with physiological variables was investigated in a subgroup (n = 22). A multivariable logistic regression model was fitted by backward stepwise. Results Fifty children were recruited. Inter-rater reliability using the variables scales was fair to moderate. However, the scales poorly predicted clinical outcome. Combining the scales with physiological parameters such as systolic blood pressure improved outcome prediction. Conclusion This is the first study to compare four coma scales in Nepali children with AES. The scales exhibited fair to moderate inter-rater reliability. However, the study is inadequately powered to answer the question on the relationship between coma scales and outcome. Further larger studies are required.
Gait Deviation Index, Gait Profile Score and Gait Variable Score in children with spastic cerebral palsy: Intra-rater reliability and agreement across two repeated sessions.

PubMed

Rasmussen, Helle Mätzke; Nielsen, Dennis Brandborg; Pedersen, Niels Wisbech; Overgaard, Søren; Holsgaard-Larsen, Anders

2015-07-01

The Gait Deviation Index (GDI) and Gait Profile Score (GPS) are the most used summary measures of gait in children with cerebral palsy (CP). However, the reliability and agreement of these indices have not been investigated, limiting their clinimetric quality for research and clinical practice. The aim of this study was to investigate the intra-rater reliability and agreement of summary measures of gait (GDI; GPS; and the Gait Variable Score (GVS) derived from the GPS). The intra-rater reliability and agreement were investigated across two repeated sessions in 18 children aged 5-12 years diagnosed with spastic CP. No systematic bias was observed between the sessions and no heteroscedasticity was observed in Bland-Altman plots. For the GDI and GPS, excellent reliability with intraclass correlation coefficient (ICC) values of 0.8-0.9 was found, while the GVS was found to have fair to good reliability with ICCs of 0.4-0.7. The agreement for the GDI and the logarithmically transformed GPS, in terms of the standard error of measurement as a percentage of the grand mean (SEM%) varied from 4.1 to 6.7%, whilst the smallest detectable change in percent (SDC%) ranged from 11.3 to 18.5%. For the logarithmically transformed GVS, we found a fair to large variation in SEM% from 7 to 29% and in SDC% from 18 to 81%. The GDI and GPS demonstrated excellent reliability and acceptable agreement proving that they can both be used in research and clinical practice. However, the observed large variability for some of the GVS requires cautious consideration when selecting outcome measures. Copyright © 2015 Elsevier B.V. All rights reserved.
Measurements of stem diameter: implications for individual- and stand-level errors.

PubMed

Paul, Keryn I; Larmour, John S; Roxburgh, Stephen H; England, Jacqueline R; Davies, Micah J; Luck, Hamish D

2017-08-01

Stem diameter is one of the most common measurements made to assess the growth of woody vegetation, and the commercial and environmental benefits that it provides (e.g. wood or biomass products, carbon sequestration, landscape remediation). Yet inconsistency in its measurement is a continuing source of error in estimates of stand-scale measures such as basal area, biomass, and volume. Here we assessed errors in stem diameter measurement through repeated measurements of individual trees and shrubs of varying size and form (i.e. single- and multi-stemmed) across a range of contrasting stands, from complex mixed-species plantings to commercial single-species plantations. We compared a standard diameter tape with a Stepped Diameter Gauge (SDG) for time efficiency and measurement error. Measurement errors in diameter were slightly (but significantly) influenced by size and form of the tree or shrub, and stem height at which the measurement was made. Compared to standard tape measurement, the mean systematic error with SDG measurement was only -0.17 cm, but varied between -0.10 and -0.52 cm. Similarly, random error was relatively large, with standard deviations (and percentage coefficients of variation) averaging only 0.36 cm (and 3.8%), but varying between 0.14 and 0.61 cm (and 1.9 and 7.1%). However, at the stand scale, sampling errors (i.e. how well individual trees or shrubs selected for measurement of diameter represented the true stand population in terms of the average and distribution of diameter) generally had at least a tenfold greater influence on random errors in basal area estimates than errors in diameter measurements. This supports the use of diameter measurement tools that have high efficiency, such as the SDG. Use of the SDG almost halved the time required for measurements compared to the diameter tape. Based on these findings, recommendations include the following: (i) use of a tape to maximise accuracy when developing allometric models, or when
Error and uncertainty in Raman thermal conductivity measurements

DOE PAGES

Thomas Edwin Beechem; Yates, Luke; Graham, Samuel

2015-04-22

We investigated error and uncertainty in Raman thermal conductivity measurements via finite element based numerical simulation of two geometries often employed -- Joule-heating of a wire and laser-heating of a suspended wafer. Using this methodology, the accuracy and precision of the Raman-derived thermal conductivity are shown to depend on (1) assumptions within the analytical model used in the deduction of thermal conductivity, (2) uncertainty in the quantification of heat flux and temperature, and (3) the evolution of thermomechanical stress during testing. Apart from the influence of stress, errors of 5% coupled with uncertainties of ±15% are achievable for most materialsmore » under conditions typical of Raman thermometry experiments. Error can increase to >20%, however, for materials having highly temperature dependent thermal conductivities or, in some materials, when thermomechanical stress develops concurrent with the heating. A dimensionless parameter -- termed the Raman stress factor -- is derived to identify when stress effects will induce large levels of error. Together, the results compare the utility of Raman based conductivity measurements relative to more established techniques while at the same time identifying situations where its use is most efficacious.« less
A manufacturing error measurement methodology for a rotary vector reducer cycloidal gear based on a gear measuring center

NASA Astrophysics Data System (ADS)

Li, Tianxing; Zhou, Junxiang; Deng, Xiaozhong; Li, Jubo; Xing, Chunrong; Su, Jianxin; Wang, Huiliang

2018-07-01

A manufacturing error of a cycloidal gear is the key factor affecting the transmission accuracy of a robot rotary vector (RV) reducer. A methodology is proposed to realize the digitized measurement and data processing of the cycloidal gear manufacturing error based on the gear measuring center, which can quickly and accurately measure and evaluate the manufacturing error of the cycloidal gear by using both the whole tooth profile measurement and a single tooth profile measurement. By analyzing the particularity of the cycloidal profile and its effect on the actual meshing characteristics of the RV transmission, the cycloid profile measurement strategy is planned, and the theoretical profile model and error measurement model of cycloid-pin gear transmission are established. Through the digital processing technology, the theoretical trajectory of the probe and the normal vector of the measured point are calculated. By means of precision measurement principle and error compensation theory, a mathematical model for the accurate calculation and data processing of manufacturing error is constructed, and the actual manufacturing error of the cycloidal gear is obtained by the optimization iterative solution. Finally, the measurement experiment of the cycloidal gear tooth profile is carried out on the gear measuring center and the HEXAGON coordinate measuring machine, respectively. The measurement results verify the correctness and validity of the measurement theory and method. This methodology will provide the basis for the accurate evaluation and the effective control of manufacturing precision of the cycloidal gear in a robot RV reducer.
Estimating the Imputed Social Cost of Errors of Measurement.

DTIC Science & Technology

1983-10-01

social cost of an error of measurement in the score on a unidimensional test, an asymptotic method, based on item response theory, is developed for...11111111 ij MICROCOPY RESOLUTION TEST CHART NATIONAL BUREAU OF STANDARDS-1963-A.5. ,,, I v.P I RR-83-33-ONR 4ESTIMATING THE IMPUTED SOCIAL COST S OF... SOCIAL COST OF ERRORS OF MEASUREMENT Frederic M. Lord This research was sponsored in part by the Personnel and Training Research Programs Psychological
Application of round grating angle measurement composite error amendment in the online measurement accuracy improvement of large diameter

NASA Astrophysics Data System (ADS)

Wang, Biao; Yu, Xiaofen; Li, Qinzhao; Zheng, Yu

2008-10-01

The paper aiming at the influence factor of round grating dividing error, rolling-wheel produce eccentricity and surface shape errors provides an amendment method based on rolling-wheel to get the composite error model which includes all influence factors above, and then corrects the non-circle measurement angle error of the rolling-wheel. We make soft simulation verification and have experiment; the result indicates that the composite error amendment method can improve the diameter measurement accuracy with rolling-wheel theory. It has wide application prospect for the measurement accuracy higher than 5 μm/m.
Assessment and Correlation of Customer and Rater Response to Cold-Start and Warmup Driveability

DTIC Science & Technology

1993-08-01

Customer satisfaction fleet Year N % 1986 13 18 1988 10 14 1987 12 18 1988 12 16 1989 14 19 1990 9 12 1991 3 4 Consumer I Rater Fleet Hydrocarbon fuel...2 4 1991 0 0 Fuel system * Customer satisfaction fleet Fuel system N % Carbureted 19 26 PFI 33 48 1T1 21 29 Consumer I Rater Fleet Hydrooarbon fuel...between the customer fleet and one of the consumer /rater subfleets; these vehicles are included in both places in the tables above. 30 TABLE 2 AVERAGE
Measurement of electromagnetic tracking error in a navigated breast surgery setup

NASA Astrophysics Data System (ADS)

Harish, Vinyas; Baksh, Aidan; Ungi, Tamas; Lasso, Andras; Baum, Zachary; Gauvin, Gabrielle; Engel, Jay; Rudan, John; Fichtinger, Gabor

2016-03-01

PURPOSE: The measurement of tracking error is crucial to ensure the safety and feasibility of electromagnetically tracked, image-guided procedures. Measurement should occur in a clinical environment because electromagnetic field distortion depends on positioning relative to the field generator and metal objects. However, we could not find an accessible and open-source system for calibration, error measurement, and visualization. We developed such a system and tested it in a navigated breast surgery setup. METHODS: A pointer tool was designed for concurrent electromagnetic and optical tracking. Software modules were developed for automatic calibration of the measurement system, real-time error visualization, and analysis. The system was taken to an operating room to test for field distortion in a navigated breast surgery setup. Positional and rotational electromagnetic tracking errors were then calculated using optical tracking as a ground truth. RESULTS: Our system is quick to set up and can be rapidly deployed. The process from calibration to visualization also only takes a few minutes. Field distortion was measured in the presence of various surgical equipment. Positional and rotational error in a clean field was approximately 0.90 mm and 0.31°. The presence of a surgical table, an electrosurgical cautery, and anesthesia machine increased the error by up to a few tenths of a millimeter and tenth of a degree. CONCLUSION: In a navigated breast surgery setup, measurement and visualization of tracking error defines a safe working area in the presence of surgical equipment. Our system is available as an extension for the open-source 3D Slicer platform.
Emergency department discharge prescription errors in an academic medical center

PubMed Central

Belanger, April; Devine, Lauren T.; Lane, Aaron; Condren, Michelle E.

2017-01-01

This study described discharge prescription medication errors written for emergency department patients. This study used content analysis in a cross-sectional design to systematically categorize prescription errors found in a report of 1000 discharge prescriptions submitted in the electronic medical record in February 2015. Two pharmacy team members reviewed the discharge prescription list for errors. Open-ended data were coded by an additional rater for agreement on coding categories. Coding was based upon majority rule. Descriptive statistics were used to address the study objective. Categories evaluated were patient age, provider type, drug class, and type and time of error. The discharge prescription error rate out of 1000 prescriptions was 13.4%, with “incomplete or inadequate prescription” being the most commonly detected error (58.2%). The adult and pediatric error rates were 11.7% and 22.7%, respectively. The antibiotics reviewed had the highest number of errors. The highest within-class error rates were with antianginal medications, antiparasitic medications, antacids, appetite stimulants, and probiotics. Emergency medicine residents wrote the highest percentage of prescriptions (46.7%) and had an error rate of 9.2%. Residents of other specialties wrote 340 prescriptions and had an error rate of 20.9%. Errors occurred most often between 10:00 am and 6:00 pm. PMID:28405061
[Determination of the error of aerosol extinction coefficient measured by DOAS].

PubMed

Si, Fu-qi; Liu, Jian-guo; Xie, Pin-hua; Zhang, Yu-jun; Wang, Mian; Liu, Wen-qing; Hiroaki, Kuze; Liu, Cheng; Nobuo, Takeuchi

2006-10-01

The method of defining the error of aerosol extinction coefficient measured by differential optical absorption spectroscopy (DOAS) is described. Some factors which could bring errors to result, such as variation of source, integral time, atmospheric turbulence, calibration of system parameter, displacement of system, and back scattering of particles, are analyzed. The error of aerosol extinction coefficient, 0.03 km(-1), is determined by theoretical analysis and practical measurement.
Tilt error in cryospheric surface radiation measurements at high latitudes: a model study

NASA Astrophysics Data System (ADS)

Bogren, Wiley Steven; Faulkner Burkhart, John; Kylling, Arve

2016-03-01

We have evaluated the magnitude and makeup of error in cryospheric radiation observations due to small sensor misalignment in in situ measurements of solar irradiance. This error is examined through simulation of diffuse and direct irradiance arriving at a detector with a cosine-response fore optic. Emphasis is placed on assessing total error over the solar shortwave spectrum from 250 to 4500 nm, as well as supporting investigation over other relevant shortwave spectral ranges. The total measurement error introduced by sensor tilt is dominated by the direct component. For a typical high-latitude albedo measurement with a solar zenith angle of 60°, a sensor tilted by 1, 3, and 5° can, respectively introduce up to 2.7, 8.1, and 13.5 % error into the measured irradiance and similar errors in the derived albedo. Depending on the daily range of solar azimuth and zenith angles, significant measurement error can persist also in integrated daily irradiance and albedo. Simulations including a cloud layer demonstrate decreasing tilt error with increasing cloud optical depth.
Evaluating Rater Responses to an Online Training Program for L2 Writing Assessment

ERIC Educational Resources Information Center

Elder, Catherine; Barkhuizen, Gary; Knoch, Ute; von Randow, Janet

2007-01-01

The use of online rater self-training is growing in popularity and has obvious practical benefits, facilitating access to training materials and rating samples and allowing raters to reorient themselves to the rating scale and self monitor their behaviour at their own convenience. However there has thus far been little research into rater…
The Availability of Radiological Measurement of Femoral Anteversion Angle: Three-Dimensional Computed Tomography Reconstruction

PubMed Central

Byun, Ha Young; Shin, Heesuk; Lee, Eun Shin; Kong, Min Sik; Lee, Seung Hun

2016-01-01

Objective To assess the intra-rater and inter-rater reliability for measuring femoral anteversion angle (FAA) by a radiographic method using three-dimensional computed tomography reconstruction (3D-CT). Methods The study included 82 children who presented with intoeing gait. 3D-CT data taken between 2006 and 2014 were retrospectively reviewed. FAA was measured by 3D-CT. FAA is defined as the angle between the long axis of the femur neck and condylar axis of the distal femur. FAA measurement was performed twice at both lower extremities by each rater. The intra-rater and inter-rater reliability were calculated by intraclass correlation coefficient (ICC). Results One hundred and sixty-four lower limbs of 82 children (31 boys and 51 girls, 6.3±3.2 years old) were included. The ICCs of intra-rater measurement for the angle of femoral neck axis (NA) were 0.89 for rater A and 0.96 for rater B, and those of condylar axis (CA) were 0.99 for rater A and 0.99 for rater B, respectively. The ICC of inter-rater measurement for the angle of NA was 0.89 and that of CA was 0.92. By each rater, the ICCs of the intrarater measurement for FAA were 0.97 for rater A and 0.95 for rater B, respectively and the ICC of the inter-rater measurement for FAA was 0.89. Conclusion The 3D-CT measures for FAA are reliable within individual raters and between different raters. The 3D-CT measures of FAA can be a useful method for accurate diagnosis and follow-up of femoral anteversion. PMID:27152273
Tilt Error in Cryospheric Surface Radiation Measurements at High Latitudes: A Model Study

NASA Astrophysics Data System (ADS)

Bogren, W.; Kylling, A.; Burkhart, J. F.

2015-12-01

We have evaluated the magnitude and makeup of error in cryospheric radiation observations due to small sensor misalignment in in-situ measurements of solar irradiance. This error is examined through simulation of diffuse and direct irradiance arriving at a detector with a cosine-response foreoptic. Emphasis is placed on assessing total error over the solar shortwave spectrum from 250nm to 4500nm, as well as supporting investigation over other relevant shortwave spectral ranges. The total measurement error introduced by sensor tilt is dominated by the direct component. For a typical high latitude albedo measurement with a solar zenith angle of 60◦, a sensor tilted by 1, 3, and 5◦ can respectively introduce up to 2.6, 7.7, and 12.8% error into the measured irradiance and similar errors in the derived albedo. Depending on the daily range of solar azimuth and zenith angles, significant measurement error can persist also in integrated daily irradiance and albedo.
Establishing inter-rater reliability scoring in a state trauma system.

PubMed

Read-Allsopp, Christine

2004-01-01

Trauma systems rely on accurate Injury Severity Scoring (ISS) to describe trauma patient populations. Twenty-seven (27) Trauma Nurse Coordinators and Data Managers across the state of New South Wales, Australia trauma network were instructed in the uses and techniques of the Abbreviated Injury Scale (AIS) from the Association for the Advancement of Automotive Medicine. The aim is to provide accurate, reliable and valid data for the state trauma network. Four (4) months after the course a coding exercise was conducted to assess inter-rater reliability. The results show that inter-rater reliability is with accepted international standards.
Improved characterisation and modelling of measurement errors in electrical resistivity tomography (ERT) surveys

NASA Astrophysics Data System (ADS)

Tso, Chak-Hau Michael; Kuras, Oliver; Wilkinson, Paul B.; Uhlemann, Sebastian; Chambers, Jonathan E.; Meldrum, Philip I.; Graham, James; Sherlock, Emma F.; Binley, Andrew

2017-11-01

Measurement errors can play a pivotal role in geophysical inversion. Most inverse models require users to prescribe or assume a statistical model of data errors before inversion. Wrongly prescribed errors can lead to over- or under-fitting of data; however, the derivation of models of data errors is often neglected. With the heightening interest in uncertainty estimation within hydrogeophysics, better characterisation and treatment of measurement errors is needed to provide improved image appraisal. Here we focus on the role of measurement errors in electrical resistivity tomography (ERT). We have analysed two time-lapse ERT datasets: one contains 96 sets of direct and reciprocal data collected from a surface ERT line within a 24 h timeframe; the other is a two-year-long cross-borehole survey at a UK nuclear site with 246 sets of over 50,000 measurements. Our study includes the characterisation of the spatial and temporal behaviour of measurement errors using autocorrelation and correlation coefficient analysis. We find that, in addition to well-known proportionality effects, ERT measurements can also be sensitive to the combination of electrodes used, i.e. errors may not be uncorrelated as often assumed. Based on these findings, we develop a new error model that allows grouping based on electrode number in addition to fitting a linear model to transfer resistance. The new model explains the observed measurement errors better and shows superior inversion results and uncertainty estimates in synthetic examples. It is robust, because it groups errors together based on the electrodes used to make the measurements. The new model can be readily applied to the diagonal data weighting matrix widely used in common inversion methods, as well as to the data covariance matrix in a Bayesian inversion framework. We demonstrate its application using extensive ERT monitoring datasets from the two aforementioned sites.

Putting Raters in Ratees' Shoes: Perspective Taking and Assessment of Creative Products

ERIC Educational Resources Information Center

Han, Jiantao; Long, Haiying; Pang, Weiguo

2017-01-01

This study reported 2 experiments that studied the effect of perspective taking on assessment of creative products by using human raters. Forty responses of 2 alternative uses tasks (AUTs) and 15 alien stories generated by 6th-grade students were used as assessment materials. Undergraduate students as the novice raters assessed the products under…
Transcultural Adaptation of GRID Hamilton Rating Scale For Depression (GRID-HAMD) to Brazilian Portuguese and Evaluation of the Impact of Training Upon Inter-Rater Reliability.

PubMed

Henrique-Araújo, Ricardo; Osório, Flávia L; Gonçalves Ribeiro, Mônica; Soares Monteiro, Ivandro; Williams, Janet B W; Kalali, Amir; Alexandre Crippa, José; Oliveira, Irismar Reis De

2014-07-01

GRID-HAMD is a semi-structured interview guide developed to overcome flaws in HAM-D, and has been incorporated into an increasing number of studies. Carry out the transcultural adaptation of GRID-HAMD into the Brazilian Portuguese language, evaluate the inter-rater reliability of this instrument and the training impact upon this measure, and verify the raters' opinions of said instrument. The transcultural adaptation was conducted by appropriate methodology. The measurement of inter-rater reliability was done by way of videos that were evaluated by 85 professionals before and after training for the use of this instrument. The intraclass correlation coefficient (ICC) remained between 0.76 and 0.90 for GRID-HAMD-21 and between 0.72 and 0.91 for GRID-HAMD-17. The training did not have an impact on the ICC, except for a few groups of participants with a lower level of experience. Most of the participants showed high acceptance of GRID-HAMD, when compared to other versions of HAM-D. The scale presented adequate inter-rater reliability even before training began. Training did not have an impact on this measure, except for a few groups with less experience. GRID-HAMD received favorable opinions from most of the participants.
Inter-rater Reliability of Sustained Aberrant Movement Patterns as a Clinical Assessment of Muscular Fatigue

PubMed Central

Aerts, Frank; Carrier, Kathy; Alwood, Becky

2016-01-01

Background: The assessment of clinical manifestation of muscle fatigue is an effective procedure in establishing therapeutic exercise dose. Few studies have evaluated physical therapist reliability in establishing muscle fatigue through detection of changes in quality of movement patterns in a live setting. Objective: The purpose of this study is to evaluate the inter-rater reliability of physical therapists’ ability to detect altered movement patterns due to muscle fatigue. Design: A reliability study in a live setting with multiple raters. Participants: Forty-four healthy individuals (ages 19-35) were evaluated by six physical therapists in a live setting. Methods: Participants were evaluated by physical therapists for altered movement patterns during resisted shoulder rotation. Each participant completed a total of four tests: right shoulder internal rotation, right shoulder external rotation, left shoulder internal rotation and left shoulder external rotation. Results: For all tests combined, the inter-rater reliability for a single rater scoring ICC (2,1) was .65 (95%, .60, .71) This corresponds to moderate inter-rater reliability between physical therapists. Limitations: The results of this study apply only to healthy participants and therefore cannot be generalized to a symptomatic population. Conclusion: Moderate inter-rater reliability was found between physical therapists in establishing muscle fatigue through the observation of sustained altered movement patterns during dynamic resistive shoulder internal and external rotation. PMID:27347241
Reliability of Untrained and Experienced Raters on FEES: Rating Overall Residue is a Simple Task.

PubMed

Pisegna, Jessica M; Borders, James C; Kaneoka, Asako; Coster, Wendy J; Leonard, Rebecca; Langmore, Susan E

2018-03-07

The purpose of this study was to investigate the reliability of residue ratings on Fiberoptic Endoscopic Evaluation of Swallowing (FEES). We also examined rating differences based on experience to determine if years of experience influenced residue ratings. A group of 44 raters watched 81 FEES videos representing a wide range of residue severities for thin liquid, applesauce, and cracker boluses. Raters were untrained on the rating scales and simply rated their overall impression of residue amount on a visual analog scale (VAS) and a five-point ordinal scale in a randomized fashion across two sessions. Intra-class correlation coefficients, kappa coefficients, and ANOVAs were used to analyze agreement and differences in ratings. Residue ratings on both the VAS and ordinal scales had acceptable inter- and intra-rater reliability. Inter-rater agreement was acceptable (ICC > 0.7) for all comparisons. Intra-rater agreement was excellent on the VAS scale (r c = 0.9) and good on the ordinal scale (k = 0.78). There was no significant difference between expert ratings and other raters based on years of experience for cracker ratings (p = 0.2119) and applesauce ratings (p = 0.2899), but there was a significant difference between clinicians on thin liquid ratings (p = 0.0005). Without any specific training, raters demonstrated high reliability when rating the overall amount of residue on FEES. Years of experience with FEES did not influence residue ratings, suggesting that expert ratings of overall residue amount are not unique or specialized. Rating the overall amount of residue on FEES appears to be a simple visual-perceptual task for puree and cracker boluses.
[Quality assurance in coding expertise of hospital cases in the German DRG system. Evaluation of inter-rater reliability in MDK expertise].

PubMed

Huber, H; Brambrink, M; Funk, R; Rieger, M

2012-10-01

The purpose of this study was to evaluate differences in the D-DRG results of a hospital case by 2 independently coding MKD raters. Calculation of the 2-inter-rater reliability was performed by examination of the coding of individual hospital cases. The reasons for the non-agreement of the expert evaluations and suggestions to improve the process are discussed. From the expert evaluation pool of the MDK-WL a random sample of 0.7% of the 57,375 expertises was taken. Distribution equality with the basic total was tested by the χ² test or, respectively, Fisher's exact test. For the total of 402 individual hospital cases, the G-DRG case sums of 2 experts of the MDK were determined independently and the results checked for each individual case for agreement or non-agreement. The corresponding confidence intervals with standard errors were analysed to test if certain major diagnosis categories (MDC) were statistically significantly more affected by differing expertise results than others. In 280 of the total 402 tested hospital cases, the 2 MDK raters independently reached the same G-DRG results; in 122 cases the G-DRG case sums determined by the 2 raters differed (agreement 70%; CI 65.2-74.1). Different DRG results between the 2 experts occurred regularly in the entire MDC spectrum. No MDC chapter in which significant differences between the 2 raters arose could be identified. The results of our study demonstrate an almost 70% agreement in the evaluation of hospital cost accounts by 2 independently operating MDK. This result leaves room for improvement. Optimisation potentials can be recognised on the basis of the results. Potential for improvement was established in combination with regular further training and the expansion of binding internal code recommendations as well as exchange of code-relevant information among experts in internal forums. The presented model is in principle suitable for cross-border examinations within the MDK system with the advantage that
THE INTRA- AND INTER-RATER RELIABILITY OF THE SOCCER INJURY MOVEMENT SCREEN (SIMS).

PubMed

McCunn, Robert; Aus der Fünten, Karen; Govus, Andrew; Julian, Ross; Schimpchen, Jan; Meyer, Tim

2017-02-01

The growing volume of movement screening research reveals a belief among practitioners and researchers alike that movement quality may have an association with injury risk. However, existing movement screening tools have not considered the sport-specific movement and injury patterns relevant to soccer. The present study introduces the Soccer Injury Movement Screen (SIMS), which has been designed specifically for use within soccer. Furthermore, the purpose of the present study was to assess the intra- and inter-rater reliability of the SIMS and determine its suitability for use in further research. The study utilized a test-retest design to discern reliablility. Twenty-five (11 males, 14 females) healthy, recreationally active university students (age 25.5 ± 4.0 years, height 171 ± 9 cm, weight 64.7 ± 12.6 kg) agreed to participate. The SIMS contains five sub-tests: the anterior reach, single-leg deadlift, in-line lunge, single-leg hop for distance and tuck jump. Each movement was scored out of 10 points and summed to produce a composite score out of 50. The anterior reach and single-leg hop for distance were scored in real-time while the remaining tests were filmed and scored retrospectively. Three raters conducted the SIMS with each participant on three occasions separated by an average of three and a half days (minimum one day, maximum seven days). Rater 1 re-scored the filmed movements for all participants on all occasions six months later to establish the 'pure' intra-rater (intra-occasion) reliability for those movements. Intraclass correlation coefficient (ICC) values for intra- and inter-rater composite score reliability ranged from 0.66-0.72 and 0.79-0.86 respectively. Weighted kappa values representing the intra- and inter-rater reliability of the individual sub-tests ranged from 0.35-0.91 indicating fair to almost perfect agreement. Establishing the reliability of the SIMS is a prerequisite for further research seeking to investigate
THE INTRA- AND INTER-RATER RELIABILITY OF THE SOCCER INJURY MOVEMENT SCREEN (SIMS)

PubMed Central

aus der Fünten, Karen; Govus, Andrew; Julian, Ross; Schimpchen, Jan; Meyer, Tim

2017-01-01

Background/purpose The growing volume of movement screening research reveals a belief among practitioners and researchers alike that movement quality may have an association with injury risk. However, existing movement screening tools have not considered the sport-specific movement and injury patterns relevant to soccer. The present study introduces the Soccer Injury Movement Screen (SIMS), which has been designed specifically for use within soccer. Furthermore, the purpose of the present study was to assess the intra- and inter-rater reliability of the SIMS and determine its suitability for use in further research. Methods The study utilized a test-retest design to discern reliablility. Twenty-five (11 males, 14 females) healthy, recreationally active university students (age 25.5 ± 4.0 years, height 171 ± 9 cm, weight 64.7 ± 12.6 kg) agreed to participate. The SIMS contains five sub-tests: the anterior reach, single-leg deadlift, in-line lunge, single-leg hop for distance and tuck jump. Each movement was scored out of 10 points and summed to produce a composite score out of 50. The anterior reach and single-leg hop for distance were scored in real-time while the remaining tests were filmed and scored retrospectively. Three raters conducted the SIMS with each participant on three occasions separated by an average of three and a half days (minimum one day, maximum seven days). Rater 1 re-scored the filmed movements for all participants on all occasions six months later to establish the ‘pure’ intra-rater (intra-occasion) reliability for those movements. Results Intraclass correlation coefficient (ICC) values for intra- and inter-rater composite score reliability ranged from 0.66-0.72 and 0.79-0.86 respectively. Weighted kappa values representing the intra- and inter-rater reliability of the individual sub-tests ranged from 0.35-0.91 indicating fair to almost perfect agreement. Conclusions Establishing the reliability of the SIMS is a
Selection of noisy measurement locations for error reduction in static parameter identification

NASA Astrophysics Data System (ADS)

Sanayei, Masoud; Onipede, Oladipo; Babu, Suresh R.

1992-09-01

An incomplete set of noisy static force and displacement measurements is used for parameter identification of structures at the element level. Measurement location and the level of accuracy in the measured data can drastically affect the accuracy of the identified parameters. A heuristic method is presented to select a limited number of degrees of freedom (DOF) to perform a successful parameter identification and to reduce the impact of measurement errors on the identified parameters. This pretest simulation uses an error sensitivity analysis to determine the effect of measurement errors on the parameter estimates. The selected DOF can be used for nondestructive testing and health monitoring of structures. Two numerical examples, one for a truss and one for a frame, are presented to demonstrate that using the measurements at the selected subset of DOF can limit the error in the parameter estimates.
The Hierarchical Rater Model for Rated Test Items and Its Application to Large-Scale Educational Assessment Data.

ERIC Educational Resources Information Center

Patz, Richard J.; Junker, Brian W.; Johnson, Matthew S.; Mariano, Louis T.

2002-01-01

Discusses the hierarchical rater model (HRM) of R. Patz (1996) and shows how it can be used to scale examinees and items, model aspects of consensus among raters, and model individual rater severity and consistency effects. Also shows how the HRM fits into the generalizability theory framework. Compares the HRM to the conventional item response…
Baseline Error Analysis and Experimental Validation for Height Measurement of Formation Insar Satellite

NASA Astrophysics Data System (ADS)

Gao, X.; Li, T.; Zhang, X.; Geng, X.

2018-04-01

In this paper, we proposed the stochastic model of InSAR height measurement by considering the interferometric geometry of InSAR height measurement. The model directly described the relationship between baseline error and height measurement error. Then the simulation analysis in combination with TanDEM-X parameters was implemented to quantitatively evaluate the influence of baseline error to height measurement. Furthermore, the whole emulation validation of InSAR stochastic model was performed on the basis of SRTM DEM and TanDEM-X parameters. The spatial distribution characteristics and error propagation rule of InSAR height measurement were fully evaluated.
Utilizing measure-based feedback in control-mastery theory: A clinical error.

PubMed

Snyder, John; Aafjes-van Doorn, Katie

2016-09-01

Clinical errors and ruptures are an inevitable part of clinical practice. Often times, therapists are unaware that a clinical error or rupture has occurred, leaving no space for repair, and potentially leading to patient dropout and/or less effective treatment. One way to overcome our blind spots is by frequently and systematically collecting measure-based feedback from the patient. Patient feedback measures that focus on the process of psychotherapy such as the Patient's Experience of Attunement and Responsiveness scale (PEAR) can be used in conjunction with treatment outcome measures such as the Outcome Questionnaire 45.2 (OQ-45.2) to monitor the patient's therapeutic experience and progress. The regular use of these types of measures can aid clinicians in the identification of clinical errors and the associated patient deterioration that might otherwise go unnoticed and unaddressed. The current case study describes an instance of clinical error that occurred during the 2-year treatment of a highly traumatized young woman. The clinical error was identified using measure-based feedback and subsequently understood and addressed from the theoretical standpoint of the control-mastery theory of psychotherapy. An alternative hypothetical response is also presented and explained using control-mastery theory. (PsycINFO Database Record (c) 2016 APA, all rights reserved).
In the eye of the beholder: the effect of rater variability and different rating scales on QTL mapping.

PubMed

Poland, Jesse A; Nelson, Rebecca J

2011-02-01

The agronomic importance of developing durably resistant cultivars has led to substantial research in the field of quantitative disease resistance (QDR) and, in particular, mapping quantitative trait loci (QTL) for disease resistance. The assessment of QDR is typically conducted by visual estimation of disease severity, which raises concern over the accuracy and precision of visual estimates. Although previous studies have examined the factors affecting the accuracy and precision of visual disease assessment in relation to the true value of disease severity, the impact of this variability on the identification of disease resistance QTL has not been assessed. In this study, the effects of rater variability and rating scales on mapping QTL for northern leaf blight resistance in maize were evaluated in a recombinant inbred line population grown under field conditions. The population of 191 lines was evaluated by 22 different raters using a direct percentage estimate, a 0-to-9 ordinal rating scale, or both. It was found that more experienced raters had higher precision and that using a direct percentage estimation of diseased leaf area produced higher precision than using an ordinal scale. QTL mapping was then conducted using the disease estimates from each rater using stepwise general linear model selection (GLM) and inclusive composite interval mapping (ICIM). For GLM, the same QTL were largely found across raters, though some QTL were only identified by a subset of raters. The magnitudes of estimated allele effects at identified QTL varied drastically, sometimes by as much as threefold. ICIM produced highly consistent results across raters and for the different rating scales in identifying the location of QTL. We conclude that, despite variability between raters, the identification of QTL was largely consistent among raters, particularly when using ICIM. However, care should be taken in estimating QTL allele effects, because this was highly variable and rater
Reliability, standard error, and minimum detectable change of clinical pressure pain threshold testing in people with and without acute neck pain.

PubMed

Walton, David M; Macdermid, Joy C; Nielson, Warren; Teasell, Robert W; Chiasson, Marco; Brown, Lauren

2011-09-01

Clinical measurement. To evaluate the intrarater, interrater, and test-retest reliability of an accessible digital algometer, and to determine the minimum detectable change in normal healthy individuals and a clinical population with neck pain. Pressure pain threshold testing may be a valuable assessment and prognostic indicator for people with neck pain. To date, most of this research has been completed using algometers that are too resource intensive for routine clinical use. Novice raters (physiotherapy students or clinical physiotherapists) were trained to perform algometry testing over 2 clinically relevant sites: the angle of the upper trapezius and the belly of the tibialis anterior. A convenience sample of normal healthy individuals and a clinical sample of people with neck pain were tested by 2 different raters (all participants) and on 2 different days (healthy participants only). Intraclass correlation coefficient (ICC), standard error of measurement, and minimum detectable change were calculated. A total of 60 healthy volunteers and 40 people with neck pain were recruited. Intrarater reliability was almost perfect (ICC = 0.94-0.97), interrater reliability was substantial to near perfect (ICC = 0.79-0.90), and test-retest reliability was substantial (ICC = 0.76-0.79). Smaller change was detectable in the trapezius compared to the tibialis anterior. This study provides evidence that novice raters can perform digital algometry with adequate reliability for research and clinical use in people with and without neck pain.
Error analysis on spinal motion measurement using skin mounted sensors.

PubMed

Yang, Zhengyi; Ma, Heather Ting; Wang, Deming; Lee, Raymond

2008-01-01

Measurement errors of skin-mounted sensors in measuring forward bending movement of the lumbar spines are investigated. In this investigation, radiographic images capturing the entire lumbar spines' positions were acquired and used as a 'gold' standard. Seventeen young male volunteers (21 (SD 1) years old) agreed to participate in the study. Light-weight miniature sensors of the electromagnetic tracking systems-Fastrak were attached to the skin overlying the spinous processes of the lumbar spine. With the sensors attached, the subjects were requested to take lateral radiographs in two postures: neutral upright and full flexion. The ranges of motions of lumbar spine were calculated from two sets of digitized data: the bony markers of vertebral bodies and the sensors and compared. The differences between the two sets of results were then analyzed. The relative movement between sensor and vertebrae was decomposed into sensor sliding and titling, from which sliding error and titling error were introduced. Gross motion range of forward bending of lumbar spine measured from bony markers of vertebrae is 67.8 degrees (SD 10.6 degrees ) and that from sensors is 62.8 degrees (SD 12.8 degrees ). The error and absolute error for gross motion range were 5.0 degrees (SD 7.2 degrees ) and 7.7 degrees (SD 3.9 degrees ). The contributions of sensors placed on S1 and L1 to the absolute error were 3.9 degrees (SD 2.9 degrees ) and 4.4 degrees (SD 2.8 degrees ), respectively.
Design considerations for case series models with exposure onset measurement error.

PubMed

Mohammed, Sandra M; Dalrymple, Lorien S; Sentürk, Damla; Nguyen, Danh V

2013-02-28

The case series model allows for estimation of the relative incidence of events, such as cardiovascular events, within a pre-specified time window after an exposure, such as an infection. The method requires only cases (individuals with events) and controls for all fixed/time-invariant confounders. The measurement error case series model extends the original case series model to handle imperfect data, where the timing of an infection (exposure) is not known precisely. In this work, we propose a method for power/sample size determination for the measurement error case series model. Extensive simulation studies are used to assess the accuracy of the proposed sample size formulas. We also examine the magnitude of the relative loss of power due to exposure onset measurement error, compared with the ideal situation where the time of exposure is measured precisely. To facilitate the design of case series studies, we provide publicly available web-based tools for determining power/sample size for both the measurement error case series model as well as the standard case series model. Copyright © 2012 John Wiley & Sons, Ltd.
Can Raters with Reduced Job Descriptive Information Provide Accurate Position Analysis Questionnaire (PAQ) Ratings?

ERIC Educational Resources Information Center

Friedman, Lee; Harvey, Robert J.

1986-01-01

Job-naive raters provided with job descriptive information made Position Analysis Questionnaire (PAQ) ratings which were validated against ratings of job analysts who were also job content experts. None of the reduced job descriptive information conditions enabled job-naive raters to obtain either acceptable levels of convergent validity with…
Swath-altimetry measurements of the main stem Amazon River: measurement errors and hydraulic implications

NASA Astrophysics Data System (ADS)

Wilson, M. D.; Durand, M.; Jung, H. C.; Alsdorf, D.

2015-04-01

The Surface Water and Ocean Topography (SWOT) mission, scheduled for launch in 2020, will provide a step-change improvement in the measurement of terrestrial surface-water storage and dynamics. In particular, it will provide the first, routine two-dimensional measurements of water-surface elevations. In this paper, we aimed to (i) characterise and illustrate in two dimensions the errors which may be found in SWOT swath measurements of terrestrial surface water, (ii) simulate the spatio-temporal sampling scheme of SWOT for the Amazon, and (iii) assess the impact of each of these on estimates of water-surface slope and river discharge which may be obtained from SWOT imagery. We based our analysis on a virtual mission for a ~260 km reach of the central Amazon (Solimões) River, using a hydraulic model to provide water-surface elevations according to SWOT spatio-temporal sampling to which errors were added based on a two-dimensional height error spectrum derived from the SWOT design requirements. We thereby obtained water-surface elevation measurements for the Amazon main stem as may be observed by SWOT. Using these measurements, we derived estimates of river slope and discharge and compared them to those obtained directly from the hydraulic model. We found that cross-channel and along-reach averaging of SWOT measurements using reach lengths greater than 4 km for the Solimões and 7.5 km for Purus reduced the effect of systematic height errors, enabling discharge to be reproduced accurately from the water height, assuming known bathymetry and friction. Using cross-sectional averaging and 20 km reach lengths, results show Nash-Sutcliffe model efficiency values of 0.99 for the Solimões and 0.88 for the Purus, with 2.6 and 19.1 % average overall error in discharge, respectively. We extend the results to other rivers worldwide and infer that SWOT-derived discharge estimates may be more accurate for rivers with larger channel widths (permitting a greater level of cross
A Simple Endoscopic Technique for Measuring the Cross-Sectional Area of the Upper Airway in a Rabbit Model.

PubMed

Wistermayer, Paul R; McIlwain, Wesley R; Ieronimakis, Nicholas; Rogers, Derek J

2018-04-01

Validate an accurate and reproducible method of measuring the cross-sectional area (CSA) of the upper airway. This is a prospective animal study done at a tertiary care medical treatment facility. Control images were obtained using endotracheal tubes of varying sizes. In vivo images were obtained from various timepoints of a concurrent study on subglottic stenosis. Using a 0° rod telescope, an instrument was placed at the level of interest, and a photo was obtained. Three independent and blinded raters then measured the CSA of the narrowest portion of the airway using open source image analysis software. Each blinded rater measured the CSA of 79 photos. The t testing to assess for accuracy showed no difference between measured and known CSAs of the control images ( P = .86), with an average error of 1.5% (SD = 5.5%). All intraclass correlation (ICC) values for intrarater agreement showed excellent agreement (ICC > .75). Interrater reliability among all raters in control (ICC = .975; 95% CI, .817-.995) and in vivo (ICC = .846;, 95% CI, .780-.896) images showed excellent agreement. We validate a simple, accurate, and reproducible method of measuring the CSA of the airway that can be used in a clinical or research setting.
Modal Correction Method For Dynamically Induced Errors In Wind-Tunnel Model Attitude Measurements

NASA Technical Reports Server (NTRS)

Buehrle, R. D.; Young, C. P., Jr.

1995-01-01

This paper describes a method for correcting the dynamically induced bias errors in wind tunnel model attitude measurements using measured modal properties of the model system. At NASA Langley Research Center, the predominant instrumentation used to measure model attitude is a servo-accelerometer device that senses the model attitude with respect to the local vertical. Under smooth wind tunnel operating conditions, this inertial device can measure the model attitude with an accuracy of 0.01 degree. During wind tunnel tests when the model is responding at high dynamic amplitudes, the inertial device also senses the centrifugal acceleration associated with model vibration. This centrifugal acceleration results in a bias error in the model attitude measurement. A study of the response of a cantilevered model system to a simulated dynamic environment shows significant bias error in the model attitude measurement can occur and is vibration mode and amplitude dependent. For each vibration mode contributing to the bias error, the error is estimated from the measured modal properties and tangential accelerations at the model attitude device. Linear superposition is used to combine the bias estimates for individual modes to determine the overall bias error as a function of time. The modal correction model predicts the bias error to a high degree of accuracy for the vibration modes characterized in the simulated dynamic environment.
The better way to determine the validity, reliability, objectivity and accuracy of measuring devices.

PubMed

Pazira, Parvin; Rostami Haji-Abadi, Mahdi; Zolaktaf, Vahid; Sabahi, Mohammadfarzan; Pazira, Toomaj

2016-06-08

In relation to statistical analysis, studies to determine the validity, reliability, objectivity and precision of new measuring devices are usually incomplete, due in part to using only correlation coefficient and ignoring the data dispersion. The aim of this study was to demonstrate the best way to determine the validity, reliability, objectivity and accuracy of an electro-inclinometer or other measuring devices. Another purpose of this study is to answer the question of whether reliability and objectivity represent accuracy of measuring devices. The validity of an electro-inclinometer was examined by mechanical and geometric methods. The objectivity and reliability of the device was assessed by calculating Cronbach's alpha for repeated measurements by three raters and by measurements on the same person by mechanical goniometer and the electro-inclinometer. Measurements were performed on "hip flexion with the extended knee" and "shoulder abduction with the extended elbow." The raters measured every angle three times within an interval of two hours. The three-way ANOVA was used to determine accuracy. The results of mechanical and geometric analysis showed that validity of the electro-inclinometer was 1.00 and level of error was less than one degree. Objectivity and reliability of electro-inclinometer was 0.999, while objectivity of mechanical goniometer was in the range of 0.802 to 0.966 and the reliability was 0.760 to 0.961. For hip flexion, the difference between raters in joints angle measurement by electro-inclinometer and mechanical goniometer was 1.74 and 16.33 degree (P<0.05), respectively. The differences for shoulder abduction measurement by electro-inclinometer and goniometer were 0.35 and 4.40 degree (P<0.05). Although both the objectivity and reliability are acceptable, the results showed that measurement error was very high in the mechanical goniometer. Therefore, it can be concluded that objectivity and reliability alone cannot determine the accuracy

A Cross-Linguistic Investigation of the Effect of Raters' Accent Familiarity on Speaking Assessment

ERIC Educational Resources Information Center

Huang, Becky; Alegre, Analucia; Eisenberg, Ann

2016-01-01

The project aimed to examine the effect of raters' familiarity with accents on their judgments of non-native speech. Participants included three groups of raters who were either from Spanish Heritage, Spanish Non-Heritage, or Chinese Heritage backgrounds (n = 16 in each group) using Winke & Gass's (2013) definition of a heritage learner as…
The Problem of Limited Inter-rater Agreement in Modelling Music Similarity

PubMed Central

Flexer, Arthur; Grill, Thomas

2016-01-01

One of the central goals of Music Information Retrieval (MIR) is the quantification of similarity between or within pieces of music. These quantitative relations should mirror the human perception of music similarity, which is however highly subjective with low inter-rater agreement. Unfortunately this principal problem has been given little attention in MIR so far. Since it is not meaningful to have computational models that go beyond the level of human agreement, these levels of inter-rater agreement present a natural upper bound for any algorithmic approach. We will illustrate this fundamental problem in the evaluation of MIR systems using results from two typical application scenarios: (i) modelling of music similarity between pieces of music; (ii) music structure analysis within pieces of music. For both applications, we derive upper bounds of performance which are due to the limited inter-rater agreement. We compare these upper bounds to the performance of state-of-the-art MIR systems and show how the upper bounds prevent further progress in developing better MIR systems. PMID:28190932
Quantitative evaluation of statistical errors in small-angle X-ray scattering measurements

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sedlak, Steffen M.; Bruetzel, Linda K.; Lipfert, Jan

A new model is proposed for the measurement errors incurred in typical small-angle X-ray scattering (SAXS) experiments, which takes into account the setup geometry and physics of the measurement process. The model accurately captures the experimentally determined errors from a large range of synchrotron and in-house anode-based measurements. Its most general formulation gives for the variance of the buffer-subtracted SAXS intensity σ 2(q) = [I(q) + const.]/(kq), whereI(q) is the scattering intensity as a function of the momentum transferq;kand const. are fitting parameters that are characteristic of the experimental setup. The model gives a concrete procedure for calculating realistic measurementmore » errors for simulated SAXS profiles. In addition, the results provide guidelines for optimizing SAXS measurements, which are in line with established procedures for SAXS experiments, and enable a quantitative evaluation of measurement errors.« less
Measurement Error Calibration in Mixed-Mode Sample Surveys

ERIC Educational Resources Information Center

Buelens, Bart; van den Brakel, Jan A.

2015-01-01

Mixed-mode surveys are known to be susceptible to mode-dependent selection and measurement effects, collectively referred to as mode effects. The use of different data collection modes within the same survey may reduce selectivity of the overall response but is characterized by measurement errors differing across modes. Inference in sample surveys…
Inter-rater agreement in evaluation of disability: systematic review of reproducibility studies.

PubMed

Barth, Jürgen; de Boer, Wout E L; Busse, Jason W; Hoving, Jan L; Kedzia, Sarah; Couban, Rachel; Fischer, Katrin; von Allmen, David Y; Spanjer, Jerry; Kunz, Regina

2017-01-25

To explore agreement among healthcare professionals assessing eligibility for work disability benefits. Systematic review and narrative synthesis of reproducibility studies. Medline, Embase, and PsycINFO searched up to 16 March 2016, without language restrictions, and review of bibliographies of included studies. Observational studies investigating reproducibility among healthcare professionals performing disability evaluations using a global rating of working capacity and reporting inter-rater reliability by a statistical measure or descriptively. Studies could be conducted in insurance settings, where decisions on ability to work include normative judgments based on legal considerations, or in research settings, where decisions on ability to work disregard normative considerations. : Teams of paired reviewers identified eligible studies, appraised their methodological quality and generalisability, and abstracted results with pretested forms. As heterogeneity of research designs and findings impeded a quantitative analysis, a descriptive synthesis stratified by setting (insurance or research) was performed. From 4562 references, 101 full text articles were reviewed. Of these, 16 studies conducted in an insurance setting and seven in a research setting, performed in 12 countries, met the inclusion criteria. Studies in the insurance setting were conducted with medical experts assessing claimants who were actual disability claimants or played by actors, hypothetical cases, or short written scenarios. Conditions were mental (n=6, 38%), musculoskeletal (n=4, 25%), or mixed (n=6, 38%). Applicability of findings from studies conducted in an insurance setting to real life evaluations ranged from generalisable (n=7, 44%) and probably generalisable (n=3, 19%) to probably not generalisable (n=6, 37%). Median inter-rater reliability among experts was 0.45 (range intraclass correlation coefficient 0.86 to κ-0.10). Inter-rater reliability was poor in six studies (37
Inter-rater agreement in evaluation of disability: systematic review of reproducibility studies

PubMed Central

Barth, Jürgen; de Boer, Wout E L; Busse, Jason W; Hoving, Jan L; Kedzia, Sarah; Couban, Rachel; Fischer, Katrin; von Allmen, David Y; Spanjer, Jerry

2017-01-01

Objectives To explore agreement among healthcare professionals assessing eligibility for work disability benefits. Design Systematic review and narrative synthesis of reproducibility studies. Data sources Medline, Embase, and PsycINFO searched up to 16 March 2016, without language restrictions, and review of bibliographies of included studies. Eligibility criteria Observational studies investigating reproducibility among healthcare professionals performing disability evaluations using a global rating of working capacity and reporting inter-rater reliability by a statistical measure or descriptively. Studies could be conducted in insurance settings, where decisions on ability to work include normative judgments based on legal considerations, or in research settings, where decisions on ability to work disregard normative considerations.Teams of paired reviewers identified eligible studies, appraised their methodological quality and generalisability, and abstracted results with pretested forms. As heterogeneity of research designs and findings impeded a quantitative analysis, a descriptive synthesis stratified by setting (insurance or research) was performed. Results From 4562 references, 101 full text articles were reviewed. Of these, 16 studies conducted in an insurance setting and seven in a research setting, performed in 12 countries, met the inclusion criteria. Studies in the insurance setting were conducted with medical experts assessing claimants who were actual disability claimants or played by actors, hypothetical cases, or short written scenarios. Conditions were mental (n=6, 38%), musculoskeletal (n=4, 25%), or mixed (n=6, 38%). Applicability of findings from studies conducted in an insurance setting to real life evaluations ranged from generalisable (n=7, 44%) and probably generalisable (n=3, 19%) to probably not generalisable (n=6, 37%). Median inter-rater reliability among experts was 0.45 (range intraclass correlation coefficient 0.86 to κ−0
Unified Parkinson's Disease Rating Scale-Motor Exam: inter-rater reliability of advanced practice nurse and neurologist assessments.

PubMed

Palmer, Janice L; Coats, Mary A; Roe, Catherine M; Hanko, Shelly M; Xiong, Chengjie; Morris, John C

2010-06-01

This paper is a report of a study to establish the inter-rater reliability of advanced practice nurse and neurologist neurological assessments which included ratings with the Unified Parkinson's Disease Rating Scale-Motor Exam. Around the world, advanced practice nurses are performing tasks once completed only by physicians. To promote consumer and provider confidence, it is important to establish that nurse and physician ratings using assessment tools are similar. In addition in research settings, when different raters are used, establishment of inter-rater reliability for study assessments is needed. Advanced practice nurses and neurologists independently recorded findings on neurological examinations of 46 participants in a study conducted between August 2007 and January 2008. An intraclass correlation coefficient was calculated to estimate overall agreement between the nurse and neurologist ratings. Agreement for individual items measured on a dichotomous scale was assessed by calculating Cohen's kappa. There was substantial agreement between advanced practice nurses and neurologists on the mean Unified Parkinson's Disease Rating Scale-Motor Exam ratings (intraclass correlation coefficient = 0.65) and the U.S. National Alzheimer's Coordinating Center Uniform Data Set neurological examination ratings of unremarkable findings (kappa = 0.74) and of gait disorder (kappa = 0.73). Moderate agreement (kappa = 0.53) was reached for the rating of whether all Unified Parkinson's Disease Rating Scale-Motor Exam items were normal. These findings are consistent with studies of the inter-rater agreement of the Unified Parkinson's Disease Rating Scale-Motor Exam and support the conduct of neurological assessments by advanced practice nurses.
Optical radiation measurements: instrumentation and sources of error.

PubMed

Landry, R J; Andersen, F A

1982-07-01

Accurate measurement of optical radiation is required when sources of this radiation are used in biological research. The most difficult measurements of broadband noncoherent optical radiations usually must be performed by a highly trained specialist using sophisticated, complex, and expensive instruments. Presentation of the results of such measurement requires correct use of quantities and units with which many biological researchers are unfamiliar. The measurement process, physical quantities and units, measurement systems with instruments, and sources of error and uncertainties associated with optical radiation measurements are reviewed.
Eccentricity error identification and compensation for high-accuracy 3D optical measurement.

PubMed

He, Dong; Liu, Xiaoli; Peng, Xiang; Ding, Yabin; Gao, Bruce Z

2013-07-01

The circular target has been widely used in various three-dimensional optical measurements, such as camera calibration, photogrammetry and structured light projection measurement system. The identification and compensation of the circular target systematic eccentricity error caused by perspective projection is an important issue for ensuring accurate measurement. This paper introduces a novel approach for identifying and correcting the eccentricity error with the help of a concentric circles target. Compared with previous eccentricity error correction methods, our approach does not require taking care of the geometric parameters of the measurement system regarding target and camera. Therefore, the proposed approach is very flexible in practical applications, and in particular, it is also applicable in the case of only one image with a single target available. The experimental results are presented to prove the efficiency and stability of the proposed approach for eccentricity error compensation.
Tracking and shape errors measurement of concentrating heliostats

NASA Astrophysics Data System (ADS)

Coquand, Mathieu; Caliot, Cyril; Hénault, François

2017-09-01

In solar tower power plants, factors such as tracking accuracy, facets misalignment and surface shape errors of concentrating heliostats are of prime importance on the efficiency of the system. At industrial scale, one critical issue is the time and effort required to adjust the different mirrors of the faceted heliostats, which could take several months using current techniques. Thus, methods enabling quick adjustment of a field with a huge number of heliostats are essential for the rise of solar tower technology. In this communication is described a new method for heliostat characterization that makes use of four cameras located near the solar receiver and simultaneously recording images of the sun reflected by the optical surfaces. From knowledge of a measured sun profile, data processing of the acquired images allows reconstructing the slope and shape errors of the heliostats, including tracking and canting errors. The mathematical basis of this shape reconstruction process is explained comprehensively. Numerical simulations demonstrate that the measurement accuracy of this "backward-gazing method" is compliant with the requirements of solar concentrating optics. Finally, we present our first experimental results obtained at the THEMIS experimental solar tower plant in Targasonne, France.
Inter-rater agreement in the assessment of abnormal chest X-ray findings for tuberculosis between two Asian countries

PubMed Central

2012-01-01

Background Inter-rater agreement in the interpretation of chest X-ray (CXR) films is crucial for clinical and epidemiological studies of tuberculosis. We compared the readings of CXR films used for a survey of tuberculosis between raters from two Asian countries. Methods Of the 11,624 people enrolled in a prevalence survey in Hanoi, Viet Nam, in 2003, we studied 258 individuals whose CXR films did not exclude the possibility of active tuberculosis. Follow-up films obtained from accessible individuals in 2006 were also analyzed. Two Japanese and two Vietnamese raters read the CXR films based on a coding system proposed by Den Boon et al. and another system newly developed in this study. Inter-rater agreement was evaluated by kappa statistics. Marginal homogeneity was evaluated by the generalized estimating equation (GEE). Results CXR findings suspected of tuberculosis differed between the four raters. The frequencies of infiltrates and fibrosis/scarring detected on the films significantly differed between the raters from the two countries (P < 0.0001 and P = 0.0082, respectively, by GEE). The definition of findings such as primary cavity, used in the coding systems also affected the degree of agreement. Conclusions CXR findings were inconsistent between the raters with different backgrounds. High inter-rater agreement is a component necessary for an optimal CXR coding system, particularly in international studies. An analysis of reading results and a thorough discussion to achieve a consensus would be necessary to achieve further consistency and high quality of reading. PMID:22296612
A Bayesian hierarchical latent trait model for estimating rater bias and reliability in large-scale performance assessment

PubMed Central

2018-01-01

We propose a novel approach to modelling rater effects in scoring-based assessment. The approach is based on a Bayesian hierarchical model and simulations from the posterior distribution. We apply it to large-scale essay assessment data over a period of 5 years. Empirical results suggest that the model provides a good fit for both the total scores and when applied to individual rubrics. We estimate the median impact of rater effects on the final grade to be ± 2 points on a 50 point scale, while 10% of essays would receive a score at least ± 5 different from their actual quality. Most of the impact is due to rater unreliability, not rater bias. PMID:29614129
Efficient Measurement of Quantum Gate Error by Interleaved Randomized Benchmarking

NASA Astrophysics Data System (ADS)

Magesan, Easwar; Gambetta, Jay M.; Johnson, B. R.; Ryan, Colm A.; Chow, Jerry M.; Merkel, Seth T.; da Silva, Marcus P.; Keefe, George A.; Rothwell, Mary B.; Ohki, Thomas A.; Ketchen, Mark B.; Steffen, M.

2012-08-01

We describe a scalable experimental protocol for estimating the average error of individual quantum computational gates. This protocol consists of interleaving random Clifford gates between the gate of interest and provides an estimate as well as theoretical bounds for the average error of the gate under test, so long as the average noise variation over all Clifford gates is small. This technique takes into account both state preparation and measurement errors and is scalable in the number of qubits. We apply this protocol to a superconducting qubit system and find a bounded average error of 0.003 [0,0.016] for the single-qubit gates Xπ/2 and Yπ/2. These bounded values provide better estimates of the average error than those extracted via quantum process tomography.
Measurement Error Correction for Predicted Spatiotemporal Air Pollution Exposures.

PubMed

Keller, Joshua P; Chang, Howard H; Strickland, Matthew J; Szpiro, Adam A

2017-05-01

Air pollution cohort studies are frequently analyzed in two stages, first modeling exposure then using predicted exposures to estimate health effects in a second regression model. The difference between predicted and unobserved true exposures introduces a form of measurement error in the second stage health model. Recent methods for spatial data correct for measurement error with a bootstrap and by requiring the study design ensure spatial compatibility, that is, monitor and subject locations are drawn from the same spatial distribution. These methods have not previously been applied to spatiotemporal exposure data. We analyzed the association between fine particulate matter (PM2.5) and birth weight in the US state of Georgia using records with estimated date of conception during 2002-2005 (n = 403,881). We predicted trimester-specific PM2.5 exposure using a complex spatiotemporal exposure model. To improve spatial compatibility, we restricted to mothers residing in counties with a PM2.5 monitor (n = 180,440). We accounted for additional measurement error via a nonparametric bootstrap. Third trimester PM2.5 exposure was associated with lower birth weight in the uncorrected (-2.4 g per 1 μg/m difference in exposure; 95% confidence interval [CI]: -3.9, -0.8) and bootstrap-corrected (-2.5 g, 95% CI: -4.2, -0.8) analyses. Results for the unrestricted analysis were attenuated (-0.66 g, 95% CI: -1.7, 0.35). This study presents a novel application of measurement error correction for spatiotemporal air pollution exposures. Our results demonstrate the importance of spatial compatibility between monitor and subject locations and provide evidence of the association between air pollution exposure and birth weight.
Inter-rater Agreement of End-of-shift Evaluations Based on a Single Encounter

PubMed Central

Warrington, Steven; Beeson, Michael; Bradford, Amber

2017-01-01

Introduction End-of-shift evaluation (ESE) forms, also known as daily encounter cards, represent a subset of encounter-based assessment forms. Encounter cards have become prevalent for formative evaluation, with some suggesting a potential for summative evaluation. Our objective was to evaluate the inter-rater agreement of ESE forms using a single scripted encounter at a conference of emergency medicine (EM) educators. Methods Following institutional review board exemption, we created a scripted video simulating an encounter between an intern and a patient with an ankle injury. That video was shown during a lecture at the Council of EM Residency Director’s Academic Assembly with attendees asked to evaluate the “resident” using one of eight possible ESE forms randomly distributed. Descriptive statistics were used to analyze the results with Fleiss’ kappa to evaluate inter-rater agreement. Results Most of the 324 respondents were leadership in residency programs (66%), with a range of 29–47 responses per evaluation form. Few individuals (5%) felt they were experts in assessing residents based on EM milestones. Fleiss’ kappa ranged from 0.157 – 0.308 and did not perform much better in two post-hoc subgroup analyses. Conclusion The kappa ranges found show only slight to fair inter-rater agreement and raise concerns about the use of ESE forms in assessment of EM residents. Despite limitations present in this study, these results and a lack of other studies on inter-rater agreement of encounter cards should prompt further studies of such methods of assessment. Additionally, EM educators should focus research on methods to improve inter-rater agreement of ESE forms or other evaluating other methods of assessment of EM residents. PMID:28435505
Partial compensation interferometry measurement system for parameter errors of conicoid surface

NASA Astrophysics Data System (ADS)

Hao, Qun; Li, Tengfei; Hu, Yao; Wang, Shaopu; Ning, Yan; Chen, Zhuo

2018-06-01

Surface parameters, such as vertex radius of curvature and conic constant, are used to describe the shape of an aspheric surface. Surface parameter errors (SPEs) are deviations affecting the optical characteristics of an aspheric surface. Precise measurement of SPEs is critical in the evaluation of optical surfaces. In this paper, a partial compensation interferometry measurement system for SPE of a conicoid surface is proposed based on the theory of slope asphericity and the best compensation distance. The system is developed to measure the SPE-caused best compensation distance change and SPE-caused surface shape change and then calculate the SPEs with the iteration algorithm for accuracy improvement. Experimental results indicate that the average relative measurement accuracy of the proposed system could be better than 0.02% for the vertex radius of curvature error and 2% for the conic constant error.
Eccentricity error identification and compensation for high-accuracy 3D optical measurement

PubMed Central

He, Dong; Liu, Xiaoli; Peng, Xiang; Ding, Yabin; Gao, Bruce Z

2016-01-01

The circular target has been widely used in various three-dimensional optical measurements, such as camera calibration, photogrammetry and structured light projection measurement system. The identification and compensation of the circular target systematic eccentricity error caused by perspective projection is an important issue for ensuring accurate measurement. This paper introduces a novel approach for identifying and correcting the eccentricity error with the help of a concentric circles target. Compared with previous eccentricity error correction methods, our approach does not require taking care of the geometric parameters of the measurement system regarding target and camera. Therefore, the proposed approach is very flexible in practical applications, and in particular, it is also applicable in the case of only one image with a single target available. The experimental results are presented to prove the efficiency and stability of the proposed approach for eccentricity error compensation. PMID:26900265
Influence of video compression on the measurement error of the television system

NASA Astrophysics Data System (ADS)

Sotnik, A. V.; Yarishev, S. N.; Korotaev, V. V.

2015-05-01

Video data require a very large memory capacity. Optimal ratio quality / volume video encoding method is one of the most actual problem due to the urgent need to transfer large amounts of video over various networks. The technology of digital TV signal compression reduces the amount of data used for video stream representation. Video compression allows effective reduce the stream required for transmission and storage. It is important to take into account the uncertainties caused by compression of the video signal in the case of television measuring systems using. There are a lot digital compression methods. The aim of proposed work is research of video compression influence on the measurement error in television systems. Measurement error of the object parameter is the main characteristic of television measuring systems. Accuracy characterizes the difference between the measured value abd the actual parameter value. Errors caused by the optical system can be selected as a source of error in the television systems measurements. Method of the received video signal processing is also a source of error. Presence of error leads to large distortions in case of compression with constant data stream rate. Presence of errors increases the amount of data required to transmit or record an image frame in case of constant quality. The purpose of the intra-coding is reducing of the spatial redundancy within a frame (or field) of television image. This redundancy caused by the strong correlation between the elements of the image. It is possible to convert an array of image samples into a matrix of coefficients that are not correlated with each other, if one can find corresponding orthogonal transformation. It is possible to apply entropy coding to these uncorrelated coefficients and achieve a reduction in the digital stream. One can select such transformation that most of the matrix coefficients will be almost zero for typical images . Excluding these zero coefficients also
A new accuracy measure based on bounded relative error for time series forecasting

PubMed Central

Twycross, Jamie; Garibaldi, Jonathan M.

2017-01-01

Many accuracy measures have been proposed in the past for time series forecasting comparisons. However, many of these measures suffer from one or more issues such as poor resistance to outliers and scale dependence. In this paper, while summarising commonly used accuracy measures, a special review is made on the symmetric mean absolute percentage error. Moreover, a new accuracy measure called the Unscaled Mean Bounded Relative Absolute Error (UMBRAE), which combines the best features of various alternative measures, is proposed to address the common issues of existing measures. A comparative evaluation on the proposed and related measures has been made with both synthetic and real-world data. The results indicate that the proposed measure, with user selectable benchmark, performs as well as or better than other measures on selected criteria. Though it has been commonly accepted that there is no single best accuracy measure, we suggest that UMBRAE could be a good choice to evaluate forecasting methods, especially for cases where measures based on geometric mean of relative errors, such as the geometric mean relative absolute error, are preferred. PMID:28339480
A new accuracy measure based on bounded relative error for time series forecasting.

PubMed

Chen, Chao; Twycross, Jamie; Garibaldi, Jonathan M

2017-01-01

Many accuracy measures have been proposed in the past for time series forecasting comparisons. However, many of these measures suffer from one or more issues such as poor resistance to outliers and scale dependence. In this paper, while summarising commonly used accuracy measures, a special review is made on the symmetric mean absolute percentage error. Moreover, a new accuracy measure called the Unscaled Mean Bounded Relative Absolute Error (UMBRAE), which combines the best features of various alternative measures, is proposed to address the common issues of existing measures. A comparative evaluation on the proposed and related measures has been made with both synthetic and real-world data. The results indicate that the proposed measure, with user selectable benchmark, performs as well as or better than other measures on selected criteria. Though it has been commonly accepted that there is no single best accuracy measure, we suggest that UMBRAE could be a good choice to evaluate forecasting methods, especially for cases where measures based on geometric mean of relative errors, such as the geometric mean relative absolute error, are preferred.

Exploring the Effects of Rater Linking Designs and Rater Fit on Achievement Estimates within the Context of Music Performance Assessments

ERIC Educational Resources Information Center

Wind, Stefanie A.; Engelhard, George, Jr.; Wesolowski, Brian

2016-01-01

When good model-data fit is observed, the Many-Facet Rasch (MFR) model acts as a linking and equating model that can be used to estimate student achievement, item difficulties, and rater severity on the same linear continuum. Given sufficient connectivity among the facets, the MFR model provides estimates of student achievement that are equated to…
Investigation on coupling error characteristics in angular rate matching based ship deformation measurement approach

NASA Astrophysics Data System (ADS)

Yang, Shuai; Wu, Wei; Wang, Xingshu; Xu, Zhiguang

2018-01-01

The coupling error in the measurement of ship hull deformation can significantly influence the attitude accuracy of the shipborne weapons and equipments. It is therefore important to study the characteristics of the coupling error. In this paper, an comprehensive investigation on the coupling error is reported, which has a potential of deducting the coupling error in the future. Firstly, the causes and characteristics of the coupling error are analyzed theoretically based on the basic theory of measuring ship deformation. Then, simulations are conducted for verifying the correctness of the theoretical analysis. Simulation results show that the cross-correlation between dynamic flexure and ship angular motion leads to the coupling error in measuring ship deformation, and coupling error increases with the correlation value between them. All the simulation results coincide with the theoretical analysis.
Validity and reliability of a new ankle dorsiflexion measurement device.

PubMed

Gatt, Alfred; Chockalingam, Nachiappan

2013-08-01

The assessment of the maximum ankle dorsiflexion angle is an important clinical examination procedure. Evidence shows that the traditional goniometer is highly unreliable, and various designs of goniometers to measure the maximum ankle dorsiflexion angle rely on the application of a known force to obtain reliable results. Hence, an innovative ankle dorsiflexion measurement device was designed to make this measurement more reliable by holding the foot in a selected posture without the application of a known moment. To report on the comprehensive validity and reliability testing carried out on the new device. Following validity testing, four different trials to test reliability of the ankle dorsiflexion measurement device were performed. These trials included inter-rater and intra-rater testings with a controlled moment, intra-rater reliability testing with knees flexed and extended without a controlled moment, intra-rater testing with a patient population, and inter-rater reliability testing between four raters of varying experience without controlling moment. All raters were blinded. A series of trials to test intra-rater and inter-rater reliabilities. Intra-rater reliability intraclass correlation coefficient was 0.98 and inter-rater reliability intraclass correlation coefficient (2,1) was 0.953 with a controlled moment. With uncontrolled moment, very high reliability for intra-tester was also achieved (intraclass correlation coefficient = 0.94 with knees extended and intraclass correlation coefficient = 0.95 with knees flexed). For the trial investigating test-retest reliability with actual patients, intraclass correlation coefficient of 0.99 was obtained. In the trial investigating four different raters with uncontrolled moment, intraclass correlation coefficient of 0.91 was achieved. The new ankle dorsiflexion measurement device is a valid and reliable device for measuring ankle dorsiflexion in both healthy subjects and patients, with both controlled and
The Effect of Raters and Rating Conditions on the Reliability of the Missionary Teaching Assessment

ERIC Educational Resources Information Center

Ure, Abigail C.

2011-01-01

This study investigated how 2 different rating conditions, the controlled rating condition (CRC) and the uncontrolled rating condition (URC), effected rater behavior and the reliability of a performance assessment (PA) known as the Missionary Teaching Assessment (MTA). The CRC gives raters the capability to manipulate (pause, rewind, fast-forward)…
Reliability and validity of an iPhone(®) application for the measurement of lumbar spine flexion and extension range of motion.

PubMed

Pourahmadi, Mohammad Reza; Taghipour, Morteza; Jannati, Elham; Mohseni-Bandpei, Mohammad Ali; Ebrahimi Takamjani, Ismail; Rajabzadeh, Fatemeh

2016-01-01

used to establish concurrent validity of the iPhone(®) app. Furthermore, minimum detectable change at the 95% confidence level (MDC95) was computed as 1.96 × standard error of measurement × [Formula: see text]. Good to excellent intra-rater and inter-rater reliability were demonstrated for both the gravity-based inclinometer with ICC values of ≥0.84 and ≥0.77 and the iPhone(®) app with ICC values of ≥0.85 and ≥0.85, respectively. The MDC95 ranged from 5.82°to 8.18°for the intra-rater analysis and from 7.38°to 8.66° for the inter-rater analysis. The concurrent validity for flexion and extension between the 2 instruments was 0.85 and 0.91, respectively. The iPhone(®)app possesses good to excellent intra-rater and inter-rater reliability and concurrent validity. It seems that the iPhone(®) app can be used for the measurement of lumbar spine flexion-extension ROM. IIb.
Measurement error in environmental epidemiology and the shape of exposure-response curves.

PubMed

Rhomberg, Lorenz R; Chandalia, Juhi K; Long, Christopher M; Goodman, Julie E

2011-09-01

Both classical and Berkson exposure measurement errors as encountered in environmental epidemiology data can result in biases in fitted exposure-response relationships that are large enough to affect the interpretation and use of the apparent exposure-response shapes in risk assessment applications. A variety of sources of potential measurement error exist in the process of estimating individual exposures to environmental contaminants, and the authors review the evaluation in the literature of the magnitudes and patterns of exposure measurement errors that prevail in actual practice. It is well known among statisticians that random errors in the values of independent variables (such as exposure in exposure-response curves) may tend to bias regression results. For increasing curves, this effect tends to flatten and apparently linearize what is in truth a steeper and perhaps more curvilinear or even threshold-bearing relationship. The degree of bias is tied to the magnitude of the measurement error in the independent variables. It has been shown that the degree of bias known to apply to actual studies is sufficient to produce a false linear result, and that although nonparametric smoothing and other error-mitigating techniques may assist in identifying a threshold, they do not guarantee detection of a threshold. The consequences of this could be great, as it could lead to a misallocation of resources towards regulations that do not offer any benefit to public health.
Heisenberg's error-disturbance relations: A joint measurement-based experimental test

NASA Astrophysics Data System (ADS)

Zhao, Yuan-Yuan; Kurzyński, Paweł; Xiang, Guo-Yong; Li, Chuan-Feng; Guo, Guang-Can

2017-04-01

The original Heisenberg error-disturbance relation was recently shown to be not universally valid and two different approaches to reformulate it were proposed. The first one focuses on how the error and disturbance of two observables A and B depend on a particular quantum state. The second one asks how a joint measurement of A and B affects their eigenstates. Previous experiments focused on the first approach. Here we focus on the second one. First, we propose and implement an extendible method of quantum-walk-based joint measurements of noisy Pauli operators to test the error-disturbance relation for qubits introduced in the work of Busch et al. [Phys. Rev. A 89, 012129 (2014), 10.1103/PhysRevA.89.012129], where the polarization of the single photon, corresponding to a walker's auxiliary degree of freedom that is commonly known as a coin, undergoes a position- and time-dependent evolution. Then we formulate and experimentally test a universally valid state-dependent relation for three mutually unbiased observables. We therefore establish a method of testing error-disturbance relations.
Errors in causal inference: an organizational schema for systematic error and random error.

PubMed

Suzuki, Etsuji; Tsuda, Toshihide; Mitsuhashi, Toshiharu; Mansournia, Mohammad Ali; Yamamoto, Eiji

2016-11-01

To provide an organizational schema for systematic error and random error in estimating causal measures, aimed at clarifying the concept of errors from the perspective of causal inference. We propose to divide systematic error into structural error and analytic error. With regard to random error, our schema shows its four major sources: nondeterministic counterfactuals, sampling variability, a mechanism that generates exposure events and measurement variability. Structural error is defined from the perspective of counterfactual reasoning and divided into nonexchangeability bias (which comprises confounding bias and selection bias) and measurement bias. Directed acyclic graphs are useful to illustrate this kind of error. Nonexchangeability bias implies a lack of "exchangeability" between the selected exposed and unexposed groups. A lack of exchangeability is not a primary concern of measurement bias, justifying its separation from confounding bias and selection bias. Many forms of analytic errors result from the small-sample properties of the estimator used and vanish asymptotically. Analytic error also results from wrong (misspecified) statistical models and inappropriate statistical methods. Our organizational schema is helpful for understanding the relationship between systematic error and random error from a previously less investigated aspect, enabling us to better understand the relationship between accuracy, validity, and precision. Copyright © 2016 Elsevier Inc. All rights reserved.
A Unified Approach to Measurement Error and Missing Data: Overview and Applications

ERIC Educational Resources Information Center

Blackwell, Matthew; Honaker, James; King, Gary

2017-01-01

Although social scientists devote considerable effort to mitigating measurement error during data collection, they often ignore the issue during data analysis. And although many statistical methods have been proposed for reducing measurement error-induced biases, few have been widely used because of implausible assumptions, high levels of model…
Comparing Graphical and Verbal Representations of Measurement Error in Test Score Reports

ERIC Educational Resources Information Center

Zwick, Rebecca; Zapata-Rivera, Diego; Hegarty, Mary

2014-01-01

Research has shown that many educators do not understand the terminology or displays used in test score reports and that measurement error is a particularly challenging concept. We investigated graphical and verbal methods of representing measurement error associated with individual student scores. We created four alternative score reports, each…
Comparison of "E-Rater"[R] Automated Essay Scoring Model Calibration Methods Based on Distributional Targets

ERIC Educational Resources Information Center

Zhang, Mo; Williamson, David M.; Breyer, F. Jay; Trapani, Catherine

2012-01-01

This article describes two separate, related studies that provide insight into the effectiveness of "e-rater" score calibration methods based on different distributional targets. In the first study, we developed and evaluated a new type of "e-rater" scoring model that was cost-effective and applicable under conditions of absent human rating and…
Examining the interrater reliability of the Hare Psychopathy Checklist-Revised across a large sample of trained raters.

PubMed

Blais, Julie; Forth, Adelle E; Hare, Robert D

2017-06-01

The goal of the current study was to assess the interrater reliability of the Psychopathy Checklist-Revised (PCL-R) among a large sample of trained raters (N = 280). All raters completed PCL-R training at some point between 1989 and 2012 and subsequently provided complete coding for the same 6 practice cases. Overall, 3 major conclusions can be drawn from the results: (a) reliability of individual PCL-R items largely fell below any appropriate standards while the estimates for Total PCL-R scores and factor scores were good (but not excellent); (b) the cases representing individuals with high psychopathy scores showed better reliability than did the cases of individuals in the moderate to low PCL-R score range; and (c) there was a high degree of variability among raters; however, rater specific differences had no consistent effect on scoring the PCL-R. Therefore, despite low reliability estimates for individual items, Total scores and factor scores can be reliably scored among trained raters. We temper these conclusions by noting that scoring standardized videotaped case studies does not allow the rater to interact directly with the offender. Real-world PCL-R assessments typically involve a face-to-face interview and much more extensive collateral information. We offer recommendations for new web-based training procedures. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Theoretical and experimental errors for in situ measurements of plant water potential.

PubMed

Shackel, K A

1984-07-01

Errors in psychrometrically determined values of leaf water potential caused by tissue resistance to water vapor exchange and by lack of thermal equilibrium were evaluated using commercial in situ psychrometers (Wescor Inc., Logan, UT) on leaves of Tradescantia virginiana (L.). Theoretical errors in the dewpoint method of operation for these sensors were demonstrated. After correction for these errors, in situ measurements of leaf water potential indicated substantial errors caused by tissue resistance to water vapor exchange (4 to 6% reduction in apparent water potential per second of cooling time used) resulting from humidity depletions in the psychrometer chamber during the Peltier condensation process. These errors were avoided by use of a modified procedure for dewpoint measurement. Large changes in apparent water potential were caused by leaf and psychrometer exposure to moderate levels of irradiance. These changes were correlated with relatively small shifts in psychrometer zero offsets (-0.6 to -1.0 megapascals per microvolt), indicating substantial errors caused by nonisothermal conditions between the leaf and the psychrometer. Explicit correction for these errors is not possible with the current psychrometer design.
Compensation for positioning error of industrial robot for flexible vision measuring system

NASA Astrophysics Data System (ADS)

Guo, Lei; Liang, Yajun; Song, Jincheng; Sun, Zengyu; Zhu, Jigui

2013-01-01

Positioning error of robot is a main factor of accuracy of flexible coordinate measuring system which consists of universal industrial robot and visual sensor. Present compensation methods for positioning error based on kinematic model of robot have a significant limitation that it isn't effective in the whole measuring space. A new compensation method for positioning error of robot based on vision measuring technique is presented. One approach is setting global control points in measured field and attaching an orientation camera to vision sensor. Then global control points are measured by orientation camera to calculate the transformation relation from the current position of sensor system to global coordinate system and positioning error of robot is compensated. Another approach is setting control points on vision sensor and two large field cameras behind the sensor. Then the three dimensional coordinates of control points are measured and the pose and position of sensor is calculated real-timely. Experiment result shows the RMS of spatial positioning is 3.422mm by single camera and 0.031mm by dual cameras. Conclusion is arithmetic of single camera method needs to be improved for higher accuracy and accuracy of dual cameras method is applicable.
Error in total ozone measurements arising from aerosol attenuation

NASA Technical Reports Server (NTRS)

Thomas, R. W. L.; Basher, R. E.

1979-01-01

A generalized least squares method for deducing both total ozone and aerosol extinction spectrum parameters from Dobson spectrophotometer measurements was developed. An error analysis applied to this system indicates that there is little advantage to additional measurements once a sufficient number of line pairs have been employed to solve for the selected detail in the attenuation model. It is shown that when there is a predominance of small particles (less than about 0.35 microns in diameter) the total ozone from the standard AD system is too high by about one percent. When larger particles are present the derived total ozone may be an overestimate or an underestimate but serious errors occur only for narrow polydispersions.
Measurements of the toroidal torque balance of error field penetration locked modes

DOE PAGES

Shiraki, Daisuke; Paz-Soldan, Carlos; Hanson, Jeremy M.; ...

2015-01-05

Here, detailed measurements from the DIII-D tokamak of the toroidal dynamics of error field penetration locked modes under the influence of slowly evolving external fields, enable study of the toroidal torques on the mode, including interaction with the intrinsic error field. The error field in these low density Ohmic discharges is well known based on the mode penetration threshold, allowing resonant and non-resonant torque effects to be distinguished. These m/n = 2/1 locked modes are found to be well described by a toroidal torque balance between the resonant interaction with n = 1 error fields, and a viscous torque inmore » the electron diamagnetic drift direction which is observed to scale as the square of the perturbed field due to the island. Fitting to this empirical torque balance allows a time-resolved measurement of the intrinsic error field of the device, providing evidence for a time-dependent error field in DIII-D due to ramping of the Ohmic coil current.« less
Measurement Model Specification Error in LISREL Structural Equation Models.

ERIC Educational Resources Information Center

Baldwin, Beatrice; Lomax, Richard

This LISREL study examines the robustness of the maximum likelihood estimates under varying degrees of measurement model misspecification. A true model containing five latent variables (two endogenous and three exogenous) and two indicator variables per latent variable was used. Measurement model misspecification considered included errors of…
Conditional Standard Errors of Measurement for Composite Scores Using IRT

ERIC Educational Resources Information Center

Kolen, Michael J.; Wang, Tianyou; Lee, Won-Chan

2012-01-01

Composite scores are often formed from test scores on educational achievement test batteries to provide a single index of achievement over two or more content areas or two or more item types on that test. Composite scores are subject to measurement error, and as with scores on individual tests, the amount of error variability typically depends on…
SU-E-T-511: Inter-Rater Variability in Classification of Incidents in a New Incident Reporting System

DOE Office of Scientific and Technical Information (OSTI.GOV)

Pappas, D; Reis, S; Ali, A

Purpose To determine how consistent the results of different raters are when reviewing the same cases within the Radiation Oncology Incident Learning System (ROILS). Methods Three second-year medical physics graduate students filled out incident reports in spreadsheets set up to mimic ROILS. All students studied the same 33 cases and independently entered their assessments, for a total of 99 reviewed cases. The narratives for these cases were obtained from a published International Commission on Radiological Protection (ICRP) report which included shorter narratives selected from the Radiation Oncology Safety Information System (ROSIS) database. Each category of questions was reviewed to seemore » how consistent the results were by utilizing free-marginal multirater kappa analysis. The percentage of cases where all raters shared full agreement or full disagreement was recorded to show which questions were answered consistently by multiple raters for a given case. The consistency among the raters was analyzed between ICRP and ROSIS cases to see if either group led to more reliable results. Results The categories where all raters agreed 100 percent in their choices were the event type (93.94 percent of cases 0.946 kappa) and the likelihood of the event being harmful to the patient (42.42 percent of cases 0.409 kappa). The categories where all raters disagreed 100 percent in their choices were the dosimetric severity scale (39.39 percent of cases 0.139 kappa) and the potential future toxicity (48.48 percent of cases 0.205 kappa). ROSIS had more cases where all raters disagreed than ICRP (23.06 percent of cases compared to 15.58 percent, respectively). Conclusion Despite reviewing the same cases, the results among the three raters was widespread. ROSIS narratives were shorter than ICRP, which suggests that longer narratives lead to more consistent results. This study shows that the incident reporting system can be optimized to yield more consistent results.« less
A Unified Approach to Measurement Error and Missing Data: Details and Extensions

ERIC Educational Resources Information Center

Blackwell, Matthew; Honaker, James; King, Gary

2017-01-01

We extend a unified and easy-to-use approach to measurement error and missing data. In our companion article, Blackwell, Honaker, and King give an intuitive overview of the new technique, along with practical suggestions and empirical applications. Here, we offer more precise technical details, more sophisticated measurement error model…

#2 - An Empirical Assessment of Exposure Measurement Error ...

EPA Pesticide Factsheets

Background• Differing degrees of exposure error acrosspollutants• Previous focus on quantifying and accounting forexposure error in single-pollutant models• Examine exposure errors for multiple pollutantsand provide insights on the potential for bias andattenuation of effect estimates in single and bipollutantepidemiological models The National Exposure Research Laboratory (NERL) Human Exposure and Atmospheric Sciences Division (HEASD) conducts research in support of EPA mission to protect human health and the environment. HEASD research program supports Goal 1 (Clean Air) and Goal 4 (Healthy People) of EPA strategic plan. More specifically, our division conducts research to characterize the movement of pollutants from the source to contact with humans. Our multidisciplinary research program produces Methods, Measurements, and Models to identify relationships between and characterize processes that link source emissions, environmental concentrations, human exposures, and target-tissue dose. The impact of these tools is improved regulatory programs and policies for EPA.
Validation and Error Characterization for the Global Precipitation Measurement

NASA Technical Reports Server (NTRS)

Bidwell, Steven W.; Adams, W. J.; Everett, D. F.; Smith, E. A.; Yuter, S. E.

2003-01-01

The Global Precipitation Measurement (GPM) is an international effort to increase scientific knowledge on the global water cycle with specific goals of improving the understanding and the predictions of climate, weather, and hydrology. These goals will be achieved through several satellites specifically dedicated to GPM along with the integration of numerous meteorological satellite data streams from international and domestic partners. The GPM effort is led by the National Aeronautics and Space Administration (NASA) of the United States and the National Space Development Agency (NASDA) of Japan. In addition to the spaceborne assets, international and domestic partners will provide ground-based resources for validating the satellite observations and retrievals. This paper describes the validation effort of Global Precipitation Measurement to provide quantitative estimates on the errors of the GPM satellite retrievals. The GPM validation approach will build upon the research experience of the Tropical Rainfall Measuring Mission (TRMM) retrieval comparisons and its validation program. The GPM ground validation program will employ instrumentation, physical infrastructure, and research capabilities at Supersites located in important meteorological regimes of the globe. NASA will provide two Supersites, one in a tropical oceanic and the other in a mid-latitude continental regime. GPM international partners will provide Supersites for other important regimes. Those objectives or regimes not addressed by Supersites will be covered through focused field experiments. This paper describes the specific errors that GPM ground validation will address, quantify, and relate to the GPM satellite physical retrievals. GPM will attempt to identify the source of errors within retrievals including those of instrument calibration, retrieval physical assumptions, and algorithm applicability. With the identification of error sources, improvements will be made to the respective calibration
The Reliability of Anthropometric Measurements Used Preoperatively in Aesthetic Breast Surgery.

PubMed

Isaac, Kathryn V; Murphy, Blake D; Beber, Brett; Brown, Mitchell

2016-04-01

Patient outcomes in aesthetic breast surgery are highly dependent on breast measurements used in preoperative planning. The purpose of this study is to determine the reliability of anthropometric breast measurements. Four raters measured 28 women using 7 measurements: sternal notch to nipple distance (Sn-N), nipple to midline (N-M), nipple to inframammary-fold distance under maximal stretch (N-IMF), breast base width (BW), soft tissue pinch thickness of the upper pole (STPT:UP), STPT at the inframammary fold (STPT:IMF), and anterior pull skin stretch (APSS). Reliability was assessed using intra-class correlation coefficients (ICCs). Inter-rater reliability was excellent for Sn-N, N-M, and BW (ICC = 0.94, 0.90, and 0.76, respectively) and was good for N-IMF (ICC = 0.70). The STPT:UP, STPT:IMF, and APSS measurements were not reliable between raters (ICC < 0.2). Intra-rater reliability was excellent for Sn-N, N-M, and BW for all raters (all ICC > 0.75). The N-IMF intra-rater reliability was excellent in senior raters (ICC > 0.75) and good in junior raters (ICC > 0.6). The STPT:UP, STPT:IMF, and APSS measurements showed fair or poor reliability for most raters (ICC < 0.6). The Sn-N, N-M, and BW measurements are very reliable. Dynamic measurements including APSS, STPT:UP, and STUP:IMF are unreliable. N-IMF is the only reliable dynamic measurement, and its reliability improves with increasing clinical experience. The variable reliability of preoperative measurements must be considered in the planning of aesthetic breast surgery. 4 Diagnostic. © 2015 The American Society for Aesthetic Plastic Surgery, Inc. Reprints and permission: journals.permissions@oup.com.
Conditional Standard Errors of Measurement for Scale Scores.

ERIC Educational Resources Information Center

Kolen, Michael J.; And Others

1992-01-01

A procedure is described for estimating the reliability and conditional standard errors of measurement of scale scores incorporating the discrete transformation of raw scores to scale scores. The method is illustrated using a strong true score model, and practical applications are described. (SLD)
Probing-error compensation using 5 degree of freedom force/moment sensor for coordinate measuring machine

NASA Astrophysics Data System (ADS)

Lee, Minho; Cho, Nahm-Gyoo

2013-09-01

A new probing and compensation method is proposed to improve the three-dimensional (3D) measuring accuracy of 3D shapes, including irregular surfaces. A new tactile coordinate measuring machine (CMM) probe with a five-degree of freedom (5-DOF) force/moment sensor using carbon fiber plates was developed. The proposed method efficiently removes the anisotropic sensitivity error and decreases the stylus deformation and the actual contact point estimation errors that are major error components of shape measurement using touch probes. The relationship between the measuring force and estimation accuracy of the actual contact point error and stylus deformation error are examined for practical use of the proposed method. The appropriate measuring force condition is presented for the precision measurement.
Evaluation of TRMM Ground-Validation Radar-Rain Errors Using Rain Gauge Measurements

NASA Technical Reports Server (NTRS)

Wang, Jianxin; Wolff, David B.

2009-01-01

Ground-validation (GV) radar-rain products are often utilized for validation of the Tropical Rainfall Measuring Mission (TRMM) spaced-based rain estimates, and hence, quantitative evaluation of the GV radar-rain product error characteristics is vital. This study uses quality-controlled gauge data to compare with TRMM GV radar rain rates in an effort to provide such error characteristics. The results show that significant differences of concurrent radar-gauge rain rates exist at various time scales ranging from 5 min to 1 day, despite lower overall long-term bias. However, the differences between the radar area-averaged rain rates and gauge point rain rates cannot be explained as due to radar error only. The error variance separation method is adapted to partition the variance of radar-gauge differences into the gauge area-point error variance and radar rain estimation error variance. The results provide relatively reliable quantitative uncertainty evaluation of TRMM GV radar rain estimates at various times scales, and are helpful to better understand the differences between measured radar and gauge rain rates. It is envisaged that this study will contribute to better utilization of GV radar rain products to validate versatile spaced-based rain estimates from TRMM, as well as the proposed Global Precipitation Measurement, and other satellites.
Can Physicians Identify Inappropriate Nuclear Stress Tests? An Examination of Inter-rater Reliability for the 2009 Appropriate Use Criteria for Radionuclide Imaging

PubMed Central

Ye, Siqin; Rabbani, LeRoy E.; Kelly, Christopher R.; Kelly, Maureen R.; Lewis, Matthew; Paz, Yehuda; Peck, Clara L.; Rao, Shaline; Bokhari, Sabahat; Weiner, Shepard D.; Einstein, Andrew J.

2014-01-01

Background We sought to determine inter-rater reliability of the 2009 Appropriate Use Criteria (AUC) for radionuclide imaging (RNI) and whether physicians at various levels of training can effectively identify nuclear stress tests with inappropriate indications. Methods and Results Four hundred patients were randomly selected from a consecutive cohort of patients undergoing nuclear stress testing at an academic medical center. Raters with different levels of training (including cardiology attending physicians, cardiology fellows, internal medicine hospitalists, and internal medicine interns) classified individual nuclear stress tests using the 2009 AUC. Consensus classification by two cardiologists was considered the operational gold standard, and sensitivity and specificity of individual raters for identifying inappropriate tests was calculated. Inter-rater reliability of the AUC was assessed using Cohen’s kappa statistics for pairs of different raters. The mean age of patients was 61.5 years; 214 (54%) were female. The cardiologists rated 256 (64%) of 400 NSTs as appropriate, 68 (18%) as uncertain, 55 (14%) as inappropriate; 21 (5%) tests were unable to be classified. Inter-rater reliability for non-cardiologist raters was modest (unweighted Cohen’s kappa, 0.51, 95% confidence interval, 0.45 to 0.55). Sensitivity of individual raters for identifying inappropriate tests ranged from 47% to 82%, while specificity ranged from 85% to 97%. Conclusions Inter-rater reliability for the 2009 AUC for RNI is modest, and there is considerable variation in the ability of raters at different levels of training to identify inappropriate tests. PMID:25563660
A Model of Self-Monitoring Blood Glucose Measurement Error.

PubMed

Vettoretti, Martina; Facchinetti, Andrea; Sparacino, Giovanni; Cobelli, Claudio

2017-07-01

A reliable model of the probability density function (PDF) of self-monitoring of blood glucose (SMBG) measurement error would be important for several applications in diabetes, like testing in silico insulin therapies. In the literature, the PDF of SMBG error is usually described by a Gaussian function, whose symmetry and simplicity are unable to properly describe the variability of experimental data. Here, we propose a new methodology to derive more realistic models of SMBG error PDF. The blood glucose range is divided into zones where error (absolute or relative) presents a constant standard deviation (SD). In each zone, a suitable PDF model is fitted by maximum-likelihood to experimental data. Model validation is performed by goodness-of-fit tests. The method is tested on two databases collected by the One Touch Ultra 2 (OTU2; Lifescan Inc, Milpitas, CA) and the Bayer Contour Next USB (BCN; Bayer HealthCare LLC, Diabetes Care, Whippany, NJ). In both cases, skew-normal and exponential models are used to describe the distribution of errors and outliers, respectively. Two zones were identified: zone 1 with constant SD absolute error; zone 2 with constant SD relative error. Goodness-of-fit tests confirmed that identified PDF models are valid and superior to Gaussian models used so far in the literature. The proposed methodology allows to derive realistic models of SMBG error PDF. These models can be used in several investigations of present interest in the scientific community, for example, to perform in silico clinical trials to compare SMBG-based with nonadjunctive CGM-based insulin treatments.
Covariate Measurement Error Correction Methods in Mediation Analysis with Failure Time Data

PubMed Central

Zhao, Shanshan

2014-01-01

Summary Mediation analysis is important for understanding the mechanisms whereby one variable causes changes in another. Measurement error could obscure the ability of the potential mediator to explain such changes. This paper focuses on developing correction methods for measurement error in the mediator with failure time outcomes. We consider a broad definition of measurement error, including technical error and error associated with temporal variation. The underlying model with the ‘true’ mediator is assumed to be of the Cox proportional hazards model form. The induced hazard ratio for the observed mediator no longer has a simple form independent of the baseline hazard function, due to the conditioning event. We propose a mean-variance regression calibration approach and a follow-up time regression calibration approach, to approximate the partial likelihood for the induced hazard function. Both methods demonstrate value in assessing mediation effects in simulation studies. These methods are generalized to multiple biomarkers and to both case-cohort and nested case-control sampling design. We apply these correction methods to the Women's Health Initiative hormone therapy trials to understand the mediation effect of several serum sex hormone measures on the relationship between postmenopausal hormone therapy and breast cancer risk. PMID:25139469
Covariate measurement error correction methods in mediation analysis with failure time data.

PubMed

Zhao, Shanshan; Prentice, Ross L

2014-12-01

Mediation analysis is important for understanding the mechanisms whereby one variable causes changes in another. Measurement error could obscure the ability of the potential mediator to explain such changes. This article focuses on developing correction methods for measurement error in the mediator with failure time outcomes. We consider a broad definition of measurement error, including technical error, and error associated with temporal variation. The underlying model with the "true" mediator is assumed to be of the Cox proportional hazards model form. The induced hazard ratio for the observed mediator no longer has a simple form independent of the baseline hazard function, due to the conditioning event. We propose a mean-variance regression calibration approach and a follow-up time regression calibration approach, to approximate the partial likelihood for the induced hazard function. Both methods demonstrate value in assessing mediation effects in simulation studies. These methods are generalized to multiple biomarkers and to both case-cohort and nested case-control sampling designs. We apply these correction methods to the Women's Health Initiative hormone therapy trials to understand the mediation effect of several serum sex hormone measures on the relationship between postmenopausal hormone therapy and breast cancer risk. © 2014, The International Biometric Society.
Economic measurement of medical errors using a hospital claims database.

PubMed

David, Guy; Gunnarsson, Candace L; Waters, Heidi C; Horblyuk, Ruslan; Kaplan, Harold S

2013-01-01

The primary objective of this study was to estimate the occurrence and costs of medical errors from the hospital perspective. Methods from a recent actuarial study of medical errors were used to identify medical injuries. A visit qualified as an injury visit if at least 1 of 97 injury groupings occurred at that visit, and the percentage of injuries caused by medical error was estimated. Visits with more than four injuries were removed from the population to avoid overestimation of cost. Population estimates were extrapolated from the Premier hospital database to all US acute care hospitals. There were an estimated 161,655 medical errors in 2008 and 170,201 medical errors in 2009. Extrapolated to the entire US population, there were more than 4 million unique injury visits containing more than 1 million unique medical errors each year. This analysis estimated that the total annual cost of measurable medical errors in the United States was $985 million in 2008 and just over $1 billion in 2009. The median cost per error to hospitals was $892 for 2008 and rose to $939 in 2009. Nearly one third of all medical injuries were due to error in each year. Medical errors directly impact patient outcomes and hospitals' profitability, especially since 2008 when Medicare stopped reimbursing hospitals for care related to certain preventable medical errors. Hospitals must rigorously analyze causes of medical errors and implement comprehensive preventative programs to reduce their occurrence as the financial burden of medical errors shifts to hospitals. Copyright © 2013 International Society for Pharmacoeconomics and Outcomes Research (ISPOR). Published by Elsevier Inc. All rights reserved.
Characterization of measurement errors using structure-from-motion and photogrammetry to measure marine habitat structural complexity.

PubMed

Bryson, Mitch; Ferrari, Renata; Figueira, Will; Pizarro, Oscar; Madin, Josh; Williams, Stefan; Byrne, Maria

2017-08-01

Habitat structural complexity is one of the most important factors in determining the makeup of biological communities. Recent advances in structure-from-motion and photogrammetry have resulted in a proliferation of 3D digital representations of habitats from which structural complexity can be measured. Little attention has been paid to quantifying the measurement errors associated with these techniques, including the variability of results under different surveying and environmental conditions. Such errors have the potential to confound studies that compare habitat complexity over space and time. This study evaluated the accuracy, precision, and bias in measurements of marine habitat structural complexity derived from structure-from-motion and photogrammetric measurements using repeated surveys of artificial reefs (with known structure) as well as natural coral reefs. We quantified measurement errors as a function of survey image coverage, actual surface rugosity, and the morphological community composition of the habitat-forming organisms (reef corals). Our results indicated that measurements could be biased by up to 7.5% of the total observed ranges of structural complexity based on the environmental conditions present during any particular survey. Positive relationships were found between measurement errors and actual complexity, and the strength of these relationships was increased when coral morphology and abundance were also used as predictors. The numerous advantages of structure-from-motion and photogrammetry techniques for quantifying and investigating marine habitats will mean that they are likely to replace traditional measurement techniques (e.g., chain-and-tape). To this end, our results have important implications for data collection and the interpretation of measurements when examining changes in habitat complexity using structure-from-motion and photogrammetry.
Exploring Measurement Error with Cookies: A Real and Virtual Approach via Interactive Excel

ERIC Educational Resources Information Center

Sinex, Scott A; Gage, Barbara A.; Beck, Peggy J.

2007-01-01

A simple, guided-inquiry investigation using stacked sandwich cookies is employed to develop a simple linear mathematical model and to explore measurement error by incorporating errors as part of the investigation. Both random and systematic errors are presented. The model and errors are then investigated further by engaging with an interactive…
Propagation of Radiosonde Pressure Sensor Errors to Ozonesonde Measurements

NASA Technical Reports Server (NTRS)

Stauffer, R. M.; Morris, G.A.; Thompson, A. M.; Joseph, E.; Coetzee, G. J. R.; Nalli, N. R.

2014-01-01

Several previous studies highlight pressure (or equivalently, pressure altitude) discrepancies between the radiosonde pressure sensor and that derived from a GPS flown with the radiosonde. The offsets vary during the ascent both in absolute and percent pressure differences. To investigate this problem further, a total of 731 radiosonde-ozonesonde launches from the Southern Hemisphere subtropics to Northern mid-latitudes are considered, with launches between 2005 - 2013 from both longer-term and campaign-based intensive stations. Five series of radiosondes from two manufacturers (International Met Systems: iMet, iMet-P, iMet-S, and Vaisala: RS80-15N and RS92-SGP) are analyzed to determine the magnitude of the pressure offset. Additionally, electrochemical concentration cell (ECC) ozonesondes from three manufacturers (Science Pump Corporation; SPC and ENSCI-Droplet Measurement Technologies; DMT) are analyzed to quantify the effects these offsets have on the calculation of ECC ozone (O3) mixing ratio profiles (O3MR) from the ozonesonde-measured partial pressure. Approximately half of all offsets are 0.6 hPa in the free troposphere, with nearly a third 1.0 hPa at 26 km, where the 1.0 hPa error represents 5 persent of the total atmospheric pressure. Pressure offsets have negligible effects on O3MR below 20 km (96 percent of launches lie within 5 percent O3MR error at 20 km). Ozone mixing ratio errors above 10 hPa (30 km), can approach greater than 10 percent ( 25 percent of launches that reach 30 km exceed this threshold). These errors cause disagreement between the integrated ozonesonde-only column O3 from the GPS and radiosonde pressure profile by an average of +6.5 DU. Comparisons of total column O3 between the GPS and radiosonde pressure profiles yield average differences of +1.1 DU when the O3 is integrated to burst with addition of the McPeters and Labow (2012) above-burst O3 column climatology. Total column differences are reduced to an average of -0.5 DU when
Intra- and inter-rater reliability of 3D passive intervertebral motion in subjects with nonspecific neck pain assessed by physical therapy students: A pilot study.

PubMed

Rossettini, Giacomo; Rondoni, Angie; Lovato, Tommaso; Strobe, Marco; Verzè, Elisa; Vicentini, Marco; Testa, Marco

2016-06-03

Passive Intervertebral Movements (PIVMs) are commonly used to assess and treat patients with nonspecific neck pain. Only very few studies have investigated 3D movements until now. This study assessed intra- and inter-rater reliability of three-dimensional (3D) cervical PIVMs performed by physical therapy students in patients with nonspecific neck pain. Thirty-one patients, mean age 47.2 ± 7.2 years, were independently evaluated by 2 physical therapy students. The raters (A and B) assessed mobility, end-feel and pain provocation performing bilaterally the 3D cervical segmental side-bending test (3D CSSB) from levels C2-C3 to C6-C7. Percentage agreement (raw, positive and negative), Cohen's kappa (95% CI), prevalence index and bias index were calculated to estimate intra- and inter-reliability. Intra-rater reliability showed kappa values ranging between fair and substantial (k 0.29-0.80) for pain provocation, mobility and end-feel, with percentage agreements between 61%-90%. Inter-rater reliability presented kappa values ranging between fair and substantial (k 0.22-0.62) for pain provocation, mobility and end-feel, with percentage agreements between 61% and 80%. Intra-rater reliability of 3D PIVMs was superior to inter-rater reliability in patients with nonspecific neck pain. The most repeatable evaluation parameter was pain. However overall poor reliability suggests avoiding the use of these techniques alone to examine patients and measure their outcome. Further studies are needed to investigate PIVMs reliability in combination with other assessment procedure in symptomatic patients.
Bayesian adjustment for measurement error in continuous exposures in an individually matched case-control study.

PubMed

Espino-Hernandez, Gabriela; Gustafson, Paul; Burstyn, Igor

2011-05-14

In epidemiological studies explanatory variables are frequently subject to measurement error. The aim of this paper is to develop a Bayesian method to correct for measurement error in multiple continuous exposures in individually matched case-control studies. This is a topic that has not been widely investigated. The new method is illustrated using data from an individually matched case-control study of the association between thyroid hormone levels during pregnancy and exposure to perfluorinated acids. The objective of the motivating study was to examine the risk of maternal hypothyroxinemia due to exposure to three perfluorinated acids measured on a continuous scale. Results from the proposed method are compared with those obtained from a naive analysis. Using a Bayesian approach, the developed method considers a classical measurement error model for the exposures, as well as the conditional logistic regression likelihood as the disease model, together with a random-effect exposure model. Proper and diffuse prior distributions are assigned, and results from a quality control experiment are used to estimate the perfluorinated acids' measurement error variability. As a result, posterior distributions and 95% credible intervals of the odds ratios are computed. A sensitivity analysis of method's performance in this particular application with different measurement error variability was performed. The proposed Bayesian method to correct for measurement error is feasible and can be implemented using statistical software. For the study on perfluorinated acids, a comparison of the inferences which are corrected for measurement error to those which ignore it indicates that little adjustment is manifested for the level of measurement error actually exhibited in the exposures. Nevertheless, a sensitivity analysis shows that more substantial adjustments arise if larger measurement errors are assumed. In individually matched case-control studies, the use of conditional
Managing Rater Effects through the Use of FACETS Analysis: The Case of a University Placement Test

ERIC Educational Resources Information Center

Wu, Siew Mei; Tan, Susan

2016-01-01

Rating essays is a complex task where students' grades could be adversely affected by test-irrelevant factors such as rater characteristics and rating scales. Understanding these factors and controlling their effects are crucial for test validity. Rater behaviour has been extensively studied through qualitative methods such as questionnaires and…
Analysis and improvement of gas turbine blade temperature measurement error

NASA Astrophysics Data System (ADS)

Gao, Shan; Wang, Lixin; Feng, Chi; Daniel, Ketui

2015-10-01

Gas turbine blade components are easily damaged; they also operate in harsh high-temperature, high-pressure environments over extended durations. Therefore, ensuring that the blade temperature remains within the design limits is very important. In this study, measurement errors in turbine blade temperatures were analyzed, taking into account detector lens contamination, the reflection of environmental energy from the target surface, the effects of the combustion gas, and the emissivity of the blade surface. In this paper, each of the above sources of measurement error is discussed, and an iterative computing method for calculating blade temperature is proposed.
Systematic study of error sources in supersonic skin-friction balance measurements

NASA Technical Reports Server (NTRS)

Allen, J. M.

1976-01-01

An experimental study was performed to investigate potential error sources in data obtained with a self-nulling, moment-measuring, skin-friction balance. The balance was installed in the sidewall of a supersonic wind tunnel, and independent measurements of the three forces contributing to the balance output (skin friction, lip force, and off-center normal force) were made for a range of gap size and element protrusion. The relatively good agreement between the balance data and the sum of these three independently measured forces validated the three-term model used. No advantage to a small gap size was found; in fact, the larger gaps were preferable. Perfect element alignment with the surrounding test surface resulted in very small balance errors. However, if small protrusion errors are unavoidable, no advantage was found in having the element slightly below the surrounding test surface rather than above it.
Correcting for Measurement Error in Time-Varying Covariates in Marginal Structural Models.

PubMed

Kyle, Ryan P; Moodie, Erica E M; Klein, Marina B; Abrahamowicz, Michał

2016-08-01

Unbiased estimation of causal parameters from marginal structural models (MSMs) requires a fundamental assumption of no unmeasured confounding. Unfortunately, the time-varying covariates used to obtain inverse probability weights are often error-prone. Although substantial measurement error in important confounders is known to undermine control of confounders in conventional unweighted regression models, this issue has received comparatively limited attention in the MSM literature. Here we propose a novel application of the simulation-extrapolation (SIMEX) procedure to address measurement error in time-varying covariates, and we compare 2 approaches. The direct approach to SIMEX-based correction targets outcome model parameters, while the indirect approach corrects the weights estimated using the exposure model. We assess the performance of the proposed methods in simulations under different clinically plausible assumptions. The simulations demonstrate that measurement errors in time-dependent covariates may induce substantial bias in MSM estimators of causal effects of time-varying exposures, and that both proposed SIMEX approaches yield practically unbiased estimates in scenarios featuring low-to-moderate degrees of error. We illustrate the proposed approach in a simple analysis of the relationship between sustained virological response and liver fibrosis progression among persons infected with hepatitis C virus, while accounting for measurement error in γ-glutamyltransferase, using data collected in the Canadian Co-infection Cohort Study from 2003 to 2014. © The Author 2016. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

Measurement error and timing of predictor values for multivariable risk prediction models are poorly reported.

PubMed

Whittle, Rebecca; Peat, George; Belcher, John; Collins, Gary S; Riley, Richard D

2018-05-18

Measurement error in predictor variables may threaten the validity of clinical prediction models. We sought to evaluate the possible extent of the problem. A secondary objective was to examine whether predictors are measured at the intended moment of model use. A systematic search of Medline was used to identify a sample of articles reporting the development of a clinical prediction model published in 2015. After screening according to a predefined inclusion criteria, information on predictors, strategies to control for measurement error and intended moment of model use were extracted. Susceptibility to measurement error for each predictor was classified into low and high risk. Thirty-three studies were reviewed, including 151 different predictors in the final prediction models. Fifty-one (33.7%) predictors were categorised as high risk of error, however this was not accounted for in the model development. Only 8 (24.2%) studies explicitly stated the intended moment of model use and when the predictors were measured. Reporting of measurement error and intended moment of model use is poor in prediction model studies. There is a need to identify circumstances where ignoring measurement error in prediction models is consequential and whether accounting for the error will improve the predictions. Copyright © 2018. Published by Elsevier Inc.
Unified Parkinson’s Disease Rating Scale-Motor Exam: Inter-rater reliability of advanced practice nurse and neurologist assessments

PubMed Central

Palmer, Janice L.; Coats, Mary A.; Roe, Catherine M.; Hanko, Shelly M.; Xiong, Chengjie; Morris, John C.

2010-01-01

Aim This paper is a report of a study to establish the inter-rater reliability of advanced practice nurse and neurologist neurological assessments which included ratings with the Unified Parkinson’s Disease Rating Scale-Motor Exam. Background Around the world, advanced practice nurses are performing tasks once completed by only physicians. To promote consumer and provider confidence, it is important to establish that nurse and physician ratings using assessment tools are similar. In addition in research settings, when different raters are used, establishment of inter-rater reliability for study assessments is needed. Method Advanced practice nurses and neurologists independently recorded findings on neurological examinations of 46 participants in a study conducted between August 2007 and January 2008. An intraclass correlation coefficient was calculated to estimate overall agreement between the nurse and neurologist ratings. Agreement for individual items measured on a dichotomous scale was assessed by calculating Cohen’s kappa. Results There was substantial agreement between advanced practice nurses and neurologists on the mean Unified Parkinson’s Disease Rating Scale-Motor Exam ratings (intraclass correlation coefficient = 0.65) and the U.S. National Alzheimer’s Coordinating Center Uniform Data Set neurological examination ratings of unremarkable findings (kappa = 0.74) and of gait disorder (kappa = 0.73). Moderate agreement (kappa = 0.53) was reached for the rating of whether all Unified Parkinson’s Disease Rating Scale-Motor Exam items were normal. Conclusion These findings are consistent with studies of the inter-rater agreement of the Unified Parkinson’s Disease Rating Scale-Motor Exam and support the conduct of neurological assessments by advanced practice nurses. PMID:20546368
Multiple imputation to account for measurement error in marginal structural models

PubMed Central

Edwards, Jessie K.; Cole, Stephen R.; Westreich, Daniel; Crane, Heidi; Eron, Joseph J.; Mathews, W. Christopher; Moore, Richard; Boswell, Stephen L.; Lesko, Catherine R.; Mugavero, Michael J.

2015-01-01

Background Marginal structural models are an important tool for observational studies. These models typically assume that variables are measured without error. We describe a method to account for differential and non-differential measurement error in a marginal structural model. Methods We illustrate the method estimating the joint effects of antiretroviral therapy initiation and current smoking on all-cause mortality in a United States cohort of 12,290 patients with HIV followed for up to 5 years between 1998 and 2011. Smoking status was likely measured with error, but a subset of 3686 patients who reported smoking status on separate questionnaires composed an internal validation subgroup. We compared a standard joint marginal structural model fit using inverse probability weights to a model that also accounted for misclassification of smoking status using multiple imputation. Results In the standard analysis, current smoking was not associated with increased risk of mortality. After accounting for misclassification, current smoking without therapy was associated with increased mortality [hazard ratio (HR): 1.2 (95% CI: 0.6, 2.3)]. The HR for current smoking and therapy (0.4 (95% CI: 0.2, 0.7)) was similar to the HR for no smoking and therapy (0.4; 95% CI: 0.2, 0.6). Conclusions Multiple imputation can be used to account for measurement error in concert with methods for causal inference to strengthen results from observational studies. PMID:26214338
Clinical Utility of Ultrasound Measurements of Plantar Fascia Width and Cross-Sectional AreaA Novel Technique.

PubMed

Bisi-Balogun, Adebisi; Rector, Michael

2017-09-01

We sought to develop a standardized protocol for ultrasound (US) measurements of plantar fascia (PF) width and cross-sectional area (CSA), which may serve as additional outcome variables during US examinations of both healthy asymptomatic PF and in plantar fasciopathy and determine its interrater and intrarater reliability. Ten healthy individuals (20 feet) were enrolled. Participants were assessed twice by two raters each to determine intrarater and interrater reliability. For each foot, three transverse scans of the central bundle of the PF were taken at its insertion at the medial calcaneal tubercle, identified in real time on the plantar surface of the foot, using a fine wire technique. Reliability was determined using intraclass correlation coefficients (ICC), standard errors of measurement (SEM), and limits of agreement (LOA) expressed as percentages of the mean. Reliability of PF width and CSA measurements was determined using PF width and CSA measurements from one sonogram measured once and the mean of three measurements from three sonograms each measured once. Ultrasound measurements of PF width and CSA showed a mean of 18.6 ± 2.0 mm and 69.20 ± 13.6 mm 2 respectively. Intra-reliability within both raters showed an ICC > 0.84 for width and ICC > 0.92 for CSA as well as a SEM% and LOA% < 10% for both width and CSA. Inter-rater reliability showed an ICC of 0.82 for width and 0.87 for CSA as well as a SEM% and LOA% < 10% for width and a SEM% < 10% and LOA% < 20% for CSA. Relative and absolute reliability within and between raters were higher when using the mean of three sonographs compared to one sonograph. Using this novel technique, PF CSA and width may be determined reliably using measurements from one sonogram or the mean of three sonograms. Measurement of PF CSA and width in addition to already established thickness and echogenicity measurements provides additional information on structural properties of the PF for clinicians and researchers in healthy
An assessment of the inter-rater reliability of the ASA physical status score in the orthopaedic trauma population.

PubMed

Ihejirika, Rivka C; Thakore, Rachel V; Sathiyakumar, Vasanth; Ehrenfeld, Jesse M; Obremskey, William T; Sethi, Manish K

2015-04-01

Although recent literature has demonstrated the utility of the ASA score in predicting postoperative length of stay, complication risk and potential utilization of other hospital resources, the ASA score has been inconsistently assigned by anaesthesia providers. This study tested the reliability of assignment of the ASA score classification by both attending anaesthesiologists and anaesthesia residents specifically among the orthopaedic trauma patient population. Nine case-based scenarios were created involving preoperative patients with isolated operative orthopaedic trauma injuries. The cases were created and assigned a reference score by both an attending anaesthesiologist and orthopaedic trauma surgeon. Attending and resident anaesthesiologists were asked to assign an ASA score for each case. Rater versus reference and inter-rater agreement amongst respondents was then analyzed utilizing Fleiss's Kappa and weighted and unweighted Cohen's Kappa. Thirty three individuals provided ASA scores for each of the scenarios. The average rater versus reference reliability was substantial (Kw=0.78, SD=0.131, 95% CI=0.73-0.83). The average rater versus reference Kuw was also substantial (Kuw=0.64, SD=0.21, 95% CI=0.56-0.71). The inter-rater reliability as evaluated by Fleiss's Kappa was moderate (K=0.51, p<.001). An inter-rater comparison within the group of attendings (K=0.50, p<.001) and within the group of residents were both moderate (K=0.55, p<.001). There was a significant increase in the level of inter-rater reliability from the self-reported 'very uncomfortable' participants to the 'very comfortable' participants (uncomfortable K=0.43, comfortable K=0.59, p<.001). This study shows substantial agreement strength for reliability of the ASA score among anaesthesiologists when evaluating orthopaedic trauma patients. The significant increase in inter-rater reliability based on anaesthesiologists' comfort with the ASA scoring method implies a need for further evaluation
Rater Judgment and English Language Speaking Proficiency. Research Report

ERIC Educational Resources Information Center

Chalhoub-Deville, Micheline; Wigglesworth, Gillian

2005-01-01

The paper investigates whether there is a shared perception of speaking proficiency among raters from different English speaking countries. More specifically, this study examines whether there is a significant difference among English language learning (ELL) teachers, residing in Australia, Canada, the UK, and the USA when rating speech samples of…
Error Analysis of Wind Measurements for the University of Illinois Sodium Doppler Temperature System

NASA Technical Reports Server (NTRS)

Pfenninger, W. Matthew; Papen, George C.

1992-01-01

Four-frequency lidar measurements of temperature and wind velocity require accurate frequency tuning to an absolute reference and long term frequency stability. We quantify frequency tuning errors for the Illinois sodium system, to measure absolute frequencies and a reference interferometer to measure relative frequencies. To determine laser tuning errors, we monitor the vapor cell and interferometer during lidar data acquisition and analyze the two signals for variations as functions of time. Both sodium cell and interferometer are the same as those used to frequency tune the laser. By quantifying the frequency variations of the laser during data acquisition, an error analysis of temperature and wind measurements can be calculated. These error bounds determine the confidence in the calculated temperatures and wind velocities.
Simulation of the Effects of Random Measurement Errors

ERIC Educational Resources Information Center

Kinsella, I. A.; Hannaidh, P. B. O.

1978-01-01

Describes a simulation method for measurement of errors that requires calculators and tables of random digits. Each student simulates the random behaviour of the component variables in the function and by combining the results of all students, the outline of the sampling distribution of the function can be obtained. (GA)
A methodology for translating positional error into measures of attribute error, and combining the two error sources

Treesearch

Yohay Carmel; Curtis Flather; Denis Dean

2006-01-01

This paper summarizes our efforts to investigate the nature, behavior, and implications of positional error and attribute error in spatiotemporal datasets. Estimating the combined influence of these errors on map analysis has been hindered by the fact that these two error types are traditionally expressed in different units (distance units, and categorical units,...
Adjustment of Measurements with Multiplicative Errors: Error Analysis, Estimates of the Variance of Unit Weight, and Effect on Volume Estimation from LiDAR-Type Digital Elevation Models

PubMed Central

Shi, Yun; Xu, Peiliang; Peng, Junhuan; Shi, Chuang; Liu, Jingnan

2014-01-01

Modern observation technology has verified that measurement errors can be proportional to the true values of measurements such as GPS, VLBI baselines and LiDAR. Observational models of this type are called multiplicative error models. This paper is to extend the work of Xu and Shimada published in 2000 on multiplicative error models to analytical error analysis of quantities of practical interest and estimates of the variance of unit weight. We analytically derive the variance-covariance matrices of the three least squares (LS) adjustments, the adjusted measurements and the corrections of measurements in multiplicative error models. For quality evaluation, we construct five estimators for the variance of unit weight in association of the three LS adjustment methods. Although LiDAR measurements are contaminated with multiplicative random errors, LiDAR-based digital elevation models (DEM) have been constructed as if they were of additive random errors. We will simulate a model landslide, which is assumed to be surveyed with LiDAR, and investigate the effect of LiDAR-type multiplicative error measurements on DEM construction and its effect on the estimate of landslide mass volume from the constructed DEM. PMID:24434880
Inter-rater reliability and generalizability of patient note scores using a scoring rubric based on the USMLE Step-2 CS format.

PubMed

Park, Yoon Soo; Hyderi, Abbas; Bordage, Georges; Xing, Kuan; Yudkowsky, Rachel

2016-10-01

Recent changes to the patient note (PN) format of the United States Medical Licensing Examination have challenged medical schools to improve the instruction and assessment of students taking the Step-2 clinical skills examination. The purpose of this study was to gather validity evidence regarding response process and internal structure, focusing on inter-rater reliability and generalizability, to determine whether a locally-developed PN scoring rubric and scoring guidelines could yield reproducible PN scores. A randomly selected subsample of historical data (post-encounter PN from 55 of 177 medical students) was rescored by six trained faculty raters in November-December 2014. Inter-rater reliability (% exact agreement and kappa) was calculated for five standardized patient cases administered in a local graduation competency examination. Generalizability studies were conducted to examine the overall reliability. Qualitative data were collected through surveys and a rater-debriefing meeting. The overall inter-rater reliability (weighted kappa) was .79 (Documentation = .63, Differential Diagnosis = .90, Justification = .48, and Workup = .54). The majority of score variance was due to case specificity (13 %) and case-task specificity (31 %), indicating differences in student performance by case and by case-task interactions. Variance associated with raters and its interactions were modest (<5 %). Raters felt that justification was the most difficult task to score and that having case and level-specific scoring guidelines during training was most helpful for calibration. The overall inter-rater reliability indicates high level of confidence in the consistency of note scores. Designs for scoring notes may optimize reliability by balancing the number of raters and cases.
Error Reduction Methods for Integrated-path Differential-absorption Lidar Measurements

NASA Technical Reports Server (NTRS)

Chen, Jeffrey R.; Numata, Kenji; Wu, Stewart T.

2012-01-01

We report new modeling and error reduction methods for differential-absorption optical-depth (DAOD) measurements of atmospheric constituents using direct-detection integrated-path differential-absorption lidars. Errors from laser frequency noise are quantified in terms of the line center fluctuation and spectral line shape of the laser pulses, revealing relationships verified experimentally. A significant DAOD bias is removed by introducing a correction factor. Errors from surface height and reflectance variations can be reduced to tolerable levels by incorporating altimetry knowledge and "log after averaging", or by pointing the laser and receiver to a fixed surface spot during each wavelength cycle to shorten the time of "averaging before log".
Refraction error correction for deformation measurement by digital image correlation at elevated temperature

NASA Astrophysics Data System (ADS)

Su, Yunquan; Yao, Xuefeng; Wang, Shen; Ma, Yinji

2017-03-01

An effective correction model is proposed to eliminate the refraction error effect caused by an optical window of a furnace in digital image correlation (DIC) deformation measurement under high-temperature environment. First, a theoretical correction model with the corresponding error correction factor is established to eliminate the refraction error induced by double-deck optical glass in DIC deformation measurement. Second, a high-temperature DIC experiment using a chromium-nickel austenite stainless steel specimen is performed to verify the effectiveness of the correction model by the correlation calculation results under two different conditions (with and without the optical glass). Finally, both the full-field and the divisional displacement results with refraction influence are corrected by the theoretical model and then compared to the displacement results extracted from the images without refraction influence. The experimental results demonstrate that the proposed theoretical correction model can effectively improve the measurement accuracy of DIC method by decreasing the refraction errors from measured full-field displacements under high-temperature environment.
Accounting for measurement error in human life history trade-offs using structural equation modeling.

PubMed

Helle, Samuli

2018-03-01

Revealing causal effects from correlative data is very challenging and a contemporary problem in human life history research owing to the lack of experimental approach. Problems with causal inference arising from measurement error in independent variables, whether related either to inaccurate measurement technique or validity of measurements, seem not well-known in this field. The aim of this study is to show how structural equation modeling (SEM) with latent variables can be applied to account for measurement error in independent variables when the researcher has recorded several indicators of a hypothesized latent construct. As a simple example of this approach, measurement error in lifetime allocation of resources to reproduction in Finnish preindustrial women is modelled in the context of the survival cost of reproduction. In humans, lifetime energetic resources allocated in reproduction are almost impossible to quantify with precision and, thus, typically used measures of lifetime reproductive effort (e.g., lifetime reproductive success and parity) are likely to be plagued by measurement error. These results are contrasted with those obtained from a traditional regression approach where the single best proxy of lifetime reproductive effort available in the data is used for inference. As expected, the inability to account for measurement error in women's lifetime reproductive effort resulted in the underestimation of its underlying effect size on post-reproductive survival. This article emphasizes the advantages that the SEM framework can provide in handling measurement error via multiple-indicator latent variables in human life history studies. © 2017 Wiley Periodicals, Inc.
Building "e-rater"® Scoring Models Using Machine Learning Methods. Research Report. ETS RR-16-04

ERIC Educational Resources Information Center

Chen, Jing; Fife, James H.; Bejar, Isaac I.; Rupp, André A.

2016-01-01

The "e-rater"® automated scoring engine used at Educational Testing Service (ETS) scores the writing quality of essays. In the current practice, e-rater scores are generated via a multiple linear regression (MLR) model as a linear combination of various features evaluated for each essay and human scores as the outcome variable. This…
An analysis of errors, discrepancies, and variation in opioid prescriptions for adult outpatients at a teaching hospital

PubMed Central

Bicket, Mark C.; Kattail, Deepa; Yaster, Myron; Wu, Christopher L.; Pronovost, Peter

2017-01-01

Objective To determine opioid prescribing patterns and rate of three types of errors, discrepancies, and variation from ideal practice. Design Retrospective review of opioid prescriptions processed at an outpatient pharmacy Setting Tertiary institutional medical center Patients We examined 510 consecutive opioid medication prescriptions for adult patients processed at an institutional outpatient pharmacy in June 2016 for patient, provider, and prescription characteristics. Main Outcome Measure(s) We analyzed prescriptions for deviation from best practice guidelines, lack of two patient identifiers, and noncompliance with Drug Enforcement Agency (DEA) rules. Results Mean patient age (SD) was 47.5 years (17.4). The most commonly prescribed opioid was oxycodone (71%), usually not combined with acetaminophen. Practitioners prescribed tablet formulation to 92% of the sample, averaging 57 (47) pills. We identified at least one error on 42% of prescriptions. Among all prescriptions, 9% deviated from best practice guidelines, 21% failed to include two patient identifiers, and 41% were noncompliant with DEA rules. Errors occurred in 89% of handwritten prescriptions, 0% of electronic health record (EHR) computer-generated prescriptions, and 12% of non-EHR computer-generated prescriptions. Inter-rater reliability by kappa was 0.993. Conclusions Inconsistencies in opioid prescribing remain common. Handwritten prescriptions continue to demonstrate higher associations of errors, discrepancies, and variation from ideal practice and government regulations. All computer-generated prescriptions adhered to best practice guidelines and contained two patient identifiers, and all EHR prescriptions were fully compliant with DEA rules. PMID:28345746
Distribution of standing-wave errors in real-ear sound-level measurements.

PubMed

Richmond, Susan A; Kopun, Judy G; Neely, Stephen T; Tan, Hongyang; Gorga, Michael P

2011-05-01

Standing waves can cause measurement errors when sound-pressure level (SPL) measurements are performed in a closed ear canal, e.g., during probe-microphone system calibration for distortion-product otoacoustic emission (DPOAE) testing. Alternative calibration methods, such as forward-pressure level (FPL), minimize the influence of standing waves by calculating the forward-going sound waves separate from the reflections that cause errors. Previous research compared test performance (Burke et al., 2010) and threshold prediction (Rogers et al., 2010) using SPL and multiple FPL calibration conditions, and surprisingly found no significant improvements when using FPL relative to SPL, except at 8 kHz. The present study examined the calibration data collected by Burke et al. and Rogers et al. from 155 human subjects in order to describe the frequency location and magnitude of standing-wave pressure minima to see if these errors might explain trends in test performance. Results indicate that while individual results varied widely, pressure variability was larger around 4 kHz and smaller at 8 kHz, consistent with the dimensions of the adult ear canal. The present data suggest that standing-wave errors are not responsible for the historically poor (8 kHz) or good (4 kHz) performance of DPOAE measures at specific test frequencies.
Reliability of capturing foot parameters using digital scanning and the neutral suspension casting technique

PubMed Central

2011-01-01

Background A clinical study was conducted to determine the intra and inter-rater reliability of digital scanning and the neutral suspension casting technique to measure six foot parameters. The neutral suspension casting technique is a commonly utilised method for obtaining a negative impression of the foot prior to orthotic fabrication. Digital scanning offers an alternative to the traditional plaster of Paris techniques. Methods Twenty one healthy participants volunteered to take part in the study. Six casts and six digital scans were obtained from each participant by two raters of differing clinical experience. The foot parameters chosen for investigation were cast length (mm), forefoot width (mm), rearfoot width (mm), medial arch height (mm), lateral arch height (mm) and forefoot to rearfoot alignment (degrees). Intraclass correlation coefficients (ICC) with 95% confidence intervals (CI) were calculated to determine the intra and inter-rater reliability. Measurement error was assessed through the calculation of the standard error of the measurement (SEM) and smallest real difference (SRD). Results ICC values for all foot parameters using digital scanning ranged between 0.81-0.99 for both intra and inter-rater reliability. For neutral suspension casting technique inter-rater reliability values ranged from 0.57-0.99 and intra-rater reliability values ranging from 0.36-0.99 for rater 1 and 0.49-0.99 for rater 2. Conclusions The findings of this study indicate that digital scanning is a reliable technique, irrespective of clinical experience, with reduced measurement variability in all foot parameters investigated when compared to neutral suspension casting. PMID:21375757
Reliability and validity of CODA motion analysis system for measuring cervical range of motion in patients with cervical spondylosis and anterior cervical fusion.

PubMed

Gao, Zhongyang; Song, Hui; Ren, Fenggang; Li, Yuhuan; Wang, Dong; He, Xijing

2017-12-01

The aim of the present study was to evaluate the reliability of the Cartesian Optoelectronic Dynamic Anthropometer (CODA) motion system in measuring the cervical range of motion (ROM) and verify the construct validity of the CODA motion system. A total of 26 patients with cervical spondylosis and 22 patients with anterior cervical fusion were enrolled and the CODA motion analysis system was used to measure the three-dimensional cervical ROM. Intra- and inter-rater reliability was assessed by interclass correlation coefficients (ICCs), standard error of measurement (SEm), Limits of Agreements (LOA) and minimal detectable change (MDC). Independent samples t-tests were performed to examine the differences of cervical ROM between cervical spondylosis and anterior cervical fusion patients. The results revealed that in the cervical spondylosis group, the reliability was almost perfect (intra-rater reliability: ICC, 0.87-0.95; LOA, -12.86-13.70; SEm, 2.97-4.58; inter-rater reliability: ICC, 0.84-0.95; LOA, -13.09-13.48; SEm, 3.13-4.32). In the anterior cervical fusion group, the reliability was high (intra-rater reliability: ICC, 0.88-0.97; LOA, -10.65-11.08; SEm, 2.10-3.77; inter-rater reliability: ICC, 0.86-0.96; LOA, -10.91-13.66; SEm, 2.20-4.45). The cervical ROM in the cervical spondylosis group was significantly higher than that in the anterior cervical fusion group in all directions except for left rotation. In conclusion, the CODA motion analysis system is highly reliable in measuring cervical ROM and the construct validity was verified, as the system was sufficiently sensitive to distinguish between the cervical spondylosis and anterior cervical fusion groups based on their ROM.
The Berg Balance Scale has high intra- and inter-rater reliability but absolute reliability varies across the scale: a systematic review.

PubMed

Downs, Stephen; Marquez, Jodie; Chiarelli, Pauline

2013-06-01

What is the intra-rater and inter-rater relative reliability of the Berg Balance Scale? What is the absolute reliability of the Berg Balance Scale? Does the absolute reliability of the Berg Balance Scale vary across the scale? Systematic review with meta-analysis of reliability studies. Any clinical population that has undergone assessment with the Berg Balance Scale. Relative intra-rater reliability, relative inter-rater reliability, and absolute reliability. Eleven studies involving 668 participants were included in the review. The relative intrarater reliability of the Berg Balance Scale was high, with a pooled estimate of 0.98 (95% CI 0.97 to 0.99). Relative inter-rater reliability was also high, with a pooled estimate of 0.97 (95% CI 0.96 to 0.98). A ceiling effect of the Berg Balance Scale was evident for some participants. In the analysis of absolute reliability, all of the relevant studies had an average score of 20 or above on the 0 to 56 point Berg Balance Scale. The absolute reliability across this part of the scale, as measured by the minimal detectable change with 95% confidence, varied between 2.8 points and 6.6 points. The Berg Balance Scale has a higher absolute reliability when close to 56 points due to the ceiling effect. We identified no data that estimated the absolute reliability of the Berg Balance Scale among participants with a mean score below 20 out of 56. The Berg Balance Scale has acceptable reliability, although it might not detect modest, clinically important changes in balance in individual subjects. The review was only able to comment on the absolute reliability of the Berg Balance Scale among people with moderately poor to normal balance. Copyright © 2013 Australian Physiotherapy Association. Published by .. All rights reserved.

Construct Validity of "e-rater"® in Scoring TOEFL® Essays. Research Report. ETS RR-07-21

ERIC Educational Resources Information Center

Attali, Yigal

2007-01-01

This study examined the construct validity of the "e-rater"® automated essay scoring engine as an alternative to human scoring in the context of TOEFL® essay writing. Analyses were based on a sample of students who repeated the TOEFL within a short time period. Two "e-rater" scores were investigated in this study, the first…
Statistical methods for biodosimetry in the presence of both Berkson and classical measurement error

NASA Astrophysics Data System (ADS)

Miller, Austin

In radiation epidemiology, the true dose received by those exposed cannot be assessed directly. Physical dosimetry uses a deterministic function of the source term, distance and shielding to estimate dose. For the atomic bomb survivors, the physical dosimetry system is well established. The classical measurement errors plaguing the location and shielding inputs to the physical dosimetry system are well known. Adjusting for the associated biases requires an estimate for the classical measurement error variance, for which no data-driven estimate exists. In this case, an instrumental variable solution is the most viable option to overcome the classical measurement error indeterminacy. Biological indicators of dose may serve as instrumental variables. Specification of the biodosimeter dose-response model requires identification of the radiosensitivity variables, for which we develop statistical definitions and variables. More recently, researchers have recognized Berkson error in the dose estimates, introduced by averaging assumptions for many components in the physical dosimetry system. We show that Berkson error induces a bias in the instrumental variable estimate of the dose-response coefficient, and then address the estimation problem. This model is specified by developing an instrumental variable mixed measurement error likelihood function, which is then maximized using a Monte Carlo EM Algorithm. These methods produce dose estimates that incorporate information from both physical and biological indicators of dose, as well as the first instrumental variable based data-driven estimate for the classical measurement error variance.
The Influence of Training Phase on Error of Measurement in Jump Performance.

PubMed

Taylor, Kristie-Lee; Hopkins, Will G; Chapman, Dale W; Cronin, John B

2016-03-01

The purpose of this study was to calculate the coefficients of variation in jump performance for individual participants in multiple trials over time to determine the extent to which there are real differences in the error of measurement between participants. The effect of training phase on measurement error was also investigated. Six subjects participated in a resistance-training intervention for 12 wk with mean power from a countermovement jump measured 6 d/wk. Using a mixed-model meta-analysis, differences between subjects, within-subject changes between training phases, and the mean error values during different phases of training were examined. Small, substantial factor differences of 1.11 were observed between subjects; however, the finding was unclear based on the width of the confidence limits. The mean error was clearly higher during overload training than baseline training, by a factor of ×/÷ 1.3 (confidence limits 1.0-1.6). The random factor representing the interaction between subjects and training phases revealed further substantial differences of ×/÷ 1.2 (1.1-1.3), indicating that on average, the error of measurement in some subjects changes more than in others when overload training is introduced. The results from this study provide the first indication that within-subject variability in performance is substantially different between training phases and, possibly, different between individuals. The implications of these findings for monitoring individuals and estimating sample size are discussed.
Error model of geomagnetic-field measurement and extended Kalman-filter based compensation method

PubMed Central

Ge, Zhilei; Liu, Suyun; Li, Guopeng; Huang, Yan; Wang, Yanni

2017-01-01

The real-time accurate measurement of the geomagnetic-field is the foundation to achieving high-precision geomagnetic navigation. The existing geomagnetic-field measurement models are essentially simplified models that cannot accurately describe the sources of measurement error. This paper, on the basis of systematically analyzing the source of geomagnetic-field measurement error, built a complete measurement model, into which the previously unconsidered geomagnetic daily variation field was introduced. This paper proposed an extended Kalman-filter based compensation method, which allows a large amount of measurement data to be used in estimating parameters to obtain the optimal solution in the sense of statistics. The experiment results showed that the compensated strength of the geomagnetic field remained close to the real value and the measurement error was basically controlled within 5nT. In addition, this compensation method has strong applicability due to its easy data collection and ability to remove the dependence on a high-precision measurement instrument. PMID:28445508
Performance of bias-correction methods for exposure measurement error using repeated measurements with and without missing data.

PubMed

Batistatou, Evridiki; McNamee, Roseanne

2012-12-10

It is known that measurement error leads to bias in assessing exposure effects, which can however, be corrected if independent replicates are available. For expensive replicates, two-stage (2S) studies that produce data 'missing by design', may be preferred over a single-stage (1S) study, because in the second stage, measurement of replicates is restricted to a sample of first-stage subjects. Motivated by an occupational study on the acute effect of carbon black exposure on respiratory morbidity, we compare the performance of several bias-correction methods for both designs in a simulation study: an instrumental variable method (EVROS IV) based on grouping strategies, which had been recommended especially when measurement error is large, the regression calibration and the simulation extrapolation methods. For the 2S design, either the problem of 'missing' data was ignored or the 'missing' data were imputed using multiple imputations. Both in 1S and 2S designs, in the case of small or moderate measurement error, regression calibration was shown to be the preferred approach in terms of root mean square error. For 2S designs, regression calibration as implemented by Stata software is not recommended in contrast to our implementation of this method; the 'problematic' implementation of regression calibration although substantially improved with use of multiple imputations. The EVROS IV method, under a good/fairly good grouping, outperforms the regression calibration approach in both design scenarios when exposure mismeasurement is severe. Both in 1S and 2S designs with moderate or large measurement error, simulation extrapolation severely failed to correct for bias. Copyright © 2012 John Wiley & Sons, Ltd.
Fluorescence errors in integrating sphere measurements of remote phosphor type LED light sources

NASA Astrophysics Data System (ADS)

Keppens, A.; Zong, Y.; Podobedov, V. B.; Nadal, M. E.; Hanselaer, P.; Ohno, Y.

2011-05-01

The relative spectral radiant flux error caused by phosphor fluorescence during integrating sphere measurements is investigated both theoretically and experimentally. Integrating sphere and goniophotometer measurements are compared and used for model validation, while a case study provides additional clarification. Criteria for reducing fluorescence errors to a degree of negligibility as well as a fluorescence error correction method based on simple matrix algebra are presented. Only remote phosphor type LED light sources are studied because of their large phosphor surfaces and high application potential in general lighting.
Correction of motion measurement errors beyond the range resolution of a synthetic aperture radar

DOEpatents

Doerry, Armin W [Albuquerque, NM; Heard, Freddie E [Albuquerque, NM; Cordaro, J Thomas [Albuquerque, NM

2008-06-24

Motion measurement errors that extend beyond the range resolution of a synthetic aperture radar (SAR) can be corrected by effectively decreasing the range resolution of the SAR in order to permit measurement of the error. Range profiles can be compared across the slow-time dimension of the input data in order to estimate the error. Once the error has been determined, appropriate frequency and phase correction can be applied to the uncompressed input data, after which range and azimuth compression can be performed to produce a desired SAR image.
Effects of measurement errors on psychometric measurements in ergonomics studies: Implications for correlations, ANOVA, linear regression, factor analysis, and linear discriminant analysis.

PubMed

Liu, Yan; Salvendy, Gavriel

2009-05-01

This paper aims to demonstrate the effects of measurement errors on psychometric measurements in ergonomics studies. A variety of sources can cause random measurement errors in ergonomics studies and these errors can distort virtually every statistic computed and lead investigators to erroneous conclusions. The effects of measurement errors on five most widely used statistical analysis tools have been discussed and illustrated: correlation; ANOVA; linear regression; factor analysis; linear discriminant analysis. It has been shown that measurement errors can greatly attenuate correlations between variables, reduce statistical power of ANOVA, distort (overestimate, underestimate or even change the sign of) regression coefficients, underrate the explanation contributions of the most important factors in factor analysis and depreciate the significance of discriminant function and discrimination abilities of individual variables in discrimination analysis. The discussions will be restricted to subjective scales and survey methods and their reliability estimates. Other methods applied in ergonomics research, such as physical and electrophysiological measurements and chemical and biomedical analysis methods, also have issues of measurement errors, but they are beyond the scope of this paper. As there has been increasing interest in the development and testing of theories in ergonomics research, it has become very important for ergonomics researchers to understand the effects of measurement errors on their experiment results, which the authors believe is very critical to research progress in theory development and cumulative knowledge in the ergonomics field.
Reliability of Semi-Automated Segmentations in Glioblastoma.

PubMed

Huber, T; Alber, G; Bette, S; Boeckh-Behrens, T; Gempt, J; Ringel, F; Alberts, E; Zimmer, C; Bauer, J S

2017-06-01

In glioblastoma, quantitative volumetric measurements of contrast-enhancing or fluid-attenuated inversion recovery (FLAIR) hyperintense tumor compartments are needed for an objective assessment of therapy response. The aim of this study was to evaluate the reliability of a semi-automated, region-growing segmentation tool for determining tumor volume in patients with glioblastoma among different users of the software. A total of 320 segmentations of tumor-associated FLAIR changes and contrast-enhancing tumor tissue were performed by different raters (neuroradiologists, medical students, and volunteers). All patients underwent high-resolution magnetic resonance imaging including a 3D-FLAIR and a 3D-MPRage sequence. Segmentations were done using a semi-automated, region-growing segmentation tool. Intra- and inter-rater-reliability were addressed by intra-class-correlation (ICC). Root-mean-square error (RMSE) was used to determine the precision error. Dice score was calculated to measure the overlap between segmentations. Semi-automated segmentation showed a high ICC (> 0.985) for all groups indicating an excellent intra- and inter-rater-reliability. Significant smaller precision errors and higher Dice scores were observed for FLAIR segmentations compared with segmentations of contrast-enhancement. Single rater segmentations showed the lowest RMSE for FLAIR of 3.3 % (MPRage: 8.2 %). Both, single raters and neuroradiologists had the lowest precision error for longitudinal evaluation of FLAIR changes. Semi-automated volumetry of glioblastoma was reliably performed by all groups of raters, even without neuroradiologic expertise. Interestingly, segmentations of tumor-associated FLAIR changes were more reliable than segmentations of contrast enhancement. In longitudinal evaluations, an experienced rater can detect progressive FLAIR changes of less than 15 % reliably in a quantitative way which could help to detect progressive disease earlier.
Beyond alpha: an empirical examination of the effects of different sources of measurement error on reliability estimates for measures of individual differences constructs.

PubMed

Schmidt, Frank L; Le, Huy; Ilies, Remus

2003-06-01

On the basis of an empirical study of measures of constructs from the cognitive domain, the personality domain, and the domain of affective traits, the authors of this study examine the implications of transient measurement error for the measurement of frequently studied individual differences variables. The authors clarify relevant reliability concepts as they relate to transient error and present a procedure for estimating the coefficient of equivalence and stability (L. J. Cronbach, 1947), the only classical reliability coefficient that assesses all 3 major sources of measurement error (random response, transient, and specific factor errors). The authors conclude that transient error exists in all 3 trait domains and is especially large in the domain of affective traits. Their findings indicate that the nearly universal use of the coefficient of equivalence (Cronbach's alpha; L. J. Cronbach, 1951), which fails to assess transient error, leads to overestimates of reliability and undercorrections for biases due to measurement error.
Inter-rater reliability of a modified version of Delitto et al.’s classification-based system for low back pain: a pilot study

PubMed Central

Apeldoorn, Adri T.; van Helvoirt, Hans; Ostelo, Raymond W.; Meihuizen, Hanneke; Kamper, Steven J.; van Tulder, Maurits W.; de Vet, Henrica C. W.

2016-01-01

Study design Observational inter-rater reliability study. Objectives To examine: (1) the inter-rater reliability of a modified version of Delitto et al.’s classification-based algorithm for patients with low back pain; (2) the influence of different levels of familiarity with the system; and (3) the inter-rater reliability of algorithm decisions in patients who clearly fit into a subgroup (clear classifications) and those who do not (unclear classifications). Methods Patients were examined twice on the same day by two of three participating physical therapists with different levels of familiarity with the system. Patients were classified into one of four classification groups. Raters were blind to the others’ classification decision. In order to quantify the inter-rater reliability, percentages of agreement and Cohen’s Kappa were calculated. Results A total of 36 patients were included (clear classification n = 23; unclear classification n = 13). The overall rate of agreement was 53% and the Kappa value was 0·34 [95% confidence interval (CI): 0·11–0·57], which indicated only fair inter-rater reliability. Inter-rater reliability for patients with a clear classification (agreement 52%, Kappa value 0·29) was not higher than for patients with an unclear classification (agreement 54%, Kappa value 0·33). Familiarity with the system (i.e. trained with written instructions and previous research experience with the algorithm) did not improve the inter-rater reliability. Conclusion Our pilot study challenges the inter-rater reliability of the classification procedure in clinical practice. Therefore, more knowledge is needed about factors that affect the inter-rater reliability, in order to improve the clinical applicability of the classification scheme. PMID:27559279
Magnetic resonance enterography has good inter-rater agreement and diagnostic accuracy for detecting inflammation in pediatric Crohn disease.

PubMed

Church, Peter C; Greer, Mary-Louise C; Cytter-Kuint, Ruth; Doria, Andrea S; Griffiths, Anne M; Turner, Dan; Walters, Thomas D; Feldman, Brian M

2017-05-01

Magnetic resonance enterography (MRE) is increasingly relied upon for noninvasive assessment of intestinal inflammation in Crohn disease. However very few studies have examined the diagnostic accuracy of individual MRE signs in children. We have created an MR-based multi-item measure of intestinal inflammation in children with Crohn disease - the Pediatric Inflammatory Crohn's MRE Index (PICMI). To inform item selection for this instrument, we explored the inter-rater agreement and diagnostic accuracy of individual MRE signs of inflammation in pediatric Crohn disease and compared our findings with the reference standards of the weighted Pediatric Crohn's Disease Activity Index (wPCDAI) and C-reactive protein (CRP). In this cross-sectional single-center study, MRE studies in 48 children with diagnosed Crohn disease (66% male, median age 15.5 years) were reviewed by two independent radiologists for the presence of 15 MRE signs of inflammation. Using kappa statistics we explored inter-rater agreement for each MRE sign across 10 anatomical segments of the gastrointestinal tract. We correlated MRE signs with the reference standards using correlation coefficients. Radiologists measured the length of inflamed bowel in each segment of the gastrointestinal tract. In each segment, MRE signs were scored as either binary (0-absent, 1-present), or ordinal (0-absent, 1-mild, 2-marked). These segmental scores were weighted by the length of involved bowel and were summed to produce a weighted score per patient for each MRE sign. Using a combination of wPCDAI≥12.5 and CRP≥5 to define active inflammation, we calculated area under the receiver operating characteristic curve (AUC) for each weighted MRE sign. Bowel wall enhancement, wall T2 hyperintensity, wall thickening and wall diffusion-weighted imaging (DWI) hyperintensity were most commonly identified. Inter-rater agreement was best for decreased motility and wall DWI hyperintensity (kappa≥0.64). Correlation between MRE
Measuring Scale Errors in a Laser Tracker’s Horizontal Angle Encoder Through Simple Length Measurement and Two-Face System Tests

PubMed Central

Muralikrishnan, B.; Blackburn, C.; Sawyer, D.; Phillips, S.; Bridges, R.

2010-01-01

We describe a method to estimate the scale errors in the horizontal angle encoder of a laser tracker in this paper. The method does not require expensive instrumentation such as a rotary stage or even a calibrated artifact. An uncalibrated but stable length is realized between two targets mounted on stands that are at tracker height. The tracker measures the distance between these two targets from different azimuthal positions (say, in intervals of 20° over 360°). Each target is measured in both front face and back face. Low order harmonic scale errors can be estimated from this data and may then be used to correct the encoder’s error map to improve the tracker’s angle measurement accuracy. We have demonstrated this for the second order harmonic in this paper. It is important to compensate for even order harmonics as their influence cannot be removed by averaging front face and back face measurements whereas odd orders can be removed by averaging. We tested six trackers from three different manufacturers. Two of those trackers are newer models introduced at the time of writing of this paper. For older trackers from two manufacturers, the length errors in a 7.75 m horizontal length placed 7 m away from a tracker were of the order of ± 65 μm before correcting the error map. They reduced to less than ± 25 μm after correcting the error map for second order scale errors. Newer trackers from the same manufacturers did not show this error. An older tracker from a third manufacturer also did not show this error. PMID:27134789
Inter-rater reliability and aspects of validity of the parent-infant relationship global assessment scale (PIR-GAS)

PubMed Central

2013-01-01

Background The Parent-Infant Relationship Global Assessment Scale (PIR-GAS) signifies a conceptually relevant development in the multi-axial, developmentally sensitive classification system DC:0-3R for preschool children. However, information about the reliability and validity of the PIR-GAS is rare. A review of the available empirical studies suggests that in research, PIR-GAS ratings can be based on a ten-minute videotaped interaction sequence. The qualification of raters may be very heterogeneous across studies. Methods To test whether the use of the PIR-GAS still allows for a reliable assessment of the parent-infant relationship, our study compared a PIR-GAS ratings based on a full-information procedure across multiple settings with ratings based on a ten-minute video by two doctoral candidates of medicine. For each mother-child dyad at a family day hospital (N = 48), we obtained two video ratings and one full-information rating at admission to therapy and at discharge. This pre-post design allowed for a replication of our findings across the two measurement points. We focused on the inter-rater reliability between the video coders, as well as between the video and full-information procedure, including mean differences and correlations between the raters. Additionally, we examined aspects of the validity of video and full-information ratings based on their correlation with measures of child and maternal psychopathology. Results Our results showed that a ten-minute video and full-information PIR-GAS ratings were not interchangeable. Most results at admission could be replicated by the data obtained at discharge. We concluded that a higher degree of standardization of the assessment procedure should increase the reliability of the PIR-GAS, and a more thorough theoretical foundation of the manual should increase its validity. PMID:23705962
Theoretical and Experimental Errors for In Situ Measurements of Plant Water Potential 1

PubMed Central

Shackel, Kenneth A.

1984-01-01

Errors in psychrometrically determined values of leaf water potential caused by tissue resistance to water vapor exchange and by lack of thermal equilibrium were evaluated using commercial in situ psychrometers (Wescor Inc., Logan, UT) on leaves of Tradescantia virginiana (L.). Theoretical errors in the dewpoint method of operation for these sensors were demonstrated. After correction for these errors, in situ measurements of leaf water potential indicated substantial errors caused by tissue resistance to water vapor exchange (4 to 6% reduction in apparent water potential per second of cooling time used) resulting from humidity depletions in the psychrometer chamber during the Peltier condensation process. These errors were avoided by use of a modified procedure for dewpoint measurement. Large changes in apparent water potential were caused by leaf and psychrometer exposure to moderate levels of irradiance. These changes were correlated with relatively small shifts in psychrometer zero offsets (−0.6 to −1.0 megapascals per microvolt), indicating substantial errors caused by nonisothermal conditions between the leaf and the psychrometer. Explicit correction for these errors is not possible with the current psychrometer design. PMID:16663701
A measurement error model for physical activity level as measured by a questionnaire with application to the 1999-2006 NHANES questionnaire.

PubMed

Tooze, Janet A; Troiano, Richard P; Carroll, Raymond J; Moshfegh, Alanna J; Freedman, Laurence S

2013-06-01

Systematic investigations into the structure of measurement error of physical activity questionnaires are lacking. We propose a measurement error model for a physical activity questionnaire that uses physical activity level (the ratio of total energy expenditure to basal energy expenditure) to relate questionnaire-based reports of physical activity level to true physical activity levels. The 1999-2006 National Health and Nutrition Examination Survey physical activity questionnaire was administered to 433 participants aged 40-69 years in the Observing Protein and Energy Nutrition (OPEN) Study (Maryland, 1999-2000). Valid estimates of participants' total energy expenditure were also available from doubly labeled water, and basal energy expenditure was estimated from an equation; the ratio of those measures estimated true physical activity level ("truth"). We present a measurement error model that accommodates the mixture of errors that arise from assuming a classical measurement error model for doubly labeled water and a Berkson error model for the equation used to estimate basal energy expenditure. The method was then applied to the OPEN Study. Correlations between the questionnaire-based physical activity level and truth were modest (r = 0.32-0.41); attenuation factors (0.43-0.73) indicate that the use of questionnaire-based physical activity level would lead to attenuated estimates of effect size. Results suggest that sample sizes for estimating relationships between physical activity level and disease should be inflated, and that regression calibration can be used to provide measurement error-adjusted estimates of relationships between physical activity and disease.
The reliability of non-invasive biophysical outcome measures for evaluating normal and hyperkeratotic foot skin.

PubMed

Hashmi, Farina; Wright, Ciaran; Nester, Christopher; Lam, Sharon

2015-01-01

Hyperkeratosis of foot skin is a common skin problem affecting people of different ages. The clinical presentation of this condition can range from dry flaky skin, which can lead to fissures, to hard callused skin which is often painful and debilitating. The purpose of this study was to test the reliability of certain non-invasive skin measurement devices on foot skin in normal and hyperkeratotic states, with a view to confirming their use as quantitative outcome measures in future clinical trials. Twelve healthy adult participants with a range of foot skin conditions (xerotic skin, heel fissures and plantar calluses) were recruited to the study. Measurements of normal and hyperkeratotic skin sites were taken using the following devices: Corneometer® CM 825, Cutometer® 580 MPA, Reviscometer® RVM 600, Visioline® VL 650 Quantiride® and Visioscan® VC 98, by two investigators on two consecutive days. The intra and inter rater reliability and standard error of measurement for each device was calculated. The data revealed the majority of the devices to be reliable measurement tools for normal and hyperkeratotic foot skin (ICC values > 0.6). The surface evaluation parameters for skin: SEsc and SEsm have greater reliability compared to the SEr measure. The Cutometer® is sensitive to soft tissue movement within the probe, therefore measurement of plantar soft tissue areas should be approached with caution. Reviscometer® measures on callused skin demonstrated an unusually high degree of error. These results confirm the intra and inter rater reliability of the Corneometer®, Cutometer®, Visioline® and Visioscan® in quantifying specific foot skin biophysical properties.
Computational fluid dynamics analysis and experimental study of a low measurement error temperature sensor used in climate observation.

PubMed

Yang, Jie; Liu, Qingquan; Dai, Wei

2017-02-01

To improve the air temperature observation accuracy, a low measurement error temperature sensor is proposed. A computational fluid dynamics (CFD) method is implemented to obtain temperature errors under various environmental conditions. Then, a temperature error correction equation is obtained by fitting the CFD results using a genetic algorithm method. The low measurement error temperature sensor, a naturally ventilated radiation shield, a thermometer screen, and an aspirated temperature measurement platform are characterized in the same environment to conduct the intercomparison. The aspirated platform served as an air temperature reference. The mean temperature errors of the naturally ventilated radiation shield and the thermometer screen are 0.74 °C and 0.37 °C, respectively. In contrast, the mean temperature error of the low measurement error temperature sensor is 0.11 °C. The mean absolute error and the root mean square error between the corrected results and the measured results are 0.008 °C and 0.01 °C, respectively. The correction equation allows the temperature error of the low measurement error temperature sensor to be reduced by approximately 93.8%. The low measurement error temperature sensor proposed in this research may be helpful to provide a relatively accurate air temperature result.
Prompt and Rater Effects in Second Language Writing Performance Assessment

ERIC Educational Resources Information Center

Lim, Gad S.

2009-01-01

Performance assessments have become the norm for evaluating language learners' writing abilities in international examinations of English proficiency. Two aspects of these assessments are usually systematically varied: test takers respond to different prompts, and their responses are read by different raters. This raises the possibility of undue…
Use of e-rater[R] in Scoring of the TOEFL iBT[R] Writing Test. Research Report. ETS RR-11-25

ERIC Educational Resources Information Center

Haberman, Shelby J.

2011-01-01

Alternative approaches are discussed for use of e-rater[R] to score the TOEFL iBT[R] Writing test. These approaches involve alternate criteria. In the 1st approach, the predicted variable is the expected rater score of the examinee's 2 essays. In the 2nd approach, the predicted variable is the expected rater score of 2 essay responses by the…

Causal inference with measurement error in outcomes: Bias analysis and estimation methods.

PubMed

Shu, Di; Yi, Grace Y

2017-01-01

Inverse probability weighting estimation has been popularly used to consistently estimate the average treatment effect. Its validity, however, is challenged by the presence of error-prone variables. In this paper, we explore the inverse probability weighting estimation with mismeasured outcome variables. We study the impact of measurement error for both continuous and discrete outcome variables and reveal interesting consequences of the naive analysis which ignores measurement error. When a continuous outcome variable is mismeasured under an additive measurement error model, the naive analysis may still yield a consistent estimator; when the outcome is binary, we derive the asymptotic bias in a closed-form. Furthermore, we develop consistent estimation procedures for practical scenarios where either validation data or replicates are available. With validation data, we propose an efficient method for estimation of average treatment effect; the efficiency gain is substantial relative to usual methods of using validation data. To provide protection against model misspecification, we further propose a doubly robust estimator which is consistent even when either the treatment model or the outcome model is misspecified. Simulation studies are reported to assess the performance of the proposed methods. An application to a smoking cessation dataset is presented.
Integration of Error Compensation of Coordinate Measuring Machines into Feature Measurement: Part II—Experimental Implementation

PubMed Central

Calvo, Roque; D’Amato, Roberto; Gómez, Emilio; Domingo, Rosario

2016-01-01

Coordinate measuring machines (CMM) are main instruments of measurement in laboratories and in industrial quality control. A compensation error model has been formulated (Part I). It integrates error and uncertainty in the feature measurement model. Experimental implementation for the verification of this model is carried out based on the direct testing on a moving bridge CMM. The regression results by axis are quantified and compared to CMM indication with respect to the assigned values of the measurand. Next, testing of selected measurements of length, flatness, dihedral angle, and roundness features are accomplished. The measurement of calibrated gauge blocks for length or angle, flatness verification of the CMM granite table and roundness of a precision glass hemisphere are presented under a setup of repeatability conditions. The results are analysed and compared with alternative methods of estimation. The overall performance of the model is endorsed through experimental verification, as well as the practical use and the model capability to contribute in the improvement of current standard CMM measuring capabilities. PMID:27754441
The effect of misclassification errors on case mix measurement.

PubMed

Sutherland, Jason M; Botz, Chas K

2006-12-01

Case mix systems have been implemented for hospital reimbursement and performance measurement across Europe and North America. Case mix categorizes patients into discrete groups based on clinical information obtained from patient charts in an attempt to identify clinical or cost difference amongst these groups. The diagnosis related group (DRG) case mix system is the most common methodology, with variants adopted in many countries. External validation studies of coding quality have confirmed that widespread variability exists between originally recorded diagnoses and re-abstracted clinical information. DRG assignment errors in hospitals that share patient level cost data for the purpose of establishing cost weights affects cost weight accuracy. The purpose of this study is to estimate bias in cost weights due to measurement error of reported clinical information. DRG assignment error rates are simulated based on recent clinical re-abstraction study results. Our simulation study estimates that 47% of cost weights representing the least severe cases are over weight by 10%, while 32% of cost weights representing the most severe cases are under weight by 10%. Applying the simulated weights to a cross-section of hospitals, we find that teaching hospitals tend to be under weight. Since inaccurate cost weights challenges the ability of case mix systems to accurately reflect patient mix and may lead to potential distortions in hospital funding, bias in hospital case mix measurement highlights the role clinical data quality plays in hospital funding in countries that use DRG-type case mix systems. Quality of clinical information should be carefully considered from hospitals that contribute financial data for establishing cost weights.
On using summary statistics from an external calibration sample to correct for covariate measurement error.

PubMed

Guo, Ying; Little, Roderick J; McConnell, Daniel S

2012-01-01

Covariate measurement error is common in epidemiologic studies. Current methods for correcting measurement error with information from external calibration samples are insufficient to provide valid adjusted inferences. We consider the problem of estimating the regression of an outcome Y on covariates X and Z, where Y and Z are observed, X is unobserved, but a variable W that measures X with error is observed. Information about measurement error is provided in an external calibration sample where data on X and W (but not Y and Z) are recorded. We describe a method that uses summary statistics from the calibration sample to create multiple imputations of the missing values of X in the regression sample, so that the regression coefficients of Y on X and Z and associated standard errors can be estimated using simple multiple imputation combining rules, yielding valid statistical inferences under the assumption of a multivariate normal distribution. The proposed method is shown by simulation to provide better inferences than existing methods, namely the naive method, classical calibration, and regression calibration, particularly for correction for bias and achieving nominal confidence levels. We also illustrate our method with an example using linear regression to examine the relation between serum reproductive hormone concentrations and bone mineral density loss in midlife women in the Michigan Bone Health and Metabolism Study. Existing methods fail to adjust appropriately for bias due to measurement error in the regression setting, particularly when measurement error is substantial. The proposed method corrects this deficiency.
Inter-rater reliability of h-index scores calculated by Web of Science and Scopus for clinical epidemiology scientists.

PubMed

Walker, Benjamin; Alavifard, Sepand; Roberts, Surain; Lanes, Andrea; Ramsay, Tim; Boet, Sylvain

2016-06-01

We investigated the inter-rater reliability of Web of Science (WoS) and Scopus when calculating the h-index of 25 senior scientists in the Clinical Epidemiology Program of the Ottawa Hospital Research Institute. Bibliometric information and the h-indices for the subjects were computed by four raters using the automatic calculators in WoS and Scopus. Correlation and agreement between ratings was assessed using Spearman's correlation coefficient and a Bland-Altman plot, respectively. Data could not be gathered from Google Scholar due to feasibility constraints. The Spearman's rank correlation between the h-index of scientists calculated with WoS was 0.81 (95% CI 0.72-0.92) and with Scopus was 0.95 (95% CI 0.92-0.99). The Bland-Altman plot showed no significant rater bias in WoS and Scopus; however, the agreement between ratings is higher in Scopus compared to WoS. Our results showed a stronger relationship and increased agreement between raters when calculating the h-index of a scientist using Scopus compared to WoS. The higher inter-rater reliability and simple user interface used in Scopus may render it the more effective database when calculating the h-index of senior scientists in epidemiology. © 2016 Health Libraries Group.
Reliability and validity of an iPhone® application for the measurement of lumbar spine flexion and extension range of motion

PubMed Central

Pourahmadi, Mohammad Reza; Jannati, Elham; Mohseni-Bandpei, Mohammad Ali; Ebrahimi Takamjani, Ismail; Rajabzadeh, Fatemeh

2016-01-01

reliability, respectively. The Pearson correlation coefficients were used to establish concurrent validity of the iPhone® app. Furthermore, minimum detectable change at the 95% confidence level (MDC95) was computed as 1.96 × standard error of measurement × \\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{upgreek} \\usepackage{mathrsfs} \\setlength{\\oddsidemargin}{-69pt} \\begin{document} }{}$\\sqrt{2}$\\end{document}2. Results Good to excellent intra-rater and inter-rater reliability were demonstrated for both the gravity-based inclinometer with ICC values of ≥0.84 and ≥0.77 and the iPhone® app with ICC values of ≥0.85 and ≥0.85, respectively. The MDC95 ranged from 5.82°to 8.18°for the intra-rater analysis and from 7.38°to 8.66° for the inter-rater analysis. The concurrent validity for flexion and extension between the 2 instruments was 0.85 and 0.91, respectively. Conclusions The iPhone®app possesses good to excellent intra-rater and inter-rater reliability and concurrent validity. It seems that the iPhone® app can be used for the measurement of lumbar spine flexion–extension ROM. Level of evidence IIb. PMID:27635328
Continuous glucose monitoring in newborn infants: how do errors in calibration measurements affect detected hypoglycemia?

PubMed

Thomas, Felicity; Signal, Mathew; Harris, Deborah L; Weston, Philip J; Harding, Jane E; Shaw, Geoffrey M; Chase, J Geoffrey

2014-05-01

Neonatal hypoglycemia is common and can cause serious brain injury. Continuous glucose monitoring (CGM) could improve hypoglycemia detection, while reducing blood glucose (BG) measurements. Calibration algorithms use BG measurements to convert sensor signals into CGM data. Thus, inaccuracies in calibration BG measurements directly affect CGM values and any metrics calculated from them. The aim was to quantify the effect of timing delays and calibration BG measurement errors on hypoglycemia metrics in newborn infants. Data from 155 babies were used. Two timing and 3 BG meter error models (Abbott Optium Xceed, Roche Accu-Chek Inform II, Nova Statstrip) were created using empirical data. Monte-Carlo methods were employed, and each simulation was run 1000 times. Each set of patient data in each simulation had randomly selected timing and/or measurement error added to BG measurements before CGM data were calibrated. The number of hypoglycemic events, duration of hypoglycemia, and hypoglycemic index were then calculated using the CGM data and compared to baseline values. Timing error alone had little effect on hypoglycemia metrics, but measurement error caused substantial variation. Abbott results underreported the number of hypoglycemic events by up to 8 and Roche overreported by up to 4 where the original number reported was 2. Nova results were closest to baseline. Similar trends were observed in the other hypoglycemia metrics. Errors in blood glucose concentration measurements used for calibration of CGM devices can have a clinically important impact on detection of hypoglycemia. If CGM devices are going to be used for assessing hypoglycemia it is important to understand of the impact of these errors on CGM data. © 2014 Diabetes Technology Society.
Volumetric error modeling, identification and compensation based on screw theory for a large multi-axis propeller-measuring machine

NASA Astrophysics Data System (ADS)

Zhong, Xuemin; Liu, Hongqi; Mao, Xinyong; Li, Bin; He, Songping; Peng, Fangyu

2018-05-01

Large multi-axis propeller-measuring machines have two types of geometric error, position-independent geometric errors (PIGEs) and position-dependent geometric errors (PDGEs), which both have significant effects on the volumetric error of the measuring tool relative to the worktable. This paper focuses on modeling, identifying and compensating for the volumetric error of the measuring machine. A volumetric error model in the base coordinate system is established based on screw theory considering all the geometric errors. In order to fully identify all the geometric error parameters, a new method for systematic measurement and identification is proposed. All the PIGEs of adjacent axes and the six PDGEs of the linear axes are identified with a laser tracker using the proposed model. Finally, a volumetric error compensation strategy is presented and an inverse kinematic solution for compensation is proposed. The final measuring and compensation experiments have further verified the efficiency and effectiveness of the measuring and identification method, indicating that the method can be used in volumetric error compensation for large machine tools.
Effects of Rater Characteristics and Scoring Methods on Speaking Assessment

ERIC Educational Resources Information Center

Matsugu, Sawako

2013-01-01

Understanding the sources of variance in speaking assessment is important in Japan where society's high demand for English speaking skills is growing. Three challenges threaten fair assessment of speaking. First, in Japanese university speaking courses, teachers are typically the only raters, but teachers' knowledge of their students may unfairly…
Exploring the Impact of Mental Workload on Rater-Based Assessments

ERIC Educational Resources Information Center

Tavares, Walter; Eva, Kevin W.

2013-01-01

When appraising the performance of others, assessors must acquire relevant information and process it in a meaningful way in order to translate it effectively into ratings, comments, or judgments about how well the performance meets appropriate standards. Rater-based assessment strategies in health professional education, including scale and…
Exploring Examiner Judgement of Professional Competence in Rater Based Assessment

ERIC Educational Resources Information Center

Naumann, Fiona L.; Marshall, Stephen; Shulruf, Boaz; Jones, Philip D.

2016-01-01

Exercise physiology courses have transitioned to competency based, forcing Universities to rethink assessment to ensure students are competent to practice. This study built on earlier research to explore rater cognition, capturing factors that contribute to assessor decision making about students' competency. The aims were to determine the source…
Evaluating the Construct-Coverage of the e-rater[R] Scoring Engine. Research Report. ETS RR-09-01

ERIC Educational Resources Information Center

Quinlan, Thomas; Higgins, Derrick; Wolff, Susanne

2009-01-01

This report evaluates the construct coverage of the e-rater[R[ scoring engine. The matter of construct coverage depends on whether one defines writing skill, in terms of process or product. Originally, the e-rater engine consisted of a large set of components with a proven ability to predict human holistic scores. By organizing these capabilities…
Error of the slanted edge method for measuring the modulation transfer function of imaging systems.

PubMed

Xie, Xufen; Fan, Hongda; Wang, Hongyuan; Wang, Zebin; Zou, Nianyu

2018-03-01

The slanted edge method is a basic approach for measuring the modulation transfer function (MTF) of imaging systems; however, its measurement accuracy is limited in practice. Theoretical analysis of the slanted edge MTF measurement method performed in this paper reveals that inappropriate edge angles and random noise reduce this accuracy. The error caused by edge angles is analyzed using sampling and reconstruction theory. Furthermore, an error model combining noise and edge angles is proposed. We verify the analyses and model with respect to (i) the edge angle, (ii) a statistical analysis of the measurement error, (iii) the full width at half-maximum of a point spread function, and (iv) the error model. The experimental results verify the theoretical findings. This research can be referential for applications of the slanted edge MTF measurement method.
Do you see what I see? Mobile eye-tracker contextual analysis and inter-rater reliability.

PubMed

Stuart, S; Hunt, D; Nell, J; Godfrey, A; Hausdorff, J M; Rochester, L; Alcock, L

2018-02-01

Mobile eye-trackers are currently used during real-world tasks (e.g. gait) to monitor visual and cognitive processes, particularly in ageing and Parkinson's disease (PD). However, contextual analysis involving fixation locations during such tasks is rarely performed due to its complexity. This study adapted a validated algorithm and developed a classification method to semi-automate contextual analysis of mobile eye-tracking data. We further assessed inter-rater reliability of the proposed classification method. A mobile eye-tracker recorded eye-movements during walking in five healthy older adult controls (HC) and five people with PD. Fixations were identified using a previously validated algorithm, which was adapted to provide still images of fixation locations (n = 116). The fixation location was manually identified by two raters (DH, JN), who classified the locations. Cohen's kappa correlation coefficients determined the inter-rater reliability. The algorithm successfully provided still images for each fixation, allowing manual contextual analysis to be performed. The inter-rater reliability for classifying the fixation location was high for both PD (kappa = 0.80, 95% agreement) and HC groups (kappa = 0.80, 91% agreement), which indicated a reliable classification method. This study developed a reliable semi-automated contextual analysis method for gait studies in HC and PD. Future studies could adapt this methodology for various gait-related eye-tracking studies.
Error analysis for the ground-based microwave ozone measurements during STOIC

NASA Technical Reports Server (NTRS)

Connor, Brian J.; Parrish, Alan; Tsou, Jung-Jung; McCormick, M. Patrick

1995-01-01

We present a formal error analysis and characterization of the microwave measurements made during the Stratospheric Ozone Intercomparison Campaign (STOIC). The most important error sources are found to be determination of the tropospheric opacity, the pressure-broadening coefficient of the observed line, and systematic variations in instrument response as a function of frequency ('baseline'). Net precision is 4-6% between 55 and 0.2 mbar, while accuracy is 6-10%. Resolution is 8-10 km below 3 mbar and increases to 17km at 0.2 mbar. We show the 'blind' microwave measurements from STOIC and make limited comparisons to other measurements. We use the averaging kernels of the microwave measurement to eliminate resolution and a priori effects from a comparison to SAGE 2. The STOIC results and comparisons are broadly consistent with the formal analysis.
Psychrometric Measurement of Leaf Water Potential: Lack of Error Attributable to Leaf Permeability.

PubMed

Barrs, H D

1965-07-02

A report that low permeability could cause gross errors in psychrometric determinations of water potential in leaves has not been confirmed. No measurable error from this source could be detected for either of two types of thermocouple psychrometer tested on four species, each at four levels of water potential. No source of error other than tissue respiration could be demonstrated.
Expert and Naive Raters Using the PAG: Does it Matter?

ERIC Educational Resources Information Center

Cornelius, Edwin T.; And Others

1984-01-01

Questions the observed correlation between job experts and naive raters using the Position Analysis Questionnaire (PAQ); and conducts a replication of the Smith and Hakel study (1979) with college students (N=39). Concluded that PAQ ratings from job experts and college students are not equivalent and therefore are not interchangeable. (LLL)
Unaccounted source of systematic errors in measurements of the Newtonian gravitational constant G

NASA Astrophysics Data System (ADS)

DeSalvo, Riccardo

2015-06-01

Many precision measurements of G have produced a spread of results incompatible with measurement errors. Clearly an unknown source of systematic errors is at work. It is proposed here that most of the discrepancies derive from subtle deviations from Hooke's law, caused by avalanches of entangled dislocations. The idea is supported by deviations from linearity reported by experimenters measuring G, similarly to what is observed, on a larger scale, in low-frequency spring oscillators. Some mitigating experimental apparatus modifications are suggested.
CPS-Rater: Automated Sequential Annotation for Conversations in Collaborative Problem-Solving Activities. Research Report. ETS RR-17-58

ERIC Educational Resources Information Center

Hao, Jiangang; Chen, Lei; Flor, Michael; Liu, Lei; von Davier, Alina A.

2017-01-01

Conversations in collaborative problem-solving activities can be used to probe the collaboration skills of the team members. Annotating the conversations into different collaboration skills by human raters is laborious and time consuming. In this report, we report our work on developing an automated annotation system, CPS-rater, for conversational…
Errors induced by catalytic effects in premixed flame temperature measurements

NASA Astrophysics Data System (ADS)

Pita, G. P. A.; Nina, M. N. R.

The evaluation of instantaneous temperature in a premixed flame using fine-wire Pt/Pt-(13 pct)Rh thermocouples was found to be subject to significant errors due to catalytic effects. An experimental study was undertaken to assess the influence of local fuel/air ratio, thermocouple wire diameter, and gas velocity on the thermocouple reading errors induced by the catalytic surface reactions. Measurements made with both coated and uncoated thermocouples showed that the catalytic effect imposes severe limitations on the accuracy of mean and fluctuating gas temperature in the radical-rich flame zone.

[Intra-rater Reliability for the Questionnaire on Activity Limitations and Participation Restrictions of Children With ADHD].

PubMed

Salamanca Duque, Luisa Matilde; Naranjo Aristizábal, María Mercedes; Gutiérrez Ríos, Gladys Helena; Prieto, Jaime Bayona

2014-03-01

Questionnaires for evaluating activity limitations and participation restrictions in children with ADHD (CLARP-TDAH) has recently been developed in Colombia, based on the suggestions made by the WHO from the International Classification of Functioning, Disability and Health (ICF), allowing clinical evaluation beyond an evaluation of the functionality and functioning of children in their family and school environments. Previous research with the questionnaire proved useful in the multidisciplinary approach of Colombian children with ADHD. This study determines the level of intra-rater reliability for questionnaires CLARP-TDAH Parents and Teachers. The study included a non-random sample of 203 Colombian children attending school and diagnosed with ADHD. Intra-rater reliability and the reproducibility of the results was determined using the Kappa index. The informants were parents and teachers. Kappa values >0.7 were obtained for the intra-rater reliability of the questionnaire domains of CLARP-TDAH Parents, while for CLARP-TDAH Teachers domains these values were >0.8. CLARP-TDAH questionnaires are a tool with a good level of intra-rater reliability, which allows a reliable assessment of activity limitations and participation restrictions in order to determine the level of functioning at home and school. Copyright © 2014 Asociación Colombiana de Psiquiatría. Publicado por Elsevier España. All rights reserved.
Accounting for the measurement error of spectroscopically inferred soil carbon data for improved precision of spatial predictions.

PubMed

Somarathna, P D S N; Minasny, Budiman; Malone, Brendan P; Stockmann, Uta; McBratney, Alex B

2018-08-01

Spatial modelling of environmental data commonly only considers spatial variability as the single source of uncertainty. In reality however, the measurement errors should also be accounted for. In recent years, infrared spectroscopy has been shown to offer low cost, yet invaluable information needed for digital soil mapping at meaningful spatial scales for land management. However, spectrally inferred soil carbon data are known to be less accurate compared to laboratory analysed measurements. This study establishes a methodology to filter out the measurement error variability by incorporating the measurement error variance in the spatial covariance structure of the model. The study was carried out in the Lower Hunter Valley, New South Wales, Australia where a combination of laboratory measured, and vis-NIR and MIR inferred topsoil and subsoil soil carbon data are available. We investigated the applicability of residual maximum likelihood (REML) and Markov Chain Monte Carlo (MCMC) simulation methods to generate parameters of the Matérn covariance function directly from the data in the presence of measurement error. The results revealed that the measurement error can be effectively filtered-out through the proposed technique. When the measurement error was filtered from the data, the prediction variance almost halved, which ultimately yielded a greater certainty in spatial predictions of soil carbon. Further, the MCMC technique was successfully used to define the posterior distribution of measurement error. This is an important outcome, as the MCMC technique can be used to estimate the measurement error if it is not explicitly quantified. Although this study dealt with soil carbon data, this method is amenable for filtering the measurement error of any kind of continuous spatial environmental data. Copyright © 2018 Elsevier B.V. All rights reserved.
Reliability and concurrent validity of a new iPhone® goniometric application for measuring active wrist range of motion: a cross-sectional study in asymptomatic subjects.

PubMed

Pourahmadi, Mohammad Reza; Ebrahimi Takamjani, Ismail; Sarrafzadeh, Javad; Bahramian, Mehrdad; Mohseni-Bandpei, Mohammad Ali; Rajabzadeh, Fatemeh; Taghipour, Morteza

2017-03-01

Measurement of wrist range of motion (ROM) is often considered to be an essential component of wrist physical examination. The measurement can be carried out through various instruments such as goniometers and inclinometers. Recent smartphones have been equipped with accelerometers and magnetometers, which, through specific software applications (apps) can be used for goniometric functions. This study, for the first time, aimed to evaluate the reliability and concurrent validity of a new smartphone goniometric app (Goniometer Pro©) for measuring active wrist ROM. In all, 120 wrists of 70 asymptomatic adults (38 men and 32 women; aged 18-40 years) were assessed in a physiotherapy clinic located at the School of Rehabilitation Sciences, Iran University of Medical Science and Health Services, Tehran, Iran. Following the recruitment process, active wrist ROM was measured using a universal goniometer and iPhone ® 5 app. Two blinded examiners each utilized the universal goniometer and iPhone ® to measure active wrist ROM using a volar/dorsal alignment technique in the following sequences: flexion, extension, radial deviation, and ulnar deviation. The second (2 h later) and third (48 h later) sessions were carried out in the same manner as the first session. All the measurements were conducted three times and the mean value of three repetitions for each measurement was used for analysis. Intraclass correlation coefficient (ICC) models (3, k) and (2, k) were used to determine the intra-rater and inter-rater reliability, respectively. The Pearson correlation coefficients were used to establish concurrent validity of the iPhone ® app. Good to excellent intra-rater and inter-rater reliability was demonstrated for the goniometer with ICC values of ≥ 0.82 and ≥ 0.73 and the iPhone ® app with ICC values of ≥ 0.83 and ≥ 0.79, respectively. Minimum detectable change at the 95% confidence level (MDC 95 ) was computed as 1.96 × standard error of measurement
And the Winner Is … : Inter-Rater Reliability among Scholarship Assessors

ERIC Educational Resources Information Center

Johnston, Lucy; Schluter, Philip J.

2017-01-01

With increasing competition for postgraduate research scholarships, awarding processes demand attention and scrutiny. We examine inter-rater reliability for two prestigious New Zealand scholarships, the Shirtcliffe Fellowship and the Gordon Watson Scholarship. For each scholarship, five assessors (three academic; two non-academic) independently…
Rain radar measurement error estimation using data assimilation in an advection-based nowcasting system

NASA Astrophysics Data System (ADS)

Merker, Claire; Ament, Felix; Clemens, Marco

2017-04-01

The quantification of measurement uncertainty for rain radar data remains challenging. Radar reflectivity measurements are affected, amongst other things, by calibration errors, noise, blocking and clutter, and attenuation. Their combined impact on measurement accuracy is difficult to quantify due to incomplete process understanding and complex interdependencies. An improved quality assessment of rain radar measurements is of interest for applications both in meteorology and hydrology, for example for precipitation ensemble generation, rainfall runoff simulations, or in data assimilation for numerical weather prediction. Especially a detailed description of the spatial and temporal structure of errors is beneficial in order to make best use of the areal precipitation information provided by radars. Radar precipitation ensembles are one promising approach to represent spatially variable radar measurement errors. We present a method combining ensemble radar precipitation nowcasting with data assimilation to estimate radar measurement uncertainty at each pixel. This combination of ensemble forecast and observation yields a consistent spatial and temporal evolution of the radar error field. We use an advection-based nowcasting method to generate an ensemble reflectivity forecast from initial data of a rain radar network. Subsequently, reflectivity data from single radars is assimilated into the forecast using the Local Ensemble Transform Kalman Filter. The spread of the resulting analysis ensemble provides a flow-dependent, spatially and temporally correlated reflectivity error estimate at each pixel. We will present first case studies that illustrate the method using data from a high-resolution X-band radar network.
Generalized Structured Component Analysis with Uniqueness Terms for Accommodating Measurement Error

PubMed Central

Hwang, Heungsun; Takane, Yoshio; Jung, Kwanghee

2017-01-01

Generalized structured component analysis (GSCA) is a component-based approach to structural equation modeling (SEM), where latent variables are approximated by weighted composites of indicators. It has no formal mechanism to incorporate errors in indicators, which in turn renders components prone to the errors as well. We propose to extend GSCA to account for errors in indicators explicitly. This extension, called GSCAM, considers both common and unique parts of indicators, as postulated in common factor analysis, and estimates a weighted composite of indicators with their unique parts removed. Adding such unique parts or uniqueness terms serves to account for measurement errors in indicators in a manner similar to common factor analysis. Simulation studies are conducted to compare parameter recovery of GSCAM and existing methods. These methods are also applied to fit a substantively well-established model to real data. PMID:29270146
The Outdoor MEDIA DOT: The Development and Inter-Rater Reliability of a Tool Designed to Measure Food and Beverage Outlets and Outdoor Advertising

PubMed Central

Poulos, Natalie S.; Pasch, Keryn E.

2015-01-01

Few studies of the food environment have collected primary data, and even fewer have reported reliability of the tool used. This study focused on the development of an innovative electronic data collection tool used to document outdoor food and beverage (FB) advertising and establishments near 43 middle and high schools in the Outdoor MEDIA Study. Tool development used GIS based mapping, an electronic data collection form on handheld devices, and an easily adaptable interface to efficiently collect primary data within the food environment. For the reliability study, two teams of data collectors documented all FB advertising and establishments within one half-mile of six middle schools. Inter-rater reliability was calculated overall and by advertisement or establishment category using percent agreement. A total of 824 advertisements (n=233), establishment advertisements (n=499), and establishments (n=92) were documented (range=8–229 per school). Overall inter-rater reliability of the developed tool ranged from 69–89% for advertisements and establishments. Results suggest that the developed tool is highly reliable and effective for documenting the outdoor FB environment. PMID:26022774
High dimensional linear regression models under long memory dependence and measurement error

NASA Astrophysics Data System (ADS)

Kaul, Abhishek

This dissertation consists of three chapters. The first chapter introduces the models under consideration and motivates problems of interest. A brief literature review is also provided in this chapter. The second chapter investigates the properties of Lasso under long range dependent model errors. Lasso is a computationally efficient approach to model selection and estimation, and its properties are well studied when the regression errors are independent and identically distributed. We study the case, where the regression errors form a long memory moving average process. We establish a finite sample oracle inequality for the Lasso solution. We then show the asymptotic sign consistency in this setup. These results are established in the high dimensional setup (p> n) where p can be increasing exponentially with n. Finally, we show the consistency, n½ --d-consistency of Lasso, along with the oracle property of adaptive Lasso, in the case where p is fixed. Here d is the memory parameter of the stationary error sequence. The performance of Lasso is also analysed in the present setup with a simulation study. The third chapter proposes and investigates the properties of a penalized quantile based estimator for measurement error models. Standard formulations of prediction problems in high dimension regression models assume the availability of fully observed covariates and sub-Gaussian and homogeneous model errors. This makes these methods inapplicable to measurement errors models where covariates are unobservable and observations are possibly non sub-Gaussian and heterogeneous. We propose weighted penalized corrected quantile estimators for the regression parameter vector in linear regression models with additive measurement errors, where unobservable covariates are nonrandom. The proposed estimators forgo the need for the above mentioned model assumptions. We study these estimators in both the fixed dimension and high dimensional sparse setups, in the latter setup, the
Wavefront-aberration measurement and systematic-error analysis of a high numerical-aperture objective

NASA Astrophysics Data System (ADS)

Liu, Zhixiang; Xing, Tingwen; Jiang, Yadong; Lv, Baobin

2018-02-01

A two-dimensional (2-D) shearing interferometer based on an amplitude chessboard grating was designed to measure the wavefront aberration of a high numerical-aperture (NA) objective. Chessboard gratings offer better diffraction efficiencies and fewer disturbing diffraction orders than traditional cross gratings. The wavefront aberration of the tested objective was retrieved from the shearing interferogram using the Fourier transform and differential Zernike polynomial-fitting methods. Grating manufacturing errors, including the duty-cycle and pattern-deviation errors, were analyzed with the Fourier transform method. Then, according to the relation between the spherical pupil and planar detector coordinates, the influence of the distortion of the pupil coordinates was simulated. Finally, the systematic error attributable to grating alignment errors was deduced through the geometrical ray-tracing method. Experimental results indicate that the measuring repeatability (3σ) of the wavefront aberration of an objective with NA 0.4 was 3.4 mλ. The systematic-error results were consistent with previous analyses. Thus, the correct wavefront aberration can be obtained after calibration.
Considerations for analysis of time-to-event outcomes measured with error: Bias and correction with SIMEX.

PubMed

Oh, Eric J; Shepherd, Bryan E; Lumley, Thomas; Shaw, Pamela A

2018-04-15

For time-to-event outcomes, a rich literature exists on the bias introduced by covariate measurement error in regression models, such as the Cox model, and methods of analysis to address this bias. By comparison, less attention has been given to understanding the impact or addressing errors in the failure time outcome. For many diseases, the timing of an event of interest (such as progression-free survival or time to AIDS progression) can be difficult to assess or reliant on self-report and therefore prone to measurement error. For linear models, it is well known that random errors in the outcome variable do not bias regression estimates. With nonlinear models, however, even random error or misclassification can introduce bias into estimated parameters. We compare the performance of 2 common regression models, the Cox and Weibull models, in the setting of measurement error in the failure time outcome. We introduce an extension of the SIMEX method to correct for bias in hazard ratio estimates from the Cox model and discuss other analysis options to address measurement error in the response. A formula to estimate the bias induced into the hazard ratio by classical measurement error in the event time for a log-linear survival model is presented. Detailed numerical studies are presented to examine the performance of the proposed SIMEX method under varying levels and parametric forms of the error in the outcome. We further illustrate the method with observational data on HIV outcomes from the Vanderbilt Comprehensive Care Clinic. Copyright © 2017 John Wiley & Sons, Ltd.
Comparison of Algorithm-based Estimates of Occupational Diesel Exhaust Exposure to Those of Multiple Independent Raters in a Population-based Case–Control Study

PubMed Central

Friesen, Melissa C.

2013-01-01

Objectives: Algorithm-based exposure assessments based on patterns in questionnaire responses and professional judgment can readily apply transparent exposure decision rules to thousands of jobs quickly. However, we need to better understand how algorithms compare to a one-by-one job review by an exposure assessor. We compared algorithm-based estimates of diesel exhaust exposure to those of three independent raters within the New England Bladder Cancer Study, a population-based case–control study, and identified conditions under which disparities occurred in the assessments of the algorithm and the raters. Methods: Occupational diesel exhaust exposure was assessed previously using an algorithm and a single rater for all 14 983 jobs reported by 2631 study participants during personal interviews conducted from 2001 to 2004. Two additional raters independently assessed a random subset of 324 jobs that were selected based on strata defined by the cross-tabulations of the algorithm and the first rater’s probability assessments for each job, oversampling their disagreements. The algorithm and each rater assessed the probability, intensity and frequency of occupational diesel exhaust exposure, as well as a confidence rating for each metric. Agreement among the raters, their aggregate rating (average of the three raters’ ratings) and the algorithm were evaluated using proportion of agreement, kappa and weighted kappa (κw). Agreement analyses on the subset used inverse probability weighting to extrapolate the subset to estimate agreement for all jobs. Classification and Regression Tree (CART) models were used to identify patterns in questionnaire responses that predicted disparities in exposure status (i.e., unexposed versus exposed) between the first rater and the algorithm-based estimates. Results: For the probability, intensity and frequency exposure metrics, moderate to moderately high agreement was observed among raters (κw = 0.50–0.76) and between the
Error analysis in stereo vision for location measurement of 3D point

NASA Astrophysics Data System (ADS)

Li, Yunting; Zhang, Jun; Tian, Jinwen

2015-12-01

Location measurement of 3D point in stereo vision is subjected to different sources of uncertainty that propagate to the final result. For current methods of error analysis, most of them are based on ideal intersection model to calculate the uncertainty region of point location via intersecting two fields of view of pixel that may produce loose bounds. Besides, only a few of sources of error such as pixel error or camera position are taken into account in the process of analysis. In this paper we present a straightforward and available method to estimate the location error that is taken most of source of error into account. We summed up and simplified all the input errors to five parameters by rotation transformation. Then we use the fast algorithm of midpoint method to deduce the mathematical relationships between target point and the parameters. Thus, the expectations and covariance matrix of 3D point location would be obtained, which can constitute the uncertainty region of point location. Afterwards, we turned back to the error propagation of the primitive input errors in the stereo system and throughout the whole analysis process from primitive input errors to localization error. Our method has the same level of computational complexity as the state-of-the-art method. Finally, extensive experiments are performed to verify the performance of our methods.
Deriving Oral Assessment Scales across Different Tests and Rater Groups.

ERIC Educational Resources Information Center

Chalhoub-Deville, Micheline

1995-01-01

The purpose of this study was to derive the criteria/dimensions underlying learners' second-language oral ability scores across three tests: an oral interview, a narration, and a read-aloud. A stimulus tape of 18 speech samples was presented to 3 native speaker rater groups for evaluation. Results indicate that researchers might need to reconsider…
Direct measurement of the poliovirus RNA polymerase error frequency in vitro

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ward, C.D.; Stokes, M.A.M.; Flanegan, J.B.

1988-02-01

The fidelity of RNA replication by the poliovirus-RNA-dependent RNA polymerase was examined by copying homopolymeric RNA templates in vitro. The poliovirus RNA polymerase was extensively purified and used to copy poly(A), poly(C), or poly(I) templates with equimolar concentrations of noncomplementary and complementary ribonucleotides. The error frequency was expressed as the amount of a noncomplementary nucleotide incorporated divided by the total amount of complementary and noncomplementary nucleotide incorporated. The polymerase error frequencies were very high, depending on the specific reaction conditions. The activity of the polymerase on poly(U) and poly(G) was too low to measure error frequencies on these templates. Amore » fivefold increase in the error frequency was observed when the reaction conditions were changed from 3.0 mM Mg{sup 2+} (pH 7.0) to 7.0 mM Mg{sup 2+} (pH 8.0). This increase in the error frequency correlates with an eightfold increase in the elongation rate that was observed under the same conditions in a previous study.« less
Simulated patient training: Using inter-rater reliability to evaluate simulated patient consistency in nursing education.

PubMed

MacLean, Sharon; Geddes, Fiona; Kelly, Michelle; Della, Phillip

2018-03-01

Simulated patients (SPs) are frequently used for training nursing students in communication skills. An acknowledged benefit of using SPs is the opportunity to provide a standardized approach by which participants can demonstrate and develop communication skills. However, relatively little evidence is available on how to best facilitate and evaluate the reliability and accuracy of SPs' performances. The aim of this study is to investigate the effectiveness of an evidenced based SP training framework to ensure standardization of SPs. The training framework was employed to improve inter-rater reliability of SPs. A quasi-experimental study was employed to assess SP post-training understanding of simulation scenario parameters using inter-rater reliability agreement indices. Two phases of data collection took place. Initially a trial phase including audio-visual (AV) recordings of two undergraduate nursing students completing a simulation scenario is rated by eight SPs using the Interpersonal Communication Assessments Scale (ICAS) and Quality of Discharge Teaching Scale (QDTS). In phase 2, eight SP raters and four nursing faculty raters independently evaluated students' (N=42) communication practices using the QDTS. Intraclass correlation coefficients (ICC) were >0.80 for both stages of the study in clinical communication skills. The results support the premise that if trained appropriately, SPs have a high degree of reliability and validity to both facilitate and evaluate student performance in nurse education. Crown Copyright © 2018. Published by Elsevier Ltd. All rights reserved.
Inter-rater reliability of an observation-based ergonomics assessment checklist for office workers.

PubMed

Pereira, Michelle Jessica; Straker, Leon Melville; Comans, Tracy Anne; Johnston, Venerina

2016-12-01

To establish the inter-rater reliability of an observation-based ergonomics assessment checklist for computer workers. A 37-item (38-item if a laptop was part of the workstation) comprehensive observational ergonomics assessment checklist comparable to government guidelines and up to date with empirical evidence was developed. Two trained practitioners assessed full-time office workers performing their usual computer-based work and evaluated the suitability of workstations used. Practitioners assessed each participant consecutively. The order of assessors was randomised, and the second assessor was blinded to the findings of the first. Unadjusted kappa coefficients between the raters were obtained for the overall checklist and subsections that were formed from question-items relevant to specific workstation equipment. Twenty-seven office workers were recruited. The inter-rater reliability between two trained practitioners achieved moderate to good reliability for all except one checklist component. This checklist has mostly moderate to good reliability between two trained practitioners. Practitioner Summary: This reliable ergonomics assessment checklist for computer workers was designed using accessible government guidelines and supplemented with up-to-date evidence. Employers in Queensland (Australia) can fulfil legislative requirements by using this reliable checklist to identify and subsequently address potential risk factors for work-related injury to provide a safe working environment.
False Positives in Multiple Regression: Unanticipated Consequences of Measurement Error in the Predictor Variables

ERIC Educational Resources Information Center

Shear, Benjamin R.; Zumbo, Bruno D.

2013-01-01

Type I error rates in multiple regression, and hence the chance for false positive research findings, can be drastically inflated when multiple regression models are used to analyze data that contain random measurement error. This article shows the potential for inflated Type I error rates in commonly encountered scenarios and provides new…
Error Characterization of Altimetry Measurements at Climate Scales

NASA Astrophysics Data System (ADS)

Ablain, Michael; Larnicol, Gilles; Faugere, Yannice; Cazenave, Anny; Meyssignac, Benoit; Picot, Nicolas; Benveniste, Jerome

2013-09-01

Thanks to studies performed in the framework of the SALP project (supported by CNES) since the TOPEX era and more recently in the framework of the Sea- Level Climate Change Initiative project (supported by ESA), strong improvements have been provided on the estimation of the global and regional mean sea level over the whole altimeter period for all the altimetric missions. Thanks to these efforts, a better characterization of altimeter measurements errors at climate scales has been performed and presented in this paper. These errors have been compared to user requirements in order to know if scientific goals are reached by altimeter missions. The main issue of this paper is the importance to enhance the link between altimeter and climate communities to improve or refine user requirements, to better specify future altimeter system for climate applications but also to reprocess older missions beyond their original specifications.
Joint nonparametric correction estimator for excess relative risk regression in survival analysis with exposure measurement error

PubMed Central

Wang, Ching-Yun; Cullings, Harry; Song, Xiao; Kopecky, Kenneth J.

2017-01-01

SUMMARY Observational epidemiological studies often confront the problem of estimating exposure-disease relationships when the exposure is not measured exactly. In the paper, we investigate exposure measurement error in excess relative risk regression, which is a widely used model in radiation exposure effect research. In the study cohort, a surrogate variable is available for the true unobserved exposure variable. The surrogate variable satisfies a generalized version of the classical additive measurement error model, but it may or may not have repeated measurements. In addition, an instrumental variable is available for individuals in a subset of the whole cohort. We develop a nonparametric correction (NPC) estimator using data from the subcohort, and further propose a joint nonparametric correction (JNPC) estimator using all observed data to adjust for exposure measurement error. An optimal linear combination estimator of JNPC and NPC is further developed. The proposed estimators are nonparametric, which are consistent without imposing a covariate or error distribution, and are robust to heteroscedastic errors. Finite sample performance is examined via a simulation study. We apply the developed methods to data from the Radiation Effects Research Foundation, in which chromosome aberration is used to adjust for the effects of radiation dose measurement error on the estimation of radiation dose responses. PMID:29354018
Patient motion tracking in the presence of measurement errors.

PubMed

Haidegger, Tamás; Benyó, Zoltán; Kazanzides, Peter

2009-01-01

The primary aim of computer-integrated surgical systems is to provide physicians with superior surgical tools for better patient outcome. Robotic technology is capable of both minimally invasive surgery and microsurgery, offering remarkable advantages for the surgeon and the patient. Current systems allow for sub-millimeter intraoperative spatial positioning, however certain limitations still remain. Measurement noise and unintended changes in the operating room environment can result in major errors. Positioning errors are a significant danger to patients in procedures involving robots and other automated devices. We have developed a new robotic system at the Johns Hopkins University to support cranial drilling in neurosurgery procedures. The robot provides advanced visualization and safety features. The generic algorithm described in this paper allows for automated compensation of patient motion through optical tracking and Kalman filtering. When applied to the neurosurgery setup, preliminary results show that it is possible to identify patient motion within 700 ms, and apply the appropriate compensation with an average of 1.24 mm positioning error after 2 s of setup time.

Inter-Rater Reliability and Validity of the Australian Football League’s Kicking and Handball Tests

PubMed Central

Cripps, Ashley J.; Hopper, Luke S.; Joyce, Christopher

2015-01-01

validity of the handball test. Key points The skill tests created by the AFL demonstrated acceptable levels of relative and absolute inter-rater reliability. Both the AFL’s skills tests are able to differentiate between athletes dominant and non-dominant limbs. However, only the kicking test could consistently differentiated between score outcomes over a range of Australian Football specific disposal distances. Both tests demonstrated poor concurrent validity, with no correlation found between coaches’ perceptions of technical skills and actual skill outcomes measured. PMID:26336356
Inter-rater reliability of direct observations of the physical and psychosocial working conditions in eldercare: An evaluation in the DOSES project.

PubMed

Karstad, Kristina; Rugulies, Reiner; Skotte, Jørgen; Munch, Pernille Kold; Greiner, Birgit A; Burdorf, Alex; Søgaard, Karen; Holtermann, Andreas

2018-05-01

The aim of the study was to develop and evaluate the reliability of the "Danish observational study of eldercare work and musculoskeletal disorders" (DOSES) observation instrument to assess physical and psychosocial risk factors for musculoskeletal disorders (MSD) in eldercare work. During 1.5 years, sixteen raters conducted 117 inter-rater observations from 11 nursing homes. Reliability was evaluated using percent agreement and Gwet's AC1 coefficient. Of the 18 examined items, inter-rater reliability was excellent for 7 items (AC1>0.75) fair to good for 7 items (AC1 0.40-0.75) and poor for 2 items (AC1 0-0.40). For 2 items there was no agreement between the raters (AC1 <0). The reliability did not differ between the first and second half of the data collection period and the inter-rater observations were representative regarding occurrence of events in eldercare work. The instrument is appropriate for assessing physical and psychosocial risk factors for MSD among eldercare workers. Copyright © 2018 The Authors. Published by Elsevier Ltd.. All rights reserved.
The Reliability of Lumbar Lordosis Measurements Using a Flexible-Rule.

DTIC Science & Technology

The purpose of this study was to examine the intra-rater and intra-rater reliability of lumbar lordosis measurements taken with a flexible-rule. Two...coefficients (ICC) were used to determine the degree of agreement between measurements. The results suggest that measurements of lumbar lordosis with a
Assessment of Systematic Measurement Errors for Acoustic Travel-Time Tomography of the Atmosphere

DTIC Science & Technology

2013-01-01

measurements include assess- ment of the time delays in electronic circuits and mechanical hardware (e.g., drivers and microphones) of a tomography array ...hardware and electronic circuits of the tomography array and errors in synchronization of the transmitted and recorded signals. For example, if...coordinates can be as large as 30 cm. These errors are equivalent to the systematic errors in the travel times of 0.9 ms. Third, loudspeakers which are used
Analysis and compensation of synchronous measurement error for multi-channel laser interferometer

NASA Astrophysics Data System (ADS)

Du, Shengwu; Hu, Jinchun; Zhu, Yu; Hu, Chuxiong

2017-05-01

Dual-frequency laser interferometer has been widely used in precision motion system as a displacement sensor, to achieve nanoscale positioning or synchronization accuracy. In a multi-channel laser interferometer synchronous measurement system, signal delays are different in the different channels, which will cause asynchronous measurement, and then lead to measurement error, synchronous measurement error (SME). Based on signal delay analysis of the measurement system, this paper presents a multi-channel SME framework for synchronous measurement, and establishes the model between SME and motion velocity. Further, a real-time compensation method for SME is proposed. This method has been verified in a self-developed laser interferometer signal processing board (SPB). The experiment result showed that, using this compensation method, at a motion velocity 0.89 m s-1, the max SME between two measuring channels in the SPB is 1.1 nm. This method is more easily implemented and applied to engineering than the method of directly testing smaller signal delay.
Detecting Measurement Disturbances in Rater-Mediated Assessments

ERIC Educational Resources Information Center

Wind, Stefanie A.; Schumacker, Randall E.

2017-01-01

The term measurement disturbance has been used to describe systematic conditions that affect a measurement process, resulting in a compromised interpretation of person or item estimates. Measurement disturbances have been discussed in relation to systematic response patterns associated with items and persons, such as start-up, plodding, boredom,…
Development and evaluation of the "BRISK Scale," a brief observational measure of risk communication competence.

PubMed

Han, Paul K J; Joekes, Katherine; Mills, Greg; Gutheil, Caitlin; Smith, Kahsi; Cochran, Nancy E; Elwyn, Glyn

2016-12-01

To develop and evaluate a brief observational measure of clinical risk communication competence. A 4-item checklist-type measure, the BRISK (Brief Risk Information Skill) Scale, was developed by selecting and refining items from a more comprehensive measure of clinical risk communication competence. Six volunteer raters received brief training on the measure and then used the BRISK Scale to evaluate 52 video-recorded encounters between 2nd-year medical students and standardized patients conducted as part of an Observed Structured Clinical Examination (OSCE) involving a risk communication task. Internal consistency reliability, inter-rater reliability, and criterion validity were assessed. Raters reported no difficulties using the BRISK Scale; scores across all raters and subjects ranged from 0 to 16 with a mean score of 6.49 (SD=3.17). The BRISK Scale showed good internal consistency reliability (α=0.64), and inter-rater reliability at the scale level (Intraclass Correlation Coefficient (ICC)=0.79 for consistency, and 0.75 for absolute agreement) and individual-item level (ICC range: 0.62-.91). Novice raters' BRISK Scale scores were highly correlated (r=0.84, p<0.01) with expert raters' scores on the Risk Communication Content measure, a more comprehensive measure of risk communication competence. The BRISK Scale is a promising new brief observational measure of clinical risk communication competence. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Psychometrics of the Fitness-to-Drive Screening Measure.

PubMed

Classen, Sherrilene; Velozo, Craig A; Winter, Sandra M; Bédard, Michel; Wang, Yanning

2015-01-01

We employed item response theory (IRT), specifically using Rasch modeling, to determine the measurement precision of the Fitness-to-Drive Screening Measure (FTDS), a tool that can be used by caregivers and occupational therapists to help detect at-risk drivers. We examined unidimensionality through the factor structure (how items contribute to the central construct of fitness to drive), rating scale (use of the categories of the rating scale), item/person-level separation (distinguishing between items with different difficulty levels or persons with different ability levels) and reliability, item hierarchy (easier driving items advancing to more difficult driving items), rater reliability, rater effects (severity vs. leniency of a rater), and criterion validity of the FTDS to an on-road assessment, via three rater groups (n = 200 older drivers; n = 200 caregivers; n = 2 evaluators). The FTDS is unidimensional, the rating scale performed well, has good person (> 3.07) and item (> 5.43) separation, good person (> 0.90) and item reliability (> 0.97), with < 10% misfitting items for two rater groups (caregivers and drivers). The intraclass correlation (ICC) coefficient among the three rater groups was significant (.253, p < .001) and the evaluators were the most severe raters. When comparing the caregivers' FTDS rating with the drivers' on-road assessment, the areas under the curve (index of discriminability; caregivers .726, p < .001) suggested concurrent validity between the FTDS and the on-road assessment. Despite limitations, the FTDS is a reliable and accurate screening measure for caregivers to help identify at-risk older drivers and for occupational therapy practitioners to start conversations about driving.
Inter-rater reliability of the Full Outline of UnResponsiveness score and the Glasgow Coma Scale in critically ill patients: a prospective observational study

PubMed Central

2010-01-01

Introduction The Glasgow Coma Scale (GCS) is the most widely used scoring system for comatose patients in intensive care. Limitations of the GCS include the impossibility to assess the verbal score in intubated or aphasic patients, and an inconsistent inter-rater reliability. The FOUR (Full Outline of UnResponsiveness) score, a new coma scale not reliant on verbal response, was recently proposed. The aim of the present study was to compare the inter-rater reliability of the GCS and the FOUR score among unselected patients in general critical care. A further aim was to compare the inter-rater reliability of neurologists with that of intensive care unit (ICU) staff. Methods In this prospective observational study, scoring of GCS and FOUR score was performed by neurologists and ICU staff on 267 consecutive patients admitted to intensive care. Results In a total of 437 pair wise ratings the exact inter-rater agreement for the GCS was 71%, and for the FOUR score 82% (P = 0.0016); the inter-rater agreement within a range of ± 1 score point for the GCS was 90%, and for the FOUR score 92% (P = ns.). The exact inter-rater agreement among neurologists was superior to that among ICU staff for the FOUR score (87% vs. 79%, P = 0.04) but not for the GCS (73% vs. 73%). Neurologists and ICU staff did not significantly differ in the inter-rater agreement within a range of ± 1 score point for both GCS (88% vs. 93%) and the FOUR score (91% vs. 88%). Conclusions The FOUR score performed better than the GCS for exact inter-rater agreement, but not for the clinically more relevant agreement within the range of ± 1 score point. Though neurologists outperformed ICU staff with regard to exact inter-rater agreement, the inter-rater agreement of ICU staff within the clinically more relevant range of ± 1 score point equalled that of the neurologists. The small advantage in inter-rater reliability of the FOUR score is most likely insufficient to replace the GCS, a score with a long
Intra-rater reliability and agreement of various methods of measurement to assess dorsiflexion in the Weight Bearing Dorsiflexion Lunge Test (WBLT) among female athletes.

PubMed

Langarika-Rocafort, Argia; Emparanza, José Ignacio; Aramendi, José F; Castellano, Julen; Calleja-González, Julio

2017-01-01

To examine the intra-observer reliability and agreement between five methods of measurement for dorsiflexion during Weight Bearing Dorsiflexion Lunge Test and to assess the degree of agreement between three methods in female athletes. Repeated measurements study design. Volleyball club. Twenty-five volleyball players. Dorsiflexion was evaluated using five methods: heel-wall distance, first toe-wall distance, inclinometer at tibia, inclinometer at Achilles tendon and the dorsiflexion angle obtained by a simple trigonometric function. For the statistical analysis, agreement was studied using the Bland-Altman method, the Standard Error of Measurement and the Minimum Detectable Change. Reliability analysis was performed using the Intraclass Correlation Coefficient. Measurement methods using the inclinometer had more than 6° of measurement error. The angle calculated by trigonometric function had 3.28° error. The reliability of inclinometer based methods had ICC values < 0.90. Distance based methods and trigonometric angle measurement had an ICC values > 0.90. Concerning the agreement between methods, there was from 1.93° to 14.42° bias, and from 4.24° to 7.96° random error. To assess DF angle in WBLT, the angle calculated by a trigonometric function is the most repeatable method. The methods of measurement cannot be used interchangeably. Copyright © 2016 Elsevier Ltd. All rights reserved.
Measuring Systematic Error with Curve Fits

ERIC Educational Resources Information Center

Rupright, Mark E.

2011-01-01

Systematic errors are often unavoidable in the introductory physics laboratory. As has been demonstrated in many papers in this journal, such errors can present a fundamental problem for data analysis, particularly when comparing the data to a given model. In this paper I give three examples in which my students use popular curve-fitting software…
Impact of gradient timing error on the tissue sodium concentration bioscale measured using flexible twisted projection imaging

NASA Astrophysics Data System (ADS)

Lu, Aiming; Atkinson, Ian C.; Vaughn, J. Thomas; Thulborn, Keith R.

2011-12-01

The rapid biexponential transverse relaxation of the sodium MR signal from brain tissue requires efficient k-space sampling for quantitative imaging in a time that is acceptable for human subjects. The flexible twisted projection imaging (flexTPI) sequence has been shown to be suitable for quantitative sodium imaging with an ultra-short echo time to minimize signal loss. The fidelity of the k-space center location is affected by the readout gradient timing errors on the three physical axes, which is known to cause image distortion for projection-based acquisitions. This study investigated the impact of these timing errors on the voxel-wise accuracy of the tissue sodium concentration (TSC) bioscale measured with the flexTPI sequence. Our simulations show greater than 20% spatially varying quantification errors when the gradient timing errors are larger than 10 μs on all three axes. The quantification is more tolerant of gradient timing errors on the Z-axis. An existing method was used to measure the gradient timing errors with <1 μs error. The gradient timing error measurement is shown to be RF coil dependent, and timing error differences of up to ˜16 μs have been observed between different RF coils used on the same scanner. The measured timing errors can be corrected prospectively or retrospectively to obtain accurate TSC values.
Multi-faceted Rasch measurement and bias patterns in EFL writing performance assessment.

PubMed

He, Tung-Hsien; Gou, Wen Johnny; Chien, Ya-Chen; Chen, I-Shan Jenny; Chang, Shan-Mao

2013-04-01

This study applied multi-faceted Rasch measurement to examine rater bias in the assessment of essays written by college students learning English as a foreign language. Four raters who had received different academic training from four distinctive disciplines applied a six-category rating scale to analytically rate essays on an argumentative topic and on a descriptive topic. FACETS, a Rasch computer program, was utilized to pinpoint bias patterns by analyzing the rater-topic, rater-category, and topic-category interactions. Results showed: argumentative essays were rated more severely than were descriptive essays; the linguistics-major rater was the most lenient rater, while the literature-major rater was the severest one; and the category of language use received the severest ratings, whereas content was given the most lenient ratings. The severity hierarchies for raters, essay topics, and rating categories suggested that raters' academic training and their perceptions of the importance of categories were associated with their bias patterns. Implications for rater training are discussed.
Using Computation Curriculum-Based Measurement Probes for Error Pattern Analysis

ERIC Educational Resources Information Center

Dennis, Minyi Shih; Calhoon, Mary Beth; Olson, Christopher L.; Williams, Cara

2014-01-01

This article describes how "curriculum-based measurement--computation" (CBM-C) mathematics probes can be used in combination with "error pattern analysis" (EPA) to pinpoint difficulties in basic computation skills for students who struggle with learning mathematics. Both assessment procedures provide ongoing assessment data…
GY SAMPLING THEORY IN ENVIRONMENTAL STUDIES 2: SUBSAMPLING ERROR MEASUREMENTS

EPA Science Inventory

Sampling can be a significant source of error in the measurement process. The characterization and cleanup of hazardous waste sites require data that meet site-specific levels of acceptable quality if scientifically supportable decisions are to be made. In support of this effort,...
Development of a simple system for simultaneously measuring 6DOF geometric motion errors of a linear guide.

PubMed

Qibo, Feng; Bin, Zhang; Cunxing, Cui; Cuifang, Kuang; Yusheng, Zhai; Fenglin, You

2013-11-04

A simple method for simultaneously measuring the 6DOF geometric motion errors of the linear guide was proposed. The mechanisms for measuring straightness and angular errors and for enhancing their resolution are described in detail. A common-path method for measuring the laser beam drift was proposed and it was used to compensate the errors produced by the laser beam drift in the 6DOF geometric error measurements. A compact 6DOF system was built. Calibration experiments with certain standard measurement meters showed that our system has a standard deviation of 0.5 µm in a range of ± 100 µm for the straightness measurements, and standard deviations of 0.5", 0.5", and 1.0" in the range of ± 100" for pitch, yaw, and roll measurements, respectively.
Irradiance measurement errors due to the assumption of a Lambertian reference panel

NASA Technical Reports Server (NTRS)

Kimes, D. S.; Kirchner, J. A.

1982-01-01

A technique is presented for determining the error in diurnal irradiance measurements that results from the non-Lambertian behavior of a reference panel under various irradiance conditions. Spectral biconical reflectance factors of a spray-painted barium sulfate panel, along with simulated sky radiance data for clear and hazy skies at six solar zenith angles, were used to calculate the estimated panel irradiances and true irradiances for a nadir-looking sensor in two wavelength bands. The inherent errors in total spectral irradiance (0.68 microns) for a clear sky were 0.60, 6.0, 13.0, and 27.0% for solar zenith angles of 0, 45, 60, and 75 deg, respectively. The technique can be used to characterize the error of a specific panel used in field measurements, and thus eliminate any ambiguity of the effects of the type, preparation, and aging of the paint.
Grant Peer Review: Improving Inter-Rater Reliability with Training.

PubMed

Sattler, David N; McKnight, Patrick E; Naney, Linda; Mathis, Randy

2015-01-01

This study developed and evaluated a brief training program for grant reviewers that aimed to increase inter-rater reliability, rating scale knowledge, and effort to read the grant review criteria. Enhancing reviewer training may improve the reliability and accuracy of research grant proposal scoring and funding recommendations. Seventy-five Public Health professors from U.S. research universities watched the training video we produced and assigned scores to the National Institutes of Health scoring criteria proposal summary descriptions. For both novice and experienced reviewers, the training video increased scoring accuracy (the percentage of scores that reflect the true rating scale values), inter-rater reliability, and the amount of time reading the review criteria compared to the no video condition. The increase in reliability for experienced reviewers is notable because it is commonly assumed that reviewers--especially those with experience--have good understanding of the grant review rating scale. The findings suggest that both experienced and novice reviewers who had not received the type of training developed in this study may not have appropriate understanding of the definitions and meaning for each value of the rating scale and that experienced reviewers may overestimate their knowledge of the rating scale. The results underscore the benefits of and need for specialized peer reviewer training.
Error reduction by combining strapdown inertial measurement units in a baseball stitch

NASA Astrophysics Data System (ADS)

Tracy, Leah

A poor musical performance is rarely due to an inferior instrument. When a device is under performing, the temptation is to find a better device or a new technology to achieve performance objectives; however, another solution may be improving how existing technology is used through a better understanding of device characteristics, i.e., learning to play the instrument better. This thesis explores improving position and attitude estimates of inertial navigation systems (INS) through an understanding of inertial sensor errors, manipulating inertial measurement units (IMUs) to reduce that error and multisensor fusion of multiple IMUs to reduce error in a GPS denied environment.
INTRA-RATER RELIABILITY OF THE MULTIPLE SINGLE-LEG HOP-STABILIZATION TEST AND RELATIONSHIPS WITH AGE, LEG DOMINANCE AND TRAINING.

PubMed

Sawle, Leanne; Freeman, Jennifer; Marsden, Jonathan

2017-04-01

Balance is a complex construct, affected by multiple components such as strength and co-ordination. However, whilst assessing an athlete's dynamic balance is an important part of clinical examination, there is no gold standard measure. The multiple single-leg hop-stabilization test is a functional test which may offer a method of evaluating the dynamic attributes of balance, but it needs to show adequate intra-tester reliability. The purpose of this study was to assess the intra-rater reliability of a dynamic balance test, the multiple single-leg hop-stabilization test on the dominant and non-dominant legs. Intra-rater reliability study. Fifteen active participants were tested twice with a 10-minute break between tests. The outcome measure was the multiple single-leg hop-stabilization test score, based on a clinically assessed numerical scoring system. Results were analysed using an Intraclass Correlations Coefficient (ICC 2,1 ) and Bland-Altman plots. Regression analyses explored relationships between test scores, leg dominance, age and training (an alpha level of p = 0.05 was selected). ICCs for intra-rater reliability were 0.85 for the dominant and non-dominant legs (confidence intervals = 0.62-0.95 and 0.61-0.95 respectively). Bland-Altman plots showed scores within two standard deviations. A significant correlation was observed between the dominant and non-dominant leg on balance scores (R 2 =0.49, p<0.05), and better balance was associated with younger participants in their non-dominant leg (R 2 =0.28, p<0.05) and their dominant leg (R 2 =0.39, p<0.05), and a higher number of hours spent training for the non-dominant leg R 2 =0.37, p<0.05). The multiple single-leg hop-stabilisation test demonstrated strong intra-tester reliability with active participants. Younger participants who trained more, have better balance scores. This test may be a useful measure for evaluating the dynamic attributes of balance. 3.

The Combined Effects of Measurement Error and Omitting Confounders in the Single-Mediator Model

PubMed Central

Fritz, Matthew S.; Kenny, David A.; MacKinnon, David P.

2016-01-01

Mediation analysis requires a number of strong assumptions be met in order to make valid causal inferences. Failing to account for violations of these assumptions, such as not modeling measurement error or omitting a common cause of the effects in the model, can bias the parameter estimates of the mediated effect. When the independent variable is perfectly reliable, for example when participants are randomly assigned to levels of treatment, measurement error in the mediator tends to underestimate the mediated effect, while the omission of a confounding variable of the mediator to outcome relation tends to overestimate the mediated effect. Violations of these two assumptions often co-occur, however, in which case the mediated effect could be overestimated, underestimated, or even, in very rare circumstances, unbiased. In order to explore the combined effect of measurement error and omitted confounders in the same model, the impact of each violation on the single-mediator model is first examined individually. Then the combined effect of having measurement error and omitted confounders in the same model is discussed. Throughout, an empirical example is provided to illustrate the effect of violating these assumptions on the mediated effect. PMID:27739903
Optics measurement algorithms and error analysis for the proton energy frontier

NASA Astrophysics Data System (ADS)

Langner, A.; Tomás, R.

2015-03-01

Optics measurement algorithms have been improved in preparation for the commissioning of the LHC at higher energy, i.e., with an increased damage potential. Due to machine protection considerations the higher energy sets tighter limits in the maximum excitation amplitude and the total beam charge, reducing the signal to noise ratio of optics measurements. Furthermore the precision in 2012 (4 TeV) was insufficient to understand beam size measurements and determine interaction point (IP) β -functions (β*). A new, more sophisticated algorithm has been developed which takes into account both the statistical and systematic errors involved in this measurement. This makes it possible to combine more beam position monitor measurements for deriving the optical parameters and demonstrates to significantly improve the accuracy and precision. Measurements from the 2012 run have been reanalyzed which, due to the improved algorithms, result in a significantly higher precision of the derived optical parameters and decreased the average error bars by a factor of three to four. This allowed the calculation of β* values and demonstrated to be fundamental in the understanding of emittance evolution during the energy ramp.
Measurements of Aperture Averaging on Bit-Error-Rate

NASA Technical Reports Server (NTRS)

Bastin, Gary L.; Andrews, Larry C.; Phillips, Ronald L.; Nelson, Richard A.; Ferrell, Bobby A.; Borbath, Michael R.; Galus, Darren J.; Chin, Peter G.; Harris, William G.; Marin, Jose A.;

2005-01-01

We report on measurements made at the Shuttle Landing Facility (SLF) runway at Kennedy Space Center of receiver aperture averaging effects on a propagating optical Gaussian beam wave over a propagation path of 1,000 in. A commercially available instrument with both transmit and receive apertures was used to transmit a modulated laser beam operating at 1550 nm through a transmit aperture of 2.54 cm. An identical model of the same instrument was used as a receiver with a single aperture that was varied in size up to 20 cm to measure the effect of receiver aperture averaging on Bit Error Rate. Simultaneous measurements were also made with a scintillometer instrument and local weather station instruments to characterize atmospheric conditions along the propagation path during the experiments.

Measurements of aperture averaging on bit-error-rate

NASA Astrophysics Data System (ADS)

Bastin, Gary L.; Andrews, Larry C.; Phillips, Ronald L.; Nelson, Richard A.; Ferrell, Bobby A.; Borbath, Michael R.; Galus, Darren J.; Chin, Peter G.; Harris, William G.; Marin, Jose A.; Burdge, Geoffrey L.; Wayne, David; Pescatore, Robert

2005-08-01

We report on measurements made at the Shuttle Landing Facility (SLF) runway at Kennedy Space Center of receiver aperture averaging effects on a propagating optical Gaussian beam wave over a propagation path of 1,000 m. A commercially available instrument with both transmit and receive apertures was used to transmit a modulated laser beam operating at 1550 nm through a transmit aperture of 2.54 cm. An identical model of the same instrument was used as a receiver with a single aperture that was varied in size up to 20 cm to measure the effect of receiver aperture averaging on Bit Error Rate. Simultaneous measurements were also made with a scintillometer instrument and local weather station instruments to characterize atmospheric conditions along the propagation path during the experiments.
Inter-rater reliability of data elements from a prototype of the Paul Coverdell National Acute Stroke Registry

PubMed Central

Reeves, Mathew J; Mullard, Andrew J; Wehner, Susan

2008-01-01

Background The Paul Coverdell National Acute Stroke Registry (PCNASR) is a U.S. based national registry designed to monitor and improve the quality of acute stroke care delivered by hospitals. The registry monitors care through specific performance measures, the accuracy of which depends in part on the reliability of the individual data elements used to construct them. This study describes the inter-rater reliability of data elements collected in Michigan's state-based prototype of the PCNASR. Methods Over a 6-month period, 15 hospitals participating in the Michigan PCNASR prototype submitted data on 2566 acute stroke admissions. Trained hospital staff prospectively identified acute stroke admissions, abstracted chart information, and submitted data to the registry. At each hospital 8 randomly selected cases were re-abstracted by an experienced research nurse. Inter-rater reliability was estimated by the kappa statistic for nominal variables, and intraclass correlation coefficient (ICC) for ordinal and continuous variables. Factors that can negatively impact the kappa statistic (i.e., trait prevalence and rater bias) were also evaluated. Results A total of 104 charts were available for re-abstraction. Excellent reliability (kappa or ICC > 0.75) was observed for many registry variables including age, gender, black race, hemorrhagic stroke, discharge medications, and modified Rankin Score. Agreement was at least moderate (i.e., 0.75 > kappa ≥; 0.40) for ischemic stroke, TIA, white race, non-ambulance arrival, hospital transfer and direct admit. However, several variables had poor reliability (kappa < 0.40) including stroke onset time, stroke team consultation, time of initial brain imaging, and discharge destination. There were marked systematic differences between hospital abstractors and the audit abstractor (i.e., rater bias) for many of the data elements recorded in the emergency department. Conclusion The excellent reliability of many of the data elements
Assessing Gaussian Assumption of PMU Measurement Error Using Field Data

DOE PAGES

Wang, Shaobu; Zhao, Junbo; Huang, Zhenyu; ...

2017-10-13

Gaussian PMU measurement error has been assumed for many power system applications, such as state estimation, oscillatory modes monitoring, voltage stability analysis, to cite a few. This letter proposes a simple yet effective approach to assess this assumption by using the stability property of a probability distribution and the concept of redundant measurement. Extensive results using field PMU data from WECC system reveal that the Gaussian assumption is questionable.
SU-F-J-65: Prediction of Patient Setup Errors and Errors in the Calibration Curve from Prompt Gamma Proton Range Measurements

DOE Office of Scientific and Technical Information (OSTI.GOV)

Albert, J; Labarbe, R; Sterpin, E

2016-06-15

Purpose: To understand the extent to which the prompt gamma camera measurements can be used to predict the residual proton range due to setup errors and errors in the calibration curve. Methods: We generated ten variations on a default calibration curve (CC) and ten corresponding range maps (RM). Starting with the default RM, we chose a square array of N beamlets, which were then rotated by a random angle θ and shifted by a random vector s. We added a 5% distal Gaussian noise to each beamlet in order to introduce discrepancies that exist between the ranges predicted from themore » prompt gamma measurements and those simulated with Monte Carlo algorithms. For each RM, s, θ, along with an offset u in the CC, were optimized using a simple Euclidian distance between the default ranges and the ranges produced by the given RM. Results: The application of our method lead to the maximal overrange of 2.0mm and underrange of 0.6mm on average. Compared to the situations where s, θ, and u were ignored, these values were larger: 2.1mm and 4.3mm. In order to quantify the need for setup error corrections, we also performed computations in which u was corrected for, but s and θ were not. This yielded: 3.2mm and 3.2mm. The average computation time for 170 beamlets was 65 seconds. Conclusion: These results emphasize the necessity to correct for setup errors and the errors in the calibration curve. The simplicity and speed of our method makes it a good candidate for being implemented as a tool for in-room adaptive therapy. This work also demonstrates that the Prompt gamma range measurements can indeed be useful in the effort to reduce range errors. Given these results, and barring further refinements, this approach is a promising step towards an adaptive proton radiotherapy.« less
The Outdoor MEDIA DOT: The development and inter-rater reliability of a tool designed to measure food and beverage outlets and outdoor advertising.

PubMed

Poulos, Natalie S; Pasch, Keryn E

2015-07-01

Few studies of the food environment have collected primary data, and even fewer have reported reliability of the tool used. This study focused on the development of an innovative electronic data collection tool used to document outdoor food and beverage (FB) advertising and establishments near 43 middle and high schools in the Outdoor MEDIA Study. Tool development used GIS based mapping, an electronic data collection form on handheld devices, and an easily adaptable interface to efficiently collect primary data within the food environment. For the reliability study, two teams of data collectors documented all FB advertising and establishments within one half-mile of six middle schools. Inter-rater reliability was calculated overall and by advertisement or establishment category using percent agreement. A total of 824 advertisements (n=233), establishment advertisements (n=499), and establishments (n=92) were documented (range=8-229 per school). Overall inter-rater reliability of the developed tool ranged from 69-89% for advertisements and establishments. Results suggest that the developed tool is highly reliable and effective for documenting the outdoor FB environment. Copyright © 2015 Elsevier Ltd. All rights reserved.
Validity and intra-rater reliability of MyJump app on iPhone 6s in jump performance.

PubMed

Stanton, Robert; Wintour, Sally-Anne; Kean, Crystal O

2017-05-01

Smartphone applications are increasingly used by researchers, coaches, athletes and clinicians. The aim of this study was to examine the concurrent validity and intra-rater reliability of the smartphone-based application, MyJump, against laboratory-based force plate measurements. Cross sectional study. Participants completed counter-movement jumps (CMJ) (n=29) and 30cm drop jumps (DJ) (n=27) on a force plate which were simultaneously recorded using MyJump. To assess concurrent validity, jump height, derived from flight time acquired from each device, was compared for each jump type. Intra-rater reliability was determined by replicating data analysis of MyJump recordings on two occasions separated by seven days. CMJ and DJ heights derived from MyJump showed excellent agreement with the force plate (ICC values range from 0.991 for CMJ to 0.993) However mean DJ height from the force plate was significantly higher than MyJump (mean difference: 0.87cm, 95% CI: 0.69-1.04cm). Intra-rater reliability of MyJump for both CMJ and DJ was almost perfect (ICC values range from 0.997 for CMJ to 0.998 for DJ); however, mean CMJ and DJ jump height for Day 1 was significantly higher than Day 2 (CMJ: 0.43cm, 95% CI: 0.23-0.62cm); (DJ: 0.38cm, 95% CI: 0.23-0.53cm). The present study finds MyJump to be a valid and highly reliable tool for researchers, coaches, athletes and clinicians; however, systematic bias should be considered when comparing MyJump outputs to other testing devices. Copyright © 2016 Sports Medicine Australia. Published by Elsevier Ltd. All rights reserved.
System Error Compensation Methodology Based on a Neural Network for a Micromachined Inertial Measurement Unit

PubMed Central

Liu, Shi Qiang; Zhu, Rong

2016-01-01

Errors compensation of micromachined-inertial-measurement-units (MIMU) is essential in practical applications. This paper presents a new compensation method using a neural-network-based identification for MIMU, which capably solves the universal problems of cross-coupling, misalignment, eccentricity, and other deterministic errors existing in a three-dimensional integrated system. Using a neural network to model a complex multivariate and nonlinear coupling system, the errors could be readily compensated through a comprehensive calibration. In this paper, we also present a thermal-gas MIMU based on thermal expansion, which measures three-axis angular rates and three-axis accelerations using only three thermal-gas inertial sensors, each of which capably measures one-axis angular rate and one-axis acceleration simultaneously in one chip. The developed MIMU (100 × 100 × 100 mm3) possesses the advantages of simple structure, high shock resistance, and large measuring ranges (three-axes angular rates of ±4000°/s and three-axes accelerations of ±10 g) compared with conventional MIMU, due to using gas medium instead of mechanical proof mass as the key moving and sensing elements. However, the gas MIMU suffers from cross-coupling effects, which corrupt the system accuracy. The proposed compensation method is, therefore, applied to compensate the system errors of the MIMU. Experiments validate the effectiveness of the compensation, and the measurement errors of three-axis angular rates and three-axis accelerations are reduced to less than 1% and 3% of uncompensated errors in the rotation range of ±600°/s and the acceleration range of ±1 g, respectively. PMID:26840314
Quantum error correction for continuously detected errors with any number of error channels per qubit

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ahn, Charlene; Wiseman, Howard; Jacobs, Kurt

2004-08-01

It was shown by Ahn, Wiseman, and Milburn [Phys. Rev. A 67, 052310 (2003)] that feedback control could be used as a quantum error correction process for errors induced by weak continuous measurement, given one perfectly measured error channel per qubit. Here we point out that this method can be easily extended to an arbitrary number of error channels per qubit. We show that the feedback protocols generated by our method encode n-2 logical qubits in n physical qubits, thus requiring just one more physical qubit than in the previous case.
Research on Measurement Accuracy of Laser Tracking System Based on Spherical Mirror with Rotation Errors of Gimbal Mount Axes

NASA Astrophysics Data System (ADS)

Shi, Zhaoyao; Song, Huixu; Chen, Hongfang; Sun, Yanqiang

2018-02-01

This paper presents a novel experimental approach for confirming that spherical mirror of a laser tracking system can reduce the influences of rotation errors of gimbal mount axes on the measurement accuracy. By simplifying the optical system model of laser tracking system based on spherical mirror, we can easily extract the laser ranging measurement error caused by rotation errors of gimbal mount axes with the positions of spherical mirror, biconvex lens, cat's eye reflector, and measuring beam. The motions of polarization beam splitter and biconvex lens along the optical axis and vertical direction of optical axis are driven by error motions of gimbal mount axes. In order to simplify the experimental process, the motion of biconvex lens is substituted by the motion of spherical mirror according to the principle of relative motion. The laser ranging measurement error caused by the rotation errors of gimbal mount axes could be recorded in the readings of laser interferometer. The experimental results showed that the laser ranging measurement error caused by rotation errors was less than 0.1 μm if radial error motion and axial error motion were within ±10 μm. The experimental method simplified the experimental procedure and the spherical mirror could reduce the influences of rotation errors of gimbal mount axes on the measurement accuracy of the laser tracking system.
Long-term continuous acoustical suspended-sediment measurements in rivers - Theory, application, bias, and error

USGS Publications Warehouse

Topping, David J.; Wright, Scott A.

2016-05-04

these sites. In addition, detailed, step-by-step procedures are presented for the general river application of the method.Quantification of errors in sediment-transport measurements made using this acoustical method is essential if the measurements are to be used effectively, for example, to evaluate uncertainty in long-term sediment loads and budgets. Several types of error analyses are presented to evaluate (1) the stability of acoustical calibrations over time, (2) the effect of neglecting backscatter from silt and clay, (3) the bias arising from changes in sand grain size, (4) the time-varying error in the method, and (5) the influence of nonrandom processes on error. Results indicate that (1) acoustical calibrations can be stable for long durations (multiple years), (2) neglecting backscatter from silt and clay can result in unacceptably high bias, (3) two frequencies are likely required to obtain sand-concentration measurements that are unbiased by changes in grain size, depending on site-specific conditions and acoustic frequency, (4) relative errors in silt-and-clay- and sand-concentration measurements decrease substantially as concentration increases, and (5) nonrandom errors may arise from slow changes in the spatial structure of suspended sediment that affect the relations between concentration in the acoustically ensonified part of the cross section and concentration in the entire river cross section. Taken together, the error analyses indicate that the two-frequency method produces unbiased measurements of suspended-silt-and-clay and sand concentration, with errors that are similar to, or larger than, those associated with conventional sampling methods.
Analysis of the Rater Effects on the Scoring of Diagnostic Trees Prepared by Teacher Candidates with the Many-Facet Rasch Model

ERIC Educational Resources Information Center

Nalbantoglu Yilmaz, Funda

2017-01-01

In the study, it was aimed to investigate the leniency/severity, bias and halo effect of the raters which were used in the scoring of the diagnostic tree prepared by the teacher candidates with the many-facet Rasch model. The research study group constitutes 24 teacher candidates who are taking measurement and evaluation lesson from the students…
Inter-Rater Agreement of Auscultation, Palpable Fremitus, and Ventilator Waveform Sawtooth Patterns Between Clinicians.

PubMed

Berry, Marc P; Martí, Joan-Daniel; Ntoumenopoulos, George

2016-10-01

Clinicians often use numerous bedside assessments for secretion retention in participants who are receiving invasive mechanical ventilation. This study aimed to evaluate inter-rater agreement between clinicians when using standard clinical assessments of secretion retention and whether differences in clinician experience influenced inter-rater agreement. Seventy-one mechanically ventilated participants were assessed by a research clinician and by one of 13 ICU clinicians. Each clinician conducted a standardized assessment of lung auscultation, palpation for chest-wall (rhonchal) fremitus, and ventilator inspiratory/expiratory flow-time waveforms for the sawtooth pattern. On the presence of breath sounds, agreement ranged from absolute to moderate in the upper zones and the lower zones, respectively. Kappa values for abnormal and adventitious lung sounds achieved moderate agreement in the upper zones, less than chance agreement to substantial agreement in the middle zones, and moderate agreement to almost perfect agreement in the lower zones. Moderate to almost perfect agreement was established for palpable fremitus in the upper zones, moderate to substantial agreement in the middle zones, and less than chance to moderate agreement in the lower zones. Inter-rater agreement on the presence of expiratory sawtooth pattern identification showed moderate agreement. The level of percentage agreement between the research and ICU clinicians for each respiratory assessment studied did not relate directly to level of clinical experience. Inter-rater agreement for all assessments showed variability between lung regions but maintained reasonable percentage agreement in mechanically ventilated participants. The level of percentage agreement achieved between clinicians did not directly relate to clinical experience for all respiratory assessments. Therefore, these respiratory assessments should not necessarily be viewed in isolation but interpreted within the context of a full
Stabilizing Conditional Standard Errors of Measurement in Scale Score Transformations

ERIC Educational Resources Information Center

Moses, Tim; Kim, YoungKoung

2017-01-01

The focus of this article is on scale score transformations that can be used to stabilize conditional standard errors of measurement (CSEMs). Three transformations for stabilizing the estimated CSEMs are reviewed, including the traditional arcsine transformation, a recently developed general variance stabilization transformation, and a new method…
Exploring the Effectiveness of a Measurement Error Tutorial in Helping Teachers Understand Score Report Results

ERIC Educational Resources Information Center

Zapata-Rivera, Diego; Zwick, Rebecca; Vezzu, Margaret

2016-01-01

The goal of this study was to explore the effectiveness of a short web-based tutorial in helping teachers to better understand the portrayal of measurement error in test score reports. The short video tutorial included both verbal and graphical representations of measurement error. Results showed a significant difference in comprehension scores…
Do Survey Data Estimate Earnings Inequality Correctly? Measurement Errors among Black and White Male Workers

ERIC Educational Resources Information Center

Kim, ChangHwan; Tamborini, Christopher R.

2012-01-01

Few studies have considered how earnings inequality estimates may be affected by measurement error in self-reported earnings in surveys. Utilizing restricted-use data that links workers in the Survey of Income and Program Participation with their W-2 earnings records, we examine the effect of measurement error on estimates of racial earnings…
Measures of shoulder protraction and thoracolumbar rotation.

PubMed

Schenkman, M; Laub, K C; Kuchibhatla, M; Ray, L; Shinberg, M

1997-05-01

Physical therapists need objective measures that can be used reliably with a variety of subject groups to document upper quadrant function. Two aspects of upper quadrant motion, shoulder protraction and thoracolumbar rotation, are assessed routinely in clinical practice, but no standard measurement techniques have been reported. We hypothesized that there would be significant differences, by age and state of health, for both shoulder protraction and thoracolumbar rotation. The purposes of this study were: 1) to develop measurement approaches for shoulder protraction and thoracolumbar rotation; 2) to determine if there are significant differences in these motions for four subject groups: healthy young, healthy elders, functionally limited elders, and people with Parkinson's disease; and 3) to describe between-rater and within-rater reliability for these measures. Fifty-five subjects participated in this investigation. All subjects were rated by a physical therapist and two research assistants. Using an analysis of variance followed by Scheffe's post hoc analysis, significant differences were demonstrated between the groups. Between-rater and within-rater reliability ranged from ICCs of 0.54 to 0.95. Clinicians can use these measures to quantify aspects of upper quadrant function treated routinely in physical therapy practice. These measures also have applicability for researchers.
Model selection for marginal regression analysis of longitudinal data with missing observations and covariate measurement error.

PubMed

Shen, Chung-Wei; Chen, Yi-Hau

2015-10-01

Missing observations and covariate measurement error commonly arise in longitudinal data. However, existing methods for model selection in marginal regression analysis of longitudinal data fail to address the potential bias resulting from these issues. To tackle this problem, we propose a new model selection criterion, the Generalized Longitudinal Information Criterion, which is based on an approximately unbiased estimator for the expected quadratic error of a considered marginal model accounting for both data missingness and covariate measurement error. The simulation results reveal that the proposed method performs quite well in the presence of missing data and covariate measurement error. On the contrary, the naive procedures without taking care of such complexity in data may perform quite poorly. The proposed method is applied to data from the Taiwan Longitudinal Study on Aging to assess the relationship of depression with health and social status in the elderly, accommodating measurement error in the covariate as well as missing observations. © The Author 2015. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

Multipollutant measurement error in air pollution epidemiology studies arising from predicting exposures with penalized regression splines

PubMed Central

Bergen, Silas; Sheppard, Lianne; Kaufman, Joel D.; Szpiro, Adam A.

2016-01-01

Summary Air pollution epidemiology studies are trending towards a multi-pollutant approach. In these studies, exposures at subject locations are unobserved and must be predicted using observed exposures at misaligned monitoring locations. This induces measurement error, which can bias the estimated health effects and affect standard error estimates. We characterize this measurement error and develop an analytic bias correction when using penalized regression splines to predict exposure. Our simulations show bias from multi-pollutant measurement error can be severe, and in opposite directions or simultaneously positive or negative. Our analytic bias correction combined with a non-parametric bootstrap yields accurate coverage of 95% confidence intervals. We apply our methodology to analyze the association of systolic blood pressure with PM2.5 and NO2 in the NIEHS Sister Study. We find that NO2 confounds the association of systolic blood pressure with PM2.5 and vice versa. Elevated systolic blood pressure was significantly associated with increased PM2.5 and decreased NO2. Correcting for measurement error bias strengthened these associations and widened 95% confidence intervals. PMID:27789915
Measurement error: Implications for diagnosis and discrepancy models of developmental dyslexia.

PubMed

Cotton, Sue M; Crewther, David P; Crewther, Sheila G

2005-08-01

The diagnosis of developmental dyslexia (DD) is reliant on a discrepancy between intellectual functioning and reading achievement. Discrepancy-based formulae have frequently been employed to establish the significance of the difference between 'intelligence' and 'actual' reading achievement. These formulae, however, often fail to take into consideration test reliability and the error associated with a single test score. This paper provides an illustration of the potential effects that test reliability and measurement error can have on the diagnosis of dyslexia, with particular reference to discrepancy models. The roles of reliability and standard error of measurement (SEM) in classic test theory are also briefly reviewed. This is followed by illustrations of how SEM and test reliability can aid with the interpretation of a simple discrepancy-based formula of DD. It is proposed that a lack of consideration of test theory in the use of discrepancy-based models of DD can lead to misdiagnosis (both false positives and false negatives). Further, misdiagnosis in research samples affects reproducibility and generalizability of findings. This in turn, may explain current inconsistencies in research on the perceptual, sensory, and motor correlates of dyslexia.
Measurement methods to assess diastasis of the rectus abdominis muscle (DRAM): A systematic review of their measurement properties and meta-analytic reliability generalisation.

PubMed

van de Water, A T M; Benjamin, D R

2016-02-01

Systematic literature review. Diastasis of the rectus abdominis muscle (DRAM) has been linked with low back pain, abdominal and pelvic dysfunction. Measurement is used to either screen or to monitor DRAM width. Determining which methods are suitable for screening and monitoring DRAM is of clinical value. To identify the best methods to screen for DRAM presence and monitor DRAM width. AMED, Embase, Medline, PubMed and CINAHL databases were searched for measurement property studies of DRAM measurement methods. Population characteristics, measurement methods/procedures and measurement information were extracted from included studies. Quality of all studies was evaluated using 'quality rating criteria'. When possible, reliability generalisation was conducted to provide combined reliability estimations. Thirteen studies evaluated measurement properties of the 'finger width'-method, tape measure, calipers, ultrasound, CT and MRI. Ultrasound was most evaluated. Methodological quality of these studies varied widely. Pearson's correlations of r = 0.66-0.79 were found between calipers and ultrasound measurements. Calipers and ultrasound had Intraclass Correlation Coefficients (ICC) of 0.78-0.97 for test-retest, inter- and intra-rater reliability. The 'finger width'-method had weighted Kappa's of 0.73-0.77 for test-retest reliability, but moderate agreement (63%; weighted Kappa = 0.53) between raters. Comparing calipers and ultrasound, low measurement error was found (above the umbilicus), and the methods had good agreement (83%; weighted Kappa = 0.66) for discriminative purposes. The available information support ultrasound and calipers as adequate methods to assess DRAM. For other methods limited measurement information of low to moderate quality is available and further evaluation of their measurement properties is required. Copyright © 2015 Elsevier Ltd. All rights reserved.
Least-Squares Models to Correct for Rater Effects in Performance Assessment.

ERIC Educational Resources Information Center

Raymond, Mark R.; Viswesvaran, Chockalingam

This study illustrates the use of three least-squares models to control for rater effects in performance evaluation: (1) ordinary least squares (OLS); (2) weighted least squares (WLS); and (3) OLS subsequent to applying a logistic transformation to observed ratings (LOG-OLS). The three models were applied to ratings obtained from four…
RELIABILITY OF ANKLE-FOOT MORPHOLOGY, MOBILITY, STRENGTH, AND MOTOR PERFORMANCE MEASURES.

PubMed

Fraser, John J; Koldenhoven, Rachel M; Saliba, Susan A; Hertel, Jay

2017-12-01

Assessment of foot posture, morphology, intersegmental mobility, strength and motor control of the ankle-foot complex are commonly used clinically, but measurement properties of many assessments are unclear. To determine test-retest and inter-rater reliability, standard error of measurement, and minimal detectable change of morphology, joint excursion and play, strength, and motor control of the ankle-foot complex. Reliability study. 24 healthy, recreationally-active young adults without history of ankle-foot injury were assessed by two clinicians on two occasions, three to ten days apart. Measurement properties were assessed for foot morphology (foot posture index, total and truncated length, width, arch height), joint excursion (weight-bearing dorsiflexion, rearfoot and hallux goniometry, forefoot inclinometry, 1 st metatarsal displacement) and joint play, strength (handheld dynamometry), and motor control rating during intrinsic foot muscle (IFM) exercises. Clinician order was randomized using a Latin Square. The clinicians performed independent examinations and did not confer on the findings for the duration of the study. Test-retest and inter-tester reliability and agreement was assessed using intraclass correlation coefficients (ICC 2,k ) and weighted kappa ( K w ). Test-retest reliability ICC were as follows: morphology: .80-1.00, joint excursion: .58-.97, joint play: -.67-.84, strength: .67-.92, IFM motor rating: K W -.01-.71. Inter-rater reliability ICC were as follows: morphology: .81-1.00, joint excursion: .32-.97, joint play: -1.06-1.00, strength: .53-.90, and IFM motor rating: K w .02-.56. Measures of ankle-foot posture, morphology, joint excursion, and strength demonstrated fair to excellent test-retest and inter-rater reliability. Test-retest reliability for rating of perceived difficulty and motor performance was good to excellent for short-foot, toe-spread-out, and hallux exercises and poor to fair for lesser toe extension. Joint play measures had
Instrumental variables vs. grouping approach for reducing bias due to measurement error.

PubMed

Batistatou, Evridiki; McNamee, Roseanne

2008-01-01

Attenuation of the exposure-response relationship due to exposure measurement error is often encountered in epidemiology. Given that error cannot be totally eliminated, bias correction methods of analysis are needed. Many methods require more than one exposure measurement per person to be made, but the `group mean OLS method,' in which subjects are grouped into several a priori defined groups followed by ordinary least squares (OLS) regression on the group means, can be applied with one measurement. An alternative approach is to use an instrumental variable (IV) method in which both the single error-prone measure and an IV are used in IV analysis. In this paper we show that the `group mean OLS' estimator is equal to an IV estimator with the group mean used as IV, but that the variance estimators for the two methods are different. We derive a simple expression for the bias in the common estimator which is a simple function of group size, reliability and contrast of exposure between groups, and show that the bias can be very small when group size is large. We compare this method with a new proposal (group mean ranking method), also applicable with a single exposure measurement, in which the IV is the rank of the group means. When there are two independent exposure measurements per subject, we propose a new IV method (EVROS IV) and compare it with Carroll and Stefanski's (CS IV) proposal in which the second measure is used as an IV; the new IV estimator combines aspects of the `group mean' and `CS' strategies. All methods are evaluated in terms of bias, precision and root mean square error via simulations and a dataset from occupational epidemiology. The `group mean ranking method' does not offer much improvement over the `group mean method.' Compared with the `CS' method, the `EVROS' method is less affected by low reliability of exposure. We conclude that the group IV methods we propose may provide a useful way to handle mismeasured exposures in epidemiology with or
Self-test web-based pure-tone audiometry: validity evaluation and measurement error analysis.

PubMed

Masalski, Marcin; Kręcicki, Tomasz

2013-04-12

Potential methods of application of self-administered Web-based pure-tone audiometry conducted at home on a PC with a sound card and ordinary headphones depend on the value of measurement error in such tests. The aim of this research was to determine the measurement error of the hearing threshold determined in the way described above and to identify and analyze factors influencing its value. The evaluation of the hearing threshold was made in three series: (1) tests on a clinical audiometer, (2) self-tests done on a specially calibrated computer under the supervision of an audiologist, and (3) self-tests conducted at home. The research was carried out on the group of 51 participants selected from patients of an audiology outpatient clinic. From the group of 51 patients examined in the first two series, the third series was self-administered at home by 37 subjects (73%). The average difference between the value of the hearing threshold determined in series 1 and in series 2 was -1.54dB with standard deviation of 7.88dB and a Pearson correlation coefficient of .90. Between the first and third series, these values were -1.35dB±10.66dB and .84, respectively. In series 3, the standard deviation was most influenced by the error connected with the procedure of hearing threshold identification (6.64dB), calibration error (6.19dB), and additionally at the frequency of 250Hz by frequency nonlinearity error (7.28dB). The obtained results confirm the possibility of applying Web-based pure-tone audiometry in screening tests. In the future, modifications of the method leading to the decrease in measurement error can broaden the scope of Web-based pure-tone audiometry application.
Sinusoidal Siemens star spatial frequency response measurement errors due to misidentified target centers

DOE PAGES

Birch, Gabriel Carisle; Griffin, John Clark

2015-07-23

Numerous methods are available to measure the spatial frequency response (SFR) of an optical system. A recent change to the ISO 12233 photography resolution standard includes a sinusoidal Siemens star test target. We take the sinusoidal Siemens star proposed by the ISO 12233 standard, measure system SFR, and perform an analysis of errors induced by incorrectly identifying the center of a test target. We show a closed-form solution for the radial profile intensity measurement given an incorrectly determined center and describe how this error reduces the measured SFR of the system. As a result, using the closed-form solution, we proposemore » a two-step process by which test target centers are corrected and the measured SFR is restored to the nominal, correctly centered values.« less
Ultrasonographic measurement of the acromiohumeral distance in spinal cord injury: Reliability and effects of shoulder positioning.

PubMed

Lin, Yen-Sheng; Boninger, Michael L; Day, Kevin A; Koontz, Alicia M

2015-11-01

To investigate the reliability of ultrasonographic measurement of acromiohumeral distance (AHD) and the effects of shoulder positioning on AHD among manual wheelchair users (MWUs) with spinal cord injury (SCI) and an able-bodied control group. Ten MWUs with SCI and 10 able-bodied subjects participated in this study. The ultrasonographic measurements of AHD from each subject were obtained by two raters during passive and active scapular plane arm elevation in neutral, 45°, 90° with and without resistance and in a weight relief raise position. The measurements were recorded again by each rater using the same procedures after a 30-minute time interval. All raters were blinded to each other's measurements. University Laboratories and Veteran Affairs Healthcare System. Intra-rater (intraclass correlation coefficient, ICC > 0.83) and inter-rater (ICC > 0.78) reliability was excellent for both the MWUs with SCI and able-bodied groups across all arm positions except for the 45° position in the control group for one of the raters (intra-rater: ICC < 0.40 and inter-rater: ICC < 0.60). AHD significantly reduced when the shoulder was in the 90° arm elevated positions with or without resistance. Findings from our study demonstrated that ultrasonography is a reliable means to evaluate AHD in both able bodied and individuals with SCI, who are known to have significant shoulder pathology. This technique could be used to develop reference measures and to identify changes in AHD caused by interventions.
Assessment of measurement errors and dynamic calibration methods for three different tipping bucket rain gauges

NASA Astrophysics Data System (ADS)

Shedekar, Vinayak S.; King, Kevin W.; Fausey, Norman R.; Soboyejo, Alfred B. O.; Harmel, R. Daren; Brown, Larry C.

2016-09-01

Three different models of tipping bucket rain gauges (TBRs), viz. HS-TB3 (Hydrological Services Pty Ltd.), ISCO-674 (Isco, Inc.) and TR-525 (Texas Electronics, Inc.), were calibrated in the lab to quantify measurement errors across a range of rainfall intensities (5 mm·h- 1 to 250 mm·h- 1) and three different volumetric settings. Instantaneous and cumulative values of simulated rainfall were recorded at 1, 2, 5, 10 and 20-min intervals. All three TBR models showed a substantial deviation (α = 0.05) in measurements from actual rainfall depths, with increasing underestimation errors at greater rainfall intensities. Simple linear regression equations were developed for each TBR to correct the TBR readings based on measured intensities (R2 > 0.98). Additionally, two dynamic calibration techniques, viz. quadratic model (R2 > 0.7) and T vs. 1/Q model (R2 = > 0.98), were tested and found to be useful in situations when the volumetric settings of TBRs are unknown. The correction models were successfully applied to correct field-collected rainfall data from respective TBR models. The calibration parameters of correction models were found to be highly sensitive to changes in volumetric calibration of TBRs. Overall, the HS-TB3 model (with a better protected tipping bucket mechanism, and consistent measurement errors across a range of rainfall intensities) was found to be the most reliable and consistent for rainfall measurements, followed by the ISCO-674 (with susceptibility to clogging and relatively smaller measurement errors across a range of rainfall intensities) and the TR-525 (with high susceptibility to clogging and frequent changes in volumetric calibration, and highly intensity-dependent measurement errors). The study demonstrated that corrections based on dynamic and volumetric calibration can only help minimize-but not completely eliminate the measurement errors. The findings from this study will be useful for correcting field data from TBRs; and may have major
The Thirty Gigahertz Instrument Receiver for the QUIJOTE Experiment: Preliminary Polarization Measurements and Systematic-Error Analysis.

PubMed

Casas, Francisco J; Ortiz, David; Villa, Enrique; Cano, Juan L; Cagigas, Jaime; Pérez, Ana R; Aja, Beatriz; Terán, J Vicente; de la Fuente, Luisa; Artal, Eduardo; Hoyland, Roger; Génova-Santos, Ricardo

2015-08-05

This paper presents preliminary polarization measurements and systematic-error characterization of the Thirty Gigahertz Instrument receiver developed for the QUIJOTE experiment. The instrument has been designed to measure the polarization of Cosmic Microwave Background radiation from the sky, obtaining the Q, U, and I Stokes parameters of the incoming signal simultaneously. Two kinds of linearly polarized input signals have been used as excitations in the polarimeter measurement tests in the laboratory; these show consistent results in terms of the Stokes parameters obtained. A measurement-based systematic-error characterization technique has been used in order to determine the possible sources of instrumental errors and to assist in the polarimeter calibration process.
The TiltMeter app is a novel and accurate measurement tool for the weight bearing lunge test.

PubMed

Williams, Cylie M; Caserta, Antoni J; Haines, Terry P

2013-09-01

The weight bearing lunge test is increasing being used by health care clinicians who treat lower limb and foot pathology. This measure is commonly established accurately and reliably with the use of expensive equipment. This study aims to compare the digital inclinometer with a free app, TiltMeter on an Apple iPhone. This was an intra-rater and inter-rater reliability study. Two raters (novice and experienced) conducted the measurements in both a bent knee and straight leg position to determine the intra-rater and inter-rater reliability. Concurrent validity was also established. Allied health practitioners were recruited as participants from the workplace. A preconditioning stretch was conducted and the ankle range of motion was established with the weight bearing lunge test position with firstly the leg straight and secondly with the knee bent. The measurement device and each participant were randomised during measurement. The intra-rater reliability and inter-rater reliability for the devices and in both positions were all over ICC 0.8 except for one intra-rater measure (Digital inclinometer, novice, ICC 0.65). The inter-rater reliability between the digital inclinometer and the tilmeter was near perfect, ICC 0.96 (CI: 0.898-0.983); Concurrent validity ICC between the two devices was 0.83 (CI: -0.740 to 0.445). The use of the Tiltmeter app on the iPhone is a reliable and inexpensive tool to measure the available ankle range of motion. Health practitioners should use caution in applying these findings to other smart phone equipment if surface areas are not comparable. Crown Copyright © 2013. Published by Elsevier Ltd. All rights reserved.
Dealing with dietary measurement error in nutritional cohort studies.

PubMed

Freedman, Laurence S; Schatzkin, Arthur; Midthune, Douglas; Kipnis, Victor

2011-07-20

Dietary measurement error creates serious challenges to reliably discovering new diet-disease associations in nutritional cohort studies. Such error causes substantial underestimation of relative risks and reduction of statistical power for detecting associations. On the basis of data from the Observing Protein and Energy Nutrition Study, we recommend the following approaches to deal with these problems. Regarding data analysis of cohort studies using food-frequency questionnaires, we recommend 1) using energy adjustment for relative risk estimation; 2) reporting estimates adjusted for measurement error along with the usual relative risk estimates, whenever possible (this requires data from a relevant, preferably internal, validation study in which participants report intakes using both the main instrument and a more detailed reference instrument such as a 24-hour recall or multiple-day food record); 3) performing statistical adjustment of relative risks, based on such validation data, if they exist, using univariate (only for energy-adjusted intakes such as densities or residuals) or multivariate regression calibration. We note that whereas unadjusted relative risk estimates are biased toward the null value, statistical significance tests of unadjusted relative risk estimates are approximately valid. Regarding study design, we recommend increasing the sample size to remedy loss of power; however, it is important to understand that this will often be an incomplete solution because the attenuated signal may be too small to distinguish from unmeasured confounding in the model relating disease to reported intake. Future work should be devoted to alleviating the problem of signal attenuation, possibly through the use of improved self-report instruments or by combining dietary biomarkers with self-report instruments.
Reproducibility of thoracic kyphosis measurements in patients with adolescent idiopathic scoliosis.

PubMed

Ohrt-Nissen, Søren; Cheung, Jason Pui Yin; Hallager, Dennis Winge; Gehrchen, Martin; Kwan, Kenny; Dahl, Benny; Cheung, Kenneth M C; Samartzis, Dino

2017-01-01

Current surgical treatment for adolescent idiopathic scoliosis (AIS) involves correction in both the coronal and sagittal plane, and thorough assessment of these parameters is essential for evaluation of surgical results. However, various definitions of thoracic kyphosis (TK) have been proposed, and the intra- and inter-rater reproducibility of these measures has not been determined. As such, the purpose of the current study was to determine the intra- and inter-rater reproducibility of several TK measurements used in the assessment of AIS. Twenty patients (90% females) surgically treated for AIS with alternate-level pedicle screw fixation were included in the study. Three raters independently evaluated pre- and postoperative standing lateral plain radiographs. For each radiograph, several definitions of TK were measured as well as L1-S1 and nonfixed lumbar lordosis. All variables were measured twice 14 days apart, and a mixed effects model was used to determine the repeatability coefficient (RC), which is a measure of the agreement between repeated measurements. Also, the intra- and inter-rater intra-class correlation coefficient (ICC) was determined as a measure of reliability. Preoperative median Cobb angle was 58° (range 41°-86°), and median surgical curve correction was 68% (range 49-87%). Overall intra-rater RC was highest for T2-T12 and nonfixed TK (11°) and lowest for T4-T12 and T5-T12 (8°). Inter-rater RC was highest for T1-T12, T1-nonfixed, and nonfixed TK (13°) and lowest for T5-T12 (9°). Agreement varied substantially between pre- and postoperative radiographs. Inter-rater ICC was highest for T4-T12 (0.92; 95% CI 0.88-0.95) and T5-T12 (0.92; 95% CI 0.88-0.95) and lowest for T1-nonfixed (0.80; 95% CI 0.72-0.88). Considerable variation for all TK measurements was noted. Intra- and inter-rater reproducibility was best for T4-T12 and T5-T12. Future studies should consider adopting a relevant minimum difference as a limit for true change in TK.
Proxies and Other External Raters: Methodological Considerations

PubMed Central

Snow, A Lynn; Cook, Karon F; Lin, Pay-Shin; Morgan, Robert O; Magaziner, Jay

2005-01-01

Objective The purpose of this paper is to introduce researchers to the measurement and subsequent analysis considerations involved when using externally rated data. We will define and describe two categories of externally rated data, recommend methodological approaches for analyzing and interpreting data in these two categories, and explore factors affecting agreement between self-rated and externally rated reports. We conclude with a discussion of needs for future research. Data Sources/Study Setting Data sources for this paper are previous published studies and reviews comparing self-rated with externally rated data. Study Design/Data Collection/Extraction Methods This is a psychometric conceptual paper. Principal Findings We define two types of externally rated data: proxy data and other-rated data. Proxy data refer to those collected from someone who speaks for a patient who cannot, will not, or is unavailable to speak for him or herself, whereas we use the term other-rater data to refer to situations in which the researcher collects ratings from a person other than the patient to gain multiple perspectives on the assessed construct. These two types of data differ in the way the measurement model is defined, the definition of the gold standard against which the measurements are validated, the analysis strategies appropriately used, and how the analyses are interpreted. There are many factors affecting the discrepancies between self- and external ratings, including characteristics of the patient, the proxy, and of the rated construct. Several psychological theories can be helpful in predicting such discrepancies. Conclusions Externally rated data have an important place in health services research, but use of such data requires careful consideration of the nature of the data and how it will be analyzed and interpreted. PMID:16179002
Simulation on measurement of five-DOF motion errors of high precision spindle with cylindrical capacitive sensor

NASA Astrophysics Data System (ADS)

Zhang, Min; Wang, Wen; Xiang, Kui; Lu, Keqing; Fan, Zongwei

2015-02-01

This paper describes a novel cylindrical capacitive sensor (CCS) to measure the spindle five degree-of-freedom (DOF) motion errors. The operating principle and mathematical models of the CCS are presented. Using Ansoft Maxwell software to calculate the different capacitances in different configurations, structural parameters of end face electrode are then investigated. Radial, axial and tilt motions are also simulated by making comparisons with the given displacements and the simulation values respectively. It could be found that the proposed CCS has a high accuracy for measuring radial motion error when the average eccentricity is about 15 μm. Besides, the maximum relative error of axial displacement is 1.3% when the axial motion is within [0.7, 1.3] mm, and the maximum relative error of the tilt displacement is 1.6% as rotor tilts around a single axis within [-0.6, 0.6]°. Finally, the feasibility of the CCS for measuring five DOF motion errors is verified through simulation and analysis.
Identifying the latent failures underpinning medication administration errors: an exploratory study.

PubMed

Lawton, Rebecca; Carruthers, Sam; Gardner, Peter; Wright, John; McEachan, Rosie R C

2012-08-01

The primary aim of this article was to identify the latent failures that are perceived to underpin medication errors. The study was conducted within three medical wards in a hospital in the United Kingdom. The study employed a cross-sectional qualitative design. Interviews were conducted with 12 nurses and eight managers. Interviews were transcribed and subject to thematic content analysis. A two-step inter-rater comparison tested the reliability of the themes. Ten latent failures were identified based on the analysis of the interviews. These were ward climate, local working environment, workload, human resources, team communication, routine procedures, bed management, written policies and procedures, supervision and leadership, and training. The discussion focuses on ward climate, the most prevalent theme, which is conceptualized here as interacting with failures in the nine other organizational structures and processes. This study is the first of its kind to identify the latent failures perceived to underpin medication errors in a systematic way. The findings can be used as a platform for researchers to test the impact of organization-level patient safety interventions and to design proactive error management tools and incident reporting systems in hospitals. © Health Research and Educational Trust.
Effects of Measurement Errors on Individual Tree Stem Volume Estimates for the Austrian National Forest Inventory

Treesearch

Ambros Berger; Thomas Gschwantner; Ronald E. McRoberts; Klemens Schadauer

2014-01-01

National forest inventories typically estimate individual tree volumes using models that rely on measurements of predictor variables such as tree height and diameter, both of which are subject to measurement error. The aim of this study was to quantify the impacts of these measurement errors on the uncertainty of the model-based tree stem volume estimates. The impacts...
Rater Biases in Genetically Informative Research Designs: Comment on Bartels, Boomsma, Hudziak, van Beijsterveldt, and van den Oord (2007)

ERIC Educational Resources Information Center

Hoyt, William T.

2007-01-01

Rater biases are of interest to behavior genetic researchers, who often use ratings data as a basis for studying heritability. Inclusion of multiple raters for each sibling pair (M. Bartels, D. I. Boomsma, J. J. Hudziak, T. C. E. M. van Beijsterveldt, & E. J. C. G. van den Oord, 2007) is a promising strategy for controlling bias variance and may…
GPS measurement error gives rise to spurious 180 degree turning angles and strong directional biases in animal movement data.

PubMed

Hurford, Amy

2009-05-20

Movement data are frequently collected using Global Positioning System (GPS) receivers, but recorded GPS locations are subject to errors. While past studies have suggested methods to improve location accuracy, mechanistic movement models utilize distributions of turning angles and directional biases and these data present a new challenge in recognizing and reducing the effect of measurement error. I collected locations from a stationary GPS collar, analyzed a probabilistic model and used Monte Carlo simulations to understand how measurement error affects measured turning angles and directional biases. Results from each of the three methods were in complete agreement: measurement error gives rise to a systematic bias where a stationary animal is most likely to be measured as turning 180 degrees or moving towards a fixed point in space. These spurious effects occur in GPS data when the measured distance between locations is <20 meters. Measurement error must be considered as a possible cause of 180 degree turning angles in GPS data. Consequences of failing to account for measurement error are predicting overly tortuous movement, numerous returns to previously visited locations, inaccurately predicting species range, core areas, and the frequency of crossing linear features. By understanding the effect of GPS measurement error, ecologists are able to disregard false signals to more accurately design conservation plans for endangered wildlife.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.