inter-rater reliability study: Topics by Science.gov

Sample records for inter-rater reliability study

The inter and intra rater reliability of the Netball Movement Screening Tool.

PubMed

Reid, Duncan A; Vanweerd, Rebecca J; Larmer, Peter J; Kingstone, Rachel

2015-05-01

To establish the inter- and intra-rater reliability of the Netball Movement Screening Tool, for screening adolescent female netball players. Inter- and intra-rater reliability study. Forty secondary school netball players were recruited to take part in the study. Twenty subjects were screened simultaneously and independently by two raters to ascertain inter-rater agreement. Twenty subjects were scored by rater one on two occasions, separated by a week, to ascertain intra-rater agreement. Inter and intra-rater agreement was assessed utilising the two-way mixed inter class correlation coefficient and weighted kappa statistics. No significant demographic differences were found between the inter and intra-rater groups of subjects. Inter class correlation coefficients' demonstrated excellent inter-rater (two-way mixed inter class correlation coefficients 0.84, standard error of measurement 0.25) and intra-rater (two-way mixed inter class correlation coefficients 0.96, standard error of measurement 0.13) reliability for the overall Netball Movement Screening Tool score and substantial-excellent (two-way mixed inter class correlation coefficients 1.0-0.65) inter-rater and substantial-excellent intra-rater (two-way mixed inter class correlation coefficients 0.96-0.79) reliability for the component scores of the Netball Movement Screening Tool. Kappa statistic showed substantial to poor inter-rater (k=0.75-0.32) and intra-rater (k=0.77-0.27) agreement for individual tests of the NMST. The Netball Movement Screening Tool may be a reliable screening tool for adolescent netball players; however the individual test scores have low reliability. The screening tool can be administered reliably by raters with similar levels of training in the tool but variable clinical experience. On-going research needs to be undertaken to ascertain whether the Netball Movement Screening Tool is a valid tool in ascertaining increased injury risk for netball players. Copyright © 2014 Sports Medicine Australia. Published by Elsevier Ltd. All rights reserved.
Inter-rater and intra-rater reliability of the Bahasa Melayu version of Rose Angina Questionnaire.

PubMed

Hassan, N B; Choudhury, S R; Naing, L; Conroy, R M; Rahman, A R A

2007-01-01

The objective of the study is to translate the Rose Questionnaire (RQ) into a Bahasa Melayu version and adapt it cross-culturally, and to measure its inter-rater and intrarater reliability. This cross sectional study was conducted in the respondents' homes or workplaces in Kelantan, Malaysia. One hundred respondents aged 30 and above with different socio-demographic status were interviewed for face validity. For each inter-rater and intra-rater reliability, a sample of 150 respondents was interviewed. Inter-rater and intra-rater reliabilities were assessed by Cohen's kappa. The overall inter-rater agreements by the five pair of interviewers at point one and two were 0.86, and intrarater reliability by the five interviewers on the seven-item questionnaire at poinone and two was 0.88, as measured by kappa coefficient. The translated Malay version of RQ demonstrated an almost perfect inter-rater and intra-rater reliability and further validation such as sensitivity and specificity analysis of this translated questionnaire is highly recommended.
Inter-rater reliability of a modified version of Delitto et al.’s classification-based system for low back pain: a pilot study

PubMed Central

Apeldoorn, Adri T.; van Helvoirt, Hans; Ostelo, Raymond W.; Meihuizen, Hanneke; Kamper, Steven J.; van Tulder, Maurits W.; de Vet, Henrica C. W.

2016-01-01

Study design Observational inter-rater reliability study. Objectives To examine: (1) the inter-rater reliability of a modified version of Delitto et al.’s classification-based algorithm for patients with low back pain; (2) the influence of different levels of familiarity with the system; and (3) the inter-rater reliability of algorithm decisions in patients who clearly fit into a subgroup (clear classifications) and those who do not (unclear classifications). Methods Patients were examined twice on the same day by two of three participating physical therapists with different levels of familiarity with the system. Patients were classified into one of four classification groups. Raters were blind to the others’ classification decision. In order to quantify the inter-rater reliability, percentages of agreement and Cohen’s Kappa were calculated. Results A total of 36 patients were included (clear classification n = 23; unclear classification n = 13). The overall rate of agreement was 53% and the Kappa value was 0·34 [95% confidence interval (CI): 0·11–0·57], which indicated only fair inter-rater reliability. Inter-rater reliability for patients with a clear classification (agreement 52%, Kappa value 0·29) was not higher than for patients with an unclear classification (agreement 54%, Kappa value 0·33). Familiarity with the system (i.e. trained with written instructions and previous research experience with the algorithm) did not improve the inter-rater reliability. Conclusion Our pilot study challenges the inter-rater reliability of the classification procedure in clinical practice. Therefore, more knowledge is needed about factors that affect the inter-rater reliability, in order to improve the clinical applicability of the classification scheme. PMID:27559279
The Critical Thinking Analytic Rubric (CTAR): Investigating Intra-Rater and Inter-Rater Reliability of a Scoring Mechanism for Critical Thinking Performance Assessments

ERIC Educational Resources Information Center

Saxton, Emily; Belanger, Secret; Becker, William

2012-01-01

The purpose of this study was to investigate the intra-rater and inter-rater reliability of the Critical Thinking Analytic Rubric (CTAR). The CTAR is composed of 6 rubric categories: interpretation, analysis, evaluation, inference, explanation, and disposition. To investigate inter-rater reliability, two trained raters scored four sets of…
Is laser speckle contrast analysis (LASCA) the new kid on the block in systemic sclerosis? A systematic literature review and pilot study to evaluate reliability of LASCA to measure peripheral blood perfusion in scleroderma patients.

PubMed

Cutolo, Maurizio; Vanhaecke, Amber; Ruaro, Barbara; Deschepper, Ellen; Ickinger, Claudia; Melsens, Karin; Piette, Yves; Trombetta, Amelia Chiara; De Keyser, Filip; Smith, Vanessa

2018-06-06

A reliable tool to evaluate flow is paramount in systemic sclerosis (SSc). We describe herein on the one hand a systematic literature review on the reliability of laser speckle contrast analysis (LASCA) to measure the peripheral blood perfusion (PBP) in SSc and perform an additional pilot study, investigating the intra- and inter-rater reliability of LASCA. A systematic search was performed in 3 electronic databases, according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. In the pilot study, 30 SSc patients and 30 healthy subjects (HS) underwent LASCA assessment. Intra-rater reliability was assessed by having a first anchor rater performing the measurements at 2 time-points and inter-rater reliability by having the anchor rater and a team of second raters performing the measurements in 15 SSc and 30 HS. The measurements were repeated with a second anchor rater in the other 15 SSc patients, as external validation. Only 1 of the 14 records of interest identified through the systematic search was included in the final analysis. In the additional pilot study: intra-class correlation coefficient (ICC) for intra-rater reliability of the first anchor rater was 0.95 in SSc and 0.93 in HS, the ICC for inter-rater reliability was 0.97 in SSc and 0.93 in HS. Intra- and inter-rater reliability of the second anchor rater was 0.78 and 0.87. The identified literature regarding the reliability of LASCA measurements reports good to excellent inter-rater agreement. This very pilot study could confirm the reliability of LASCA measurements with good to excellent inter-rater agreement and found additionally good to excellent intra-rater reliability. Furthermore, similar results were found in the external validation. Copyright © 2018. Published by Elsevier B.V.
Inter-rater and intra-rater reliability of a movement control test in shoulder.

PubMed

Rajasekar, S; Bangera, Rakshith K; Sekaran, Padmanaban

2017-07-01

Movement faults are commonly observed in patients with musculoskeletal pain. The Kinetic Medial Rotation Test (KMRT) is a movement control test used to identify movement faults of the scapula and gleno-humeral joints during arm movement. Objective tests such as the KMRT need to be reliable and valid for the results to be applied across different clinical settings and patient populations. The primary objective of the present study was to determine the intra-rater and inter-rater reliability of KMRT in subjects with and without shoulder pain. Sixty subjects were included in this study based on specific inclusion and exclusion criteria. Two musculoskeletal physiotherapists with different levels of clinical experience performed the tests. The intra-rater reliability was tested in twenty asymptomatic subjects by a single assessor at two week intervals. An equal number of subjects with and without shoulder pain were tested by both the assessors to determine the inter-rater reliability. Both components of the KMRT, the Gleno- Humeral Anterior Translation (GHAT) and the Scapular Forward Tilt (SCFT) were tested. The Kappa values for inter-rater reliability of the GHAT and SCFT were K = 0.68 & K = 0.65 respectively in subjects with shoulder pain. In asymptomatic subjects, the inter-rater reliability of GHAT was K = 0.61 and SCFT was K = 0.85. Intra-rater reliability ranged from K = 0.66 for GHAT to K = 0.87 for SCFT. Our study found substantial agreement in inter-rater reliability of KMRT in subjects with shoulder pain, whereas substantial to near perfect agreement was found in intra-rater and inter-rater reliability of KMRT in subjects without shoulder pain. Copyright © 2017 Elsevier Ltd. All rights reserved.
Introducing a new definition of a near fall: intra-rater and inter-rater reliability.

PubMed

Maidan, I; Freedman, T; Tzemah, R; Giladi, N; Mirelman, A; Hausdorff, J M

2014-01-01

Near falls (NFs) are more frequent than falls, and may occur before falls, potentially predicting fall risk. As such, identification of a NF is important. We aimed to assess intra and inter-rater reliability of the traditional definition of a NF and to demonstrate the potential utility of a new definition. To this end, 10 older adults, 10 idiopathic elderly fallers, and 10 patients with Parkinson's disease (PD) walked in an obstacle course while wearing a safety harness. All walks were videotaped. Forty-nine video segments were extracted to create 2 clips each of 8.48 min. Four raters scored each event using the traditional definition and, two weeks later, using the new definition. A fifth rater used only the new definition. Intra-rater reliability was determined using Kappa (K) statistics and inter-rater reliability was determined using ICC. Using the traditional definition, three raters had poor intra-rater reliability (K<0.054, p>0.137) and one rater had moderate intra-rater reliability (K=0.624, p<0.001). With the traditional definition, inter-rater reliability between the four raters was moderate (ICC=0.667, p<0.001). In contrast, the new NF definition showed high intra-rater (K>0.601, p<0.001) and excellent inter-rater reliability (ICC=0.815, p<0.001). A priori, it is easy to distinguish falls from usual walking and NFs, but it is more challenging to distinguish NFs from obstacle negotiation and usual walking. Therefore, a more precise definition of NF is required. The results of the present study suggest that the proposed new definition increases intra and inter-rater reliability, a critical step for using NFs to quantify fall risk. Copyright © 2013 Elsevier B.V. All rights reserved.
Inter- and intra-rater reliability of calliper-based lymph node measurement in dogs with peripheral nodal lymphomas.

PubMed

Childress, M O; Fulkerson, C M; Lahrman, S A; Weng, H-Y

2016-08-01

The purpose of this study was to assess reliability of lymph node measurements between and within raters in dogs with nodal lymphomas. Three raters measured lymph nodes from 20 dogs twice prior to and once after administering chemotherapy. Sum tumour volume (TV) and sum longest diameter (LD) of all lymph nodes at each time point, and the percent change in measurements following chemotherapy, were calculated for each dog. Inter- and intra-rater reliability were assessed with the intraclass correlation coefficient (ICC). ICC for inter-rater sum TV and sum LD prior to chemotherapy were 0.86 and 0.80, respectively. ICC for inter-rater sum TV and sum LD after chemotherapy were 0.95 and 0.91, respectively. ICC for percent change in sum TV and sum LD were 0.96 and 0.94, respectively. ICC for intra-rater reliability ranged from 0.90 to 0.98 for each rater. Inter- and intra-rater reliability in measurements among the three raters was good to excellent. © 2014 John Wiley & Sons Ltd.
Exploring Differences in Measurement and Reporting of Classroom Observation Inter-Rater Reliability

ERIC Educational Resources Information Center

Wilhelm, Anne Garrison; Gillespie Rouse, Amy; Jones, Francesca

2018-01-01

Although inter-rater reliability is an important aspect of using observational instruments, it has received little theoretical attention. In this article, we offer some guidance for practitioners and consumers of classroom observations so that they can make decisions about inter-rater reliability, both for study design and in the reporting of data…
Inter-rater reliability of the Full Outline of UnResponsiveness score and the Glasgow Coma Scale in critically ill patients: a prospective observational study

PubMed Central

2010-01-01

Introduction The Glasgow Coma Scale (GCS) is the most widely used scoring system for comatose patients in intensive care. Limitations of the GCS include the impossibility to assess the verbal score in intubated or aphasic patients, and an inconsistent inter-rater reliability. The FOUR (Full Outline of UnResponsiveness) score, a new coma scale not reliant on verbal response, was recently proposed. The aim of the present study was to compare the inter-rater reliability of the GCS and the FOUR score among unselected patients in general critical care. A further aim was to compare the inter-rater reliability of neurologists with that of intensive care unit (ICU) staff. Methods In this prospective observational study, scoring of GCS and FOUR score was performed by neurologists and ICU staff on 267 consecutive patients admitted to intensive care. Results In a total of 437 pair wise ratings the exact inter-rater agreement for the GCS was 71%, and for the FOUR score 82% (P = 0.0016); the inter-rater agreement within a range of ± 1 score point for the GCS was 90%, and for the FOUR score 92% (P = ns.). The exact inter-rater agreement among neurologists was superior to that among ICU staff for the FOUR score (87% vs. 79%, P = 0.04) but not for the GCS (73% vs. 73%). Neurologists and ICU staff did not significantly differ in the inter-rater agreement within a range of ± 1 score point for both GCS (88% vs. 93%) and the FOUR score (91% vs. 88%). Conclusions The FOUR score performed better than the GCS for exact inter-rater agreement, but not for the clinically more relevant agreement within the range of ± 1 score point. Though neurologists outperformed ICU staff with regard to exact inter-rater agreement, the inter-rater agreement of ICU staff within the clinically more relevant range of ± 1 score point equalled that of the neurologists. The small advantage in inter-rater reliability of the FOUR score is most likely insufficient to replace the GCS, a score with a long tradition in intensive care. PMID:20398274
Inter-rater Reliability of Sustained Aberrant Movement Patterns as a Clinical Assessment of Muscular Fatigue

PubMed Central

Aerts, Frank; Carrier, Kathy; Alwood, Becky

2016-01-01

Background: The assessment of clinical manifestation of muscle fatigue is an effective procedure in establishing therapeutic exercise dose. Few studies have evaluated physical therapist reliability in establishing muscle fatigue through detection of changes in quality of movement patterns in a live setting. Objective: The purpose of this study is to evaluate the inter-rater reliability of physical therapists’ ability to detect altered movement patterns due to muscle fatigue. Design: A reliability study in a live setting with multiple raters. Participants: Forty-four healthy individuals (ages 19-35) were evaluated by six physical therapists in a live setting. Methods: Participants were evaluated by physical therapists for altered movement patterns during resisted shoulder rotation. Each participant completed a total of four tests: right shoulder internal rotation, right shoulder external rotation, left shoulder internal rotation and left shoulder external rotation. Results: For all tests combined, the inter-rater reliability for a single rater scoring ICC (2,1) was .65 (95%, .60, .71) This corresponds to moderate inter-rater reliability between physical therapists. Limitations: The results of this study apply only to healthy participants and therefore cannot be generalized to a symptomatic population. Conclusion: Moderate inter-rater reliability was found between physical therapists in establishing muscle fatigue through the observation of sustained altered movement patterns during dynamic resistive shoulder internal and external rotation. PMID:27347241
Measuring the quality of life in mild to very severe dementia: testing the inter-rater and intra-rater reliability of the German version of the QUALIDEM.

PubMed

Dichter, Martin Nikolaus; Schwab, Christian G G; Meyer, Gabriele; Bartholomeyczik, Sabine; Dortmann, Olga; Halek, Margareta

2014-05-01

Quality of life (Qol) is an increasingly used outcome measure in dementia research. The QUALIDEM is a dementia-specific and proxy-rated Qol instrument. We aimed to determine the inter-rater and intra-rater reliability in residents with dementia in German nursing homes. The QUALIDEM consists of nine subscales that were applied to a sample of 108 people with mild to severe dementia and six consecutive subscales that were applied to a sample of 53 people with very severe dementia. The proxy raters were 49 registered nurses and nursing assistants. Inter-rater and intra-rater reliability scores were calculated on the subscale and item level. None of the QUALIDEM subscales showed strong inter-rater reliability based on the single-measure Intra-Class Correlation Coefficient (ICC) for absolute agreement ≥ 0.70. Based on the average-measure ICC for four raters, eight subscales for people with mild to severe dementia (care relationship, positive affect, negative affect, restless tense behavior, social relations, social isolation, feeling at home and having something to do) and five subscales for very severe dementia (care relationship, negative affect, restless tense behavior, social relations and social isolation) yielded a strong inter-rater agreement (ICC: 0.72-0.86). All of the QUALIDEM subscales, regardless of dementia severity, showed strong intra-rater agreement. The ICC values ranged between 0.70 and 0.79 for people with mild to severe dementia and between 0.75 and 0.87 for people with very severe dementia. This study demonstrated insufficient inter-rater reliability and sufficient intra-rater reliability for all subscales of both versions of the German QUALIDEM. The degree of inter-rater reliability can be improved by collaborative Qol rating by more than one nurse. The development of a measurement manual with accurate item definitions and a standardized education program for proxy raters is recommended.
Development and inter-rater reliability of a standardized verbal instruction manual for the Chinese Geriatric Depression Scale-short form.

PubMed

Wong, M T P; Ho, T P; Ho, M Y; Yu, C S; Wong, Y H; Lee, S Y

2002-05-01

The Geriatric Depression Scale (GDS) is a common screening tool for elderly depression in Hong Kong. This study aimed at (1) developing a standardized manual for the verbal administration and scoring of the GDS-SF, and (2) comparing the inter-rater reliability between the standardized and non-standardized verbal administration of GDS-SF. Two studies were reported. In Study 1, the process of developing the manual was described. In Study 2, we compared the inter-rater reliabilities of GDS-SF scores using the standardized verbal instructions and the traditional non-standardized administration. Results of Study 2 indicated that the standardized procedure in verbal administration and scoring improved the inter-rater reliabilities of GDS-SF. Copyright 2002 John Wiley & Sons, Ltd.
Inter-rater Reliability of Real-Time Ultrasound to Measure Acromiohumeral Distance.

PubMed

Mackenzie, Tanya Anne; Bdaiwi, Alya H; Herrington, Lee; Cools, Ann

2016-07-01

Real-time ultrasound (RTUS) has been suggested as a reliable measure of acromiohumeral distance. However, to date, no vigorous assessment and reporting of inter-rater reliability of this method has been performed with the shoulder in a neutral position or with active and passive arm abduction. To assess intrasession inter-rater reliability of using RTUS to measure acromiohumeral distance with the shoulder in a neutral position and with 60° active and passive abduction. Inter-rater intrasession reliability of repeated measures. Human performance laboratory. Twenty persons (12 male and 8 female) with an average age of 29.86 years (standard deviation, 7.8). In an inter-rater, intrasession study, RTUS was used to measure the acromiohumeral distance with the shoulder in a neutral position and with 60° of both active and passive abduction. Acromiohumeral distance. Intraclass correlation coefficient (ICC)2.1 scores ranged between 0.65-0.88 (standard error of the mean = 0.81-1.2 mm and minimal detectable differences with 95% confidence = 2.2-2.3 mm) for inter-rater intrasession reliability. RTUS was found to have fair to good inter-rater reliability as a tool to measure acromiohumeral distance with the shoulder in a neutral position and with 60° of both active and passive arm abduction. Copyright © 2016 American Academy of Physical Medicine and Rehabilitation. Published by Elsevier Inc. All rights reserved.
Inter-rater reliability of output measures for a posture matching assessment approach: a pilot study with food service workers.

PubMed

Cann, A P; Connolly, M; Ruuska, R; MacNeil, M; Birmingham, T B; Vandervoort, A A; Callaghan, J P

2008-04-01

Despite the ongoing health problem of repetitive strain injuries, there are few tools currently available for ergonomic applications evaluating cumulative loading that have well-documented evidence of reliability and validity. The purpose of this study was to determine the inter-rater reliability of a posture matching based analysis tool (3DMatch, University of Waterloo) for predicting cumulative and peak spinal loads. A total of 30 food service workers were each videotaped for a 1-h period while performing typical work activities and a single work task was randomly selected from each for analysis by two raters. Inter-rater reliability was determined using intraclass correlation coefficients (ICC) model 2,1 and standard errors of measurement for cumulative and peak spinal and shoulder loading variables across all subjects. Overall, 85.5% of variables had moderate to excellent inter-rater reliability, with ICCs ranging from 0.30-0.99 for all cumulative and peak loading variables. 3DMatch was found to be a reliable ergonomic tool when more than one rater is involved.
Impact of clinical history on chest radiograph interpretation.

PubMed

Test, Matthew; Shah, Samir S; Monuteaux, Michael; Ambroggio, Lilliam; Lee, Edward Y; Markowitz, Richard I; Bixby, Sarah; Diperna, Stephanie; Servaes, Sabah; Hellinger, Jeffrey C; Neuman, Mark I

2013-07-01

The inclusion of clinical information may have unrecognized influence in the interpretation of diagnostic testing. The objective of the study was to determine the impact of clinical history on chest radiograph interpretation in the diagnosis of pneumonia. Prospective case-based study. Radiologists interpreted 110 radiographs of children evaluated for suspicion of pneumonia. Clinical information was withheld during the first interpretation. After 6 months the radiographs were reviewed with clinical information. Radiologists reported on pneumonia indicators described by the World Health Organization (ie, any infiltrate, alveolar infiltrate, interstitial infiltrate, air bronchograms, hilar adenopathy, pleural effusion). Children's Hospital of Philadelphia and Boston Children's Hospital. Six board-certified radiologists. Inter- and inter-rater reliability were assessed using the kappa statistic. The addition of clinical history did not have a substantial impact on the inter-rater reliability in the identification of any infiltrate, alveolar infiltrate, interstitial infiltrate, pleural effusion, or hilar adenopathy. Inter-rater reliability in the identification of air bronchograms improved from fair (k = 0.32) to moderate (k = 0.53). Intra-rater reliability for the identification of alveolar infiltrate remained substantial to almost perfect for all 6 raters with and without clinical information. One rater had a decrease in inter-rater reliability from almost perfect (k = 1.0) to fair (k = 0.21) in the identification of interstitial infiltrate with the addition of clinical history. Alveolar infiltrate and pleural effusion are findings with high intra- and inter-rater reliability in the diagnosis of bacterial pneumonia. The addition of clinical information did not have a substantial impact on the reliability of these findings. © 2012 Society of Hospital Medicine.
Intra- and inter-rater reliability of 3D passive intervertebral motion in subjects with nonspecific neck pain assessed by physical therapy students: A pilot study.

PubMed

Rossettini, Giacomo; Rondoni, Angie; Lovato, Tommaso; Strobe, Marco; Verzè, Elisa; Vicentini, Marco; Testa, Marco

2016-06-03

Passive Intervertebral Movements (PIVMs) are commonly used to assess and treat patients with nonspecific neck pain. Only very few studies have investigated 3D movements until now. This study assessed intra- and inter-rater reliability of three-dimensional (3D) cervical PIVMs performed by physical therapy students in patients with nonspecific neck pain. Thirty-one patients, mean age 47.2 ± 7.2 years, were independently evaluated by 2 physical therapy students. The raters (A and B) assessed mobility, end-feel and pain provocation performing bilaterally the 3D cervical segmental side-bending test (3D CSSB) from levels C2-C3 to C6-C7. Percentage agreement (raw, positive and negative), Cohen's kappa (95% CI), prevalence index and bias index were calculated to estimate intra- and inter-reliability. Intra-rater reliability showed kappa values ranging between fair and substantial (k 0.29-0.80) for pain provocation, mobility and end-feel, with percentage agreements between 61%-90%. Inter-rater reliability presented kappa values ranging between fair and substantial (k 0.22-0.62) for pain provocation, mobility and end-feel, with percentage agreements between 61% and 80%. Intra-rater reliability of 3D PIVMs was superior to inter-rater reliability in patients with nonspecific neck pain. The most repeatable evaluation parameter was pain. However overall poor reliability suggests avoiding the use of these techniques alone to examine patients and measure their outcome. Further studies are needed to investigate PIVMs reliability in combination with other assessment procedure in symptomatic patients.
Anatomical landmark position--can we trust what we see? Results from an online reliability and validity study of osteopaths.

PubMed

Pattyn, Elise; Rajendran, Dévan

2014-04-01

Practitioners traditionally use observation to classify the position of patients' anatomical landmarks. This information may contribute to diagnosis and patient management. To calculate a) Inter-rater reliability of categorising the sagittal plane position of four anatomical landmarks (lateral femoral epicondyle, greater trochanter, mastoid process and acromion) on side-view photographs (with landmarks highlighted and not-highlighted) of anonymised subjects; b) Intra-rater reliability; c) Individual landmark inter-rater reliability; d) Validity against a 'gold standard' photograph. Online inter- and intra-rater reliability study. Photographed subjects: convenience sample of asymptomatic students; raters: randomly selected UK registered osteopaths. 40 photographs of 30 subjects were used, a priori clinically acceptable reliability was ≥0.4. Inter-rater arm: 20 photographs without landmark highlights plus 10 with highlights; Intra-rater arm: 10 duplicate photographs (non-highlighted landmarks). Validity arm: highlighted landmark scores versus 'gold standard' photographs with vertical line. Research ethics approval obtained. Osteopaths (n = 48) categorised landmark position relative to imagined vertical-line; Gwet's Agreement Coefficient 1 (AC1) calculated and chance-corrected coefficient benchmarked against Landis and Koch's scale; Validity calculation used Kendall's tau-B. Inter-rater reliability was 'fair' (AC1 = 0.342; 95% confidence interval (CI) = 0.279-0.404) for non-highlighted landmarks and 'moderate' (AC1 = 0.700; 95% CI = 0.596-0.805) for highlighted landmarks. Intra-rater reliability was 'fair' (AC1 = 0.522); range was 'poor' (AC1 = 0.160) to 'substantial' (AC1 = 0.896). No differences were found between individual landmarks. Validity was 'low' (TB = 0.327; p = 0.104). Both inter- and intra-rater reliability was 'fair' but below clinically acceptable levels, validity was 'low'. Together these results challenge the clinical practice of using observation to categorise anterio-posterior landmark position. Copyright © 2014 Elsevier Ltd. All rights reserved.
Intra and Inter-Rater Reliability of Screening for Movement Impairments: Movement Control Tests from The Foundation Matrix

PubMed Central

Mischiati, Carolina R.; Comerford, Mark; Gosford, Emma; Swart, Jacqueline; Ewings, Sean; Botha, Nadine; Stokes, Maria; Mottram, Sarah L.

2015-01-01

Pre-season screening is well established within the sporting arena, and aims to enhance performance and reduce injury risk. With the increasing need to identify potential injury with greater accuracy, a new risk assessment process has been produced; The Performance Matrix (battery of movement control tests). As with any new method of objective testing, it is fundamental to establish whether the same results can be reproduced between examiners and by the same examiner on consecutive occasions. This study aimed to determine the intra-rater test re-test and inter-rater reliability of tests from a component of The Performance Matrix, The Foundation Matrix. Twenty participants were screened by two experienced musculoskeletal therapists using nine tests to assess the ability to control movement during specific tasks. Movement evaluation criteria for each test were rated as pass or fail. The therapists observed participants real-time and tests were recorded on video to enable repeated ratings four months later to examine intra-rater reliability (videos rated two weeks apart). Overall test percentage agreement was 87% for inter-rater reliability; 98% Rater 1, 94% Rater 2 for test re-test reliability; and 75% for real-time versus video. Intraclass-correlation coefficients (ICCs) were excellent between raters (0.81) and within raters (Rater 1, 0.96; Rater 2, 0.88) but poor for real-time versus video (0.23). Reliability for individual components of each test was more variable: inter-rater, 68-100%; intra-rater, 88-100% Rater 1, 75-100% Rater 2; and real-time versus video 31-100%. Cohen’s Kappa values for inter-rater reliability were 0.0-1.0; intra-rater 0.6-1.0 for Rater 1; -0.1-1.0 for Rater 2; and -0.1-1 for real-time versus video. It is concluded that both inter and intra-rater reliability of tests in The Foundation Matrix are acceptable when rated by experienced therapists. Recommendations are made for modifying some of the criteria to improve reliability where excellence was not reached. Key points The movement control tests of The Foundation Matrix had acceptable reliability between raters and within raters on different days Agreement between observations made on tests performed real-time and on video recordings was low, indicating poor validity of use of video recordings Some movement evaluation criteria related to specific tests that did not achieve excellent agreement could be modified to improve reliability PMID:25983594
Reproducibility of African giant pouched rats detecting Mycobacterium tuberculosis.

PubMed

Ellis, Haylee; Mulder, Christiaan; Valverde, Emilio; Poling, Alan; Edwards, Timothy

2017-04-24

African pouched rats sniffing sputum samples provided by local clinics have significantly increased tuberculosis case findings in Tanzania and Mozambique. The objective of this study was to determine the reproducibility of rat results. Over an 18-month period 11,869 samples were examined by the rats. Intra-rater reliability was assessed through Yule's Q. Inter-rater reliability was assessed with Krippendorff's alpha. Intra-rater reliability was high, with a mean Yule's Q of 0.9. Inter-rater agreement was fair, with Krippendorf's alpha ranging from 0.15 to 0.45. Both Intra- and Inter-rater reliability was independent of the sex of the animals, but they were positively correlated with age. Both intra- and inter-rater agreement was lowest for samples designated as smear-negative by the clinics. Overall, the reproducibility of tuberculosis detection rat results was fair and diagnostic results were therefore independent of the rats used.

Harmonization Process and Reliability Assessment of Anthropometric Measurements in the Elderly EXERNET Multi-Centre Study

PubMed Central

Gómez-Cabello, Alba; Vicente-Rodríguez, Germán; Albers, Ulrike; Mata, Esmeralda; Rodriguez-Marroyo, Jose A.; Olivares, Pedro R.; Gusi, Narcis; Villa, Gerardo; Aznar, Susana; Gonzalez-Gross, Marcela; Casajús, Jose A.; Ara, Ignacio

2012-01-01

Background The elderly EXERNET multi-centre study aims to collect normative anthropometric data for old functionally independent adults living in Spain. Purpose To describe the standardization process and reliability of the anthropometric measurements carried out in the pilot study and during the final workshop, examining both intra- and inter-rater errors for measurements. Materials and Methods A total of 98 elderly from five different regions participated in the intra-rater error assessment, and 10 different seniors living in the city of Toledo (Spain) participated in the inter-rater assessment. We examined both intra- and inter-rater errors for heights and circumferences. Results For height, intra-rater technical errors of measurement (TEMs) were smaller than 0.25 cm. For circumferences and knee height, TEMs were smaller than 1 cm, except for waist circumference in the city of Cáceres. Reliability for heights and circumferences was greater than 98% in all cases. Inter-rater TEMs were 0.61 cm for height, 0.75 cm for knee-height and ranged between 2.70 and 3.09 cm for the circumferences measured. Inter-rater reliabilities for anthropometric measurements were always higher than 90%. Conclusion The harmonization process, including the workshop and pilot study, guarantee the quality of the anthropometric measurements in the elderly EXERNET multi-centre study. High reliability and low TEM may be expected when assessing anthropometry in elderly population. PMID:22860013
Measurement of glenohumeral joint translation using real-time ultrasound imaging: A physiotherapist and sonographer intra-rater and inter-rater reliability study.

PubMed

Rathi, Sangeeta; Taylor, Nicholas F; Gee, Jamie; Green, Rodney A

2016-12-01

Ultrasonography is an economical and non-invasive method for measuring real-time joint movements. Although physiotherapists are increasingly using ultrasound imaging for rotator cuff disorders, there is a lack of evidence on their reliability in using ultrasonography to measure glenohumeral translation. The aim of this study was to evaluate the reliability of a physiotherapist in measuring anterior and posterior glenohumeral joint translation with ultrasound. Study design: within day reliability. Anterior and posterior glenohumeral translations were measured at rest, in response to passive accessory motion testing force, and with isometric internal and external rotation in 12 young healthy adults. All the measurements were made in real time by a physiotherapist and an experienced sonographer in two positions (neutral and abducted) and in two views (anterior and posterior). Intra-rater and inter-rater reliability were expressed using intraclass correlation coefficients (ICC) and measurement error (mm). Intra-rater reliability was good for both raters (ICC P : 0.86-0.98; ICC S : 0.85-0.96). The inter-rater reliability between the physiotherapist and sonographer was moderate to good for posterior measurements (ICC 0.50-0.75) and poor to moderate for anterior measurements (ICC 0.31-0.53). For both intra-rater and inter-rater measurements, posterior translation was more reliable than the anterior translation with smaller measurement errors (posterior: 0.1-0.2 mm, anterior: 0.2-0.3 mm). A physiotherapist with minimal training was reliable in measuring glenohumeral joint translations. The ultrasound method was reliable for repeated measurement of both anterior and posterior glenohumeral translations with posterior measurements being more reliable than anterior. This method is recommended for future research to investigate the stabilising role of rotator cuff muscles. Copyright © 2016 Elsevier Ltd. All rights reserved.
Inter and intra-rater reliability of mobile device goniometer in measuring lumbar flexion range of motion.

PubMed

Bedekar, Nilima; Suryawanshi, Mayuri; Rairikar, Savita; Sancheti, Parag; Shyam, Ashok

2014-01-01

Evaluation of range of motion (ROM) is integral part of assessment of musculoskeletal system. This is required in health fitness and pathological conditions; also it is used as an objective outcome measure. Several methods are described to check spinal flexion range of motion. Different methods for measuring spine ranges have their advantages and disadvantages. Hence, a new device was introduced in this study using the method of dual inclinometer to measure lumbar spine flexion range of motion (ROM). To determine Intra and Inter-rater reliability of mobile device goniometer in measuring lumbar flexion range of motion. iPod mobile device with goniometer software was used. The part being measure i.e the back of the subject was suitably exposed. Subject was standing with feet shoulder width apart. Spinous process of second sacral vertebra S2 and T12 were located, these were used as the reference points and readings were taken. Three readings were taken for each: inter-rater reliability as well as the intra-rater reliability. Sufficient rest was given between each flexion movement. Intra-rater reliability using ICC was r=0.920 and inter-rater r=0.812 at CI 95%. Validity r=0.95. Mobile device goniometer has high intra-rater reliability. The inter-rater reliability was moderate. This device can be used to assess range of motion of spine flexion, representing uni-planar movement.
Inter-rater reliability and generalizability of patient note scores using a scoring rubric based on the USMLE Step-2 CS format.

PubMed

Park, Yoon Soo; Hyderi, Abbas; Bordage, Georges; Xing, Kuan; Yudkowsky, Rachel

2016-10-01

Recent changes to the patient note (PN) format of the United States Medical Licensing Examination have challenged medical schools to improve the instruction and assessment of students taking the Step-2 clinical skills examination. The purpose of this study was to gather validity evidence regarding response process and internal structure, focusing on inter-rater reliability and generalizability, to determine whether a locally-developed PN scoring rubric and scoring guidelines could yield reproducible PN scores. A randomly selected subsample of historical data (post-encounter PN from 55 of 177 medical students) was rescored by six trained faculty raters in November-December 2014. Inter-rater reliability (% exact agreement and kappa) was calculated for five standardized patient cases administered in a local graduation competency examination. Generalizability studies were conducted to examine the overall reliability. Qualitative data were collected through surveys and a rater-debriefing meeting. The overall inter-rater reliability (weighted kappa) was .79 (Documentation = .63, Differential Diagnosis = .90, Justification = .48, and Workup = .54). The majority of score variance was due to case specificity (13 %) and case-task specificity (31 %), indicating differences in student performance by case and by case-task interactions. Variance associated with raters and its interactions were modest (<5 %). Raters felt that justification was the most difficult task to score and that having case and level-specific scoring guidelines during training was most helpful for calibration. The overall inter-rater reliability indicates high level of confidence in the consistency of note scores. Designs for scoring notes may optimize reliability by balancing the number of raters and cases.
Inter-rater reliability of direct observations of the physical and psychosocial working conditions in eldercare: An evaluation in the DOSES project.

PubMed

Karstad, Kristina; Rugulies, Reiner; Skotte, Jørgen; Munch, Pernille Kold; Greiner, Birgit A; Burdorf, Alex; Søgaard, Karen; Holtermann, Andreas

2018-05-01

The aim of the study was to develop and evaluate the reliability of the "Danish observational study of eldercare work and musculoskeletal disorders" (DOSES) observation instrument to assess physical and psychosocial risk factors for musculoskeletal disorders (MSD) in eldercare work. During 1.5 years, sixteen raters conducted 117 inter-rater observations from 11 nursing homes. Reliability was evaluated using percent agreement and Gwet's AC1 coefficient. Of the 18 examined items, inter-rater reliability was excellent for 7 items (AC1>0.75) fair to good for 7 items (AC1 0.40-0.75) and poor for 2 items (AC1 0-0.40). For 2 items there was no agreement between the raters (AC1 <0). The reliability did not differ between the first and second half of the data collection period and the inter-rater observations were representative regarding occurrence of events in eldercare work. The instrument is appropriate for assessing physical and psychosocial risk factors for MSD among eldercare workers. Copyright © 2018 The Authors. Published by Elsevier Ltd.. All rights reserved.
Inter-Rater Reliability and Intra-Rater Reliability of Assessing the 2-Minute Push-Up Test.

PubMed

Fielitz, Lynn; Coelho, Jeffrey; Horne, Thomas; Brechue, William

2016-02-01

The purpose of this study was to assess inter-rater reliability and intra-rater reliability of the 2-minute, 90° push-up test as utilized in the Army Physical Fitness Test. Analysis of rater assessment reliability included both total score agreement and agreement across individual push-up repetitions. This study utilized 8 Raters who assessed 15 different videotaped push-up performances over 4 iterations separated by a minimum of 1 week. The 15 push-up participants were videotaped during the semiannual Army Physical Fitness Test. Each Rater randomly viewed the 15 push-up and verbally responded with a "yes" or "no" to each push-up repetition. The data generated were analyzed using the Pearson product-moment correlation as well as the kappa, modified kappa and the intra-class correlation coefficient (3,1). An attribute agreement analysis was conducted to determine the percent of inter-rater and intra-rater agreement across individual push-ups.The results indicated that Raters varied a great deal in assessing push-ups. Over the 4 trials of 15 participants, the overall scores of the Raters varied between 3.0 and 35.7 push-ups. Post hoc comparisons found that there was significant increase in the grand mean of push-ups from trials 1-3 to trial 4 (p < 0.05). Also, there was a significant difference among raters over the 4 trials (p < 0.05). Pearson correlation coefficients for inter-rater and intra-rater reliability identified inter-rater reliability coefficients were between 0.10 and 0.97. Intra-rater coefficients were between 0.48 and 0.99. Intra-rater agreement for individual push-up repetitions ranged from 41.8% to 84.8%. The results indicated that the raters failed to assess the same push-up repetition with the same score (below 70% agreement) as well as failed to agree when viewed between raters (29%). Interestingly, as previously mentioned, scores on trial 4 increased significantly which might have been caused by rater drift or that the Raters did not maintain the push-up standard over the trials. It does appear that the final push-up scores received by each participant was a close approximation of actual performance (within 65%) but when assessing physical performance for retention in the Army, a more reliable test might be considered. Reprint & Copyright © 2016 Association of Military Surgeons of the U.S.
How reliable are Functional Movement Screening scores? A systematic review of rater reliability.

PubMed

Moran, Robert W; Schneiders, Anthony G; Major, Katherine M; Sullivan, S John

2016-05-01

Several physical assessment protocols to identify intrinsic risk factors for injury aetiology related to movement quality have been described. The Functional Movement Screen (FMS) is a standardised, field-expedient test battery intended to assess movement quality and has been used clinically in preparticipation screening and in sports injury research. To critically appraise and summarise research investigating the reliability of scores obtained using the FMS battery. Systematic literature review. Systematic search of Google Scholar, Scopus (including ScienceDirect and PubMed), EBSCO (including Academic Search Complete, AMED, CINAHL, Health Source: Nursing/Academic Edition), MEDLINE and SPORTDiscus. Studies meeting eligibility criteria were assessed by 2 reviewers for risk of bias using the Quality Appraisal of Reliability Studies checklist. Overall quality of evidence was determined using van Tulder's levels of evidence approach. 12 studies were appraised. Overall, there was a 'moderate' level of evidence in favour of 'acceptable' (intraclass correlation coefficient ≥0.6) inter-rater and intra-rater reliability for composite scores derived from live scoring. For inter-rater reliability of composite scores derived from video recordings there was 'conflicting' evidence, and 'limited' evidence for intra-rater reliability. For inter-rater reliability based on live scoring of individual subtests there was 'moderate' evidence of 'acceptable' reliability (κ≥0.4) for 4 subtests (Deep Squat, Shoulder Mobility, Active Straight-leg Raise, Trunk Stability Push-up) and 'conflicting' evidence for the remaining 3 (Hurdle Step, In-line Lunge, Rotary Stability). This review found 'moderate' evidence that raters can achieve acceptable levels of inter-rater and intra-rater reliability of composite FMS scores when using live ratings. Overall, there were few high-quality studies, and the quality of several studies was impacted by poor study reporting particularly in relation to rater blinding. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://www.bmj.com/company/products-services/rights-and-licensing/
Reliability of doming and toe flexion testing to quantify foot muscle strength.

PubMed

Ridge, Sarah Trager; Myrer, J William; Olsen, Mark T; Jurgensmeier, Kevin; Johnson, A Wayne

2017-01-01

Quantifying the strength of the intrinsic foot muscles has been a challenge for clinicians and researchers. The reliable measurement of this strength is important in order to assess weakness, which may contribute to a variety of functional issues in the foot and lower leg, including plantar fasciitis and hallux valgus. This study reports 3 novel methods for measuring foot strength - doming (previously unmeasured), hallux flexion, and flexion of the lesser toes. Twenty-one healthy volunteers performed the strength tests during two testing sessions which occurred one to five days apart. Each participant performed each series of strength tests (doming, hallux flexion, and lesser toe flexion) four times during the first testing session (twice with each of two raters) and two times during the second testing session (once with each rater). Intra-class correlation coefficients were calculated to test for reliability for the following comparisons: between raters during the same testing session on the same day (inter-rater, intra-day, intra-session), between raters on different days (inter-rater, inter-day, inter-session), between days for the same rater (intra-rater, inter-day, inter-session), and between sessions on the same day by the same rater (intra-rater, intra-day, inter-session). ICCs showed good to excellent reliability for all tests between days, raters, and sessions. Average doming strength was 99.96 ± 47.04 N. Average hallux flexion strength was 65.66 ± 24.5 N. Average lateral toe flexion was 50.96 ± 22.54 N. These simple tests using relatively low cost equipment can be used for research or clinical purposes. If repeated testing will be conducted on the same participant, it is suggested that the same researcher or clinician perform the testing each time for optimal reliability.
Reliability of visual and instrumental color matching.

PubMed

Igiel, Christopher; Lehmann, Karl Martin; Ghinea, Razvan; Weyhrauch, Michael; Hangx, Ysbrand; Scheller, Herbert; Paravina, Rade D

2017-09-01

The aim of this investigation was to evaluate intra-rater and inter-rater reliability of visual and instrumental shade matching. Forty individuals with normal color perception participated in this study. The right maxillary central incisor of a teaching model was prepared and restored with 10 feldspathic all-ceramic crowns of different shades. A shade matching session consisted of the observer (rater) visually selecting the best match by using VITA classical A1-D4 (VC) and VITA Toothguide 3D Master (3D) shade guides and the VITA Easyshade Advance intraoral spectrophotometer (ES) to obtain both VC and 3D matches. Three shade matching sessions were held with 4 to 6 weeks between sessions. Intra-rater reliability was assessed based on the percentage of agreement for the three sessions for the same observer, whereas the inter-rater reliability was calculated as mean percentage of agreement between different observers. The Fleiss' Kappa statistical analysis was used to evaluate visual inter-rater reliability. The mean intra-rater reliability for the visual shade selection was 64(11) for VC and 48(10) for 3D. The corresponding ES values were 96(4) for both VC and 3D. The percentages of observers who matched the same shade with VC and 3D were 55(10) and 43(12), respectively, while corresponding ES values were 88(8) for VC and 92(4) for 3D. The results for visual shade matching exhibited a high to moderate level of inconsistency for both intra-rater and inter-rater comparisons. The VITA Easyshade Advance intraoral spectrophotometer exhibited significantly better reliability compared with visual shade selection. This study evaluates the ability of observers to consistently match the same shade visually and with a dental spectrophotometer in different sessions. The intra-rater and inter-rater reliability (agreement of repeated shade matching) of visual and instrumental tooth color matching strongly suggest the use of color matching instruments as a supplementary tool in everyday dental practice to enhance the esthetic outcome. © 2017 Wiley Periodicals, Inc.
Test-re-test reliability and inter-rater reliability of a digital pelvic inclinometer in young, healthy males and females.

PubMed

Beardsley, Chris; Egerton, Tim; Skinner, Brendon

2016-01-01

Objective. The purpose of this study was to investigate the reliability of a digital pelvic inclinometer (DPI) for measuring sagittal plane pelvic tilt in 18 young, healthy males and females. Method. The inter-rater reliability and test-re-test reliabilities of the DPI for measuring pelvic tilt in standing on both the right and left sides of the pelvis were measured by two raters carrying out two rating sessions of the same subjects, three weeks apart. Results. For measuring pelvic tilt, inter-rater reliability was designated as good on both sides (ICC = 0.81-0.88), test-re-test reliability within a single rating session was designated as good on both sides (ICC = 0.88-0.95), and test-re-test reliability between two rating sessions was designated as moderate on the left side (ICC = 0.65) and good on the right side (ICC = 0.85). Conclusion. Inter-rater reliability and test-re-test reliability within a single rating session of the DPI in measuring pelvic tilt were both good, while test-re-test reliability between rating sessions was moderate-to-good. Caution is required regarding the interpretation of the test-re-test reliability within a single rating session, as the raters were not blinded. Further research is required to establish validity.
The development and validation of a custom built device for assessing frontal knee joint laxity.

PubMed

Ismail, Shiek Abdullah; Simic, Milena; Clarke, Jillian L; Lopes, Thiago Jambo Alves; Pappas, Evangelos

2017-12-01

This study reports the development and validation of a quantitative technique of assessing frontal knee joint laxity through a custom built device named KLICP. The objectives of this study were to determine: (i) the intra- and inter-rater reliability and (ii) the validity of the device when compared to real time ultrasound. Twenty-five participants had their frontal knee joint laxity assessed by the KLICP, by manual varus/valgus tests and by ultrasound. Two raters independently assessed laxity manually by three repeated measurements, repeated at least 48h later. Results were validated by comparing them to the medial and lateral joint space opening measured by the ultrasound. Intraclass correlation coefficients and standard error of measurement reliability were calculated. Pearson's correlation coefficients were calculated to determine the correlation between the KLICP and the joint space. Intra-rater reliability (intra-session) for each rater was good on both sessions (0.91-0.98), intra-rater reliability (inter-sessions) was moderate to good (0.62-0.87), and inter-rater reliability (intra-session) was good (0.75-0.80). There is low agreement for intra-rater (inter-session) and for inter-rater (intra-session) reliability. The KLICP measurement has a significant positive fair to moderate correlation to the ultrasound measurement at the left (r: 0.61, p: 0.01) and right (r: 0.48, p: 0.02) knee in the valgus direction and at the left (r: 0.51, p: 0.01) and right (r: 0.39, p: 0.05) knee in the varus direction. There is low agreement between the KLICP and the RTU. Reliability and agreement was good only when measured for intra-rater, within session. Copyright © 2017 Elsevier B.V. All rights reserved.
Inter-rater and test–retest reliability of quality assessments by novice student raters using the Jadad and Newcastle–Ottawa Scales

PubMed Central

Oremus, Carolina; Hall, Geoffrey B C; McKinnon, Margaret C

2012-01-01

Introduction Quality assessment of included studies is an important component of systematic reviews. Objective The authors investigated inter-rater and test–retest reliability for quality assessments conducted by inexperienced student raters. Design Student raters received a training session on quality assessment using the Jadad Scale for randomised controlled trials and the Newcastle–Ottawa Scale (NOS) for observational studies. Raters were randomly assigned into five pairs and they each independently rated the quality of 13–20 articles. These articles were drawn from a pool of 78 papers examining cognitive impairment following electroconvulsive therapy to treat major depressive disorder. The articles were randomly distributed to the raters. Two months later, each rater re-assessed the quality of half of their assigned articles. Setting McMaster Integrative Neuroscience Discovery and Study Program. Participants 10 students taking McMaster Integrative Neuroscience Discovery and Study Program courses. Main outcome measures The authors measured inter-rater reliability using κ and the intraclass correlation coefficient type 2,1 or ICC(2,1). The authors measured test–retest reliability using ICC(2,1). Results Inter-rater reliability varied by scale question. For the six-item Jadad Scale, question-specific κs ranged from 0.13 (95% CI −0.11 to 0.37) to 0.56 (95% CI 0.29 to 0.83). The ranges were −0.14 (95% CI −0.28 to 0.00) to 0.39 (95% CI −0.02 to 0.81) for the NOS cohort and −0.20 (95% CI −0.49 to 0.09) to 1.00 (95% CI 1.00 to 1.00) for the NOS case–control. For overall scores on the six-item Jadad Scale, ICC(2,1)s for inter-rater and test–retest reliability (accounting for systematic differences between raters) were 0.32 (95% CI 0.08 to 0.52) and 0.55 (95% CI 0.41 to 0.67), respectively. Corresponding ICC(2,1)s for the NOS cohort were −0.19 (95% CI −0.67 to 0.35) and 0.62 (95% CI 0.25 to 0.83), and for the NOS case–control, the ICC(2,1)s were 0.46 (95% CI −0.13 to 0.92) and 0.83 (95% CI 0.48 to 0.95). Conclusions Inter-rater reliability was generally poor to fair and test–retest reliability was fair to excellent. A pilot rating phase following rater training may be one way to improve agreement. PMID:22855629
Inter-rater and test-retest reliability of quality assessments by novice student raters using the Jadad and Newcastle-Ottawa Scales.

PubMed

Oremus, Mark; Oremus, Carolina; Hall, Geoffrey B C; McKinnon, Margaret C

2012-01-01

Quality assessment of included studies is an important component of systematic reviews. The authors investigated inter-rater and test-retest reliability for quality assessments conducted by inexperienced student raters. Student raters received a training session on quality assessment using the Jadad Scale for randomised controlled trials and the Newcastle-Ottawa Scale (NOS) for observational studies. Raters were randomly assigned into five pairs and they each independently rated the quality of 13-20 articles. These articles were drawn from a pool of 78 papers examining cognitive impairment following electroconvulsive therapy to treat major depressive disorder. The articles were randomly distributed to the raters. Two months later, each rater re-assessed the quality of half of their assigned articles. McMaster Integrative Neuroscience Discovery and Study Program. 10 students taking McMaster Integrative Neuroscience Discovery and Study Program courses. The authors measured inter-rater reliability using κ and the intraclass correlation coefficient type 2,1 or ICC(2,1). The authors measured test-retest reliability using ICC(2,1). Inter-rater reliability varied by scale question. For the six-item Jadad Scale, question-specific κs ranged from 0.13 (95% CI -0.11 to 0.37) to 0.56 (95% CI 0.29 to 0.83). The ranges were -0.14 (95% CI -0.28 to 0.00) to 0.39 (95% CI -0.02 to 0.81) for the NOS cohort and -0.20 (95% CI -0.49 to 0.09) to 1.00 (95% CI 1.00 to 1.00) for the NOS case-control. For overall scores on the six-item Jadad Scale, ICC(2,1)s for inter-rater and test-retest reliability (accounting for systematic differences between raters) were 0.32 (95% CI 0.08 to 0.52) and 0.55 (95% CI 0.41 to 0.67), respectively. Corresponding ICC(2,1)s for the NOS cohort were -0.19 (95% CI -0.67 to 0.35) and 0.62 (95% CI 0.25 to 0.83), and for the NOS case-control, the ICC(2,1)s were 0.46 (95% CI -0.13 to 0.92) and 0.83 (95% CI 0.48 to 0.95). Inter-rater reliability was generally poor to fair and test-retest reliability was fair to excellent. A pilot rating phase following rater training may be one way to improve agreement.
Measurement of the Inter-Rater Reliability Rate Is Mandatory for Improving the Quality of a Medical Database: Experience with the Paulista Lung Cancer Registry.

PubMed

Lauricella, Leticia L; Costa, Priscila B; Salati, Michele; Pego-Fernandes, Paulo M; Terra, Ricardo M

2018-06-01

Database quality measurement should be considered a mandatory step to ensure an adequate level of confidence in data used for research and quality improvement. Several metrics have been described in the literature, but no standardized approach has been established. We aimed to describe a methodological approach applied to measure the quality and inter-rater reliability of a regional multicentric thoracic surgical database (Paulista Lung Cancer Registry). Data from the first 3 years of the Paulista Lung Cancer Registry underwent an audit process with 3 metrics: completeness, consistency, and inter-rater reliability. The first 2 methods were applied to the whole data set, and the last method was calculated using 100 cases randomized for direct auditing. Inter-rater reliability was evaluated using percentage of agreement between the data collector and auditor and through calculation of Cohen's κ and intraclass correlation. The overall completeness per section ranged from 0.88 to 1.00, and the overall consistency was 0.96. Inter-rater reliability showed many variables with high disagreement (>10%). For numerical variables, intraclass correlation was a better metric than inter-rater reliability. Cohen's κ showed that most variables had moderate to substantial agreement. The methodological approach applied to the Paulista Lung Cancer Registry showed that completeness and consistency metrics did not sufficiently reflect the real quality status of a database. The inter-rater reliability associated with κ and intraclass correlation was a better quality metric than completeness and consistency metrics because it could determine the reliability of specific variables used in research or benchmark reports. This report can be a paradigm for future studies of data quality measurement. Copyright © 2018 American College of Surgeons. Published by Elsevier Inc. All rights reserved.
Reliability of capturing foot parameters using digital scanning and the neutral suspension casting technique

PubMed Central

2011-01-01

Background A clinical study was conducted to determine the intra and inter-rater reliability of digital scanning and the neutral suspension casting technique to measure six foot parameters. The neutral suspension casting technique is a commonly utilised method for obtaining a negative impression of the foot prior to orthotic fabrication. Digital scanning offers an alternative to the traditional plaster of Paris techniques. Methods Twenty one healthy participants volunteered to take part in the study. Six casts and six digital scans were obtained from each participant by two raters of differing clinical experience. The foot parameters chosen for investigation were cast length (mm), forefoot width (mm), rearfoot width (mm), medial arch height (mm), lateral arch height (mm) and forefoot to rearfoot alignment (degrees). Intraclass correlation coefficients (ICC) with 95% confidence intervals (CI) were calculated to determine the intra and inter-rater reliability. Measurement error was assessed through the calculation of the standard error of the measurement (SEM) and smallest real difference (SRD). Results ICC values for all foot parameters using digital scanning ranged between 0.81-0.99 for both intra and inter-rater reliability. For neutral suspension casting technique inter-rater reliability values ranged from 0.57-0.99 and intra-rater reliability values ranging from 0.36-0.99 for rater 1 and 0.49-0.99 for rater 2. Conclusions The findings of this study indicate that digital scanning is a reliable technique, irrespective of clinical experience, with reduced measurement variability in all foot parameters investigated when compared to neutral suspension casting. PMID:21375757
Reliability and Concurrent Validity of Dynamic Rotator Stability Test-A Cross Sectional study.

PubMed

Binoy Mathew, K V; Eapen, Charu; Kumar, P Senthil

2012-01-01

To find intra rater and inter rater reliability of Dynamic Rotator Stability Test (DRST) and to find concurrent validity of Dynamic Rotator Stability Test (DRST) with University of Pennsylvania Shoulder Score (PENN) Scale. 40 subjects of either gender between the age group of 18-70 with painful shoulder conditions of musculoskeletal origin was selected through convenient sampling. Tester 1 and tester 2 administered DRST and PENN scale randomly. In a subgroup of 20 subjects DRST was administered by both the testers to find the inter rater reliability. 180° Standard Universal Goniometer was used to take measurements. For intra-rater reliability, all the test variables were showing highly significant correlation (p=.94 - 1). For inter -rater, with tester 2, test variables like position, ROM, force, direction of abnormal translation, pain during the test, compensatory movement during test were found to be significant (p=.71-1).only some variables of DRST showed significant correlation with PENN scale (P=.320-.450). Dynamic Rotator Stability Test has good intra rater and moderate inter rater reliability. Concurrent validity of Dynamic Rotator Stability Test was found to be poor when compared to PENN Shoulder Score.
Tackling reliability and construct validity: the systematic development of a qualitative protocol for skill and incident analysis.

PubMed

Savage, Trevor Nicholas; McIntosh, Andrew Stuart

2017-03-01

It is important to understand factors contributing to and directly causing sports injuries to improve the effectiveness and safety of sports skills. The characteristics of injury events must be evaluated and described meaningfully and reliably. However, many complex skills cannot be effectively investigated quantitatively because of ethical, technological and validity considerations. Increasingly, qualitative methods are being used to investigate human movement for research purposes, but there are concerns about reliability and measurement bias of such methods. Using the tackle in Rugby union as an example, we outline a systematic approach for developing a skill analysis protocol with a focus on improving objectivity, validity and reliability. Characteristics for analysis were selected using qualitative analysis and biomechanical theoretical models and epidemiological and coaching literature. An expert panel comprising subject matter experts provided feedback and the inter-rater reliability of the protocol was assessed using ten trained raters. The inter-rater reliability results were reviewed by the expert panel and the protocol was revised and assessed in a second inter-rater reliability study. Mean agreement in the second study improved and was comparable (52-90% agreement and ICC between 0.6 and 0.9) with other studies that have reported inter-rater reliability of qualitative analysis of human movement.
Intra and inter-rater reliability of infrared image analysis of masticatory and upper trapezius muscles in women with and without temporomandibular disorder.

PubMed

Costa, Ana C S; Dibai Filho, Almir V; Packer, Amanda C; Rodrigues-Bigaton, Delaine

2013-01-01

Infrared thermography is an aid tool that can be used to evaluate several pathologies given its efficiency in analyzing the distribution of skin surface temperature. To propose two forms of infrared image analysis of the masticatory and upper trapezius muscles, and to determine the intra and inter-rater reliability of both forms of analysis. Infrared images of masticatory and upper trapezius muscles of 64 female volunteers with and without temporomandibular disorder (TMD) were collected. Two raters performed the infrared image analysis, which occurred in two ways: temperature measurement of the muscle length and in central portion of the muscle. The Intraclass Correlation Coefficient (ICC) was used to determine the intra and inter-rater reliability. The ICC showed excellent intra and inter-rater values for both measurements: temperature measurement of the muscle length (TMD group, intra-rater, ICC ranged from 0.996 to 0.999, inter-rater, ICC ranged from 0.992 to 0.999; control group, intra-rater, ICC ranged from 0.993 to 0.998, inter-rater, ICC ranged from 0.990 to 0.998), and temperature measurement of the central portion of the muscle (TMD group, intra-rater, ICC ranged from 0.981 to 0.998, inter-rater, ICC ranged from 0.971 to 0.998; control group, intra-rater, ICC ranged from 0.887 to 0.996, inter-rater, ICC ranged from 0.852 to 0.996). The results indicated that temperature measurements of the masticatory and upper trapezius muscles carried out by the analysis of the muscle length and central portion yielded excellent intra and inter-rater reliability.
Reliability of intracerebral hemorrhage classification systems: A systematic review.

PubMed

Rannikmäe, Kristiina; Woodfield, Rebecca; Anderson, Craig S; Charidimou, Andreas; Chiewvit, Pipat; Greenberg, Steven M; Jeng, Jiann-Shing; Meretoja, Atte; Palm, Frederic; Putaala, Jukka; Rinkel, Gabriel Je; Rosand, Jonathan; Rost, Natalia S; Strbian, Daniel; Tatlisumak, Turgut; Tsai, Chung-Fen; Wermer, Marieke Jh; Werring, David; Yeh, Shin-Joe; Al-Shahi Salman, Rustam; Sudlow, Cathie Lm

2016-08-01

Accurately distinguishing non-traumatic intracerebral hemorrhage (ICH) subtypes is important since they may have different risk factors, causal pathways, management, and prognosis. We systematically assessed the inter- and intra-rater reliability of ICH classification systems. We sought all available reliability assessments of anatomical and mechanistic ICH classification systems from electronic databases and personal contacts until October 2014. We assessed included studies' characteristics, reporting quality and potential for bias; summarized reliability with kappa value forest plots; and performed meta-analyses of the proportion of cases classified into each subtype. We included 8 of 2152 studies identified. Inter- and intra-rater reliabilities were substantial to perfect for anatomical and mechanistic systems (inter-rater kappa values: anatomical 0.78-0.97 [six studies, 518 cases], mechanistic 0.89-0.93 [three studies, 510 cases]; intra-rater kappas: anatomical 0.80-1 [three studies, 137 cases], mechanistic 0.92-0.93 [two studies, 368 cases]). Reporting quality varied but no study fulfilled all criteria and none was free from potential bias. All reliability studies were performed with experienced raters in specialist centers. Proportions of ICH subtypes were largely consistent with previous reports suggesting that included studies are appropriately representative. Reliability of existing classification systems appears excellent but is unknown outside specialist centers with experienced raters. Future reliability comparisons should be facilitated by studies following recently published reporting guidelines. © 2016 World Stroke Organization.
Computerized back postural assessment in physiotherapy practice: Intra-rater and inter-rater reliability of the MIDAS system.

PubMed

McAlpine, R T; Bettany-Saltikov, J A; Warren, J G

2009-01-01

Assessment of spinal posture during physiotherapy practice is routine, yet few objective measures exist to this end. The Middlesbrough Integrated Digital Assessment System (MIDAS) is a low cost portable system able to record 3D information on posture. The purpose of this study was to assess both the intra-rater and inter-rater reliability of the MIDAS system. Twenty-five healthy subjects were recruited. A repeated measures design was used to record fifteen pre-palpated landmarks on the back of each subject. To limit the sources of variability, the principal researcher palpated the landmarks for each subject. Each of three raters took two measurements on each subject in a standardized upright posture. X (medio-lateral), Y (antero-posterior) and Z (height) landmark positions were recorded via a computer interface. Both intra-rater agreement (mean ICCs - rater 1 r=0.970, rater 2 r=0.965 and rater 3 r=0.965, p< 0.001) and inter-rater agreement (mean ICCs r=0.967, p< 0.001) was very high between repeated measures and between markers. Error values for the z-axis (height) were the lowest. The MIDAS demonstrated both high inter-rater and intra-rater reliability and provides an objective method for the assessment of posture in physiotherapy practice.

Inter- and intra-rater reliability and agreement in determining subcutaneous tumour margins in dogs.

PubMed

Ranganathan, B; Milovancev, M; Leeper, H; Townsend, K L; Bracha, S; Curran, K

2018-03-01

The objective of this prospective study was to evaluate agreement and reliability of calliper-based measurements of locally invasive subcutaneous malignant tumours in dogs. Four raters measured the longest diameter of 12 subcutaneous tumours (7 soft tissue sarcomas and 5 mast cell tumours) from 11 client-owned dogs during 3 randomized, blinded measurement trials, both pre- and post-sedation. Inter- and intra-rater reliability was evaluated using intra-class correlation coefficient (ICC) and agreement was evaluated using Bland-Altman plots. Inter- and intra-rater reliability was good (ICC range of 0.8694-0.89520) and excellent (ICC range of 0.9720-0.9966), respectively. For agreement calculations, an a priori clinically relevant limit of agreement of 10 mm was set. Inter- and intra-rater agreement was unacceptable with inter-rater limits of agreement ranging from 15.9 to 55.6 mm and intra-rater limit of agreement ranging from 11.9 to 28.1 mm. Review of the measurement trial photographs revealed that calliper orientation changes were frequent, occurring in 9/12 (75%) and 8/12 (67%) pre- and post-sedation cases. No significant correlation was found between inter-rater measurement standard deviations and calliper orientation changes or dog body condition score. These findings suggest veterinarians may have poor agreement in determining the gross edge of tumours, which is expected to introduce bias and inconsistency in tumour staging, assessing response to therapy, and surgical margin planning. Due to the potential consequences for veterinary cancer patients, future studies are needed to validate the present findings. © 2018 John Wiley & Sons Ltd.
A study of the reliability of the Nociception Coma Scale.

PubMed

Riganello, F; Cortese, M D; Arcuri, F; Candelieri, A; Guglielmino, F; Dolce, G; Sannita, W G; Schnakers, C

2015-04-01

In this study, we investigated the reliability of the Nociception Coma Scale which has recently been developed to assess nociception in non-communicative, severely brain-injured patients. Prospective cross-sequential study. Semi-intensive care unit and long-term brain injury care. Forty-four patients diagnosed as being in a vegetative state (n=26) or in a minimally conscious state (n=18). Patients were assessed by two experts (rater A and rater B) on two consecutive weeks to measure inter-rater agreement and test-retest reliability. Total scores and subscores of the Nociception Coma Scale. We performed a total of 176 assessments. The inter-rater agreement was moderate for the total scores (k = 0.57) and fair to substantial for the subscores (0.33 ≤ k ≤ 0.62) on week 2. The test-retest reliability was substantial for the total scores (k = 0.66) and moderate to almost perfect for the subscores (0.53 ≤ k ≤ 0.96) for rater A. The inter-rater agreement was weaker on week 1, whereas the test-retest reliability was lower for the least experienced rater (rater B). This study provides further evidence of the psychometric qualities of the Nociception Coma Scale. Future studies should assess the impact of practical experience and background on administration and scoring of the scale. © The Author(s) 2014.
The TiltMeter app is a novel and accurate measurement tool for the weight bearing lunge test.

PubMed

Williams, Cylie M; Caserta, Antoni J; Haines, Terry P

2013-09-01

The weight bearing lunge test is increasing being used by health care clinicians who treat lower limb and foot pathology. This measure is commonly established accurately and reliably with the use of expensive equipment. This study aims to compare the digital inclinometer with a free app, TiltMeter on an Apple iPhone. This was an intra-rater and inter-rater reliability study. Two raters (novice and experienced) conducted the measurements in both a bent knee and straight leg position to determine the intra-rater and inter-rater reliability. Concurrent validity was also established. Allied health practitioners were recruited as participants from the workplace. A preconditioning stretch was conducted and the ankle range of motion was established with the weight bearing lunge test position with firstly the leg straight and secondly with the knee bent. The measurement device and each participant were randomised during measurement. The intra-rater reliability and inter-rater reliability for the devices and in both positions were all over ICC 0.8 except for one intra-rater measure (Digital inclinometer, novice, ICC 0.65). The inter-rater reliability between the digital inclinometer and the tilmeter was near perfect, ICC 0.96 (CI: 0.898-0.983); Concurrent validity ICC between the two devices was 0.83 (CI: -0.740 to 0.445). The use of the Tiltmeter app on the iPhone is a reliable and inexpensive tool to measure the available ankle range of motion. Health practitioners should use caution in applying these findings to other smart phone equipment if surface areas are not comparable. Crown Copyright © 2013. Published by Elsevier Ltd. All rights reserved.
Reliability of Pain Measurements Using Computerized Cuff Algometry: A DoloCuff Reliability and Agreement Study.

PubMed

Kvistgaard Olsen, Jack; Fener, Dilay Kesgin; Waehrens, Eva Elisabet; Wulf Christensen, Anton; Jespersen, Anders; Danneskiold-Samsøe, Bente; Bartels, Else Marie

2017-07-01

Computerized pneumatic cuff pressure algometry (CPA) using the DoloCuff is a new method for pain assessment. Intra- and inter-rater reliabilities have not yet been established. Our aim was to examine the inter- and intrarater reliabilities of DoloCuff measures in healthy subjects. Twenty healthy subjects (ages 20 to 29 years) were assessed three times at 24-hour intervals by two trained raters. Inter-rater reliability was established based on the first and second assessments, whereas intrarater reliability was based on the second and third assessments. Subjects were randomized 1:1 to first assessment at either rater 1 or rater 2. The variables of interest were pressure pain threshold (PT), pressure pain tolerance (PTol), and temporal summation index (TSI). Reliability was estimated by a two-way mixed intraclass correlation coefficient (ICC) absolute agreement analysis. Reliability was considered excellent if ICC > 0.75, fair to good if 0.4 < ICC < 0.75, and poor if ICC < 0.4. Bias and random errors between raters and assessments were evaluated using 95% confidence interval (CI) and Bland-Altman plots. Inter-rater reliability for PT, PTol, and TSI was 0.88 (95% CI: 0.69 to 0.95), 0.86 (95% CI: 0.65 to 0.95), and 0.81 (95% CI: 0.42 to 0.94), respectively. The intrarater reliability for PT, PTol, and TSI was 0.81 (95% CI: 0.53 to 0.92), 0.89 (95% CI: 0.74 to 0.96), and 0.75 (95% CI: 0.28 to 0.91), respectively. Inter-rater reliability was excellent for PT, PTol, and TSI. Similarly, the intrarater reliability for PT and PTol was excellent, while borderline excellent/good for TSI. Therefore, the DoloCuff can be used to obtain reliable measures of pressure pain parameters in healthy subjects. © 2016 World Institute of Pain.
An assessment of the inter-rater reliability of the ASA physical status score in the orthopaedic trauma population.

PubMed

Ihejirika, Rivka C; Thakore, Rachel V; Sathiyakumar, Vasanth; Ehrenfeld, Jesse M; Obremskey, William T; Sethi, Manish K

2015-04-01

Although recent literature has demonstrated the utility of the ASA score in predicting postoperative length of stay, complication risk and potential utilization of other hospital resources, the ASA score has been inconsistently assigned by anaesthesia providers. This study tested the reliability of assignment of the ASA score classification by both attending anaesthesiologists and anaesthesia residents specifically among the orthopaedic trauma patient population. Nine case-based scenarios were created involving preoperative patients with isolated operative orthopaedic trauma injuries. The cases were created and assigned a reference score by both an attending anaesthesiologist and orthopaedic trauma surgeon. Attending and resident anaesthesiologists were asked to assign an ASA score for each case. Rater versus reference and inter-rater agreement amongst respondents was then analyzed utilizing Fleiss's Kappa and weighted and unweighted Cohen's Kappa. Thirty three individuals provided ASA scores for each of the scenarios. The average rater versus reference reliability was substantial (Kw=0.78, SD=0.131, 95% CI=0.73-0.83). The average rater versus reference Kuw was also substantial (Kuw=0.64, SD=0.21, 95% CI=0.56-0.71). The inter-rater reliability as evaluated by Fleiss's Kappa was moderate (K=0.51, p<.001). An inter-rater comparison within the group of attendings (K=0.50, p<.001) and within the group of residents were both moderate (K=0.55, p<.001). There was a significant increase in the level of inter-rater reliability from the self-reported 'very uncomfortable' participants to the 'very comfortable' participants (uncomfortable K=0.43, comfortable K=0.59, p<.001). This study shows substantial agreement strength for reliability of the ASA score among anaesthesiologists when evaluating orthopaedic trauma patients. The significant increase in inter-rater reliability based on anaesthesiologists' comfort with the ASA scoring method implies a need for further evaluation of ASA assessment training and routine use on the ground. These findings support the use of the ASA score as a statistically reliable tool in orthopaedic trauma. Copyright © 2014 Elsevier Ltd. All rights reserved.
Establishing Inter- and Intrarater Reliability for High-Stakes Testing Using Simulation.

PubMed

Kardong-Edgren, Suzan; Oermann, Marilyn H; Rizzolo, Mary Anne; Odom-Maryon, Tamara

This article reports one method to develop a standardized training method to establish the inter- and intrarater reliability of a group of raters for high-stakes testing. Simulation is used increasingly for high-stakes testing, but without research into the development of inter- and intrarater reliability for raters. Eleven raters were trained using a standardized methodology. Raters scored 28 student videos over a six-week period. Raters then rescored all videos over a two-day period to establish both intra- and interrater reliability. One rater demonstrated poor intrarater reliability; a second rater failed all students. Kappa statistics improved from the moderate to substantial agreement range with the exclusion of the two outlier raters' scores. There may be faculty who, for different reasons, should not be included in high-stakes testing evaluations. All faculty are content experts, but not all are expert evaluators.
The Mental Disability Military Assessment Tool: A Reliable Tool for Determining Disability in Veterans with Post-traumatic Stress Disorder.

PubMed

Fokkens, Andrea S; Groothoff, Johan W; van der Klink, Jac J L; Popping, Roel; Stewart, Roy E; van de Ven, Lex; Brouwer, Sandra; Tuinstra, Jolanda

2015-09-01

An assessment tool was developed to assess disability in veterans who suffer from post-traumatic stress disorder (PTSD) due to a military mission. The objective of this study was to determine the reliability, intra-rater and inter-rater variation of the Mental Disability Military (MDM) assessment tool. Twenty-four assessment interviews of veterans with an insurance physician were videotaped. Each videotaped interview was assessed by a group of five independent raters on limitations of the veterans using the MDM assessment tool. After 2 months the raters repeated this procedure. Next the intra-rater and inter-rater variation was assessed with an adjusted version of AG09 computing weighted percentage agreement. The results of this study showed that both the intra-rater variation and inter-rater variation on the ten subcategories of the MDM assessment tool were small, with an agreement of 84-100% within raters and 93-100% between raters. The MDM assessment tool proves to be a reliable instrument to measure PTSD limitations in functioning in Dutch military veterans who apply for disability compensation. Further research is needed to assess the validity of this instrument.
Nutrition Environment Measures Survey in stores (NEMS-S): development and evaluation.

PubMed

Glanz, Karen; Sallis, James F; Saelens, Brian E; Frank, Lawrence D

2007-04-01

Eating, or nutrition, environments are believed to contribute to obesity and chronic diseases. There is a need for valid, reliable measures of nutrition environments. This article reports on the development and evaluation of measures of nutrition environments in retail food stores. The Nutrition Environment Measures Study developed observational measures of the nutrition environment within retail food stores (NEMS-S) to assess availability of healthy options, price, and quality. After pretesting, measures were completed by independent raters to evaluate inter-rater reliability and across two occasions to assess test-retest reliability in grocery and convenience stores in four neighborhoods differing on income and community design in the Atlanta metropolitan area. Data were collected and analyzed in 2004 and 2005. Ten food categories (e.g., fruits) or indicator food items (e.g., ground beef) were evaluated in 85 stores. Inter-rater reliability and test-retest reliability of availability were high: inter-rater reliability kappas were 0.84 to 1.00, and test-retest reliabilities were .73 to 1.00. Inter-rater reliability for quality across fresh produce was moderate (kappas, 0.44 to 1.00). Healthier options were higher priced for hot dogs, lean ground beef, and baked chips. More healthful options were available in grocery than convenience stores and in stores in higher income neighborhoods. The NEMS-S tool was found to have a high degree of inter-rater and test-retest reliability, and to reveal significant differences across store types and neighborhoods of high and low socioeconomic status. These observational measures of nutrition environments can be applied in multilevel studies of community nutrition, and can inform new approaches to conducting and evaluating nutrition interventions.
Intra-Rater and Inter-Rater Reliability of the Balance Error Scoring System in Pre-Adolescent School Children

ERIC Educational Resources Information Center

Sheehan, Dwayne P.; Lafave, Mark R.; Katz, Larry

2011-01-01

This study was designed to test the intra- and inter-rater reliability of the University of North Carolina's Balance Error Scoring System in 9- and 10-year-old children. Additionally, a modified version of the Balance Error Scoring System was tested to determine if it was more sensitive in this population ("raw scores"). Forty-six…
Screening of the spine in adolescents: inter- and intra-rater reliability and measurement error of commonly used clinical tests.

PubMed

Aartun, Ellen; Degerfalk, Anna; Kentsdotter, Linn; Hestbaek, Lise

2014-02-10

Evidence on the reliability of clinical tests used for the spinal screening of children and adolescents is currently lacking. The aim of this study was to determine the inter- and intra-rater reliability and measurement error of clinical tests commonly used when screening young spines. Two experienced chiropractors independently assessed 111 adolescents aged 12-14 years who were recruited from a primary school in Denmark. A standardised examination protocol was used to test inter-rater reliability including tests for scoliosis, hypermobility, general mobility, inter-segmental mobility and end range pain in the spine. Seventy-five of the 111 subjects were re-examined after one to four hours to test intra-rater reliability. Percentage agreement and Cohen's Kappa were calculated for binary variables, and interclass correlation (ICC) and Bland-Altman plots with Limits of Agreement (LoA) were calculated for continuous measures. Inter-rater percentage agreement for binary data ranged from 59.5% to 100%. Kappa ranged from 0.06-1.00. Kappa ≥ 0.40 was seen for elbow, thumb, fifth finger and trunk/hip flexion hypermobility, pain response in inter-segmental mobility and end range pain in lumbar flexion and extension. For continuous data, ICCs ranged from 0.40-0.95. Only forward flexion as measured by finger-to-floor distance reached an acceptable ICC(≥ 0.75). Overall, results for intra-rater reliability were better than for inter-rater reliability but for both components, the LoA were quite wide compared with the range of assessments. Some clinical tests showed good, and some tests poor, reliability when applied in a spinal screening of adolescents. The results could probably be improved by additional training and further test standardization. This is the first step in evaluating the value of these tests for the spinal screening of adolescents. Future research should determine the association between these tests and current and/or future neck and back pain.
Do you see what I see? Mobile eye-tracker contextual analysis and inter-rater reliability.

PubMed

Stuart, S; Hunt, D; Nell, J; Godfrey, A; Hausdorff, J M; Rochester, L; Alcock, L

2018-02-01

Mobile eye-trackers are currently used during real-world tasks (e.g. gait) to monitor visual and cognitive processes, particularly in ageing and Parkinson's disease (PD). However, contextual analysis involving fixation locations during such tasks is rarely performed due to its complexity. This study adapted a validated algorithm and developed a classification method to semi-automate contextual analysis of mobile eye-tracking data. We further assessed inter-rater reliability of the proposed classification method. A mobile eye-tracker recorded eye-movements during walking in five healthy older adult controls (HC) and five people with PD. Fixations were identified using a previously validated algorithm, which was adapted to provide still images of fixation locations (n = 116). The fixation location was manually identified by two raters (DH, JN), who classified the locations. Cohen's kappa correlation coefficients determined the inter-rater reliability. The algorithm successfully provided still images for each fixation, allowing manual contextual analysis to be performed. The inter-rater reliability for classifying the fixation location was high for both PD (kappa = 0.80, 95% agreement) and HC groups (kappa = 0.80, 91% agreement), which indicated a reliable classification method. This study developed a reliable semi-automated contextual analysis method for gait studies in HC and PD. Future studies could adapt this methodology for various gait-related eye-tracking studies.
The development of an instrument to match individuals with disabilities and service animals.

PubMed

Zapf, S A; Rough, R B

There has been an increase in the use of service animals assisting persons with disabilities in the past decade. However many of the service dog agencies do not utilize an assessment that is designed to match the person to the animal in the rehabilitation and psycho-social domains. The purpose of this study was to develop the Service Animal Adaptive Intervention Assessment (SAAIA) and to measure the content validity, inter-rater reliability and clinical utility of the assessment. Two subject groups were used. Subject group one had 43 subjects who measured the content validity and clinical utility of the SAAIA Survey. Subject group two had 12 subjects who measured the inter-rater reliability by completing the SAAIA using information obtained through a video-taped client case scenario. Content validity results indicated a good to high percentage of agreement and a fair percentage of agreement for clinical utility. Inter-rater reliability results indicate good to high agreement on six of the eight variables of the SAAIA. However, the Kappa score indicates low inter-rater reliability. Results indicate the SAAIA has good content validity and inter-rater reliability and fair clinical utility based on percent agreement. However, further research is needed on the reliability of the SAAIA.
Rater reliability and concurrent validity of the Keyboard Personal Computer Style instrument (K-PeCS).

PubMed

Baker, Nancy A; Cook, James R; Redfern, Mark S

2009-01-01

This paper describes the inter-rater and intra-rater reliability, and the concurrent validity of an observational instrument, the Keyboard Personal Computer Style instrument (K-PeCS), which assesses stereotypical postures and movements associated with computer keyboard use. Three trained raters independently rated the video clips of 45 computer keyboard users to ascertain inter-rater reliability, and then re-rated a sub-sample of 15 video clips to ascertain intra-rater reliability. Concurrent validity was assessed by comparing the ratings obtained using the K-PeCS to scores developed from a 3D motion analysis system. The overall K-PeCS had excellent reliability [inter-rater: intra-class correlation coefficients (ICC)=.90; intra-rater: ICC=.92]. Most individual items on the K-PeCS had from good to excellent reliability, although six items fell below ICC=.75. Those K-PeCS items that were assessed for concurrent validity compared favorably to the motion analysis data for all but two items. These results suggest that most items on the K-PeCS can be used to reliably document computer keyboarding style.
Intra-rater and inter-rater reliability of ultrasonographic measurements of acromion-greater tuberosity distance in patients with post-stroke hemiplegia.

PubMed

Kumar, Praveen; Cruziah, Reynold; Bradley, Michael; Gray, Selena; Swinkels, Annette

2016-06-01

Glenohumeral subluxation (GHS) is reported in up to 81% of patients with stroke. Ultrasonographic measurements of GHS by measuring the acromion-greater tuberosity (AGT) have been found to be reliable for experienced raters. The primary aim was to assess the intra-rater reliability of measurements of AGT distance in people with stroke following a short course of rater training. A secondary aim was to compare the inter-rater reliability of these measurements between novice and experienced raters. Patients with stroke (n = 16; 5 men, 11 women; 74 ± 10 years) with 1-sided weakness who gave informed consent were recruited. Ultrasonographic measurements were recorded at the bedside by two physiotherapists with patients seated upright in a hospital chair. Reliability was assessed by intra-class correlation coefficients (ICCs) and the standard error of measurements (SEM). Minimum detectable change (MDC90) scores were used to estimate the magnitude of change that is likely to exceed measurement error. Mean ± SD AGT distances on the affected and unaffected sides for rater 1 were 2.2 ± 0.7 and 1.7 ± 0.4 cm, respectively. Corresponding values for rater 2 were 2.5 ± 0.6 and 2.0 ± 0.4 cm. Intra-class correlation coefficient values for the affected and unaffected shoulders for rater 1 were 0.96 and 0.91, respectively. Corresponding values for rater 2 were 0.95 and 0.90.SEM and MDC90 for both affected and unaffected shoulders were ≤ 0.2 cm. Inter-rater reliability coefficients were 0.86 (affected) and 0.76 (unaffected) shoulders. Ultrasonographic measurement of AGT distance demonstrates excellent intra-rater reliability for a novice rater. Inter-rater reliability of ultrasonographic measurement of AGT also demonstrates good reliability between novice and experienced raters.
Evaluating Written Patient Information for Eczema in German: Comparing the Reliability of Two Instruments, DISCERN and EQIP.

PubMed

McCool, Megan E; Wahl, Josepha; Schlecht, Inga; Apfelbacher, Christian

2015-01-01

Patients actively seek information about how to cope with their health problems, but the quality of the information available varies. A number of instruments have been developed to assess the quality of patient information, primarily though in English. Little is known about the reliability of these instruments when applied to patient information in German. The objective of our study was to investigate and compare the reliability of two validated instruments, DISCERN and EQIP, in order to determine which of these instruments is better suited for a further study pertaining to the quality of information available to German patients with eczema. Two independent raters evaluated a random sample of 20 informational brochures in German. All the brochures addressed eczema as a disorder and/or therapy options and care. Intra-rater and inter-rater reliability were assessed by calculating intra-class correlation coefficients, agreement was tested with weighted kappas, and the correlation of the raters' scores for each instrument was measured with Pearson's correlation coefficient. DISCERN demonstrated substantial intra- and inter-rater reliability. It also showed slightly better agreement than EQIP. There was a strong correlation of the raters' scores for both instruments. The findings of this study support the reliability of both DISCERN and EQIP. However, based on the results of the inter-rater reliability, agreement and correlation analyses, we consider DISCERN to be the more precise tool for our project on patient information concerning the treatment and care of eczema.
Intra- and Inter-Rater Reliability of the Rate of Force Development of Hip Abductor Muscles Measured by Hand-Held Dynamometer

ERIC Educational Resources Information Center

Takeda, Kazuya; Tanabe, Shigeo; Koyama, Soichiro; Nagai, Tomoko; Sakurai, Hiroaki; Kanada, Yoshikiyo; Shomoto, Koji

2018-01-01

The aim of this study was to clarify the intra- and inter-rater reliability of the rate of force development in hip abductor muscle force measurements using a hand-held dynamometer. Thirty healthy adults were separately assessed by two independent raters on two separate days. Rate of force development was calculated from the slope of the…
The Berg Balance Scale has high intra- and inter-rater reliability but absolute reliability varies across the scale: a systematic review.

PubMed

Downs, Stephen; Marquez, Jodie; Chiarelli, Pauline

2013-06-01

What is the intra-rater and inter-rater relative reliability of the Berg Balance Scale? What is the absolute reliability of the Berg Balance Scale? Does the absolute reliability of the Berg Balance Scale vary across the scale? Systematic review with meta-analysis of reliability studies. Any clinical population that has undergone assessment with the Berg Balance Scale. Relative intra-rater reliability, relative inter-rater reliability, and absolute reliability. Eleven studies involving 668 participants were included in the review. The relative intrarater reliability of the Berg Balance Scale was high, with a pooled estimate of 0.98 (95% CI 0.97 to 0.99). Relative inter-rater reliability was also high, with a pooled estimate of 0.97 (95% CI 0.96 to 0.98). A ceiling effect of the Berg Balance Scale was evident for some participants. In the analysis of absolute reliability, all of the relevant studies had an average score of 20 or above on the 0 to 56 point Berg Balance Scale. The absolute reliability across this part of the scale, as measured by the minimal detectable change with 95% confidence, varied between 2.8 points and 6.6 points. The Berg Balance Scale has a higher absolute reliability when close to 56 points due to the ceiling effect. We identified no data that estimated the absolute reliability of the Berg Balance Scale among participants with a mean score below 20 out of 56. The Berg Balance Scale has acceptable reliability, although it might not detect modest, clinically important changes in balance in individual subjects. The review was only able to comment on the absolute reliability of the Berg Balance Scale among people with moderately poor to normal balance. Copyright © 2013 Australian Physiotherapy Association. Published by .. All rights reserved.
Rater methodology for stroboscopy: a systematic review.

PubMed

Bonilha, Heather Shaw; Focht, Kendrea L; Martin-Harris, Bonnie

2015-01-01

Laryngeal endoscopy with stroboscopy (LES) remains the clinical gold standard for assessing vocal fold function. LES is used to evaluate the efficacy of voice treatments in research studies and clinical practice. LES as a voice treatment outcome tool is only as good as the clinician interpreting the recordings. Research using LES as a treatment outcome measure should be evaluated based on rater methodology and reliability. The purpose of this literature review was to evaluate the rater-related methodology from studies that use stroboscopic findings as voice treatment outcome measures. Systematic literature review. Computerized journal databases were searched for relevant articles using terms: stroboscopy and treatment. Eligible articles were categorized and evaluated for the use of rater-related methodology, reporting of number of raters, types of raters, blinding, and rater reliability. Of the 738 articles reviewed, 80 articles met inclusion criteria. More than one-third of the studies included in the review did not report the number of raters who participated in the study. Eleven studies reported results of rater reliability analysis with only two studies reporting good inter- and intrarater reliability. The comparability and use of results from treatment studies that use LES are limited by a lack of rigor in rater methodology and variable, mostly poor, inter- and intrarater reliability. To improve our ability to evaluate and use the findings from voice treatment studies that use LES features as outcome measures, greater consistency of reporting rater methodology characteristics across studies and improved rater reliability is needed. Copyright © 2015 The Voice Foundation. Published by Elsevier Inc. All rights reserved.
Reliability of the Cooking Task in adults with acquired brain injury.

PubMed

Poncet, Frédérique; Swaine, Bonnie; Taillefer, Chantal; Lamoureux, Julie; Pradat-Diehl, Pascale; Chevignard, Mathilde

2015-01-01

Acquired brain injury (ABI) often leads to deficits in executive functioning (EF) responsible for severe and long-standing disabilities in daily life activities. The Cooking Task is an ecological and valid test of EF involving multi-tasking in a real environment. Given its complex scoring system, it is important to establish the tool's reliability. The objective of the study was to examine the reliability of the Cooking Task (internal consistency, inter-rater and test-retest reliability). A total of 160 patients with ABI (113 men, mean age 37 years, SD = 14.3) were tested using the Cooking Task. For test-retest reliability, patients were assessed by the same rater on two occasions (mean interval 11 days) while two raters independently and simultaneously observed and scored patients' performances to estimate inter-rater reliability. Internal consistency was high for the global scale (Cronbach α = .74). Inter-rater reliability (n = 66) for total errors was also high (ICC = .93), however the test-retest reliability (n = 11) was poor (ICC = .36). In general the Cooking Task appears to be a reliable tool. The low test-retest results were expected given the importance of EF in the performance of novel tasks.
Six of one, half a dozen of the other: A measure of multidisciplinary inter/intra-rater reliability of the society for fetal urology and urinary tract dilation grading systems for hydronephrosis.

PubMed

Rickard, Mandy; Easterbrook, Bethany; Kim, Soojin; Farrokhyar, Forough; Stein, Nina; Arora, Steven; Belostotsky, Vladamir; DeMaria, Jorge; Lorenzo, Armando J; Braga, Luis H

2017-02-01

The urinary tract dilation (UTD) classification system was introduced to standardize terminology in the reporting of hydronephrosis (HN), and bridge a gap between pre- and postnatal classification such as the Society for Fetal Urology (SFU) grading system. Herein we compare the intra/inter-rater reliability of both grading systems. SFU (I-IV) and UTD (I-III) grades were independently assigned by 13 raters (9 pediatric urology staff, 2 nephrologists, 2 radiologists), twice, 3 weeks apart, to 50 sagittal postnatal ultrasonographic views of hydronephrotic kidneys. Data regarding ureteral measurements and bladder abnormalities were included to allow proper UTD categorization. Ten images were repeated to assess intra-rater reliability. Krippendorff's alpha coefficient was used to measure overall and by grade intra/inter-rater reliability. Reliability between specialties and training levels were also analyzed. Overall inter-rater reliability was slightly higher for SFU (α = 0.842, 95% CI 0.812-0.879, in session 1; and α = 0.808, 95% CI 0.775-0.839, in session 2) than for UTD (α = 0.774, 95% CI 0.715-0.827, in session 1; and α = 0.679, 95% CI 0.605-0.750, in session 2). Reliability for intermediate grades (SFU II/III and UTD 2) of HN was poor regardless of the system. Reliabilities for SFU and UTD classifications among Urology, Nephrology, and Radiology, as well as between training levels were not significantly different. Despite the introduction of HN grading systems to standardize the interpretation and reporting of renal ultrasound in infants with HN, none have been proven superior in allowing clinicians to distinguish between "moderate" grades. While this study demonstrated high reliability in distinguishing between "mild" (SFU I/II and UTD 1) and "severe" (SFU IV and UTD 3) grades of HN, the overall reliability between specialties was poor. This is in keeping with a previous report of modest inter-rater reliability of the SFU system. This drawback is likely explained by the subjective interpretation required to assign grades, which can be impacted by experience, image quality, and scanning technique. As shown in the figure, which demonstrates SFU II (a) and SFU III (b), as assigned by a radiologist, it is possible to make an argument that either of these images can be classified into both categories that were observed during the grading sessions of this study. Although both systems have acceptable reliability, the SFU grading system showed higher overall intra/inter-rater reliability regardless of rater specialty than the UTD classification. Inter-rater reliability for SFU grades II/III and UTD 2 was low, highlighting the limitations of both classifications in regards to properly segregating moderate HN grades. Copyright © 2016 Journal of Pediatric Urology Company. Published by Elsevier Ltd. All rights reserved.

Indices of Paraspinal Muscles Degeneration: Reliability and Association With Facet Joint Osteoarthritis: Feasibility Study.

PubMed

Kalichman, Leonid; Klindukhov, Alexander; Li, Ling; Linov, Lina

2016-11-01

A reliability and cross-sectional observational study. To introduce a scoring system for visible fat infiltration in paraspinal muscles; to evaluate intertester and intratester reliability of this system and its relationship with indices of muscle density; to evaluate the association between indices of paraspinal muscle degeneration and facet joint osteoarthritis. Current evidence suggests that the paraspinal muscles degeneration is associated with low back pain, facet joint osteoarthritis, spondylolisthesis, and degenerative disc disease. However, the evaluation of paraspinal muscles on computed tomography is not radiological routine, probably because of absence of simple and reliable indices of paraspinal degeneration. One hundred fifty consecutive computed tomography scans of the lower back (N=75) or abdomen (N=75) were evaluated. Mean radiographic density (in Hounsfield units) and SD of the density of multifidus and erector spinae were evaluated at the L4-L5 spinal level. A new index of muscle degeneration, radiographic density ratio=muscle density/SD of density, was calculated. To evaluate the visible fat infiltration in paraspinal muscles, we proposed a 3-graded scoring system. The prevalence of facet joint osteoarthritis was also evaluated. Intraclass correlation and κ statistics were used to evaluate inter-rater and intra-rater reliability. Logistic regression examined the association between paraspinal muscle indices and facet joint osteoarthritis. Intra-rater reliability for fat infiltration score (κ) ranged between 0.87 and 0.92; inter-rater reliability between 0.70 and 0.81. Intra-rater reliability (intraclass correlation) for mean density of paraspinal muscles ranged between 0.96 and 0.99, inter-rater reliability between 0.95 and 0.99; SD intra-rater reliability ranged between 0.82 and 0.91, inter-rater reliability between 0.80 and 0.89. Significant associations (P<0.01) were found between facet joint osteoarthritis, fat infiltration score, and radiographic density ratio. Two suggested indices of paraspinal muscle degeneration showed excellent reliability and were significantly associated with facet joint osteoarthritis. Additional studies are needed to evaluate the associations with other spinal degeneration features and low back pain.
Can Physicians Identify Inappropriate Nuclear Stress Tests? An Examination of Inter-rater Reliability for the 2009 Appropriate Use Criteria for Radionuclide Imaging

PubMed Central

Ye, Siqin; Rabbani, LeRoy E.; Kelly, Christopher R.; Kelly, Maureen R.; Lewis, Matthew; Paz, Yehuda; Peck, Clara L.; Rao, Shaline; Bokhari, Sabahat; Weiner, Shepard D.; Einstein, Andrew J.

2014-01-01

Background We sought to determine inter-rater reliability of the 2009 Appropriate Use Criteria (AUC) for radionuclide imaging (RNI) and whether physicians at various levels of training can effectively identify nuclear stress tests with inappropriate indications. Methods and Results Four hundred patients were randomly selected from a consecutive cohort of patients undergoing nuclear stress testing at an academic medical center. Raters with different levels of training (including cardiology attending physicians, cardiology fellows, internal medicine hospitalists, and internal medicine interns) classified individual nuclear stress tests using the 2009 AUC. Consensus classification by two cardiologists was considered the operational gold standard, and sensitivity and specificity of individual raters for identifying inappropriate tests was calculated. Inter-rater reliability of the AUC was assessed using Cohen’s kappa statistics for pairs of different raters. The mean age of patients was 61.5 years; 214 (54%) were female. The cardiologists rated 256 (64%) of 400 NSTs as appropriate, 68 (18%) as uncertain, 55 (14%) as inappropriate; 21 (5%) tests were unable to be classified. Inter-rater reliability for non-cardiologist raters was modest (unweighted Cohen’s kappa, 0.51, 95% confidence interval, 0.45 to 0.55). Sensitivity of individual raters for identifying inappropriate tests ranged from 47% to 82%, while specificity ranged from 85% to 97%. Conclusions Inter-rater reliability for the 2009 AUC for RNI is modest, and there is considerable variation in the ability of raters at different levels of training to identify inappropriate tests. PMID:25563660
Reliability of videotaped observational gait analysis in patients with orthopedic impairments

PubMed Central

Brunnekreef, Jaap J; van Uden, Caro JT; van Moorsel, Steven; Kooloos, Jan GM

2005-01-01

Background In clinical practice, visual gait observation is often used to determine gait disorders and to evaluate treatment. Several reliability studies on observational gait analysis have been described in the literature and generally showed moderate reliability. However, patients with orthopedic disorders have received little attention. The objective of this study is to determine the reliability levels of visual observation of gait in patients with orthopedic disorders. Methods The gait of thirty patients referred to a physical therapist for gait treatment was videotaped. Ten raters, 4 experienced, 4 inexperienced and 2 experts, individually evaluated these videotaped gait patterns of the patients twice, by using a structured gait analysis form. Reliability levels were established by calculating the Intraclass Correlation Coefficient (ICC), using a two-way random design and based on absolute agreement. Results The inter-rater reliability among experienced raters (ICC = 0.42; 95%CI: 0.38–0.46) was comparable to that of the inexperienced raters (ICC = 0.40; 95%CI: 0.36–0.44). The expert raters reached a higher inter-rater reliability level (ICC = 0.54; 95%CI: 0.48–0.60). The average intra-rater reliability of the experienced raters was 0.63 (ICCs ranging from 0.57 to 0.70). The inexperienced raters reached an average intra-rater reliability of 0.57 (ICCs ranging from 0.52 to 0.62). The two expert raters attained ICC values of 0.70 and 0.74 respectively. Conclusion Structured visual gait observation by use of a gait analysis form as described in this study was found to be moderately reliable. Clinical experience appears to increase the reliability of visual gait analysis. PMID:15774012
Inter- and intra-rater reliability of nasal auscultation in daycare children.

PubMed

Santos, Rita; Silva Alexandrino, Ana; Tomé, David; Melo, Cristina; Mesquita Montes, António; Costa, Daniel; Pinto Ferreira, João

2018-02-01

The aim of this study was to assess nasal auscultation's intra- and inter-rater reliability and to analyze ear and respiratory clinical condition according to nasal auscultation. Cross-sectional study performed in 125 children aged up to 3 years old attending daycare centers. Nasal auscultation, tympanometry and Paediatric Respiratory Severity Score (PRSS) were applied to all children. Nasal sounds were classified by an expert panel in order to determine nasal auscultation's intra and inter- rater reliability. The classification of nasal sounds was assessed against tympanometric and PRSS values. Nasal auscultation revealed substantial inter-rater (K=0.75) and intra-rater (K=0.69; K=0.61 and K=0.72) reliability. Children with a "non-obstructed" classification revealed a lower peak pressure (t=-3.599, P<0.001 in left ear; t=-2.258, P=0.026 in right ear) and a higher compliance (t=-2,728, P=0.007 in left ear; t=-3.830. P<0.001 in right ear) in both ears. There was an association between the classification of sounds and tympanogram types in both ears (X=11.437, P=0.003 in left ear; X=13.535, P=0.001 in right ear). Children with a "non-obstructed" classification had a healthier respiratory condition. Nasal auscultation revealed substantial intra- and inter-rater reliability. Nasal auscultation exhibited important differences according to ear and respiratory clinical conditions. Nasal auscultation in pediatrics seems to be an original topic as well as a simple method that can be used to identify early signs of nasopharyngeal obstruction.
The intensive care delirium screening checklist: translation and reliability testing in a Swedish ICU.

PubMed

Neziraj, M; Sarac Kart, N; Samuelson, Karin

2011-08-01

The view of delirium has changed considerably over the last decade, and delirium is now a very topical issue within the intensive care unit (ICU) setting. Delirium has proved to be common in critically ill patients and is manifested as acute changes in mental status with reduced cognitive ability, incoherent thought patterns, impaired consciousness, agitation and acute confusion. In order to be able to prevent, identify and alleviate problems related to delirium it is important that validated instruments for delirium screening are implemented and evaluated. The aim of this study was to translate the Intensive Care Delirium Screening Checklist (ICDSC) into Swedish and test the inter-rater reliability in a Swedish general ICU setting. The study was carried out during 2009 in a general Swedish ICU. A translation of the scale from English into Swedish was made, including back-translation, critical review and pilot testing. A total of 49 paired ratings were carried out using the Swedish version of the ICDSC scale. The inter-rater reliability was tested using weighted kappa (κ) statistics (linear weighting). The ICDSC scale was successfully translated into Swedish and the inter-rater reliability testing of the Swedish version resulted in a weighted k value of 0.92. The result of this study indicates that the Swedish version of the ICDSC scale has a very good inter-rater reliability. The high inter-rater reliability and the ease of administration make the ICDSC scale applicable for delirium screening in a Swedish ICU setting. © 2011 The Authors. Acta Anaesthesiologica Scandinavica © 2011 The Acta Anaesthesiologica Scandinavica Foundation.
The reliability of WorkWell Systems Functional Capacity Evaluation: a systematic review

PubMed Central

2014-01-01

Background Functional capacity evaluation (FCE) determines a person’s ability to perform work-related tasks and is a major component of the rehabilitation process. The WorkWell Systems (WWS) FCE (formerly known as Isernhagen Work Systems FCE) is currently the most commonly used FCE tool in German rehabilitation centres. Our systematic review investigated the inter-rater, intra-rater and test-retest reliability of the WWS FCE. Methods We performed a systematic literature search of studies on the reliability of the WWS FCE and extracted item-specific measures of inter-rater, intra-rater and test-retest reliability from the identified studies. Intraclass correlation coefficients ≥ 0.75, percentages of agreement ≥ 80%, and kappa coefficients ≥ 0.60 were categorised as acceptable, otherwise they were considered non-acceptable. The extracted values were summarised for the five performance categories of the WWS FCE, and the results were classified as either consistent or inconsistent. Results From 11 identified studies, 150 item-specific reliability measures were extracted. 89% of the extracted inter-rater reliability measures, all of the intra-rater reliability measures and 96% of the test-retest reliability measures of the weight handling and strength tests had an acceptable level of reliability, compared to only 67% of the test-retest reliability measures of the posture/mobility tests and 56% of the test-retest reliability measures of the locomotion tests. Both of the extracted test-retest reliability measures of the balance test were acceptable. Conclusions Weight handling and strength tests were found to have consistently acceptable reliability. Further research is needed to explore the reliability of the other tests as inconsistent findings or a lack of data prevented definitive conclusions. PMID:24674029
Inter-Rater and Test-Retest (Between-Sessions) Reliability of the 4-Skills Scan for Dutch Elementary School Children

ERIC Educational Resources Information Center

van Kernebeek, Willem G.; de Schipper, Antoine W.; Savelsbergh, Geert J. P.; Toussaint, Huub M.

2018-01-01

In The Netherlands, the 4-Skills Scan is an instrument for physical education teachers to assess gross motor skills of elementary school children. Little is known about its reliability. Therefore, in this study the test-retest and inter-rater reliability was determined. Respectively, 624 and 557 Dutch 6- to 12-year-old children were analyzed for…
The Availability of Radiological Measurement of Femoral Anteversion Angle: Three-Dimensional Computed Tomography Reconstruction

PubMed Central

Byun, Ha Young; Shin, Heesuk; Lee, Eun Shin; Kong, Min Sik; Lee, Seung Hun

2016-01-01

Objective To assess the intra-rater and inter-rater reliability for measuring femoral anteversion angle (FAA) by a radiographic method using three-dimensional computed tomography reconstruction (3D-CT). Methods The study included 82 children who presented with intoeing gait. 3D-CT data taken between 2006 and 2014 were retrospectively reviewed. FAA was measured by 3D-CT. FAA is defined as the angle between the long axis of the femur neck and condylar axis of the distal femur. FAA measurement was performed twice at both lower extremities by each rater. The intra-rater and inter-rater reliability were calculated by intraclass correlation coefficient (ICC). Results One hundred and sixty-four lower limbs of 82 children (31 boys and 51 girls, 6.3±3.2 years old) were included. The ICCs of intra-rater measurement for the angle of femoral neck axis (NA) were 0.89 for rater A and 0.96 for rater B, and those of condylar axis (CA) were 0.99 for rater A and 0.99 for rater B, respectively. The ICC of inter-rater measurement for the angle of NA was 0.89 and that of CA was 0.92. By each rater, the ICCs of the intrarater measurement for FAA were 0.97 for rater A and 0.95 for rater B, respectively and the ICC of the inter-rater measurement for FAA was 0.89. Conclusion The 3D-CT measures for FAA are reliable within individual raters and between different raters. The 3D-CT measures of FAA can be a useful method for accurate diagnosis and follow-up of femoral anteversion. PMID:27152273
Validation of different pediatric triage systems in the emergency department

PubMed Central

Aeimchanbanjong, Kanokwan; Pandee, Uthen

2017-01-01

BACKGROUND: Triage system in children seems to be more challenging compared to adults because of their different response to physiological and psychosocial stressors. This study aimed to determine the best triage system in the pediatric emergency department. METHODS: This was a prospective observational study. This study was divided into two phases. The first phase determined the inter-rater reliability of five triage systems: Manchester Triage System (MTS), Emergency Severity Index (ESI) version 4, Pediatric Canadian Triage and Acuity Scale (CTAS), Australasian Triage Scale (ATS), and Ramathibodi Triage System (RTS) by triage nurses and pediatric residents. In the second phase, to analyze the validity of each triage system, patients were categorized as two groups, i.e., high acuity patients (triage level 1, 2) and low acuity patients (triage level 3, 4, and 5). Then we compared the triage acuity with actual admission. RESULTS: In phase I, RTS illustrated almost perfect inter-rater reliability with kappa of 1.0 (P<0.01). ESI and CTAS illustrated good inter-rater reliability with kappa of 0.8–0.9 (P<0.01). Meanwhile, ATS and MTS illustrated moderate to good inter-rater reliability with kappa of 0.5–0.7 (P<0.01). In phase II, we included 1 041 participants with average age of 4.7±4.2 years, of which 55% were male and 45% were female. In addition 32% of the participants had underlying diseases, and 123 (11.8%) patients were admitted. We found that ESI illustrated the most appropriate predicting ability for admission with sensitivity of 52%, specificity of 81%, and AUC 0.78 (95%CI 0.74–0.81). CONCLUSION: RTS illustrated almost perfect inter-rater reliability. Meanwhile, ESI and CTAS illustrated good inter-rater reliability. Finally, ESI illustrated the appropriate validity for triage system. PMID:28680520
Evaluation of General Classes of Reliability Estimators Often Used in Statistical Analyses of Quasi-Experimental Designs

NASA Astrophysics Data System (ADS)

Saini, K. K.; Sehgal, R. K.; Sethi, B. L.

2008-10-01

In this paper major reliability estimators are analyzed and there comparatively result are discussed. There strengths and weaknesses are evaluated in this case study. Each of the reliability estimators has certain advantages and disadvantages. Inter-rater reliability is one of the best ways to estimate reliability when your measure is an observation. However, it requires multiple raters or observers. As an alternative, you could look at the correlation of ratings of the same single observer repeated on two different occasions. Each of the reliability estimators will give a different value for reliability. In general, the test-retest and inter-rater reliability estimates will be lower in value than the parallel forms and internal consistency ones because they involve measuring at different times or with different raters. Since reliability estimates are often used in statistical analyses of quasi-experimental designs.
The new GRID Hamilton Rating Scale for Depression demonstrates excellent inter-rater reliability for inexperienced and experienced raters before and after training.

PubMed

Tabuse, Hideaki; Kalali, Amir; Azuma, Hideki; Ozaki, Norio; Iwata, Nakao; Naitoh, Hiroshi; Higuchi, Teruhiko; Kanba, Shigenobu; Shioe, Kunihiko; Akechi, Tatsuo; Furukawa, Toshi A

2007-09-30

The Hamilton Rating Scale for Depression (HAMD) is the de facto international gold standard for the assessment of depression. There are some criticisms, however, especially with regard to its inter-rater reliability, due to the lack of standardized questions or explicit scoring procedures. The GRID-HAMD was developed to provide standardized explicit scoring conventions and a structured interview guide for administration and scoring of the HAMD. We developed the Japanese version of the GRID-HAMD and examined its inter-rater reliability among experienced and inexperienced clinicians (n=70), how rater characteristics may affect it, and how training can improve it in the course of a model training program using videotaped interviews. The results showed that the inter-rater reliability of the GRID-HAMD total score was excellent to almost perfect and those of most individual items were also satisfactory to excellent, both with experienced and inexperienced raters, and both before and after the training. With its standardized definitions, questions and detailed scoring conventions, the GRID-HAMD appears to be the best achievable set of interview guides for the HAMD and can provide a solid tool for highly reliable assessment of depression severity.
Inter-rater reliability for movement pattern analysis (MPA): measuring patterning of behaviors versus discrete behavior counts as indicators of decision-making style

PubMed Central

Connors, Brenda L.; Rende, Richard; Colton, Timothy J.

2014-01-01

The unique yield of collecting observational data on human movement has received increasing attention in a number of domains, including the study of decision-making style. As such, interest has grown in the nuances of core methodological issues, including the best ways of assessing inter-rater reliability. In this paper we focus on one key topic – the distinction between establishing reliability for the patterning of behaviors as opposed to the computation of raw counts – and suggest that reliability for each be compared empirically rather than determined a priori. We illustrate by assessing inter-rater reliability for key outcome measures derived from movement pattern analysis (MPA), an observational methodology that records body movements as indicators of decision-making style with demonstrated predictive validity. While reliability ranged from moderate to good for raw counts of behaviors reflecting each of two Overall Factors generated within MPA (Assertion and Perspective), inter-rater reliability for patterning (proportional indicators of each factor) was significantly higher and excellent (ICC = 0.89). Furthermore, patterning, as compared to raw counts, provided better prediction of observable decision-making process assessed in the laboratory. These analyses support the utility of using an empirical approach to inform the consideration of measuring patterning versus discrete behavioral counts of behaviors when determining inter-rater reliability of observable behavior. They also speak to the substantial reliability that may be achieved via application of theoretically grounded observational systems such as MPA that reveal thinking and action motivations via visible movement patterns. PMID:24999336
Inter-rater reliability for movement pattern analysis (MPA): measuring patterning of behaviors versus discrete behavior counts as indicators of decision-making style.

PubMed

Connors, Brenda L; Rende, Richard; Colton, Timothy J

2014-01-01

The unique yield of collecting observational data on human movement has received increasing attention in a number of domains, including the study of decision-making style. As such, interest has grown in the nuances of core methodological issues, including the best ways of assessing inter-rater reliability. In this paper we focus on one key topic - the distinction between establishing reliability for the patterning of behaviors as opposed to the computation of raw counts - and suggest that reliability for each be compared empirically rather than determined a priori. We illustrate by assessing inter-rater reliability for key outcome measures derived from movement pattern analysis (MPA), an observational methodology that records body movements as indicators of decision-making style with demonstrated predictive validity. While reliability ranged from moderate to good for raw counts of behaviors reflecting each of two Overall Factors generated within MPA (Assertion and Perspective), inter-rater reliability for patterning (proportional indicators of each factor) was significantly higher and excellent (ICC = 0.89). Furthermore, patterning, as compared to raw counts, provided better prediction of observable decision-making process assessed in the laboratory. These analyses support the utility of using an empirical approach to inform the consideration of measuring patterning versus discrete behavioral counts of behaviors when determining inter-rater reliability of observable behavior. They also speak to the substantial reliability that may be achieved via application of theoretically grounded observational systems such as MPA that reveal thinking and action motivations via visible movement patterns.
A comparison of Google Glass and traditional video vantage points for bedside procedural skill assessment.

PubMed

Evans, Heather L; O'Shea, Dylan J; Morris, Amy E; Keys, Kari A; Wright, Andrew S; Schaad, Douglas C; Ilgen, Jonathan S

2016-02-01

This pilot study assessed the feasibility of using first person (1P) video recording with Google Glass (GG) to assess procedural skills, as compared with traditional third person (3P) video. We hypothesized that raters reviewing 1P videos would visualize more procedural steps with greater inter-rater reliability than 3P rating vantages. Seven subjects performed simulated internal jugular catheter insertions. Procedures were recorded by both Google Glass and an observer's head-mounted camera. Videos were assessed by 3 expert raters using a task-specific checklist (CL) and both an additive- and summative-global rating scale (GRS). Mean scores were compared by t-tests. Inter-rater reliabilities were calculated using intraclass correlation coefficients. The 1P vantage was associated with a significantly higher mean CL score than the 3P vantage (7.9 vs 6.9, P = .02). Mean GRS scores were not significantly different. Mean inter-rater reliabilities for the CL, additive-GRS, and summative-GRS were similar between vantages. 1P vantage recordings may improve visualization of tasks for behaviorally anchored instruments (eg, CLs), whereas maintaining similar global ratings and inter-rater reliability when compared with conventional 3P vantage recordings. Copyright © 2016 Elsevier Inc. All rights reserved.
Carotid and vertebral injury study (CAVIS) technique for characterization of blunt traumatic aneurysms with reliability assessment.

PubMed

Griessenauer, Christoph J; Foreman, Paul; Shoja, Mohammadali M; Kicielinski, Kimberly P; Deveikis, John P; Walters, Beverly C; Harrigan, Mark R

2015-04-01

Traumatic aneurysms occur in up to 20% of blunt traumatic extracranial carotid artery injuries. Currently there is no standardized method for characterization of traumatic aneurysms. For the carotid and vertebral injury study (CAVIS), a prospective study of traumatic cerebrovascular injury, we established a method for aneurysm characterization and tested its reliability. Saccular aneurysm size was defined as the greatest linear distance between the expected location of the normal artery wall and the outer edge of the aneurysm lumen ("depth"). Fusiform aneurysm size was defined as the "depth" and longitudinal distance ("length") paralleling the normal artery. The size of the aneurysm relative to the normal artery was also assessed. Reliability measurements were made using four raters who independently reviewed 15 computed tomographic angiograms (CTAs) and 13 digital subtraction angiograms (DSAs) demonstrating a traumatic aneurysm of the internal carotid artery. Raters categorized the aneurysms as either "saccular" or "fusiform" and made measurements. Five scans of each imaging modality were repeated to evaluate intra-rater reliability. Fleiss's free-marginal multi-rater kappa (κ), Cohen's kappa (κ), and interclass correlation coefficient (ICC) determined inter- and intra-rater reliability. Inter-rater agreement as to the aneurysm "shape" was almost perfect for CTA (κ = 0.82) and DSA (κ = 0.897). Agreements on aneurysm "depth," "length," "aneurysm plus parent artery," and "parent artery" for CTA and DSA were excellent (ICC > 0.75). Intra-rater agreement as to aneurysm "shape" was substantial to almost perfect (κ > 0.60). The CAVIS method of traumatic aneurysm characterization has remarkable inter- and intra-rater reliability and will facilitate further studies of the natural history and management of extracranial cerebrovascular traumatic aneurysms. © The Author(s) 2015 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav.
Intra- and interrater reliability of the 'lumbar-locked thoracic rotation test' in competitive swimmers ages 10 through 18 years.

PubMed

Feijen, Stef; Kuppens, Kevin; Tate, Angela; Baert, Isabel; Struyf, Thomas; Struyf, Filip

2018-04-17

Measuring thoracic spine mobility can be of interest to competitive swimmers as it has been associated with shoulder girdle function and scapular position in subjects with and without shoulder pain. At present, no reliability data of thoracic spine mobility measurements are available in the swimming population. This study aims to evaluate the within-session intra- and interrater reliability of the "lumbar-locked rotation test" for thoracic spine rotation in competitive swimmers aged 10 to 18 years. This reliability study is part of a larger prospective cohort study investigating potential risk factors for the development of shoulder pain in competitive swimmers. Within-session, intra- and inter-rater reliability. Competitive swimming clubs in Belgium. 21 competitive swimmers. Intra- and inter-rater reliability of the lumbar-locked thoracic rotation test. Intraclass correlation coefficients (ICCs) ranged from 0.91 (95% CI 0.78 to 0.96) to 0.96 (0.89-0.98) for intra-rater reliability. Results for inter-rater reliability ranged from 0.89 (0.72-0.95) to 0.86 (0.65-0.94) respectively for right and left thoracic rotation. Results suggest good to excellent reliability of the lumbar-locked thoracic rotation test, indicating this test can be used reliably in clinical practice. Copyright © 2018 Elsevier Ltd. All rights reserved.
Inter-rater reliability of measures to characterize the tobacco retail environment in Mexico.

PubMed

Hall, Marissa G; Kollath-Cattano, Christy; Reynales-Shigematsu, Luz Myriam; Thrasher, James F

2015-01-01

To evaluate the inter-rater reliability of a data collection instrument to assess the tobacco retail environment in Mexico, after major marketing regulations were implemented. In 2013, two data collectors independently evaluated 21 stores in two census tracts, through a data collection instrument that assessed the presence of price promotions, whether single cigarettes were sold, the number of visible advertisements, the presence of signage prohibiting the sale of cigarettes to minors, and characteristics of cigarette pack displays. We evaluated the inter-rater reliability of the collected data, through the calculation of metrics such as intraclass correlation coefficient, percent agreement, Cohen's kappa and Krippendorff's alpha. Most measures demonstrated substantial or perfect inter-rater reliability. Our results indicate the potential utility of the data collection instrument for future point-of-sale research.
IRR (Inter-Rater Reliability) of a COP (Classroom Observation Protocol)--A Critical Appraisal

ERIC Educational Resources Information Center

Rui, Ning; Feldman, Jill M.

2012-01-01

Notwithstanding broad utility of COPs (classroom observation protocols), there has been limited documentation of the psychometric properties of even the most popular COPs. This study attempted to fill this void by closely examining the item and domain-level IRR (inter-rater reliability) of a COP that was used in a federally funded striving readers…
Assessing Reliability of Medical Record Reviews for the Detection of Hospital Adverse Events.

PubMed

Ock, Minsu; Lee, Sang-il; Jo, Min-Woo; Lee, Jin Yong; Kim, Seon-Ha

2015-09-01

The purpose of this study was to assess the inter-rater reliability and intra-rater reliability of medical record review for the detection of hospital adverse events. We conducted two stages retrospective medical records review of a random sample of 96 patients from one acute-care general hospital. The first stage was an explicit patient record review by two nurses to detect the presence of 41 screening criteria (SC). The second stage was an implicit structured review by two physicians to identify the occurrence of adverse events from the positive cases on the SC. The inter-rater reliability of two nurses and that of two physicians were assessed. The intra-rater reliability was also evaluated by using test-retest method at approximately two weeks later. In 84.2% of the patient medical records, the nurses agreed as to the necessity for the second stage review (kappa, 0.68; 95% confidence interval [CI], 0.54 to 0.83). In 93.0% of the patient medical records screened by nurses, the physicians agreed about the absence or presence of adverse events (kappa, 0.71; 95% CI, 0.44 to 0.97). When assessing intra-rater reliability, the kappa indices of two nurses were 0.54 (95% CI, 0.31 to 0.77) and 0.67 (95% CI, 0.47 to 0.87), whereas those of two physicians were 0.87 (95% CI, 0.62 to 1.00) and 0.37 (95% CI, -0.16 to 0.89). In this study, the medical record review for detecting adverse events showed intermediate to good level of inter-rater and intra-rater reliability. Well organized training program for reviewers and clearly defining SC are required to get more reliable results in the hospital adverse event study.
[Inter-rater reliability and validity of the OPD-CA axes structure and conflict].

PubMed

Benecke, Cord; Bock, Astrid; Wieser, Elke; Tschiesner, Reinhard; Lochmann, Martha; Küspert, Felicia; Schorn, Robert; Viertler, Bernhard; Steinmayr-Gensluckner, Maria

2011-01-01

The manual of the Operationalized Psychodynamic Diagnostics in childhood and adolescence (OPD-CA) is an instrument meanwhile widespread in the clinical practice to assess psychodynamic dimensions. Publications of inter-rater agreement and validity are still outstanding. This study assessed the interrater-reliability and validity for the axis structure and the axis conflict. 60 adolescents between 14 and 17 years, with and without psychic disorders, were diagnosed with the Operationalized Psychodynamic Diagnostics in childhood and adolescence (Arbeitskreis OPD-KJ, 2007) and SCID-II-interviews and questionnaires. A partial sample of 36 OPD-CA-interviews was the data basis for the assessment of inter-rater agreement. Calculations of validity for axis structure and axis conflict were made with the whole sample. Inter-rater agreement for the axis structure and the axis conflict showed good to very good weighted Kappa coefficients among the trained raters. Validity of the axis structure showed good results. The Operationalized Psychodynamic Diagnostics in childhood and adolescence (OPD-CA) allows a reliable diagnostic of axis structure and axis conflict, if the ratings are done on the basis of semistructured videotaped interviews by trained raters. The axis structure shows validity, while the results concerning the validity of the axis conflict remain unclear.

Inter-rater reliability of three standardized functional tests in patients with low back pain

PubMed Central

Tidstrand, Johan; Horneij, Eva

2009-01-01

Background Of all patients with low back pain, 85% are diagnosed as "non-specific lumbar pain". Lumbar instability has been described as one specific diagnosis which several authors have described as delayed muscular responses, impaired postural control as well as impaired muscular coordination among these patients. This has mostly been measured and evaluated in a laboratory setting. There are few standardized and evaluated functional tests, examining functional muscular coordination which are also applicable in the non-laboratory setting. In ordinary clinical work, tests of functional muscular coordination should be easy to apply. The aim of this present study was to therefore standardize and examine the inter-rater reliability of three functional tests of muscular functional coordination of the lumbar spine in patients with low back pain. Methods Nineteen consecutive individuals, ten men and nine women were included. (Mean age 42 years, SD ± 12 yrs). Two independent examiners assessed three tests: "single limb stance", "sitting on a Bobath ball with one leg lifted" and "unilateral pelvic lift" on the same occasion. The standardization procedure took altered positions of the spine or pelvis and compensatory movements of the free extremities into account. The inter-rater reliability was analyzed by Cohen's kappa coefficient (κ) and by percentage agreement. Results The inter-rater reliability for the right and the left leg respectively was: for the single limb stance very good (κ: 0.88–1.0), for sitting on a Bobath ball good (κ: 0.79) and very good (κ: 0.88) and for the unilateral pelvic lift: good (κ: 0.61) and moderate (κ: 0.47). Conclusion The present study showed good to very good inter-rater reliability for two standardized tests, that is, the single-limb stance and sitting on a Bobath-ball with one leg lifted. Inter-rater reliability for the unilateral pelvic lift test was moderate to good. Validation of the tests in their ability to evaluate lumbar stability is required. PMID:19490644
Reliability of the modified Tufts Lumbar Degenerative Disc Classification between neurosurgeons and neuroradiologists.

PubMed

Burke, Shane M; Hwang, Steven W; Mehan, William A; Bedi, Harprit S; Ogbuji, Richard; Riesenburger, Ron I

2016-07-01

Cross-specialty inter-rater reliability has not been explicitly reported for imaging characteristics that are thought to be important in lumbar intervertebral disc degeneration. Sufficient cross-specialty reliability is an essential consideration if radiographic stratification of symptomatic patients to specific treatment modalities is to ever be realized. Therefore the purpose of this study was to directly compare the assessment of such characteristics between neurosurgeons and neuroradiologists. Sixty consecutive patients with a diagnosis of lumbago and appropriate imaging were selected for inclusion. Lumbar MRI were evaluated using the Tufts Degenerative Disc Classification by two neurosurgeons and two neuroradiologists. Inter-rater reliability was assessed using Cohen's κ values both within and between specialties. A sensitivity analysis was performed for a modified grading system, which excluded high intensity zones (HIZ), due to poor cross-specialty inter-rater reliability of HIZ between specialties. The reliability of HIZ between neurosurgeons and neuroradiologists was fair in two of the four cross-specialty comparisons in this study (neurosurgeon 1 versus both radiologists κ=0.364 and κ=0.290). Removing HIZ from the classification improved inter-rater reliability for all comparisons within and between specialties (0.465⩽κ⩽0.576). In addition, intra-rater reliability remained in the moderate to substantial range (0.523⩽κ⩽0.649). Given our findings and corroboration with previous studies, identification of HIZ seems to have a markedly variable reliability. Thus we recommend modification of the original Tufts Degenerative Disc Classification by removing HIZ in order to make the overall grade provided by this classification more reproducible when scored by practitioners of different training backgrounds. Copyright © 2015 Elsevier Ltd. All rights reserved.
Reliability of physical examination tests for the diagnosis of knee disorders: Evidence from a systematic review.

PubMed

Décary, Simon; Ouellet, Philippe; Vendittoli, Pascal-André; Desmeules, François

2016-12-01

Clinicians often rely on physical examination tests to guide them in the diagnostic process of knee disorders. However, reliability of these tests is often overlooked and may influence the consistency of results and overall diagnostic validity. Therefore, the objective of this study was to systematically review evidence on the reliability of physical examination tests for the diagnosis of knee disorders. A structured literature search was conducted in databases up to January 2016. Included studies needed to report reliability measures of at least one physical test for any knee disorder. Methodological quality was evaluated using the QAREL checklist. A qualitative synthesis of the evidence was performed. Thirty-three studies were included with a mean QAREL score of 5.5 ± 0.5. Based on low to moderate quality evidence, the Thessaly test for meniscal injuries reached moderate inter-rater reliability (k = 0.54). Based on moderate to excellent quality evidence, the Lachman for anterior cruciate ligament injuries reached moderate to excellent inter-rater reliability (k = 0.42 to 0.81). Based on low to moderate quality evidence, the Tibiofemoral Crepitus, Joint Line and Patellofemoral Pain/Tenderness, Bony Enlargement and Joint Pain on Movement tests for knee osteoarthritis reached fair to excellent inter-rater reliability (k = 0.29 to 0.93). Based on low to moderate quality evidence, the Lateral Glide, Lateral Tilt, Lateral Pull and Quality of Movement tests for patellofemoral pain reached moderate to good inter-rater reliability (k = 0.49 to 0.73). Many physical tests appear to reach good inter-rater reliability, but this is based on low-quality and conflicting evidence. High-quality research is required to evaluate the reliability of knee physical examination tests. Copyright © 2016 Elsevier Ltd. All rights reserved.
Reliability and criterion validity of two applications of the iPhone™ to measure cervical range of motion in healthy participants

PubMed Central

2013-01-01

Summary of background data Recent smartphones, such as the iPhone, are often equipped with an accelerometer and magnetometer, which, through software applications, can perform various inclinometric functions. Although these applications are intended for recreational use, they have the potential to measure and quantify range of motion. The purpose of this study was to estimate the intra and inter-rater reliability as well as the criterion validity of the clinometer and compass applications of the iPhone in the assessment cervical range of motion in healthy participants. Methods The sample consisted of 28 healthy participants. Two examiners measured cervical range of motion of each participant twice using the iPhone (for the estimation of intra and inter-reliability) and once with the CROM (for the estimation of criterion validity). Estimates of reliability and validity were then established using the intraclass correlation coefficient (ICC). Results We observed a moderate intra-rater reliability for each movement (ICC = 0.65-0.85) but a poor inter-rater reliability (ICC < 0.60). For the criterion validity, the ICCs are moderate (>0.50) to good (>0.65) for movements of flexion, extension, lateral flexions and right rotation, but poor (<0.50) for the movement left rotation. Conclusion We found good intra-rater reliability and lower inter-rater reliability. When compared to the gold standard, these applications showed moderate to good validity. However, before using the iPhone as an outcome measure in clinical settings, studies should be done on patients presenting with cervical problems. PMID:23829201
The reliability of three psoriasis assessment tools: Psoriasis area and severity index, body surface area and physician global assessment.

PubMed

Bożek, Agnieszka; Reich, Adam

2017-08-01

A wide variety of psoriasis assessment tools have been proposed to evaluate the severity of psoriasis in clinical trials and daily practice. The most frequently used clinical instrument is the psoriasis area and severity index (PASI); however, none of the currently published severity scores used for psoriasis meets all the validation criteria required for an ideal score. The aim of this study was to compare and assess the reliability of 3 commonly used assessment instruments for psoriasis severity: the psoriasis area and severity index (PASI), body surface area (BSA) and physician global assessment (PGA). On the scoring day, 10 trained dermatologists evaluated 9 adult patients with plaque-type psoriasis using the PASI, BSA and PGA. All the subjects were assessed twice by each physician. Correlations between the assessments were analyzed using the Pearson correlation coefficient. Intra-class correlation coefficient (ICC) was calculated to analyze intra-rater reliability, and the coefficient of variation (CV) was used to assess inter-rater variability. Significant correlations were observed among the 3 scales in both assessments. In all 3 scales the ICCs were > 0.75, indicating high intra-rater reliability. The highest ICC was for the BSA (0.96) and the lowest one for the PGA (0.87). The CV for the PGA and PASI were 29.3 and 36.9, respectively, indicating moderate inter-rater variability. The CV for the BSA was 57.1, indicating high inter-rater variability. Comparing the PASI, PGA and BSA, it was shown that the PGA had the highest inter-rater reliability, whereas the BSA had the highest intra-rater reliability. The PASI showed intermediate values in terms of interand intra-rater reliability. None of the 3 assessment instruments showed a significant advantage over the other. A reliable assessment of psoriasis severity requires the use of several independent evaluations simultaneously.
Validity and reliability of a new ankle dorsiflexion measurement device.

PubMed

Gatt, Alfred; Chockalingam, Nachiappan

2013-08-01

The assessment of the maximum ankle dorsiflexion angle is an important clinical examination procedure. Evidence shows that the traditional goniometer is highly unreliable, and various designs of goniometers to measure the maximum ankle dorsiflexion angle rely on the application of a known force to obtain reliable results. Hence, an innovative ankle dorsiflexion measurement device was designed to make this measurement more reliable by holding the foot in a selected posture without the application of a known moment. To report on the comprehensive validity and reliability testing carried out on the new device. Following validity testing, four different trials to test reliability of the ankle dorsiflexion measurement device were performed. These trials included inter-rater and intra-rater testings with a controlled moment, intra-rater reliability testing with knees flexed and extended without a controlled moment, intra-rater testing with a patient population, and inter-rater reliability testing between four raters of varying experience without controlling moment. All raters were blinded. A series of trials to test intra-rater and inter-rater reliabilities. Intra-rater reliability intraclass correlation coefficient was 0.98 and inter-rater reliability intraclass correlation coefficient (2,1) was 0.953 with a controlled moment. With uncontrolled moment, very high reliability for intra-tester was also achieved (intraclass correlation coefficient = 0.94 with knees extended and intraclass correlation coefficient = 0.95 with knees flexed). For the trial investigating test-retest reliability with actual patients, intraclass correlation coefficient of 0.99 was obtained. In the trial investigating four different raters with uncontrolled moment, intraclass correlation coefficient of 0.91 was achieved. The new ankle dorsiflexion measurement device is a valid and reliable device for measuring ankle dorsiflexion in both healthy subjects and patients, with both controlled and uncontrolled moments, even by multiple raters of varying experience when the foot is dorsiflexed to its end of range of motion. An ankle dorsiflexion measuring device has been designed to increase the reliability of ankle dorsiflexion measurement and replace the traditional goniometer. While the majority of similar devices rely on application of a known moment to perform this measurement, it has been shown that this is not required with the new ankle dorsiflexion measurement device and, rather, foot posture should be taken into consideration as this affects the maximum ankle dorsiflexion angle.
Can we have an overall osteoarthritis severity score for the patellofemoral joint using magnetic resonance imaging? Reliability and validity.

PubMed

Kobayashi, Sarah; Peduto, Anthony; Simic, Milena; Fransen, Marlene; Refshauge, Kathryn; Mah, Jean; Pappas, Evangelos

2018-04-01

This work aimed to assess inter-rater reliability and agreement of a magnetic resonance imaging (MRI)-based Kellgren and Lawrence (K&L) grading for patellofemoral joint osteoarthritis (OA) and to validate it against the MRI Osteoarthritis Knee Score (MOAKS). MRI scans from people aged 45 to 75 years with chronic knee pain participating in a randomised clinical trial evaluating dietary supplements were utilised. Fifty participants were randomly selected and scored using the MRI-based K&L grading using axial and sagittal MRI scans. Raters conducted inter-rater reliability, blinded to clinical information, radiology reports and other rater results. Intra- and inter-rater reliability and agreement were evaluated using the intra-class correlation coefficient (ICC) and Cohen's weighted kappa. There was a 2-week interval between the first and second readings for intra-rater reliability. Validity was assessed using the MOAKS and evaluated using Spearman's correlation coefficient. Intra-rater reliability of the K&L system was excellent: ICC 0.91 (95% CI 0.82-0.95); weighted kappa (ĸ = 0.69). Inter-rater reliability was high (ICC 0.88; 95% CI 0.79-0.93), while agreement between raters was moderate (ĸ = 0.49-0.57). Validity analysis demonstrated a strong correlation between the total MOAKS features score and the K&L grading system (ρ = 0.62-0.67) but weak correlations when compared with individual MOAKS features (ρ = 0.19-0.61). The high reliability and good agreement show consistency in grading the severity of patellofemoral OA with the MRI-based K&L score. Our validity results suggest that the scale may be useful, particularly in the clinical environment. Future research should validate this method against clinical findings.
Establishing inter-rater reliability scoring in a state trauma system.

PubMed

Read-Allsopp, Christine

2004-01-01

Trauma systems rely on accurate Injury Severity Scoring (ISS) to describe trauma patient populations. Twenty-seven (27) Trauma Nurse Coordinators and Data Managers across the state of New South Wales, Australia trauma network were instructed in the uses and techniques of the Abbreviated Injury Scale (AIS) from the Association for the Advancement of Automotive Medicine. The aim is to provide accurate, reliable and valid data for the state trauma network. Four (4) months after the course a coding exercise was conducted to assess inter-rater reliability. The results show that inter-rater reliability is with accepted international standards.
Reproducibility of cervical range of motion in patients with neck pain

PubMed Central

Hoving, Jan Lucas; Pool, Jan JM; van Mameren, Henk; Devillé, Walter JLM; Assendelft, Willem JJ; de Vet, Henrica CW; de Winter, Andrea F; Koes, Bart W; Bouter, Lex M

2005-01-01

Background Reproducibility measurements of the range of motion are an important prerequisite for the interpretation of study results. The aim of the study is to assess the intra-rater and inter-rater reproducibility of the measurement of active Range of Motion (ROM) in patients with neck pain using the Cybex Electronic Digital Inclinometer-320 (EDI-320). Methods In an outpatient clinic in a primary care setting 32 patients with at least 2 weeks of pain and/or stiffness in the neck were randomly assessed, in a test- retest design with blinded raters using a standardized measurement protocol. Cervical flexion-extension, lateral flexion and rotation were assessed. Results Reliability expressed by the Intraclass Correlation Coefficient (ICC) was 0.93 (lateral flexion) or higher for intra-rater reliability and 0.89 (lateral flexion) or higher for inter-rater reliability. The 95% limits of agreement for intra-rater agreement, expressing the range of the differences between two ratings were -2.5 ± 11.1° for flexion-extension, -0.1 ± 10.4° for lateral flexion and -5.9 ± 13.5° for rotation. For inter-rater agreement the limits of agreement were 3.3 ± 17.0° for flexion-extension, 0.5 ± 17.0° for lateral flexion and -1.3 ± 24.6° for rotation. Conclusion In general, the intra-rater reproducibility and the inter-rater reproducibility were good. We recommend to compare the reproducibility and clinical applicability of the EDI-320 inclinometer with other cervical ROM measures in symptomatic patients. PMID:16351719
Inter-rater reliability of twelve diagnostic systems of schizophrenia.

PubMed

Helmes, E; Landmark, J; Kazarian, S S

1983-05-01

The present and past symptomatology of 31 chronic schizophrenics was rated by four independent judges, two experienced clinical psychiatrists and two psychiatric residents, in a context more representative of actual clinical practice than most research studies. Ratings were made on 64 symptoms derived from 12 diagnostic systems, based on either live or videotaped interviews for present symptomatology and case records for past symptomatology. Inter-rater reliabilities were higher for present than for past symptoms, and in general did not approach those reported for highly trained raters. There were no differences between live and videotaped interviews. Diagnostic systems differed widely in rater agreement. The most consistent across both past and present symptomatology were the systems of Langfeldt, Schneider, and DSM-III, for which the level of reliability was consistent with other studies.
The inter-rater reliability and prognostic value of coma scales in Nepali children with acute encephalitis syndrome.

PubMed

Ray, Stephen; Rayamajhi, Ajit; Bonnett, Laura J; Solomon, Tom; Kneen, Rachel; Griffiths, Michael J

2018-02-01

Background Acute encephalitis syndrome (AES) is a common cause of coma in Nepali children. The Glasgow coma scale (GCS) is used to assess the level of coma in these patients and predict outcome. Alternative coma scales may have better inter-rater reliability and prognostic value in encephalitis in Nepali children, but this has not been studied. The Adelaide coma scale (ACS), Blantyre coma scale (BCS) and the Alert, Verbal, Pain, Unresponsive scale (AVPU) are alternatives to the GCS which can be used. Methods Children aged 1-14 years who presented to Kanti Children's Hospital, Kathmandu with AES between September 2010 and November 2011 were recruited. All four coma scales (GCS, ACS, BCS and AVPU) were applied on admission, 48 h later and on discharge. Inter-rater reliability (unweighted kappa) was measured for each. Correlation and agreement between total coma score and outcome (Liverpool outcome score) was measured by Spearman's rank and Bland-Altman plot. The prognostic value of coma scales alone and in combination with physiological variables was investigated in a subgroup (n = 22). A multivariable logistic regression model was fitted by backward stepwise. Results Fifty children were recruited. Inter-rater reliability using the variables scales was fair to moderate. However, the scales poorly predicted clinical outcome. Combining the scales with physiological parameters such as systolic blood pressure improved outcome prediction. Conclusion This is the first study to compare four coma scales in Nepali children with AES. The scales exhibited fair to moderate inter-rater reliability. However, the study is inadequately powered to answer the question on the relationship between coma scales and outcome. Further larger studies are required.
Translation, adaptation and inter-rater reliability of the administration manual for the Fugl-Meyer assessment.

PubMed

Michaelsen, Stella M; Rocha, André S; Knabben, Rodrigo J; Rodrigues, Luciano P; Fernandes, Claudia G C

2011-01-01

Recently, the reliability of the Brazilian version of the Fugl-Meyer Assessment (FMA) was assessed through the scoring given according to observations made by a single evaluator who applied the test. When different raters apply the scale, the reliability may depend on the interpretation given to the assessment sheet. In such cases, a clear administration manual is essential for ensuring homogeneity of application. To translate and adapt the French Canadian version of the FMA administration manual into Brazilian Portuguese and to evaluate the inter-rater reliability when different evaluators apply the FMA on the basis of the information contained in the manual. Eighteen adults (59±10 years) with chronic hemiparesis (38±35 months after a stroke) took part in this study. Eight patients participated in the first part of the study and 10 in the second part. Based on analyzing the results from part 1, an adapted version was developed, in which information and photos were added to illustrate the positions of the patient and evaluator. The inter-rater reliability was assessed using the intraclass correlation coefficient (ICC). The reliability of the FMA based on the adapted version of the manual was excellent for the total motor scores for the upper limbs (ICC=0.98) and lower limbs (ICC=0.90), as well as for movement sense (ICC=0.98) and upper and lower-limb passive range of motion (ICC=0.84 and 0.90, respectively). The reliability was moderate for tactile sensitivity (0.75). The joint pain assessment presented low reliability. The results showed that, except for pain assessment, application of the FMA based on the adapted version of the application manual for Brazilian Portuguese presented adequate inter-rater reliability.
The impact of revised DSM-5 criteria on the relative distribution and inter-rater reliability of eating disorder diagnoses in a residential treatment setting.

PubMed

Thomas, Jennifer J; Eddy, Kamryn T; Murray, Helen B; Tromp, Marilou D P; Hartmann, Andrea S; Stone, Melissa T; Levendusky, Philip G; Becker, Anne E

2015-09-30

This study evaluated the relative distribution and inter-rater reliability of revised DSM-5 criteria for eating disorders in a residential treatment program. Consecutive adolescent and young adult females (N=150) admitted to a residential eating disorder treatment facility were assigned both DSM-IV and DSM-5 diagnoses by a clinician (n=14) via routine clinical interview and a research assessor (n=4) via structured interview. We compared the frequency of diagnostic assignments under each taxonomy and by type of assessor. We evaluated concordance between clinician and researcher assignment through inter-rater reliability kappa and percent agreement. Significantly fewer patients received either clinician or researcher diagnoses of a residual eating disorder under DSM-5 (clinician-12.0%; researcher-31.3%) versus DSM-IV (clinician-28.7%; researcher-59.3%), with the majority of reassigned DSM-IV residual cases reclassified as DSM-5 anorexia nervosa. Researcher and clinician diagnoses showed moderate inter-rater reliability under DSM-IV (κ=.48) and DSM-5 (κ=.57), though agreement for specific DSM-5 other specified feeding or eating disorder (OSFED) presentations was poor (κ=.05). DSM-5 revisions were associated with significantly less frequent residual eating disorder diagnoses, but not with reduced inter-rater reliability. Findings support specific dimensions of clinical utility for revised DSM-5 criteria for eating disorders. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
Translation, reliability, and clinical utility of the Melbourne Assessment 2.

PubMed

Gerber, Corinna N; Plebani, Anael; Labruyère, Rob

2017-10-12

The aims were to (i) provide a German translation of the Melbourne Assessment 2 (MA2), a quantitative test to measure unilateral upper limb function in children with neurological disabilities and (ii) to evaluate its reliability and aspects of clinical utility. After its translation into German and approval of the back translation by the original authors, the MA2 was performed and videotaped twice with 30 children with neuromotor disorders. For each participant, two raters scored the video of the first test for inter-rater reliability. To determine test-retest reliability, one rater additionally scored the video of the second test while the other rater repeated the scoring of the first video to evaluate intra-rater reliability. Time needed for rater training, test administration, and scoring was recorded. The four subscale scores showed excellent intra-, inter-rater, and test-retest reliability with intraclass correlation coefficients of 0.90-1.00 (95%-confidence intervals 0.78-1.00). Score items revealed substantial to almost perfect intra-rater reliability (weighted kappa k w = 0.66-1.00) for the more affected side. Score item inter-rater and test-retest reliability of the same extremity were, with one exception, moderate to almost perfect (k w = 0.42-0.97; k w = 0.40-0.89). Furthermore, the MA2 was feasible and acceptable for patients and clinicians. The MA2 showed excellent subscale and moderate to almost perfect score item reliability. Implications for Rehabilitation There is a lack of high-quality studies about psychometric properties of upper limb measurement tools in the neuropediatric population. The Melbourne Assessment 2 is a promising tool for reliable measurement of unilateral upper limb movement quality in the neuropediatric population. The Melbourne Assessment 2 is acceptable and practicable to therapists and patients for routine use in clinical care.
Feasibility and inter-rater reliability of the ICU Mobility Scale.

PubMed

Hodgson, Carol; Needham, Dale; Haines, Kimberley; Bailey, Michael; Ward, Alison; Harrold, Megan; Young, Paul; Zanni, Jennifer; Buhr, Heidi; Higgins, Alisa; Presneill, Jeff; Berney, Sue

2014-01-01

The objectives of this study were to develop a scale for measuring the highest level of mobility in adult ICU patients and to assess its feasibility and inter-rater reliability. Growing evidence supports the feasibility, safety and efficacy of early mobilization in the intensive care unit (ICU). However, there are no adequately validated tools to quickly, easily, and reliably describe the mobility milestones of adult patients in ICU. Identifying or developing such a tool is a priority for evaluating mobility and rehabilitation activities for research and clinical care purposes. This study was performed at two ICUs in Australia. Thirty ICU nursing, and physiotherapy staff assessed the feasibility of the 'ICU Mobility Scale' (IMS) using a 10-item questionnaire. The inter-rater reliability of the IMS was assessed by 2 junior physical therapists, 2 senior physical therapists, and 16 nursing staff in 100 consecutive medical, surgical or trauma ICU patients. An 11 point IMS scale was developed based on multidisciplinary input. Participating clinicians reported that the scale was clear, with 95% of respondents reporting that it took <1 min to complete. The junior and senior physical therapists showed the highest inter-rater reliability with a weighted Kappa (95% confidence interval) of 0.83 (0.76-0.90), while the senior physical therapists and nurses and the junior physical therapists and nurses had a weighted Kappa of 0.72 (0.61-0.83) and 0.69 (0.56-0.81) respectively. The IMS is a feasible tool with strong inter-rater reliability for measuring the maximum level of mobility of adult patients in the ICU. Copyright © 2014 Elsevier Inc. All rights reserved.
Psychometric evaluation of a motor control test battery of the craniofacial region.

PubMed

von Piekartz, H; Stotz, E; Both, A; Bahn, G; Armijo-Olivo, S; Ballenberger, N

2017-12-01

The primary objective of this study was to determine the structural and known-group validity as well as the inter-rater reliability of a test battery to evaluate the motor control of the craniofacial region. Seventy volunteers without TMD and 25 subjects with TMD (Axes I) per the DC/TMD were asked to execute a test battery consisting of eight tests. The tests were video-taped in the same sequence in a standardised manner. Two experienced physical therapists participated in this study as blinded assessors. We used exploratory factor analysis to identify the underlying component structure of the eight tests. Internal consistency (Cronbach's α), inter-rater reliability (intra-class correlation coefficient) and construct validity (ie, hypothesis testing-known-group validity) (receiver operating curves) were also explored for the test battery. The structural validity showed the presence of one factor underlying the construct of the test battery. The internal consistency was excellent (0.90) as well as the inter-rater reliability. All values of reliability were close to 0.9 or above indicating very high inter-rater reliability. The area under the curve (AUC) was 0.93 for rater 1 and 0.94 for rater two, respectively, indicating excellent discrimination between subjects with TMD and healthy controls. The results of the present study support the psychometric properties of test battery to measure motor control of the craniofacial region when evaluated through videotaping. This test battery could be used to differentiate between healthy subjects and subjects with musculoskeletal impairments in the cervical and oro-facial regions. In addition, this test battery could be used to assess the effectiveness of management strategies in the craniofacial region. © 2017 John Wiley & Sons Ltd.
Transcultural Adaptation of GRID Hamilton Rating Scale For Depression (GRID-HAMD) to Brazilian Portuguese and Evaluation of the Impact of Training Upon Inter-Rater Reliability.

PubMed

Henrique-Araújo, Ricardo; Osório, Flávia L; Gonçalves Ribeiro, Mônica; Soares Monteiro, Ivandro; Williams, Janet B W; Kalali, Amir; Alexandre Crippa, José; Oliveira, Irismar Reis De

2014-07-01

GRID-HAMD is a semi-structured interview guide developed to overcome flaws in HAM-D, and has been incorporated into an increasing number of studies. Carry out the transcultural adaptation of GRID-HAMD into the Brazilian Portuguese language, evaluate the inter-rater reliability of this instrument and the training impact upon this measure, and verify the raters' opinions of said instrument. The transcultural adaptation was conducted by appropriate methodology. The measurement of inter-rater reliability was done by way of videos that were evaluated by 85 professionals before and after training for the use of this instrument. The intraclass correlation coefficient (ICC) remained between 0.76 and 0.90 for GRID-HAMD-21 and between 0.72 and 0.91 for GRID-HAMD-17. The training did not have an impact on the ICC, except for a few groups of participants with a lower level of experience. Most of the participants showed high acceptance of GRID-HAMD, when compared to other versions of HAM-D. The scale presented adequate inter-rater reliability even before training began. Training did not have an impact on this measure, except for a few groups with less experience. GRID-HAMD received favorable opinions from most of the participants.
Evaluating Written Patient Information for Eczema in German: Comparing the Reliability of Two Instruments, DISCERN and EQIP

PubMed Central

McCool, Megan E.; Wahl, Josepha; Schlecht, Inga; Apfelbacher, Christian

2015-01-01

Patients actively seek information about how to cope with their health problems, but the quality of the information available varies. A number of instruments have been developed to assess the quality of patient information, primarily though in English. Little is known about the reliability of these instruments when applied to patient information in German. The objective of our study was to investigate and compare the reliability of two validated instruments, DISCERN and EQIP, in order to determine which of these instruments is better suited for a further study pertaining to the quality of information available to German patients with eczema. Two independent raters evaluated a random sample of 20 informational brochures in German. All the brochures addressed eczema as a disorder and/or therapy options and care. Intra-rater and inter-rater reliability were assessed by calculating intra-class correlation coefficients, agreement was tested with weighted kappas, and the correlation of the raters’ scores for each instrument was measured with Pearson’s correlation coefficient. DISCERN demonstrated substantial intra- and inter-rater reliability. It also showed slightly better agreement than EQIP. There was a strong correlation of the raters’ scores for both instruments. The findings of this study support the reliability of both DISCERN and EQIP. However, based on the results of the inter-rater reliability, agreement and correlation analyses, we consider DISCERN to be the more precise tool for our project on patient information concerning the treatment and care of eczema. PMID:26440612
The admissions process of a bachelor of science in nursing program: initial reliability and validity of the personal interview.

PubMed

Carpio, B; Brown, B

1993-01-01

The undergraduate nursing degree program (B.Sc.N.) at McMaster University School of Nursing uses small groups, and is learner-centered and problem-based. A study was conducted during the 1991 admissions cycle to determine the initial reliability and validity of the semi-structured personal interview which constitutes the final component of candidate selection for this program. During the interview, three-member teams assess applicant suitability to the program based on six dimensions: applicant motivation, awareness of the program, problem-solving abilities, ability to relate to others, self-appraisal skills, and career goals. Each interviewer assigns the applicant a global rating using a seven-point scale. For the purposes of this study four interviewer teams were randomly selected from the pool of 31 teams to interview four simulated (preprogrammed) applicants. Using two-factor repeated-measures ANOVA to analyze interview ratings, inter-rater and inter-team intraclass correlation coefficients (ICC) were calculated. Inter-team reliability ranged from .64 to .97 for the individual dimensions, and .66 to .89 on global ratings. Inter-rater ICC for the six dimensions ranged from .81 to .99, and .96 to .99 for the global ratings. The item-to-total correlation coefficients between individual dimensions and global ratings ranged from .8 to 1.0. Pearson correlations between items ranged from .77 to 1.0. The ICC were then calculated for the interview scores of 108 actual applicants to the program. Inter-rater reliability based on global ratings was .79 for the single (1 rater) observation, and .91 for the multiple (3 rater) observation. These findings support the continued use of the interview as a reliable instrument with face validity. Studies of predictive validity will be undertaken.
Inter-rater reliability of Hamilton depression rating scale using video-recorded interviews — Focus on rater-blinding

PubMed Central

Prasad, M. Krishna; Udupa, K.; Kishore, K. R.; Thirthalli, J.; Sathyaprabha, T. N.; Gangadhar, B. N.

2009-01-01

Background: Hamilton depression rating scale (Ham-D) is the most widely used clinician rating scale for depression. There has been no Indian study that has examined the inter-rater reliability (IRR) of video-recorded interviews of the 21-item Ham-D. Aim: To study the IRR of scoring video-recorded interviews for 21-item Ham-D. Materials and Methods: Eighteen subjects with major depressive disorder involved in a larger study were interviewed using the semi-structured clinical interview of the 21-item Ham-D by a primary rater after informed consent. These interviews were video-recorded and portions edited to ensure rater blinding. Subsequently, the video-recorded interviews were rated by a “blind” rater. Both rated the different sub-domains of Ham-D according to Rhoades and Overall (1983). IRR was evaluated using intra-class correlation coefficient. Results: Excellent IRR was observed (0.9891) between the two raters. This was true for each of the primary factors and super-factors. Conclusion: Video recorded 21-item Ham-D has excellentIRR. Video-recorded interviews of Ham-D can be reliably used to blind raters in research. PMID:19881046

Rater reliability and construct validity of a mobile application for posture analysis

PubMed Central

Szucs, Kimberly A.; Brown, Elena V. Donoso

2018-01-01

[Purpose] Measurement of posture is important for those with a clinical diagnosis as well as researchers aiming to understand the impact of faulty postures on the development of musculoskeletal disorders. A reliable, cost-effective and low tech posture measure may be beneficial for research and clinical applications. The purpose of this study was to determine rater reliability and construct validity of a posture screening mobile application in healthy young adults. [Subjects and Methods] Pictures of subjects were taken in three standing positions. Two raters independently digitized the static standing posture image twice. The app calculated posture variables, including sagittal and coronal plane translations and angulations. Intra- and inter-rater reliability were calculated using the appropriate ICC models for complete agreement. Construct validity was determined through comparison of known groups using repeated measures ANOVA. [Results] Intra-rater reliability ranged from 0.71 to 0.99. Inter-rater reliability was good to excellent for all translations. ICCs were stronger for translations versus angulations. The construct validity analysis found that the app was able to detect the change in the four variables selected. [Conclusion] The posture mobile application has demonstrated strong rater reliability and preliminary evidence of construct validity. This application may have utility in clinical and research settings. PMID:29410561
Rater reliability and construct validity of a mobile application for posture analysis.

PubMed

Szucs, Kimberly A; Brown, Elena V Donoso

2018-01-01

[Purpose] Measurement of posture is important for those with a clinical diagnosis as well as researchers aiming to understand the impact of faulty postures on the development of musculoskeletal disorders. A reliable, cost-effective and low tech posture measure may be beneficial for research and clinical applications. The purpose of this study was to determine rater reliability and construct validity of a posture screening mobile application in healthy young adults. [Subjects and Methods] Pictures of subjects were taken in three standing positions. Two raters independently digitized the static standing posture image twice. The app calculated posture variables, including sagittal and coronal plane translations and angulations. Intra- and inter-rater reliability were calculated using the appropriate ICC models for complete agreement. Construct validity was determined through comparison of known groups using repeated measures ANOVA. [Results] Intra-rater reliability ranged from 0.71 to 0.99. Inter-rater reliability was good to excellent for all translations. ICCs were stronger for translations versus angulations. The construct validity analysis found that the app was able to detect the change in the four variables selected. [Conclusion] The posture mobile application has demonstrated strong rater reliability and preliminary evidence of construct validity. This application may have utility in clinical and research settings.
Reliability and main findings of the FEES-Tensilon Test in patients with myasthenia gravis and dysphagia.

PubMed

Im, Sun; Suntrup-Krueger, Sonja; Colbow, Sigrid; Sauer, Sonja; Claus, Inga; Meuth, Sven G; Dziewas, Rainer; Warnecke, Tobias

2018-05-26

Diagnosis of pharyngeal dysphagia caused by myasthenia gravis (MG) based on clinical examination alone is often challenging. Flexible endoscopic evaluation of swallowing (FEES) combined with Tensilon (edrophonium) application, referred to as the FEES-Tensilon Test, was developed to improve diagnostic accuracy and to detect the main symptoms of pharyngeal dysphagia in MG. Here we investigated inter- and intra-rater reliability of the FEES-Tensilon Test and analyzed the main endoscopic findings. Four experienced raters reviewed a total of 20 FEES-Tensilon-Test videos in randomized order. Residue severity was graded at 4 different pharyngeal spaces before and after Tensilon administration. All interpretations were performed twice per rater, 4 weeks apart (a total of 160 scorings). Intra-rater test-retest reliability and inter-rater reliability levels were calculated. The most frequent FEES findings in MG patients before Tensilon application were prominent residues of semi solids spread all over the hypopharynx in varying locations. The reliability level in the interpretation of the FEES-Tensilon test was excellent regardless of the raters' profession or years of experience with FEES. All 4 raters showed high inter- and intra- reliability levels in interpreting the FEES-Tensilon Test based on residue clearance (kappa=0.922, 0.981). Degree of residue normalization in the vallecular space after Tensilon application showed the highest inter- and intra-rater reliability level (kappa=0.863, 0.957) followed by the epiglottis (kappa=0.813, 0.946) and pyriform sinuses (kappa=0.836, 0.929). Interpretation of the FEES-Tensilon Test based on residue severity and degree of Tensilon clearance, especially in the vallecular space, is consistent and reliable. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
The Reliability of Environmental Measures of the College Alcohol Environment.

ERIC Educational Resources Information Center

Clapp, John D.; Whitney, Mike; Shillington, Audrey M.

2002-01-01

Assesses the inter-rater reliability of two environmental scanning tools designed to identify alcohol-related advertisements targeting college students. Inter-rater reliability for these forms varied across different rating categories and ranged from poor to excellent. Suggestions for future research are addressed. (Contains 26 references and 6…
Inter-rater reliability of select physical examination procedures in patients with neck pain.

PubMed

Hanney, William J; George, Steven Z; Kolber, Morey J; Young, Ian; Salamh, Paul A; Cleland, Joshua A

2014-07-01

This study evaluated the inter-rater reliability of select examination procedures in patients with neck pain (NP) conducted over a 24- to 48-h period. Twenty-two patients with mechanical NP participated in a standardized examination. One examiner performed standardized examination procedures and a second blinded examiner repeated the procedures 24-48 h later with no treatment administered between examinations. Inter-rater reliability was calculated with the Cohen Kappa and weighted Kappa for ordinal data while continuous level data were calculated using an intraclass correlation coefficient model 2,1 (ICC2,1). Coefficients for categorical variables ranged from poor to moderate agreement (-0.22 to 0.70 Kappa) and coefficients for continuous data ranged from slight to moderate (ICC2,1 0.28-0.74). The standard error of measurement for cervical range of motion ranged from 5.3° to 9.9° while the minimal detectable change ranged from 12.5° to 23.1°. This study is the first to report inter-rater reliability values for select components of the cervical examination in those patients with NP performed 24-48 h after the initial examination. There was considerably less reliability when compared to previous studies, thus clinicians should consider how the passage of time may influence variability in examination findings over a 24- to 48-h period.
Reliability and validity of a nutrition and physical activity environmental self-assessment for child care

PubMed Central

Benjamin, Sara E; Neelon, Brian; Ball, Sarah C; Bangdiwala, Shrikant I; Ammerman, Alice S; Ward, Dianne S

2007-01-01

Background Few assessment instruments have examined the nutrition and physical activity environments in child care, and none are self-administered. Given the emerging focus on child care settings as a target for intervention, a valid and reliable measure of the nutrition and physical activity environment is needed. Methods To measure inter-rater reliability, 59 child care center directors and 109 staff completed the self-assessment concurrently, but independently. Three weeks later, a repeat self-assessment was completed by a sub-sample of 38 directors to assess test-retest reliability. To assess criterion validity, a researcher-administered environmental assessment was conducted at 69 centers and was compared to a self-assessment completed by the director. A weighted kappa test statistic and percent agreement were calculated to assess agreement for each question on the self-assessment. Results For inter-rater reliability, kappa statistics ranged from 0.20 to 1.00 across all questions. Test-retest reliability of the self-assessment yielded kappa statistics that ranged from 0.07 to 1.00. The inter-quartile kappa statistic ranges for inter-rater and test-retest reliability were 0.45 to 0.63 and 0.27 to 0.45, respectively. When percent agreement was calculated, questions ranged from 52.6% to 100% for inter-rater reliability and 34.3% to 100% for test-retest reliability. Kappa statistics for validity ranged from -0.01 to 0.79, with an inter-quartile range of 0.08 to 0.34. Percent agreement for validity ranged from 12.9% to 93.7%. Conclusion This study provides estimates of criterion validity, inter-rater reliability and test-retest reliability for an environmental nutrition and physical activity self-assessment instrument for child care. Results indicate that the self-assessment is a stable and reasonably accurate instrument for use with child care interventions. We therefore recommend the Nutrition and Physical Activity Self-Assessment for Child Care (NAP SACC) instrument to researchers and practitioners interested in conducting healthy weight intervention in child care. However, a more robust, less subjective measure would be more appropriate for researchers seeking an outcome measure to assess intervention impact. PMID:17615078
Improving Teacher Selection: The Effect of Inter-Rater Reliability in the Screening Process. CEDR Working Paper. WP #2015-7

ERIC Educational Resources Information Center

Martinkova, Patricia; Goldhaber, Dan

2015-01-01

Inter-rater reliability, commonly assessed by intra-class correlation coefficient ICC, is an important index for describing the extent to which there is consistency amongst two or more raters in assigned measures. In organizational research, the data structure is often hierarchical and designs deviate substantially from the ideal of a balanced…
The reliability of a segmentation methodology for assessing intramuscular adipose tissue and other soft-tissue compartments of lower leg MRI images.

PubMed

Karampatos, Sarah; Papaioannou, Alexandra; Beattie, Karen A; Maly, Monica R; Chan, Adrian; Adachi, Jonathan D; Pritchard, Janet M

2016-04-01

Determine the reliability of a magnetic resonance (MR) image segmentation protocol for quantifying intramuscular adipose tissue (IntraMAT), subcutaneous adipose tissue, total muscle and intermuscular adipose tissue (InterMAT) of the lower leg. Ten axial lower leg MRI slices were obtained from 21 postmenopausal women using a 1 Tesla peripheral MRI system. Images were analyzed using sliceOmatic™ software. The average cross-sectional areas of the tissues were computed for the ten slices. Intra-rater and inter-rater reliability were determined and expressed as the standard error of measurement (SEM) (absolute reliability) and intraclass coefficient (ICC) (relative reliability). Intra-rater and inter-rater reliability for IntraMAT were 0.991 (95% confidence interval [CI] 0.978-0.996, p < 0.05) and 0.983 (95% CI 0.958-9.993, p < 0.05), respectively. For the other soft tissue compartments, the ICCs were all >0.90 (p < 0.05). The absolute intra-rater and inter-rater reliability (expressed as SEM) for segmenting IntraMAT were 22.19 mm(2) (95% CI 16.97-32.04) and 78.89 mm(2) (95% CI 60.36-113.92), respectively. This is a reliable segmentation protocol for quantifying IntraMAT and other soft-tissue compartments of the lower leg. A standard operating procedure manual is provided to assist users, and SEM values can be used to estimate sample size and determine confidence in repeated measurements in future research.
Inter-Rater Reliability of Cyclotorsion Measurements Using Fundus Photography.

PubMed

Dysli, Muriel; Kanku, Madeleine; Traber, Ghislaine L

2018-04-01

The foveo-papillary angle (FPA) on fundus photographs is the accepted standard for the measurement of ocular cyclotorsion. We assessed the inter-rater reliability of this method in healthy subjects and in patients with trochlear nerve palsies. In this methodological study, fundus photographs of healthy subjects and of patients with trochlear nerve palsies were made with a fundus camera (Zeiss Fundus Camera FF 450 plus, Jena, Germany). Three independent observers measured the FPA on the fundus photographs of all subjects in synedra View (synedra View 16, Version 16.0.0.11, Innsbruck, Austria). One hundred and four eyes of 52 subjects (26 healthy controls and 26 patients) were assessed. The mean FPA of the healthy controls was 5.80 degrees (°) [± 0.44 standard error of the mean (SEM)] compared to 11.55° (± 0.80 SEM) for patients with trochlear nerve palsies. The inter-rater reliability of all measured FPAs showed an intraclass correlation coefficient (ICC) of 0.98 (95% CI 0.97 - 0.98). The inter-rater reliability of objective cyclotorsion measurements using fundus photographs was very high. Georg Thieme Verlag KG Stuttgart · New York.
Braden scale (ALB) for assessing pressure ulcer risk in hospital patients: A validity and reliability study.

PubMed

Chen, Hong-Lin; Cao, Ying-Juan; Zhang, Wei; Wang, Jing; Huai, Bao-Sha

2017-02-01

The inter-rater reliability of Braden Scale is not so good. We modified the Braden(ALB) scale by defining nutrition subscale based on serum albumin, then assessed it's the validity and reliability in hospital patients. We designed a retrospective study for validity analysis, and a prospective study for reliability analysis. Receiver operating curve (ROC) and area under the curve (AUC) were used to evaluate the predictive validity. Intra-class correlation coefficient (ICC) was used to investigate the inter-rater reliability. Two thousand five hundred twenty-five patients were included for validity analysis, 76 patients (3.0%) developed pressure ulcer. Positive correlation was found between serum albumin and nutrition score in Braden scale (Spearman's coefficient 0.2203, P<0.0001). The AUCs for Braden scale and Braden(ALB) scale predicting pressure ulcer risk were 0.813 (95% CI 0.797-0.828; P<0.0001), and 0.859 (95% CI 0.845-0.872; P<0.0001), respectively. The Braden(ALB) scale was even more valid than the Braden scale (z=1.860, P=0.0628). In different age subgroups, the Braden(ALB) scale seems also more valid than the original Braden scale, but no statistically significant differences were found (P>0.05). The inter-rater reliability study showed the ICC-value for nutrition increased 45.9%, and increased 4.3% for total score. The Braden(ALB) scale has similar validity compared with the original Braden scale for in hospital patients. However, the inter-rater reliability was significantly increased. Copyright © 2016 Elsevier Inc. All rights reserved.
Inter-Rater Reliability of the Modified Ashworth Scale and Modified Modified Ashworth Scale in Assessing Poststroke Elbow Flexor Spasticity

ERIC Educational Resources Information Center

Kaya, Taciser; Goksel Karatepe, Altinay; Gunaydin, Rezzan; Koc, Aysegul; Altundal Ercan, Ulku

2011-01-01

The Modified Ashworth Scale (MAS) is commonly used in clinical practice for grading spasticity. However, it was modified recently by omitting grade "1+" of the MAS and redefining grade "2". The aim of this study was to investigate the inter-rater reliability of MAS and modified MAS (MMAS) for the assessment of poststroke elbow flexor spasticity.…
Comparison of in vivo 3D cone-beam computed tomography tooth volume measurement protocols.

PubMed

Forst, Darren; Nijjar, Simrit; Flores-Mir, Carlos; Carey, Jason; Secanell, Marc; Lagravere, Manuel

2014-12-23

The objective of this study is to analyze a set of previously developed and proposed image segmentation protocols for precision in both intra- and inter-rater reliability for in vivo tooth volume measurements using cone-beam computed tomography (CBCT) images. Six 3D volume segmentation procedures were proposed and tested for intra- and inter-rater reliability to quantify maxillary first molar volumes. Ten randomly selected maxillary first molars were measured in vivo in random order three times with 10 days separation between measurements. Intra- and inter-rater agreement for all segmentation procedures was attained using intra-class correlation coefficient (ICC). The highest precision was for automated thresholding with manual refinements. A tooth volume measurement protocol for CBCT images employing automated segmentation with manual human refinement on a 2D slice-by-slice basis in all three planes of space possessed excellent intra- and inter-rater reliability. Three-dimensional volume measurements of the entire tooth structure are more precise than 3D volume measurements of only the dental roots apical to the cemento-enamel junction (CEJ).
THE INTRA- AND INTER-RATER RELIABILITY OF THE SOCCER INJURY MOVEMENT SCREEN (SIMS).

PubMed

McCunn, Robert; Aus der Fünten, Karen; Govus, Andrew; Julian, Ross; Schimpchen, Jan; Meyer, Tim

2017-02-01

The growing volume of movement screening research reveals a belief among practitioners and researchers alike that movement quality may have an association with injury risk. However, existing movement screening tools have not considered the sport-specific movement and injury patterns relevant to soccer. The present study introduces the Soccer Injury Movement Screen (SIMS), which has been designed specifically for use within soccer. Furthermore, the purpose of the present study was to assess the intra- and inter-rater reliability of the SIMS and determine its suitability for use in further research. The study utilized a test-retest design to discern reliablility. Twenty-five (11 males, 14 females) healthy, recreationally active university students (age 25.5 ± 4.0 years, height 171 ± 9 cm, weight 64.7 ± 12.6 kg) agreed to participate. The SIMS contains five sub-tests: the anterior reach, single-leg deadlift, in-line lunge, single-leg hop for distance and tuck jump. Each movement was scored out of 10 points and summed to produce a composite score out of 50. The anterior reach and single-leg hop for distance were scored in real-time while the remaining tests were filmed and scored retrospectively. Three raters conducted the SIMS with each participant on three occasions separated by an average of three and a half days (minimum one day, maximum seven days). Rater 1 re-scored the filmed movements for all participants on all occasions six months later to establish the 'pure' intra-rater (intra-occasion) reliability for those movements. Intraclass correlation coefficient (ICC) values for intra- and inter-rater composite score reliability ranged from 0.66-0.72 and 0.79-0.86 respectively. Weighted kappa values representing the intra- and inter-rater reliability of the individual sub-tests ranged from 0.35-0.91 indicating fair to almost perfect agreement. Establishing the reliability of the SIMS is a prerequisite for further research seeking to investigate the relationship between test score and subsequent injury. The present results indicate acceptable reliability for this purpose; however, room for further development of the intra-rater reliability exists for some of the individual sub-tests. 2b.
THE INTRA- AND INTER-RATER RELIABILITY OF THE SOCCER INJURY MOVEMENT SCREEN (SIMS)

PubMed Central

aus der Fünten, Karen; Govus, Andrew; Julian, Ross; Schimpchen, Jan; Meyer, Tim

2017-01-01

Background/purpose The growing volume of movement screening research reveals a belief among practitioners and researchers alike that movement quality may have an association with injury risk. However, existing movement screening tools have not considered the sport-specific movement and injury patterns relevant to soccer. The present study introduces the Soccer Injury Movement Screen (SIMS), which has been designed specifically for use within soccer. Furthermore, the purpose of the present study was to assess the intra- and inter-rater reliability of the SIMS and determine its suitability for use in further research. Methods The study utilized a test-retest design to discern reliablility. Twenty-five (11 males, 14 females) healthy, recreationally active university students (age 25.5 ± 4.0 years, height 171 ± 9 cm, weight 64.7 ± 12.6 kg) agreed to participate. The SIMS contains five sub-tests: the anterior reach, single-leg deadlift, in-line lunge, single-leg hop for distance and tuck jump. Each movement was scored out of 10 points and summed to produce a composite score out of 50. The anterior reach and single-leg hop for distance were scored in real-time while the remaining tests were filmed and scored retrospectively. Three raters conducted the SIMS with each participant on three occasions separated by an average of three and a half days (minimum one day, maximum seven days). Rater 1 re-scored the filmed movements for all participants on all occasions six months later to establish the ‘pure’ intra-rater (intra-occasion) reliability for those movements. Results Intraclass correlation coefficient (ICC) values for intra- and inter-rater composite score reliability ranged from 0.66-0.72 and 0.79-0.86 respectively. Weighted kappa values representing the intra- and inter-rater reliability of the individual sub-tests ranged from 0.35-0.91 indicating fair to almost perfect agreement. Conclusions Establishing the reliability of the SIMS is a prerequisite for further research seeking to investigate the relationship between test score and subsequent injury. The present results indicate acceptable reliability for this purpose; however, room for further development of the intra-rater reliability exists for some of the individual sub-tests. Level of evidence 2b PMID:28217416
The reliability and validity of the Turkish version of Fullerton Advanced Balance (FAB-T) scale.

PubMed

Iyigun, Gozde; Kirmizigil, Berkiye; Angin, Ender; Oksuz, Sevim; Can, Filiz; Eker, Levent; Rose, Debra J

2018-06-04

The aim of this study was to evaluate the reliability and validity of the Turkish version of the FAB(FAB-T) scale in the older Turkish adults. The reliability and validity of the scale was tested on 200 community-dwelling older adults. FAB-T scale was scored by different physiotherapists on different days to evaluate inter-rater and intrarater reliability. The Berg Balance Scale (BBS) was used for the evaluation of convergent validity, and the content validity of the FAB-T scale was investigated. The FAB-T scale showed very high inter- and intra-rater reliability. For inter-rater agreement, on the individual test items and total score ICC values were 0.92 (95 %CI; 0.90-0.94) and 0.96 (95% CI; 0.95-0.97) respectively. The intra-rater agreement, on the individual test items and total score ICC values were 0.93 (95 %CI; 0.91- 0.95) and 0.96 (95% CI; 0.95- 0.97) respectively. There was a good agreement between the FAB-T and BBS scales. A high correlation was found between the BBS and FAB-T scales [rho = 0.70 (%95 CI; 0.62-0.76)] indicating good convergent validity. Considering the content validity of the FAB-T scale, no floor (floor score: 0%) or ceiling (ceiling score: 6.5%) effect was detected. The FAB-T scale was successfully translated from the original English version (FAB) and demonstrated strong psychometric features. It was found that the FAB-T scale has very high inter-rater and intra-rater reliability. Considering the convergent validity, the scale has high correlation with the BBS. The FAB-T has no floor and ceiling effect. Copyright © 2018 Elsevier B.V. All rights reserved.
The reliability of a modified Kalamazoo Consensus Statement Checklist for assessing the communication skills of multidisciplinary clinicians in the simulated environment.

PubMed

Peterson, Eleanor B; Calhoun, Aaron W; Rider, Elizabeth A

2014-09-01

With increased recognition of the importance of sound communication skills and communication skills education, reliable assessment tools are essential. This study reports on the psychometric properties of an assessment tool based on the Kalamazoo Consensus Statement Essential Elements Communication Checklist. The Gap-Kalamazoo Communication Skills Assessment Form (GKCSAF), a modified version of an existing communication skills assessment tool, the Kalamazoo Essential Elements Communication Checklist-Adapted, was used to assess learners in a multidisciplinary, simulation-based communication skills educational program using multiple raters. 118 simulated conversations were available for analysis. Internal consistency and inter-rater reliability were determined by calculating a Cronbach's alpha score and intra-class correlation coefficients (ICC), respectively. The GKCSAF demonstrated high internal consistency with a Cronbach's alpha score of 0.844 (faculty raters) and 0.880 (peer observer raters), and high inter-rater reliability with an ICC of 0.830 (faculty raters) and 0.89 (peer observer raters). The Gap-Kalamazoo Communication Skills Assessment Form is a reliable method of assessing the communication skills of multidisciplinary learners using multi-rater methods within the learning environment. The Gap-Kalamazoo Communication Skills Assessment Form can be used by educational programs that wish to implement a reliable assessment and feedback system for a variety of learners. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.
Inter-rater agreement in evaluation of disability: systematic review of reproducibility studies

PubMed Central

Barth, Jürgen; de Boer, Wout E L; Busse, Jason W; Hoving, Jan L; Kedzia, Sarah; Couban, Rachel; Fischer, Katrin; von Allmen, David Y; Spanjer, Jerry

2017-01-01

Objectives To explore agreement among healthcare professionals assessing eligibility for work disability benefits. Design Systematic review and narrative synthesis of reproducibility studies. Data sources Medline, Embase, and PsycINFO searched up to 16 March 2016, without language restrictions, and review of bibliographies of included studies. Eligibility criteria Observational studies investigating reproducibility among healthcare professionals performing disability evaluations using a global rating of working capacity and reporting inter-rater reliability by a statistical measure or descriptively. Studies could be conducted in insurance settings, where decisions on ability to work include normative judgments based on legal considerations, or in research settings, where decisions on ability to work disregard normative considerations.Teams of paired reviewers identified eligible studies, appraised their methodological quality and generalisability, and abstracted results with pretested forms. As heterogeneity of research designs and findings impeded a quantitative analysis, a descriptive synthesis stratified by setting (insurance or research) was performed. Results From 4562 references, 101 full text articles were reviewed. Of these, 16 studies conducted in an insurance setting and seven in a research setting, performed in 12 countries, met the inclusion criteria. Studies in the insurance setting were conducted with medical experts assessing claimants who were actual disability claimants or played by actors, hypothetical cases, or short written scenarios. Conditions were mental (n=6, 38%), musculoskeletal (n=4, 25%), or mixed (n=6, 38%). Applicability of findings from studies conducted in an insurance setting to real life evaluations ranged from generalisable (n=7, 44%) and probably generalisable (n=3, 19%) to probably not generalisable (n=6, 37%). Median inter-rater reliability among experts was 0.45 (range intraclass correlation coefficient 0.86 to κ−0.10). Inter-rater reliability was poor in six studies (37%) and excellent in only two (13%). This contrasts with studies conducted in the research setting, where the median inter-rater reliability was 0.76 (range 0.91-0.53), and 71% (5/7) studies achieved excellent inter-rater reliability. Reliability between assessing professionals was higher when the evaluation was guided by a standardised instrument (23 studies, P=0.006). No such association was detected for subjective or chronic health conditions or the studies’ generalisability to real world evaluation of disability (P=0.46, 0.45, and 0.65, respectively). Conclusions Despite their common use and far reaching consequences for workers claiming disabling injury or illness, research on the reliability of medical evaluations of disability for work is limited and indicates high variation in judgments among assessing professionals. Standardising the evaluation process could improve reliability. Development and testing of instruments and structured approaches to improve reliability in evaluation of disability are urgently needed. PMID:28122727
Validity and reliability of balance assessment software using the Nintendo Wii balance board: usability and validation

PubMed Central

2014-01-01

Background A balance test provides important information such as the standard to judge an individual’s functional recovery or make the prediction of falls. The development of a tool for a balance test that is inexpensive and widely available is needed, especially in clinical settings. The Wii Balance Board (WBB) is designed to test balance, but there is little software used in balance tests, and there are few studies on reliability and validity. Thus, we developed a balance assessment software using the Nintendo Wii Balance Board, investigated its reliability and validity, and compared it with a laboratory-grade force platform. Methods Twenty healthy adults participated in our study. The participants participated in the test for inter-rater reliability, intra-rater reliability, and concurrent validity. The tests were performed with balance assessment software using the Nintendo Wii balance board and a laboratory-grade force platform. Data such as Center of Pressure (COP) path length and COP velocity were acquired from the assessment systems. The inter-rater reliability, the intra-rater reliability, and concurrent validity were analyzed by an intraclass correlation coefficient (ICC) value and a standard error of measurement (SEM). Results The inter-rater reliability (ICC: 0.89-0.79, SEM in path length: 7.14-1.90, SEM in velocity: 0.74-0.07), intra-rater reliability (ICC: 0.92-0.70, SEM in path length: 7.59-2.04, SEM in velocity: 0.80-0.07), and concurrent validity (ICC: 0.87-0.73, SEM in path length: 5.94-0.32, SEM in velocity: 0.62-0.08) were high in terms of COP path length and COP velocity. Conclusion The balance assessment software incorporating the Nintendo Wii balance board was used in our study and was found to be a reliable assessment device. In clinical settings, the device can be remarkably inexpensive, portable, and convenient for the balance assessment. PMID:24912769
Validity and reliability of balance assessment software using the Nintendo Wii balance board: usability and validation.

PubMed

Park, Dae-Sung; Lee, GyuChang

2014-06-10

A balance test provides important information such as the standard to judge an individual's functional recovery or make the prediction of falls. The development of a tool for a balance test that is inexpensive and widely available is needed, especially in clinical settings. The Wii Balance Board (WBB) is designed to test balance, but there is little software used in balance tests, and there are few studies on reliability and validity. Thus, we developed a balance assessment software using the Nintendo Wii Balance Board, investigated its reliability and validity, and compared it with a laboratory-grade force platform. Twenty healthy adults participated in our study. The participants participated in the test for inter-rater reliability, intra-rater reliability, and concurrent validity. The tests were performed with balance assessment software using the Nintendo Wii balance board and a laboratory-grade force platform. Data such as Center of Pressure (COP) path length and COP velocity were acquired from the assessment systems. The inter-rater reliability, the intra-rater reliability, and concurrent validity were analyzed by an intraclass correlation coefficient (ICC) value and a standard error of measurement (SEM). The inter-rater reliability (ICC: 0.89-0.79, SEM in path length: 7.14-1.90, SEM in velocity: 0.74-0.07), intra-rater reliability (ICC: 0.92-0.70, SEM in path length: 7.59-2.04, SEM in velocity: 0.80-0.07), and concurrent validity (ICC: 0.87-0.73, SEM in path length: 5.94-0.32, SEM in velocity: 0.62-0.08) were high in terms of COP path length and COP velocity. The balance assessment software incorporating the Nintendo Wii balance board was used in our study and was found to be a reliable assessment device. In clinical settings, the device can be remarkably inexpensive, portable, and convenient for the balance assessment.
Evaluating the reliability of an injury prevention screening tool: Test-retest study.

PubMed

Gittelman, Michael A; Kincaid, Madeline; Denny, Sarah; Wervey Arnold, Melissa; FitzGerald, Michael; Carle, Adam C; Mara, Constance A

2016-10-01

A standardized injury prevention (IP) screening tool can identify family risks and allow pediatricians to address behaviors. To assess behavior changes on later screens, the tool must be reliable for an individual and ideally between household members. Little research has examined the reliability of safety screening tool questions. This study utilized test-retest reliability of parent responses on an existing IP questionnaire and also compared responses between household parents. Investigators recruited parents of children 0 to 1 year of age during admission to a tertiary care children's hospital. When both parents were present, one was chosen as the "primary" respondent. Primary respondents completed the 30-question IP screening tool after consent, and they were re-screened approximately 4 hours later to test individual reliability. The "second" parent, when present, only completed the tool once. All participants received a 10-dollar gift card. Cohen's Kappa was used to estimate test-retest reliability and inter-rater agreement. Standard test-retest criteria consider Kappa values: 0.0 to 0.40 poor to fair, 0.41 to 0.60 moderate, 0.61 to 0.80 substantial, and 0.81 to 1.00 as almost perfect reliability. One hundred five families participated, with five lost to follow-up. Thirty-two (30.5%) parent dyads completed the tool. Primary respondents were generally mothers (88%) and Caucasian (72%). Test-retest of the primary respondents showed their responses to be almost perfect; average 0.82 (SD = 0.13, range 0.49-1.00). Seventeen questions had almost perfect test-retest reliability and 11 had substantial reliability. However, inter-rater agreement between household members for 12 objective questions showed little agreement between responses; inter-rater agreement averaged 0.35 (SD = 0.34, range -0.19-1.00). One question had almost perfect inter-rater agreement and two had substantial inter-rater agreement. The IP screening tool used by a single individual had excellent test-retest reliability for nearly all questions. However, when a reporter changes from pre- to postintervention, differences may reflect poor reliability or different subjective experiences rather than true change.

Self-audit of lockout/tagout in manufacturing workplaces: A pilot study.

PubMed

Yamin, Samuel C; Parker, David L; Xi, Min; Stanley, Rodney

2017-05-01

Occupational health and safety (OHS) self-auditing is a common practice in industrial workplaces. However, few audit instruments have been tested for inter-rater reliability and accuracy. A lockout/tagout (LOTO) self-audit checklist was developed for use in manufacturing enterprises. It was tested for inter-rater reliability and accuracy using responses of business self-auditors and external auditors. Inter-rater reliability at ten businesses was excellent (κ = 0.84). Business self-auditors had high (100%) accuracy in identifying elements of LOTO practice that were present as well those that were absent (81% accuracy). Reliability and accuracy increased further when problematic checklist questions were removed from the analysis. Results indicate that the LOTO self-audit checklist would be useful in manufacturing firms' efforts to assess and improve their LOTO programs. In addition, a reliable self-audit instrument removes the need for external auditors to visit worksites, thereby expanding capacity for outreach and intervention while minimizing costs. © 2017 Wiley Periodicals, Inc.
The Outdoor MEDIA DOT: The development and inter-rater reliability of a tool designed to measure food and beverage outlets and outdoor advertising.

PubMed

Poulos, Natalie S; Pasch, Keryn E

2015-07-01

Few studies of the food environment have collected primary data, and even fewer have reported reliability of the tool used. This study focused on the development of an innovative electronic data collection tool used to document outdoor food and beverage (FB) advertising and establishments near 43 middle and high schools in the Outdoor MEDIA Study. Tool development used GIS based mapping, an electronic data collection form on handheld devices, and an easily adaptable interface to efficiently collect primary data within the food environment. For the reliability study, two teams of data collectors documented all FB advertising and establishments within one half-mile of six middle schools. Inter-rater reliability was calculated overall and by advertisement or establishment category using percent agreement. A total of 824 advertisements (n=233), establishment advertisements (n=499), and establishments (n=92) were documented (range=8-229 per school). Overall inter-rater reliability of the developed tool ranged from 69-89% for advertisements and establishments. Results suggest that the developed tool is highly reliable and effective for documenting the outdoor FB environment. Copyright © 2015 Elsevier Ltd. All rights reserved.
The Outdoor MEDIA DOT: The Development and Inter-Rater Reliability of a Tool Designed to Measure Food and Beverage Outlets and Outdoor Advertising

PubMed Central

Poulos, Natalie S.; Pasch, Keryn E.

2015-01-01

Few studies of the food environment have collected primary data, and even fewer have reported reliability of the tool used. This study focused on the development of an innovative electronic data collection tool used to document outdoor food and beverage (FB) advertising and establishments near 43 middle and high schools in the Outdoor MEDIA Study. Tool development used GIS based mapping, an electronic data collection form on handheld devices, and an easily adaptable interface to efficiently collect primary data within the food environment. For the reliability study, two teams of data collectors documented all FB advertising and establishments within one half-mile of six middle schools. Inter-rater reliability was calculated overall and by advertisement or establishment category using percent agreement. A total of 824 advertisements (n=233), establishment advertisements (n=499), and establishments (n=92) were documented (range=8–229 per school). Overall inter-rater reliability of the developed tool ranged from 69–89% for advertisements and establishments. Results suggest that the developed tool is highly reliable and effective for documenting the outdoor FB environment. PMID:26022774
Objective measurements of excess skin in post bariatric patients--inter-rater reliability.

PubMed

Biörserud, Christina; Fagevik Olsén, Monika; Elander, Anna; Wiklund, Malin

2016-01-01

An ability to reliably assess excess skin after massive weight loss using well-described and transferrable methods is important. The aim of this trial was to evaluate inter-rater reliability of ptosis and circumference measurements in patients with excess skin after bariatric surgery. Twenty-five postbariatric patients were included in the study, and their excess skin was measured 18 months after surgery. A protocol was designed to measure excess skin in a standardised way. To evaluate the inter-rater reliability in the measuring protocol, all patients were measured twice, by a specialist nurse and a specialist physiotherapist. All circumference measurements on different body parts had an ICC > 0.9, indicating high reliability. Furthermore, all breast and abdominal ptosis measurements had high reliability. In contrast, visual evaluation of abdominal ptosis had poor reliability. Measurements of ptoses on different body parts had an ICC > 0.6. There were no systematic differences between the results of the two testers, except for measurements of the buttocks and maximal knee circumference. The measuring protocol presented in this study has high reliability and, therefore, represents a useful instrument to provide a consistent and objective assessment of excess skin in the postbariatric patient.
Simulated patient training: Using inter-rater reliability to evaluate simulated patient consistency in nursing education.

PubMed

MacLean, Sharon; Geddes, Fiona; Kelly, Michelle; Della, Phillip

2018-03-01

Simulated patients (SPs) are frequently used for training nursing students in communication skills. An acknowledged benefit of using SPs is the opportunity to provide a standardized approach by which participants can demonstrate and develop communication skills. However, relatively little evidence is available on how to best facilitate and evaluate the reliability and accuracy of SPs' performances. The aim of this study is to investigate the effectiveness of an evidenced based SP training framework to ensure standardization of SPs. The training framework was employed to improve inter-rater reliability of SPs. A quasi-experimental study was employed to assess SP post-training understanding of simulation scenario parameters using inter-rater reliability agreement indices. Two phases of data collection took place. Initially a trial phase including audio-visual (AV) recordings of two undergraduate nursing students completing a simulation scenario is rated by eight SPs using the Interpersonal Communication Assessments Scale (ICAS) and Quality of Discharge Teaching Scale (QDTS). In phase 2, eight SP raters and four nursing faculty raters independently evaluated students' (N=42) communication practices using the QDTS. Intraclass correlation coefficients (ICC) were >0.80 for both stages of the study in clinical communication skills. The results support the premise that if trained appropriately, SPs have a high degree of reliability and validity to both facilitate and evaluate student performance in nurse education. Crown Copyright © 2018. Published by Elsevier Ltd. All rights reserved.
Training less-experienced faculty improves reliability of skills assessment in cardiac surgery.

PubMed

Lou, Xiaoying; Lee, Richard; Feins, Richard H; Enter, Daniel; Hicks, George L; Verrier, Edward D; Fann, James I

2014-12-01

Previous work has demonstrated high inter-rater reliability in the objective assessment of simulated anastomoses among experienced educators. We evaluated the inter-rater reliability of less-experienced educators and the impact of focused training with a video-embedded coronary anastomosis assessment tool. Nine less-experienced cardiothoracic surgery faculty members from different institutions evaluated 2 videos of simulated coronary anastomoses (1 by a medical student and 1 by a resident) at the Thoracic Surgery Directors Association Boot Camp. They then underwent a 30-minute training session using an assessment tool with embedded videos to anchor rating scores for 10 components of coronary artery anastomosis. Afterward, they evaluated 2 videos of a different student and resident performing the task. Components were scored on a 1 to 5 Likert scale, yielding an average composite score. Inter-rater reliabilities of component and composite scores were assessed using intraclass correlation coefficients (ICCs) and overall pass/fail ratings with kappa. All components of the assessment tool exhibited improvement in reliability, with 4 (bite, needle holder use, needle angles, and hand mechanics) improving the most from poor (ICC range, 0.09-0.48) to strong (ICC range, 0.80-0.90) agreement. After training, inter-rater reliabilities for composite scores improved from moderate (ICC, 0.76) to strong (ICC, 0.90) agreement, and for overall pass/fail ratings, from poor (kappa = 0.20) to moderate (kappa = 0.78) agreement. Focused, video-based anchor training facilitates greater inter-rater reliability in the objective assessment of simulated coronary anastomoses. Among raters with less teaching experience, such training may be needed before objective evaluation of technical skills. Published by Elsevier Inc.
A new scale for the assessment of performance and capacity of hand function in children with hemiplegic cerebral palsy: reliability and validity studies.

PubMed

Rosa-Rizzotto, M; Visonà Dalla Pozza, L; Corlatti, A; Luparia, A; Marchi, A; Molteni, F; Facchin, P; Pagliano, E; Fedrizzi, E

2014-10-01

In hemiplegic children, the recognition of the activity limitation pattern and the possibility of grading its severity are relevant for clinicians while planning interventions, monitoring results, predicting outcomes. Aim of the study is to examine the reliability and validity of Besta Scale, an instrument used to measure in hemiplegic children from 18 months to 12 years of age both grasp on request (capacity) and spontaneous use of upper limb (performance) in bimanual play activities and in ADL. Psychometric analysis of reliability and of validity of the Besta scale was performed. Outpatient study sample Reliability study: A sample of 39 patients was enrolled. The administration of Besta scale was video-recorded in a standardized manner. All videos were scored by 20 independent raters on subsequent viewing. 3 raters randomly selected from the 20-raters group rescored the same video two years later for intra-rater reliability. Intra and inter-rater reliability were calculated using Intraclass Correlation Coefficient (ICC) and Kendall's coefficient (K), respectively. Internal consistency reliability was assessed using Alpha's Chronbach coefficient. Validity study: a sample of 105 children was assessed 5 times (at t0 and 2, 3, 6 and 12 months later) by 20 independent raters. Each patient underwent at the same time to QUEST and Besta scale administration and assessment. Criterion validity was calculated using rho-Pearson coefficient. Reliability study: The inter-rater reliability calculated with Kendall's coefficient resulted moderate K=0.47. The intra-rater (or test-retest) reliability for 3 raters was excellent (ICC=0.927). The Cronbach's alpha for internal consistency was 0.972. Validity study: Besta scale showed a good criterion validity compared to QUEST increasing by age and severity of impairment. Rho Pearson's correlation coefficient r was 0.81 (P<0.0001). Limitations. Besta scales in infants finds hard to distinguish between mild to moderately impaired hand function. Besta scale scoring system is a valid and reliable tool, utilizable in a clinical setting to monitor evolution of unimanual and bimanual manipulation and to distinguish hand's capacity from performance.
Standard setting: comparison of two methods.

PubMed

George, Sanju; Haque, M Sayeed; Oyebode, Femi

2006-09-14

The outcome of assessments is determined by the standard-setting method used. There is a wide range of standard-setting methods and the two used most extensively in undergraduate medical education in the UK are the norm-reference and the criterion-reference methods. The aims of the study were to compare these two standard-setting methods for a multiple-choice question examination and to estimate the test-retest and inter-rater reliability of the modified Angoff method. The norm-reference method of standard-setting (mean minus 1 SD) was applied to the 'raw' scores of 78 4th-year medical students on a multiple-choice examination (MCQ). Two panels of raters also set the standard using the modified Angoff method for the same multiple-choice question paper on two occasions (6 months apart). We compared the pass/fail rates derived from the norm reference and the Angoff methods and also assessed the test-retest and inter-rater reliability of the modified Angoff method. The pass rate with the norm-reference method was 85% (66/78) and that by the Angoff method was 100% (78 out of 78). The percentage agreement between Angoff method and norm-reference was 78% (95% CI 69% - 87%). The modified Angoff method had an inter-rater reliability of 0.81-0.82 and a test-retest reliability of 0.59-0.74. There were significant differences in the outcomes of these two standard-setting methods, as shown by the difference in the proportion of candidates that passed and failed the assessment. The modified Angoff method was found to have good inter-rater reliability and moderate test-retest reliability.
And the Winner Is … : Inter-Rater Reliability among Scholarship Assessors

ERIC Educational Resources Information Center

Johnston, Lucy; Schluter, Philip J.

2017-01-01

With increasing competition for postgraduate research scholarships, awarding processes demand attention and scrutiny. We examine inter-rater reliability for two prestigious New Zealand scholarships, the Shirtcliffe Fellowship and the Gordon Watson Scholarship. For each scholarship, five assessors (three academic; two non-academic) independently…
Development and reliability of a Turkish version of the Short Form-Joint Protection Behavior Assessment (JPBA-S).

PubMed

Tonga, Eda; Atasavun Uysal, Songul; Karayazgan, Sedef; Hayran, Mutlu; Düger, Tülin

2016-01-01

Clinical measurement. To adapt the original JPBA-S to a Turkish version (TUR-JPBA-S) and to investigate its reliability in assessing patients with rheumatoid arthritis (RA). Twenty-two participants with RA and 21 healthy people were videotaped while performing tasks listed in the TUR-JPBA-S. Two raters scored the video recordings for to evaluate inter-rater reliability. One rater re-analyzed the recordings at a different time point for intra-rater reliability. Participants with RA were asked to perform the same tasks after three to four weeks which was also recorded to evaluate test-retest reliability. Internal consistency (Cronbach's α value) was found to be high (0.89) for participants with RA. Our results demonstrate excellent intra-rater (ICC: 0.99, SEM 1.2) inter-rater (ICC: 0.99, SEM 1.7) reliability, apart from excellent test-retest reliability (ICC: 0.96). The TUR-JPBA-S is a valid and reliable instrument for assessing JP behavior in patients with RA in Turkey. Level 2. Copyright © 2016 Hanley & Belfus. Published by Elsevier Inc. All rights reserved.
Reliability of digital ulcer definitions as proposed by the UK Scleroderma Study Group: A challenge for clinical trial design.

PubMed

Hughes, Michael; Tracey, Andrew; Bhushan, Monica; Chakravarty, Kuntal; Denton, Christopher P; Dubey, Shirish; Guiducci, Serena; Muir, Lindsay; Ong, Voon; Parker, Louise; Pauling, John D; Prabu, Athiveeraramapandian; Rogers, Christine; Roberts, Christopher; Herrick, Ariane L

2018-06-01

The reliability of clinician grading of systemic sclerosis-related digital ulcers has been reported to be poor to moderate at best, which has important implications for clinical trial design. The aim of this study was to examine the reliability of new proposed UK Scleroderma Study Group digital ulcer definitions among UK clinicians with an interest in systemic sclerosis. Raters graded (through a custom-built interface) 90 images (80 unique and 10 repeat) of a range of digital lesions collected from patients with systemic sclerosis. Lesions were graded on an ordinal scale of severity: 'no ulcer', 'healed ulcer' or 'digital ulcer'. A total of 23 clinicians - 18 rheumatologists, 3 dermatologists, 1 hand surgeon and 1 specialist rheumatology nurse - completed the study. A total of 2070 (1840 unique + 230 repeat) image gradings were obtained. For intra-rater reliability, across all images, the overall weighted kappa coefficient was high (0.71) and was moderate (0.55) when averaged across individual raters. Overall inter-rater reliability was poor (0.15). Although our proposed digital ulcer definitions had high intra-rater reliability, the overall inter-rater reliability was poor. Our study highlights the challenges of digital ulcer assessment by clinicians with an interest in systemic sclerosis and provides a number of useful insights for future clinical trial design. Further research is warranted to improve the reliability of digital ulcer definition/rating as an outcome measure in clinical trials, including examining the role for objective measurement techniques, and the development of digital ulcer patient-reported outcome measures.
Ultrasound measures of tendon thickness: Intra-rater, Inter-rater and Inter-machine reliability.

PubMed

Del Baño-Aledo, María Elena; Martínez-Payá, Jacinto Javier; Ríos-Díaz, José; Mejías-Suárez, Silvia; Serrano-Carmona, Sergio; de Groot-Ferrando, Ana

2017-01-01

Ultrasound imaging is often used by physiotherapists and other healthcare professionals but the reliability of image acquisition with different ultrasound machines is unknown. The objective was to compare the intra-rater, inter-rater and intermachine reliability of thickness measurements of the plantar fascia (PF), Achilles tendon (AT), patellar tendon (PT) and elbow common extensor tendon (ECET) with musculoskeletal ultrasound imaging (MSUS). Tendon thickness was measured in four anatomical structures (14 participants, 28 images per tendon) by two sonographers and with two different ultrasound machines. Intraclass Correlation Coefficients (ICCs) and Bland-Altman plots were calculated. The standard error of measurement (SEM) and minimum detectable difference (MDD) were calculated. Inter-rater reliability was excellent for AT (ICC=0.98; 95% CI= 0.96-0.99) and very good for PT (ICC=0.85; 95% CI = 0.67-0.93) and ECET (ICC=0.81; 95% CI= 0.72-0.94). Reliability for PF was moderate, with an ICC of 0.63 (CI 95%= 0.20-0.83). Bland-Altman plot for inter-machine reliability showed a mean difference of 1 m for PF measurements and a mean difference of 4 m and 20 m for AT and PT. The relative SEMs were below 7% and the MDCs were below 0.7 mm. The MSUS reliability in measuring thickness of the four tendons is confirmed by the homogeneous readings intra sonographers, between operators and between different machines. Level of evidence: Tendon thickness can be measured reliably on different ultrasound devices, which is an important step forward in the use of this technique in daily clinical practice and research. III.
Preliminary appraisal of the reliability and validity of the Colorado State University Feline Acute Pain Scale.

PubMed

Shipley, Hilary; Guedes, Alonso; Graham, Lynelle; Goudie-DeAngelis, Elizabeth; Wendt-Hornickle, Erin

2018-05-01

Objectives The objective of this study was to determine the inter-rater reliability and convergent validity of the Colorado State University Feline Acute Pain Scale (CSU-FAPS) in a preliminary appraisal of its performance in a clinical teaching setting. Methods Sixty-eight female cats were assessed for pain after ovariohysterectomy. A cohort of 21 cats was examined independently by four raters (two board-certified anesthesiologists and two anesthesia residents) with the CSU-FAPS, and intra-class correlation coefficient (ICC) was used to determine inter-rater reliability. Weighted Cohen's kappa was used to determine inter-rater reliability centered on the 'need to reassess analgesic plan' (dichotomous scale). A separate cohort of 47 cats was evaluated independently by two raters (one board-certified anesthesiologist and one veterinary small animal rotating intern) using the CSU-FAPS and the Glasgow Composite Measure Pain Scale (CMPS-Feline), and Spearman rank-order correlation was determined to assess convergent validity. Reliability was interpreted using Altman's classification as very good, good, moderate, fair and poor. Validity was considered adequate if correlation coefficients were between 0.4 and 0.8. Results The ICC was 0.61 for anesthesiologists and 0.67 for residents, indicating good reliability. Weighted Cohen's kappa was 0.79 for anesthesiologists and 0.44 for residents, indicating moderate to good reliability. The Spearman rank correlation indicated a statistically significant ( P = 0.0003) positive correlation (0.31; 95% confidence interval 0.14-0.46) between the CSU-FAPS and the CMPS-Feline. Conclusions and relevance The CSU-FAPS showed moderate-to-good inter-rater reliability when used by veterinarians to assess pain level or need to reassess analgesic plan after ovariohysterectomy in cats. The validity fell short of current guidelines for correlation coefficients and further refinement and testing are warranted to improve its performance.
High inter-rater reliability, agreement, and convergent validity of Constant score in patients with clavicle fractures.

PubMed

Ban, Ilija; Troelsen, Anders; Kristensen, Morten Tange

2016-10-01

The Constant score (CS) has been the primary endpoint in most studies on clavicle fractures. However, the CS was not developed to assess patients with clavicle fractures. Our aim was to examine inter-rater reliability and agreement of the CS in patients with clavicle fractures. The secondary aim was to estimate the correlation between the CS and the Disabilities of the Arm, Shoulder and Hand score and the internal consistency of the 2 scores. On the basis of sample sizing, 36 patients (31 male and 5 female patients; mean age, 41.3 years) with clavicle fractures underwent standardized CS assessment at a mean of 6.8 weeks (SD, 1.0 weeks) after injury. Reliability and agreement of the CS were determined by 2 raters. The interclass correlation coefficient (ICC2,1), standard error of measurement, minimal detectable change, Cronbach α coefficient, and Pearson correlation coefficient were estimated. Inter-rater reliability of the total CS was excellent (interclass correlation coefficient, 0.94; 95% confidence interval, 0.88-0.97), with no systematic difference between the 2 raters (P = .75). The standard error of measurement (measurement error at the group level) was 4.9, whereas the minimal detectable change (smallest change needed to indicate a real change for an individual) was 13.6 CS points. The internal consistency of the 10 CS items was good, with a Cronbach α of .85, and we found a strong correlation (r = -0.92) between the CS and Disabilities of the Arm, Shoulder and Hand score. The CS was found to be reliable for assessing patients with clavicle fractures, especially at the group level. With high inter-rater reliability and agreement, in addition to good internal consistency, the standardized CS used in this study can be used for comparison of results from different settings. Copyright © 2016 Journal of Shoulder and Elbow Surgery Board of Trustees. Published by Elsevier Inc. All rights reserved.
Analyses of inter-rater reliability between professionals, medical students and trained school children as assessors of basic life support skills.

PubMed

Beck, Stefanie; Ruhnke, Bjarne; Issleib, Malte; Daubmann, Anne; Harendza, Sigrid; Zöllner, Christian

2016-10-07

Training of lay-rescuers is essential to improve survival-rates after cardiac arrest. Multiple campaigns emphasise the importance of basic life support (BLS) training for school children. Trainings require a valid assessment to give feedback to school children and to compare the outcomes of different training formats. Considering these requirements, we developed an assessment of BLS skills using MiniAnne and tested the inter-rater reliability between professionals, medical students and trained school children as assessors. Fifteen professional assessors, 10 medical students and 111-trained school children (peers) assessed 1087 school children at the end of a CPR-training event using the new assessment format. Analyses of inter-rater reliability (intraclass correlation coefficient; ICC) were performed. Overall inter-rater reliability of the summative assessment was high (ICC = 0.84, 95 %-CI: 0.84 to 0.86, n = 889). The number of comparisons between peer-peer assessors (n = 303), peer-professional assessors (n = 339), and peer-student assessors (n = 191) was adequate to demonstrate high inter-rater reliability between peer- and professional-assessors (ICC: 0.76), peer- and student-assessors (ICC: 0.88) and peer- and other peer-assessors (ICC: 0.91). Systematic variation in rating of specific items was observed for three items between professional- and peer-assessors. Using this assessment and integrating peers and medical students as assessors gives the opportunity to assess hands-on skills of school children with high reliability.
Leveraging Data Sampling and Practical Knowledge: Field Instructors' Perceptions about Inter-Rater Reliability Data

ERIC Educational Resources Information Center

Soslau, Elizabeth; Lewis, Kandia

2014-01-01

For accreditation and programmatic decision making, education school administrators use inter-rater reliability analyses to judge credibility of student-teacher assessments. Although weak levels of agreement between university-appointed supervisors and cooperating teachers are usually interpreted to indicate that the process is not being…
Minimal detectable change of the Personal and Social Performance scale in individuals with schizophrenia.

PubMed

Lee, Shu-Chun; Tang, Shih-Fen; Lu, Wen-Shian; Huang, Sheau-Ling; Deng, Nai-Yu; Lue, Wen-Chyn; Hsieh, Ching-Lin

2016-12-30

The minimal detectable change (MDC) of the Personal and Social Performance scale (PSP) has not yet been investigated, limiting its utility in data interpretation. The purpose of this study was to determine the MDCs of the PSP administered by the same rater or different raters in individuals with schizophrenia. Participants with schizophrenia were recruited from two psychiatric community rehabilitation centers to complete the PSP assessments twice, 2 weeks apart, by the same rater or 2 different raters. MDC values were calculated from the coefficients of intra- and inter-rater reliability (i.e., intraclass correlation coefficients). Forty patients (mean age 36.9 years, SD 9.7) from one center participated in the intra-rater reliability study. Another 40 patients (mean age 44.3 years, SD 11.1) from the other center participated in the inter-rater study. The MDCs (MDC%) of the PSP were 10.7 (17.1%) for the same rater and 16.2 (24.1%) for different raters. The MDCs of the PSP appeared appropriate for clinical trials aiming to determine whether a real change in social functioning has occurred in people with schizophrenia. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Inter-Rater Reliability of Total Body Score-A Scale for Quantification of Corpse Decomposition.

PubMed

Nawrocka, Marta; Frątczak, Katarzyna; Matuszewski, Szymon

2016-05-01

The degree of body decomposition can be quantified using Total Body Score (TBS), a scale frequently used in taphonomic or entomological studies of decomposition. Here, the inter-rater reliability of the scale is analyzed. The study was made on 120 laymen, which were trained in the use of the scale. Participants scored decomposition of pig carcasses from photographs. It was found that the scale, when used by different people, gives homogeneous results irrespective of the user qualifications (the Krippendorff's alfa for all participants was 0.818). The study also indicated that carcasses in advanced decomposition receive significantly less accurate scores. Moreover, it was found that scores for cadavers in mosaic decomposition (i.e., representing signs of at least two stages of decomposition) are less accurate. These results demonstrate that the scale may be regarded as inter-rater reliable. Some propositions for refinement of the scale were also discussed. © 2016 American Academy of Forensic Sciences.
Reliability of Real-time Ultrasound Imaging for the Assessment of Trunk Stabilizer Muscles: A Systematic Review of the Literature.

PubMed

Taghipour, Morteza; Mohseni-Bandpei, Mohammad Ali; Behtash, Hamid; Abdollahi, Iraj; Rajabzadeh, Fatemeh; Pourahmadi, Mohammad Reza; Emami, Mahnaz

2018-04-24

Rehabilitative ultrasound (US) imaging is one of the popular methods for investigating muscle morphologic characteristics and dimensions in recent years. The reliability of this method has been investigated in different studies. As studies have been performed with different designs and quality, reported values of rehabilitative US have a wide range. The objective of this study was to systematically review the literature conducted on the reliability of rehabilitative US imaging for the assessment of deep abdominal and lumbar trunk muscle dimensions. The PubMed/MEDLINE, Scopus, Google Scholar, Science Direct, Embase, Physiotherapy Evidence, Ovid, and CINAHL databases were searched to identify original research articles conducted on the reliability of rehabilitative US imaging published from June 2007 to August 2017. The articles were qualitatively assessed; reliability data were extracted; and the methodological quality was evaluated by 2 independent reviewers. Of the 26 included studies, 16 were considered of high methodological quality. Except for 2 studies, all high-quality studies reported intraclass correlation coefficients (ICCs) for intra-rater reliability of 0.70 or greater. Also, ICCs reported for inter-rater reliability in high-quality studies were generally greater than 0.70. Among low-quality studies, reported ICCs ranged from 0.26 to 0.99 and 0.68 to 0.97 for intra- and inter-rater reliability, respectively. Also, the reported standard error of measurement and minimal detectable change for rehabilitative US were generally in an acceptable range. Generally, the results of the reviewed studies indicate that rehabilitative US imaging has good levels of both inter- and intra-rater reliability. © 2018 by the American Institute of Ultrasound in Medicine.
The Statin-Associated Muscle Symptom Clinical Index (SAMS-CI): Revision for Clinical Use, Content Validation, and Inter-rater Reliability.

PubMed

Rosenson, Robert S; Miller, Kate; Bayliss, Martha; Sanchez, Robert J; Baccara-Dinet, Marie T; Chibedi-De-Roche, Daniela; Taylor, Beth; Khan, Irfan; Manvelian, Garen; White, Michelle; Jacobson, Terry A

2017-04-01

The Statin-Associated Muscle Symptom Clinical Index (SAMS-CI) is a method for assessing the likelihood that a patient's muscle symptoms (e.g., myalgia or myopathy) were caused or worsened by statin use. The objectives of this study were to prepare the SAMS-CI for clinical use, estimate its inter-rater reliability, and collect feedback from physicians on its practical application. For content validity, we conducted structured in-depth interviews with its original authors as well as with a panel of independent physicians. Estimation of inter-rater reliability involved an analysis of 30 written clinical cases which were scored by a sample of physicians. A separate group of physicians provided feedback on the clinical use of the SAMS-CI and its potential utility in practice. Qualitative interviews with providers supported the content validity of the SAMS-CI. Feedback on the clinical use of the SAMS-CI included several perceived benefits (such as brevity, clear wording, and simple scoring process) and some possible concerns (workflow issues and applicability in primary care). The inter-rater reliability of the SAMS-CI was estimated to be 0.77 (confidence interval 0.66-0.85), indicating high concordance between raters. With additional provider feedback, a revised SAMS-CI instrument was created suitable for further testing, both in the clinical setting and in prospective validation studies. With standardized questions, vetted language, easily interpreted scores, and demonstrated reliability, the SAMS aims to estimate the likelihood that a patient's muscle symptoms were attributable to statins. The SAMS-CI may support better detection of statin-associated muscle symptoms in clinical practice, optimize treatment for patients experiencing muscle symptoms, and provide a useful tool for further clinical research.

Inter-rater agreement in evaluation of disability: systematic review of reproducibility studies.

PubMed

Barth, Jürgen; de Boer, Wout E L; Busse, Jason W; Hoving, Jan L; Kedzia, Sarah; Couban, Rachel; Fischer, Katrin; von Allmen, David Y; Spanjer, Jerry; Kunz, Regina

2017-01-25

To explore agreement among healthcare professionals assessing eligibility for work disability benefits. Systematic review and narrative synthesis of reproducibility studies. Medline, Embase, and PsycINFO searched up to 16 March 2016, without language restrictions, and review of bibliographies of included studies. Observational studies investigating reproducibility among healthcare professionals performing disability evaluations using a global rating of working capacity and reporting inter-rater reliability by a statistical measure or descriptively. Studies could be conducted in insurance settings, where decisions on ability to work include normative judgments based on legal considerations, or in research settings, where decisions on ability to work disregard normative considerations. : Teams of paired reviewers identified eligible studies, appraised their methodological quality and generalisability, and abstracted results with pretested forms. As heterogeneity of research designs and findings impeded a quantitative analysis, a descriptive synthesis stratified by setting (insurance or research) was performed. From 4562 references, 101 full text articles were reviewed. Of these, 16 studies conducted in an insurance setting and seven in a research setting, performed in 12 countries, met the inclusion criteria. Studies in the insurance setting were conducted with medical experts assessing claimants who were actual disability claimants or played by actors, hypothetical cases, or short written scenarios. Conditions were mental (n=6, 38%), musculoskeletal (n=4, 25%), or mixed (n=6, 38%). Applicability of findings from studies conducted in an insurance setting to real life evaluations ranged from generalisable (n=7, 44%) and probably generalisable (n=3, 19%) to probably not generalisable (n=6, 37%). Median inter-rater reliability among experts was 0.45 (range intraclass correlation coefficient 0.86 to κ-0.10). Inter-rater reliability was poor in six studies (37%) and excellent in only two (13%). This contrasts with studies conducted in the research setting, where the median inter-rater reliability was 0.76 (range 0.91-0.53), and 71% (5/7) studies achieved excellent inter-rater reliability. Reliability between assessing professionals was higher when the evaluation was guided by a standardised instrument (23 studies, P=0.006). No such association was detected for subjective or chronic health conditions or the studies' generalisability to real world evaluation of disability (P=0.46, 0.45, and 0.65, respectively). Despite their common use and far reaching consequences for workers claiming disabling injury or illness, research on the reliability of medical evaluations of disability for work is limited and indicates high variation in judgments among assessing professionals. Standardising the evaluation process could improve reliability. Development and testing of instruments and structured approaches to improve reliability in evaluation of disability are urgently needed. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.
Reliability of real-time ultrasound measurement of transversus abdominis thickness in healthy trained subjects.

PubMed

Gnat, Rafael; Saulicz, Edward; Miądowicz, Barbara

2012-08-01

To investigate intra- and inter-rater reliability of the ultrasound measurement of transversus abdominis (TrA) thickness and thickness change (difference between thickness at rest and during contraction) in asymptomatic, trained subjects. To define the number of repeated measurements that provide acceptable level of reliability. To investigate variability of the measurements over time of 5 days and the reliability of duplicate analysis of images. A single-group repeated-measures design was used to assess reliability. Healthy volunteers (n = 10) were subjected to 1-week training in voluntary activation of TrA. Real-time ultrasound imaging and subsequent measurement of the TrA thickness at rest and during voluntary contraction were repeated on Monday, Wednesday and Friday of the next week. Using a single repeated measurement, intraclass correlation coefficients (ICCs) for TrA thickness were: 0.86-0.95 (intra-rater), 0.86-0.92 (inter-rater); and for TrA thickness change: 0.34-0.56 (intra-rater), 0.47-0.61 (inter-rater). Using the mean of three repeated measurements respective values were: 0.97, 0.96-0.98; and 0.81-0.84, 0.80-0.90. No significant differences were found between mean values of TrA thickness as well as thickness change obtained on three consecutive measurement days. Duplicate analysis of the images was highly reliable with ICCs of 0.89-0.99. Two repeated measurements for TrA thickness and at least three measurements for TrA thickness change are needed to achieve acceptable levels of intra- and inter-rater reliability. In healthy trained volunteers TrA thickness and thickness change are relatively stable parameters over a 5-day period. Duplicate analysis of the same images by two blinded observers is reliable.
A Spanish validation of the Coma Recovery Scale-Revised (CRS-R).

PubMed

Tamashiro, Mercedes; Rivas, Maria Elisa; Ron, Melania; Salierno, Fernando; Dalera, Marisol; Olmos, Lisandro

2014-01-01

Analysis of inter-rater reliability and concurrent validity. To determine measurement properties of a Spanish version of The Coma Recovery Scale-Revised (CRS-R). A sample of 35 in-patients with severe acquired brain injury. To test concurrent validity of the translated scale, the Glasgow Coma Scale (GSC) and Disability Rating Scale (DRS) were also administered. Two experts in the field were recruited to assess inter-rater agreement. Inter-rater reliability was good for total CRS-R scores (Cronbach α = 0.973, p = 0.001). Sub-scale analysis showed moderate-to-high inter-rater agreement. Total CRS-R scores correlated significantly (p < 0.05) with total GCS (r = 0.74) and DRS (r = 0.54) scores, indicating acceptable concurrent validity. The Spanish version of CRS-R can be administered reliably by trained and experienced examiners. CRS-R appears capable of differentiating patients in Emergence from Minimally Conscious State (EMCS) or in Minimally Conscious State (MCS) from those in a Vegetative State (VS).
Does a Rater's Familiarity with a Candidate's Pronunciation Affect the Rating in Oral Proficiency Interviews?

ERIC Educational Resources Information Center

Carey, Michael D.; Mannell, Robert H.; Dunn, Peter K.

2011-01-01

This study investigated factors that could affect inter-examiner reliability in the pronunciation assessment component of speaking tests. We hypothesized that the rating of pronunciation is susceptible to variation in assessment due to the amount of exposure examiners have to nonnative English accents. An inter-rater variability analysis was…
The inter-rater reliability test of the modified Morse Fall Scale among patients ≥ 55 years old in an acute care hospital in Singapore.

PubMed

Tang, Wing Sze; Chow, Yeow Leng; Koh, Serena Siew Lin

2014-02-01

A prospective, descriptive study was conducted in an acute care hospital in Singapore to determine the inter-rater reliability of the modified Morse Fall Scale by evaluating the degrees of agreement on the ratings of the individual items and overall score between the 'gold standard' assessor and the facility assessors. One hundred and forty-two subjects were recruited during the 1.5 month data collection period. The simple and weighted κ-values were all > 0.8 except for the item 'effects of medications' (κ and κw = 0.63), and the correlation coefficient (rs = 0.89) was significantly high at a significance level of < 0.001. The modified Morse Fall Scale was shown to be a reliable fall risk assessment tool having a relative high inter-rater reliability level for the overall score and individual items. This study provides evidence-based psychometric support for the clinical application of this tool. © 2013 Wiley Publishing Asia Pty Ltd.
Use of volunteer student abstractors for a retrospective cohort analysis: a study of inter-rater reliability.

PubMed

Gritsiouk, Yaroslav; Hegsted, Damian; Gardiner, Stuart; Merriman, Lisa; Gubler, Kelly Dean

2013-05-01

Little is known about the reliability of data collected by abstractors without professional medical training. This investigation sought to determine the level of agreement among untrained volunteer abstractors as part of a study to evaluate the risk assessment of venous thromboembolism in patients who have undergone trauma. Forty-nine paper charts were chosen randomly from a volunteer-reviewed cohort of 2,339 and were compared with those of a single experienced abstractor. Inter-rater agreement was assessed using percent agreement, Cohen's kappa, and prevalence-adjusted bias-adjusted kappa (PABAK). Of the 71 data points, 28 had perfect agreement. The average agreement across all charts was 97%. Data with imperfect agreement had kappa values between .27 and .96 (mean, .75), with one additional value at zero even though it was associated with an agreement of 94%. PABAK values ranged from .67 to .98 (mean, .91), an average increase of .17 compared with kappa values. The performance of volunteers showed outstanding inter-rater reliability; however, limitations of interpretation can influence reliability. Copyright © 2013 Elsevier Inc. All rights reserved.
Reliability of different methodologies of infrared image analysis of myofascial trigger points in the upper trapezius muscle

PubMed Central

Dibai-Filho, Almir V.; Guirro, Elaine C. O.; Ferreira, Vânia T. K.; Brandino, Hugo E.; Vaz, Maíta M. O. L. L.; Guirro, Rinaldo R. J.

2015-01-01

BACKGROUND: Infrared thermography is recognized as a viable method for evaluation of subjects with myofascial pain. OBJECTIVE: The aim of the present study was to assess the intra- and inter-rater reliability of infrared image analysis of myofascial trigger points in the upper trapezius muscle. METHOD: A reliability study was conducted with 24 volunteers of both genders (23 females) between 18 and 30 years of age (22.12±2.54), all having cervical pain and presence of active myofascial trigger point in the upper trapezius muscle. Two trained examiners performed analysis of point, line, and area of the infrared images at two different periods with a 1-week interval. The intra-class correlation coefficient (ICC2,1) was used to assess the intra- and inter-rater reliability. RESULTS: With regard to the intra-rater reliability, ICC values were between 0.591 and 0.993, with temperatures between 0.13 and 1.57 °C for values of standard error of measurement (SEM) and between 0.36 and 4.35 °C for the minimal detectable change (MDC). For the inter-rater reliability, ICC ranged from 0.615 to 0.918, with temperatures between 0.43 and 1.22 °C for the SEM and between 1.19 and 3.38 °C for the MDC. CONCLUSION: The methods of infrared image analyses of myofascial trigger points in the upper trapezius muscle employed in the present study are suitable for clinical and research practices. PMID:25993626
Reliability and validity of CODA motion analysis system for measuring cervical range of motion in patients with cervical spondylosis and anterior cervical fusion.

PubMed

Gao, Zhongyang; Song, Hui; Ren, Fenggang; Li, Yuhuan; Wang, Dong; He, Xijing

2017-12-01

The aim of the present study was to evaluate the reliability of the Cartesian Optoelectronic Dynamic Anthropometer (CODA) motion system in measuring the cervical range of motion (ROM) and verify the construct validity of the CODA motion system. A total of 26 patients with cervical spondylosis and 22 patients with anterior cervical fusion were enrolled and the CODA motion analysis system was used to measure the three-dimensional cervical ROM. Intra- and inter-rater reliability was assessed by interclass correlation coefficients (ICCs), standard error of measurement (SEm), Limits of Agreements (LOA) and minimal detectable change (MDC). Independent samples t-tests were performed to examine the differences of cervical ROM between cervical spondylosis and anterior cervical fusion patients. The results revealed that in the cervical spondylosis group, the reliability was almost perfect (intra-rater reliability: ICC, 0.87-0.95; LOA, -12.86-13.70; SEm, 2.97-4.58; inter-rater reliability: ICC, 0.84-0.95; LOA, -13.09-13.48; SEm, 3.13-4.32). In the anterior cervical fusion group, the reliability was high (intra-rater reliability: ICC, 0.88-0.97; LOA, -10.65-11.08; SEm, 2.10-3.77; inter-rater reliability: ICC, 0.86-0.96; LOA, -10.91-13.66; SEm, 2.20-4.45). The cervical ROM in the cervical spondylosis group was significantly higher than that in the anterior cervical fusion group in all directions except for left rotation. In conclusion, the CODA motion analysis system is highly reliable in measuring cervical ROM and the construct validity was verified, as the system was sufficiently sensitive to distinguish between the cervical spondylosis and anterior cervical fusion groups based on their ROM.
Evaluation of the Walking Index for Spinal Cord Injury II (WISCI-II) in children with Spinal Cord Injury (SCI).

PubMed

Calhoun Thielen, C; Sadowsky, C; Vogel, L C; Taylor, H; Davidson, L; Bultman, J; Gaughan, J; Mulcahey, M J

2017-05-01

Mixed methods were used in this study. The appropriateness of the levels of the Walking Index for Spinal Cord Injury II (WISCI-II) for application in children was critically reviewed by physical therapists using the Modified Delphi Technique, and the inter- and intra-rater reliability of the WISCI-II in children was evaluated. To examine the construct validity, and to establish reliability of the WISCI-II related to its use in children with spinal cord injury (SCI). United States of America. Using a Modified Delphi Technique, physical therapists critically reviewed the WISCI-II levels for pediatric utilization. Concurrently, ambulatory children under age 18 years with SCI were evaluated using the WISCI-II on two occasions by the same therapist to establish intra-rater reliability. One trial was photographed and de-identified. Each photograph was reviewed by four different physical therapists who gave WISCI-II scores to establish inter-rater reliability. Summary and descriptive statistics were used to calculate the frequency of yes/no responses for each WISCI-II level question and to determine the percent agreement for each question. Inter- and intra-rater reliability was calculated using interclass correlation coefficients (ICCs) with 95% confidence intervals (CI). Construct validity was confirmed after one Delphi round during which at least 80% agreement was established by 51 physical therapists on the appropriateness of the WISCI-II levels for children. Fifty-two children with SCI aged 2-17 years completed repeated WISCI-II assessments and 40 de-identified photographs were scored by four physical therapists. Intra- and inter-rater reliability was high (ICC=0.997, CI=0.995-0.998 and ICC=0.97, CI=0.95-0.98, respectively). This study demonstrates support for the use of the WISCI-II in ambulatory children with SCI. This study was funded by the Craig H Neilsen Foundation, Spinal Cord Injury Research on the Translation Spectrum, Senior Research Award #282592 (Mulcahey, PI).
Influence of speech sample on perceptual rating of hypernasality.

PubMed

Medeiros, Maria Natália Leite de; Fukushiro, Ana Paula; Yamashita, Renata Paciello

2016-07-07

To investigate the influence of speech sample of spontaneous conversation or sentences repetition on intra and inter-rater hypernasality reliability. One hundred and twenty audio recorded speech samples (60 containing spontaneous conversation and 60 containing repeated sentences) of individuals with repaired cleft palate±lip, both genders, aged between 6 and 52 years old (mean=21±10) were selected and edited. Three experienced speech and language pathologists rated hypernasality according to their own criteria using 4-point scale: 1=absence of hypernasality, 2=mild hypernasality, 3=moderate hypernasality and 4=severe hypernasality, first in spontaneous speech samples and 30 days after, in sentences repetition samples. Intra- and inter-rater agreements were calculated for both speech samples and were statistically compared by the Z test at a significance level of 5%. Comparison of intra-rater agreements between both speech samples showed an increase of the coefficients obtained in the analysis of sentences repetition compared to those obtained in spontaneous conversation. Comparison between inter-rater agreement showed no significant difference among the three raters for the two speech samples. Sentences repetition improved intra-raters reliability of perceptual judgment of hypernasality. However, the speech sample had no influence on reliability among different raters.
Ultrasonographic measurement of the acromiohumeral distance in spinal cord injury: Reliability and effects of shoulder positioning.

PubMed

Lin, Yen-Sheng; Boninger, Michael L; Day, Kevin A; Koontz, Alicia M

2015-11-01

To investigate the reliability of ultrasonographic measurement of acromiohumeral distance (AHD) and the effects of shoulder positioning on AHD among manual wheelchair users (MWUs) with spinal cord injury (SCI) and an able-bodied control group. Ten MWUs with SCI and 10 able-bodied subjects participated in this study. The ultrasonographic measurements of AHD from each subject were obtained by two raters during passive and active scapular plane arm elevation in neutral, 45°, 90° with and without resistance and in a weight relief raise position. The measurements were recorded again by each rater using the same procedures after a 30-minute time interval. All raters were blinded to each other's measurements. University Laboratories and Veteran Affairs Healthcare System. Intra-rater (intraclass correlation coefficient, ICC > 0.83) and inter-rater (ICC > 0.78) reliability was excellent for both the MWUs with SCI and able-bodied groups across all arm positions except for the 45° position in the control group for one of the raters (intra-rater: ICC < 0.40 and inter-rater: ICC < 0.60). AHD significantly reduced when the shoulder was in the 90° arm elevated positions with or without resistance. Findings from our study demonstrated that ultrasonography is a reliable means to evaluate AHD in both able bodied and individuals with SCI, who are known to have significant shoulder pathology. This technique could be used to develop reference measures and to identify changes in AHD caused by interventions.
Unified Parkinson's Disease Rating Scale-Motor Exam: inter-rater reliability of advanced practice nurse and neurologist assessments.

PubMed

Palmer, Janice L; Coats, Mary A; Roe, Catherine M; Hanko, Shelly M; Xiong, Chengjie; Morris, John C

2010-06-01

This paper is a report of a study to establish the inter-rater reliability of advanced practice nurse and neurologist neurological assessments which included ratings with the Unified Parkinson's Disease Rating Scale-Motor Exam. Around the world, advanced practice nurses are performing tasks once completed only by physicians. To promote consumer and provider confidence, it is important to establish that nurse and physician ratings using assessment tools are similar. In addition in research settings, when different raters are used, establishment of inter-rater reliability for study assessments is needed. Advanced practice nurses and neurologists independently recorded findings on neurological examinations of 46 participants in a study conducted between August 2007 and January 2008. An intraclass correlation coefficient was calculated to estimate overall agreement between the nurse and neurologist ratings. Agreement for individual items measured on a dichotomous scale was assessed by calculating Cohen's kappa. There was substantial agreement between advanced practice nurses and neurologists on the mean Unified Parkinson's Disease Rating Scale-Motor Exam ratings (intraclass correlation coefficient = 0.65) and the U.S. National Alzheimer's Coordinating Center Uniform Data Set neurological examination ratings of unremarkable findings (kappa = 0.74) and of gait disorder (kappa = 0.73). Moderate agreement (kappa = 0.53) was reached for the rating of whether all Unified Parkinson's Disease Rating Scale-Motor Exam items were normal. These findings are consistent with studies of the inter-rater agreement of the Unified Parkinson's Disease Rating Scale-Motor Exam and support the conduct of neurological assessments by advanced practice nurses.
Validating the Danish adaptation of the World Health Organization's International Classification for Patient Safety classification of patient safety incident types

PubMed Central

Mikkelsen, Kim Lyngby; Thommesen, Jacob; Andersen, Henning Boje

2013-01-01

Objectives Validation of a Danish patient safety incident classification adapted from the World Health Organizaton's International Classification for Patient Safety (ICPS-WHO). Design Thirty-three hospital safety management experts classified 58 safety incident cases selected to represent all types and subtypes of the Danish adaptation of the ICPS (ICPS-DK). Outcome Measures Two measures of inter-rater agreement: kappa and intra-class correlation (ICC). Results An average number of incident types used per case per rater was 2.5. The mean ICC was 0.521 (range: 0.199–0.809) and the mean kappa was 0.513 (range: 0.193–0.804). Kappa and ICC showed high correlation (r = 0.99). An inverse correlation was found between the prevalence of type and inter-rater reliability. Results are discussed according to four factors known to determine the inter-rater agreement: skill and motivation of raters; clarity of case descriptions; clarity of the operational definitions of the types and the instructions guiding the coding process; adequacy of the underlying classification scheme. Conclusions The incident types of the ICPS-DK are adequate, exhaustive and well suited for classifying and structuring incident reports. With a mean kappa a little above 0.5 the inter-rater agreement of the classification system is considered ‘fair’ to ‘good’. The wide variation in the inter-rater reliability and low reliability and poor discrimination among the highly prevalent incident types suggest that for these types, precisely defined incident sub-types may be preferred. This evaluation of the reliability and usability of WHO's ICPS should be useful for healthcare administrations that consider or are in the process of adapting the ICPS. PMID:23287641
A Comparison of Rubrics and Graded Category Rating Scales with Various Methods Regarding Raters' Reliability

ERIC Educational Resources Information Center

Dogan, C. Deha; Uluman, Müge

2017-01-01

The aim of this study was to determine the extent at which graded-category rating scales and rubrics contribute to inter-rater reliability. The research was designed as a correlational study. Study group consisted of 82 students attending sixth grade and three writing course teachers in a private elementary school. A performance task was…
Reliability of the School Food Checklist for in-school audits and photograph analysis of children's packed lunches.

PubMed

Mitchell, S A; Miles, C L; Brennan, L; Matthews, J

2010-02-01

Assessment of children's diets is problematic, typically relying on error-prone parent or child recall or reporting, or resource intensive direct observation. The School Food Checklist (SFC) is an objective instrument comprising of 20 food and beverage categories designed to measure the foods contained in children's packed lunches. The present study aimed to assess intra-rater and inter-rater reliability of each of the food and beverage categories of the SFC for both in-school audits and photograph analysis of children's school lunches. Participants comprised 176 children aged 5-8 years from five primary schools in Northern Metropolitan Melbourne. The SFC was used to measure the foods contained in children's packed lunches in the school setting and using photographs. Photograph analysis was conducted by the auditors 2-3 months after completion of in-school audits. Both intra-rater [intra-class correlation coefficient (ICC) = 0.78-1] and inter-rater (ICC = 0.50-0.95) reliability analysis indicated strong agreement for in-school auditing. With the exception of the food category titled 'leftovers', there was strong intra-rater reliability for auditors' live audits and their analysis of photographs [ICC = 0.57-0.98 (Auditor 1); ICC = 0.72-0.90 (Auditor 2)], and strong inter-rater reliability for photograph analysis (ICC = 0.68-0.92). The SFC is a reliable method of measuring the foods and beverages contained in children's packed lunches when used in the school setting or for photograph analysis. This finding has broad implications, particularly for the use of photograph analysis, because this approach offers a convenient and cost effective method of measuring what food and beverages children bring to school in home packed lunches.
A comparison of the reliability of make versus break testing in measuring palmar abduction strength of the thumb.

PubMed

Lim, J X; Toh, R X; Chook, S K H; Sebastin, S J; Karjalainen, T

2014-06-01

Previous studies have established the role of quantitative measurements of palmar abduction strength of the thumb (PAST). This study compares the reliability of the 'make' versus the 'break' test in measuring PAST in healthy volunteers. In a 'make' test, the body part being tested is positioned at the start of its range of motion and the participant is asked to exert his/her maximal force. In a 'break' test, increasing force is applied to a body part after it has completed its range of motion, until the joint being tested gives way. PAST was measured in both hands in 100 healthy volunteers using a handheld device. Two examiners measured PAST using both the 'make' and 'break' test to determine inter-rater reliability. The tests were repeated in 30 volunteers 6 weeks after the initial testing to determine intra-rater reliability. Our results showed that the 'make' test has better inter and intra-rater reliability.
Inter-rater Reliability of Three Musculoskeletal Physical examination Techniques Used to Assess Motion in Three Planes While Standing

PubMed Central

Prather, Heidi; Hunt, Devyani; Steger-May, Karen; Hayes, Marcie Harris; Knaus, Evan; Clohisy, John

2012-01-01

Objective The objective of the study was to measure the reliability between examiners of three basic maneuvers of the Total Body Functional Profile© physical examination test. The hypothesis was musculoskeletal health care providers of different disciplines could reliably use the three basic maneuvers as part of the musculoskeletal physical examination. Design A prospective observational study was conducted. Twenty-eight adult volunteers were measured on both the left and right side by two independent raters on a single occasion. Setting The subjects were recruited through advertisements placed by the orthopedic department at a tertiary university. Participants 28 volunteers were recruited and completed the study. The volunteers were between the ages of 18 and 51 years of age, had no symptoms in the lower extremity or spine, had no previous history of surgery or tumor involving the lower extremity, and no medical conditions that would preclude participation. Assessment On a single occasion, two examiners per one volunteer were blinded to their own and each others' measurements. Each examiner assessed the distance of frontal and sagittal plane lunge and angle of motion for transverse plane testing. Main Outcome Measurements Inter-rater agreement is expressed with intraclass correlation coefficients (ICCs) and corresponding 95% confidence intervals (CIs). The difference between raters is reported with 95% CIs. Baseline demographics, UCLA, and Harris hip questionnaires were completed by all participants. Results The UCLA and Harris hip scores showed no significant activity restrictions or pain limitations in all participants. The inter-rater reliability for sagittal, frontal, and transverse plane matrix testing was good with ICCs of 0.86 (95% CI 0.77, 0.91), 0.90 (95% CI 0.84, 0.94), and 0.85 (95% CI 0.75, 0.91) respectively. The rater reliability between disciplines for transverse, sagittal and frontal plane matrix testing was good with ICCs of 0.89 (95% CI 0.80, 0.94), 0.88 (95% CI 0.79, 0.94), 0.90 (95% CI 0.81, 0.95). Conclusion The inter-rater reliability for three basic maneuvers of the Total Body Functional Profile© is good amongst musculoskeletal healthcare providers of different disciplines. These three maneuvers may be used consistently as part of the musculoskeletal physical examination. PMID:19627956
Inter-rater reliability of three musculoskeletal physical examination techniques used to assess motion in three planes while standing.

PubMed

Prather, Heidi; Hunt, Devyani; Steger-May, Karen; Hayes, Marcie Harris; Knaus, Evan; Clohisy, John

2009-07-01

The objective of the study was to measure the reliability between examiners of 3 basic maneuvers of the Total Body Functional Profile physical examination test. The hypothesis was musculoskeletal health care providers of different disciplines could reliably use the 3 basic maneuvers as part of the musculoskeletal physical examination. A prospective observational study was conducted. Twenty-eight adult volunteers were measured on both the left and right side by 2 independent raters on a single occasion. The subjects were recruited through advertisements placed by the orthopedic department at a tertiary university. Twenty-eight volunteers were recruited and completed the study. The volunteers were between the ages of 18 and 51 years of age, had no symptoms in the lower extremity or spine, had no previous history of surgery or tumor involving the lower extremity, and no medical conditions that would preclude participation. On a single occasion, 2 examiners per 1 volunteer were blinded to their own and each others' measurements. Each examiner assessed the distance of frontal and sagittal plane lunge and angle of motion for transverse plane testing. Inter-rater agreement is expressed with intraclass correlation coefficients (ICCs) and corresponding 95% confidence intervals (CIs). The difference between raters is reported with 95% CIs. Baseline demographics, University of California Los Angeles (UCLA), and Harris hip questionnaires were completed by all participants. The UCLA and Harris hip scores showed no significant activity restrictions or pain limitations in all participants. The inter-rater reliability for sagittal, frontal, and transverse plane matrix testing was good with ICCs of 0.86 (95% CI 0.77-0.91), 0.90 (95% CI 0.84-0.94), and 0.85 (95% CI 0.75-0.91), respectively. The rater reliability between disciplines for transverse, sagittal, and frontal plane matrix testing was good with ICCs of 0.89 (95% CI 0.80-0.94), 0.88 (95% CI 0.79-0.94), and 0.90 (95% CI 0.81-0.95), respectively. The inter-rater reliability for 3 basic maneuvers of the Total Body Functional Profile is good among musculoskeletal health care providers of different disciplines. These 3 maneuvers may be used consistently as part of the musculoskeletal physical examination.
Rater agreement reliability of the dial test in the ACL-deficient knee.

PubMed

Slichter, Malou E; Wolterbeek, Nienke; Auw Yang, K Gie; Zijl, Jacco A C; Piscaer, Tom M

2018-06-14

Posterolateral rotatory instability (PLRI) of the knee can easily be missed, because attention is paid to injury of the cruciate ligaments. If left untreated this clinical instability may persist after reconstruction of the cruciate ligaments and may put the graft at risk of failure. Even though the dial test is widely used to diagnose PLRI, no validity and reliability studies of the manual dial test are yet performed in patients. This study focuses on the reliability of the manual dial test by determining the rater agreement. Two independent examiners performed the dial test in knees of 52 patients after knee distorsion with a suspicion on ACL rupture. The dial test was performed in prone position in 30°, 60° and 90° of flexion of the knees. ≥10° side-to-side difference was considered a positive dial test. For quantification of the amount of rotation in degrees, a measuring device was used with a standardized 6 Nm force, using a digital torque adapter on a booth. The intra-rater, inter-rater and rater-device agreement were determined by calculating kappa (κ) for the dial test. A positive dial test was found in 21.2% and 18.0% of the patients as assessed by a blinded examiner and orthopaedic surgeon respectively. Fair inter-rater agreement was found in 30° of flexion, κ F = 0.29 (95% CI: 0.01 to 0.56), p = 0.044 and 90° of flexion, κ F = 0.38 (95% CI: 0.10 to 0.66), p = 0.007. Almost perfect rater-device agreement was found in 30° of flexion, κ C = 0.84 (95% CI: 0.52 to 1.15), p < 0.001. Moderate rater-device agreement was found in 30° and 90° combined, κ C = 0.50 (95% CI: 0.13 to 0.86), p = 0.008. No significant intra-rater agreement was found. Rater agreement reliability of the manual dial test is questionable. It has a fair inter-rater agreement in 30° and 90° of flexion.
Brief Report: Interrater Reliability of Clinical Diagnosis and DSM-IV Criteria for Autistic Disorder: Results of the DSM-IV Autism Field Trial.

ERIC Educational Resources Information Center

Klin, Ami; Lang, Jason; Cicchetti, Domenic V.; Volkmar, Fred R.

2000-01-01

This study examined the inter-rater reliability of clinician-assigned diagnosis of autism using or not using the criteria specified in the Diagnostic and Statistical Manual IV (DSM-IV). For experienced raters there was little difference in reliability in the two conditions. However, a clinically significant improvement in diagnostic reliability…

Feasibility of a Semi-computerized Line Bisection Test for Unilateral Visual Neglect Assessment.

PubMed

Jee, H; Kim, J; Kim, C; Kim, T; Park, J

2015-01-01

Commonly used paper-and-pencil based test modalities for assessing the degree of unilateral visual neglect (ULN) in patients with hemispheric cerebral lesions consume human resources with a significant inter and intra-rater variability. To explore the feasibility of a semi-computerized electronic-pen based ULN assessment system (e-system) to improve assessment quality without altering the conventional user interface. Thirty cognitively healthy participants (HG) and 11 participants diagnosed with right-hemispheric lesion and unilateral visual neglect (NG) were recruited to evaluate the e-system. Line bisection tests (LBT) were repeatedly conducted twice for the inter-rater and intra-rater (reliability) comparisons. The LBT results were assessed by the e-system and the golden standard methods (manual rater assessment). The percent deviation (%), assessment duration (sec), and number of neglected line (each) were evaluated. The inter-rater comparisons of the assessed deviation (%) variable showed excellent interrater reliabilities (CCCs) ranging from .84 (.59 to .95 (p < .001)) to .99 (.90 to .99 (p < .001)) for HG and NG. The Bland Altman mean difference (B-A) plots with bias (95% LOA (limits of agreement)) showed similar agreements between the e-system and the raters ranging from -.04 % (-2.10 to 1.97) to 1.30 % (-2.23 to 4.84) for HG and NG. The effect sizes (ES), which show similarities between the assessment methods, yielded smaller ranges from .01 to .30 for HG and NG. The reliability (test-retest) comparisons showed similar assessment results between the e-system, rater 1, and rater 2. The manual rater assessment time ranging from 5.85 to 6.00 minutes and inter- and intraassessment variations were virtually eliminated with the e-system. The semi-computerized system with the conventional paper-and pencil user-interface showed valid and reliable assessment results. It may be a feasible replacement for the manual rater assessment modality even in a clinical setting.
Reliability and validity of the de Morton Mobility Index in individuals with sub-acute stroke.

PubMed

Braun, Tobias; Marks, Detlef; Thiel, Christian; Grüneberg, Christian

2018-02-04

To establish the validity and reliability of the de Morton Mobility Index (DEMMI) in patients with sub-acute stroke. This cross-sectional study was performed in a neurological rehabilitation hospital. We assessed unidimensionality, construct validity, internal consistency reliability, inter-rater reliability, minimal detectable change and possible floor and ceiling effects of the DEMMI in adult patients with sub-acute stroke. The study included a total sample of 121 patients with sub-acute stroke. We analysed validity (n = 109) and reliability (n = 51) in two sub-samples. Rasch analysis indicated unidimensionality with an overall fit to the model (chi-square = 12.37, p = 0.577). All hypotheses on construct validity were confirmed. Internal consistency reliability (Cronbach's alpha = 0.94) and inter-rater reliability (intraclass correlation coefficient = 0.95; 95% confidence interval: 0.92-0.97) were excellent. The minimal detectable change with 90% confidence was 13 points. No floor or ceiling effects were evident. These results indicate unidimensionality, sufficient internal consistency reliability, inter-rater reliability, and construct validity of the DEMMI in patients with a sub-acute stroke. Advantages of the DEMMI in clinical application are the short administration time, no need for special equipment and interval level data. The de Morton Mobility Index, therefore, may be a useful performance-based bedside test to measure mobility in individuals with a sub-acute stroke across the whole mobility spectrum. Implications for Rehabilitation The de Morton Mobility Index (DEMMI) is an unidimensional measurement instrument of mobility in individuals with sub-acute stroke. The DEMMI has excellent internal consistency and inter-rater reliability, and sufficient construct validity. The minimal detectable change of the DEMMI with 90% confidence in stroke rehabilitation is 13 points. The lack of any floor or ceiling effects on hospital admission indicates applicability across the whole mobility spectrum of patients with sub-acute stroke.
Greater understanding of normal hip physical function may guide clinicians in providing targeted rehabilitation programmes.

PubMed

Kemp, Joanne L; Schache, Anthony G; Makdissi, Michael; Sims, Kevin J; Crossley, Kay M

2013-07-01

This study investigated tests of hip muscle strength and functional performance. The specific objectives were to: (i) establish intra- and inter-rater reliability; (ii) compare differences between dominant and non-dominant limbs; (iii) compare agonist and antagonist muscle strength ratios; (iv) compare differences between genders; and (v) examine relationships between hip muscle strength, baseline measures and functional performance. Reliability study and cross-sectional analysis of hip strength and functional performance. In healthy adults aged 18-50years, normalised hip muscle peak torque and functional performance were evaluated to: (i) establish intra-rater and inter-rater reliability; (ii) analyse differences between limbs, between antagonistic muscle groups and genders; and (iii) associations between strength and functional performance. Excellent reliability (intra-rater ICC=0.77-0.96; inter-rater ICC=0.82-0.95) was observed. No difference existed between dominant and non-dominant limbs. Differences in strength existed between antagonistic pairs of muscles: hip abduction was greater than adduction (p<0.001) and hip ER was greater than IR (p<0.001). Men had greater ER strength (p=0.006) and hop for distance (p<0.001) than women. Strong associations were observed between measures of hip muscle strength (except hip flexion) and age, height, and functional performance. Deficits in hip muscle strength or functional performance may influence hip pain. In order to provide targeted rehabilitation programmes to address patient-specific impairments, and determine when individuals are ready to return to physical activity, clinicians are increasingly utilising tests of hip strength and functional performance. This study provides a battery of reliable, clinically applicable tests which can be used for these purposes. Copyright © 2012 Sports Medicine Australia. Published by Elsevier Ltd. All rights reserved.
Norwegian version of the rating anxiety in dementia scale (RAID-N): a validity and reliability study.

PubMed

Goyal, Alka R; Bergh, Sverre; Engedal, Knut; Kirkevold, Marit; Kirkevold, Øyvind

2017-12-01

Dementia-specific anxiety scales in the Norwegian language are lacking; the aim of this study was to investigate the validity and inter-rater reliability of a Norwegian version of the Rating Anxiety in Dementia (RAID-N) scale. The validity of the RAID-N was tested in a sample of 101 patients with dementia from seven Norwegian nursing homes. One psychogeriatrician (n = 50) or a physician with long experience with nursing home patients (n = 51) 'blind' to the RAID-N score diagnosed anxiety according to DSM-5 criteria of generalised anxiety disorder (GAD). A receiver operating characteristic (ROC) analysis assessed the best cut-off point for the RAID-N, and the area under the curve (AUC) was calculated. Inter-rater reliability was tested in a subgroup of 53 patients by intraclass correlation (ICC) and Cohen's kappa. Twenty-eight of 101 (27.7%) met the GAD criteria. The mean RAID-N score for patients with GAD was 16.1 (SD 6.3) and without GAD, 8.8 (SD 6.5) (p < 0.001). A cut-off score of ≥12 on the RAID-N gave a sensitivity of 82.1%, specificity of 70.0%, and 73.3% accuracy in identifying clinically significant GAD in patients with dementia. Inter-rater reliability on overall RAID-N items was good (ICC = 0.82), Cohen's kappa was 0.58 for total RAID-N score, with satisfactory internal consistency (Cronbach's alpha = 0.81). The RAID-N has fairly good validity and inter-rater reliability, and could be useful to assess GAD in patients with dementia. Further studies should investigate the optimal RAID-N cut-off score in different settings.
Reliability of the hand held dynamometer in measuring muscle strength in people with interstitial lung disease.

PubMed

Dowman, Leona; McDonald, Christine F; Hill, Catherine J; Lee, Annemarie; Barker, Kathryn; Boote, Claire; Glaspole, Ian; Goh, Nicole; Southcott, Annemarie; Burge, Angela; Ndongo, Rebecca; Martin, Alicia; Holland, Anne E

2016-09-01

To evaluate the inter-rater and intra-rater reliability of the hand held dynamometer in measuring muscle strength in people with interstitial lung disease (ILD). Test retest reliability of hand-held dynamometry for elbow flexor and knee extensor strength between two independent raters and two testing sessions. Physiotherapy department within a tertiary hospital. Thirty participants with ILD of varying aetiology were included. Twenty participants completed the inter-rater reliability protocol (10 idiopathic pulmonary fibrosis, mean (SD) age 73 (10) years, 11 male) and 21 participants completed the intra-rater reliability protocol (10 idiopathic pulmonary fibrosis, mean age 71 (10) years, 11 male). Mean muscle strength (kg). Agreement between the two raters and testing sessions was analyzed using Bland-Altman plots and reliability was estimated using intraclass correlation coefficients (ICC). For elbow flexor strength there was a mean difference between raters of -0.6kg (limits of agreement (LOA) -5.6 to 4.4kg) and within raters of -0.3kg (LOA -2.8 to 2.3kg). The ICCs were 0.95 and 0.98, respectively. For knee extensor strength there was a mean difference between raters of -1.5kg (LOA -6.9 to 3.9kg) and within raters of -0.7kg (LOA -3.9 to 2.4kg). The ICCs were 0.95 and 0.97, respectively. Hand-held dynamometry is reliable in measuring elbow flexor and knee extensor strength in people with ILD. Copyright © 2015 Chartered Society of Physiotherapy. Published by Elsevier Ltd. All rights reserved.
Ultrasonographic measurements of lower trapezius muscle thickness at rest and during isometric contraction: a reliability study.

PubMed

Talbott, Nancy R; Witt, Dexter W

2014-07-01

The purpose of this study was to determine the intra-rater reliability and inter-rater reliability of ultrasound imaging (USI) thickness measurements of the lower trapezius (LT) at rest and during active contractions when the transverse process and the lamina were used as reference sites for the measurement process. Twenty healthy individuals between the ages of 22 and 32 years volunteered. With the subject prone and the shoulder in 145° of abduction, images of the LT were taken bilaterally by one examiner as the subject: (1) rested; (2) actively held the test position; and (3) actively held the test position while holding a weight. Ten subjects returned and testing was repeated by the same examiner and by a second examiner. LT thickness measurements were recorded at the level of the transverse process and at the level of the lamina. Intra-class correlation coefficients (ICC) for within session intra-rater reliability (ICC3,3) ranged from 0.951 to 0.986 for both measurement sites while between session intra-rater reliability (ICC3,2) ranged from 0.935 to 0.962. Within session inter-rater reliability (ICC2,2) ranged from 0.934 to 0.973. USI can be used to reliably measure LT thickness at rest, during active contraction and during active contraction when holding a weight. The described protocol can be utilized during shoulder examinations to provide an additional assessment tool for monitoring changes in LT thickness.
Ultrasonographic measurement of the acromiohumeral distance in spinal cord injury: Reliability and effects of shoulder positioning

PubMed Central

Lin, Yen-Sheng; Boninger, Michael L.; Day, Kevin A.

2015-01-01

Objective To investigate the reliability of ultrasonographic measurement of acromiohumeral distance (AHD) and the effects of shoulder positioning on AHD among manual wheelchair users (MWUs) with spinal cord injury (SCI) and an able-bodied control group. Methods Ten MWUs with SCI and 10 able-bodied subjects participated in this study. The ultrasonographic measurements of AHD from each subject were obtained by two raters during passive and active scapular plane arm elevation in neutral, 45°, 90° with and without resistance and in a weight relief raise position. The measurements were recorded again by each rater using the same procedures after a 30-minute time interval. All raters were blinded to each other's measurements. Setting University Laboratories and Veteran Affairs Healthcare System. Results Intra-rater (intraclass correlation coefficient, ICC > 0.83) and inter-rater (ICC > 0.78) reliability was excellent for both the MWUs with SCI and able-bodied groups across all arm positions except for the 45° position in the control group for one of the raters (intra-rater: ICC < 0.40 and inter-rater: ICC < 0.60). AHD significantly reduced when the shoulder was in the 90° arm elevated positions with or without resistance. Conclusion Findings from our study demonstrated that ultrasonography is a reliable means to evaluate AHD in both able bodied and individuals with SCI, who are known to have significant shoulder pathology. This technique could be used to develop reference measures and to identify changes in AHD caused by interventions. PMID:24968117
Reliable and fast volumetry of the lumbar spinal cord using cord image analyser (Cordial).

PubMed

Tsagkas, Charidimos; Altermatt, Anna; Bonati, Ulrike; Pezold, Simon; Reinhard, Julia; Amann, Michael; Cattin, Philippe; Wuerfel, Jens; Fischer, Dirk; Parmar, Katrin; Fischmann, Arne

2018-04-30

To validate the precision and accuracy of the semi-automated cord image analyser (Cordial) for lumbar spinal cord (SC) volumetry in 3D T1w MRI data of healthy controls (HC). 40 3D T1w images of 10 HC (w/m: 6/4; age range: 18-41 years) were acquired at one 3T-scanner in two MRI sessions (time interval 14.9±6.1 days). Each subject was scanned twice per session, allowing determination of test-retest reliability both in back-to-back (intra-session) and scan-rescan images (inter-session). Cordial was applied for lumbar cord segmentation twice per image by two raters, allowing for assessment of intra- and inter-rater reliability, and compared to a manual gold standard. While manually segmented volumes were larger (mean: 2028±245 mm 3 vs. Cordial: 1636±300 mm 3 , p<0.001), accuracy assessments between manually and semi-automatically segmented images showed a mean Dice-coefficient of 0.88±0.05. Calculation of within-subject coefficients of variation (COV) demonstrated high intra-session (1.22-1.86%), inter-session (1.26-1.84%), as well as intra-rater (1.73-1.83%) reproducibility. No significant difference was shown between intra- and inter-session reproducibility or between intra-rater reliabilities. Although inter-rater reproducibility (COV: 2.87%) was slightly lower compared to all other reproducibility measures, between rater consistency was very strong (intraclass correlation coefficient: 0.974). While under-estimating the lumbar SCV, Cordial still provides excellent inter- and intra-session reproducibility showing high potential for application in longitudinal trials. • Lumbar spinal cord segmentation using the semi-automated cord image analyser (Cordial) is feasible. • Lumbar spinal cord is 40-mm cord segment 60 mm above conus medullaris. • Cordial provides excellent inter- and intra-session reproducibility in lumbar spinal cord region. • Cordial shows high potential for application in longitudinal trials.
Geometric classification of scalp hair for valid drug testing, 6 more reliable than 8 hair curl groups.

PubMed

Mkentane, K; Van Wyk, J C; Sishi, N; Gumedze, F; Ngoepe, M; Davids, L M; Khumalo, N P

2017-01-01

Curly hair is reported to contain higher lipid content than straight hair, which may influence incorporation of lipid soluble drugs. The use of race to describe hair curl variation (Asian, Caucasian and African) is unscientific yet common in medical literature (including reports of drug levels in hair). This study investigated the reliability of a geometric classification of hair (based on 3 measurements: the curve diameter, curl index and number of waves). After ethical approval and informed consent, proximal virgin (6cm) hair sampled from the vertex of scalp in 48 healthy volunteers were evaluated. Three raters each scored hairs from 48 volunteers at two occasions each for the 8 and 6-group classifications. One rater applied the 6-group classification to 80 additional volunteers in order to further confirm the reliability of this system. The Kappa statistic was used to assess intra and inter rater agreement. Each rater classified 480 hairs on each occasion. No rater classified any volunteer's 10 hairs into the same group; the most frequently occurring group was used for analysis. The inter-rater agreement was poor for the 8-groups (k = 0.418) but improved for the 6-groups (k = 0.671). The intra-rater agreement also improved (k = 0.444 to 0.648 versus 0.599 to 0.836) for 6-groups; that for the one evaluator for all volunteers was good (k = 0.754). Although small, this is the first study to test the reliability of a geometric classification. The 6-group method is more reliable. However, a digital classification system is likely to reduce operator error. A reliable objective classification of human hair curl is long overdue, particularly with the increasing use of hair as a testing substrate for treatment compliance in Medicine.
Unified Parkinson’s Disease Rating Scale-Motor Exam: Inter-rater reliability of advanced practice nurse and neurologist assessments

PubMed Central

Palmer, Janice L.; Coats, Mary A.; Roe, Catherine M.; Hanko, Shelly M.; Xiong, Chengjie; Morris, John C.

2010-01-01

Aim This paper is a report of a study to establish the inter-rater reliability of advanced practice nurse and neurologist neurological assessments which included ratings with the Unified Parkinson’s Disease Rating Scale-Motor Exam. Background Around the world, advanced practice nurses are performing tasks once completed by only physicians. To promote consumer and provider confidence, it is important to establish that nurse and physician ratings using assessment tools are similar. In addition in research settings, when different raters are used, establishment of inter-rater reliability for study assessments is needed. Method Advanced practice nurses and neurologists independently recorded findings on neurological examinations of 46 participants in a study conducted between August 2007 and January 2008. An intraclass correlation coefficient was calculated to estimate overall agreement between the nurse and neurologist ratings. Agreement for individual items measured on a dichotomous scale was assessed by calculating Cohen’s kappa. Results There was substantial agreement between advanced practice nurses and neurologists on the mean Unified Parkinson’s Disease Rating Scale-Motor Exam ratings (intraclass correlation coefficient = 0.65) and the U.S. National Alzheimer’s Coordinating Center Uniform Data Set neurological examination ratings of unremarkable findings (kappa = 0.74) and of gait disorder (kappa = 0.73). Moderate agreement (kappa = 0.53) was reached for the rating of whether all Unified Parkinson’s Disease Rating Scale-Motor Exam items were normal. Conclusion These findings are consistent with studies of the inter-rater agreement of the Unified Parkinson’s Disease Rating Scale-Motor Exam and support the conduct of neurological assessments by advanced practice nurses. PMID:20546368
Assessment of Lower Limb Muscle Strength and Power Using Hand-Held and Fixed Dynamometry: A Reliability and Validity Study

PubMed Central

Perraton, Luke G.; Bower, Kelly J.; Adair, Brooke; Pua, Yong-Hao; Williams, Gavin P.; McGaw, Rebekah

2015-01-01

Introduction Hand-held dynamometry (HHD) has never previously been used to examine isometric muscle power. Rate of force development (RFD) is often used for muscle power assessment, however no consensus currently exists on the most appropriate method of calculation. The aim of this study was to examine the reliability of different algorithms for RFD calculation and to examine the intra-rater, inter-rater, and inter-device reliability of HHD as well as the concurrent validity of HHD for the assessment of isometric lower limb muscle strength and power. Methods 30 healthy young adults (age: 23±5yrs, male: 15) were assessed on two sessions. Isometric muscle strength and power were measured using peak force and RFD respectively using two HHDs (Lafayette Model-01165 and Hoggan microFET2) and a criterion-reference KinCom dynamometer. Statistical analysis of reliability and validity comprised intraclass correlation coefficients (ICC), Pearson correlations, concordance correlations, standard error of measurement, and minimal detectable change. Results Comparison of RFD methods revealed that a peak 200ms moving window algorithm provided optimal reliability results. Intra-rater, inter-rater, and inter-device reliability analysis of peak force and RFD revealed mostly good to excellent reliability (coefficients ≥ 0.70) for all muscle groups. Concurrent validity analysis showed moderate to excellent relationships between HHD and fixed dynamometry for the hip and knee (ICCs ≥ 0.70) for both peak force and RFD, with mostly poor to good results shown for the ankle muscles (ICCs = 0.31–0.79). Conclusions Hand-held dynamometry has good to excellent reliability and validity for most measures of isometric lower limb strength and power in a healthy population, particularly for proximal muscle groups. To aid implementation we have created freely available software to extract these variables from data stored on the Lafayette device. Future research should examine the reliability and validity of these variables in clinical populations. PMID:26509265
Development of the Therapist Empathy Scale.

PubMed

Decker, Suzanne E; Nich, Charla; Carroll, Kathleen M; Martino, Steve

2014-05-01

Few measures exist to examine therapist empathy as it occurs in session. A 9-item observer rating scale, called the Therapist Empathy Scale (TES), was developed based on Watson's (1999) work to assess affective, cognitive, attitudinal, and attunement aspects of therapist empathy. The aim of this study was to evaluate the inter-rater reliability, internal consistency, and construct and criterion validity of the TES. Raters evaluated therapist empathy in 315 client sessions conducted by 91 therapists, using data from a multi-site therapist training trial (Martino et al., 2010) in Motivational Interviewing (MI). Inter-rater reliability (ICC = .87 to .91) and internal consistency (Cronbach's alpha = .94) were high. Confirmatory factor analyses indicated some support for single-factor fit. Convergent validity was supported by correlations between TES scores and MI fundamental adherence (r range .50 to .67) and competence scores (r range .56 to .69). Discriminant validity was indicated by negative or nonsignificant correlations between TES and MI-inconsistent behavior (r range .05 to -.33). The TES demonstrates excellent inter-rater reliability and internal consistency. RESULTS indicate some support for a single-factor solution and convergent and discriminant validity. Future studies should examine the use of the TES to evaluate therapist empathy in different psychotherapy approaches and to determine the impact of therapist empathy on client outcome.
General motor function assessment scale--reliability of a Norwegian version.

PubMed

Langhammer, Birgitta; Lindmark, Birgitta

2014-01-01

The General Motor Function assessment scale (GMF) measures activity-related dependence, pain and insecurity among older people in frail health. The aim of the present study was to translate the GMF into a Norwegian version (N-GMF) and establish its reliability and clinical feasibility. The procedure used in translating the GMF was a forward and backward process, testing a convenience sample of 30 frail elderly people with it. The intra-rater reliability tests were performed by three physiotherapists, and the inter-reliability test was done by the same three plus nine independent colleagues. The statistical analyses were performed with a pairwise analysis for intra- and inter-rater reliability, using Cronbach's α, Percentage Agreement (PA), Svensson's rank transformable method and Cohen's κ. The Cronbach's α coefficients for the different subscales of N-GMF were 0.68 for Dependency, 0.73 for Pain and 0.75 for Insecurity. Intra-rater reliability: The variation in the PA for the total score was 40-70% in Dependence, 30-40% in Pain and 30-60% in Insecurity. The Relative Rank Variant (RV) indicated a modest individual bias and an augmented rank-order agreement coefficient ra of 0.96, 0.96 and 0.99, respectively. The variation in the κ statistics was 0.27-0.62 for Dependence, 0.17-0.35 for Pain and 0.13-0.47 for Insecurity. Inter-rater reliability: The PA between different testers in Dependence, Pain and Insecurity was 74%, 89% and 74%, respectively. The augmented rank-order agreement coefficients were: for Dependence r(a) = 0.97; for Pain, r(a) = 0.99; and for Insecurity, r(a) = 0.99. The N-GMF is a fairly reliable instrument for use with frail elderly people, with intra-rater and inter-rater reliability moderate in Dependence and slight to fair in Pain and Insecurity. The clinical usefulness was stressed in regard to its main focus, the frail elderly, and for communication within a multidisciplinary team. Implications for Rehabilitation The Norwegian-General Motor Function Assessment Scale (N-GMF) is a reliable instrument. The N-GMF is an instrument for screening and assessment of activity-related dependence, pain and insecurity in frail older people. The N-GMF may be used as a tool of communication in a multidisciplinary team.
Inter-Rater Reliability and Validity of the Australian Football League’s Kicking and Handball Tests

PubMed Central

Cripps, Ashley J.; Hopper, Luke S.; Joyce, Christopher

2015-01-01

Talent identification tests used at the Australian Football League’s National Draft Combine assess the capacities of athletes to compete at a professional level. Tests created for the National Draft Combine are also commonly used for talent identification and athlete development in development pathways. The skills tests created by the Australian Football League required players to either handball (striking the ball with the hand) or kick to a series of 6 randomly generated targets. Assessors subjectively rate each skill execution giving a 0-5 score for each disposal. This study aimed to investigate the inter-rater reliability and validity of the skills tests at an adolescent sub-elite level. Male Australian footballers were recruited from sub-elite adolescent teams (n = 121, age = 15.7 ± 0.3 years, height = 1.77 ± 0.07 m, mass = 69.17 ± 8.08 kg). The coaches (n = 7) of each team were also recruited. Inter-rater reliability was assessed using Inter-class correlations (ICC) and Limits of Agreement statistics. Both the kicking (ICC = 0.96, p < .01) and handball tests (ICC = 0.89, p < .01) demonstrated strong reliability and acceptable levels of absolute agreement. Content validity was determined by examining the test scores sensitivity to laterality and distance. Concurrent validity was assessed by comparing coaches’ perceptions of skill to actual test outcomes. Multivariate analysis of variance (MANOVA) examined the main effect of laterality, with scores on the dominant hand (p = .04) and foot (p < .01) significantly higher compared to the non-dominant side. Follow-up univariate analysis reported significant differences at every distance in the kicking test. A poor correlation was found between coaches’ perceptions of skill and testing outcomes. The results of this study demonstrate both skill tests demonstrate acceptable inter-rater reliable. Partial content validity was confirmed for the kicking test, however further research is required to confirm validity of the handball test. Key points The skill tests created by the AFL demonstrated acceptable levels of relative and absolute inter-rater reliability. Both the AFL’s skills tests are able to differentiate between athletes dominant and non-dominant limbs. However, only the kicking test could consistently differentiated between score outcomes over a range of Australian Football specific disposal distances. Both tests demonstrated poor concurrent validity, with no correlation found between coaches’ perceptions of technical skills and actual skill outcomes measured. PMID:26336356
Inter-Rater Agreement of Pressure Ulcer Risk and Prevention Measures in the National Database of Nursing Quality Indicators(®) (NDNQI).

PubMed

Waugh, Shirley Moore; Bergquist-Beringer, Sandra

2016-06-01

In this descriptive multi-site study, we examined inter-rater agreement on 11 National Database of Nursing Quality Indicators(®) (NDNQI(®) ) pressure ulcer (PrU) risk and prevention measures. One hundred twenty raters at 36 hospitals captured data from 1,637 patient records. At each hospital, agreement between the most experienced rater and each other team rater was calculated for each measure. In the ratings studied, 528 patients were rated as "at risk" for PrU and, therefore, were included in calculations of agreement for the prevention measures. Prevalence-adjusted kappa (PAK) was used to interpret inter-rater agreement because prevalence of single responses was high. The PAK values for eight measures indicated "substantial" to "near perfect" agreement between most experienced and other team raters: Skin assessment on admission (.977, 95% CI [.966-.989]), PrU risk assessment on admission (.978, 95% CI [.964-.993]), Time since last risk assessment (.790, 95% CI [.729-.852]), Risk assessment method (.997, 95% CI [.991-1.0]), Risk status (.877, 95% CI [.838-.917]), Any prevention (.856, 95% CI [.76-.943]), Skin assessment (.956, 95% CI [.904-1.0]), and Pressure-redistribution surface use (.839, 95% CI [.763-.916]). For three intervention measures, PAK values fell below the recommended value of ≥.610: Routine repositioning (.577, 95% CI [.494-.661]), Nutritional support (.500, 95% CI [.418-.581]), and Moisture management (.556, 95% CI [.469-.643]). Areas of disagreement were identified. Findings provide support for the reliability of 8 of the 11 measures. Further clarification of data collection procedures is needed to improve reliability for the less reliable measures. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
The psychometric properties of Observer OPTION(5), an observer measure of shared decision making.

PubMed

Barr, Paul J; O'Malley, Alistair James; Tsulukidze, Maka; Gionfriddo, Michael R; Montori, Victor; Elwyn, Glyn

2015-08-01

Observer OPTION(5) was designed as a more efficient version of OPTION(12), the most commonly used measure of shared decision making (SDM). The current paper assesses the psychometric properties of OPTION(5). Two raters used OPTION(5) to rate recordings of clinical encounters from two previous patient decision aid (PDA) trials (n=201; n=110). A subsample was re-rated two weeks later. We assessed discriminative validity, inter-rater reliability, intra-rater reliability, and concurrent validity. OPTION(5) demonstrated discriminative validity, with increases in SDM between usual care and PDA arms. OPTION(5) also demonstrated concurrent validity with OPTION(12), r=0.61 (95%CI 0.54, 0.68) and intra-rater reliability, r=0.93 (0.83, 0.97). The mean difference in rater score was 8.89 (95% Credibility Interval, 7.5, 10.3), with intraclass correlation (ICC) of 0.67 (95% Credibility Interval, 0.51, 0.91) for the accuracy of rater scores and 0.70 (95% Credibility Interval, 0.56, 0.94) for the consistency of rater scores across encounters, indicating good inter-rater reliability. Raters reported lower cognitive burden when using OPTION(5) compared to OPTION(12). OPTION(5) is a brief, theoretically grounded observer measure of SDM with promising psychometric properties in this sample and low burden on raters. OPTION(5) has potential to provide reliable, valid assessment of SDM in clinical encounters. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
Assessment of lumbosacral kyphosis in spondylolisthesis: a computer-assisted reliability study of six measurement techniques

PubMed Central

Glavas, Panagiotis; Mac-Thiong, Jean-Marc; Parent, Stefan; de Guise, Jacques A.

2008-01-01

Although recognized as an important aspect in the management of spondylolisthesis, there is no consensus on the most reliable and optimal measure of lumbosacral kyphosis (LSK). Using a custom computer software, four raters evaluated 60 standing lateral radiographs of the lumbosacral spine during two sessions at a 1-week interval. The sample size consisted of 20 normal, 20 low and 20 high grade spondylolisthetic subjects. Six parameters were included for analysis: Boxall’s slip angle, Dubousset’s lumbosacral angle (LSA), the Spinal Deformity Study Group’s (SDSG) LSA, dysplastic SDSG LSA, sagittal rotation (SR), kyphotic Cobb angle (k-Cobb). Intra- and inter-rater reliability for all parameters was assessed using intra-class correlation coefficients (ICC). Correlations between parameters and slip percentage were evaluated with Pearson coefficients. The intra-rater ICC’s for all the parameters ranged between 0.81 and 0.97 and the inter-rater ICC’s were between 0.74 and 0.98. All parameters except sagittal rotation showed a medium to large correlation with slip percentage. Dubousset’s LSA and the k-Cobb showed the largest correlations (r = −0.78 and r = −0.50, respectively). SR was associated with the weakest correlation (r = −0.10). All other parameters had medium correlations with percent slip (r = 0.31–0.43). All measurement techniques provided excellent inter- and intra-rater reliability. Dubousset’s LSA showed the strongest correlation with slip grade. This parameter can be used in the clinical setting with PACS software capabilities to assess LSK. A computer-assisted technique is recommended in order to increase the reliability of the measurement of LSK in spondylolisthesis. PMID:19015898
Assessing the environmental characteristics of cycling routes to school: a study on the reliability and validity of a Google Street View-based audit.

PubMed

Vanwolleghem, Griet; Van Dyck, Delfien; Ducheyne, Fabian; De Bourdeaudhuij, Ilse; Cardon, Greet

2014-06-10

Google Street View provides a valuable and efficient alternative to observe the physical environment compared to on-site fieldwork. However, studies on the use, reliability and validity of Google Street View in a cycling-to-school context are lacking. We aimed to study the intra-, inter-rater reliability and criterion validity of EGA-Cycling (Environmental Google Street View Based Audit - Cycling to school), a newly developed audit using Google Street View to assess the physical environment along cycling routes to school. Parents (n = 52) of 11-to-12-year old Flemish children, who mostly cycled to school, completed a questionnaire and identified their child's cycling route to school on a street map. Fifty cycling routes of 11-to-12-year olds were identified and physical environmental characteristics along the identified routes were rated with EGA-Cycling (5 subscales; 37 items), based on Google Street View. To assess reliability, two researchers performed the audit. Criterion validity of the audit was examined by comparing the ratings based on Google Street View with ratings through on-site assessments. Intra-rater reliability was high (kappa range 0.47-1.00). Large variations in the inter-rater reliability (kappa range -0.03-1.00) and criterion validity scores (kappa range -0.06-1.00) were reported, with acceptable inter-rater reliability values for 43% of all items and acceptable criterion validity for 54% of all items. EGA-Cycling can be used to assess physical environmental characteristics along cycling routes to school. However, to assess the micro-environment specifically related to cycling, on-site assessments have to be added.
Testing fine motor coordination via telehealth: effects of video characteristics on reliability and validity.

PubMed

Hoenig, Helen M; Amis, Kristopher; Edmonds, Carol; Morgan, Michelle S; Landerman, Lawrence; Caves, Kevin

2017-01-01

Background There is limited research about the effects of video quality on the accuracy of assessments of physical function. Methods A repeated measures study design was used to assess reliability and validity of the finger-nose test (FNT) and the finger-tapping test (FTT) carried out with 50 veterans who had impairment in gross and/or fine motor coordination. Videos were scored by expert raters under eight differing conditions, including in-person, high definition video with slow motion review and standard speed videos with varying bit rates and frame rates. Results FTT inter-rater reliability was excellent with slow motion video (ICC 0.98-0.99) and good (ICC 0.59) under the normal speed conditions. Inter-rater reliability for FNT 'attempts' was excellent (ICC 0.97-0.99) for all viewing conditions; for FNT 'misses' it was good to excellent (ICC 0.89) with slow motion review but substantially worse (ICC 0.44) on the normal speed videos. FTT criterion validity (i.e. compared to slow motion review) was excellent (β = 0.94) for the in-person rater and good ( β = 0.77) on normal speed videos. Criterion validity for FNT 'attempts' was excellent under all conditions ( r ≥ 0.97) and for FNT 'misses' it was good to excellent under all conditions ( β = 0.61-0.81). Conclusions In general, the inter-rater reliability and validity of the FNT and FTT assessed via video technology is similar to standard clinical practices, but is enhanced with slow motion review and/or higher bit rate.
Reliability of the Functional Mobility Scale for Children with Cerebral Palsy

ERIC Educational Resources Information Center

Harvey, Adrienne R.; Morris, Meg E.; Graham, H. Kerr; Wolfe, Rory; Baker, Richard

2010-01-01

This study examined inter-rater reliability of the Functional Mobility Scale (FMS) for children with cerebral palsy (CP) and the presence of rater bias. A consecutive sample of 118 children with CP, 2-18 years old (mean 10.3 years, SD 3.6), was recruited from a hospital setting. Children were classified using the gross motor function…

Can we perceptually rate alaryngeal voice? Developing the Sunderland Tracheoesophageal Voice Perceptual Scale.

PubMed

Hurren, A; Hildreth, A J; Carding, P N

2009-12-01

To investigate the inter and intra reliability of raters (in relation to both profession and expertise) when judging two alaryngeal voice parameters: 'Overall Grade' and 'Neoglottal Tonicity'. Reliable perceptual assessment is essential for surgical and therapeutic outcome measurement but has been minimally researched to date. Test of inter and intra rater agreement from audio recordings of 55 tracheoesophageal speakers. Cancer Unit. Twelve speech and language therapists and ten Ear, Nose and Throat surgeons. Perceptual voice parameters of 'Overall Grade' rated with a 0-3 equally appearing interval scale and 'Neoglottal Tonicity' with an 11-point bipolar semantic scale. All raters achieved 'good' agreement for 'Overall Grade' with mean weighted kappa coefficients of 0.78 for intra and 0.70 for inter-rater agreement. All raters achieved 'good' intra-rater agreement for 'Neoglottal Tonicity' (0.64) but inter-rater agreement was only 'moderate' (0.40). However, the expert speech and language therapists sub-group attained 'good' inter-rater agreement with this parameter (0.63). The effect of 'Neoglottal Tonicity' on 'Overall Grade' was examined utilising only expert speech and language therapists data. Linear regression analysis resulted in an r-squared coefficient of 0.67. Analysis of the perceptual impression of hypotonicity and hypertonicity in relation to mean 'Overall Grade' score demonstrated neither tone was linked to a more favourable grade (P = 0.42). Expert speech and language therapist raters may be the optimal judges for tracheoesophageal voice assessment. Tonicity appears to be a good predictor of 'Overall Grade'. These scales have clinical applicability to investigate techniques that facilitate optotonic neoglottal voice quality.
Reliability of Semi-Automated Segmentations in Glioblastoma.

PubMed

Huber, T; Alber, G; Bette, S; Boeckh-Behrens, T; Gempt, J; Ringel, F; Alberts, E; Zimmer, C; Bauer, J S

2017-06-01

In glioblastoma, quantitative volumetric measurements of contrast-enhancing or fluid-attenuated inversion recovery (FLAIR) hyperintense tumor compartments are needed for an objective assessment of therapy response. The aim of this study was to evaluate the reliability of a semi-automated, region-growing segmentation tool for determining tumor volume in patients with glioblastoma among different users of the software. A total of 320 segmentations of tumor-associated FLAIR changes and contrast-enhancing tumor tissue were performed by different raters (neuroradiologists, medical students, and volunteers). All patients underwent high-resolution magnetic resonance imaging including a 3D-FLAIR and a 3D-MPRage sequence. Segmentations were done using a semi-automated, region-growing segmentation tool. Intra- and inter-rater-reliability were addressed by intra-class-correlation (ICC). Root-mean-square error (RMSE) was used to determine the precision error. Dice score was calculated to measure the overlap between segmentations. Semi-automated segmentation showed a high ICC (> 0.985) for all groups indicating an excellent intra- and inter-rater-reliability. Significant smaller precision errors and higher Dice scores were observed for FLAIR segmentations compared with segmentations of contrast-enhancement. Single rater segmentations showed the lowest RMSE for FLAIR of 3.3 % (MPRage: 8.2 %). Both, single raters and neuroradiologists had the lowest precision error for longitudinal evaluation of FLAIR changes. Semi-automated volumetry of glioblastoma was reliably performed by all groups of raters, even without neuroradiologic expertise. Interestingly, segmentations of tumor-associated FLAIR changes were more reliable than segmentations of contrast enhancement. In longitudinal evaluations, an experienced rater can detect progressive FLAIR changes of less than 15 % reliably in a quantitative way which could help to detect progressive disease earlier.
Inter-rater reliability of an observation-based ergonomics assessment checklist for office workers.

PubMed

Pereira, Michelle Jessica; Straker, Leon Melville; Comans, Tracy Anne; Johnston, Venerina

2016-12-01

To establish the inter-rater reliability of an observation-based ergonomics assessment checklist for computer workers. A 37-item (38-item if a laptop was part of the workstation) comprehensive observational ergonomics assessment checklist comparable to government guidelines and up to date with empirical evidence was developed. Two trained practitioners assessed full-time office workers performing their usual computer-based work and evaluated the suitability of workstations used. Practitioners assessed each participant consecutively. The order of assessors was randomised, and the second assessor was blinded to the findings of the first. Unadjusted kappa coefficients between the raters were obtained for the overall checklist and subsections that were formed from question-items relevant to specific workstation equipment. Twenty-seven office workers were recruited. The inter-rater reliability between two trained practitioners achieved moderate to good reliability for all except one checklist component. This checklist has mostly moderate to good reliability between two trained practitioners. Practitioner Summary: This reliable ergonomics assessment checklist for computer workers was designed using accessible government guidelines and supplemented with up-to-date evidence. Employers in Queensland (Australia) can fulfil legislative requirements by using this reliable checklist to identify and subsequently address potential risk factors for work-related injury to provide a safe working environment.
Assessing physiotherapists' communication skills for promoting patient autonomy for self-management: reliability and validity of the communication evaluation in rehabilitation tool.

PubMed

Murray, Aileen; Hall, Amanda; Williams, Geoffrey C; McDonough, Suzanne M; Ntoumanis, Nikos; Taylor, Ian; Jackson, Ben; Copsey, Bethan; Hurley, Deirdre A; Matthews, James

2018-02-27

To assess the inter-rater reliability and concurrent validity of the Communication Evaluation in Rehabilitation Tool, which aims to externally assess physiotherapists competency in using Self-Determination Theory-based communication strategies in practice. Audio recordings of initial consultations between 24 physiotherapists and 24 patients with chronic low back pain in four hospitals in Ireland were obtained as part of a larger randomised controlled trial. Three raters, all of whom had Ph.Ds in psychology and expertise in motivation and physical activity, independently listened to the 24 audio recordings and completed the 18-item Communication Evaluation in Rehabilitation Tool. Inter-rater reliability between all three raters was assessed using intraclass correlation coefficients. Concurrent validity was assessed using Pearson's r correlations with a reference standard, the Health Care Climate Questionnaire. The total score for the Communication Evaluation in Rehabilitation Tool is an average of all 18 items. Total scores demonstrated good inter-rater reliability (Intraclass Correlation Coefficient (ICC) = 0.8) and concurrent validity with the Health Care Climate Questionnaire total score (range: r = 0.7-0.88). Item-level scores of the Communication Evaluation in Rehabilitation Tool identified five items that need improvement. Results provide preliminary evidence to support future use and testing of the Communication Evaluation in Rehabilitation Tool. Implications for Rehabilitation Promoting patient autonomy is a learned skill and while interventions exist to train clinicians in these skills there are no tools to assess how well clinicians use these skills when interacting with a patient. The lack of robust assessment has severe implications regarding both the fidelity of clinician training packages and resulting outcomes for promoting patient autonomy. This study has developed a novel measurement tool Communication Evaluation in Rehabilitation Tool and a comprehensive user manual to assess how well health care providers use autonomy-supportive communication strategies in real world-clinical settings. This tool has demonstrated good inter-rater reliability and concurrent validity in its initial testing phase. The Communication Evaluation in Rehabilitation Tool can be used in future studies to assess autonomy-supportive communication and undergo further measurement property testing as per our recommendations.
Grant Peer Review: Improving Inter-Rater Reliability with Training.

PubMed

Sattler, David N; McKnight, Patrick E; Naney, Linda; Mathis, Randy

2015-01-01

This study developed and evaluated a brief training program for grant reviewers that aimed to increase inter-rater reliability, rating scale knowledge, and effort to read the grant review criteria. Enhancing reviewer training may improve the reliability and accuracy of research grant proposal scoring and funding recommendations. Seventy-five Public Health professors from U.S. research universities watched the training video we produced and assigned scores to the National Institutes of Health scoring criteria proposal summary descriptions. For both novice and experienced reviewers, the training video increased scoring accuracy (the percentage of scores that reflect the true rating scale values), inter-rater reliability, and the amount of time reading the review criteria compared to the no video condition. The increase in reliability for experienced reviewers is notable because it is commonly assumed that reviewers--especially those with experience--have good understanding of the grant review rating scale. The findings suggest that both experienced and novice reviewers who had not received the type of training developed in this study may not have appropriate understanding of the definitions and meaning for each value of the rating scale and that experienced reviewers may overestimate their knowledge of the rating scale. The results underscore the benefits of and need for specialized peer reviewer training.
Validity and reliability of the robotic objective structured assessment of technical skills

PubMed Central

Siddiqui, Nazema Y.; Galloway, Michael L.; Geller, Elizabeth J.; Green, Isabel C.; Hur, Hye-Chun; Langston, Kyle; Pitter, Michael C.; Tarr, Megan E.; Martino, Martin A.

2015-01-01

Objective Objective structured assessments of technical skills (OSATS) have been developed to measure the skill of surgical trainees. Our aim was to develop an OSATS specifically for trainees learning robotic surgery. Study Design This is a multi-institutional study in eight academic training programs. We created an assessment form to evaluate robotic surgical skill through five inanimate exercises. Obstetrics/gynecology, general surgery, and urology residents, fellows, and faculty completed five robotic exercises on a standard training model. Study sessions were recorded and randomly assigned to three blinded judges who scored performance using the assessment form. Construct validity was evaluated by comparing scores between participants with different levels of surgical experience; inter- and intra-rater reliability were also assessed. Results We evaluated 83 residents, 9 fellows, and 13 faculty, totaling 105 participants; 88 (84%) were from obstetrics/gynecology. Our assessment form demonstrated construct validity, with faculty and fellows performing significantly better than residents (mean scores: 89 ± 8 faculty; 74 ± 17 fellows; 59 ± 22 residents, p<0.01). In addition, participants with more robotic console experience scored significantly higher than those with fewer prior console surgeries (p<0.01). R-OSATS demonstrated good inter-rater reliability across all five drills (mean Cronbach's α: 0.79 ± 0.02). Intra-rater reliability was also high (mean Spearman's correlation: 0.91 ± 0.11). Conclusions We developed an assessment form for robotic surgical skill that demonstrates construct validity, inter- and intra-rater reliability. When paired with standardized robotic skill drills this form may be useful to distinguish between levels of trainee performance. PMID:24807319
Reliability and concurrent validity of the iPhone® Compass application to measure thoracic rotation range of motion (ROM) in healthy participants

PubMed Central

Schram, Ben; Cox, Alistair J.; Anderson, Sarah L.; Keogh, Justin

2018-01-01

Background Several water-based sports (swimming, surfing and stand up paddle boarding) require adequate thoracic mobility (specifically rotation) in order to perform the appropriate activity requirements. The measurement of thoracic spine rotation is problematic for clinicians due to a lack of convenient and reliable measurement techniques. More recently, smartphones have been used to quantify movement in various joints in the body; however, there appears to be a paucity of research using smartphones to assess thoracic spine movement. Therefore, the aim of this study is to determine the reliability (intra and inter rater) and validity of the iPhone® app (Compass) when assessing thoracic spine rotation ROM in healthy individuals. Methods A total of thirty participants were recruited for this study. Thoracic spine rotation ROM was measured using both the current clinical gold standard, a universal goniometer (UG) and the Smart Phone Compass app. Intra-rater and inter-rater reliability was determined with a Intraclass Correlation Coefficient (ICC) and associated 95% confidence intervals (CI). Validation of the Compass app in comparison to the UG was measured using Pearson’s correlation coefficient and levels of agreement were identified with Bland–Altman plots and 95% limits of agreement. Results Both the UG and Compass app measurements both had excellent reproducibility for intra-rater (ICC 0.94–0.98) and inter-rater reliability (ICC 0.72–0.89). However, the Compass app measurements had higher intra-rater reliability (ICC = 0.96 − 0.98; 95% CI [0.93–0.99]; vs. ICC = 0.94 − 0.98; 95% CI [0.88–0.99]) and inter-rater reliability (ICC = 0.87 − 0.89; 95% CI [0.74–0.95] vs. ICC = 0.72 − 0.82; 95% CI [0.21–0.94]). A strong and significant correlation was found between the UG and the Compass app, demonstrating good concurrent validity (r = 0.835, p < 0.001). Levels of agreement between the two devices were 24.8° (LoA –9.5°, +15.3°). The UG was found to consistently measure higher values than the compass app (mean difference 2.8°, P < 0.001). Conclusion This study reveals that the iPhone® app (Compass) is a reliable tool for measuring thoracic spine rotation which produces greater reproducibility of measurements both within and between raters than a UG. As a significant positive correlation exists between the Compass app and UG, this supports the use of either device in clinical practice as a reliable and valid tool to measure thoracic rotation. Considering the levels of agreement are clinically unacceptable, the devices should not be used interchangeably for initial and follow up measurements. PMID:29568701
Reliability and concurrent validity of the iPhone® Compass application to measure thoracic rotation range of motion (ROM) in healthy participants.

PubMed

Furness, James; Schram, Ben; Cox, Alistair J; Anderson, Sarah L; Keogh, Justin

2018-01-01

Several water-based sports (swimming, surfing and stand up paddle boarding) require adequate thoracic mobility (specifically rotation) in order to perform the appropriate activity requirements. The measurement of thoracic spine rotation is problematic for clinicians due to a lack of convenient and reliable measurement techniques. More recently, smartphones have been used to quantify movement in various joints in the body; however, there appears to be a paucity of research using smartphones to assess thoracic spine movement. Therefore, the aim of this study is to determine the reliability (intra and inter rater) and validity of the iPhone ® app (Compass) when assessing thoracic spine rotation ROM in healthy individuals. A total of thirty participants were recruited for this study. Thoracic spine rotation ROM was measured using both the current clinical gold standard, a universal goniometer (UG) and the Smart Phone Compass app. Intra-rater and inter-rater reliability was determined with a Intraclass Correlation Coefficient (ICC) and associated 95% confidence intervals (CI). Validation of the Compass app in comparison to the UG was measured using Pearson's correlation coefficient and levels of agreement were identified with Bland-Altman plots and 95% limits of agreement. Both the UG and Compass app measurements both had excellent reproducibility for intra-rater (ICC 0.94-0.98) and inter-rater reliability (ICC 0.72-0.89). However, the Compass app measurements had higher intra-rater reliability ( ICC = 0.96 - 0.98; 95% CI [0.93-0.99]; vs. ICC = 0.94 - 0.98; 95% CI [0.88-0.99]) and inter-rater reliability ( ICC = 0.87 - 0.89; 95% CI [0.74-0.95] vs. ICC = 0.72 - 0.82; 95% CI [0.21-0.94]). A strong and significant correlation was found between the UG and the Compass app, demonstrating good concurrent validity ( r = 0.835, p < 0.001). Levels of agreement between the two devices were 24.8° (LoA -9.5°, +15.3°). The UG was found to consistently measure higher values than the compass app (mean difference 2.8°, P < 0.001). This study reveals that the iPhone ® app (Compass) is a reliable tool for measuring thoracic spine rotation which produces greater reproducibility of measurements both within and between raters than a UG. As a significant positive correlation exists between the Compass app and UG, this supports the use of either device in clinical practice as a reliable and valid tool to measure thoracic rotation. Considering the levels of agreement are clinically unacceptable, the devices should not be used interchangeably for initial and follow up measurements.
Reliability of the Quality of Upper Extremity Skills Test for Children with Cerebral Palsy Aged 2 to 12 Years

ERIC Educational Resources Information Center

Thorley, Megan; Lannin, Natasha; Cusick, Anne; Novak, Iona; Boyd, Roslyn

2012-01-01

Aim: To investigate reliability of the Quality of Upper Extremity Skills Test (QUEST) scores for children with cerebral palsy (CP) aged 2-12 years. Method: Thirty-one QUESTs from 24 children with CP were rated once by two raters and twice by one rater. Internal consistency of total scores, inter- and intra-rater reliability findings for total,…
The inter-rater reliability of the incontinence-associated dermatitis intervention tool-D (IADIT-D) between two independent registered nurses of nursing home residents in long-term care facilities.

PubMed

Braunschmidt, Brigitte; Müller, Gerhard; Jukic-Puntigam, Margareta; Steininger, Alfred

2013-01-01

Incontinence-associated dermatitis (IAD) is the clinical manifestation of moisture related skin damage (Beeckman, Woodward, & Gray, 2011). Valid assessment instruments are needed for risk assessment and classification of IAD. Aim of the quantitative-descriptive cross-sectional study was to determine the inter-rater reliability of the item scores of the German Incontinence Associated Dermatitis Intervention Tool (IADIT-D) between two independent assessors of nursing home residents (n = 381) in long-term care facilities. The 19 pairs of assessors consisted of registered nurses. The data analysis was computed first with the calculation of the total percentage of agreement. Because this value is not randomly adjusted, the calculation of the Kappa-coefficients and AC1-Statistic was done as well. The total percentage of the inter-rater agreement was 84% (n = 319). In a second step of analysis, the calculation of all items determined high (kappa = .70) and very high agreement (AC1 = .83) levels, respectively. For the risk assessment (kappa = .82; AC1 = .94), the values amounted to very high agreement levels and for the classification (kappa(w) = .70; AC1 = .76) to high agreement levels. The high to very high agreement values of IADIT-D demonstrate that the items can be regarded as stable in regards to the inter-rater reliability for the use in long-term care facilities. In addition, further validation studies are needed.
A TWIN STUDY OF SCHIZOAFFECTIVE-MANIA, SCHIZOAFFECTIVE-DEPRESSION AND OTHER PSYCHOTIC SYNDROMES

PubMed Central

Cardno, Alastair G; Rijsdijk, Frühling V; West, Robert M; Gottesman, Irving I; Craddock, Nick; Murray, Robin M; McGuffin, Peter

2012-01-01

The nosological status of schizoaffective disorders remains controversial. Twin studies are potentially valuable for investigating relationships between schizoaffective-mania, schizoaffective-depression and other psychotic syndromes, but no such study has yet been reported. We ascertained 224 probandwise twin pairs (106 monozygotic, 118 same-sex dizygotic), where probands had psychotic or manic symptoms, from the Maudsley Twin Register in London (1948–1993). We investigated Research Diagnostic Criteria schizoaffective-mania, schizoaffective-depression, schizophrenia, mania and depressive psychosis primarily using a non-hierarchical classification, and additionally using hierarchical and data-derived classifications, and a classification featuring broad schizophrenic and manic syndromes without separate schizoaffective syndromes. We investigated inter-rater reliability and co-occurrence of syndromes within twin probands and twin pairs. The schizoaffective syndromes showed only moderate inter-rater reliability. There was general significant co-occurrence between syndromes within twin probands and monozygotic pairs, and a trend for schizoaffective-mania and mania to have the greatest co-occurrence. Schizoaffective syndromes in monozygotic probands were associated with relatively high risk of a psychotic syndrome occurring in their co-twins. The classification of broad schizophrenic and manic syndromes without separate schizoaffective syndromes showed improved inter-rater reliability, but high genetic and environmental correlations between the two broad syndromes. The results are consistent with regarding schizoaffective-mania as due to co-occurring elevated liability to schizophrenia, mania and depression; and schizoaffective-depression as due to co-occurring elevated liability to schizophrenia and depression, but with less elevation of liability to mania. If in due course schizoaffective syndromes show satisfactory inter-rater reliability and some specific etiological factors they could alternatively be regarded as partly independent disorders. PMID:22213671
A twin study of schizoaffective-mania, schizoaffective-depression, and other psychotic syndromes.

PubMed

Cardno, Alastair G; Rijsdijk, Frühling V; West, Robert M; Gottesman, Irving I; Craddock, Nick; Murray, Robin M; McGuffin, Peter

2012-03-01

The nosological status of schizoaffective disorders remains controversial. Twin studies are potentially valuable for investigating relationships between schizoaffective-mania, schizoaffective-depression, and other psychotic syndromes, but no such study has yet been reported. We ascertained 224 probandwise twin pairs [106 monozygotic (MZ), 118 same-sex dizygotic (DZ)], where probands had psychotic or manic symptoms, from the Maudsley Twin Register in London (1948-1993). We investigated Research Diagnostic Criteria schizoaffective-mania, schizoaffective-depression, schizophrenia, mania and depressive psychosis primarily using a non-hierarchical classification, and additionally using hierarchical and data-derived classifications, and a classification featuring broad schizophrenic and manic syndromes without separate schizoaffective syndromes. We investigated inter-rater reliability and co-occurrence of syndromes within twin probands and twin pairs. The schizoaffective syndromes showed only moderate inter-rater reliability. There was general significant co-occurrence between syndromes within twin probands and MZ pairs, and a trend for schizoaffective-mania and mania to have the greatest co-occurrence. Schizoaffective syndromes in MZ probands were associated with relatively high risk of a psychotic syndrome occurring in their co-twins. The classification of broad schizophrenic and manic syndromes without separate schizoaffective syndromes showed improved inter-rater reliability, but high genetic and environmental correlations between the two broad syndromes. The results are consistent with regarding schizoaffective-mania as due to co-occurring elevated liability to schizophrenia, mania, and depression; and schizoaffective-depression as due to co-occurring elevated liability to schizophrenia and depression, but with less elevation of liability to mania. If in due course schizoaffective syndromes show satisfactory inter-rater reliability and some specific etiological factors they could alternatively be regarded as partly independent disorders. Copyright © 2011 Wiley Periodicals, Inc.
Inter-rater reliability of PATH observations for assessment of ergonomic risk factors in hospital work.

PubMed

Park, Jung-Keun; Boyer, Jon; Tessler, Jamie; Casey, Jeffrey; Schemm, Linda; Gore, Rebecca; Punnett, Laura

2009-07-01

This study examined the inter-rater reliability of expert observations of ergonomic risk factors by four analysts. Ten jobs were observed at a hospital using a newly expanded version of the PATH method (Buchholz et al. 1996), to which selected upper extremity exposures had been added. Two of the four raters simultaneously observed each worker onsite for a total of 443 observation pairs containing 18 categorical exposure items each. For most exposure items, kappa coefficients were 0.4 or higher. For some items, agreement was higher both for the jobs with less rapid hand activity and for the analysts with a higher level of ergonomic job analysis experience. These upper extremity exposures could be characterised reliably with real-time observation, given adequate experience and training of the observers. The revised version of PATH is applicable to the analysis of jobs where upper extremity musculoskeletal strain is of concern.
Inter-rater reliability of h-index scores calculated by Web of Science and Scopus for clinical epidemiology scientists.

PubMed

Walker, Benjamin; Alavifard, Sepand; Roberts, Surain; Lanes, Andrea; Ramsay, Tim; Boet, Sylvain

2016-06-01

We investigated the inter-rater reliability of Web of Science (WoS) and Scopus when calculating the h-index of 25 senior scientists in the Clinical Epidemiology Program of the Ottawa Hospital Research Institute. Bibliometric information and the h-indices for the subjects were computed by four raters using the automatic calculators in WoS and Scopus. Correlation and agreement between ratings was assessed using Spearman's correlation coefficient and a Bland-Altman plot, respectively. Data could not be gathered from Google Scholar due to feasibility constraints. The Spearman's rank correlation between the h-index of scientists calculated with WoS was 0.81 (95% CI 0.72-0.92) and with Scopus was 0.95 (95% CI 0.92-0.99). The Bland-Altman plot showed no significant rater bias in WoS and Scopus; however, the agreement between ratings is higher in Scopus compared to WoS. Our results showed a stronger relationship and increased agreement between raters when calculating the h-index of a scientist using Scopus compared to WoS. The higher inter-rater reliability and simple user interface used in Scopus may render it the more effective database when calculating the h-index of senior scientists in epidemiology. © 2016 Health Libraries Group.
The Reliability and Validity of the Thoracolumbar Injury Classification System in Pediatric Spine Trauma.

PubMed

Savage, Jason W; Moore, Timothy A; Arnold, Paul M; Thakur, Nikhil; Hsu, Wellington K; Patel, Alpesh A; McCarthy, Kathryn; Schroeder, Gregory D; Vaccaro, Alexander R; Dimar, John R; Anderson, Paul A

2015-09-15

The thoracolumbar injury classification system (TLICS) was evaluated in 20 consecutive pediatric spine trauma cases. The purpose of this study was to determine the reliability and validity of the TLICS in pediatric spine trauma. The TLICS was developed to improve the categorization and management of thoracolumbar trauma. TLICS has been shown to have good reliability and validity in the adult population. The clinical and radiographical findings of 20 pediatric thoracolumbar fractures were prospectively presented to 20 surgeons with disparate levels of training and experience with spinal trauma. These injuries were consecutively scored using the TLICS. Cohen unweighted κ coefficients and Spearman rank order correlation values were calculated for the key parameters (injury morphology, status of posterior ligamentous complex, neurological status, TLICS total score, and proposed management) to assess the inter-rater reliabilities. Five surgeons scored the same cases 3 months later to assess the intra-rater reliability. The actual management of each case was then compared with the treatment recommended by the TLICS algorithm to assess validity. The inter-rater κ statistics of all subgroups (injury morphology, status of the posterior ligamentous complex, neurological status, TLICS total score, and proposed treatment) were within the range of moderate to substantial reproducibility (0.524-0.958). All subgroups had excellent intra-rater reliability (0.748-1.000). The various indices for validity were calculated (80.3% correct, 0.836 sensitivity, 0.785 specificity, 0.676 positive predictive value, 0.899 negative predictive value). Overall, TLICS demonstrated good validity. The TLICS has good reliability and validity when used in the pediatric population. The inter-rater reliability of predicting management and indices for validity are lower than those in adults with thoracolumbar fractures, which is likely due to differences in the way children are treated for certain types of injuries. TLICS can be used to reliably categorize thoracolumbar injuries in the pediatric population; however, modifications may be needed to better guide treatment in this specific patient population. 4.
Post-traumatic subtalar osteoarthritis: which grading system should we use?

PubMed

de Muinck Keizer, Robert-Jan O; Backes, Manouk; Dingemans, Siem A; Goslings, J Carel; Schepers, Tim

2016-09-01

To assess and compare post-traumatic osteoarthritis following intra-articular calcaneal fractures, one must have a reliable grading system that consistently grades the post-traumatic changes of the joint. A reliable grading system aids in the communication between treating physicians and improves the interpretation of research. To date, there is no consensus on what grading system to use in the evaluation of post-traumatic subtalar osteoarthritis. The objective of this study was to determine and compare the inter- and intra-rater reliability of two grading systems for post-traumatic subtalar osteoarthritis. Four observers evaluated 50 calcaneal fractures at least one year after trauma on conventional oblique lateral, internally and externally rotated views, and graded post-traumatic subtalar osteoarthritis using the Kellgren and Lawrence Grading Scale (KLGS) and the Paley Grading System (PGS). Inter- and intra-rater reliability were calculated and compared. The inter-rater reliability showed an intra-class correlation (ICC) of 0.54 (95 % CI 0.40-0.67) for the KLGS and an ICC of 0.41 (95 % CI 0.26 - 0.57) for the PGS. This difference was not statistically significant. The intra-rater reliability showed a mean weighted kappa of 0.62 for both the KLGS and the PGS. There is no statistically significant difference in reliability between the Kellgren and Lawrence Grading System (KLGS) and the Paley Grading System (PGS). The PGS allows for an easy two-step approach making it easy for everyday clinical purposes. For research purposes however, the more detailed and widely used KLGS seems preferable.
A paired comparison analysis of third-party rater thyroidectomy scar preference.

PubMed

Rajakumar, C; Doyle, P C; Brandt, M G; Moore, C C; Nichols, A; Franklin, J H; Yoo, J; Fung, K

2017-01-01

To determine the length and position of a thyroidectomy scar that is cosmetically most appealing to naïve raters. Images of thyroidectomy scars were reproduced on male and female necks using digital imaging software. Surgical variables studied were scar position and length. Fifteen raters were presented with 56 scar pairings and asked to identify which was preferred cosmetically. Twenty duplicate pairings were included to assess rater reliability. Analysis of variance was used to determine preference. Raters preferred low, short scars, followed by high, short scars, with long scars in either position being less desirable (p < 0.05). Twelve of 15 raters had acceptable intra-rater and inter-rater reliability. Naïve raters preferred low, short scars over the alternatives. High, short scars were the next most favourably rated. If other factors influencing incision choice are considered equal, surgeons should consider these preferences in scar position and length when planning their thyroidectomy approach.
Validation of the Spanish adaptation of the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V).

PubMed

Núñez-Batalla, Faustino; Morato-Galán, Marta; García-López, Isabel; Ávila-Menéndez, Arántzazu

2015-01-01

The Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) was developed.to promote a standardised approach to evaluating and documenting auditory perceptual judgments of vocal quality. This tool was originally developed in English language and its Spanish version is still inexistent. The aim of this study was to develop a Spanish adaptation of CAPE-V and to examine the reliability and empirical validity of this Spanish version. To adapt the CAPE-V protocol to the Spanish language, we proposed 6 phrases phonetically designed according to the CAPE-V requirements. Prospective instrument validation was performed. The validity of the Spanish version of the CAPE-V was examined in 4 ways: intra-rater reliability, inter-rater reliability and CAPE-V versus GRABS judgments. Inter-rater reliability coefficients for the CAPE-V ranged from 0.93 for overall severity to 0.54 for intensity; intra-rater reliability ranged from 0.98 for overall severity to 0.85 for intensity. The comparison of judgments between GRABS and CAPE-V ranged from 0.86 for overall severity to 0.61 for breathiness. The present study supports the use of the Spanish version of CAPE-V because of its validity and reliability. Copyright © 2014 Elsevier España, S.L.U. and Sociedad Española de Otorrinolaringología y Patología Cérvico-Facial. All rights reserved.
The Americleft Project: A Modification of Asher-McDade Method for Rating Nasolabial Esthetics in Patients With Unilateral Cleft Lip and Palate Using Q-sort.

PubMed

Stoutland, Alicia; Long, Ross E; Mercado, Ana; Daskalogiannakis, John; Hathaway, Ronald R; Russell, Kathleen A; Singer, Emily; Semb, Gunvor; Shaw, William C

2017-11-01

The purpose of this study was to investigate ways to improve rater reliability and satisfaction in nasolabial esthetic evaluations of patients with complete unilateral cleft lip and palate (UCLP), by modifying the Asher-McDade method with use of Q-sort methodology. Blinded ratings of cropped photographs of one hundred forty-nine 5- to 7-year-old consecutively treated patients with complete UCLP from 4 different centers were used in a rating of frontal and profile nasolabial esthetic outcomes by 6 judges involved in the Americleft Project's intercenter outcome comparisons. Four judges rated in previous studies using the original Asher-McDade approach. For the Q-sort modification, rather than projection of images, each judge had cards with frontal and profile photographs of each patient and rated them on a scale of 1 to 5 for vermillion border, nasolabial frontal, and profile, using the Q-sort method with placement of cards into categories 1 to 5. Inter- and intrarater reliabilities were calculated using the Weighted Kappa (95% confidence interval). For 4 raters, the reliabilities were compared with those in previous studies. There was no significant improvement in inter-rater reliabilities using the new method. Intrarater reliability consistently improved. All raters preferred the Q-sort method with rating cards rather than a PowerPoint of photos, which improved internal consistency in rating compared to previous studies using the original Asher-McDade method. All raters preferred this method because of the ability to continuously compare photos and adjust relative ratings between patients.
The Arthroscopic Surgical Skill Evaluation Tool (ASSET)

PubMed Central

Koehler, Ryan J.; Amsdell, Simon; Arendt, Elizabeth A; Bisson, Leslie J; Braman, Jonathan P; Butler, Aaron; Cosgarea, Andrew J; Harner, Christopher D; Garrett, William E; Olson, Tyson; Warme, Winston J.; Nicandri, Gregg T.

2014-01-01

Background Surgeries employing arthroscopic techniques are among the most commonly performed in orthopaedic clinical practice however, valid and reliable methods of assessing the arthroscopic skill of orthopaedic surgeons are lacking. Hypothesis The Arthroscopic Surgery Skill Evaluation Tool (ASSET) will demonstrate content validity, concurrent criterion-oriented validity, and reliability, when used to assess the technical ability of surgeons performing diagnostic knee arthroscopy on cadaveric specimens. Study Design Cross-sectional study; Level of evidence, 3 Methods Content validity was determined by a group of seven experts using a Delphi process. Intra-articular performance of a right and left diagnostic knee arthroscopy was recorded for twenty-eight residents and two sports medicine fellowship trained attending surgeons. Subject performance was assessed by two blinded raters using the ASSET. Concurrent criterion-oriented validity, inter-rater reliability, and test-retest reliability were evaluated. Results Content validity: The content development group identified 8 arthroscopic skill domains to evaluate using the ASSET. Concurrent criterion-oriented validity: Significant differences in total ASSET score (p<0.05) between novice, intermediate, and advanced experience groups were identified. Inter-rater reliability: The ASSET scores assigned by each rater were strongly correlated (r=0.91, p <0.01) and the intra-class correlation coefficient between raters for the total ASSET score was 0.90. Test-retest reliability: there was a significant correlation between ASSET scores for both procedures attempted by each individual (r = 0.79, p<0.01). Conclusion The ASSET appears to be a useful, valid, and reliable method for assessing surgeon performance of diagnostic knee arthroscopy in cadaveric specimens. Studies are ongoing to determine its generalizability to other procedures as well as to the live OR and other simulated environments. PMID:23548808

Measuring the morphological characteristics of thoracolumbar fascia in ultrasound images: an inter-rater reliability study.

PubMed

De Coninck, Kyra; Hambly, Karen; Dickinson, John W; Passfield, Louis

2018-06-01

Chronic lower back pain is still regarded as a poorly understood multifactorial condition. Recently, the thoracolumbar fascia complex has been found to be a contributing factor. Ultrasound imaging has shown that people with chronic lower back pain demonstrate both a significant decrease in shear strain, and a 25% increase in thickness of the thoracolumbar fascia. There is sparse data on whether medical practitioners agree on the level of disorganisation in ultrasound images of thoracolumbar fascia. The purpose of this study was to establish inter-rater reliability of the ranking of architectural disorganisation of thoracolumbar fascia on a scale from 'very disorganised' to 'very organised'. An exploratory analysis was performed using a fully crossed design of inter-rater reliability. Thirty observers were recruited, consisting of 21 medical doctors, 7 physiotherapists and 2 radiologists, with an average of 13.03 ± 9.6 years of clinical experience. All 30 observers independently rated the architectural disorganisation of the thoracolumbar fascia in 30 ultrasound scans, on a Likert-type scale with rankings from 1 = very disorganised to 10 = very organised. Internal consistency was assessed using Cronbach's alpha. Krippendorff's alpha was used to calculate the overall inter-rater reliability. The Krippendorf's alpha was .61, indicating a modest degree of agreement between observers on the different morphologies of thoracolumbar fascia.The Cronbach's alpha (0.98), indicated that there was a high degree of consistency between observers. Experience in ultrasound image analysis did not affect constancy between observers (Cronbach's range between experienced and inexperienced raters: 0.95 and 0.96 respectively). Medical practitioners agree on morphological features such as levels of organisation and disorganisation in ultrasound images of thoracolumbar fascia, regardless of experience. Further analysis by an expert panel is required to develop specific classification criteria for thoracolumbar fascia.
Evaluation of previously embolized intracranial aneurysms: inter-and intra-rater reliability among neurosurgeons and interventional neuroradiologists.

PubMed

Zuckerman, Scott L; Lakomkin, Nikita; Magarik, Jordan A; Vargas, Jan; Stephens, Marcus; Akinpelu, Babatunde; Spiotta, Alejandro M; Ahmed, Azam; Arthur, Adam S; Fiorella, David; Hanel, Ricardo; Hirsch, Joshua A; Hui, Ferdinand K; James, Robert F; Kallmes, David F; Meyers, Philip M; Niemann, David B; Rasmussen, Peter; Turner, Raymond D; Welch, Babu G; Mocco, J

2018-05-01

The angiographic evaluation of previously coiled aneurysms can be difficult yet remains critical for determining re-treatment. The main objective of this study was to determine the inter-rater reliability for both the Raymond Scale and per cent embolization among a group of neurointerventionalists evaluating previously embolized aneurysms. A panel of 15 neurointerventionalists examined 92 distinct cases of immediate post-coil embolization and 1 year post-embolization angiographs. Each case was presented four times throughout the study, along with alterations in demographics in order to evaluate intra-rater reliability. All respondents were asked to provide the per cent embolization (0-100%) and Raymond Scale grade (1-3) for each aneurysm. Inter-rater reliability was evaluated by computing weighted kappa values (for the Raymond Scale) and intraclass correlation coefficients (ICC) for per cent embolization. 10 neurosurgeons and 5 interventional neuroradiologists evaluated 368 simulated cases. The agreement among all readers employing the Raymond Scale was fair (κ=0.35) while concordance in per cent embolization was good (ICC=0.64). Clinicians with fewer than 10 years of experience demonstrated a significantly greater level of agreement than the group with greater than 10 years (κ=0.39 and ICC=0.70 vs κ=0.28 and ICC=0.58). When the same aneurysm was presented multiple times, clinicians demonstrated excellent consistency when assessing per cent embolization (ICC=0.82), but moderate agreement when employing the Raymond classification (κ=0.58). Identifying the per cent embolization in previously coiled aneurysms resulted in good inter- and intra-rater agreement, regardless of years of experience. The strong agreement among providers employing per cent embolization may make it a valuable tool for embolization assessment in this patient population. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2018. All rights reserved. No commercial use is permitted unless otherwise expressly granted.
Urdu translation of the Hamilton Rating Scale for Depression: Results of a validation study

PubMed Central

Hashmi, Ali M.; Naz, Shahana; Asif, Aftab; Khawaja, Imran S.

2016-01-01

Objective: To develop a standardized validated version of the Hamilton Rating Scale for Depression (HAM-D) in Urdu. Methods: After translation of the HAM-D into the Urdu language following standard guidelines, the final Urdu version (HAM-D-U) was administered to 160 depressed outpatients. Inter-item correlation was assessed by calculating Cronbach alpha. Correlation between HAM-D-U scores at baseline and after a 2-week interval was evaluated for test-retest reliability. Moreover, scores of two clinicians on HAM-D-U were compared for inter-rater reliability. For establishing concurrent validity, scores of HAM-D-U and BDI-U were compared by using Spearman correlation coefficient. The study was conducted at Mayo Hospital, Lahore, from May to December 2014. Results: The Cronbach alpha for HAM-D-U was 0.71. Composite scores for HAM-D-U at baseline and after a 2-week interval were also highly correlated with each other (Spearman correlation coefficient 0.83, p-value < 0.01) indicating good test-retest reliability. Composite scores for HAM-D-U and BDI-U were positively correlated with each other (Spearman correlation coefficient 0.85, p < 0.01) indicating good concurrent validity. Scores of two clinicians for HAM-D-U were also positively correlated (Spearman correlation coefficient 0.82, p-value < 0.01) indicated good inter-rater reliability. Conclusion: The HAM-D-U is a valid and reliable instrument for the assessment of Depression. It shows good inter-rater and test-retest reliability. The HAM-D-U can be a tool either for clinical management or research. PMID:28083049
Urdu translation of the Hamilton Rating Scale for Depression: Results of a validation study.

PubMed

Hashmi, Ali M; Naz, Shahana; Asif, Aftab; Khawaja, Imran S

2016-01-01

To develop a standardized validated version of the Hamilton Rating Scale for Depression (HAM-D) in Urdu. After translation of the HAM-D into the Urdu language following standard guidelines, the final Urdu version (HAM-D-U) was administered to 160 depressed outpatients. Inter-item correlation was assessed by calculating Cronbach alpha. Correlation between HAM-D-U scores at baseline and after a 2-week interval was evaluated for test-retest reliability. Moreover, scores of two clinicians on HAM-D-U were compared for inter-rater reliability. For establishing concurrent validity, scores of HAM-D-U and BDI-U were compared by using Spearman correlation coefficient. The study was conducted at Mayo Hospital, Lahore, from May to December 2014. The Cronbach alpha for HAM-D-U was 0.71. Composite scores for HAM-D-U at baseline and after a 2-week interval were also highly correlated with each other (Spearman correlation coefficient 0.83, p-value < 0.01) indicating good test-retest reliability. Composite scores for HAM-D-U and BDI-U were positively correlated with each other (Spearman correlation coefficient 0.85, p < 0.01) indicating good concurrent validity. Scores of two clinicians for HAM-D-U were also positively correlated (Spearman correlation coefficient 0.82, p-value < 0.01) indicated good inter-rater reliability. The HAM-D-U is a valid and reliable instrument for the assessment of Depression. It shows good inter-rater and test-retest reliability. The HAM-D-U can be a tool either for clinical management or research.
A new iPhone application for measuring active craniocervical range of motion in patients with non-specific neck pain: a reliability and validity study.

PubMed

Pourahmadi, Mohammad Reza; Bagheri, Rasool; Taghipour, Morteza; Takamjani, Ismail Ebrahimi; Sarrafzadeh, Javad; Mohseni-Bandpei, Mohammad Ali

2018-03-01

Measurement of cervical spine range of motion (ROM) is often considered to be an essential component of cervical spine physiotherapy assessment. This study aimed to investigate the reliability and validity of an iPhone application (app) (Goniometer Pro) for measuring active craniocervical ROM (ACCROM) in patients with non-specific neck pain. A cross-sectional study was conducted at the musculoskeletal biomechanics laboratory located at Iran University of Medical Sciences. Forty non-specific neck pain patients participated in this study. The outcome measure was the ACCROM, including flexion, extension, lateral flexion, and rotation. Following the recruitment process, ACCROM was measured using a universal goniometer (UG) and iPhone 7 app. Two blinded examiners each used the UG and iPhone to measure ACCROM in the following sequences: flexion, extension, lateral flexion, and rotation. The second (2 hours later) and third (48 hours later) sessions were carried out in the same manner as the first session. Intraclass correlation coefficient (ICC) models were used to determine the intra-rater and inter-rater reliability. The Pearson correlation coefficients were used to establish concurrent validity of the iPhone app. Minimum detectable change at the 95% confidence level (MDC 95 ) was also computed. Good intra-rater and inter-rater reliability was demonstrated for the goniometer with ICC values of ≥0.66 and ≥0.70 and the iPhone app with ICC values of ≥0.62 and ≥0.65, respectively. The MDC 95 ranged from 2.21° to 12.50° for the intra-rater analysis and from 3.40° to 12.61° for the inter-rater analysis. The concurrent validity between the two instruments was high, with r valuesof ≥0.63. The magnitude of the differences between the UG and iPhone app values (effect sizes) was small, with Cohen d values of ≤0.17. The iPhone app possesses good reliability and high validity. It seems that this app can be used for measuring ACCROM. Copyright © 2017 Elsevier Inc. All rights reserved.
Rating the methodological quality of single-subject designs and n-of-1 trials: introducing the Single-Case Experimental Design (SCED) Scale.

PubMed

Tate, Robyn L; McDonald, Skye; Perdices, Michael; Togher, Leanne; Schultz, Regina; Savage, Sharon

2008-08-01

Rating scales that assess methodological quality of clinical trials provide a means to critically appraise the literature. Scales are currently available to rate randomised and non-randomised controlled trials, but there are none that assess single-subject designs. The Single-Case Experimental Design (SCED) Scale was developed for this purpose and evaluated for reliability. Six clinical researchers who were trained and experienced in rating methodological quality of clinical trials developed the scale and participated in reliability studies. The SCED Scale is an 11-item rating scale for single-subject designs, of which 10 items are used to assess methodological quality and use of statistical analysis. The scale was developed and refined over a 3-year period. Content validity was addressed by identifying items to reduce the main sources of bias in single-case methodology as stipulated by authorities in the field, which were empirically tested against 85 published reports. Inter-rater reliability was assessed using a random sample of 20/312 single-subject reports archived in the Psychological Database of Brain Impairment Treatment Efficacy (PsycBITE). Inter-rater reliability for the total score was excellent, both for individual raters (overall ICC = 0.84; 95% confidence interval 0.73-0.92) and for consensus ratings between pairs of raters (overall ICC = 0.88; 95% confidence interval 0.78-0.95). Item reliability was fair to excellent for consensus ratings between pairs of raters (range k = 0.48 to 1.00). The results were replicated with two independent novice raters who were trained in the use of the scale (ICC = 0.88, 95% confidence interval 0.73-0.95). The SCED Scale thus provides a brief and valid evaluation of methodological quality of single-subject designs, with the total score demonstrating excellent inter-rater reliability using both individual and consensus ratings. Items from the scale can also be used as a checklist in the design, reporting and critical appraisal of single-subject designs, thereby assisting to improve standards of single-case methodology.
Reliability, repeatability, and reproducibility of pulmonary transit time assessment by contrast enhanced echocardiography.

PubMed

Herold, Ingeborg H F; Saporito, Salvatore; Bouwman, R Arthur; Houthuizen, Patrick; van Assen, Hans C; Mischi, Massimo; Korsten, Hendrikus H M

2016-01-05

The aim of this study is to investigate the inter and intra-rater reliability, repeatability, and reproducibility of pulmonary transit time (PTT) measurement in patients using contrast enhanced ultrasound (CEUS), as an indirect measure of preload and left ventricular function. Mean transit times (MTT) were measured by drawing a region of interest (ROI) in right and left cardiac ventricle in the CEUS loops. Acoustic intensity dilution curves were obtained from the ROIs. MTTs were calculated by applying model-based fitting on the dilution curves. PTT was calculated as the difference of the MTTs. Eight raters with different levels of experience measured the PTT (time moment 1) and repeated the measurement within a week (time moment 2). Reliability and agreement were assessed using intra-class correlations (ICC) and Bland-Altman analysis. Repeatability was tested by estimating the variance of means (ANOVA) of three injections in each patient at different doses. Reproducibility was tested by the ICC of the two time moments. Fifteen patients with heart failure were included. The mean PTT was 11.8 ± 3.1 s at time moment 1 and 11.7 ± 2.9 s at time moment 2. The inter-rater reliability for PTT was excellent (ICC = 0.94). The intra-rater reliability per rater was between 0.81-0.99. Bland-Altman analysis revealed a bias of 0.10 s within the rater groups. Reproducibility for PTT showed an ICC = 0.94 between the two time moments. ANOVA showed no significant difference between the means of the three different doses F = 0.048 (P = 0.95). The mean and standard deviation for PTT estimates at three different doses was 11.6 ± 3.3 s. PTT estimation using CEUS shows a high inter- and intra-rater reliability, repeatability at three different doses, and reproducibility by ROI drawing. This makes the minimally invasive PTT measurement using contrast echocardiography ready for clinical evaluation in patients with heart failure and for preload estimation.
The intra- and inter-rater reliability of five clinical muscle performance tests in patients with and without neck pain

PubMed Central

2013-01-01

Background This study investigates the reliability of muscle performance tests using cost- and time-effective methods similar to those used in clinical practice. When conducting reliability studies, great effort goes into standardising test procedures to facilitate a stable outcome. Therefore, several test trials are often performed. However, when muscle performance tests are applied in the clinical setting, clinicians often only conduct a muscle performance test once as repeated testing may produce fatigue and pain, thus variation in test results. We aimed to investigate whether cervical muscle performance tests, which have shown promising psychometric properties, would remain reliable when examined under conditions similar to those of daily clinical practice. Methods The intra-rater (between-day) and inter-rater (within-day) reliability was assessed for five cervical muscle performance tests in patients with (n = 33) and without neck pain (n = 30). The five tests were joint position error, the cranio-cervical flexion test, the neck flexor muscle endurance test performed in supine and in a 45°-upright position and a new neck extensor test. Results Intra-rater reliability ranged from moderate to almost perfect agreement for joint position error (ICC ≥ 0.48-0.82), the cranio-cervical flexion test (ICC ≥ 0.69), the neck flexor muscle endurance test performed in supine (ICC ≥ 0.68) and in a 45°-upright position (ICC ≥ 0.41) with the exception of a new test (neck extensor test), which ranged from slight to moderate agreement (ICC = 0.14-0.41). Likewise, inter-rater reliability ranged from moderate to almost perfect agreement for joint position error (ICC ≥ 0.51-0.75), the cranio-cervical flexion test (ICC ≥ 0.85), the neck flexor muscle endurance test performed in supine (ICC ≥ 0.70) and in a 45°-upright position (ICC ≥ 0.56). However, only slight to fair agreement was found for the neck extensor test (ICC = 0.19-0.25). Conclusions Intra- and inter-rater reliability ranged from moderate to almost perfect agreement with the exception of a new test (neck extensor test), which ranged from slight to moderate agreement. The significant variability observed suggests that tests like the neck extensor test and the neck flexor muscle endurance test performed in a 45°-upright position are too unstable to be used when evaluating neck muscle performance. PMID:24299621
The reliability and validity of video analysis for the assessment of the clinical signs of concussion in Australian football.

PubMed

Makdissi, Michael; Davis, Gavin

2016-10-01

The objective of this study was to determine the reliability and validity of identifying clinical signs of concussion using video analysis in Australian football. Prospective cohort study. All impacts and collisions potentially resulting in a concussion were identified during 2012 and 2013 Australian Football League seasons. Consensus definitions were developed for clinical signs associated with concussion. For intra- and inter-rater reliability analysis, two experienced clinicians independently assessed 102 randomly selected videos on two occasions. Sensitivity, specificity, positive and negative predictive values were calculated based on the diagnosis provided by team medical staff. 212 incidents resulting in possible concussion were identified in 414 Australian Football League games. The intra-rater reliability of the video-based identification of signs associated with concussion was good to excellent. Inter-rater reliability was good to excellent for impact seizure, slow to get up, motor incoordination, ragdoll appearance (2 of 4 analyses), clutching at head and facial injury. Inter-rater reliability for loss of responsiveness and blank and vacant look was only fair and did not reach statistical significance. The feature with the highest sensitivity was slow to get up (87%), but this sign had a low specificity (19%). Other video signs had a high specificity but low sensitivity. Blank and vacant look (100%) and motor incoordination (81%) had the highest positive predictive value. Video analysis may be a useful adjunct to the side-line assessment of a possible concussion. Video analysis however should not replace the need for a thorough multimodal clinical assessment. Copyright © 2016 Sports Medicine Australia. Published by Elsevier Ltd. All rights reserved.
How to assess and compare inter-rater reliability, agreement and correlation of ratings: an exemplary analysis of mother-father and parent-teacher expressive vocabulary rating pairs

PubMed Central

Stolarova, Margarita; Wolf, Corinna; Rinker, Tanja; Brielmann, Aenne

2014-01-01

This report has two main purposes. First, we combine well-known analytical approaches to conduct a comprehensive assessment of agreement and correlation of rating-pairs and to dis-entangle these often confused concepts, providing a best-practice example on concrete data and a tutorial for future reference. Second, we explore whether a screening questionnaire developed for use with parents can be reliably employed with daycare teachers when assessing early expressive vocabulary. A total of 53 vocabulary rating pairs (34 parent–teacher and 19 mother–father pairs) collected for two-year-old children (12 bilingual) are evaluated. First, inter-rater reliability both within and across subgroups is assessed using the intra-class correlation coefficient (ICC). Next, based on this analysis of reliability and on the test-retest reliability of the employed tool, inter-rater agreement is analyzed, magnitude and direction of rating differences are considered. Finally, Pearson correlation coefficients of standardized vocabulary scores are calculated and compared across subgroups. The results underline the necessity to distinguish between reliability measures, agreement and correlation. They also demonstrate the impact of the employed reliability on agreement evaluations. This study provides evidence that parent–teacher ratings of children's early vocabulary can achieve agreement and correlation comparable to those of mother–father ratings on the assessed vocabulary scale. Bilingualism of the evaluated child decreased the likelihood of raters' agreement. We conclude that future reports of agreement, correlation and reliability of ratings will benefit from better definition of terms and stricter methodological approaches. The methodological tutorial provided here holds the potential to increase comparability across empirical reports and can help improve research practices and knowledge transfer to educational and therapeutic settings. PMID:24994985
An initial reliability and validity study of the Interaction, Communication, and Literacy Skills Audit.

PubMed

El-Choueifati, Nisrine; Purcell, Alison; McCabe, Patricia; Heard, Robert; Munro, Natalie

2014-06-01

Early childhood educators (ECEs) have an important role in promoting positive outcomes for children's language and literacy development. This paper reports the development of a new tool, The Interaction Communication and Literacy (ICL) Skills Audit, and pilots its reliability and validity. Intra- and inter-rater reliability was examined by three speech-language pathologists (SLPs). Five skill areas relating to ECE language and literacy practice were rated. The face and content validity of the ICL Skills Audit was examined by expert SLPs (n = 8) and expert ECEs (n = 4) via questionnaire. The overall intra-rater reliability for the ICL Skills Audit was excellent with percentage close agreement (PCA) of 91-94. Inter-rater agreement was PCA 68-80. Expert SLPs and ECEs agreed that the content was comprehensive and practical. Based on this preliminary study, the ICL Skills Audit appears to be a promising tool that can be used by SLPs and ECEs in collaboration to measure the skills of ECEs in the areas of language and literacy support. Future psychometric and outcome research on the revised ICL Skills Audit is warranted.
Reliable and valid assessment of competence in endoscopic ultrasonography and fine-needle aspiration for mediastinal staging of non-small cell lung cancer.

PubMed

Konge, L; Vilmann, P; Clementsen, P; Annema, J T; Ringsted, C

2012-10-01

Fine-needle aspiration (FNA) guided by endoscopic ultrasonography (EUS) is important in mediastinal staging of non-small cell lung cancer (NSCLC). Training standards and implementation strategies of this technique are currently under discussion. The aim of this study was to explore the reliability and validity of a newly developed EUS Assessment Tool (EUSAT) designed to measure competence in EUS - FNA for mediastinal staging of NSCLC. A total of 30 patients with proven or suspected NSCLC underwent EUS - FNA for mediastinal staging by three trainees and three experienced physicians. Their performances were assessed prospectively by three experts in EUS under direct observation and again 2 months later in a blinded fashion using digital video-recordings. Based on the assessments, intra-rater reliability, inter-rater reliability, and construct validity were explored. The intra-rater reliability was good (Cronbach's α = 0.80), but comparison of results based on direct observations and blinded video-recordings indicated a significant bias favoring consultants (P = 0.022). Inter-rater reliability was very good (Cronbach's α = 0.93). However, one rater assessing five procedures or two raters each assessing four procedures were necessary to secure a generalizability coefficient of 0.80. The assessment tool demonstrated construct validity by discriminating between trainees and experienced physicians (P = 0.034). Competency in mediastinal staging of NSCLC using EUS and EUS - FNA can be assessed in a reliable and valid way using the EUSAT assessment tool. Measuring and defining competency and training requirements could improve EUS quality and benefit patient care. © Georg Thieme Verlag KG Stuttgart · New York.
A medical record review for functional somatic symptoms in children.

PubMed

Rask, Charlotte Ulrikka; Borg, Carsten; Søndergaard, Charlotte; Schulz-Pedersen, Søren; Thomsen, Per Hove; Fink, Per

2010-04-01

The objectives of this study were to develop and test a systematic medical record review for functional somatic symptoms (FSSs) in paediatric patients and to estimate the inter-rater reliability of paediatricians' recognition of FSSs and their associated impairments while using this method. We developed the Medical Record Review for Functional Somatic Symptoms in Children (MRFC) for retrospective medical record review. Described symptoms were categorised as probably, definitely, or not FSSs. FSS-associated impairment was also determined. Three paediatricians performed the MRFC on the medical records of 54 children with a diagnosed, well-defined physical disease and 59 with 'symptom' diagnoses. The inter-rater reliabilities of the recognition and associated impairment of FSSs were tested on 20 of these records. The MRFC allowed identification of subgroups of children with multisymptomatic FSSs, long-term FSSs, and/or impairing FSSs. The FSS inter-rater reliability was good (combined kappa=0.69) but only fair as far as associated impairment was concerned (combined kappa=0.29). In the hands of skilled paediatricians, the MRFC is a reliable method for identifying paediatric patients with diverse types of FSSs for clinical research. However, additional information is needed for reliable judgement of impairment. The method may also prove useful in clinical practice. Copyright 2010 Elsevier Inc. All rights reserved.
Validation of the one pass measure for motivational interviewing competence.

PubMed

McMaster, Fiona; Resnicow, Ken

2015-04-01

This paper examines the psychometric properties of the OnePass coding system: a new, user-friendly tool for evaluating practitioner competence in motivational interviewing (MI). We provide data on reliability and validity with the current gold-standard: Motivational Interviewing Treatment Integrity tool (MITI). We compared scores from 27 videotaped MI sessions performed by student counselors trained in MI and simulated patients using both OnePass and MITI, with three different raters for each tool. Reliability was estimated using intra-class coefficients (ICCs), and validity was assessed using Pearson's r. OnePass had high levels of inter-rater reliability with 19/23 items found from substantial to almost perfect agreement. Taking the pair of scores with the highest inter-rater reliability on the MITI, the concurrent validity between the two measures ranged from moderate to high. Validity was highest for evocation, autonomy, direction and empathy. OnePass appears to have good inter-rater reliability while capturing similar dimensions of MI as the MITI. Despite the moderate concurrent validity with the MITI, the OnePass shows promise in evaluating both traditional and novel interpretations of MI. OnePass may be a useful tool for developing and improving practitioner competence in MI where access to MITI coders is limited. Copyright © 2015. Published by Elsevier Ireland Ltd.
Consensus Conference Follow-up: Inter-rater Reliability Assessment of the Best Evidence in Emergency Medicine (BEEM) Rater Scale, a Medical Literature Rating Tool for Emergency Physicians

PubMed Central

Worster, Andrew; Kulasegaram, Kulamakan; Carpenter, Christopher R.; Vallera, Teresa; Upadhye, Suneel; Sherbino, Jonathan; Haynes, R. Brian

2011-01-01

Background Studies published in general and specialty medical journals have the potential to improve emergency medicine (EM) practice, but there can be delayed awareness of this evidence because emergency physicians (EPs) are unlikely to read most of these journals. Also, not all published studies are intended for or ready for clinical practice application. The authors developed “Best Evidence in Emergency Medicine” (BEEM) to ameliorate these problems by searching for, identifying, appraising, and translating potentially practice-changing studies for EPs. An initial step in the BEEM process is the BEEM rater scale, a novel tool for EPs to collectively evaluate the relative clinical relevance of EM-related studies found in more than 120 journals. The BEEM rater process was designed to serve as a clinical relevance filter to identify those studies with the greatest potential to affect EM practice. Therefore, only those studies identified by BEEM raters as having the highest clinical relevance are selected for the subsequent critical appraisal process and, if found methodologically sound, are promoted as the best evidence in EM. Objectives The primary objective was to measure inter-rater reliability (IRR) of the BEEM rater scale. Secondary objectives were to determine the minimum number of EP raters needed for the BEEM rater scale to achieve acceptable reliability and to compare performance of the scale against a previously published evidence rating system, the McMaster Online Rating of Evidence (MORE), in an EP population. Methods The authors electronically distributed the title, conclusion, and a PubMed link for 23 recently published studies related to EM to a volunteer group of 134 EPs. The volunteers answered two demographic questions and rated the articles using one of two randomly assigned seven-point Likert scales, the BEEM rater scale (n = 68) or the MORE scale (n = 66), over two separate administrations. The IRR of each scale was measured using generalizability theory. Results The IRR of the BEEM rater scale ranged between 0.90 (95% confidence interval [CI] = 0.86 to 0.93) to 0.92 (95% CI = 0.89 to 0.94) across administrations. Decision studies showed a minimum of 12 raters is required for acceptable reliability of the BEEM rater scale. The IRR of the MORE scale was 0.82 to 0.84. Conclusions The BEEM rater scale is a highly reliable, single-question tool for a small number of EPs to collectively rate the relative clinical relevance within the specialty of EM of recently published studies from a variety of medical journals. It compares favorably with the MORE system because it achieves a high IRR despite simply requiring raters to read each article’s title and conclusion. PMID:22092904
Psychometric Evaluation of the D-Catch, an Instrument to Measure the Accuracy of Nursing Documentation.

PubMed

D'Agostino, Fabio; Barbaranelli, Claudio; Paans, Wolter; Belsito, Romina; Juarez Vela, Raul; Alvaro, Rosaria; Vellone, Ercole

2017-07-01

To evaluate the psychometric properties of the D-Catch instrument. A cross-sectional methodological study. Validity and reliability were estimated with confirmatory factor analysis (CFA) and internal consistency and inter-rater reliability, respectively. A sample of 250 nursing documentations was selected. CFA showed the adequacy of a 1-factor model (chronologically descriptive accuracy) with an outlier item (nursing diagnosis accuracy). Internal consistency and inter-rater reliability were adequate. The D-Catch is a valid and reliable instrument for measuring the accuracy of nursing documentation. Caution is needed when measuring diagnostic accuracy since only one item measures this dimension. The D-Catch can be used as an indicator of the accuracy of nursing documentation and the quality of nursing care. © 2015 NANDA International, Inc.
Towards an Operational Definition of Clinical Competency in Pharmacy

PubMed Central

2015-01-01

Objective. To estimate the inter-rater reliability and accuracy of ratings of competence in student pharmacist/patient clinical interactions as depicted in videotaped simulations and to compare expert panelist and typical preceptor ratings of those interactions. Methods. This study used a multifactorial experimental design to estimate inter-rater reliability and accuracy of preceptors’ assessment of student performance in clinical simulations. The study protocol used nine 5-10 minute video vignettes portraying different levels of competency in student performance in simulated clinical interactions. Intra-Class Correlation (ICC) was used to calculate inter-rater reliability and Fisher exact test was used to compare differences in distribution of scores between expert and nonexpert assessments. Results. Preceptors (n=42) across 5 states assessed the simulated performances. Intra-Class Correlation estimates were higher for 3 nonrandomized video simulations compared to the 6 randomized simulations. Preceptors more readily identified high and low student performances compared to satisfactory performances. In nearly two-thirds of the rating opportunities, a higher proportion of expert panelists than preceptors rated the student performance correctly (18 of 27 scenarios). Conclusion. Valid and reliable assessments are critically important because they affect student grades and formative student feedback. Study results indicate the need for pharmacy preceptor training in performance assessment. The process demonstrated in this study can be used to establish minimum preceptor benchmarks for future national training programs. PMID:26089563
Color-coded fluid-attenuated inversion recovery images improve inter-rater reliability of fluid-attenuated inversion recovery signal changes within acute diffusion-weighted image lesions.

PubMed

Kim, Bum Joon; Kim, Yong-Hwan; Kim, Yeon-Jung; Ahn, Sung Ho; Lee, Deok Hee; Kwon, Sun U; Kim, Sang Joon; Kim, Jong S; Kang, Dong-Wha

2014-09-01

Diffusion-weighted image fluid-attenuated inversion recovery (FLAIR) mismatch has been considered to represent ischemic lesion age. However, the inter-rater agreement of diffusion-weighted image FLAIR mismatch is low. We hypothesized that color-coded images would increase its inter-rater agreement. Patients with ischemic stroke <24 hours of a clear onset were retrospectively studied. FLAIR signal change was rated as negative, subtle, or obvious on conventional and color-coded FLAIR images based on visual inspection. Inter-rater agreement was evaluated using κ and percent agreement. The predictive value of diffusion-weighted image FLAIR mismatch for identification of patients <4.5 hours of symptom onset was evaluated. One hundred and thirteen patients were enrolled. The inter-rater agreement of FLAIR signal change improved from 69.9% (k=0.538) with conventional images to 85.8% (k=0.754) with color-coded images (P=0.004). Discrepantly rated patients on conventional, but not on color-coded images, had a higher prevalence of cardioembolic stroke (P=0.02) and cortical infarction (P=0.04). The positive predictive value for patients <4.5 hours of onset was 85.3% and 71.9% with conventional and 95.7% and 82.1% with color-coded images, by each rater. Color-coded FLAIR images increased the inter-rater agreement of diffusion-weighted image FLAIR recovery mismatch and may ultimately help identify unknown-onset stroke patients appropriate for thrombolysis. © 2014 American Heart Association, Inc.
Geometric classification of scalp hair for valid drug testing, 6 more reliable than 8 hair curl groups

PubMed Central

Mkentane, K.; Gumedze, F.; Ngoepe, M.; Davids, L. M.; Khumalo, N. P.

2017-01-01

Introduction Curly hair is reported to contain higher lipid content than straight hair, which may influence incorporation of lipid soluble drugs. The use of race to describe hair curl variation (Asian, Caucasian and African) is unscientific yet common in medical literature (including reports of drug levels in hair). This study investigated the reliability of a geometric classification of hair (based on 3 measurements: the curve diameter, curl index and number of waves). Materials and methods After ethical approval and informed consent, proximal virgin (6cm) hair sampled from the vertex of scalp in 48 healthy volunteers were evaluated. Three raters each scored hairs from 48 volunteers at two occasions each for the 8 and 6-group classifications. One rater applied the 6-group classification to 80 additional volunteers in order to further confirm the reliability of this system. The Kappa statistic was used to assess intra and inter rater agreement. Results Each rater classified 480 hairs on each occasion. No rater classified any volunteer’s 10 hairs into the same group; the most frequently occurring group was used for analysis. The inter-rater agreement was poor for the 8-groups (k = 0.418) but improved for the 6-groups (k = 0.671). The intra-rater agreement also improved (k = 0.444 to 0.648 versus 0.599 to 0.836) for 6-groups; that for the one evaluator for all volunteers was good (k = 0.754). Conclusions Although small, this is the first study to test the reliability of a geometric classification. The 6-group method is more reliable. However, a digital classification system is likely to reduce operator error. A reliable objective classification of human hair curl is long overdue, particularly with the increasing use of hair as a testing substrate for treatment compliance in Medicine. PMID:28570555
Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial

PubMed Central

Hallgren, Kevin A.

2012-01-01

Many research designs require the assessment of inter-rater reliability (IRR) to demonstrate consistency among observational ratings provided by multiple coders. However, many studies use incorrect statistical procedures, fail to fully report the information necessary to interpret their results, or do not address how IRR affects the power of their subsequent analyses for hypothesis testing. This paper provides an overview of methodological issues related to the assessment of IRR with a focus on study design, selection of appropriate statistics, and the computation, interpretation, and reporting of some commonly-used IRR statistics. Computational examples include SPSS and R syntax for computing Cohen’s kappa and intra-class correlations to assess IRR. PMID:22833776

Intra- and inter-rater reliability of digital image analysis for skin color measurement

PubMed Central

Sommers, Marilyn; Beacham, Barbara; Baker, Rachel; Fargo, Jamison

2013-01-01

Background We determined the intra- and inter-rater reliability of data from digital image color analysis between an expert and novice analyst. Methods Following training, the expert and novice independently analyzed 210 randomly ordered images. Both analysts used Adobe® Photoshop lasso or color sampler tools based on the type of image file. After color correction with Pictocolor® in camera software, they recorded L*a*b* (L*=light/dark; a*=red/green; b*=yellow/blue) color values for all skin sites. We computed intra-rater and inter-rater agreement within anatomical region, color value (L*, a*, b*), and technique (lasso, color sampler) using a series of one-way intra-class correlation coefficients (ICCs). Results Results of ICCs for intra-rater agreement showed high levels of internal consistency reliability within each rater for the lasso technique (ICC ≥ 0.99) and somewhat lower, yet acceptable, level of agreement for the color sampler technique (ICC = 0.91 for expert, ICC = 0.81 for novice). Skin L*, skin b*, and labia L* values reached the highest level of agreement (ICC ≥ 0.92) and skin a*, labia b*, and vaginal wall b* were the lowest (ICC ≥ 0.64). Conclusion Data from novice analysts can achieve high levels of agreement with data from expert analysts with training and the use of a detailed, standard protocol. PMID:23551208
Intra- and inter-rater reliability of digital image analysis for skin color measurement.

PubMed

Sommers, Marilyn; Beacham, Barbara; Baker, Rachel; Fargo, Jamison

2013-11-01

We determined the intra- and inter-rater reliability of data from digital image color analysis between an expert and novice analyst. Following training, the expert and novice independently analyzed 210 randomly ordered images. Both analysts used Adobe(®) Photoshop lasso or color sampler tools based on the type of image file. After color correction with Pictocolor(®) in camera software, they recorded L*a*b* (L*=light/dark; a*=red/green; b*=yellow/blue) color values for all skin sites. We computed intra-rater and inter-rater agreement within anatomical region, color value (L*, a*, b*), and technique (lasso, color sampler) using a series of one-way intra-class correlation coefficients (ICCs). Results of ICCs for intra-rater agreement showed high levels of internal consistency reliability within each rater for the lasso technique (ICC ≥ 0.99) and somewhat lower, yet acceptable, level of agreement for the color sampler technique (ICC = 0.91 for expert, ICC = 0.81 for novice). Skin L*, skin b*, and labia L* values reached the highest level of agreement (ICC ≥ 0.92) and skin a*, labia b*, and vaginal wall b* were the lowest (ICC ≥ 0.64). Data from novice analysts can achieve high levels of agreement with data from expert analysts with training and the use of a detailed, standard protocol. © 2013 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
Reliability of Untrained and Experienced Raters on FEES: Rating Overall Residue is a Simple Task.

PubMed

Pisegna, Jessica M; Borders, James C; Kaneoka, Asako; Coster, Wendy J; Leonard, Rebecca; Langmore, Susan E

2018-03-07

The purpose of this study was to investigate the reliability of residue ratings on Fiberoptic Endoscopic Evaluation of Swallowing (FEES). We also examined rating differences based on experience to determine if years of experience influenced residue ratings. A group of 44 raters watched 81 FEES videos representing a wide range of residue severities for thin liquid, applesauce, and cracker boluses. Raters were untrained on the rating scales and simply rated their overall impression of residue amount on a visual analog scale (VAS) and a five-point ordinal scale in a randomized fashion across two sessions. Intra-class correlation coefficients, kappa coefficients, and ANOVAs were used to analyze agreement and differences in ratings. Residue ratings on both the VAS and ordinal scales had acceptable inter- and intra-rater reliability. Inter-rater agreement was acceptable (ICC > 0.7) for all comparisons. Intra-rater agreement was excellent on the VAS scale (r c = 0.9) and good on the ordinal scale (k = 0.78). There was no significant difference between expert ratings and other raters based on years of experience for cracker ratings (p = 0.2119) and applesauce ratings (p = 0.2899), but there was a significant difference between clinicians on thin liquid ratings (p = 0.0005). Without any specific training, raters demonstrated high reliability when rating the overall amount of residue on FEES. Years of experience with FEES did not influence residue ratings, suggesting that expert ratings of overall residue amount are not unique or specialized. Rating the overall amount of residue on FEES appears to be a simple visual-perceptual task for puree and cracker boluses.
Validity and Reliability of the Clinical Competency Evaluation Instrument for Use among Physiotherapy Students: Pilot study.

PubMed

Muhamad, Zailani; Ramli, Ayiesah; Amat, Salleh

2015-05-01

The aim of this study was to determine the content validity, internal consistency, test-retest reliability and inter-rater reliability of the Clinical Competency Evaluation Instrument (CCEVI) in assessing the clinical performance of physiotherapy students. This study was carried out between June and September 2013 at University Kebangsaan Malaysia (UKM), Kuala Lumpur, Malaysia. A panel of 10 experts were identified to establish content validity by evaluating and rating each of the items used in the CCEVI with regards to their relevance in measuring students' clinical competency. A total of 50 UKM undergraduate physiotherapy students were assessed throughout their clinical placement to determine the construct validity of these items. The instrument's reliability was determined through a cross-sectional study involving a clinical performance assessment of 14 final-year undergraduate physiotherapy students. The content validity index of the entire CCEVI was 0.91, while the proportion of agreement on the content validity indices ranged from 0.83-1.00. The CCEVI construct validity was established with factor loading of ≥0.6, while internal consistency (Cronbach's alpha) overall was 0.97. Test-retest reliability of the CCEVI was confirmed with a Pearson's correlation range of 0.91-0.97 and an intraclass coefficient correlation range of 0.95-0.98. Inter-rater reliability of the CCEVI domains ranged from 0.59 to 0.97 on initial and subsequent assessments. This pilot study confirmed the content validity of the CCEVI. It showed high internal consistency, thereby providing evidence that the CCEVI has moderate to excellent inter-rater reliability. However, additional refinement in the wording of the CCEVI items, particularly in the domains of safety and documentation, is recommended to further improve the validity and reliability of the instrument.
Reliability of the Community Balance and Mobility Scale (CB&M) in high-functioning school-aged children and adolescents who have an acquired brain injury.

PubMed

Wright, F Virginia; Ryan, Jennifer; Brewer, Kelly

2010-01-01

To examine inter-rater, intra-rater and test-re-test reliability of the Community Balance and Mobility Scale (CB&M) and compare reliability in live vs videotape rating contexts for children with acquired brain injury (ABI). Repeated measures design. Seven physiotherapists (PTs) were trained as assessors. The primary assessor administered and scored baseline CB&M while the second assessor observed and scored independently (inter-rater reliability). Re-assessment occurred 3-10 days later by primary assessor (test-re-test reliability). Assessments were videotaped. There were 32 participants with ABI (mean age = 14 years 1 month (SD = 2 years 1 month)). Baseline mean scores were 67.4% (18.2) and 66.7% (18.3) for primary and second assessor, respectively. Primary assessors' re-test mean score was 69.3%. Inter-rater reliability ICC was 0.93 (95% confidence interval (CI) = 0.87-0.97). Test-re-test ICC was 0.90 (95%CI = 0.81-0.95) and Bland-Altman plot indicated greatest test-re-test differences for mid-range CB&M scores. Minimum detectable change (MDC₉₀) was 13.5% points. The CB&M showed excellent reliability in youth. Reliability was comparable for live and videotape rating approaches, meaning that the easier and less expensive live-rating can be recommended. Future work should focus on evaluation of responsiveness to change in rehabilitation centre and community intervention contexts.
Reliability of Autism-Tics, AD/HD, and other Comorbidities (A-TAC) inventory in a test-retest design.

PubMed

Larson, Tomas; Kerekes, Nóra; Selinus, Eva Norén; Lichtenstein, Paul; Gumpert, Clara Hellner; Anckarsäter, Henrik; Nilsson, Thomas; Lundström, Sebastian

2014-02-01

The Autism-Tics, AD/HD, and other Comorbidities (A-TAC) inventory is used in epidemiological research to assess neurodevelopmental problems and coexisting conditions. Although the A-TAC has been applied in various populations, data on retest reliability are limited. The objective of the present study was to present additional reliability data. The A-TAC was administered by lay assessors and was completed on two occasions by parents of 400 individual twins, with an average interval of 70 days between test sessions. Intra- and inter-rater reliability were analysed with intraclass correlations and Cohen's kappa. A-TAC showed excellent test-retest intraclass correlations for both autism spectrum disorder and attention deficit hyperactivity disorder (each at .84). Most modules in the A-TAC had intra- and inter-rater reliability intraclass correlation coefficients of > or = .60. Cohen's kappa indi- cated acceptable reliability. The current study provides statistical evidence that the A-TAC yields good test-retest reliability in a population-based cohort of children.
Exploring students' perceptions on the use of significant event analysis, as part of a portfolio assessment process in general practice, as a tool for learning how to use reflection in learning

PubMed Central

Grant, Andrew J; Vermunt, Jan D; Kinnersley, Paul; Houston, Helen

2007-01-01

Background Portfolio learning enables students to collect evidence of their learning. Component tasks making up a portfolio can be devised that relate directly to intended learning outcomes. Reflective tasks can stimulate students to recognise their own learning needs. Assessment of portfolios using a rating scale relating to intended learning outcomes offers high content validity. This study evaluated a reflective portfolio used during a final-year attachment in general practice (family medicine). Students were asked to evaluate the portfolio (which used significant event analysis as a basis for reflection) as a learning tool. The validity and reliability of the portfolio as an assessment tool were also measured. Methods 81 final-year medical students completed reflective significant event analyses as part of a portfolio created during a three-week attachment (clerkship) in general practice (family medicine). As well as two reflective significant event analyses each portfolio contained an audit and a health needs assessment. Portfolios were marked three times; by the student's GP teacher, the course organiser and by another teacher in the university department of general practice. Inter-rater reliability between pairs of markers was calculated. A questionnaire enabled the students' experience of portfolio learning to be determined. Results Benefits to learning from reflective learning were limited. Students said that they thought more about the patients they wrote up in significant event analyses but information as to the nature and effect of this was not forthcoming. Moderate inter-rater reliability (Spearman's Rho .65) was found between pairs of departmental raters dealing with larger numbers (20 – 60) of portfolios. Inter-rater reliability of marking involving GP tutors who only marked 1 – 3 portfolios was very low. Students rated highly their mentoring relationship with their GP teacher but found the portfolio tasks time-consuming. Conclusion The inter-rater reliability observed in this study should be viewed alongside the high validity afforded by the authenticity of the learning tasks (compared with a sample of a student's learning taken by an exam question). Validity is enhanced by the rating scale which directly connects the grade given with intended learning outcomes. The moderate inter-rater reliability may be increased if a portfolio is completed over a longer period of time and contains more component pieces of work. The questionnaire used in this study only accessed limited information about the effect of reflection on students' learning. Qualitative methods of evaluation would determine the students experience in greater depth. It would be useful to evaluate the effects of reflective learning after students have had more time to get used to this unfamiliar method of learning and to overcome any problems in understanding the task. PMID:17397544
Exploring students' perceptions on the use of significant event analysis, as part of a portfolio assessment process in general practice, as a tool for learning how to use reflection in learning.

PubMed

Grant, Andrew J; Vermunt, Jan D; Kinnersley, Paul; Houston, Helen

2007-03-30

Portfolio learning enables students to collect evidence of their learning. Component tasks making up a portfolio can be devised that relate directly to intended learning outcomes. Reflective tasks can stimulate students to recognise their own learning needs. Assessment of portfolios using a rating scale relating to intended learning outcomes offers high content validity. This study evaluated a reflective portfolio used during a final-year attachment in general practice (family medicine). Students were asked to evaluate the portfolio (which used significant event analysis as a basis for reflection) as a learning tool. The validity and reliability of the portfolio as an assessment tool were also measured. 81 final-year medical students completed reflective significant event analyses as part of a portfolio created during a three-week attachment (clerkship) in general practice (family medicine). As well as two reflective significant event analyses each portfolio contained an audit and a health needs assessment. Portfolios were marked three times; by the student's GP teacher, the course organiser and by another teacher in the university department of general practice. Inter-rater reliability between pairs of markers was calculated. A questionnaire enabled the students' experience of portfolio learning to be determined. Benefits to learning from reflective learning were limited. Students said that they thought more about the patients they wrote up in significant event analyses but information as to the nature and effect of this was not forthcoming. Moderate inter-rater reliability (Spearman's Rho .65) was found between pairs of departmental raters dealing with larger numbers (20-60) of portfolios. Inter-rater reliability of marking involving GP tutors who only marked 1-3 portfolios was very low. Students rated highly their mentoring relationship with their GP teacher but found the portfolio tasks time-consuming. The inter-rater reliability observed in this study should be viewed alongside the high validity afforded by the authenticity of the learning tasks (compared with a sample of a student's learning taken by an exam question). Validity is enhanced by the rating scale which directly connects the grade given with intended learning outcomes. The moderate inter-rater reliability may be increased if a portfolio is completed over a longer period of time and contains more component pieces of work. The questionnaire used in this study only accessed limited information about the effect of reflection on students' learning. Qualitative methods of evaluation would determine the students experience in greater depth. It would be useful to evaluate the effects of reflective learning after students have had more time to get used to this unfamiliar method of learning and to overcome any problems in understanding the task.
Reliability and concurrent validity of a new iPhone® goniometric application for measuring active wrist range of motion: a cross-sectional study in asymptomatic subjects.

PubMed

Pourahmadi, Mohammad Reza; Ebrahimi Takamjani, Ismail; Sarrafzadeh, Javad; Bahramian, Mehrdad; Mohseni-Bandpei, Mohammad Ali; Rajabzadeh, Fatemeh; Taghipour, Morteza

2017-03-01

Measurement of wrist range of motion (ROM) is often considered to be an essential component of wrist physical examination. The measurement can be carried out through various instruments such as goniometers and inclinometers. Recent smartphones have been equipped with accelerometers and magnetometers, which, through specific software applications (apps) can be used for goniometric functions. This study, for the first time, aimed to evaluate the reliability and concurrent validity of a new smartphone goniometric app (Goniometer Pro©) for measuring active wrist ROM. In all, 120 wrists of 70 asymptomatic adults (38 men and 32 women; aged 18-40 years) were assessed in a physiotherapy clinic located at the School of Rehabilitation Sciences, Iran University of Medical Science and Health Services, Tehran, Iran. Following the recruitment process, active wrist ROM was measured using a universal goniometer and iPhone ® 5 app. Two blinded examiners each utilized the universal goniometer and iPhone ® to measure active wrist ROM using a volar/dorsal alignment technique in the following sequences: flexion, extension, radial deviation, and ulnar deviation. The second (2 h later) and third (48 h later) sessions were carried out in the same manner as the first session. All the measurements were conducted three times and the mean value of three repetitions for each measurement was used for analysis. Intraclass correlation coefficient (ICC) models (3, k) and (2, k) were used to determine the intra-rater and inter-rater reliability, respectively. The Pearson correlation coefficients were used to establish concurrent validity of the iPhone ® app. Good to excellent intra-rater and inter-rater reliability was demonstrated for the goniometer with ICC values of ≥ 0.82 and ≥ 0.73 and the iPhone ® app with ICC values of ≥ 0.83 and ≥ 0.79, respectively. Minimum detectable change at the 95% confidence level (MDC 95 ) was computed as 1.96 × standard error of measurement × √2. The MDC 95 ranged from 1.66° to 5.35° for the intra-rater analysis and from 1.97° to 6.15° for the inter-rater analysis. The concurrent validity between the two instruments was high, with r values of ≥ 0.80. From the results of this cross-sectional study, it can be concluded that the iPhone ® app possesses good to excellent intra-rater and inter-rater reliability and concurrent validity. It seems that this app can be used for the measurement of wrist ROM. However, further research is needed to evaluate symptomatic subjects using this app. © 2016 Anatomical Society.
The development and testing of a qualitative instrument designed to assess critical thinking

NASA Astrophysics Data System (ADS)

Clauson, Cynthia Louisa

This study examined a qualitative approach to assess critical thinking. An instrument was developed that incorporates an assessment process based on Dewey's (1933) concepts of self-reflection and critical thinking as problem solving. The study was designed to pilot test the critical thinking assessment process with writing samples collected from a heterogeneous group of students. The pilot test included two phases. Phase 1 was designed to determine the validity and inter-rater reliability of the instrument using two experts in critical thinking, problem solving, and literacy development. Validity of the instrument was addressed by requesting both experts to respond to ten questions in an interview. The inter-rater reliability was assessed by analyzing the consistency of the two experts' scorings of the 20 writing samples to each other, as well as to my scoring of the same 20 writing samples. Statistical analyses included the Spearman Rho and the Kuder-Richardson (Formula 20). Phase 2 was designed to determine the validity and reliability of the critical thinking assessment process with seven science teachers. Validity was addressed by requesting the teachers to respond to ten questions in a survey and interview. Inter-rater reliability was addressed by comparing the seven teachers' scoring of five writing samples with my scoring of the same five writing samples. Again, the Spearman Rho and the Kuder-Richardson (Formula 20) were used to determine the inter-rater reliability. The validity results suggest that the instrument is helpful as a guide for instruction and provides a systematic method to teach and assess critical thinking while problem solving with students in the classroom. The reliability results show the critical thinking assessment instrument to possess fairly high reliability when used by the experts, but weak reliability when used by classroom teachers. A major conclusion was drawn that teachers, as well as students, would need to receive instruction in critical thinking and in how to use the assessment process in order to gain more consistent interpretations of the six problem-solving steps. Specific changes needing to be made in the instrument to improve the quality are included.
Comparing the ability of OPTION(12) and OPTION(5) to assess shared decision-making in genetic counselling.

PubMed

Vortel, Martina A; Adam, Shelin; Port-Thompson, Ashley V; Friedman, Jan M; Grande, Stuart W; Birch, Patricia H

2016-10-01

OPTION(12) is the most widely used tool to measure shared decision-making (SDM) in health care. A newer scale, OPTION(5), has been proposed as a more parsimonious measure that better addresses core concepts of SDM. This study compares OPTION(5) to OPTION(12) in prenatal genetic counselling. Two raters independently used OPTION(12) and OPTION(5) to score 27 clinical encounters between genetic counsellors (GC) and women with pregnancies at increased risk for genetic conditions. Global and item scores on the two instruments were compared to test concurrent validity and to identify usability in this context. Inter-rater reliability was also assessed for both instruments. Mean scores for OPTION(12) were 43.8 (SD=9.7), and for OPTION(5) were=60.6 (SD=12.5). The correlation between OPTION(12) and OPTION(5) scores was r=0.70. Inter-rater reliability was 0.70 and 0.85 for OPTION(12) and OPTION(5) respectively, however mean inter-rater reliability for individual items was 0.31 and 0.63 for OPTION(12) and OPTION(5) respectively. GCs exhibit SDM as measured by both OPTION instruments. OPTION(5) exhibits improved psychometric performance relative to OPTION(12), and more specifically targets the core constructs of SDM. However, refinement of OPTION instruments or manuals is needed to improve reliability and validity in GC assessment. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Focused physician-performed echocardiography in sports medicine: a potential screening tool for detecting aortic root dilatation in athletes.

PubMed

Yim, Eugene S; Kao, Daniel; Gillis, Edward F; Basilico, Frederick C; Corrado, Gianmichael D

2013-12-01

The purpose of this study was to investigate whether sports medicine physicians can obtain accurate measurements of the aortic root in young athletes. Twenty male collegiate athletes, aged 18 to 21 years, were prospectively enrolled. Focused echocardiography was performed by a board-certified sports medicine physician and a medical student, followed by comprehensive echocardiography within 2 weeks by a cardiac sonographer. A left parasternal long-axis view was acquired to measure the aortic root diameter at the sinuses of Valsalva. Intraclass correlation coefficients (ICCs) were used to assess inter-rater reliability compared to a reference standard and intra-rater reliability of repeated measurements obtained by the sports medicine physician and medical student. The ICCs between the sports medicine physician and cardiac sonographer and between the medical student and cardiac sonographer were strong: 0.80 and 0.76, respectively. Across all 3 readers, the ICC was 0.89, indicating strong inter-rater reliability and concordance. The ICC for the 2 measurements taken by the sports medicine physician for each athlete was 0.75, indicating strong intra-rater reliability. The medical student had moderate intra-rater reliability, with an ICC of 0.59. Sports medicine physicians are able to obtain measurements of the aortic root by focused echocardiography that are consistent with those obtained by a cardiac sonographer. Focused physician-performed echocardiography may serve as a promising technique for detecting aortic root dilatation and may contribute in this manner to preparticipation cardiovascular screening for athletes.
RELIABILITY OF ANKLE-FOOT MORPHOLOGY, MOBILITY, STRENGTH, AND MOTOR PERFORMANCE MEASURES.

PubMed

Fraser, John J; Koldenhoven, Rachel M; Saliba, Susan A; Hertel, Jay

2017-12-01

Assessment of foot posture, morphology, intersegmental mobility, strength and motor control of the ankle-foot complex are commonly used clinically, but measurement properties of many assessments are unclear. To determine test-retest and inter-rater reliability, standard error of measurement, and minimal detectable change of morphology, joint excursion and play, strength, and motor control of the ankle-foot complex. Reliability study. 24 healthy, recreationally-active young adults without history of ankle-foot injury were assessed by two clinicians on two occasions, three to ten days apart. Measurement properties were assessed for foot morphology (foot posture index, total and truncated length, width, arch height), joint excursion (weight-bearing dorsiflexion, rearfoot and hallux goniometry, forefoot inclinometry, 1 st metatarsal displacement) and joint play, strength (handheld dynamometry), and motor control rating during intrinsic foot muscle (IFM) exercises. Clinician order was randomized using a Latin Square. The clinicians performed independent examinations and did not confer on the findings for the duration of the study. Test-retest and inter-tester reliability and agreement was assessed using intraclass correlation coefficients (ICC 2,k ) and weighted kappa ( K w ). Test-retest reliability ICC were as follows: morphology: .80-1.00, joint excursion: .58-.97, joint play: -.67-.84, strength: .67-.92, IFM motor rating: K W -.01-.71. Inter-rater reliability ICC were as follows: morphology: .81-1.00, joint excursion: .32-.97, joint play: -1.06-1.00, strength: .53-.90, and IFM motor rating: K w .02-.56. Measures of ankle-foot posture, morphology, joint excursion, and strength demonstrated fair to excellent test-retest and inter-rater reliability. Test-retest reliability for rating of perceived difficulty and motor performance was good to excellent for short-foot, toe-spread-out, and hallux exercises and poor to fair for lesser toe extension. Joint play measures had poor to fair reliability overall. The findings of this study should be considered when choosing methods of clinical assessment and outcome measures in practice and research. 3.
Validity and reliability of the Diagnostic Adaptive Behaviour Scale.

PubMed

Tassé, M J; Schalock, R L; Balboni, G; Spreat, S; Navas, P

2016-01-01

The Diagnostic Adaptive Behaviour Scale (DABS) is a new standardised adaptive behaviour measure that provides information for evaluating limitations in adaptive behaviour for the purpose of determining a diagnosis of intellectual disability. This article presents validity evidence and reliability data for the DABS. Validity evidence was based on comparing DABS scores with scores obtained on the Vineland Adaptive Behaviour Scale, second edition. The stability of the test scores was measured using a test and retest, and inter-rater reliability was assessed by computing the inter-respondent concordance. The DABS convergent validity coefficients ranged from 0.70 to 0.84, while the test-retest reliability coefficients ranged from 0.78 to 0.95, and the inter-rater concordance as measured by intraclass correlation coefficients ranged from 0.61 to 0.87. All obtained validity and reliability indicators were strong and comparable with the validity and reliability coefficients of the most commonly used adaptive behaviour instruments. These results and the advantages of the DABS for clinician and researcher use are discussed. © 2015 MENCAP and International Association of the Scientific Study of Intellectual and Developmental Disabilities and John Wiley & Sons Ltd.
Development and evaluation of the "BRISK Scale," a brief observational measure of risk communication competence.

PubMed

Han, Paul K J; Joekes, Katherine; Mills, Greg; Gutheil, Caitlin; Smith, Kahsi; Cochran, Nancy E; Elwyn, Glyn

2016-12-01

To develop and evaluate a brief observational measure of clinical risk communication competence. A 4-item checklist-type measure, the BRISK (Brief Risk Information Skill) Scale, was developed by selecting and refining items from a more comprehensive measure of clinical risk communication competence. Six volunteer raters received brief training on the measure and then used the BRISK Scale to evaluate 52 video-recorded encounters between 2nd-year medical students and standardized patients conducted as part of an Observed Structured Clinical Examination (OSCE) involving a risk communication task. Internal consistency reliability, inter-rater reliability, and criterion validity were assessed. Raters reported no difficulties using the BRISK Scale; scores across all raters and subjects ranged from 0 to 16 with a mean score of 6.49 (SD=3.17). The BRISK Scale showed good internal consistency reliability (α=0.64), and inter-rater reliability at the scale level (Intraclass Correlation Coefficient (ICC)=0.79 for consistency, and 0.75 for absolute agreement) and individual-item level (ICC range: 0.62-.91). Novice raters' BRISK Scale scores were highly correlated (r=0.84, p<0.01) with expert raters' scores on the Risk Communication Content measure, a more comprehensive measure of risk communication competence. The BRISK Scale is a promising new brief observational measure of clinical risk communication competence. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
The Reliability of Anthropometric Measurements Used Preoperatively in Aesthetic Breast Surgery.

PubMed

Isaac, Kathryn V; Murphy, Blake D; Beber, Brett; Brown, Mitchell

2016-04-01

Patient outcomes in aesthetic breast surgery are highly dependent on breast measurements used in preoperative planning. The purpose of this study is to determine the reliability of anthropometric breast measurements. Four raters measured 28 women using 7 measurements: sternal notch to nipple distance (Sn-N), nipple to midline (N-M), nipple to inframammary-fold distance under maximal stretch (N-IMF), breast base width (BW), soft tissue pinch thickness of the upper pole (STPT:UP), STPT at the inframammary fold (STPT:IMF), and anterior pull skin stretch (APSS). Reliability was assessed using intra-class correlation coefficients (ICCs). Inter-rater reliability was excellent for Sn-N, N-M, and BW (ICC = 0.94, 0.90, and 0.76, respectively) and was good for N-IMF (ICC = 0.70). The STPT:UP, STPT:IMF, and APSS measurements were not reliable between raters (ICC < 0.2). Intra-rater reliability was excellent for Sn-N, N-M, and BW for all raters (all ICC > 0.75). The N-IMF intra-rater reliability was excellent in senior raters (ICC > 0.75) and good in junior raters (ICC > 0.6). The STPT:UP, STPT:IMF, and APSS measurements showed fair or poor reliability for most raters (ICC < 0.6). The Sn-N, N-M, and BW measurements are very reliable. Dynamic measurements including APSS, STPT:UP, and STUP:IMF are unreliable. N-IMF is the only reliable dynamic measurement, and its reliability improves with increasing clinical experience. The variable reliability of preoperative measurements must be considered in the planning of aesthetic breast surgery. 4 Diagnostic. © 2015 The American Society for Aesthetic Plastic Surgery, Inc. Reprints and permission: journals.permissions@oup.com.
Video training and certification program improves reliability of postischemic neurologic deficit measurement in the rat.

PubMed

Taninishi, Hideki; Pearlstein, Molly; Sheng, Huaxin; Izutsu, Miwa; Chaparro, Rafael E; Goldstein, Larry B; Warner, David S

2016-12-01

Scoring systems are used to measure behavioral deficits in stroke research. Video-assisted training is used to standardize stroke-related neurologic deficit scoring in humans. We hypothesized that a video-assisted training and certification program can improve inter-rater reliability in assessing neurologic function after middle cerebral artery occlusion in rats. Three expert raters scored neurologic deficits in post-middle cerebral artery occlusion rats using three published systems having different complexity levels (3, 18, or 48 points). The system having the highest point estimate for the correlation between neurologic score and infarct size was selected to create a video-assisted training and certification program. Eight trainee raters completed the video-assisted training and certification program. Inter-rater agreement ( Κ: score) and agreement with expert consensus scores were measured before and after video-assisted training and certification program completion. The 48-point system correlated best with infarct size. Video-assisted training and certification improved agreement with expert consensus scores (pretraining = 65 ± 10, posttraining = 87 ± 14, 112 possible scores, P < 0.0001), median number of trainee raters with scores within ±2 points of the expert consensus score (pretraining = 4, posttraining = 6.5, P < 0.01), categories with Κ: > 0.4 (pretraining = 4, posttraining = 9), and number of categories with an improvement in the Κ: score from pretraining to posttraining (n = 6). Video-assisted training and certification improved trainee inter-rater reliability and agreement with expert consensus behavioral scores in rats after middle cerebral artery occlusion. Video-assisted training and certification may be useful in multilaboratory preclinical studies. © The Author(s) 2015.
Considerations in the use of reflective writing for student assessment: issues of reliability and validity.

PubMed

Moniz, Tracy; Arntfield, Shannon; Miller, Kristina; Lingard, Lorelei; Watling, Chris; Regehr, Glenn

2015-09-01

Reflective writing is a popular tool to support the growth of reflective capacity in undergraduate medical learners. Its popularity stems from research suggesting that reflective capacity may lead to improvements in skills such as empathy, communication, collaboration and professionalism. This has led to assumptions that reflective writing can also serve as a tool for student assessment. However, evidence to support the reliability and validity of reflective writing as a meaningful assessment strategy is lacking. Using a published instrument for measuring 'reflective capacity' (the Reflection Evaluation for Learners' Enhanced Competencies Tool [REFLECT]), four trained raters independently scored four samples of writing from each of 107 undergraduate medical students to determine the reliability of reflective writing scores. REFLECT scores were then correlated with scores on a Year 4 objective structured clinical examination (OSCE) and Year 2 multiple-choice question (MCQ) examinations to examine, respectively, convergent and divergent validity. Across four writing samples, four-rater Cronbach's α-values ranged from 0.72 to 0.82, demonstrating reasonable inter-rater reliability with four raters using the REFLECT rubric. However, inter-sample reliability was fairly low (four-sample Cronbach's α = 0.54, single-sample intraclass correlation coefficient: 0.23), which suggests that performance on one reflective writing sample was not strongly indicative of performance on the next. Approximately 14 writing samples are required to achieve reasonable inter-sample reliability. The study found weak, non-significant correlations between reflective writing scores and both OSCE global scores (r = 0.13) and MCQ examination scores (r = 0.10), demonstrating a lack of relationship between reflective writing and these measures of performance. Our findings suggest that to draw meaningful conclusions about reflective capacity as a stable construct in individuals requires 14 writing samples per student, each assessed by four or five raters. This calls into question the feasibility and utility of using reflective writing rigorously as an assessment tool in undergraduate medical education. © 2015 John Wiley & Sons Ltd.
Inter-Rater Reliability of Provider Interpretations of Irritable Bowel Syndrome Food and Symptom Journals

PubMed Central

Chung, Chia-Fang; Xu, Kaiyuan; Dong, Yi; Schenk, Jeanette M.; Cain, Kevin; Munson, Sean; Heitkemper, Margaret M.

2017-01-01

There are currently no standardized methods for identifying trigger food(s) from irritable bowel syndrome (IBS) food and symptom journals. The primary aim of this study was to assess the inter-rater reliability of providers’ interpretations of IBS journals. A second aim was to describe whether these interpretations varied for each patient. Eight providers reviewed 17 IBS journals and rated how likely key food groups (fermentable oligo-di-monosaccharides and polyols, high-calorie, gluten, caffeine, high-fiber) were to trigger IBS symptoms for each patient. Agreement of trigger food ratings was calculated using Krippendorff’s α-reliability estimate. Providers were also asked to write down recommendations they would give to each patient. Estimates of agreement of trigger food likelihood ratings were poor (average α = 0.07). Most providers gave similar trigger food likelihood ratings for over half the food groups. Four providers gave the exact same written recommendation(s) (range 3–7) to over half the patients. Inter-rater reliability of provider interpretations of IBS food and symptom journals was poor. Providers favored certain trigger food likelihood ratings and written recommendations. This supports the need for a more standardized method for interpreting these journals and/or more rigorous techniques to accurately identify personalized IBS food triggers. PMID:29113044
Validation of the secretion severity rating scale.

PubMed

Pluschinski, Petra; Zaretsky, Eugen; Stöver, Timo; Murray, Joseph; Sader, Robert; Hey, Christiane

2016-10-01

Accumulation of secretions within the hypopharynx, aditus laryngis, and trachea is one characteristic of severe dysphagia and is of high clinical and therapeutic relevance. For the graduation of the secretion severity level, a secretion scale was provided by Murray et al. in 1996. The purpose of the study presented here is the validation of this scale by analyzing the intra-rater and inter-rater reliability as well as concurrent validity. For examination of reliability and validity, a reference standard was defined by two expert clinicians who reviewed 40 video recordings of fiberendoscopic swallowing evaluations, with 10 videos for each severity grade. These videos were rated and rerated independently and blinded by 4 ENT-residents with an interval of 4 weeks. Both the intra-rater (Kendall's τ > 0.847***) and inter-rater reliability (Kendall's W > 0.951***) were highly significant and can be considered good or very good. Correlation of the median of all ratings with the reference standard was close to the highest possible value 1 (τ = 0.984***). The scale was proved to be a reliable and valid instrument for graduation of one of the principal symptoms of oropharyngeal dysphagia and is recommended as an evidence-based instrument for standardized fiberoptic endoscopic evaluation of swallowing.

Reliability and type of consumer health documents on the World Wide Web: an annotation study.

PubMed

Martin, Melanie J

2011-01-01

In this paper we present a detailed scheme for annotating medical web pages designed for health care consumers. The annotation is along two axes: first, by reliability (the extent to which the medical information on the page can be trusted), second, by the type of page (patient leaflet, commercial, link, medical article, testimonial, or support). We analyze inter-rater agreement among three judges for each axis. Inter-rater agreement was moderate (0.77 accuracy, 0.62 F-measure, 0.49 Kappa) on the page reliability axis and good (0.81 accuracy, 0.72 F-measure, 0.73 Kappa) along the page type axis. We have shown promising results in this study that appropriate classes of pages can be developed and used by human annotators to annotate web pages with reasonable to good agreement. No.
BurnCase 3D software validation study: Burn size measurement accuracy and inter-rater reliability.

PubMed

Parvizi, Daryousch; Giretzlehner, Michael; Wurzer, Paul; Klein, Limor Dinur; Shoham, Yaron; Bohanon, Fredrick J; Haller, Herbert L; Tuca, Alexandru; Branski, Ludwik K; Lumenta, David B; Herndon, David N; Kamolz, Lars-P

2016-03-01

The aim of this study was to compare the accuracy of burn size estimation using the computer-assisted software BurnCase 3D (RISC Software GmbH, Hagenberg, Austria) with that using a 2D scan, considered to be the actual burn size. Thirty artificial burn areas were pre planned and prepared on three mannequins (one child, one female, and one male). Five trained physicians (raters) were asked to assess the size of all wound areas using BurnCase 3D software. The results were then compared with the real wound areas, as determined by 2D planimetry imaging. To examine inter-rater reliability, we performed an intraclass correlation analysis with a 95% confidence interval. The mean wound area estimations of the five raters using BurnCase 3D were in total 20.7±0.9% for the child, 27.2±1.5% for the female and 16.5±0.1% for the male mannequin. Our analysis showed relative overestimations of 0.4%, 2.8% and 1.5% for the child, female and male mannequins respectively, compared to the 2D scan. The intraclass correlation between the single raters for mean percentage of the artificial burn areas was 98.6%. There was also a high intraclass correlation between the single raters and the 2D Scan visible. BurnCase 3D is a valid and reliable tool for the determination of total body surface area burned in standard models. Further clinical studies including different pediatric and overweight adult mannequins are warranted. Copyright © 2016 Elsevier Ltd and ISBI. All rights reserved.
Measuring the needs of mental health patients in Greece: reliability and validity of the Greek version of the Camberwell assessment of need.

PubMed

Stefanatou, Pentagiotissa; Giannouli, Eleni; Konstantakopoulos, George; Vitoratou, Silia; Mavreas, Venetsanos

2014-11-01

Evaluation of mental health services based on patients' needs assessments has never taken place in Greece, although it is a crucial factor for the efficient use of their limited resources. To examine the inter-rater and test-retest reliability and the concurrent/convergent validity of the Greek research version of the Camberwell Assessment of Need-Research (CAN-R). A total of 53 schizophrenic patient-staff pairs were interviewed twice to test the inter-rater and test-retest reliability of the Greek version of the CAN-R. The World Health Organization Quality of Life-Brief Form (WHOQOL-BREF) and World Health Organization Disability Assessment Schedule-2.0 (WHODAS-2.0) were administered to the patients to examine concurrent validity. The inter-rater and test-retest reliability of patient and staff interviews for the 22 individual items and the eight summary scores of the instrument's four sections were good to excellent. Significant correlations emerged between CAN scores and the WHOQOL-BREF and WHODAS-2.0 domains for both patient and staff ratings, indicating good concurrent validity. Our results suggest that the Greek version of the CAN-R is a reliable instrument for assessing mental health patients' needs. Moreover, it is the first CAN-R validity study with satisfactory results using WHOQOL-BREF and WHODAS-2.0 as criterion variables. © The Author(s) 2013.
The Scarbase Duo(®): Intra-rater and inter-rater reliability and validity of a compact dual scar assessment tool.

PubMed

Fell, Matthew; Meirte, Jill; Anthonissen, Mieke; Maertens, Koen; Pleat, Jonathon; Moortgat, Peter

2016-03-01

Objective scar assessment tools were designed to help identify problematic scars and direct clinical management. Their use has been restricted by their measurement of a single scar property and the bulky size of equipment. The Scarbase Duo(®) was designed to assess both trans-epidermal water loss (TEWL) and colour of a burn scar whilst being compact and easy to use. Twenty patients with a burn scar were recruited and measurements taken using the Scarbase Duo(®) by two observers. The Scarbase Duo(®) measures TEWL via an open-chamber system and undertakes colorimetry via narrow-band spectrophotometry, producing values for relative erythema and melanin pigmentation. Validity was assessed by comparing the Scarbase Duo(®) against the Dermalab(®) and the Minolta Chromameter(®) respectively for TEWL and colorimetry measurements. The intra-class correlation coefficient (ICC) was used to assess reliability with standard error of measurement (SEM) used to assess reproducibility of measurements. The Pearson correlation coefficient (r) was used to assess the convergent validity. The Scarbase Duo(®) TEWL mode had excellent reliability when used on scars for both intra- (ICC=0.95) and inter-rater (ICC=0.96) measurements with moderate SEM values. The erythema component of the colorimetry mode showed good reliability for use on scars for both intra-(ICC=0.81) and inter-rater (ICC=0.83) measurements with low SEM values. Pigmentation values showed excellent reliability on scar tissue for both intra- (ICC=0.97) and inter-rater (ICC=0.97) with moderate SEM values. The Scarbase Duo(®) TEWL function had excellent correlation with the Dermalab(®) (r=0.93) whilst the colorimetry erythema value had moderate correlation with the Minolta Chromameter (r=0.72). The Scarbase Duo(®) is a reliable and objective scar assessment tool, which is specifically designed for burn scars. However, for clinical use, standardised measurement conditions are recommended. Copyright © 2015 Elsevier Ltd and ISBI. All rights reserved.
Inter-rater reliability of surgical reviews for AREN03B2: a COG renal tumor committee study.

PubMed

Hamilton, Thomas E; Barnhart, Douglas; Gow, Kenneth; Ferrer, Fernando; Kandel, Jessica; Glick, Richard; Dasgupta, Roshni; Naranjo, Arlene; He, Ying; Gratias, Eric; Geller, James; Mullen, Elizabeth; Ehrlich, Peter

2014-01-01

The Children's Oncology Group (COG) renal tumor study (AREN03B2) requires real-time central review of radiology, pathology, and the surgical procedure to determine appropriate risk-based therapy. The purpose of this study was to determine the inter-rater reliability of the surgical reviews. Of the first 3200 enrolled AREN03B2 patients, a sample of 100 enriched for blood vessel involvement, spill, rupture, and lymph node involvement was selected for analysis. The surgical assessment was then performed independently by two blinded surgical reviewers and compared to the original assessment, which had been completed by another of the committee surgeons. Variables assessed included surgeon-determined local tumor stage, overall disease stage, type of renal procedure performed, presence of tumor rupture, occurrence of intraoperative tumor spill, blood vessel involvement, presence of peritoneal implants, and interpretation of residual disease. Inter-rater reliability was measured using the Fleiss' Kappa statistic two-sided hypothesis tests (Kappa, p-value). Local tumor stage correlated in all 3 reviews except in one case (Kappa=0.9775, p<0.001). Similarly, overall disease stage had excellent correlation (0.9422, p<0.001). There was strong correlation for type of renal procedure (0.8357, p<0.001), presence of tumor rupture (0.6858, p<0.001), intraoperative tumor spill (0.6493, p<0.001), and blood vessel involvement (0.6470, p<0.001). Variables that had lower correlation were determination of the presence of peritoneal implants (0.2753, p<0.001) and interpretation of residual disease status (0.5310, p<0.001). The inter-rater reliability of the surgical review is high based on the great consistency in the 3 independent review results. This analysis provides validation and establishes precedent for real-time central surgical review to determine treatment assignment in a risk-based stratagem for multimodal cancer therapy. © 2014.
Characterising smoking cessation smartphone applications in terms of behaviour change techniques, engagement and ease-of-use features.

PubMed

Ubhi, Harveen Kaur; Michie, Susan; Kotz, Daniel; van Schayck, Onno C P; Selladurai, Abiram; West, Robert

2016-09-01

The aim of this study was to assess whether or not behaviour change techniques (BCTs) as well as engagement and ease-of-use features used in smartphone applications (apps) to aid smoking cessation can be identified reliably. Apps were coded for presence of potentially effective BCTs, and engagement and ease-of-use features. Inter-rater reliability for this coding was assessed. Inter-rater agreement for identifying presence of potentially effective BCTs ranged from 66.8 to 95.1 % with 'prevalence and bias adjusted kappas' (PABAK) ranging from 0.35 to 0.90 (p < 0.001). The intra-class correlation coefficients between the two coders for scores denoting the proportions of (a) a set of engagement features and (b) a set of ease-of-use features, which were included, were 0.77 and 0.75, respectively (p < 0.001). Prevalence estimates for BCTs ranged from <10 % for medication advice to >50 % for rewarding abstinence. The average proportions of specified engagement and ease-of-use features included in the apps were 69 and 83 %, respectively. The study found that it is possible to identify potentially effective BCTs, and engagement and ease-of-use features in smoking cessation apps with fair to high inter-rater reliability.
Reliability, Construct Validity and Interpretability of the Brazilian version of the Rapid Upper Limb Assessment (RULA) and Strain Index (SI).

PubMed

Valentim, Daniela Pereira; Sato, Tatiana de Oliveira; Comper, Maria Luiza Caíres; Silva, Anderson Martins da; Boas, Cristiana Villas; Padula, Rosimeire Simprini

There are very few observational methods for analysis of biomechanical exposure available in Brazilian-Portuguese. This study aimed to cross-culturally adapt and test the measurement properties of the Rapid Upper Limb Assessment (RULA) and Strain Index (SI). The cross-cultural adaptation and measurement properties test were established according to Beaton et al. and COSMIN guidelines, respectively. Several tasks that required static posture and/or repetitive motion of upper limbs were evaluated (n>100). The intra-raters' reliability for the RULA ranged from poor to almost perfect (k: 0.00-0.93), and SI from poor to excellent (ICC 2.1 : 0.05-0.99). The inter-raters' reliability was very poor for RULA (k: -0.12 to 0.13) and ranged from very poor to moderate for SI (ICC 2.1 : 0.00-0.53). The agreement was good for RULA (75-100% intra-raters, and 42.24-100% inter-raters) and to SI (EPM: -1.03% to 1.97%; intra-raters, and -0.17% to 1.51% inter-raters). The internal consistency was appropriate for RULA (α=0.88), and low for SI (α=0.65). Moderate construct validity were observed between RULA and SI, in wrist/hand-wrist posture (rho: 0.61) and strength/intensity of exertion (rho: 0.39). The adapted versions of the RULA and SI presented semantic and cultural equivalence for the Brazilian Portuguese. The RULA and SI had reliability estimates ranged from very poor to almost perfect. The internal consistency for RULA was better than the SI. The correlation between methods was moderate only of muscle request/movement repetition. Previous training is mandatory to use of observations methods for biomechanical exposure assessment, although it does not guarantee good reproducibility of these measures. Copyright © 2017 Associação Brasileira de Pesquisa e Pós-Graduação em Fisioterapia. Publicado por Elsevier Editora Ltda. All rights reserved.
Neurobehavioural assessment and diagnosis in disorders of consciousness: a preliminary study of the Sensory Tool to Assess Responsiveness (STAR).

PubMed

Stokes, Verity; Gunn, Sarah; Schouwenaars, Katie; Badwan, Derar

2018-09-01

The Sensory Tool to Assess Responsiveness (STAR) is an interdisciplinary neurobehavioural diagnostic tool for individuals with prolonged disorders of consciousness. It utilises current diagnostic criteria and is intended to improve upon the high misdiagnosis rate in this population. This study assesses the inter-rater reliability of the STAR and its diagnostic validity in comparison with the Coma Recovery Scale-Revised (CRS-R) and the Wessex Head Injury Matrix (WHIM). Participants were patients with severe acquired brain injury resulting in a disorder of consciousness, who were admitted to the Royal Leamington Spa Rehabilitation Hospital between 1999 and 2009. Patients underwent sensory stimulation sessions during their period of admission, which were recorded on video. Using this footage, patients were re-assessed for this study using the STAR, WHIM and CRS-R criteria. The STAR demonstrated "moderate" inter-rater reliability, "substantial" diagnostic agreement with the CRS-R, and "moderate" agreement with the WHIM. There were no significant differences between diagnoses assigned by the different assessments. The STAR demonstrated a good degree of inter-rater reliability in identification of diagnoses for patients with disorders of consciousness. The diagnostic outcomes of the STAR agreed at a good level with the CRS-R, moderately with the WHIM, and did not significantly differ from either. This demonstrates the reliability and validity of the STAR, showing its appropriateness for clinical use. Future longitudinal studies and research into the STAR's applicability in long-stay rehabilitation are indicated.
Inter-rater reliability of a food store checklist to assess availability of healthier alternatives to the energy-dense snacks and beverages commonly consumed by children.

PubMed

Izumi, Betty T; Findholt, Nancy E; Pickus, Hayley A; Nguyen, Thuan; Cuneo, Monica K

2014-06-01

Food stores have gained attention as potential intervention targets for improving children's eating habits. There is a need for valid and reliable instruments to evaluate changes in food store snack and beverage availability secondary to intervention. The aim of this study was to develop a valid, reliable, and resource-efficient instrument to evaluate the healthfulness of food store environments faced by children. The SNACZ food store checklist was developed to assess availability of healthier alternatives to the energy-dense snacks and beverages commonly consumed by children. After pretesting, two trained observers independently assessed the availability of 48 snack and beverage items in 50 food stores located near elementary and middle schools in Portland, Oregon, over a 2-week period in summer 2012. Inter-rater reliability was calculated using the kappa statistic. Overall, the instrument had mostly high inter-rater reliability. Seventy-three percent of items assessed had almost perfect or substantial reliability. Two items had moderate reliability (0.41-0.60), and no items had a reliability score less than 0.41. Eleven items occurred too infrequently to generate a kappa score. The SNACZ food store checklist is a first-step toward developing a valid and reliable tool to evaluate the healthfulness of food store environments faced by children. The tool can be used to compare availability of healthier snack and beverage alternatives across communities and measure change secondary to intervention. As a wider variety of healthier snack and beverage alternatives become available in food stores, the checklist should be updated.
Reliability and Validity of Autism Diagnostic Interview-Revised, Japanese Version

ERIC Educational Resources Information Center

Tsuchiya, Kenji J.; Matsumoto, Kaori; Yagi, Atsuko; Inada, Naoko; Kuroda, Miho; Inokuchi, Eiko; Koyama, Tomonori; Kamio, Yoko; Tsujii, Masatsugu; Sakai, Saeko; Mohri, Ikuko; Taniike, Masako; Iwanaga, Ryoichiro; Ogasahara, Kei; Miyachi, Taishi; Nakajima, Shunji; Tani, Iori; Ohnishi, Masafumi; Inoue, Masahiko; Nomura, Kazuyo; Hagiwara, Taku; Uchiyama, Tokio; Ichikawa, Hironobu; Kobayashi, Shuji; Miyamoto, Ken; Nakamura, Kazuhiko; Suzuki, Katsuaki; Mori, Norio; Takei, Nori

2013-01-01

To examine the inter-rater reliability of Autism Diagnostic Interview-Revised, Japanese Version (ADI-R-JV), the authors recruited 51 individuals aged 3-19 years, interviewed by two independent raters. Subsequently, to assess the discriminant and diagnostic validity of ADI-R-JV, the authors investigated 317 individuals aged 2-19 years, who were…
[Kennedy V Axis assessment in an Italian outpatient and inpatient population].

PubMed

Mundo, Emanuela; Bonalume, Laura; Del Corno, Franco; Madeddu, Fabio; Lang, Margherita

2010-01-01

Kennedy Axis V or K Axis acts is an alternative tool to the DSM-IVTR Global Assessment of Functioning (GAF) Scale, that many researchers describe as a scale with poor inter-rater reliability and clinical utility. Unlike the GAF scale, K Axis provides a multidimensional and multiaxial approach to measure personal, social and interpersonal functioning in psychiatric outpatients and inpatients. In this study, we examined K Axis's inter-raters reliability by using it with an Italian clinical population. Clinicians used Kennedy Axis V to assess global functioning among 180 inpatients, in 9 psychiatric services in Lombardia and Piemonte. Patients were divided into 4 different diagnostic groups, according to the DSM-IV-TR criteria. Intraclass correlations between two independent raters's scores reveal high level of interrater reliability for all K Axis scales (0,633 < ICC < 0,813). Highly significant results in the Kruskal-Wallis test demonstrate that the patient diagnosis influence all the scales scores. Significant differences in patients functioning profiles in all K Axis scales, apart from Violence one, were noted between different diagnosis groups. In this study high level of raters agreement was noted, even if K Axis scales were used in different mental health services from different clinicians. K Axis scales provide a useful profile of patient global functioning, in line with the specific pathology.
Reliability of sagittal plane hip, knee, and ankle joint angles from a single frame of video data using the GAITRite camera system.

PubMed

Ross, Sandy A; Rice, Clinton; Von Behren, Kristyn; Meyer, April; Alexander, Rachel; Murfin, Scott

2015-01-01

The purpose of this study was to establish intra-rater, intra-session, and inter-rater, reliability of sagittal plane hip, knee, and ankle angles with and without reflective markers using the GAITRite walkway and single video camera between student physical therapists and an experienced physical therapist. This study included thirty-two healthy participants age 20-59, stratified by age and gender. Participants performed three successful walks with and without markers applied to anatomical landmarks. GAITRite software was used to digitize sagittal hip, knee, and ankle angles at two phases of gait: (1) initial contact; and (2) mid-stance. Intra-rater reliability was more consistent for the experienced physical therapist, regardless of joint or phase of gait. Intra-session reliability was variable, the experienced physical therapist showed moderate to high reliability (intra-class correlation coefficient (ICC) = 0.50-0.89) and the student physical therapist showed very poor to high reliability (ICC = 0.07-0.85). Inter-rater reliability was highest during mid-stance at the knee with markers (ICC = 0.86) and lowest during mid-stance at the hip without markers (ICC = 0.25). Reliability of a single camera system, especially at the knee joint shows promise. Depending on the specific type of reliability, error can be attributed to the testers (e.g. lack of digitization practice and marker placement), participants (e.g. loose fitting clothing) and camera systems (e.g. frame rate and resolution). However, until the camera technology can be upgraded to a higher frame rate and resolution, and the software can be linked to the GAITRite walkway, the clinical utility for pre/post measures is limited.
Reliability of analysis of the bone mineral density of the second and fifth metatarsals using dual-energy x-ray absorptiometry (DXA).

PubMed

Pritchard, N Stewart; Smoliga, James M; Nguyen, Anh-Dung; Branscomb, Micah C; Sinacore, David R; Taylor, Jeffrey B; Ford, Kevin R

2017-01-01

Metatarsal fractures, especially of the fifth metatarsal, are common injuries of the foot in a young athletic population, but the risk factors for this injury are not well understood. Dual-energy x-ray absorptiometry (DXA) provides reliable measures of regional bone mineral density to predict fracture risk in the hip and lumbar spine. Recently, sub-regional metatarsal reliability was established in fresh cadaveric specimens and associated with ultimate fracture force. The purpose of this study was to assess the reliability of DXA bone mineral density measurements of sub-regions of the second and fifth metatarsals in a young, active population. Thirty two recreationally active individuals participated in the study, and the bone density of the second (2MT) and fifth (5MT) metatarsals of each subject was measured using a Hologic QDR x-ray bone densitometer. Scans were analyzed separately by two raters, and regional bone mineral density, bone mineral content, and area measurements were calculated for the proximal, shaft, and distal regions of the bone. Intra-rater, inter-rater, and scan-rescan reliability were then determined for each region. Proximal and shaft bone mineral density measurements of the second and fifth metatarsal were reliable. ICC's were variable across regions and metatarsals, with the distal region being the poorest. Bone mineral density measurements of the metatarsals may be a better indicator of fracture risk of the metatarsals than whole body measurements. A reliable method for measuring the regional bone mineral densities of the metatarsals was found. However, inter-rater reliability and scan-rescan reliability for the distal regions were poor. Future research should examine the relationship between DXA bone mineral density measurements and fracture risk at the metatarsals.
En Face Optical Coherence Tomography Angiography Imaging Versus Fundus Photography in the Measurement of Choroidal Nevi.

PubMed

Lee, Michele D; Kaidonis, Georgia; Kim, Alice Y; Shields, Ryan A; Leng, Theodore

2017-09-01

Choroidal nevi are common benign intraocular tumors with a small risk of malignant transformation. This retrospective study investigates the use of en face spectral-domain optical coherence tomography angiography (SD-OCTA) in determining the clinical features and measurement of choroidal nevi. Patients with choroidal nevi were imaged with both OCTA and a fundus photography device. Greatest longitudinal dimension (GLD), perpendicular dimension (PD), and the GLD/PD ratio were assessed on each device. Inter-device variation and intra- and inter-rater reliability analyses were performed. Fourteen patients with choroidal nevi were included. No significant difference between the GLD/PD ratio as measured by all three devices was found (Chi-square = 2.8, 2 df, P = .247). Intraclass correlation coefficients were greater than 0.7 for repeated measures on all devices, suggesting good repeatability and reproducibility. This study demonstrated inter-device consistency and high intra- and inter-rater reliability when measuring choroidal nevi. [Ophthalmic Surg Lasers Imaging Retina. 2017;48:741-747.]. Copyright 2017, SLACK Incorporated.
Inter-rater reliability of data elements from a prototype of the Paul Coverdell National Acute Stroke Registry

PubMed Central

Reeves, Mathew J; Mullard, Andrew J; Wehner, Susan

2008-01-01

Background The Paul Coverdell National Acute Stroke Registry (PCNASR) is a U.S. based national registry designed to monitor and improve the quality of acute stroke care delivered by hospitals. The registry monitors care through specific performance measures, the accuracy of which depends in part on the reliability of the individual data elements used to construct them. This study describes the inter-rater reliability of data elements collected in Michigan's state-based prototype of the PCNASR. Methods Over a 6-month period, 15 hospitals participating in the Michigan PCNASR prototype submitted data on 2566 acute stroke admissions. Trained hospital staff prospectively identified acute stroke admissions, abstracted chart information, and submitted data to the registry. At each hospital 8 randomly selected cases were re-abstracted by an experienced research nurse. Inter-rater reliability was estimated by the kappa statistic for nominal variables, and intraclass correlation coefficient (ICC) for ordinal and continuous variables. Factors that can negatively impact the kappa statistic (i.e., trait prevalence and rater bias) were also evaluated. Results A total of 104 charts were available for re-abstraction. Excellent reliability (kappa or ICC > 0.75) was observed for many registry variables including age, gender, black race, hemorrhagic stroke, discharge medications, and modified Rankin Score. Agreement was at least moderate (i.e., 0.75 > kappa ≥; 0.40) for ischemic stroke, TIA, white race, non-ambulance arrival, hospital transfer and direct admit. However, several variables had poor reliability (kappa < 0.40) including stroke onset time, stroke team consultation, time of initial brain imaging, and discharge destination. There were marked systematic differences between hospital abstractors and the audit abstractor (i.e., rater bias) for many of the data elements recorded in the emergency department. Conclusion The excellent reliability of many of the data elements supports the use of the PCNASR to monitor and improve care. However, the poor reliability for several variables, particularly time-related events in the emergency department, indicates the need for concerted efforts to improve the quality of data collection. Specific recommendations include improvements to data definitions, abstractor training, and the development of ED-based real-time data collection systems. PMID:18547421
Reliability testing of a portfolio assessment tool for postgraduate family medicine training in South Africa

PubMed Central

Mash, Bob; Derese, Anselme

2013-01-01

Abstract Background Competency-based education and the validity and reliability of workplace-based assessment of postgraduate trainees have received increasing attention worldwide. Family medicine was recognised as a speciality in South Africa six years ago and a satisfactory portfolio of learning is a prerequisite to sit the national exit exam. A massive scaling up of the number of family physicians is needed in order to meet the health needs of the country. Aim The aim of this study was to develop a reliable, robust and feasible portfolio assessment tool (PAT) for South Africa. Methods Six raters each rated nine portfolios from the Stellenbosch University programme, using the PAT, to test for inter-rater reliability. This rating was repeated three months later to determine test–retest reliability. Following initial analysis and feedback the PAT was modified and the inter-rater reliability again assessed on nine new portfolios. An acceptable intra-class correlation was considered to be > 0.80. Results The total score was found to be reliable, with a coefficient of 0.92. For test–retest reliability, the difference in mean total score was 1.7%, which was not statistically significant. Amongst the subsections, only assessment of the educational meetings and the logbook showed reliability coefficients > 0.80. Conclusion This was the first attempt to develop a reliable, robust and feasible national portfolio assessment tool to assess postgraduate family medicine training in the South African context. The tool was reliable for the total score, but the low reliability of several sections in the PAT helped us to develop 12 recommendations regarding the use of the portfolio, the design of the PAT and the training of raters.
ASSOCIATIONS BETWEEN THREE CLINICAL ASSESSMENT TOOLS FOR POSTURAL STABILITY

PubMed Central

Saxion, Casie E.; Cameron, Kenneth L.; Gerber, J. Parry

2010-01-01

Study Design: Clinical Measurement, Correlation, Reliability Objectives: To assess the relationship between the Single Leg Balance (SLB), modified Balance Error Scoring System (mBESS), and modified Star Excursion Balance (mSEBT) tests and secondarily to assess inter-rater and test-retest reliability of these tests. Background: Ankle sprains often result in chronic instability and dysfunction. Several clinical tests assess postural deficits as a potential cause of this dysfunction; however, limited information exists pertaining to the relationship that these tests have with one another. Methods: Two independent examiners measured the performance of 34 healthy participants completing the SLB Test, mBESS test, and mSEBT at two different time periods. The relationship between tests was assessed using the Pearson Correlation and Fisher's Exact Tests. Inter-rater and test-retest reliability were assessed using the intraclass correlation coefficient (ICC) and Kappa statistics. Results: A significant correlation (r = -0.35) was observed between the mSEBT and the mBESS. Fisher's Exact Test showed a significant association between the SLB Test and mBESS (P = .048), but no association between the SLB and mSEBT (P = 1.000). Inter-rater reliability was excellent for the mSEBT and fair for the mBESS (ICCs of .91 and .61 respectively). Excellent agreement was observed between raters for the SLB test (k = 1.00). Test-retest reliability was excellent for the mSEBT (ICC = 0.98) and fair for the mBESS (ICC = 0.74). There was poor test-retest agreement for the SLB test (k = .211). Conclusion: There was a significant relationship observed between the SLB Test, mBESS test, and mSEBT: however; strength of association measures showed limited overlap between these tests. This suggests that these tests are interrelated but may not assess equal components of postural stability. PMID:21589668
The children's menu assessment: development, evaluation, and relevance of a tool for evaluating children's menus.

PubMed

Krukowski, Rebecca A; Eddings, Kenya; West, Delia Smith

2011-06-01

Restaurant foods represent a substantial portion of children's dietary intake, and consumption of foods away from home has been shown to contribute to excess adiposity. This descriptive study aimed to pilot-test and establish the reliability of a standardized and comprehensive assessment tool, the Children's Menu Assessment, for evaluating the restaurant food environment for children. The tool is an expansion of the Nutrition Environment Measures Survey-Restaurant. In 2009-2010, a randomly selected sample of 130 local and chain restaurants were chosen from within 20 miles of Little Rock, AR, to examine the availability of children's menus and to conduct initial calibration of the Children's Menu Assessment tool (final sample: n=46). Independent raters completed the Children's Menu Assessment in order to determine inter-rater reliability. Test-retest reliability was also examined. Inter-rater reliability was high: percent agreement was 97% and Spearman correlation was 0.90. Test-retest was also high: percent agreement was 91% and Spearman correlation was 0.96. Mean Children's Menu Assessment completion time was 14 minutes, 56 seconds ± 10 minutes, 21 seconds. Analysis of Children's Menu Assessment findings revealed that few healthier options were available on children's menus, and most menus did not provide parents with information for making healthy choices, including nutrition information or identification of healthier options. The Children's Menu Assessment tool allows for comprehensive, rapid measurement of the restaurant food environment for children with high inter-rater reliability. This tool has the potential to contribute to public health efforts to develop and evaluate targeted environmental interventions and/or policy changes regarding restaurant foods. Copyright © 2011 American Dietetic Association. Published by Elsevier Inc. All rights reserved.
Measuring the Pain Area: An Intra- and Inter-Rater Reliability Study Using Image Analysis Software.

PubMed

Dos Reis, Felipe Jose Jandre; de Barros E Silva, Veronica; de Lucena, Raphaela Nunes; Mendes Cardoso, Bruno Alexandre; Nogueira, Leandro Calazans

2016-01-01

Pain drawings have frequently been used for clinical information and research. The aim of this study was to investigate intra- and inter-rater reliability of area measurements performed on pain drawings. Our secondary objective was to verify the reliability when using computers with different screen sizes, both with and without mouse hardware. Pain drawings were completed by patients with chronic neck pain or neck-shoulder-arm pain. Four independent examiners participated in the study. Examiners A and B used the same computer with a 16-inch screen and wired mouse hardware. Examiner C used a notebook with a 16-inch screen and no mouse hardware, and Examiner D used a computer with an 11.6-inch screen and a wireless mouse. Image measurements were obtained using GIMP and NIH ImageJ computer programs. The length of all the images was measured using GIMP software to a set scale in ImageJ. Thus, each marked area was encircled and the total surface area (cm(2) ) was calculated for each pain drawing measurement. A total of 117 areas were identified and 52 pain drawings were analyzed. The intrarater reliability between all examiners was high (ICC = 0.989). The inter-rater reliability was also high. No significant differences were observed when using different screen sizes or when using or not using the mouse hardware. This suggests that the precision of these measurements is acceptable for the use of this method as a measurement tool in clinical practice and research. © 2014 World Institute of Pain.
Development, reliability and validation of an infant mammalian penetration-aspiration scale

PubMed Central

Holman, Shaina Devi; Campbell-Malone, Regina; Ding, Peng; Gierbolini-Norat, Estela M.; Griffioen, Anne M.; Inokuchi, Haruhi; Lukasik, Stacey L.; German, Rebecca Z.

2012-01-01

A penetration-aspiration scale exists for assessing airway protection in adult videofluoroscopy and fiberoptic endoscopic swallowing studies, however no such scale exists for animal models. The aim of this study was threefold to 1) develop a Penetration-Aspiration Scale (PAS) for infant mammals, 2) test the scale’s intra- and inter-rater reliability, and 3) to validate the use of the scale for distinguishing between abnormal and normal animals. After discussion and reviewing many videos, the result was a 7-Point Infant Mammal PAS. Reliability was tested by having 5 judges score 90 swallows recorded with videofluoroscopy across two time points. In these videos, the frame rate was either 30 or 60 frames per second and the animals were either normal, had a unilateral superior laryngeal nerve (SLN) lesion, or had hard palate local anesthesia. The scale was validated by having one judge score videos of both normal and SLN lesioned pigs and testing the difference using a t-test. Raters had a high intra-rater (average kappa of 0.82, intraclass correlation coefficient (ICC)= 0.92) and high inter-rater reliability (average kappa of 0.68, ICC= 0.66). There was a significant difference in reliability for videos captured at 30 and 60 frames per second for scores of 3 and 7 (p<0.001). The scale was also validated for distinguishing between normal and abnormal pigs (p<0.001). Given the increasing number of animal studies using videofluoroscopy to study dysphagia, this scale provides a valid and reliable measure of airway protection during swallowing in infant pigs that will give these animal models increased translational significance. PMID:23129423

Reliability and validity of an iPhone(®) application for the measurement of lumbar spine flexion and extension range of motion.

PubMed

Pourahmadi, Mohammad Reza; Taghipour, Morteza; Jannati, Elham; Mohseni-Bandpei, Mohammad Ali; Ebrahimi Takamjani, Ismail; Rajabzadeh, Fatemeh

2016-01-01

Measurement of lumbar spine range of motion (ROM) is often considered to be an essential component of lumbar spine physiotherapy and orthopedic assessment. The measurement can be carried out through various instruments such as inclinometers, goniometers, and etc. Recent smartphones have been equipped with accelerometers and magnetometers, which, through specific software applications (apps) can be used for inclinometric functions. The main purpose was to investigate the reliability and validity of an iPhone(®) app (TiltMeter(©) -advanced level and inclinometer) for measuring standing lumbar spine flexion-extension ROM in asymptomatic subjects. A cross-sectional study was carried out. This study was conducted in a physiotherapy clinic located at School of Rehabilitation Sciences, Iran University of Medical Science and Health Services, Tehran, Iran. A convenience sample of 30 asymptomatic adults (15 males; 15 females; age range = 18-55 years) was recruited between August 2015 and December 2015. Following a 2-minute warm-up, the subjects were asked to stand in a relaxed position and their skin was marked at the T12-L1 and S1-S2 spinal levels. From this position, they were asked to perform maximum lumbar flexion followed by maximum lumbar extension with their knees straight. Two blinded raters each used an inclinometer and the iPhone (®) app to measure lumbar spine flexion-extension ROM. A third rater read the measured angles. To calculate total lumbar spine flexion-extension ROM, the measurement from S1-S2 was subtracted from T12-L1. The second (2 hours later) and third (48 hours later) sessions were carried out in the same manner as the first session. All of the measurements were conducted 3 times and the mean value of 3 repetitions for each measurement was used for analysis. Intraclass correlation coefficient (ICC) models (3, k) and (2, k) were used to determine the intra-rater and inter-rater reliability, respectively. The Pearson correlation coefficients were used to establish concurrent validity of the iPhone(®) app. Furthermore, minimum detectable change at the 95% confidence level (MDC95) was computed as 1.96 × standard error of measurement × [Formula: see text]. Good to excellent intra-rater and inter-rater reliability were demonstrated for both the gravity-based inclinometer with ICC values of ≥0.84 and ≥0.77 and the iPhone(®) app with ICC values of ≥0.85 and ≥0.85, respectively. The MDC95 ranged from 5.82°to 8.18°for the intra-rater analysis and from 7.38°to 8.66° for the inter-rater analysis. The concurrent validity for flexion and extension between the 2 instruments was 0.85 and 0.91, respectively. The iPhone(®)app possesses good to excellent intra-rater and inter-rater reliability and concurrent validity. It seems that the iPhone(®) app can be used for the measurement of lumbar spine flexion-extension ROM. IIb.
Reliability of 3D laser-based anthropometry and comparison with classical anthropometry.

PubMed

Kuehnapfel, Andreas; Ahnert, Peter; Loeffler, Markus; Broda, Anja; Scholz, Markus

2016-05-26

Anthropometric quantities are widely used in epidemiologic research as possible confounders, risk factors, or outcomes. 3D laser-based body scans (BS) allow evaluation of dozens of quantities in short time with minimal physical contact between observers and probands. The aim of this study was to compare BS with classical manual anthropometric (CA) assessments with respect to feasibility, reliability, and validity. We performed a study on 108 individuals with multiple measurements of BS and CA to estimate intra- and inter-rater reliabilities for both. We suggested BS equivalents of CA measurements and determined validity of BS considering CA the gold standard. Throughout the study, the overall concordance correlation coefficient (OCCC) was chosen as indicator of agreement. BS was slightly more time consuming but better accepted than CA. For CA, OCCCs for intra- and inter-rater reliability were greater than 0.8 for all nine quantities studied. For BS, 9 of 154 quantities showed reliabilities below 0.7. BS proxies for CA measurements showed good agreement (minimum OCCC > 0.77) after offset correction. Thigh length showed higher reliability in BS while upper arm length showed higher reliability in CA. Except for these issues, reliabilities of CA measurements and their BS equivalents were comparable.
Reliability and validity of the Korean version of the community balance and mobility scale in patients with hemiplegia after stroke

PubMed Central

Lee, Kyoung-bo; Lee, Paul; Yoo, Sang-won; Kim, Young-dong

2016-01-01

[Purpose] The aim of this study was to translate and adapt the Community Balance and Mobility Scale (CB&M) into Korean (K-CB&M) and to verify the reliability and validity of scores obtained with Korean patients. [Subjects and Methods] A total of 16 subjects were recruited from St. Vincent’s Hospital in South Korea. At each testing session, subjects completed the K-CB&M, Berg balance scale (BBS), timed up and go test (TUG), and functional reaching test. All tests were administered by a physical therapist, and subjects completed the tests in an identical standardized order during all testing sessions. [Results] The inter- and intra-rater reliability coefficients were high for most subscores, while moderate inter-rater reliability was observed for the items “walking and looking” and “walk, look, and carry”, and moderate intra-rater reliability was observed for “forward to backward walking”. There was a positive correlation between the K-CB&M and BBS and a negative correlation between the K-CB&M and TUG in the convergent validity assessments. [Conclusion] The reliability and validity of the K-CB&M was high, suggesting that clinical practitioners treating Korean patients with hemiplegia can use this material for assessing static and dynamic balance. PMID:27630420
Hip range of motion and provocative physical examination tests reliability and agreement in asymptomatic volunteers

PubMed Central

Prather, H; Harris-Hayes, M; Hunt, D; Steger-May, K; Mathew, V; Clohisy, JC

2012-01-01

Objective The objectives of this study are the following: 1) report passive hip ROM in asymptomatic young adults, 2) report the intra-tester and inter-tester reliability of hip ROM measurements among testers of multiple disciplines, 3) report the results of provocative hip tests and tester agreement. Design descriptive epidemiology study Setting tertiary university Participants Twenty-eight young adult volunteers without musculoskeletal symptoms, history of disorder or surgery involving the lumbar spine or lower extremities were enrolled and completed the study. Methods Asymptomatic young adult volunteers completed questionnaires and were examined by two blinded examiners during a single session. The testers were physical therapists and physicians. Hip range of motion and provocative tests were completed by both examiners on each hip. Main Outcome Measurements Inter and intra-rater reliability for ROM and agreement for provocative tests was determined. Results Twenty-eight asymptomatic adults with mean age 31 years old (range 18–51 years) and mean modified Harris Hip Score of 99.5 ± 1.5 and UCLA Activity score of 8.8 ± 1.2 completed the study. Intra-rater agreement was excellent for all hip range of motion measurements, with intraclass correlation coefficients (ICCs) ranging from 0.76 to 0.97 with similar agreement if the examiner was a physical therapist or a physician. Excellent inter-rater reliability was found for hip flexion ICC 0.87 (95% CI 0.78 to 0.92), supine internal rotation ICC 0.75 (95% CI 0.60 to 0.84) and prone internal rotation ICC 0.79 (95% CI 0.66 to 0.87). The least reliable measurements were supine hip abduction (ICC 0.34) and supine external rotation (ICC 0.18). Agreement between examiners ranged from 96–100% for provocative hip tests which included the hip impingement, resisted straight leg raise, FABER/Patrick’s and log roll tests. Conclusions Specific hip ROM measures show excellent inter-rater reliability and provocative hip tests show good agreement among multiple examiners and medical disciplines. Further studies are needed to assess the utilization of these measurements and tests as a part of a hip screening examination to assess for young adults at risk intra-articular hip disorders prior to the onset of degenerative changes. PMID:20970757
Validity and reliability of a low-cost digital dynamometer for measuring isometric strength of lower limb.

PubMed

Romero-Franco, Natalia; Jiménez-Reyes, Pedro; Montaño-Munuera, Juan A

2017-11-01

Lower limb isometric strength is a key parameter to monitor the training process or recognise muscle weakness and injury risk. However, valid and reliable methods to evaluate it often require high-cost tools. The aim of this study was to analyse the concurrent validity and reliability of a low-cost digital dynamometer for measuring isometric strength in lower limb. Eleven physically active and healthy participants performed maximal isometric strength for: flexion and extension of ankle, flexion and extension of knee, flexion, extension, adduction, abduction, internal and external rotation of hip. Data obtained by the digital dynamometer were compared with the isokinetic dynamometer to examine its concurrent validity. Data obtained by the digital dynamometer from 2 different evaluators and 2 different sessions were compared to examine its inter-rater and intra-rater reliability. Intra-class correlation (ICC) for validity was excellent in every movement (ICC > 0.9). Intra and inter-tester reliability was excellent for all the movements assessed (ICC > 0.75). The low-cost digital dynamometer demonstrated strong concurrent validity and excellent intra and inter-tester reliability for assessing isometric strength in the main lower limb movements.
Emotional and Behavioral Screener: Test-Retest Reliability, Inter-Rater Reliability, and Convergent Validity

ERIC Educational Resources Information Center

Nordness, Philip D.; Epstein, Michael H.; Cullinan, Douglas; Pierce, Corey D.

2014-01-01

The Emotional and Behavioral Screener (EBS) is a universal screening instrument designed to identify students whose excessive problem behaviors put them at risk of the education disability category of emotional disturbance (ED). This article reports findings from three studies that address the reliability and validity of the EBS. Studies 1 and 2…
The assessment of fidelity in a motor speech-treatment approach

PubMed Central

Hayden, Deborah; Namasivayam, Aravind Kumar; Ward, Roslyn

2015-01-01

Objective To demonstrate the application of the constructs of treatment fidelity for research and clinical practice for motor speech disorders, using the Prompts for Restructuring Oral Muscular Phonetic Targets (PROMPT) Fidelity Measure (PFM). Treatment fidelity refers to a set of procedures used to monitor and improve the validity and reliability of behavioral intervention. While the concept of treatment fidelity has been emphasized in medical and allied health sciences, documentation of procedures for the systematic evaluation of treatment fidelity in Speech-Language Pathology is sparse. Methods The development and iterative process to improve the PFM, is discussed. Further, the PFM is evaluated against recommended measurement strategies documented in the literature. This includes evaluating the appropriateness of goals and objectives; and the training of speech–language pathologists, using direct and indirect procedures. Three expert raters scored the PFM to examine inter-rater reliability. Results Three raters, blinded to each other's scores, completed fidelity ratings on three separate occasions. Inter-rater reliability, using Krippendorff's Alpha, was >80% for the PFM on the final scoring occasion. This indicates strong inter-rater reliability. Conclusion The development of fidelity measures for the training of service providers and treatment delivery is important in specialized treatment approaches where certain ‘active ingredients’ (e.g. specific treatment targets and therapeutic techniques) must be present in order for treatment to be effective. The PFM reflects evidence-based practice by integrating treatment delivery and clinical skill as a single quantifiable metric. PFM enables researchers and clinicians to objectively measure treatment outcomes within the PROMPT approach. PMID:26213623
Validity and reliability of a novel measure of activity performance and participation.

PubMed

Murgatroyd, Phil; Karimi, Leila

2016-01-01

To develop and evaluate an innovative clinician-rated measure, which produces global numerical ratings of activity performance and participation. Repeated measures study with 48 community-dwelling participants investigating clinical sensibility, comprehensiveness, practicality, inter-rater reliability, responsiveness, sensitivity and concurrent validity with Barthel Index. Important clinimetric characteristics including comprehensiveness and ease of use were rated >8/10 by clinicians. Inter-rater reliability was excellent on the summary scores (intraclass correlation of 0.95-0.98). There was good evidence that the new outcome measure distinguished between known high and low functional scoring groups, including both responsiveness to change and sensitivity at the same time point in numerous tests. Concurrent validity with the Barthel Index was fair to high (Spearman Rank Order Correlation 0.32-0.85, p > 0.05). The new measure's summary scores were nearly twice as responsive to change compared with the Barthel Index. Other more detailed data could also be generated by the new measure. The Activity Performance Measure is an innovative outcome instrument that showed good clinimetric qualities in this initial study. Some of the results were strong, given the sample size, and further trial and evaluation is appropriate. Implications for Rehabilitation The Activity Performance Measure is an innovative outcome measure covering activity performance and participation. In an initial evaluation, it showed good clinimetric qualities including responsiveness to change, sensitivity, practicality, clinical sensibility, item coverage, inter-rater reliability and concurrent validity with the Barthel Index. Further trial and evaluation is appropriate.
Reliability of a quantitative clinical posture assessment tool among persons with idiopathic scoliosis.

PubMed

Fortin, Carole; Feldman, Debbie Ehrmann; Cheriet, Farida; Gravel, Denis; Gauthier, Frédérique; Labelle, Hubert

2012-03-01

To determine overall, test-retest and inter-rater reliability of posture indices among persons with idiopathic scoliosis. A reliability study using two raters and two test sessions. Tertiary care paediatric centre. Seventy participants aged between 10 and 20 years with different types of idiopathic scoliosis (Cobb angle 15 to 60°) were recruited from the scoliosis clinic. Based on the XY co-ordinates of natural reference points (e.g., eyes) as well as markers placed on several anatomical landmarks, 32 angular and linear posture indices taken from digital photographs in the standing position were calculated from a specially developed software program. Generalisability theory served to estimate the reliability and standard error of measurement (SEM) for the overall, test-retest and inter-rater designs. Bland and Altman's method was also used to document agreement between sessions and raters. In the random design, dependability coefficients demonstrated a moderate level of reliability for six posture indices (ϕ=0.51 to 0.72) and a good level of reliability for 26 posture indices out of 32 (ϕ≥0.79). Error attributable to marker placement was negligible for most indices. Limits of agreement and SEM values were larger for shoulder protraction, trunk list, Q angle, cervical lordosis and scoliosis angles. The most reproducible indices were waist angles and knee valgus and varus. Posture can be assessed in a global fashion from photographs in persons with idiopathic scoliosis. Despite the good reliability of marker placement, other studies are needed to minimise measurement errors in order to provide a suitable tool for monitoring change in posture over time. Copyright © 2011 Chartered Society of Physiotherapy. Published by Elsevier Ltd. All rights reserved.
Psychometric properties of the Calgary Cambridge guides to assess communication skills of undergraduate medical students.

PubMed

Simmenroth-Nayda, Anne; Heinemann, Stephanie; Nolte, Catharina; Fischer, Thomas; Himmel, Wolfgang

2014-12-06

The aim of this study was to analyse the psychometric properties of the short version of the Calgary Cambridge Guides and to decide whether it can be recommended for use in the assessment of communications skills in young undergraduate medical students. Using a translated version of the Guide, 30 members from the Department of General Practice rated 5 videotaped encounters between students and simulated patients twice. Item analysis should detect possible floor and/or ceiling effects. The construct validity was investigated using exploratory factor analysis. Intra-rater reliability was measured in an interval of 3 months, inter-rater reliability was assessed by the intraclass correlation coefficient. The score distribution of the items showed no ceiling or floor effects. Four of the five factors extracted from the factor analysis represented important constructs of doctor-patient communication The ratings for the first and second round of assessing the videos correlated at 0.75 (p<0.0001). Intraclass correlation coefficients for each item ranged were moderate and ranged from 0.05 to 0.57. Reasonable score distributions of most items without ceiling or floor effects as well as a good test-retest reliability and construct validity recommend the C-CG as an instrument for assessing communication skills in undergraduate medical students. Some deficiencies in inter-rater reliability are a clear indication that raters need a thorough instruction before using the C-CG.
Cross-cultural validation of the Persian version of the Functional Independence Measure for patients with stroke.

PubMed

Naghdi, Soofia; Ansari, Noureddin Nakhostin; Raji, Parvin; Shamili, Aryan; Amini, Malek; Hasson, Scott

2016-01-01

To translate and cross-culturally adapt the Functional Independence Measure (FIM) into the Persian language and to test the reliability and validity of the Persian FIM (PFIM) in patients with stroke. In this cross-sectional study carried out in an outpatient stroke rehabilitation center, 40 patients with stroke (mean age 60 years) were participated. A standard forward-backward translation method and expert panel validation was followed to develop the PFIM. Two experienced occupational therapists (OTs) assessed the patients independently in all items of the PFIM in a single session for inter-rater reliability. One of the OTs reassessed the patients after 1 week for intra-rater reliability. There were no floor or ceiling effects for the PFIM. Excellent inter-rater and intra-rater reliability was noted for the PFIM total score, motor and cognitive subscales (ICC(agreement)0.88-0.98). According to the Bland-Altman agreement analysis, there was no systematic bias between raters and within raters. The internal consistency of the PFIM was with Cronbach's alpha from 0.70 to 0.96. The principal component analysis with varimax rotation indicated a three-factor structure: (1) self-care and mobility; (2) sphincter control and (3) cognitive that jointly accounted for 74.8% of the total variance. Construct validity was supported by a significant Pearson correlation between the PFIM and the Persian Barthel Index (r = 0.95; p < 0.001). The PFIM is a highly reliable and valid instrument for measuring functional status of Persian patients with stroke. The Functional Independence Measure (FIM) is an outcome measure for disability based on the International Classification of Functioning, Disability and Health (ICF). The FIM was cross-culturally adapted and validated into Persian language. The Persian version of the FIM (PFIM) is reliable and valid for assessing functional status of patients with stroke. The PFIM can be used in Persian speaking countries to assess the limitations in activities of daily living of patients with stroke.
The Reliability and Validity of the Computerized Double Inclinometer in Measuring Lumbar Mobility

PubMed Central

MacDermid, Joy Christine; Arumugam, Vanitha; Vincent, Joshua Israel; Carroll, Krista L

2014-01-01

Study Design : Repeated measures reliability/validity study. Objectives : To determine the concurrent validity, test-retest, inter-rater and intra-rater reliability of lumbar flexion and extension measurements using the Tracker M.E. computerized dual inclinometer (CDI) in comparison to the modified-modified Schober (MMS) Summary of Background : Numerous studies have evaluated the reliability and validity of the various methods of measuring spinal motion, but the results are inconsistent. Differences in equipment and techniques make it difficult to correlate results. Methods : Twenty subjects with back pain and twenty without back pain were selected through convenience sampling. Two examiners measured sagittal plane lumbar range of motion for each subject. Two separate tests with the CDI and one test with the MMS were conducted. Each test consisted of three trials. Instrument and examiner order was randomly assigned. Intra-class correlations (ICCs 2, 2 and 2, 2) and Pearson correlation coefficients (r) were used to calculate reliability and concurrent validity respectively. Results : Intra-trial reliability was high to very high for both the CDI (ICCs 0.85 - 0.96) and MMS (ICCs 0.84 - 0.98). However, the reliability was poor to moderate, when the CDI unit had to be repositioned either by the same rate (ICCs 0.16 - 0.59) or a different rater (ICCs 0.45 - 0.52). Inter-rater reliability for the MMS was moderate to high (ICCs 0.75 - 0.82) which bettered the moderate correlation obtained for the CDI (ICCs 0.45 - 0.52). Correlations between the CDI and MMS were poor for flexion (0.32; p<0.05) and poor to moderate (-0.42 - -0.51; p<0.05) for extension measurements. Conclusion : When using the CDI, an average of subsequent tests is required to obtain moderate reliability. The MMS was highly reliable than the CDI. The MMS and the CDI measure lumbar movement on a different metric that are not highly related to each other. PMID:25352928
Schedule for personality assessment from notes and documents (SPAN-DOC): Preliminary validation, links to the ICD-11 classification of personality disorder, and use in eating disorders.

PubMed

Kim, Youl-Ri; Tyrer, Peter; Lee, Hong-Seock; Kim, Sung-Gon; Connan, Frances; Kinnaird, Emma; Olajide, Kike; Crawford, Mike

2016-05-01

The underlying core of personality is insufficiently assessed by any single instrument. This has led to the development of instruments adapted for written records in the assessment of personality disorder. To test the construct validity and inter-rater reliability of a new personality assessment method. This study (four parts) assessed the construct validity of the Schedule for Personality Assessment from Notes and Documents (SPAN-DOC), a dimensional assessment from clinical records. We examined inter-rater reliability using case vignettes (Part 1) and convergent validity in three ways: by comparison with NEO Five-Factor Inventory in 130 Korean patients (Part 2), with agreed ICD-11 personality severity levels in two populations (Part 3) and determining its use in assessing the personality status in 90 British patients with eating disorders (Part 4). Internal consistency (alpha = .90) and inter-rater reliability (intraclass correlation coefficient ≥ .88) were satisfactory. Each factor in the five-factor model of personality was correlated with conceptually valid SPAN-DOC variables. The SPAN-DOC domain traits in those with eating disorders were categorized into 3 clusters: self-aggrandisement, emotionally unstable, and anxious/dependent. This study provides preliminary support for the usefulness of SPAN-DOC in the assessment of personality disorder. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.
Reliability and validity of goniometric iPhone applications for the assessment of active shoulder external rotation.

PubMed

Mitchell, Katy; Gutierrez, Simran Bakshi; Sutton, Stacy; Morton, Stephanie; Morgenthaler, Andrea

2014-10-01

The purpose of this study was to determine the reliability and validity of two smartphone applications: (1) GetMyROM - inclinometery-based and (2) DrGoniometry - photo-based in the measurement of active shoulder external rotation (ER) as compared to standard goniometry (SG). Ninety-four Texas Woman's University Doctor of Physical Therapy students from the School of Physical Therapy - Houston campus, were recruited to participate in this study. Two iPhone applications were compared to SG using both novice and experienced raters. Active shoulder ER range of motion was measured over two time periods in random order by blinded novice and experienced raters. Intra-rater reliability using novice raters for the two applications ranged from an intraclass correlation coefficient (ICC) of 0.79 to 0.81 with SG at 0.82. Inter-rater reliability (novice/expert) for the two applications ranged from an ICC of 0.92 to 0.94 with SG at 0.91. Concurrent validity (when compared to SG) ranged from 0.93 to 0.94. There were no significant differences between the novice and experienced raters. Both applications were found to be reliable and comparable to SG. A photo-based application potentially offers a superior method of measurement as visualizing the landmarks may be simplified in this format and it provides a record of measurement. Further study using patient populations may find the two studied applications are useful as an adjunct for clinical practice.
Reliability and convergent validity of the five-step test in people with chronic stroke.

PubMed

Ng, Shamay S M; Tse, Mimi M Y; Tam, Eric W C; Lai, Cynthia Y Y

2018-01-10

(i) To estimate the intra-rater, inter-rater and test-retest reliabilities of the Five-Step Test (FST), as well as the minimum detectable change in FST completion times in people with stroke. (ii) To estimate the convergent validity of the FST with other measures of stroke-specific impairments. (iii) To identify the best cut-off times for distinguishing FST performance in people with stroke from that of healthy older adults. A cross-sectional study. University-based rehabilitation centre. Forty-eight people with stroke and 39 healthy controls. None. The FST, along with (for the stroke survivors only) scores on the Fugl-Meyer Lower Extremity Assessment (FMA-LE), the Berg Balance Scale (BBS), Limits of Stability (LOS) tests, and Activities-specific Balance Confidence (ABC) scale were tested. The FST showed excellent intra-rater (intra-class correlation coefficient; ICC = 0.866-0.905), inter-rater (ICC = 0.998), and test-retest (ICC = 0.838-0.842) reliabilities. A minimum detectable change of 9.16 s was found for the FST in people with stroke. The FST correlated significantly with the FMA-LE, BBS, and LOS results in the forward and sideways directions (r = -0.411 to -0.716, p < 0.004). The FST completion time of 13.35 s was shown to discriminate reliably between people with stroke and healthy older adults. The FST is a reliable, easy-to-administer clinical test for assessing stroke survivors' ability to negotiate steps and stairs.
Structured assessment of current mental state in clinical practice: an international study of the reliability and validity of the Current Psychiatric State interview, CPS-50.

PubMed

Falloon, I R H; Mizuno, M; Murakami, M; Roncone, R; Unoka, Z; Harangozo, J; Pullman, J; Gedye, R; Held, T; Hager, B; Erickson, D; Burnett, K

2005-01-01

To develop a reliable standardized assessment of psychiatric symptoms for use in clinical practice. A 50-item interview, the Current Psychiatric State 50 (CPS-50), was used to assess 237 patients with a range of psychiatric diagnoses. Ratings were made by interviewers after a 2-day training. Comparisons of inter-rater reliability on each item and on eight clinical subscales were made across four international centres and between psychiatrists and non-psychiatrists. A principal components analysis was used to validate these clinical scales. Acceptable inter-rater reliability (intra-class coefficient > 0.80) was found for 46 of the 50 items, and for all eight subscales. There was no difference between centres or between psychiatrists and non-psychiatrists. The principal components analysis factors were similar to the clinical scales. The CPS-50 is a reliable standardized assessment of current mental status that can be used in clinical practice by all mental health professionals after brief training. Blackwell Munksgaard 2004
Inter-rater reliability of motor unit number estimates and quantitative motor unit analysis in the tibialis anterior muscle.

PubMed

Boe, S G; Dalton, B H; Harwood, B; Doherty, T J; Rice, C L

2009-05-01

To establish the inter-rater reliability of decomposition-based quantitative electromyography (DQEMG) derived motor unit number estimates (MUNEs) and quantitative motor unit (MU) analysis. Using DQEMG, two examiners independently obtained a sample of needle and surface-detected motor unit potentials (MUPs) from the tibialis anterior muscle from 10 subjects. Coupled with a maximal M wave, surface-detected MUPs were used to derive a MUNE for each subject and each examiner. Additionally, size-related parameters of the individual MUs were obtained following quantitative MUP analysis. Test-retest MUNE values were similar with high reliability observed between examiners (ICC=0.87). Additionally, MUNE variability from test-retest as quantified by a 95% confidence interval was relatively low (+/-28 MUs). Lastly, quantitative data pertaining to MU size, complexity and firing rate were similar between examiners. MUNEs and quantitative MU data can be obtained with high reliability by two independent examiners using DQEMG. Establishing the inter-rater reliability of MUNEs and quantitative MU analysis using DQEMG is central to the clinical applicability of the technique. In addition to assessing response to treatments over time, multiple clinicians may be involved in the longitudinal assessment of the MU pool of individuals with disorders of the central or peripheral nervous system.
Reproducibility of thoracic kyphosis measurements in patients with adolescent idiopathic scoliosis.

PubMed

Ohrt-Nissen, Søren; Cheung, Jason Pui Yin; Hallager, Dennis Winge; Gehrchen, Martin; Kwan, Kenny; Dahl, Benny; Cheung, Kenneth M C; Samartzis, Dino

2017-01-01

Current surgical treatment for adolescent idiopathic scoliosis (AIS) involves correction in both the coronal and sagittal plane, and thorough assessment of these parameters is essential for evaluation of surgical results. However, various definitions of thoracic kyphosis (TK) have been proposed, and the intra- and inter-rater reproducibility of these measures has not been determined. As such, the purpose of the current study was to determine the intra- and inter-rater reproducibility of several TK measurements used in the assessment of AIS. Twenty patients (90% females) surgically treated for AIS with alternate-level pedicle screw fixation were included in the study. Three raters independently evaluated pre- and postoperative standing lateral plain radiographs. For each radiograph, several definitions of TK were measured as well as L1-S1 and nonfixed lumbar lordosis. All variables were measured twice 14 days apart, and a mixed effects model was used to determine the repeatability coefficient (RC), which is a measure of the agreement between repeated measurements. Also, the intra- and inter-rater intra-class correlation coefficient (ICC) was determined as a measure of reliability. Preoperative median Cobb angle was 58° (range 41°-86°), and median surgical curve correction was 68% (range 49-87%). Overall intra-rater RC was highest for T2-T12 and nonfixed TK (11°) and lowest for T4-T12 and T5-T12 (8°). Inter-rater RC was highest for T1-T12, T1-nonfixed, and nonfixed TK (13°) and lowest for T5-T12 (9°). Agreement varied substantially between pre- and postoperative radiographs. Inter-rater ICC was highest for T4-T12 (0.92; 95% CI 0.88-0.95) and T5-T12 (0.92; 95% CI 0.88-0.95) and lowest for T1-nonfixed (0.80; 95% CI 0.72-0.88). Considerable variation for all TK measurements was noted. Intra- and inter-rater reproducibility was best for T4-T12 and T5-T12. Future studies should consider adopting a relevant minimum difference as a limit for true change in TK.
Measuring symptoms and functioning of youth with ADHD in middle schools.

PubMed

Evans, Steven W; Allen, Jessica; Moore, Sheryle; Strauss, Victoria

2005-12-01

The identification of reliable and valid means for evaluating the effectiveness of school-based treatments and completing diagnostic evaluations of middle school aged students are needed. The present study examined the inter-rater agreement of teacher ratings and the relationship between ratings and observational data in a middle school setting. The data are interpreted in the context of differences between a secondary and elementary school setting. Teacher ratings and observational data were collected regularly over the course of two academic years for middle school students diagnosed with ADHD. The results indicate low rates of inter-rater agreement as well as low rates of agreement between teachers and observational data, and between observational data collected in different classrooms. Inter-rater agreement was lowest in late fall and gradually increased over the second half of the year. Implications for conducting treatment outcome evaluations of school-based treatment programs and diagnostic evaluations are discussed.
Inter-rater agreement on PIVC-associated phlebitis signs, symptoms and scales.

PubMed

Marsh, Nicole; Mihala, Gabor; Ray-Barruel, Gillian; Webster, Joan; Wallis, Marianne C; Rickard, Claire M

2015-10-01

Many peripheral intravenous catheter (PIVC) infusion phlebitis scales and definitions are used internationally, although no existing scale has demonstrated comprehensive reliability and validity. We examined inter-rater agreement between registered nurses on signs, symptoms and scales commonly used in phlebitis assessment. Seven PIVC-associated phlebitis signs/symptoms (pain, tenderness, swelling, erythema, palpable venous cord, purulent discharge and warmth) were observed daily by two raters (a research nurse and registered nurse). These data were modelled into phlebitis scores using 10 different tools. Proportions of agreement (e.g. positive, negative), observed and expected agreements, Cohen's kappa, the maximum achievable kappa, prevalence- and bias-adjusted kappa were calculated. Two hundred ten patients were recruited across three hospitals, with 247 sets of paired observations undertaken. The second rater was blinded to the first's findings. The Catney and Rittenberg scales were the most sensitive (phlebitis in >20% of observations), whereas the Curran, Lanbeck and Rickard scales were the most restrictive (≤2% phlebitis). Only tenderness and the Catney (one of pain, tenderness, erythema or palpable cord) and Rittenberg scales (one of erythema, swelling, tenderness or pain) had acceptable (more than two-thirds, 66.7%) levels of inter-rater agreement. Inter-rater agreement for phlebitis assessment signs/symptoms and scales is low. This likely contributes to the high degree of variability in phlebitis rates in literature. We recommend further research into assessment of infrequent signs/symptoms and the Catney or Rittenberg scales. New approaches to evaluating vein irritation that are valid, reliable and based on their ability to predict complications need exploration. © 2015 John Wiley & Sons, Ltd.

A Reliable, Feasible Method to Observe Neighborhoods at High Spatial Resolution

PubMed Central

Kepper, Maura M.; Sothern, Melinda S.; Theall, Katherine P.; Griffiths, Lauren A.; Scribner, Richard; Tseng, Tung-Sung; Schaettle, Paul; Cwik, Jessica M.; Felker-Kantor, Erica; Broyles, Stephanie T.

2016-01-01

Introduction Systematic social observation (SSO) methods traditionally measure neighborhoods at street level and have been performed reliably using virtual applications to increase feasibility. Research indicates that collection at even higher spatial resolution may better elucidate the health impact of neighborhood factors, but whether virtual applications can reliably capture social determinants of health at the smallest geographic resolution (parcel level) remains uncertain. This paper presents a novel, parcel-level SSO methodology and assesses whether this new method can be collected reliably using Google Street View and is feasible. Methods Multiple raters (N=5) observed 42 neighborhoods. In 2016, inter-rater reliability (observed agreement and kappa coefficient) was compared for four SSO methods: (1) street-level in person; (2) street-level virtual; (3) parcel-level in person; and (4) parcel-level virtual. Intra-rater reliability (observed agreement and kappa coefficient) was calculated to determine whether parcel-level methods produce results comparable to traditional street-level observation. Results Substantial levels of inter-rater agreement were documented across all four methods; all methods had >70% of items with at least substantial agreement. Only physical decay showed higher levels of agreement (83% of items with >75% agreement) for direct versus virtual rating source. Intra-rater agreement comparing street- versus parcel-level methods resulted in observed agreement >75% for all but one item (90%). Conclusions Results support the use of Google Street View as a reliable, feasible tool for performing SSO at the smallest geographic resolution. Validation of a new parcel-level method collected virtually may improve the assessment of social determinants contributing to disparities in health behaviors and outcomes. PMID:27989289
Presentation of the Coding and Assessment System for Narratives of Trauma (CASNOT): Application in Spanish Battered Women and Preliminary Analyses.

PubMed

Fernández-Lansac, Violeta; Crespo, María

2017-07-26

This study introduces a new coding system, the Coding and Assessment System for Narratives of Trauma (CASNOT), to analyse several language domains in narratives of autobiographical memories, especially in trauma narratives. The development of the coding system is described. It was applied to assess positive and traumatic/negative narratives in 50 battered women (trauma-exposed group) and 50 nontrauma-exposed women (control group). Three blind raters coded each narrative. Inter-rater reliability analyses were conducted for the CASNOT language categories (multirater Kfree coefficients) and dimensions (intraclass correlation coefficients). High levels of inter-rater agreement were found for most of the language domains. Categories that did not reach the expected reliability were mainly those related to cognitive processes, which reflects difficulties in operationalizing constructs such as lack of control or helplessness, control or planning, and rationalization or memory elaboration. Applications and limitations of the CASNOT are discussed to enhance narrative measures for autobiographical memories.
Seven Reliability Indices for High-Stakes Decision Making: Description, Selection, and Simple Calculation

ERIC Educational Resources Information Center

Smith, Stacey L.; Vannest, Kimberly J.; Davis, John L.

2011-01-01

The reliability of data is a critical issue in decision-making for practitioners in the school. Percent Agreement and Cohen's kappa are the two most widely reported indices of inter-rater reliability, however, a recent Monte Carlo study on the reliability of multi-category scales found other indices to be more trustworthy given the type of data…
Quantitative Muscle Ultrasonography in Carpal Tunnel Syndrome.

PubMed

Lee, Hyewon; Jee, Sungju; Park, Soo Ho; Ahn, Seung-Chan; Im, Juneho; Sohn, Min Kyun

2016-12-01

To assess the reliability of quantitative muscle ultrasonography (US) in healthy subjects and to evaluate the correlation between quantitative muscle US findings and electrodiagnostic study results in patients with carpal tunnel syndrome (CTS). The clinical significance of quantitative muscle US in CTS was also assessed. Twenty patients with CTS and 20 age-matched healthy volunteers were recruited. All control and CTS subjects underwent a bilateral median and ulnar nerve conduction study (NCS) and quantitative muscle US. Transverse US images of the abductor pollicis brevis (APB) and abductor digiti minimi (ADM) were obtained to measure muscle cross-sectional area (CSA), thickness, and echo intensity (EI). EI was determined using computer-assisted, grayscale analysis. Inter-rater and intra-rater reliability for quantitative muscle US in control subjects, and differences in muscle thickness, CSA, and EI between the CTS patient and control groups were analyzed. Relationships between quantitative US parameters and electrodiagnostic study results were evaluated. Quantitative muscle US had high inter-rater and intra-rater reliability in the control group. Muscle thickness and CSA were significantly decreased, and EI was significantly increased in the APB of the CTS group (all p<0.05). EI demonstrated a significant positive correlation with latency of the median motor and sensory NCS in CTS patients (p<0.05). These findings suggest that quantitative muscle US parameters may be useful for detecting muscle changes in CTS. Further study involving patients with other neuromuscular diseases is needed to evaluate peripheral muscle change using quantitative muscle US.
A validation study of the Keyboard Personal Computer Style instrument (K-PeCS) for use with children.

PubMed

Green, Dido; Meroz, Anat; Margalit, Adi Edit; Ratzon, Navah Z

2012-11-01

This study examines a potential instrument for measurement of typing postures of children. This paper describes inter-rater, test-retest reliability and concurrent validity of the Keyboard Personal Computer Style instrument (K-PeCS), an observational measurement of postures and movements during keyboarding, for use with children. Two trained raters independently rated videos of 24 children (aged 7-10 years). Six children returned one week later for identifying test-retest reliability. Concurrent validity was assessed by comparing ratings obtained using the K-PECS to scores from a 3D motion analysis system. Inter-rater reliability was moderate to high for 12 out of 16 items (Kappa: 0.46 to 1.00; correlation coefficients: 0.77-0.95) and test-retest reliability varied across items (Kappa: 0.25 to 0.67; correlation coefficients: r = 0.20 to r = 0.95). Concurrent validity compared favourably across arm pathlength, wrist extension and ulnar deviation. In light of the limitations of other tools the K-PeCS offers a fairly affordable, reliable and valid instrument to address the gap for measurement of typing styles of children, despite the shortcomings of some items. However further research is required to refine the instrument for use in evaluating typing among children. Copyright © 2012 Elsevier Ltd and The Ergonomics Society. All rights reserved.
Systematic reviews need to consider applicability to disadvantaged populations: inter-rater agreement for a health equity plausibility algorithm

PubMed Central

2012-01-01

Background Systematic reviews have been challenged to consider effects on disadvantaged groups. A priori specification of subgroup analyses is recommended to increase the credibility of these analyses. This study aimed to develop and assess inter-rater agreement for an algorithm for systematic review authors to predict whether differences in effect measures are likely for disadvantaged populations relative to advantaged populations (only relative effect measures were addressed). Methods A health equity plausibility algorithm was developed using clinimetric methods with three items based on literature review, key informant interviews and methodology studies. The three items dealt with the plausibility of differences in relative effects across sex or socioeconomic status (SES) due to: 1) patient characteristics; 2) intervention delivery (i.e., implementation); and 3) comparators. Thirty-five respondents (consisting of clinicians, methodologists and research users) assessed the likelihood of differences across sex and SES for ten systematic reviews with these questions. We assessed inter-rater reliability using Fleiss multi-rater kappa. Results The proportion agreement was 66% for patient characteristics (95% confidence interval: 61%-71%), 67% for intervention delivery (95% confidence interval: 62% to 72%) and 55% for the comparator (95% confidence interval: 50% to 60%). Inter-rater kappa, assessed with Fleiss kappa, ranged from 0 to 0.199, representing very low agreement beyond chance. Conclusions Users of systematic reviews rated that important differences in relative effects across sex and socioeconomic status were plausible for a range of individual and population-level interventions. However, there was very low inter-rater agreement for these assessments. There is an unmet need for discussion of plausibility of differential effects in systematic reviews. Increased consideration of external validity and applicability to different populations and settings is warranted in systematic reviews to meet this need. PMID:23253632
Systematic reviews need to consider applicability to disadvantaged populations: inter-rater agreement for a health equity plausibility algorithm.

PubMed

Welch, Vivian; Brand, Kevin; Kristjansson, Elizabeth; Smylie, Janet; Wells, George; Tugwell, Peter

2012-12-19

Systematic reviews have been challenged to consider effects on disadvantaged groups. A priori specification of subgroup analyses is recommended to increase the credibility of these analyses. This study aimed to develop and assess inter-rater agreement for an algorithm for systematic review authors to predict whether differences in effect measures are likely for disadvantaged populations relative to advantaged populations (only relative effect measures were addressed). A health equity plausibility algorithm was developed using clinimetric methods with three items based on literature review, key informant interviews and methodology studies. The three items dealt with the plausibility of differences in relative effects across sex or socioeconomic status (SES) due to: 1) patient characteristics; 2) intervention delivery (i.e., implementation); and 3) comparators. Thirty-five respondents (consisting of clinicians, methodologists and research users) assessed the likelihood of differences across sex and SES for ten systematic reviews with these questions. We assessed inter-rater reliability using Fleiss multi-rater kappa. The proportion agreement was 66% for patient characteristics (95% confidence interval: 61%-71%), 67% for intervention delivery (95% confidence interval: 62% to 72%) and 55% for the comparator (95% confidence interval: 50% to 60%). Inter-rater kappa, assessed with Fleiss kappa, ranged from 0 to 0.199, representing very low agreement beyond chance. Users of systematic reviews rated that important differences in relative effects across sex and socioeconomic status were plausible for a range of individual and population-level interventions. However, there was very low inter-rater agreement for these assessments. There is an unmet need for discussion of plausibility of differential effects in systematic reviews. Increased consideration of external validity and applicability to different populations and settings is warranted in systematic reviews to meet this need.
Can real time location system technology (RTLS) provide useful estimates of time use by nursing personnel?

PubMed

Jones, Terry L; Schlegel, Cara

2014-02-01

Accurate, precise, unbiased, reliable, and cost-effective estimates of nursing time use are needed to insure safe staffing levels. Direct observation of nurses is costly, and conventional surrogate measures have limitations. To test the potential of electronic capture of time and motion through real time location systems (RTLS), a pilot study was conducted to assess efficacy (method agreement) of RTLS time use; inter-rater reliability of RTLS time-use estimates; and associated costs. Method agreement was high (mean absolute difference = 28 seconds); inter-rater reliability was high (ICC = 0.81-0.95; mean absolute difference = 2 seconds); and costs for obtaining RTLS time-use estimates on a single nursing unit exceeded $25,000. Continued experimentation with RTLS to obtain time-use estimates for nursing staff is warranted. © 2013 Wiley Periodicals, Inc.
[A systematic social observation tool: methods and results of inter-rater reliability].

PubMed

Freitas, Eulilian Dias de; Camargos, Vitor Passos; Xavier, César Coelho; Caiaffa, Waleska Teixeira; Proietti, Fernando Augusto

2013-10-01

Systematic social observation has been used as a health research methodology for collecting information from the neighborhood physical and social environment. The objectives of this article were to describe the operationalization of direct observation of the physical and social environment in urban areas and to evaluate the instrument's reliability. The systematic social observation instrument was designed to collect information in several domains. A total of 1,306 street segments belonging to 149 different neighborhoods in Belo Horizonte, Minas Gerais, Brazil, were observed. For the reliability study, 149 segments (1 per neighborhood) were re-audited, and Fleiss kappa was used to access inter-rater agreement. Mean agreement was 0.57 (SD = 0.24); 53% had substantial or almost perfect agreement, and 20.4%, moderate agreement. The instrument appears to be appropriate for observing neighborhood characteristics that are not time-dependent, especially urban services, property characterization, pedestrian environment, and security.
Grant Peer Review: Improving Inter-Rater Reliability with Training

DOE PAGES

Sattler, David N.; McKnight, Patrick E.; Naney, Linda; ...

2015-06-15

In this study, we developed and evaluated a brief training program for grant reviewers that aimed to increase inter-rater reliability, rating scale knowledge, and effort to read the grant review criteria. Enhancing reviewer training may improve the reliability and accuracy of research grant proposal scoring and funding recommendations. Seventy-five Public Health professors from U.S. research universities watched the training video we produced and assigned scores to the National Institutes of Health scoring criteria proposal summary descriptions. For both novice and experienced reviewers, the training video increased scoring accuracy (the percentage of scores that reflect the true rating scale values), inter-ratermore » reliability, and the amount of time reading the review criteria compared to the no video condition. The increase in reliability for experienced reviewers is notable because it is commonly assumed that reviewers—especially those with experience—have good understanding of the grant review rating scale. Our findings suggest that both experienced and novice reviewers who had not received the type of training developed in this study may not have appropriate understanding of the definitions and meaning for each value of the rating scale and that experienced reviewers may overestimate their knowledge of the rating scale. Lastly, the results underscore the benefits of and need for specialized peer reviewer training.« less
Inter-rater reliability and aspects of validity of the parent-infant relationship global assessment scale (PIR-GAS)

PubMed Central

2013-01-01

Background The Parent-Infant Relationship Global Assessment Scale (PIR-GAS) signifies a conceptually relevant development in the multi-axial, developmentally sensitive classification system DC:0-3R for preschool children. However, information about the reliability and validity of the PIR-GAS is rare. A review of the available empirical studies suggests that in research, PIR-GAS ratings can be based on a ten-minute videotaped interaction sequence. The qualification of raters may be very heterogeneous across studies. Methods To test whether the use of the PIR-GAS still allows for a reliable assessment of the parent-infant relationship, our study compared a PIR-GAS ratings based on a full-information procedure across multiple settings with ratings based on a ten-minute video by two doctoral candidates of medicine. For each mother-child dyad at a family day hospital (N = 48), we obtained two video ratings and one full-information rating at admission to therapy and at discharge. This pre-post design allowed for a replication of our findings across the two measurement points. We focused on the inter-rater reliability between the video coders, as well as between the video and full-information procedure, including mean differences and correlations between the raters. Additionally, we examined aspects of the validity of video and full-information ratings based on their correlation with measures of child and maternal psychopathology. Results Our results showed that a ten-minute video and full-information PIR-GAS ratings were not interchangeable. Most results at admission could be replicated by the data obtained at discharge. We concluded that a higher degree of standardization of the assessment procedure should increase the reliability of the PIR-GAS, and a more thorough theoretical foundation of the manual should increase its validity. PMID:23705962
A novel computer system for the evaluation of nasolabial morphology, symmetry and aesthetics after cleft lip and palate treatment. Part 1: General concept and validation.

PubMed

Pietruski, Piotr; Majak, Marcin; Debski, Tomasz; Antoszewski, Boguslaw

2017-04-01

The need for a widely accepted method suitable for a multicentre quantitative evaluation of facial aesthetics after surgical treatment of cleft lip and palate (CLP) has been emphasized for years. The aim of this study was to validate a novel computer system 'Analyse It Doc' (A.I.D.) as a tool for objective anthropometric analysis of the nasolabial region. An indirect anthropometric analysis of facial photographs was conducted with the A.I.D. system and Adobe Photoshop/ImageJ software. Intra-rater and inter-rater reliability and the time required for the analysis were estimated separately for each method and compared. Analysis with A.I.D. system was nearly 10-fold faster than that with the reference evaluation method. The A.I.D. system provided strong inter-rater and intra-rater correlations for linear, angular and area measurements of the nasolabial region, as well as a significantly higher accuracy and reproducibility of angular measurements in submental view. No statistically significant inter-method differences were found for other measurements. The hereby presented novel computer system is suitable for simple, time-efficient and reliable multicenter photogrammetric analyses of the nasolabial region in CLP patients and healthy subjects. Copyright © 2017 European Association for Cranio-Maxillo-Facial Surgery. Published by Elsevier Ltd. All rights reserved.
Evaluation of high fidelity patient simulator in assessment of performance of anaesthetists.

PubMed

Weller, J M; Bloch, M; Young, S; Maze, M; Oyesola, S; Wyner, J; Dob, D; Haire, K; Durbridge, J; Walker, T; Newble, D

2003-01-01

There is increasing emphasis on performance-based assessment of clinical competence. The High Fidelity Patient Simulator (HPS) may be useful for assessment of clinical practice in anaesthesia, but needs formal evaluation of validity, reliability, feasibility and effect on learning. We set out to assess the reliability of a global rating scale for scoring simulator performance in crisis management. Using a global rating scale, three judges independently rated videotapes of anaesthetists in simulated crises in the operating theatre. Five anaesthetists then independently rated subsets of these videotapes. There was good agreement between raters for medical management, behavioural attributes and overall performance. Agreement was high for both the initial judges and the five additional raters. Using a global scale to assess simulator performance, we found good inter-rater reliability for scoring performance in a crisis. We estimate that two judges should provide a reliable assessment. High fidelity simulation should be studied further for assessing clinical performance.
Reliability of Multi-Category Rating Scales

ERIC Educational Resources Information Center

Parker, Richard I.; Vannest, Kimberly J.; Davis, John L.

2013-01-01

The use of multi-category scales is increasing for the monitoring of IEP goals, classroom and school rules, and Behavior Improvement Plans (BIPs). Although they require greater inference than traditional data counting, little is known about the inter-rater reliability of these scales. This simulation study examined the performance of nine…
Reliability and validity of an iPhone® application for the measurement of lumbar spine flexion and extension range of motion

PubMed Central

Pourahmadi, Mohammad Reza; Jannati, Elham; Mohseni-Bandpei, Mohammad Ali; Ebrahimi Takamjani, Ismail; Rajabzadeh, Fatemeh

2016-01-01

Background Measurement of lumbar spine range of motion (ROM) is often considered to be an essential component of lumbar spine physiotherapy and orthopedic assessment. The measurement can be carried out through various instruments such as inclinometers, goniometers, and etc. Recent smartphones have been equipped with accelerometers and magnetometers, which, through specific software applications (apps) can be used for inclinometric functions. Purpose The main purpose was to investigate the reliability and validity of an iPhone® app (TiltMeter© -advanced level and inclinometer) for measuring standing lumbar spine flexion–extension ROM in asymptomatic subjects. Design A cross-sectional study was carried out. Setting This study was conducted in a physiotherapy clinic located at School of Rehabilitation Sciences, Iran University of Medical Science and Health Services, Tehran, Iran. Subjects A convenience sample of 30 asymptomatic adults (15 males; 15 females; age range = 18–55 years) was recruited between August 2015 and December 2015. Methods Following a 2–minute warm-up, the subjects were asked to stand in a relaxed position and their skin was marked at the T12–L1 and S1–S2 spinal levels. From this position, they were asked to perform maximum lumbar flexion followed by maximum lumbar extension with their knees straight. Two blinded raters each used an inclinometer and the iPhone ® app to measure lumbar spine flexion–extension ROM. A third rater read the measured angles. To calculate total lumbar spine flexion–extension ROM, the measurement from S1–S2 was subtracted from T12–L1. The second (2 hours later) and third (48 hours later) sessions were carried out in the same manner as the first session. All of the measurements were conducted 3 times and the mean value of 3 repetitions for each measurement was used for analysis. Intraclass correlation coefficient (ICC) models (3, k) and (2, k) were used to determine the intra-rater and inter-rater reliability, respectively. The Pearson correlation coefficients were used to establish concurrent validity of the iPhone® app. Furthermore, minimum detectable change at the 95% confidence level (MDC95) was computed as 1.96 × standard error of measurement × \\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{upgreek} \\usepackage{mathrsfs} \\setlength{\\oddsidemargin}{-69pt} \\begin{document} }{}$\\sqrt{2}$\\end{document}2. Results Good to excellent intra-rater and inter-rater reliability were demonstrated for both the gravity-based inclinometer with ICC values of ≥0.84 and ≥0.77 and the iPhone® app with ICC values of ≥0.85 and ≥0.85, respectively. The MDC95 ranged from 5.82°to 8.18°for the intra-rater analysis and from 7.38°to 8.66° for the inter-rater analysis. The concurrent validity for flexion and extension between the 2 instruments was 0.85 and 0.91, respectively. Conclusions The iPhone®app possesses good to excellent intra-rater and inter-rater reliability and concurrent validity. It seems that the iPhone® app can be used for the measurement of lumbar spine flexion–extension ROM. Level of evidence IIb. PMID:27635328
Glioblastoma Segmentation: Comparison of Three Different Software Packages.

PubMed

Fyllingen, Even Hovig; Stensjøen, Anne Line; Berntsen, Erik Magnus; Solheim, Ole; Reinertsen, Ingerid

2016-01-01

To facilitate a more widespread use of volumetric tumor segmentation in clinical studies, there is an urgent need for reliable, user-friendly segmentation software. The aim of this study was therefore to compare three different software packages for semi-automatic brain tumor segmentation of glioblastoma; namely BrainVoyagerTM QX, ITK-Snap and 3D Slicer, and to make data available for future reference. Pre-operative, contrast enhanced T1-weighted 1.5 or 3 Tesla Magnetic Resonance Imaging (MRI) scans were obtained in 20 consecutive patients who underwent surgery for glioblastoma. MRI scans were segmented twice in each software package by two investigators. Intra-rater, inter-rater and between-software agreement was compared by using differences of means with 95% limits of agreement (LoA), Dice's similarity coefficients (DSC) and Hausdorff distance (HD). Time expenditure of segmentations was measured using a stopwatch. Eighteen tumors were included in the analyses. Inter-rater agreement was highest for BrainVoyager with difference of means of 0.19 mL and 95% LoA from -2.42 mL to 2.81 mL. Between-software agreement and 95% LoA were very similar for the different software packages. Intra-rater, inter-rater and between-software DSC were ≥ 0.93 in all analyses. Time expenditure was approximately 41 min per segmentation in BrainVoyager, and 18 min per segmentation in both 3D Slicer and ITK-Snap. Our main findings were that there is a high agreement within and between the software packages in terms of small intra-rater, inter-rater and between-software differences of means and high Dice's similarity coefficients. Time expenditure was highest for BrainVoyager, but all software packages were relatively time-consuming, which may limit usability in an everyday clinical setting.
Development and Reliability Testing of the FEDS System for Classifying Glenohumeral Instability

PubMed Central

Kuhn, John E.; Helmer, Tara T.; Dunn, Warren R.; Throckmorton V, Thomas W.

2010-01-01

Background Classification systems for glenohumeral instability (GHI) are opinion based, not validated, and poorly defined. This study is designed to methodologically develop and test a GHI classification system. Methods: Classification System Development A systematic literature review identified 18 systems for classifying GHI. The frequency characteristics used was recorded. Additionally 31 members of the American Shoulder and Elbow Surgeons responded to a survey to identify features important to characterize GHI. Frequency, Etiology, Direction, and Severity (FEDS), were found to be most important. Frequency was defined as solitary (one episode), occasional (2–5x/year), or frequent (>5x/year). Etiology was defined as traumatic or atraumatic. Direction referred to the primary direction of instability (anterior, posterior, or inferior). Severity was defined as either subluxation or dislocation. Methods: Reliability Testing Fifty GHI patients completed a questionnaire at their initial visit. One of six sports medicine fellowship trained physicians completed a similar questionnaire after examining the patient. Patients returned after two weeks and were examined by the original physician and two other physicians. Inter- and intra-rater agreement for the FEDS classification system was calculated. Results Agreement between patients and physicians was lowest for frequency (39%; k=0.130) and highest for direction (82%; k=0.636). Physician intra-rater agreement was 84– 97% for the individual FEDS characteristics (k=0.69 to 0.87)). Physician inter-rater agreement ranged from 82–90% (k=0.44 to 0.76). Conclusions The FEDS system has content validity and is highly reliable for classifying GHI. Physical examination using provocative testing to determine the primary direction of instability produces very high levels of inter- and intra-rater agreement. Level of evidence Level II, Development of Diagnostic Criteria with Consecutive Series of Patients, Diagnosis Study. PMID:21277809
Reliability and Validity of the Dutch Physical Activity Questionnaires for Children (PAQ-C) and Adolescents (PAQ-A).

PubMed

Bervoets, Liene; Van Noten, Caroline; Van Roosbroeck, Sofie; Hansen, Dominique; Van Hoorenbeeck, Kim; Verheyen, Els; Van Hal, Guido; Vankerckhoven, Vanessa

2014-01-01

This study was designed to validate the Dutch Physical Activity Questionnaires for Children (PAQ-C) and Adolescents (PAQ-A). After adjustment of the original Canadian PAQ-C and PAQ-A (i.e. translation/back-translation and evaluation by expert committee), content validity of both PAQs was assessed and calculated using item-level (I-CVI) and scale-level (S-CVI) content validity indexes. Inter-item and inter-rater reliability of 196 PAQ-C and 95 PAQ-A filled in by both children or adolescents and their parent, were evaluated. Inter-item reliability was calculated by Cronbach's alpha (α) and inter-rater reliability was examined by percent observed agreement and weighted kappa (κ). Concurrent validity of PAQ-A was examined in a subsample of 28 obese and 16 normal-weight children by comparing it with concurrently measured physical activity using a maximal cardiopulmonary exercise test for the assessment of peak oxygen uptake (VO2 peak). For both PAQs, I-CVI ranged 0.67-1.00. S-CVI was 0.89 for PAQ-C and 0.90 for PAQ-A. A total of 192 PAQ-C and 94 PAQ-A were fully completed by both child and parent. Cronbach's α was 0.777 for PAQ-C and 0.758 for PAQ-A. Percent agreement ranged 59.9-74.0% for PAQ-C and 51.1-77.7% for PAQ-A, and weighted κ ranged 0.48-0.69 for PAQ-C and 0.51-0.68 for PAQ-A. The correlation between total PAQ-A score and VO2 peak - corrected for age, gender, height and weight - was 0.516 (p = 0.001). Both PAQs have an excellent content validity, an acceptable inter-item reliability and a moderate to good strength of inter-rater agreement. In addition, total PAQ-A score showed a moderate positive correlation with VO2 peak. Both PAQs have an acceptable to good reliability and validity, however, further validity testing is recommended to provide a more complete assessment of both PAQs.
Determinants of the reliability of ultrasound tomography sound speed estimates as a surrogate for volumetric breast density

PubMed Central

Khodr, Zeina G.; Sak, Mark A.; Pfeiffer, Ruth M.; Duric, Nebojsa; Littrup, Peter; Bey-Knight, Lisa; Ali, Haythem; Vallieres, Patricia; Sherman, Mark E.; Gierach, Gretchen L.

2015-01-01

Purpose: High breast density, as measured by mammography, is associated with increased breast cancer risk, but standard methods of assessment have limitations including 2D representation of breast tissue, distortion due to breast compression, and use of ionizing radiation. Ultrasound tomography (UST) is a novel imaging method that averts these limitations and uses sound speed measures rather than x-ray imaging to estimate breast density. The authors evaluated the reproducibility of measures of speed of sound and changes in this parameter using UST. Methods: One experienced and five newly trained raters measured sound speed in serial UST scans for 22 women (two scans per person) to assess inter-rater reliability. Intrarater reliability was assessed for four raters. A random effects model was used to calculate the percent variation in sound speed and change in sound speed attributable to subject, scan, rater, and repeat reads. The authors estimated the intraclass correlation coefficients (ICCs) for these measures based on data from the authors’ experienced rater. Results: Median (range) time between baseline and follow-up UST scans was five (1–13) months. Contributions of factors to sound speed variance were differences between subjects (86.0%), baseline versus follow-up scans (7.5%), inter-rater evaluations (1.1%), and intrarater reproducibility (∼0%). When evaluating change in sound speed between scans, 2.7% and ∼0% of variation were attributed to inter- and intrarater variation, respectively. For the experienced rater’s repeat reads, agreement for sound speed was excellent (ICC = 93.4%) and for change in sound speed substantial (ICC = 70.4%), indicating very good reproducibility of these measures. Conclusions: UST provided highly reproducible sound speed measurements, which reflect breast density, suggesting that UST has utility in sensitively assessing change in density. PMID:26429241
Determinants of the reliability of ultrasound tomography sound speed estimates as a surrogate for volumetric breast density

DOE Office of Scientific and Technical Information (OSTI.GOV)

Khodr, Zeina G.; Pfeiffer, Ruth M.; Gierach, Gretchen L., E-mail: GierachG@mail.nih.gov

Purpose: High breast density, as measured by mammography, is associated with increased breast cancer risk, but standard methods of assessment have limitations including 2D representation of breast tissue, distortion due to breast compression, and use of ionizing radiation. Ultrasound tomography (UST) is a novel imaging method that averts these limitations and uses sound speed measures rather than x-ray imaging to estimate breast density. The authors evaluated the reproducibility of measures of speed of sound and changes in this parameter using UST. Methods: One experienced and five newly trained raters measured sound speed in serial UST scans for 22 women (twomore » scans per person) to assess inter-rater reliability. Intrarater reliability was assessed for four raters. A random effects model was used to calculate the percent variation in sound speed and change in sound speed attributable to subject, scan, rater, and repeat reads. The authors estimated the intraclass correlation coefficients (ICCs) for these measures based on data from the authors’ experienced rater. Results: Median (range) time between baseline and follow-up UST scans was five (1–13) months. Contributions of factors to sound speed variance were differences between subjects (86.0%), baseline versus follow-up scans (7.5%), inter-rater evaluations (1.1%), and intrarater reproducibility (∼0%). When evaluating change in sound speed between scans, 2.7% and ∼0% of variation were attributed to inter- and intrarater variation, respectively. For the experienced rater’s repeat reads, agreement for sound speed was excellent (ICC = 93.4%) and for change in sound speed substantial (ICC = 70.4%), indicating very good reproducibility of these measures. Conclusions: UST provided highly reproducible sound speed measurements, which reflect breast density, suggesting that UST has utility in sensitively assessing change in density.« less

The reliability of the Adelaide in-shoe foot model.

PubMed

Bishop, Chris; Hillier, Susan; Thewlis, Dominic

2017-07-01

Understanding the biomechanics of the foot is essential for many areas of research and clinical practice such as orthotic interventions and footwear development. Despite the widespread attention paid to the biomechanics of the foot during gait, what largely remains unknown is how the foot moves inside the shoe. This study investigated the reliability of the Adelaide In-Shoe Foot Model, which was designed to quantify in-shoe foot kinematics and kinetics during walking. Intra-rater reliability was assessed in 30 participants over five walking trials whilst wearing shoes during two data collection sessions, separated by one week. Sufficient reliability for use was interpreted as a coefficient of multiple correlation and intra-class correlation coefficient of >0.61. Inter-rater reliability was investigated separately in a second sample of 10 adults by two researchers with experience in applying markers for the purpose of motion analysis. The results indicated good consistency in waveform estimation for most kinematic and kinetic data, as well as good inter-and intra-rater reliability. The exception is the peak medial ground reaction force, the minimum abduction angle and the peak abduction/adduction external hindfoot joint moments which resulted in less than acceptable repeatability. Based on our results, the Adelaide in-shoe foot model can be used with confidence for 24 commonly measured biomechanical variables during shod walking. Copyright © 2017 Elsevier B.V. All rights reserved.
Delirium diagnosis methodology used in research: a survey-based study.

PubMed

Neufeld, Karin J; Nelliot, Archana; Inouye, Sharon K; Ely, E Wesley; Bienvenu, O Joseph; Lee, Hochang Benjamin; Needham, Dale M

2014-12-01

To describe methodology used to diagnose delirium in research studies evaluating delirium detection tools. The authors used a survey to address reference rater methodology for delirium diagnosis, including rater characteristics, sources of patient information, and diagnostic process, completed via web or telephone interview according to respondent preference. Participants were authors of 39 studies included in three recent systematic reviews of delirium detection instruments in hospitalized patients. Authors from 85% (N = 33) of the 39 eligible studies responded to the survey. The median number of raters per study was 2.5 (interquartile range: 2-3); 79% were physicians. The raters' median duration of clinical experience with delirium diagnosis was 7 years (interquartile range: 4-10), with 5% having no prior clinical experience. Inter-rater reliability was evaluated in 70% of studies. Cognitive tests and delirium detection tools were used in the delirium reference rating process in 61% (N = 21) and 45% (N = 15) of studies, respectively, with 33% (N = 11) using both and 27% (N = 9) using neither. When patients were too drowsy or declined to participate in delirium evaluation, 70% of studies (N = 23) used all available information for delirium diagnosis, whereas 15% excluded such patients. Significant variability exists in reference standard methods for delirium diagnosis in published research. Increasing standardization by documenting inter-rater reliability, using standardized cognitive and delirium detection tools, incorporating diagnostic expert consensus panels, and using all available information in patients declining or unable to participate with formal testing may help advance delirium research by increasing consistency of case detection and improving generalizability of research results. Copyright © 2014 American Association for Geriatric Psychiatry. Published by Elsevier Inc. All rights reserved.
[Inter-rater reliability and construct validity of the OPD-CA axis structure: first study results regarding the integration of OPD-CA into clinical practice].

PubMed

Cropp, Carola; Salzer, Simone; Häusser, Leonard F; Streeck-Fischer, Annette

2013-01-01

The axis structure of the Operationalized Psychodynamic Diagnostics in childhood and adolescence (OPD-CA) has proven to be a reliable and valid diagnostic tool under research conditions. However, corresponding data regarding the integration of OPD-CA axis structure into clinical practice is still lacking. Hence, this aspect was examined as part of a randomized controlled clinical trial realized at Asklepios Fachklinikum Tiefenbrunn. Here, the OPD-CA axis structure has been applied to assess the structural level of 42 adolescent patients (15-19 years). In contrast to previous studies, the assessment was not carried out by independent raters using a videotaped OPD-CA interview, but the rating was part of clinical routine procedures. Also under these conditions, inter-rater reliability was high, in particular regarding the four subscales of the OPD-CA axis structure. With respect to construct validity, the results of our study supported a two-factor solution, which is in accordance with the findings of two previous works. One factor corresponded to the dimension "self-regulation" while the other factor included both the dimension "self-perception and object perception" as well as the dimension "communication skills". Implications of the findings for research and practice are discussed.
PubMed

Brosseau, Lucie; Laroche, Chantal; Guitard, Paulette; King, Judy; Poitras, Stéphane; Casimiro, Lynn; Barette, Julie Alexandra; Cardinal, Dominique; Cavallo, Sabrina; Laferrière, Lucie; Martini, Rose; Champoux, Nicholas; Taverne, Jennifer; Paquette, Chanyque; Tremblay, Sébastien; Sutton, Ann; Galipeau, Roseline; Tourigny, Jocelyne; Toupin-April, Karine; Loew, Laurianne; Demers, Catrine; Sauvé-Schenk, Katrine; Paquet, Nicole; Savard, Jacinthe; Lagacé, Josée; Pharand, Denyse; Vaillancourt, Véronique

2017-01-01

Objectives: The primary objective was to produce a French-Canadian translation of AMSTAR (a measurement tool to assess systematic reviews) and to examine the validity of the translation's contents. The secondary and tertiary objectives were to assess the inter-rater reliability and factorial construct validity of this French-Canadian version of AMSTAR. Methods: A modified approach to Vallerand's methodology (1989) for cross-cultural validation was used. 1 First, a parallel back-translation of AMSTAR 2 was performed, by both professionals and future professionals. Next, a first committee of experts (P1) examined the translations to create a first draft of the French-Canadian version of the AMSTAR tool. This draft was then evaluated and modified by a second committee of experts (P2). Following that, 18 future professionals (master's students in physiotherapy) rated this second draft of the instrument for clarity using a seven-point scale (1: very clear; 7: very ambiguous). Lastly, the principal co-investigators then reviewed the problematic elements and proposed final changes. Four independent raters used this French-Canadian version of AMSTAR to assess 20 systematic reviews that were published in French after the year 2000. An intraclass correlation coefficient (ICC) and kappa coefficient were calculated to measure the tool's inter-rater reliability. A Cronbach's alpha coefficient was also calculated to measure internal consistency. In addition, factor analysis was used to evaluate construct validity in order to determine the number of dimensions. Results: The statements on the final version of the AMSTAR tool received an average ambiguity rating of between 1.0 and 1.4. No statement received an average rating below 1.4, which indicates a high level of clarity. Inter-rater reliability ( n =4) for the instrument's total score was moderate, with an intraclass correlation coefficient of 0.61 (95% confidence interval [CI]: 0.29, 0.97). Inter-rater reliability for 82% of the individual items was good, according to the kappa values obtained. Internal consistency was excellent, with a Cronbach's alpha coefficient of 0.91 (95% CI: 0.83, 0.99). The French-Canadian version of AMSTAR is a unidimensional tool, as confirmed by factor analysis and community values greater than 0.30. Conclusion: A valid French-Canadian version of AMSTAR was created using this rigorous five-step process. This version is unidimensional, with moderate inter-rater reliability for the elements overall, and with excellent internal consistency. This tool could be valuable to French-Canadian professionals and researchers, and could also be of interest to the international Francophone community.
Psychometric properties of the Calgary Cambridge guides to assess communication skills of undergraduate medical students

PubMed Central

Simmenroth-Nayda, Anne; Heinemann, Stephanie; Nolte, Catharina; Fischer, Thomas; Himmel, Wolfgang

2014-01-01

Objectives: The aim of this study was to analyse the psychometric properties of the short version of the Calgary Cambridge Guides and to decide whether it can be recommended for use in the assessment of communications skills in young undergraduate medical students. Methods: Using a translated version of the Guide, 30 members from the Department of General Practice rated 5 videotaped encounters between students and simulated patients twice. Item analysis should detect possible floor and/or ceiling effects. The construct validity was investigated using exploratory factor analysis. Intra-rater reliability was measured in an interval of 3 months, inter-rater reliability was assessed by the intraclass correlation coefficient. Results: The score distribution of the items showed no ceiling or floor effects. Four of the five factors extracted from the factor analysis represented important constructs of doctor-patient communication The ratings for the first and second round of assessing the videos correlated at 0.75 (p < 0.0001). Intraclass correlation coefficients for each item ranged were moderate and ranged from 0.05 to 0.57. Conclusions: Reasonable score distributions of most items without ceiling or floor effects as well as a good test-retest reliability and construct validity recommend the C-CG as an instrument for assessing communication skills in undergraduate medical students. Some deficiencies in inter-rater reliability are a clear indication that raters need a thorough instruction before using the C-CG. PMID:25480988
Mapping the Association of College and Research Libraries information literacy framework and nursing professional standards onto an assessment rubric.

PubMed

Willson, Gloria; Angell, Katelyn

2017-04-01

The authors developed a rubric for assessing undergraduate nursing research papers for information literacy skills critical to their development as researchers and health professionals. We developed a rubric mapping six American Nurses Association professional standards onto six related concepts of the Association of College & Research Libraries (ACRL) Framework for Information Literacy for Higher Education. We used this rubric to evaluate fifty student research papers and assess inter-rater reliability. Students tended to score highest on the "Information Has Value" dimension and lowest on the "Scholarship as Conversation" dimension. However, we found a discrepancy between the grading patterns of the two investigators, with inter-rater reliability being "fair" or "poor" for all six rubric dimensions. The development of a rubric that dually assesses information literacy skills and maps relevant disciplinary competencies holds potential. This study offers a template for a rubric inspired by the ACRL Framework and outside professional standards. However, the overall low inter-rater reliability demands further calibration of the rubric. Following additional norming, this rubric can be used to help students identify the key information literacy competencies that they need in order to succeed as college students and future nurses. These skills include developing an authoritative voice, determining the scope of their information needs, and understanding the ramifications of their information choices.
Specific algorithm method of scoring the Clock Drawing Test applied in cognitively normal elderly

PubMed Central

Mendes-Santos, Liana Chaves; Mograbi, Daniel; Spenciere, Bárbara; Charchat-Fichman, Helenice

2015-01-01

The Clock Drawing Test (CDT) is an inexpensive, fast and easily administered measure of cognitive function, especially in the elderly. This instrument is a popular clinical tool widely used in screening for cognitive disorders and dementia. The CDT can be applied in different ways and scoring procedures also vary. Objective The aims of this study were to analyze the performance of elderly on the CDT and evaluate inter-rater reliability of the CDT scored by using a specific algorithm method adapted from Sunderland et al. (1989). Methods We analyzed the CDT of 100 cognitively normal elderly aged 60 years or older. The CDT ("free-drawn") and Mini-Mental State Examination (MMSE) were administered to all participants. Six independent examiners scored the CDT of 30 participants to evaluate inter-rater reliability. Results and Conclusion A score of 5 on the proposed algorithm ("Numbers in reverse order or concentrated"), equivalent to 5 points on the original Sunderland scale, was the most frequent (53.5%). The CDT specific algorithm method used had high inter-rater reliability (p<0.01), and mean score ranged from 5.06 to 5.96. The high frequency of an overall score of 5 points may suggest the need to create more nuanced evaluation criteria, which are sensitive to differences in levels of impairment in visuoconstructive and executive abilities during aging. PMID:29213954
Mapping the Association of College and Research Libraries information literacy framework and nursing professional standards onto an assessment rubric

PubMed Central

Willson, Gloria; Angell, Katelyn

2017-01-01

Objective The authors developed a rubric for assessing undergraduate nursing research papers for information literacy skills critical to their development as researchers and health professionals. Methods We developed a rubric mapping six American Nurses Association professional standards onto six related concepts of the Association of College & Research Libraries (ACRL) Framework for Information Literacy for Higher Education. We used this rubric to evaluate fifty student research papers and assess inter-rater reliability. Results Students tended to score highest on the “Information Has Value” dimension and lowest on the “Scholarship as Conversation” dimension. However, we found a discrepancy between the grading patterns of the two investigators, with inter-rater reliability being “fair” or “poor” for all six rubric dimensions. Conclusions The development of a rubric that dually assesses information literacy skills and maps relevant disciplinary competencies holds potential. This study offers a template for a rubric inspired by the ACRL Framework and outside professional standards. However, the overall low inter-rater reliability demands further calibration of the rubric. Following additional norming, this rubric can be used to help students identify the key information literacy competencies that they need in order to succeed as college students and future nurses. These skills include developing an authoritative voice, determining the scope of their information needs, and understanding the ramifications of their information choices. PMID:28377678
Ecologically relevant outcome measure for post-inpatient rehabilitation.

PubMed

Marquez de la Plata, Carlos; Qualls, Devin; Plenger, Patrick; Malec, James F; Hayden, Mary Ellen

2017-01-01

Transfer of skills learned within the clinic environment to patients' home or community is important in post-inpatient brain injury rehabilitation (PBIR). Outcome measures used in PBIR assess level of independence during functional tasks; however, available functional instruments do not quantitate the environment in which the behaviors occur. To examine the reliability and validity of an instrument used to assess patients' functional abilities while quantifying the amount of structure and distractions in the environment. 2501 patients who sustained a traumatic brain injury (TBI) or cerebrovascular accident (CVA) and participated in a multidisciplinary PBIR program between 2006 and 2014 were identified retrospectively for this study. The PERPOS and MPAI-4 were used to assess functional abilities at admission and at discharge. Construct validity was assessed using a bivariate Spearman rho analysis A subsample of 56 consecutive admissions during 2014 were examined to determine inter-rater reliability. Intra-class correlation coefficient (ICC) and Kappa coefficients assessed inter-rater agreement of the total PERPOS and PERPOS subscales respectively. The PERPOS and MPAI-4 demonstrated a strong negative association among both TBI and CVA patients. Kappa scores for the three PERPOS scales each demonstrated good to excellent inter-rater agreement. The ICC for overall PERPOS scores fell in the good agreement range. The PERPOS can be used reliably in PBIR to quantify patients' functional abilities within the context of environmental demands.
Reliability of infrared thermometric measurements of skin temperature in the hand.

PubMed

Packham, Tara L; Fok, Diana; Frederiksen, Karen; Thabane, Lehana; Buckley, Norman

2012-01-01

Clinical measurement study. Skin temperature asymmetries (STAs) are used in the diagnosis of complex regional pain syndrome (CRPS), but little evidence exists for reliability of the equipment and methods. This study examined the reliability of an inexpensive infrared (IR) thermometer and measurement points in the hand for the study of STA. ST was measured three times at five points on both hands with an IR thermometer by two raters in 20 volunteers (12 normals and 8 CRPS). ST measurement results using IR thermometers support inter-rater reliability: intraclass correlation coefficient (ICC) estimate for single measures 0.80; all ST measurement points were also highly reliable (ICC single measures, 0.83-0.91). The equipment demonstrated excellent reliability, with little difference in the reliability of the five measurement sites. These preliminary findings support their use in future CRPS research. Not applicable. Copyright © 2012 Hanley & Belfus. Published by Elsevier Inc. All rights reserved.
Effects of questionnaire-based diagnosis and training on inter-rater reliability among practitioners of traditional Chinese medicine.

PubMed

Mist, Scott; Ritenbaugh, Cheryl; Aickin, Mikel

2009-07-01

To investigate whether a training process that focused on a questionnaire-based diagnosis in Traditional Chinese Medicine (TCM), and developing diagnostic consensus, would improve the agreement of TCM diagnoses among 10 TCM practitioners evaluating patients with temporomandibular joint disorder (TMJD). Evaluation of a diagnostic training program at the Department of Family and Community Medicine, University of Arizona, Tucson, Arizona, and the Oregon College of Oriental Medicine, Portland, Oregon. Screened participants for a study of TCM for TMJD. PRACTITIONERS: Ten (10) licensed acupuncturists with a minimum of 5 years licensure and education in Chinese herbs. A training session using a questionnaire-based diagnostic form was conducted, followed by waves of diagnostic sessions. Between sessions, practitioners discussed the results of the previous round of participants with a focus on reducing variability in primary diagnosis and severity rating of each diagnosis: 3 waves of 5 patients were assessed by 4 practitioner pairs for a total of 120 diagnoses. At 18 months, practitioners completed a recalibration exercise with a similar format with a total of 32 diagnoses. These diagnoses were then examined with respect to the rate of agreement among the 10 practitioners using inter-rater correlations and kappas. The inter-rater correlation with respect to the TCM diagnoses among the 10 practitioners increased from 0.112 to 0.618 with training. Statistically significant improvements were found between the baseline and 18 month exercises (p < 0.01). Inter-rater reliability of TCM diagnosis may be improved through a training process and a questionnaire-based diagnosis process. The improvements varied by diagnosis, with the greatest congruence among primary and more severe diagnoses. Future TCM studies should consider including calibration training to improve the validity of results.
A comparison of the reliability of the trochanteric prominence angle test and the alternative method in healthy subjects.

PubMed

Yoon, Tae-Lim; Park, Kyung-Mi; Choi, Sil-Ah; Lee, Ji-Hyun; Jeong, Hyo-Jung; Cynn, Heon-Seock

2014-04-01

A wide range of intra- and inter-rater reliabilities of the trochanteric prominence angle test (TPAT) has been reported. We introduced the transcondylar angle test (TCAT) as an alternative to the TPAT and using a smartphone as a reliable measurement tool for femoral neck anteversion (FNA) measurement. The reliabilities of the TPAT and the TCAT, the reliability of using a smartphone as a clinical measurement tool, and the correlation between the difference value of medial knee joint space (KJS) between rest and tested positions and the difference value between the TPAT and TCAT were assessed. Two physical therapists independently determined the reliabilities of the TPAT with a digital inclinometer, the TCAT with a digital inclinometer, and the TCAT with a smartphone in 19 hips of 10 healthy subjects (5 male and 5 female, 22.2 ± 1.69 years). The medial KJS in rest and the tested position were assessed using a sonography. The intra-class correlation coefficients (ICC) for the intra-rater reliabilities of TPAT with a digital inclinometer (ICC = 0.92), TCAT with a digital inclinometer (ICC = 0.94) and a smartphone (ICC = 0.95) in both testers were substantial. The inter-rater reliability of TPAT with a digital inclinometer was fair (ICC = 0.48) while TCAT with a digital inclinometer (ICC = 0.89) and a smartphone (ICC = 0.85) were substantial. The correlation between the difference value of medial KJS between rest and tested positions and the difference value between TPAT and TCAT was low and statistically non-significant (r = 0.114; p = 0.325). The TCAT would be more reliable than the TPAT in inter-rater test. Using a smartphone is a clinically comparable measuring tool to a digital inclinometer. Copyright © 2013 Elsevier Ltd. All rights reserved.
Reliability of two social cognition tests: The combined stories test and the social knowledge test.

PubMed

Thibaudeau, Élisabeth; Cellard, Caroline; Legendre, Maxime; Villeneuve, Karèle; Achim, Amélie M

2018-04-01

Deficits in social cognition are common in psychiatric disorders. Validated social cognition measures with good psychometric properties are necessary to assess and target social cognitive deficits. Two recent social cognition tests, the Combined Stories Test (COST) and the Social Knowledge Test (SKT), respectively assess theory of mind and social knowledge. Previous studies have shown good psychometric properties for these tests, but the test-retest reliability has never been documented. The aim of this study was to evaluate the test-retest reliability and the inter-rater reliability of the COST and the SKT. The COST and the SKT were administered twice to a group of forty-two healthy adults, with a delay of approximately four weeks between the assessments. Excellent test-retest reliability was observed for the COST, and a good test-retest reliability was observed for the SKT. There was no evidence of practice effect. Furthermore, an excellent inter-rater reliability was observed for both tests. This study shows a good reliability of the COST and the SKT that adds to the good validity previously reported for these two tests. These good psychometrics properties thus support that the COST and the SKT are adequate measures for the assessment of social cognition. Copyright © 2018. Published by Elsevier B.V.
Motivational Interviewing Skills in Health Care Encounters (MISHCE): Development and psychometric testing of an assessment tool.

PubMed

Petrova, Tatjana; Kavookjian, Jan; Madson, Michael B; Dagley, John; Shannon, David; McDonough, Sharon K

2015-01-01

Motivational interviewing (MI) has demonstrated a significant impact as an intervention strategy for addiction management, change in lifestyle behaviors, and adherence to prescribed medication and other treatments. Key elements to studying MI include training in MI of professionals who will use it, assessment of skills acquisition in trainees, and the use of a validated skills assessment tool. The purpose of this research project was to develop a psychometrically valid and reliable tool that has been designed to assess MI skills competence in health care provider trainees. The goal was to develop an assessment tool that would evaluate the acquisition and use of specific MI skills and principles, as well as the quality of the patient-provider therapeutic alliance in brief health care encounters. To address this purpose, specific steps were followed, beginning with a literature review. This review contributed to the development of relevant conceptual and operational definitions, selecting a scaling technique and response format, and methods for analyzing validity and reliability. Internal consistency reliability was established on 88 video recorded interactions. The inter-rater and test-retest reliability were established using randomly selected 18 from the 88 interactions. The assessment tool Motivational Interviewing Skills for Health Care Encounters (MISHCE) and a manual for use of the tool were developed. Validity and reliability of MISHCE were examined. Face and content validity were supported with well-defined conceptual and operational definitions and feedback from an expert panel. Reliability was established through internal consistency, inter-rater reliability, and test-retest reliability. The overall internal consistency reliability (Cronbach's alpha) for all fifteen items was 0.75. MISHCE demonstrated good inter-rater reliability and good to excellent test-retest reliability. MISHCE assesses the health provider's level of knowledge and skills in brief disease management encounters. MISHCE also evaluates quality of the patient-provider therapeutic alliance, i.e., the "flow" of the interaction. Copyright © 2015 Elsevier Inc. All rights reserved.
Reliability and group differences in quantitative cervicothoracic measures among individuals with and without chronic neck pain

PubMed Central

2012-01-01

Background Clinicians frequently rely on subjective categorization of impairments in mobility, strength, and endurance for clinical decision-making; however, these assessments are often unreliable and lack sensitivity to change. The objective of this study was to determine the inter-rater reliability, minimum detectable change (MDC), and group differences in quantitative cervicothoracic measures for individuals with and without chronic neck pain (NP). Methods Nineteen individuals with NP and 20 healthy controls participated in this case control study. Two physical therapists performed a 30-minute examination on separate days. A handheld dynamometer, gravity inclinometer, ruler, and stopwatch were used to quantify cervical range of motion (ROM), cervical muscle strength and endurance, and scapulothoracic muscle length and strength, respectively. Results Intraclass correlation coefficients for inter-rater reliability were significantly greater than zero for most impairment measures, with point estimates ranging from 0.45 to 0.93. The NP group exhibited reduced cervical ROM (P ≤ 0.012) and muscle strength (P ≤ 0.038) in most movement directions, reduced cervical extensor endurance (P = 0.029), and reduced rhomboid and middle trapezius muscle strength (P ≤ 0.049). Conclusions Results demonstrate the feasibility of obtaining objective cervicothoracic impairment measures with acceptable inter-rater agreement across time. The clinical utility of these measures is supported by evidence of impaired mobility, strength, and endurance among patients with NP, with corresponding MDC values that can help establish benchmarks for clinically significant change. PMID:23114092
Frame-of-reference training for simulation-based intraoperative communication assessment.

PubMed

Gardner, Aimee K; Russo, Michael A; Jabbour, Ibrahim I; Kosemund, Matthew; Scott, Daniel J

2016-09-01

The purpose of this study was to examine the impact of frame-of-reference (FOR) training on assessments of intraoperative communication skills and identify areas of need to inform curricular efforts. Simulation instructors (M.D., Ph.D., Research Fellow, Simulation Technician) underwent a 2-hour FOR training session with the operating room communication instrument. They then independently rated communication skills of 19 PGY1s who participated in a team-based simulation. Residents completed self-assessments via video review of the scenario. Intraclass correlation coefficients were used to examine inter-rater reliability. Relationships between trained raters and resident scores were assessed with Pearson correlation coefficients and paired sample t tests. Inter-reliability after FOR training was .91. The correlation between trained rater scores and resident evaluations was nonsignificant. Residents significantly underestimated their intraoperative communication skills (P < .05). Use of names, closed loop communication, and sharing information with team members demonstrated consistently low ratings among all residents. These findings reveal that a number of individuals can be trained to reliably rate resident intraoperative communication performance and that residents tend to under-rate their communication skills. Copyright © 2016 Elsevier Inc. All rights reserved.
Educational Testing and Validity of Conclusions in the Scholarship of Teaching and Learning

PubMed Central

Beltyukova, Svetlana A.; Martin, Beth A.

2013-01-01

Validity and its integral evidence of reliability are fundamentals for educational and psychological measurement, and standards of educational testing. Herein, we describe these standards of educational testing, along with their subtypes including internal consistency, inter-rater reliability, and inter-rater agreement. Next, related issues of measurement error and effect size are discussed. This article concludes with a call for future authors to improve reporting of psychometrics and practical significance with educational testing in the pharmacy education literature. By increasing the scientific rigor of educational research and reporting, the overall quality and meaningfulness of SoTL will be improved. PMID:24249848
Design, implementation, and psychometric analysis of a scoring instrument for simulated pediatric resuscitation: a report from the EXPRESS pediatric investigators.

PubMed

Donoghue, Aaron; Ventre, Kathleen; Boulet, John; Brett-Fleegler, Marisa; Nishisaki, Akira; Overly, Frank; Cheng, Adam

2011-04-01

Robustly tested instruments for quantifying clinical performance during pediatric resuscitation are lacking. Examining Pediatric Resuscitation Education through Simulation and Scripting Collaborative was established to conduct multicenter trials of simulation education in pediatric resuscitation, evaluating performance with multiple instruments, one of which is the Clinical Performance Tool (CPT). We hypothesize that the CPT will measure clinical performance during simulated pediatric resuscitation in a reliable and valid manner. Using a pediatric resuscitation scenario as a basis, a scoring system was designed based on Pediatric Advanced Life Support algorithms comprising 21 tasks. Each task was scored as follows: task not performed (0 points); task performed partially, incorrectly, or late (1 point); and task performed completely, correctly, and within the recommended time frame (2 points). Study teams at 14 children's hospitals went through the scenario twice (PRE and POST) with an interposed 20-minute debriefing. Both scenarios for each of eight study teams were scored by multiple raters. A generalizability study, based on the PRE scores, was conducted to investigate the sources of measurement error in the CPT total scores. Inter-rater reliability was estimated based on the variance components. Validity was assessed by repeated measures analysis of variance comparing PRE and POST scores. Sixteen resuscitation scenarios were reviewed and scored by seven raters. Inter-rater reliability for the overall CPT score was 0.63. POST scores were found to be significantly improved compared with PRE scores when controlled for within-subject covariance (F1,15 = 4.64, P < 0.05). The variance component ascribable to rater was 2.4%. Reliable and valid measures of performance in simulated pediatric resuscitation can be obtained from the CPT. Future studies should examine the applicability of trichotomous scoring instruments to other clinical scenarios, as well as performance during actual resuscitations.
Infant polysomnography: reliability and validity of infant arousal assessment.

PubMed

Crowell, David H; Kulp, Thomas D; Kapuniai, Linda E; Hunt, Carl E; Brooks, Lee J; Weese-Mayer, Debra E; Silvestri, Jean; Ward, Sally Davidson; Corwin, Michael; Tinsley, Larry; Peucker, Mark

2002-10-01

Infant arousal scoring based on the Atlas Task Force definition of transient EEG arousal was evaluated to determine (1). whether transient arousals can be identified and assessed reliably in infants and (2). whether arousal and no-arousal epochs scored previously by trained raters can be validated reliably by independent sleep experts. Phase I for inter- and intrarater reliability scoring was based on two datasets of sleep epochs selected randomly from nocturnal polysomnograms of healthy full-term, preterm, idiopathic apparent life-threatening event cases, and siblings of Sudden Infant Death Syndrome infants of 35 to 64 weeks postconceptional age. After training, test set 1 reliability was assessed and discrepancies identified. After retraining, test set 2 was scored by the same raters to determine interrater reliability. Later, three raters from the trained group rescored test set 2 to assess inter- and intrarater reliabilities. Interrater and intrarater reliability kappa's, with 95% confidence intervals, ranged from substantial to almost perfect levels of agreement. Interrater reliabilities for spontaneous arousals were initially moderate and then substantial. During the validation phase, 315 previously scored epochs were presented to four sleep experts to rate as containing arousal or no-arousal events. Interrater expert agreements were diverse and considered as noninterpretable. Concordance in sleep experts' agreements, based on identification of the previously sampled arousal and no-arousal epochs, was used as a secondary evaluative technique. Results showed agreement by two or more experts on 86% of the Collaborative Home Infant Monitoring Evaluation Study arousal scored events. Conversely, only 1% of the Collaborative Home Infant Monitoring Evaluation Study-scored no-arousal epochs were rated as an arousal. In summary, this study presents an empirically tested model with procedures and criteria for attaining improved reliability in transient EEG arousal assessments in infants using the modified Atlas Task Force standards. With training based on specific criteria, substantial inter- and intrarater agreement in identifying infant arousals was demonstrated. Corroborative validation results were too disparate for meaningful interpretation. Alternate evaluation based on concordance agreements supports reliance on infant EEG criteria for assessment. Results mandate additional confirmatory validation studies with specific training on infant EEG arousal assessment criteria.
Reliability of shear wave ultrasound elastography for neck lesions identified in routine clinical practice.

PubMed

Bhatia, K; Tong, C S L; Cho, C C M; Yuen, E H Y; Lee, J; Ahuja, A T

2012-10-01

To evaluate the reliability of shear wave ultrasound elastography (SWE) in the neck. 176 neck lesions (40 thyroid, 56 lymph nodes, 46 salivary, 34 miscellaneous) identified in a routine US clinic underwent SWE by one or two blinded radiologists. For this study, SWE required the operator to acquire three 10 second dynamic colour-coded SWE cineloops per lesion, select one static image per cineloop, and place circular regions-of-interest within the entire lesion and stiffest part to generate 3 SWE measurements per static image. For logistical reasons, one radiologist evaluated all 176 lesions and the other evaluated 58 lesions. Both radiologists also reviewed 27 archived cineloops independently to assess SWE excluding practical technique. Reliability was assessed using intraclass correlation coefficients (ICCs) concordance correlation coefficients (CCCs) and coefficients of repeatability (CORs). Test-retest ICCs for the radiologist evaluating 176 lesions were 0.78 - 0.85 (fair-excellent agreement), CCCs were 0.85 - 0.88 (substantial agreement), and CORs were 14.9 - 36.1 kPa. For both radiologists evaluating 58 lesions, intra-rater and inter-rater ICCs were 0.65 - 0.78 and 0.72 - 0.77 respectively. For SWE excluding practical technique, inter-rater ICCs were 0.97 - 0.98 (excellent agreement). ICCs differed according to tissue, being higher in thyroid lesions than lymph nodes (p < 0.001), and higher in benign than malignant lesions (p values < 0.001). Intra- and inter-rater reliability of SWE is fair to excellent according to ICCs. SWE reliability is influenced appreciably by acquisition technique. Nevertheless, CORs for SWE are not negligible. To determine whether these results are acceptable clinically, further research is required to establish SWE stiffness values of normal and pathological tissues in the neck. © Georg Thieme Verlag KG Stuttgart · New York.

Abbreviated Injury Scale: not a reliable basis for summation of injury severity in trauma facilities?

PubMed

Ringdal, Kjetil G; Skaga, Nils Oddvar; Hestnes, Morten; Steen, Petter Andreas; Røislien, Jo; Rehn, Marius; Røise, Olav; Krüger, Andreas J; Lossius, Hans Morten

2013-05-01

Injury severity is most frequently classified using the Abbreviated Injury Scale (AIS) as a basis for the Injury Severity Score (ISS) and the New Injury Severity Score (NISS), which are used for assessment of overall injury severity in the multiply injured patient and in outcome prediction. European trauma registries recommended the AIS 2008 edition, but the levels of inter-rater agreement and reliability of ISS and NISS, associated with its use, have not been reported. Nineteen Norwegian AIS-certified trauma registry coders were invited to score 50 real, anonymised patient medical records using AIS 2008. Rater agreements for ISS and NISS were analysed using Bland-Altman plots with 95% limits of agreement (LoA). A clinically acceptable LoA range was set at ± 9 units. Reliability was analysed using a two-way mixed model intraclass correlation coefficient (ICC) statistics with corresponding 95% confidence intervals (CI) and hierarchical agglomerative clustering. Ten coders submitted their coding results. Of their AIS codes, 2189 (61.5%) agreed with a reference standard, 1187 (31.1%) real injuries were missed, and 392 non-existing injuries were recorded. All LoAs were wider than the predefined, clinically acceptable limit of ± 9, for both ISS and NISS. The joint ICC (range) between each rater and the reference standard was 0.51 (0.29,0.86) for ISS and 0.51 (0.27,0.78) for NISS. The joint ICC (range) for inter-rater reliability was 0.49 (0.19,0.85) for ISS and 0.49 (0.16,0.82) for NISS. Univariate linear regression analyses indicated a significant relationship between the number of correctly AIS-coded injuries and total number of cases coded during the rater's career, but no significant relationship between the rater-against-reference ISS and NISS ICC values and total number of cases coded during the rater's career. Based on AIS 2008, ISS and NISS were not reliable for summarising anatomic injury severity in this study. This result indicates a limitation in their use as benchmarking tools for trauma system performance. Copyright © 2012 Elsevier Ltd. All rights reserved.
Reliability and validity of a novel tool to comprehensively assess food and beverage marketing in recreational sport settings.

PubMed

Prowse, Rachel J L; Naylor, Patti-Jean; Olstad, Dana Lee; Carson, Valerie; Mâsse, Louise C; Storey, Kate; Kirk, Sara F L; Raine, Kim D

2018-05-31

Current methods for evaluating food marketing to children often study a single marketing channel or approach. As the World Health Organization urges the removal of unhealthy food marketing in children's settings, methods that comprehensively explore the exposure and power of food marketing within a setting from multiple marketing channels and approaches are needed. The purpose of this study was to test the inter-rater reliability and the validity of a novel settings-based food marketing audit tool. The Food and beverage Marketing Assessment Tool for Settings (FoodMATS) was developed and its psychometric properties evaluated in five public recreation and sport facilities (sites) and subsequently used in 51 sites across Canada for a cross-sectional analysis of food marketing. Raters recorded the count of food marketing occasions, presence of child-targeted and sports-related marketing techniques, and the physical size of marketing occasions. Marketing occasions were classified by healthfulness. Inter-rater reliability was tested using Cohen's kappa (κ) and intra-class correlations (ICC). FoodMATS scores for each site were calculated using an algorithm that represented the theoretical impact of the marketing environment on food preferences, purchases, and consumption. Higher FoodMATS scores represented sites with higher exposure to, and more powerful (unhealthy, child-targeted, sports-related, large) food marketing. Validity of the scoring algorithm was tested through (1) Pearson's correlations between FoodMATS scores and facility sponsorship dollars, and (2) sequential multiple regression for predicting "Least Healthy" food sales from FoodMATS scores. Inter-rater reliability was very good to excellent (κ = 0.88-1.00, p < 0.001; ICC = 0.97, p < 0.001). There was a strong positive correlation between FoodMATS scores and food sponsorship dollars, after controlling for facility size (r = 0.86, p < 0.001). The FoodMATS score explained 14% of the variability in "Least Healthy" concession sales (p = 0.012) and 24% of the variability total concession and vending "Least Healthy" food sales (p = 0.003). FoodMATS has high inter-rater reliability and good validity. As the first validated tool to evaluate the exposure and power of food marketing in recreation facilities, the FoodMATS provides a novel means to comprehensively track changes in food marketing environments that can assist in developing and monitoring the impact of policies and interventions.
Reliability of the Matson Evaluation of Social Skills with Youngsters (MESSY) for Children with Autism Spectrum Disorders

ERIC Educational Resources Information Center

Matson, Johnny L.; Horovitz, Max; Mahan, Sara; Fodstad, Jill

2013-01-01

The purpose of this paper was to update the psychometrics of the "Matson Evaluation of Social Skills for Youngsters" ("MESSY") with children with Autism Spectrum Disorders (ASD), specifically with respect to internal consistency, split-half reliability, and inter-rater reliability. In Study 1, 114 children with ASD (Autistic Disorder, Asperger's…
Inter-rater reliability of the Sødring Motor Evaluation of Stroke patients (SMES).

PubMed

Halsaa, K E; Sødring, K M; Bjelland, E; Finsrud, K; Bautz-Holter, E

1999-12-01

The Sødring Motor Evaluation of Stroke patients is an instrument for physiotherapists to evaluate motor function and activities in stroke patients. The rating reflects quality as well as quantity of the patient's unassisted performance within three domains: leg, arm and gross function. The inter-rater reliability of the method was studied in a sample of 30 patients admitted to a stroke rehabilitation unit. Three therapists were involved in the study; two therapists assessed the same patient on two consecutive days in a balanced design. Cohen's weighted kappa and McNemar's test of symmetry were used as measures of item reliability, and the intraclass correlation coefficient was used to express the reliability of the sumscores. For 24 out of 32 items the weighted kappa statistic was excellent (0.75-0.98), while 7 items had a kappa statistic within the range 0.53-0.74 (fair to good). The reliability of one item was poor (0.13). The intraclass correlation coefficient for the three sumscores was 0.97, 0.91 and 0.97. We conclude that the Sødring Motor Evaluation of Stroke patients is a reliable measure of motor function in stroke patients undergoing rehabilitation.
A Brazilian-Portuguese version of the Kinesthetic and Visual Motor Imagery Questionnaire.

PubMed

Demanboro, Alan; Sterr, Annette; Anjos, Sarah Monteiro Dos; Conforto, Adriana Bastos

2018-01-01

Motor imagery has emerged as a potential rehabilitation tool in stroke. The goals of this study were: 1) to develop a translated and culturally-adapted Brazilian-Portugese version of the Kinesthetic and Visual Motor Imagery Questionnaire (KVIQ20-P); 2) to evaluate the psychometric characteristics of the scale in a group of patients with stroke and in an age-matched control group; 3) to compare the KVIQ20 performance between the two groups. Test-retest, inter-rater reliabilities, and internal consistencies were evaluated in 40 patients with stroke and 31 healthy participants. In the stroke group, ICC confidence intervals showed excellent test-retest and inter-rater reliabilities. Cronbach's alpha also indicated excellent internal consistency. Results for controls were comparable to those obtained in persons with stroke. The excellent psychometric properties of the KVIQ20-P should be considered during the design of studies of motor imagery interventions for stroke rehabilitation.
Inter-rater reliability of categorical versus continuous scoring of fish vitality: Does it affect the utility of the reflex action mortality predictor (RAMP) approach?

PubMed Central

Yochum, Noëlle; Kochzius, Marc; Ampe, Bart; Tuyttens, Frank A. M.

2017-01-01

Scoring reflex responsiveness and injury of aquatic organisms has gained popularity as predictors of discard survival. Given this method relies upon the individual interpretation of scoring criteria, an evaluation of its robustness is done here to test whether protocol-instructed, multiple raters with diverse backgrounds (research scientist, technician, and student) are able to produce similar or the same reflex and injury score for one of the same flatfish (European plaice, Pleuronectes platessa) after experiencing commercial fishing stressors. Inter-rater reliability for three raters was assessed by using a 3-point categorical scale (‘absent’, ‘weak’, ‘strong’) and a tagged visual analogue continuous scale (tVAS, a 10 cm bar split in three labelled sections: 0 for ‘absent’, ‘weak’, ‘moderate’, and ‘strong’) for six reflex responses, and a 4-point scale for four injury types. Plaice (n = 304) were sampled from 17 research beam-trawl deployments during four trips. Fleiss kappa (categorical scores) and intra-class correlation coefficients (ICC, continuous scores) indicated variable inter-rater agreement by reflex type (ranging between 0.55 and 0.88, and 67% and 91% for Fleiss kappa and ICC, respectively), with least agreement among raters on extent of injury (Fleiss kappa between 0.08 and 0.27). Despite differences among raters, which did not significantly influence the relationship between impairment and predicted survival, combining categorical reflex and injury scores always produced a close relationship of such vitality indices and observed delayed mortality. The use of the continuous scale did not improve fit of these models compared with using the reflex impairment index based on categorical scores. Given these findings, we recommend using a 3-point categorical over a continuous scale. We also determined that training rather than experience of raters minimised inter-rater differences. Our results suggest that cost-efficient reflex impairment and injury scoring may be considered a robust technique to evaluate lethal stress and damage of this flatfish species on-board commercial beam-trawl vessels. PMID:28704390
The neurobehavioural rating scale: assessment of the behavioural sequelae of head injury by the clinician.

PubMed Central

Levin, H S; High, W M; Goethe, K E; Sisson, R A; Overall, J E; Rhoades, H M; Eisenberg, H M; Kalisky, Z; Gary, H E

1987-01-01

To investigate the inter-rater reliability and validity of the Neurobehavioural Rating Scale at various stages of recovery after hospitalisation for closed head injury, we studied 101 head trauma patients who had no antecedent neuropsychiatric disorder. The results demonstrated satisfactory inter-rater reliability and showed that the Neurobehavioural Rating Scale reflects both the severity and chronicity of closed head injury. A principal components analysis revealed four factors which were differentially related to severity of head injury and the presence of a frontal lobe mass lesion. Although our findings provide support for utilising clinical ratings of behaviour to investigate sequelae of head injury, extension of this technique to other settings is necessary to evaluate the distinctiveness of the neurobehavioural profile of closed head injury as compared with other aetiologies of brain damage. PMID:3572433
Development and evaluation of the OHCITIES instrument: assessing alcohol urban environments in the Heart Healthy Hoods project

PubMed Central

Sureda, Xisca; Espelt, Albert; Villalbí, Joan R; Cebrecos, Alba; Baranda, Lucía; Pearce, Jamie; Franco, Manuel

2017-01-01

Objectives To describe the development and test–retest reliability of OHCITIES, an instrument characterising alcohol urban environment in terms of availability, promotion and signs of consumption. Design This study involved: (1) developing the conceptual framework for alcohol urban environment by means of literature reviewing and previous alcohol environment research experience; (2) pilot testing and redesigning the instrument; (3) instrument digitalisation; (4) instrument evaluation using test–retest reliability. Setting Data for testing the reliability of the instrument were collected in seven census sections in Madrid in 2016 by two observers. Primary and secondary outcome measures We computed per cent agreement and Cohen’s kappa coefficients to estimate inter-rater and test–retest reliability for alcohol outlet environment measures. We calculated interclass coefficients and their 95% CIs to provide a measure of inter-rater reliability for signs of alcohol consumption measures. Results We collected information on 92 on-premise and 24 off-premise alcohol outlets identified in the studied areas about availability, accessibility and promotion of alcohol. Most per cent-agreement values for alcohol measures in on-premise and off-premise alcohol outlets were greater than 80%, and inter-rater and test–retest reliability values were generally above 0.80. Observers identified 26 streets and 3 public squares with signs of alcohol consumption. Intraclass correlation coefficient between observers for any type of signs of alcohol consumption was 0.50 (95% CI −0.09 to 0.77). Few items promoting alcohol unrelated to alcohol outlets were found on public spaces. Conclusions The OHCITIES instrument is a reliable instrument to characterise alcohol urban environment. This instrument might be used to understand how alcohol environment associates with alcohol behaviours and its related health outcomes, and can help in the design and evaluation of policies to reduce the harm caused by alcohol. PMID:28982829
Development and evaluation of the OHCITIES instrument: assessing alcohol urban environments in the Heart Healthy Hoods project.

PubMed

Sureda, Xisca; Espelt, Albert; Villalbí, Joan R; Cebrecos, Alba; Baranda, Lucía; Pearce, Jamie; Franco, Manuel

2017-10-05

To describe the development and test-retest reliability of OHCITIES, an instrument characterising alcohol urban environment in terms of availability, promotion and signs of consumption. This study involved: (1) developing the conceptual framework for alcohol urban environment by means of literature reviewing and previous alcohol environment research experience; (2) pilot testing and redesigning the instrument; (3) instrument digitalisation; (4) instrument evaluation using test-retest reliability. Data for testing the reliability of the instrument were collected in seven census sections in Madrid in 2016 by two observers. We computed per cent agreement and Cohen's kappa coefficients to estimate inter-rater and test-retest reliability for alcohol outlet environment measures. We calculated interclass coefficients and their 95% CIs to provide a measure of inter-rater reliability for signs of alcohol consumption measures. We collected information on 92 on-premise and 24 off-premise alcohol outlets identified in the studied areas about availability, accessibility and promotion of alcohol. Most per cent-agreement values for alcohol measures in on-premise and off-premise alcohol outlets were greater than 80%, and inter-rater and test-retest reliability values were generally above 0.80. Observers identified 26 streets and 3 public squares with signs of alcohol consumption. Intraclass correlation coefficient between observers for any type of signs of alcohol consumption was 0.50 (95% CI -0.09 to 0.77). Few items promoting alcohol unrelated to alcohol outlets were found on public spaces. The OHCITIES instrument is a reliable instrument to characterise alcohol urban environment. This instrument might be used to understand how alcohol environment associates with alcohol behaviours and its related health outcomes, and can help in the design and evaluation of policies to reduce the harm caused by alcohol. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2017. All rights reserved. No commercial use is permitted unless otherwise expressly granted.
Validity, Reliability and Acceptability of the Team Standardized Assessment of Clinical Encounter Report*

PubMed Central

Wong, Camilla L.; Norris, Mireille; Sinha, Samir S.; Zorzitto, Maria L.; Madala, Sushma; Hamid, Jemila S.

2016-01-01

Background The Team Standardized Assessment of a Clinical Encounter Report (StACER) was designed for use in Geriatric Medicine residency programs to evaluate Communicator and Collaborator competencies. Methods The Team StACER was completed by two geriatricians and interdisciplinary team members based on observations during a geriatric medicine team meeting. Postgraduate trainees were recruited from July 2010–November 2013. Inter-rater reliability between two geriatricians and between all team members was determined. Internal consistency of items for the constructs Communicator and Collaborator competencies was calculated. Raters completed a survey previously administered to Canadian geriatricians to assess face validity. Trainees completed a survey to determine the usefulness of this instrument as a feedback tool. Results Thirty postgraduate trainees participated. The prevalence-adjusted bias-adjusted kappa range inter-rater reliability for Communicator and Collaborator items were 0.87–1.00 and 0.86–1.00, respectively. The Cronbach’s alpha coefficient for Communicator and Collaborator items was 0.997 (95% CI: 0.993–1.00) and 0.997 (95% CI: 0.997–1.00), respectively. The instrument lacked discriminatory power, as all trainees scored “meets requirements” in the overall assessment. Niney-three per cent and 86% of trainees found feedback useful for developing Communicator and Collaborator competencies, respectively. Conclusions The Team StACER has adequate inter-rater reliability and internal consistency. Poor discriminatory power and face validity challenge the merit of using this evaluation tool. Trainees felt the tool provided useful feedback on Collaborator and Communicator competencies. PMID:28050222
The push-off test: development of a simple, reliable test of upper extremity weight-bearing capability.

PubMed

Vincent, Joshua I; MacDermid, Joy C; Michlovitz, Susan L; Rafuse, Richard; Wells-Rowsell, Christina; Wong, Owen; Bisbee, Leslie

2014-01-01

Longitudinal clinical measurement study. The push-off test (POT) is a novel and simple measure of upper extremity weight-bearing that can be measured with a grip dynamometer. There are no published studies on the validity and reliability of the POT. The relationship between upper extremity self-report activity/participation and impairment measures remain an unexplored realm. The primary purpose of this study is to estimate the intra and inter-rater reliability and construct validity of the POT. The secondary purpose is to estimate the relationship between upper extremity self-report activity/participation questionnaires and impairment measures. A convenience sample of 22 patients with wrist or elbow injuries were tested for POT, wrist/elbow range of motion (ROM), isometric wrist extension strength (WES) and grip strength; and completed two self-report activity/participation questionnaires: Disability of the Arm, Shoulder and the Hand (DASH) and Work Limitations Questionnaire (WLQ-26). POT's inter and intra-rater reliability and construct validity was tested. Pearson's correlations were run between the impairment measures and self-report questionnaires to look into the relationship amongst them. The POT demonstrated high inter-rater reliability (ICC affected = 0.97; 95% C.I. 0.93-0.99; ICC unaffected = 0.85; 95% C.I. 0.68-0.94) and intra-rater reliability (ICC affected = 0.96; 95% C.I. 0.92-0.97; ICC unaffected = 0.92; 95% C.I. 0.85-0.97). The POT was correlated moderately with the DASH (r = -0.47; p = 0.03). While examining the relationship between upper extremity self-reported activity/participation questionnaires and impairment measures the strongest correlation was between the DASH and the POT (r = -0.47; p = 0.03) and none of the correlations with the other physical impairment measures reached significance. At-work disability demonstrated insignificant correlations with physical impairments. The POT test provides a reliable and easily administered quantitative measure of ability to bear the load through an injured arm. Preliminary evidence supports a moderate relationship between loading bearing measured by the POT and upper extremity function measured by the DASH. 1b. Copyright © 2014 Hanley & Belfus. Published by Elsevier Inc. All rights reserved.
Delirium assessment in hospitalized elderly patients: Italian translation and validation of the nursing delirium screening scale.

PubMed

Spedale, Valentina; Di Mauro, Stefania; Del Giorno, Giulia; Barilaro, Monica; Villa, Candida E; Gaudreau, Jean D; Ausili, Davide

2017-08-01

Delirium has a high incidence pathology associated with negative outcomes. Although highly preventable, half the cases are not recognized. One major cause of delirium misdiagnosis is the absence of a versatile instrument to measure it. Our objective was to translate the nursing delirium screening scale (Nu-DESC) and evaluate its performance in Italian settings. This was a methodological study conducted in two sequential phases. The first was the Italian translation of Nu-DESC through a translation and back-translation process. The second aimed to test the inter-rater reliability, the sensibility and specificity of the instrument on a convenience sample of 101 hospitalized elderly people admitted to relevant wards of the San Gerardo Hospital in Monza. To evaluate the inter-rater reliability, two examiners tested Nu-DESC on 20 patients concurrently without comparison. To measure the sensibility and specificity of Nu-DESC, the confusion assessment method was used as a gold standard measure. The inter-rater reliability (Cohen Kappa) was 0.87-an excellent agreement between examiners. The study of the ROC curve showed an AUC value of 0.9461 suggesting high test accuracy. Using 3 as a cut-off value, Nu-DESC showed 100 % sensibility and 76 % specificity. Further research is needed to test Nu-DESC on a larger sample. However, based on our results, Nu-DESC can be used in research and clinical practice in Italian settings because of its very good and similar performances to previous validation studies. The value of 3 appears to be the optimal cut-off in the Italian context.
Inter-Rater Variability as Mutual Disagreement: Identifying Raters' Divergent Points of View

ERIC Educational Resources Information Center

Gingerich, Andrea; Ramlo, Susan E.; van der Vleuten, Cees P. M.; Eva, Kevin W.; Regehr, Glenn

2017-01-01

Whenever multiple observers provide ratings, even of the same performance, inter-rater variation is prevalent. The resulting "idiosyncratic rater variance" is considered to be unusable error of measurement in psychometric models and is a threat to the defensibility of our assessments. Prior studies of inter-rater variation in clinical…
Inter-observer and intra-observer reliability in the radiographic diagnosis of avascular necrosis of the femoral head following reconstructive hip surgery in children with cerebral palsy.

PubMed

Hesketh, Kim; Sankar, Wudbhav; Joseph, Benjamin; Narayanan, Unni; Mulpuri, Kishore

2016-04-01

The incidence of avascular necrosis (AVN) following reconstructive hip surgery in cerebral palsy (CP) ranges from 0 to 69 % in the current literature. The purpose of this study was to determine the inter- and intra-observer reliability of radiographically diagnosing AVN in children with CP after hip surgery. A retrospective review of 65 children with CP who had reconstructive hip surgery between 2009 and 2012 at BC Children's Hospital was completed. Anterior-posterior and lateral radiographs were presented to four pediatric orthopaedic surgeons over two rounds. Surgeons were asked to review the set of unidentified radiographs and comment 'yes' or 'no' for the presence of AVN. Two weeks later the same set of radiographs was sent in a different order and the surgeons were again asked to comment on AVN. Inter- and intra-observer reliability was determined using kappa statistics. The intra-observer reliability ranged from 0.65 to 0.88 with an average score of 0.76. Inter-observer reliability showed greater variability, ranging from 0.41 to 0.77 with an average score of 0.56 across all surgeons. Although the intra-rater reliability produced a strength of "good" and the inter-rater reliability a strength of "moderate" agreement, the variability within these scores is clinically important as it demonstrates the difficulty in identifying AVN. This may explain the variability in AVN that is reported in the literature. The need for further education and research in the diagnosis of AVN in children with CP who have undergone reconstructive hip surgery is clinically necessary.
Validity and reliability of Internet-based physiotherapy assessment for musculoskeletal disorders: a systematic review.

PubMed

Mani, Suresh; Sharma, Shobha; Omar, Baharudin; Paungmali, Aatit; Joseph, Leonard

2017-04-01

Purpose The purpose of this review is to systematically explore and summarise the validity and reliability of telerehabilitation (TR)-based physiotherapy assessment for musculoskeletal disorders. Method A comprehensive systematic literature review was conducted using a number of electronic databases: PubMed, EMBASE, PsycINFO, Cochrane Library and CINAHL, published between January 2000 and May 2015. The studies examined the validity, inter- and intra-rater reliabilities of TR-based physiotherapy assessment for musculoskeletal conditions were included. Two independent reviewers used the Quality Appraisal Tool for studies of diagnostic Reliability (QAREL) and the Quality Assessment of Diagnostic Accuracy Studies (QUADAS) tool to assess the methodological quality of reliability and validity studies respectively. Results A total of 898 hits were achieved, of which 11 articles based on inclusion criteria were reviewed. Nine studies explored the concurrent validity, inter- and intra-rater reliabilities, while two studies examined only the concurrent validity. Reviewed studies were moderate to good in methodological quality. The physiotherapy assessments such as pain, swelling, range of motion, muscle strength, balance, gait and functional assessment demonstrated good concurrent validity. However, the reported concurrent validity of lumbar spine posture, special orthopaedic tests, neurodynamic tests and scar assessments ranged from low to moderate. Conclusion TR-based physiotherapy assessment was technically feasible with overall good concurrent validity and excellent reliability, except for lumbar spine posture, orthopaedic special tests, neurodynamic testa and scar assessment.
Assessment of nursing home residents in Europe: the Services and Health for Elderly in Long TERm care (SHELTER) study

PubMed Central

2012-01-01

Background Aims of the present study are the following: 1. to describe the rationale and methodology of the Services and Health for Elderly in Long TERm care (SHELTER) study, a project funded by the European Union, aimed at implementing the interRAI instrument for Long Term Care Facilities (interRAI LTCF) as a tool to assess and gather uniform information about nursing home (NH) residents across different health systems in European countries; 2. to present the results about the test-retest and inter-rater reliability of the interRAI LTCF instrument translated into the languages of participating countries; 3 to illustrate the characteristics of NH residents at study entry. Methods A 12 months prospective cohort study was conducted in 57 NH in 7 EU countries (Czech Republic, England, Finland, France, Germany, Italy, The Netherlands) and 1 non EU country (Israel). Weighted kappa coefficients were used to evaluate the reliability of interRAI LTCF items. Results Mean age of 4156 residents entering the study was 83.4 ± 9.4 years, 73% were female. ADL disability and cognitive impairment was observed in 81.3% and 68.0% of residents, respectively. Clinical complexity of residents was confirmed by a high prevalence of behavioral symptoms (27.5% of residents), falls (18.6%), pressure ulcers (10.4%), pain (36.0%) and urinary incontinence (73.5%). Overall, 197 of the 198 the items tested met or exceeded standard cut-offs for acceptable test-retest and inter-rater reliability after translation into the target languages. Conclusion The interRAI LTCF appears to be a reliable instrument. It enables the creation of databases that can be used to govern the provision of long-term care across different health systems in Europe, to answer relevant research and policy questions and to compare characteristics of NH residents across countries, languages and cultures. PMID:22230771
The relative reliability of actively participating and passively observing raters in a simulation-based assessment for selection to specialty training in anaesthesia.

PubMed

Roberts, M J; Gale, T C E; Sice, P J A; Anderson, I R

2013-06-01

Selection to specialty training is a high-stakes assessment demanding valuable consultant time. In one initial entry level and two higher level anaesthesia selection centres, we investigated the feasibility of using staff participating in simulation scenarios, rather than observing consultants, to rate candidate performance. We compared participant and observer scores using four different outcomes: inter-rater reliability; score distributions; correlation of candidate rankings; and percentage of candidates whose selection might be affected by substituting participants' for observers' ratings. Inter-rater reliability between observers was good (correlation coefficient 0.73-0.96) but lower between participants (correlation coefficient 0.39-0.92), particularly at higher level where participants also rated candidates more favourably than did observers. Station rank orderings were strongly correlated between the rater groups at entry level (rho 0.81, p < 0.001) but weaker at the two higher level centres (rho 0.52, p = 0.018; rho 0.58, p = 0.001). Substituting participants' for observers' ratings had less effect once scores were combined with those from other selection centre stations. Selection decisions for 0-20% of candidates could have changed, depending on the numbers of training posts available. We conclude that using participating raters is feasible at initial entry level only. Anaesthesia © 2013 The Association of Anaesthetists of Great Britain and Ireland.
3D photography is as accurate as digital planimetry tracing in determining burn wound area.

PubMed

Stockton, K A; McMillan, C M; Storey, K J; David, M C; Kimble, R M

2015-02-01

In the paediatric population careful attention needs to be made concerning techniques utilised for wound assessment to minimise discomfort and stress to the child. To investigate whether 3D photography is a valid measure of burn wound area in children compared to the current clinical gold standard method of digital planimetry using Visitrak™. Twenty-five children presenting to the Stuart Pegg Paediatric Burn Centre for burn dressing change following acute burn injury were included in the study. Burn wound area measurement was undertaken using both digital planimetry (Visitrak™ system) and 3D camera analysis. Inter-rater reliability of the 3D camera software was determined by three investigators independently assessing the burn wound area. A comparison of wound area was assessed using intraclass correlation co-efficients (ICC) which demonstrated excellent agreement 0.994 (CI 0.986, 0.997). Inter-rater reliability measured using ICC 0.989 (95% CI 0.979, 0.995) demonstrated excellent inter-rater reliability. Time taken to map the wound was significantly quicker using the camera at bedside compared to Visitrak™ 14.68 (7.00)s versus 36.84 (23.51)s (p<0.001). In contrast, analysing wound area was significantly quicker using the Visitrak™ tablet compared to Dermapix(®) software for the 3D Images 31.36 (19.67)s versus 179.48 (56.86)s (p<0.001). This study demonstrates that images taken with the 3D LifeViz™ camera and assessed with Dermapix(®) software is a reliable method for wound area assessment in the acute paediatric burn setting. Copyright © 2014 Elsevier Ltd and ISBI. All rights reserved.
Reliability of light microscopy and a computer-assisted replica measurement technique for evaluating the fit of dental copings.

PubMed

Rudolph, Heike; Ostertag, Silke; Ostertag, Michael; Walter, Michael H; Luthardt, Ralph Gunnar; Kuhn, Katharina

2018-02-01

The aim of this in vitro study was to assess the reliability of two measurement systems for evaluating the marginal and internal fit of dental copings. Sixteen CAD/CAM titanium copings were produced for a prepared maxillary canine. To modify the CAD surface model using different parameters (data density; enlargement in different directions), varying fit was created. Five light-body silicone replicas representing the gap between the canine and the coping were made for each coping and for each measurement method: (1) light microscopy measurements (LMMs); and (2) computer-assisted measurements (CASMs) using an optical digitizing system. Two investigators independently measured the marginal and internal fit using both methods. The inter-rater reliability [intraclass correlation coefficient (ICC)] and agreement [Bland-Altman (bias) analyses]: mean of the differences (bias) between two measurements [the closer to zero the mean (bias) is, the higher the agreement between the two measurements] were calculated for several measurement points (marginal-distal, marginal-buccal, axial-buccal, incisal). For the LMM technique, one investigator repeated the measurements to determine repeatability (intra-rater reliability and agreement). For inter-rater reliability, the ICC was 0.848-0.998 for LMMs and 0.945-0.999 for CASMs, depending on the measurement point. Bland-Altman bias was -15.7 to 3.5 μm for LMMs and -3.0 to 1.9 μm for CASMs. For LMMs, the marginal-distal and marginal-buccal measurement points showed the lowest ICC (0.848/0.978) and the highest bias (-15.7 μm/-7.6 μm). With the intra-rater reliability and agreement (repeatability) for LMMs, the ICC was 0.970-0.998 and bias was -1.3 to 2.3 μm. LMMs showed lower interrater reliability and agreement at the marginal measurement points than CASMs, which indicates a more subjective influence with LMMs at these measurement points. The values, however, were still clinically acceptable. LMMs showed very high intra-rater reliability and agreement for all measurement points, indicating high repeatability.
Reliability of light microscopy and a computer-assisted replica measurement technique for evaluating the fit of dental copings

PubMed Central

Rudolph, Heike; Ostertag, Silke; Ostertag, Michael; Walter, Michael H.; LUTHARDT, Ralph Gunnar; Kuhn, Katharina

2018-01-01

Abstract The aim of this in vitro study was to assess the reliability of two measurement systems for evaluating the marginal and internal fit of dental copings. Material and Methods Sixteen CAD/CAM titanium copings were produced for a prepared maxillary canine. To modify the CAD surface model using different parameters (data density; enlargement in different directions), varying fit was created. Five light-body silicone replicas representing the gap between the canine and the coping were made for each coping and for each measurement method: (1) light microscopy measurements (LMMs); and (2) computer-assisted measurements (CASMs) using an optical digitizing system. Two investigators independently measured the marginal and internal fit using both methods. The inter-rater reliability [intraclass correlation coefficient (ICC)] and agreement [Bland-Altman (bias) analyses]: mean of the differences (bias) between two measurements [the closer to zero the mean (bias) is, the higher the agreement between the two measurements] were calculated for several measurement points (marginal-distal, marginal-buccal, axial-buccal, incisal). For the LMM technique, one investigator repeated the measurements to determine repeatability (intra-rater reliability and agreement). Results For inter-rater reliability, the ICC was 0.848-0.998 for LMMs and 0.945-0.999 for CASMs, depending on the measurement point. Bland-Altman bias was −15.7 to 3.5 μm for LMMs and −3.0 to 1.9 μm for CASMs. For LMMs, the marginal-distal and marginal-buccal measurement points showed the lowest ICC (0.848/0.978) and the highest bias (-15.7 μm/-7.6 μm). With the intra-rater reliability and agreement (repeatability) for LMMs, the ICC was 0.970-0.998 and bias was −1.3 to 2.3 μm. Conclusion LMMs showed lower interrater reliability and agreement at the marginal measurement points than CASMs, which indicates a more subjective influence with LMMs at these measurement points. The values, however, were still clinically acceptable. LMMs showed very high intra-rater reliability and agreement for all measurement points, indicating high repeatability. PMID:29412364

Agreement of repeated motor and sensory scores at individual myotomes and dermatomes in young persons with spinal cord injury.

PubMed

Krisa, L; Gaughan, J; Vogel, L; Betz, R R; Mulcahey, M J

2013-01-01

A prospective repeated measures multicenter study to determine reliability at individual spinal levels when applied to young persons with spinal cord injury (SCI). To evaluate intra- and inter-rater agreement of repeated motor and sensory scores at individual spinal levels. Shriners Hospitals for Children--Philadelphia and Chicago, USA. A total 189 youth with complete and incomplete SCI underwent four neurological exams by two different raters. Agreement between and within raters for each myotome and dermatome was evaluated for complete and incomplete SCI separately. Intraclass correlation coefficients and 95% confidence intervals were calculated. Overall, both intra- and inter-rater agreement resulted in moderate-to-high agreement among myotomes. Subjects with complete SCI had moderate agreement for light touch (LT) and pin prick (PP) testing, whereas subjects with incomplete SCI had >60.0% of dermatomes resulting in poor agreement for PP testing. Overall, moderate-to-high agreement was found for muscle strength comparisons and moderate-to-poor agreement was found for PP and LT.
Inferior turbinate classification system, grades 1 to 4: development and validation study.

PubMed

Camacho, Macario; Zaghi, Soroush; Certal, Victor; Abdullatif, Jose; Means, Casey; Acevedo, Jason; Liu, Stanley; Brietzke, Scott E; Kushida, Clete A; Capasso, Robson

2015-02-01

To develop a validated inferior turbinate grading scale. Development and validation study. Phase 1 development (alpha test) consisted of a proposal of 10 different inferior turbinate grading scales (>1,000 clinic patients). Phase 2 validation (beta test) utilized 10 providers grading 27 standardized endoscopic photos of inferior turbinates using two different classification systems. Phase 3 validation (pilot study) consisted of 100 live consecutive clinic patients (n = 200 inferior turbinates) who were each prospectively graded by 18 different combinations of two independent raters, and grading was repeated by each of the same two raters, two separate times for each patient. In the development phase, 25% (grades 1-4) and 33% (grades 1-4) were the most useful systems. In the validation phase, the 25% classification system was found to be the best balance between potential clinical utility and ability to grade; the photo grading demonstrated a Cohen's kappa (κ) = 0.4671 ± 0.0082 (moderate inter-rater agreement). Live-patient grading with the 25% classification system demonstrated an overall inter-rater reliability of 71.5% (95% confidence interval [CI]: 64.8-77.3), with overall substantial agreement (κ = 0.704 ± 0.028). Intrarater reliability was 91.5% (95% CI: 88.7-94.3). Distribution for the 200 inferior turbinates was as follows: 25% quartile = grade 1, 50% quartile (median) = grade 2, 75% quartile = grade 3, and 90% quartile = grade 4. Mean turbinate size was 2.22 (95% CI: 2.07-2.34; standard deviation 1.02). Categorical κ was as follows: grade 1, 0.8541 ± 0.0289; grade 2, 0.7310 ± 0.0289; grade 3, 0.6997 ± 0.0289, and grade 4, 0.7760 ± 0.0289. The 25% (grades 1-4) inferior turbinate classification system is a validated grading scale with high intrarater and inter-rater reliability. This system can facilitate future research by tracking the effect of interventions on inferior turbinates. 2c. © 2014 The American Laryngological, Rhinological and Otological Society, Inc.
Spanish translation, cross-cultural adaptation, and validation of the Questionnaire for Diabetes-Related Foot Disease (Q-DFD)

PubMed Central

Castillo-Tandazo, Wilson; Flores-Fortty, Adolfo; Feraud, Lourdes; Tettamanti, Daniel

2013-01-01

Purpose To translate, cross-culturally adapt, and validate the Questionnaire for Diabetes-Related Foot Disease (Q-DFD), originally created and validated in Australia, for its use in Spanish-speaking patients with diabetes mellitus. Patients and methods The translation and cross-cultural adaptation were based on international guidelines. The Spanish version of the survey was applied to a community-based (sample A) and a hospital clinic-based sample (samples B and C). Samples A and B were used to determine criterion and construct validity comparing the survey findings with clinical evaluation and medical records, respectively; while sample C was used to determine intra- and inter-rater reliability. Results After completing the rigorous translation process, only four items were considered problematic and required a new translation. In total, 127 patients were included in the validation study: 76 to determine criterion and construct validity and 41 to establish intra- and inter-rater reliability. For an overall diagnosis of diabetes-related foot disease, a substantial level of agreement was obtained when we compared the Q-DFD with the clinical assessment (kappa 0.77, sensitivity 80.4%, specificity 91.5%, positive likelihood ratio [LR+] 9.46, negative likelihood ratio [LR−] 0.21); while an almost perfect level of agreement was obtained when it was compared with medical records (kappa 0.88, sensitivity 87%, specificity 97%, LR+ 29.0, LR− 0.13). Survey reliability showed substantial levels of agreement, with kappa scores of 0.63 and 0.73 for intra- and inter-rater reliability, respectively. Conclusion The translated and cross-culturally adapted Q-DFD showed good psychometric properties (validity, reproducibility, and reliability) that allow its use in Spanish-speaking diabetic populations. PMID:24039434
Point-of-care urine tests for smoking status and isoniazid treatment monitoring in adult patients.

PubMed

Nicolau, Ioana; Tian, Lulu; Menzies, Dick; Ostiguy, Gaston; Pai, Madhukar

2012-01-01

Poor adherence to isoniazid (INH) preventive therapy (IPT) is an impediment to effective control of latent tuberculosis (TB) infection. TB patients who smoke are at higher risk of latent TB infection, active disease, and TB mortality, and may have lower adherence to their TB medications. The objective of our study was to validate IsoScreen and SmokeScreen (GFC Diagnostics, UK), two point-of-care tests for monitoring INH intake and determining smoking status. The tests could be used together in the same individual to help identify patients with a high-risk profile and provide a tailored treatment plan that includes medication management, adherence interventions, and smoking cessation programs. 200 adult outpatients attending the TB and/or the smoking cessation clinic were recruited at the Montreal Chest Institute. Sensitivity and specificity were measured for each test against the corresponding composite reference standard. Test reliability was measured using kappa statistic for intra-rater and inter-rater agreement. Univariate and multivariate logistic regression models were used to explore possible covariates that might be related to false-positive and false-negative test results. IsoScreen had a sensitivity of 93.2% (95% confidence interval [CI] 80.3, 98.2) and specificity of 98.7% (94.8, 99.8). IsoScreen had intra-rater agreement (kappa) of 0.75 (0.48, 0.94) and inter-rater agreement of 0.61 (0.27, 0.90). SmokeScreen had a sensitivity of 69.2% (56.4, 79.8), specificity of 81.6% (73.0, 88.0), intra-rater agreement of 0.77 (0.56, 0.94), and inter-rater agreement of 0.66 (0.42, 0.88). False-positive SmokeScreen tests were strongly associated with INH treatment. IsoScreen had high validity and reliability, whereas SmokeScreen had modest validity and reliability. SmokeScreen tests did not perform well in a population receiving INH due to the association between INH treatment and false-positive SmokeScreen test results. Development of the next generation SmokeScreen assay should account for this potential interference.
Point-of-Care Urine Tests for Smoking Status and Isoniazid Treatment Monitoring in Adult Patients

PubMed Central

Nicolau, Ioana; Tian, Lulu; Menzies, Dick; Ostiguy, Gaston; Pai, Madhukar

2012-01-01

Background Poor adherence to isoniazid (INH) preventive therapy (IPT) is an impediment to effective control of latent tuberculosis (TB) infection. TB patients who smoke are at higher risk of latent TB infection, active disease, and TB mortality, and may have lower adherence to their TB medications. The objective of our study was to validate IsoScreen and SmokeScreen (GFC Diagnostics, UK), two point-of-care tests for monitoring INH intake and determining smoking status. The tests could be used together in the same individual to help identify patients with a high-risk profile and provide a tailored treatment plan that includes medication management, adherence interventions, and smoking cessation programs. Methodology/Principal Findings 200 adult outpatients attending the TB and/or the smoking cessation clinic were recruited at the Montreal Chest Institute. Sensitivity and specificity were measured for each test against the corresponding composite reference standard. Test reliability was measured using kappa statistic for intra-rater and inter-rater agreement. Univariate and multivariate logistic regression models were used to explore possible covariates that might be related to false-positive and false-negative test results. IsoScreen had a sensitivity of 93.2% (95% confidence interval [CI] 80.3, 98.2) and specificity of 98.7% (94.8, 99.8). IsoScreen had intra-rater agreement (kappa) of 0.75 (0.48, 0.94) and inter-rater agreement of 0.61 (0.27, 0.90). SmokeScreen had a sensitivity of 69.2% (56.4, 79.8), specificity of 81.6% (73.0, 88.0), intra-rater agreement of 0.77 (0.56, 0.94), and inter-rater agreement of 0.66 (0.42, 0.88). False-positive SmokeScreen tests were strongly associated with INH treatment. Conclusions IsoScreen had high validity and reliability, whereas SmokeScreen had modest validity and reliability. SmokeScreen tests did not perform well in a population receiving INH due to the association between INH treatment and false-positive SmokeScreen test results. Development of the next generation SmokeScreen assay should account for this potential interference. PMID:23029310
Evaluating "The Safe Living Guide": A Home Hazard Checklist for Seniors

ERIC Educational Resources Information Center

Sorcinelli, Andrea; Shaw, Lynn; Freeman, Andrew; Cooper, Kim

2007-01-01

Purpose: The purpose of this study was to evaluate the utility and reliability of a home hazard checklist published in Health Canada, "The Safe Living Guide: A Guide to Home Safety for Seniors" (2003). Methods: 76 community-dwelling seniors evaluated the guide, and inter-rater reliability was determined through comparison of ratings of…
Developing a digital photography-based method for dietary analysis in self-serve dining settings.

PubMed

Christoph, Mary J; Loman, Brett R; Ellison, Brenna

2017-07-01

Current population-based methods for assessing dietary intake, including food frequency questionnaires, food diaries, and 24-h dietary recall, are limited in their ability to objectively measure food intake. Digital photography has been identified as a promising addition to these techniques but has rarely been assessed in self-serve settings. We utilized digital photography to examine university students' food choices and consumption in a self-serve dining hall setting. Research assistants took pre- and post-photos of students' plates during lunch and dinner to assess selection (presence), servings, and consumption of MyPlate food groups. Four coders rated the same set of approximately 180 meals for inter-rater reliability analyses; approximately 50 additional meals were coded twice by each coder to assess intra-rater agreement. Inter-rater agreement on the selection, servings, and consumption of food groups was high at 93.5%; intra-rater agreement was similarly high with an average of 95.6% agreement. Coders achieved the highest rates of agreement in assessing if a food group was present on the plate (95-99% inter-rater agreement, depending on food group) and estimating the servings of food selected (81-98% inter-rater agreement). Estimating consumption, particularly for items such as beans and cheese that were often in mixed dishes, was more challenging (77-94% inter-rater agreement). Results suggest that the digital photography method presented is feasible for large studies in real-world environments and can provide an objective measure of food selection, servings, and consumption with a high degree of agreement between coders; however, to make accurate claims about the state of dietary intake in all-you-can-eat, self-serve settings, researchers will need to account for the possibility of diners taking multiple trips through the serving line. Copyright © 2017 Elsevier Ltd. All rights reserved.
Construct validity and inter-rater reliability of the Dutch activity measure for post-acute care "6-clicks" basic mobility form to assess the mobility of hospitalized patients.

PubMed

Geelen, Sven Jacobus Gertruda; Valkenet, Karin; Veenhof, Cindy

2018-05-12

To evaluate the construct validity and the inter-rater reliability of the Dutch Activity Measure for Post-Acute Care "6-clicks" Basic Mobility short form measuring the patient's mobility in Dutch hospital care. First, the "6-clicks" was translated by using a forward-backward translation protocol. Next, 64 patients were assessed by the physiotherapist to determine the validity while being admitted to the Internal Medicine wards of a university medical center. Six hypotheses were tested regarding the construct "mobility" which showed that: Better "6-clicks" scores were related to less restrictive pre-admission living situations (p = 0.011), less restrictive discharge locations (p = 0.001), more independence in activities of daily living (p = 0.001) and less physiotherapy visits (p < 0.001). A correlation was found between the "6-clicks" and length of stay (r= -0.408, p = 0.001), but not between the "6-clicks" and age (r= -0.180, p = 0.528). To determine the inter-rater reliability, an additional 50 patients were assessed by pairs of physiotherapists who independently scored the patients. Intraclass Correlation Coefficients of 0.920 (95%CI: 0.828-0.964) were found. The Kappa Coefficients for the individual items ranged from 0.649 (walking stairs) to 0.841 (sit-to-stand). The Dutch "6-clicks" shows a good construct validity and moderate-to-excellent inter-rater reliability when used to assess the mobility of hospitalized patients. Implications for Rehabilitation Even though various measurement tools have been developed, it appears the majority of physiotherapists working in a hospital currently do not use these tools as a standard part of their care. The Activity Measure for Post-Acute Care "6-clicks" Basic Mobility is the only tool which is designed to be short, easy to use within usual care and has been validated in the entire hospital population. This study shows that the Dutch version of the Activity Measure for Post-Acute Care "6-clicks" Basic Mobility form is a valid, easy to use, quick tool to assess the basic mobility of Dutch hospitalized patients.
Evaluating the use of in-store measures in retail food stores and restaurants in Brazil.

PubMed

Duran, Ana Clara; Lock, Karen; Latorre, Maria do Rosario D O; Jaime, Patricia Constante

2015-01-01

To assess inter-rater reliability, test-retest reliability, and construct validity of retail food store, open-air food market, and restaurant observation tools adapted to the Brazilian urban context. This study is part of a cross-sectional observation survey conducted in 13 districts across the city of Sao Paulo, Brazil in 2010-2011. Food store and restaurant observational tools were developed based on previously available tools, and then tested it. They included measures on the availability, variety, quality, pricing, and promotion of fruits and vegetables and ultra-processed foods. We used Kappa statistics and intra-class correlation coefficients to assess inter-rater and test-retest reliabilities in samples of 142 restaurants, 97 retail food stores (including open-air food markets), and of 62 restaurants and 45 retail food stores (including open-air food markets), respectively. Construct validity as the tool's abilities to discriminate based on store types and different income contexts were assessed in the entire sample: 305 retail food stores, 8 fruits and vegetable markets, and 472 restaurants. Inter-rater and test-retest reliability were generally high, with most Kappa values greater than 0.70 (range 0.49-1.00). Both tools discriminated between store types and neighborhoods with different median income. Fruits and vegetables were more likely to be found in middle to higher-income neighborhoods, while soda, fruit-flavored drink mixes, cookies, and chips were cheaper and more likely to be found in lower-income neighborhoods. The measures were reliable and able to reveal significant differences across store types and different contexts. Although some items may require revision, results suggest that the tools may be used to reliably measure the food stores and restaurant food environment in urban settings of middle-income countries. Such studies can help .inform health promotion interventions and policies in these contexts.
Evaluating the use of in-store measures in retail food stores and restaurants in Brazil

PubMed Central

Duran, Ana Clara; Lock, Karen; Latorre, Maria do Rosario D O; Jaime, Patricia Constante

2015-01-01

ABSTRACT OBJECTIVE To assess inter-rater reliability, test-retest reliability, and construct validity of retail food store, open-air food market, and restaurant observation tools adapted to the Brazilian urban context. METHODS This study is part of a cross-sectional observation survey conducted in 13 districts across the city of Sao Paulo, Brazil in 2010-2011. Food store and restaurant observational tools were developed based on previously available tools, and then tested it. They included measures on the availability, variety, quality, pricing, and promotion of fruits and vegetables and ultra-processed foods. We used Kappa statistics and intra-class correlation coefficients to assess inter-rater and test-retest reliabilities in samples of 142 restaurants, 97 retail food stores (including open-air food markets), and of 62 restaurants and 45 retail food stores (including open-air food markets), respectively. Construct validity as the tool’s abilities to discriminate based on store types and different income contexts were assessed in the entire sample: 305 retail food stores, 8 fruits and vegetable markets, and 472 restaurants. RESULTS Inter-rater and test-retest reliability were generally high, with most Kappa values greater than 0.70 (range 0.49-1.00). Both tools discriminated between store types and neighborhoods with different median income. Fruits and vegetables were more likely to be found in middle to higher-income neighborhoods, while soda, fruit-flavored drink mixes, cookies, and chips were cheaper and more likely to be found in lower-income neighborhoods. CONCLUSIONS The measures were reliable and able to reveal significant differences across store types and different contexts. Although some items may require revision, results suggest that the tools may be used to reliably measure the food stores and restaurant food environment in urban settings of middle-income countries. Such studies can help .inform health promotion interventions and policies in these contexts. PMID:26538101
Validity and reliability of criterion based clinical audit to assess obstetrical quality of care in West Africa.

PubMed

Pirkle, Catherine M; Dumont, Alexandre; Traore, Mamadou; Zunzunegui, Maria-Victoria

2012-10-29

In Mali and Senegal, over 1% of women die giving birth in hospital. At some hospitals, over a third of infants are stillborn. Many deaths are due to substandard medical practices. Criterion-based clinical audits (CBCA) are increasingly used to measure and improve obstetrical care in resource-limited settings, but their measurement properties have not been formally evaluated. In 2011, we published a systematic review of obstetrical CBCA highlighting insufficient considerations of validity and reliability. The objective of this study is to develop an obstetrical CBCA adapted to the West African context and assess its reliability and validity. This work was conducted as a sub-study within a cluster randomized trial known as QUARITE. Criteria were selected based on extensive literature review and expert opinion. Early 2010, two auditors applied the CBCA to identical samples at 8 sites in Mali and Senegal (n = 185) to evaluate inter-rater reliability. In 2010-11, we conducted CBCA at 32 hospitals to assess construct validity (n = 633 patients). We correlated hospital characteristics (resource availability, facility perinatal and maternal mortality) with mean hospital CBCA scores. We used generalized estimating equations to assess whether patient CBCA scores were associated with perinatal mortality. Results demonstrate substantial (ICC = 0.67, 95% CI 0.54; 0.76) to elevated inter-rater reliability (ICC = 0.84, 95% CI 0.77; 0.89) in Senegal and Mali, respectively. Resource availability positively correlated with mean hospital CBCA scores and maternal and perinatal mortality were inversely correlated with hospital CBCA scores. Poor CBCA scores, adjusted for hospital and patient characteristics, were significantly associated with perinatal mortality (OR 1.84, 95% CI 1.01-3.34). Our CBCA has substantial inter-rater reliability and there is compelling evidence of its validity as the tool performs according to theory. Current Controlled Trials ISRCTN46950658.
The Shoulder Objective Practical Assessment Tool: Evaluation of a New Tool Assessing Residents Learning in Diagnostic Shoulder Arthroscopy.

PubMed

Talbot, Christopher L; Holt, Edward M; Gooding, Benjamin W T; Tennent, Thomas D; Foden, Philip

2015-08-01

To design and validate an objective practical assessment tool for diagnostic shoulder arthroscopy that would provide residents with a method to evaluate their progression in this field of surgery and to identify specific learning needs. We designed and evaluated the shoulder Objective Practical Assessment Tool (OPAT). The shoulder OPAT was designed by us, and scoring domains were created using a Delphi process. The shoulder OPAT was trialed by members of the British Elbow & Shoulder Society Education Committee for internal consistency and ease of use before being offered to other trainers and residents. Inter-rater reliability and intrarater reliability were calculated. One hundred forty orthopaedic residents, of varying seniority, within 5 training regions in the United Kingdom, were questioned regarding the tool. A pilot study of 6 residents was undertaken. Internal consistency was 0.77 (standardized Cronbach α). Inter-rater reliability was 0.60, and intrarater reliability was 0.82. The Spearman correlation coefficient (r) between the global summary score for the shoulder OPAT and the current assessment tool used in postgraduate training for orthopaedic residents undertaking diagnostic shoulder arthroscopy equaled 0.74. Of the residents, 82% agreed or strongly agreed when asked if the shoulder OPAT would be a useful tool in monitoring progression and 72% agreed or strongly agreed with the introduction of the shoulder OPAT within the orthopaedic domain. This study shows that the shoulder OPAT fulfills several aspects of reliability and validity when tested. Despite the inter-rater reliability being 0.60, we believe that the shoulder OPAT has the potential to play a role alongside the current assessment tool in the training of orthopaedic residents. The shoulder OPAT can be used to assess residents during shoulder arthroscopy and has the potential for use in medical education, as well as arthroscopic skills training in the operating theater. Copyright © 2015 Arthroscopy Association of North America. Published by Elsevier Inc. All rights reserved.
Manual for the Extrapyramidal Symptom Rating Scale (ESRS).

PubMed

Chouinard, Guy; Margolese, Howard C

2005-07-15

The Extrapyramidal Symptom Rating Scale (ESRS) was developed to assess four types of drug-induced movement disorders (DIMD): Parkinsonism, akathisia, dystonia, and tardive dyskinesia (TD). Comprehensive ESRS definitions and basic instructions are given. Factor analysis provided six ESRS factors: 1) hypokinetic Parkinsonism; 2) orofacial dyskinesia; 3) trunk/limb dyskinesia; 4) akathisia; 5) tremor; and 6) tardive dystonia. Two pivotal studies found high inter-rater reliability correlations in both antipsychotic-induced movement disorders and idiopathic Parkinson disease. For inter-rater reliability and certification of raters, >or=80% of item ratings of the complete scale should be +/-1 point of expert ratings and >or=70% of ratings on individual items of each ESRS subscale should be +/-1 point of expert ratings. During a cross-scale comparison, AIMS and ESRS were found to have a 96% (359/374) agreement between TD-defined cases by DSM-IV TD criteria. Two recent international studies using the ESRS included over 3000 patients worldwide and showed an incidence of TD ranging from 10.2% (2000) to 12% (1998). ESRS specificity was investigated through two different approaches, path analyses and ANCOVA PANSS factors changes, which found that ESRS measurement of drug-induced EPS is valid and discriminative from psychiatric symptoms.
Development and evaluation of an instrument for assessing brief behavioral change interventions.

PubMed

Strayer, Scott M; Martindale, James R; Pelletier, Sandra L; Rais, Salehin; Powell, Jon; Schorling, John B

2011-04-01

To develop an observational coding instrument for evaluating the fidelity and quality of brief behavioral change interventions based on the behavioral theories of the 5 A's, Stages of Change and Motivational Interviewing. Content and face validity were assessed prior to an intervention where psychometric properties were evaluated with a prospective cohort of 116 medical students. Properties assessed included the inter-rater reliability of the instrument, internal consistency of the full scale and sub-scales and descriptive statistics of the instrument. Construct validity was assessed based on student's scores. Inter-rater reliability for the instrument was 0.82 (intraclass correlation). Internal consistency for the full scale was 0.70 (KR20). Internal consistencies for the sub-scales were as follows: MI intervention component (KR20=.7); stage-appropriate MI-based intervention (KR20=.55); MI spirit (KR20=.5); appropriate assessment (KR20=.45) and appropriate assisting (KR20=.56). The instrument demonstrated good inter-rater reliability and moderate overall internal consistency when used to assess performing brief behavioral change interventions by medical students. This practical instrument can be used with minimal training and demonstrates promising psychometric properties when evaluated with medical students counseling standardized patients. Further testing is required to evaluate its usefulness in clinical settings. Copyright © 2010 Elsevier Ireland Ltd. All rights reserved.
Implicit Review Instrument to Evaluate Quality of Care Delivered by Physicians to Children in Emergency Departments.

PubMed

Marcin, James P; Romano, Patrick S; Dharmar, Madan; Chamberlain, James M; Dudley, Nanette; Macias, Charles G; Nigrovic, Lise E; Powell, Elizabeth C; Rogers, Alexander J; Sonnett, Meridith; Tzimenatos, Leah; Alpern, Elizabeth R; Andrews-Dickert, Rebecca; Borgialli, Dominic A; Sidney, Erika; Casper, Charlie; Dean, Jonathan Michael; Kuppermann, Nathan

2018-06-01

To evaluate the consistency, reliability, and validity of an implicit review instrument that measures the quality of care provided to children in the emergency department (ED). Medical records of randomly selected children from 12 EDs in the Pediatric Emergency Care Applied Research Network (PECARN). Eight pediatric emergency medicine physicians applied the instrument to 620 medical records. We determined internal consistency using Cronbach's alpha and inter-rater reliability using the intraclass correlation coefficient (ICC). We evaluated the validity of the instrument by correlating scores with four condition-specific explicit review instruments. Individual reviewers' Cronbach's alpha had a mean of 0.85 with a range of 0.76-0.97; overall Cronbach's alpha was 0.90. The ICC was 0.49 for the summary score with a range from 0.40 to 0.46. Correlations between the quality of care score and the four condition-specific explicit review scores ranged from 0.24 to 0.38. The quality of care instrument demonstrated good internal consistency, moderate inter-rater reliability, high inter-rater agreement, and evidence supporting validity. The instrument could be useful for systems' assessment and research in evaluating the care delivered to children in the ED. © Health Research and Educational Trust.
An analysis of functional shoulder movements during task performance using Dartfish movement analysis software.

PubMed

Khadilkar, Leenesh; MacDermid, Joy C; Sinden, Kathryn E; Jenkyn, Thomas R; Birmingham, Trevor B; Athwal, George S

2014-01-01

Video-based movement analysis software (Dartfish) has potential for clinical applications for understanding shoulder motion if functional measures can be reliably obtained. The primary purpose of this study was to describe the functional range of motion (ROM) of the shoulder used to perform a subset of functional tasks. A second purpose was to assess the reliability of functional ROM measurements obtained by different raters using Dartfish software. Ten healthy participants, mean age 29 ± 5 years, were videotaped while performing five tasks selected from the Disabilities of the Arm, Shoulder and Hand (DASH). Video cameras and markers were used to obtain video images suitable for analysis in Dartfish software. Three repetitions of each task were performed. Shoulder movements from all three repetitions were analyzed using Dartfish software. The tracking tool of the Dartfish software was used to obtain shoulder joint angles and arcs of motion. Test-retest and inter-rater reliability of the measurements were evaluated using intraclass correlation coefficients (ICC). Maximum (coronal plane) abduction (118° ± 16°) and (sagittal plane) flexion (111° ± 15°) was observed during 'washing one's hair;' maximum extension (-68° ± 9°) was identified during 'washing one's own back.' Minimum shoulder ROM was observed during 'opening a tight jar' (33° ± 13° abduction and 13° ± 19° flexion). Test-retest reliability (ICC = 0.45 to 0.94) suggests high inter-individual task variability, and inter-rater reliability (ICC = 0.68 to 1.00) showed moderate to excellent agreement. KEY FINDINGS INCLUDE: 1) functional shoulder ROM identified in this study compared to similar studies; 2) healthy individuals require less than full ROM when performing five common ADL tasks 3) high participant variability was observed during performance of the five ADL tasks; and 4) Dartfish software provides a clinically relevant tool to analyze shoulder function.
The London field trial for hoarding disorder.

PubMed

Mataix-Cols, D; Billotti, D; Fernández de la Cruz, L; Nordsletten, A E

2013-04-01

A new diagnostic category, hoarding disorder (HD), has been proposed for inclusion in DSM-5. This study field-tested the validity, reliability and perceived acceptability of the proposed diagnostic criteria for HD. Method Fifty unselected individuals with prominent hoarding behavior and 20 unselected, self-defined 'collectors' participated in thorough psychiatric assessments, involving home visits whenever possible. A semi-structured interview based on the proposed diagnostic criteria for HD was administered and scored by two independent raters. 'True' diagnoses were made by consensus according to the best-estimate diagnosis procedure. The percentage of true positive HD cases (sensitivity) and true negative HD cases (specificity) was calculated, along with inter-rater reliability for the diagnosis and each criterion. Participants were asked about their perceptions of the acceptability, utility and stigma associated with the new diagnosis. Twenty-nine (58%) of the hoarding individuals and none of the collectors fulfilled diagnostic criteria for HD. The sensitivity, specificity and inter-rater reliability of the diagnosis, and of each individual criterion and the specifiers, were excellent. Most participants with HD (96%) felt that creating a new disorder would be very or somewhat acceptable, useful (96%) and not too stigmatizing (59%). The proposed HD criteria are valid, reliable and perceived as acceptable and useful by the sufferers. Crucially, they seem to be sufficiently conservative and unlikely to overpathologize normative behavior. Minor changes in the wording of the criteria are suggested.
Development of the Music Therapy Assessment Tool for Advanced Huntington's Disease: A Pilot Validation Study.

PubMed

O'Kelly, Julian; Bodak, Rebeka

2016-01-01

Case studies of people with Huntington's disease (HD) report that music therapy provides a range of benefits that may improve quality of life; however, no robust music therapy assessment tools exist for this population. Develop and conduct preliminary psychometric testing of a music therapy assessment tool for patients with advanced HD. First, we established content and face validity of the Music Therapy Assessment Tool for Advanced HD (MATA-HD) through focus groups and field testing. Second, we examined psychometric properties of the resulting MATA-HD in terms of its construct validity, internal consistency, and inter-rater and intra-rater reliability over 10 group music therapy sessions with 19 patients. The resulting MATA-HD included a total of 15 items across six subscales (Arousal/Attention, Physical Presentation, Communication, Musical, Cognition, and Psychological/Behavioral). We found good construct validity (r ≥ 0.7) for Mood, Communication Level, Communication Effectiveness, Choice, Social Behavior, Arousal, and Attention items. Cronbach's α of 0.825 indicated good internal consistency across 11 items with a common focus of engagement in therapy. The inter-rater reliability (IRR) Intra-Class Coefficient (ICC) scores averaged 0.65, and a mean intra-rater ICC reliability of 0.68 was obtained. Further training and retesting provided a mean of IRR ICC of 0.7. Preliminary data indicate that the MATA-HD is a promising tool for measuring patient responses to music therapy interventions across psychological, physical, social, and communication domains of functioning in patients with advanced HD. © the American Music Therapy Association 2016. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Inter-Rater and Test-Retest Reliability of the Beery VMI in Schoolchildren

PubMed Central

Harvey, Erin M.; Leonard-Green, Tina K.; Mohan, Kathleen M.; Kulp, Marjean Taylor; Davis, Amy L.; Miller, Joseph M.; Twelker, J. Daniel; Campus, Irene; Dennis, Leslie K.

2017-01-01

Purpose To assess inter-rater and test-retest reliability of the 6th Edition Beery-Buktenica Developmental Test of Visual-Motor Integration (VMI) and test-retest reliability of the VMI Visual Perception Supplemental Test (VMIp) in school-age children. Methods Subjects were 163 Native American 3rd – 8th grade students with no significant refractive error (astigmatism < 1.00 D, myopia: < 0.75 D, hyperopia: < 2.50 D, anisometropia < 1.50 D) or ocular abnormalities. The VMI and VMIp were administered twice, on separate days. All VMI tests were scored by two trained scorers and a subset of 50 tests were also scored by an experienced scorer. Scorers strictly applied objective scoring criteria. Analyses included inter-rater and test-retest assessments of bias, 95% limits of agreement, and intraclass correlation analysis. Results Trained scorers had no significant scoring bias compared to the experienced scorer. One of the two trained scorers tended to provide higher scores than the other (mean difference in standardized scores = 1.54). Inter-rater correlations were strong (0.75 to 0.88). VMI and VMIp test-retest comparisons indicated no significant bias (subjects did not tend to score better on retest). Test-retest correlations were moderate (0.54 to 0.58). The 95% LOAs for the VMI were −24.14 to 24.67 (scorer 1) and −26.06 to 26.58 (scorer 2) and the 95% LOAs for the VMIp were −27.11 to 27.34. Conclusions The 95% LOA for test-retest differences will be useful for determining if the VMI and VMIp have sufficient sensitivity for detecting change with treatment in both clinical and research settings. Further research on test-retest reliability reporting 95% LOAs for children across different age ranges are recommended, particularly if the test is to be used to detect changes due to intervention or treatment. PMID:28422801
Brazilian version of the Nottingham Sensory Assessment: validity, agreement and reliability.

PubMed

Lima, Daniela H F; Queiroz, Ana P; De Salvo, Geovana; Yoneyama, Simone M; Oberg, Telma D; Lima, Núbia M F V

2010-01-01

To investigate the inter-rater and intra-rater reliability, construct validity and internal consistency of the Brazilian version of the Nottingham Sensory Assessment for Stroke Patients (NSA). The instrument was translated into Portuguese from its original in English by a bilingual translator and was then back-translated into English. Twenty-one hemiparetics were evaluated by two examiners using the NSA and the Fugl-Meyer Assessment (FMA) of physical performance. Significant correlation were found between the FMA and the NSA (r=0.752). The NSA showed excellent internal consistency (0.86), and there were acceptable inter- and intra-rater reliability for all items of the NSA, except temperature. Significant ceiling effects were found for the NSA and the FMA. The Brazilian version of the NSA met the criteria for agreement, internal consistency and concurrent validity. It was quick and easy to apply, and it could be used within clinical practice in neuro-rehabilitation outpatient clinics to assess sensory functions following stroke. The significant ceiling effect for the NSA did not limit its use, given that for the same patients, the FMA also showed ceiling effects.

Inter-rater reliability of the PIPES tool: validation of a surgical capacity index for use in resource-limited settings.

PubMed

Markin, Abraham; Barbero, Roxana; Leow, Jeffrey J; Groen, Reinou S; Perlman, Greg; Habermann, Elizabeth B; Apelgren, Keith N; Kushner, Adam L; Nwomeh, Benedict C

2014-09-01

In response to the need for simple, rapid means of quantifying surgical capacity in low resource settings, Surgeons OverSeas (SOS) developed the personnel, infrastructure, procedures, equipment and supplies (PIPES) tool. The present investigation assessed the inter-rater reliability of the PIPES tool. As part of a government assessment of surgical services in Santa Cruz, Bolivia, the PIPES tool was translated into Spanish and applied in interviews with physicians at 31 public hospitals. An additional interview was conducted with nurses at a convenience sample of 25 of these hospitals. Physician and nurse responses were then compared to generate an estimate of reliability. For dichotomous survey items, inter-rater reliability between physicians and nurses was assessed using the Cohen's kappa statistic and percent agreement. The Pearson correlation coefficient was used to assess agreement for continuous items. Cohen's kappa was 0.46 for infrastructure, 0.43 for procedures, 0.26 for equipment, and 0 for supplies sections. The median correlation coefficient was 0.91 for continuous items. Correlation was 0.79 for the PIPES index, and ranged from 0.32 to 0.98 for continuous response items. Reliability of the PIPES tool was moderate for the infrastructure and procedures sections, fair for the equipment section, and poor for supplies section when comparing surgeons' responses to nurses' responses-an extremely rigorous test of reliability. These results indicate that the PIPES tool is an effective measure of surgical capacity but that the equipment and supplies sections may need to be revised.
Additional Evidence for the Reliability and Validity of the Student Risk Screening Scale at the High School Level: A Replication and Extension

ERIC Educational Resources Information Center

Lane, Kathleen Lynne; Oakes, Wendy P.; Ennis, Robin Parks; Cox, Meredith Lucille; Schatschneider, Christopher; Lambert, Warren

2013-01-01

This study reports findings from a validation study of the Student Risk Screening Scale for use with 9th- through 12th-grade students (N = 1854) attending a rural fringe school. Results indicated high internal consistency, test-retest stability, and inter-rater reliability. Predictive validity was established across two academic years, with Spring…
Norming a VALUE rubric to assess graduate information literacy skills.

PubMed

Turbow, David J; Evener, Julie

2016-07-01

The study evaluated whether a modified version of the information literacy Valid Assessment of Learning in Undergraduate Education (VALUE) rubric would be useful for assessing the information literacy skills of graduate health sciences students. Through facilitated calibration workshops, an interdepartmental six-person team of librarians and faculty engaged in guided discussion about the meaning of the rubric criteria. They applied the rubric to score student work for a peer-review essay assignment in the "Information Literacy for Evidence-Based Practice" course. To determine inter-rater reliability, the raters participated in a follow-up exercise in which they independently applied the rubric to ten samples of work from a research project in the doctor of physical therapy program: the patient case report assignment. For the peer-review essay, a high level of consistency in scoring was achieved for the second workshop, with statistically significant intra-class correlation coefficients above 0.8 for 3 criteria: "Determine the extent of evidence needed," "Use evidence effectively to accomplish a specific purpose," and "Access the needed evidence." Participants concurred that the essay prompt and rubric criteria adequately discriminated the quality of student work for the peer-review essay assignment. When raters independently scored the patient case report assignment, inter-rater agreement was low and statistically insignificant for all rubric criteria (kappa=-0.16, p>0.05-kappa=0.12, p>0.05). While the peer-review essay assignment lent itself well to rubric calibration, scorers had a difficult time with the patient case report. Lack of familiarity among some raters with the specifics of the patient case report assignment and subject matter might have accounted for low inter-rater reliability. When norming, it is important to hold conversations about search strategies and expectations of performance. Overall, the authors found the rubric to be appropriate for assessing information literacy skills of graduate health sciences students.
Norming a VALUE rubric to assess graduate information literacy skills

PubMed Central

Turbow, David J.; Evener, Julie

2016-01-01

Objective The study evaluated whether a modified version of the information literacy Valid Assessment of Learning in Undergraduate Education (VALUE) rubric would be useful for assessing the information literacy skills of graduate health sciences students. Methods Through facilitated calibration workshops, an interdepartmental six-person team of librarians and faculty engaged in guided discussion about the meaning of the rubric criteria. They applied the rubric to score student work for a peer-review essay assignment in the “Information Literacy for Evidence-Based Practice” course. To determine inter-rater reliability, the raters participated in a follow-up exercise in which they independently applied the rubric to ten samples of work from a research project in the doctor of physical therapy program: the patient case report assignment. Results For the peer-review essay, a high level of consistency in scoring was achieved for the second workshop, with statistically significant intra-class correlation coefficients above 0.8 for 3 criteria: “Determine the extent of evidence needed,” “Use evidence effectively to accomplish a specific purpose,” and “Access the needed evidence.” Participants concurred that the essay prompt and rubric criteria adequately discriminated the quality of student work for the peer-review essay assignment. When raters independently scored the patient case report assignment, inter-rater agreement was low and statistically insignificant for all rubric criteria (kappa=−0.16, p>0.05–kappa=0.12, p>0.05). Conclusions While the peer-review essay assignment lent itself well to rubric calibration, scorers had a difficult time with the patient case report. Lack of familiarity among some raters with the specifics of the patient case report assignment and subject matter might have accounted for low inter-rater reliability. When norming, it is important to hold conversations about search strategies and expectations of performance. Overall, the authors found the rubric to be appropriate for assessing information literacy skills of graduate health sciences students. PMID:27366121
A semi-automated algorithm for hypothalamus volumetry in 3 Tesla magnetic resonance images.

PubMed

Wolff, Julia; Schindler, Stephanie; Lucas, Christian; Binninger, Anne-Sophie; Weinrich, Luise; Schreiber, Jan; Hegerl, Ulrich; Möller, Harald E; Leitzke, Marco; Geyer, Stefan; Schönknecht, Peter

2018-07-30

The hypothalamus, a small diencephalic gray matter structure, is part of the limbic system. Volumetric changes of this structure occur in psychiatric diseases, therefore there is increasing interest in precise volumetry. Based on our detailed volumetry algorithm for 7 Tesla magnetic resonance imaging (MRI), we developed a method for 3 Tesla MRI, adopting anatomical landmarks and work in triplanar view. We overlaid T1-weighted MR images with gray matter-tissue probability maps to combine anatomical information with tissue class segmentation. Then, we outlined regions of interest (ROIs) that covered potential hypothalamus voxels. Within these ROIs, seed growing technique helped define the hypothalamic volume using gray matter probabilities from the tissue probability maps. This yielded a semi-automated method with short processing times of 20-40 min per hypothalamus. In the MRIs of ten subjects, reliabilities were determined as intraclass correlations (ICC) and volume overlaps in percent. Three raters achieved very good intra-rater reliabilities (ICC 0.82-0.97) and good inter-rater reliabilities (ICC 0.78 and 0.82). Overlaps of intra- and inter-rater runs were very good (≥ 89.7%). We present a fast, semi-automated method for in vivo hypothalamus volumetry in 3 Tesla MRI. Copyright © 2018 Elsevier B.V. All rights reserved.
Reliability of the nursing care hour measure: a descriptive study.

PubMed

Klaus, Susan F; Dunton, Nancy; Gajewski, Byron; Potter, Catima

2013-07-01

The nursing care hour has become an international standard unit of measure in research where nurse staffing is a key variable. Until now, there have been no studies verifying whether nursing care hours obtained from hospital data sources can be collected reliably. To examine the processes used by hospitals to generate nursing care hour data and to evaluate inter-rater reliability and guideline compliance with standards of the National Database of Nursing Quality Indicators(®) (NDNQI(®)) and the National Quality Forum. Two-phase descriptive study of all NDNQI hospitals that submitted data in third quarter of 2007. Data for phase I came from an online survey created by the authors to ascertain the processes used by hospitals to collect nursing care hours and their compliance with standardized data collection guidelines. In phase II, inter-rater reliability was measured using intra-class correlations between nursing care hours generated from clock hour files submitted to the study team by participants' payroll/accounting departments and aggregated data submitted previously. Phase I data were obtained from a total of 714 respondents. Nearly half (48%) of all sites use payroll records to obtain nursing care hour data and 70% use one of the standardized methods for converting the bi-weekly hours into months. Unit secretaries were reportedly included in NCH by 17.4% of respondents and only 26.2% of sites could accurately identify the point at which newly hired nurses should be included. The phase II findings (n=11) support the ability of two independent raters to obtain similar results when calculating total nursing care hours according to standard guidelines (ICC=0.76-0.99). Although barriers exist, this study found support for hospitals' abilities to collect reliable nursing care hour data. Copyright © 2012 Elsevier Ltd. All rights reserved.
Skin colour assessment of replanted fingers in digital images and its reliability for the incorporation of images in nursing progress notes.

PubMed

Terashima, Taiko; Yoshimura, Sadako

2018-03-01

To determine whether nurses can accurately assess the skin colour of replanted fingers displayed as digital images on a computer screen. Colour measurement and clinical diagnostic methods for medical digital images have been studied, but reproducing skin colour on a computer screen remains difficult. The inter-rater reliability of skin colour assessment scores was evaluated. In May 2014, 21 nurses who worked on a trauma ward in Japan participated in testing. Six digital images with different skin colours were used. Colours were scored from both digital images and direct patient's observation. The score from a digital image was defined as the test score, and its difference from the direct assessment score as the difference score. Intraclass correlation coefficients were calculated. Nurses' opinions were classified and summarised. The intraclass correlation coefficients for the test scores were fair. Although the intraclass correlation coefficients for the difference scores were poor, they improved to good when three images that might have contributed to poor reliability were excluded. Most nurses stated that it is difficult to assess skin colour in digital images; they did not think it could be a substitute for direct visual assessment. However, most nurses were in favour of including images in nursing progress notes. Although the inter-rater reliability was fairly high, the reliability of colour reproduction in digital images as indicated by the difference scores was poor. Nevertheless, nurses expect the incorporation of digital images in nursing progress notes to be useful. This gap between the reliability of digital colour reproduction and nurses' expectations towards it must be addressed. High inter-rater reliability for digital images in nursing progress notes was not observed. Assessments of future improvements in colour reproduction technologies are required. Further digitisation and visualisation of nursing records might pose challenges. © 2017 John Wiley & Sons Ltd.
Reliability of specific physical examination tests for the diagnosis of shoulder pathologies: a systematic review and meta-analysis.

PubMed

Lange, Toni; Matthijs, Omer; Jain, Nitin B; Schmitt, Jochen; Lützner, Jörg; Kopkow, Christian

2017-03-01

Shoulder pain in the general population is common and to identify the aetiology of shoulder pain, history, motion and muscle testing, and physical examination tests are usually performed. The aim of this systematic review was to summarise and evaluate intrarater and inter-rater reliability of physical examination tests in the diagnosis of shoulder pathologies. A comprehensive systematic literature search was conducted using MEDLINE, EMBASE, Allied and Complementary Medicine Database (AMED) and Physiotherapy Evidence Database (PEDro) through 20 March 2015. Methodological quality was assessed using the Quality Appraisal of Reliability Studies (QAREL) tool by 2 independent reviewers. The search strategy revealed 3259 articles, of which 18 finally met the inclusion criteria. These studies evaluated the reliability of 62 test and test variations used for the specific physical examination tests for the diagnosis of shoulder pathologies. Methodological quality ranged from 2 to 7 positive criteria of the 11 items of the QAREL tool. This review identified a lack of high-quality studies evaluating inter-rater as well as intrarater reliability of specific physical examination tests for the diagnosis of shoulder pathologies. In addition, reliability measures differed between included studies hindering proper cross-study comparisons. PROSPERO CRD42014009018. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://www.bmj.com/company/products-services/rights-and-licensing/.
New methods for analyzing semantic graph based assessments in science education

NASA Astrophysics Data System (ADS)

Vikaros, Lance Steven

This research investigated how the scoring of semantic graphs (known by many as concept maps) could be improved and automated in order to address issues of inter-rater reliability and scalability. As part of the NSF funded SENSE-IT project to introduce secondary school science students to sensor networks (NSF Grant No. 0833440), semantic graphs illustrating how temperature change affects water ecology were collected from 221 students across 16 schools. The graphing task did not constrain students' use of terms, as is often done with semantic graph based assessment due to coding and scoring concerns. The graphing software used provided real-time feedback to help students learn how to construct graphs, stay on topic and effectively communicate ideas. The collected graphs were scored by human raters using assessment methods expected to boost reliability, which included adaptations of traditional holistic and propositional scoring methods, use of expert raters, topical rubrics, and criterion graphs. High levels of inter-rater reliability were achieved, demonstrating that vocabulary constraints may not be necessary after all. To investigate a new approach to automating the scoring of graphs, thirty-two different graph features characterizing graphs' structure, semantics, configuration and process of construction were then used to predict human raters' scoring of graphs in order to identify feature patterns correlated to raters' evaluations of graphs' topical accuracy and complexity. Results led to the development of a regression model able to predict raters' scoring with 77% accuracy, with 46% accuracy expected when used to score new sets of graphs, as estimated via cross-validation tests. Although such performance is comparable to other graph and essay based scoring systems, cross-context testing of the model and methods used to develop it would be needed before it could be recommended for widespread use. Still, the findings suggest techniques for improving the reliability and scalability of semantic graph based assessments without requiring constraint of how ideas are expressed.
A Study to Identify and Analyze the Effects of Category and Frequency Sampling on the Reporting of Total Nursing Care Hour Requirements

DTIC Science & Technology

1989-06-26

quarterly basis as a means to validate inter-rater reliability. Reliability testing will be conducted by an Independent, expert patient classifier...appointed by nursing administrat’on (Vail, Norton, & Rimm, 1984). An independent, expert patient classifier is defined as an RN not assigned to the unit
The Evaluation of a Screening Tool for Children with an Intellectual Disability: The Child and Adolescent Intellectual Disability Screening Questionnaire

ERIC Educational Resources Information Center

McKenzie, Karen; Paxton, Donna; Murray, George; Milanesi, Paula; Murray, Aja Louise

2012-01-01

The study outlines the evaluation of an intellectual disability screening tool, the "Child and Adolescent Intellectual Disability Screening Questionnaire" ("CAIDS-Q"), with two age groups. A number of aspects of the reliability and validity of the "CAIDS-Q" were assessed for these two groups, including inter-rater reliability, convergent and…
Applying Resource Utilization Groups (RUG-III) in Hong Kong nursing homes.

PubMed

Chou, Kee-Lee; Chi, Iris; Leung, Joe C B

2008-01-01

Resource Utilization Groups III (RUG-III) is a case-mix system developed in the United States for categorization of nursing home residents and the financing of residential care services. In Hong Kong, RUG-III is based on several board groups of residents. The aim of this study was to examine the reliability and validity of the RUG-III in Hong Kong nursing homes. A cross-sectional survey was conducted in seven residential facilities operated by one agency. Residents ( N = 1,127) were assessed by the Minimum Data Set (MDS) and nursing as well as auxiliary staff care times were recorded within 2 weeks before or after the completion of MDS assessment. Forty-five out 1,127 residents were re-interviewed by an independent assessor to assess the inter-rater reliability. The inter-rater reliability of MDS assessment was excellent (kappa = 0.76) and the original RUG-III accounted for about 30 per cent of nursing staff time. Results provide preliminary evidence to support that RUG-III is a reliable and valid case-mix system for Hong Kong nursing homes, but future studies must be explored to reduce the variance of resource use explained by this case-mix system.
Bimanual Capacity of Children With Cerebral Palsy: Intra- and Interrater Reliability of a Revised Edition of the Bimanual Fine Motor Function Classification.

PubMed

Elvrum, Ann-Kristin G; Beckung, Eva; Sæther, Rannei; Lydersen, Stian; Vik, Torstein; Himmelmann, Kate

2017-08-01

To develop a revised edition of the Bimanual Fine Motor Function (BFMF 2), as a classification of fine motor capacity in children with cerebral palsy (CP), and establish intra- and interrater reliability of this edition. The content of the original BFMF was discussed by an expert panel, resulting in a revised edition comprising the original description of the classification levels, but in addition including figures with specific explanatory text. Four professionals classified fine motor function of 79 children (3-17 years; 45 boys) who represented all subtypes of CP and Manual Ability Classification levels (I-V). Intra- and inter-rater reliability was assessed using overall intra-class correlation coefficient (ICC), and Cohen's quadratic weighted kappa. The overall ICC was 0.86. Cohen's weighted kappa indicated high intra-rater (к w : >0.90) and inter-rater (к w : >0.85) reliability. The revised BFMF 2 had high intra- and interrater reliability. The classification levels could be determined from short video recordings (<5 minutes), using the figures and precise descriptions of the fine motor function levels included in the BFMF 2. Thus, the BFMF 2 may be a feasible and useful classification of fine motor capacity both in research and in clinical practice.
Dental examiners consistency in applying the ICDAS criteria for a caries prevention community trial.

PubMed

Nelson, S; Eggertsson, H; Powell, B; Mandelaris, J; Ntragatakis, M; Richardson, T; Ferretti, G

2011-09-01

To examine dental examiners' one-year consistency in utilizing the International Caries Detection and Assessment System (ICDAS) criteria after baseline training and calibration. A total of three examiners received baseline training/calibration by a "gold standard" examiner, and one year later re-calibration was conducted. For the baseline training/calibration, subjects aged 8-16 years, and for the re-calibration subjects aged five to six years were recruited for the study. The ICDAS criteria were used to classify visual caries lesion severity (0-6 scale), lesion activity (active/inactive), and presence of filling material (0-9 scale) of all available tooth surfaces of permanent and primary teeth. The examination used a clinical light, mirror and air syringe. Kappa (weighted: Wkappa, unweighted: Kappa) statistics were used to determine inter-and intra-examiner reliability at baseline and re-calibration. For lesion severity and filling criteria, the baseline calibration on 35 subjects indicated an inter-rater Wkappa ranging from 0.69-0.92 and intra-rater Wkappa ranging from 0.81-0.92. Re-calibration on 22 subjects indicated an inter-rater Wkappa of 0.77-0.98 and intra-rater Wkappa ranged from 0.93-1.00. The Wkappa for filling was consistently in the excellent range, while lesion severity was in the good to excellent range. Activity kappa was in the poor to good range. All examiners improved with time. The baseline training/calibration in ICDAS was crucial to maintain the stability of the examiners reliability over a one year period. The ICDAS can be an effective assessment tool for community-based clinical trials.
Mammography image quality and evidence based practice: Analysis of the demonstration of the inframammary angle in the digital setting.

PubMed

Spuur, Kelly; Webb, Jodi; Poulos, Ann; Nielsen, Sharon; Robinson, Wayne

2018-03-01

The aim of this study is to determine the clinical rates of the demonstration of the inframammary angle (IMA) on the mediolateral oblique (MLO) view of the breast on digital mammograms and to compare the outcomes with current accreditation standards for compliance. Relationships between the IMA, age, the posterior nipple line (PNL) and compressed breast thickness will be identified and the study outcomes validated using appropriate analyses of inter-reader and inter-rater reliability and variability. Differences in left versus right data were also investigated. A quantitative retrospective study of 2270 randomly selected paired digital mammograms performed by BreastScreen NSW was undertaken. Data was collected by direct measurement and visual analysis. Intra-class correlation analyses were used to evaluate inter- and intra-rater reliability. The IMA was demonstrated on 52.4% of individual and 42.6% of paired mammograms. A linear relationship was found between the posterior nipple line (PNL) and age (p-value <0.001). The PNL was predicted to increase by 0.48 mm for every one year increment in age. The odds of demonstrating the IMA reduced by 2% for every one year increase in age (p-value = 0.001); are 0.4% higher for every 1 mm increase in PNL (p-value = 0.001) and 1.6% lower for every 1 mm increase in compressed breast thickness, (p-value<0.001). There was high inter- and intra-rater reliability for the PNL while there was 100% agreement for the demonstration of the IMA. Analysis of the demonstration of the IMA indicates clinically achievable rates (42.6%) well below that required for compliance (50%-75%) to known worldwide accreditation standards for screening mammography. These standards should be aligned to the reported evidence base. Visualisation of the IMA is impacted negatively by increasing age and compressed breast thickness but positively by breast size (PNL). Copyright © 2018 Elsevier B.V. All rights reserved.
Analyzing Movements Development and Evaluation of the Body Awareness Scale Movement Quality (BAS MQ).

PubMed

Sundén, A; Ekdahl, C; Horstman, V; Gyllensten, A L

2016-06-01

Limitations in everyday movements, physical activities are/or pain are the main reasons for seeking help from a physiotherapist. The purpose of this study was to investigate the psychometric properties of the Body Awareness Scale Movement Quality (BAS MQ) focusing on factor structure, validity and reliability and to explore whether BAS MQ could discriminate between healthy individuals and patients. BAS MQ assesses both limitations and resources concerning functional ability and quality of movements. The total sample in the study (n = 172) consisted of individuals with hip osteoarthritis (OA) (n = 132), individuals with psychiatric disorders (n = 33) and healthy individuals (n = 7). A factor analysis of the BAS MQ was performed for the total group. Inter-rater reliability was tested in a group of individuals with hip OA (n = 24). Concurrent validity was tested in a group of individuals with hip OA (n = 89). The Medical Outcomes Study 36-Item Short-Form Health Survey (SF-36), the 6-Minute Walk Test (6MWT) and the Hip Osteoarthritis Outcome Score (HOOS) were chosen in the validation process. The factor analysis revealed three factors that together explained 60.8% of the total variance of BAS MQ. The inter-rater reliability was considered good or very good with a kappa value of 0.61. Significant correlations between BAS MQ and SF-36, HOOS and 6MWT in the subjects with hip OA confirmed the validity. The BAS MQ was able to discriminate between healthy individuals and individuals with physical and psychiatric limitations. Results of the study revealed that BAS MQ has a satisfactory factor structure. The inter-rater reliability and validity were acceptable in a group of individuals with hip OA. BAS MQ could be a useful assessment tool for physiotherapists when evaluating the quality of everyday movements in different patient groups. Copyright © 2014 John Wiley & Sons, Ltd. Copyright © 2014 John Wiley & Sons, Ltd.
[Reliability and Validity of the Behavioral Check List for Preschool Children to Measure Attention Deficit Hyperactivity Behaviors].

PubMed

Tsuno, Kanami; Yoshimasu, Kouichi; Hayashi, Takashi; Tatsuta, Nozomi; Ito, Yuki; Kamijima, Michihiro; Nakai, Kunihiko

2018-01-01

Nowadays, attention deficit hyperactivity (ADH) problems are observed commonly among school-age children. However, questionnaires specific to ADH behaviors among preschool children are very few. The aim of this study was to investigate the reliability and validity of the 25-item Behavioral Check List (BCL), which was developed from interviews of parents with children who were diagnosed as having Attention-deficit/hyperactivity disorder (ADHD) and measures ADH behaviors in preschool age. We recruited 22 teachers from 10 nurseries/kindergartens in Miyagi Prefecture, Japan. A total of 138 preschool children were assessed using the BCL. To investigate inter-rater reliability, two teachers from each facility assess seven to twenty children in their class, and intraclass correlation coefficients (ICCs) were calculated. The teachers additionally answered questions in the 1/5-5 Caregiver-Teacher Report Form (C-TRF) to investigate the criterion validity of the BCL. To investigate structural validity, exploratory factor analysis with promax rotation and confirmatory factor analysis were performed. The internal consistency reliability of the BCL was good (α = 0.92) and correlation analyses also confirmed its excellent criterion validity. Although exploratory factor analysis for the BCL yielded a five-factor model that consisted of a factor structure different from that of the original one, the results were similar to the original six factors. The ICCs of the BCL were 0.38-0.99 and it was not high enough for inter-rater reliability in some facilities. However, there is a possibility to improve it by giving raters adequate explanations when using BCL. The present study showed acceptable levels of reliability and validity of the BCL among Japanese preschool children.
Brown Adipose Tissue Quantification in Human Neonates Using Water-Fat Separated MRI

PubMed Central

Rasmussen, Jerod M.; Entringer, Sonja; Nguyen, Annie; van Erp, Theo G. M.; Guijarro, Ana; Oveisi, Fariba; Swanson, James M.; Piomelli, Daniele; Wadhwa, Pathik D.

2013-01-01

There is a major resurgence of interest in brown adipose tissue (BAT) biology, particularly regarding its determinants and consequences in newborns and infants. Reliable methods for non-invasive BAT measurement in human infants have yet to be demonstrated. The current study first validates methods for quantitative BAT imaging of rodents post mortem followed by BAT excision and re-imaging of excised tissues. Identical methods are then employed in a cohort of in vivo infants to establish the reliability of these measures and provide normative statistics for BAT depot volume and fat fraction. Using multi-echo water-fat MRI, fat- and water-based images of rodents and neonates were acquired and ratios of fat to the combined signal from fat and water (fat signal fraction) were calculated. Neonatal scans (n = 22) were acquired during natural sleep to quantify BAT and WAT deposits for depot volume and fat fraction. Acquisition repeatability was assessed based on multiple scans from the same neonate. Intra- and inter-rater measures of reliability in regional BAT depot volume and fat fraction quantification were determined based on multiple segmentations by two raters. Rodent BAT was characterized as having significantly higher water content than WAT in both in situ as well as ex vivo imaging assessments. Human neonate deposits indicative of bilateral BAT in spinal, supraclavicular and axillary regions were observed. Pairwise, WAT fat fraction was significantly greater than BAT fat fraction throughout the sample (ΔWAT-BAT = 38%, p<10−4). Repeated scans demonstrated a high voxelwise correlation for fat fraction (Rall = 0.99). BAT depot volume and fat fraction measurements showed high intra-rater (ICCBAT,VOL = 0.93, ICCBAT,FF = 0.93) and inter-rater reliability (ICCBAT,VOL = 0.86, ICCBAT,FF = 0.93). This study demonstrates the reliability of using multi-echo water-fat MRI in human neonates for quantification throughout the torso of BAT depot volume and fat fraction measurements. PMID:24205024
Translation, cross-cultural adaptation to Brazilian- Portuguese and reliability analysis of the instrument Rapid Entire Body Assessment-REBA

PubMed Central

Lamarão, Andressa M.; Costa, Lucíola C. M.; Comper, Maria L. C.; Padula, Rosimeire S.

2014-01-01

Background: Observational instruments, such as the Rapid Entire Body Assessment, quickly assess biomechanical risks present in the workplace. However, in order to use these instruments, it is necessary to conduct the translational/cross-cultural adaptation of the instrument and test its measurement properties. Objectives: To perform the translation and the cross-cultural adaptation to Brazilian-Portuguese and test the reliability of the REBA instrument. Method: The procedures of translation and cross-cultural adaptation to Brazilian-Portuguese were conducted following proposed guidelines that involved translation, synthesis of translations, back translation, committee review and testing of the pre-final version. In addition, reliability and the intra- and inter-rater percent agreement were obtained with the Linear Weighted Kappa Coefficient that was associated with the 95% Confidence Interval and the cross tabulation 2×2. Results : The procedures for translation and adaptation were adequate and the necessary adjustments were conducted on the instrument. The intra- and inter-rater reliability showed values of 0.104 to 0.504, respectively, ranging from very poor to moderate. The percentage agreement values ranged from 5.66% to 69.81%. The percentage agreement was closer to 100% at the item 'upper arm' (69.81%) for the Intra-rater 1 and at the items 'legs' and 'upper arm' for the Intra-rater 2 (62.26%). Conclusions: The processes of translation and cross-cultural adaptation were conducted on the REBA instrument and the Brazilian version of the instrument was obtained. However, despite the reliability of the tests used to correct the translated and adapted version, the reliability values are unacceptable according to the guidelines standard, indicating that the reliability must be re-evaluated. Therefore, caution in the interpretation of the biomechanical risks measured by this instrument should be taken. PMID:25003273
Treatment fidelity instrument to measure a brief opportunistic intervention for prenatal substance use.

PubMed

Torrey, Antonia Rae

2012-01-01

To develop and psychometrically evaluate an instrument designed to measure the treatment fidelity associated with implementation of the I Am Concerned (IAC) brief opportunistic intervention by frontline, prenatal, primary care staff. A methodologic approach framed development of the IAC Treatment Fidelity Instrument in a six-phase protocol. A simulated prenatal clinic with standardized patients portraying substance-using pregnant women. Prenatal, primary care, frontline staff (N = 6), experienced in IAC implementation. Following development of the IAC treatment fidelity instrument, independent raters used the instrument to evaluate audio recordings (N = 49) of frontline staff implementing the IAC brief opportunistic intervention with standardized patients representing substance-using pregnant women. Psychometric analysis provided evidence of content validity. Intraclass correlation coefficients calculated for inter-rater reliability were satisfactory for subscales (0.64) and (0.62) and ranged from -0.07 to 0.81 for individual items. Internal consistency alpha coefficients were satisfactory for the total scale (0.72) and lower than acceptable for adherence (0.54) and competence (0.56) subscales. Overall high rater percentage agreement and negatively skewed ratings distribution indicated reliability results were paradoxically low due to the base rate problem. Results support revision and ongoing testing of the IAC treatment fidelity instrument. The impact on reliability statistics exerted by this study's skewed data distribution has implications for nursing research as low variance can be anticipated when measuring care provided to homogenous patient populations. It is important to recognize the resulting influence on inter-rater agreement to avoid making inaccurate interpretations about the reliability of an instrument's measurements. © 2012 AWHONN, the Association of Women's Health, Obstetric and Neonatal Nurses.

Validity and Inter-Rater Reliability of a Novel Bedside Referral Tool for Spasticity

ClinicalTrials.gov

2018-02-20

Spasticity, Muscle; Muscular Diseases; Musculoskeletal Disease; Muscle Hypertonia; Muscle Spasticity; Neuromuscular Manifestations; Signs and Symptoms; Nervous System Diseases; Neurologic Manifestations
A new MRI rating scale for progressive supranuclear palsy and multiple system atrophy: validity and reliability

PubMed Central

Rolland, Yan; Vérin, Marc; Payan, Christine A; Duchesne, Simon; Kraft, Eduard; Hauser, Till K; Jarosz, Josef; Deasy, Neil; Defevbre, Luc; Delmaire, Christine; Dormont, Didier; Ludolph, Albert C; Bensimon, Gilbert

2011-01-01

Aim To evaluate a standardised MRI acquisition protocol and a new image rating scale for disease severity in patients with progressive supranuclear palsy (PSP) and multiple systems atrophy (MSA) in a large multicentre study. Methods The MRI protocol consisted of two-dimensional sagittal and axial T1, axial PD, and axial and coronal T2 weighted acquisitions. The 32 item ordinal scale evaluated abnormalities within the basal ganglia and posterior fossa, blind to diagnosis. Among 760 patients in the study population (PSP=362, MSA=398), 627 had per protocol images (PSP=297, MSA=330). Intra-rater (n=60) and inter-rater (n=555) reliability were assessed through Cohen's statistic, and scale structure through principal component analysis (PCA) (n=441). Internal consistency and reliability were checked. Discriminant and predictive validity of extracted factors and total scores were tested for disease severity as per clinical diagnosis. Results Intra-rater and inter-rater reliability were acceptable for 25 (78%) of the items scored (≥0.41). PCA revealed four meaningful clusters of covarying parameters (factor (F) F1: brainstem and cerebellum; F2: midbrain; F3: putamen; F4: other basal ganglia) with good to excellent internal consistency (Cronbach α 0.75–0.93) and moderate to excellent reliability (intraclass coefficient: F1: 0.92; F2: 0.79; F3: 0.71; F4: 0.49). The total score significantly discriminated for disease severity or diagnosis; factorial scores differentially discriminated for disease severity according to diagnosis (PSP: F1–F2; MSA: F2–F3). The total score was significantly related to survival in PSP (p<0.0007) or MSA (p<0.0005), indicating good predictive validity. Conclusions The scale is suitable for use in the context of multicentre studies and can reliably and consistently measure MRI abnormalities in PSP and MSA. Clinical Trial Registration Number The study protocol was filed in the open clinical trial registry (http://www.clinicaltrials.gov) with ID No NCT00211224. PMID:21386111
The reliability and validity of the Complex Task Performance Assessment: A performance-based assessment of executive function.

PubMed

Wolf, Timothy J; Dahl, Abigail; Auen, Colleen; Doherty, Meghan

2017-07-01

The objective of this study was to evaluate the inter-rater reliability, test-retest reliability, concurrent validity, and discriminant validity of the Complex Task Performance Assessment (CTPA): an ecologically valid performance-based assessment of executive function. Community control participants (n = 20) and individuals with mild stroke (n = 14) participated in this study. All participants completed the CTPA and a battery of cognitive assessments at initial testing. The control participants completed the CTPA at two different times one week apart. The intra-class correlation coefficient (ICC) for inter-rater reliability for the total score on the CTPA was .991. The ICCs for all of the sub-scores of the CTPA were also high (.889-.977). The CTPA total score was significantly correlated to Condition 4 of the DKEFS Color-Word Interference Test (p = -.425), and the Wechsler Test of Adult Reading (p = -.493). Finally, there were significant differences between control subjects and individuals with mild stroke on the total score of the CTPA (p = .007) and all sub-scores except interpretation failures and total items incorrect. These results are also consistent with other current executive function performance-based assessments and indicate that the CTPA is a reliable and valid performance-based measure of executive function.
Assessing the Quality of Mobile Exercise Apps Based on the American College of Sports Medicine Guidelines: A Reliable and Valid Scoring Instrument.

PubMed

Guo, Yi; Bian, Jiang; Leavitt, Trevor; Vincent, Heather K; Vander Zalm, Lindsey; Teurlings, Tyler L; Smith, Megan D; Modave, François

2017-03-07

Regular physical activity can not only help with weight management, but also lower cardiovascular risks, cancer rates, and chronic disease burden. Yet, only approximately 20% of Americans currently meet the physical activity guidelines recommended by the US Department of Health and Human Services. With the rapid development of mobile technologies, mobile apps have the potential to improve participation rates in exercise programs, particularly if they are evidence-based and are of sufficient content quality. The goal of this study was to develop and test an instrument, which was designed to score the content quality of exercise program apps with respect to the exercise guidelines set forth by the American College of Sports Medicine (ACSM). We conducted two focus groups (N=14) to elicit input for developing a preliminary 27-item scoring instruments based on the ACSM exercise prescription guidelines. Three reviewers who were no sports medicine experts independently scored 28 exercise program apps using the instrument. Inter- and intra-rater reliability was assessed among the 3 reviewers. An expert reviewer, a Fellow of the ACSM, also scored the 28 apps to create criterion scores. Criterion validity was assessed by comparing nonexpert reviewers' scores to the criterion scores. Overall, inter- and intra-rater reliability was high with most coefficients being greater than .7. Inter-rater reliability coefficients ranged from .59 to .99, and intra-rater reliability coefficients ranged from .47 to 1.00. All reliability coefficients were statistically significant. Criterion validity was found to be excellent, with the weighted kappa statistics ranging from .67 to .99, indicating a substantial agreement between the scores of expert and nonexpert reviewers. Finally, all apps scored poorly against the ACSM exercise prescription guidelines. None of the apps received a score greater than 35, out of a possible maximal score of 70. We have developed and presented valid and reliable scoring instruments for exercise program apps. Our instrument may be useful for consumers and health care providers who are looking for apps that provide safe, progressive general exercise programs for health and fitness. ©Yi Guo, Jiang Bian, Trevor Leavitt, Heather K Vincent, Lindsey Vander Zalm, Tyler L Teurlings, Megan D Smith, François Modave. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 07.03.2017.
Reliable and valid assessment of Lichtenstein hernia repair skills.

PubMed

Carlsen, C G; Lindorff-Larsen, K; Funch-Jensen, P; Lund, L; Charles, P; Konge, L

2014-08-01

Lichtenstein hernia repair is a common surgical procedure and one of the first procedures performed by a surgical trainee. However, formal assessment tools developed for this procedure are few and sparsely validated. The aim of this study was to determine the reliability and validity of an assessment tool designed to measure surgical skills in Lichtenstein hernia repair. Key issues were identified through a focus group interview. On this basis, an assessment tool with eight items was designed. Ten surgeons and surgical trainees were video recorded while performing Lichtenstein hernia repair, (four experts, three intermediates, and three novices). The videos were blindly and individually assessed by three raters (surgical consultants) using the assessment tool. Based on these assessments, validity and reliability were explored. The internal consistency of the items was high (Cronbach's alpha = 0.97). The inter-rater reliability was very good with an intra-class correlation coefficient (ICC) = 0.93. Generalizability analysis showed a coefficient above 0.8 even with one rater. The coefficient improved to 0.92 if three raters were used. One-way analysis of variance found a significant difference between the three groups which indicates construct validity, p < 0.001. Lichtenstein hernia repair skills can be assessed blindly by a single rater in a reliable and valid fashion with the new procedure-specific assessment tool. We recommend this tool for future assessment of trainees performing Lichtenstein hernia repair to ensure that the objectives of competency-based surgical training are met.
Peer-review for selection of oral presentations for conferences: Are we reliable?

PubMed

Deveugele, Myriam; Silverman, Jonathan

2017-11-01

Although peer-review for journal submission, grant-applications and conference submissions has been called 'a counter- stone of science', and even 'the gold standard for evaluating scientific merit', publications on this topic remain scares. Research that has investigated peer-review reveals several issues and criticisms concerning bias, poor quality review, unreliability and inefficiency. The most important weakness of the peer review process is the inconsistency between reviewers leading to inadequate inter-rater reliability. To report the reliability of ratings for a large international conference and to suggest possible solutions to overcome the problem. In 2016 during the International Conference on Communication in Healthcare, organized by EACH: International Association for Communication in Healthcare, a calibration exercise was proposed and feedback was reported back to the participants of the exercise. Most abstracts, as well as most peer-reviewers, receive and give scores around the median. Contrary to the general assumption that there are high and low scorers, in this group only 3 peer-reviewers could be identified with a high mean, while 7 has a low mean score. Only 2 reviewers gave only high ratings (4 and 5). Of the eight abstracts included in this exercise, only one abstract received a high mean score and one a low mean score. Nevertheless, both these abstracts received both low and high scores; all other abstracts received all possible scores. Peer-review of submissions for conferences are, in accordance with the literature, unreliable. New and creative methods will be needed to give the participants of a conference what they really deserve: a more reliable selection of the best abstracts. More raters per abstract improves the inter-rater reliability; training of reviewers could be helpful; providing feedback to reviewers can lead to less inter-rater disagreement; fostering negative peer-review (rejecting the inappropriate submissions) rather than a positive (accepting the best) could be fruitful for selecting abstracts for conferences. Copyright © 2017 Elsevier B.V. All rights reserved.
Utility and Reliability of an App for the System for Observing Play and Recreation in Communities (iSOPARC®)

ERIC Educational Resources Information Center

Santos, Maria P. M.; Rech, Cassiano R.; Alberico, Claudia O.; Fermino, Rogério C.; Rios, Ana P.; David, João; Reis, Rodrigo S.; Sarmiento, Olga L.; McKenzie, Thomas L.; Mota, Jorge

2016-01-01

The app for the System for Observing Play and Recreation in Communities (iSOPARC®) was developed to enhance System for Observing Play and Recreation in Communities data collection and management. The study aim was to examine the usability and inter-rater reliability of iSOPARC®. Trained observers collected data in 16 park areas in two Latin…
Assessment of first-year veterinary students' communication skills using an objective structured clinical examination: the importance of context.

PubMed

Hecker, Kent G; Adams, Cindy L; Coe, Jason B

2012-01-01

Communication skills are considered to be a core clinical skill in veterinary medicine and essential for practice success, including outcomes of care for patients and clients. While veterinary schools include communication skills training in their programs, there is minimal knowledge on how best to assess communication competence throughout the undergraduate program. The purpose of this study was to further our understanding of the reliability, utility, and suitability of a communication skills Objective Structured Clinical Examination (OSCE). Specifically we wanted to (1) identify the greatest source of variability (student, rater, station, and track) within a first-year, four station OSCE using exam scores and scores from videotape review by two trained raters, and (2) determine the effect of different stations on students' communication skills performance. Reliability of the scores from both the exam data and the two expert raters was 0.50 and 0.46 respectively, with the greatest amount of variance attributable to student by station. The percentage of variance due to raters in the exam data was 16.35%, whereas the percentage of variance for the two expert raters was 0%. These results have three important implications. First, the results reinforce the need for communication educators to emphasize that use of communication skills is moderated by the context of the clinical interaction. Second, by increasing rater training the amount of error in the scores due to raters can be reduced and inter-rater reliability increases. Third, the communication assessment method (in this case the OSCE checklist) should be built purposefully, taking into consideration the context of the case.
Development and validation of a Malawian version of the primary care assessment tool.

PubMed

Dullie, Luckson; Meland, Eivind; Hetlevik, Øystein; Mildestvedt, Thomas; Gjesdal, Sturla

2018-05-16

Malawi does not have validated tools for assessing primary care performance from patients' experience. The aim of this study was to develop a Malawian version of Primary Care Assessment Tool (PCAT-Mw) and to evaluate its reliability and validity in the assessment of the core primary care dimensions from adult patients' perspective in Malawi. A team of experts assessed the South African version of the primary care assessment tool (ZA-PCAT) for face and content validity. The adapted questionnaire underwent forward and backward translation and a pilot study. The tool was then used in an interviewer administered cross-sectional survey in Neno district, Malawi, to test validity and reliability. Exploratory factor analysis was performed on a random half of the sample to evaluate internal consistency, reliability and construct validity of items and scales. The identified constructs were then tested with confirmatory factor analysis. Likert scale assumption testing and descriptive statistics were done on the final factor structure. The PCAT-Mw was further tested for intra-rater and inter-rater reliability. From the responses of 631 patients, a 29-item PCAT-Mw was constructed comprising seven multi-item scales, representing five primary care dimensions (first contact, continuity, comprehensiveness, coordination and community orientation). All the seven scales achieved good internal consistency, item-total correlations and construct validity. Cronbach's alpha coefficient ranged from 0.66 to 0.91. A satisfactory goodness of fit model was achieved (GFI = 0.90, CFI = 0.91, RMSEA = 0.05, PCLOSE = 0.65). The full range of possible scores was observed for all scales. Scaling assumptions tests were achieved for all except the two comprehensiveness scales. Intra-class correlation coefficient (ICC) was 0.90 (n = 44, 95% CI 0.81-0.94, p < 0.001) for intra-rater reliability and 0.84 (n = 42, 95% CI 0.71-0.96, p < 0.001) for inter-rater reliability. Comprehensive metric analyses supported the reliability and validity of PCAT-Mw in assessing the core concepts of primary care from adult patients' experience. This tool could be used for health service research in primary care in Malawi.
[Care quality: reliability and usefulness of observation data in bench marking nursing homes and homes for the aged in the Netherlands].

PubMed

Frijters, Dinnus; Gerritsen, Debby; Steverink, Nardi

2003-02-01

Before including quality of care indicators in the Benchmark of Nursing Homes and Homes for the Aged in the Netherlands the reliability of the patient data collection, and usefulness had to be established. The patient data items were derived from the Resident Assessment Instruments (RAI) and a questionnaire on social interaction in elderly people. Three nursing homes and three homes for the aged participated in the test with 550 patients. 279 x 2 assessments were collected by independent raters for an inter rater reliability test; 259 x 2 by the same rater for a reliability test-retest; and 24 by a single rater. The scores on paired assessment forms were compared with the weighted Kappa agreement test. The test results allowed 10 of the 13 quality indicators from RAI to be retained. In addition new quality indicators could be defined on 'giving attention' and 'unrespectful addressing'. We estimate on the basis of a questionnaire for the raters that on average 9 to 12 minutes per patient are needed to collect and enter data for the resulting 12 quality indicators.
Reliability and validity of the range of motion scale (ROMS) in patients with abnormal postures.

PubMed

van Rooijen, Diana E; Lalli, Stefania; Marinus, Johan; Maihöfner, Christian; McCabe, Candida S; Munts, Alex G; van der Plas, Anton A; Tijssen, Marina A J; van de Warrenburg, Bart P; Albanese, Alberto; van Hilten, Jacobus J

2015-03-01

Sustained abnormal postures (i.e., fixed dystonia) are the most frequently reported motor abnormalities in complex regional pain syndrome (CRPS), but these symptoms may also develop after peripheral trauma without CRPS. Currently, there is no valid and reliable measurement instrument available to measure the severity and distribution of these postures. The range of motion scale (ROMS) was therefore developed to assess the severity based on the possible active range of motion of all joints (arms, legs, trunk, and neck), and the present study evaluates its reliability and validity. Inter- and intra-rater reliability of the ROMS was determined in 16 patients with abnormal sustained postures, who were videotaped following a standard video protocol in a university hospital. The recordings were rated by a panel of international experts. In addition, 30 patients were clinically tested with both the Burke-Fahn-Marsden (BFM) scale as well as the ROMS to assess construct validity. Inter-rater reliability for total ROMS scores showed an intra-class correlation coefficient (ICC) of 0.85. The majority of the scores for the separate joints (13 out of 18) demonstrated an almost perfect agreement with ICCs ranging from 0.81 to 0.94; of the other items, one showed fair, one moderate, and three substantial agreement. The ICCs for the intra-rater reliability ranged from moderate to almost perfect (0.68-0.98). Spearman's correlation coefficients between corresponding body areas as measured with the ROMS or BFM were all above 0.82. The ROMS is a reliable and valid instrument to evaluate the severity and distribution of sustained abnormal postures. Wiley Periodicals, Inc.
Factor validity and reliability of the aberrant behavior checklist-community (ABC-C) in an Indian population with intellectual disability.

PubMed

Lehotkay, R; Saraswathi Devi, T; Raju, M V R; Bada, P K; Nuti, S; Kempf, N; Carminati, G Galli

2015-03-01

In this study realised in collaboration with the department of psychology and parapsychology of Andhra University, validation of the Aberrant Behavior Checklist-Community (ABC-C) in Telugu, the official language of Andhra Pradesh, one of India's 28 states, was carried out. To assess the factor validity and reliability of this Telugu version, 120 participants with moderate to profound intellectual disability (94 men and 26 women, mean age 25.2, SD 7.1) were rated by the staff of the Lebenshilfe Institution for Mentally Handicapped in Visakhapatnam, Andhra Pradesh, India. Rating data were analysed with a confirmatory factor analysis. The internal consistency was estimated by Cronbach's alpha. To confirm the test-retest reliability, 50 participants were rated twice with an interval of 4 weeks, and 50 were rated by pairs of raters to assess inter-rater reliability. Confirmatory factor analysis revealed that the root mean square error of approximation (RMSEA) was equal to 0.06, the comparative fit index (CFI) was equal to 0.77, and the Tucker Lewis index (TLI) was equal to 0.77, which indicated that the model with five correlated factors had a good fit. Coefficient alpha ranged from 0.85 to 0.92 across the five subscales. Spearman's rank correlation coefficients for inter-rater reliability tests ranged from 0.65 to 0.75, and the correlations for test-retest reliability ranged from 0.58 to 0.76. All reliability coefficients were statistically significant (P < 0.01). The factor validity and reliability of Telugu version of the ABC-C evidenced factor validity and reliability comparable to the original English version and appears to be useful for assessing behaviour disorders in Indian people with intellectual disabilities. © 2014 MENCAP and International Association of the Scientific Study of Intellectual and Developmental Disabilities and John Wiley & Sons Ltd.
Reliability of diagnosis and clinical efficacy of visceral osteopathy: a systematic review.

PubMed

Guillaud, Albin; Darbois, Nelly; Monvoisin, Richard; Pinsault, Nicolas

2018-02-17

In 2010, the World Health Organization published benchmarks for training in osteopathy in which osteopathic visceral techniques are included. The purpose of this study was to identify and critically appraise the scientific literature concerning the reliability of diagnosis and the clinical efficacy of techniques used in visceral osteopathy. Databases MEDLINE, OSTMED.DR, the Cochrane Library, Osteopathic Research Web, Google Scholar, Journal of American Osteopathic Association (JAOA) website, International Journal of Osteopathic Medicine (IJOM) website, and the catalog of Académie d'ostéopathie de France website were searched through December 2017. Only inter-rater reliability studies including at least two raters or the intra-rater reliability studies including at least two assessments by the same rater were included. For efficacy studies, only randomized-controlled-trials (RCT) or crossover studies on unhealthy subjects (any condition, duration and outcome) were included. Risk of bias was determined using a modified version of the quality appraisal tool for studies of diagnostic reliability (QAREL) in reliability studies. For the efficacy studies, the Cochrane risk of bias tool was used to assess their methodological design. Two authors performed data extraction and analysis. Eight reliability studies and six efficacy studies were included. The analysis of reliability studies shows that the diagnostic techniques used in visceral osteopathy are unreliable. Regarding efficacy studies, the least biased study shows no significant difference for the main outcome. The main risks of bias found in the included studies were due to the absence of blinding of the examiners, an unsuitable statistical method or an absence of primary study outcome. The results of the systematic review lead us to conclude that well-conducted and sound evidence on the reliability and the efficacy of techniques in visceral osteopathy is absent. The review is registered PROSPERO 12th of December 2016. Registration number is CRD4201605286 .
The Impact of Rater Variability on Relationships among Different Effect-Size Indices for Inter-Rater Agreement between Human and Automated Essay Scoring

ERIC Educational Resources Information Center

Yun, Jiyeo

2017-01-01

Since researchers investigated automatic scoring systems in writing assessments, they have dealt with relationships between human and machine scoring, and then have suggested evaluation criteria for inter-rater agreement. The main purpose of my study is to investigate the magnitudes of and relationships among indices for inter-rater agreement used…
Inter-rater reliability of the German version of the Nurses' Global Assessment of Suicide Risk scale.

PubMed

Kozel, Bernd; Grieser, Manuela; Abderhalden, Christoph; Cutcliffe, John R

2016-10-01

In comparison to the general population, the suicide rates of psychiatric inpatient populations in Germany and Switzerland are very high. An important preventive contribution to the lowering of the suicide rates in mental health care is to ensure that the risk of suicide of psychiatric inpatients is assessed as accurately as possible. While risk-assessment instruments can serve an important function in determining such risk, very few have been translated to German. Therefore, in the present study, we reported on the German version of Nurses' Global Assessment of Suicide Risk (NGASR) scale. After translating the original instrument into German and pretesting the German version, we tested the inter-rater reliability of the instrument. Twelve video case studies were evaluated by 13 raters with the NGASR scale in a 'laboratory' trial. In each case, the observer's agreement was calculated for the single items, the overall scale, the risk levels, and the sum scores. The statistical data analysis was conducted with kappa and AC1 statistics for dichotomous (items, scale) scales. A high-to-very high observers' agreement (AC1: 0.62-1.00, kappa: 0.00-1.00) was determined for 16 items of the German version of the NGASR scale. We conclude that the German version of the NGASR scale is a reliable instrument for evaluating risk factors for suicide. A reliable application in the clinical practise appears to be enhanced by training in the use of the instrument and the right implementation instructions. © 2016 Australian College of Mental Health Nurses Inc.
[Quality Assurance in Sociomedical Evaluation by Peer Review: A Pilot Project of the German Statutory Pension Insurance].

PubMed

Strahl, A; Gerlich, C; Wolf, H-D; Gehrke, J; Müller-Garnn, A; Vogel, H

2016-03-01

The sociomedical evaluation by the German Pension Insurance serves the purpose of determining entitlement to disability pensions. A quality assurance concept for the sociomedical evaluation was developed, which is based on a peer Review process. Peer review is an established process of external quality assurance in health care. The review is based on a hierarchically constructed manual that was evaluated in this pilot project. The database consists of 260 medical reports for disability pension of 12 pension insurance agencies. 771 reviews from 19 peers were included in the evaluation of the inter-rater reliability. Kendall's coefficient of concordance W for more than 2 raters is used as primary measure of inter-rater reliability. Reliability appeared to be heterogeneous. Kendalls W varies for the particular criteria from 0.09 to 0.88 and reached for primary criterion reproducibility a value of 0.37. The reliability of the manual seemed acceptable in the context of existing research data and is in line with existing peer review research outcomes. Nevertheless, the concordance is limited and requires optimisation. Starting points for improvement can be seen in a systematic training and regular user meetings of the peers involved. © Georg Thieme Verlag KG Stuttgart · New York.
Validation of personal digital photography to assess dietary quality among people with intellectual disabilities.

PubMed

Elinder, L S; Brunosson, A; Bergström, H; Hagströmer, M; Patterson, E

2012-02-01

Dietary assessment is a challenge in general, and specifically in individuals with intellectual disabilities (ID). This study aimed to evaluate personal digital photography as a method of assessing different aspects of dietary quality in this target group. Eighteen adults with ID were recruited from community residences and activity centres in Stockholm County. Participants were instructed to photograph all foods and beverages consumed during 1 day, while observed. Photographs were coded by two raters. Observations and photographs of meal frequency, intake occasions of four specific food and beverage items, meal quality and dietary diversity were compared. Evaluation of inter-rater reliability and validity of the method was performed by intra-class correlation analysis. With reminders from staff, 85% of all observed eating or drinking occasions were photographed. The inter-rater reliability was excellent for all assessed variables (ICC ≥ 0.88), except for meal quality where ICC was 0.66. The correlations between items assessed in photos and observations were strong to almost perfect with ICC values ranging from 0.71 to 0.92 and all were statistically significant. Personal digital photography appears to be a feasible, reliable and valid method for assessing dietary quality in people with mild to moderate ID, who have daily staff support. © 2011 The Authors. Journal of Intellectual Disability Research © 2011 Blackwell Publishing Ltd.
Inter-agency communication and operations capabilities during a hospital functional exercise: reliability and validity of a measurement tool.

PubMed

Savoia, Elena; Biddinger, Paul D; Burstein, Jon; Stoto, Michael A

2010-01-01

As proxies for actual emergencies, drills and exercises can raise awareness, stimulate improvements in planning and training, and provide an opportunity to examine how different components of the public health system would combine to respond to a challenge. Despite these benefits, there remains a substantial need for widely accepted and prospectively validated tools to evaluate agencies' and hospitals' performance during such events. Unfortunately, to date, few studies have focused on addressing this need. The purpose of this study was to assess the validity and reliability of a qualitative performance assessment tool designed to measure hospitals' communication and operational capabilities during a functional exercise. The study population included 154 hospital personnel representing nine hospitals that participated in a functional exercise in Massachusetts in June 2008. A 25-item questionnaire was developed to assess the following three hospital functional capabilities: (1) inter-agency communication; (2) communication with the public; and (3) disaster operations. Analyses were conducted to examine internal consistency, associations among scales, the empirical structure of the items, and inter-rater agreement. Twenty-two questions were retained in the final instrument, which demonstrated reliability with alpha coefficients of 0.83 or higher for all scales. A three-factor solution from the principal components analysis accounted for 57% of the total variance, and the factor structure was consistent with the original hypothesized domains. Inter-rater agreement between participants' self reported scores and external evaluators' scores ranged from moderate to good. The resulting 22-item performance measurement tool reliably measured hospital capabilities in a functional exercise setting, with preliminary evidence of concurrent and criterion-related validity.
Reliability of the test of gross motor development second edition (TGMD-2) for Kindergarten children in Myanmar

PubMed Central

Aye, Thanda; Oo, Khin Saw; Khin, Myo Thuzar; Kuramoto-Ahuja, Tsugumi; Maruyama, Hitoshi

2017-01-01

[Purpose] The purpose of this study was to investigate reliability of the test of gross motor development second edition (TGMD-2) for Kindergarten children in Myanmar. [Subjects and Methods] Fifty healthy Kindergarten children (23 males, 27 females) whose parents/guardians had given written consent were participated. The subjects were explained and demonstrated all 12 gross motor skills of TGMD-2 before the assessment. Each subject individually performed two trials for each gross motor skill and the performance was video recorded. Three raters separately watched the video recordings and rated for inter-rater reliability. The second assessment was done one month later with 25 out of 50 subjects for test-rest reliability. The video recordings of 12 subjects were randomly selected from the first 50 recordings for intra-rater reliability six weeks after the first assessment. The agreement on the locomotor and object control raw scores and the gross motor quotient (GMQ) were calculated. [Results] The findings of all the reliability coefficients for the locomotor and object control raw scores and the GMQ were interpreted as good and excellent reliability. [Conclusion] The results represented that TGMD-2 is a highly reliable and appropriate assessment tool for assessing gross motor skill development of Kindergarten children in Myanmar. PMID:29184278
Reliability of the test of gross motor development second edition (TGMD-2) for Kindergarten children in Myanmar.

PubMed

Aye, Thanda; Oo, Khin Saw; Khin, Myo Thuzar; Kuramoto-Ahuja, Tsugumi; Maruyama, Hitoshi

2017-10-01

[Purpose] The purpose of this study was to investigate reliability of the test of gross motor development second edition (TGMD-2) for Kindergarten children in Myanmar. [Subjects and Methods] Fifty healthy Kindergarten children (23 males, 27 females) whose parents/guardians had given written consent were participated. The subjects were explained and demonstrated all 12 gross motor skills of TGMD-2 before the assessment. Each subject individually performed two trials for each gross motor skill and the performance was video recorded. Three raters separately watched the video recordings and rated for inter-rater reliability. The second assessment was done one month later with 25 out of 50 subjects for test-rest reliability. The video recordings of 12 subjects were randomly selected from the first 50 recordings for intra-rater reliability six weeks after the first assessment. The agreement on the locomotor and object control raw scores and the gross motor quotient (GMQ) were calculated. [Results] The findings of all the reliability coefficients for the locomotor and object control raw scores and the GMQ were interpreted as good and excellent reliability. [Conclusion] The results represented that TGMD-2 is a highly reliable and appropriate assessment tool for assessing gross motor skill development of Kindergarten children in Myanmar.

Fatigue after stroke: the development and evaluation of a case definition.

PubMed

Lynch, Joanna; Mead, Gillian; Greig, Carolyn; Young, Archie; Lewis, Susan; Sharpe, Michael

2007-11-01

While fatigue after stroke is a common problem, it has no generally accepted definition. Our aim was to develop a case definition for post-stroke fatigue and to test its psychometric properties. A case definition with face validity and an associated structured interview was constructed. After initial piloting, the feasibility, reliability (test-retest and inter-rater) and concurrent validity (in relation to four fatigue severity scales) were determined in 55 patients with stroke. All participating patients provided satisfactory answers to all the case definition probe questions demonstrating its feasibility For test-retest reliability, kappa was 0.78 (95% CI, 0.57-0.94, P<.01) and for inter-rater reliability kappa was 0.80 (95% CI, 0.62-0.99, P<.01). Patients fulfilling the case definition also had substantially higher fatigue scores on four fatigue severity scales (P<.001) indicating concurrent validity. The proposed case definition is feasible to administer and reliable in practice, and there is evidence of concurrent validity. It requires further evaluation in different settings.
Validity and reliability of a Malay version of the Lawton instrumental activities of daily living scale among the Malay speaking elderly in Malaysia.

PubMed

Kadar, Masne; Ibrahim, Suhaili; Razaob, Nor Afifi; Chai, Siaw Chui; Harun, Dzalani

2018-02-01

The Lawton Instrumental Activities of Daily Living Scale is a tool often used to assess independence among elderly at home. Its suitability to be used with the elderly population in Malaysia has not been validated. This current study aimed to assess the validity and reliability of the Lawton Instrumental Activities of Daily Living Scale - Malay Version to Malay speaking elderly in Malaysia. This study was divided into three phases: (1) translation and linguistic validity involving both forward and backward translations; (2) establishment of face validity and content validity; and (3) establishment of reliability involving inter-rater, test-retest and internal consistency analyses. Data used for these analyses were obtained by interviewing 65 elderly respondents. Percentages of Content Validity Index for 4 criteria were from 88.89 to 100.0. The Cronbach α coefficient for internal consistency was 0.838. Intra-class Correlation Coefficient of inter-rater reliability and test-retest reliability was 0.957 and 0.950 respectively. The result shows that the Lawton Instrumental Activities of Daily Living Scale - Malay Version has excellent reliability and validity for use with the Malay speaking elderly people in Malaysia. This scale could be used by professionals to assess functional ability of elderly who live independently in community. © 2018 Occupational Therapy Australia.
Standards Performance Continuum: Development and Validation of a Measure of Effective Pedagogy.

ERIC Educational Resources Information Center

Doherty, R. William; Hilberg, R. Soleste; Epaloose, Georgia; Tharp, Roland G.

2002-01-01

Describes the development and validation of the Standards Performance Continuum (SPC) for assessing teacher performance of the Standards for Effective Pedagogy. Three studies involving Florida, California, and New Mexico public school teachers provided evidence of inter-rater reliability, concurrent validity, and criterion-related validity…
The reliability, accuracy and minimal detectable difference of a multi-segment kinematic model of the foot-shoe complex.

PubMed

Bishop, Chris; Paul, Gunther; Thewlis, Dominic

2013-04-01

Kinematic models are commonly used to quantify foot and ankle kinematics, yet no marker sets or models have been proven reliable or accurate when wearing shoes. Further, the minimal detectable difference of a developed model is often not reported. We present a kinematic model that is reliable, accurate and sensitive to describe the kinematics of the foot-shoe complex and lower leg during walking gait. In order to achieve this, a new marker set was established, consisting of 25 markers applied on the shoe and skin surface, which informed a four segment kinematic model of the foot-shoe complex and lower leg. Three independent experiments were conducted to determine the reliability, accuracy and minimal detectable difference of the marker set and model. Inter-rater reliability of marker placement on the shoe was proven to be good to excellent (ICC=0.75-0.98) indicating that markers could be applied reliably between raters. Intra-rater reliability was better for the experienced rater (ICC=0.68-0.99) than the inexperienced rater (ICC=0.38-0.97). The accuracy of marker placement along each axis was <6.7 mm for all markers studied. Minimal detectable difference (MDD90) thresholds were defined for each joint; tibiocalcaneal joint--MDD90=2.17-9.36°, tarsometatarsal joint--MDD90=1.03-9.29° and the metatarsophalangeal joint--MDD90=1.75-9.12°. These thresholds proposed are specific for the description of shod motion, and can be used in future research designed at comparing between different footwear. Copyright © 2012 Elsevier B.V. All rights reserved.
The German Version of the Manchester Triage System and Its Quality Criteria – First Assessment of Validity and Reliability

PubMed Central

Gräff, Ingo; Goldschmidt, Bernd; Glien, Procula; Bogdanow, Manuela; Fimmers, Rolf; Hoeft, Andreas; Kim, Se-Chan; Grigutsch, Daniel

2014-01-01

Background The German Version of the Manchester Triage System (MTS) has found widespread use in EDs across German-speaking Europe. Studies about the quality criteria validity and reliability of the MTS currently only exist for the English-language version. Most importantly, the content of the German version differs from the English version with respect to presentation diagrams and change indicators, which have a significant impact on the category assigned. This investigation offers a preliminary assessment in terms of validity and inter-rater reliability of the German MTS. Methods Construct validity of assigned MTS level was assessed based on comparisons to hospitalization (general / intensive care), mortality, ED and hospital length of stay, level of prehospital care and number of invasive diagnostics. A sample of 45,469 patients was used. Inter-rater agreement between an expert and triage nurses (reliability) was calculated separately for a subset group of 167 emergency patients. Results For general hospital admission the area under the curve (AUC) of the receiver operating characteristic was 0.749; for admission to ICU it was 0.871. An examination of MTS-level and number of deceased patients showed that the higher the priority derived from MTS, the higher the number of deaths (p<0.0001 / χ2 Test). There was a substantial difference in the 30-day survival among the 5 MTS categories (p<0.0001 / log-rank test).The AUC for the predict 30-day mortality was 0.613. Categories orange and red had the highest numbers of heart catheter and endoscopy. Category red and orange were mostly accompanied by an emergency physician, whereas categories blue and green were walk-in patients. Inter-rater agreement between expert triage nurses was almost perfect (κ = 0.954). Conclusion The German version of the MTS is a reliable and valid instrument for a first assessment of emergency patients in the emergency department. PMID:24586477
Inter-operator and inter-device agreement and reliability of the SEM Scanner.

PubMed

Clendenin, Marta; Jaradeh, Kindah; Shamirian, Anasheh; Rhodes, Shannon L

2015-02-01

The SEM Scanner is a medical device designed for use by healthcare providers as part of pressure ulcer prevention programs. The objective of this study was to evaluate the inter-rater and inter-device agreement and reliability of the SEM Scanner. Thirty-one (31) volunteers free of pressure ulcers or broken skin at the sternum, sacrum, and heels were assessed with the SEM Scanner. Each of three operators utilized each of three devices to collect readings from four anatomical sites (sternum, sacrum, left and right heels) on each subject for a total of 108 readings per subject collected over approximately 30 min. For each combination of operator-device-anatomical site, three SEM readings were collected. Inter-operator and inter-device agreement and reliability were estimated. Over the course of this study, more than 3000 SEM Scanner readings were collected. Agreement between operators was good with mean differences ranging from -0.01 to 0.11. Inter-operator and inter-device reliability exceeded 0.80 at all anatomical sites assessed. The results of this study demonstrate the high reliability and good agreement of the SEM Scanner across different operators and different devices. Given the limitations of current methods to prevent and detect pressure ulcers, the SEM Scanner shows promise as an objective, reliable tool for assessing the presence or absence of pressure-induced tissue damage such as pressure ulcers. Copyright © 2015 Bruin Biometrics, LLC. Published by Elsevier Ltd.. All rights reserved.
Brief Assessment of Motor Function: Content Validity and Reliability of the Upper Extremity Gross Motor Scale

PubMed Central

Cintas, Holly Lea; Parks, Rebecca; Don, Sarah; Gerber, Lynn

2011-01-01

Content validity and reliability of the Brief Assessment of Motor Function (BAMF) Upper Extremity Gross Motor Scale (UEGMS) were evaluated in this prospective, descriptive study. The UEGMS is one of five ordinal scales designed for quick documentation of gross, fine and oral motor skill levels. Designed to be independent of age and diagnosis, it is intended for use for infants through young adults. An expert panel of 17 physical therapists and 13 occupational therapists refined the content by responding to a standard questionnaire comprised of questions which asked whether each item should be included, is clearly worded, should be reordered higher or lower, is functionally relevant, and is easily discriminated. Ratings of content validity exceeded the criterion except for two items which may represent different perspectives of physical and occupational therapists. The UEGMS was modified using the quantitative and qualitative feedback from the questionnaires. For reliability, five raters scored videotaped motor performances of ten children. Coefficients for inter-rater (0.94) and intra-rater (0.95) reliability were high. The results provide evidence of content validity and reliability of the UEGMS for assessment of upper extremity gross motor skill. PMID:21599568
Assessing movement quality in persons with severe mental illness - Reliability and validity of the Body Awareness Scale Movement Quality and Experience.

PubMed

Hedlund, Lena; Gyllensten, Amanda Lundvik; Waldegren, Tomas; Hansson, Lars

2016-05-01

Motor disturbances and disturbed self-recognition are common features that affect mobility in persons with schizophrenia spectrum disorder and bipolar disorder. Physiotherapists in Scandinavia assess and treat movement difficulties in persons with severe mental illness. The Body Awareness Scale Movement Quality and Experience (BAS MQ-E) is a new and shortened version of the commonly used Body Awareness Scale-Health (BAS-H). The purpose of this study was to investigate the inter-rater reliability and the concurrent validity of BAS MQ-E in persons with severe mental illness. The concurrent validity was examined by investigating the relationships between neurological soft signs, alexithymia, fatigue, anxiety, and mastery. Sixty-two persons with severe mental illness participated in the study. The results showed a satisfactory inter-rater reliability (n = 53) and a concurrent validity (n = 62) with neurological soft signs, especially cognitive and perceptual based signs. There was also a concurrent validity linked to physical fatigue and aspects of alexithymia. The scores of BAS MQ-E were in general higher for persons with schizophrenia compared to persons with other diagnoses within the schizophrenia spectrum disorders and bipolar disorder. The clinical implications are presented in the discussion.
Reliability of Hypernasality Rating: Comparison of 3 Different Methods for Perceptual Assessment.

PubMed

Yamashita, Renata Paciello; Borg, Elisabet; Granqvist, Svante; Lohmander, Anette

2018-01-01

To compare reliability in auditory-perceptual assessment of hypernasality for 3 different methods and to explore the influence of language background. Comparative methodological study. Participants and Materials: Audio recordings of 5-year-old Swedish-speaking children with repaired cleft lip and palate consisting of 73 stimuli of 9 nonnasal single-word strings in 3 different randomized orders. Four experienced speech-language pathologists (2 native speakers of Brazilian-Portuguese and 2 native speakers of Swedish) participated as listeners. After individual training, each listener performed the hypernasality rating task. Each order of stimuli was analyzed individually using the 2-step, VISOR and Borg centiMax scale methods. Comparison of intra- and inter-rater reliability, and consistency for each method within language of the listener and between listener languages (Swedish and Brazilian-Portuguese). Good to excellent intra-rater reliability was found within each listener for all methods, 2-step: κ = 0.59-0.93; VISOR: intraclass correlation coefficient (ICC) = 0.80-0.99; Borg centiMax (cM) scale: ICC = 0.80-1.00. The highest inter-rater reliability was demonstrated for VISOR (ICC = 0.60-0.90) and Borg cM-scale (ICC = 0.40-0.80). High consistency within each method was found with the highest for the Borg cM scale (ICC = 0.89-0.91). There was a significant difference in the ratings between the Swedish and the Brazilian listeners for all methods. The category-ratio scale Borg cM was considered most reliable in the assessment of hypernasality. Language background of Brazilian-Portuguese listeners influenced the perceptual ratings of hypernasality in Swedish speech samples, despite their experience in perceptual assessment of cleft palate speech disorders.
Kinematic predictors of single-leg squat performance: a comparison of experienced physiotherapists and student physiotherapists.

PubMed

Weeks, Benjamin K; Carty, Christopher P; Horan, Sean A

2012-10-25

The single-leg squat (SLS) is a common test used by clinicians for the musculoskeletal assessment of the lower limb. The aim of the current study was to reveal the kinematic parameters used by experienced and inexperienced clinicians to determine SLS performance and establish reliability of such assessment. Twenty-two healthy, young adults (23.8 ± 3.1 years) performed three SLSs on each leg whilst being videoed. Three-dimensional data for the hip and knee was recorded using a 10-camera optical motion analysis system (Vicon, Oxford, UK). SLS performance was rated from video data using a 10-point ordinal scale by experienced musculoskeletal physiotherapists and student physiotherapists. All ratings were undertaken a second time at least two weeks after the first by the same raters. Stepwise multiple regression analysis was performed to determine kinematic predictors of SLS performance scores and inter- and intra-rater reliability were determined using a two-way mixed model to generate intra-class correlation coefficients (ICC3,1) of consistency. One SLS per leg for each participant was used for analysis, providing 44 SLSs in total. Eight experienced physiotherapists and eight physiotherapy students agreed to rate each SLS. Variance in physiotherapist scores was predicted by peak knee flexion, knee medio-lateral displacement, and peak hip adduction (R2 = 0.64, p = 0.01), while variance in student scores was predicted only by peak knee flexion, and knee medio-lateral displacement (R2 = 0.57, p = 0.01). Inter-rater reliability was good for physiotherapists (ICC3,1 = 0.71) and students (ICC3,1 = 0.60), whilst intra-rater reliability was excellent for physiotherapists (ICC3,1 = 0.81) and good for students (ICC3,1 = 0.71). Physiotherapists and students are both capable of reliable assessment of SLS performance. Physiotherapist assessments, however, bear stronger relationships to lower limb kinematics and are more sensitive to hip joint motion than student assessments.
Brief Report: "Quick and (Not So) Dirty" Assessment of Change in Autism--Cross-Cultural Reliability of the Developmental Disabilities CGAS and the OSU Autism CGI

ERIC Educational Resources Information Center

Choque Olsson, Nora; Bölte, Sven

2014-01-01

There are few evaluated economic tools to assess change in autism. This study examined the inter-rater reliability of the Developmental Disabilities Children's Global Assessment Scale (DD-CGAS), and the OSU Autism Clinical Global Impression (OSU Autism CGI) in a European setting. Using these scales, 16 clinicians with multidisciplinary…
Assessing adherence to the evidence base in the management of poststroke dysphagia.

PubMed

Burton, Christopher; Pennington, Lindsay; Roddam, Hazel; Russell, Ian; Russell, Daphne; Krawczyk, Karen; Smith, Hilary A

2006-01-01

To evaluate the reliability and responsiveness to change of an audit tool to assess adherence to evidence of effectiveness in the speech and language therapy (SLT) management of poststroke dysphagia. The tool was used to review SLT practice as part of a randomized study of different education strategies. Medical records were audited before and after delivery of the trial intervention. Seventeen SLT departments in the north-west of England participated in the study. The assessment tool was used to assess the medical records of 753 patients before and 717 patients after delivery of the trial intervention across the 17 departments. A target of 10 records per department per month was sought, using systematic sampling with a random start. Inter- and intra-rater reliability were explored, together with the tool's internal consistency and responsiveness to change. The assessment tool had high face validity, although internal consistency was low (ra = 0.37). Composite scores on the tool were however responsive to differences between SLT departments. Both inter- and intra-rater reliability ranged from 'substantial' to 'near perfect' across all items. The audit tool has high face validity and measurement reliability. The use of a composite adherence score should, however, proceed with caution as internal consistency is low.
A literature review of clinical tests for lumbar instability in low back pain: validity and applicability in clinical practice.

PubMed

Ferrari, Silvano; Manni, Tiziana; Bonetti, Francesca; Villafañe, Jorge Hugo; Vanti, Carla

2015-01-01

Several clinical tests have been proposed on low back pain (LBP), but their usefulness in detecting lumbar instability is not yet clear. The objective of this literature review was to investigate the clinical validity of the main clinical tests used for the diagnosis of lumbar instability in individuals with LBP and to verify their applicability in everyday clinical practice. We searched studies of the accuracy and/or reliability of Prone Instability Test (PIT), Passive Lumbar Extension Test (PLE), Aberrant Movements Pattern (AMP), Posterior Shear Test (PST), Active Straight Leg Raise Test (ASLR) and Prone and Supine Bridge Tests (PB and SB) in Medline, Embase, Cinahl, PubMed, and Scopus databases. Only the studies in which each test was investigated by at least one study concerning both the accuracy and the reliability were considered eligible. The quality of the studies was evaluated by QUADAS and QAREL scales. Six papers considering 333 LBP patients were included. The PLE was the most accurate and informative clinical test, with high sensitivity (0.84, 95% CI: 0.69 - 0.91) and high specificity (0.90, 95% CI: 0.85 -0.97). The diagnostic accuracy of AMP depends on each singular test. The PIT and the PST demonstrated by fair to moderate sensitivity and specificity [PIT sensitivity = 0.71 (95% CI: 0.51 - 0.83), PIT specificity = 0.57 (95% CI: 039 - 0.78); PST sensitivity = 0.50 (95% CI: 0.41 - 0.76), PST specificity = 0.48 (95% CI: 0.22 - 0.58)]. The PLE showed a good reliability (k = 0.76), but this result comes from a single study. The inter-rater reliability of the PIT ranged by slight (k = 0.10 and 0.04), to good (k = 0.87). The inter-rater reliability of the AMP ranged by slight (k = -0.07) to moderate (k = 0.64), whereas the inter-rater reliability of the PST was fair (k = 0.27). The data from the studies provided information on the methods used and suggest that PLE is the most appropriate tests to detect lumbar instability in specific LBP. However, due to the lack of available papers on other lumbar conditions, these findings should be confirmed with studies on non-specific LBP patients.
Developing the Person-Environment Apathy Rating for persons with dementia.

PubMed

Jao, Ying-Ling; Algase, Donna L; Specht, Janet K; Williams, Kristine

2016-08-01

To develop the Person-Environment Apathy Rating (PEAR) scale that measures environmental stimulation and apathy in persons with dementia and to evaluate its psychometrics. The PEAR scale consists of the PEAR-Environment subscale and PEAR-Apathy subscales. The items were developed via literature review, field testing, expert review, and pilot testing. The construct validity and reliability were examined through video observation. The parent study enrolled 185 institutionalized residents with dementia. For this study, 96 videos were selected from 24 participants. The PEAR-Environment subscale was validated using the Ambiance Scale and the Crowding Index. The PEAR-Apathy subscale was validated using the Neuropsychiatric Inventory (NPI)-Apathy, Passivity in Dementia Scale (PDS), and NPI-Depression. The PEAR-Environment subscale and PEAR-Apathy subscales each consists of six items rated on a 1-4 scale. For validity, the Crowding Index slightly, yet significantly, correlated with the PEAR-Environment subscale total score and three of the individual scores. Ambiance Scale scores, both engaging and soothing, did not correlate with the PEAR-Environment subscale. The PEAR-Apathy highly correlated with the PDS and NPI-Apathy and moderately correlated with the NPI-Depression, suggesting good convergent validity and moderate discriminant validity. For reliability, both environment and apathy subscales demonstrated excellent internal consistency. Although facial expression and eye contact showed moderate inter-rater reliability, all other items showed good to excellent inter-rater and intra-rater reliability. This study has successfully developed the PEAR scale and established its psychometrics based on the compatible scales available. The PEAR scale is the first scale that concurrently assesses apathy and environmental stimulation, and is recommended for use in persons with dementia.
Reliability of the Test of Integrated Language and Literacy Skills (TILLS).

PubMed

Mailend, Marja-Liisa; Plante, Elena; Anderson, Michele A; Applegate, E Brooks; Nelson, Nickola W

2016-07-01

As new standardized tests become commercially available, it is critical that clinicians have access to the information about a test's psychometric properties, including aspects of reliability. The purpose of the three studies reported in this article was to investigate the reliability of a new test, the Test of Integrated Language and Literacy Skills (TILLS), with consideration of both internal and external sources of measurement error. The TILLS was administered to children aged 6;0-18;11 years. The participants varied in terms of their language and literacy skills and included children with typical language development as well as those diagnosed with language or learning disability. The sample of children also varied in terms of their racial and socioeconomic backgrounds. Study 1 (N = 1056) assessed the internal consistency of TILLS calculating the coefficient omega for each subtest. Study 2 (N = 103) and Study 3 (N = 39) used the intra-class correlation coefficients to report on test-retest and inter-rater reliability respectively. The results indicate strong internal consistency and inter-rater reliability for all subtests of TILLS. The test-retest reliability was strong for all but one subtest, for which the intra-class correlation coefficient was in the acceptable range. This article provides clinicians with essential scientific information that supports the internal and external reliability of a new test of oral and written language skills, the TILLS. Information about reliability is critical for guiding the selection of an appropriate diagnostic tool amongst a number of options. © 2016 Royal College of Speech and Language Therapists.
[Overal cognitive assessment in Basque-speaking people with advanced dementia. Validation to the Basque language of the Severe Mini-Mental State Examination SMMSE (SMMSE-eus)].

PubMed

Buiza, Cristina; Yanguas, Javier; Zulaica, Amaia; Antón, Iván; Arriola, Enrique; García, Alvaro

2018-04-13

Adaptation and validation to the Basque language of tests to assess advanced cognitive impairment is a not covered need for Basque-speaking people. The present work shows the validation of the Basque version of the Severe Mini Mental State Examination (SMMSE). A total of 109 people with advanced dementia (MEC<15) took part in the validation study, and were classified as GDS 5-7 on the Geriatric Depression Scale (GDS). All participants were Spanish-Basque bilingual. It was shown that SMMSE-eus has a high internal consistency (alpha=0.92), a good test-retest reliability (r=0.88; P<.01), and a high inter-rater reliability (CCI=0.99; P<.00) for the overall score, as well as for each item. Both the high internal consistency and inter-rater reliability, and to a lesser extent, test-retest reliability, made the SMMSE-eus a valid test for the brief assessment of cognitive status in people with advanced dementia in Basque-speaking people. For this reason, the SMMSE-eus is a usable and reliable alternative for assessing Basque-speaking people in their mother-tongue, or preferred language. Copyright © 2017 SEGG. Publicado por Elsevier España, S.L.U. All rights reserved.
Clinical indicators for routine use in the evaluation of early psychosis intervention: development, training support and inter-rater reliability.

PubMed

Catts, Stanley V; Frost, Aaron D J; O'Toole, Brian I; Carr, Vaughan J; Lewin, Terry; Neil, Amanda L; Harris, Meredith G; Evans, Russell W; Crissman, Belinda R; Eadie, Kathy

2011-01-01

Clinical practice improvement carried out in a quality assurance framework relies on routinely collected data using clinical indicators. Herein we describe the development, minimum training requirements, and inter-rater agreement of indicators that were used in an Australian multi-site evaluation of the effectiveness of early psychosis (EP) teams. Surveys of clinician opinion and face-to-face consensus-building meetings were used to select and conceptually define indicators. Operationalization of definitions was achieved by iterative refinement until clinicians could be quickly trained to code indicators reliably. Calculation of percentage agreement with expert consensus coding was based on ratings of paper-based clinical vignettes embedded in a 2-h clinician training package. Consensually agreed upon conceptual definitions for seven clinical indicators judged most relevant to evaluating EP teams were operationalized for ease-of-training. Brief training enabled typical clinicians to code indicators with acceptable percentage agreement (60% to 86%). For indicators of suicide risk, psychosocial function, and family functioning this level of agreement was only possible with less precise 'broad range' expert consensus scores. Estimated kappa values indicated fair to good inter-rater reliability (kappa > 0.65). Inspection of contingency tables (coding category by health service) and modal scores across services suggested consistent, unbiased coding across services. Clinicians are able to agree upon what information is essential to routinely evaluate clinical practice. Simple indicators of this information can be designed and coding rules can be reliably applied to written vignettes after brief training. The real world feasibility of the indicators remains to be tested in field trials.
A French validation study of the Coma Recovery Scale-Revised (CRS-R).

PubMed

Schnakers, Caroline; Majerus, Steve; Giacino, Joseph; Vanhaudenhuyse, Audrey; Bruno, Marie-Aurelie; Boly, Melanie; Moonen, Gustave; Damas, Pierre; Lambermont, Bernard; Lamy, Maurice; Damas, Francois; Ventura, Manfredi; Laureys, Steven

2008-09-01

The aim of the present study was to explore the concurrent validity, inter-rater agreement and diagnostic sensitivity of a French adaptation of the Coma Recovery Scale-Revised (CRS-R) as compared to other coma scales such as the Glasgow Coma Scale (GCS), the Full Outline of UnResponsiveness scale (FOUR) and the Wessex Head Injury Matrix (WHIM). Multi-centric prospective study. To test concurrent validity and diagnostic sensitivity, the four behavioural scales were administered in a randomized order in 77 vegetative and minimally conscious patients. Twenty-four clinicians with different professional backgrounds, levels of expertise and CRS-R experience were recruited to assess inter-rater agreement. Good concurrent validity was obtained between the CRS-R and the three other standardized behavioural scales. Inter-rater reliability for the CRS-R total score and sub-scores was good, indicating that the scale yields reproducible findings across examiners and does not appear to be systematically biased by profession, level of expertise or CRS-R experience. Finally, the CRS-R demonstrated a significantly higher sensitivity to detect MCS patients, as compared to the GCS, the FOUR and the WHIM. The results show that the French version of the CRS-R is a valid and sensitive scale which can be used in severely brain damaged patients by all members of the medical staff.
Validity and Reliability of Field-Based Measures for Assessing Movement Skill Competency in Lifelong Physical Activities: A Systematic Review.

PubMed

Hulteen, Ryan M; Lander, Natalie J; Morgan, Philip J; Barnett, Lisa M; Robertson, Samuel J; Lubans, David R

2015-10-01

It has been suggested that young people should develop competence in a variety of 'lifelong physical activities' to ensure that they can be active across the lifespan. The primary aim of this systematic review is to report the methodological properties, validity, reliability, and test duration of field-based measures that assess movement skill competency in lifelong physical activities. A secondary aim was to clearly define those characteristics unique to lifelong physical activities. A search of four electronic databases (Scopus, SPORTDiscus, ProQuest, and PubMed) was conducted between June 2014 and April 2015 with no date restrictions. Studies addressing the validity and/or reliability of lifelong physical activity tests were reviewed. Included articles were required to assess lifelong physical activities using process-oriented measures, as well as report either one type of validity or reliability. Assessment criteria for methodological quality were adapted from a checklist used in a previous review of sport skill outcome assessments. Movement skill assessments for eight different lifelong physical activities (badminton, cycling, dance, golf, racquetball, resistance training, swimming, and tennis) in 17 studies were identified for inclusion. Methodological quality, validity, reliability, and test duration (time to assess a single participant), for each article were assessed. Moderate to excellent reliability results were found in 16 of 17 studies, with 71% reporting inter-rater reliability and 41% reporting intra-rater reliability. Only four studies in this review reported test-retest reliability. Ten studies reported validity results; content validity was cited in 41% of these studies. Construct validity was reported in 24% of studies, while criterion validity was only reported in 12% of studies. Numerous assessments for lifelong physical activities may exist, yet only assessments for eight lifelong physical activities were included in this review. Generalizability of results may be more applicable if more heterogeneous samples are used in future research. Moderate to excellent levels of inter- and intra-rater reliability were reported in the majority of studies. However, future work should look to establish test-retest reliability. Validity was less commonly reported than reliability, and further types of validity other than content validity need to be established in future research. Specifically, predictive validity of 'lifelong physical activity' movement skill competency is needed to support the assertion that such activities provide the foundation for a lifetime of activity.
Inter-rater reliability of malaria parasite counts and comparison of methods

PubMed Central

2009-01-01

Background The introduction of artemesinin-based treatment for falciparum malaria has led to a shift away from symptom-based diagnosis. Diagnosis may be achieved by using rapid non-microscopic diagnostic tests (RDTs), of which there are many available. Light microscopy, however, has a central role in parasite identification and quantification and remains the main method of parasite-based diagnosis in clinic and hospital settings and is necessary for monitoring the accuracy of RDTs. The World Health Organization has prepared a proficiency testing panel containing a range of malaria-positive blood samples of known parasitaemia, to be used for the assessment of commercially available malaria RDTs. Different blood film and counting methods may be used for this purpose, which raises questions regarding accuracy and reproducibility. A comparison was made of the established methods for parasitaemia estimation to determine which would give the least inter-rater and inter-method variation Methods Experienced malaria microscopists counted asexual parasitaemia on different slides using three methods; the thin film method using the total erythrocyte count, the thick film method using the total white cell count and the Earle and Perez method. All the slides were stained using Giemsa pH 7.2. Analysis of variance (ANOVA) models were used to find the inter-rater reliability for the different methods. The paired t-test was used to assess any systematic bias between the two methods, and a regression analysis was used to see if there was a changing bias with parasite count level. Results The thin blood film gave parasite counts around 30% higher than those obtained by the thick film and Earle and Perez methods, but exhibited a loss of sensitivity with low parasitaemia. The thick film and Earle and Perez methods showed little or no bias in counts between the two methods, however, estimated inter-rater reliability was slightly better for the thick film method. Conclusion The thin film method gave results closer to the true parasite count but is not feasible at a parasitaemia below 500 parasites per microlitre. The thick film method was both reproducible and practical for this project. The determination of malarial parasitaemia must be applied by skilled operators using standardized techniques. PMID:19939271

Reliability, validity and description of timed performance of the Jebsen-Taylor Test in patients with muscular dystrophies.

PubMed

Artilheiro, Mariana Cunha; Fávero, Francis Meire; Caromano, Fátima Aparecida; Oliveira, Acary de Souza Bulle; Carvas, Nelson; Voos, Mariana Callil; Sá, Cristina Dos Santos Cardoso de

2017-12-08

The Jebsen-Taylor Test evaluates upper limb function by measuring timed performance on everyday activities. The test is used to assess and monitor the progression of patients with Parkinson disease, cerebral palsy, stroke and brain injury. To analyze the reliability, internal consistency and validity of the Jebsen-Taylor Test in people with Muscular Dystrophy and to describe and classify upper limb timed performance of people with Muscular Dystrophy. Fifty patients with Muscular Dystrophy were assessed. Non-dominant and dominant upper limb performances on the Jebsen-Taylor Test were filmed. Two raters evaluated timed performance for inter-rater reliability analysis. Test-retest reliability was investigated by using intraclass correlation coefficients. Internal consistency was assessed using the Cronbach alpha. Construct validity was conducted by comparing the Jebsen-Taylor Test with the Performance of Upper Limb. The internal consistency of Jebsen-Taylor Test was good (Cronbach's α=0.98). A very high inter-rater reliability (0.903-0.999), except for writing with an Intraclass correlation coefficient of 0.772-1.000. Strong correlations between the Jebsen-Taylor Test and the Performance of Upper Limb Module were found (rho=-0.712). The Jebsen-Taylor Test is a reliable and valid measure of timed performance for people with Muscular Dystrophy. Copyright © 2017 Associação Brasileira de Pesquisa e Pós-Graduação em Fisioterapia. Publicado por Elsevier Editora Ltda. All rights reserved.
Measuring competence in endoscopic sinus surgery.

PubMed

Syme-Grant, J; White, P S; McAleer, J P G

2008-02-01

Competence based education is currently being introduced into higher surgical training in the UK. Valid and reliable performance assessment tools are essential to ensure competencies are achieved. No such tools have yet been reported in the UK literature. We sought to develop and pilot test an Endoscopic Sinus Surgery Competence Assessment Tool (ESSCAT). The ESSCAT was designed for in-theatre assessment of higher surgical trainees in the UK. The ESSCAT rating matrix was developed through task analysis of ESS procedures. All otolaryngology consultants and specialist registrars in Scotland were given the opportunity to contribute to its refinement. Two cycles of in-theatre testing were used to ensure utility and gather quantitative data on validity and reliability. Videos of trainees performing surgery were used in establishing inter-rater reliability. National consultation, the consensus derived minimum standard of performance, Cronbach's alpha = 0.89 and demonstration of trainee learning (p = 0.027) during the in vivo application of the ESSCAT suggest a high level of validity. Inter-rater reliability was moderate for competence decisions (Cohen's Kappa = 0.5) and good for total scores (Intra-Class Correlation Co-efficient = 0.63). Intra-rater reliability was good for both competence decisions (Kappa = 0.67) and total scores (Kendall's Tau-b = 0.73). The ESSCAT generates a valid and reliable assessment of trainees' in-theatre performance of endoscopic sinus surgery. In conjunction with ongoing evaluation of the instrument we recommend the use of the ESSCAT in higher specialist training in otolaryngology in the UK.
Does a child's language ability affect the correspondence between parent and teacher ratings of ADHD symptoms?

PubMed

Gooch, Debbie; Maydew, Harriet; Sears, Claire; Norbury, Courtenay Frazier

2017-04-05

Rating scales are often used to identify children with potential Attention-Deficit/Hyperactivity Disorder (ADHD), yet there are frequently discrepancies between informants which may be moderated by child characteristics. The current study asked whether correspondence between parent and teacher ratings on the Strengths and Weakness of ADHD symptoms and Normal behaviour scale (SWAN) varied systematically with child language ability. Parent and teacher SWAN questionnaires were returned for 200 children (aged 61-81 months); 106 had low language ability (LL) and 94 had typically developing language (TL). After exploring informant correspondence (using Pearson correlation) and the discrepancy between raters, we report inter-class correlation coefficients, to assess inter-rater reliability, and Cohen's kappa, to assess agreement regarding possible ADHD caseness. Correlations between informant ratings on the SWAN were moderate. Children with LL were rated as having increased inattention and hyperactivity relative to children with TL; teachers, however, rated children with LL as having more inattention than parents. Inter-rater reliability of the SWAN was good and there were no systematic differences between the LL and TL groups. Case agreement between parent and teachers was fair; this varied by language group with poorer case agreement for children with LL. Children's language abilities affect the discrepancy between informant ratings of ADHD symptomatology and the agreement between parents and teachers regarding potential ADHD caseness. The assessment of children's core language ability would be a beneficial addition to the ADHD diagnostic process.
The evaluation of lumbar multifidus muscle function via palpation: reliability and validity of a new clinical test.

PubMed

Hebert, Jeffrey J; Koppenhaver, Shane L; Teyhen, Deydre S; Walker, Bruce F; Fritz, Julie M

2015-06-01

The lumbar multifidus muscle provides an important contribution to lumbar spine stability, and the restoration of lumbar multifidus function is a frequent goal of rehabilitation. Currently, there are no reliable and valid physical examination procedures available to assess lumbar multifidus function among patients with low back pain. To examine the inter-rater reliability and concurrent validity of the multifidus lift test (MLT) to identify lumbar multifidus dysfunction among patients with low back pain. A cross-sectional analysis of reliability and concurrent validity performed in a university outpatient research facility. Thirty-two persons aged 18 to 60 years with current low back pain and a minimum modified Oswestry disability score of 20%. Study participants were excluded if they reported a history of lumbar spine surgery, lumbar radiculopathy, medical red flags, osteoporosis, or had recently been treated with spinal manipulation or trunk stabilization exercises. Concurrent measures of lumbar multifidus muscle function at the L4-L5 and L5-S1 levels were obtained with the MLT (index test) and real-time ultrasound imaging (reference standard). The inter-rater reliability of the MLT was examined by measuring the level of agreement between two blinded examiners. Concurrent validity of the MLT was investigated by comparing clinicians' judgments with real-time ultrasound imaging measures of lumbar multifidus function. Inter-rater reliability of the MLT was substantial to excellent (κ=0.75 to 0.81, p≤.01) and free from errors of bias and prevalence. When performed at L4-L5 or L5-S1, the MLT demonstrated evidence of concurrent validity through its relationship with the reference standard results at L4-L5 (rbis=0.59-0.73, p≤.01). The MLT generally failed to demonstrate a relationship with the reference standard results from the L5-S1 level. Our results provide preliminary evidence supporting the reliability and validity of the MLT to assess lumbar multifidus function at the L4-L5 spinal level. Additional research examining the measurement properties and utility of this test should be undertaken before confident implementation with patients. Copyright © 2015 Elsevier Inc. All rights reserved.
Inter-Rater Reliability for Speech-Language Therapists' Judgement of Oesophageal Abnormality during Oesophageal Visualization

ERIC Educational Resources Information Center

Miles, Anna

2017-01-01

Background: Oesophageal abnormalities are common findings in a speech-language therapy videofluoroscopy clinic. Fluoroscopic screening involving oropharynx alone fails to identify these patients. Oesophageal screening as an adjunct to videofluoroscopy is gaining popularity. Yet currently, little is known about the reliability of speech and…
Reliability and validity of a Chinese version of the Diagnostic Interview for Borderlines-Revised.

PubMed

Wang, Lanlan; Yuan, Chenmei; Qiu, Jianying; Gunderson, John; Zhang, Min; Jiang, Kaida; Leung, Freedom; Zhong, Jie; Xiao, Zeping

2014-09-01

Borderline personality disorder (BPD) is the most studied of the axis II disorders. One of the most widely used diagnostic instruments is the Diagnostic Interview for Borderline Patients-Revised (DIB-R). The aim of this study was to test the reliability and validity of DIB-R for use in the Chinese culture. The reliability and validity of the DIB-R Chinese version were assessed in a sample of 236 outpatients with a probable BPD diagnosis. The Structured Clinical Interview for DSM-IV Personality Disorders (SCID-II) was used as a standard. Test-retest reliability was tested six months later with 20 patients, and inter-rater reliability was tested on 32 patients. The Chinese version of the DIB-R showed good internal global consistency (Cronbach's α of 0.916), good test-retest reliability (Pearson correlation of 0.704), good inter-rater reliability (intra-class correlation coefficient of 0.892 and kappa of 0.861). When compared with the DSM-IV diagnosis as measured by the SCID-II, the DIB-R showed relatively good sensitivity (0.768) and specificity (0.891) at the cutoff of 7, moderate diagnostic convergence (kappa of 0.631), as well as good discriminating validity. The Chinese version of the DIB-R has good psychometric properties, which renders it a valuable method for examining the presence, the severity, and component phenotypes of BPD in Chinese samples. © 2013 Wiley Publishing Asia Pty Ltd.
Assessing clinical competency in the health sciences

NASA Astrophysics Data System (ADS)

Panzarella, Karen Joanne

To test the success of integrated curricula in schools of health sciences, meaningful measurements of student performance are required to assess clinical competency. This research project analyzed a new performance assessment tool, the Integrated Standardized Patient Examination (ISPE), for assessing clinical competency: specifically, to assess Doctor of Physical Therapy (DPT) students' clinical competence as the ability to integrate basic science knowledge with clinical communication skills. Thirty-four DPT students performed two ISPE cases, one of a patient who sustained a stroke and the other a patient with a herniated lumbar disc. Cases were portrayed by standardized patients (SPs) in a simulated clinical setting. Each case was scored by an expert evaluator in the exam room and then by one investigator and the students themselves via videotape. The SPs scored each student on an overall encounter rubric. Written feedback was obtained from all participants in the study. Acceptable reliability was demonstrated via inter-rater agreement as well as inter-rater correlations on items that used a dichotomous scale, whereas the items requiring the use of the 4-point rubric were somewhat less reliable. For the entire scale both cases had a significant correlation between the Expert-Investigator pair of raters, for the CVA case r = .547, p < .05 and for the HD case r = .700, p < .01. The SPs scored students higher than the other raters. Students' self-assessments were most closely aligned with the investigator. Effects were apparent due to case. Content validity was gathered in the process of developing cases and patient scenarios that were used in this study. Construct validity was obtained from the survey results analyzed from the experts and students. Future studies should examine the effect of rater training upon the reliability. Criterion or predictive validity could be further studied by comparing students' performances on the ISPE with other independent estimates of students' competence. The unique integration questions of the ISPE were judged to have good content validity from experts and students, suggestive that integration, a most crucial element of clinical competence, while done in the mind of the student, can be practiced, learned and assessed.
How Effective Are Self- and Peer Assessment of Oral Presentation Skills Compared with Teachers' Assessments?

ERIC Educational Resources Information Center

De Grez, Luc; Valcke, Martin; Roozen, Irene

2012-01-01

Assessment of oral presentation skills is an underexplored area. The study described here focuses on the agreement between professional assessment and self- and peer assessment of oral presentation skills and explores student perceptions about peer assessment. The study has the merit of paying attention to the inter-rater reliability of the…
Development and psychometric evaluation of a clinical global impression for schizoaffective disorder scale.

PubMed

Allen, Michael H; Daniel, David G; Revicki, Dennis A; Canuso, Carla M; Turkoz, Ibrahim; Fu, Dong-Jing; Alphs, Larry; Ishak, K Jack; Bartko, John J; Lindenmayer, Jean-Pierre

2012-01-01

The Clinical Global Impression for Schizoaffective Disorder scale is a new rating scale adapted from the Clinical Global Impression scale for use in patients with schizoaffective disorder. The psychometric characteristics of the Clinical Global Impression for Schizoaffective Disorder are described. Content validity was assessed using an investigator questionnaire. Inter-rater reliability was determined with 12 sets of videotaped interviews rated independently by two trained individuals. Test-retest reliability was assessed using 30 randomly selected raters from clinical trials who evaluated the same videos on separate occasions two weeks apart. Convergent and divergent validity and effect size were evaluated by comparing scores between the Clinical Global Impression for Schizoaffective Disorder and the Positive and Negative Syndrome Scale, 21-item Hamilton Rating Scale for Depression, and Young Mania Rating Scale scales using pooled patient data from two clinical trials. Clinical Global Impression for Schizoaffective Disorder scores were then linked to corresponding Positive and Negative Syndrome Scale scores. Content validity was strong. Inter-rater agreement was good to excellent for most scales and subscales (intra-class correlation coefficient ≥ 0.50). Test-retest showed good reproducibility, with intraclass correlation coefficients ranging from 0.444 to 0.898. Spearman correlations between Clinical Global Impression for Schizoaffective Disorder domains and corresponding symptom scales were 0.60 or greater, and effect sizes for Clinical Global Impression for Schizoaffective Disorder overall and domain scores were similar to Positive and Negative Syndrome Scale Young Mania Rating Scale, and 21-item Hamilton Rating Scale for Depression scores. Raters anticipated that the scale might be less effective in distinguishing negative from depressive symptoms, and, in fact, the results here may reflect that clinical reality. Multiple lines of evidence support the reliability and validity of the Clinical Global Impression for Schizoaffective Disorder for studies in schizoaffective disorder.
Exploring precarious employment and women's health within the context of U.S. microenterprise using focus groups.

PubMed

Salt, Rebekah; Lee, Jongwon

2014-01-01

Nursing has been a leader in exploring social determinants of health within the context of U.S. microenterprise and women's health. The purpose of this study was to explore precarious employment within the context of microenterprise and women's health using focus groups with clientele from New Mexico (NM). The specific aims were to identify (1) the health concerns of low-income women who utilized resources from Women's Economic Self-Sufficiency Team (WESST), and (2) the meaning of precarious employment in low-income women's lives. Fourteen women, ranging in age from 21-65 years, who were affiliated with regional WESST sites around NM participated in focus groups and completed a demographic questionnaire. Focus group data were analyzed using content analysis. The degree of interrater agreement was determined by calculating the Cohen's kappa, percentage agreement, and prevalence-adjusted and bias-adjusted kappa (PABAK). Two broad themes emerged from these data: (1) Working for Yourself and (2) Strategies. Although the women identified concerns about participation in microenterprise, flexibility, freedom, and feeling purposeful were motivators to pursue a small business. The kappa statistics on the five transcripts revealed poor inter-rater agreement, yet PABAK, which is a more sophisticated inter-rater reliability index, indicated that inter-rater agreement between the two raters was satisfactory. Despite the challenges associated with microenterprise in the US, women found value in working for themselves. © 2014 Wiley Periodicals, Inc.
Psychometric properties of the Peer Proficiency Assessment (PEPA): a tool for evaluation of undergraduate peer counselors' motivational interviewing fidelity.

PubMed

Mastroleo, Nadine R; Mallett, Kimberly A; Turrisi, Rob; Ray, Anne E

2009-09-01

Despite the expanding use of undergraduate student peer counseling interventions aimed at reducing college student drinking, few programs evaluate peer counselors' competency to conduct these interventions. The present research describes the development and psychometric assessments of the Peer Proficiency Assessment (PEPA), a new tool for examining Motivational Interviewing adherence in undergraduate student peer delivered interventions. Twenty peer delivered sessions were evaluated by master and undergraduate student coders using a cross-validation design to examine peer based alcohol intervention sessions. Assessments revealed high inter-rater reliability between student and master coders and good correlations between previously established fidelity tools. Findings lend support for the use of the PEPA to examine peer counselor competency. The PEPA, training for use, inter-rater reliability information, construct and predictive validity, and tool usefulness are described.
How much do family physicians involve pregnant women in decisions about prenatal screening for Down syndrome?

PubMed

Gagnon, Susie; Labrecque, Michel; Njoya, Merlin; Rousseau, François; St-Jacques, Sylvie; Légaré, France

2010-02-01

To assess the extent to which family physicians (FPs) involve women in decisions about prenatal screening for Down syndrome. Based on transcripts of consultations between 41 FPs and 128 women, two raters independently assessed clinician's efforts to involve women in decisions about prenatal screening for Down syndrome using the French-language version of OPTION. Descriptive statistics of OPTION scores were calculated. Construct validity was assessed by performing a principal factor analysis and by measuring association with consultation duration and FPs sociodemograhics. Internal consistency was assessed with Cronbach's alpha and inter-rater reliability with the intraclass correlation coefficient. The overall mean OPTION score was low: 19 +/- 7 (range = 0 [no involvement] to 100 [high involvement]). One factor accounted for 80% of the variance. Both internal consistency and inter-rater reliability were very good (Cronbach's alpha = 0.73; ICC = 0.76). OPTION scores were lower for residents than for licensed FPs (17 +/- 5 vs 21 +/- 4; p = 0.02) and were positively associated with duration of consultation (r = 0.56; p < 0.001). Based on the French-language version of OPTION, which showed satisfactory psychometric properties, FPs studied put minimal efforts to involve women in decisions about prenatal screening for Down syndrome. (c) 2009 John Wiley & Sons, Ltd.
The relationship between quantitative measures of disc height and disc signal intensity with Pfirrmann score of disc degeneration.

PubMed

Salamat, Sara; Hutchings, John; Kwong, Clemens; Magnussen, John; Hancock, Mark J

2016-01-01

To assess the relationship between quantitative measures of disc height and signal intensity with the Pfirrmann disc degeneration scoring system and to test the inter-rater reliability of the quantitative measures. Participants were 76 people who had recently recovered from their last episode of acute low back pain and underwent MRI scan on a single 3T machine. At all 380 lumbar discs, quantitative measures of disc height and signal intensity were made by 2 independent raters and compared to Pfirrmann scores from a single radiologist. For quantitative measures of disc height and signal intensity a "raw" score and 2 adjusted ratios were calculated and the relationship with Pfirrmann scores was assessed. The inter-tester reliability of quantitative measures was also investigated. There was a strong linear relationship between quantitative disc signal intensity and Pfirrmann scores for grades 1-4, but not for grades 4 and 5. For disc height only, Pfirrmann grade 5 had significantly reduced disc height compared to all other grades. Results were similar regardless of whether raw or adjusted scores were used. Inter-rater reliability for the quantitative measures was excellent (ICC > 0.97). Quantitative measures of disc signal intensity were strongly related to Pfirrmann scores from grade 1 to 4; however disc height only differentiated between grade 4 and 5 Pfirrmann scores. Using adjusted ratios for quantitative measures of disc height or signal intensity did not significantly alter the relationship with Pfirrmann scores.
Development of a Chinese Version of the Suicide Intent Scale

ERIC Educational Resources Information Center

Gau, Susan S. F.; Chen, Chin-Hung; Lee, Charles T. C.; Chang, Jung-Chen; Cheng, Andrew T. A.

2009-01-01

This study established the psychometric properties of the Chinese version of the Suicide Intent Scale (SIS) in a clinic- and community-based sample of 36 patients and 592 respondents, respectively. Results showed that the Chinese SIS demonstrated good inter-rater and test-retest reliability. Factor analysis generated three factors (Precautions,…
Psychometrics and Validation of a Brief Rating Measure of Parent-Infant Interaction: Manchester Assessment of Caregiver-Infant Interaction

ERIC Educational Resources Information Center

Wan, Ming Wai; Brooks, Ami; Green, Jonathan; Abel, Kathryn; Elmadih, Alya

2017-01-01

This study investigated the psychometrics of a recently developed global rating measure of videotaped parent-infant interaction, the "Manchester Assessment of Caregiver-Infant Interaction" (MACI), in a normative sample. Inter-rater reliability, stability over time, and convergent and discriminant validity were tested. Six-minute play…
Concurrent validity of the Alberta Infant Motor Scale to detect delayed gross motor development in preterm infants: A comparative study with the Bayley III.

PubMed

Albuquerque, Plínio Luna de; Guerra, Miriam Queiroz de Farias; Lima, Marília de Carvalho; Eickmann, Sophie Helena

2017-05-24

To investigate the concurrent validity of AIMS in relation to the gross motor subtest of the Bayley Scale III/GM in preterm infants. A total of 159 gross motor development assessments were performed with the AIMS and Bayley-III/GM. Linear regression was used to assess the correlation between AIMS and Bayley-III/GM scores. The intra-class correlation coefficient (ICC) and the Bland-Altman plot were used to analyze intra- and inter-rater reliability. There was a prevalence of delayed gross motor development of 20.8% according to the Bayley-III/GM, and 11.9% for the 5th percentile and 21.4% for the 10th percentile of AIMS. A good correlation of AIMS with Bayley-III/GM scores and intra- and inter-rater reliability was encountered in this study. AIMS proved very capable of detecting delayed gross motor development in preterm infants when compared with the Bayley-III/GM. The 10th percentile of AIMS provided the best combination of indicators, with greater specificity.
Development and Inter-Rater Reliability of the Liverpool Adverse Drug Reaction Causality Assessment Tool

PubMed Central

Gallagher, Ruairi M.; Kirkham, Jamie J.; Mason, Jennifer R.; Bird, Kim A.; Williamson, Paula R.; Nunn, Anthony J.; Turner, Mark A.; Smyth, Rosalind L.; Pirmohamed, Munir

2011-01-01

Aim To develop and test a new adverse drug reaction (ADR) causality assessment tool (CAT). Methods A comparison between seven assessors of a new CAT, formulated by an expert focus group, compared with the Naranjo CAT in 80 cases from a prospective observational study and 37 published ADR case reports (819 causality assessments in total). Main Outcome Measures Utilisation of causality categories, measure of disagreements, inter-rater reliability (IRR). Results The Liverpool ADR CAT, using 40 cases from an observational study, showed causality categories of 1 unlikely, 62 possible, 92 probable and 125 definite (1, 62, 92, 125) and ‘moderate’ IRR (kappa 0.48), compared to Naranjo (0, 100, 172, 8) with ‘moderate’ IRR (kappa 0.45). In a further 40 cases, the Liverpool tool (0, 66, 81, 133) showed ‘good’ IRR (kappa 0.6) while Naranjo (1, 90, 185, 4) remained ‘moderate’. Conclusion The Liverpool tool assigns the full range of causality categories and shows good IRR. Further assessment by different investigators in different settings is needed to fully assess the utility of this tool. PMID:22194808
The Italian version of the Mayo-Portland Adaptability Inventory-4. A new measure of brain injury outcome.

PubMed

Cattelani, R; Corsini, D; Posteraro, L; Agosti, M; Saccavini, M

2009-12-01

The assessment of major obstacles to community integration which may result from an acquired brain injury (ABI) is needed for rational planning and effective management of ABI patients' social adjustment. Currently, such a generally acceptable measure is not available for the Italian population. This paper reports the translation process, the internal consistency, and the inter-rater reliability data for the Italian version of the Mayo-Portland Adaptability Inventory-4 (MPAI-4), a useful measure with highly developed and well documented psychometric properties. The MPAI-4 is specifically designed to assess socially relevant aspects of physical status and cognitive-behavioural competence following ABI. It is a 29-item inventory which is divided into three subdomains (Abilities, Adjustment, and Participation indices) covering a reasonably representative range Twenty ABI patients with at least one-year discharge from the rehabilitation facilities were submitted to the Italian MPAI-4. They were independently rated by two different rehabilitation professionals and a family member/significant other serving as informant (SO). Internal consistency was assessed by calculating the Cronbach's alpha values. Inter-rater agreement for individual items was statistically examined by determining the interclass correlation coefficient (ICC). In addition to the 8% of perfectly correspondent sentences, a clear prevalence (75.5%) of minor semantic variations and formal variations with no semantic value at the sentence-to-sentence matching was found. Full-scale Cronbach's alpha was 0.951 and 0.947 for the two professionals (rater #1 and rater #2, respectively), and was 0.957 for the family member serving as informant (rater #3). Full-Scale ICC (2.1) between professionals and SOs was 0.804 (CI=95%; lower-upper bound=0.688-0.901). The Italian MPAI-4 shares many psychometric features with the original English version, demonstrates both good internal consistency and good inter-rater reliability. The MPAI-4 confirms to be suitable for research applications in postacute settings as an efficient, broad and inclusive outcome measure for adult subjects with ABI.
Post-operative rotator cuff integrity, based on Sugaya's classification, can reflect abduction muscle strength of the shoulder.

PubMed

Yoshida, Masahito; Collin, Phillipe; Josseaume, Thierry; Lädermann, Alexandre; Goto, Hideyuki; Sugimoto, Katumasa; Otsuka, Takanobu

2018-01-01

Magnetic resonance (MR) imaging is common in structural and qualitative assessment of the rotator cuff post-operatively. Rotator cuff integrity has been thought to be associated with clinical outcome. The purpose of this study was to evaluate the inter-observer reliability of cuff integrity (Sugaya's classification) and assess the correlation between Sugaya's classification and the clinical outcome. It was hypothesized that Sugaya's classification would show good reliability and good correlation with the clinical outcome. Post-operative MR images were taken two years post-operatively, following arthroscopic rotator cuff repair. For assessment of inter-rater reliability, all radiographic evaluations for the supraspinatus muscle were done by two orthopaedic surgeons and one radiologist. Rotator cuff integrity was classified into five categories, according to Sugaya's classification. Fatty infiltration was graded into four categories, based on the Fuchs' classification grading system. Muscle hypotrophy was graded as four grades, according to the scale proposed by Warner. The clinical outcome was assessed according to the constant scoring system pre-operatively and 2 years post-operatively. Of the sixty-two consecutive patients with full-thickness rotator cuff tears, fifty-two patients were reviewed in this study. These subjects included twenty-three men and twenty-nine women, with an average age of fifty-seven years. In terms of the inter-rater reliability between orthopaedic surgeons, Sugaya's classification showed the highest agreement [ICC (2.1) = 0.82] for rotator cuff integrity. The grade of fatty infiltration and muscle atrophy demonstrated good agreement, respectively (0.722 and 0.758). With regard to the inter-rater reliability between orthopaedic surgeon and radiologist, Sugaya's classification showed good reliability [ICC (2.1) = 0.70]. On the other hand, fatty infiltration and muscle hypotrophy classifications demonstrated fair and moderate agreement [ICC (2.1) = 0.39 and 0.49]. Although no significant correlation was found between overall post-operative constant score and Sugaya's classification, Sugaya's classification indicated significant correlation with the muscle strength score. Sugaya's classification showed repeatability and good agreement between the orthopaedist and radiologist, who are involved in the patient care for the rotator cuff tear. Common classification of rotator cuff integrity with good reliability will give appropriate information for clinicians to improve the patient care of the rotator cuff tear. This classification also would be helpful to predict the strength of arm abduction in the scapular plane. IV.
Content Validity Index and Intra- and Inter-Rater Reliability of a New Muscle Strength/Endurance Test Battery for Swedish Soldiers

PubMed Central

Larsson, Helena; Tegern, Matthias; Monnier, Andreas; Skoglund, Jörgen; Helander, Charlotte; Persson, Emelie; Malm, Christer; Broman, Lisbet; Aasa, Ulrika

2015-01-01

The objective of this study was to examine the content validity of commonly used muscle performance tests in military personnel and to investigate the reliability of a proposed test battery. For the content validity investigation, thirty selected tests were those described in the literature and/or commonly used in the Nordic and North Atlantic Treaty Organization (NATO) countries. Nine selected experts rated, on a four-point Likert scale, the relevance of these tests in relation to five different work tasks: lifting, carrying equipment on the body or in the hands, climbing, and digging. Thereafter, a content validity index (CVI) was calculated for each work task. The result showed excellent CVI (≥0.78) for sixteen tests, which comprised of one or more of the military work tasks. Three of the tests; the functional lower-limb loading test (the Ranger test), dead-lift with kettlebells, and back extension, showed excellent content validity for four of the work tasks. For the development of a new muscle strength/endurance test battery, these three tests were further supplemented with two other tests, namely, the chins and side-bridge test. The inter-rater reliability was high (intraclass correlation coefficient, ICC2,1 0.99) for all five tests. The intra-rater reliability was good to high (ICC3,1 0.82–0.96) with an acceptable standard error of mean (SEM), except for the side-bridge test (SEM%>15). Thus, the final suggested test battery for a valid and reliable evaluation of soldiers’ muscle performance comprised the following four tests; the Ranger test, dead-lift with kettlebells, chins, and back extension test. The criterion-related validity of the test battery should be further evaluated for soldiers exposed to varying physical workload. PMID:26177030

Reliability and Validity of the Italian Version of the Protocol of Orofacial Myofunctional Evaluation with Scores (I-OMES).

PubMed

Scarponi, Letizia; de Felicio, Claudia Maria; Sforza, Chiarella; Pimenta Ferreira, Claudia Lucia; Ginocchio, Daniela; Pizzorni, Nicole; Barozzi, Stefania; Mozzanica, Francesco; Schindler, Antonio

2018-05-30

To evaluate the reliability, validity, and responsiveness of the Italian OMES (I-OMES). The study consisted of 3 phases: (1) internal consistency and reliability, (2) validity, and (3) responsiveness analysis. The recruited population included 27 patients with orofacial myofunctional disorders (OMD) and 174 healthy volunteers. Forty-seven subjects, 18 healthy and all recruited patients with OMD were assessed for inter-rater and test-retest reliability analysis. I-OMES and Nordic Orofacial Test - Screening (NOT-S) scores of the patients were correlated for concurrent validity analysis. I-OMES scores from 27 patients with OMD and 27 age- and gender-matched healthy subjects were compared to investigate construct validity. I-OMES scores before and after successful swallowing rehabilitation in patients were compared for responsiveness analysis. Adequate internal consistency (Cronbach α = 0.71) and strong inter-rater and test-retest reliability (intraclass coefficient correlation = 0.97 and 0.98, respectively) were found. I-OMES and NOT-S scores significantly and inversely correlated (r = -0.38). A statistical significance (p < 0.001) was found between the pathological group and the control group for the total I-OMES score. The mean I-OMES score improved from 90 (78-102) to 99 (89-103) after myofunctional rehabilitation (p < 0.001). The I-OMES is a reliable and valid tool to evaluate OMD. © 2018 S. Karger AG, Basel.
Reliability of tristimulus colourimetry in the assessment of cutaneous bruise colour.

PubMed

Scafide, Katherine N; Sheridan, Daniel J; Taylor, Laura A; Hayat, Matthew J

2016-06-01

Bruising is one of the most common types of injury clinicians observe among victims of violence and other trauma patients. However, research has shown commonly used qualitative description of cutaneous bruise colour via the naked eye is subjective and unreliable. No published work has formally evaluated the reliability of tristimulus colourimetry as an alternative for assessing bruise colour, despite its clinical and research applications in accurately assessing skin colour. The purpose of this study was to systematically evaluate the test-retest and inter-observer reliability of tristimulus colourimetry in the assessment of cutaneous bruise colour. Two researchers obtained repeated tristimulus colourimetry measures of cutaneous bruises with participants of diverse skin colour. Measures were obtained using the Minolta CR-400 Chomameter. Commission Internationale d'Eclairage (CIE) L*a*b* colour space was used. Data was analysed using intraclass correlation coefficients (ICC), Cronbach's alpha, and minimal detectable change (MDC) on all three L*a*b* values. The colorimeter demonstrated excellent test-retest or intra-rater reliability (L* ICC=0.999; a* ICC=0.973; b* ICC=0.892) and inter-rater reliability (L* ICC=0.997; a* ICC=0.976; b* ICC=0.982). With consistent placement, the tristimulus colourimetry is reliable for the objective assessment and documentation of cutaneous bruise colour for purposes of clinical practice and research. Recommendations for use in practice/research are provided. Copyright © 2016 Elsevier Ltd. All rights reserved.
Cutting costs of multiple mini-interviews – changes in reliability and efficiency of the Hamburg medical school admission test between two applications

PubMed Central

2014-01-01

Background Multiple mini-interviews (MMIs) are a valuable tool in medical school selection due to their broad acceptance and promising psychometric properties. With respect to the high expenses associated with this procedure, the discussion about its feasibility should be extended to cost-effectiveness issues. Methods Following a pilot test of MMIs for medical school admission at Hamburg University in 2009 (HAM-Int), we took several actions to improve reliability and to reduce costs of the subsequent procedure in 2010. For both years, we assessed overall and inter-rater reliabilities based on multilevel analyses. Moreover, we provide a detailed specification of costs, as well as an extrapolation of the interrelation of costs, reliability, and the setup of the procedure. Results The overall reliability of the initial 2009 HAM-Int procedure with twelve stations and an average of 2.33 raters per station was ICC=0.75. Following the improvement actions, in 2010 the ICC remained stable at 0.76, despite the reduction of the process to nine stations and 2.17 raters per station. Moreover, costs were cut down from $915 to $495 per candidate. With the 2010 modalities, we could have reached an ICC of 0.80 with 16 single rater stations ($570 per candidate). Conclusions With respect to reliability and cost-efficiency, it is generally worthwhile to invest in scoring, rater training and scenario development. Moreover, it is more beneficial to increase the number of stations instead of raters within stations. However, if we want to achieve more than 80 % reliability, a minor improvement is paid with skyrocketing costs. PMID:24645665
Optical coherence tomography allows for the reliable identification of laryngeal epithelial dysplasia and for precise biopsy: a clinicopathological study of 61 patients undergoing microlaryngoscopy.

PubMed

Just, Tino; Lankenau, Eva; Prall, Friedrich; Hüttmann, Gereon; Pau, Hans Wilhelm; Sommer, Konrad

2010-10-01

A newly developed microscope-based spectral-domain optical coherence tomography (SD-OCT) device and an endoscope-based time-domain OCT (TD-OCT) were used to assess the inter-rater reliability, sensitivity, specificity, and accuracy of benign and dysplastic laryngeal epithelial lesions. Prospective study. OCT during microlaryngoscopy was done on 35 patients with an endoscope-based TD-OCT, and on 26 patients by an SD-OCT system integrated into an operating microscope. Biopsies were taken from microscopically suspicious lesions allowing comparative study of OCT images and histology. Thickness of the epithelium was seen to be the main criterion for degree of dysplasia. The inter-rater reliability for two observers was found to be kappa = 0.74 (P <.001) for OCT. OCT provided test outcomes for differentiation between benign laryngeal lesions and dysplasia/CIS with sensitivity of 88%, specificity of 89%, PPV of 85%, NPV of 91%, and predictive accuracy of 88%. However, because of the limited penetration depth of the laser light primarily in hyperkeratotic lesions (thickness above 1.5 mm), the basal cell layer was no longer visible, precluding reliable assessment of such lesions. OCT allows for a fairly accurate assessment of benign and dysplastic laryngeal epithelial lesion and greatly facilitates the taking of precise biopsies. Laryngoscope, 2010.
Facial Angiofibroma Severity Index (FASI): reliability assessment of a new tool developed to measure severity and responsiveness to therapy in tuberous sclerosis-associated facial angiofibroma.

PubMed

Salido-Vallejo, R; Ruano, J; Garnacho-Saucedo, G; Godoy-Gijón, E; Llorca, D; Gómez-Fernández, C; Moreno-Giménez, J C

2014-12-01

Tuberous sclerosis complex (TSC) is an autosomal dominant neurocutaneous disorder characterized by the development of multisystem hamartomatous tumours. Topical sirolimus has recently been suggested as a potential treatment for TSC-associated facial angiofibroma (FA). To validate a reproducible scale created for the assessment of clinical severity and treatment response in these patients. We developed a new tool, the Facial Angiofibroma Severity Index (FASI) to evaluate the grade of erythema and the size and extent of FAs. In total, 30 different photographs of patients with TSC were shown to 56 dermatologists at each evaluation. Three evaluations using the same photographs but in a different random order were performed 1 week apart. Test and retest reliability and interobserver reproducibility were determined. There was good agreement between the investigators. Inter-rater reliability showed strong correlations (> 0.98; range 0.97-0.99) with inter-rater correlation coefficients (ICCs) for the FASI. The global estimated kappa coefficient for the degree of intra-rater agreement (test-retest) was 0.94 (range 0.91-0.97). The FASI is a valid and reliable tool for measuring the clinical severity of TSC-associated FAs, which can be applied in clinical practice to evaluate the response to treatment in these patients. © 2014 British Association of Dermatologists.
Validity and reliability of the de Morton Mobility Index in the subacute hospital setting in a geriatric evaluation and management population.

PubMed

de Morton, Natalie A; Lane, Kylie

2010-11-01

To investigate the clinimetric properties of the de Morton Mobility Index (DEMMI) in a Geriatric Evaluation and Management (GEM) population. A longitudinal validation study (n = 100) and inter-rater reliability study (n = 29) in a GEM population. Consecutive patients admitted to a GEM rehabilitation ward were eligible for inclusion. At hospital admission and discharge, a physical therapist assessed patients with physical performance instruments that included the 6-metre walk test, step test, Clinical Test of Sensory Organization and Balance, Timed Up and Go test, 6-minute walk test and the DEMMI. Consecutively eligible patients were included in an inter-rater reliability study between physical therapists. DEMMI admission scores were normally distributed (mean 30.2, standard deviation 16.7) and other activity limitation instruments had either a floor or a ceiling effect. Evidence of convergent, discriminant and known groups validity for the DEMMI were obtained. The minimal detectable change with 90% confidence was 10.5 (95% confidence interval 6.1-17.9) points and the minimally clinically important difference was 8.4 points on the 100-point interval DEMMI scale. The DEMMI provides clinicians with an accurate and valid method of measuring mobility for geriatric patients in the subacute hospital setting.
Review and Evaluation of Mindfulness-Based iPhone Apps

PubMed Central

Kavanagh, David J; Hides, Leanne; Stoyanov, Stoyan R

2015-01-01

Background There is growing evidence for the positive impact of mindfulness on wellbeing. Mindfulness-based mobile apps may have potential as an alternative delivery medium for training. While there are hundreds of such apps, there is little information on their quality. Objective This study aimed to conduct a systematic review of mindfulness-based iPhone mobile apps and to evaluate their quality using a recently-developed expert rating scale, the Mobile Application Rating Scale (MARS). It also aimed to describe features of selected high-quality mindfulness apps. Methods A search for “mindfulness” was conducted in iTunes and Google Apps Marketplace. Apps that provided mindfulness training and education were included. Those containing only reminders, timers or guided meditation tracks were excluded. An expert rater reviewed and rated app quality using the MARS engagement, functionality, visual aesthetics, information quality and subjective quality subscales. A second rater provided MARS ratings on 30% of the apps for inter-rater reliability purposes. Results The “mindfulness” search identified 700 apps. However, 94 were duplicates, 6 were not accessible and 40 were not in English. Of the remaining 560, 23 apps met inclusion criteria and were reviewed. The median MARS score was 3.2 (out of 5.0), which exceeded the minimum acceptable score (3.0). The Headspace app had the highest average score (4.0), followed by Smiling Mind (3.7), iMindfulness (3.5) and Mindfulness Daily (3.5). There was a high level of inter-rater reliability between the two MARS raters. Conclusions Though many apps claim to be mindfulness-related, most were guided meditation apps, timers, or reminders. Very few had high ratings on the MARS subscales of visual aesthetics, engagement, functionality or information quality. Little evidence is available on the efficacy of the apps in developing mindfulness. PMID:26290327
Review and Evaluation of Mindfulness-Based iPhone Apps.

PubMed

Mani, Madhavan; Kavanagh, David J; Hides, Leanne; Stoyanov, Stoyan R

2015-08-19

There is growing evidence for the positive impact of mindfulness on wellbeing. Mindfulness-based mobile apps may have potential as an alternative delivery medium for training. While there are hundreds of such apps, there is little information on their quality. This study aimed to conduct a systematic review of mindfulness-based iPhone mobile apps and to evaluate their quality using a recently-developed expert rating scale, the Mobile Application Rating Scale (MARS). It also aimed to describe features of selected high-quality mindfulness apps. A search for "mindfulness" was conducted in iTunes and Google Apps Marketplace. Apps that provided mindfulness training and education were included. Those containing only reminders, timers or guided meditation tracks were excluded. An expert rater reviewed and rated app quality using the MARS engagement, functionality, visual aesthetics, information quality and subjective quality subscales. A second rater provided MARS ratings on 30% of the apps for inter-rater reliability purposes. The "mindfulness" search identified 700 apps. However, 94 were duplicates, 6 were not accessible and 40 were not in English. Of the remaining 560, 23 apps met inclusion criteria and were reviewed. The median MARS score was 3.2 (out of 5.0), which exceeded the minimum acceptable score (3.0). The Headspace app had the highest average score (4.0), followed by Smiling Mind (3.7), iMindfulness (3.5) and Mindfulness Daily (3.5). There was a high level of inter-rater reliability between the two MARS raters. Though many apps claim to be mindfulness-related, most were guided meditation apps, timers, or reminders. Very few had high ratings on the MARS subscales of visual aesthetics, engagement, functionality or information quality. Little evidence is available on the efficacy of the apps in developing mindfulness.
The Frontal Behavioural Inventory (Italian version) differentiates frontotemporal lobar degeneration variants from Alzheimer's disease.

PubMed

Alberici, A; Geroldi, C; Cotelli, M; Adorni, A; Calabria, M; Rossi, G; Borroni, B; Padovani, A; Zanetti, O; Kertesz, A

2007-04-01

The objective was to evaluate the construct validity of the Italian version of the Frontal Behavioural Inventory (FBI) and its usefulness in the differential diagnosis of dementias. Standard criteria were used in the clinical diagnosis of dementias in 83 patients and 33 agematched healthy volunteers. The FBI scale was translated from English into Italian language and back-translated. Cronbach's alpha, inter-rater and test-retest reliability, FBI convergent validity and discriminant analysis were calculated. FBI profile was compared between patients affected by frontotemporal lobar degeneration (FTLD) and Alzheimer's disease (AD). The FBI showed a high internal consistency and inter-rater reliability and it distinguished normal behavioural conditions from those presented in FTLD or AD. An 86.8% diagnostic accuracy was calculated by the discriminant analysis, selecting only age at disease onset and FBI, and particularly distinguishing behavioural variants within the FTLD spectrum. FTLD patients showed a characteristic behavioural profile. The FBI might be a reliable and useful diagnostic tool for dementias in clinical practice.
Reliability of a New Radiographic Classification for Developmental Dysplasia of the Hip.

PubMed

Narayanan, Unni; Mulpuri, Kishore; Sankar, Wudbhav N; Clarke, Nicholas M P; Hosalkar, Harish; Price, Charles T

2015-01-01

Existing radiographic classification schemes (eg, Tönnis criteria) for DDH quantify the severity of disease based on the position of the ossific nucleus relative to Hilgenreiner's and Perkin's lines. By definition, this method requires the presence of an ossification centre, which can be delayed in appearance and eccentric in location within the femoral head. A new radiographic classification system has been developed by the International Hip Dysplasia Institute (IHDI), which uses the mid-point of the proximal femoral metaphysis as a reference landmark, and can therefore be applied to children of all ages. The purpose of this study was to compare the reliability of this new method with that of Tönnis, as the first step in establishing its validity and clinical utility. Twenty standardized anteroposterior pelvic radiographs of children with untreated DDH were selected purposefully to capture the spectrum of age (range, 3 to 32 mo) at presentation and disease severity. Each of the hips was classified separately by the IHDI and Tönnis methods by 6 experienced pediatric orthopaedists from the United States, Canada, Mexico, United Kingdom, and by 2 orthopaedic senior residents. The inter-rater reliability was tested using the Intra Class Correlation coefficient (ICC) to measure concordance between raters. All 40 hips were classifiable by the IHDI method by all raters. Ten of the 40 hips could not be classified by the Tönnis method because of the absence of the ossific nucleus on one or both sides. The ICC (95% confidence interval) for the IHDI method for all raters was 0.90 (0.83-0.95) and 0.95 (0.91-0.98) for the right and left hips, respectively. The corresponding ICCs for the Tönnis method were 0.63 (0.46-0.80) and 0.60 (0.43-0.78), respectively. There was no significant difference between the ICCs of the 6 experts and 2 trainees. The IHDI method of classification has excellent inter-rater reliability, both among experts and novices, and is more widely applicable than the Tönnis method as it can be applied even when the ossification centre is absent. Level II (diagnostic).
Bronchiolitis Score of Sant Joan de Déu: BROSJOD Score, validation and usefulness.

PubMed

Balaguer, Mònica; Alejandre, Carme; Vila, David; Esteban, Elisabeth; Carrasco, Josep L; Cambra, Francisco José; Jordan, Iolanda

2017-04-01

To validate the bronchiolitis score of Sant Joan de Déu (BROSJOD) and to examine the previously defined scoring cutoff. Prospective, observational study. BROSJOD scoring was done by two independent physicians (at admission, 24 and 48 hr). Internal consistency of the score was assessed using Cronbach's α. To determine inter-rater reliability, the concordance correlation coefficient estimated as an intraclass correlation coefficient (CCC) and limits of agreement estimated as the 90% total deviation index (TDI) were estimated. An expert opinion was used to classify patients according to clinical severity. A validity analysis was conducted comparing the 3-level classification score to that expert opinion. Volume under the surface (VUS), predictive values, and probability of correct classification (PCC) were measured to assess discriminant validity. About 112 patients were recruited, 62 of them (55.4%) males. Median age: 52.5 days (IQR: 32.75-115.25). The admission Cronbach's α was 0.77 (CI95%: 0.71; 0.82) and at 24 hr it was 0.65 (CI95%: 0.48; 0.7). The inter-rater reliability analysis was: CCC at admission 0.96 (95%CI 0.94-0.97), at 24 h 0.77 (95%CI 0.65-0.86), and at 48 hr 0.94 (95%CI 0.94-0.97); TDI 90%: 1.6, 2.9, and 1.57, respectively. The discriminant validity at admission: VUS of 0.8 (95%CI 0.70-0.90), at 24 h 0.92 (95%CI 0.85-0.99), and at 48 hr 0.93 (95%CI 0.87-0.99). The predictive values and PCC values were within 38-100% depending on the level of clinical severity. There is a high inter-rater reliability, showing the BROSJOD score to be reliable and valid, even when different observers apply it. Pediatr Pulmonol. 2017;52:533-539. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
Assessment of wheelchair driving performance in a virtual reality-based simulator

PubMed Central

Mahajan, Harshal P.; Dicianno, Brad E.; Cooper, Rory A.; Ding, Dan

2013-01-01

Objective To develop a virtual reality (VR)-based simulator that can assist clinicians in performing standardized wheelchair driving assessments. Design A completely within-subjects repeated measures design. Methods Participants drove their wheelchairs along a virtual driving circuit modeled after the Power Mobility Road Test (PMRT) and in a hallway with decreasing width. The virtual simulator was displayed on computer screen and VR screens and participants interacted with it using a set of instrumented rollers and a wheelchair joystick. Driving performances of participants were estimated and compared using quantitative metrics from the simulator. Qualitative ratings from two experienced clinicians were used to estimate intra- and inter-rater reliability. Results Ten regular wheelchair users (seven men, three women; mean age ± SD, 39.5 ± 15.39 years) participated. The virtual PMRT scores from the two clinicians show high inter-rater reliability (78–90%) and high intra-rater reliability (71–90%) for all test conditions. More research is required to explore user preferences and effectiveness of the two control methods (rollers and mathematical model) and the display screens. Conclusions The virtual driving simulator seems to be a promising tool for wheelchair driving assessment that clinicians can use to supplement their real-world evaluations. PMID:23820148
The Chelsea critical care physical assessment tool (CPAx): validation of an innovative new tool to measure physical morbidity in the general adult critical care population; an observational proof-of-concept pilot study.

PubMed

Corner, E J; Wood, H; Englebretsen, C; Thomas, A; Grant, R L; Nikoletou, D; Soni, N

2013-03-01

To develop a scoring system to measure physical morbidity in critical care - the Chelsea Critical Care Physical Assessment Tool (CPAx). The development process was iterative involving content validity indices (CVI), a focus group and an observational study of 33 patients to test construct validity against the Medical Research Council score for muscle strength, peak cough flow, Australian Therapy Outcome Measures score, Glasgow Coma Scale score, Bloomsbury sedation score, Sequential Organ Failure Assessment score, Short Form 36 (SF-36) score, days of mechanical ventilation and inter-rater reliability. Trauma and general critical care patients from two London teaching hospitals. Users of the CPAx felt that it possessed content validity, giving a final CVI of 1.00 (P<0.05). Construct validation data showed moderate to strong significant correlations between the CPAx score and all secondary measures, apart from the mental component of the SF-36 which demonstrated weak correlation with the CPAx score (r=0.024, P=0.720). Reliability testing showed internal consistency of α=0.798 and inter-rater reliability of κ=0.988 (95% confidence interval 0.791 to 1.000) between five raters. This pilot work supports proof of concept of the CPAx as a measure of physical morbidity in the critical care population, and is a cogent argument for further investigation of the scoring system. Copyright © 2012 Chartered Society of Physiotherapy. Published by Elsevier Ltd. All rights reserved.
How Reliable Are Students' Evaluations of Teaching Quality? A Variance Components Approach

ERIC Educational Resources Information Center

Feistauer, Daniela; Richter, Tobias

2017-01-01

The inter-rater reliability of university students' evaluations of teaching quality was examined with cross-classified multilevel models. Students (N = 480) evaluated lectures and seminars over three years with a standardised evaluation questionnaire, yielding 4224 data points. The total variance of these student evaluations was separated into the…
Proposing Melasma Severity Index: A New, More Practical, Office-based Scoring System for Assessing the Severity of Melasma

PubMed Central

Majid, Imran; Haq, Inaamul; Imran, Saher; Keen, Abid; Aziz, Khalid; Arif, Tasleem

2016-01-01

Background: Melasma Area and Severity Index (MASI), the scoring system in melasma, needs to be refined. Aims and Objectives: To propose a more practical scoring system, named as Melasma Severity Index (MSI), for assessing the disease severity and treatment response in melasma. Materials and Methods: Four dermatologists were trained to calculate MASI and also the proposed MSI scores. For MSI, the formula used was 0.4 (a × p2) l + 0.4 (a × p2) r + 0.2 (a × p2) n where “a” stands for area, “p” for pigmentation, “l” for left face, “r” for right face, and “n” for nose. On a single day, 30 enrolled patients were randomly examined by each trained dermatologist and their MASI and MSI scores were calculated. Next, each rater re-examined every 6th patient for repeat MASI and MSI scoring to assess intra- and inter-rater reliability of MASI and MSI scores. Validity was assessed by comparing the individual scores of each rater with objective data from mexameter and ImageJ software. Results: Inter-rater reliability, as assessed by intraclass correlation coefficient, was significantly higher for MSI (0.955) as compared to MASI (0.816). Correlation of scores with objective data by Spearman's correlation revealed higher rho values for MSI than for MASI for all raters. Limitations: Sample population belonged to a single ethnic group. Conclusions: MSI is simpler and more practical scoring system for melasma. PMID:26955093
The reliability of dual-energy X-ray absorptiometry measurements of bone mineral density in the metatarsals.

PubMed

Fuller, Joel T; Archer, Jane; Buckley, Jonathan D; Tsiros, Margarita D; Thewlis, Dominic

2016-01-01

To investigate the reliability of a simple, efficient technique for measuring bone mineral density (BMD) in the metatarsals using dual-energy X-ray absorptiometry (DXA). BMD of the right foot of 32 trained male distance runners was measured using a DXA scanner with the foot in the plantar position. Separate regions of interest (ROI) were used to assess the BMD of each metatarsal shaft (1st-5th) for each participant. ROI analysis was repeated by the same investigator to determine within-scan intra-rater reliability and by a different investigator to determine within-scan inter-rater reliability. Repeat DXA scans were undertaken for ten participants to assess between-scan intra-rater reliability. Assessment of BMD was consistently most reliable for the first metatarsal across all domains of reliability assessed (intra-class correlation coefficient [ICC] ≥0.97; coefficient of variation [CV] ≤1.5%; limits of agreement [LOA] ≤4.2%). Reasonable levels of intra-rater reliability were also achieved for the second and fifth metatarsals (ICC ≥0.90; CV ≤4.2%; LOA ≤11.9%). Poorer levels of reliability were demonstrated for the third (ICC ≥0.64; CV ≤8.2%; LOA ≤23.6%) and fourth metatarsals (ICC ≥0.67; CV ≤9.6%; LOA ≤27.5%). BMD was greatest in the first and second metatarsals (P < 0.01). Reliable measurements of BMD were achieved for the first, second and fifth metatarsals.
The Smile Esthetic Index (SEI): A method to measure the esthetics of the smile. An intra-rater and inter-rater agreement study.

PubMed

Rotundo, Roberto; Nieri, Michele; Bonaccini, Daniele; Mori, Massimiliano; Lamberti, Elena; Massironi, Domenico; Giachetti, Luca; Franchi, Lorenzo; Venezia, Piero; Cavalcanti, Raffaele; Bondi, Elena; Farneti, Mauro; Pinchi, Vilma; Buti, Jacopo

2015-01-01

To propose a method to measure the esthetics of the smile and to report its validation by means of an intra-rater and inter-rater agreement analysis. Ten variables were chosen as determinants for the esthetics of a smile: smile line and facial midline, tooth alignment, tooth deformity, tooth dischromy, gingival dischromy, gingival recession, gingival excess, gingival scars and diastema/missing papillae. One examiner consecutively selected seventy smile pictures, which were in the frontal view. Ten examiners, with different levels of clinical experience and specialties, applied the proposed assessment method twice on the selected pictures, independently and blindly. Intraclass correlation coefficient (ICC) and Fleiss' kappa) statistics were performed to analyse the intra-rater and inter-rater agreement. Considering the cumulative assessment of the Smile Esthetic Index (SEI), the ICC value for the inter-rater agreement of the 10 examiners was 0.62 (95% CI: 0.51 to 0.72), representing a substantial agreement. Intra-rater agreement ranged from 0.86 to 0.99. Inter-rater agreement (Fleiss' kappa statistics) calculated for each variable ranged from 0.17 to 0.75. The SEI was a reproducible method, to assess the esthetic component of the smile, useful for the diagnostic phase and for setting appropriate treatment plans.
Rater agreement of visual lameness assessment in horses during lungeing.

PubMed

Hammarberg, M; Egenvall, A; Pfau, T; Rhodin, M

2016-01-01

Lungeing is an important part of lameness examinations as the circular path may accentuate low-grade lameness. Movement asymmetries related to the circular path, to compensatory movements and to pain make the lameness evaluation complex. Scientific studies have shown high inter-rater variation when assessing lameness during straight line movement. The aim was to estimate inter- and intra-rater agreement of equine veterinarians evaluating lameness from videos of sound and lame horses during lungeing and to investigate the influence of veterinarians' experience and the objective degree of movement asymmetry on rater agreement. Cross-sectional observational study. Video recordings and quantitative gait analysis with inertial sensors were performed in 23 riding horses of various breeds. The horses were examined at trot on a straight line and during lungeing on soft or hard surfaces in both directions. One video sequence was recorded per condition and the horses were classified as forelimb lame, hindlimb lame or sound from objective straight line symmetry measurements. Equine veterinarians (n = 86), including 43 with >5 years of orthopaedic experience, participated in a web-based survey and were asked to identify the lamest limb on 60 videos, including 10 repeats. The agreements between (inter-rater) and within (intra-rater) veterinarians were analysed with κ statistics (Fleiss, Cohen). Inter-rater agreement κ was 0.31 (0.38/0.25 for experienced/less experienced) and higher for forelimb (0.33) than for hindlimb lameness (0.11) or soundness (0.08) evaluation. Median intra-rater agreement κ was 0.57. Inter-rater agreement was poor for less experienced raters, and for all raters when evaluating hindlimb lameness. Since identification of the lame limb/limbs is a prerequisite for successful diagnosis, treatment and recovery, the high inter-rater variation when evaluating lameness on the lunge is likely to influence the accuracy and repeatability of lameness examinations and, indirectly, the efficacy of treatment. © 2015 The Authors. Equine Veterinary Journal published by John Wiley & Sons Ltd on behalf of EVJ Ltd.
Inter-rater reliability of kinesthetic measurements with the KINARM robotic exoskeleton.

PubMed

Semrau, Jennifer A; Herter, Troy M; Scott, Stephen H; Dukelow, Sean P

2017-05-22

Kinesthesia (sense of limb movement) has been extremely difficult to measure objectively, especially in individuals who have survived a stroke. The development of valid and reliable measurements for proprioception is important to developing a better understanding of proprioceptive impairments after stroke and their impact on the ability to perform daily activities. We recently developed a robotic task to evaluate kinesthetic deficits after stroke and found that the majority (~60%) of stroke survivors exhibit significant deficits in kinesthesia within the first 10 days post-stroke. Here we aim to determine the inter-rater reliability of this robotic kinesthetic matching task. Twenty-five neurologically intact control subjects and 15 individuals with first-time stroke were evaluated on a robotic kinesthetic matching task (KIN). Subjects sat in a robotic exoskeleton with their arms supported against gravity. In the KIN task, the robot moved the subjects' stroke-affected arm at a preset speed, direction and distance. As soon as subjects felt the robot begin to move their affected arm, they matched the robot movement with the unaffected arm. Subjects were tested in two sessions on the KIN task: initial session and then a second session (within an average of 18.2 ± 13.8 h of the initial session for stroke subjects), which were supervised by different technicians. The task was performed both with and without the use of vision in both sessions. We evaluated intra-class correlations of spatial and temporal parameters derived from the KIN task to determine the reliability of the robotic task. We evaluated 8 spatial and temporal parameters that quantify kinesthetic behavior. We found that the parameters exhibited moderate to high intra-class correlations between the initial and retest conditions (Range, r-value = [0.53-0.97]). The robotic KIN task exhibited good inter-rater reliability. This validates the KIN task as a reliable, objective method for quantifying kinesthesia after stroke.
A Study of the Inter-Rater Reliability of University Application Readers in a Holistic Admissions Review Process

ERIC Educational Resources Information Center

Moody Rideout, Blaire Lauren

2017-01-01

In 2015, the American Council on Education surveyed undergraduate admission and enrollment management leaders at 338 four-year institutions to understand holistic admissions review (Espinosa, Gaertner, and Orfield, 2015). In the report titled, Race, Class and College Access: Achieving Diversity in a Shifting Legal Landscape, 92% of selective…

Utility of Angle Correction for Hemodynamic Measurements with Doppler Echocardiography.

PubMed

Sigurdsson, Martin I; Eoh, Eun J; Chow, Vinca W; Waldron, Nathan H; Cleve, Jayne; Nicoara, Alina; Swaminathan, Madhav

2018-04-06

The routine application angle correction (AnC) in hemodynamic measurements with transesophageal echocardiography currently is not recommended but potentially could be beneficial. The authors hypothesized that AnC can be applied reliably and may change grading of aortic stenosis (AS). Retrospective analysis. Single institution, university hospital. During phase I, use of AnC was assessed in 60 consecutive patients with intraoperative transesophageal echocardiography. During phase II, 129 images from a retrospective cohort of 117 cases were used to quantify AS by mean pressure gradient. A panel of observers used custom-written software in Java to measure intra-individual and inter-individual correlation in AnC application, correlation with preoperative transthoracic echocardiography gradients, and regrading of AS after AnC. For phase I, the median AnC was 21 (16-35) degrees, and 17% of patients required no AnC. For phase II, the median AnC was 7 (0-15) degrees, and 37% of assessed images required no AnC. The mean inter-individual and intra-individual correlation for AnC was 0.50 (95% confidence interval [CI] 0.49-0.52) and 0.87 (95% CI 0.82-0.92), respectively. AnC did not improve agreement with the transthoracic echocardiography mean pressure gradient. The mean inter-rater and intra-rater agreement for grading AS severity was 0.82 (95% CI 0.81-0.83) and 0.95 (95% CI 0.91-0.95), respectively. A total of 241 (7%) AS gradings were reclassified after AnC was applied, mostly when the uncorrected mean gradient was within 5 mmHg of the severity classification cutoff. AnC can be performed with a modest inter-rater and intra-rater correlation and high degree of inter-rater and intra-rater agreement for AS severity grading. Copyright © 2018 Elsevier Inc. All rights reserved.
Are photographic records reliable for orthodontic screening?

PubMed

Mandall, N A

2002-06-01

The aim of the study was to evaluate the reliability of a panel of orthodontists for accepting new patient referrals based on clinical photographs. Eight orthodontists from Greater Manchester, Lancashire, Chester, and Derbyshire observed clinical photographs of 40 consecutive new patients attending the orthodontic department, Hope Hospital, Salford. They recorded whether or not they would accept the patient, as a new patient referral, in their department. Each consultant was asked to take into account factors, such as oral hygiene, dental development, and severity of the malocclusion. Kappa statistic for multiple-rater agreement and kappa statistic for intra-observer reliability were calculated. Inter-observer panel agreement for accepting new patient referrals based on photographic information was low (multiple rater kappa score 0.37). Intra-examiner agreement was better (kappa range 0.34-0.90). Clinician agreement for screening and accepting orthodontic referrals based on clinical photographs is comparable to that previously reported for other clinical decision making.
Digital assessment of the fetal alcohol syndrome facial phenotype: reliability and agreement study.

PubMed

Tsang, Tracey W; Laing-Aiken, Zoe; Latimer, Jane; Fitzpatrick, James; Oscar, June; Carter, Maureen; Elliott, Elizabeth J

2017-01-01

To examine the three facial features of fetal alcohol syndrome (FAS) in a cohort of Australian Aboriginal children from two-dimensional digital facial photographs to: (1) assess intrarater and inter-rater reliability; (2) identify the racial norms with the best fit for this population; and (3) assess agreement with clinician direct measures. Photographs and clinical data for 106 Aboriginal children (aged 7.4-9.6 years) were sourced from the Lililwan Project . Fifty-eight per cent had a confirmed prenatal alcohol exposure and 13 (12%) met the Canadian 2005 criteria for FAS/partial FAS. Photographs were analysed using the FAS Facial Photographic Analysis Software to generate the mean PFL three-point ABC-Score, five-point lip and philtrum ranks and four-point face rank in accordance with the 4-Digit Diagnostic Code. Intrarater and inter-rater reliability of digital ratings was examined in two assessors. Caucasian or African American racial norms for PFL and lip thickness were assessed for best fit; and agreement between digital and direct measurement methods was assessed. Reliability of digital measures was substantial within (kappa: 0.70-1.00) and between assessors (kappa: 0.64-0.89). Clinician and digital ratings showed moderate agreement (kappa: 0.47-0.58). Caucasian PFL norms and the African American Lip-Philtrum Guide 2 provided the best fit for this cohort. In an Aboriginal cohort with a high rate of FAS, assessment of facial dysmorphology using digital methods showed substantial inter- and intrarater reliability. Digital measurement of features has high reliability and until data are available from a larger population of Aboriginal children, the African American Lip-Philtrum Guide 2 and Caucasian (Strömland) PFL norms provide the best fit for Australian Aboriginal children.
Psychometric evaluation of the Work Readiness Questionnaire in schizophrenia.

PubMed

Potkin, Steven G; Bugarski-Kirola, Dragana; Edgar, Chris J; Soliman, Sherif; Le Scouiller, Stephanie; Kunovac, Jelena; Miguel Velasco, Eugenio; Garibaldi, George M

2016-04-01

Unemployment can negatively impact quality of life among patients with schizophrenia. Employment status depends on ability, opportunity, education, and cultural influences. A clinician-rated scale of work readiness, independent of current work status, can be a valuable assessment tool. A series of studies were conducted to create and validate a Work Readiness Questionnaire (WoRQ) for clinicians to assess patient ability to engage in socially useful activity, independent of work availability. Content validity, test-retest and inter-rater reliability, and construct validity were evaluated in three separate studies. Content validity was supported. Cronbach's α was 0.91, in the excellent range. Clinicians endorsed WoRQ concepts, including treatment adherence, physical appearance, social competence, and symptom control. The final readiness decision showed good test-retest reliability and moderate inter-rater reliability. Work readiness was associated with higher function and lower levels of negative symptoms. Low positive and high negative predictive values confirmed the concept validity. The WoRQ has suitable psychometric properties for use in a clinical trial for patients with a broad range of symptom severity. The scale may be applicable to assess therapeutic interventions. It is not intended to assess eligibility for supported work interventions. The WoRQ is suitable for use in schizophrenia clinical trials to assess patient work functional potential.
Intraosseous access can be taught to medical students using the four-step approach.

PubMed

Afzali, Monika; Kvisselgaard, Ask Daffy; Lyngeraa, Tobias Stenbjerg; Viggers, Sandra

2017-03-02

The intraosseous (IO) access is an alternative route for vascular access when peripheral intravascular catheterization cannot be obtained. In Denmark the IO access is reported as infrequently trained and used. The aim of this pilot study was to investigate if medical students can obtain competencies in IO access when taught by a modified Walker and Peyton's four-step approach. Nineteen students attended a human cadaver course in emergency procedures. A lecture was followed by a workshop. Fifteen students were presented with a case where IO access was indicated and their performance was evaluated by an objective structured clinical examination (OSCE) and rated using a weighted checklist. To evaluate the validity of the checklist, three raters rated performance and Cohen's kappa was performed to assess inter-rater reliability (IRR). To examine the strength of the overall IRR, Randolph's free-marginal multi rater kappa was used. A maximum score of 15 points was obtained by nine (60%) of the participants and two participants (13%) scored 13 points with all three raters. Only one participant failed more than one item on the checklist. The expert rater rated lower with a mean score of 14.2 versus the non-expert raters with mean 14.6 and 14.3. The overall IRR calculated with Randolph's free-marginal multi rater kappa was 0.71. The essentials of the IO access procedure can be taught to medical students using a modified version of the Walker and Peyton's four-step approach and the checklist used was found reliable.
Performance of intraclass correlation coefficient (ICC) as a reliability index under various distributions in scale reliability studies.

PubMed

Mehta, Shraddha; Bastero-Caballero, Rowena F; Sun, Yijun; Zhu, Ray; Murphy, Diane K; Hardas, Bhushan; Koch, Gary

2018-04-29

Many published scale validation studies determine inter-rater reliability using the intra-class correlation coefficient (ICC). However, the use of this statistic must consider its advantages, limitations, and applicability. This paper evaluates how interaction of subject distribution, sample size, and levels of rater disagreement affects ICC and provides an approach for obtaining relevant ICC estimates under suboptimal conditions. Simulation results suggest that for a fixed number of subjects, ICC from the convex distribution is smaller than ICC for the uniform distribution, which in turn is smaller than ICC for the concave distribution. The variance component estimates also show that the dissimilarity of ICC among distributions is attributed to the study design (ie, distribution of subjects) component of subject variability and not the scale quality component of rater error variability. The dependency of ICC on the distribution of subjects makes it difficult to compare results across reliability studies. Hence, it is proposed that reliability studies should be designed using a uniform distribution of subjects because of the standardization it provides for representing objective disagreement. In the absence of uniform distribution, a sampling method is proposed to reduce the non-uniformity. In addition, as expected, high levels of disagreement result in low ICC, and when the type of distribution is fixed, any increase in the number of subjects beyond a moderately large specification such as n = 80 does not have a major impact on ICC. Copyright © 2018 John Wiley & Sons, Ltd.
Measurement of patient safety: a systematic review of the reliability and validity of adverse event detection with record review

PubMed Central

Hanskamp-Sebregts, Mirelle; Zegers, Marieke; Vincent, Charles; van Gurp, Petra J; de Vet, Henrica C W; Wollersheim, Hub

2016-01-01

Objectives Record review is the most used method to quantify patient safety. We systematically reviewed the reliability and validity of adverse event detection with record review. Design A systematic review of the literature. Methods We searched PubMed, EMBASE, CINAHL, PsycINFO and the Cochrane Library and from their inception through February 2015. We included all studies that aimed to describe the reliability and/or validity of record review. Two reviewers conducted data extraction. We pooled κ values (κ) and analysed the differences in subgroups according to number of reviewers, reviewer experience and training level, adjusted for the prevalence of adverse events. Results In 25 studies, the psychometric data of the Global Trigger Tool (GTT) and the Harvard Medical Practice Study (HMPS) were reported and 24 studies were included for statistical pooling. The inter-rater reliability of the GTT and HMPS showed a pooled κ of 0.65 and 0.55, respectively. The inter-rater agreement was statistically significantly higher when the group of reviewers within a study consisted of a maximum five reviewers. We found no studies reporting on the validity of the GTT and HMPS. Conclusions The reliability of record review is moderate to substantial and improved when a small group of reviewers carried out record review. The validity of the record review method has never been evaluated, while clinical data registries, autopsy or direct observations of patient care are potential reference methods that can be used to test concurrent validity. PMID:27550650
Development and testing of the KERNset: an instrument to assess the quality of telephone triage in out-of-hours primary care services.

PubMed

Smits, Marleen; Keizer, Ellen; Ram, Paul; Giesen, Paul

2017-12-02

Telephone triage is a core but vulnerable part of the care process at out-of-hours general practitioner (GP) cooperatives. In the Netherlands, different instruments have been used for assessing the quality of telephone triage. These instruments focussed mainly on communicational aspects, and less on the medical quality of triage decisions. Our aim was to develop and test a minimum set of items to assess the quality of telephone triage. A national survey among all GP cooperatives in the Netherlands was performed to examine the most important aspects of telephone triage. Next, corresponding items from existing instruments were searched on these topics. Subsequently, an expert panel judged these items on importance, completeness and formulation. The concept KERNset consisted of 24 items about the telephone conversation: 13 medical, ten communicational and one regarding both types. It was pilot tested on measurement characteristics, reliability, validity and variation between triagists. In this pilot study, 114 anonymous calls from four GP cooperatives spread across the Netherlands were judged by three out of eight raters, both internal and external raters. Cronbach's alpha was .94 for the medical items and .75 for the communicational items. Inter-rater reliability: complete agreement between the external raters was 45% and reasonable agreement 73% (difference of maximally one point on the five-point scale). Intra-rater reliability: complete agreement within raters was 55% and reasonable agreement 84%. There were hardly any differences between internal and external raters, but there were differences in strictness between individual raters. The construct validity was confirmed by the high correlation between the general impression of the call and the items of the KERNset. Of the differences within items 19% could be explained by differences between triage nurses, which means the KERNset is able to demonstrate differences between triage nurses. The KERNset can be used to assess the quality of telephone triage. The validity is good and differences between calls and between triage nurses can be measured. A more intensive training for raters could improve the reliability.
Reliability and Normative Data for the Dynamic Visual Acuity Test for Vestibular Screening.

PubMed

Riska, Kristal M; Hall, Courtney D

2016-06-01

The purpose of this study was to determine reliability of computerized dynamic visual acuity (DVA) testing and to determine reference values for younger and older adults. A primary function of the vestibular system is to maintain gaze stability during head motion. The DVA test quantifies gaze stabilization with the head moving versus stationary. Commercially available computerized systems allow clinicians to incorporate DVA into their assessment; however, information regarding reliability and normative values of these systems is sparse. Forty-six healthy adults, grouped by age, with normal vestibular function were recruited. Each participant completed computerized DVA testing including static visual acuity, minimum perception time, and DVA using the NeuroCom inVision System. Testing was performed by two examiners in the same session and then repeated at a follow-up session 3 to 14 days later. Intraclass correlation coefficients (ICCs) were used to determine inter-rater and test-retest reliability. ICCs for inter-rater reliability ranged from 0.323 to 0.937 and from 0.434 to 0.909 for horizontal and vertical head movements, respectively. ICCs for test-retest reliability ranged from 0.154 to 0.856 and from 0.377 to 0.9062 for horizontal and vertical head movements, respectively. Overall, raw scores (left/right DVA and up/down DVA) were more reliable than DVA loss scores. Reliability of a commercially available DVA system has poor-to-fair reliability for DVA loss scores. The use of a convergence paradigm and not incorporating the forced choice paradigm may contribute to poor reliability.
The Number of Feedbacks Needed for Reliable Evaluation. A Multilevel Analysis of the Reliability, Stability and Generalisability of Students' Evaluation of Teaching

ERIC Educational Resources Information Center

Rantanen, Pekka

2013-01-01

A multilevel analysis approach was used to analyse students' evaluation of teaching (SET). The low value of inter-rater reliability stresses that any solid conclusions on teaching cannot be made on the basis of single feedbacks. To assess a teacher's general teaching effectiveness, one needs to evaluate four randomly chosen course implementations.…
Therapist adherence in the strong without anorexia nervosa (SWAN) study: A randomized controlled trial of three treatments for adults with anorexia nervosa.

PubMed

Andony, Louise J; Tay, Elaine; Allen, Karina L; Wade, Tracey D; Hay, Phillipa; Touyz, Stephen; McIntosh, Virginia V W; Treasure, Janet; Schmidt, Ulrike H; Fairburn, Christopher G; Erceg-Hurn, David M; Fursland, Anthea; Crosby, Ross D; Byrne, Susan M

2015-12-01

To develop a psychotherapy rating scale to measure therapist adherence in the Strong Without Anorexia Nervosa (SWAN) study, a multi-center randomized controlled trial comparing three different psychological treatments for adults with anorexia nervosa. The three treatments under investigation were Enhanced Cognitive Behavioural Therapy (CBT-E), the Maudsley Anorexia Nervosa Treatment for Adults (MANTRA), and Specialist Supportive Clinical Management (SSCM). The SWAN Psychotherapy Rating Scale (SWAN-PRS) was developed, after consultation with the developers of the treatments, and refined. Using the SWAN-PRS, two independent raters initially rated 48 audiotapes of treatment sessions to yield inter-rater reliability data. One rater proceeded to rate a total of 98 audiotapes from 64 trial participants. The SWAN-PRS demonstrated sound psychometric properties, and was considered a reliable measure of therapist adherence. The three treatments were highly distinguishable by independent raters, with therapists demonstrating significantly more behaviors consistent with the actual allocated treatment compared to the other two treatment modalities. There were no significant site differences in therapist adherence observed. The findings provide support for the internal validity of the SWAN study. The SWAN-PRS was deemed suitable for use in other trials involving CBT-E, MANTRA, or SSCM. The Authors. International Journal of Eating Disorders Published by Wiley Periodicals, Inc.
Reliability of a survey tool for measuring consumer nutrition environment in urban food stores.

PubMed

Hosler, Akiko S; Dharssi, Aliza

2011-01-01

Despite the increase in the volume and importance of food environment research, there is a general lack of reliable measurement tools. This study presents the development and reliability assessment of a tool for measuring consumer nutrition environment in urban food stores. Cross-sectional design. A racially diverse downtown portion (6 ZIP code areas) in Albany, New York. A sample of 39 food stores was visited by our research team in 2009 to 2010. These stores were randomly selected from 123 eligible food stores identified through multiple government lists and ground-truthing. The Food Retail Outlet Survey Tool was developed to assess the presence of selected food and nonfood items, placement, milk prices, physical characteristics of the store, policy implementation, and advertisements on outside windows. For in-store items, agreement of observations between experienced and lightly trained surveyors was assessed. For window advertisement assessments, inter-method agreement (on-site sketch vs digital photo), and inter-rater agreement (both on-site) among lightly trained surveyors were evaluated. Percent agreement, Kappa, and prevalence-adjusted bias-adjusted kappa were calculated for in-store observations. Interclass correlation coefficients were calculated for window observations. Twenty-seven of the 47 in-store items had 100% agreement. The prevalence-adjusted bias-adjusted kappa indicated excellent agreement (≥0.90) on all items, except aisle width (0.74) and dark-green/orange colored fresh vegetables (0.85). The store type (nonconvenience store), the order of visits (first half), and the time to complete survey (>10 minutes) were associated with lower reliability in these 2 items. Both the inter-method and inter-rater agreements for window advertisements were uniformly high (intraclass correlation coefficient ranged 0.94-1.00), indicating high reliability. The Food Retail Outlet Survey Tool is a reliable tool for quickly measuring consumer nutrition environment. It can be effectively used by an individual who attended a 30-minute group briefing and practiced with 3 to 4 stores.
A behaviourally anchored rating scale for evaluating the use of the WHO surgical safety checklist: development and initial evaluation of the WHOBARS.

PubMed

Devcich, Daniel A; Weller, Jennifer; Mitchell, Simon J; McLaughlin, Scott; Barker, Lauren; Rudolph, Jenny W; Raemer, Daniel B; Zammert, Martin; Singer, Sara J; Torrie, Jane; Frampton, Chris Ma; Merry, Alan F

2016-10-01

Realising the full potential of the WHO Surgical Safety Checklist (SSC) to reduce perioperative harm requires the constructive engagement of all operating room (OR) team members during its administration. To facilitate research on SSC implementation, a valid and reliable instrument is needed for measuring OR team behaviours during its administration. We developed a behaviourally anchored rating scale (BARS) for this purpose. We used a modified Delphi process, involving 16 subject matter experts, to compile a BARS with behavioural domains applicable to all three phases of the SSC. We evaluated the instrument in 80 adult OR cases and 30 simulated cases using two medical student raters and seven expert raters, respectively. Intraclass correlation coefficients were calculated to assess inter-rater reliability. Internal consistency and instrument discrimination were explored. Sample size estimates for potential study designs using the instrument were calculated. The Delphi process resulted in a BARS instrument (the WHOBARS) with five behavioural domains. Intraclass correlation coefficients calculated from the OR cases exceeded 0.80 for 80% of the instrument's domains across the SSC phases. The WHOBARS showed high internal consistency across the three phases of the SSC and ability to discriminate among surgical cases in both clinical and simulated settings. Fewer than 20 cases per group would be required to show a difference of 1 point between groups in studies of the SSC, where α=0.05 and β=0.8. We have developed a generic instrument for comprehensively rating the administration of the SSC and informing initiatives to realise its full potential. We have provided data supporting its capacity for discrimination, internal consistency and inter-rater reliability. Further psychometric evaluation is warranted. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://www.bmj.com/company/products-services/rights-and-licensing/
Reliability of rehabilitative ultrasonographic imaging for muscle thickness measurement of the rhomboid major.

PubMed

Jeong, Ju Ri; Ko, Young Jun; Ha, Hyun Geun; Lee, Wan Hee

2016-03-01

This study was to establish inter-rater and intrarater reliability of the rehabilitative ultrasonographic imaging (RUSI) technique for muscle thickness measurement of the rhomboid major at rest and with the shoulder abducted to 90°. Twenty-four young adults (eight men, 16 women; right-handed; mean age [±SD], 24·4 years [±2·6]) with no history of neck, shoulder, or arm pain were recruited. Rhomboid major muscle images were obtained in the resting position and with shoulder in 90° abduction using an ultrasonography system with a 7·5-MHz linear transducer. In these two positions, the examiners found the site at which the transducer could be placed. Two examiners obtained the images of all participants in three test sessions at random. Intraclass correlation coefficients (ICC) were used to estimate reliability. All ICCs (95% CI) were >0·75, ranging from 0·93 to 0·98, which indicates good reliability. The ICCs for inter-rater reliability ranged from 0·75 to 0·94. For the absolute value of the difference in the intra-examiner reliability between the right and left ratios, the ICCs ranged from 0·58 to 0·91. In this study, the intra- and interexaminer reliability of muscle thickness measurements of the rhomboid major were good. Therefore, we suggest that muscle thickness measurements of the rhomboid major obtained with the RUSI technique would be useful for clinical rehabilitative assessment. © 2014 Scandinavian Society of Clinical Physiology and Nuclear Medicine. Published by John Wiley & Sons Ltd.
The Health Informatics Trial Enhancement Project (HITE): Using routinely collected primary care data to identify potential participants for a depression trial

PubMed Central

2010-01-01

Background Recruitment to clinical trials can be challenging. We identified anonymous potential participants to an existing pragmatic randomised controlled depression trial to assess the feasibility of using routinely collected data to identify potential trial participants. We discuss the strengths and limitations of this approach, assess its potential value, report challenges and ethical issues encountered. Methods Swansea University's Health Information Research Unit's Secure Anonymised Information Linkage (SAIL) database of routinely collected health records was interrogated, using Structured Query Language (SQL). Read codes were used to create an algorithm of inclusion/exclusion criteria with which to identify suitable anonymous participants. Two independent clinicians rated the eligibility of the potential participants' identified. Inter-rater reliability was assessed using the kappa statistic and inter-class correlation. Results The study population (N = 37263) comprised all adults registered at five general practices in Swansea UK. Using the algorithm 867 anonymous potential participants were identified. The sensitivity and specificity results > 0.9 suggested a high degree of accuracy from the algorithm. The inter-rater reliability results indicated strong agreement between the confirming raters. The Intra Class Correlation Coefficient (Cronbach's Alpha) > 0.9, suggested excellent agreement and Kappa coefficient > 0.8; almost perfect agreement. Conclusions This proof of concept study showed that routinely collected primary care data can be used to identify potential participants for a pragmatic randomised controlled trial of folate augmentation of antidepressant therapy for the treatment of depression. Further work will be needed to assess generalisability to other conditions and settings and the inclusion of this approach to support Electronic Enhanced Recruitment (EER). PMID:20398303
Assessing disease stress and modeling yield losses in alfalfa

NASA Astrophysics Data System (ADS)

Guan, Jie

Alfalfa is the most important forage crop in the U.S. and worldwide. Fungal foliar diseases are believed to cause significant yield losses in alfalfa, yet, little quantitative information exists regarding the amount of crop loss. Different fungicides and application frequencies were used as tools to generate a range of foliar disease intensities in Ames and Nashua, IA. Visual disease assessments (disease incidence, disease severity, and percentage defoliation) were obtained weekly for each alfalfa growth cycle (two to three growing cycles per season). Remote sensing assessments were performed using a hand-held, multispectral radiometer to measure the amount and quality of sunlight reflected from alfalfa canopies. Factors such as incident radiation, sun angle, sensor height, and leaf wetness were all found to significantly affect the percentage reflectance of sunlight reflected from alfalfa canopies. The precision of visual and remote sensing assessment methods was quantified. Precision was defined as the intra-rater repeatability and inter-rater reliability of assessment methods. F-tests, slopes, intercepts, and coefficients of determination (R2) were used to compare assessment methods for precision. Results showed that among the three visual disease assessment methods (disease incidence, disease severity, and percentage defoliation), percentage defoliation had the highest intra-rater repeatability and inter-rater reliability. Remote sensing assessment method had better precision than the percentage defoliation assessment method based upon higher intra-rater repeatability and inter-rater reliability. Significant linear relationships between canopy reflectance (810 nm), percentage defoliation and yield were detected using linear regression and percentage reflectance (810 nm) assessments were found to have a stronger relationship with yield than percentage defoliation assessments. There were also significant linear relationships between percentage defoliation, dry weight, percentage reflectance (810 nm), and green leaf area index (GLAI). Percentage reflectance (810 nm) assessments had a stronger relationship with dry weight and green leaf area index than percentage defoliation assessments. Our research conclusively demonstrates that percentage reflectance measurements can be used to nondestructively assess green leaf area index which is a direct measure of plant health and an indirect measure of productivity. This research conclusively demonstrates that remote sensing is superior to visual assessment method to assess alfalfa stress and to model yield and GLAI in the alfalfa foliar disease pathosystem.
Is the Bayley Scales of Infant and Toddler Developmental Screening Test, Valid and Reliable for Persian Speaking Children?

PubMed

Soleimani, Farin; Azari, Nadia; Vameghi, Roshanak; Sajedi, Firoozeh; Shahshahani, Soheila; Karimi, Hossein; Kraskian, Adis; Shahrokhi, Amin; Teymouri, Robab; Gharib, Masoud

2016-10-01

Advances in perinatal and neonatal care have substantially improved the survival of at-risk infants over the past two decades. The purpose of this study was to assess the reliability and validity of the Bayley Scales of infant and toddler developmental Screening test in Persian-speaking children. This was a cross-sectional prospective study of 403 children aged 1 - 42-months. The Bayley scales screening instrument, which consists of five domains (cognitive, receptive, and expressive communication and fine and gross motor items), was used to measure infants' and toddlers' development. The psychometric properties examined included the face and content validity of the scale, in addition to cultural and linguistic modifications to the scale and its test-retest and inter-rater reliability. An expert team changed some of the test items relating to cultural and linguistic issues. In almost all the age groups, cultural or linguistic changes were made to items in the communication domains. According to Cronbach's alpha for internal consistency, the reliability of the cognitive scale was r = 0.79, and the reliability of the receptive scale was r = 0.76. The reliability for expressive communication, fine motor, and gross motor scales was r = 0.81, r = 0.80, and r = 0.81, respectively. The construct validity of the tests was confirmed using a factor analysis and comparison of the mean scores of the age groups. The intra- and inter-rater reliabilities of the Bayley Scales were good-to-excellent. The results indicated that the Bayley Scales had a high level of reliability in the present study. Thus, the scale can be used in a Persian population.
Narrow Band Imaging Enhances the Detection Rate of Penetration and Aspiration in FEES.

PubMed

Nienstedt, Julie C; Müller, Frank; Nießen, Almut; Fleischer, Susanne; Koseki, Jana-Christiane; Flügel, Till; Pflug, Christina

2017-06-01

Narrow band imaging (NBI) is widely used in gastrointestinal, laryngeal, and urological endoscopy. Its original purpose was to visualize vessels and epithelial irregularities. Based on our observation that adding NBI to common white light (WL) improves the contrast of the test bolus in fiberoptic endoscopic evaluation of swallowing (FEES), we now investigated the potential value of NBI in swallowing disorders. 148 FEES images were analyzed from 74 consecutive patients with swallowing disorders, including 74 with and 74 without NBI. All images were evaluated by four dysphagia specialists. Findings were classified according to Rosenbek's penetration-aspiration scale modified for evaluating these FEES images. Intra- and inter-rater reliability was determined as well as observer confidence. A better visualization of the bolus is the main advantage of NBI in FEES. This generally leads to sharper optical contrasts and better detection of small bolus quantities. Accordingly, NBI enhances the detection rate of penetration and aspiration. On average, identification of laryngeal penetration increased from 40 to 73% and of aspiration from 13 to 24% (each p < 0.01) of patients. In contrast to WL alone, the use of NBI also markedly increased the inter- and intra-rater reliability (p < 0.01) and the rating confidence of all experts (p < 0.05). NBI is an easy and cost-effective tool simplifying dysphagia evaluation and shortening FEES evaluation time. It leads to a markedly higher detection rate of pathological findings. The significantly better intra- and inter-rater reliability argues further for a better overall reproducibly of FEES interpretation.
Assessment of the severity of dementia: validity and reliability of the Chinese (Cantonese) version of the Hierarchic Dementia Scale (CV-HDS).

PubMed

Poon, Vickie Wan-kei; Lam, Linda Chiu-wa; Wong, Samuel Yeung-shan

2008-09-01

With the rapid growth of the older population, early detection of cognitive deficits is crucial in slowing down functional deterioration of the elderly persons. To examine the validity and reliability of the Chinese (Cantonese) version of the Hierarchic Dementia Scale (CV-HDS) for Chinese older persons in Hong Kong. The HDS was translated into Cantonese Chinese. The content and cultural validity were evaluated by six expert panel members. Sixty-two participants with diagnosis of dementia were recruited for evaluation. Inter-rater reliability, test-retest reliability, internal consistency and concurrent validity were examined. The CV-HDS demonstrated satisfactory psychometric properties. inter-rater reliability and test-retest reliability were high (alpha=0.89 and alpha=0.94 respectively). High value of Cronbach's alpha (alpha=0.94) demonstrated good internal consistency. The concurrent validity of CV-HDS, through correlation with its scores with that of the Chinese version of Mini Mental Status Examination, was established (ranged from r=0.58 to r=0.78, p<0.01). The CV-HDS is a reliable and valid instrument for assessing severity of cognitive impairment in Cantonese speaking Chinese people with dementia. It facilitates treatment planning to optimize the effects of functional training and rehabilitation.
Children's Reaction to Types of Television. Technical Report No. 28.

ERIC Educational Resources Information Center

Hines, Brainard W.

An observational system having high inter-rater reliability and providing a reliable estimate of patterns of behavior across time periods is developed and tested for use in evaluating children's responses to a number of television styles and modes of presentation. This project was designed to meet three goals: first, to develop a valid and…

The use of portable 2D echocardiography and 'frame-based' bubble counting as a tool to evaluate diving decompression stress.

PubMed

Germonpré, Peter; Papadopoulou, Virginie; Hemelryck, Walter; Obeid, Georges; Lafère, Pierre; Eckersley, Robert J; Tang, Meng-Xing; Balestra, Costantino

2014-03-01

'Decompression stress' is commonly evaluated by scoring circulating bubble numbers post dive using Doppler or cardiac echography. This information may be used to develop safer decompression algorithms, assuming that the lower the numbers of venous gas emboli (VGE) observed post dive, the lower the statistical risk of decompression sickness (DCS). Current echocardiographic evaluation of VGE, using the Eftedal and Brubakk method, has some disadvantages as it is less well suited for large-scale evaluation of recreational diving profiles. We propose and validate a new 'frame-based' VGE-counting method which offers a continuous scale of measurement. Nine 'raters' of varying familiarity with echocardiography were asked to grade 20 echocardiograph recordings using both the Eftedal and Brubakk grading and the new 'frame-based' counting method. They were also asked to count the number of bubbles in 50 still-frame images, some of which were randomly repeated. A Wilcoxon Spearman ρ calculation was used to assess test-retest reliability of each rater for the repeated still frames. For the video images, weighted kappa statistics, with linear and quadratic weightings, were calculated to measure agreement between raters for the Eftedal and Brubakk method. Bland-Altman plots and intra-class correlation coefficients were used to measure agreement between raters for the frame-based counting method. Frame-based counting showed a better inter-rater agreement than the Eftedal and Brubakk grading, even with relatively inexperienced assessors, and has good intra- and inter-rater reliability. Frame-based bubble counting could be used to evaluate post-dive decompression stress, and offers possibilities for computer-automated algorithms to allow near-real-time counting.
Measurement of cervical flexor endurance following whiplash.

PubMed

Kumbhare, Dinesh A; Balsor, Brad; Parkinson, William L; Harding Bsckin, Peter; Bedard, Michel; Papaioannou, Alexandra; Adachi, Jonathan D

2005-07-22

To investigate measurement properties of a practical test of cervical flexor endurance (CFE) in whiplash patients including inter-rater reliability, sensitivity to clinical change, criterion related validity against the Neck Disability Index (NDI), and discriminant validity for injured versus uninjured populations. Two samples were recruited, 81 whiplash patients, and a convenience sample of 160 subjects who were not seeking treatment and met criteria for normal pain and range of motion. CFE was measured using a stopwatch while the subject, in crook lying, held their head against gravity to fatigue. Inter-rater reliability in whiplash patients was in a range considered 'almost perfect' (Intraclass Correlation=0.96). CFE had greater inter-subject variability than the NDI or range of motion in any of three planes. However, the effect size for improvement in CFE over treatment was as large as the effect sizes for all of those measures. In multivariate regression, CFE changes accounted for changes on the NDI better than the three ranges of motion. CFE discriminated whiplash patients who were within six months of injury (n=71) from age and gender matched normals with high effect size (ES=1.5). These findings provide evidence of reliability and validity for CFE measurement, and demonstrate that CFE detects clinical improvements. Variance on CFE emphasizes the need to consider inter-, and intra-subject standard deviations to interpret scores.
Reliability, validity and minimal detectable change of the Mini-BESTest in Greek participants with chronic stroke.

PubMed

Lampropoulou, Sofia I; Billis, Evdokia; Gedikoglou, Ingrid A; Michailidou, Christina; Nowicky, Alexander V; Skrinou, Dimitra; Michailidi, Fotini; Chandrinou, Danae; Meligkoni, Margarita

2018-02-23

This study aimed to investigate the psychometric characteristics of reliability, validity and ability to detect change of a newly developed balance assessment tool, the Mini-BESTest, in Greek patients with stroke. A prospective, observational design study with test-retest measures was conducted. A convenience sample of 21 Greek patients with chronic stroke (14 male, 7 female; age of 63 ± 16 years) was recruited. Two independent examiners administered the scale, for the inter-rater reliability, twice within 10 days for the test-retest reliability. Bland Altman Analysis for repeated measures assessed the absolute reliability and the Standard Error of Measurement (SEM) and the Minimum Detectable Change at 95% confidence interval (MDC 95% ) were established. The Greek Mini-BESTest (Mini-BESTest GR ) was correlated with the Greek Berg Balance Scale (BBS GR ) for assessing the concurrent validity and with the Timed Up and Go (TUG), the Functional Reach Test (FRT) and the Greek Falls Efficacy Scale-International (FES-I GR ) for the convergent validity. The Mini-BESTestGR demonstrated excellent inter-rater reliability (ICC (95%CI) = 0.997 (0.995-0.999, SEM = 0.46) with the scores of two raters within the limits of agreement (mean dif = -0.143 ± 0.727, p > 0.05) and test-retest reliability (ICC (95%CI) = 0.966 (0.926-0.988), SEM = 1.53). Additionally, the Mini-BESTest GR yielded very strong to moderate correlations with BBS GR (r = 0.924, p < 0.001), TUG (r = -0.823, p < 0.001), FES-I GR (r = -0.734, p < 0.001) and FRT (r = 0.689, p < 0.001). MDC 95 was 4.25 points. The exceptionally high reliability and the equally good validity of the Mini-BESTest GR , strongly support its utility in Greek people with chronic stroke. Its ability to identify clinically meaningful changes and falls risk need further investigation.
Is it reliable to assess visual attention of drivers affected by Parkinson's disease from the backseat?—a simulator study

PubMed Central

Lee, Hoe C.; Yanting Chee, Derserri; Selander, Helena; Falkmer, Torbjorn

2012-01-01

Background Current methods of determining licence retainment or cancellation is through on-road driving tests. Previous research has shown that occupational therapists frequently assess drivers’ visual attention while sitting in the back seat on the opposite side of the driver. Since the eyes of the driver are not always visible, assessment by eye contact becomes problematic. Such procedural drawbacks may challenge validity and reliability of the visual attention assessments. In terms of correctly classified attention, the aim of the study was to establish the accuracy and the inter-rater reliability of driving assessments of visual attention from the back seat. Furthermore, by establishing eye contact between the assessor and the driver through an additional mirror on the wind screen, the present study aimed to establish how much such an intervention would enhance the accuracy of the visual attention assessment. Methods Two drivers with Parkinson's disease (PD) and six control drivers drove a fixed route in a driving simulator while wearing a head mounted eye tracker. The eye tracker data showed where the foveal visual attention actually was directed. These data were time stamped and compared with the simultaneous manual scoring of the visual attention of the drivers. In four of the drivers, one with Parkinson's disease, a mirror on the windscreen was set up to arrange for eye contact between the driver and the assessor. Inter-rater reliability was performed with one of the Parkinson drivers driving, but without the mirror. Results Without mirror, the overall accuracy was 56% when assessing the three control drivers and with mirror 83%. However, for the PD driver without mirror the accuracy was 94%, whereas for the PD driver with a mirror the accuracy was 90%. With respect to the inter-rater reliability, a 73% agreement was found. Conclusion If the final outcome of a driving assessment is dependent on the subcategory of a protocol assessing visual attention, we suggest the use of an additional mirror to establish eye contact between the assessor and the driver. The clinicians’ observations on-road should not be a standalone assessment in driving assessments. Instead, eye trackers should be employed for further analyses and correlation in cases where there is doubt about a driver's attention. PMID:22461850
The Hong Kong version of the Oxford Cognitive Screen (HK-OCS): validation study for Cantonese-speaking chronic stroke survivors.

PubMed

Kong, Anthony Pak-Hin; Lam, Pinky Hiu-Ping; Ho, Diana Wai-Lam; Lau, Johnny King; Humphreys, Glyn W; Riddoch, Jane; Weekes, Brendan

2016-09-01

This study reports the validation of the Hong Kong version of Oxford Cognitive Screen (HK-OCS). Seventy Cantonese-speaking healthy individuals participated to establish normative data and 46 chronic stroke survivors were assessed using the HK-OCS, Albert's Test of Visual Neglect, short test of gestural production, and Hong Kong version of the following assessments: Western Aphasia Battery, MMSE, MoCA, Modified Barthel Index, and Lawton Instrumental Activities of Daily Living scale. The validity of the HK-OCS was appraised by the difference between the two participant groups. Neurologically unimpaired individuals performed significantly better than stroke survivors on the HK-OCS. Positive and significant correlations found between cognitive subtests in the HK-OCS and related assessments indicated good concurrent validity. Excellent intra-rater and inter-rater reliabilities, fair test-retest reliability, and acceptable internal consistency suggested that the HK-OCS had good reliability. Specific HK-OCS subtests including semantics, episodic memory, number writing, and orientation were the best predictors of functional outcomes.
A fatigue resistance test for elderly persons based on grip strength: reliability and comparison with healthy young subjects.

PubMed

Bautmans, Ivan; Mets, Tony

2005-06-01

Although a wide variety of protocols are available for evaluating skeletal muscle fatigue resistance, they often necessitate important technological resources or are too complicated for elderly subjects. We present here a new test, designed for elderly persons, based on maintaining maximal voluntary grip strength as long as possible. The aim of the study was to determine the reliability of this test procedure in hospitalized geriatric patients and in young healthy persons. Fatigue resistance was considered as the time in which grip strength decreases to 50% of its maximum value. Twenty geriatric, hospitalized patients (age 83 +/- 6 yrs) and thirty-nine young, healthy persons (age 23 +/- 4 yrs) were evaluated for fatigue resistance by two different observers. Height, weight and body mass index were determined for each participant and the current amount of sports activity was recorded in the young subjects. All participants were able to perform the test. Inter- and intra-rater reliability in both subgroups was good to excellent, with ICC(3,1) values ranging from 0.77 to 0.94. No significant differences in inter- and intra-rater measurements were found, except for inter-observer evaluations of the dominant hand in hospitalized geriatric patients. No significant relationships were found between fatigue resistance and maximal grip strength, anthropometrics or gender. The proposed fatigue resistance test is a reliable tool to evaluate geriatric hospitalized patients as well as young, active and healthy persons. Fatigue resistance scores are not related to gender, maximal strength or anthropometrics within the observed subgroups.
A study to evaluate the reliability of using two-dimensional photographs, three-dimensional images, and stereoscopic projected three-dimensional images for patient assessment.

PubMed

Zhu, S; Yang, Y; Khambay, B

2017-03-01

Clinicians are accustomed to viewing conventional two-dimensional (2D) photographs and assume that viewing three-dimensional (3D) images is similar. Facial images captured in 3D are not viewed in true 3D; this may alter clinical judgement. The aim of this study was to evaluate the reliability of using conventional photographs, 3D images, and stereoscopic projected 3D images to rate the severity of the deformity in pre-surgical class III patients. Forty adult patients were recruited. Eight raters assessed facial height, symmetry, and profile using the three different viewing media and a 100-mm visual analogue scale (VAS), and appraised the most informative viewing medium. Inter-rater consistency was above good for all three media. Intra-rater reliability was not significantly different for rating facial height using 2D (P=0.704), symmetry using 3D (P=0.056), and profile using projected 3D (P=0.749). Using projected 3D for rating profile and symmetry resulted in significantly lower median VAS scores than either 3D or 2D images (all P<0.05). For 75% of the raters, stereoscopic 3D projection was the preferred method for rating. The reliability of assessing specific characteristics was dependent on the viewing medium. Clinicians should be aware that the visual information provided when viewing 3D images is not the same as when viewing 2D photographs, especially for facial depth, and this may change the clinical impression. Crown Copyright © 2016. Published by Elsevier Ltd. All rights reserved.
Ventilatory threshold may be a more specific measure of aerobic capacity than peak oxygen consumption rate in persons with stroke.

PubMed

Boyne, Pierce; Reisman, Darcy; Brian, Michael; Barney, Brian; Franke, Ava; Carl, Daniel; Khoury, Jane; Dunning, Kari

2017-03-01

After stroke, aerobic deconditioning can have a profound impact on daily activities. This is usually measured by the peak oxygen consumption rate achieved during exercise testing (VO2-peak). However, VO2-peak may be distorted by motor function. The oxygen uptake efficiency slope (OUES) and VO2 at the ventilatory threshold (VO2-VT) could more specifically assess aerobic capacity after stroke, but this has not been tested. To assess the differential influence of motor function on three measures of aerobic capacity (VO2-peak, OUES, and VO2-VT) and to evaluate the inter-rater reliability of VO2-VT determination post-stroke. Among 59 persons with chronic stroke, cross-sectional correlations with motor function (comfortable gait speed [CGS] and lower extremity Fugl-Meyer [LEFM]) were compared between the different aerobic capacity measures, after adjustment for covariates, in order to isolate any distorting effect of motor function. Reliability of VO2-VT determination between three raters was assessed with intra-class correlation (ICC). CGS was moderately correlated with VO2-peak (r = 0.52, p < 0.0001) and weakly correlated with OUES (r = 0.41, p = 0.002) and VO2-VT (r = 0.37, p = 0.01). LEFM was weakly correlated with VO2-peak (r = 0.26, p = 0.055) and very weakly correlated with OUES (r = 0.19, p = 0.17) and VO2-VT (r = 0.14, p = 0.31). Compared to VO2-peak, VO2-VT was significantly less correlated with CGS (r difference = -0.16, p = 0.02). Inter-rater reliability of VO2-VT determination was high (ICC: 0.93, 95% CI: 0.89-0.96). Motor dysfunction appears to artificially lower measured aerobic capacity. VO2-VT seemed to be less distorted than VO2-peak and had good inter-rater reliability, so it may provide more specific assessment of aerobic capacity post-stroke.
Systematic behavioural observation of executive performance after brain injury.

PubMed

Lewis, Mark W; Babbage, Duncan R; Leathem, Janet M

2017-01-01

To develop an ecologically valid measure of executive functioning (i.e. Planning and Organization, Executive Memory, Initiation, Cognitive Shifting, Impulsivity, Sustained and Directed Attention, Error Detection, Error Correction and Time Management) during a functional chocolate brownie cooking task. In Study 1, the inter-rater reliability of a novel behavioural observation assessment method was assessed with 10 people with traumatic brain injury (TBI). In Study 2, 27 people with TBI and 16 healthy controls completed the functional task along with other measures of executive functioning to assess validity. Intraclass correlation coefficients for six of the nine aspects of executive functioning ranged from .54 to 1.00. Percentage agreements for the remaining aspects ranged from 70% to 90%. Significant and non-significant, moderate, correlations were found between the functional cooking task and standard neuropsychological measures. The healthy control group performed better than the TBI group in six areas (d = 0.56 to 1.23). In this initial trial of a novel assessment method, adequate inter-rater reliability was found. The measure was associated with standard neuropsychological measures, and our healthy control group performed better than the TBI group. The measure appears to be an ecologically valid measure of executive functioning.
Clothing Protection from Ultraviolet Radiation: A New Method for Assessment.

PubMed

Gage, Ryan; Leung, William; Stanley, James; Reeder, Anthony; Barr, Michelle; Chambers, Tim; Smith, Moira; Signal, Louise

2017-11-01

Clothing modifies ultraviolet radiation (UVR) exposure from the sun and has an impact on skin cancer risk and the endogenous synthesis of vitamin D. There is no standardized method available for assessing body surface area (BSA) covered by clothing, which limits generalizability between study findings. We calculated the body cover provided by 38 clothing items using diagrams of BSA, adjusting the values to account for differences in BSA by age. Diagrams displaying each clothing item were developed and incorporated into a coverage assessment procedure (CAP). Five assessors used the CAP and Lund & Browder chart, an existing method for estimating BSA, to calculate the clothing coverage of an image sample of 100 schoolchildren. Values of clothing coverage, inter-rater reliability and assessment time were compared between CAP and Lund & Browder methods. Both methods had excellent inter-rater reliability (>0.90) and returned comparable results, although the CAP method was significantly faster in determining a person's clothing coverage. On balance, the CAP method appears to be a feasible method for calculating clothing coverage. Its use could improve comparability between sun-safety studies and aid in quantifying the health effects of UVR exposure. © 2017 The American Society of Photobiology.
The reliability and validity of measurements of human dental casts made by an intra-oral 3D scanner, with conventional hand-held digital callipers as the comparison measure.

PubMed

Rajshekar, Mithun; Julian, Roberta; Williams, Anne-Marie; Tennant, Marc; Forrest, Alex; Walsh, Laurence J; Wilson, Gary; Blizzard, Leigh

2017-09-01

Intra-oral 3D scanning of dentitions has the potential to provide a fast, accurate and non-invasive method of recording dental information. The aim of this study was to assess the reliability of measurements of human dental casts made using a portable intra-oral 3D scanner appropriate for field use. Two examiners each measured 84 tooth and 26 arch features of 50 sets of upper and lower human dental casts using digital hand-held callipers, and secondly using the measuring tool provided with the Zfx IntraScan intraoral 3D scanner applied to the virtual dental casts. The measurements were repeated at least one week later. Reliability and validity were quantified concurrently by calculation of intra-class correlation coefficients (ICC) and standard errors of measurement (SEM). The measurements of the 110 landmark features of human dental casts made using the intra-oral 3D scanner were virtually indistinguishable from measurements of the same features made using conventional hand-held callipers. The difference of means as a percentage of the average of the measurements by each method ranged between 0.030% and 1.134%. The intermethod SEMs ranged between 0.037% and 0.535%, and the inter-method ICCs ranged between 0.904 and 0.999, for both the upper and the lower arches. The inter-rater SEMs were one-half and the intra-method/rater SEMs were one-third of the inter-method values. This study demonstrates that the Zfx IntraScan intra-oral 3D scanner with its virtual on-screen measuring tool is a reliable and valid method for measuring the key features of dental casts. Copyright © 2017 Elsevier B.V. All rights reserved.
Reliability of the hospital nutrition environment scan for cafeterias, vending machines, and gift shops.

PubMed

Winston, Courtney P; Sallis, James F; Swartz, Michael D; Hoelscher, Deanna M; Peskin, Melissa F

2013-08-01

According to ecological models, the physical environment plays a major role in determining individual health behaviors. As such, researchers have started targeting the consumer nutrition environment of large-scale foodservice operations when implementing obesity-prevention programs. In 2010, the American Hospital Association released a call-to-action encouraging health care facilities to join in this movement and improve their facilities' consumer nutrition environments. The Hospital Nutrition Environment Scan (HNES) for Cafeterias, Vending Machines, and Gift Shops was developed in 2011, and the present study evaluated the inter-rater reliability of this instrument. Two trained raters visited 39 hospitals in southern California and completed the HNES. Percent agreement, kappa statistics, and intraclass correlation coefficients were calculated. Percent agreement between raters ranged from 74.4% to 100% and kappa statistics ranged from 0.458 to 1.0. The intraclass correlation coefficient for the overall nutrition composite scores was 0.961. Given these results, the HNES demonstrated acceptable reliability metrics and can now be disseminated to assess the current state of hospital consumer nutrition environments. Copyright © 2013 Academy of Nutrition and Dietetics. Published by Elsevier Inc. All rights reserved.
The Queensland high risk foot form (QHRFF) – is it a reliable and valid clinical research tool for foot disease?

PubMed Central

2014-01-01

Background Foot disease complications, such as foot ulcers and infection, contribute to considerable morbidity and mortality. These complications are typically precipitated by “high-risk factors”, such as peripheral neuropathy and peripheral arterial disease. High-risk factors are more prevalent in specific “at risk” populations such as diabetes, kidney disease and cardiovascular disease. To the best of the authors’ knowledge a tool capturing multiple high-risk factors and foot disease complications in multiple at risk populations has yet to be tested. This study aimed to develop and test the validity and reliability of a Queensland High Risk Foot Form (QHRFF) tool. Methods The study was conducted in two phases. Phase one developed a QHRFF using an existing diabetes foot disease tool, literature searches, stakeholder groups and expert panel. Phase two tested the QHRFF for validity and reliability. Four clinicians, representing different levels of expertise, were recruited to test validity and reliability. Three cohorts of patients were recruited; one tested criterion measure reliability (n = 32), another tested criterion validity and inter-rater reliability (n = 43), and another tested intra-rater reliability (n = 19). Validity was determined using sensitivity, specificity and positive predictive values (PPV). Reliability was determined using Kappa, weighted Kappa and intra-class correlation (ICC) statistics. Results A QHRFF tool containing 46 items across seven domains was developed. Criterion measure reliability of at least moderate categories of agreement (Kappa > 0.4; ICC > 0.75) was seen in 91% (29 of 32) tested items. Criterion validity of at least moderate categories (PPV > 0.7) was seen in 83% (60 of 72) tested items. Inter- and intra-rater reliability of at least moderate categories (Kappa > 0.4; ICC > 0.75) was seen in 88% (84 of 96) and 87% (20 of 23) tested items respectively. Conclusions The QHRFF had acceptable validity and reliability across the majority of items; particularly items identifying relevant co-morbidities, high-risk factors and foot disease complications. Recommendations have been made to improve or remove identified weaker items for future QHRFF versions. Overall, the QHRFF possesses suitable practicality, validity and reliability to assess and capture relevant foot disease items across multiple at risk populations. PMID:24468080
Development and Validation of a Family Meeting Assessment Tool (FMAT).

PubMed

Hagiwara, Yuya; Healy, Jennifer; Lee, Shuko; Ross, Jeanette; Fischer, Dixie; Sanchez-Reilly, Sandra

2018-01-01

A cornerstone procedure in Palliative Medicine is to perform family meetings. Learning how to lead a family meeting is an important skill for physicians and others who care for patients with serious illnesses and their families. There is limited evidence on how to assess best practice behaviors during end-of-life family meetings. Our aim was to develop and validate an observational tool to assess trainees' ability to lead a simulated end-of-life family meeting. Building on evidence from published studies and accrediting agency guidelines, an expert panel at our institution developed the Family Meeting Assessment Tool. All fourth-year medical students (MS4) and eight geriatric and palliative medicine fellows (GPFs) were invited to participate in a Family Meeting Objective Structured Clinical Examination, where each trainee assumed the physician role leading a complex family meeting. Two evaluators observed and rated randomly chosen students' performances using the Family Meeting Assessment Tool during the examination. Inter-rater reliability was measured using percent agreement. Internal consistency was measured using Cronbach α. A total of 141 trainees (MS4 = 133 and GPF = 8) and 26 interdisciplinary evaluators participated in the study. Internal reliability (Cronbach α) of the tool was 0.85. Number of trainees rated by two evaluators was 210 (MS4 = 202 and GPF = 8). Rater agreement was 84%. Composite scores, on average, were significantly higher for fellows than for medical students (P < 0.001). Expert-based content, high inter-rater reliability, good internal consistency, and ability to predict educational level provided initial evidence for construct validity for this novel assessment tool. Copyright © 2017 American Academy of Hospice and Palliative Medicine. All rights reserved.
Diagnostic reliability of MMPI-2 computer-based test interpretations.

PubMed

Pant, Hina; McCabe, Brian J; Deskovitz, Mark A; Weed, Nathan C; Williams, John E

2014-09-01

Reflecting the common use of the MMPI-2 to provide diagnostic considerations, computer-based test interpretations (CBTIs) also typically offer diagnostic suggestions. However, these diagnostic suggestions can sometimes be shown to vary widely across different CBTI programs even for identical MMPI-2 profiles. The present study evaluated the diagnostic reliability of 6 commercially available CBTIs using a 20-item Q-sort task developed for this study. Four raters each sorted diagnostic classifications based on these 6 CBTI reports for 20 MMPI-2 profiles. Two questions were addressed. First, do users of CBTIs understand the diagnostic information contained within the reports similarly? Overall, diagnostic sorts of the CBTIs showed moderate inter-interpreter diagnostic reliability (mean r = .56), with sorts for the 1/2/3 profile showing the highest inter-interpreter diagnostic reliability (mean r = .67). Second, do different CBTIs programs vary with respect to diagnostic suggestions? It was found that diagnostic sorts of the CBTIs had a mean inter-CBTI diagnostic reliability of r = .56, indicating moderate but not strong agreement across CBTIs in terms of diagnostic suggestions. The strongest inter-CBTI diagnostic agreement was found for sorts of the 1/2/3 profile CBTIs (mean r = .71). Limitations and future directions are discussed. PsycINFO Database Record (c) 2014 APA, all rights reserved.
[Quality assurance in coding expertise of hospital cases in the German DRG system. Evaluation of inter-rater reliability in MDK expertise].

PubMed

Huber, H; Brambrink, M; Funk, R; Rieger, M

2012-10-01

The purpose of this study was to evaluate differences in the D-DRG results of a hospital case by 2 independently coding MKD raters. Calculation of the 2-inter-rater reliability was performed by examination of the coding of individual hospital cases. The reasons for the non-agreement of the expert evaluations and suggestions to improve the process are discussed. From the expert evaluation pool of the MDK-WL a random sample of 0.7% of the 57,375 expertises was taken. Distribution equality with the basic total was tested by the χ² test or, respectively, Fisher's exact test. For the total of 402 individual hospital cases, the G-DRG case sums of 2 experts of the MDK were determined independently and the results checked for each individual case for agreement or non-agreement. The corresponding confidence intervals with standard errors were analysed to test if certain major diagnosis categories (MDC) were statistically significantly more affected by differing expertise results than others. In 280 of the total 402 tested hospital cases, the 2 MDK raters independently reached the same G-DRG results; in 122 cases the G-DRG case sums determined by the 2 raters differed (agreement 70%; CI 65.2-74.1). Different DRG results between the 2 experts occurred regularly in the entire MDC spectrum. No MDC chapter in which significant differences between the 2 raters arose could be identified. The results of our study demonstrate an almost 70% agreement in the evaluation of hospital cost accounts by 2 independently operating MDK. This result leaves room for improvement. Optimisation potentials can be recognised on the basis of the results. Potential for improvement was established in combination with regular further training and the expansion of binding internal code recommendations as well as exchange of code-relevant information among experts in internal forums. The presented model is in principle suitable for cross-border examinations within the MDK system with the advantage that further trends could be uncovered by more variety and larger numbers of the randomly selected cases. © Georg Thieme Verlag KG Stuttgart · New York.
Reliability and validity of the upper-body dressing scale in Japanese patients with vascular dementia with hemiparesis.

PubMed

Endo, Arisa; Suzuki, Makoto; Akagi, Atsumi; Chiba, Naoyuki; Ishizaka, Ikuyo; Matsunaga, Atsuhiko; Fukuda, Michinari

2015-03-01

The purpose of this study was to examine the reliability and validity of the Upper-body Dressing Scale (UBDS) for buttoned shirt dressing, which evaluates the learning process of new component actions of upper-body dressing in patients diagnosed with dementia and hemiparesis. This was a preliminary correlational study of concurrent validity and reliability in which 10 vascular dementia patients with hemiparesis were enrolled and assessed repeatedly by six occupational therapists by means of the UBDS and the dressing item of the Functional Independence Measure (FIM). Intraclass correlation coefficient was 0.97 for intra-rater reliability and 0.99 for inter-rater reliability. The level of correlation between UBDS score and FIM dressing item scores was -0.93. UBDS scores for paralytic hand passed into the sleeve and sleeve pulled up beyond the shoulder joint were worse than the scores for the other components of the task. The UBDS has good reliability and validity for vascular dementia patients with hemiparesis. Further research is needed to investigate the relation between UBDS score and the effect of intervention and to clarify sensitivity or responsiveness of the scale to clinical change. Copyright © 2014 John Wiley & Sons, Ltd.
Complexity of GPs' explanations about mental health problems: development, reliability, and validity of a measure

PubMed Central

Cape, John; Morris, Elena; Burd, Mary; Buszewicz, Marta

2008-01-01

Background How GPs understand mental health problems determines their treatment choices; however, measures describing GPs' thinking about such problems are not currently available. Aim To develop a measure of the complexity of GP explanations of common mental health problems and to pilot its reliability and validity. Design of study A qualitative development of the measure, followed by inter-rater reliability and validation pilot studies. Setting General practices in North London. Method Vignettes of simulated consultations with patients with mental health problems were videotaped, and an anchored measure of complexity of psychosocial explanation in response to these vignettes was developed. Six GPs, four psychologists, and two lay people viewed the vignettes. Their responses were rated for complexity, both using the anchored measure and independently by two experts in primary care mental health. In a second reliability and revalidation study, responses of 50 GPs to two vignettes were rated for complexity. The GPs also completed a questionnaire to determine their interest and training in mental health, and they completed the Depression Attitudes Questionnaire. Results Inter-rater reliability of the measure of complexity of explanation in both pilot studies was satisfactory (intraclass correlation coefficient = 0.78 and 0.72). The measure correlated with expert opinion as to what constitutes a complex explanation, and the responses of psychologists, GPs, and lay people differed in measured complexity. GPs with higher complexity scores had greater interest, more training in mental health, and more positive attitudes to depression. Conclusion Results suggest that the complexity of GPs' psychosocial explanations about common mental health problems can be reliably and validly assessed by this new standardised measure. PMID:18505616
A quick aphasia battery for efficient, reliable, and multidimensional assessment of language function.

PubMed

Wilson, Stephen M; Eriksson, Dana K; Schneck, Sarah M; Lucanie, Jillian M

2018-01-01

This paper describes a quick aphasia battery (QAB) that aims to provide a reliable and multidimensional assessment of language function in about a quarter of an hour, bridging the gap between comprehensive batteries that are time-consuming to administer, and rapid screening instruments that provide limited detail regarding individual profiles of deficits. The QAB is made up of eight subtests, each comprising sets of items that probe different language domains, vary in difficulty, and are scored with a graded system to maximize the informativeness of each item. From the eight subtests, eight summary measures are derived, which constitute a multidimensional profile of language function, quantifying strengths and weaknesses across core language domains. The QAB was administered to 28 individuals with acute stroke and aphasia, 25 individuals with acute stroke but no aphasia, 16 individuals with chronic post-stroke aphasia, and 14 healthy controls. The patients with chronic post-stroke aphasia were tested 3 times each and scored independently by 2 raters to establish test-retest and inter-rater reliability. The Western Aphasia Battery (WAB) was also administered to these patients to assess concurrent validity. We found that all QAB summary measures were sensitive to aphasic deficits in the two groups with aphasia. All measures showed good or excellent test-retest reliability (overall summary measure: intraclass correlation coefficient (ICC) = 0.98), and excellent inter-rater reliability (overall summary measure: ICC = 0.99). Sensitivity and specificity for diagnosis of aphasia (relative to clinical impression) were 0.91 and 0.95 respectively. All QAB measures were highly correlated with corresponding WAB measures where available. Individual patients showed distinct profiles of spared and impaired function across different language domains. In sum, the QAB efficiently and reliably characterized individual profiles of language deficits.
A quick aphasia battery for efficient, reliable, and multidimensional assessment of language function

PubMed Central

Eriksson, Dana K.; Schneck, Sarah M.; Lucanie, Jillian M.

2018-01-01

This paper describes a quick aphasia battery (QAB) that aims to provide a reliable and multidimensional assessment of language function in about a quarter of an hour, bridging the gap between comprehensive batteries that are time-consuming to administer, and rapid screening instruments that provide limited detail regarding individual profiles of deficits. The QAB is made up of eight subtests, each comprising sets of items that probe different language domains, vary in difficulty, and are scored with a graded system to maximize the informativeness of each item. From the eight subtests, eight summary measures are derived, which constitute a multidimensional profile of language function, quantifying strengths and weaknesses across core language domains. The QAB was administered to 28 individuals with acute stroke and aphasia, 25 individuals with acute stroke but no aphasia, 16 individuals with chronic post-stroke aphasia, and 14 healthy controls. The patients with chronic post-stroke aphasia were tested 3 times each and scored independently by 2 raters to establish test-retest and inter-rater reliability. The Western Aphasia Battery (WAB) was also administered to these patients to assess concurrent validity. We found that all QAB summary measures were sensitive to aphasic deficits in the two groups with aphasia. All measures showed good or excellent test-retest reliability (overall summary measure: intraclass correlation coefficient (ICC) = 0.98), and excellent inter-rater reliability (overall summary measure: ICC = 0.99). Sensitivity and specificity for diagnosis of aphasia (relative to clinical impression) were 0.91 and 0.95 respectively. All QAB measures were highly correlated with corresponding WAB measures where available. Individual patients showed distinct profiles of spared and impaired function across different language domains. In sum, the QAB efficiently and reliably characterized individual profiles of language deficits. PMID:29425241

Some links on this page may take you to non-federal websites. Their policies may differ from this site.