Sociologický časopis / Czech Sociological Review 2025, 61(3): 277-300 | DOI: 10.13060/csr.2024.007
Options, Problems and Guidelines for Measuring Interrater Agreement – a Descriptive Approach
- Ústav experimentálnej psychológie, Slovenská akadémia vied, Bratislava
Interrater agreement is one way to establish reliability (and also validity) in social science research. The traditionally preferred method of measuring interrater agreement is the descriptive approach owing to its simplicity. This approach is also associated with a number of different agreement indices, which makes it difficult to select the right index. This article summarises theoretical definition on the prevailing approach used to measure interrater agreement (in both quantitative and qualitative research). From a practical point of view, the article focuses on the possibilities of measuring agreement by using percent agreement, the kappa coefficient, and the AC1 coefficient. A more detailed description of the indices explains how to define, calculate, and interpret them and the problems associated with their use. The indices are then discussed in comparison. Although underestimated and criticised, percent agreement may be a good indicator of interrater agreement. Several paradoxes accompany the use of the kappa coefficient, which is only possible under certain conditions. The appropriate alternative to it is the AC1 coefficient. The article concludes with a summary of recommendations for improving the quantification of interrater agreement.
Keywords: interrater agreement, agreement index, percent agreement, kappa coefficient, AC1 coefficient
Received: February 28, 2023; Revised: February 12, 2024; Accepted: February 26, 2024; Prepublished online: June 6, 2024; Published: September 1, 2025 Show citation
Attachments
| Download file | Apendix on-line-priloha-377-Kocisova20.pdf |
References
- Altman, D. G. (1991). Practical Statistics for Medical Research. Chapman & Hall. https://doi.org/10.1201/9780429258589
Go to original source... - Bennett, E. M., Alpert, R. a Goldstein, A. C. (1954). Communications through limited-response questioning. Public Opinion Quarterly, 18, 303-308. https://doi.org/10.1086/266520
Go to original source... - Bliese, P. D. (2000). Within-group agreement, non-independence, and reliability: Implications for data aggregation and analysis. In K. J. Klein a S. W. J. Kozlowski (eds.), Multilevel Theory, Research, and Methods in Organizations: Foundations, Extensions, and New Directions (s. 349-381). Jossey-Bass.
- Button, C. M., Snook, B. a Grant, M. J. (2020). Inter-rater agreement, data Reliability, and the crisis of confidence in psychological research. The Quantitative Methods for Psychology, 16(5), 467-471. https://doi.org/10.20982/tqmp.16.5.p467
Go to original source... - Byrt, T., Bishop, J. a Carlin, J. B. (1993). Bias, prevalence and kappa. Journal of Clinical Epidemiology, 46(5), 423-429. https://doi.org/10.1016/0895-4356(93)90018-v
Go to original source... - Cicchetti, D. V. a Feinstein, A. R. (1990). High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 43(6), 551-558. https://doi.org/10.1016/0895-4356(90)90159-m
Go to original source... - Cígler, H. a Šmíra, M. (2015). Chyba měření a odhad pravého skóru. Testfórum, 6, 67-84. https://doi.org/10.5817/TF2015-6-104
Go to original source... - Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37-46. https://doi.org/10.1177/001316446002000104
Go to original source... - Conger, A. J. (1980). Integration and generalization of kappas for multiple raters. Psychological Bulletin, 88(2), 322-328. https://doi.org/10.1037/0033-2909.88.2.322
Go to original source... - Cunningham, M. 2009. More than just the kappa coefficient: A program to fully characterize inter-rater reliability between two raters. SAS Global Forum, Paper 242-2009. https://support.sas.com/resources/papers/proceedings09/242-2009.pdf
- de Vet, H. C. W., Terwee, C. B., Knol, D. L. a Bouter, L. M. (2006). When to use agreement versus reliability measures. Journal of Clinical Epidemiology, 59(10), 1033-1039. https://doi.org/10.1016/j.jclinepi.2005.10.015
Go to original source... - Dettori, J. R. a Norvell, D. C. (2020). Kappa and Beyond: Is There Agreement? Global Spine Journal, 10(4), 499-501. https://doi.org/10.1177/2192568220911648
Go to original source... - DeVellis, R. F. (2005). Inter-Rater Reliability. In K. Kempf-Leonard (ed.), Encyclopedia of Social Measurement (s. 317-322). Elsevier. https://doi.org/10.1016/B0-12-369398-5/00095-5
Go to original source... - Di Eugenio, B. a Glass, M. (2004). The kappa statistic: A second look. Computational Linguistics, 30(1), 95-101. https://doi.org/10.1162/089120104773633402
Go to original source... - Feinstein, A. R. a D. V. Cicchetti. (1990). High agreement but low kappa: I. The problem of two paradoxes. Journal of Clinical Epidemiology, 43, 543-549. http://dx.doi.org/10.1016/0895-4356(90)90158-L
Go to original source... - Feng, G. C. (2013). Underlying determinants driving agreement among coders. Quality and Quantity, 47, 2983-2997. https://doi.org/10.1007/s11135-012-9807-z
Go to original source... - Feng, G. C. (2015). Mistakes and how to avoid mistakes in using intercoder reliability indices. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 11(1), 13-22. https://doi.org/10.1027/1614-2241/a000086
Go to original source... - Fleiss, J. L., Levin, B. a Paik, M. C. (2003). Statistical Methods for Rates & Proportions. Wiley & Sons. https://doi.org/10.1002/0471445428
Go to original source... - Gálová, L. (2010, 3. únor). Koeficient kappa - aplikačné možnosti, výhody a nevýhody [príspevok prednesený na konferencii]. 2. česko-slovenská konference doktorandů oborů pomáhajících profesí, Ostrava.
- Gisev, N., Bell, J. S. a Chen, T. F. (2013). Interrater agreement and interrater reliability: key concepts, approaches, and applications. Research in Social & Administrative Pharmacy, 9(3), 330-338. https://doi.org/10.1016/j.sapharm.2012.04.004
Go to original source... - Goodwin, L. D. (2001). Interrater agreement and reliability. Measurement in Physical Education and Exercise Science, 5(1), 13-34. https://doi.org/10.1207/S15327841MPEE0501_2
Go to original source... - Graham, M., Milanowski, A. a Miller, J. (2012). Measuring and Promoting Inter-rater Agreement of Teacher and Principal Performance Ratings. Center for Educator Compensation and Reform. Stiahnuté 10. 2. 2023 z http://es.eric.ed.gov/fulltext/ED532068.pdf
- Gwet, K. L. (2002). Kappa Statistic Is Not Satisfactory for Assessing the Extent of Agreement between Raters. Statistical Methods for Inter-Rater Reliability Assessment, 1, 1-6. https://agreestat.com/papers/kappa_statistic_is_not_satisfactory.pdf
- Gwet, K. L. (2014, 12. december). Benchmarking agreement coefficients. Inter-rater Reliability Blog. https://inter-rater-reliability.blogspot.com/2014/
- Gwet, K. (2021). Handbook of Inter-rater reliability. The Definite Guide to Measuring the Extent of Agreement. Analysis of Categorical Rating. Gaithersburg: AgreeStat Analytics.
- Hartmann, D. P., Barrios, B. A. a Wood, D. D. (2004). Principles of behavioral observation. In M. Hersen (ed.), Comprehensive Handbook of Psychological Assessment (s. 108-137). John Wiley & Sons.
- Chan, D. (1998). Functional relations among constructs in the same content domain at different levels of analysis: A typology of composition models. Journal of Applied Psychology, 83(2), 234-246. https://doi.org/10.1037/0021-9010.83.2.234
Go to original source... - Kočišová, L. (2022). Reportovanie súhlasu posudzovateľov a spoľahlivosti posudzovateľov. Testfórum, 15, 41-57. https://doi.org/10.5817/TF2022-15-14647
Go to original source... - Konstantinidis, M., Le, L. W. a Gao, X. (2022). An Empirical Comparative Assessment of Inter-Rater Agreement of Binary Outcomes and Multiple Raters. Symmetry, 14(2), 262. https://doi.org/10.3390/sym14020262
Go to original source... - Kottner, J. a Streiner, D. L. (2011). The difference between reliability and agreement. Journal of Clinical Epidemiology, 64(6), 701-702. https://doi.org/10.1016/j.jclinepi.2010.12.001
Go to original source... - Kottner, J., Audigé, L., Brorson, S., Donner, A., Gajewski, B. J., Hróbjartsson, A., … Streiner, D. L. (2011). Guidelines for Reporting Reliability and Agreement Studies (GRRAS) were proposed. Journal of Clinical Epidemiology, 64(1), 96-106. https://doi.org/10.1016/j.jclinepi.2010.03.002
Go to original source... - Kozlowski, S. W. J. a Klein, K. (2000). A multilevel approach to theory and research in organizations: Contextual, temporal, and emergent processes. in K. J. Klein a S. W. J. Kozlowski (eds.). Multilevel Theory, Research, and Methods in Organizations: Foundations, Extensions, and New Directions (s. 3-90). San Francisco: Jossey-Bass.
- Kozlowski, S. W. a Hattrup, K. (1992). A disagreement about within-group agreement: Disentangling issues of consistency versus consensus. Journal of Applied Psychology, 77(2), 161-167. https://doi.org/10.1037/0021-9010.77.2.161
Go to original source... - Krefting, L. (1991). Rigor in qualitative research: the assessment of trustworthiness. American Journal of Occupational Therapy, 45, 214-222. https://doi.org/10.5014/ajot.45.3.214
Go to original source... - Krippendorff, K. (1980). Content Analysis: An Introduction to Its Methodology. Beverly Hills: Sage Publications.
- Krippendorff, K. (2004). Reliability in content analysis: Some common misconceptions and recommendations. Human Communication Research, 30(3), 411-433. https://doi.org/10.1111/j.1468-2958.2004.tb00738.x
Go to original source... - Lance, C. E., Butts, M. M. a Michels L. C. (2006). The sources of four commonly reported cutoff criteria: what did they really say? Organizational Research Methods, 9(2), 202-220. https://doi.org/10.1177/1094428105284919
Go to original source... - Landis, J. R. a Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-174. https://doi.org/10.2307/2529310
Go to original source... - LeBreton, J. M. a Senter, J. L. (2008). Answers to 20 questions about interrater reliability and interrater agreement. Organizational Research Methods, 11, 815-852. https://doi.org/10.1177/1094428106296642
Go to original source... - Light, R. J. (1971). Measures of response agreement for qualitative data: Some generalizations and alternatives. Psychological Bulletin, 76(5), 365-377. https://doi.org/10.1037/h0031643
Go to original source... - Lincoln, Y. a Guba, E. G. (1985). Naturalistic inquiry. Sage. https://doi.org/10.1016/0147-1767(85)90062-8
Go to original source... - Litwin, M. S. (1995). How to Measure Survey Reliability and Validity. Sage Publications. https://doi.org/10.4135/9781483348957
Go to original source... - Lombard, M., Snyder-Duch, J. a Bracken, C. C. (2002). Content analysis in mass communication research: An assessment and reporting of intercoder reliability. Human Communication Research, 28(4), 587-604. https://doi.org/10.1111/j.1468-2958.2002.tb00826.x
Go to original source... - Ludbrook, J. (2002). Statistical techniques for comparing measures and methods of measurement: A critical review. Clinical and Experimental Pharmacology and Physiology, 29(7), 527-536. https://doi.org/10.1046/j.1440-1681.2002.03686.x
Go to original source... - Mandysová, P., Ehler, E. a Trejbalová, L. (2012). Česká verze Škály Bradenové: metodika prekladu a shoda mezi posuzovateli. Ošetrovateľstvo, 2(4), 137-142.
- Martončik, M. (2019). Validita merania v sociálnych vedách. Prešovská univerzita.
- McDonald, N., Schoenebeck, S. a Forte, A. (2019). Reliability and Inter-rater Reliability in Qualitative Research: Norms and Guidelines for CSCW and HCI Practice. Proceedings of the ACM on Human-Computer Interaction, 72. https://doi.org/10.1145/3359174
Go to original source... - McHugh, M. L. (2012). Interrater reliability: the kappa statistic. Biochemia Medica, 22(3), 276-282. https://doi.org/10.11613/BM.2012.031
Go to original source... - Řehák, J. (1998). Kvalita dat I. Klasický model měření reliability a jeho praktický aplikační význam. Sociologický časopis, 34(1), 51-60. https://doi.org/10.13060/00380288.1998.34.1.07
Go to original source... - Scott, W. A. (1955). Reliability of content analysis: the case of nominal scale coding. Public Opinion Quarterly, 19, 321-325. https://doi.org/10.1086/266577
Go to original source... - Sim, J. a Wright, Ch. C. (2005). The Kappa Statistic in Reliability Studies: Use, Interpretation, and Sample Size Requirements. Physical Therapy, 85(3), 257-268. https://doi.org/10.1093/ptj/85.3.257
Go to original source... - Stemler, S. E. (2004). A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability. Practical Assessment, Research & Evaluation, 9(4). https://doi.org/10.7275/96jp-xz07
Go to original source... - Syed, M. a Nelson, S. C. (2015). Guidelines for establishing reliability when coding narrative data. Emerging Adulthood, 3(6), 375-387. https://doi.org/10.1177/2167696815587648
Go to original source... - Švaříček, R. a Šeďová, K. a kol. (2007). Kvalitativní výzkum v pedagogických vědách. Portál.
- Tinsley, H. E. a Weiss, D. J. (1975). Interrater reliability and agreement of subjective judgments. Journal of Counseling Psychology, 22(4), 358-376. https://doi.org/10.1037/h0076640
Go to original source... - Uebersax J. (2018, 19. september). Raw Agreement Indices. https://www.john-uebersax.com/stat/raw.htm
- Urbánek, T., Denglerová, D. a Širůček, J. (2011). Psychometrika: měření v psychologii. Portál.
- van Oest, R. (2019). A new coefficient of interrater agreement: The challenge of highly unequal category proportions. Psychological Methods, 24(4), 439-451. https://doi.org/10.1037/met0000183
Go to original source... - Vanacore, A. a Pellegrino, M. S. (2021). Benchmarking procedures for characterizing the extent of rater agreement: a comparative study. Quality and Reliability Engineering International, 38(3), 1404-1415. https://doi.org/10.1002/qre.2982
Go to original source... - Von Eye, A. a Mun, E. Y. (2005). Analyzing Rater Agreement: Manifest Variable Methods. Lawrence Erlbaum.
- Warrens, M. J. (2015). Five ways to look at Cohen's kappa. Journal of Psychology & Psychotherapy, 5(4), 1. https://doi.org/10.4172/2161-0487.1000197
Go to original source... - Wilhelm, A. G., Rouse, A. G. a Jones, F. (2018). Exploring differences in measurement and reporting of classroom observation inter-rater reliability. Practical Assessment, Research, and Evaluation, 23(4). https://doi.org/10.7275/at67-md25
Go to original source... - Xie, Q. (2013, 4.-6. november). Agree or Disagree? A Demonstration of An Alternative Statistic to Cohen's Kappa for Measuring the Extent and Reliability of Agreement between Observers [príspevok prednesený na konferencii]. Federal Committee on Statistical Methodology Research Conference, The Council of Professional Associations on Federal Statistics, Washington, DC. https://nces.ed.gov/FCSM/pdf/J4_Xie_2013FCSM.pdf
- Xu, S. a Lorber, M. F. (2014). Interrater agreement statistics with skewed data: evaluation of alternatives to Cohen's kappa. Journal of Consulting and Clinical Psychology, 82(6), 1219-1227. https://doi.org/10.1037/a0037489
Go to original source... - Zec, S., Soriani, N., Comoretto, R. a Baldi, I. (2017). High Agreement and High Prevalence: The Paradox of Cohen's Kappa. The Open Nursing Journal, 11, 211-218. https://doi.org/10.2174/1874434601711010211
Go to original source... - Zhao, X. 2011 (11.-30. máj). When to use Cohen's κ, if ever? [príspevok prednesený na konferencii]. 61st annual conference of International Communication Association, Boston.
- Zhao, X., Feng, G. C., Ao, S. H. a Liu, P. L. (2022). Interrater reliability estimators tested against true interrater reliabilities. BMC Medical Research Methodology, 22(1), 232. https://doi.org/10.1186/s12874-022-01707-5
Go to original source... - Zhao, X., Feng, G. C., Liu, K. a Deng, J. S. (2018). We agreed to measure agreement - redefining reliability de-justifies Krippendorff's alpha. China Media Research, 14(2), 1-15. https://repository.um.edu.mo/handle/10692/25978
- Zhao, X., Liu, J. S. a Deng, K. (2013). Assumptions behind intercoder reliability indices. Annals of the International Communication Association, 36(1), 419-480. https://doi.org/10.1080/23808985.2013.11679142
Go to original source...
This is an open access article distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0), which permits non-comercial use, distribution, and reproduction in any medium, provided the original publication is properly cited. No use, distribution or reproduction is permitted which does not comply with these terms.


ORCID...