Možnosti, problémy a odporúčania pri meraní zhody medzi posudzovateľmi – deskriptívny prístup

doi:10.13060/csr.2024.007

Sociologický časopis / Czech Sociological Review 2025, 61(3): 277-300 | DOI: 10.13060/csr.2024.007

Options, Problems and Guidelines for Measuring Interrater Agreement – a Descriptive Approach

Lucia Kočišová ORCID...: Ústav experimentálnej psychológie, Slovenská akadémia vied, Bratislava

Interrater agreement is one way to establish reliability (and also validity) in social science research. The traditionally preferred method of measuring interrater agreement is the descriptive approach owing to its simplicity. This approach is also associated with a number of different agreement indices, which makes it difficult to select the right index. This article summarises theoretical definition on the prevailing approach used to measure interrater agreement (in both quantitative and qualitative research). From a practical point of view, the article focuses on the possibilities of measuring agreement by using percent agreement, the kappa coefficient, and the AC₁ coefficient. A more detailed description of the indices explains how to define, calculate, and interpret them and the problems associated with their use. The indices are then discussed in comparison. Although underestimated and criticised, percent agreement may be a good indicator of interrater agreement. Several paradoxes accompany the use of the kappa coefficient, which is only possible under certain conditions. The appropriate alternative to it is the AC₁ coefficient. The article concludes with a summary of recommendations for improving the quantification of interrater agreement.

Keywords: interrater agreement, agreement index, percent agreement, kappa coefficient, AC₁coefficient

Received: February 28, 2023; Revised: February 12, 2024; Accepted: February 26, 2024; Prepublished online: June 6, 2024; Published: September 1, 2025 Show citation

Kočišová, L. (2025). Options, Problems and Guidelines for Measuring Interrater Agreement – a Descriptive Approach. Sociologický časopis / Czech Sociological Review, 61(3), 277-300. doi: 10.13060/csr.2024.007

Share...

Download citation

Open full article

Attachments

Download file

Apendix

on-line-priloha-377-Kocisova20.pdf
File size: 130.77 kB

References

Altman, D. G. (1991). Practical Statistics for Medical Research. Chapman & Hall. https://doi.org/10.1201/9780429258589 Go to original source...
Bennett, E. M., Alpert, R. a Goldstein, A. C. (1954). Communications through limited-response questioning. Public Opinion Quarterly, 18, 303-308. https://doi.org/10.1086/266520 Go to original source...
Bliese, P. D. (2000). Within-group agreement, non-independence, and reliability: Implications for data aggregation and analysis. In K. J. Klein a S. W. J. Kozlowski (eds.), Multilevel Theory, Research, and Methods in Organizations: Foundations, Extensions, and New Directions (s. 349-381). Jossey-Bass.
Button, C. M., Snook, B. a Grant, M. J. (2020). Inter-rater agreement, data Reliability, and the crisis of confidence in psychological research. The Quantitative Methods for Psychology, 16(5), 467-471. https://doi.org/10.20982/tqmp.16.5.p467 Go to original source...
Byrt, T., Bishop, J. a Carlin, J. B. (1993). Bias, prevalence and kappa. Journal of Clinical Epidemiology, 46(5), 423-429. https://doi.org/10.1016/0895-4356(93)90018-v Go to original source...
Cicchetti, D. V. a Feinstein, A. R. (1990). High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 43(6), 551-558. https://doi.org/10.1016/0895-4356(90)90159-m Go to original source...
Cígler, H. a Šmíra, M. (2015). Chyba měření a odhad pravého skóru. Testfórum, 6, 67-84. https://doi.org/10.5817/TF2015-6-104 Go to original source...
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37-46. https://doi.org/10.1177/001316446002000104 Go to original source...
Conger, A. J. (1980). Integration and generalization of kappas for multiple raters. Psychological Bulletin, 88(2), 322-328. https://doi.org/10.1037/0033-2909.88.2.322 Go to original source...
Cunningham, M. 2009. More than just the kappa coefficient: A program to fully characterize inter-rater reliability between two raters. SAS Global Forum, Paper 242-2009. https://support.sas.com/resources/papers/proceedings09/242-2009.pdf
de Vet, H. C. W., Terwee, C. B., Knol, D. L. a Bouter, L. M. (2006). When to use agreement versus reliability measures. Journal of Clinical Epidemiology, 59(10), 1033-1039. https://doi.org/10.1016/j.jclinepi.2005.10.015 Go to original source...
Dettori, J. R. a Norvell, D. C. (2020). Kappa and Beyond: Is There Agreement? Global Spine Journal, 10(4), 499-501. https://doi.org/10.1177/2192568220911648 Go to original source...
DeVellis, R. F. (2005). Inter-Rater Reliability. In K. Kempf-Leonard (ed.), Encyclopedia of Social Measurement (s. 317-322). Elsevier. https://doi.org/10.1016/B0-12-369398-5/00095-5 Go to original source...
Di Eugenio, B. a Glass, M. (2004). The kappa statistic: A second look. Computational Linguistics, 30(1), 95-101. https://doi.org/10.1162/089120104773633402 Go to original source...
Feinstein, A. R. a D. V. Cicchetti. (1990). High agreement but low kappa: I. The problem of two paradoxes. Journal of Clinical Epidemiology, 43, 543-549. http://dx.doi.org/10.1016/0895-4356(90)90158-L Go to original source...
Feng, G. C. (2013). Underlying determinants driving agreement among coders. Quality and Quantity, 47, 2983-2997. https://doi.org/10.1007/s11135-012-9807-z Go to original source...
Feng, G. C. (2015). Mistakes and how to avoid mistakes in using intercoder reliability indices. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 11(1), 13-22. https://doi.org/10.1027/1614-2241/a000086 Go to original source...
Fleiss, J. L., Levin, B. a Paik, M. C. (2003). Statistical Methods for Rates & Proportions. Wiley & Sons. https://doi.org/10.1002/0471445428 Go to original source...
Gálová, L. (2010, 3. únor). Koeficient kappa - aplikačné možnosti, výhody a nevýhody [príspevok prednesený na konferencii]. 2. česko-slovenská konference doktorandů oborů pomáhajících profesí, Ostrava.
Gisev, N., Bell, J. S. a Chen, T. F. (2013). Interrater agreement and interrater reliability: key concepts, approaches, and applications. Research in Social & Administrative Pharmacy, 9(3), 330-338. https://doi.org/10.1016/j.sapharm.2012.04.004 Go to original source...
Goodwin, L. D. (2001). Interrater agreement and reliability. Measurement in Physical Education and Exercise Science, 5(1), 13-34. https://doi.org/10.1207/S15327841MPEE0501_2 Go to original source...
Graham, M., Milanowski, A. a Miller, J. (2012). Measuring and Promoting Inter-rater Agreement of Teacher and Principal Performance Ratings. Center for Educator Compensation and Reform. Stiahnuté 10. 2. 2023 z http://es.eric.ed.gov/fulltext/ED532068.pdf
Gwet, K. L. (2002). Kappa Statistic Is Not Satisfactory for Assessing the Extent of Agreement between Raters. Statistical Methods for Inter-Rater Reliability Assessment, 1, 1-6. https://agreestat.com/papers/kappa_statistic_is_not_satisfactory.pdf
Gwet, K. L. (2014, 12. december). Benchmarking agreement coefficients. Inter-rater Reliability Blog. https://inter-rater-reliability.blogspot.com/2014/
Gwet, K. (2021). Handbook of Inter-rater reliability. The Definite Guide to Measuring the Extent of Agreement. Analysis of Categorical Rating. Gaithersburg: AgreeStat Analytics.
Hartmann, D. P., Barrios, B. A. a Wood, D. D. (2004). Principles of behavioral observation. In M. Hersen (ed.), Comprehensive Handbook of Psychological Assessment (s. 108-137). John Wiley & Sons.
Chan, D. (1998). Functional relations among constructs in the same content domain at different levels of analysis: A typology of composition models. Journal of Applied Psychology, 83(2), 234-246. https://doi.org/10.1037/0021-9010.83.2.234 Go to original source...
Kočišová, L. (2022). Reportovanie súhlasu posudzovateľov a spoľahlivosti posudzovateľov. Testfórum, 15, 41-57. https://doi.org/10.5817/TF2022-15-14647 Go to original source...
Konstantinidis, M., Le, L. W. a Gao, X. (2022). An Empirical Comparative Assessment of Inter-Rater Agreement of Binary Outcomes and Multiple Raters. Symmetry, 14(2), 262. https://doi.org/10.3390/sym14020262 Go to original source...
Kottner, J. a Streiner, D. L. (2011). The difference between reliability and agreement. Journal of Clinical Epidemiology, 64(6), 701-702. https://doi.org/10.1016/j.jclinepi.2010.12.001 Go to original source...
Kottner, J., Audigé, L., Brorson, S., Donner, A., Gajewski, B. J., Hróbjartsson, A., … Streiner, D. L. (2011). Guidelines for Reporting Reliability and Agreement Studies (GRRAS) were proposed. Journal of Clinical Epidemiology, 64(1), 96-106. https://doi.org/10.1016/j.jclinepi.2010.03.002 Go to original source...
Kozlowski, S. W. J. a Klein, K. (2000). A multilevel approach to theory and research in organizations: Contextual, temporal, and emergent processes. in K. J. Klein a S. W. J. Kozlowski (eds.). Multilevel Theory, Research, and Methods in Organizations: Foundations, Extensions, and New Directions (s. 3-90). San Francisco: Jossey-Bass.
Kozlowski, S. W. a Hattrup, K. (1992). A disagreement about within-group agreement: Disentangling issues of consistency versus consensus. Journal of Applied Psychology, 77(2), 161-167. https://doi.org/10.1037/0021-9010.77.2.161 Go to original source...
Krefting, L. (1991). Rigor in qualitative research: the assessment of trustworthiness. American Journal of Occupational Therapy, 45, 214-222. https://doi.org/10.5014/ajot.45.3.214 Go to original source...
Krippendorff, K. (1980). Content Analysis: An Introduction to Its Methodology. Beverly Hills: Sage Publications.
Krippendorff, K. (2004). Reliability in content analysis: Some common misconceptions and recommendations. Human Communication Research, 30(3), 411-433. https://doi.org/10.1111/j.1468-2958.2004.tb00738.x Go to original source...
Lance, C. E., Butts, M. M. a Michels L. C. (2006). The sources of four commonly reported cutoff criteria: what did they really say? Organizational Research Methods, 9(2), 202-220. https://doi.org/10.1177/1094428105284919 Go to original source...
Landis, J. R. a Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-174. https://doi.org/10.2307/2529310 Go to original source...
LeBreton, J. M. a Senter, J. L. (2008). Answers to 20 questions about interrater reliability and interrater agreement. Organizational Research Methods, 11, 815-852. https://doi.org/10.1177/1094428106296642 Go to original source...
Light, R. J. (1971). Measures of response agreement for qualitative data: Some generalizations and alternatives. Psychological Bulletin, 76(5), 365-377. https://doi.org/10.1037/h0031643 Go to original source...
Lincoln, Y. a Guba, E. G. (1985). Naturalistic inquiry. Sage. https://doi.org/10.1016/0147-1767(85)90062-8 Go to original source...
Litwin, M. S. (1995). How to Measure Survey Reliability and Validity. Sage Publications. https://doi.org/10.4135/9781483348957 Go to original source...
Lombard, M., Snyder-Duch, J. a Bracken, C. C. (2002). Content analysis in mass communication research: An assessment and reporting of intercoder reliability. Human Communication Research, 28(4), 587-604. https://doi.org/10.1111/j.1468-2958.2002.tb00826.x Go to original source...
Ludbrook, J. (2002). Statistical techniques for comparing measures and methods of measurement: A critical review. Clinical and Experimental Pharmacology and Physiology, 29(7), 527-536. https://doi.org/10.1046/j.1440-1681.2002.03686.x Go to original source...
Mandysová, P., Ehler, E. a Trejbalová, L. (2012). Česká verze Škály Bradenové: metodika prekladu a shoda mezi posuzovateli. Ošetrovateľstvo, 2(4), 137-142.
Martončik, M. (2019). Validita merania v sociálnych vedách. Prešovská univerzita.
McDonald, N., Schoenebeck, S. a Forte, A. (2019). Reliability and Inter-rater Reliability in Qualitative Research: Norms and Guidelines for CSCW and HCI Practice. Proceedings of the ACM on Human-Computer Interaction, 72. https://doi.org/10.1145/3359174 Go to original source...
McHugh, M. L. (2012). Interrater reliability: the kappa statistic. Biochemia Medica, 22(3), 276-282. https://doi.org/10.11613/BM.2012.031 Go to original source...
Řehák, J. (1998). Kvalita dat I. Klasický model měření reliability a jeho praktický aplikační význam. Sociologický časopis, 34(1), 51-60. https://doi.org/10.13060/00380288.1998.34.1.07 Go to original source...
Scott, W. A. (1955). Reliability of content analysis: the case of nominal scale coding. Public Opinion Quarterly, 19, 321-325. https://doi.org/10.1086/266577 Go to original source...
Sim, J. a Wright, Ch. C. (2005). The Kappa Statistic in Reliability Studies: Use, Interpretation, and Sample Size Requirements. Physical Therapy, 85(3), 257-268. https://doi.org/10.1093/ptj/85.3.257 Go to original source...
Stemler, S. E. (2004). A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability. Practical Assessment, Research & Evaluation, 9(4). https://doi.org/10.7275/96jp-xz07 Go to original source...
Syed, M. a Nelson, S. C. (2015). Guidelines for establishing reliability when coding narrative data. Emerging Adulthood, 3(6), 375-387. https://doi.org/10.1177/2167696815587648 Go to original source...
Švaříček, R. a Šeďová, K. a kol. (2007). Kvalitativní výzkum v pedagogických vědách. Portál.
Tinsley, H. E. a Weiss, D. J. (1975). Interrater reliability and agreement of subjective judgments. Journal of Counseling Psychology, 22(4), 358-376. https://doi.org/10.1037/h0076640 Go to original source...
Uebersax J. (2018, 19. september). Raw Agreement Indices. https://www.john-uebersax.com/stat/raw.htm
Urbánek, T., Denglerová, D. a Širůček, J. (2011). Psychometrika: měření v psychologii. Portál.
van Oest, R. (2019). A new coefficient of interrater agreement: The challenge of highly unequal category proportions. Psychological Methods, 24(4), 439-451. https://doi.org/10.1037/met0000183 Go to original source...
Vanacore, A. a Pellegrino, M. S. (2021). Benchmarking procedures for characterizing the extent of rater agreement: a comparative study. Quality and Reliability Engineering International, 38(3), 1404-1415. https://doi.org/10.1002/qre.2982 Go to original source...
Von Eye, A. a Mun, E. Y. (2005). Analyzing Rater Agreement: Manifest Variable Methods. Lawrence Erlbaum.
Warrens, M. J. (2015). Five ways to look at Cohen's kappa. Journal of Psychology & Psychotherapy, 5(4), 1. https://doi.org/10.4172/2161-0487.1000197 Go to original source...
Wilhelm, A. G., Rouse, A. G. a Jones, F. (2018). Exploring differences in measurement and reporting of classroom observation inter-rater reliability. Practical Assessment, Research, and Evaluation, 23(4). https://doi.org/10.7275/at67-md25 Go to original source...
Xie, Q. (2013, 4.-6. november). Agree or Disagree? A Demonstration of An Alternative Statistic to Cohen's Kappa for Measuring the Extent and Reliability of Agreement between Observers [príspevok prednesený na konferencii]. Federal Committee on Statistical Methodology Research Conference, The Council of Professional Associations on Federal Statistics, Washington, DC. https://nces.ed.gov/FCSM/pdf/J4_Xie_2013FCSM.pdf
Xu, S. a Lorber, M. F. (2014). Interrater agreement statistics with skewed data: evaluation of alternatives to Cohen's kappa. Journal of Consulting and Clinical Psychology, 82(6), 1219-1227. https://doi.org/10.1037/a0037489 Go to original source...
Zec, S., Soriani, N., Comoretto, R. a Baldi, I. (2017). High Agreement and High Prevalence: The Paradox of Cohen's Kappa. The Open Nursing Journal, 11, 211-218. https://doi.org/10.2174/1874434601711010211 Go to original source...
Zhao, X. 2011 (11.-30. máj). When to use Cohen's κ, if ever? [príspevok prednesený na konferencii]. 61st annual conference of International Communication Association, Boston.
Zhao, X., Feng, G. C., Ao, S. H. a Liu, P. L. (2022). Interrater reliability estimators tested against true interrater reliabilities. BMC Medical Research Methodology, 22(1), 232. https://doi.org/10.1186/s12874-022-01707-5 Go to original source...
Zhao, X., Feng, G. C., Liu, K. a Deng, J. S. (2018). We agreed to measure agreement - redefining reliability de-justifies Krippendorff's alpha. China Media Research, 14(2), 1-15. https://repository.um.edu.mo/handle/10692/25978
Zhao, X., Liu, J. S. a Deng, K. (2013). Assumptions behind intercoder reliability indices. Annals of the International Communication Association, 36(1), 419-480. https://doi.org/10.1080/23808985.2013.11679142 Go to original source...

This is an open access article distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0), which permits non-comercial use, distribution, and reproduction in any medium, provided the original publication is properly cited. No use, distribution or reproduction is permitted which does not comply with these terms.

Return to the content

Options, Problems and Guidelines for Measuring Interrater Agreement – a Descriptive Approach

Keywords: interrater agreement, agreement index, percent agreement, kappa coefficient, AC1 coefficient

Open full article

Attachments

References

Keywords: interrater agreement, agreement index, percent agreement, kappa coefficient, AC₁coefficient