Many cancer diagnostic tests involve the classification of a patient by a medical expert using an ordered categorical scale. Such tests involve elements of subjectivity and estimation on the part of the expert due to the necessity to interpret imperfect diagnostic test results, leading to discrepancies between experts' classifications, often severely so, even in common diagnostic procedures such as mammography and in the classification of breast density, an important predictor of breast cancer. This has motivated many large-scale studies to be conducted to examine levels of agreement between experts in common diagnostic settings and to investigate if factors such as rater experience affect the consistency of ratings made by different experts. However, limited statistical methods currently exist to assess agreement in large-scale studies such as these. Our overall goals are two-fold: (1) to develop novel and flexible statistical methods and agreement measures for assessing reliability in large-scale studies involving two or more medical experts when using one or more diagnostic tests with ordered categorical scales, and (2) to use these methods to assess reliability in recently conducted large-scale breast cancer and breast density studies and to examine the impact of factors such as rater experience and the patient's prior history that can play important roles in reliability in these population- based settings. Due to widespread use of screening mammography in the community, conclusions drawn from our analyses of large-scale agreement studies in diagnostic testing will have significant and far-reaching implications for breast cancer screening and diagnosis in the community. The proposed methods in our application provide a novel and comprehensive approach to examine agreement in large-scale studies and focus on assessing and comparing agreement between experts when they classify subjects according to ordered categorical classification scales in diagnostic tests. Methods developed will be made freely available and easily implemented using standard statistical software. Our analyses of large-scale cancer agreement studies using our proposed methods will provide new insights into the screening interpretative performance of radiologists.