Impact of using artificial intelligence as a second reader in breast screening including arbitration

Warren, Lucy M.; Venton, Jenny; Young, Kenneth C.; Halling-Brown, Mark; Kelly, Christopher J.; Wilson, Marc; Morigami, Megumi; Khoo, Lisanne; Cunningham, Deborah; Sidebottom, Richard; Reddy, Mamatha; Purushothaman, Hema; Khodabakhshi, Delara; Honeyfield, Lesley; Hujan, Amandeep; Stoycheva, Tsvetina; Joiner, Andy; Chopra, Reena; Sy, Aminata; Ward, Dominic; Yang, Lin; Sayres, Rory; Golden, Daniel; Malhotra, Namrata; Mallya, Rachita; Xi, Lihong; Ogunleye, Della; Purdy, Charlotte; Mackenzie, Alistair; Thomas, Susan; Shetty, Shravya; Gilbert, Fiona J.; Darzi, Ara; Ashrafian, Hutan

doi:10.1038/s43018-026-01128-z

Download PDF

Article
Open access
Published: 10 March 2026

Impact of using artificial intelligence as a second reader in breast screening including arbitration

Nature Cancer (2026)Cite this article

6155 Accesses
2 Citations
99 Altmetric
Metrics details

Subjects

Abstract

The impact of incorporating artificial intelligence (AI) into a double-read breast-screening workflow, including arbitration, is unclear. This retrospective study included 50,000 representative women from two NHS breast-screening centers. All the women had long-term follow-up, allowing us to determine whether use of AI leads to earlier cancer detection. Cases requiring arbitration (8,732 cases) were read by 22 readers in a reader study, following their normal arbitration workflow. Overall, after arbitration, replacing the second reader with AI was noninferior (5% margin) to two human readers in terms of sensitivity and specificity (P < 0.001) while offering a workload benefit. Arbitration improved the specificity of the AI arm by overruling cases incorrectly recalled by the AI tool; however, it also overruled the AI tool recall decision for some interval and next-round cancers. Further development of the AI tool alongside improvement in its explainability could lead to the earlier detection of cancers.

Prospective implementation of AI-assisted screen reading to improve early detection of breast cancer

Article Open access 16 November 2023

Prospective evaluation of artificial intelligence integration into breast cancer screening in multiple workflow settings: the GEMINI study

Article Open access 10 March 2026

Cost-effectiveness requirements for implementing artificial intelligence technology in the Women’s UK Breast Cancer Screening service

Article Open access 30 September 2023

Main

Breast screening aims to find cancers early, where treatment is more successful. The NHS Breast Screening Programme (NHSBSP) invites women aged between 50 years and 70 years for mammographic screening every 3 years, or women aged >70 years can self-refer. Each mammogram is read by two mammography readers (radiologists, breast clinicians, associate specialists, consultant radiographers or enhanced-level practitioners), who decide whether to recall the woman for additional follow-up. An arbitration process takes place to come to a final decision about whether to recall a woman, either if there is disagreement between the two mammography readers or for all recalled women, depending on local screening center processes. A radiologist workforce crisis in the UK threatens the long-term sustainability of the NHSBSP. The UK has a 30% shortfall of clinical radiologists and this is forecast to rise to 40% by 2028¹. Artificial intelligence (AI)-assisted screening could play an important role in future proofing of the UK’s NHSBSP.

AI has shown promise as a standalone tool for detecting breast cancer in breast screening. It has the potential to reduce mammography reader workload while potentially improving outcomes. AI could be used in breast screening in several ways: (1) as a standalone tool to replace one of the human readers; (2) as decision support for readers; (3) as a triage tool by using the AI estimate of cancer likelihood to reduce screening by human readers for low-risk cases and increase it for high-risk cases; or a combination of these approaches².

A systematic review³ of standalone AI performance for breast cancer detection at screening concluded that it performed as well or better than radiologists. However, it is less clear how radiologists perform when interacting with AI tools in situations closer to real-world screening. An important aspect of breast screening in the UK is the use of double reading with arbitration. A pilot of an independent external validation process⁴ concluded that there needed to be a clinical validation of the impact of AI on the decisions made by radiologists during arbitration. This study addresses this issue, by performing a large retrospective study including mammograms and clinical data from 50,000 women at two screening centers. There have been retrospective studies that have simulated the arbitration outcome with AI^2,5,6,7. In addition, a small reader study (278 cases)⁸ looked at the impact on arbitration; however, this was not clinically relevant to the NHSBSP due to enriched dataset, use of ultrasound and mammography and different arbitration processes. The study presented here is a large-scale study using retrospective data, to perform a reader study with radiologists and consultant radiographers prospectively arbitrating studies, using arbitration protocols that are relevant to the NHSBSP.

The AI system used in this evaluation was created by Google (v1.2, Google LLC), and is an updated version of the v1.0 model⁹. Further details of the AI model are given in Methods and the relevant paper¹⁰. In the human arm of the study, the workflow is based on the recall decision of the historical two mammography readers (referred to as first and second human readers). In the AI arm of the study, the workflow is based on the recall decision of the first historical human reader and the AI tool. To determine the impact of the AI tool on arbitration, the arbitration criteria at each center were applied. Any cases requiring arbitration were read by pairs of readers. The arbitration decisions were made in a reader study with 22 readers. As historical data are used, the reader study results did not impact clinical care of the patients.

An overview of the study is given in Fig. 1. Some key strengths of the study include: the long-term (>3 years) follow-up enabled an assessment of whether an AI-assisted pathway can detect cancers earlier than standard care. The cancer locations and recall decisions after arbitration were annotated on the mammograms, allowing for the assessment of localization accuracy by the AI tool and humans, before and after arbitration. The study included two large screening centers, with different arbitration criteria. This allowed us to understand whether clinical pathway variability in current practice has an impact on cancer detection with AI. Finally, the scale of the study allowed for analysis by subgroups such as ethnicity, age, index of multiple deprivation, X-ray system manufacturer, cancer grade, invasive status and breast density.

**Fig. 1: Overview of study design and aims.**

Results

Study population

The study included mammography images and clinical data for 50,000 representative women from two NHSBSP screening centers in London (25,000 from each center) selected from the OPTIMAM Mammography Image Database OMI-DB¹¹. Further details of the data selection are given in Methods. From the 50,000 women, 4,354 (8.7%) were excluded due to the AI exclusion criteria (technical recalls, cases containing more or less than four images and implants) and 44 (0.1%) cases were excluded due to insufficient or conflicting clinical information. After exclusions, 45,602 women were included. A flowchart of the data selection and overview of the study is shown in Fig. 1e. The distribution of age, ethnicity, X-ray system manufacturer and cancer grade are given in Table 1.

Table 1 Description of study population from each screening center

Full size table

Sensitivity, specificity, recall rate and cancer detection rate

Table 2 shows the overall performance. The sensitivity and specificity in the AI arm were 1.2% (95% confidence interval (CI) −0.7%, 3.2%) and 0.3% (95% CI 0.0%, 0.6%) higher than the human arm, respectively. The sensitivity and specificity of the AI arm were noninferior (5% margin) to the human arm (P < 0.001). This satisfied the prespecified noninferiority endpoint. There was no significant difference in the recall rate between the AI arm and the human arm (AI versus human difference: −0.3% (95% CI −0.6%, −0.0%), P = 0.076). There was no significant difference in the cancer detection rate between the AI arm and the human arm (AI versus human difference 0.0% (95% CI −0.0%, 0.1%), P = 0.299).

Table 2 Overall performance metrics following arbitration

Full size table

Workload metrics

The impact on workload was considered separately for each center because they had different arbitration protocols. Center 1 arbitrated discordant recalls only, whereas center 2 arbitrated all recall decisions (discordant and concordant). As seen in Table 3, at both centers, the number of human screen readings in the AI arm was 50% lower than in the human arm, due to AI replacing the second reader. In addition, AI excluded 4,354 (8.7% of cases), which would still need to be read by two human readers, resulting in an overall 46% lower number of screen readings in the AI arm. However, the number of arbitrations required in the AI arm was 142% and 22% higher than in the human arm, for center 1 and center 2, respectively. It should be noted that different professional groups may be performing these different tasks, but this can vary by screening center.

Table 3 Workload metrics following arbitration

Full size table

Impact of arbitration

Arbitration decisions in the reader study were made by pairs of readers. They together determined a consensus opinion about whether or not the woman needed to be recalled for further assessment, drawing a region of interest (ROI) around the area that they recalled. This mimicked the arbitration process performed clinically. A detailed description of the arbitration process is described in Methods.

Before arbitration the AI tool had similar sensitivity and specificity at both centers (Fig. 2a,b). The sensitivity and specificity of the first and second human readers differed between the two centers before arbitration due to their different arbitration practices. After arbitration the difference between the two centers disappeared.

**Fig. 2: Comparison of performance before and after arbitration.**

The sensitivity after arbitration was 48.0% in the human arm and 49.2% in the AI arm (Table 2). This may appear low, but a sensitivity of around 50% is expected, because of the long period of follow-up (39 months) used. This means that, in addition to screen-detected cancers, positive cases include interval cancers (symptomatic after negative screen) and cancers detected at the next screening round (3 years later). These would not have been detected by human readers at screening. We included the next-round cancers and interval cancers to assess whether the AI tool detects these cancers earlier than humans. The AI arm sensitivity was 92.3% for screen-detected cancers, 8.8% for interval cancers and 8.1% for next-round cancers (Table 4).

Table 4 Sensitivity and specificity before and after arbitration

Full size table

Figure 2c and Table 4 compare the sensitivity and specificity before and after arbitration. Before arbitration, recall by either reader was considered a recall decision for the case. Arbitration aims to improve specificity with minimal loss in sensitivity. Therefore, as expected there is a higher specificity and lower sensitivity after arbitration, in both arms. However, the difference in sensitivity before and after arbitration differs by time of detection. Before arbitration, the sensitivity of interval and next-round cancer was higher in the AI arm (32.4% and 34.0%, respectively), compared to the human arm (15.4% and 12.8%, respectively). However, after arbitration, the sensitivity of interval and next-round cancer are similar in the AI arm (8.8% and 8.1%) and the human arm (5.9% and 5.5%, respectively).

Understanding cases that are overruled at arbitration

There were 93 positive cases that AI correctly recalled at a case level, but the reader pair overruled at arbitration. Of these 13 are screen-detected cancers, 28 interval cancers and 52 next-round cancers.

Most (94.5%) of the positive cases in the study had the ground-truth location annotated by expert radiologists or consultant radiographers who did not participate in the study. To understand the drop in sensitivity during arbitration, we reviewed these 93 positive cases that AI correctly recalled on a case level, but the reader pair overruled at arbitration.

We compared the ROIs drawn at arbitration to the ground-truth location. For 9 women there was no ground-truth location annotated, leaving 84 women. As shown in Fig. 2d, for 24 women the cancer was correctly localized in either one or both views and there were no other AI boxes. For 20 women the cancer was localized in at least one view but there were other AI boxes. For four women with more than one cancer, one cancer was correctly localized by the AI tool and the other was not. For 36 women, the cancer was not correctly localized by the AI tool. The arbitration readers were ‘very dissatisfied’ or ‘somewhat dissatisfied’ with AI’s assessment for 87% of these 93 cases, compared with 48% of AI arm-arbitrated cases overall. Furthermore, 75.3% of these 93 cases had prior images, compared with 50.5% for cases that went to arbitration in the AI arm of the study. Of these cases with prior images, readers said that the priors changed their decision to recall or not for 65.7% of cases, compared with 60.0% of all cases with priors that went to arbitration in the AI arm.

Subgroups

The sensitivity and specificity of the two arms were calculated for each of the subgroups: X-ray system manufacturer, breast density category, age, type of screen (first or subsequent), index of multiple deprivation (IMD) and ethnicity (Fig. 3). We observed no notable differences in performance between the AI arm and human arm across the subgroups tested. Two subgroups where the AI arm sensitivity was lower than the human arm were: Siemens (AI versus human difference: −5.7 (95% CI −14.3, 0), n = 35), and ‘not specified’ ethnicity (AI versus human difference: −6.0 (95% CI −12.7, 0.6), n = 83). All were small groups, with large error bars, limiting the strength of statistical conclusions possible. For 25 out of 29 subgroups, the sample size was <300 positive cases, affecting the accuracy of sensitivity calculations.

Sensitivity was also calculated for each cancer subgroup: type, grade, lesion type and size (Fig. 3). We observed no notable differences in performance between the AI arm and the human arm across the subgroups tested. When split by subgroups, the numbers in the subgroups are low (for 11 out of 14 subgroups the sample included <300 positive cases) and in some subgroups there were <20.

Localization performance

The proportions of cases with varying numbers of AI-generated suspicious ROIs across the four images in each case are shown in Fig. 4a. Most negative cases have no ROI (93.0%) and the number of ROIs ranged from one to nine. For cases with screen-detected cancer, 94.3% of cases had at least one ROI, most had two ROIs and the number of ROIs ranged from one to ten. Interval cancers and next-round cancers had a distribution of ROIs more similar to the negative cases than screen-detected cancers, with 71.7% of cases having no ROI. Of those negative cases with ROIs, most had one ROI (Fig. 4b).

The sensitivity of the AI tool before arbitration by the time of cancer detection, at case and lesion levels is given in Fig. 4c. At a case level, the sensitivity of the AI tool is 27% and 29% for interval and next-round cancers and at a lesion level this reduces to 16% and 15%, respectively. This drop in sensitivity for lesion level compared to case-level analysis indicates that the AI tool is marking ROIs in the incorrect location for around half of such cases and highlights the importance of performing localization analysis in studies. The false-positive AI ROI rate was 0.12 per case for all cases, 0.12 per case for negative cases and 0.52 per case for positive cases. The false-positive AI ROI rate was 1.27 per case for the positive cases that the AI correctly recalled on a case level, but the reader pair overruled at arbitration.

The JAFROC figure of merit was 71.4 (95% CI 68.8, 74.0) for the AI arm and 70.5 (67.9, 73.1) for the human arm. The difference in the figure of merit was 0.88 (−0.63, 2.38) which was not statistically significant (P = 0.255). Figure 4d shows the free response operating characteristic (FROC) curve.

Human factor analysis

In the post-study surveys completed by 21 of 22 readers, by the end of the study, most users reported that they ‘somewhat trusted’ the information provided by the tool on a 5-point scale of ‘did not trust at all’ (1 of 21), ‘somewhat trusted’ (15 of 21), ‘moderately trusted’ (4 of 21), ‘very trusted’ (1 of 21) and ‘extremely trusted’ (0 of 21). In an open-ended question asking in what situations did they think the AI was unreliable, if at all, over half the participants (12 of 21) stated that they thought that the AI was unreliable due to overcalling calcifications and for cases with prior images (12 for each statement, with 6 participants stating both statements).

Discussion

There are a growing number of studies assessing the performance of AI compared to historical human reads in breast cancer screening using retrospective data^{12,13,14,15,16}. However, such studies do not take into account arbitration. Some retrospective studies have simulated arbitration^2,5,6,7; however, these cannot fully predict how human reader opinions will be impacted at arbitration when considering both human and AI opinions to come to a consensus. Here a reader study assessed the impact of replacing the second reader with AI, including arbitration.

Overall, on a case level, after arbitration, replacing the second reader with an AI read (AI arm) was noninferior for sensitivity and specificity, compared to two human readers (human arm). The AI arm had a slightly higher cancer detection rate and a slightly lower recall rate than the human arm, but these differences were not statistically significant.

There are seven aspects that we considered and discuss in this section. First, we considered what the workload implications are of replacing a second reader with AI. The human reading workload at screening in the AI arm was 46% lower than in the human arm because the AI tool replaced the second reader. However, the arbitration rate in the AI arm compared to the human arm increased from 3.9% to 9.4% at center 1 and from 11.3% to 13.8% at center 2. These increases, although different between centers due to the different arbitration approaches, are broadly in line with the simulations of arbitration in previous studies^6,7. In addition, 8.7% of cases were not read by the AI tool, due to being within the AI tool’s exclusion criteria. These cases would need to be read by two human readers if the tool were to be used clinically. Therefore the resources for this and the infrastructure to facilitate this different workflow would need to be considered before deployment. Taking this all into account, the overall reductions in reading time for centers 1 and 2 were 36% and 44%, respectively, using a simplified assumption that an arbitration read takes about 5× as long as a single read¹⁷. A full cost analysis was not performed here. However, such an analysis using information from this study could be useful for planning prospective trials and the introduction of AI into the NHSBSP.

Second, we considered whether replacing a second reader with AI leads to breast cancers being detected earlier. A strength of this study is the longitudinal nature of the retrospective data, which allows us to assess performance by time of detection of positive cases. Before arbitration, the AI tool had similar sensitivity to first and second human readers for screen-detected cancers and a higher sensitivity than first and second human readers for interval cancers and next-round cancers. This implied that the AI tool could detect cancers earlier than two humans and we investigated whether this prevailed through arbitration. Arbitration improved specificity at both centers in both arms. For screen-detected cancers, there was a small decrease in sensitivity after arbitration in both arms. For interval cancers and next-round cancers, there was a larger reduction in sensitivity after arbitration and the decrease was larger for the AI arm than the human arm. Consequently, there was not a notable difference in sensitivity between the two arms for interval cancers and next-round cancers after arbitration. Therefore, after arbitration, replacing the second reader with AI did not result in cancers being detected earlier.

Third, we considered why some correct AI prompts were dismissed at arbitration. Arbitration here succeeded in improving specificity with minimal reduction in sensitivity. However, there were some cases (93 out of 8,732 arbitrations) that were correctly recalled by the AI tool, but were overruled at arbitration. These were primarily (86%) interval cancers and next-round cancers and therefore challenging cases not recalled clinically. For half the cases, the AI tool correctly localized the cancer (albeit half of these also had other ROIs in incorrect areas). For the other half of the cases, the AI tool did not correctly localize the cancer, which would explain why the AI tool was overruled. A higher proportion of these cases had prior images (75.3%), compared with all arbitrated cases (48.4%) and the readers indicated that the prior changed their decision 71% of the time. Therefore, it could be that some were overruled due to the readers seeing no change in the mammograms between the current and prior image and/or knowing that the AI tool does not analyze the prior images. Or perhaps the issue is more fundamental and readers cannot see why recall recommendations are being made by the AI tool.

The false-positive AI ROI rate was 0.12 per case for negative cases. This is much better than in traditional computer-aided diagnosis, for instance a previous prospective study in the UK NHSBSP¹⁷ found a false-positive AI ROI rate of 1.59 per case. Here for positive cases the false-positive AI ROI rate was 0.52 per case. On the positive cases that were correctly recalled by the AI tool, but were overruled at arbitration, there were nearly 3× more false-positive ROIs than for positive cases where human arbitration agreed with the AI decision to recall. These superfluous ROIs may be a contributing factor to the readers overruling the AI tool for some positive cases.

These 93 incorrectly overruled cases need to be weighed against the 2,307 out of 3,124 (73.8%) times that the human arbitration correctly overruled the AI tool when it recalled a negative case. It should also be noted that 8,148 of the 8,732 arbitrations were negative cases, highlighting the challenge that arbitration readers faced to identify the true positive cases among such large numbers of arbitration reads. Therefore, arbitration is a very challenging task for the readers, but they did a very good job of improving specificity with a small sensitivity loss, which is the aim of arbitration.

Fourth, we considered whether these results change with different subgroups. Sensitivity was analyzed across subgroups. The AI arm performed better than the human arm for Hologic images, but the opposite was true for Siemens images (although with a smaller number of cases). This was found before arbitration¹⁰ and persisted after arbitration (albeit with small numbers). Only 0.9% of the images used to train the AI tool were Siemens images. In the NHSBSP, the proportion of machines by manufacturer is 1% Fuji, 31% GE, 52% Hologic and 16% Siemens (J. Loveland, personal communication, National Co-ordinating Centre for the Physics of Mammography, 26 February 2025). The wide range of image processing used in the NHSBSP could also affect AI performance¹⁸. It has been shown that software upgrades by an X-ray vendor caused a change in AI performance¹⁹. This highlights the need to evaluate the performance of AI tools by X-ray system manufacturers and also to monitor the performance over time once deployed. There may also be a need to re-tune or re-train AI tools on deployment at a new site or after equipment upgrade.

Fifth, we considered whether the results differ by screening workflow. NHSBSP guidance recommends that breast-screening services should determine their local reading policy (Section 1.1.7 in https://www.gov.uk/government/publications/breast-screening-guidance-for-image-reading/breast-screening-guidance-for-image-reading). The two arbitration protocols here are the two main methods used in the NHSBSP and allowed the impact of an AI reader in different arbitration protocols to be examined. Before arbitration, the performance of readers at the two centers was very different due to the different clinical workflow used (Fig. 2a,b), whereas AI tool performance was very similar at both centers. This contributed to the larger increase in arbitrated cases with the introduction of AI at one center. After arbitration, the difference between the centers disappeared. Therefore, AI vendors should consider carefully the operating point or threshold of their tool at each screening center and how this complements the workflow and arbitration practice.

Sixth, we considered the strengths and limitations of the study design. The main strength of this study was the availability of long-term follow-up allowing us to investigate whether the AI arm could detect cancers earlier than the human arm. Having the locations of cancerous lesions, along with the AI ROI and the ROI drawn by readers at arbitration, allowed analysis at a lesion level. Although other studies have looked at localization accuracy^12,20, our study compared localization accuracy at arbitration. The AI tool excluded 8.7% of women. The proportion of episode outcomes, age, ethnicity, IMD, X-ray system manufacturer and type of screen (first or subsequent) were not significantly different between the included and the excluded women. A limitation here is that, for 25.6% of women at center 1 and 7.5% of women at center 2, the paperwork used in the study had the historical human reader opinion, but no location data. If these 954 women were excluded from the study, the study results remain unchanged, that is, overall sensitivity and specificity remained noninferior. Ethnicity data were grouped according to categories taken from the NHS Data Model and Dictionary Service (https://digital.nhs.uk/services/nhs-data-model-and-dictionary-service#top). Ideally more detailed ethnicity subgroups should be analyzed, but the amount of data in each subgroup was too small. This is despite this study including 50,000 women overall and highlights the need for enriched datasets as well as representative datasets when investigating performance by ethnicity. This study used data from the NHSBSP where screening is performed 3 yearly. In other countries where screening is biennial or annual, there will be smaller numbers of interval cancers due to more frequent screening, which could impact the generalizability of these results to those programs. In this study, the same noninferiority margin was used for specificity and sensitivity as in other studies^6,21. This did not impact the results because specificity was similar in both arms. However, due to the larger number of negative cases compared with positive cases, the study could have been powered for a smaller noninferiority margin in specificity. Having different noninferiority margins for sensitivity and specificity should be considered in future. It was not possible to blind the AI arm to the readers because the AI output was overlaid on images and the human readers’ decisions on paperwork. However, this is clinically realistic because it is how the images would be read clinically. This may introduce bias, but there will also be bias in a real-world setting. This bias may change over time, through additional training and daily experience with AI. Finally, by re-reading the standard care arm (human arm), rather than using historical clinical outcome as a comparator, both arms were read similarly, reducing any laboratory effect.

Finally, we considered whether there is a better way in which the AI could be used. Some interval cancer cases and cancers detected at the next screening round were recalled by AI, but were overruled at arbitration despite AI localizing the cancer correctly in around half of these. This may be overcome by displaying the type of cancer localized (mass, calcification), incorporating priors into model prediction, reducing false-positive prompts to prevent readers being distracted, more explainability in the model output and providing AI confidence values for each ROI.

Post-study surveys completed by 21 of 22 readers indicated that, as well as changes to the AI tool listed above, training of readers may help readers understand better when to trust the AI tool. Feedback about AI predictions and actual cancer outcomes may help readers interpret AI predictions better, especially around superfluous ROIs as seen in other AI-assisted reader settings²². The human factor post-study survey in this study was exploratory. Further work could include a more extensive questionnaire study design to assess participants’ views, including on wider workforce considerations.

In addition, the operating point or threshold of the AI tool could be altered. The operating point of the AI tool determines its sensitivity and specificity and therefore the arbitration rate. The prespecified operating point in this study was selected using a tuning dataset, based on set rules¹⁰, which struck a balance across sensitivity, specificity and arbitration rates. These rules were pre-agreed between the members of the research team. It is possible to construct a different set of rules that weigh the target metrics differently and select a different operating point that could be, for example, more specific, to reduce the number of arbitrations required. In a live screening deployment, the screening center staff would agree to the operating point selected.

Finally, there could be alternative ways in which the AI tool could be incorporated into screening. One alternative could include sending cases where the AI tool had very high confidence straight to recall, avoiding arbitration¹⁵ or adaptive screening, offering certain groups earlier follow-up²³ or additional imaging²⁴. At any rate, a more effective human AI interface is needed to optimize implementation of AI tools in breast screening. We hope that the lessons learnt from this study will help inform upcoming prospective clinical trials investigating deployment of AI tools in breast screening, such as the EDITH trial (https://www.gov.uk/government/news/world-leading-ai-trial-to-tackle-breast-cancer-launched).

In conclusion, this study explored the effect of introducing AI as a second reader in a double-reader workflow, crucially including the process of specialist arbitration. It showed that AI-enabled reading was noninferior to a standard two-reader workflow. It highlighted that, when replacing a second reader with AI, overall reading workload was reduced. Further development of the AI tool alongside improvement in explainability and acceptance of the tool by mammography readers could lead to the detection of cancers earlier than with two human readers.

Methods

This study is part of the Artificial Intelligence in Mammography Screening (AIMS) study. The AIMS study protocol was approved by East Midlands Nottingham Research Ethics Committee (no. 22/EM/0038) and NHS England Breast Screening Programme Research Advisory Committee (no. BSPRAC_0093). The study was registered with the ISRCTN (no. 60839016). The AIMS study was funded by a National Institute for Health and Care Research (NIHR) award from the Secretary of State for Health and Social Care. An overview of the study is given in Fig. 1a–e. Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Case selection

Mammography images and clinical data for 50,000 women from two NHSBSP screening centers were selected from the OPTIMAM Mammography Image Database OMI-DB¹¹.

The AI developer had used 10,000 cases from each screening center to select the operating point or threshold that is optimal for the recall rate for the local center. The vendor advised that they would do this before clinical implementation at any NHSBSP screening centers and so this mimicked the clinical situation. Therefore, all episodes for these women from any year were removed before selection of the study dataset.

We randomly selected 25,000 women aged 50–70 years per screening center from 2016. This permitted a 3-year follow-up in 2019, to avoid any potential impact of COVID-19 on the data. At both screening centers, the interval cancers in the National Breast Screening System (NBSS) were lower than expected from the national Screening History Information Management system. Therefore, the additional interval cancers to reach the numbers reported by the Screening History Information Management system were selected from a wider year range: 2011–2018. There was a proportion of women with normal mammograms, but no subsequent normal mammogram to confirm the negative ground truth. To ensure a high-quality ground truth, women younger than 68 years with normal mammograms, but no subsequent screening episode, were replaced with women who did have a follow-up mammogram from a wider year range: 2011–2018. The women were matched by episode outcome, whether it was the first or the subsequent mammogram, and by age (for screening center 1 within ±1 year and for screening center 2 within ±3 years). Women aged 68+ years were permitted to have no follow-up screen, because they would not typically be invited back as part of national screening at this age. For screening center 2, there was also a proportion of women whose cases had been used to train the AI tool. These were also replaced with women who had not been used to train the AI tool, with the same matching criteria as above.

The clinical data included pathological information and the recall or no-recall decisions by the historical first and second readers and arbitration. The locations of cancers for 94.5% of positive cases were recorded with a rectangle around the cancer or area where the cancer was later detected for those detected as interval cancers or at the next screening round. These bounding boxes were the ground-truth ROIs. For the remaining 5.5% of positive cases there were insufficient clinical data available for such annotation.

AI exclusion criteria were applied for technical recalls, any study containing >4 images or <4 images, or cases with implants, resulting in 4,354 women (8.7%) being excluded. The AI tool was run on the mammography images for the women not excluded and an AI recall decision for each woman was obtained using the site-specific operating points¹⁰. This included a case level decision and ROIs (bounding boxes marking suspicious areas). In addition, 44 (0.1%) cases were excluded due to insufficient or conflicting clinical information.

In the human arm of the study, the workflow was based on the recall decision of historical first and second human readers. In the AI arm of the study, the workflow was based on the recall decision of the first historical human reader and the AI tool. To determine the impact of the AI tool on arbitration, the arbitration criteria at each center were applied and selected cases read in a reader study with two readers making the arbitration decision. At center 1, women went to arbitration if, for either breast, there was a disagreement between the first and second readers. At center 2, women went to arbitration if recalled by either the first or second reader or both. A flowchart of the case selection, study exclusions and allocation to arms is given in Fig. 1e.

Design of arbitration reader study

AI system

The AI system used in this evaluation was created by Google (v1.2, Google LLC) and is an updated version of the v1.0 model^9,10. This is an AI-powered, independent mammography reader product for double-read breast cancer-screening workflows. It analyses two-dimensional, full-field digital mammograms to give a normal or abnormal screening determination and highlights suspicious ROIs. The AI system has three components: (1) a global model which takes four mammograms and produces a case-level prediction; (2) a detection model which detects bounding boxes of lesions for each view; and (3) a hybrid model which takes as input the features from the last layer of the global model and the bounding boxes from the detection model to produce a score for each bounding box. The final case-level cancer prediction is the maximum score of the bounding boxes for that case. The AI system outputs DICOM images with bounding boxes with scores above the operating point shown; however, case-level or bounding box scores are not displayed to the user. Data from 76,142 women (63,918 from the UK, 12,224 from the USA) were used to train the AI tool. Among all the studies, 88.7% were with Hologic images, 9.6% GE images, 0.9% Siemens images and 0.8% Philips images. The exclusion criteria of the AI tool include technical recalls, cases containing more or less than four images and implants. The four-image limitation is due to design of the AI tool where it processes one image for each of the four mammogram views (that is, left craniocaudal, left mediolateral oblique, right craniocaudal and right mediolateral oblique) for a complete analysis (no missing view allowed) and, when multiple images of the same view are present, it defers the selection of that image to the operator.

Readers

Nine radiologists from center 1 and nine radiologists and four consultant radiographers from center 2 participated in the reader study. All were NHSBSP-accredited mammography readers, with between 3 years and 36 years of experience (mean = 13.5 years) and reading between 2,300 and 15,000 examinations per year (mean = 6,000). Only one reader had experience in using AI previously.

Reader training

All readers were provided with an information pack and completed a consent form before the study. All readers completed training in interpreting the AI tool. This was provided by the AI vendor to mimic what would happen clinically and included a 10-min video explaining the AI tool and 100 cases that showed the AI decision and the ground truth—cancer or no cancer—and the location of any cancers. These training cases were from a screening center not included in the study. In addition a pilot study was performed by the research team. This included 28 cases, to train the readers in how to use the viewing software (RiViewer) and to test the entire process before the main study, including paperwork generation, hanging protocol, clarity of questions asked and timing. There was no overlap between the pilot study cases and the cases used in the main study.

Study paperwork

Clinically, when making arbitration decisions, the arbitration panel can view the decisions of the first and second readers on the NBSS and the clinical paperwork, where the readers have written their opinion and/or drawn areas of suspicion on a diagram. It was not possible to show the readers in this study the original paperwork or NBSS because this would show the original arbitration decision and the data would not be anonymized. To overcome this, research radiographers laboriously transcribed the original first and second reader opinions and diagrams of suspicious regions to create an anonymized copy of the study paperwork blinded to the screening outcome. For cases in the human arm, the study paperwork contained the opinions of both first and second readers. For cases in the AI arm, the study paperwork contained only the first human reader opinion (the second reader is AI).

Arbitration reading conditions and hanging protocol

Batches of ten cases for arbitration were reviewed by pairs of readers. As these sessions were outside of working hours, the pairing depended primarily on working patterns. This mimics the clinical situation, where readers working at the same time arbitrate together. The pairs were not fixed, to allow for flexibility around clinical and personal commitments. The pair reading each batch was recorded. The reading took place on clinical workstations at the screening center using RiViewer software in a reading room with normal clinical conditions, including low lighting and high-resolution monitors.

The proportion of AI arm and human arm cases within a batch was based on the proportion in the entire dataset at that center. The readers saw the study images (termed ‘current images’), the images of the immediately prior screening round if there was one and, for the AI arm, the images produced by the AI model with any areas of concern annotated. For both arms, the paperwork was shown after the readers had looked at the current images and prior images. For the AI arm, the study paperwork was shown at the same time as the AI images, so that, for images in the AI arm, the readers saw the human and AI decisions at the same time. For both arms, the readers had to complete a whole loop of a defined hanging protocol before they could make a decision for that case. It was not possible to blind the AI arm to the readers because the AI output was overlaid on images and the human readers’ decisions on paperwork. However, this is clinically realistic because it is how the images would be read clinically

For all cases readers were asked to provide the Royal College of Radiology 5-point scale M-score²⁵ (M1, no recall; M2, no recall; M3, recall; M4, recall; and M5, recall) for each breast, and the breast density Breast Imaging Reporting and Data System (BIRAD) categories A–D. For cases with prior imaging, they were asked additionally whether the priors changed their recall decision. For cases in the AI arm, they were also asked whether they were satisfied with the AI assessment of the case. If recalling a case, the reading pairs were asked to draw a bounding box around the areas being recalled and provide the conspicuity, lesion type and suspicion of malignancy. The readers were asked to draw a rectangle around the region in both views. Each region has an ID and, if they saw it in both views, they linked the region with the same ID.

Data quality control

Collection of all clinical data and images was automated and the images and data fields were not altered during collection. This ensured that the data were clinically relevant and representative. The reader study required study paperwork to be transcribed from clinical paperwork by research radiographers. The trial manager checked that the clinical paperwork had been correctly transcribed for 1% of the study paperwork during on-site monitoring. The data entered by the readers from the reader study were checked fortnightly with automated scripts, for any inconsistencies or incomplete data. These data checks were outlined in a data management plan at the start of the study.

Exploratory human factor surveys

Participants completed online surveys before, during and after the study. This included relevant questions from the NASA Task Load Index²⁶, trust and general impressions of the AI tool. Results in this paper are shown for the surveys after the study.

Positive and negative definitions used in the analysis

Positive cases

A positive case is a woman diagnosed with cancer within 39 months of the screening mammogram used in the study, based on pathological information. This therefore includes screen-detected cancers, interval cancers and screen-detected cancers detected at the next screening round.

Negative cases

A negative case is a woman whose mammograms used in the study resulted in an outcome of normal, with routine recall to screening 3 years later and the follow-up mammograms from 24 months onward also resulted in an outcome of normal with routine recall to screening 3 years later (age <68 years only)

Localization ground truth

The mammograms of all the positive cases were annotated by expert radiologists or consultant radiographers who did not participate in the study. They drew a rectangular ROI tightly around each lesion. They then described the radiological appearance of the lesion (mass, distortion, asymmetry, calcification), whether the lesion was malignant or benign and the conspicuity of the lesion on a three-point scale (very subtle, subtle or obvious). Conspicuity was defined as how visible the lesion was in the image, in the annotator’s judgment. For interval cancers and next-round cancers, the cancer was annotated on the diagnostic image (where available) and, in addition, the location the cancer would have been as annotated in the prior image.

Statistics and reproducibility

Study characteristics

Descriptive analysis was used to summarize study population characteristics. Frequencies and percentages were calculated for categorical data. A χ² test was used to compare proportions of characteristics between included and excluded groups.

Primary analysis

Our primary endpoint was noninferiority (prespecified 5% absolute margin) of the AI arm for sensitivity and specificity at the case level, compared to the human arm, measured against a 39-month ground truth. Statistical testing was performed using one-sided tests at the 0.025 significance level (after correcting for multiple comparisons using Holm–Bonferroni). CIs on the difference were Wald’s intervals²⁷ and Wald’s test was used for noninferiority²⁸. Both used Obuchowski’s variance estimate²⁹. If noninferiority was shown, a one-tailed superiority test was planned to follow without loss of power or requirement for multiple testing^30,31. Superiority comparisons were conducted using Obuchowski’s extension of the two-sided McNemar’s test for clustered data. Clusters were defined to group arbitrations read by the same reader pair. For case-level analysis the highest RCR M score for each breast was used. The data met the requirements of the paired binary tests used (Wald’s and McNemar’s tests).

Secondary analysis

Case-level secondary analysis included positive predictive value (PPV), negative predictive value (NPV), cancer detection rate (CDR) and recall rate (RR). For PPV and NPV, CIs on the absolute values, differences and CIs on difference were calculated by bootstrapping. For CDR and RR, differences were calculated using Obuchowski’s extension of the two-sided McNemar’s test for clustered data. For CDR and RR, Wald’s CIs were calculated with Obuchowski’s clusters based on reader pairs.

Exploratory analysis

Case-level subgroup sensitivity and specificity were calculated by type of screen, age, ethnicity, X-ray system manufacturer, IMD and breast density. In addition, subgroup sensitivity was calculated by cancer type, cancer grade, lesion characteristic and lesion size.

The age was taken from the NBSS. The grouping of age (50–54 years, 55–59 years, 60–64 years, 65–70 years) used as subgroups was as reported in published NHSBSP statistics. The ethnicity was taken from the NBSS. The grouping of ethnicities (white, mixed, Asian, black, other, not specified) as subgroups was based on the NHS Data Dictionary ethnic categories (https://www.datadictionary.nhs.uk/data_elements/ethnic_category.html). The IMD 1–10 (as defined in https://www.gov.uk/government/statistics/english-indices-of-deprivation-2019) was calculated from lower layer super output area data before de-identification. Breast density values (BIRADS 1–4) were calculated for mammograms acquired using Hologic devices with software developed by Royal Surrey³². The breast density subgroups were the categories from BIRADS, 5th edn. X-ray manufacturer values (Hologic and Siemens) were taken from the DICOM header of the mammography images. The screen type (first or subsequent screen) was taken from the NBSS. The subgroups used were as in NHSBSP statistics. The cancer type (invasive or in situ) was taken from the NBSS. These subgroups are reported in published NHSBSP statistics. The invasive grades (1, 2 and 3) and in situ grades (low, intermediate and high) were taken from the NBSS. The subgroups were based on the NHS Data Dictionary tumor grades for breast screening (https://archive.datadictionary.nhs.uk/DD%20Release%20June%202023/attributes/tumour_grade_for_breast_screening.html). The lesion type was obtained by an expert radiologist annotating the cancers; if that was not possible due to the diagnostic images not being available, the lesion type was taken from the NBSS. The invasive lesion size (small, <15 mm, and large, ≥15 mm) was taken from the NBSS. The subgroups used were as in NHSBSP screening statistics.

As the study was not powered for subgroup analysis and there were no prespecified subgroup endpoints, these subgroup analyses should be considered exploratory and hypothesis generating. We therefore present unadjusted CIs for subgroup differences to describe observed trends and magnitudes of effect within subgroups. It is important to note that these CIs should be interpreted cautiously due to the lack of power and the increased risk of false-positive findings associated with multiple subgroup comparisons. No formal hypothesis testing or multiplicity adjustments were conducted for these exploratory subgroup analyses. Case-level CIs were calculated using Wald’s CIs for groups of >50 cases and, for groups of <50 cases, bootstrapping was used.

Finally, localization analysis of the bounding boxes drawn during arbitration was performed using the RJafroc package v2.1.2 in RStudio v4.3.3 (ref. ³³). A correctly localized lesion was defined as the overlap between the ROI drawn at arbitration and the corresponding ground-truth ROI having an intersection over union value ≥0.1. All intersections over union values <0.3 were reviewed by a radiologist who did not participate in the reader study and the hit-or-miss decision was changed accordingly.

For human factor analysis, perceived task load differences for the human and AI arm were analyzed using Wilcoxon’s signed-rank test. Other questions, such as those on trust and general impressions, were examined using descriptive statistics for closed-ended questions and open-ended responses underwent dual-coder thematic analysis.

Post-hoc analysis

For all positive cases where the AI correctly recalled but human arbitration then overrode the decision, we checked whether the AI ROI had correctly localized the ground-truth ROI. In addition, the average number of false-positive prompts per case were calculated for: all cases, positive cases, negative cases and positive cases where the AI correctly recalled but human arbitration then overrode the decision. For the positive cases, 2 × 2 tables of outcomes for the human and AI arms were provided for all positive cases (Supplementary Table 1), screen-detected cancers only (Supplementary Table 2) and negative cases only (Supplementary Table 3).

Sample size estimation

We powered the study by simulating a two-arm, within-case design (routine versus AI assisted), where each case is read under both regimens and the primary analysis is a matched-pair Wald’s test for noninferiority on sensitivity (specificity was expected to be amply powered, given the low prevalence). We assumed identical underlying performance in both arms: latent continuous scores with area under the curve of 0.90, binarized at a common threshold to yield 73% sensitivity and specificity using 39-month outcomes⁹. Between-arm correlation was modeled via an agreement parameter set to 84.5%, matching previously observed R1–R2 concordance on positives. We modeled the two site-specific arbitration protocols (R1 | R2 and R1 ≠ R2) and powered the study using a worst-case scenario that combined the R1 | R2 arbitration style, consensus panel recall of 0.73 and agreement between arms of 0.70. Under these assumptions, 275 cancer-positive cases exceeded 90% power, whereas 200 positives provided 80% power. We therefore targeted a minimum of 200 positive cases per site to achieve 80% power. Assuming a population prevalence of cancer, this corresponded to 25,000 cases per site.

Reproducibility

Randomization is not applicable to this study because it was a retrospective study and all clients were in both arms of the study. As described above, it was not possible to blind the AI arm to the readers because the AI output was overlaid on images and the human readers’ decisions on paperwork. However, this is clinically realistic because it is how the images would be read clinically. The data met the requirements of the paired binary tests used (Wald’s and McNemar’s tests). The data exclusions were defined before the study. From the 50,000 women, 4,354 (8.7%) were excluded due to being within the AI exclusion criteria (technical recalls, cases containing ≥4 or ≤4 images and implants) and 44 (0.1%) cases were excluded due to insufficient or conflicting clinical information.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The images and clinical data used in this publication are from the OPTIMAM imaging database and are not publicly available due to restrictions imposed by OPTIMAM’s ethical approval. Instead, the images and data can be accessed after a formal data access request and review by a Data Access Committee and a Data Sharing Agreement (DSA) implemented. Applications for access to the data can be made at https://medphys.royalsurrey.nhs.uk/omidb/getting-access/. The application, review and agreement process can take anywhere from 2 weeks to 12 weeks depending on the applicant’s desire to customize the template DSA. The dataset derived from this resource, which supports the primary findings of this study, is available in Supplementary Table 4. Source data are provided with this paper.

Code availability

Statistical analysis and plotting were performed using open source and publicly available Python packages (numpy v1.24.4, pandas v2.1.1, scipy v1.9.1, matplotlib v3.7.2 and seaborn v0.13.0), using the methodology outlined in Methods. All codes were executed in Python (v3.9.7). The AI tool (v1.2, Google LLC) was not developed in this study; it was evaluated. The code used for training the models has a large number of dependencies on internal tooling, infrastructure and hardware, and its release is therefore not feasible. However, implementation details are described in sufficient detail in the supplementary methods of relevant papers^9,10 to support replication with nonproprietary libraries. The AI system is not currently available for external use but interested parties should contact the corresponding authors of the relevant paper¹⁰ to be notified about external availability.

References

The Royal College of Radiology 2023 Clinical Radiology and Clinical Oncology Workforce Census Reports (2024); https://www.rcr.ac.uk/news-policy/latest-updates/2023-clinical-radiology-and-clinical-oncology-workforce-census-reports/
Frazer, H. M. et al. Comparison of AI-integrated pathways with human-AI interaction in population mammographic screening for breast cancer. Nat. Commun. 15, 7525 (2024).
Article CAS PubMed PubMed Central Google Scholar
Yoon, J. H. et al. Standalone AI for breast cancer detection at screening digital mammography and digital breast tomosynthesis: a systematic review and meta-analysis. Radiology 307, e222639 (2023).
Article PubMed PubMed Central Google Scholar
Cushnan, D. et al. Lessons learned from independent external validation of an AI tool to detect breast cancer using a representative UK data set. Br. J. Radiol. 96, 20211104 (2023).
Article PubMed PubMed Central Google Scholar
Marinovich, M. L. et al. Simulated arbitration of discordance between radiologists and artificial intelligence interpretation of breast cancer screening mammograms. J. Med. Screen. 32, 48–52 (2025).
Article PubMed Google Scholar
Sharma, N. et al. Multi-vendor evaluation of artificial intelligence as an independent reader for double reading in breast cancer screening on 275,900 mammograms. BMC Cancer 23, 460 (2023).
Article PubMed PubMed Central Google Scholar
Elhakim, M. T. et al. AI-integrated screening to replace double reading of mammograms: a population-wide accuracy and feasibility study. Radiol. Artif. Intell. 6, e230529 (2024).
Article PubMed PubMed Central Google Scholar
Nakai, E. et al. Artificial intelligence as a second reader for screening mammography. Radiol. Adv. 1, umae011 (2024).
Article Google Scholar
McKinney, S. M. et al. International evaluation of an AI system for breast cancer screening. Nature 577, 89–94 (2020).
Article CAS PubMed Google Scholar
Kelly, C. J. et al. Diagnostic accuracy, fairness and clinical implementation of AI for breast cancer for breast cancer screening: results of multicenter retrospective and prospective technical feasibility studies. Nat. Cancer https://doi.org/10.1038/s43018-026-01127-0 (2026).
Halling-Brown, M. D. et al. OPTIMAM mammography image database: a large-scale resource of mammography images and clinical data. Radiol. Artif. Intell. 3, e200103 (2020).
Article PubMed PubMed Central Google Scholar
Kim, H.-E. et al. Changes in cancer detection and false-positive recall in mammography using artificial intelligence: a retrospective, multireader study. Lancet Digit. Health 2, e138–e148 (2020).
Article PubMed Google Scholar
Dembrower, K. et al. Artificial intelligence for breast cancer detection in screening mammography in Sweden: a prospective, population-based, paired-reader, non-inferiority study. Lancet Digit. Health 5, e703–e711 (2023).
Article CAS PubMed Google Scholar
Rodriguez-Ruiz, A. et al. Stand-alone artificial intelligence for breast cancer detection in mammography: comparison with 101 radiologists. J. Natl Cancer Inst. 111, 916–922 (2019).
Article PubMed PubMed Central Google Scholar
Leibig, C. et al. Combining the strengths of radiologists and AI for breast cancer screening: a retrospective analysis. Lancet Digit. Health 4, e507–e519 (2022).
Article CAS PubMed PubMed Central Google Scholar
Salim, M. et al. External evaluation of 3 commercial artificial intelligence algorithms for independent assessment of screening mammograms. JAMA Oncol. 6, 1581–1588 (2020).
Article PubMed PubMed Central Google Scholar
Khoo, L. et al. Computer-aided detection in the United Kingdom National Breast Screening Programme: prospective study. Radiology 237, 444–449 (2005).
Article PubMed Google Scholar
Mackenzie, A., Loveland, J. & van Engen, R. Survey of image processing settings used for mammography systems in the United Kingdom: how variable is it?. SPIE Digit. Library 13174, 232–239 (2024).
Google Scholar
de Vries, C. F. et al. Impact of different mammography systems on artificial intelligence performance in breast cancer screening. Radiol. Artif. Intell. 5, e220146 (2023).
Article PubMed PubMed Central Google Scholar
Byng, D. et al. AI-based prevention of interval cancers in a national mammography screening program. Eur. J. Radiol. 152, 110321 (2022).
Article PubMed Google Scholar
Hernström, V. et al. Screening performance and characteristics of breast cancer detected in the Mammography Screening with Artificial Intelligence trial (MASAI): a randomised, controlled, parallel-group, non-inferiority, single-blinded, screening accuracy study. Lancet Digit. Health 7, e175–e183 (2025).
Article PubMed Google Scholar
Cai, C. J. et al. Onboarding materials as cross-functional boundary objects for developing AI assistants. In Proc. ACM SIGCHI Conference on Human Factors in Computing Systems 1–7 (Association for Computing Machinery, 2021).
Hill, H. et al. The cost-effectiveness of risk-stratified breast cancer screening in the UK. Br. J. Cancer 129, 1801–1809 (2023).
Article PubMed PubMed Central Google Scholar
Allajbeu, I. et al. Introduction of automated breast ultrasound as an additional screening tool for dense breasts in the UK: a practical approach from the BRAID trial. Clin. Radiol. 79, e641–e650 (2024).
Article CAS PubMed Google Scholar
Metaxa, L., Healy, N. A. & O’Keeffe, S. A. Breast microcalcifications: the UK RCR 5-point breast imaging system or BI-RADS; which is the better predictor of malignancy? Br. J. Radiol. 92, 20190177 (2019).
Article PubMed PubMed Central Google Scholar
Hart, S. G. & Staveland, L. E. Development of NASA-TLX (Task Load Index): results of empirical and theoretical research. Adv. Psychol. 52, 139–183 (1988).
Article Google Scholar
Fagerland, M. W., Lydersen, S. & Laake, P. Recommended tests and confidence intervals for paired binomial proportions. Stat. Med. 33, 2850–2875 (2014).
Liu, J. et al. Tests for equivalence or non-inferiority for paired binary data. Stat. Med. 21, 231–245 (2002).
Article PubMed Google Scholar
Obuchowski, N. A. On the comparison of correlated proportions for clustered data. Stat. Med. 17, 1495–1507 (1998).
Article CAS PubMed Google Scholar
Dunnett, C. W. & Gent, M. An alternative to the use of two-sided tests in clinical trials. Stat. Med. 15, 1729–1738 (1996).
Article CAS PubMed Google Scholar
Koyama, T. & Westfall, P. H. Decision-theoretic views on simultaneous testing of superiority and noninferiority. J. Biopharm. Stat. 15, 943–955 (2005).
Article PubMed Google Scholar
Warren, L. M. et al. Deep learning to calculate breast density from processed mammography images. SPIE Digit. Library 11513, 352–358 (2020).
Google Scholar
Chakraborty, D. P. A brief history of free-response receiver operating characteristic paradigm data analysis. Acad. Radiol. 20, 915–919 (2013).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We acknowledge many contributors who made this project possible. The images and clinical data used in this publication are derived from the OPTIMAM image database¹¹. We thank the OPTIMAM project team and staff at the Royal Surrey NHS Foundation Trust who developed the OPTIMAM database, at Cancer Research UK who funded the creation and maintenance of the database and Cancer Research Horizons which facilitated access to the OPTIMAM data. We thank our incredible public engagement group comprising women from across the UK who supported the research from conception to conclusion. We thank C. Patel (Kingston and Richmond NHS Trust) for providing ground-truth annotations. Our thanks go to the readers who participated in the study: A. Gupta, S. Comitis, S. George, T. Seaton, B. Faissola, N. Upadhyay, N. Zaman, S. Flaise and V. Stewart (Imperial NHS Foundation Trust); L. Ward, E. Muscat, S. McWilliams, M. Sinclair, M. Mangat, R. Given-Wilson, A. Shrestha, O. Taylor-Fry, S. Mohammadi, C. Morris, A. Jaffer, K. McFeely and E. O’Flynn (St George’s NHS Foundation Trust). We also thank: I. Baptista, P. Craven, T. Bennett and T. Palalic (Imperial College London); A. Batra (St George’s); L. Kupferman, C. Fennessy, J. Rizk and A. Um’rani (Google); D. Bamford and the team (NHS AI Lab); and the commercial and commissioning teams at NHS England for contracting and setup support. For information on governance and privacy support, we thank: Y. Ibitoye (Google); M. Faez, E. Ghiliu, S. Gutama, B. Glampson and A. Rajagopalan (Imperial College Healthcare NHS Trust); L. Ferrara, A. Tun, M. Nakasala and N. Murphy-O’Kane (St George’s University Hospitals NHS Foundation Trust); and N. Wood and A. Materio (Imperial College London). We thank the research governance teams at Imperial College London, St George’s (including R. Bescoby) and Imperial College Healthcare NHS Trust (including R. Kullar and A. Gyamfi). Statistical expertise was provided by J. Miles and M. Ho (Google) and P. Wu (Imperial College London). This study is part of the AIMS study, which was funded by an NIHR award from the Secretary of State for Health and Social Care. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. This report is independent research funded by the UK Department of Health and Social Care (AI Award in Health and Care Phase 3, Application no. AI_AWARD02288). The views expressed in this publication are those of the authors and not necessarily those of the NHS or the Department of Health and Social Care. Google Research provided research and engineering staff, compute and project management resources to support the project. Infrastructure support for this research was provided by the NIHR Imperial Biomedical Research Centre. The following authors received funding from the AIMS grant (no. AI_AWARD02288): L.M.W., J.V., K.C.Y, M.H.-B., L.K., D.C., R. Sidebottom, M.R., H.P., D.K., L.H., A.H., T.S., A.J., A.S., D.W., D.G., N.M., R.M., D.O., C.P., A.M., F.J.G, A.D. and H.A.

Author information

Authors and Affiliations

Royal Surrey NHS Foundation Trust, Guildford, UK
Lucy M. Warren, Jenny Venton, Kenneth C. Young, Mark Halling-Brown, Andy Joiner, Dominic Ward & Alistair Mackenzie
University of Surrey, Guildford, UK
Kenneth C. Young & Mark Halling-Brown
Google Research, Mountain View, CA, USA
Christopher J. Kelly, Marc Wilson, Megumi Morigami, Richard Sidebottom, Reena Chopra, Lin Yang, Rory Sayres, Daniel Golden, Lihong Xi & Shravya Shetty
St George’s University Hospitals NHS Foundation Trust, London, UK
Lisanne Khoo, Mamatha Reddy, Delara Khodabakhshi & Amandeep Hujan
Imperial College Healthcare NHS Trust, London, UK
Deborah Cunningham, Hema Purushothaman, Lesley Honeyfield & Tsvetina Stoycheva
The Royal Marsden NHS Foundation Trust, London, UK
Richard Sidebottom
Imperial College London, London, UK
Aminata Sy, Namrata Malhotra, Rachita Mallya, Ara Darzi & Hutan Ashrafian
AIMS Public Engagement Group, London, UK
Della Ogunleye & Charlotte Purdy
Google for Health, Mountain View, CA, USA
Susan Thomas
University of Cambridge, Cambridge, UK
Fiona J. Gilbert

Authors

Lucy M. Warren
View author publications
Search author on:PubMed Google Scholar
Jenny Venton
View author publications
Search author on:PubMed Google Scholar
Kenneth C. Young
View author publications
Search author on:PubMed Google Scholar
Mark Halling-Brown
View author publications
Search author on:PubMed Google Scholar
Christopher J. Kelly
View author publications
Search author on:PubMed Google Scholar
Marc Wilson
View author publications
Search author on:PubMed Google Scholar
Megumi Morigami
View author publications
Search author on:PubMed Google Scholar
Lisanne Khoo
View author publications
Search author on:PubMed Google Scholar
Deborah Cunningham
View author publications
Search author on:PubMed Google Scholar
Richard Sidebottom
View author publications
Search author on:PubMed Google Scholar
Mamatha Reddy
View author publications
Search author on:PubMed Google Scholar
Hema Purushothaman
View author publications
Search author on:PubMed Google Scholar
Delara Khodabakhshi
View author publications
Search author on:PubMed Google Scholar
Lesley Honeyfield
View author publications
Search author on:PubMed Google Scholar
Amandeep Hujan
View author publications
Search author on:PubMed Google Scholar
Tsvetina Stoycheva
View author publications
Search author on:PubMed Google Scholar
Andy Joiner
View author publications
Search author on:PubMed Google Scholar
Reena Chopra
View author publications
Search author on:PubMed Google Scholar
Aminata Sy
View author publications
Search author on:PubMed Google Scholar
Dominic Ward
View author publications
Search author on:PubMed Google Scholar
Lin Yang
View author publications
Search author on:PubMed Google Scholar
Rory Sayres
View author publications
Search author on:PubMed Google Scholar
Daniel Golden
View author publications
Search author on:PubMed Google Scholar
Namrata Malhotra
View author publications
Search author on:PubMed Google Scholar
Rachita Mallya
View author publications
Search author on:PubMed Google Scholar
Lihong Xi
View author publications
Search author on:PubMed Google Scholar
Della Ogunleye
View author publications
Search author on:PubMed Google Scholar
Charlotte Purdy
View author publications
Search author on:PubMed Google Scholar
Alistair Mackenzie
View author publications
Search author on:PubMed Google Scholar
Susan Thomas
View author publications
Search author on:PubMed Google Scholar
Shravya Shetty
View author publications
Search author on:PubMed Google Scholar
Fiona J. Gilbert
View author publications
Search author on:PubMed Google Scholar
Ara Darzi
View author publications
Search author on:PubMed Google Scholar
Hutan Ashrafian
View author publications
Search author on:PubMed Google Scholar

Contributions

L.M.W., K.C.Y., M.H.-B., C.J.K., M.W., M.M., L.K., D.C., R. Sidebottom, M.R., H.P., R.C., A.S., D.W., L.Y., R. Sayers, D.G., R.M., L.X., D.O., C.P., A.M., S.T., S.S., F.J.G., A.D. and H.A. contributed to the conception of the study and study design. L.M.W., J.V., M.H.-B., C.J.K., M.W., M.M., L.K., D.C., R. Sidebottom, M.R., H.P., D.K., L.H., A.H., T.S., A.J., R.C., A.S., D.W., R. Sayers, D.G. and H.A. contributed to acquisition, analysis and interpretation of the data. M.H.-B. and A.J. contributed to the creation of new software used in the work. L.M.W., J.V., K.C.Y., M.H.-B., C.J.K., M.W., M.M., L.K., D.C., M.R., H.P., D.K., L.H., R. Sayers, R. Sidebottom, A.H., T.S., R.C., A.S., N.M., R.M., L.X., A.M. and H.A. contributed to drafting or substantially revising the paper.

Corresponding authors

Correspondence to Lucy M. Warren or Hutan Ashrafian.

Ethics declarations

Competing interests

Google LLC and/or a subsidiary thereof (‘Google’) provided the AI tool and staff time to support the project. C.J.K., M.W., M.M., R.C., L.Y., R. Sayers, D.G., L.X., S.T. and S.S. are employees of Google and own stock as part of the standard compensation package. R. Sidebottom and F.J.G. are paid consultants of Google. L.M.W., M.H.-B., J.V., D.W., A.J. and K.C.Y. are employees of the Royal Surrey NHS Foundation Trust, which received funding from Google to support the collection of data into the OPTIMAM database. H.A. is Chief Scientific Officer, Preemptive Health, Flagship Pioneering, Chief Medical Officer, Harbinger Health and Chief Medical Officer, Oxford Medical Products. The other authors declare no competing interests.

Peer review

Peer review information

Nature Cancer thanks Harry Hill and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Protocol.

Reporting Summary (download PDF )

Supplementary Tables (download XLSX )

Supplementary Tables 1–4.

Source data

Source Data Fig. 1 (download XLSX )

Data for Fig. 1e.

Source Data Fig. 2 (download XLSX )

Data for Fig. 2a–d.

Source Data Fig. 3 (download XLSX )

Data for Fig. 3.

Source Data Fig. 4 (download XLSX )

Data for Fig. 4a–d.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Warren, L.M., Venton, J., Young, K.C. et al. Impact of using artificial intelligence as a second reader in breast screening including arbitration. Nat Cancer (2026). https://doi.org/10.1038/s43018-026-01128-z

Download citation

Received: 20 December 2024
Accepted: 20 January 2026
Published: 10 March 2026
Version of record: 10 March 2026
DOI: https://doi.org/10.1038/s43018-026-01128-z

Subjects

Abstract

Similar content being viewed by others

Main

Results

Study population

Sensitivity, specificity, recall rate and cancer detection rate

Workload metrics

Impact of arbitration

Understanding cases that are overruled at arbitration

Subgroups

Localization performance

Human factor analysis

Discussion

Methods

Case selection

Design of arbitration reader study

AI system

Readers

Reader training

Study paperwork

Arbitration reading conditions and hanging protocol

Data quality control

Exploratory human factor surveys

Positive and negative definitions used in the analysis

Positive cases

Negative cases

Localization ground truth

Statistics and reproducibility

Study characteristics

Primary analysis

Secondary analysis

Exploratory analysis

Post-hoc analysis

Sample size estimation

Reproducibility

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links