Main

According to the latest statistics from the National Cancer Institute in United States, 12.1 per 100 000 women developed ovarian cancer per year between 2008 and 2012, with a mortality of 7.7 per 100 000 women (Howlader et al, 2015). The overall 5-year survival is estimated to be 45.6% for all stages of the disease (Howlader et al, 2015). However, for early localised ovarian cancers, the 5-year survival exceeds 90% (Howlader et al, 2015). A combination of early diagnosis and centralised management are thought to be key factors to optimise survival (Bristow et al, 2013, 2014; Howlader et al, 2015). For early diagnosis, previous trials to evaluate ovarian cancer screening have not been successful (Kobayashi et al, 2008; Buys et al, 2011). However, recently, the United Kingdom Collaborative Trial of Ovarian Cancer Screening (UKCTOCS) showed that screening using the risk of ovarian cancer algorithm (ROCA) doubled the number of detected primary invasive epithelial ovarian or tubal cancers (iEOCs) compared with a fixed cutoff of CA125 (Menon et al, 2015). The researchers also reported a significant mortality reduction with annual multimodal screening (MMS) when prevalent cases were excluded. However, the effect of this mortality reduction on final ovarian cancer screening cost effectiveness requires longer-term follow-up of the study patients (Jacobs et al, 2015).

A further important aspect of clinical management is that an accurate diagnosis is made when a woman presents with an ovarian mass. This is essential if women with cancer are to be referred to specialist oncology services. The International Ovarian Tumour Analysis group (IOTA) have developed and validated models and rules to characterise ovarian masses as benign or malignant (Timmerman et al, 2005, 2010a, b; Van Holsbeke et al, 2012). These models and rules have also been validated in the hands of less experienced (level II) ultrasound examiners (Sayasneh et al, 2013a, 2013b).

The IOTA group has developed the multiclass ADNEX (The Assessment of Different NEoplasias in the adneXa) model that can differentiate between benign tumours, borderline tumours, early-stage primary cancers, late-stage primary cancers (stages II–IV) and secondary metastatic cancers (Van Calster et al, 2014). The ADNEX is based on three clinical (including CA125) and six ultrasound parameters (Van Calster et al, 2014), and also offers risk calculation without CA125. The model was developed and temporally validated using parameters collected by experienced (or level III) ultrasound examiners, equivalent to a UK consultant level with a special interest in gynaecological ultrasonography (Education and Practical Standards Committee, European Federation of Societies for Ultrasound in Medicine and Biology (EFSUMB), 2006; Van Calster et al, 2014). This model should facilitate the management of ovarian masses more efficiently as it allows patients to be triaged to the correct management pathway, whether for conservative follow-up, surgery at a general gynaecology unit or management at high-volume specialised cancer centres. Correctly classifying the subtype of malignancy is also of critical importance as borderline ovarian tumours and early-stage ovarian cancers can be treated less aggressively, leading to the possibility of fertility preservation in younger women (Hennessy et al, 2009; Darai et al, 2013). On the other hand, metastatic ovarian cancers should be managed according to the origin of the primary cancer (Hennessy et al, 2009).

The primary aim of this project was to externally validate the ADNEX model. The secondary aim was to assess the performance of the model by level II examiners with varied training (nonconsultant doctors (MDs) and sonographers) (Education and Practical Standards Committee, European Federation of Societies for Ultrasound in Medicine and Biology (EFSUMB), 2006; Van Calster et al, 2014). We hypothesised that the discriminatory performance of ADNEX would be retained, that is, it would be similar to the validation performance in the original ADNEX study.

Materials and methods

Setting and design

This was a multicentre cross-sectional cohort study for diagnostic accuracy. Data were collected prospectively, with the purpose of developing and validating ultrasound-based prediction models from transvaginal ultrasound examinations performed by level II ultrasound examiners (nonconsultant gynaecology specialist, gynaecology trainees doctors and gynaecology sonographers) (Education and Practical Standards Committee, European Federation of Societies for Ultrasound in Medicine and Biology (EFSUMB), 2006; The Royal College of Radiologists (RCR) Board of the Faculty of Clinical Radiology, 2012). The ultrasound examiners were blind to the results of the reference test, that is, the final histological outcome or in the event of cancer the stage of the disease The ADNEX model was applied by a single investigator (AS) using a dedicated excel spreadsheet. Patients were recruited from three cancer centres (Queen Charlotte’s Chelsea Hospital (QCCH), London, UK; Princess Ann Hospital (PAH), Southampton, UK; and Garibaldi Nesima Hospital (GNH), Catania, Italy). The study was approved as a service evaluation audit at the UK centres and as a validation study by the hospital authority at the Italian centre. The guidelines of the TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis) initiative were used (Collins et al, 2015). Patients were recruited consecutively from September 2010 to November 2014 at QCCH, from May 2012 to May 2014 at PAH and from September 2012 to February 2015 at GNH. Patients at QCCH and PAH were also recruited to the IOTA 4 study (Sayasneh et al, 2013a, 2013b). Transvaginal ultrasonography was performed using the standardised approach previously published by the IOTA group (Timmerman et al, 2000, 2010b). Transabdominal ultrasonography was undertaken when a large mass could not be fully evaluated transvaginally (Timmerman et al, 2010b).

Participants and data collection

The inclusion criteria were patients presenting with at least one adnexal mass who underwent transvaginal ultrasonography at one of the participating centres. For bilateral adnexal masses, the mass with the most complex ultrasound features was included (Timmerman et al, 2000, 2010b). If both masses had similar ultrasound morphology, the largest mass or the one most easily accessible by ultrasonography was included (Timmerman et al, 2010b).

The exclusion criteria were (1) pregnancy, (2) patients examined by a consultant, (3) refusal of transvaginal ultrasonography, (4) cytology rather than histology as an outcome and (5) failure to undergo surgery within 120 days of the ultrasound examination. At PAH, 8 cases were included in the final analysis, although they had the ultrasound examination more than 120 days before surgery. These cases underwent a CT scan within 120 days, confirming the persistent presence of the mass.

The NHS Caldicott report guidelines were followed in all steps of data handling (Great Britain; Department of Health, 1997). At QCCH and GNH, a secure electronic data collection system was used (Astraia Software, Munich, Germany). A unique identifier was generated automatically for each patient’s record. Dedicated data collection forms and excel sheets were used at PAH. Serum CA125 was measured as per clinician’s discretion or clinical practice in each centre, using Abbott Architect CA125 II (Abbott Park, IL, USA) immunoassay kit at QCCH and GNH, and UniCel DxI Immunoassay System (Beckman Coulter Inc., Brea, CA, USA) Assay at PAH.

The ADNEX model

The ADNEX model contains three clinical and six ultrasound predictors: age (in years), serum CA125 level (U ml−1), type of centre (oncology centres vs other hospitals), maximum diameter of lesion (in mm), proportion of solid tissue, more than 10 cyst locules (yes or no), number of papillary projections (0, 1, 2, 3 or >3) acoustic shadows (yes or no) and ascites (yes or no) (Van Calster et al, 2014). Oncology centres were defined as ‘tertiary referral centres with a specific gynaecology oncology unit’. The proportion of solid tissue is obtained as the ratio of the maximum diameter of the largest solid component and the maximum diameter of the lesion. The ADNEX model is available online and in mobile applications (www.iotagroup.org/adnexmodel/) (Van Calster et al, 2014). The ADNEX model can still be calculated without including the serum CA125 value. In this study we calculated the performance of the ADNEX model with and without CA125. The temporal validation of the model with CA125 in the original paper yielded an area under the receiver operator curve (AUC) of 0.943 (0.934–0.952) to discriminate benign from malignant tumours. The model without CA125 had an AUC of 0.932 (0.922–0.941). Validation AUCs between all pairs of the five categories varied between 0.71 (stage I cancer vs secondary metastatic cancer) and 0.99 (benign tumours vs late stage primary cancer). We applied the model exactly as presented in the original publication, that is, without any changes to the model formula or coefficients.

Reference tests

The reference standard was the histopathological diagnosis of the mass after surgical removal. The excised tissues underwent histological examination at the local centre. Tumours were classified according to the WHO (World Health Organisation) classification of tumours and malignant tumours were staged according to the FIGO (International Federation of Gynaecology and Obstetrics) criteria (Tavassoli et al, 2003; Heintz et al, 2006). Histological classification was performed without knowledge of the ADNEX results or clinical and ultrasound findings for the patient. The final diagnosis was categorised into five types: benign, borderline, stage I invasive, stage II–IV invasive and secondary metastatic cancer.

Statistical analysis

There were missing values for serum CA125 and for the presence of >10 cyst locules (loc10). Missing values were handled differently for serum CA125 and loc10. The number of missing values for the latter variable was small (3%), and hence these were dealt with using single stochastic imputation based on logistic regression. Missing loc10 values were predicted by a logistic regression model with Firth correction with the following predictors: age, maximum diameter of the lesion, proportion of solid tissue, number of papillations, presence of acoustic shadows, ascites, type of ovarian tumour and type of operator. The missing serum CA125 values were handled with multiple stochastic imputation using predictive mean matching regression. As the distribution of serum CA125 was heavily skewed, the log–log transformation of CA125 was used (i.e., log(log(CA125))). In this imputation model, age, maximum diameter of the lesion, proportion solid tissue, loc10, number of papillations, presence of acoustic shadows, ascites, type of ovarian tumour, hospital and operator type were used as predictors. Using this approach, the missing values were replaced by 100 plausible values, leading to 100 completed data sets. Imputed values were back transformed to the original scale. For the ADNEX model with CA125, each of the 100 completed data sets were analysed separately and their results combined using Rubin’s Rules (Rubin, 1987).

External validation of the ADNEX model with and without CA125 was performed by evaluating discrimination and calibration performance. The AUC was calculated for the basic discrimination between benign and malignant tumours using the total risk of malignancy (i.e., the sum of the estimated risks of the four malignant subtypes). The 95% confidence intervals for differences in AUCs were computed based on 1000 bootstrap samples, where for each bootstrap sample the same patients were selected across the imputed data sets (Musoro et al, 2014). In addition, AUCs were computed for each pair of tumour types using the conditional risk method (Van Calster et al, 2012b). Finally, the polytomous discrimination index was calculated (Van Calster et al, 2012a) that estimates the average proportion of correctly classified patients by the model when presented with five patients, one with each tumour type. Sensitivity and specificity were calculated using a 1%, 5%, 10%, 15%, 20% and 30% cutoff denoting the total risk of malignancy. Calibration of the predicted probabilities was assessed through use of calibration plots that show the relation between the observed and predicted probabilities for malignant tumours. The calibration curve was estimated by using a loess smoother (Van Calster et al, 2016).

Results

During the study period, 751 women underwent ultrasonography by level II examiners (one associate specialist in gynaecology, 12 resident gynaecology trainees and 29 sonographers) for a pelvic mass and went through the surgical management pathway. Of these, 141 women were excluded from the final analysis for the following reasons: 65 women were examined by a consultant, 26 women had no histology result (14 only cytology, 12 no cytology or histology), 24 women had surgery >120 days from the characterising ultrasound scan, 15 women were pregnant, 5 women only had a transabdominal scan, 5 women had no surgery performed (declined or were not medically fit) and finally 1 woman who had a recurrence of cervical cancer in the pelvis a few years after radical hysterectomy and underwent a bilateral salpingo-oophorectomy was excluded as the tumour was not considered adnexal. Supplementary Table 1 presents exclusions for each centre. In the final analysis, 610 women were included (Supplementary Figure 1). Of these patients, 142 (23%) had a missing CA125 level and 17 (3%) had a missing value for loc10. Supplementary Table 2 presents the numbers of missing values for each of the study centres. The prevalence of malignancy was 30% (n=182), with 33% for QCCH, 32% for PAH and 19% for GNH. There were 42 (7%) borderline tumours, 47 (8%) stage I primary ovarian cancers, 69 (11%) stage II–IV primary ovarian cancers and 24 (4%) secondary metastatic cancers (see Supplementary Table 3 for a breakdown per centre). The median age was 47 years with 352 (58%) premenopausal and 258 (42%) postmenopausal women. Table 1 shows descriptive statistics of the ADNEX predictors per tumour subtype. Supplementary Tables 4–6 shows descriptive statistics per centre.

Table 1 Descriptive information about the patients and masses included in the study according to tumour subtype

The calibration plots suggest good correspondence between the total predicted risk of malignancy and the observed proportion of malignant tumours, both for the ADNEX model with and without CA125 (Figure 1).

Figure 1
figure 1

(A) Calibration plot for the ADNEX model with serum CA125. (B) Calibration plot for the ADNEX model without serum CA125.

The AUC to differentiate between benign and malignant masses was 0.937 (95% CI: 0.915–0.954) for ADNEX with CA125 and 0.925 (95% CI: 0.902–0943) for ADNEX without CA125 (Figure 2 and Table 2). The model with CA125 showed slightly better performance (AUC difference: 0.012, 95% CI: 0.006–0.020). At risk cutoffs of 1%, 10% and 30%, sensitivities were 100%, 97% and 86% for ADNEX with CA125 (Table 3). Corresponding specificities were 12%, 68% and 84%. As in the original study, centre differences were observed with centre-specific AUCs for ADNEX with CA125 that varied from 0.90 for PAH to 0.99 for GNH (Table 2). The AUC was higher for premenopausal women (0.94) than for postmenopausal women (0.90) (Table 2): 0.939 vs 0.899 for the model with CA125 (difference 0.04, 95% CI −0.009 to 0.084) and 0.935 vs 0.873 for the model without CA125 (difference 0.062, 95% CI 0.012−0.116).

Figure 2
figure 2

Receiver operating curves for the ADNEX model with and without serum CA125 levels to discriminate between benign and malignant masses.

Table 2 The area under the receiver operator curve for the discrimination between benign and malignant lesions for ADNEX with and without CA125 according to type of centre and sonographer
Table 3 The overall sensitivity and specificity (benign vs malignant) of the ADNEX model with and without the inclusion of serum CA125

When tumours were classified into benign, borderline, stage I invasive, stages II–IV, invasive and secondary metastatic, the model showed good discrimination between the different subtypes (Table 4). For example, discrimination between benign and stage II–IV tumours was near perfect for the model with CA125 (AUC 0.99). In comparison, the model had most difficulties discriminating between borderline and stage I tumours (AUC 0.78), though its performance is still good. The model without CA125 mainly had lower AUCs for stage II−IV tumours vs other groups, in particular vs secondary metastatic cancers (AUC 0.88 for model with CA125, AUC 0.77 for model without CA125). The polytomous discrimination index (PDI) was 0.58 for ADNEX with CA125 and 0.52 for ADNEX without CA125 (Table 4), whereas PDI for random performance would be 0.20 for five categories.

Table 4 Pairwise AUCs and PDI of the ADNEX model with and without serum CA125

Discussion

In this study, we have shown that in the hands of level II ultrasound examiners, the ADNEX model was able to discriminate between benign and malignant masses with a very similar level of performance to that achieved by experienced ultrasound examiners in the original ADNEX temporal validation study published by the IOTA group (Van Calster et al, 2014). In our external validation study using a 10% cutoff to define malignancy, the ADNEX model achieved a sensitivity of 97.3% and a specificity of 67.7% compared with 96.5% and 71.3% in the original study (Van Calster et al, 2014). The optimal cutoff for selecting patients for conservative management may vary (e.g., between 1 and 5%) depending on the health-care system, cost of surgery and surgical risk factors (age, previous medical and surgical history). However, as this study only included patients who underwent surgical management, we cannot conclude which cutoff is optimal for conservative management. This will be investigated in the IOTA5 study (https://clinicaltrials.gov/ct2/show/NCT01698632). In contrast, in a tertiary centre it may be preferable to have a lower false positive rate, and a cutoff value of 30% may be more appropriate (Van Calster et al, 2015).

To the best of our knowledge, this is the first external validation study of the IOTA ADNEX model. Furthermore, the validation was carried out by level II ultrasound examiners, whereas in the previous IOTA development and temporal validation study (Van Calster et al, 2014), the ultrasound scan parameters were collected by experienced level III examiners. A strength of our study is that it is multicentre, and as it includes level II examiners with varied training and experience (sonographers and medical doctors), we think the performance of the ADNEX model in this study is likely to be generalisable. Another strength of our study is the robust selection of the reference test, as only cases with a histological outcome were included. However, this may also be seen as a weakness in relation to the potential performance of the ADNEX model for masses that are selected for conservative management as these were not included in the study. This is an issue that applies to most, if not all, of the diagnostic research carried out to date on ovarian masses. The previously mentioned IOTA 5 study should give us useful information on the diagnostic performance of ADNEX and the long-term behaviour of these masses.

A potential limitation is the use of different assay kits for serum CA125 measurements; however, the inconsistency in CA125 levels resulting from this is thought to be limited (Davelaar et al, 1998). Furthermore, the variance in CA125 assay kits used in the study is a reflection of clinical reality and again means results are more likely to be reproducible (Van Calster et al, 2014). A further possible limitation of the study is that all three participating hospitals were referral centres for gynaecological cancers, resulting in there being a relatively high prevalence of malignant disease in the study population. Accordingly, it is possible that our findings may have limitations when trying to predict test performance either in primary care or secondary gynaecology units. However, it should be noted that in the original ADNEX study the prevalence of malignancy ranged from 0 to 66% in the 24 participating centres (Van Calster et al, 2014), and hence this makes it more likely that results will be generalisable. Furthermore, ADNEX explicitly corrects its prediction for type of centre (oncology centres vs other centres). In this sense, the potential for selection bias is accounted for by the model.

Finally, having no centralised histopathology review in our study may have led to bias. For example, distinguishing borderline tumours from benign tumours or even stage I cancer may be challenging for pathologists, where disagreement can occur and this may give inaccurate diagnostic performance results for the ADNEX model in these cases (Van Calster et al, 2014). However, as all the histopathology departments involved in this study were tertiary referral centres for gynaecological cancers, in the event of a discrepancy (including discrepancies in the referring units) a local review at the tertiary centre would have been held to resolve the disagreement. Furthermore, centralised review of pathology was discontinued in IOTA studies as it was shown in initial studies that there were minimal differences between local and central reports (Timmerman et al, 2005).

It is worth noting that we have observed variation in the ADNEX performance between centres that is comparable to the one observed in the original IOTA validation study (Van Calster et al, 2014). This variation could be explained by the differences in the case mix between these centres with a higher number of secondary metastatic cancers in PAH compared with QCCH and GNH. It is important to investigate heterogeneity between centres, but this data set is not ideal for this objective because this requires a larger database derived from a large number of centres.

In our study, the classification of the level of experience of the ultrasound examiners (level II) was based on the recommendations published by the European Federation of Societies for Ultrasound in Medicine and Biology (Education and Practical Standards Committee, European Federation of Societies for Ultrasound in Medicine and Biology (EFSUMB), 2006) and by the Royal College of Radiologists (The Royal College of Radiologists (RCR) Board of the Faculty of Clinical Radiology, 2012). As guidance, a level III examiner in the United Kingdom equates to a consultant with a special interest in gynaecological ultrasonography (The Royal College of Radiologists (RCR) Board of the Faculty of Clinical Radiology, 2012). We acknowledge that this approach has limitations as some level II examiners may have similar levels of competence to someone with level III experience. However, it is acknowledged that the boundaries between these levels can be difficult to distinguish and may overlap (The Royal College of Radiologists (RCR) Board of the Faculty of Clinical Radiology, 2012). In our study, similar to previous findings when the IOTA model LR2 was validated in the hands of level II examiners (Sayasneh et al, 2013b), we found the AUC for the ADNEX model was slightly higher when the scans were performed by doctors compared with sonographers (Table 2).

By characterising the type of malignancy (borderline, primary stage I cancer, primary stage II–IV cancer or secondary metastatic), the ADNEX model offers the possibility of a more personalised diagnosis in the event of an ovarian mass. This potentially may enable fertility preserving surgery in some women, help plan the most appropriate surgical approach (laparoscopy or laparotomy) in others or direct attention to the primary site of malignancy in the event of metastasis. Although the ADNEX model gives absolute risks ratios, relative risk ratios can be computed to give a comparison with the background risk for individual patient (Van Calster et al, 2015). External validation is a critical step for any diagnostic test before it can be introduced into clinical practice. We have shown that the performance of the ADNEX model is retained in units with different patient populations to the original study, and that it performs well in the hands of examiners with different levels of experience and background training. Our findings suggest that the ADNEX model has the potential to improve management decisions in daily clinical practice for women with adnexal tumours.