Ophthalmologists and primary care practitioners often examine macula-centered retinal fundus images for comprehensive screening and efficient management of vision-threatening eye diseases such as diabetic retinopathy (DR)1, glaucoma2, age-related macular edema (AMD)3, and retinal vein occlusion (RVO)4. Deep learning (DL) algorithms5 have been developed to automate the assessment of DR6,7, glaucoma8,9, and AMD10,11, as well as multiple ophthalmologic findings12, achieving performance comparable to that of human experts. A major obstacle hindering the applicability of DL-based computer-aided diagnosis (CAD) systems in clinical setting is its interpretability, that is, the rationale behind its diagnostic conclusions is obscure. Several visualization techniques such as class activation maps13,14 and integrated gradients15 have been developed to highlight lesions as preliminary solutions. However, the ?EUR~heatmap?EUR(TM) provides only ambiguous regions on the image that contributed to the final prediction and cannot explicitly differentiate the lesions that attributed to the final model prediction. Therefore, the users may not fully understand which findings contributed to the DL system?EUR(TM)s diagnostic predictions. Another limitation of preexisting DL-based algorithms for fundus image analysis is that they are capable of examining only a few ophthalmologic findings or diseases (e.g., DR), while more comprehensive coverage of multiple common abnormal retinal conditions is necessary for practical deployment of DL-based CAD systems to clinical settings.

We present a DL-based CAD system that not only comprehensively identifies multiple abnormal retinal findings in color fundus images and diagnoses major eye diseases, but also quantifies the attribution of each finding to the final diagnosis. The training procedure resembles ophthalmologists?EUR(TM) typical workflow, first identifying abnormal findings and diagnosing diseases based on the findings present in the fundus image. This DL system presents the final diagnostic prediction and their accompanying heatmap just as other available DL systems, and also provides the quantitative and explicit attributions of each finding in making the proposed diagnoses, which enhances the interpretability of the provided diagnostic decision, to the benefit of physicians in making their final decisions for the right treatment or management of ophthalmic diseases. The model?EUR(TM)s performance was validated on a held-out, in-house dataset as well as 9 external datasets. A novel notion of counterfactual attribution ratio (CAR) was used to elucidate the rationale behind our DL system?EUR(TM)s decision-making process by quantifying the extent to which each finding contributes to its diagnostic prediction. Statistical analysis of CAR was performed to evaluate if the DL system?EUR(TM)s clinical relations between finding identification and disease diagnoses were similar to that of human experts.


Reliability of the DL-system

The system consists of two major components that are implemented in a single neural network: (1) fifteen-finding identification subnetwork is specialized to predict the likelihood that each finding is present in a fundus image, and (2) eight-disease diagnosis subnetwork diagnoses retinal diseases based on features extracted from the finding-identification network (Fig.? 1a). The 15 findings considered in this system consist of hemorrhage, hard exudate, cotton wool patch (CWP), drusen, membrane, macular hole, myelinated nerve fiber, chorioretinal atrophy or scar, any vascular abnormality, retinal nerve fiber layer (RNFL) defect, glaucomatous disc change, non-glaucomatous disc change, fluid accumulation, retinal pigmentary change, and choroidal lesion. The 8 major diseases considered are dry AMD, wet AMD, any DR, referable DR, central retinal vein occlusion (CRVO), branch retinal vein occlusion (BRVO)/hemi-CRVO, epiretinal membrane, and glaucoma suspect.16.

Figure 1

(a) Overall architecture of the deep learning-based model. (b) Relationships between ophthalmolgic findings and eye diseases computed by aggregate annotations of human experts and the model?EUR(TM)s counterfactual attribution ratio.

The finding-identification models and disease diagnosis models were trained and tested on data of 103,262 macula-centered fundus images from 47,764 patients (Supplementary Table 1). All models were evaluated with respect to its area under the receiver operating characteristic curve (AUROC), with its 95% confidence interval computed using the Clopper-Pearson method. Operating points were chosen to maximize the harmonic mean between sensitivity and specificity (i.e. F1-score) on the in-house validation data set except for BRVO/hemi-CRVO and CRVO whose operating points were set to have approximately 90% sensitivity because only a small number of positive cases were available.

Component #1: identification of fifteen abnormal findings

As shown in Table 1, the system identified findings in retinal fundus images with a mean AUROC of 0.980 across all 15 findings on a held-out, in-house test set. AUROCs ranged from a minimum of 0.972 for retinal pigmentary change to a maximum of 1.000 for myelinated nerve fiber layer with respect to the majority consensus of three ophthalmologists as reference. Sensitivities ranged from 92.5% for glaucomatous disc change to 100.0% for myelinated nerve fiber and specificity varied between 88.3% for retinal pigmentary change to 100.0% for myelinated nerve fiber (Supplementary Table 2, Supplementary Fig.? 1). This performance is comparable to human experts as reported in previous literature.12.

Table 1 Area under receiver operating characteristic curves of the proposed models on the in-house test dataset and external datasets.

The models were then validated on 4 external datasets without additional tuning to identify findings included in each dataset: MESSIDOR17 (left( ) is the number of images with finding (k). We found that the average cosine distance between finding pairs started to increase beyond ?EUR~Block5a?EUR(TM) as in Supplementary Figs.? 4 and 5, signifying that feature maps at ?EUR~Block4a?EUR(TM) convey universal features informative of all findings whereas deeper layers learn discriminative features specific to each finding. We thus branched out after ?EUR~Block4a?EUR(TM). The encoder was then frozen to fine-tune finding branches until the validation AUROC saturated. Experimentally, the significant performance was not observed between architectures of different size that that we chose to append top layers of B0, the smallest network in the parameter size for each finding branch (Supplementary Table 5). Experimentally, freezing the encoder has minor effect in the test and validation AUROC on the in-house dataset (Supplementary Fig.? 6).

Target labels were assigned using the aforementioned Na??ve Bayes and the model was trained using the sum of BCE and guidance loss29. Training samples were randomly sampled non-uniformly in batches of size 6 such that the expected number of positive and negative samples in a batch are equal. The B7-B0 network was trained using Nesterov-SGD, with initial learning rate set to 0.001 until the 9th epoch reduced by a factor of 10 at epoch 10, until the validation AUROC decreased. Both linear projection matrices were trained for a maximum of 10 epochs with batch size 64 using the same Nesterov-SGD with L2-regularization coefficient 0.0005. Other training details such as augmentation and sampling ratio were identical to that used for training the B7-B0 network. The learning curves are illustrated in Supplementary Fig.? 7. In the experiment on predicting whether a fundus is normal, this final architecture did not show any degradation in AUROC compared to end-to-end models and other models with addition parameters (Supplementary Table 3).

Quantifying clinical relations between finding-disease pairs

To understand how our DL-based CAD system infers its diagnostic predictions, CAR was defined as a measure of how much a specific finding contributed to diagnosing a certain disease by comparing its prediction with what its prediction would have been in a hypothetical situation in which the finding under consideration is present or absent. Before defining CAR, we first define instance-dependent counterfactual attribution which can be computed for all finding-disease pairs (left( f,d right)) in a fundus image (x). First notice how the finding features (overlinez_f) can be decomposed as (overlinez_f = overlinez_ fracw_f + overlinez_f, bot w_f ) with a component (overlinez_w_f ) parallel to (w_f) and its orthogonal counterpart (overlinez_f, bot w_f ):

$$overlinez_f = w_f^T overlinez_f fracw_f + left[ overlinez_f – w_f^T overlinez_f fracw_f left right] = :left( sigma^ – 1 left( haty_f right) – b_f right)fracw_f left + overlinez_f, bot w_f .$$

The odds (mathcalOleft( d;x right)) of disease (d) given a fundus image (x) is defined as the ratio between the model?EUR(TM)s prediction for disease (d) being present and absent:

$$mathcalOleft( d;x right) triangleq frachaty_d 1 – haty_d = textexpleft( mathop sum limits_f v_d,f^T overlinez_f + c_d right).$$

Let the latent vector (overlinez_backslash f = left( sigma^ – 1 left( epsilon right) – b_f right)fracw_f w_f right + overlinez_f, bot w_f ) be the hypothetical feature map had the feature corresponding to finding (f) not been present in the image. The instance-dependent counterfactual attribution of finding (f) in diagnosing disease (d) from a fundus image (x) is the odds after removing the diagnostic prediction?EUR(TM)s dependency on finding (f), hence its name counterfactual:

$$mathcalCleft( f,d;x right) = exp left( mathop sum limits_f^prime ne f v_d,f^prime^T overlinez_f^prime + v_d,f^T overlinez_backslash f + c_d right) .$$

For a finding-disease pair (left( f,d right)) and finding prediction? (haty_f) on a fundus image (x), the instance-dependent counterfactual attribution ratio (I-CAR) (R_I – CAR left( f,d;x right)) is the ratio between the odds and the counterfactual attribution:

$$mathcalR_I – CAR left( f,d;x right): = fracmathcalOleft( d;x right)mathcalCleft( f,d;x right) = exp left( left( sigma^ – 1 left( haty_f right) – sigma^ – 1 left( epsilon right) right)v_d,f^T fracw_f left right),$$

where (sigma^ – 1) is the inverse sigmoid function and (epsilon in left( 0, 1/100 right)) is a small positive number. If a user wishes to modify the attribution due to some finding prediction, the diagnostic prediction of diseases is modified accordingly by changing (haty_f) in (overlinez_f). This is useful when a user wants to reject model?EUR(TM)s finding prediction in case of false positives and false negatives.

The two quantities described above establish the key intuition behind our main notion of CAR used to understand the decision-making process behind the DL-based CAD system. Replacing the prediction (haty_f) in I-CAR with a high confidence of (1 – epsilon) is the finding-disease CAR, comparing two hypothetical situations in which a finding is present surely and absent with high confidence:

$$mathcalR_CAR left( f,d right) = exp left( left( sigma^ – 1 left( 1 – epsilon right) – sigma^ – 1 left( epsilon right) right)v_d,f^T fracw_f w_f right right).$$

The confidence level (epsilon) was chosen to be the 5-percentile ordered statistic of prediction values on benign cases in the in-house validation dataset.

Given a finding-disease pair, an attribution activation map that quantifies the influence of each finding (f) in diagnosing disease (d) can be visualized by modifying the class activation map13 as

$$Aleft( f,d;x right) = v_d,f^T fracw_f w_f^T g_f left( x right).$$

Computation of odds ratios for human experts

Odds ratio of human experts were computed as following. Let (S) and (I) denote the set of annotator and image indices. Every image (x_i) indexed by (i) is associated with a finding (f_i^s) and disease (d_i^s) label indicating its presence of finding/diagnosis assigned by reader (s in S). All annotations were accumulated into a single (2 times 2) matrix (N) as

$$N = left( beginarray*20c mathop sum limits_j in S mathop sum limits_i in I 1left f_i^s = 1 wedge d_i^s = 1 right & mathop sum limits_j in S mathop sum limits_i in I 1left f_i^s = 1 wedge d_i^s = 0 right \ mathop sum limits_j in S mathop sum limits_i in I 1left f_i^s = 0 wedge d_i^s = 1 right & mathop sum limits_j in S mathop sum limits_i in I 1left f_i^s = 0 wedge d_i^s = 0 right \ endarray right),$$

where (wedge) is the Boolean And operation and (1left cdot right\) is the indicator function. The odds ratio was then computed as

$$OR = fracN_0,0 N_1,1 N_0,1 N_1,0 .$$

External datasets

The proposed models were tested on 9 external datasets with their summary statistics described in Supplementary Table 6. Two datasets, MESSIDOR and STARE, contain both findings and disease annotations, whereas 2?EUR”e-ophtha and IDRiD-segmentation?EUR”contain only finding annotations and 5 contain disease annotations: Kaggle-APTOS (2019), IDRiD-classification, REFUGE (training), REFUGE (val, test), and ADAM.

MESSIDOR consists of 1200 macula-centered images taken by TOPCON TRC NW6 digital fundus camera [TOPCON, Tokyo, Japan] with a 45-degree field of view. The dataset provides DR grades in a scale of 4, from 0 to 3, which does not align with the 5-scale grading in ICDRDSS. Three retinal specialists (SJP, JYS, HDK) who participated in annotating the in-house data independently assessed the images in the dataset regarding 15 findings and 8 diseases and adjusted the annotations to be compatible with the ICDRDSS grading. Images considered ungradable by any of the 3 specialists were excluded from our study.

All the other 8 external datasets are public datasets available online. Assessments provided with the datasets were used as-is when the decisions were binary (present/absent or positive/negative). Annotations in ICDRDSS scales in Kaggle-APTOS and IDRiD?EUR”classification datasets were converted to binary annotations for DR for grades (ge 1) and referable DR for grades (ge 2). Binary labels in the ADAM dataset indicate the presence of AMD without subcategorizing as dry or wet AMD, and the two subcategories were grouped into a single AMD class with annotations assigned positive/present if either dry or wet AMD was present and absent otherwise. The higher of predictions on dry and wet AMD were used to evaluate the model. The laterality of an image was derived by the center of the optic disc using a segmentation network for optic disc37, e.g. right eye if the disc center is on the right side.

Comparison with readers performance

To compare with human readers?EUR(TM) performance, 150 fundus images corresponding to 150 distinct subjects were sampled at the health screening center and ophthalmology outpatient clinic at SNUBH from July 1st, 2016 to June 30th, 2018. The images were captured with various fundus cameras including VX-10??, nonmyd 7, nonmyd WX [Kowa Optimed, Tokyo, Japan] with varying resolutions of (2144, 1424), (2992, 2000), (2464, 1632), (4288, 2848), and the data were annotated in disease names. Average age was 59.4 with standard deviation of 11.9 and there were 74 females (49.7%). The sampled data included cases of 25 DR (14 referable DR), 27 AMD (17 dry AMD, 10 wet AMD), 20 RVO (10 CRVO, 10 BRVO/hemi-CRVO), 13 glaucoma suspect, and 18 epiretinal membrane. We measured the performance of 4 physicians, and compared the performance with that of the DL algorithm. This dataset is denoted as ?EUR~Reader Study?EUR(TM) and the operating point of each reader is shown in Fig.? 2.