Is artificial intelligence (AI) better than humans for diagnosing the eye condition 'exudative age-related macular degeneration'?

Key messages

• Compared to human experts, artificial intelligence (AI)-based tests may be comparably accurate at detecting the exudative (or wet) form of age-related macular degeneration (eAMD).

• There were no significant differences in the performance regardless of the other eye conditions in the image dataset or the image types used.

• More research and consistent reporting are needed to define the role of AI in the diagnosis of eAMD.

What is age-related macular degeneration?
The macula is the central part of the retina, which is located at the back of the eye. As people age, cells in the macula die or are damaged, making it difficult for them to see clearly. Age-related macular degeneration (AMD) is a common eye condition that can worsen to exudative (or wet) AMD (eAMD), which reduces vision in the center of the eye from the growth of abnormal blood vessels. Accurate diagnosis of eAMD is important because it allows patients to receive treatment from a retinal specialist. Traditional methods of diagnosing eAMD rely on an eyecare specialist and multiple imaging techniques, which can be time- and resource-consuming. Tests that use artificial intelligence (AI) hold the promise of automatically identifying eAMD. This could help more people with AMD get their eyes checked and receive timely diagnosis and treatment.

How can AI help?
AI is a branch of computer science that aims to accomplish tasks that traditionally require human intelligence. AI applications have been developed to examine images of the eye and trained to select those that may show signs of eAMD. Patients can be referred for timely treatment and eye specialists are freed up from time-consuming eye tests.

What did we want to find out?
We wanted to find out how accurate AI tests are compared to human experts in diagnosing eAMD from images of eyes.

What did we do?
We searched for studies anywhere in the world that compared the diagnostic performance of AI tests with those of human experts in reading eye images to diagnose eAMD. The images could be from patients seeking eye care at a community clinic or academic medical center or from a database of images. The AI-based reading results were compared to those of human experts who reviewed the images prior to the AI tests.

What did we find?
We identified 36 studies, with more than 16,000 people and 62,000 images that reported the results of 41 different AI tests. More than half of the studies were carried out in Asia, followed by Europe, the USA, and multicountry collaborations. On average, 33% of people in the studies had eAMD.

For the three AI tests evaluated on new data beyond the training images, when applied to detect eAMD in 10,000 individuals (including 100 who actually had eAMD), the AI tests would incorrectly identify about 99 people as having eAMD (false positives) and miss approximately 6 cases (false negatives).

For the 28 AI tests evaluated solely on training data, using the same scenario, the tests would incorrectly identify about 396 people as having eAMD (false positives) and miss approximately 7 cases (false negatives).

The AI tests demonstrated similar performance to human experts, whether they were evaluated using images from their training set or from a new dataset. Performance was similar across image datasets of eAMD and various control groups or image types.

What are the limitations of the evidence?
Most of the included studies had flaws in selecting, training or evaluating the AI tests. These study flaws may have made the test results appear better than they were. Consequently, our confidence in the accuracy of the test results was low. Future studies should recruit participants whose age and disease severity reflect real-world conditions.

How up-to-date is this evidence?
The evidence is current as of April 2024.

Authors' conclusions: 

Low- to very low-certainty evidence suggests that an algorithm-based test may correctly identify most individuals with eAMD without increasing unnecessary referrals (false positives) in either the primary or the specialty care settings. There were significant concerns for applying the review findings due to variations in the eAMD prevalence in the included studies. In addition, among the included algorithm-based tests, diagnostic accuracy estimates were at risk of bias due to study participants not reflecting real-world characteristics, inadequate model validation, and the likelihood of selective results reporting. Limited quality and quantity of externally validated algorithms highlighted the need for high-certainty evidence. This evidence will require a standardized definition for eAMD on different imaging modalities and external validation of the algorithm to assess generalizability.

Read the full abstract...
Background: 

Age-related macular degeneration (AMD) is a retinal disorder characterized by central retinal (macular) damage. Approximately 10% to 20% of non-exudative AMD cases progress to the exudative form, which may result in rapid deterioration of central vision. Individuals with exudative AMD (eAMD) need prompt consultation with retinal specialists to minimize the risk and extent of vision loss. Traditional methods of diagnosing ophthalmic disease rely on clinical evaluation and multiple imaging techniques, which can be resource-consuming. Tests leveraging artificial intelligence (AI) hold the promise of automatically identifying and categorizing pathological features, enabling the timely diagnosis and treatment of eAMD.

Objectives: 

To determine the diagnostic accuracy of artificial intelligence (AI) as a triaging tool for exudative age-related macular degeneration (eAMD).

Search strategy: 

We searched CENTRAL, MEDLINE, Embase, three clinical trials registries, and Data Archiving and Networked Services (DANS) for gray literature. We did not restrict searches by language or publication date. The date of the last search was April 2024.

Selection criteria: 

Included studies compared the test performance of algorithms with that of human readers to detect eAMD on retinal images collected from people with AMD who were evaluated at eye clinics in community or academic medical centers, and who were not receiving treatment for eAMD when the images were taken. We included algorithms that were either internally or externally validated or both.

Data collection and analysis: 

Pairs of review authors independently extracted data and assessed study quality using the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) tool with revised signaling questions. For studies that reported more than one set of performance results, we extracted only one set of diagnostic accuracy data per study based on the last development stage or the optimal algorithm as indicated by the study authors. For two-class algorithms, we collected data from the 2x2 table whenever feasible. For multi-class algorithms, we first consolidated data from all classes other than eAMD before constructing the corresponding 2x2 tables. Assuming a common positivity threshold applied by the included studies, we chose random-effects, bivariate logistic models to estimate summary sensitivity and specificity as the primary performance metrics.

Main results: 

We identified 36 eligible studies that reported 40 sets of algorithm performance data, encompassing over 16,000 participants and 62,000 images. We included 28 studies (78%) that reported 31 algorithms with performance data in the meta-analysis. The remaining nine studies (25%) reported eight algorithms that lacked usable performance data; we reported them in the qualitative synthesis.

Study characteristics and risk of bias

Most studies were conducted in Asia, followed by Europe, the USA, and collaborative efforts spanning multiple countries. Most studies identified study participants from the hospital setting, while others used retinal images from public repositories; a few studies did not specify image sources. Based on four of the 36 studies reporting demographic information, the age of the study participants ranged from 62 to 82 years. The included algorithms used various retinal image types as model input, such as optical coherence tomography (OCT) images (N = 15), fundus images (N = 6), and multi-modal imaging (N = 7). The predominant core method used was deep neural networks. All studies that reported externally validated algorithms were at high risk of bias mainly due to potential selection bias from either a two-gate design or the inappropriate exclusion of potentially eligible retinal images (or participants).

Findings

Only three of the 40 included algorithms were externally validated (7.5%, 3/40). The summary sensitivity and specificity were 0.94 (95% confidence interval (CI) 0.90 to 0.97) and 0.99 (95% CI 0.76 to 1.00), respectively, when compared to human graders (3 studies; 27,872 images; low-certainty evidence). The prevalence of images with eAMD ranged from 0.3% to 49%.

Twenty-eight algorithms were reportedly either internally validated (20%, 8/40) or tested on a development set (50%, 20/40); the pooled sensitivity and specificity were 0.93 (95% CI 0.89 to 0.96) and 0.96 (95% CI 0.94 to 0.98), respectively, when compared to human graders (28 studies; 33,409 images; low-certainty evidence). We did not identify significant sources of heterogeneity among these 28 algorithms. Although algorithms using OCT images appeared more homogeneous and had the highest summary specificity (0.97, 95% CI 0.93 to 0.98), they were not superior to algorithms using fundus images alone (0.94, 95% CI 0.89 to 0.97) or multimodal imaging (0.96, 95% CI 0.88 to 0.99; P for meta-regression = 0.239). The median prevalence of images with eAMD was 30% (interquartile range [IQR] 22% to 39%).

We did not include eight studies that described nine algorithms (one study reported two sets of algorithm results) to distinguish eAMD from normal images, images of other AMD, or other non-AMD retinal lesions in the meta-analysis. Five of these algorithms were generally based on smaller datasets (range 21 to 218 participants per study) yet with a higher prevalence of eAMD images (range 33% to 66%). Relative to human graders, the reported sensitivity in these studies ranged from 0.95 and 0.97, while the specificity ranged from 0.94 to 0.99. Similarly, using small datasets (range 46 to 106), an additional four algorithms for detecting eAMD from other retinal lesions showed high sensitivity (range 0.96 to 1.00) and specificity (range 0.77 to 1.00).