A foundation artificial intelligence model outperformed physicians with less than 3 years of dermoscopy experience but did not surpass dermatologists with more than 10 years of experience in diagnosing skin lesions, according to a diagnostic study published in JAMA Dermatology. The rankings depended on the task: the AI systems led on benign-vs-malignant discrimination but trailed the most experienced readers on fine-grained, multiclass diagnosis.
Researchers compared 3 artificial intelligence (AI) systems with physician readers using the Test of Dermoscopy for International Validation platform. The AI systems included a first-generation convolutional neural network and 2 configurations of the PanDerm foundation model: a unimodal version using dermoscopic images only and a multimodal version using dermoscopic images, clinical photographs, and metadata.
The prospective diagnostic study used retrospectively collected images from 1,117 skin lesion cases. Each case included clinical information, at least 1 macroscopic photograph, and at least 1 polarized dermoscopic image. The data set was designed to reflect a broad diagnostic spectrum and included rare and atypical presentations, but it was also assembled primarily for physician education and evaluation and contained a higher proportion of malignant and diagnostically challenging lesions than would be expected in routine clinical practice.
A total of 652 physicians contributed 1,092 completed test iterations. Each test iteration included 100 stratified, randomly selected cases. Most readers had limited dermoscopy experience: 478 physicians had less than 3 years of experience, and 60 had more than 10 years of experience.
The primary outcome was reader-level multiclass diagnostic accuracy across 9 lesion categories. Secondary outcomes included benign vs malignant sensitivity, specificity, balanced accuracy, and area under the receiver operating characteristic curve.
Multiclass Accuracy: Experts on Top
For the primary endpoint, physicians with more than 10 years of dermoscopy experience had the highest diagnostic accuracy, at 74%. The unimodal PanDerm model had 72% accuracy, compared with 59% among physicians with less than 1 year of dermoscopy experience, 68% among those with 1 to less than 3 years of experience, and 73% among those with 3 to less than 10 years of experience. The multimodal PanDerm model had 66% accuracy, and the convolutional neural network had 57% accuracy. Every physician group, including the least experienced readers, outperformed the first-generation convolutional neural network.
Binary Discrimination: A Different Ranking
Performance rankings differed when lesions were grouped as benign or malignant. In the binary analysis, the unimodal model had the highest balanced accuracy, at 0.82, compared with 0.65 for physicians overall and 0.72 for the multimodal model. The unimodal model had 71% sensitivity and 94% specificity. The multimodal model had 47% sensitivity and 97% specificity. Overall physician sensitivity and specificity were 66% and 65%, respectively, and improved with dermoscopy experience.
The AI systems' edge in balanced accuracy was driven largely by specificity rather than sensitivity. The researchers reported no statistically significant difference in sensitivity between physicians and either AI configuration, and physicians with more than 10 years of experience had significantly higher sensitivity than both models. The researchers described the multimodal model as favoring higher specificity at the expense of sensitivity, whereas they characterized the unimodal model as showing a more balanced operating profile.
Receiver operating characteristic analysis showed an area under the curve of 0.91 for both PanDerm configurations, compared with 0.78 for the human consensus pseudo-ROC. However, the researchers noted that individual reader confidence scores were unavailable, so area under the curve comparisons could not be performed at the individual physician level. The higher benign vs malignant discrimination among the AI systems did not translate into higher multiclass diagnostic accuracy or sensitivity compared with expert readers.
Why the Multimodal Model Lagged
The researchers also found that the multimodal model performed worse than the unimodal model despite having access to additional clinical information. They suggested that a distribution shift between training images and the more distant, complex clinical images in the test set may have contributed to the finding. Among malignant lesions missed by both PanDerm configurations, the researchers noted an apparent preponderance of acral localizations, which they said could reflect underrepresentation of acral melanoma in publicly available training data sets.
What the Researchers Concluded
The researchers framed their results as an argument for human-AI collaboration rather than substitution. They proposed that AI tools may be most valuable as decision-support systems for less experienced clinicians, functioning as a kind of virtual mentor, and as a triage aid for experts. In support of the educational role, the researchers cited prior work in which general practitioner trainees improved their diagnostic accuracy and confidence for pigmented lesions after dermoscopy training that incorporated AI.
The researchers also raised a note of caution about overreliance. They pointed to a multicenter study in endoscopy that documented what they described as a substantial decline in unassisted adenoma detection rates following the introduction of AI-assisted colonoscopy, which they characterized as a measurable deskilling effect. On that basis, they argued that dermoscopy training must be actively maintained rather than replaced by AI tools, regardless of those tools' measured performance, and that clinicians should also be trained in how AI works and where it falls short. The researchers concluded that the future of skin cancer diagnosis likely lies in collaborative workflows in which AI provides systematic secondary review to reduce errors caused by fatigue or inattention, while expert clinicians supply nuanced interpretation and judgment.
Limitations
Several limitations may affect interpretation. The images were retrospectively collected, and the TODIV data set was curated for education and evaluation rather than clinical prevalence. The benign-to-malignant ratio differed from routine practice. All malignant lesions were histopathologically confirmed, but some benign lesions were verified by expert consensus rather than biopsy. Most readers were recruited in France, darker skin phototypes were underrepresented, and the study did not evaluate combined physician-AI decision-making.
"Results of this diagnostic study show that foundation models approached the diagnostic accuracy of well-trained clinicians and surpassed novices but still fell short of the best experts, who remain the reference standard," wrote lead study author Julien Anriot, MD, of Claude Bernard University Lyon 1 and Centre Léon Bérard, and colleagues.
Disclosures: Several researchers reported financial relationships with pharmaceutical and medical technology companies, equipment support, institutional grants, and/or patent-related disclosures. Full disclosures are available in the study.
Source: JAMA Dermatology