Artificial intelligence systems may demonstrate high accuracy in detecting abnormal voices but substantially lower performance in identifying specific laryngeal disorders, according to an integrative review.
In the review, investigators found that artificial intelligence (AI) models consistently performed best in binary classification tasks that distinguished healthy from pathologic voices. Across the literature, reported accuracies for distinguishing healthy from pathologic voices ranged from 88% to 99%. By comparison, performance declined to approximately 70% to 90% when classifying broader pathophysiologic categories and generally remained below 75% when identifying specific disorders.
The investigators reviewed 88 studies published between 2015 and 2025 that evaluated machine- and deep-learning approaches for the recognition, detection, classification, or severity assessment of laryngeal disorders. The structured search included PubMed, Scopus, and the Cochrane Library. The investigators examined study populations, recording protocols, AI architectures, validation strategies, and diagnostic performance.
To contextualize current performance, the investigators proposed a three-level clinical recognition framework. Level 1 involved the binary detection of abnormal vs. healthy voices; level 2 classified patients into broader pathophysiologic categories such as structural lesions, neuromuscular disorders, inflammatory conditions, or incomplete glottic closure conditions; and level 3 attempted to identify specific diagnoses. According to the review, performance became progressively less reliable as the classification tasks moved from detection to diagnosis.
The investigators attributed much of this decline to acoustic overlap among the laryngeal disorders. Distinct diseases frequently produce similar measurable voice abnormalities, limiting the ability of acoustic analysis alone to distinguish between the diagnoses. The investigators noted that vocal fold nodules, polyps, spasmodic dysphonia, hyperfunctional dysphonia, and unilateral vocal fold paralysis can share acoustic characteristics despite differing underlying pathology.
AI's performance also varied according to model architecture and data type. Traditional machine-learning approaches commonly achieved internal accuracies of 88% to 96% for binary detection tasks, whereas deep-learning systems reported accuracies of 97% to 99% on standardized data sets. Image-based deep-learning models showed strong performance for morphologic lesion characterization, achieving approximately 92% sequence-level accuracy for benign-vs.-malignant lesion classification in one cited study.
A recurring finding throughout the review was the gap between internal and external validation. Most studies relied on internal cross-validation using a limited number of commonly used historical databases. When evaluated on independent cohorts, performance often declined by 10- to 20-percentage points, with some studies reporting decreases of 20- to 30-percentage points in multimodal and multiclass settings. Just seven of the 88 studies included both internal and external validation.
The investigators identified several methodologic concerns that may contribute to optimistic performance estimates. These included dependence on only a few legacy databases, class imbalance, limited demographic diversity, restricted pathology representation, and heavy reliance on sustained-vowel recordings. Approximately 82% of the reviewed studies used sustained-vowel tasks rather than connected speech, which may not fully capture clinically relevant vocal variability.
The review also highlighted evidence gaps involving pediatric and geriatric populations, underrepresented languages, neurologic voice disorders, laryngopharyngeal reflux disease, and complex clinical presentations. Fewer than 15% of the studies shared source code or complete model documentation, limiting reproducibility.
The researchers concluded that current evidence supported AI primarily as a tool for screening, triage, telemedicine applications, longitudinal monitoring, and decision support rather than as an autonomous diagnostic system.
“These systems cannot yet replace endoscopic assessment for specific diagnosis,” wrote lead study author Samantha Mairesse, of the Department of Surgery at the UMONS Research Institute for Language Science and Technology at the University of Mons in Belgium, and colleagues.
The study received no external funding and the study authors declared no conflicts of interest.
Source: Journal of Personalized Medicine