
A recent study highlights significant limitations of large language models (LLMs) in medical diagnostics, questioning their reliability in clinical settings.
Artificial intelligence is making strides in healthcare, particularly with specialized algorithms designed for tasks like detecting diabetic eye disease or analyzing CT scans for early signs of cancer. However, the emergence of LLMs, such as ChatGPT and Claude, raises concerns regarding their applicability in medical diagnostics. A study conducted by researchers at the New York Institute of Technology College of Osteopathic Medicine examined the performance of several advanced multimodal LLMs, including GPT-5 and Claude Opus 4.5 Extended, when tasked with analyzing a CT brain scan.
Findings revealed a 20% diagnostic error rate across the models, highlighting inconsistencies in their interpretations. While all models initially identified the scan type correctly, one model misclassified an ischemic stroke as a hemorrhage, a critical error that could affect treatment decisions. Even among those that reached the correct diagnosis, the models provided conflicting assessments regarding the timing of the stroke and alternative diagnoses.
According to Dr. Milan Toma, the study's lead author, this disparity underscores a key difference between specialized medical AI tools and LLMs. While the former are trained for specific diagnostic tasks, LLMs are primarily designed for language processing and conversation, leading to authoritative-sounding yet potentially flawed interpretations. The researchers suggest that the future of healthcare AI may involve a hybrid approach, integrating specialized diagnostic tools with LLMs for tasks like clinical documentation, albeit with necessary oversight from healthcare professionals.