Version 3.5 of ChatGPT could not formulate a correct diagnosis in 83 of 100 pediatric cases, according to recent research published in JAMA Pediatrics.
According to the authors of the study, 72 of the incorrect diagnoses were completely incorrect and 11 of the incorrect diagnoses were clinically related but too broad to be considered a correct diagnosis.
A caveat of this study was the large language model used represented an older version of ChatGPT. Despite this, what do these results mean for healthcare and the use of AI?
The aforementioned study underscores the importance of physician oversight when implementing AI tools and large language models in clinical medicine. AI tools are only beginning to be developed, and much more research and investigation is necessary before they become mainstream in healthcare. Physicians are and should always be the final arbiters and stewards of patient care, particularly when the stakes are as high as human life or death, as is the case with patient care.
Medical interpretation often is nuanced and requires contextual understanding of various factors. As an example, when radiologist physicians are interpreting a CT scan of the legs, they may come across the finding of subcutaneous edema in the calf. This finding is nonspecific and can be seen in the setting of many diagnoses; including cellulitis, contusion from trauma and vascular disease from heart failure. Physicians rely on integrated information from the patient’s history to make the final diagnosis. In the above scenario, if the patient had a fever the likely diagnosis would be cellulitis, but if the patient suffered a motor vehicle accident the subcutaneous edema would likely be from a contusion.
It is precisely this contextual information that AI still needs to develop, as exemplified in the study published in JAMA Pediatrics. Making the proper diagnosis in the pediatric cases not only requires pattern recognition of symptoms, but also consideration of patient’s age and additional contextual patient information. AI certainly excels in pattern recognition, but likely struggles with more complex health scenarios where symptoms could overlap with various diagnoses. This limitation is precisely why physicians must regulate and oversee decisions and diagnoses made by large language models.
So should the healthcare industry give up on AI as a means to augment patient care?
There are tremendous advantages to AI and the aforementioned study should be an impetus for researchers and scientists to continue to develop large language models to improve the performance of AI. These tools have the potential to transform medicine by reducing burnout, communicate with patients, transcribe prescriptions and treat patients remotely.
AI tools and chatbots require datasets to train, and more complex datasets should be used to improve the performance of tools such as ChatGPT. The more comprehensive these datasets are, and the less bias they have, the more superior their performance will be. Bias remains a well-recognized limitation of AI tools that should always be considered when evaluating and improving AI software.
The results of the study in JAMA Pediatrics should serve as a gentle reminder that we are not where we need to be with respect to the AI revolution in medicine. AI is a tool, not a solution for healthcare challenges and should always be used hand-in-hand with the expertise of physicians.