2025 ISAKOS Congress in Munich, Germany

2025 ISAKOS Biennial Congress ePoster


Can Popular AI Large Language Models Provide Reliable Answers to Frequently Asked Questions About Rotator Cuff Tears?

Orhan Mete Karademir, MD, Ankara TURKEY
Ulaş Can Kolaç, MD, Ankara TURKEY
Gökhan Ayik, PhD, Ankara TURKEY
Mehmet Kaymakoglu, M.D, Izmir TURKEY
Erdi Ozdemir, MD, Hershey UNITED STATES
Filippo Familiari, MD, Prof., Catanzaro ITALY
Gazi Huri, Prof. MD, Doha QATAR

Hacettepe University, Department of Orthopedics and Traumatology, Ankara, TURKEY

FDA Status Not Applicable

Summary

The study evaluates the information quality and readability of responses from popular AI large language models to frequently asked questions about rotator cuff tears by patients.

Abstract

Background

Rotator cuff tears (RCT) are common upper extremity injuries that significantly impair shoulder function, leading to pain, reduced range of motion, and a decrease in quality of life. With the increasing reliance on artificial intelligence large language models (AI LLMs) for health information, it is crucial to evaluate the quality and readability of the information provided by these models.

Methods

A pool of 50 questions were generated related to RCT by querying popular AI LLMs (ChatGPT 3.5, ChatGPT 4, Gemini, and Microsoft CoPilot) and using Google search. After that, responses from the AI LLMs were saved and evaluated. For information quality the DISCERN tool and a Likert Scale was used, for readability the PEMAT Understandability Score and the Flesch-Kincaid Reading Ease Score was used. Two orthopedic surgeons assessed the responses, and discrepancies were resolved by a senior author.

Results

Out of 198 answers, the median DISCERN score was 40, with 56.6% considered sufficient. The Likert Scale showed 96% sufficiency. The median PEMAT Understandability score was 83.33, with 77.3% sufficiency, while the Flesch-Kincaid Reading Ease score had a median of 42.05, with 88.9% sufficiency. Overall, 39.8% of the answers were sufficient in both information quality and readability. Differences were found among AI models in DISCERN, Likert, PEMAT Understandability, and Flesch-Kincaid scores.

Conclusion

AI LLMs generally can not offer sufficient information quality and readability. While they are not ready for use in medical field, they show a promising future. There is a necessity for continuous re-evaluation of these models due to their rapid evolution. Developing new, comprehensive tools for evaluating medical information quality and readability is crucial for ensuring these models can effectively support patient education. Future research should focus on enhancing readability and consistent information quality to better serve patients.