2025 ISAKOS Biennial Congress Paper
A Custom ChatGPT Can Comprehensively Answer Questions From The International Osteotomy Consensus Statement for the Painful Degenerative Varus Knee
Ahmed Mabrouk, MBBCH (HONS), MRCS, FRCS (Trauma & Orthopaedics), Wakefield, West Yorkshire UNITED KINGDOM
Rohan Bidwai, FRCS,FEBOT,MS, Mch, Birmingham, England UNITED KINGDOM
Shahbaz S Malik, BSc, MB BCh, MSc (Orth Engin), LLM, FRCS (Tr&Orth), Birmingham UNITED KINGDOM
Tarek Boutefnouchet, MBChB MRCS PGCMed MSc FRCS (Tr&Orth) Dip. FIFA Med, Birmingham UNITED KINGDOM
Tamer Sweed, FRCS(Orth), Birmingham, West-midlands UNITED KINGDOM
University Hospitals of Birmingham , Birmingham , UNITED KINGDOM
FDA Status Not Applicable
Summary
A custom ChatGPT can be trained to comprehensively answer questions from specific documents. This can serve as a valuable tool to guide surgeons in their daily practice to have an evidence based answers to scientific questions in a time efficient manner
Abstract
Background
Artificial Intelligence (AI) and advanced language models like ChatGPT are increasingly being utilised in orthopaedic surgery. Models such as ChatGPT, developed by OpenAI, have been tested for their ability to answer questions related to orthopaedic surgery. The responses generated by ChatGPT are based on a blend of licensed data, data curated by human trainers, and publicly available information, which can vary in accuracy. This study aimed to assess the accuracy of a custom-trained ChatGPT model in responding to questions specifically related to high tibial osteotomies, using the ESSKA's osteotomy consensus for degenerative varus knee as the source of information
Methods
A custom version of ChatGPT was developed using the ESSKA's osteotomy consensus for degenerative varus knee as the primary training material. The custom ChatGPT model was then tested for accuracy by generating responses to a series of 10 questions—5 directly extracted from the ESSKA consensus and 5 other common questions related to high tibial osteotomies. The generated responses were assessed by three knee surgeons using a custom-made scoring system. The scoring system evaluated the accuracy, relevance,clarity, completeness, and adherence to the osteotomy consensus statement document. Each item was scored on a likert scale from 0 to 3, with 3 was the highest point. Inter-rater reliability was calculated with intra-class correlation coefficient (ICC).
Results
A total of 30 questions were asked to the custom trained ChatGPT by the three raters. The mean scores for the accuracy, relevance, and clarity were 2.5 ± 0.8, 2.9 ± 0.3, and 2.9 ± 0.2, respectively. The inter-rater reliability for these scores was good (ICC 0.7 , p = 0.004). Whereas, the mean score for completeness was 2.6 ± 0.5 with moderate inter-rater reliability (ICC 0.5 , p = 0.1) and the mean score for adherence to the consensus statement document was 2.5 ± 1.1 with excellent inter-rater reliability (ICC 0.9 , p < 0.001).
Conclusion
A custom ChatGPT can be trained to comprehensively answer questions from specific documents. This can serve as a valuable tool to guide surgeons in their daily practice to have an evidence based answers to scientific questions in a time efficient manner.