Article Text

Original research
Use of artificial intelligence chatbots in clinical management of immune-related adverse events
  1. Hannah Burnette1,
  2. Aliyah Pabani2,
  3. Mitchell S von Itzstein3,
  4. Benjamin Switzer4,
  5. Run Fan5,
  6. Fei Ye5,
  7. Igor Puzanov4,
  8. Jarushka Naidoo6,
  9. Paolo A Ascierto7,
  10. David E Gerber3,
  11. Marc S Ernstoff8 and
  12. Douglas B Johnson1
  1. 1Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee, USA
  2. 2Department of Oncology, Johns Hopkins University, Baltimore, Maryland, USA
  3. 3Harold C Simmons Comprehensive Cancer Center, The University of Texas Southwestern Medical Center, Dallas, Texas, USA
  4. 4Department of Medicine, Roswell Park Comprehensive Cancer Center, Buffalo, New York, USA
  5. 5Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
  6. 6RCSI Cancer Centre, Beaumont Hospital, Dublin, Ireland
  7. 7Department of Melanoma, Cancer Immunotherapy and Development Therapeutics, Istituto Nazionale Tumori IRCCS Fondazione Pascale, Napoli, Campania, Italy
  8. 8ImmunoOncology Branch (IOB), Developmental Therapeutics Program, Cancer Therapy and Diagnosis Division, National Cancer Institute (NCI), National Institutes of Health, Bethesda, Maryland, USA
  1. Correspondence to Dr Douglas B Johnson; douglas.b.johnson{at}vumc.org

Abstract

Background Artificial intelligence (AI) chatbots have become a major source of general and medical information, though their accuracy and completeness are still being assessed. Their utility to answer questions surrounding immune-related adverse events (irAEs), common and potentially dangerous toxicities from cancer immunotherapy, are not well defined.

Methods We developed 50 distinct questions with answers in available guidelines surrounding 10 irAE categories and queried two AI chatbots (ChatGPT and Bard), along with an additional 20 patient-specific scenarios. Experts in irAE management scored answers for accuracy and completion using a Likert scale ranging from 1 (least accurate/complete) to 4 (most accurate/complete). Answers across categories and across engines were compared.

Results Overall, both engines scored highly for accuracy (mean scores for ChatGPT and Bard were 3.87 vs 3.5, p<0.01) and completeness (3.83 vs 3.46, p<0.01). Scores of 1–2 (completely or mostly inaccurate or incomplete) were particularly rare for ChatGPT (6/800 answer-ratings, 0.75%). Of the 50 questions, all eight physician raters gave ChatGPT a rating of 4 (fully accurate or complete) for 22 questions (for accuracy) and 16 questions (for completeness). In the 20 patient scenarios, the average accuracy score was 3.725 (median 4) and the average completeness was 3.61 (median 4).

Conclusions AI chatbots provided largely accurate and complete information regarding irAEs, and wildly inaccurate information (“hallucinations”) was uncommon. However, until accuracy and completeness increases further, appropriate guidelines remain the gold standard to follow

  • Immune Checkpoint Inhibitor
  • Immune related adverse event - irAE
  • Thyroiditis
  • Colitis
  • Pneumonitis

Data availability statement

All data relevant to the study are included in the article or uploaded as supplementary information. All data associated with this manuscript has been provided in the form of supplemental materials and can be found in online supplemental table 1.

http://creativecommons.org/licenses/by-nc/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See http://creativecommons.org/licenses/by-nc/4.0/.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

WHAT IS ALREADY KNOWN ON THIS TOPIC

  • Large language model chatbots can provide information about a variety of topics, including medical data. However, the utility of LLMs for complex immune-related adverse event (irAE) questions is unclear.

WHAT THIS STUDY ADDS

  • We found that ChatGPT provided generally accuate and comprehensive answers to queries about irAEs, though occasional errors were noted.

HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY

  • Clinicians may use ChatGPT as a resource for irAEs, though additional verification is needed.

Background

The advent of new artificial intelligence chatbots such as ChatGPT, Google Bard, and many others (hereafter referred to as chatbots) has the potential to change medical diagnostics and treatment drastically. These chatbots, built around large language models, analyze various data sets procured from sources found on the internet and learn from them before producing human-like answers to address inputted queries.1 The answers generated by the chatbots evolve based on human feedback combined with the availability of new or updated sources of information. This allows the chatbot to provide more complex answers that are better aligned with the end-user’s original intentions.

The ever-increasing extent and availability of medical information presents substantial challenges to physicians. Increasingly, both physicians and patients are turning to chatbots to help make medical information more digestible and accessible. Determining whether chatbot answers are accurate or reliable is important, especially given that patients are increasingly relying on the answers from these chatbots to inform their medical decision-making.2 Several studies have shown that earlier versions of chatbots provide digestible and fairly accurate information, but may also provide incomplete, inaccurate, or out-of-date answers.3 4 Many of these studies though, focus on multiple-choice or binary answers, which often do not reflect the open-ended nature of the real-world medical practice. Lastly, chatbot responses may also lack both the emotional aspects of healthcare such as empathy although some studies suggest they perform well in this regard.5–7

This study seeks to analyze the accuracy and completeness of chatbot-generated answers surrounding complex, open-ended questions regarding immune-related adverse events (irAEs). These immune-related toxicities impact multiple organs,8 are treated algorithmically by defined guidelines,9–11 and are common medical problems for physicians caring for patients with cancer. Further, the diverse range of organs affected, the often non-specific clinical presentations, and the multidisciplinary management required make this a challenging area for clinicians, and thus a potentially attractive area for chatbot-derived assistance.

Methods

This cross-sectional study was exempt from institutional review board review given the lack of patient data. Available guidelines for the management of irAEs were reviewed. Based on these guidelines, a total of 50 questions were generated by the senior author (DBJ) and refined/approved by other study authors as representative common questions that arise in clinical settings. Five questions each from nine common irAE categories were generated (gastrointestinal, hepatic, pulmonary, dermatologic, thyroid, pituitary/adrenal, rheumatologic, neuromuscular, cardiac), with an additional five questions about general irAE management. All questions were designed as descriptive and open-ended in nature (online supplemental table 1), but with clearly defined answers present in available guidelines from international committees with expertize in irAEs.9–11

Supplemental material

Finalized questions were entered into two chatbots (ChatGPT (V.GPT-4) and Google Bard) by the first author (HB) on October 6, 2023. Answers were provided back to the rating physicians. Rating physicians were either members of the Society for Immunotherapy in Cancer immune checkpoint inhibitor and cytokine-related adverse events subcommittee (n=5) or their colleagues with a strong focus on irAE management (n=3). All answers were graded by each rater for accuracy and completeness for both chatbots. Accuracy was graded on a 1–4 point Likert scale, with 1 signifying completely inaccurate, 2 mostly inaccurate, 3 mostly accurate, and 4 accurate. Raters were instructed to grade accuracy results based on guideline content, not personal management style. Similarly, completeness was graded on a 1–4 point Likert scale, with 1 signifying incomplete, 2 missing multiple pieces of key information, 3 missing one piece of key information, and 4 complete. Raters were instructed to grade based on major pieces of key information rather than minor or optional items, specifically giving the example of colitis (major including endoscopic evaluation, minor/optional being fecal calprotectin testing).

Grades were summarized with means, medians, and ranges for each chatbot overall and for each irAE category. Scores for completeness and accuracy were compared between chatbots using Wilcoxon signed-rank tests. Inter-rater agreement was assessed with Kendall’s coefficient of concordance since there were >2 raters. The two-sample binomial proportion test was used to compare the proportions of certain ratings between chatbots.

To further judge accuracy and completeness, 20 different clinical scenarios were generated by DBJ and approved by other participating authors, and entered into ChatGPT (not Bard given the amount of time that had elapsed and poorer performance) on March 20, 2024. Two questions were generated from each of the 10 categories, and were judged by four of the rating physicians.

Results

Both chatbots were rated for accuracy and completeness on 50 questions from 10 different categories (see online supplemental file). Both chatbots had relatively high scores overall; ChatGPT scored a median of 3.88 for accuracy (mean 3.87) and 3.88 for completeness (mean 3.83) across all questions and raters. Bard scores were median 3.5 for accuracy (mean 3.5) and 3.5 for completeness (mean 3.46). Inter-rater agreement was fair across all raters (Kendall’s correlation coefficients for accuracy and completeness were 0.21 and 0.24 for ChatGPT and 0.27 and 0.24 for Bard).12 Overall, GPT-4 received significantly higher ratings compared with Bard in both accuracy and completeness (p<0.001).

We then assessed scores stratified by category by pooling scores across five questions per category (maximum of 20 per category). Both mean and median scores for each category for both accuracy and completeness were between 19 and 20 except for general immune checkpoint inhibitor (ICI) questions for ChatGPT (table 1). Median scores for Bard ranged from 15.5 to 19, with similar ranges for mean scores (16–18.5) (table 1). Scores in all categories were rated numerically higher with ChatGPT. This difference reached statistical significance (p<0.05) in one category for accuracy (cardiac) and five categories for completeness (hepatic, dermatologic, thyroid, pituitary/adrenal, and cardiac). An additional six categories for accuracy, and two categories for completeness showed marginal statistical significance (p<0.1) favoring ChatGPT. By category, the “general” category had the lowest scores for ChatGPT with generally high scores across specific irAE categories, whereas Bard seemed to perform highest in dermatologic, rheumatologic, neuromuscular, and cardiac categories.

Table 1

Scores for accuracy and completeness for each engine in each category

There were multiple questions that received ratings of 4 from all eight reviewers, including 22/50, 44% (ChatGPT accuracy) and 16/50, 32% (ChatGPT completeness) compared with 2/50, 4% (Bard accuracy) and 1/50, 2% (Bard completeness) (p<0.001). Ratings of 1 (fully inaccurate or incomplete) were uncommon, given for 2/800 ChatGPT rater-responses (0.3%) and 9/800 Bard rater-responses (1.1%). Ratings of 2 (mostly incorrect or missing multiple key pieces of information) were of similar incidence for ChatGPT (4/800 rater-responses, 0.5%), though more common for Bard (83/800 rater-responses, 10.4%) (p<0.001).

To assess utility in specific clinical scenarios, we provided 20 different patient-specific scenarios (see online supplemental file) into ChatGPT. These answers were also rated highly; mean accuracy was 3.73 (median 4) and mean completeness was 3.61 (median 4). Of the 80 physician-answers, scores were 4 (n=53), 3 (n=23), 2 (n=4), and 1 (n=0).

Discussion

In this study, we found that chatbots, particularly ChatGPT (V.GPT-4), provided generally accurate and complete information surrounding irAEs. Questions were open-ended (not multiple choice), mirroring real-life situations rather than board examinations. The median rating for many questions was 4 (fully accurate and complete), and egregiously wrong answers were uncommon. Thus, these engines appear promising for use in receiving guidance for irAEs.

Although both engines had a reasonably high degree of accuracy and completeness, it appeared that ChatGPT was further advanced in providing accurate and comprehensive information compared with Bard. Ratings of 3 or 4 predominated for ChatGPT (794 of 800 rater-responses), thus showing consistently high grades across physician raters. As a new technology, it is likely that chatbots will change and upgrade rapidly though, thus comparisons between engines may be rapidly outdated. It is also likely that different engines will ultimately be optimized for distinct tasks and prioritize different capabilities (eg, accuracy vs comprehensiveness). In addition, chatbots may be designed to maximize other goals, such as conciseness (eg, avoiding extraneous information) or delivering information at a specific educational attainment level. These goals are also important to maximize high-yield information delivery to busy clinicians. Of note, ChatGPT and other engines have shown promise in providing high-quality medical information across a range of medical conditions.13–15 This includes general immune-oncology questions,16 urological cancers,17 and preoperative counseling for head and neck cancer surgery.18

Interestingly, ratings of 1 (fully incorrect or incomplete) were very uncommon, suggesting that outright “hallucinations” were very rare. At the outset of these technologies, this phenomenon appeared to occur with troubling frequency.19 The rarity of egregiously wrong answers in this data set suggests that such hallucinations may be a surmountable problem, at least in this type of focused question set with concrete answers available in publicly available guidelines. However, it could be argued that less frequent wrong answers may increase the impact of residual incorrect information, since increasing trust in the outcomes may decrease reliance on other more validated sources.

Tempering this enthusiasm is the fact that most questions did not universally receive a rating of 4 (fully accurate and/or complete) on all questions. This could reflect subjective disagreement by highly experienced physicians, but could also suggest that these chatbots may not be reliable as stand-alone sources of medical information. A potentially important future direction could include training chatbots specifically on irAE and other cancer-specific guidelines, as has been done with other corpus of texts. Until those types of advances, available guidelines remain a golden standard when making medical decisions. It is also important to note that ratings were subjective, and could differ with different clinicians (and could be impacted based on the particular Likert scale used). It is also possible that new features or upgrades worsen the model performance; this will be difficult to assess.

In conclusion, current iterations of chatbots provide fairly accurate and complete information to many questions surrounding irAEs, though important differences are present between different chatbots. Additional research and validation are needed prior to using these engines as “stand-alone” resources.

Data availability statement

All data relevant to the study are included in the article or uploaded as supplementary information. All data associated with this manuscript has been provided in the form of supplemental materials and can be found in online supplemental table 1.

Ethics statements

Patient consent for publication

Ethics approval

Not applicable.

References

Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

Footnotes

  • X @BenSwitzerDO, @DrJNaidoo, @PAscierto

  • Contributors DBJ and HB contributed to conception, design, and writing the manuscript. AP, MSvI, BS, RF, FY, IP, JN, PAA, DEG, and MSE reviewed the manuscript and participated in the research. DBJ is responsible for the overall content as guarantor. As this study is based on assessing AI’s potential impact on the healthcare field, AI was used to generate answers to 50 questions and 20 scenarios so they could be assessed for accuracy and completeness by physicians. AI was not used to write the manuscript in any way nor was it used to analyze the data.

  • Funding DBJ receives funding from the NCI R01CA227481. Susan and Luke Simons Directorship for Melanoma, the James C. Bradford Melanoma Fund, the Van Stephenson Melanoma Fund.

  • Competing interests AP reports grants and personal fees from Bristol-Myers Squibb; personal fees from AstraZeneca, Pfizer, Merck, Roche, and Canadian Agency for Drugs and Technologies in Health; and grants from Alberta Cancer Foundation outside the submitted work. IP has served on advisory boards for Nektar, Iovance, Nouscom, I-O Bio and has stock ownership in Ideaya. DBJ has served on advisory boards or as a consultant for BMS, Catalyst Biopharma, Iovance, Mallinckrodt, Merck, Mosaic ImmunoEngineering, Novartis, Pfizer, Targovax, and Teiko, has received research funding from BMS and Incyte, and has patents pending for use of MHC-II as a biomarker for immune checkpoint inhibitor response, and abatacept as treatment for immune-related adverse events. DEG reports research funding from AstraZeneca, BerGenBio, Karyopharm, and Novocure; stock ownership in Gilead; service on advisory boards or consulting for AstraZeneca, Catalyst Pharmaceuticals, Daiichi-Sankyo, Elevation Oncology, Janssen Scientific Affairs, LLC, Jazz Pharmaceuticals, Regeneron Pharmaceuticals, Sanofi; U.S. patent 11,747,345, patent applications 17/045,482, 63/386,387, 63/382,972, 63/382,257; and is Co-founder and Chief Scientific Officer of OncoSeer Diagnostics, LLC.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.