Article Text

Artificial intelligence and radiomics: fundamentals, applications, and challenges in immunotherapy
  1. Laurent Dercle1,
  2. Jeremy McGale1,
  3. Shawn Sun1,
  4. Aurelien Marabelle2,
  5. Randy Yeh3,
  6. Eric Deutsch4,
  7. Fatima-Zohra Mokrane5,
  8. Michael Farwell6,
  9. Samy Ammari4,7,
  10. Heiko Schoder8,
  11. Binsheng Zhao1 and
  12. Lawrence H Schwartz1
  1. 1Radiology, NewYork-Presbyterian/Columbia University Medical Center, New York, New York, USA
  2. 2Therapeutic Innovation and Early Trials, Gustave Roussy, Villejuif, Île-de-France, France
  3. 3Molecular Imaging and Therapy Service, Memorial Sloan Kettering Cancer Center, New York, New York, USA
  4. 4Radiation Oncology, Gustave Roussy, Villejuif, Île-de-France, France
  5. 5Radiology, Hospital Rangueil, Toulouse, Occitanie, France
  6. 6Division of Nuclear Medicine and Molecular Imaging, Hospital of the University of Pennsylvania, Philadelphia, Pennsylvania, USA
  7. 7Radiology, Institut de Cancérologie Paris Nord, Sarcelles, France
  8. 8Radiology, Memorial Sloan Kettering Cancer Center, New York, New York, USA
  1. Correspondence to Dr Laurent Dercle; laurent.dercle{at}gmail.com

Abstract

Immunotherapy offers the potential for durable clinical benefit but calls into question the association between tumor size and outcome that currently forms the basis for imaging-guided treatment. Artificial intelligence (AI) and radiomics allow for discovery of novel patterns in medical images that can increase radiology’s role in management of patients with cancer, although methodological issues in the literature limit its clinical application. Using keywords related to immunotherapy and radiomics, we performed a literature review of MEDLINE, CENTRAL, and Embase from database inception through February 2022. We removed all duplicates, non-English language reports, abstracts, reviews, editorials, perspectives, case reports, book chapters, and non-relevant studies. From the remaining articles, the following information was extracted: publication information, sample size, primary tumor site, imaging modality, primary and secondary study objectives, data collection strategy (retrospective vs prospective, single center vs multicenter), radiomic signature validation strategy, signature performance, and metrics for calculation of a Radiomics Quality Score (RQS). We identified 351 studies, of which 87 were unique reports relevant to our research question. The median (IQR) of cohort sizes was 101 (57–180). Primary stated goals for radiomics model development were prognostication (n=29, 33.3%), treatment response prediction (n=24, 27.6%), and characterization of tumor phenotype (n=14, 16.1%) or immune environment (n=13, 14.9%). Most studies were retrospective (n=75, 86.2%) and recruited patients from a single center (n=57, 65.5%). For studies with available information on model testing, most (n=54, 65.9%) used a validation set or better. Performance metrics were generally highest for radiomics signatures predicting treatment response or tumor phenotype, as opposed to immune environment and overall prognosis. Out of a possible maximum of 36 points, the median (IQR) of RQS was 12 (10–16). While a rapidly increasing number of promising results offer proof of concept that AI and radiomics could drive precision medicine approaches for a wide range of indications, standardizing the data collection as well as optimizing the methodological quality and rigor are necessary before these results can be translated into clinical practice.

  • tumor biomarkers
  • translational medical research
  • review
  • immunotherapy
  • immunologic surveillance
http://creativecommons.org/licenses/by-nc/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See http://creativecommons.org/licenses/by-nc/4.0/.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Introduction

Immunotherapy, a treatment strategy that harnesses a patient’s own immune system, can improve outcomes in several types of cancer. Durable objective responses and improved overall survival (OS) have been achieved in patients with a wide range of malignancies through treatment with intratumoral oncolytic virus and chimeric antigen receptor T cells, like those against CD19 and B-cell maturation antigen, as well as antagonistic monoclonal antibodies directed against T-cell checkpoint molecules (eg, programmed cell death protein-1 (PD-1), programmed cell death-ligand 1 (PD-L1), cytotoxic T-lymphocytes-associated protein 4, and more recently Lymphocyte-activation gene 3 (LAG-3)).1 However, the unique mechanisms of action of these novel treatments has led to the emergence of new, atypical response patterns including delayed response, pseudoprogression, and mixed response, all of which confound the size-based criteria typically used to guide clinical decision-making.2 3 Moreover, these treatments have been associated with immune-related adverse events (irAEs) and potentially hyperprogression, which can be life-threatening and require identification and prediction by imaging.4–6 Adapting to these novel phenomena demands innovation in imaging, a long-standing cornerstone for the evaluation of cancer treatment response. Artificial intelligence (AI), including deep learning and radiomics, offers a potential solution to the growing complexity of response assessment and represents a pivotal upgrade to the role of imaging in immunotherapy, but many challenges must be overcome before a bench-to-bedside transition can be implemented. The present review will first introduce the fundamentals of this exciting technology before describing results from a literature survey covering ‘AI in immunotherapy’, synthesizing current trends in the field, and identifying broad characteristics of immunotherapy-related radiomics research.

Fundamentals of AI

AI broadly refers to algorithms created to perform tasks previously only achievable by human intelligence. Machine learning is a subfield of AI which involves algorithms that can modify themselves, or ‘learn’, to produce a desired output using the data available. The theory is that computer vision can distinguish characteristic and distinct phenotypes produced on imaging by underlying biological processes.7

Radiomics offers one method to convert images into statistically interpretable and quantifiable data.8 Traditionally, radiomics characterizes images using hand-crafted quantitative features and mathematical formulas which can describe the relationships of image pixels in a meaningful way. Domain expertize is required to choose these features, such as intensity heterogeneity, edge sharpness, and shape irregularity, that describe characteristics known to be associated with disease (eg, the pattern of central necrosis in an aggressive malignancy). A machine learning algorithm can then learn to adjust the importance of each feature or pattern, as well as their interactions, in order to combine radiomic features into a predictive signature. With the rise of open-source software packages to extract radiomic features from images, researchers around the world have built signatures for diverse clinical applications. These signatures have been correlated with distant metastasis, pathological response, cancer recurrence after radiation therapy, disease-free survival, and even with certain genotypes in one or more cancer types.9 10%5D

Deep learning is another type of machine learning that minimizes human input, instead seeking to discover patterns algorithmically. Each pixel, or set of pixels, serves as an initial data point for an algorithm commonly known as a neural network. Through training, the neural network learns to progressively combine information to automatically discover patterns, starting with simple characteristics such as a line or a circle before proceeding to more complex representations. There is much excitement about the potential for deep learning to discover previously unknown relationships in data and perform almost any complex mapping given the correct training. These algorithms have thus far produced remarkable advancements in the field of medical imaging and, in particular, have achieved impressive results in cancer detection, characterization, and monitoring.11 12

Radiomics and deep learning each offer advantages. The strength of radiomics lies in its intuitive operation, which affords more easily interpretable results. Using features engineered by domain experts provides reassurance that the algorithm is learning in the correct direction and is basing its decision-making on proven imaging patterns. Radiomics also typically requires a smaller data set to reach the threshold of learning, making it more feasible for experiments with sparse data. Lastly, radiomic features reflect fundamental properties of the images themselves, such as those directly observed by radiologists (contrast, shape, heterogeneity) as well as novel characteristics. These are more easily explained to other physicians, a fact that may help alleviate the historical apprehension towards this new technology.

On the other hand, deep learning holds a unique advantage in that the algorithm is allowed to create its own ‘features’, setting the stage for significant scientific progress; In 2012, AlexNet, one of the first modern convolutional neural networks (CNN), revolutionized the field of image recognition by winning the ImageNet competition with a substantial increase in accuracy over its competitors.13 Deep learning has no theoretical limit to what it can learn and can, with appropriate training data, continue to increase its performance, even learning to identify unexpected new associations between imaging and disease. Incorporating both deep learning and radiomics has been shown to produce even further performance improvements, potentially due to the analysis of previously unseen relationships that are uncovered by deep learning.14 15

Challenges of AI

The main barrier to integration of AI technology into clinical practice is the need for clinical validation of initial proof of concept studies. As preliminary reports are frequently prone to overfitting due to data set limitations, validation requires appropriate experimental design in order to properly assess generalizability.12 Several organizations have published recommendations for appropriate validation, which the Food and Drug Administration has incorporated in its guidance for Software as a Medical Device. It is recommended that validation is performed on data with high technical and demographic diversity, which may be best provided by multicenter data sets. A recent study evaluated design characteristics of AI medical image research and found that only 6.0% of the over 500 included articles qualified as validation studies. They concluded that the literature may not produce results that are robust enough for clinical translation.16

Another barrier, especially prominent in radiomics, is the issue of reproducibility. Sources of noise and variation are introduced at multiple steps of the radiomic pathway. Differences in scanner type, reconstruction parameters, and acquisition techniques have all been shown to affect radiomic feature reproducibility and model performance.7 17 18 Until 2020, no standardized definition had been published for even the most common imaging biomarkers.19 To establish a set of suggested research practices, such as the proper description of imaging protocols and providing open-source data and methods, a Radiomics Quality Score (RQS) was developed by Lambin et al.20 Studies that achieve higher scores using this metric could more easily facilitate translation of radiomic results into clinical practice.

Deep learning poses its own obstacles. These models can forgo several of the steps necessary for radiomics, including segmentation and feature extraction, and may be more robust to noise variability in image acquisition. However, deep learning requires a sufficiently large and high-quality data set for neural network training. The ImageNet challenge in 2012, which first established the breakthrough performance of CNNs, had a training data set of over 1.2 million images.13 Obtaining similarly large training data sets of medical images has historically been difficult due to high cost and concerns about patient confidentiality. Promising efforts towards resolving this problem have been made such as The Cancer Imaging Archive and the National Lung Cancer Screening Trial.21 22

Other issues with the deep learning strategy include a somewhat ‘black box’ decision-making process, unexplainable selected features, and a higher risk for overfitting when using small training sets. All of these may contribute to a lack of confidence from physicians that, in turn, impedes implementation in routine clinical settings. Current neural networks can have millions of parameters, leading to extreme model complexity. Repeat training on the same data set may generate networks that have significantly different visual appearances, yet still attain a similar result. It can be unclear what patterns the algorithm is learning and which parts of the image the algorithm is attaching importance to. As a result, there is a risk that algorithms may be prioritizing noise over true signal, prompting a need for extensive validation.23 To overcome this challenge, several breakthroughs have been made in visualizing neural networks and providing insight into the mechanism of their decision-making. SHAP and Grad-Cam are some of the technologies used to visualize the importance of certain pixels in input images and how they are related to the output, effectively allowing users to understand what the model ‘sees’ by using a heat map.24 25

Material and methods

Literature search strategy and study selection

In order to review the published literature relating to the use of radiomics in immunotherapy, we queried three databases, MEDLINE (PubMed), CENTRAL (Cochrane Central Register of Controlled Trials), and Embase, from their inception through February 26, 2022. This scoping review included the following key search terms: (immunotherapy) AND (CT OR MRI OR PET) AND (radiomics OR texture OR deep learning OR artificial intelligence). One additional relevant study that did not appear in this search but that was identified in literature references was added.26 Titles and abstracts of the articles were screened to determine eligibility and were included if they (1) involved immunotherapeutic treatment of human cancers or murine models of human cancers and (2) employed radiomics with positron emission tomography (PET), CT, or MRI imaging (or any combination of the three). Case reports, systematic reviews or meta-analyses, perspectives, editorials, book chapters, workshop reports, and conference abstracts were excluded from the analysis, as well as duplicate or non-English studies and publications including only an abstract.

Data extraction

Relevant data were extracted from each eligible publication using a standardized form recording the following information: (1) general publication information (date, PubMed reference number (PMID)/ Document Object Identifier (doi)), (2) sample size, (3) location of primary tumor, (4) imaging modality, (5) primary and secondary study objectives, (6) data collection strategy (retrospective vs prospective, single center vs multicenter), (7) radiomic signature validation strategy, (8) signature performance, and (9) metrics for calculation of an RQS.

Sample size

Sample sizes were calculated by addition of each cohort utilized in the study, including all training, validation, and test sets. Studies were grouped into sample size buckets (<50, 50–99, 100–199, 200–299, 300–399, 400–499, 500–599, 600–699, ≥700) for analysis. The sample size 0–99 group was split in half for higher resolution in order to better distinguish the large number of studies fitting this criterion.

Imaging modality

The imaging modality used to generate each study’s radiomics signature was identified. This was either MRI, CT, (18F)-fluorodeoxyglucose (FDG) PET nuclear imaging information alone (PET), a combination of (18F)-FDG PET and CT images (PET/CT), or a combination of MRI and CT (MRI/CT).

Tumor type

Primary tumor location of the cancer investigated in each study was identified. For simplicity and due to low numbers of individual studies, reports examining rectal, colon, gastric, and esophageal tumors were grouped together under ‘gastrointestinal tract’. Those investigating a variety of different cancers with one radiomics model (eg, ‘solid tumors’) were categorized as ‘mixed’.

Primary task

For identifying the primary task of each study, we established five broad categories: prognosis, treatment response, general classification, classification by immune environment, and classification by tumor phenotype. Measures of ‘prognosis’ were defined as OS, progression-free survival (time from treatment initiation to clinical or radiological disease progression), and durable clinical benefit (progression-free survival past a predetermined time point, eg, 6 months). ‘Treatment response’ included studies utilizing a primary endpoint of disease response as defined by Response Evaluation Criteria in Solid Tumors (RECIST) V.1.1 criteria. ‘General classification’ was comprised of studies performing other categorization-based tasks (eg, serious sequelae and adverse events from immunotherapy or adjuvant treatment). ‘Immune environment’ included reports examining immune cell (eg, CD8+, CD4+, CD3, T-helper 1/2, B-cells, natural killer cells, among others) infiltration of primary tumors, while ‘tumor phenotype’ covered studies focused on tumor PD-L1 expression or microsatellite instability.

Validation strategy

We examined how each radiomics model was validated. In the interest of clarity when describing model validation, we created the following definitions which were subsequently used to classify studies. ‘Cross validation’ was listed as the strategy when simple cross validation was used without a separate validation set. ‘Validation set’ indicated one of two things that: (1) a study’s original cohort was split into two distinct groups, the first used to train and refine the radiomics signature (the training set) and the second kept separate from training in order to be used for testing (the validation set) or (2) a training set and a validation set were recruited separately, but from the same institution using similar criteria (ie, they were not part of an aggregate original cohort that was split, but they had similar patient distributions and characteristics). The validation set thus acted as an ‘extension’ of the training set, drawn from the same or a similar patient base, but to which the model had not been previously exposed. In a similar fashion, ‘tuning; validation set’ meant that a study’s original cohort was split into three groups: a training, tuning (for adjustment of model parameters after training), and validation set. A test set was identified when the cohort used to evaluate model performance consisted of patients that were independent of the training/validation sets, that is, they were drawn from a different database/institution/clinical trial, received a different type of treatment (chemotherapy vs immunotherapy), or were collected prospectively following initial training of the radiomics model. Test sets were used by studies either directly after training, denoted by ‘test set’, or following training and the use of a validation set, which we labeled as ‘validation set; test set’. Studies were thus classified by their validation strategies as: ‘none’, ‘cross validation’, ‘validation set’, ‘tuning; validation set’, ‘test set’, or ‘validation set; test set’.

Temporal trend

We grouped studies by individual year of publication, except for those published in 2021 and 2022. For plotting purposes, these were combined as ‘2021+’ due to the relatively few studies published since the beginning of 2022.

Methodology

Each report was examined for investigational methodology in two domains: how the data was collected (retrospectively or prospectively) and where the data was collected (single or multiple institutions). If a study utilized an external data set (eg, The Cancer Genome Atlas Project) in addition to single-center cohort recruitment, the location was classified as ‘multiple institutions’.

Performance

Performance evaluation of radiomic signatures generally involves sequential application to each cohort available in the study—moving through the training set, validation set, and test sets, depending on study design—with a different performance metric (PM) reported at each stage. We extracted PMs from the most robust cohort applications reported in each individual study, such that metrics were used from the test set whenever present, the validation set if one was present but no test set was available, and the training set if neither test nor validation set was available. The highest reported performance was utilized if multiple metrics were reported at a given level (ie, the investigators tested their radiomics model on two validation sets).

RQS

An RQS was calculated for each eligible study. The articles were individually reviewed and data was extracted based on adherence to a 16 component scoring system originally defined by Lambin et al.20 Higher scores indicate higher quality studies, and the highest possible score is 36 points. This system was developed to provide a measure of standardization for the assessment of radiomics studies, assigning point values for different features of the scientific protocol (eg, points for if multiple segmentation or measures against overfitting were performed, if the study was prospective, if the model was validated and how, etc). The score does not take into account sample size or how the model actually performs, but instead represents an evaluation of both how rigorous model development is and how impactful the study may be to the field. A full summary of what features were evaluated, as well as the associated point values, is available in online supplemental table S1.

Supplemental material

Tabulation and analysis

During our review, all eligible studies were recorded using Google Sheets (V.2022, Google, USA, 2022). Figures were created using Adobe Illustrator (V.2022, Adobe, USA, 2022).

Results

Identification and selection of studies

Our query of MEDLINE, CENTRAL, and Embase identified 350 studies reporting on the use of radiomics in immunotherapy. Among these, 53 studies were removed for being duplicates, 1 for not being in English, and 99 for only having an abstract available (including symposium/conference abstracts). The remaining 198 studies were screened to remove reviews, perspectives, and editorials (n=68), studies not pertaining to immunotherapy or radiomics (n=29), case reports (n=8), workshop reports (n=1), book chapters (n=4), and paper corrections (n=1). One additional study not identified in the search but previously known in the literature was added. Our selection process is outlined in figure 1 and yielded a total of 87 studies that were included in the final report. Their details are summarized in table 1 and figure 2.

Figure 1

Visualization of our literature survey and study selection.

Table 1

Literature review of MEDLINE, CENTRAL, and Embase from database inception through February 2022.

Figure 2

General overview of study characteristics for reports involving radiomics and immunotherapy. (A) Aggregate number of patients included in the study for all purposes; (B) Primary tumor site of the disease investigated; (C) Stated task of the research: prognosis (overall survival, progression-free survival, durable clinical benefit), treatment response (defined by Response Evaluation Criteria in Solid Tumors (RECIST v1.1), tumor phenotype (programmed cell death-ligand 1 expression, microsatellite instability), immune environment (tumor immune cell infiltration), general classification (serious sequelae and adverse events from immunotherapy or adjuvant treatment); (D) Strategy for radiomics model performance validation; (E) Year of publication; (F) Data collection strategy and data source. GI, gastrointestinal.

Sample size

The vast majority of studies in this review had total sample sizes above 50 patients (n=69, 79.3%), with a mean of 146 patients. The median (IQR) of cohort sizes was 101 (57–180) (figure 2A). In studies with validation sets, many investigators split the original study cohort into training/testing divisions utilizing a ratio of at least 2:1, respectively. Although this placed an emphasis on large samples for signature development, evaluation of performance metrics was limited due to smaller validation cohorts.

Imaging modality

Radiomics was primarily applied to the evaluation of CT images in isolation (n=58, 66.7%), MRI alone (n=13, 14.9%), and PET/CT studies (n=9, 10.3%). There were few studies (n=6, 6.9%) building radiomics signatures based on PET alone and only one study (1.1%) utilizing an MRI/CT combination.

Tumor type

The site of primary malignancy for the studies reviewed was: lung (n=42, 48.3%), melanoma (n=9, 10.3%), gastrointestinal tract (n=7, 8.0%), pancreas (n=5, 5.7%), brain (n=4, 4.6%), bladder (n=4, 4.6%), head and neck (n=3, 3.4%), liver (n=3, 3.4%), ovary (n=1, 1.1%), breast (n=1, 1.1%), kidney (n=1, 1.1%), lymphatic system (n=1, 1.1%), and blood (study targeting myeloid-derived suppressor cells, n=1, 1.1%) (figure 2B). Five studies (5.7%) employed mixed primary malignancy sites to develop predictive models for solid tumors in general.

Primary task

Radiomic signatures were primarily employed to develop predictive models of prognosis (n=29, 33.3%) or treatment response (n=24, 27.6%) (figure 2C). The remaining studies focused on classification, describing either tumor phenotype (n=14, 16.1%), tumor immune microenvironment (n=13, 14.9%), or other general characteristics of the investigated disease (n=7, 8.0%). Secondary objectives were defined in 24 studies (27.6%), most of which were prognostic (n=11, 12.6%) or described treatment response (n=6, 6.9%).

Validation strategy

Assessment of validation strategy was possible for 82 studies (94.3%). Of these, the most popular strategy for model testing involved the use of a validation set (n=38, 46.3%). Other reports utilized both a tuning and a validation set (n=2, 2.4%), an independent test set alone (n=7, 8.5%), or a validation set followed by an independent test (n=7, 8.5%). The remaining studies either did not perform validation testing (n=10, 12.2%) or relied on less rigorous validation protocols, such as cross-validation on training set data alone (n=18, 22.0%).

Of the test sets identified (n=14, 17.0%), more than half were collected retrospectively from an external institution or database (n=9), while most of the others were derived from cohorts at the same institution as the training/validation sets but differing in the treatment regimen (n=2) or consisting of data collected prospectively after model training (n=2). Only one study (1.2%) followed what is considered the most thorough strategy for model development, progression through training and validation sets followed by performance evaluation on a prospective, external cohort. This latter workflow is illustrated in online supplemental figure S2.

Temporal trend

The literature search returned 87 relevant articles from 2018 through late February 2022. It is evident that interest in this particular application of radiomics is steadily increasing, with the number of published articles rising nearly linearly year on year (figure 2E). In fact, just over half of the reports detailed herein were published in the 13 months since January 2021 (n=47, 54.0%). Additionally, no MRI or PET/CT studies were identified before 2019.

Methodology

The data collection strategy for the vast majority of studies was retrospective in nature (n=75, 86.2%), with only a small percentage using prospective data sets (n=8, 9.2%) (figure 2F). Four (4.6%) relevant experimental studies have been published since 2018, employing murine models of human cancers to investigate radiomic predictors of serious sequelae, treatment response, or prognosis.

The majority of studies used patient cohorts recruited from a single institution (n=57, 65.5%) rather than multiple institutions (n=21, 24.1%) (figure 2F). Cohort origin was not applicable for the four experimental studies (4.6%), and could not be identified in five (5.7%) other publications.

Performance

A discrete measure of radiomic signature PM was identified in most studies (n=68, 78.2%) as either an area under the receiver operating curve or as a concordance index. Nine of these reports described performances of two separate models. Performance was either not reported or was addressed using a separate metric than those listed above (eg, HR) in 19 studies (21.8%).

Most (n=69) radiomics signatures could be placed into four broad categories for prediction of the following measures: prognosis, treatment response, immune environment, and tumor phenotype. The remaining eight radiomics models that could not thus be categorized involved a variety of classification goals, such as serious sequelae or immunotherapy side effect prediction. PMs, delineated by predictive aim, are detailed in online supplemental table S3. Within each category, we also clarified how PMs were validated, that is, whether only cross-validation on the training set was performed or if a validation set/independent test set was employed after model training. Signatures describing prognosis (n=26) had a mean PM of 0.787, with a median (IQR) of 0.771 (0.711–0.875). For signatures of treatment response (n=20), mean performance was higher at 0.808 with a median (IQR) of 0.810 (0.785–0.860). PMs reported on the validation sets within this category were higher than those for the training and independent test sets, although this could be an artifact from low numbers of the latter two study types. When describing immune environment (n=10) and tumor phenotype (n=13), models had PMs with means/medians (IQRs) of 0.787/0.760 (0.727–0.848) and 0.816/0.834 (0.790–0.848), respectively. The highest overall mean performances were seen in the radiomics signatures for predicting treatment response and describing tumor phenotype, with a larger number of reports contributing to the former.

RQS

We computed an RQS for the studies identified in our survey based on the metrics detailed in online supplemental table S1. Distribution of radiomics scores as well as a visualization of study adherence to the different components of the RQS can be seen in figure 3. The four experimental studies were excluded from this analysis, and an additional four articles could not be assessed due to incomplete data. Out of a possible total score of 36 points, the median (IQR) RQS observed in the 79 studies reviewed was 12 (10–16). The vast majority of studies (n=55, 63.2%) fell within the range of 11–20, while only one study (1.3%) achieved a score greater than or equal to 25. The RQS categories with the highest adherence, that is, those for which over 75% of studies had at least one point, were image protocol quality (n=63, 79.7%), feature reduction (n=75, 94.9%), biological correlates (n=66, 83.5%), discrimination statistics (n=73, 92.4%), and comparison to a gold standard method (n=69, 87.3%). Most studies did not report calibration statistics (n=64, 81.0%) or clinical utility in the form of a decision curve (n=63, 79.7%) and most did not have or did not use open source data (n=62, 78.5%) or prospective cohorts (n=68, 86.1%). Notably, only one (1.3%) article reported a phantom study and none performed cost-analysis including a report of quality-adjusted life-years.

Figure 3

Top: (Left) Histogram of radiomics quality scores assigned to the 87 studies included in this review. Scores have been placed in bins of five with the exception of the highest bracket, which ranges from 30 (inclusive) to the highest theoretical score of 36. (Right) Breakdown of individual radiomics score components in the studies surveyed. The red bar indicates the proportion of studies (out of 79 with scorable data) that did not receive a point in that category. The green bar overall represents the portion of studies which received at least a point in the category, with darker shades indicating serially increasing point values (eg, validation: lightest shade of green is one point with successively darker shades indicating additional points for more robust validation methods, as defined in the methods section and online supplemental table S1). Bottom: Comparison of key studies with a radiomics quality score >15 (n=23, 26.4%). The articles are organized by sample size (x-axis) and radiomics quality score (y-axis) and are represented by icons denoting studied sample size and reported performance metrics (area under the receiver operating curve, concordance index, etc). Studies without formal validation or test sets are demarcated by a black outline.

Discussion

Integration of AI to guide the care of patients with a diagnosis of cancer treated with immunotherapy holds great promise. Per our review, several studies have successfully trained models to perform specific tasks for guidance in patient management, including prediction of response to treatment (n=24, 27.6%) or prognosis (eg, survival, remission time) following immunotherapy (n=29, 33.3%). However, the field is undoubtedly still in its infancy. The earliest of the 87 studies which met our search criteria was published in 2018. The median cohort size of 101 suggests a high risk of overfitted results, further demonstrated by the fact that only 54 studies (out of 82 for which data was available, 65.8%) included a validation or independent testing data set. Out of a theoretical maximum RQS of 36 points, the median (IQR) was 12 (10–16), indicating a need for the field to adopt more robust methodology for radiomics model development and application.20

Moving forward, establishing a set of best practices for AI model development will involve addressing a few key points. Primarily, there is a need to precisely define the outcomes to be predicted by AI. For immunotherapy specifically, reference standards for treatment response and disease progression will need to account for atypical patterns of response including pseudoprogression and hyperprogression. Additionally, in order to direct the focus of future model development, the most interesting and predictive imaging biomarkers, such as those involved in radiogenomics and radiotranscriptomics (eg, PD-1/PD-L1 expression, tumor mutational burden, etc), must be identified and described using standardized definitions. With these addressed, a new gold-standard strategy for model development should be recognized that involves adherence to a set of systematic processes or measures, as described previously.9 20 To avoid overfitting, studies should divide initial patient cohorts into two groups: one for training and the second for model validation on a patient population similar to that of the training set but to which the model has not been previously exposed. The models must then be externally validated on multicenter data and prospective cohorts in order to best simulate real-world environments.16 Lastly, model development will need to implement stress testing to address the issue of underspecification, or failing to capture the inner logic of an underlying system due to confounding factors in data distributions.27 To ensure broad generalizability, these stress tests should become standard practice, just as crash tests are fundamental to the automotive industry.

Our survey revealed a dearth of truly robust cohort testing, limiting the widespread applicability of reported models and overall indicating that the use of AI for medical imaging in immunotherapy remains in a preliminary stage. The level of evidence and standardization will need to progress before the technology can be applied to clinical practice.

Notable studies and biomarkers

Notwithstanding the challenges with validation previously discussed, AI has so far achieved some success in the baseline assessment of tumors prior to therapy. He et al utilized deep learning features to distinguish the level of tumor mutational burden, which resulted in groups with distinctly different overall and progression-free survival after treatment with immune checkpoint inhibitors.28 Prior evidence also demonstrated the correlation between tumor infiltration by immune CD8+ cells and response to immunotherapy, especially with treatment specifically targeting anti-PD-1 and anti-PD-L1.29 Separately, a CT-based radiomic signature was created to predict the tumor immune environment and discriminate between tumor-inflamed and tumor-desert phenotypes.30 High baseline scores on this signature (and therefore greater CD8+ T cell infiltration) were associated with a higher chance of achieving objective response at 3 and 6 months as well as improved OS. This same radiomics signature was subsequently used to assess response after stereotactic body radiation therapy and pembrolizumab in metastatic treatment-refractory adult solid tumors, once again demonstrating a significant association with progression free survival.31 Additionally, a deep learning network incorporating multimodal data sources was trained on the multiomics of imaging, laboratory, and clinical data of patients with non-small cell lung cancer (NSCLC) who received anti-PD-1/PD-L1 agents.32 This model was capable of distinguishing responders from non-responders and predicting survival benefit to therapy in certain patients with stable disease.

Post-immunotherapy assessment is another promising application of AI. In a retrospective analysis of the CheckMate trials, a radiomics signature was developed that performed quantitative analysis of early tumor changes between baseline and first on-treatment assessment.26 It was found that treatment insensitivity and shorter OS were associated with an exponential increase in radiomic features assessing tumor volume, invasion of tumor boundaries, and tumor spatial heterogeneity. Using a similar concept in metastatic colorectal cancer treated with chemotherapy and other targeted therapies, the group demonstrated improved performance in prognostic classification over standard RECIST V.1.1 criteria.33 Another radiomics signature applied to metastatic melanoma also outperformed RECIST V.1.1 in estimating OS and was better able to distinguish between pseudoprogression and true progression at 3 months.34 Lastly, a neural network identified morphological changes on pretreatment and post-treatment chest CTs to predict 1-year survival in patients with NSCLC who received nivolumab.35 Here, the use of visualization heat mapping revealed the importance of gross morphologic changes, including those in nodal features, lung and bone metastases, pleural effusions, atelectasis, and consolidations.

AI has also found success in metabolic imaging. Certain PET features were found to be associated with OS and disease progression, predicting response of patients with NSCLC to immunotherapy.36 37 PET, CT, and PET/CT fusion features were extracted from patient with pretreatment NSCLC images to create a multiparametric radiomics signature that was able to accurately identify durable clinical benefit resulting from checkpoint blockade immunotherapy.38 These results were subsequently validated in both retrospective and prospective test cohorts. The same group then applied deep learning techniques to achieve similar results in the same data sets.39

While immunotherapy is a cornerstone of treatment in patients with advanced cancer, irAEs due to unbridled T-cell activation have emerged as a concern requiring specific detection and support. Several studies have applied radiomics to predict toxicity and irAE related to immunotherapy. One report utilized AI to distinguish between pituitary metastasis and hypophysitis by incorporating both MRI imaging and clinical features.5 In another study, PET/CT radiomics predicted development of severe irAEs in patients with NSCLC treated with immunotherapy.40 This model was validated on a prospective cohort in addition to standard training and testing data sets.

Finally, the application of AI has extended beyond radiographic and functional imaging alone towards the incorporation of histopathological slides. This approach has been studied for predicting diagnosis and prognosis, forecasting response to immune checkpoint blockers, and characterizing the tumor immune microenvironment through determination of genomic class, but is beyond the scope of this study.41–43 For further details, refer to several reviews published on these topics.41 44

Clinical application

The adoption of AI in clinical practice will require developing transparent machine-learning models in which the underlying logic of the learning process can be revealed to and understood by humans. Knowing how the model arrived at prognostic and predictive outputs allows rational application to real-world clinical scenarios. On review, we observed that the majority of relevant radiomics research using CT, MRI, and PET rely on the combination of a limited subset of imaging biomarkers that capture specific macroscopic, anatomical, and functional characteristics. The macroscopic features in particular are of great clinical use for assessing prognosis. For example, overall tumor burden and organ-specific localization of metastatic disease were found to be associated with survival, with liver metastases indicating the shortest survival and greatest acceleration in tumor growth.45 Additionally, skeletal muscle index—a surrogate marker of sarcopenia that is derived from the analysis of muscle surface area on a single CT scan slice at the level of the lumbar vertebra L3—was significantly associated with OS in patients after initiation of immunotherapy.45 Among more functional classifiers, increased bone marrow glucose metabolism was significantly and positively associated with transcriptomic profiles of regulatory T-cells, and spleen metabolism was associated with immunosuppressive environment and poor outcome.46 47

The goal of the RQS discussed in this review is to provide some measure of standardization for the assessment of studies in the field. The scoring system considers robustness in scientific protocol, including corrections for overfitting, consideration of temporal variations, cut-off analyses, etc, as well as an evaluation of the relevance of results in real-world settings, including identification of biological correlates, comparison to gold-standard methods, extent of validation, etc. The RQS thus assesses many of the features required to instill confidence in the technology for clinical practice, allowing physicians to transparently view how a biomarker was selected, how and on what cohorts the algorithm was validated, and overall how repeatable the entire sequence of testing would be given a different environment. However, the score overlooks some study components that may be useful in defining generalizability and in the field of oncology. Ideally, a more comprehensive score could incorporate the total number of lesions included in the model as well as how the actual lesions were segmented (two-dimensional vs three-dimensional). Additionally, quantification of how the examined immunotherapies were documented and identification of possible confounding variables (eg, local vs radiation therapy, intratumoral immunotherapy) would elucidate the applicability of a particular radiomic signature to other tumor types and therapies. Finally, inter-evaluator reproducibility may be a concern for RQS assignment, as demonstrated by Sanduleanu et al.48 More clear descriptions of individual score components might help the next iteration reduce the role of subjective evaluation in the RQS.

Conclusion

The results for AI application to medical imaging of patients on immunotherapy are promising but preliminary. We can envision a future in which AI-derived clinical decision support algorithms assist clinicians in distinguishing between the myriad of immune response patterns to achieve earlier and better identification of responders, non-responders, and those likely to develop adverse events. However, routine clinical implementation will require further evaluation, as the current evidence is still lacking in external validation and evidence of generalizability. Once these challenges of AI implementation have been overcome, three areas of application offer potentially transformative outcomes. First, there will be an opportunity to better define the diagnosis and character of a patient’s cancer with the help of AI to analyze pathology slides, blood biomarkers, and radiologic images. Paired with its ever improving predictive value through data accumulation, this deployment of AI will enable increased personalization of therapeutic strategies. Second, by using AI tools at baseline and throughout treatment, clinicians will have the opportunity to dynamically adapt their treatments earlier and more reliably than is currently possible, reacting to disease evolution at the first sign of change. Last but not least, AI will provide the field of immuno-oncology with new tools for deciphering the tumor immune environment by recognizing patterns on medical images that are associated with therapeutically actionable pathways and subsequently guiding theranostic approaches.

Ethics statements

Patient consent for publication

Ethics approval

Not applicable.

References

Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

Footnotes

  • Twitter @laurentdercle

  • LD and JM contributed equally.

  • Contributors LD and JM contributed equally. The manuscript was written through contributions of all authors.

  • Funding JM was funded by the Kirschstein-NRSA #5T35HL007616-42.

  • Competing interests None declared.

  • Provenance and peer review Commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.