Article Text
Abstract
Background The inflamed immune phenotype (IIP), defined by enrichment of tumor-infiltrating lymphocytes (TILs) within intratumoral areas, is a promising tumor-agnostic biomarker of response to immune checkpoint inhibitor (ICI) therapy. However, it is challenging to define the IIP in an objective and reproducible manner during manual histopathologic examination. Here, we investigate artificial intelligence (AI)-based immune phenotypes capable of predicting ICI clinical outcomes in multiple solid tumor types.
Methods Lunit SCOPE IO is a deep learning model which determines the immune phenotype of the tumor microenvironment based on TIL analysis. We evaluated the correlation between the IIP and ICI treatment outcomes in terms of objective response rates (ORR), progression-free survival (PFS), and overall survival (OS) in a cohort of 1,806 ICI-treated patients representing over 27 solid tumor types retrospectively collected from multiple institutions.
Results We observed an overall IIP prevalence of 35.2% and significantly more favorable ORRs (26.3% vs 15.8%), PFS (median 5.3 vs 3.1 months, HR 0.68, 95% CI 0.61 to 0.76), and OS (median 25.3 vs 13.6 months, HR 0.66, 95% CI 0.57 to 0.75) after ICI therapy in IIP compared with non-IIP patients, respectively (p<0.001 for all comparisons). On subgroup analysis, the IIP was generally prognostic of favorable PFS across major patient subgroups, with the exception of the microsatellite unstable/mismatch repair deficient subgroup.
Conclusion The AI-based IIP may represent a practical, affordable, clinically actionable, and tumor-agnostic biomarker prognostic of ICI therapy response across diverse tumor types.
- Lymphocytes, Tumor-Infiltrating
- Tumor Biomarkers
- Immune Checkpoint Inhibitors
Data availability statement
Data are available upon reasonable request. The TCGA data used in the current study are publicly available via the National Cancer Institute Genomic Data Commons portal (https://portal.gdc.cancer.gov/). The data in the multi-institutional ICI-patient external validation data set are not publicly available due to institutional restrictions governing human subject privacy protection. As these data are stored in controlled access repositories, the data, or a subset of the data, may be made available upon reasonable request following submission of a research protocol detailing intended use and institutional review board ethics review and approval; inquiries should be directed to JS (Stanford), Y-LC (SMC), TL (CNUH), HK (SNUBH), and YKC (Northwestern University). The remaining data are available within the Article and Supplementary materials.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
WHAT IS ALREADY KNOWN ON THIS TOPIC
Recent studies have demonstrated potential association between the distribution of tumor-infiltrating lymphocytes (TIL) within the tumor microenvironment and response to immune checkpoint inhibitor (ICI) therapies. However, manual evaluation of TILs can be time-consuming, labor intensive, and subject to interobserver variability.
WHAT THIS STUDY ADDS
This study demonstrates the ability of an artificial intelligence (AI) model that runs on routine H&E-stained pathology whole-slide images of pretreatment tumor samples to predict ICI treatment outcomes in a real-world multicenter cohort of 1,806 ICI-treated patients representing over 27 different solid tumor types.
HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY
An AI-based assessment of immune phenotypes is associated with better clinical outcomes after ICI therapy across diverse solid tumor types, suggesting its potential as a prognostic biomarker for ICI treatment planning.
Introduction
Immune checkpoint inhibitors (ICI) have become a major part of the standard of care for various tumor types.1 Several predictive and prognostic biomarkers, including programmed cell death ligand 1 (PD-L1) expression, high microsatellite instability/mismatch repair deficiency (hereafter referred to as MSI), and tumor mutational burden (TMB) have been approved for use to guide ICI treatment decisions, but concern exists regarding their technical and clinical limitations.2 PD-L1 expression by immunohistochemistry (IHC) has been the most extensively investigated. However, pivotal studies supporting the Food and Drug Administration’s (FDA) ICI drug approvals have shown that PD-L1 IHC predicts ICI treatment response in only 28.9% of cases,3 limited to specific tumor types. Furthermore, PD-L1 IHC performance and interpretation strongly rely on factors such as specimen fixation method, the choice of antibody clone and staining platform, pathologist experience level, and the specific IHC scoring method used. Moreover, the thresholds for PD-L1 positivity vary by cancer type and treatment indication.4
MSI and TMB have recently been approved as tissue-agnostic biomarkers of ICI response, as MSI and TMB-high are associated with increased numbers of tumor neoantigens.5 6 However, the overall prevalence of MSI is less than 2% in pan-cancer studies,7 8 and estimation of TMB by next-generation-sequencing (NGS) has also been subject to technical and practical limitations. NGS testing has more stringent specimen requirements, a longer turnaround time, and higher costs. Although a cut-off of 10 or more mutations per megabase (Mb) is often used for defining TMB-high, as per a recent FDA approval, this threshold is limited to the few tumor types investigated in one study6 and likely needs to be adjusted for other tumor types.9–12
Given the limitations of current biomarkers, there is a significant need for additional novel biomarkers that are more time and cost efficient, universally applicable, and able to capture a significant proportion of potential ICI responders. The inflamed immune phenotype (IIP) is a potential new biomarker which is directly associated with, or reflective of, the mechanism of action of ICIs, which impact the activity of immune cells within the tumor microenvironment (TME).
Clinical outcomes after ICI treatment have been associated with the spatial localization of tumor-infiltrating lymphocytes (TILs) within the TME across several tumor types.13 14 TIL evaluation on H&E-stained tumor pathology slides collected during routine clinical care could potentially serve as a novel tumor-agnostic biomarker for ICI response. Advances in artificial intelligence (AI), and specifically, deep learning, offer the possibility of more objective and reproducible automated computational TIL assessment.15 Building on our prior work in automated histopathologic cancer region segmentation and object detection,16 17 we recently demonstrated the ability of deep learning to perform TME immune phenotyping on H&E-stained whole-slide images (WSI) of non-small cell lung cancer (NSCLC) and nasopharyngeal carcinoma, showing that the IIP is correlated with survival and response to ICI treatment.18–20 In the current study, we extend this work in a pan-cancer analysis to assess whether the immune phenotype (IP), as determined by an automated deep learning model which performs TIL analysis on routine H&E-stained WSI, might serve as a novel, clinically-actionable “tumor-agnostic” biomarker for predicting ICI treatment outcomes in a large, real-world sample of patients representing over 27 different solid tumor types.
Methods
Classification of the artificial intelligence-based immune phenotype
The Lunit SCOPE IO model (Lunit, Seoul, Republic of Korea)18 is a deep learning model that classifies the IP of the TME based on TIL distribution and density within H&E WSI. Model inference consists of two main stages: (1) tissue segmentation and cell detection, followed by (2) IP classification based on the outputs from the preceding stage.
Deep-learning-based tissue segmentation and cell detection stage
In this first stage, a convolutional neural network (CNN), hereafter referred to as the tissue segmentation model, performs semantic segmentation of cancer area (CA) and cancer stroma (CS) within a WSI (where CA refers to cancer epithelium or, in the case of non-epithelial tumors, the non-stromal tumor cells). In parallel, another CNN detects TILs using a cell detection model that identifies both tumor cells and lymphocytes. The data sets for training and tuning (optimizing) these CNNs were drawn from a pool of 17,296 H&E WSIs of over 24 different solid tumor types, including NSCLC and rarer cancer types (online supplemental tables 1a,b), collected from over nine different sources/institutions, scanned at 20× to 40× magnification (0.25 µm to 0.5 µm per pixel). As WSIs cannot be directly input into models due to their large size, representative 1,024×1,024 pixel patches were computationally extracted from pathologist-delineated tumor regions from each WSI, summing to a total effective area of 3.44×1010 µm2 for training and 8.75×109 µm2 for tuning. The training and tuning sets for the tissue segmentation model consisted of 55,325 patches extracted from 13,962 WSIs and 13,962 patches extracted from 849 WSIs, respectively, manually annotated for CA and CS regions by board-certified pathologists. The training and tuning subsets for the cell detection model consisted of 5,698 patches extracted from 2,485 WSIs and 1,925 patches extracted from 849 WSIs, respectively, manually annotated by board-certified pathologists for tumor cells and lymphocytes by placing a point annotation within the nucleus of each cell. All annotations were independently verified by a second pathologist before being used as the ground truth for model development. A total of 104 board-certified pathologists participated in annotation.
Supplemental material
The architecture of the tissue segmentation model was based on DeepLabV3+,21 with a Resnet-34 backbone as a feature extractor.22 This model takes as input patches of size 1,024×1,024 pixels, performing semantic segmentation on each patch (predicting the likelihood of each pixel belonging to the CA, CS, or background non-cancer tissue classes) and outputting a tissue class probability map of size 1,024×1,024 pixels. The model was trained using a Dice loss function23 and optimized using the Adam optimizer24 with a learning rate of 0.0001, achieving a performance of 0.82 and 0.67 on the Intersection-over-Union metric for CA and CS, respectively (please see online supplemental methods for additional details).
The cell detection model was also based on the DeepLabV3+ and Resnet-34 architectures.21 22 Cell nuclei were annotated as points, but this model casts the cell detection problem as a dense pixel prediction task. We therefore generated a circle of radius 0.95 µm centered around each point annotation during the training stage and trained the model using a Dice loss function23 with Adam optimization24 with a learning rate of 0.002. The inputs to the model were also patches of size 1,024×1,024 pixels, with the outputs being probability maps of size 1,024×1,024 pixels where each pixel represents the likelihood of a TIL existing in that location. A post-processing stage was applied to extract the location of the cells (in online supplemental methods). The F1-score for this model on lymphocyte detection was 0.69.
The performance of the tissue region segmentation and cell detection models was validated on a separate internal validation set consisting of 356 WSIs from over 17 different tumor types collected from over 8 different sources/institutions (online supplemental table 2), scanned at 20× to 40× magnification (0.25 µm to 0.5 µm per pixel), with resultant area under the receiver operating characteristic (AUROC) values for segmentation of CA and CS and TIL detection of above 0.95 (figure 1B). The outputs from the preceding tissue segmentation and cell detection models were used in the subsequent IP classification stage.
Immune phenotype classification stage
In this stage, we sought to classify the tumor present in each patient’s WSI into one of three general IP.25 26 In the IIP, there is a high density of TILs present within the CA. In the immune excluded phenotype (IEP), TILs are abundant within the CS but excluded from the CA. In the immune desert phenotype (IDP), TILs are scarce within both the CA and CS.
To perform TIL analysis within WSI of various sizes, each WSI was divided into 1 mm2 grids, with the IP of each grid classified as IIP, IEP, or IDP based on empirically determined TIL density criteria (see next paragraph below). The overall WSI-level Inflamed Score (IS), Immune-Excluded Score (IES), and Immune-Desert Score (IDS) were calculated by dividing the number of grids having that respective phenotype over the total number of grids analyzed within the WSI. If the WSI-level IS exceeded a prespecified threshold, that WSI was classified as having a WSI-level IIP. The overall workflow for IP classification is illustrated in figure 1A.
Determination of the TIL density cutoffs for grid-level IP classification and the cancer-type agnostic IS threshold for classification of an WSI as IIP was based on prior evidence demonstrating that a T-cell-inflamed gene expression profile (as characterized by the interferon-gamma responsive gene (IFNG) signature) is predictive of ICI clinical response.27 We hypothesized that ICI responders would exhibit a stronger IFNG signature and comprise approximately 25% of patients with pan-carcinoma. Therefore, the TIL density cut-off and optimal IS threshold were determined as those scores that predicted the upper 25% of IFNG signature levels in The Cancer Genome Atlas (TCGA) pan-carcinoma data set (n=7,454, online supplemental table 3)28 with the highest AUROC and greatest sum of sensitivity and specificity.
We found that an intratumoral TIL density cut-off of 200/mm2 yielded the highest AUROC (0.7772) for predicting a 75th percentile or higher IFNG signature level. As the IFNG signature levels in the TCGA data set were derived from bulk sequencing data without distinction between cancer parenchyma and stroma, we set a consistent cut-off of 200/mm² for stromal (CS) TILs to ensure a comparable distribution to that observed in cancer parenchyma (CA). The IP of each grid was therefore classified using the following criteria: grid-level IIP, if the TIL density within the total CA in the grid is ≥200/mm2; grid-level IEP, if the TIL density within the total CA is <200/mm2 and that within the total CS is ≥200/mm2; and grid-level IDP, if the TIL density is <200/mm2 in both the total CA and CS within the grid. At the 200/mm² TIL cut-off, an optimal IS of 19.479% resulted in the maximum sum of sensitivity (65.0%) and specificity (78.2%), irrespective of TMB status (AUROC 0.76 for the TMB-high population and 0.77 for the TMB-low population). Therefore, we set 20.0% as the IS threshold for WSI-level IIP classification, regardless of cancer type (hereafter referred to as the universal threshold) (figure 1C, online supplemental table 3).
ICI-treated patient data set
The final optimized Lunit SCOPE IO model was applied to an independent real-world data set (N=1,806 patients) of H&E-stained WSIs scanned at 40× magnification (0.25 µm per pixel), derived from pre-ICI treatment formalin-fixed, paraffin-embedded (FFPE) surgical resection and biopsy specimens with accompanying clinical outcomes, including progression-free survival (PFS), overall survival (OS), and best overall response (BOR) after ICI monotherapy or ICI combination therapy, as assessed by Response Evaluation Criteria In Solid Tumors (RECIST) V.1.1.29 The WSIs were collected from Stanford University Medical Center (Stanford, n=688), Samsung Medical Center (SMC, n=653), Seoul National University Bundang Hospital (SNUBH, n=269), Chonnam National University Hospital (CNUH, n=183), and Northwestern Memorial Hospital (Northwestern, n=13). ICI combination therapy regimens included at least one other anti-neoplastic drug, such as conventional chemotherapy. All work was conducted in accordance with the Declaration of Helsinki for biomedical research, after institutional review board approval at each participating institution.
PD-L1 immunohistochemistry and other ancillary biomarker testing
PD-L1 IHC was performed using the US FDA-approved Dako PD-L1 IHC 22C3 PharmDx kit (Agilent Technologies, Santa Clara, California, USA), with scoring of PD-L1 expression (%) determined using the Tumor Proportion Score (TPS), representing the percentage of viable tumor cells showing partial or complete membranous staining for PD-L1 with 1+ to 3+ intensity.30
Determination of microsatellite status at Stanford was done using DNA mismatch repair IHC and/or MSI PCR. For MSI PCR, standard multiplex PCR amplification of a panel of five microsatellites (BAT-25, BAT-26, MONO-27, NR-21, and NR-24) was performed with comparison of tumor and normal samples from the same patient by the Stanford Molecular Pathology laboratory (order code: TMSI (Tumor Microsatellite Instability), additional assay details available at: https://stanfordlab.com/content/stanfordlab/en/test-details/t/TMSI.html). If two of five microsatellite loci showed a difference in length between tumor and normal samples, the tumor was designated MSI high. If only one locus showed a difference in length, the tumor was considered MSI low. If no loci showed a difference in length, the tumor was designated microsatellite stable. IHC for DNA mismatch repair proteins was performed using standard protocols with monoclonal antisera reacting to MLH1 (clone G168-728, BD Biosciences), MSH2 (clone FE11, Calbiochem), MSH6 (clone 44, Cell Marque), and PMS2 (clone MRQ-28, Cell Marque). Normal expression was defined as nuclear staining within tumor cells, using the nuclei of stromal cells and infiltrating lymphocytes as positive internal controls.
The determination of TMB status on the Stanford samples was made using an NGS-based targeted gene panel, the Stanford Actionable Mutation Panel for Solid Tumors (order code: STAMPT, additional assay details available at: https://stanfordlab.com/content/stanfordlab/en/test-details/s/STAMPT.html) and/or the FoundationOne CDx panel (Foundation Medicine, Cambridge, Massachusetts, USA). Microsatellite and TMB status were determined by whole-exome sequencing, and MSI was quantified by using MSIsensor at SMC, as previously described.31 32
Statistical analysis
Receiver operating characteristic curves and the AUROC were used to evaluate the performance of the AI models in this study. For PFS and OS estimation, the Kaplan-Meier method was used, and the log-rank test was used to assess differences in PFS and OS between groups. HRs and 95% CIs were computed using the Cox proportional hazards model. Between-group differences in categorical variables were compared using Fisher’s exact test, and differences in means or medians for continuous variables were assessed using the non-parametric Mann-Whitney U test. All p values were two-tailed, with a significance threshold of p<0.05.
Results
Distribution of the inflamed immune phenotype in a large-scale ICI-treated cohort
We examined the prevalence of the H&E-based WSI-level IIP (hereafter simply referred to as the IIP) across multiple tumor types and explored its potential as a biomarker for guiding ICI treatment planning, using the retrospectively-collected FFPE pre-ICI treatment tumor WSI from our large, multicenter cohort of ICI-treated patients (N=1,806) representing over 27 different solid tumor types (online supplemental figure S1 and table 3). The clinical and histopathologic characteristics of these patients are summarized in table 1 and table 2, respectively. Most samples were collected from the primary tumor (62.0%), and the most prevalent tumor type was NSCLC (49.7%). 1,502 (83.2%) patients received ICI monotherapy (mono) and 304 (16.8%) received an ICI in combination with at least one other anti-neoplastic drug (ICI combo). Most patients received the ICI as part of their first (25.1%) or second line (43.5%) of treatment.
Of the 798 patients with available PD-L1 TPS results, the proportions with PD-L1 TPS<1% and PD-L1 TPS≥1% were 24.9% and 75.1%, respectively. Among the patients with both microsatellite and TMB status available (n=130), 67.7% had microsatellite stable/low microsatellite instability (MSS/MSI-L), TMB-low tumors, while 32.3% had MSI or TMB-high tumors (cut-off of 10 mutations per Mb).
By Lunit SCOPE IO analysis, 636 of the 1,806 patients (35.2%) were classified as IIP (online supplemental table 4). The IIP was highly enriched in patients with nasopharyngeal carcinoma (68.0%), melanoma (56.3%), renal cell carcinoma (52.9%), and NSCLC (33.7%) (figure 2). With regard to ICI treatment line, 39.4%, 34.2%, and 31.5% of patients receiving first-line, second-line, and ≥third-line treatment were classified as IIP. The IIP proportion was 40.7% in TPS≥1% patients and 21.6% in TPS<1% patients, respectively. In the primary tumor, lymph node, and distant metastatic samples, the IIP proportions were 35.1%, 41.4%, and 32.2%, respectively. While 33.3% of tumors that were MSI and/or TMB-high (≥10 mutations/Mb) were IIP, a substantial proportion (26.1%) of tumors that were both MSS/MSI-L and TMB-low were also IIP (online supplemental figure S2A–E).
Association between the inflamed immune phenotype and ICI-treatment outcomes across multiple tumor types when applying a universal threshold
In the overall cohort (N=1,806), the objective response rate (ORR) was significantly higher in IIP than in non-IIP patients (26.3% vs 15.8%, p<0.001, figure 3A) and there was a significant downward trend in the median IS with respect to RECIST response groups (p<0.001, online supplemental figure S3A). Interestingly, only the IS was positively prognostic of response to ICI, but not the IES and IDS (online supplemental table 5 and figure S3B). In the subset of 798 patients with available PD-L1 TPS results, the AUROC for predicting the best overall ICI response was 0.60 for the IS and 0.68 for the TPS (online supplemental table 6). However, in the subset of these patients without NSCLC, the corresponding AUROC was 0.70 for the IS and 0.64 for the TPS (online supplemental table 6). Median PFS was significantly longer in IIP compared with non-IIP patients (5.3 vs 3.1 months, HR 0.68, 95% CI 0.61 to 0.76, p<0.001) (figure 3A). A similar improvement in the median OS after ICI treatment was also observed in IIP compared with non-IIP patients (25.3 vs 13.6 months, HR 0.66, 95% CI 0.57 to 0.75, p<0.001). The same trends were observed in the subset of 909 patients without NSCLC (online supplemental figure S4), suggesting that the results for the overall cohort were not solely driven by effects in the NSCLC subset.
On subgroup analysis, the IIP was prognostic of favorable PFS, irrespective of ICI regimen (monotherapy HR 0.68, 95% CI 0.60 to 0.77, p<0.001; combo therapy HR 0.68, 95% CI 0.51 to 0.91, p=0.008, figure 3B) and PD-L1 TPS status (positive HR 0.67, 95% CI 0.56 to 0.81, p<0.001; negative HR 0.66, 95% CI 0.44 to 0.98, p=0.038, (online supplemental figure S5A).
Additionally, the IIP was consistently prognostic of favorable response to ICI across various subgroups, including: first-line and second-line ICI treatment; the timing of FFPE tissue collection relative to the start of ICI treatment (less than or ≥1 year before ICI treatment); MSS/MSI-L and TMB-low; histologic subtype; specimen type (surgical resection or biopsy); and tissue harvest site (primary, lymph node or distant metastasis). However, the IIP was not significantly associated with favorable response to ICIs in the MSI and TMB-high subgroups or in patients with squamous cell carcinomas (online supplemental figure S5B).
Application of individual (tumor type-specific) thresholds in defining the inflamed immune phenotype
Given the variability in the proportions of IIP patients within each tumor type when applying a universal threshold, we also performed a stratified analysis in which an individual threshold, defined as the IS which distinguished the top 20% of IS’s from the remaining 80% within each tumor type, was used to define the IIP. Although these individual thresholds varied across the cancer types, the overall trends in improved clinical outcomes for IIP patients were similar to those observed when using the universal threshold. The improvements in ORR, PFS, and OS for the IIP compared with non-IIP patients remained significant after applying individual thresholds for each tumor type within both the overall cohort and the subset of patients with tumors other than NSCLC (online supplemental figure S6A and 7). The subgroup analyses showed similar trends, although the improvement in PFS for IIP versus non-IIP patients did not reach statistical significance within the subgroup of MSS/MSI-L and TMB-low patients (HR 0.43, 95% CI 0.17 to 1.07, p=0.062), or in the subgroup of patients receiving the ICI as first-line treatment (HR 0.74, 95% CI 0.53 to 1.04, p=0.078, online supplemental figure 6B).
Association between the other immune phenotypes and ICI treatment outcomes
We further analyzed clinical outcomes of ICI treatment with respect to the IES and IDS, using two thresholds: a 20% threshold, consistent with the IS threshold for the IIP, and a 33.3% threshold for ternary classification of the WSI-level IP. In the overall cohort (N=1,806), WSI-level non-IEP patients exhibited a significant increase in median PFS compared with IEP patients (5.0 vs 3.3 months, HR 1.26 with a 20% threshold; 4.9 vs 3.0 months, HR 1.29 with a 33.3% threshold; both p<0.001) (online supplemental table 7). Additionally, WSI-level non-IEP patients showed higher ORR at both thresholds—25.5% (20% threshold) and 24.7% (33.3% threshold)—compared with IEP patients (both p<0.001). However, there were no significant differences in OS between WSI-level non-IEP and IEP patients. With regard to the IDS, median OS was significantly increased in WSI-level non-IDP compared with IDP patients, only at the 33.3% threshold (18.0 vs 14.0 months, HR 1.18, p=0.014); no significant differences were observed for the other outcomes.
Discussion
Here, we present an automated deep learning model, Lunit SCOPE IO, which classifies the immune phenotype of the TME on H&E-stained WSI using TIL distribution and density analysis. We demonstrate, for the first time, the ability of an AI model which runs on routine H&E-stained WSI of pre-ICI treatment FFPE tumor samples to predict ICI clinical outcomes across a broad range of solid tumor types, using a real-world multicenter cohort of 1,806 ICI-treated patients. We show that the IIP, as defined by either a universal threshold or tumor type-specific thresholds, is significantly prognostic of favorable clinical outcomes after ICI treatment. Furthermore, the IIP appears to correlate with significantly prolonged PFS, regardless of ICI treatment regimen or PD-L1 expression level, and is prognostic of favorable PFS in MSS/MSI-L, TMB-low patients, a clinically important subgroup in whom biomarkers are urgently needed.
Although the utility of the TMB and MSI as universal biomarkers of potential ICI response has primarily been attributed to their association with increased tumor neoantigenicity and heterogeneity of T-cell receptor clones,33–35 some criticism has since been directed toward their reliability as predictive biomarkers.36 Furthermore, even in the TMB-high or MSI setting, which would be expected to increase TIL recruitment to CAs, stromal interference modulated by the transforming growth factor (TGF)-beta or other immunosuppressive pathways may result in the spatial exclusion of TILs from these CAs (as reflected in the IEP).26 This is supported by the observation that, of the patients in our cohort for whom both MSI and TMB status were available, 57% of patients whose tumors were MSI and TMB-high were also of the IEP. In addition, immunoediting, whereby less immunogenic tumor cell clones are selected for, may result in a decreased antitumor immune response, even in tumors with a high mutational burden.37
Immune phenotyping based on TIL analysis avoids many of the limitations of the TMB and MSI. Our TIL analysis directly assesses the degree of lymphocytic infiltration of both CA and CS regions, allowing for the detection of tumors which might be TMB-high or MSI but unresponsive to ICI therapy due to immunoediting or the activation of immunosuppressive pathways, as these tumors would be classified as IEP (eg, non-IIP) tumors by TIL analysis. Furthermore, the H&E-based IIP appears to reflect an active antitumor immune response, based on the observed correlation between the IS and high IFNG pathway activation in a pan-cancer TCGA data set. H&E-based immune phenotyping uses pre-existing FFPE H&E slides collected during the course of routine clinical care, therefore requiring no additional tissue section procurement, and computational TIL assessment will enable more objective time-efficient and labor-efficient analysis at scale, avoiding interobserver variability and bias in interpretation.38–40
The promise of computational TIL analysis of routine H&E-stained images has been demonstrated by earlier studies showing a correlation between the spatial architecture of TILs and patient prognosis (1) in non-ICI-treated patients with cancer41–43 and (2) in small cohorts of patients with NSCLC treated with immunotherapy.44 For example, in a seminal study using over 5,000 H&E WSIs from 13 tumor types represented in the TCGA, Saltz and colleagues applied deep learning-based binary patch classification (wherein each tissue-containing patch in a WSI was classified as either positive or negative for TILs) to generate a patch-based WSI-level spatial TIL map, finding a significant association between various structural features derived from these TIL maps and OS across four TCGA tumor types (breast invasive adenocarcinoma, lung adenocarcinoma, prostate adenocarcinoma, and cutaneous melanoma).43 Our study builds on all of these prior contributions by presenting the first comprehensive pan-cancer analysis examining the association between AI-enabled TIL-based immune phenotypes, as assessed on routine H&E-stained slides representing over 27 different solid tumor types, and ICI treatment outcomes. Furthermore, we apply a universal cut-off across multiple tumor types to evaluate the feasibility of using the TIL-based immunophenotype as a tumor-type agnostic biomarker of ICI response, which has not previously been done.
We acknowledge that this study was subject to limitations. Given its focus on ICI-treated patients, it is likely that our real-world data set was enriched for IIP patients, as many of the eligibility requirements for ICI treatment are tied to the relative “immunogenicity” of the tumor. Therefore, it is possible that our current model might not generalize as well when applied to less immunogenic tumor types. Due to the retrospective nature of the study, we were unable to systematically control for heterogeneity in patient treatment regimens and other potential confounders. For instance, although previous reports have not clearly shown a significant difference in efficacy between different ICI agents,45–47 we could not entirely exclude the possibility of confounding of ICI response rates by the specific ICI regimen used. In addition, the limited number of patients in the current data set for whom ancillary biomarker status was available (MSI, PD-L1 and TMB) precluded more sophisticated analyses of the relationship between these biomarkers and the AI-based immune phenotype. However, the results from our analysis of the TCGA data set showed that the IS was predictive of a high IFNG expression signature regardless of TMB status, suggesting that the AI-based immune phenotype contributes additional predictive and prognostic value independent of the TMB. Also, as the current study was focused on the development and validation of a readily scalable H&E-based TIL analysis model, more granular analyses of individual TIL types and activation states based on immunohistochemical and gene expression profiling were not performed on the current data set. However, in prior analyses, we have found that H&E-based TIL distribution and density analysis indirectly reflects antitumor lymphocytic activity, as assessed by gene expression profile-based cytolytic activity scores and IFNG signatures.18 Lastly, our study was subject to sample size limitations for some cancer types. For example, we were unable to include mesenchymal tumors or other rarer tumor types, which will be important to include in future studies as larger ICI-treated cohorts become available. In addition, it should be noted that, in our study, the directionality of the association between the IIP and PFS for colorectal carcinoma (CRC) was contrary to what might typically be expected. We believe that this might have been due to the small sample size; among the 28 patients with CRC in the study, only one was classified as IIP when we applied the universal IS cut-off. This patient belonged to the MSI-high group and had the best overall response of stable disease with pembrolizumab monotherapy. In the remaining 24 non-IIP patients with BOR data available, four showed a partial or better response. When the IS cut-off for IIP was set to 9.1% (resulting in six IIP patients), the difference in PFS for the IIP versus non-IIP patients with CRC was not found to be statistically significant (p=0.305). In future studies, we plan to conduct more comprehensive analyses of the relationships between the IS, PD-L1, MSI status, and TMB in order to develop a more robust model for predicting BOR. We also believe that further investigations in individual tumor types with larger sample sizes are strongly warranted to optimize thresholds and to more definitively determine whether a universal or tumor type-specific threshold would be more appropriate.
Nonetheless, in this first effort to examine the correlation between clinical outcomes (including ORR, PFS, and OS) in ICI-treated patients and the H&E TIL-based immune phenotype, as assessed in a large multi-institutional cohort encompassing multiple diverse tumor types, we observed convincing results suggesting that the IIP may represent a practical, clinically actionable biomarker of favorable clinical outcomes, particularly in patients with PD-L1 negative, MSS/MSI-L, and TMB-low tumors, in whom biomarkers are urgently needed. Furthermore, we demonstrate that the application of deep learning to H&E-based immune phenotyping can provide an automated, readily scalable tool for guiding the selection of patients for ICI treatment across a wide range of different solid tumors. Further optimization and validation of the IIP thresholds used in this study in prospective clinical trials represents an important next step, which, if successful, might 1-day enable more precise selection of patients for ICI therapy.
Data availability statement
Data are available upon reasonable request. The TCGA data used in the current study are publicly available via the National Cancer Institute Genomic Data Commons portal (https://portal.gdc.cancer.gov/). The data in the multi-institutional ICI-patient external validation data set are not publicly available due to institutional restrictions governing human subject privacy protection. As these data are stored in controlled access repositories, the data, or a subset of the data, may be made available upon reasonable request following submission of a research protocol detailing intended use and institutional review board ethics review and approval; inquiries should be directed to JS (Stanford), Y-LC (SMC), TL (CNUH), HK (SNUBH), and YKC (Northwestern University). The remaining data are available within the Article and Supplementary materials.
Ethics statements
Patient consent for publication
Ethics approval
All work was conducted according to appropriate guidelines for the protection of human subjects in biomedical research, and with institutional review board (IRB) approval at Stanford (IRB no. 58610), SMC (IRB no. 2018-06-103 and 2021-02-011), SNUBH (IRB no. B-2006-619-307 and B-2209-780-303), CNUH (IRB no. CNUH-2021-023) and Northwestern University (IRB no. STU00207117). Obtaining informed consent from individual patients was waived, considering the retrospective nature of this study.
Acknowledgments
The authors would like to thank Christopher King from the Stanford Department of Pathology and Curtis P. Langlotz and Johanna Kim from the Stanford Center for Artificial Intelligence in Medicine & Imaging for additional administrative support, and Sang Yong Song, Sangjoon Choi, Hyun Ae Jung, Jong Mu Sun, Jin Seok Ahn and Myung Ju Ahn from Samsung Medical Center; Yoo Duk Choi and Kyung Hwa Lee from Chonnam National University Medical School; Sooick Cho, Huijeong Kim, SengHui Seo, Jiwon Shin, Jisoo Shin, Seungje Lee, Eunji Baek, Seonwook Park, Mohammad Mostafavi, Jeongun Ryu, Minuk Ma, Donggeun Yoo, Sangheon Ahn and Kyunghyun Paeng from Lunit; and Koung Jin Suh, Se Hyun Kim, Yu Jung Kim and Jong Seok Lee of Seoul National University Bundang Hospital for contributing to the development of Lunit SCOPE IO.
References
Supplementary materials
Supplementary Data
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
Footnotes
Twitter @loki_phd
JS, Y-LC, TL, HK and YKC contributed equally.
Presented at Preliminary results were presented in part at the ASCO Annual Meeting 2022 (June 3 – June 7, 2022), Chicago, Illinois, USA.
Contributors Conception and design: JS, C-YO, Y-JB, and TSKM. Financial support: JS and C-YO. Administrative support: JS, Y-LC, TL, HK, YKC, and S-HL. Provision of study material or patients: JS, Y-LC, TL, HK, YKC, BWD, SB, MH, GAF, SP, S-HL, J-EH, J-HC, and LK. Collection and assembly of data: JS, Y-LC, TL, HK, YKC, SSh, YL, CHO, Seulki Kim, CO, WJ. Data analysis and interpretation: HS, SP, CO, Sukjun Kim, GP, SSo, Seokhwi Kim, Y-JB, TSKM, SA, and C-YO. Manuscript writing: All authors. Final approval of manuscript: All authors. Accountable for all aspects of the work: All authors. Guarantor: C-YO.
Funding Funding for this study was provided by Lunit Inc., with additional infrastructural support from the Stanford Center for Artificial Intelligence in Medicine & Imaging (AIMI) and the Department of Pathology, Stanford University School of Medicine. JS additionally received support from the United States National Cancer Institute (NCI), National Institutes of Health (NIH) (R01 CA270437).
Competing interests JS and SB received institutional research funding from Lunit, Inc. HS, SP, SSh, YL, CHO, Seulki Kim, CO, Sukjun Kim, GP, SSo, WJ, SA and C-YO are employees of Lunit, Inc. Y-JB. is a Consultant/Advisory Board member for Merck Sharp and Dohme (MSD), Merck Serono, Daiichi-Sankyo, Astellas, Alexo Oncology, Samyang Biopharm, Hanmi, Daewoong, and Amgen, and received institutional research grants for clinical trials from Genentech/Roche, MSD, Merck Serono, Daiichi Sankyo, Astellas, and Amgen in the past 3 years. Other authors declare no potential conflicts of interest.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.