Non-invasive measurement of PD-L1 status and prediction of immunotherapy response using deep learning of PET/CT images

Background Currently, only a fraction of patients with non-small cell lung cancer (NSCLC) treated with immune checkpoint inhibitors (ICIs) experience a durable clinical benefit (DCB). According to NCCN guidelines, Programmed death-ligand 1 (PD-L1) expression status determined by immunohistochemistry (IHC) of biopsies is the only clinically approved companion biomarker to trigger the use of ICI therapy. Based on prior work showing a relationship between quantitative imaging and gene expression, we hypothesize that quantitative imaging (radiomics) can provide an alternative surrogate for PD-L1 expression status in clinical decision support. Methods 18F-FDG-PET/CT images and clinical data were curated from 697 patients with NSCLC from three institutions and these were analyzed using a small-residual-convolutional-network (SResCNN) to develop a deeply learned score (DLS) to predict the PD-L1 expression status. This developed model was further used to predict DCB, progression-free survival (PFS), and overall survival (OS) in two retrospective and one prospective test cohorts of ICI-treated patients with advanced stage NSCLC. Results The PD-L1 DLS significantly discriminated between PD-L1 positive and negative patients (area under receiver operating characteristics curve ≥0.82 in the training, validation, and two external test cohorts). Importantly, the DLS was indistinguishable from IHC-derived PD-L1 status in predicting PFS and OS, suggesting the utility of DLS as a surrogate for IHC. A score generated by combining the DLS with clinical characteristics was able to accurately (C-indexes of 0.70–0.87) predict DCB, PFS, and OS in retrospective training, prospective testing and external validation cohorts. Conclusion Hence, we propose DLS as a surrogate or substitute for IHC-determined PD-L1 measurement to guide individual pretherapy decisions pending in larger prospective trials.


Supplemental S1: Inclusion criteria for SPH and MCC cohorts
For PD-L1 prediction, patients with informed consent were accrued from the Shanghai Pulmonary Hospital (SPH), Shanghai, China, between January 2017 and June 2018 with following inclusion criteria were included: 1) histologically confirmed primary NSCLC; 2) pathological examination of PD-L1 status before any treatment; 3) PET/CT scans obtained within one month before biopsy (Bx) for immunohistochemistry (IHC) and no treatments was performed during this interval; 5) baseline clinical characteristics (including age, sex, stage, histology, and smoking history) and gene (EGFR, ALK and ROS1) mutation status were available. Based on these inclusion criteria, 400 patients were identified and subsequently assigned to a training cohort (SPHtraining,N = 284) and an independent test cohort N = 116). Using the same inclusion criteria, 85 NSCLC patients were accrued from H. Lee Moffitt Cancer Center & Research Institute (MCC), Tampa, FL and were used as external independent test cohort for prediction of PD-L1 expression (MCC-PD-L1 cohort).
For the distinct cohorts to predict patient response and outcomes, 128 patients were identified with histologically confirmed advanced stage (stage IIIB and IV) NSCLC who were treated with immunotherapy (anti-PD-L1 or anti-PD-1) between June 2011 and December 2017 at MCC using the following criteria: 1) PET/CT images were available during the interval (less than 6 months) of the last treatment (or diagnosis) and the start of immunotherapy; 2) no other treatment were performed during the interval; 3) follow-up time was greater than 6 months; and 4) no immune-related severe adverse events (Grade according to Common Terminology Criteria for Adverse Events (CTCAE)>=3 S1 ) were observed or reported during treatment; 5) baseline clinical characteristics (including age, sex, stage, histology, Eastern Cooperative Oncology Group (ECOG) scale, brain metastasis status, and smoking history) and gene (EGFR, ALK and ROS1) mutation status were available. Using the same inclusion criteria, a prospective validation cohort was curated of 49 NSCLC patients who were treated with immunotherapy between January 2018 to June 2019.
For the external VA cohort to validate the DLS and the prognostic models, 35 patients with available PET/CT images from 72 patients with advanced stage NSCLC treated with immunotherapy (anti-PD-L1 or anti-PD-1) between July 2015 and February 2019 were identified according to the above criteria.
The progression of the distinct ICI-treated cohorts used to investigate the association of the DLS and clinical characteristics with the clinical outcome including DCB (PFS>6month 20 ), PFS, and OS, was defined using clinical or RECIST based progression of cancer and the data were right-censored at 6 years and 1.5 years for the retrospective and prospective cohorts, respectively. For OS, an event was defined as death and the data were right censored at 6 years and 1.5 years for the retrospective and prospective cohorts, respectively.
Because we don't have as much follow-up time for the prospective cohort, these two cohorts have different censoring values. The index date for both OS and PFS was the date of initiation of immunotherapy.

Supplemental S2: PD-L1 expression by immunohistochemistry (IHC)
To ensure a reliable PD-L1 IHC score, all patients in this study underwent surgical resection or biopsy of the primary tumor using standardized protocol. To reduce the sampling artifact, the portion of the tumor specimen was carefully examined, and the portion with more malignant cells, less differentiated cells, and less hemorrhage was subjected to histopathological confirmation within 2 weeks after the 18 F-FDG PET/CT scan.
Furthermore, routine IHC analysis was performed to determine PD-L1 expression in all the lesions of SPH cohort and MCC-PD-L1 cohort using the same antibody. For the SPH cohort, the platform of Dako Link 48 and the antibody of Dako 22C3 (Agilent Cat# GE00621-2, RRID:AB_2833074) were used for PD-L1 staining to quantify the presence of PD-L1. For the MCC-PD-L1 cohort, the PD-L1 22C3 mouse monoclonal antibody (Agilent Cat# GE00621-2, RRID:AB_2833074) purchased from Dako, was performed utilizing the Dako EnVision FLEX visualization system on the Dako Autostainer Link 48. To compensate for reader bias, all the staining results were reviewed and analyzed by 2 experienced pathologists who were blinded to each other's scores and unaware of the patients' clinical information. When there was discrepancy, the two pathologists would have a mutual discussion to reach a consensus.

Details of the small-residual-convolutional-network
The SResCNN is similar to the Resnet50 but with fewer layers and smaller number of residual blocks, and the architecture was shown in supplemental Figure S1. The architecture was comprised with one convblock (including a 3 × 3 convolutional layer followed by a batch normalization layer and a rectified linear unit (ReLU) activation layer), 8 residual blocks (Resblock), and one fully connected layer. Finally, a softmax activation layer was connected to the last fully connected layer, which was used to yield the prediction probabilities of nodule BMJ Publishing Group Limited (BMJ) disclaims all liability and responsibility arising from any reliance Supplemental material placed on this supplemental material which has been supplied by the author(s) candidates. To prevent overfitting, one dropout layer with probability of 0.5 was added to the fully connected layers. Additionally, the model was optimized using the binary cross entropy loss function.

Preparation of the input images
After registration using ITK-SNAP, a square or an irregular box, which was close to the boundary of the tumor, was delineated manually in ITK software firstly by experienced nuclear medicine radiologist, and then the input regions of interest (ROIs) could be generated automatically after, dilation, resize and fusion to keep the entire tumor and its peripheral region were included (Supplemental Figure S3). For each slice of the tumor, the area of the smallest square mask (SSM) including the delineated region was regarded as the area of the tumor in this slice (Supplemental Figure S3B). Because of the big difference of the central slice and peripheral slices, only the slices with the area larger than the 30% of the maximum tumor cross-sectional area of this patient were regarded as valid input images and were used as the input of the deep learning model. The area here means the area of the smallest square including the delineated region (Supplemental Figure S3C). 10,650 ROIs were generated for training. In order to keep the training data had more balanced label, cubic spline interpolation of the adjacent two slices from the same patient with positive PD-L1 expression was used to generate new augmented slices on the condition that each patient was used with the same times, and finally 14,011 ROIs (6,722 PD-L1 positive and 7,289 PD-L1 negative) were used as the training dataset.
All ROIs were resized to the same size (64×64) using cubic spline interpolation and were standardized by zscore normalization before input to the model. As such, the input ROI was subtracted from the mean intensity value and divided by the standard deviation of the image intensity, before inputting to the deep learning model, to reduce the offset effect due to different equipment and different reconstruction parameters [S2] .

Training of the SResCNN network
The training of the model focuses on the optimization of the parameters of the SResCNN model to build a relationship between PET/CT images and PD-L1 expression status (positive: 1 or negative: 0). We employed binary cross entropy as the loss function and the Adam optimizer with an initial learning rate = 0.0001, beta_1=0.9, beta_2=0.999. The learning rate was reduced by a factor of 5 if no improvement of the loss of the validation dataset was seen for a 'patience' number (n=50) [S3] of epochs. The batch size was set to 64. During the training, augmentation including width/height-shift, horizontal/vertical-flip, rotation and zoom were used to reduce overfitting. The training was stopped after waiting an additional 50 epochs since the validation loss stopped to degrade. The learning rate and batch size was determined with five-fold-cross validation under the patient level, and the combination that yielded the best average accuracy on the validation folds was chosen.
BMJ Publishing Group Limited (BMJ) disclaims all liability and responsibility arising from any reliance Supplemental material placed on this supplemental material which has been supplied by the author(s)

Application of the SResCNN network
The generated ROI-based hyper-image was input into the SResCNN model after z-score normalization, and a deeply learned score (DLS) representing the PD-L1 positivity could be yielded after a sequential activation of convolution and pooling layers. To develop a robust prediction, all valid slices of each patient were fed into the SResCNN model and the average DLSs with equal weight for each slice was regarded as the final PD-L1 positive probability of the tumor.

Supplemental S4: Correlation investigation between necrosis and DLS
First, necrosis in PET images was defined as an area of hypometabolism within the hypermetabolic tumor (i.e., classically a rim of hypermetabolism with a hypometabolic center) S4 . Hypometabolism (necrosis) is defined as the region with SUV less than 42% of the maximum SUV 39 . Then, the ratio of necrosis to global lesion volume of the PET images (termed the necrosis-to-global volume ratio, NVR) was calculated and expressed as percentage to quantify the necrosis.
Thus, the relation between necrosis and the DLS could be investigated by calculating the Spearman's correlation, and linear regression between NVR and DLS, and only the cases with necrosis regions were included for this experiment.

Supplemental S5: Radiomic quality score (RQS)
Radiomics is a rapidly maturing field in machine learning. To rigorously assess the quality of study design, Lambin et al. developed a 36-point "Radiomics Quality Score" (RQS) metric that evaluates 16 different key components 31 . The full list of criteria is described in Supplemental Table S2, which shows that the current study had a RQS of 22. To put this in perspective, a recent meta-analysis S5 analyzed 77 radiomics publications BMJ Publishing Group Limited (BMJ) disclaims all liability and responsibility arising from any reliance Supplemental material placed on this supplemental material which has been supplied by the author (s) and documented that the mean ± S.D. RQS across all studies was 9.4 ± 5.6, indicating that the current study is in the upper 5 percentage of radiomics study designs.
[S5]. Park JE, Kim D, Kim HS et al. Quality of science and reporting of radiomics in oncologic studies: room for improvement according to radiomics quality score and TRIPOD statement. Eur Radiol 2019.
BMJ Publishing Group Limited (BMJ) disclaims all liability and responsibility arising from any reliance Supplemental material placed on this supplemental material which has been supplied by the author(s) Figure S2. Illustration of the ResCNN model. This model is composed of convolutional layers with kernel size 3x3, batch normalization, pooling, and drop out layers. Note. /2 means the convolution layer of the Convblock with stride of 2. Figure S3. Illustration of the generation of the input hyper-image. A square or an irregular box, which was close to the boundary of the tumor, was delineated manually in ITK software firstly, and then the hyper-image was generated after dilation, resize and fusion automatically.  Figure S4. Two different digital simulated phantoms were constructed as A and B. To show the importance of the fusion images, A and B were kept to have the same heterogeneity distribution. Entropy and Inverse Difference calculated from 3D co-occurrence matrix were used to measure the heterogeneity and the homogeneity of the phantoms. From A1 and B1 (simulated PET images), A2 and B2 (simulated CT images), the two phantoms have the same heterogeneity and homogeneity distribution. But from A3 and B3, the two phantoms could be identified based on different heterogeneity and homogeneity, which means the fusion images could reflect the relative different positional relationship of the heterogeneity. Using this image to construct a 3-chanel hyper-image together with PET and CT images was more convenient for the training of the ResCNN model in one hand. In the other hand, incorporating the prior knowledge into the model can decrease the size of the deep learning model and limit the risk of overfitting.       Figure S8. Nomograms for multivariable regression. (A) DCB nomogram obtained with multivariable logistic regression for DCB prediction (e.g., for a ADC patient with DLS of 0.6 and ECOG 1, his total points are 78 (DLS 0.6 corresponding to point 0, ECOG 1 corresponding to point 50, ADC corresponding to point 28, 0+50+28=78), which corresponds to a DCB probability of 0.40); (B) PFS nomogram obtained with multivariable Cox proportional hazards regression for PFS prediction (e.g., for a ADC patient with DLS of 0.6 and ECOG 1, his total points are 50 (DLS 0.6 corresponding to point 0, ECOG 1 corresponding to point 50, ADC corresponding to point 0, 0+50+0=50), which corresponds to a 6-month PFS probability of 0.85, 1-year PFS probability of 0.68, and 2-year PFS probability of 0.59). (C) OS nomograms obtained with multivariable Cox proportional hazards regression for OS prediction (e.g., for a SCC patient with DLS of 0.6 and ECOG 1, his total points are 74 (DLS 0.6 corresponding to point 0, ECOG 1 corresponding to point 50, SCC corresponding to point 24, 0+50+24=74), which corresponds to a 6-month OS probability of 0.87, 1-year OS probability of 0.63, and 2-year OS probability of 0.37).   Calibration statistics -report calibration statistics (for example, Calibration-in-the-large/slope, calibration plots) and their statistical significance (for example, P-values, confidence intervals). One can also apply resampling method (for example, bootstrapping, cross-validation) + 1 (if a calibration statistic and its statistical significance are reported) + 1 (if a resampling method technique is also applied) 2 0 Prospective study registered in a trial database -provides the highest level of evidence supporting the clinical validity and usefulness of the radiomics biomarker + 7 (for prospective validation of a radiomics signature in an appropriate trial) 7 7 Validation -the validation is performed without retraining and without adaptation of the cut-off value, provides crucial information with regard to credible clinical performance -5 (if validation is missing) + 2 (if validation is based on a dataset from the same institute) + 3 (if validation is based on a dataset from another institute) + 4 (if validation is based on two datasets from two distinct institutes) + 4 (if the study validates a previously published signature) + 5 (if validation is based on three or more datasets from distinct institutes)  Note., For Sex: male was assigned 1 and female was assigned 2; for histology, ADC was assigned 1 and SCC was assigned 2. Note., For Sex: male was assigned 1 and female was assigned 2; for histology, ADC was assigned 1 and SCC was assigned 2. Note., For Sex: male was assigned 1 and female was assigned 2; for histology, ADC was assigned 1 and SCC was assigned 2.
BMJ Publishing Group Limited (BMJ) disclaims all liability and responsibility arising from any reliance Supplemental material placed on this supplemental material which has been supplied by the author(s)