A commentary on the original research article: ‘Radiomics analysis for predicting pembrolizumab response in patients with advanced rare cancers’. Of note, the predictor selection process, the cross-validation method, along with the lack of final testing of the developed model with a separated data set may mask overfitting, overestimating performance metrics.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
In the original research article, Colen et al1 use classic statistics and machine learning methods in order to identify significant radiomics features and predict pembrolizumab response in advanced rare cancers. This novel approach raises relevant hypotheses and may eventually prove useful in the expansion of the therapeutic arsenal for some patients.
However, the encouraging results obtained are, until further clarification, to be interpreted with caution. Machine learning is the software-mediated attempt to produce accurate output from previously unseen data through mostly automatic adjustment of parameters based on previous experience.2 Effectively, the ‘learning’ step in this study occurs in a supervised fashion, that is, feeding the algorithm examples of labeled data (ie, the characteristics of each patient along with the label of ‘responder’ or ‘non-responder’). The learning algorithm then builds models to predict each patient’s label as accurately as possible.3
After initial training, model validation is carried out. This is usually done by splitting the data set into training and validation sets: two groups with no overlapping patients, each used exclusively in their respective phase. To increase the model’s generalization capability and decrease any sample selection bias, resampling methods are used. Bootstrapping is the process of resampling data with replacement, usually producing several new groups of different training and test data sets, sometimes containing multiple instances of the same original cases, while omitting others. Cross-validation comprises resampling without replacement, systematically producing k surrogate data sets, with n original cases being part of the validation data set exactly once. This is called k-fold cross-validation. A special case is leave-one-out cross-validation (LOOCV), in which the training set consists in all cases but one, and the remaining case is used as a one-case validation set. The process repeats until all cases are separately used as validation. LOOCV is usually reserved for small data sets, in which the omission of a significant part of the training data (ie, 10%–20%) might hinder algorithm learning and thus performance.4
Following the validation phase, the investigators may adjust the algorithm’s hyperparameters and try again until satisfactory performance is achieved. Since many changes are made to make the model more accurate for the validation data, overfitting may occur. This usually causes high performance metrics in the validation set, with poor prediction capability in a distinct dataset. To detect such phenomena, testing on sequestered, previously unseen data is performed, differences in model metrics are analyzed, methodology problems are addressed, and the process is repeated.4
In the study, Colen et al1 address the objective with an admittedly small, but multidimensional patient data set, using LOOCV to assess model accuracy and C-statistic. However, caveats to their study design should be noted. Regarding feature selection, both in tables 3 and 4, multiple instances of the same feature in different levels of grayscale can be seen. While their relevance was reportedly identified by a sound method (L1 penalty), one cannot but wonder their collinearity (assessed by variance inflation factor5), and whether data preprocessing or the usage of other selection methods (wrapper or embedded methods) would change the outcomes. This must be carefully considered when small-n-large-p-problems, known to lead to feature selection instability, are involved.6 7 In relation to cross-validation, while LOOCV maximizes training data, testing a single point at a time implies a large variance in error and a similarly high variance of CIs. The method underestimates error rates, especially in small samples with high dimensionality (ie, few patients with several features), which can explain the reported results.8 9 The lack of cross-validation on blocks of correlated data may introduce another bias in the study: the algorithm might have been able to distinguish between different primary sites, and correlating tumor origin to outcome, always guessing the correct label, leading to accuracy, and C-statistic inflation. For example, penile carcinomas, small cell malignancies of non-pulmonary origin, and retroperitoneal spindle cell sarcoma had no responders in the sample, yielding always perfect predictions of no response in the test, while this might not hold true in external validation.9
Additionally, the lack of testing in a separated set after cross-validation hinders the credibility of the outstanding metrics achieved—at least until independent verification.8
In order to address the outlined issues, the following procedures might be applied: assessment of feature collinearity and usage of different methods of feature selection might help with the small-n-large-p-problem; and the separation of lesions (in case of metastatic sites) into distinct data points, as well as data amplification methods (such as synthetic minority over-sampling technique10) may help increase the data set. After a larger amount of data is achieved, other resampling strategies (k-fold cross-validation or bootstrapping) may be employed, and more data (with no synthetic points) can be spared for final testing and overfitting assessment. The analysis of one lesion may not be a surrogate marker for cancer response to immunotherapy, but it may be an interesting hypothesis generator. Also, other predictive models for treatment response based on voting on the probability of response for each tumor in a patient may be developed from the original algorithms.
Twitter @MateusTrinconi, @RMaffeiLoureiro, @GilbertodeCas13
Contributors All author contributed equally to letter conception, design, data analysis and interpretation, manuscript writing, and final approval.
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests GdC has received personal fees from AstraZeneca, Bayer, Bristol-Myers Squibb, Boehringer Ingelheim, Janssen, Lilly, Merck Serono, Merck Sharp and Dohme, Novartis, Pfizer, Roche, Teva, and Yuhan, none related to this publication.
Provenance and peer review Commissioned; internally peer reviewed.