Article Text
Abstract
Immuno-oncology holds promise for transforming patient care having achieved durable clinical response rates across a variety of advanced and metastatic cancers. Despite these achievements, only a minority of patients respond to immunotherapy, underscoring the importance of elucidating molecular mechanisms responsible for response and resistance to inform the development and selection of treatments. Breakthroughs in molecular sequencing technologies have led to the generation of an immense amount of genomic and transcriptomic sequencing data that can be mined to uncover complex tumor-immune interactions using computational tools. In this review, we discuss existing and emerging computational methods that contextualize the composition and functional state of the tumor microenvironment, infer the reactivity and clonal dynamics from reconstructed immune cell receptor repertoires, and predict the antigenic landscape for immune cell recognition. We further describe the advantage of multi-omics analyses for capturing multidimensional relationships and artificial intelligence techniques for integrating omics data with histopathological and radiological images to encapsulate patterns of treatment response and tumor-immune biology. Finally, we discuss key challenges impeding their widespread use and clinical application and conclude with future perspectives. We are hopeful that this review will both serve as a guide for prospective researchers seeking to use existing tools for scientific discoveries and inspire the optimization or development of novel tools to enhance precision, ultimately expediting advancements in immunotherapy that improve patient survival and quality of life.
- Computational Biology
- Immunotherapy
- Tumor Microenvironment
- Biostatistics
- Gene Expression Profiling
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
Background
The concept of immuno-oncology, or modulating the immune system to treat cancer, can be traced back to over a century ago when Wilhelm Busch and Friedrich Fehleisen independently observed tumor regression in patients with cancer after a superficial skin infection and subsequently intentionally inoculated other patients with cancer.1 Leveraging their work, William Coley treated patients with sarcoma with extracts of heat-inactivated Streptococcus pyogenes and Serratia marcescens and observed favorable responses.2 3 Lack of reproducibility and harmful side effects hindered further use of Coley’s strategy but provided proof of concept for the use of cancer immunotherapies. Since then, numerous basic science investigations, along with novel experimental systems and technologies, have deepened our understanding of the relationship between the immune system and cancer cells and ascertained the immune system’s ability to eliminate malignant cells. Over the last 30 years, the foundational discoveries and subsequent clinical successes of novel immunotherapeutic modalities have revolutionized the field of oncology; yet, challenges remain to maximize their potential.
Discoveries in the late 1990s on the roles of cytotoxic T lymphocyte-associated antigen 4 (CTLA-4) and programmed cell death 1 (PD-1) as negative regulators of T cell activation gave rise to a class of immunotherapies that target immune inhibitory proteins, known as immune checkpoint inhibitors (ICIs).3 Durable responses have been achieved by ICIs across several advanced and metastatic cancers,4–10 leading to Food and Drug Administration (FDA) approval of a number of ICIs spanning many clinical indications. However, many patients experience primary or acquired resistance to ICIs. Efforts to identify predictive biomarkers of response and elucidate mechanisms of intrinsic and acquired resistance are met with considerable challenges due to the evolving and heterogeneous nature of the tumor-immune landscape. Emerging evidence has uncovered several genetic aberrations and pathways in tumor cells that interfere with the immune contexture, which not only differs greatly in the composition of immune cells both within and between cancer types but also evolves through processes such as immunoediting.11
Additional immuno-oncology approaches, including adoptive cell transfer (ACT) therapies and cancer vaccines, also hold promise and have resulted in sustained clinical responses in patient subsets.12–14 ACT therapies directly exert anticancer effects through the expansion and infusion of tumor-reactive lymphocytes, while cancer vaccines promote an immune response by exposure of tumor antigens. Early approaches for ACT therapies were met with severe toxicity and significant off-target effects in patients with colorectal cancer treated with T cells targeting carcinoembryonic antigen and MAGE-A3 in melanoma.15 16 Discovering biomarkers of response to immunotherapy and functionally characterizing the tumor microenvironment (TME) and antigens are essential for efficacious therapies with minimal toxicity and can be informed by next-generation sequencing (NGS) and computational methodologies.
Alongside breakthroughs in immunology and clinical response to treatments, advances in sequencing technologies have led to drastic decreases in the cost of sequencing, and subsequently, a rapid accumulation in available genetic sequencing data. Mining, processing, and interpreting large amounts of sequencing data require the application of computational tools. Several tools and pipelines have been developed for mining genomic data, characterizing mutational patterns among patients with cancer, and extracting immunologically relevant information, including immune infiltrate inference, neoantigen prediction, and more. Thus, computational genomics holds promise for providing mechanistic insights that will inform clinical decision-making and the development of novel immunotherapies (figure 1). In this review, we will discuss computational tools commonly used in the field of immuno-oncology, including methods for characterizing the TME, reconstructing immune cell receptor repertoire, predicting neoantigens, and multi-omic analysis.
Characterization of the TME
Contrary to traditional views that consider cancer as a disease primarily caused by genomic alterations within cancerous cells, there is an increasing acceptance of the role of the TME in tumor progression and response to treatment. The TME consists of an amalgam of immunocytes, fibroblasts, stromal cells, endothelial cells, pericytes, and other cell types that surround cancer cells within a vascularized extracellular matrix. Reciprocal interactions between tumor and non-cancerous cells influence tumor growth and suppression by modifying the surrounding vasculature, extracellular matrix, and immune response.17 Further, some cells within the TME can exert pro-tumorigenic or anti-tumorigenic functions, have distinct transcriptional patterns in different tumor types, and associate both with poor and favorable outcomes.17 Due to the dynamic, pleiotropic, and in some cases, opposing functions of cells within the TME, mediating their functionality to inhibit tumor growth requires a detailed understanding of the TME cellular composition and activities.
Differential gene expression analysis
Capturing and characterizing differential expression patterns identified from bulk and single-cell RNA sequencing (RNA-seq) data is critical for identifying cellular and molecular activity within the TME. PD-L1, for example, is frequently expressed on the surface of tumor and immune cells and attenuates T lymphocyte signaling when bound to PD-1. The overexpression of PD-L1 has thus been used as a marker for immune cell infiltration and has been associated with response to anti-PD-1/PD-L1 therapies in some settings, leading to FDA-approved companion diagnostic tests for anti-PD-1/PD-L1 therapies in select cancers.18 19 However, efficacy of ICIs in patients with PD-L1 negative tumors prompt further investigation into immunomodulatory mechanisms and robust predictive markers of response to immunotherapy.20 To identify genes that are differentially expressed among two conditions (eg, clinical response groups, cell types, or tissue types), several computational and statistical methods can be applied (table 1), with DESeq2,21 edgeR,22 and Wilcoxon rank-sum test23 24 being among the most widely used strategies. Parametric methods, such as DESeq and edgeR, are typically favored when the per-condition sample size is less than 8, as the power that can be achieved with these methods is likely to outweigh potential false-positive calls resulting from the presence of outliers. On the other hand, non-parametric tests, such as the Wilcoxon rank-sum test, is often recommended for large samples sizes (>8 samples per condition) due to its ability to achieve statistical power and control the false discovery rate (FDR).24 In cases where large numbers of genes are differentially expressed, differentially expressed genes are involved in multiple pathways, and differences in expression are nuanced by biological variability between samples rather than true differences across conditions, biological interpretation of differentially expressed genes can be ambiguous. Gene set enrichment analysis (GSEA) involves aggregating the differential expression of genes to a set of genes that share biological functions, chromosomal location, or regulation. GSEA25 and variations of GSEA, such as single-sample GSEA (ssGSEA),26 gene set variation analysis (GSVA),27 and seqGSEA28 (table 1), are widely used for identifying enrichment of a defined set of genes between the two groups. Exploiting global expression patterns within the TME may also involve clustering methodologies, including consensus clustering and Bayesian non-negative matrix factorization, where gene expression values or GSE scores are clustered. Thorsson et al,29 for example, clustered ssGSEA scores from 10,000 tumors across 33 cancer types to identify six immune subtypes (wound healing, IFN-γ dominant, inflammatory, lymphocyte depleted, immunologically quiet, and TGF-β dominant), and found significant differences in overall survival among some subtypes. Several other studies have demonstrated a role of specific genes and/or gene sets in mediating antitumor immune response,30–34 highlighting the importance of differential expression tools and analyses in elucidating tumor or immune cell expression programs that associate with distinct clinical outcomes.
Immune cell inference
Rather than consisting exclusively of tumor cells, bulk tumor samples comprise immune and stromal cells intermixed with tumor cells. Consequently, gene expression from bulk RNA-seq resembles the average expression across a mixture of cells. To identify discrete cell populations within a given tumor and adjust for cell type composition in relevant analyses, such as differential gene expression, computational approaches use cell type-specific gene signatures to infer cell type proportions within the tumor, primarily by either deconvolution-based or gene marker-based methods.35 Deconvolution algorithms model the expression of a given gene as the linear combination of gene expression levels across a set of cell types, where each cell type is represented by a reference expression profile weighted by the relative fraction of the cell types within the mixture.36 Deconvolution-based methods that use bulk RNA-seq as the reference expression profile include CIBERSORT,37 EPIC,38 TIMER,39 and quanTIseq40 (table 1), among others as previously reviewed.36 41 Some methods also allow users to input their own reference (ie, EPIC), and as an extension of CIBERSORT, CIBERSORTx was developed to use single-cell RNA-seq (scRNA-seq) data to deconvolve bulk RNA-seq data.42 Other deconvolution-based methods that use scRNA-seq data include DWLS,43 MuSiC,44 45 and BayesPrism46 (table 1). Marker-based algorithms, including ESTIMATE,47 xCell,48 TIminer,49 and MCP-counter,50 use cell type-specific gene sets to perform gene set enrichment or to compute an abundance score. An R package called ImmuneDeconv51 unifies these approaches by integrating six algorithms (TIMER, EPIC, CIBERSORT, xCell, MCP-counter, and quanTIseq). One of the most significant limitations with reference-based methods is the accuracy of the reference profile. Reference-free methods, such as LinSeed,52 do not require an external reference or predefined cell type markers and thus can be useful for an unbiased inference of cell types, especially for cell types with little or no reference information available. However, these methods can be sensitive to noise and variation within a given RNA-seq dataset given the lack of prior knowledge. Interpreting performance between cell type quantification methods can be challenging due to differences in reference profiles required and scoring methods. Further, a limitation of these methodologies is that cell types present at low proportions and poorly characterized cell types may be difficult to distinguish. These limitations may be improved with growing scRNA-seq data that can serve as a reference (see below), toward maximizing the inference potentiation of bulk RNA-seq data from clinical cohorts for TME investigations.
scRNA-seq analysis of the TME
In contrast to bulk tumor samples, scRNA-seq enables transcriptomic analysis of individual cells within the TME and provides opportunities to identify and compare expression of individual cell types within the TME, including rare and unknown immune cell types,53–55 along with mechanisms underlying the functional heterogeneity of immune cells.56 To identify cells with similar expression profiles, possibly representing distinct cell types or intermediate cell states, clustering is typically performed and resultant clusters are annotated manually or with tools, such as MetaNeighbor,57 CellAssign,58 and Garnett59 (table 2). Furthermore, Spectra,60 Vision,61 CytoSig,62 and SCENIC+63 can be employed to infer gene expression programs associated with biological processes within cells, while differential expression analysis of cellular subpopulations can aid in discriminating between clusters of cells or identifying cell-specific markers. With these tools, extensive taxonomies of cells within the TME have been built, providing atlases for healthy and diseased tissues. These atlases showcase intratumoral and intertumoral heterogeneity64 65 along with a heterogeneous microenvironmental composition where distinct transcriptional patterns in tumor-associated macrophages (TAMs) and neutrophils (TANs), for example, have been revealed across different tumor types.65–69
The composition and functional state of cells within the TME continuously evolves due to cellular cues and oncogenic signaling from progressing tumors, creating another layer of complexity in disease modeling. Reshaping of the TME, or more broadly, cellular trajectories within the TME, can be inferred through pseudotime analyses using tools, such as Monocle,70 TSCAN,71 PAGA72 and other previously reviewed methods73 that track transcriptional changes between neighboring cells (table 2). Complex cell-cell interactions that may be directing cellular states within the TME can also be inferred through CellPhoneDB,74 CellChat,75 NicheNet,76 or MultiNicheNet.77 Further, integrating scRNA-seq data with spatial transcriptomics, where tissue architecture is preserved, can help provide a comprehensive understanding of cellular communication in situ. Spatially expressed genes can be identified using imaging tools such as MERFISH,78 seqFISH+,79 and STARmap80 (table 2). A number of studies have used spatial transcriptomics to map cell niches with clinical outcomes or therapy response, determine tumor-immune interactions and compartmentalization, spatial regulation and distribution of biomarkers (eg, T cell exhaustion markers and cancer progression), discover novel cell types, and examine tumor progression and evolution, such as cancer-associated fibroblasts, TAMs, and TANs with distinct antitumor characteristics within cellular neighborhoods.81–83 These studies have revealed highly diverse cellular states that are both spatially and temporally regulated, and associate with both better and worse outcomes under certain contexts,67 underscoring the importance of expanding our knowledge on tumor and host environment interactions.
With advances in computational methods for interrogating RNA-seq data, expression profiling has played valuable roles in unveiling immunological phenotypes, quantifying tumor immune infiltrate, and inferring interactions among cells within the TME. Several challenges remain to advance our understanding of the components and mechanisms that shape the TME, including limiting biases in RNA-seq analysis (eg, batch, sample, and technical biases),84 which may affect the generalizability of expression markers and signatures. Creating integrated or automated workflows (eg, for bulk RNA-seq, RIMA85; for scRNA-seq, Scater,86 SEURAT87 and ScanPy54) will increase usability and expand accessibility to users with variable bioinformatics knowledge. As the cost of scRNA-seq decreases, comprehensive atlases that include cellular composition, signaling pathways, and functional states will complement the overall differences in gene expression profiling from new and existing bulk RNA-seq datasets. Understanding cellular activity, heterogeneity, and plasticity within unique and prospective contexts will be critical for leveraging anti-tumorigenic immune cell subsets while inhibiting or depleting immune subsets with pro-tumorigenic roles through targeting pathways or the cells within the TME.
Immune repertoire reconstruction
Once T and B cells infiltrate into the TME, an antitumor immune response may be initiated if an antigen presented on the cell surface of a tumor cell is recognized by T or B cell receptors (TCR, BCR). Specificity of the set of available receptors is initially determined during the early stages of lymphocyte development, where variable (V), diversity (D), and joining (J) gene segments are randomly combined in a process known as somatic rearrangement. Diversification of the receptor repertoires is further increased by the addition or removal of nucleotides at the junction sites and pairing of receptor chains (TCRα to TCRβ, TCRγ to TCRδ, Ig heavy chain to Ig light chain).88 89 When activated, B cells undergo somatic hypermutation and class switch recombination, which further diversify the BCR repertoire.89 Each TCR and BCR chain contains three loops known as complementarity-determining regions (CDR1-3). CDR1 and CDR2 are encoded within the V gene, while CDR3 is encoded by the V-J junctional sequence (Ig light chain, TCRα, and TCRγ) or V-D-J junctional sequence (Ig heavy chain, TCRβ, and TCRδ) (figure 2A). Although CDR1 and CDR2 are thought to play a role in contacting the major histocompatibility complex (MHC) molecule, CDR3 is most studied due to the higher diversity in this region and direct interaction with antigen binding.
Since a difference in CDR3 sequence length reflects the addition or removal of nucleotides, early approaches assessed the CDR3 length distribution for immune cell diversity.90 Sequencing technology and bioinformatic tools can now extract TCR and BCR amino acid sequences. In general, tools for reconstructing the receptor repertoire from RNA-seq data either directly map sequencing reads to germline reference V, D, J and C genes, or perform de novo assembly of reads into continuous sequences called contigs that are then annotated by a reference set of known sequences, such as the ImmunoGeneTics (IMGT) database (figure 2B, online supplemental table 1).91 Tools also often incorporate PCR and sequencing error correction through clustering of near-identical CDR3 sequences or unique molecular identifiers and building a consensus sequence from each cluster based on metrics such as base quality and read coverage. Methods for TCR and BCR repertoire construction from bulk RNA-seq data include TRUST4,92 MiXCR,93 and ImRep94 (table 3). CATT95 reconstructs the TCR repertoire only and V’DJer96 reconstructs the BCR only. MiXCR and TRUST4 are the most widely used methods, both highly customizable and applicable to both bulk RNA-seq and single-cell sequencing data. An advantage of scRNA-seq is the pairing of receptor chains, which cannot be inferred from bulk RNA-seq. Methods based on scRNA-seq, such as TraCeR,97 TRAPeS,98 and scTCRseq,99 reconstruct TCRs, while BraCeR100 and BALDR101 reconstruct BCRs, and BASIC102 and VDJPuzzle103 are capable of reconstructing both TCRs and BCRs (table 3). Downstream filtering and processing of TCR and BCR sequences have previously been reviewed.88 104
Supplemental material
With the frequency and sequences of T and B cell receptor chains found from immune cell reconstruction methods, T and B cell population dynamics can be inferred and used for therapeutic strategies. Clonality and diversity are popular measures of immune fitness, which can be computed in several different ways (previously reviewed104 105) that broadly reflect the number of clonotypes and their abundance. While clone size distribution provides important information on the occupancy of T and B cell clones at a given time, T and B cells are dynamic and evolve across time and tissue compartments.106 107 By tracking clones over time or across cellular compartments, changes in clonality or diversity may be revealed due to clonal expansion or differentiation in response to disease progression or treatment. For example, Chapuis et al 108 tracked tumor-specific, adoptively transferred cytotoxic T cells in peripheral blood and found that a single clonotype dominated in patients with complete response, suggesting that the expanded clones were responsible for tumor control. However, clonal expansion is not necessarily synonymous with antitumor activity. Mapping of T cell clonotypes to their cellular states has revealed clonal expansion of T cells displaying an exhausted phenotype in patients with disease persistence and poor response to ICB.109–112 While these studies suggest that exhausted T cells become unresponsive due to chronic tumor-specific antigen stimulation, epitopes also need to be recognized by tumor-infiltrating lymphocytes (TILs) to elicit a response. Several studies have demonstrated that only a small portion of TILs are tumor-specific, and thus capable of recognizing and reacting to tumor cells, while the majority of T cells have been reported to be non-tumor-reactive bystander T cells.113 114 In silico prediction of receptor targets leading to an antitumor response remains a challenge due to a number of factors modulating response, such as signaling strength and specificity, and the necessity of in vitro validation studies. Common patterns among clonotypes, such as sequence similarity and gene usage, have been found from receptors that recognize the same epitope.115–117 Using tools, such as TCRdist118 and GLIPH,119 that cluster similar clonotype sequences may point to shared mechanisms of epitope recognition and response. To improve receptor binding specificity predictions, several tools have been developed that incorporate features, such as receptor-antigen binding strength and structural modeling.120–122 In-depth characterization of TCR and BCR clonotypes that lead to antitumor control will be essential to further improve predictive models by identifying attributes that foster a reactive receptor and design of treatments that may expand pre-existing cytotoxic T cells through TIL therapy, replenish rare or exhausted T cells through TCR or chimeric antigen receptor (CAR) T-cell therapy, or modulate lymphocyte pathways or functions through monoclonal or bispecific antibodies. Understanding the epitopes eliciting a response will further empower the design and selection of immunotherapeutic strategies through precise knowledge of therapeutic targets or development of personalized cancer vaccines that invigorate and boost antitumor response.
Neoantigen prediction
Neoantigens are mutated peptides that, given their absence from the normal human genome, are not subject to immune tolerance and may generate an immune response when bound to the MHC and presented on the cell surface. Neoantigens can be produced from point mutations, frameshifts that alter the reading frame of the protein, splicing aberrations, gene rearrangement, post-translational modifications (methylation, phosphorylation, and acetylation), translation of upstream open reading frames and stop codon readthrough, and viral infection. Whether the immune system recognizes or tolerates the neoantigen is dependent on several factors, such as variable cross-presentation due to tumor cell heterogeneity, leading to neoantigens presented on the surface of a limited number of tumor cells and decreasing the likelihood of recognition by T cells. T cells also preferentially react to certain neoantigens over others, due to sequence similarity to the normal genome, specificity to the T cell repertoire, proximity of surrounding presented neoantigens, and other features. Thus, a computational pipeline that can predict and identify immunogenic neoantigens is paramount.
HLA typing
T cell recognition and response to antigens presented on the cell surface of tumors is dependent on the MHC. In humans, MHC proteins that present neoantigens to the surface of the cell are encoded by a complex of human leukocyte antigen (HLA) genes. These set of genes are located on the short arm of chromosome 6, which are conventionally divided into three regions defined by HLA class I genes (HLA-A, HLA-B, and HLA-C), HLA class II genes (HLA-DR, HLA-DQ, and HLA-DP), and HLA class III genes (encode for immunoregulatory molecules, such as tumor necrosis factors, heat shock proteins, and components of the complement system). HLA class I and class II genes are highly polymorphic, while HLA class III genes display low to moderate levels of genetic variability.
The genetic variability and complexity of the HLA loci pose challenges with read mapping and variant calling methods that rely on a single reference human genome. To increase the genetic diversity of the reference HLA allele sequences and limit biases introduced from a single reference, algorithms for HLA typing leverage existing sets of HLA allele sequences available in the IMGT/HLA database (online supplemental table 1).123 HLA typing methods are generally either alignment-based, where they directly map sequencing reads to reference HLA alleles, or assembly-based and perform de novo assembly of reads into contigs that may be queried against a reference set of known HLA alleles.124 125 Inference of HLA alleles then varies among HLA typing methods (mentioned below) from Bayesian classification approaches, scoring systems, or the use of alignment properties (eg, coverage or number of reads mapped).
For HLA class I typing, Optitype,126 Polysolver,127 and PHLAT128 have been previously evaluated as having the highest reported performance rates.125 129–133 Optitype and Polysolver are limited to HLA class I calling, while PHLAT, among other methods (table 4), can infer both HLA class I and II alleles. However, HLA class II typing tools are comparatively less studied than HLA class I typing tools and substantially less HLA class II alleles are reported in the IMGT/HLA database. Historically, only MHC class I molecules were thought to contribute to neoantigen recognition by T cells, but attention surrounding immunogenic MHC class II neoantigens has grown since their recognition by CD4+T cells was discovered.134 Of the few benchmarking studies that have reported prediction performance on HLA class II alleles, either alone or in combination with HLA class I alleles, HLA-HD135 and HISAT-genotype,136 and PHLAT have the highest performance rates.131 137 138 A newly developed method, T1K, can infer both HLA class I and II alleles, along with KIR genotypes.139 To increase specificity of predicting HLA alleles, consensus algorithms, which determine the concordance between multiple HLA callers, have been developed.125 However, this approach is constrained by the biases inherent within each method.
While a patient’s HLA alleles determine the repertoire of neoantigens that will be presented for T cell recognition, the loss of HLA expression can prevent recognition by T cells altogether and promote resistance to ICI. Currently, only a few tools exist that determine HLA-specific variants, including Polysolver127 and LOHHLA140 (table 4). HLA expression can also be explored in combination with HLA variants to gain a more complete understanding of tumor-immune interactions, which can subsequently aid in selection and/or development of HLA-dependent immunotherapies.141 142
Peptide processing and presentation prediction
For a neoantigen to be presented on the surface of a cell, a protein must be degraded into smaller peptides. For MHC class I antigens, peptide fragments within the cytoplasm are then transported into the endoplasmic reticulum by transporter associated with antigen processing (TAP) before binding to a given MHC class I molecule. Cleavage of proteins that bind to MHC class I occurs on denatured proteins, while cleavage of proteins that bind to MHC class II molecules occurs on folded proteins, where even minor structural changes can impact peptide cleavage and subsequent MHC class II binding.143 Several tools have been developed for predicting proteasome specificity and peptide cleavage, such as NetChop,144 Pcleavage,145 and PAProC,146 for class I antigens. Tools for TAP transport efficiency include TAPPred147 and DeepTAP148 (table 4).
Prediction of binding of peptides to MHC is a pivotal step in identifying neoantigens due to the small number of peptides that bind to MHC molecules.149 Significant efforts have focused on the development of tools that accurately predict peptide binding to MHC class I and class II molecules and have previously been reviewed (table 4).150 151 Early methods for peptide binding were allele-specific, where binding predictions are made exclusively for the HLA allele that the model had been trained on. However, due to the high polymorphism of HLA molecules, experimental validation on ligands may be limited or non-existent for certain HLA alleles. Thus, pan-specific methods were later built to predict binding for any allele by training models on binding affinity data across multiple alleles. Allele-specific methods tend to outperform pan-specific methods where sufficient data are available, while pan-specific outperform allele-specific methods where experimental data are lacking or non-existent.150 Consensus methods aim to improve the binding predictions by combining both pan-specific and allele-specific models.152
A number of prediction tools have been developed to determine binding and non-binding peptides (table 4). Allele-specific methods for predicting MHC class I binding include NetMHC153 154 and mixMHCpred,155 while pan-specific methods include NetMHCpan,156 157 MHCflurry 2.0,158 HLAthena,159 PickPocket,160 and netMHCcons152 among other previously reviewed methods.150 Predicting MHC-II binding specificity has been more challenging compared with MHC-I in part due to the structure of the MHC class II binding groove. While the MHC-I binding groove has closed ends and is restricted to binding ligands of a limited range (typically 8-11mers), the open ends of the binding groove on MHC-II allow peptides of longer lengths (14-18mers) to bind (figure 3A).161 Only a portion of the ligand (usually 9 amino acid residues) may bind to the groove with the remaining peptide termini extending outside the groove, making prediction of peptide binding to MHC-II challenging since a single peptide could have several binding motifs.161 Tools for predicting MHC class II binding include NetMHCII,162 NetMHCIIpan,156 MARIA,163 MHCNuggets,164 and MixMHC2pred165 (table 4). In earlier methods, putative neoantigens were classified as MHC binding candidates using a half-maximal inhibitory concentration (IC50) threshold of 500 nM, where peptides with IC50 >500 nM were considered likely non-binders and filtered from the remaining peptides. As biases of certain MHC molecules toward the presentation of peptides at distinct binding thresholds were revealed in newer studies, methods began ranking the binding affinity of each peptide-MHC pair based on the predicted binding affinities of a random set of natural peptides.166 Given the high sensitivity and specificity rates observed at a percentile rank threshold of 2%, peptides with a percentile rank <2 are considered MHC binding candidates, with weak binders classified by ranks between 0.5% and 2% and strong binders characterized by a rank <0.5%.166 167
Less than 3% of possible peptides are thought to bind to a given HLA allele with sufficient strength and stability for T cell recognition and response, underscoring the importance of filtering and prioritizing predicted neoantigens by immunogenicity. While there is currently no consensus on the features that constitute immunogenic peptides, several features have been proposed that broadly reflect HLA preprocessing and presentation and immune cell receptor recognition, including features describing proteasomal cleavage, TAP binding efficiency, MHC binding affinity, peptide-MHC binding stability,168 gene expression, peptide hydrophobicity,169 neoantigen clonality,170 molecular size of antigen,171 mutation position,172 differential agretopicity index (ratio of mutant binding affinity to wild-type binding affinity),173 peptide foreignness (TCR recognition probability based on homology of the tumor peptide to known pathogenic peptides in the Immune Epitope Database (IEDB) and dissimilarity to non-mutated proteome),174–176 and peptide-MHC binding specificity and promiscuity.177 178 More recently, machine learning methods have been developed that incorporate selected features to predict neoantigen immunogenicity.178–180 Other groups have created pipelines that typically include a combination of mutation calling, HLA typing, MHC binding prediction, and neoantigen filtering or ranking (figure 2B). Workflows for predicting class I neoantigens include MuPeXI,181 Neopepsee,179 pTuneos,182 NeoPredPipe,183 and LENS,184 and workflows that can predict both class I and II neoantigens include NeoFuse,185 nextNEOpi,186 and pVACtools187 (table 4). Several recent studies have begun experimentally validating the T cell reactivity of computationally predicted neoantigens.188 189 Future functional studies that validate candidate neoantigens and elucidate features essential for neoantigen presentation and antitumor response will be paramount for optimizing neoantigen prediction workflows and improving their applicability in clinical settings.
Neoantigens represent ideal targets for antitumor immunotherapy without destruction of normal tissue given their ability to escape T cell central tolerance. In contrast to tumor-associated antigens, therapies targeting neoantigens have shown durable complete response.13 190–192 Moreover, reports of off-target side effects are extremely rare.193 While neoantigen-based therapies hold promise for improving patient outcomes in a safe and effective manner, a major challenge in effectively disseminating neoantigen-based therapies to more patients is the ability of the tumor to escape immune response through a variety of mechanisms. A comprehensive understanding of intrinsic oncogenic mechanisms driving tumor progression, the composition and functional state of cells within the TME creating either a supportive or suppressive immune environment, and active lymphocyte receptor and presented neoantigens will be crucial for selecting effective immunotherapeutic strategies that result in durable response with limited toxicity for patients with cancer.
Multi-omics and multimodal approaches
Despite significant advances in uncovering features of the tumor architecture and microenvironment, single-omics studies are limited in their ability to capture complex relationships between multiple molecular layers. Integration of single-level omics data, such as genomics, transcriptomics, epigenomics, proteomics, and metabolomics, offers higher resolving power for detecting causal relationships among interrelated perturbations and their phenotypic manifestations.194 Machine learning approaches are increasingly being used for both integrating and extracting meaningful information from multidimensional data. Several studies have applied deep learning models to integrate multi-omics data and predict survival outcomes, establish predictive biomarkers, and classify molecular subtypes.195–197 In particular, one study developed a machine learning model that classified tumors into four subgroups based on tumor mutation burden (TMB), microsatellite instability (MSI), and somatic copy number alterations (SCNAs).198 Group classifications within two melanoma cohorts associated with response to ICB while a simplified combination of the genomic features (TMB, MSI, SCNAs) using median thresholds did not associate with response, highlighting both the complexity in deciphering tumor-immune biology and the benefit of using deep learning methods in future studies with expanded datasets and different tumor types.
Omics complemented with radiological and histopathological images has the potential to further contextualize tumor-immune biology and improve prognostication for response to immunotherapy by extracting both modality-specific and cross-modality relationships. Several studies have applied deep learning models to infer PD-L1 status, TMB, MSI, and other genomic alterations from histopathological images.199–205 Further, deep convolutional neural network classifiers were previously applied to whole slide images to predict response to ICB response in patients with advanced melanoma.206 Deep learning models that integrate whole slide images with genomics data to predict survival outcomes and identify genes and pathways associated with each outcome include Pathomic fusion,207 PORPOISE,208 GPDBN,209 and HFBSurv.210
Unlike histologic images and sequencing technologies that only capture a portion of the tumor, radiological images, such as CT scans, can capture the entire tumor. This is particularly important for identifying heterogeneous response patterns to ICB. Using machine learning, a CT-based radiomics signature of tumor-infiltrating CD8 cells with eight variables was built from 135 patients with advanced solid tumors to predict response to anti-PD-1/PD-L1.211 While other studies have developed radiomics-based machine learning models to predict response to ICB, a number of these studies were found to have a limited sample size and do not correct for overfitting.212 Aggregating larger datasets in future studies may mitigate the risk of overfitting, although at the cost of increased data complexity, storage demands, and computational power.
Integrating unimodal datasets presents notable challenges in multimodal studies. Multimodal dataset sizes are limited to the availability of matched data across modalities or require the use of modeling techniques that can handle missing data.213 As more data are aggregated in future studies, heterogeneous forms of the data may be introduced. For example, variations in staining techniques of histopathological images or scanner type of radiomic images can confound both intermodal and intramodal relationships. Considering and appropriately adjusting for covariates in study design and analysis can enhance validity and interpretability of findings but may be challenging when certain variables are underrepresented or contain sparse data. Cross-modal discrepancies create additional complexity in data modeling, where, for example, tumors that are PD-L1-negative by immunohistochemistry staining express PD-L1 in RNA sequencing data. Additional computational power and costs associated with integrating multimodal datasets can further compound these challenges. Histopathological slides or radiographic scans often include several images for a single case, each ranging from approximately 100 kilobytes to 1 gigabytes or above.214 215 When integrated with omics data, the aggregate dataset could surpass thousands of megabytes, thus, requiring a robust data infrastructure for storage and processing capacity at manageable costs. Multidisciplinary efforts across research and clinical settings will be imperative to addressing these challenges and ultimately improving the predictive capacity and precision of multi-omic and multimodal machine learning models.
Future directions and concluding remarks
Computational methods that interrogate genomic datasets offer windows into the mechanisms driving disease progression and response to treatment. Characterization of the TME using single-cell, bulk, and spatial transcriptomic data has revealed highly complex and dynamic cellular networks across and within tumors. Inference of the immune repertoire has provided insight into the activation, expansion, and differentiation of lymphocytes and features modulating antigen-specific binding and activation. Excel spreadsheets and manual curation have been supplanted by machine learning algorithms and convolutional neural networks that predict neoantigen presentation. Multi-omics methods have accelerated our ability to capture complex tumor-immune relationships, while AI methods are increasingly being implemented to integrate data, identify complex patterns, and extract immunologically relevant information within a flexible framework. Given the limited number of patients that respond to immunotherapy, there is a need to optimize and develop novel tools to improve precision and selection of treatments.
Advances in NGS technologies have led to a growth in computational methods, as outlined in this review, but reproducibility, transparency, and accessibility limit their potential for becoming streamlined into clinical applications. With a wealth of tools available, each with their own strengths and weaknesses, and no universal standard, studies using the same datasets can easily diverge in findings based on choice of tool. To address inconsistencies between computational methods, some consensus approaches have been developed that determine concordance among multiple algorithms, reducing the number of false positives, although potentially at the cost of sensitivity. Additionally, discrepancies in findings may further be compounded when multiple steps are involved. Integrated pipelines can mitigate this issue while also increasing accessibility to users with variable bioinformatics experience depending on configurability. To build on current knowledge, pipelines and methods need to be accessible, executable, and well documented. At present, a number of algorithms mentioned in peer-reviewed papers are not publicly available, fail to execute, or are missing pertinent information.216–218 Importantly, inaccessible, inexecutable, or insufficiently documented code has the potential to impede progress in research, undermine confidence in the scientific community through inconsistent study results, and lead to unintended detrimental consequences in clinical settings. Moving forward, collaborative, sustainable efforts that increase method usability and precision will be vital for advancing the field and improving patient care.
Prospectively, we are optimistic that advances in computational techniques will enable comprehensive capture of the tumor-immune contexture, revealing antitumor response and resistance mechanisms and inspiring selection and development of therapeutic strategies. With the growing amount of genomics data, especially matched multi-omic and clinically annotated data, we anticipate that key mechanistic, predictive, and prognostic questions surrounding tumor-immune relationships and response to treatment will surface through statistically powered, multimodal analyses. Specifically, the exploitation of the composition and functional state of cells within the TME coupled with lymphocyte tumor-specificity and neoantigen immunogenicity features mediating antitumor control in various disease contexts could provide a therapeutic blueprint outlining an ideal therapeutic strategy. We further anticipate that advances in preclinical models and functional assays will substantiate findings from immuno-genomic datasets and computational predictive methods, while machine learning approaches will offer a flexible framework to account for multiple levels of complexity, including tumor intrinsic, extrinsic, and systemic mediators of tumor control and progression. By espousing a systems-level approach within an adaptable model, we expect that disentangling complex, casual relationships for combination therapies that limit immune-related adverse events will become more feasible. Ultimately, comprehensive analyses leveraged by advances in computational methods and functional validation will unequivocally expand the efficacy and durability of immunotherapy to a wider scope of patients with cancer.
Ethics statements
Patient consent for publication
Ethics approval
Not applicable.
References
Supplementary materials
Supplementary Data
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
Footnotes
Twitter @kevinmbio, @vanallenlab
Contributors CAR and EMVA conceptualized the review. CAR wrote the original draft of this manuscript and created the figures in BioRender. KM and EMVA critically reviewed and provided feedback on the manuscript.
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests CAR and KM declare no conflicts of interest. EMVA disclosures: Advisory/consulting: Tango Therapeutics, Genome Medical, Genomic Life, Enara Bio, Manifold Bio, Monte Rosa, Novartis Institute for Biomedical Research, Riva Therapeutics, Serinus Bio. Research support: Novartis, Bristol-Myers Squibb, Sanofi. Equity: Tango Therapeutics, Genome Medical, Genomic Life, Syapse, Enara Bio, Manifold Bio, Microsoft, Monte Rosa, Riva Therapeutics, Serinus Bio. Travel reimbursement: None. Patents: Institutional patents filed on chromatin mutations and immunotherapy response, and methods for clinical interpretation; intermittent legal consulting on patents for Foaley & Hoag.
Provenance and peer review Commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.