Introduction

The role of tumour-specific antigens (TSAs) as targets of anticancer immunity was first recognized in the last century, with studies of TSA-based vaccines becoming more prevalent in the past decade1,2,3 (Box 1). Neoantigens are defined here as a subset of TSAs generated by non-synonymous mutations and other genetic variations specific to the genome of a tumour, presented by major histocompatibility complex (MHC) molecules, and recognized by endogenous T cells. The most commonly studied class of neoantigens are those derived from single-nucleotide variants (SNVs), which cause non-synonymous changes in a protein that subsequently may trigger antigen-specific T cell responses against the tumour. These conventional neoantigens have the distinct advantage over other classes of tumour antigens (for example, tumour-associated antigens and cancer–testis antigens) of having no expression in normal tissues4. As a result, T cells with specificity for these neoantigens can escape negative selection in the thymus, leading to the generation of a TSA-specific T cell repertoire5.

Despite the advantages of SNV neoantigens, their applicability as vaccine targets may be limited to cancers with highly immunogenic neoantigens, likely a subset of the total neoantigen load for any given tumour. Metastatic melanoma (which contains the highest SNV burden of any cancer6) has been the primary focus of initial neoantigen clinical studies2,3, and in this tumour type, as in lung cancer, tumour mutational burden (which estimates neoantigen load) has been associated with the response to immune checkpoint inhibition7. One hypothesis for this association is the increased likelihood in these tumour types of neoantigen generation and T cells bearing neoantigen-specific T cell receptors (TCRs). However, the number of neoantigens required for driving a clinical response is unknown, and it has been shown that tumours with a low mutational burden can have neoantigen-specific T cell populations boosted by therapeutic personalized neoantigen vaccines8,9.

Many investigators, including our group, have begun to evaluate alternative TSAs — defined as high-specificity tumour antigens arising from non-SNV-containing genomic sources. Unlike SNV neoantigens, alternative TSAs are not necessarily restricted to protein-coding exons, allowing for a greater repertoire of available targets. Predicted tumour antigen burden has demonstrated that the expression of various classes of TSAs is not always correlated, with some SNV-low cancers containing high alternative-TSA expression. This is exemplified by clear-cell renal-cell carcinoma (ccRCC), an immune checkpoint inhibitor sensitive cancer that has a low predicted SNV burden but high expression of predicted frameshift neoantigens10 and tumour-specific endogenous retroviral antigens11. Thus, studying these alternative TSAs may broaden the scope and increase the number of targets available to test in therapeutic vaccines and/or cellular therapies. Additionally, leukaemia and sarcoma (which are among the cancers with the lowest predicted SNV burden12) express gene fusions13,14 and splice variant transcripts15,16 shared across multiple tumours, potentially allowing for universal off-the-shelf therapies.

In this Opinion article, we will characterize several major classes of alternative TSAs, including those generated from mutational frameshifts, splice variants, gene fusions, endogenous retroelements and other classes, such as human leukocyte antigen (HLA)-somatic mutation-derived antigens and post-translational TSAs (Fig. 1; Table 1). One class of TSA not covered here is the viral-derived cancer antigens (for example, human papillomavirus (HPV) and Epstein–Barr virus (EBV)), which have been previously reviewed17,18,19,20,21. We will begin by providing a brief overview of TSA computational prediction and then discuss the biology, available computational tools, preclinical and/or clinical studies, and relevant cancers for each alternative TSA class. Finally, we will discuss the current challenges impeding therapeutic application of alternative TSAs and solutions to aid their clinical translation. In addition to a review of the literature, recent studies (including several from our group) have provided estimates of the antigenic burden of each TSA class among The Cancer Genome Atlas (TCGA) pan-cancer data (including selected tumour-specific viral antigens), which we have compiled here as a resource (Fig. 2 and Supplementary Fig. 1; viral antigens including those derived from EBV, herpes simplex virus and cytomegalovirus are included for comparison).

Fig. 1: Summary of tumour-specific antigen production in the tumour cell.
figure 1

Mutations and other tumour-specific nucleotide sequences (shown in red) can be observed at the genomic DNA level, where they undergo transcription (1) and splicing to form mRNA (2). Alternative splicing can occur at this step, to form splice variant mRNA. Next, translation occurs on variant mRNA, resulting in the production of variant proteins (3). Post-transcriptional frameshifts (for example, ribosomal slippage, among other mechanisms) can occur at this step, resulting in frameshifted protein variants. These proteins can then undergo proteasomal degradation (4) and transport to the endoplasmic reticulum (ER), to subsequently be loaded on major histocompatibility complexes (MHCs) (5). Other forms of post-translational frameshift can occur during these steps (for example, protein splicing). Finally, peptides containing variant sequences can be presented at the cell surface in the context of MHC, resulting in T cell targetable tumour-specific antigens (6). Chr, chromosome; ERV, endogenous retrovirus; INDEL, insertion or deletion; SNV, single-nucleotide variant.

Table 1 Advantages, disadvantages and relevant cancers for each tumour-specific antigen class
Fig. 2: Average tumour-specific antigen counts by cancer type.
figure 2

Plots represent the number of unique identified epitopes by The Cancer Genome Atlas (TCGA) cancer type. Insertion or deletion (INDEL) neoantigen counts have demonstrated significant correlation with single-nucleotide variant (SNV) neoantigens among all cancer types (coefficient: 0.81, P < 0.0001). Notable outliers in this correlation are kidney renal clear-cell carcinoma (KIRC; commonly known as clear-cell renal-cell carcinoma (ccRCC)) and kidney renal papillary cell carcinoma (KIRP; commonly known as papillary RCC), where the INDEL-to-SNV ratio is significantly higher than in other cancer types (ccRCC, 0.85, and papillary RCC, 0.90; all others, 0.43 – 0.72). Analysis of splice variant antigens has demonstrated a burden similar to that of INDEL neoantigens, with significant correlations with both INDEL and SNV neoantigen burden. A notable outlier is thyroid cancer (thyroid carcinoma (THCA)), where the average number of splice variant antigens per sample is higher than that of SNV neoantigens. The mean burden of fusion-derived neoantigens is highest in sarcomas (for example, sarcoma (SARC), 1.1; uterine carcinosarcoma (UCS), 0.78), with the carcinoma fusion burden being highest in breast (breast invasive carcinoma (BRCA), 0.70) and prostate (prostate adenocarcinoma (PRAD), 0.58) cancer. Testicular cancer (testicular germ cell tumour (TGCT)) has a substantially greater burden of human endogenous retrovirus (hERV)-derived tumour-specific antigens (TSAs) than any other TCGA cancer type. SNV and INDEL epitopes are derived from Thorsson et al.12. Fusion epitopes are derived from Gao et al.199. Splice variant epitopes are derived from Jayasinghe et al.52. The viral epitopes are derived from Selitsky et al.200; the hERV epitopes are derived from differentially expressed hERVs (>10-fold tumour-versus-mean normal expression by DESeq2) in Smith et al.11; all TSA classes represent the average numbers of predicted class I human leukocyte antigen (HLA) binders (8–11 mers, <500 nM) predicted from NetMHCPan. Stomach adenocarcinoma (STAD) INDEL and SNV calls were absent from Thorsson et al.12, and oesophageal carcinoma, acute myeloid leukaemia and ovarian serous cystadenocarcinoma were not included in all original reports and so are absent from the figure. The data shown represent reanalysis of the above reports, with modification of the data to derive values comparable across TSA groups. ACC, adrenocortical carcinoma; BLCA, bladder urothelial carcinoma; CESC, cervical and endocervical cancers; CHOL, cholangiocarcinoma; COAD, colon adenocarcinoma; DLBC, lymphoid neoplasm diffuse large B cell lymphoma; GBM, glioblastoma multiforme; HNSC, head and neck squamous cell carcinoma; KICH, kidney chromophobe; LGG, brain lower-grade glioma; LIHC, liver hepatocellular carcinoma; LUAD, lung adenocarcinoma; LUSC, lung squamous cell carcinoma; MESO, mesothelioma; PAAD, pancreatic adenocarcinoma; PCPG, phaeochromocytoma and paraganglioma; READ, rectum adenocarcinoma; SKCM, skin cutaneous melanoma; THYM, thymoma; UCEC, uterine corpus endometrial carcinoma; UVM, uveal melanoma. A version of these data with individual numbers of unique TSAs by cancer type is available online (Supplementary Fig. 1).

Computational prediction of TSAs

Recent advancements in DNA and RNA sequencing (RNA-seq) have enabled the development of genomic and computational methods of TSA prediction (Table 2). Methods for generating TSA immunotherapies generally rely on a conserved set of steps: variant calling, HLA typing, peptide enumeration, HLA binding prediction and therapy generation (Fig. 3). Variant calling is the identification of genomic regions with tumour specificity. In the case of SNVs, insertion or deletion (INDEL) mutations and gene fusions, variants are derived from mutations within the tumour exome that are not expressed by germline DNA. In contrast, endogenous retroelement-derived antigens are identified from RNA expression data, selecting for elements with higher expression in the tumour than in matched normal tissues. Splice variant antigens can be identified through a variety of techniques, discussed in-depth later. Subsequently, tumour HLA typing is derived using an HLA caller (for example, POLYSOLVER22, OptiType23, PHLAT24, HLAScan25 or HLAProfiler26), which relies on DNA and/or RNA-seq data, depending upon the platform. Peptide enumeration is then performed, whereby variant genomic regions are translated into peptide sequences, with removal of translation-incompatible sequences such as nonsense mutations. Following this, HLA binding prediction is performed using prediction software (for example, NetMHCpan27), with higher-affinity peptides being characterized by either ranked percentiles or KD values of ≤500 nM (the commonly accepted binding affinity cutoff in the field)3,27,28. The majority of MHC binding affinity prediction tools rely on machine-learning algorithms (including artificial neural networks) trained on validated epitope reference databases, where peptide binding to MHC molecules has been measured using biochemical assays29,30. Finally, the predicted TSAs are used to generate a therapeutic product, either as a vaccine (that is, a DNA or RNA, peptide or dendritic cell vaccine) or a cellular therapy product (that is, adoptive T cell therapy). Below, we will discuss the biology of each alternative TSA class, with detailed descriptions of the available computational prediction tools.

Table 2 Computational workflows for tumour-specific antigen calling
Fig. 3: Computational workflow for tumour-specific antigen calling.
figure 3

a | The identification of tumour-specific antigens begins with variant calling. This can be done through the comparison of tumour versus normal-tissue DNA sequences (single-nucleotide variants (SNVs) and insertions or deletions (INDELs)) or RNA sequences (splice variants, fusions, viral sequences and retroelements) to look for tumour-specific variants in the exome or tumour-specific transcripts in the transcriptome, respectively. b | Tumour human leukocyte antigen (HLA) typing is performed to enable downstream major histocompatibility complex (MHC) binding prediction. c | Peptide enumeration occurs through the translation of variant nucleotide sequences into their respective amino acid sequences, filtering for translation-incompatible sequences, such as those containing intervening stop codons or those with low evidence of RNA expression. These polypeptides are then used to derive 8–11 mer sequences (for MHC class I epitopes) or 15 mer sequences (MHC class II epitopes). d | This then enables downstream MHC or HLA binding prediction of each sequence. Binders are typically defined in the literature as those with a predicted binding affinity (KD) of ≤500 nM or are selected from those with the highest ranked percentile for predicted binding affinity. Other filtering criteria may be performed after this step, such as immunogenicity prediction or filtering away sequences with high homology to self-antigens. e | Finally, therapies are generated using predicted tumour-specific antigens. These can be DNA, RNA or peptide vaccines or cellular therapies such as adoptive T cell therapy.

Mutational frameshift neoantigens

Biology of INDEL mutations

INDEL mutations are derived from the insertion of base pairs into or deletion from the genome, which has the capacity to generate non-synonymous novel open reading frames, known as mutational frameshifts. INDEL-derived neoantigens have been hypothesized (but not yet proven) to generate more robust immune responses than SNV-derived neoantigens, as their sequences are completely unique from germline sequences downstream of the INDEL10,31. Epitopes generated from these mutations could induce a T cell response similar to SNV neoantigens, due to decreased potential for negative selection in the thymus against the INDEL neoantigen-specific T cell.

Cancer types that are particularly relevant for targeting INDEL neoantigens include microsatellite instability-high (MSI-H) tumours, as well as all renal cell carcinomas (RCCs). Early studies examining the role of INDEL mutations for antitumour immunity were mainly pursued in colon cancer, where MSI caused by hereditary diseases (for example, Lynch syndrome) and in sporadic tumours (MSI-H in 15%) is common32,33. MSI-H tumours are also observed in other non-hereditary cancers, including gastric, endometrial and pancreatic cancers34. MSI-H cancers are characterized by impaired DNA mismatch repair pathways and are associated with significantly greater INDEL burden than non-MSI-H tumours31,35. The association between INDEL burden and the presence of tumour-infiltrating T cells has been well described in the literature, providing early support for the hypothesis that MSI-H tumours would be susceptible to immunotherapies31,36,37,38,39. Concurrent with these findings, immune checkpoint inhibitors have demonstrated clinical activity for patients with MSI-H tumours, independent of the tissue of origin40. As a result, MSI-H tumours are the only non-tissue-restricted class of tumours with US Food and Drug Administration (FDA) approval for immune checkpoint inhibitor therapy41. In MSI-H tumours, the burden of both SNV and INDEL neoantigens is high, making both neoantigen classes potentially useful for targeted therapy10.

In contrast, RCC contains relatively few SNVs, despite having immune infiltrates and a high clinical response rate to immune checkpoint inhibitor therapy42. A potential explanation for this was explored recently by Turajlic et al.10, whereby examining the pan-cancer INDEL profile in The Cancer Genome Atlas (TCGA) dataset revealed that all RCC subtypes (clear-cell RCC (ccRCC), renal papillary cell carcinoma and chromophobe RCC) have the highest proportion and number of INDEL mutations of any cancer types. The presence of INDELs was also associated with immune features (for example, T cell activation and immune checkpoint inhibitor response) in three individual cohorts of patients with melanoma. While the number of predicted INDELs across the pan-cancer cohort was orders of magnitude lower than the SNV mutations, they were estimated to produce approximately 3 to 9 times more predicted neoantigens per mutation than SNVs10.

Tools for predicting INDEL-derived neoantigens

Currently, we are aware of at least six tools in peer-reviewed publications with the capacity to predict INDEL-derived neoantigens — pVACseq43, Neopepsee44, MuPeXI45, Epidisco46, Antigen.garnish47, and TSNAD48 (not included here are the custom neoantigen prediction pipelines being used in translational and clinical studies, which may contain proprietary methods for antigen prediction along with integration of a publicly available variant caller (for example, Indelocator and Strelka49) and peptide–MHC binding prediction methods). Among these tools, Neopepsee is unique in its integration of machine-learning algorithms to predict immunogenicity — as well as peptide–MHC binding — a feature not easily validated biologically in human neoantigen studies before the induction of therapy.

Translation of INDEL-derived antigens into the clinic

A rare example of a publicly shared neoantigen has been observed in a common frameshift mutation in the gene transforming growth factor-β receptor type 2 (TGFBR2), frequently found in Lynch syndrome and 15% of sporadic gastric and colon cancers with MSI39. Three independent studies published in 2001 demonstrated HLA-specific epitopes generated from mutated TGFβR2 capable of generating antigen-specific T cells, one associated with MHC class I–CD8+ T cell responses31,39 and one with MHC class II–CD4+ T cell responses50. A more recent study from Inderberg et al.51 isolated cytotoxic T lymphocytes (CTLs) from a patient with colon cancer who had shown greater than 10-year survival after vaccination with a TGFβR2 frameshift mutation-derived peptide, and used these CTLs to generate a TCR-paired α- and β-chain clone, which was subsequently transfected into both CD4+ and CD8+ T cells. The transfected T cells demonstrated evidence of efficacy against colon cancer cell lines containing the TGFβR2 mutation both in vitro (cytotoxicity and cytokine release) and in vivo (an immunodeficient xenograft mouse model).

A recent publication from Ott et al.3 studied the use of personalized neoantigen vaccines in the treatment of metastatic melanoma, with prioritization of INDEL neoantigens in their prediction pipeline. Four unique INDEL mutations across six tumours were predicted, with T cell cultures generated that were specific to two of those INDEL neoantigens (one CD4+ T cell epitope and one CD8+ T cell epitope), which in turn demonstrated detectable interferon-γ (IFNγ) secretion in response to their respective epitopes. This was compared with only 3–5 of 28 predicted SNV neoantigens, which exhibited IFNγ responses of a similar concentration. Although INDEL neoantigen cross-reactivity with the respective reference wild-type epitope was not measured (presumably as it was expected there would be no cross-reactivity), over half of the SNV-specific T cells demonstrated cross-reactivity with their wild-type epitope at escalating concentrations. Due to the small patient cohort and follow-up so far only at 20–32 months, the clinical benefit of INDEL neoantigens cannot yet be determined from this study.

Splice variant antigens

Splice variant antigen frequency in cancer

Splice variant antigens are post-transcriptionally derived TSAs arising from alternative splicing events, including those from mRNA splice junction mutations52,53,54,55,56,57, intron retention58,59,60,61,62,63 or dysregulation of the spliceosome machinery in the tumour cell15,64,65. Other types of post-transcriptionally derived TSAs include alternative ribosomal products (for example, ribosomal frameshifting66,67, non-canonical initiation68,69,70,71, termination codon read-through69, reverse-stand transcription72 and doublet decoding73) and post-translational splicing74,75,76 — these two mechanisms are difficult to apply in anticancer therapies, given the lack of tools for predicting such products.

The study of splice variant proteins has historically focused on haematological malignancies, with splice variant protein expression being understudied in solid tumours. As such, putative splice variant antigens derived from these proteins have received less attention in solid tumours, with expression only recently validated77. In haematological cancers in which SNV burden is relatively low6, splice variant antigens could broaden the number of available TSA targets for therapeutic application. Splice variant proteins can arise through cis-acting mutations that disrupt or create splice site motifs or through trans-acting alterations in slicing factors that have historically been identified in haematological malignancies77,78. The role of spliceosome machinery in the generation of splice variants in haematological malignancies is a current area of investigation. Mutations in spliceosome proteins (for example, splicing factor 3b subunit 1 (SF3B1), serine- and arginine-rich splicing factor 2 (SRSF2), U2 small nuclear RNA auxiliary factor 1 (U2AF1) and U2AF2) are common in myelodysplastic syndrome, acute myeloid leukaemia (AML), chronic myelomonocytic leukaemia (CMML), and chronic lymphocytic leukaemia (CLL)79,80,81,82,83. Sharing of these spliceosome protein mutations across haematological cancer types has led to the hypothesis that spliceosome dysregulation may cause the expression of splice variant mRNAs, which are not detectable in normal tissues, leading to the translation of TSAs84,85,86. Beyond haematological malignancies, recent reanalysis of the TCGA pan-cancer dataset demonstrated a strong association between somatic mutations in components of the spliceosome machinery and the expression of splice variant products77, providing evidence for the relevance of splice variant antigens in solid tumours.

Tools for predicting splicing events and splice variant antigens

Several types of splice variant callers have been described in the literature. Two of these tools, Spliceman87 and MutPred Splice88, predict the capacity of exonic variants surrounding an annotated splice junction to interfere with normal splicing. Other tools provide de novo identification of alternative splicing events, including JuncBase89, SpliceGrapher90, rMATS91, SplAdder92 and ASGAL93. Many of these tools (for example, SpliceGrapher, SplAdder and ASGAL) predict alternative splicing events through the generation of splicing graphs. This splicing graph is generated through comparisons of spliced alignments of RNA-seq reads against a genome reference, which consists of vertices (nodes) that represent predicted splicing sites for a given gene as well as edges that represent exons and introns between splicing sites. In addition to these splice variant callers, at least one peer-reviewed tool, Epidisco46 (the computational pipeline for the multi-institutional PGV-001 personalized vaccine trial94), has been described with the capacity to predict for splice variant antigens.

Jayasinghe et al.52 reported MiSplice, which integrates DNA-seq and RNA-seq data in order to discover mutation-induced splice sites, which they applied to the TCGA pan-cancer dataset. Splice variant mutations contained 2–2.5x more predicted TSA candidates than did SNVs, with some tumorigenesis-related genes containing ≥40 unique predicted TSAs. Furthermore, predicted splice variant antigen burden was correlated with programmed cell death 1 ligand 1 (PDL1) expression, suggesting that PDL1 blockade therapy may be efficacious in tumours with a high frequency of splice variant antigens. Additionally, Kahles et al.77 reported a comprehensive analysis of splice variants in the TCGA pan-cancer dataset and then used mass spectrometry to identify tryptic-digested polypeptides that contained splice variant antigens in 63 primary breast and ovarian cancer samples. This method found, on average, 1.7 predicted splice variant antigens per sample, with up to 30% more alternative splicing events in tumours than in normal tissues. Notably, Kahles et al.77 also reported several known (SF3B1 and U2AF1) and novel (transcriptional adaptor 1 (TADA1), serine–threonine protein phosphatase PPP2R1A and isocitrate dehydrogenase 1 (IDH1)) splicing quantitative trait loci that were associated with alternative splicing events in 385 genes, suggesting that these loci are important for predicting the burden of splice variant antigens.

While these studies have demonstrated TSAs derived from cancer-specific splice junctions, further work will be needed to refine the computational methods for splice variant antigen prediction. Particular emphasis is needed on identifying novel splice junctions that are likely to yield mRNA isoforms that will not undergo nonsense-mediated decay95. To address this problem, improved full-length mRNA isoform inference procedures or hybrid (that is, long- and short-read) RNA-seq algorithms will need to be developed. These procedures would identify the full-length splice variant transcript, allowing for filtering of transcripts that do not contain premature stop codons that could subsequently trigger nonsense-mediated decay.

While tumour-specific splice variants of particular genes have been described in multiple tumour types, there are currently no reports of the use of splice variant antigens in personalized therapies. For example, the presence of tumour-associated splice variants has been described in select genes, including receptor for hyaluronan-mediated motility (RHAMM; two tumour-enriched variants, RHAMM-48 and RHAMM-147 in multiple myeloma)96 and Wilms tumour protein 1 (WT1; one variant, E5+, enriched in multiple cancers)97,98,99. WT1-derived peptides have been studied as a therapeutic target in leukaemias100,101,102,103,104 and in lung105 and kidney cancers106; however, these trials did not use epitopes specific for the E5+ splice variant. Additionally, an HLA–B44-restricted epitope derived from a variant of the minor histocompatibility antigen HMSD (HMSD-v) selectively expressed by primary haematological malignant cells (including those of myeloid lineage as well as multiple myeloma), but also by normal mature dendritic cells, was observed to be targeted by the CD8+ cytotoxic T cell clone 2A12-CTL107. Co-incubation of 2A12-CTL with primary AML cells conferred tumour resistance to immunodeficient mice after injection, suggesting that this HMSD-v derived antigen is a viable target for immunotherapy. Finally, Vauchy et al.108 described a CD20 splice variant (D393–CD20) whose expression is detectable in transformed B cells and upregulated in various B cell lymphomas. They subsequently demonstrated the capacity of D393–CD20-derived epitope vaccines to trigger both CD4+ and CD8+ T cell responses in HLA-humanized transgenic mice, supporting the use of CD20 splice variant epitopes for targeted immunotherapies in B cell malignancies.

Gene fusion neoantigens

Gene fusion occurrence in cancer

Gene fusions were originally identified in leukaemia109, with subsequent observations in bladder110, breast111, renal112, colon113 and lung cancers114 (among others). Similar to splice variants, gene fusion proteins have been a focus of study in leukaemia (particularly AML, acute lymphocytic leukaemia (ALL), and chronic myeloid leukaemia (CML)115) but also sarcomas116, where SNV burden is limited. These cancers contain conserved gene fusions, some of which are observed in nearly 100% of cancer subtypes (for example, t(11;22)(p13;q12) in synovial sarcoma117). Because gene fusions are often driver mutations of certain tumours, compounds aimed at inhibiting fusion protein function have been clinically successful118. Immunotherapies directed against driver mutation gene fusions may be especially beneficial, as they would directly target the source of oncogenesis. However, while driver mutation expression has been demonstrated to be highly clonal in early cancers119, studies in non-small-cell lung cancer have demonstrated highly heterogeneous driver alterations119, frequent loss of HLA heterozygosity120 and epigenetic silencing of neoantigen-containing genes occurring in later disease121, all of which may contribute to immune escape. As such, targeting of a single driver mutation may limit long-term therapeutic efficacy, whereby therapy-resistant sub-clones with differential driver mutations and HLA expression profiles may arise. Although overall gene fusion frequency is relatively low compared to SNV and INDEL mutations, they can be shared within and between different tumour types122, making them identifiable through targeted methods (for example, fluorescence in situ hybridization) and potentially targetable by universal (as opposed to patient-specific) neoantigen-based strategies.

Prediction tools for gene fusion neoantigens

Using current genomic techniques, gene fusions are typically identified through the alignment of fusion-containing reads from RNA-seq to more than one reference gene. In addition to general gene fusion callers123, several personalized gene fusion neoantigen-calling pipelines have been developed, including INTEGRATE-neo, which is specifically designed for the prediction of gene fusion neoantigens124. Using INTEGRATE-neo for analysis of the TCGA prostate adenocarcinoma cohort, 1,761 gene fusions were identified in 333 patient samples that generated 2,707 fusion transcript isoforms. Among this set, 61 (3.5% of the total) gene fusions were identified in >1 patient. Furthermore, 1,600 fusion junction peptides were identified from the 2,707 transcripts, of which 240 (15%) were predicted HLA binders124. Notably, the binding affinity scores for these 240 predicted neoantigens were skewed toward tighter affinity, suggesting that predicted fusion-derived neoantigens might have substantially better MHC binding capacity than SNV neoantigens. In addition to INTEGRATE-neo, several other tools have been described for gene fusion neoantigen calling, including pVACfuse (which performs neoantigen epitope calling using fusion variants reported from INTEGRATE-neo), NeoepitopePred125, Antigen.garnish47 and Epidisco46.

Clinical studies with gene fusion neoantigens

Clinical trials targeting gene fusion neoantigens have been pursued in CML (targeting the BCR–ABL fusion) and paediatric sarcomas. Pinilla-Ibarz et al.126 demonstrated that three of six patients with CML receiving a high dose of a BCR–ABL fusion protein breakpoint peptide vaccine developed antigen-specific T cell responses, although no cytotoxic response was observed. Although this phase I study was designed to assess safety and not clinical efficacy, one patient demonstrated transient loss of BCR–ABL mRNA, one patient experienced transient and partial cytogenic response during vaccination, and two patients progressed to an accelerated phase of disease during the study period. A follow-up phase II trial from the same group similarly demonstrated evidence of vaccine safety and measurable immunogenic response, but no evidence of clinical efficacy127. Another trial, summarized in a publication from Mackall et al.128, studied the effects of dendritic cells pulsed with tumour-specific translocation breakpoints and E7, a peptide known to bind to HLA-A2 (given alongside autologous T cells +/- IL-2 and, serving as a control, influenza vaccinations) in patients with Ewing sarcoma and alveolar rhabdomyosarcoma. Compared with the 31% five-year overall survival in patients who underwent control apheresis, immunotherapy-treated patients had 43% five-year overall survival with minimal toxicity. These studies (among others129,130) underscore the potential for gene fusion neoantigens as universal off-the-shelf therapeutics, although their current clinical efficacy remains modest. This may be related in part to therapies only targeting a single gene fusion epitope, allowing for resistant sub-clones to arise in the later disease course119,120. While an off-the-shelf approach has clear logistical merit, the identification and application of multiple patient-specific gene fusion epitopes may improve therapeutic efficacy.

Currently, few studies have applied patient-specific fusion proteins predicted through DNA- and/or RNA-seq methods for therapeutic vaccination. One recent example from Yang et al.131 demonstrated the capacity of INTEGRATE-neo-derived fusion epitopes from head and neck cancers, including fusion epitopes derived from cancers with low overall mutational burden, to generate ex vivo activation of host and healthy donor T cells. Large-cohort clinical studies (for example, PGV-001 (refs46,94)) are currently underway that will include gene fusion neoantigens among the set of targeted TSAs. Future use of this potential class of neoantigens, alone or in combination, will require larger clinical trials with more robust clinical and immunological endpoints.

Endogenous retroelement antigens

Retrotransposons in cancer

Retrotransposons are mobile genetic elements capable of self-replication through transcription and reverse transcription from genomic DNA132. They can be broadly divided into long-terminal repeat (LTR, also known as retroviral-like) and non-LTR subclasses, which differ in their genomic structures and replication mechanisms132. Retrotransposons can be expressed in cancer through epigenetic dysregulation, either through inherently low methylation states11,133,134 or following pharmacological induction of demethylation135,136,137,138, resulting in transcription (and potential translation) of retroviral TSAs139. Among the many classes of retrotransposons, long interspersed nuclear elements (LINEs, a class of non-LTR retrotransposon) have been best characterized in terms of their ability to impact cancer biology. LINE-1 has been shown to induce cancer cell apoptosis140, trigger adenomatous polyposis coli (APC)-mediated tumorigenesis in colon cancer141 and associate with clinical features and changes in cellular morphology in breast cancer142,143, among other roles.

Endogenous retroviruses (ERVs), a type of LTR retrotransposon in mammals, are remnants of exogenous retroviruses that have been incorporated into the genome throughout evolution144. Human ERVs (hERVs) impact the pathogenesis and progression of cancers, including melanoma, lymphoma, leukaemia and ovarian, prostate, urothelial and renal carcinomas134,145,146,147,148,149,150,151,152,153. Transcription of tumour-specific or enriched hERVs arises through epigenetic dysregulation of the cancer genome (which can be either inherent to the epigenetic state of the cancer or pharmacologically induced through epigenetic modulating agents), resulting in the expression of hERV-containing genomic regions otherwise not observed under physiological conditions136,138. These tumour-specific or enriched hERVs can impact both the innate and adaptive immune system through distinct mechanisms. With the innate immune system, hERVs signal through innate sensors, most commonly the RIG-I-like pathway recognition of viral double-stranded RNAs136,138. This results in downstream nuclear factor-κB (NF-κB)-mediated inflammation, with release of type I interferon, which causes immune activation and the expression of class I MHCs on tumour cells. Additionally, hERV-derived protein antigens can induce B cell and T cell activation154,155,156. Therefore, it has been proposed that tumour-specific hERV antigens could be applied to antitumour adoptive cellular therapies and therapeutic vaccines.

In addition to INDEL-derived neoantigens, hERVs have been proposed as key drivers of antitumour immunity in ccRCC11,157. In ccRCC, hERVs demonstrate baseline expression in the tumour without exogenous pharmacological epigenetic modulation, with expression of these hERVs showing strong association with both clinical prognosis and response to immunotherapy11,157. A 2015 study from Rooney et al.158 provided an initial genomic evaluation into the interaction between hERVs and the tumour immune microenvironment, demonstrating three of 66 hERVs (ERVH-5, ERVH48-1, ERVE-4; identified in a previous study from Mayer et al.159) to have tumour-specific expression and correlate with a cytotoxicity signature (granzyme A and perforin-1) in several cancers. On the basis of this study as well as several other translational studies showing the presence of an hERV-specific T cell response in ccRCC155,160, our group performed comprehensive analyses into the role of hERVs in ccRCC11,157. From immunogenomic analysis of hERVs in ccRCC, we demonstrated hERV-derived signatures to be the best predictor of patient prognosis, outperforming both clinical stage and M1–M4 molecular subtyping11. Additionally, the expression of tumour-specific hERV 4700 in pretreatment ccRCC samples was strongly associated with post-treatment response rates to anti-programmed cell death 1 (PD1) therapy. As such, hERV-derived antigens may be a viable alternative TSA target in ccRCC. Additionally, recent evidence suggests a potential role for hERVs in the modulation of low-grade glioma (where SNV burden is among the lowest of any cancer)11 and testicular cancer (particularly those with KIT mutations), in which global DNA hypomethylation is associated with high hERV expression133.

Computational methods for the quantification of retroelement expression

Several computational methods for retroelement quantification currently exist, with the majority providing quantification of ERV-like or retrotransposon-like elements (partial or full-length) rather than full-length, intact ERVs at specific genomic coordinates. This is due to the historic lack of well-annotated ERV references containing full proviral sequences and coordinates (rather than segments of ERV-like elements), which have only recently been published, to allow for mapping of full-length, intact ERVs161,162. The most well-known tool is RepeatMasker, designed to identify interspersed repeats and low-complexity sequences of any class, including simple and tandem repeats, segmental duplications, and interspersed repeats (including ERV-like elements, LINEs and short interspersed nuclear elements (SINEs), LTRs and other classes)163. RepeatMasker used in its default state is not optimal for the detection of ERVs. However, many ERV-specific databases (for example, HERVd164, HESAS165 and EnHERV166) have subsequently been generated using RepeatMasker. A more recent quantifier designed by our group, aimed specifically for the analysis of hERVs from RNA-seq data, is hervQuant11, which quantifies full-length, intact hERV proviral sequences. The hervQuant reference is derived from Vargiu et al.161, which compiled genomic coordinates for 3,173 full-length hERV proviruses. Notably, hervQuant provided the first description of a broad genomic screening method for tumour-specific hERV antigens.

Because no tools are currently available to identify retroelement TSAs, the retroelement or ERV quantifiers described above must be paired with downstream epitope prediction software (for example, NetMHCpan27) for retroelement antigen binding predictions. Additionally, because retroelements are present in the genome of both tumour and normal tissues, the prediction of tumour-specific retroelements provides unique challenges. Unlike the identification of neoantigens, retroelement TSAs must be derived through differential expression analysis of tumour versus normal-tissue RNA-seq. While hERVs and other retroelements share common homology among their overall sequences, which might theoretically make them unsuitable targets for TSA therapeutic approaches, they also exhibit highly unique regions specific to each hERV, capable of generating equally unique peptide epitopes11. Our analysis of hERV homology during the design of hervQuant revealed that only a minority of hERVs contain >95% sequence homology with one or more other hERVs, providing the basis for our ability to differentiate hERVs from short-read RNA-seq data. Such hERV unique regions can be leveraged for hERV-based TSA therapies, as long as one can confirm the specificity of expression of that particular hERV within a tumour. Additionally, evidence of hERV-specific T cells found natively within the tumour immune microenvironment (for example, hERV 4700 (ref.11) and CT RCC HERV-E167) suggests a lack of thymic central tolerance against these hERV-specific epitopes.

Translational relevance of tumour-specific hERV targets

Several studies have described the translational application of tumour-specific hERV targets. A 2016 study from Cherkasova et al.155 identified a CD8+ T cell clone from a patient with regressing ccRCC and found the clone to have tumour-specific cytotoxicity in vitro. The CTL recognized an antigen from a specific hERV, CT RCC HERV-E — which was the same as one of the tumour-specific hERVs (ERVE-4) described by Rooney et al.158 and was also identified during our screen of differentially expressed hERVs in ccRCC (hERV 2256). This particular CTL clone is being studied in clinical trials for adoptive T cell therapy in metastatic ccRCC167. Our analysis also identified a second hERV (hERV 4700) with preferential expression in ccRCC compared to normal tissues, evidence of translation, and the presence of tumour-infiltrating CTLs specific for hERV-4700 gag- and pol-derived antigens of the virus11.

Other alternative TSAs

HLA somatic mutation-derived neoantigens

Several studies have described somatic mutations in tumour HLAs that allow for altered T cell recognition. This was first described by Brandle et al.168, where mutated HLA-A2*0201 in a ccRCC tumour promoted an antitumour T cell response. This study did not elucidate whether the mechanism for T cell response was a result of TCR recognition of the antigen presented by the mutated HLA or whether the recognition was against the HLA molecule itself. Huang et al.169 later demonstrated a similar finding in metastatic melanoma, with evidence that tumour-specific T cells may recognize an unknown antigen or set of antigens presented on mutated tumour HLA-A11. Together, these studies provided early evidence for potential targeting of novel antigens with specificity of binding to somatically mutated HLA on the tumour. A recent publication from Shulka et al.22 presented whole-exome-based HLA-typing software, POLYSOLVER, that is able to call HLA somatic mutations with high prediction power, validated by RNA-seq (estimated sensitivity 94.1%, specificity 53.3%). More recently, HLAProfiler improved upon the breadth and accuracy of HLA somatic mutation calls and is able to work from RNA-seq data alone26. Combined with existing tools capable of predicting antigen binding directly from HLA sequences (for example, NetMHCpan), it is possible to predict for sets of antigens with specificity for the mutated tumour HLA, and thus specificity for antitumour T cell responses. Notably, a more advanced version of NetMHCpan is theoretically able to predict MHC binding to novel MHC molecules (including those containing mutations) through machine-learning prediction of MHC binding based on the amino acid sequence of the MHC variant27.

Post-translational TSAs

TSAs can arise post-translationally in tumours, with the potential to be targets for therapy, but they are difficult to predict with current computational tools. Post-translational splicing may occur in human cancers, resulting in the excision of a polypeptide segment followed by subsequent ligation of the free carboxyl-terminal with the amino-terminal of a new peptide74,75. Additionally, a second class of antigens, known as T cell epitopes associated with impaired peptide processing (TEIPP), has been described as being presented on transporter involved in antigen processing (TAP)-deficient, MHC-low tumours and as being recognized by a TEIPP-specific T cell population170,171,172,173,174. Interestingly, these epitopes are non-mutated and derived from housekeeping genes. Yet they are not presented on normal cells. TEIPP-specific T cells can escape thymic selection in wild-type mice (but not in TAP1-deficient mice), making them promising candidates for antitumour therapeutic targets.

Challenges and future directions

Among the challenges impeding broad clinical application of alternative TSAs as therapeutics is the need to increase the sensitivity and accuracy of epitope prediction. The computational methods described above have provided avenues to predict a greater number of TSAs from a broader variety of genomic sources. However, methods both upstream and downstream of these algorithms can generally be applied to improve the prediction performance for all TSA classes. Here, we highlight several strategies that may universally increase the number and accuracy of all TSA predictions: improvement of MHC epitope binding predictions, algorithms for the direct prediction of TSA generation and immunogenicity, and mass spectrometry approaches to improve TSA calling accuracy.

MHC epitope calling

Most TSA therapeutic vaccine studies to date have focused on the use of predicted MHC I binding epitopes, largely due to the classical hypothesis that CD8+ T cells play a greater role in antitumour immunity than do CD4+ T cells, as well as the better performance of MHC I epitope prediction algorithms than of MHC II epitope predictors. Despite this, further improvements will be required for both MHC I and II prediction algorithms to identify greater numbers of accurately predicted TSAs. A recent analysis from our group demonstrated that the accuracy of MHC I binding affinity predictions by NetMHCpan varied greatly by allele type, with performance measures being strongly correlated with the proportion of training data epitopes that were ‘binders’ (KD ≤ 500 nM, the generally accepted cut-off within the field for MHC binding), and less so with the amount of total training data per allele175. As such, alleles with fewer ‘binders’ in the training set suffered from poor sensitivity and specificity, suggesting that more high-quality data will be necessary for the application of MHC I predictors for clinical TSA prediction.

Regarding MHC II predictions, recent preclinical and clinical studies have suggested the importance of MHC II binding neoantigens in promoting antitumour immunity. A study from Kreiter et al.176 was the first to describe MHC I-predicted neoantigens in fact being presented on MHC II, subsequently triggering CD4+ T cell responses. The relevance of SNV-specific CD4+ T cells in antitumour immunity is further supported by an earlier study from Tran et al.177, whereby infusion of an ERBB2 interacting protein (ERBB2IP) mutation-specific CD4+ T cell population abrogated tumour growth for 35 months in a patient with metastatic cholangiocarcinoma. Clinical studies from Sahin et al.2,3 confirmed the importance of CD4+ T cell responses in human trials, providing evidence in support of the clinical importance of MHC II TSAs. Despite this evidence, a major hurdle faced by computational prediction methods for MHC II epitopes arises from the open-binding cleft structure of the MHC II complex. This structure results in relatively promiscuous binding of epitopes compared with MHC I, whereby the binding core of the longer class II epitope must be accurately predicted before binding affinity can subsequently be calculated178.

Recent improvements have been made in the computational prediction of MHC II epitope binding affinity, largely facilitated by the application of machine-learning algorithms trained on large, validated epitope datasets. Many earlier algorithms focused on the identification of the epitope binding core, with predictions based on the interactions between this peptide core and the MHC complex. Neilson and colleagues179 first described NN-align, which provided MHC II binding predictions trained on both the peptide-core and flanking-region characteristics, which significantly improved MHC II binding prediction performance. Although the binding affinity of an epitope is primarily determined by its peptide core, flanking-region characteristics can also influence the binding affinity. NN-align was later adapted as the core algorithm for NetMHCIIpan by Andreatta et al.180, which further improved performance, and led to the description of alternative epitope–MHC interactions. Even with these improvements to MHC II binding prediction, the performance characteristics of state-of-the-art algorithms still lag behind MHC I binding predictors. While the importance of MHC II epitopes in promoting antitumour immunity has primarily been observed with SNV neoantigens, it is expected that alternative TSAs would similarly be applicable as MHC II epitopes. As such, increasing both the breadth of available TSAs from alternative sources and the improvements to MHC II epitope prediction can together provide a concerted strategy to multiplicatively increase the targetable pool of tumour antigens.

Direct prediction of TSA generation and immunogenicity

In addition to MHC binding affinity prediction, new methods for direct prediction of TSA generation and immunogenicity might aid in the clinical selection of therapeutic epitopes. The majority of neoantigen prediction algorithms currently rely on predicted peptide–MHC binding affinity as the primary method for epitope screening. However, preclinical and clinical studies have demonstrated that only a minority of all neoantigen candidates are capable of producing immune responses2,3,176,181. One such explanation for this high false-positive rate is that the current binding prediction tools do not account for other steps involved in MHC peptide processing182. A study from Pearson et al.183 demonstrated that MHC class I associated peptides (MAPs; that is, epitopes) were derived from only 10% of exomic sequences expressed in B lymphocytes, with 41% of protein-coding genes generating no MAPs. Using features of transcripts and proteins associated with efficient MAP production, they generated a logistic regression model to predict whether a gene is capable of MAP generation.

Another approach to improving TSA prediction is to directly predict epitope immunogenicity. As we briefly mentioned above, Neopepsee is a new tool that incorporates a machine-learning algorithm trained on HLA alleles that generate T cell responses to directly predict the immunogenicity of putative neoantigens44. Compared with conventional binding affinity metrics, Neopepsee predicted immunogenicity in two external validation datasets with significantly improved sensitivity and specificity, providing evidence in support of direct immunogenicity prediction approaches. Because the Neopepsee algorithm was trained on a broad set of HLA alleles rather than specifically using TSA epitope immunogenicity, biological differences between self-derived neoantigens and the non-self epitopes of the training set may be a limiting factor for algorithmic performance. With an increasingly growing number of clinical trials collecting neoantigen immunogenicity data, future algorithms trained specifically on TSAs may potentially provide even better predictive capabilities.

Mass spectrometry approaches

Apart from computational TSA prediction, mass spectrometry-based peptidomic approaches have been applied for the identification of tumour antigens184. The identification of endogenously presented epitopes by mass spectrometry began in the early 1990s185,186. The first peptide antigens were discovered with manual interpretation of tandem mass spectrometry; however, computational methods are now the routine strategy to make comparisons between tandem mass spectrometry and peptide sequences on proteomics databases. While conceptually similar to genomic alignment and sequencing, tandem mass spectrometry sequencing is substantially more error prone and less sensitive. Standard proteomics experiments with complex peptide mixtures from well characterized biological samples are typically only able to identify ~25% of the tandem mass spectra in a proteomics database187,188. In addition to the computational difficulties with sequence identification, peptides can undergo rearrangements in the mass spectrometer, generating sequences that were not present in the original biological sample189,190. Despite these challenges, progress has been made in confirming predicted neoantigens using mass spectrometry. Immunogenomic methods have been used to generate virtual peptidomes from tumour sequencing data, and neoantigens have been identified191,192. A recent study by Laumont et al.193 used mass spectrometry approaches and observed that approximately 90% of the identified TSAs from two mouse cancer cell lines and seven human primary tumours were derived from noncoding regions, including introns, alternative reading frames, noncoding exons, untranslated region–exon junctions, structural variants and endogenous retroelements. Notably, these noncoding regions are not identified through current exome or transcriptome-based sequencing approaches. This study underscores the potential importance of the alternative TSAs and provides strong evidence for their application in therapy design, importantly demonstrating that classical SNV neoantigens may comprise only a minority of the total TSA repertoire. While these studies have enabled neoantigen discovery on a large scale, the limitations of tandem mass spectrometry alignment and the possibility of unexpected peptide rearrangements mean that suspected neoantigens should be confirmed by direct comparison of the sample’s tandem mass spectrum with that of the synthetic peptide175,191,194.

Conclusion

Conventional SNV neoantigens remain the most well-studied class of TSA, with distinct advantages in terms of ease of prediction, prevalence in a wide cohort of patients and promising preclinical and clinical therapeutic evidence of immunogenicity. While SNV neoantigens will continue to be a driving force for therapeutic vaccine development in the coming years, many groups have broadened the search for other alternative TSAs derived from self- and non-self-antigens. While certain sources of alternative TSAs have been studied for decades (for example, gene fusion proteins and viral antigens), the advent of powerful computational methods for the patient-specific prediction of TSAs has expanded the breadth of targets available for potential clinical applications. Unlike SNV neoantigens, which are largely patient specific in expression12, some classes of alternative TSAs are shared among the population (for example, splice variant antigens, gene fusion antigens and hERV antigens), making them ideal for off-the-shelf therapies. Additionally, many of these peptide sequences are highly dissimilar from germline sequences (for example, frameshifts), allowing for potentially greater immunogenicity than SNV neoantigens. Thus, alternative TSAs should play a major role in the future of cancer immunotherapy.