Article Text

Original research
Comprehensive profiling of cancer neoantigens from aberrant RNA splicing
  1. Daniel P Wickland1,
  2. Colton McNinch2,3,
  3. Erik Jessen3,
  4. Brian Necela4,
  5. Barath Shreeder4,
  6. Yi Lin5,
  7. Keith L Knutson4 and
  8. Yan W Asmann1
  1. 1Department of Quantitative Health Sciences, Mayo Clinic, Jacksonville, Florida, USA
  2. 2National Institute of Allergy and Infectious Diseases, Bethesda, Maryland, USA
  3. 3Department of Quantitative Health Sciences, Mayo Clinic, Rochester, Minnesota, USA
  4. 4Department of Immunology, Mayo Clinic, Jacksonville, Florida, USA
  5. 5Division of Hematology, Department of Internal Medicine, Mayo Clinic, Rochester, Minnesota, USA
  1. Correspondence to Dr Yan W Asmann; asmann.yan{at}mayo.edu

Abstract

Background Cancer neoantigens arise from protein-altering somatic mutations in tumor and rank among the most promising next-generation immuno-oncology agents when used in combination with immune checkpoint inhibitors. We previously developed a computational framework, REAL-neo, for identification, quality control, and prioritization of both class-I and class-II human leucocyte antigen (HLA)-presented neoantigens resulting from somatic single-nucleotide mutations, small insertions and deletions, and gene fusions. In this study, we developed a new module, SPLICE-neo, to identify neoantigens from aberrant RNA transcripts from two distinct sources: (1) DNA mutations within splice sites and (2) de novo RNA aberrant splicings.

Methods First, SPLICE-neo was used to profile all DNA splice-site mutations in 11,892 tumors from The Cancer Genome Atlas (TCGA) and identified 11 profiles of splicing donor or acceptor site gains or losses. Transcript isoforms resulting from the top seven most frequent profiles were computed using novel logic models. Second, SPLICE-neo identified de novo RNA splicing events using RNA sequencing reads mapped to novel exon junctions from either single, double, or multiple exon-skipping events. The aberrant transcripts from both sources were then ranked based on isoform expression levels and z-scores assuming that individual aberrant splicing events are rare. Finally, top-ranked novel isoforms were translated into protein, and the resulting neoepitopes were evaluated for neoantigen potential using REAL-neo. The top splicing neoantigen candidates binding to HLA-A*02:01 were validated using in vitro T2 binding assays.

Results We identified abundant splicing neoantigens in four representative TCGA cancers: BRCA, LUAD, LUSC, and LIHC. In addition to their substantial contribution to neoantigen load, several splicing neoantigens were potent tumor antigens with stronger bindings to HLA compared with the positive control of antigens from influenza virus.

Conclusions SPLICE-neo is the first tool to comprehensively identify and prioritize splicing neoantigens from both DNA splice-site mutations and de novo RNA aberrant splicings. There are two major advances of SPLICE-neo. First, we developed novel logic models that assemble and prioritize full-length aberrant transcripts from DNA splice-site mutations. Second, SPLICE-neo can identify exon-skipping events involving more than two exons, which account for a quarter to one-third of all skipping events.

  • Next generation sequencing - NGS
  • Major histocompatibility complex - MHC
  • Immunotherapy
  • Human leukocyte antigen - HLA
  • Genome

Data availability statement

Data are available in a public, open access repository. The raw data used to support the findings of this study are publicly available in the TCGA Repository of the Genomic Data Commons (GDC) (https://portal.gdc.cancer.gov), release 29.0.

http://creativecommons.org/licenses/by-nc/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See http://creativecommons.org/licenses/by-nc/4.0/.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

WHAT IS ALREADY KNOWN ON THIS TOPIC

  • Tumor neoantigens can be leveraged to develop personalized, immune-based therapies that target cancer. While current bioinformatics tools can discover neoantigens from single-nucleotide mutations, small insertions and deletions, and gene fusions, the detection of neoantigens from aberrant RNA splicing remains challenging.

WHAT THIS STUDY ADDS

  • This study reports a new approach to comprehensively identify neoantigens arising from aberrant splicing both with and without DNA splice-site mutations.

HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY

  • The methods described in this study support the specific and sensitive identification of clinically potent neoantigens from aberrant splicing that may be used to develop personalized cancer therapies.

Background

The immune system plays a vital role in the surveillance and destruction of cancer cells by recognizing neoantigens resulting from tumor-specific somatic mutations. On presentation by class-I or class-II human leucocyte antigens (HLAs), neoantigens are recognized as foreign by T-cells, which mediate the antitumor immune response.1–3 Sources of neoantigens include protein-altering single-nucleotide mutations (SNMs), insertions or deletions (INDELs), gene fusions, and aberrant splicing.2 4 In general, neoantigens with a higher degree of non-selfness, such as those from gene fusions, aberrant RNA splicing, and frame-shift INDELs, are more immunogenic.2 4–11

While multiple tools and workflows can discover neoantigens from SNMs, INDELs, and gene fusions,12–15 the identification of neoantigens from aberrant RNA splicing remains challenging. Aberrant RNA splicing events are a rich source of potent neoantigens because they are highly tumor specific and abundant in tumors compared with normal tissues.6 8 Aberrant splicing originates from two categories of mutations: (1) DNA somatic mutations within splice sites and (2) de novo RNA aberrant splicings, which may arise from intronic mutations, alterations in splicing factor expression, transcriptional errors, or splicing errors that create novel splice sites or modify splicing regulatory elements.16–20 Multiple studies have compared the performance of various bioinformatics methods for splice-site identification, including NNSplice,21 MaxEntScan,22 and SpliceAI,23 among others. SpliceAI, which has consistently been ranked among the top tools,24–27 relies on a deep neural network not only to identify splice sites in DNA/RNA sequences but also to predict the consequences of splice-site mutations and whether the mutations lead to gain or loss of the splicing acceptor or donor site. This unique ability is critical for identification of splicing neoantigens, as described below. However, SpliceAI does not compute the transcription products resulting from the predicted acceptor and donor gains or losses, and therefore, cannot directly derive neoepitopes for neoantigen prediction. In addition, there are tools designed to identify novel splicing junctions directly from RNA sequencing data without requiring DNA splice-site mutation evidence.28–30 However, to the best of our knowledge, these RNA splicing identification tools are only capable of identifying single-exon or double-exon skipping events while events skipping three or more exons are missed. Furthermore, most methods detect either intron retention or alternatively spliced exons but rarely both.18 30 31

Despite its prevalence and specificity in cancer as well as its potential to generate highly immunogenic neoepitopes, aberrant RNA splicing remains a relatively unexplored source of neoantigens, due in large part to the computational complexity inherent in determining consequential splice-site mutations and more importantly in assembling the resulting full-length aberrant transcripts. The full-length aberrant RNA transcripts are required to compute putative neoepitopes for neoantigen prediction, and therefore, are essential. Recently, Chai et al32 developed a graph-based method to assemble splicing isoforms and predict splicing neoantigens using paired tumor and normal tissue RNA sequencing data. Because patient tumor-paired normal tissue RNA sequencing is not a common practice, and because some aberrant splicing transcripts with moderate to lower expression levels are not detectable de novo from the RNA sequencing data (eg, the case study of the mesenchymal-epithelial transition (MET) gene exon-14 skipping events described in the results below), a need exists for a more comprehensive and sensitive splicing neoantigen prediction method.

To address the current analytical shortcomings, we implemented an additional module within the framework of our previously published REAL-neo pipeline33 to comprehensively identify splicing neoantigens from aberrant RNA isoforms resulting from both DNA splice-site mutations and de novo RNA splicing. This new module, SPLICE-neo, leverages somatic splice-site mutations in tumor DNA to predict likely splicing acceptor and donor gain and loss and computes consequent aberrant splicing isoforms. In addition, SPLICE-neo detects de novo splicing isoforms without underlying DNA splice-site mutations by identifying novel junctions from single-exon, double-exon or multiple-exon skipping in tumor RNA and does not require patient-paired normal-tissue RNA sequencing. The putative aberrant splicing isoforms are then annotated, quantified, and prioritized using a z-score approach with the assumption that specific aberrant splicing events are rare. The top-ranked aberrant transcripts are translated into protein sequences, and the neoantigen potentials of all neoepitopes are evaluated by REAL-neo.

We applied SPLICE-neo to representative M-class and C-class cancers34 profiled by The Cancer Genome Atlas (TCGA)35 with mutational profiles of mostly small somatic mutations (SNMs and INDELs; M-class) or large genomic rearrangements (copy number alternations, gene fusions, and other structural variants; C-class). The 15 top-ranked splicing neoantigens were validated in the laboratory using in vitro T2 binding assays.

Materials and methods

Selection of TCGA cancers

Four representative TCGA cancer cohorts were profiled: breast invasive carcinoma (BRCA), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC) and liver hepatocellular carcinoma (LIHC). According to TCGA,34 LUAD and LUSC are M-class cancers dominated by small SNM and INDEL mutations. BRCA is a C-class cancer with extensive copy number alternations. LIHC is intermediate between M-class and C-class, with moderate levels of both SNMs/INDELs and genomic rearrangements.

Determination of splice-site mutations with transcriptional consequences

Tumor somatic mutations in Mutation Annotation Format (MAF) (unannotated_union_set_somatic_hg38.maf) were downloaded from TCGA Genomic Data Commons (https://portal.gdc.cancer/gov) and processed using REAL-neo (online supplemental figure 1). Gene structures (exon and intron coordinates) were defined based on the Consensus Coding Sequence Database (CCDS) assembly GRCh38.p12. SNMs and INDELs located at the boundary of an exon and an intron (splice sites) were evaluated by SpliceAI23 to select those with delta score >0.5, the SpliceAI-recommended threshold for high probability of loss/gain of splicing acceptor/donor sites.

Supplemental material

Computation of splicing products from splicing acceptor or donor gains or losses

We developed 18 novel logic models to compute and assemble the putative aberrant splicing products from splice acceptor and donor site gains and losses predicted by SpliceAI. This non-trivial process calculated all possible aberrant RNA products from each of the splice-site mutation profiles (online supplemental figure 2). For example, a single loss of splice acceptor (online supplemental figure 2A) created three putative aberrant transcripts: (1) if the acceptor loss occurs at the 5’ end of an internal exon, there are potentially two products: one with the retention of the intron before the affected exon, and the other with the skipping of the affected exon assuming the acceptor site of the exon after the affected exon can be utilized by the transcription machinery (online supplemental figure 2A, options 1 and 2); (2) if the acceptor loss occurs at the 5’ end of the last exon of the gene, the consequent product is a transcript with retention of the last intron (online supplemental figure 2A, option 3). We developed algorithms to compute all putative aberrant transcripts resulting from the top seven most prevalent splice-site mutation profiles from TCGA as detailed in online supplemental figure 2A–G and the results.

Identification of de novo aberrant splicing from RNA without underlying DNA splice-site mutations

Only exon-skipping events are considered by SPLICE-neo for de novo aberrant splicing detection without considering intron retention scenarios to minimize false positives. First, SPLICE-neo identifies and quantifies novel junctions. Novel and aberrant splicing junctions are identified from (1) single reads split between two non-adjacent exons (split reads) and/or (2) paired-end reads each aligned to two non-adjacent exons (junction-encompassing reads). To normalize for differences in RNA sequencing depth between samples, junction expression is then quantified by JPM: number of reads supporting a junction, per million of total reads supporting all junctions of all genes in that sample. Second, SPLICE-neo selects and prioritizes novel junctions. For each putative aberrant junction, a z-score is calculated as follows: the numerator of the z-score is the junction JPM from the sample where the novel junction is nominated, minus the mean JPM of the candidate junction calculated from a reference set of tumor RNA sequencing data; and the denominator is the SD of all JPMs of the candidate junction in the reference cohort. We define aberrant junctions as those with a z-score>10 (corresponding to the 99.75 percentile), which indicates that the junction is rare (novel). To remove artifacts and other noise, novel junctions with JPM<1 are discarded. In addition, novel junctions are filtered based on utilization in a sample, defined as the number of reads supporting the novel junction divided by the total number of reads supporting all junctions using the same exon. Utilization estimates the percentage of transcripts with an aberrant junction relative to other canonical and alternative transcripts. Specifically, junctions with utilization <10% are removed to eliminate small subclonal events.

Quantification of the aberrant splicing transcripts

SPLICE-neo uses Salmon36 to quantify transcripts. The FASTA of the aberrant splicing transcripts predicted from both the splice-site mutations and de novo RNA aberrant splicings is added to the CCDS FASTA of all genes. Salmon then quantifies the canonical CCDS and the aberrant splicing transcripts simultaneously by an optimized and iterative expectation and maximization process to allocate an entire read, or a fraction of a read, to each input transcript. Based on the bimodal distribution of the log2 TPM (transcript per million) values of all transcripts, a transcript is “expressed” with TPM>2−5. If multiple isoforms containing the aberrant junction are expressed, only the one most highly expressed is selected for neoantigen prediction.

Generation of neoepitopes and removal of wild-type epitopes using a reference peptide database

The selected transcript is translated into protein and a sliding window is used to obtain neoepitopes containing non-wild-type sequences. The window sizes are 8, 9, 10, 11, 12 amino acids (aa) for class-I neoepitopes and 15 aa for class-II epitopes. Neoepitopes matching any translated wild-type CCDS are removed, and the remaining peptides are used in neoantigen prediction.

Prediction of binding affinities between neoepitopes and HLAs

Patient-specific HLA genotyping was performed using REAL-neo. The bindings between candidate neoepitopes and HLAs were predicted using REAL-neo. Briefly, six tools were used to predict bindings between class-I HLAs and neoepitopes of 8–12 aa in length: NetMHC37 V.4, NetMHCpan38 V.2.8, SMM,39 SMMPMBEC,40 PickPocket41 V.1.1 and MHCflurry42 V.2.0.1. Three tools were used to predict bindings between class-II HLAs and 15-mer neoepitopes: NetMHCII/SMMalign43 V.1.1, NetMHCII/NNalign44 V.2.2, and NetMHCIpan45 V.3.1. Because different sets of HLAs were used in model training by the developers of each tool, we focused on the 65 HLA alleles common to all class-I prediction methods and the 24 HLAs common to all class-II prediction methods. Peptides with predicted IC50 binding affinities <500 nM by at least two class-I or class-II tools were retained.

T2 in vitro binding assays

T2 cells (ATCC, CRL-1992) were grown in Iscove Modified Dulbecco’s Medium (IMDM) supplemented with 20% Fetal Bovine Serum. On the day of the assay, T2 cells were washed 2X with PBS and plated in serum-free IMDM at a density of 250,000 cells/well in a 12-well plate. Cells were incubated with peptides at 25 uM for 16 hours at 26°C, 5% CO2. All peptides used in the assay were synthesized by GenScript Biotech (Piscataway, New Jersey) with a minimum purity of 85.0%. The Influenza peptide GILGFVFTL, a validated HLA-A*02:01 binder, and peptide HPVGEADYF, an HLA-B*35:01 binder, served as positive and negative controls, respectively. Following incubation, cells were collected, washed 2X with PBS, and resuspended in 100 µL of FACS buffer. Following blocking with TruStain FcX FC receptor block (BioLegend, San Diego, California), cells were stained with the viability stain FVS780 and anti-HLA-A2 PE (BD Biosciences) and events collected on an Attune NxT flow cytometer. Cells were gated on singlet discrimination and viability prior to acquisition of HLA-A2 signal.

Results

Evidence of aberrant splicing events from both DNA splice-site mutations and de novo RNA splicing

We investigated the well-known occurrence of mesenchymal-epithelial transition (MET) exon-14 skipping46 47 in TCGA lung cancer patients. SpliceAI identified DNA splice-site mutations in eight patients, two with donor site loss at the end of MET exon-13, and six with acceptor site loss at the beginning of exon-14 (figure 1A). As shown in figure 1B, only the six patients with acceptor site loss mutations had expressed exon-14 skipped transcripts confirmed by Salmon. Two patients with donor site loss at the end of exon-13 had no evidence of exon skipping in RNA. SPLICE-neo also identified exon-14 skipping in 12 additional patients who had no DNA splice-site mutations, most likely resulting from de novo RNA aberrant splicings. These results indicate that it is essential to profile aberrant RNA splicing events from both DNA and RNA mutational sources.

Figure 1

Mesenchymal-epithelial transition (MET) gene exon-14 skipping events resulting from both DNA splice-site mutations and de novo aberrant RNA splicings. (A) SpliceAI predicted consequential splice-site mutations in TCGA LUAD and LUSC cases including one mutation in the splicing donor site of exon-13 in two patients (gray arrow) and additional mutations in the splicing acceptor site of exon-14 in six patients (red arrow). (B) Salmon confirmed exon-14 skipping from six samples with exon-14 splice acceptor site mutations (red dots) but not in two samples with exon-13 splicing donor site mutations (gray dots). Salmon confirmed exon-14 skipping in additional 12 samples (blue dots) without DNA splice-site mutations. LUAD, lung adenocarcinoma; LUSC, lung squamous cell carcinoma; TCGA, The Cancer Genome Atlas.

The abundance of RNA aberrant splicing

Figure 2 illustrates the per-patient counts of somatic nonsynonymous SNM and INDEL mutations; expressed fusion genes detected in tumor RNA; splice-site mutations detected in tumor DNA and predicted by SpliceAI to have transcriptional consequences; and de novo RNA splicing events detected in tumor RNA without splice-site mutations from TCGA LUSC, LUAD, LIHC, and BRCA patients. While SNMs and INDELs predominate, splicing alterations (with and without underlying DNA splice-site mutations) are comparable in prevalence to expressed fusions and therefore represent a significant source of potential neoantigens since each aberrantly spliced transcript often results in multiple neoantigens, similar to frame-shift INDELs and fusion genes.

Figure 2

Quantity of different mutation types per patient in TCGA cancers. The box plots for number of SNM, INDEL, expressed fusion genes, SpliceAI-predicted DNA splice-site mutations, and de novo aberrant RNA splicing per patient in TCGA breast (BRCA, 1090 patients), lung (LUAD, 515 patients; LUSC, 501 patients), and liver (LIHC, 370 patients) cancer cohorts. BRCA, breast invasive carcinoma; INDEL, small insertions and deletion; LIHC, liver hepatocellular carcinoma; LUAD, lung adenocarcinoma; LUSC, lung squamous cell carcinoma; SNM, single-nucleotide mutation; TCGA, The Cancer Genome Atlas.

Aberrant splicing resulting from DNA splice-site mutations and its contribution to neoantigen load

The splicing donor and acceptor sites consist of sequences that designate cut positions for the spliceosome (figure 3A, green and red arrows, respectively). We screened 11,892 tumors in 33 TCGA cancer types, and 119,675 out of 314,2532 DNA splice-site mutations were predicted by SpliceAI to have potential transcriptional consequences. As shown in figure 3B, there are 11 distinct profiles of acceptor and donor site gain or loss. The top seven most frequent profiles accounted for 95% of all events in TCGA. Since SpliceAI only evaluates the likelihood of the splice-site mutations to impact splicings without computing the consequent transcripts, we developed 18 logic models to assemble the full-length splicing products resulting from the top seven profiles (figure 3C, and online supplemental figure 2A–G). Figure 3C highlights an example logic model of a single splicing acceptor site loss that generated three different aberrant splicing products as described in the methods. We used Salmon-quantified transcript expression levels and z-scores to prioritize the aberrant isoforms that were likely real with log2(TPM)>2−5 and z-score>8 (corresponding to the 99.75 percentile) (figure 3D). Figure 3E shows the numerical breakdown of types of splice-site mutation profiles, expressed splice-site mutations, and neoantigen-producing splice-site mutations in four TCGA cancers. For example, in BRCA patients SpliceAI identified 539 splice-site mutations that led to an acceptor site gain, of which 90 were expressed/confirmed in RNA sequencing and 72 resulted in neoantigens. Similarly, figure 3F shows the transcript counts according to the type of aberrant splicing alterations.

Figure 3

Identification of neoantigens in aberrant RNAs resulting from DNA splice-site mutations. (A) Schematic showing example donor and acceptor splice variants predicted by SpliceAI for a gene with three exons. Donor and acceptor sites are indicated by green and salmon arrows, respectively, and the mutation positions are indicated by black arrows. SpliceAI identifies mutations in splice sites that result in acceptor gain, acceptor loss, donor gain, or donor loss. (B) Summary table for TCGA splice-site mutation impact predictions by SpliceAI. There are 11 acceptor and donor site impact profiles ranked from the most to the least prevalent in TCGA. The top seven profiles account for 95% of all impactful splice-site changes. (C) An example of how SPLICE-neo’s logic models interpret a profile of a single splicing acceptor site loss. Three aberrant RNA products were predicted by the logical models depending on the position of the impacted acceptor: (1) if the acceptor loss occurs at the 5’ end of an internal exon, there are two products: one with the retention of the intron before the affected exon, and the other with skipping of the affected exon assuming the acceptor site of the exon after the affected exon can be used by the transcription machinery (options 1 and 2); (2) if the acceptor loss occurs at the 5’ end of the last exon of the gene, the consequent product is a transcript with retention of the last intron (option 3). The 7 top profiles and all 18 logical models are detailed in online supplemental figure 2. (D) Prioritization and selection of aberrant RNAs based on both expression (transcripts per million, TPM) and z-score calculated from TCGA cohorts. The horizontal black dashed line denotes the expression threshold (2-5 TPM) between two modes of TPMs, and the vertical black and red dashed lines denote z-scores of 0 and 8, respectively. (E) Counts and relative proportions of (from left to right) the top seven impactful splice-site mutation profiles, the top seven profiles with expressed aberrant RNAs based on TPM and z-scores, and the top seven profiles that produced neoantigens. The seven splice-site mutation profiles are color coded as illustrated by the legend to the right side of the figure: DL (donor loss), DG (donor gain), AL (acceptor loss), and AG (acceptor gain). (F) Counts and relative proportions of (from left to right) total predicted, expressed, and neoantigen-producing aberrant splicing RNA from impactful splice-site mutations. Different types of splicing alterations are color coded as indicated by the legend. BRCA, breast invasive carcinoma; INDEL, small insertions and deletion; LIHC, liver hepatocellular carcinoma; LUAD, lung adenocarcinoma; LUSC, lung squamous cell carcinoma; SNM, single-nucleotide mutation; TCGA, The Cancer Genome Atlas.

Aberrant de novo RNA splicing and its contribution to neoantigen load

The aberrant de novo RNA splicing isoforms predicted by SPLICE-neo (figure 4A) were filtered and prioritized using both expression levels and z-scores (figure 4B). Figure 4C illustrates the numerical breakdown of the de novo RNA splicing events that skipped 1, 2, or 3+exons. Skipping of 3+exons comprises 26%–33% of the total in each cancer type and represents an important source of neoantigens missed by existing tools.

Figure 4

Identification of neoantigens resulting from de novo aberrant RNA splicing. (A) The SPLICE-neo algorithm. (1) detection and quantification of all de novo junctions. The example illustrates the identification of two potential aberrant junctions between exons 1 and 4 (E1–E4) and E1–E5 by split reads and/or junction-encompassing reads. The JPM plots to the right represent the number of samples with the canonical E1–E2 vs E1–E4 (upper) or E1–E5 (lower) junctions. Compared with the number of samples with the canonical E1–E2 junction, the samples with E1–E4 or E1–E5 junctions are rare. (2) Novel junction selection and filtering. Z-scores of every novel junction are calculated using E1–E4 as an example: the numerator of the Z-score is the junction JPM from the sample where the novel junction is nominated minus the mean JPM of the candidate junction calculated from a reference set of tumors; and the denominator of the Z-score is the SD of JPMs of the candidate junction in all samples from the reference tumor set. We define novel junctions as those with a z-score above 10. Next, to remove junction artifacts supported by weak evidence, novel junctions with JPM<1 are discarded. In addition, junctions are filtered based on utilization percentage, defined as the number of reads supporting the novel junction (E1–E4) divided by the total number of reads supporting all junctions involving the same exon donor, E1. Junctions with utilization <10% are removed. In the case of nested junctions (eg, E1–E4 nesting inside E1–E5), only the innermost junction is retained (E1–E5 will be removed). (3) Expression quantification of RNAs containing aberrant junctions. RNA transcripts containing predicted aberrant junctions are assembled from exhaustive combinations of canonical CCDS isoforms using the upstream portion of the donor exon and downstream portion of the acceptor. Salmon is then used to quantify the expression of all canonical CCDS isoforms as well as the assembled aberrant RNAs. If multiple aberrant RNAs are expressed higher than the TPM threshold of 2−5, only the highest expressed aberrant RNA is retained. (4) Nucleotide and amino acid sequence assembly. The aberrant RNAs with confirmed expression are translated into proteins. (5) Neoepitope kmer generation. Finally, the peptide sequence is cut into kmers using a sliding window of 8–12 and 15 amino acids, and kmers matching any wild-type CCDS reference sequence are removed. (B) Prioritization and selection of aberrant RNAs based on both expression (TPM>2−5, black horizontal line) and z-score>10 (red vertical line). (C) Counts and relative proportions of de novo aberrant RNA splicing events that skipped one, two, or three exons in the four representative TCGA cohorts. BRCA, breast invasive carcinoma; CCDS, Consensus Coding Sequence Database; JPM, junction, per million; INDEL, small insertions and deletion; LIHC, liver hepatocellular carcinoma; LUAD, lung adenocarcinoma; LUSC, lung squamous cell carcinoma; SNM, single-nucleotide mutation; TCGA, The Cancer Genome Atlas.

Comparison of aberrant splicing neoantigens resulting from DNA splice-site mutations or de novo RNA splicings

As shown in figure 5A, aberrant splicings from both DNA splicing donor/acceptor gain/loss or de novo RNA splicings resulted in five categories of splicing events: shifted exon boundaries (2034, 11.3%), intron retention (2449, 13.6%), and single/double/multiple exon skipping (4183+6858 = 11041, 41.8%; 2517, 14%; and 3497, 19.4%). We identified 3501 single-exon skipping events in common from both DNA splice-site mutations and de novo RNA splicing, 5165 (28.7%) events from splice-site mutations only, and 9371 (52%) events from de novo RNA splicings only. Figure 5B illustrates the number of neoantigen candidates identified by SPLICE-neo from the five categories of splicing events.

Figure 5

Summary of SPLICE-neo-detected aberrant splicing events with and without DNA splice-site mutations. (A) Sankey plot of aberrant splicings from both DNA splicing donor/acceptor gain/loss or de novo RNA splicings without splice-site mutations in five categories: shifted exon boundaries (2034, 11.3%, unique to the DNA splice-site mutation group), intron retention (2449, 13.6%, unique to the splice-site mutation group), single-exon skipping (4183+6858=11041, 41.8%; detected in either the DNA splice-site mutation or the de novo RNA splicing groups), two-exon and multiple-exon skipping (2517, 14%; and 3497, 19.4%; unique to the de novo RNA splicing group). There were 3501 single-exon skipping events identified in common from both DNA splice-site mutations and de novo RNA splicing, and 2034+2449+682=5165 (28.7%) splicing events were identified from splice-site mutations only while 3357+2517+3497=9371 (52%) events were only detected from de novo RNA splicings. (B) Number of unique neo-epitopes (in thousands) categorized by event type that gave rise to them.

Recurrent neoantigens from aberrant splicing events

Next, we assessed the recurrence of neoantigens in four TCGA cohorts (figure 6A–D). In BRCA and LIHC, top recurrent neoantigens arose from SNM and INDEL mutations, whereas the majority of recurrent neoantigens in the LUAD and LUSC cohorts arose from aberrant splicings. Overall, the splicing neoantigens had low recurrence in all four cancers (<3%), similar to SNM/INDEL neoantigens. We selected the top 15 splicing neoantigens predicted to bind to HLA-A*02:01 for laboratory validation using in vitro T2 binding assays. All 15 neoantigens were validated with strong binding affinities, and 7 (46.7%) had higher affinities than the positive control influenza virus peptide (figure 6E).

Figure 6

Top recurrent neoantigens from different mutational sources. (A–D) The top 40 most-recurrent neoantigens in BRCA, LUAC, LUSC, and LIHC cohorts. The y-axis to the left denotes the fraction of the samples with a neoantigen. All recurrent neoantigens came from either SNMs/INDELs or aberrant splicing RNAs. Each plot also includes a cumulative sample fraction (y-axis to the right) in purple. (E) Experimental validation of selected aberrant splicing neoantigens. Mean fluorescence intensity (MFI) from the in vitro T2 assay of the top 15 neoantigen candidates bound to HLA*02:01. The positive control of the assay was the influenza (Influenza) peptide GILGFVFTL, a known HLA-A*02:01 binder. Two negative controls were also included: the peptide HPVGEADYF, an HLA-B*35:01 binder; and an assay without any peptides. The horizontal dashed line represents three SD above the mean of the negative controls. BRCA, breast invasive carcinoma; INDEL, small insertions and deletion; LIHC, liver hepatocellular carcinoma; LUAD, lung adenocarcinoma; LUSC, lung squamous cell carcinoma; SNM, single-nucleotide mutation.

Discussion

Aberrant RNA splicings are errors in the normal process of RNA splicing that can lead to the exclusion of specific exons, retention of introns, or shifting of exon boundaries. Because they are abundant in cancer and often result in proteins with altered coding sequences and a high degree of dissimilarity from wild-type proteins, aberrant RNA splicings are a rich source of potent neoantigens.

We developed a novel module, SPLICE-neo, in our previously published bioinformatics pipeline for neoantigen discovery and prioritization, REAL-neo.33 SPLICE-neo is the first to comprehensively identify and prioritize splicing neoantigens from both DNA splice-site mutations and de novo RNA aberrant splicings. There are two major advances of SPLICE-neo. First, we developed novel logic models that assemble and prioritize full-length aberrant transcripts from DNA splice-site mutations. After surveying 11,892 tumors in 33 TCGA cancers, we identified 7 patterns of splice-site mutations that accounted for 95% of the consequential splice-site mutations predicted by SpliceAI. These seven patterns were exhaustively computed into 18 aberrant transcript outcomes using our logic models (as described in online supplemental figure 2), a non-trivial process critical to the functionalities of SPLICE-neo. Second, SPLICE-neo can identify exon-skipping events involving more than two exons, which account for a quarter to one-third of all skipping events.

We identified abundant splicing neoantigens in four representative TCGA cancers: BRCA, LUAD, LUSC, and LIHC. In addition to their substantial contribution to neoantigen load, many splicing neoantigens are potent tumor antigens with stronger bindings to HLA compared with the positive control of antigens from influenza virus, as we demonstrated for several splicing neoantigens identified by SPLICE-neo. We acknowledge that T2 binding assays do not directly evaluate neoantigen immunogenicity. Using postvaccination peripheral blood mononuclear cells from a patient in the Mayo Clinic neoantigen clinical trial, we demonstrated the accuracy of REAL-neo in predicting immunogenic neoantigen vaccines by an ex vivo IFNγ ELISPOT assay. As shown in online supplemental figures 3, 18 out of 19 neoantigens predicted by REAL-neo resulted in IFNγ+ spot-forming T-cells (neoantigen-specific T-cells).

To assess tumor specificity of the splicing neoantigens, we checked for the presence of the top 15 splicing neoantigens across 24 different normal tissue types using 680 tumor-paired normal tissue RNA sequencing samples from TCGA (online supplemental methods). We found only 1 out of the 15 splicing neoantigens in a single normal stomach tissue RNA-seq sample. The remaining 679 samples showed no expression of any of the 15 neoantigens. This result demonstrates that SPLICE-neo identifies highly tumor-specific neoantigens, which is expected because the algorithms developed for SPLICE-neo were optimized to detect tumor-specific neoantigens from tumor-specific splice-site mutations and aberrant junctions with high z-scores.

One limitation of the SPLICE-neo module is that it ignores non-canonical splice sites. Therefore, cryptic exons, microexons and recursive splice sites are not examined or used for neoantigen discovery. We plan to include new functionality to assess non-canonical splice sites in future versions of REAL-neo. In addition, we will continue to investigate neoantigen potential from other mutational sources including expressed viral proteins from viral DNA integrated into the host genome, micropeptides, and aberrant protein translations from intergenic regions. With neoantigens being reported as both therapy and recurrence preventative agents,48–52 the full potential of neoantigen-related treatments can only be realized if all neoantigen sources can be identified from a patient’s tumor.

Data availability statement

Data are available in a public, open access repository. The raw data used to support the findings of this study are publicly available in the TCGA Repository of the Genomic Data Commons (GDC) (https://portal.gdc.cancer.gov), release 29.0.

Ethics statements

Patient consent for publication

Acknowledgments

The results published here are based on raw data generated by the TCGA Research Network: https://www.cancer.gov/tcga.

References

Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

Footnotes

  • DPW and CM contributed equally.

  • Contributors Conceptualization: YWA, KLK and YL. Methodology: CM, EJ and DPW. Investigation: CM, EJ and DPW. Visualization: CM and EJ. Funding acquisition: YWA. Supervision: YWA. Writing–original draft: DPW and YWA. Writing–review and editing: DPW, YWA and EJ. Guarantor: YWA.

  • Funding This work was supported by the Mayo Clinic Center for Individualized Medicine, the Mayo Clinic Comprehensive Cancer Center grant P30CA15083, the GI Research Foundation, and the Casey DeSantis Cancer Research Act Florida Statute 381.922.

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.