Background To determine the specificity of T cells based on their receptor sequence is a demanding task due to cross-reactivity, complicated patterns and limited size of public dataset. An effective computational model, which finds CDR3 patterns of shared antigen specificity in the existing dataset, can predict specificity of T cells accurately. In this work, we developed a deep learning methodology that computes the similarity among T cells in terms of antigen specificity using k-mer features.
Methods Our model consists of two parts. First, it encodes every overlapping k-mers of CDR3 into numerical vectors. We parallelize such k-mer encodings into several allowable ways, so that the independent semantics of each k-mers are effectively learned. Second, among the encoded k-mer features, we select only meaningful k-mers using a self-attention structure. By doing this, we remove unwanted correlations among overlapping k-mers.
We train our model with preprocessed public datasets: IEDB, VDJdb and McPAS. We optimize the overall process to find an optimal contrastive predictive coding, which is an unsupervised objective function. After optimization, we define a kernel function of k-mer features to define similarity between two CDR3s.
Results We designed an one-of-many unsupervised task: for a given arbitrary CDR3 sequence, whether our model can correctly select CDR3 with a similar specificity among N randomly sampled candidates. With N=10, our model achieves accuracy 0.3 for an independent dataset. We also test supervised task: whether our model can induce probable cognate antigens for a given CDR3. Our model achieves precision 0.7.
Conclusions Our deep learning model can extract k-mer information that only represents antigen specificity. This information is an invaluable numerical vector for computing similarity of antigen specificity. By doing this, our model can solve the one-of-many problem and predict the antigen specificity. In the future, our model will improve its performance as a size of training dataset grows.
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.