Tutorials
Installation
TCRembedding is a composite of multiple methods for embedding amino acid sequences.It is available on PyPI and can be downloaded and installed via pip:
pip install tcrembedding
Installation Tutorial
1.python venv
Since different methods rely on different runtime environments and there may be version conflicts between the dependent packages, we suggest that you create a virtual environment to use the embedding methods. At the same time, we provide an installation script env_creator.py, the script will be based on different embedding methods, create the corresponding virtual environment. The following is an example of how to use it:
(recommended) Based on Linux , python 3.8.
python env_creator.py <base_dir> <env_name> [--mirror_url=<url>]
base_dir : The base directory where virtual environments will be created.You also need to make sure that the corresponding requirements.txt file is in this directory.The requirements.txt file for each embedding method is available under src/TCRembedding/method_name/.
env_name : The name of the virtual environment.
mirror_url : The mirror URL for pip installations.
Example:
python env_creator.py /media/lihe/TCR/Word2Vec Word2vec --mirror_url=https://pypi.tuna.tsinghua.edu.cn/simple
The command to activate the virtual environment is printed at the end of the script run and the user can run the virtual environment according to the instructions.
Example:
source /media/lihe/TCR/Word2Vec/Word2vec/Word2vec_venv/bin/activate
After entering the virtual environment, use the pip command to install TCRembedding.
pip install tcrembedding
2.conda
In addition to running the env_creator.py script to create virtual environments, you can also create and manage virtual environments via conda.
Example:
conda create --name word2vec python=3.8
conda activate word2vec
pip install -r src/TCRembedding/Word2Vec/requirements.txt
pip install tcrembedding
Note:
If you encounter an issue during installation with the message “cannot import name ‘msvccompiler’ from ‘distutils’,” please first ensure that the setuptools version is below 65.0.0 or use version 65.0.2. Additionally, add the --no-build-isolation parameter when using the pip command.
Data
All the data used in the paper is publicly available, so we suggest readers refer to the original papers for more details. We uploaded the processed data to dataset folder for download.
Usage Tutorial
1.ATM-TCR
1.1 Input file format
Epitope |
CDR3B |
Affinity |
|---|---|---|
EAAGIGILTV |
CASSLGNEQF |
1 |
EAAGIGILTV |
CASSLGVATGELF |
1 |
EAAGIGILTV |
CASSQEEGGGSWGNTIYF |
1 |
Note : Epitope is placed in the first column, CDR3 in the second column, and Affinity in the third column. The maximum length of the CDR3 sequence does not exceed 30, and the maximum length of the Epitope sequence does not exceed 20.
1.2 All parameters in class EmbeddingATMTCR
file_path: The file path specifying the location of the data file to be processed.
blosum: BLOSUM (Blocks Substitution Matrix), a scoring matrix used for protein sequence alignment. If provided, it is used to transform sequence data. Default to using the BLOSUM45 matrix.
model_name: The model name, specifying the filename to use when saving or loading the model.
cuda: A boolean indicating whether to use CUDA acceleration (i.e., run the model on a GPU).
seed: The random seed used to ensure repeatability of model training and data splitting.
model_type: The type of model, specifying the model architecture used, for example, “attention” indicates a model using an attention mechanism.
drop_rate: The dropout rate used in the model’s dropout layers to prevent overfitting.
lin_size: The size of the linear layer, specifying the number of neurons in the model’s linear layers.
padding: The type of padding, used for the padding strategy of sequence data, for example, “mid” indicates padding in the middle.
heads: The number of heads in the multi-head attention mechanism, applicable only when using attention models.
max_len_tcr: The maximum length of TCR sequences, used to determine the truncating or padding length of sequences.
max_len_pep: The maximum length of peptide sequences, used to determine the truncating or padding length of sequences.
split_type: The type of data splitting, specifying how the data is split into training and test sets, for example, “random” indicates a random split.
1.3 Example Script
from TCRembedding import get_embedding_instance
EmbeddingATMTCR = get_embedding_instance("EmbeddingATMTCR")
EmbeddingATMTCR.load_data("data/testdata_ATM-TCR.csv")
encode_tcr = EmbeddingATMTCR.embed()
encode_pep = EmbeddingATMTCR.embed_epitode()
print(encode_tcr.shape)
print(encode_pep.shape)
2.catELMo
2.1 Input file format
CDR3b |
Epitope |
Affinity |
|---|---|---|
EAAGIGILTV |
CASSLGNEQF |
1 |
EAAGIGILTV |
CASSLGVATGELF |
1 |
EAAGIGILTV |
CASSQEEGGGSWGNTIYF |
1 |
2.2 All parameters in class EmbeddingcatELMo
None
2.3 Example Script
from TCRembedding import get_embedding_instance
embedder = get_embedding_instance("EmbeddingcatELMo")
embedder.load_data("data/testdata_catELMo.csv")
embedder.load_model(weights_file='model/weights.hdf5', options_file='model/options.json')
tcr_embeds = embedder.embed()
print(tcr_embeds.shape)
# if you want to embed epitope, you can set the value of the use_columns parameter in
# load_data() to the column name of the column where the epitope is located.
# then use embed_epitope() to embed.
embedder.load_data("data/testdata_catELMo.csv", use_columns='Epitope')
epi_embeds = embedder.embed_epitope()
print(epi_embeds.shape)
Note: We place models for download at Hugging Face. You need to specify the path to the model via weights_file in load_model() and the path to options file via options_file in load_model() before you run catELMo to embed sequences.Download link:lihe088/TCRembedding at main (huggingface.co).
3.clusTCR
3.1 Inupt file format
CDR3b |
|---|
CAISVAGGPGETQYF |
CASSYGGSPYEQYF |
CATGTQGDQPQHF |
3.2 All parameters in class EmbeddingclusTCR
max_sequence_size (Optional[int]): The maximum sequence length to consider when processing TCR sequences. If specified, it is used to define the maximum length of sequences during the encoding process. Defaults to None, which means the maximum sequence length will be determined dynamically based on the data.
properties (list): A list of properties to use for encoding the TCR sequences. These properties define how TCR sequences are transformed into numerical representations. Defaults to the OPTIMAL list, which is a predefined set of optimal properties.
n_cpus (Union[str, int]): The number of CPUs to use for parallel computing. Can be an integer specifying the exact number of CPUs, or a string for more flexible configurations (e.g., “auto” might denote automatic determination based on available resources). Defaults to 1, meaning processing is not parallelized.
3.3 Example Script
from TCRembedding import get_embedding_instance
encoder = get_embedding_instance("EmbeddingclusTCR")
encoder.load_data("data/testdata_clusTCR.csv")
encode_result = encoder.embed()
print(encode_result.shape)
4.DeepRC
4.1 Input file format
CDR3b |
Epitope |
Affinity |
|---|---|---|
EAAGIGILTV |
CASSLGNEQF |
1 |
EAAGIGILTV |
CASSLGVATGELF |
1 |
EAAGIGILTV |
CASSQEEGGGSWGNTIYF |
1 |
4.2 All parameters in class EmbeddingDeepRC
None
4.3 Example Script
from TCRembedding import get_embedding_instance
encoder = get_embedding_instance("EmbeddingDeepRC")
encoder.load_data("data/testdata_DeepRC.csv", use_columns="CDR3b")
encode_result = encoder.embed()
print(encode_result.shape)
5.DeepTCR
5.1 Input file format
CDR3b |
Epitope |
Affinity |
|---|---|---|
EAAGIGILTV |
CASSLGNEQF |
1 |
EAAGIGILTV |
CASSLGVATGELF |
1 |
EAAGIGILTV |
CASSQEEGGGSWGNTIYF |
1 |
5.2 All parameters in class EmbeddingDeepTCR
None
5.3 Example Script
from TCRembedding import get_embedding_instance
encoder = get_embedding_instance("EmbeddingDeepTCR")
# If you have a pre-trained model, you can specify the path to the folder where the model is
# located by using the model_folder_name parameter of the load_model() function.
encoder.load_model(train_data_directory='Data/data', model_folder_name="Test_Model", Load_Prev_Data=True)
encoder_result = encoder.embed(encode_data_directory="Data/data")
print(encoder_result.shape)
Note: The parameter “encode_data_directory” in embed() specifies the path of the input file, but the embed() method will take all the files under the path as input files.
We place models for download at Hugging Face. You need to specify the path to the folder where the model is located before you run DeepTCR to embed sequences.Download link:lihe088/TCRembedding at main (huggingface.co)
6.ERGO-II
6.1 Input file format
CDR3b |
Epitope |
Affinity |
|---|---|---|
EAAGIGILTV |
CASSLGNEQF |
1 |
EAAGIGILTV |
CASSLGVATGELF |
1 |
EAAGIGILTV |
CASSQEEGGGSWGNTIYF |
1 |
6.2 All parameters in class EmbeddingERGO
tcr_encoding_mode : “AE”/”LSTM”
6.3 Example Script
from TCRembedding import get_embedding_instance
from TCRembedding.ERGOII.data_loader import Data_Loader
encoder = get_embedding_instance("EmbeddingERGO")
encoder.tcr_encoding_model = "AE"
encoder.load_model(model_path="Models/version_10ve/checkpoints", args_path="Models/10ve/meta_tags.csv")
data_loader = Data_Loader()
tcrb_list, peptide = data_loader.collate("data/testdata_ERGO-II.csv", "AE")
tcrb_batch, pep_batch = encoder.forward(tcrb_list, peptide)
tcrb_encoding, pep_encoding = encoder.embed(tcrb_batch, pep_batch)
tcr_encode_result = tcrb_encoding.detach().numpy()
pep_encode_result = pep_encoding.detach().numpy()
print(tcrb_encoding.shape)
print(pep_encoding.shape)
Note: We place models for download at Hugging Face. You need to specify the path to the model via model_path of load_model() before you run catELMo to embed sequences.Download link:https://huggingface.co/lihe088/TCRembedding/tree/main/ERGOII.
7.ESM
7.1 Input file format
The input file format of the ESM model is .fasta, so we have to process the sequences that need to be encoded into the .fasta file format.
Example:
The contents of a csv file are as follows
CDR3b |
Epitope |
Affinity |
|---|---|---|
EAAGIGILTV |
CASSLGNEQF |
1 |
EAAGIGILTV |
CASSLGVATGELF |
1 |
EAAGIGILTV |
CASSQEEGGGSWGNTIYF |
1 |
First,we extract the columns that need to be coded from this csv file, excluding the table headers, and save it as a text file where each row is a tcr sequence.
tail -n +2 filename.csv | awk '{print $1}' > filename.tcr
Then,convert tcr text file to fasta format.
index=1;for i in `cat filename.tcr` ;do echo '>'$index && echo $i && let index++ ;done > filename.fasta
7.2 All parameters in class EmbeddingESM
toks_per_batch (int): The number of tokens per batch. This parameter controls how many tokens are processed in a single batch during the embedding generation process. Default value is 4096.
repr_layers (list): A list of integers indicating the layers of the model from which representations will be extracted. The layers are indexed starting from 0, with -1 representing the last layer. Default value is [-1], meaning only the last layer’s representations are extracted.
include (list): A list of string identifiers indicating which type of sequence representations to include. Currently supports [“mean”], which computes the mean of the embeddings across the sequence length. Default is [“mean”].
truncation_seq_length (int): The maximum sequence length. Sequences longer than this will be truncated to this length. It’s important for managing memory usage and computational efficiency. Default value is 1022.
nogpu (bool): A flag indicating whether to force the model to run on CPU even if a GPU is available. Setting this to True can be useful for debugging or environments without a GPU. Default value is False.
7.3 Example Script
from TCRembedding import get_embedding_instance
encoder = get_embedding_instance("EmbeddingESM")
model_location = "esm1b_t33_650M_UR50S"
fasta_file = "data/IEDB_uniqueTCR_top10_filter.fasta"
encoder.toks_per_batch = 2048
encoder.repr_layers = [33]
encoder.include = ["mean"]
encoder.truncation_seq_length = 1022
encoder.nogpu = True
encode_result = encoder.run(model_location, fasta_file)
print(encode_result.shape)
8.GIANA
8.1 Input file format
CDR3b |
Epitope |
Affinity |
|---|---|---|
EAAGIGILTV |
CASSLGNEQF |
1 |
EAAGIGILTV |
CASSLGVATGELF |
1 |
EAAGIGILTV |
CASSQEEGGGSWGNTIYF |
1 |
8.2 All parameters in class EmbeddingGIANA
None
8.3 Example Script
from TCRembedding import get_embedding_instance
import numpy as np
encoder = get_embedding_instance("EmbeddingGIANA")
encoder.read_csv("data/testdata_GIANA.csv", use_columns="CDR3b")
encoder.load_model()
vectors = encoder.embed()
encode_result = np.vstack(vectors)
print(encode_result.shape)
9.ImRex
9.1 Input file format
CDR3b |
Epitope |
Affinity |
|---|---|---|
EAAGIGILTV |
CASSLGNEQF |
1 |
EAAGIGILTV |
CASSLGVATGELF |
1 |
EAAGIGILTV |
CASSQEEGGGSWGNTIYF |
1 |
9.2 All parameters in class EmbeddingImRex
cdr3_range : The minimum and maximum desired cdr3 sequence length.
create_neg_dataset : Whether to create negatives by shuffling/sampling, by default True.
Note: Should always be set to False when evaluating a dataset that already contains negatives.
9.3 Example Script
from TCRembedding import get_embedding_instance
import numpy as np
encoder = get_embedding_instance("EmbeddingImRex")
encoder.load_data("data/testdata_ImRex.csv")
encode_result = encoder.embed()
iter_tf_dataset = iter(encode_result)
paired_map_list = []
for item in iter_tf_dataset:
paired_map, affinity = item
paired_map_list.append(paired_map.numpy())
encode_result = np.stack(paired_map_list)
print(encode_result.shape)
10.iSMART
10.1 Input file format
CDR3b |
Epitope |
Affinity |
|---|---|---|
EAAGIGILTV |
CASSLGNEQF |
1 |
EAAGIGILTV |
CASSLGVATGELF |
1 |
EAAGIGILTV |
CASSQEEGGGSWGNTIYF |
1 |
10.2 All parameters in class EmbeddingiSMART
None
10.3 Example Script
from TCRembedding import get_embedding_instance
encoder = get_embedding_instance("EmbeddingiSMART")
encoder.load_data("data/testdata_iSMART.csv", use_columns="CDR3b")
encoder.load_model()
encode_result = encoder.embed()
print(encode_result.shape)
11.Luu et al.
11.1 Input file format
CDR3 |
antigen.epitope |
|---|---|
EAAGIGILTV |
CASSLGNEQF |
EAAGIGILTV |
CASSLGVATGELF |
EAAGIGILTV |
CASSQEEGGGSWGNTIYF |
11.2 All parameters in class EmbeddingLuuEtAl
tcr_pad_length : The padding length for TCR sequences. This defines the fixed length to which all TCR sequences will be padded or truncated, ensuring consistent input size. Default value is 30.
ep_pad_length : The padding length for epitope sequences. This sets the fixed length to which all epitope sequences will be padded or truncated, ensuring consistent input size. Default value is 20.
11.3 Example Script
from TCRembedding import get_embedding_instance
EmbeddingLuuEtAl = get_embedding_instance("EmbeddingLuuEtAl")
EmbeddingLuuEtAl.load_data('data/testdata_Luu_et_al.csv')
TCR_encode_result, epitope_encode_result = EmbeddingLuuEtAl.embed()
print(TCR_encode_result.shape)
print(epitope_encode_result.shape)
12.NetTCR2.0
12.1 Input file format
CDR3b |
Epitope |
binder |
|---|---|---|
EAAGIGILTV |
CASSLGNEQF |
1 |
EAAGIGILTV |
CASSLGVATGELF |
1 |
EAAGIGILTV |
CASSQEEGGGSWGNTIYF |
1 |
12.2 All parameters in class EmbeddingNetTCR2
None
12.3 Example Script
from TCRembedding import get_embedding_instance
embedding = get_embedding_instance("EmbeddingNetTCR2")
embedding.load_data(file_path="data/testdata_NetTCR-2.0.csv")
embedding_data = embedding.embed(header='CDR3b')
print(embedding_data.shape)
13.pMTnet
13.1 Input file format
CDR3b |
Epitope |
affinity |
|---|---|---|
EAAGIGILTV |
CASSLGNEQF |
1 |
EAAGIGILTV |
CASSLGVATGELF |
1 |
EAAGIGILTV |
CASSQEEGGGSWGNTIYF |
1 |
13.2 All parameters in class EmbeddingpMTnet
None
13.3 Example Script
from TCRembedding import get_embedding_instance
encoder = get_embedding_instance("EmbeddingpMTnet")
# If you have a pre-trained model, you can specify the path to the folder where the
# model(.h5) is located by using the model_dir parameter of the embed() function.
TCR_encoded_matrix, antigen_array = encoder.embed("data/testdata_pMTnet.csv", model_dir='library/h5_file')
print(TCR_encoded_matrix.shape)
print(antigen_array.shape)
Note: We place models for download at Hugging Face. Download link:lihe088/TCRembedding at main (huggingface.co)
14.SETE
14.1 Input file format
CDR3 |
Epitope |
affinity |
|---|---|---|
EAAGIGILTV |
CASSLGNEQF |
1 |
EAAGIGILTV |
CASSLGVATGELF |
1 |
EAAGIGILTV |
CASSQEEGGGSWGNTIYF |
1 |
14.2 All parameters in class EmbeddingSETE
None
14.3 Example Script
from TCRembedding import get_embedding_instance
encoder = get_embedding_instance("EmbeddingSETE")
encoder.load_data("data/testdata_SETE.csv")
X, y, kmerDict = encoder.embed(k=3) # Only X is encoded.
print(X.shape)
15.TCRanno
15.1 Input file format
CDR3b |
Epitope |
affinity |
|---|---|---|
EAAGIGILTV |
CASSLGNEQF |
1 |
EAAGIGILTV |
CASSLGVATGELF |
1 |
EAAGIGILTV |
CASSQEEGGGSWGNTIYF |
1 |
15.2 All parameters in class EmbeddingTCRanno
None
15.3 Example Script
from TCRembedding import get_embedding_instance
embedder = get_embedding_instance("EmbeddingTCRanno")
embedder.load_model(model_path = None) ## set model_path=None to use the default model (provided by TCRanno)
embedder.load_data(file_path='data/testdata_TCRanno.csv', column_name='CDR3b')
X = embedder.embed()
print(X.shape)
16.TCRGP
16.1 Input file format
CDR3b |
Epitope |
affinity |
|---|---|---|
EAAGIGILTV |
CASSLGNEQF |
1 |
EAAGIGILTV |
CASSLGVATGELF |
1 |
EAAGIGILTV |
CASSQEEGGGSWGNTIYF |
1 |
16.2 All parameters in class EmbeddingTCRGP
None
16.3 Example Script
from TCRembedding import get_embedding_instance
EmbeddingTCRGP = get_embedding_instance("EmbeddingTCRGP")
filepath = "data/testdata_TCRGP.csv"
epitope = 'ATDALMTGY' # epitope name in datafile, ignore if balance control is False
EmbeddingTCRGP.datafile = filepath
embedded_data = EmbeddingTCRGP.embed(epitope,dimension=1)
print(embedded_data.shape)
17.TITAN
17.1 Input file format
CDR3b |
Epitope |
affinity |
|---|---|---|
EAAGIGILTV |
CASSLGNEQF |
1 |
EAAGIGILTV |
CASSLGVATGELF |
1 |
EAAGIGILTV |
CASSQEEGGGSWGNTIYF |
1 |
17.2 All parameters in class EmbeddingTCRGP
model_path (str): The path to the directory where the model is stored. This path is used to load the model for further operations such as training, evaluation, or inference. Default value is ‘TITAN_model’, which assumes there is a folder named ‘TITAN_model’ in the current directory that contains the model files.
params_filepath (str): The path to the JSON file containing model parameters. These parameters are essential for initializing the model with the correct configuration before loading its weights. Default value is ‘TITAN_model/model_params.json’, indicating the model parameters file is located inside the ‘TITAN_model’ directory.
17.3 Example Script
from TCRembedding import get_embedding_instance
encoder = get_embedding_instance("EmbeddingTITAN")
encoder.load_data("data/testdata_TITAN.csv", use_columns="CDR3b")
encoder.load_model()
TCR_encode_result = encoder.embed()
epi_encode_result = encoder.embed_epi()
print(TCR_encode_result.shape)
18.Word2Vec
18.1 Input file format
CDR3b |
Epitope |
affinity |
|---|---|---|
EAAGIGILTV |
CASSLGNEQF |
1 |
EAAGIGILTV |
CASSLGVATGELF |
1 |
EAAGIGILTV |
CASSQEEGGGSWGNTIYF |
1 |
18.2 All parameters in class EmbeddingWord2Vec
vector_size (int): The size of the vector to be learnt.
model_type : The context which will be used to infer the representation of the sequence. If :py:obj:~immuneML.encodings.word2vec.model_creator.ModelType.ModelType.SEQUENCE is used, the context of a k-mer is defined by the sequence it occurs in (e.g. if the sequence is CASTTY and k-mer is AST, then its context consists of k-mers CAS, STT, TTY) If :py:obj:~immuneML.encodings.word2vec.model_creator.ModelType.ModelType.KMER_PAIR is used, the context for the k-mer is defined as all the k-mers that within one edit distance (e.g. for k-mer CAS, the context includes CAA, CAC, CAD etc.). Valid values for this parameter are names of the ModelType enum.
k (int): The length of the k-mers used for the encoding.
epochs (int): for how many epochs to train the word2vec model for a given set of sentences (corresponding to epochs parameter in gensim package)
window (int): max distance between two k-mers in a sequence (same as window parameter in gensim’s word2vec)
18.3 Example Script
from TCRembedding import get_embedding_instance
import numpy as np
from gensim.models.callbacks import CallbackAny2Vec
class LossLogger(CallbackAny2Vec):
"""Callback to log loss after each epoch, differentiated by kmer batch."""
def __init__(self, log_file):
self.epoch = 0
self.kmer_index = 0
self.log_file = log_file
self.loss_previous_step = 0
def on_epoch_end(self, model):
loss = model.get_latest_training_loss()
loss_this_step = loss - self.loss_previous_step
self.loss_previous_step = loss
self.epoch += 1
with open(self.log_file, 'a') as f:
f.write(f'Kmer {self.kmer_index}, Epoch {self.epoch}: Loss {loss_this_step}\n')
def reset_epoch(self, kmer_index):
"""Resets the epoch count and previous loss when a new kmer batch starts."""
self.epoch = 0
self.kmer_index = kmer_index
self.loss_previous_step = 0
encoder = get_embedding_instance("EmbeddingWord2Vec")
# If there is a pre-trained model, you can specify the model path via the
# pretrained_word2vec_model parameter
# encoder.pretrained_word2vec_model = 'model path'
encoder.load_data("data/testdata_Word2Vec.csv", use_columns='CDR3b')
encode_result = encoder.embed()
encode_result = np.vstack(encode_result)
print(encode_result.shape)
Note:We place models for download at Hugging Face. Download link:lihe088/TCRembedding at main (huggingface.co)
19.TCRpeg
19.1 Input file format
CDR3b |
Epitope |
affinity |
|---|---|---|
EAAGIGILTV |
CASSLGNEQF |
1 |
EAAGIGILTV |
CASSLGVATGELF |
1 |
EAAGIGILTV |
CASSQEEGGGSWGNTIYF |
1 |
19.2 All parameters in class EmbeddingTCRpeg
hidden_size (int): The number of features in the hidden state of the model.
num_layers (int): The number of recurrent layers in the model.
embedding_path (str): The path to the embedding file to be used by the model.
model_path (str): The path to the pre-trained model file (.pth) to load.
device (str): The device to run the model on (‘cpu’ or ‘cuda:index’).
load_data (bool): Flag to indicate whether to load data upon initialization. Typically false when embedding is the main purpose.
19.3 Example Script
from TCRembedding import get_embedding_instance
encoder = get_embedding_instance("EmbeddingTCRpeg")
encoder.model_path = "tcrpeg/models/tcrpeg.pth"
encoder.embedding_path = "tcrpeg/data/embedding_32.txt"
encoder.load_data(file_path="data/testdata_TCRpeg.csv", use_columns="CDR3b")
encode_result = encoder.embed()
print(encode_result.shape)
Note: We place models for download at Hugging Face. You need to specify the path to the model before you run TCRpeg to embed sequences.Download link:lihe088/TCRembedding at main (huggingface.co).
Citation
“Citation”
Contact
“Contact”