The Antibody annotator annotates your input data with regards to the reference sequences in the Reference Database. This pipeline is suitable for the analysis of both NGS and Sanger type data.
To run the Antibody annotator select a file in your folder and select Antibody annotator in the Annotation dropdown.
The following outlines the steps to successfully carry out an analysis using the Antibody annotator and how each section and option works.
Reference sequences and reads
In the NGS analysis popup windows, select a reference database from the Reference Database dropdown. To learn more about reference database, please refer to the following article.
This option allows you to specify custom nucleotide features such as HisTags and etc that will be annotated on each sequence. To learn more about annotation of additional features, please refer to the following article.
Chain type and Order
To improve performance, specify the expected chain by selecting a chain type in the Unpaired reads are dropdown and the chain order in the Paired reads are dropdown.
In the Antibody annotator pipeline, the reference database is split into heavy and light sections. By specifying the expected chain, only the appropriate database section is used, improving performance and potentially accuracy too.
- Unpaired reads are - This specifies the expected chain type. If you expect your sequences to have both heavy and light chains, select the Both chains value in the dropdown. If you are analyzing scFv sequences, select the Both chains with linker (scFv) value and if you do not know the chain type, use the default Unknown chain value.
If one read in each pair is heavy and the other is light and they are in a known order, then by specifying the order, only the appropriate database section is used, improving performance and potentially accuracy too. No matter what value is specified here, when each read in a pair is from a different chain, these two reads are grouped together in the output. Paired reads from the same chain will not be grouped together in the output, these should instead be merged prior to annotation.
- Paired reads are - This specifies the expected chain order. If your paired read starts with a light chain followed by a heavy chain, select the Light then heavy chain value in the dropdown. If you do not know the chain order or your sequences, use the default Unknown order value.
Gene classification and annotation scheme
When multiple genes are equally close to a query sequence there are three possible ways we can handle it with regards to the list of the number of sequences that match each gene. For example, if a query sequence is equally close to IGHD1-26 and IGHD2-15, each option below will handle this differently:
- Each gene with partial frequency - This will assign that query sequence to all matching genes with partial frequency. Based on the example above, this sequence will add 0.5 to IGHD1-26's and 0.5 to IGHD2-15's total number of sequences.
- Group of genes - This will create a separate entry in the list of genes that represents this combination of genes. Based on the example above, this sequence will contribute 0 towards the total for each of IGHD2-15 and IGHD1-26, and instead add 1 to the total for a gene called "IGHD1-26/IGHD2-15".
- Unknown - This will treat this as an unknown gene. Based on the example above, this sequence will add nothing to the totals of IGHD1-26 and IGHD2-15.
The bundled human and mouse immunoglobulin databases may contain a number of pseudogenes and ORF genes along with fully functional genes. You can choose to include the Pseudogenes from database and/or the ORF genes from database to the analysis as mutations in these genes might be corrected prior to expression. If these are not included, then each sequence will be classified according to the most closely related functional gene in the database.
Gene difference annotation
To annotate and see the differences between your input sequences and reference sequences, select the Annotate germline gene differences option. With the selection of this option, the nucleotide and amino acid differences will be annotated on your input sequences.
The database is assumed to contain IMGT style annotations. To change the annotation scheme from IMGT to Kabat, select Kabat in the Results Annotation Scheme dropdown.
Note that the Kabat style results are produced by adjusting the IMGT CDR end points.
Liability and clustering
Liability and asset search
To search and score motifs liable to post-translational modifications or any other types of modifications or beneficial motifs, select the Find liabilities and assets option. The Antibody annotator pipeline has a default set of sequence liability check. To learn more about configuring this option, please refer to the following article.
To specify your preferred cluster combination of regions or genes, specify a cluster name followed by a colon followed by a comma separated list of genes or regions to combine in the text box. To learn more about configuring this option, please refer to the following article.
Sequence trimming and result clustering
To trim the non-annotated regions upstream and downstream of your annotate sequences, select the Trim each side of fully annotated region if over option and specify the number of nucleotide base pairs to leave untrimmed.
- The default is set to 10. This means that your annotated sequence will have 10 base pairs flanking the 5' and 3' ends of the sequence.
- For single chain sequences, if the start of FR1 is untruncated the start is trimmed. However, if the V-Gene starts before FR1, trimming starts outside of the V-Gene. Similar rules are applied for FR4.
- For scFv data, this applies to the first FR1 and the last FR4.
Note that trimming only applies to fully annotated sequence whereby sequences that are classified as not fully annotated by the NGS operation are left untrimmed.
In general, most NGS data are relatively large and in order to cluster your results, select one or more of the following options:
- Only cluster results with asset and liability score of at least - This will cluster the sequences based on the score specified. For example, if you specify a score of 1000, only sequences that have a liability and asset score of 1000 or more will be included in the clusters.
- Only cluster results which are - This will cluster the sequences that are either Fully annotated, Fully annotated and in frame, or Fully annotated, in frame, and without stop codon. For example, if you chose to cluster Fully annotated and in frame sequences, only sequences that meet the specifications of being fully annotated and in frame will be clustered, sequences that are not fully annotated, frameshifted and consist of stop codons will not be included in the clustering operation.