The Antibody annotator annotates your input data with regards to the reference sequences in the Reference Database. This pipeline is suitable for the analysis of both NGS and Sanger type data.
To run the Antibody annotator select a file in your folder and select Antibody annotator in the Annotation dropdown.
The following outlines the steps to successfully carry out an analysis using the Antibody annotator and how each section and option works.
Reference sequences and reads
In the NGS analysis popup windows, select a reference database from the Reference Database dropdown. To learn more about reference database, please refer to the following article.
This option allows you to specify custom nucleotide features such as HisTags and etc that will be annotated on each sequence. To learn more about annotation of additional features, please refer to the following article.
Selected sequences are specifies the expected chain type(s) found in the input sequences.
- Single chain (either heavy or light) tells the annotator to only look for one chain in each sequence. The annotator will determine whether each sequence more closely matches the Heavy or the Light chains in your database.
- In the Antibody annotator pipeline, the reference database is split into heavy and light sections. By specifying the expected chain (either all Heavy or all Light), only the appropriate database section is used. This improves performance and potentially accuracy too.
- When selecting "Single chain" options, you can send paired reads as input, for example paired reads that had no overlap and were not able to be merged. The Antibody Annotator will attempt to use both reads in a pair to generate a single V(D)J region. If successful there will be a linker section of Amino Acid ambiguities joining the two ends.
- If you expect your sequences to have both heavy and light chains in the same sequence (scFv sequences), select either the Both chains in a single sequence or Both chains in a single sequence with linker (scFv) value in the dropdown. There is not a lot of difference between these two options.
- If you have two chains in separate sequences that you have associated together using the Pair Heavy/Light chains operation, select Both chains in associated sequences option to ensure that these separate sequences get analysed together. For more information on associating Heavy and Light chains, see here.
- Single Heavy or Light chain sequences will be annotated as usual with this option, however they will be marked as "Not Fully Annotated" in the analysis, as only one chain could be found.
Gene classification and annotation scheme
When multiple genes are equally close to a query sequence there are three possible ways we can handle it with regards to the list of the number of sequences that match each gene. For example, if a query sequence is equally close to IGHD1-26 and IGHD2-15, each option below will handle this differently:
- Each gene with partial frequency - This will assign that query sequence to all matching genes with partial frequency. Based on the example above, this sequence will add 0.5 to IGHD1-26's and 0.5 to IGHD2-15's total number of sequences.
- Group of genes - This will create a separate entry in the list of genes that represents this combination of genes. Based on the example above, this sequence will contribute 0 towards the total for each of IGHD2-15 and IGHD1-26, and instead add 1 to the total for a gene called "IGHD1-26/IGHD2-15".
- Unknown - This will treat this as an unknown gene. Based on the example above, this sequence will add nothing to the totals of IGHD1-26 and IGHD2-15.
The bundled human and mouse immunoglobulin databases may contain a number of pseudogenes and ORF genes along with fully functional genes. You can choose to include the Pseudogenes from database and/or the ORF genes from database to the analysis as mutations in these genes might be corrected prior to expression. If these are not included, then each sequence will be classified according to the most closely related functional gene in the database.
Gene difference annotation
To annotate and see the differences between your input sequences and reference sequences, select the Annotate germline gene differences option. With the selection of this option, the nucleotide and amino acid differences will be annotated on your input sequences.
The database is assumed to contain IMGT style annotations. To change the annotation scheme from IMGT to Kabat, select Kabat in the Results Annotation Scheme dropdown.
Note that the Kabat style results are produced by adjusting the IMGT CDRs end points.
Liability and clustering
Liability and asset search
To search and score motifs liable to post-translational modifications or any other types of modifications or beneficial motifs, select the Find liabilities and assets option. The Antibody annotator pipeline has a default set of sequence liability check. To learn more about configuring this option, please refer to the following article.
To specify your preferred cluster combination of regions or genes, specify a cluster name followed by a colon followed by a comma separated list of genes or regions to combine in the text box. To learn more about configuring this option, please refer to the following article.
Sequence trimming and result clustering
To trim the non-annotated regions upstream and downstream of your annotate sequences, select the Trim each side of fully annotated region if over option and specify the number of nucleotide base pairs to leave untrimmed.
- The default is set to 10. This means that your annotated sequence will have 10 base pairs flanking the 5' and 3' ends of the sequence.
- For single chain sequences, if the start of FR1 is untruncated the start is trimmed. However, if the V-Gene starts before FR1, trimming starts outside of the V-Gene. Similar rules are applied for FR4.
- For scFv data, this applies to the first FR1 and the last FR4.
Note that trimming only applies to fully annotated sequence whereby sequences that are classified as not fully annotated by the NGS operation are left untrimmed.
In general, most NGS data are relatively large and in order to cluster your results, select one or more of the following options:
- Only cluster results with asset and liability score of at least - This will cluster the sequences based on the score specified. For example, if you specify a score of 1000, only sequences that have a liability and asset score of 1000 or more will be included in the clusters.
- Only cluster results which are - This will cluster the sequences that are either Fully annotated, Fully annotated and in frame, or Fully annotated, in frame, and without stop codon. For example, if you chose to cluster Fully annotated and in frame sequences, only sequences that meet the specifications of being fully annotated and in frame will be clustered, sequences that are not fully annotated, frameshifted and include of stop codons will not be included in the clustering operation.