The NGS analysis pipeline analyses and annotates your input data with regards to the reference sequences in the Reference Database. NGS analyses are run in experiment folders which are indicated with a conical flask icon in the folder tree. To learn more about creating experiment folders, please refer to the following article.
To run an NGS analysis select a file within the experiment folder and select NGS analysis v2 in the Pipelines dropdown. The following outlines the steps to successfully carry out an NGS analysis and how each section and option works:
In the NGS analysis popup windows, select a reference database from the Reference Database dropdown. To learn more about reference database, please refer to the following article.
To improve performance, specify the expected chain by selecting a chain type in the Unpaired reads are dropdown and the chain order in the Paired reads are dropdown.
In the NGS analysis operation, the reference database is split into heavy and light sections. By specifying the expected chain, only the appropriate database section is used, improving performance and potentially accuracy too.
- Unpaired reads are - This specifies the expected chain type. If you expect your sequences to have both heavy and light chains, select the Both chains (scFv) value in the dropdown. If you do not know the chain type, use the default Unknown chain value.
If one read in each pair is heavy and the other is light and they are in a known order, then by specifying the order, only the appropriate database section is used, improving performance and potentially accuracy too. No matter what value is specified here, when each read in a pair is from a different chain, these two reads are grouped together in the output. Paired reads from the same chain will not be grouped together in the output, these should instead be merged prior to annotation.
- Paired reads are - This specifies the expected chain order. If your paired read starts with a light chain followed by a heavy chain, select the Light then heavy chain value in the dropdown. If you do not know the chain order or your sequences, use the default Unknown order value.
When multiple genes are equally close to a query sequence there are three possible ways we can handle it with regards to the list of the number of sequences that match each gene. For example, if a query sequence is equally close to IGHD1-26 and IGHD2-15, each option below will handle this differently:
- Each gene with partial frequency - This will assign that query sequence to all matching genes with partial frequency. Based on the example above, this sequence will add 0.5 to IGHD1-26's and 0.5 to IGHD2-15's total number of sequences.
- Group of genes - This will create a separate entry in the list of genes that represents this combination of genes. Based on the example above, this sequence will contribute 0 towards the total for each of IGHD2-15 and IGHD1-26, and instead add 1 to the total for a gene called "IGHD1-26/IGHD2-15".
- Unknown - This will treat this as an unknown gene. Based on the example above, this sequence will add nothing to the totals of IGHD1-26 and IGHD2-15.
The bundled human and mouse immunoglobulin databases may contain a number of pseudogenes and ORF genes along with fully functional genes. You can choose to include the Pseudogenes from database and/or the ORF genes from database to the analysis as mutations in these genes might be corrected prior to expression. If these are not included, then each sequence will be classified according to the most closely related functional gene in the database.
Liability and asset search
To search and score motifs liable to post-translational modifications or any other types of modifications or beneficial motifs, select the Find liabilities and assets option. The NGS analysis v2 pipeline has a default set of sequence liability check. To learn more about configuring this option, please refer to the following article
To trim the non-annotated regions upstream and downstream of your annotate sequences, select the Trim each side of fully annotated region if over option and specify the number of nucleotide base pairs to leave untrimmed.
- The default is set to 10. This means that your annotated sequence will have 10 base pairs flanking the 5' and 3' ends of the sequence.
- For single chain sequences, if the start of FR1 is untruncated the start is trimmed. However, if the V-Gene starts before FR1, trimming starts outside of the V-Gene. Similar rules are applied for FR4.
- For scFv data, this applies to the first FR1 and the last FR4.
Note that trimming only applies to fully annotated sequence whereby sequences that are classified as not fully annotated by the NGS operation are left untrimmed.
In general, most NGS data are relatively large and in order to cluster your results, select one or more of the following options:
- Only cluster results with asset and liability score of at least - This will cluster the sequences based on the score specified. For example, if you specify a score of 1000, only sequences that have a liability and asset score of 1000 or more will be included in the clusters.
- Only cluster results which are - This will cluster the sequences that are either Fully annotated, Fully annotated and in frame, or Fully annotated, in frame, and without stop codon. For example, if you chose to cluster Fully annotated and in frame sequences, only sequences that meet the specifications of being fully annotated and in frame will be clustered, sequences that are not fully annotated, frameshifted and consist of stop codons will not be included in the clustering operation.