It is advisable to read this article to help you get familiarised with Geneious Biologics before proceeding with the following tutorial.
In this tutorial, you will learn how to merge and annotate next-generation sequencing (NGS) reads produced by sequencing variable gene repertoires from immunized mice. You will also learn how to assess antibody repertoire diversity through sequence clustering.
This tutorial will cover the following exercises:
- Merge overlapping paired-end NGS reads
- Sequence annotation
- Understanding sequence clusters
- Sequence Filtering
- Similarity clustering
To start this tutorial, you will need input data. If you have recently started Geneious Biologics, your organisation may already have the tutorial folders set up as described in the tutorial below. If not, you can still follow this tutorial by first downloading the input sequences here and then uploading them into Geneious Biologics.
Merging paired reads also known as overlapping or assembly of read pairs converts a read pair into a single read containing a sequence and a set of quality scores. A read pair must overlap a significant fraction of its length for the reads to be merged.
In this exercise you will learn how to merge paired-end Illumina MiSeq reads. Immunoglobulin heavy chain are approximately 300-350 bp long and because this example read library was obtained by 250 bp paired-end sequencing, it is important to merge the read in order to obtain full length heavy chain sequences. To merge these paired-end reads, select both the paired-end documents in the Input data folder and click Pre-processing > Set & Merge Paired Reads (see image below).
As the read libraries are paired-end, select the following options in the Set & Merge Paired Reads dialog box and click Run to start the analysis (see sections and image below).
- Pairs of lists
- Forward/Reverse (inward pointing, e.g. Illumina paired end)
- Set and merge paired reads using BBMerge
Once the operation is completed, 2 new documents will be generated in the Set and merge paired reads folder; a ERR346598 (merged) and a ERR346598 (couldn't be merged) document. The ERR346598 (merged) document consists of reads that were successfully paired and merged while the ERR346598 (couldn't be merged) document consists of reads that are paired but couldn’t be merged.
**Note that the number of merged and unmerged reads are dependent on the read quality and increasing the merge rate may result in higher false positives. Read more on Set & Merge Paired Reads here.
The Antibody Annotator identifies immunoglobulin framework regions, complementary determining regions and V(D)JC genes, and annotates input sequences against a selected reference database.
In this exercise, you will learn how to annotate variable heavy immunoglobulin genes in mice produced by PCR amplification and how to analyze the results with the help of the Pipeline Report and Graphs. To annotate these heavy IgG genes, select the ERR346598 (merged) document in the Set and merge paired reads folder and click Annotation > Antibody Annotator (see image below).
Select the following options from the Antibody Annotator dialog box and click Run to start the analysis (see sections and image below).
Select the following options:
- Reference database: Mouse Ig
- Selected sequences are: Single chain (heavy)
Select the following option:
- Include pseudo genes from database
- Find liabilities and assets
This operation will produce a ERR346598 (merged) Annotated & Clustered Biologics Annotator Result document in the Sequence annotation folder.
**Note that the bundled IgG reference databases are split into light and heavy sections. If the sequence type (Selected sequence are: option) is specified for a sequence, only the appropriate database section is used thus, improving performance and potentially annotation accuracy. Read more about the Antibody Annotator here.
A pipeline report is generated for every Biologics Annotator Result document. This report provides an indication of the annotation rate of the input data, region cluster diversity, and gene mutation distribution among others which are derived from the Antibody Annotator analysis.
In the following section, we will determine how well the sequences are annotated. To get a quick overview of the Antibody Annotator analysis, select the ERR346598 (merged) Annotated & Clustered document and click Pipeline Report.
Approximately 88% of the sequences were identified as Heavy Chain and these sequences were fully annotated (consists of all of the FR and CDR regions), in-frame and without stop codons (Figure 1.1).
Figure 1.1 | The number of sequences without stop codons, in-frame and fully annotated) identified and annotated by the Antibody Annotator.
**Note that the Pipeline Report can be exported as a PDF document. Click Export to PDF to export the report as a PDF document.
The Graphs option is available for every Biologics Annotator Result document. Graphs are a collections of graphs are that derived from the Antibody Annotator analysis.
In the following sections, we will learn more about clusters and assess the cluster diversity of the Heavy CDR3 region. Immunoglobulin CDR3 region has been reported to contribute to antibody diversity and for this reason, they have been widely used as unique identifiers. To assess the Heavy CDR3 cluster diversity, first, select the ERR346598 (merged) Annotated & Clustered document and click Graphs. Then, click Annotations rates and select Cluster diversity in the dropdown. Finally, select Heavy CDR3 in the Show: dropdown (see image below).
**Note that these graphs can be exported as image (png) or table (csv) files that can be used for publication or as laboratory documentation. To export a graph, click Export.
The three CDR regions which interact with antigen are more diverse compared to the Framework regions but among the CDR regions, CDR3 varies the most. The CDR cluster diversity and cluster lengths graphs provide a quick indirect comparison of the CDR clusters.
The Heavy CDR cluster diversity graphs showed that Heavy CDR3 is the CDR region with the highest cluster diversity with approximately 100,000 clusters while Heavy CDR1 and CDR2 consist of approximately 14,000 and 18,000 clusters respectively. Additionally, the majority of the Heavy CDR3 clusters in this dataset consist of a single unique CDR3 amino acid sequence suggesting high sequence diversity. The Heavy CDR3 cluster diversity is also reflected in its cluster length where the top 5 cluster lengths range from 10-14 amino acids long while the majority length of both Heavy CDR1 and CDR2 is at 8 amino acids long as shown in the Heavy CDR length graphs (Figure 1.2).
Figure 1.2 | Heavy CDRs cluster diversity and cluster length distribution. The graphs on the left show the CDR cluster diversity and the graphs on the right show the CDR cluster length.
Next-generation sequencing enables the discovery of the great diversity of natural antibody repertoires bringing about vast volume of sequencing data for a fraction of the cost of Sanger sequencing. Sequence clustering is the process of grouping similar sequences into clusters resulting in reduced sequence redundancy making data analysis more straightforward.
In this exercise you will learn how to view heavy CDR3 region clusters and identify its most abundant associated region (CDR1 and CDR2, and FR1-FR4 regions). To view sequences within a cluster with identical heavy CDR3, select ERR346598 (merged) Annotated & Clustered in the Annotation folder and click All Sequences > Heavy CDR3.
To view the most abundant heavy CDR3 cluster, sort the table by clicking the column header Total twice in order to sort the table in descending order in regards to the Total column (indicated by the downwards arrow in the Total column header). Select the top cluster and select Translations in the Sequence Viewer to view the cluster of sequences consisting of identical Heavy CDR3 “ARWEYYAMDY” sequence (see image below).
**Note that all the sequences within a region cluster consist of identical regions. For example, when sequences are grouped by Heavy CDR3, all the sequences within a cluster will consist of an identical Heavy CDR3 sequence but they may consist of distinct CDR1 and CDR2, and FR1-FR4 regions. Learn more about clusters here.
The Sequences Table can be used to quickly identify the most frequent FR and CDR clusters for a selected cluster. To view the most abundant regions associated with the selected Heavy CDR3 cluster, scroll to the right of the Sequences Table or use Focus column button located in the Table Preferences panel to quickly navigate to your column of interest.
The Sequences Table demonstrated that the most abundant associated Heavy CDR1 and CDR2 sequences for the Heavy CDR3 “ARWEYYAMDY” cluster were “GFNIKDTY” (95.55%) and “IDPANGNT” (96.59%) respectively (Figure 1.3).
Figure 1.3 | The Heavy CDR3 “ARWEYYAMDY” cluster and its most abundant associated CDR1 and CDR2 sequences.
**Note that you can create custom cluster combinations, apart from the default clusters and cluster combination. Learn more on how to create custom cluster combinations here.
NGS data generally comprises of a large number of reads making antibody candidate selection difficult. Sequence filtering coupled with assets and liability score, may aid in identifying suitable candidates for further downstream analyses.
In this exercise, you will learn how to filter the All Sequences table for sequences that meet a set of conditions. To filter the sequences for sequences that are fully annotated, in-frame and without stop codon with a score of > -100, first, right click a cell in the Without Stop Codons & In Frame & Fully Annotated column and click the Filter syntax. Then, right click a cell in the Score column and click the Filter syntax. Finally, in the Filter box, ensure that the filter syntax is as below and click Filter.
['Without Stop Codons & In Frame & Fully Annotated'] = 'Yes' AND ['Score'] > -100
A total of 27,298 sequences that are without stop codons, in-frame and fully annotated, and have a Score of ≥ -100 were identified (Figure 1.4). The high score suggested low sequence annotation error with low number of liability sites such as post-translational modifications (PTM) sites.
Figure 1.4 | A total of 27,298 of the 2,550,484 sequences meet the conditions of having a score ≥ -100 and is fully annotated, in-frame and without stop codon.
**Note that you can filter sequences on all of the columns within a Biologics Annotator Result document. Learn more about sequence filtering and filtering using scripts here.
Sequence clustering is commonly used to group highly similar immunoglobulin sequence together with the assumption that their sequence similarity are results of them sharing the same initial B cell. Reclustering is the process of grouping sequences sharing a similar region into clusters based on a set threshold.
In this exercise, you will learn how to cluster Heavy CDR3 region with 90% similarity. To recluster the heavy CDR3 region, select ERR346598 (merged) Annotated & Clustered document in the Sequence Annotation folder and click Post-processing > Re-cluster.
Select the following options from the Additional Clustering dialog box and click Run to start the analysis (see image below).
This analysis will produce a ERR346598 (merged) Annotated & Clustered (Reclustered) document in the Similarity clustering folder.
To view the reclustered sequences, select the ERR346598 (merged) Annotated & Clustered (Reclustered) document and select the Heavy CDR3 90 Percent Similarity cluster from the Group By: dropdown.
Prior to reclustering, a total of 102,089 clusters of Heavy CDR3 were identified (top) and upon reclustering a total of 35,674 clusters of Heavy CDR3 were identified (bottom) (Figure 1.5).
Figure 1.5 | Sequence re-clustering groups similar sequences into clusters based on a threshold.
**Note that the Re-cluster operation essentially produces another Biologics Annotator Result document with the additional cluster table. Additionally, this new document will not consist of the Graphs and Pipeline Report options. Read more about similarity and identity clustering here.