"Clusters" is a term we generally use to refer to when we group sequences in a dataset together by a particular region. For example, when viewing result documents, in the "Group by" selection dropdown you can collapse all the annotated sequences together by annotated region. For example if you chose "Heavy CDR3" from the Group By dropdown, this would show you a table where sequences with the exact same CDR3 region sequence get "clustered" together into one row.
In the picture above 75 sequences have a Heavy CDR3 of "WEYYAMDY". We would refer to this as a single CDR3 "cluster". We generally cluster by amino acid translations of regions, rather than the underlying nucleotides. This means that the exact nucleotide sequences may differ within a single amino acid cluster.
If you have a lot of analysed sequences, looking at clusters can be a good way of understanding the variability in your data set. It can also tell you if a single CDR3 region is dominant within your dataset. And if you were to align the CDR3 regions together, you could see how they relate to each other sequence-wise. You can find more information about defining custom clusters here, and creating inexact (fuzzy) similarity clusters here.
Another way to visualise clusters is using our graphing suite in the "Graphs" tab of Antibody Annotator result documents.
Number of Clusters will show which regions have the most variability within your annotated sequences. For example in the above graph we can see that the FR/CDR region with largest amount of different protein sequences is the Heavy CDR3 region, closely followed by the Heavy FR1 region. There is much fewer unique sequences in Heavy FR4.
The cluster size graph can be a good way of looking at the distribution of clusters within your sequences. We can see from the graph above that the majority of Heavy CDR3 clusters in this dataset only contain a single CDR3 sequence, and are unique. On the other hand, we can also see that there is at least one Heavy CDR3 cluster that contains 75 sequences (this is our "WEYYAMDY" cluster from above). If we wanted to double check that, we could look at the CDR3 count graph below, which shows the 25 most abundant Heavy CDR3 regions in my dataset.
One final graph that may be of interest is the "Cluster length" graphs, which show what variation in length is present within the Heavy CDR3 region. Below we can see that the most frequent CDR3 length is 9 Amino Acids long.
Note: These graphs are also available for other cluster regions, we were just using Heavy CDR3 as an example above.