Repertoire comparison by next-generation sequencing gives insight into the breadth of the antibody repertoire used in an immune response. Antibody repertoire comparisons can be performed using the Compare Results operation.
To compare results, (1) select 2 or more Biologics Annotator Result documents and (2) click Post-processing and (3) click Compare Results in the dropdown (see image below).
Depending on your sample, you can customise your comparisons run by selecting or adjusting the following parameters:
- Filtering (optional)
- Normalization (compulsory)
- Additional Clustering (optional)
- Experiment (compulsory)
Depending on your samples, some sequences may not be fully annotated, some may contain stop codons and some may be out of frame. These sequences may influence the comparisons analysis as they contribute towards the normalization calculations. Filtering allows you to remove:
- Sequences which consist of stop codons, out of frame and not fully annotated
- Sequences with a sum of counts (across all selected samples) lower than a specified value
In order to accurately compare samples, normalization or scaling is often performed to remove variation between samples that prevents direct comparison of data. Depending on your experimental design and data, you can choose a normalization method from the following options:
This is a simplistic way to compare samples by scaling the raw frequency counts based on the total number of sequences in the sample relative to the other sample. Total count normalization effectively just compares the raw frequency percentages of each cluster. However, this is prone to problems. For example, if one sample has a single new cluster that makes up 50% of sequences, then all other clusters will appear to have half their usual frequency, despite them not actually being any less frequent.
Total count excluding upper quartile
This is another approach used during differential gene expression normalization, and is usually better suited (than the DESeq2 normalization method) to immune system data comparison. This is due to there often being only a few regions in common between samples that have significant numbers of sequences, and selecting the median ratio of these produces a value which is quite sensitive to a change in only a few sequence frequencies.
The choice of the normalization method used affects what the normalized ratio of each cluster will be as well as the P-Value.
Median of frequency ratios
This method uses the DESeq2 approach to sample normalization during differential gene expression, but with an additional heuristic to exclude clusters with very low frequencies when calculating the normalization ratio. This is because the median frequencies may often be only 1 or 2 sequences, which can lead to inaccurate normalization ratios. For example if the median cluster has 1 sequence in one sample and 2 sequences in the other, this would produce a normalization ratio of 2. Instead we take the median ratio of those clusters where both samples have at least as many sequences as specified by the 'of frequencies at least' setting.
Note that the Median of frequency ratios normalization method is not supported for comparisons of more than 2 Biologics Annotator Result documents.
To reduce sequence redundancy, you can cluster similar sequences based on a specific region based and threshold by selecting the Group similar sequences across all samples option. Please refer to the following article to learn more about sequence clustering.
This option allows you to set a reference sample for sample comparisons. To set a reference sample, select one of the Biologics Annotator Result documents in the Reference sample dropdown.
- Total: The counts that are used to calculate the normalization scale and the value that each count gets diveded by when calculating the Frequency %.
- Count before filtering: The number of sequences prior to filtering.
- Count: The number of sequences used in the comparisons
- Score: A higher score indicates the differences between counts in this cluster are more interesting. This is a combination of the normalized count ratio and the p-value. For example a normalized count change from 100 to 200 (ratio 2) is more interesting than a normalized count change from 1 to 4 (ratio 4), because the later case is likely to have happened by chance rather than there being any real difference between the samples.
- P-value: The probability that the difference between observed normalized counts would happen by chance if there is actually no difference in the levels of these clones between the samples.
- Adjusted P-Value: The P-Value adjusted upwards to account for the fact that many different clusters are being analyzed. For example when there are 1000 clusters, we would expect one of these to have a p-value of 0.001 by chance when there is no actual difference between the samples.
- Log2 Normalized Count Ratio: The base 2 log of one of the above ratios, depending on which normalization method was selected in the options.
- Frequency %: The percentage of reads in this cluster out of the total number of reads from the sample used during analysis. If the `Only use sequences that are fully annotated, in frame, and without stop codons` setting is off, this total is the number of reads in the sample. If that setting is on, then the frequency is out of all reads in the sample which meet those conditions.
- Normalized Frequency %: This is the normalized count as a percentage of the 'total sample count adjusted according to frequency % normalization'. If the normalization method is 'Frequency % (Total Count)' then the 'Normalized Frequency %' will be identical to the 'Frequency %'.
- Frequency % Ratio / Total Count Excluding Upper Quartile Normalized Ratio / Median Ratio Normalized Ratio: The ratio between the normalized counts for each of the 3 normalization methods. When a normalized count is less than 1, for the purposes of calculating a ratio, both counts are incremented by the same amount to ensure both values are at least 1. For example if the normalized counts are 0.3 and 1.2, then these are first increased by 0.7 to make 1.0 and 1.9 before calculating the ratio to be 1.9.
- Normalized Count: The raw count of the number of reads in this cluster, scaled according to the normalization method. For example if the 'frequency % (total count)' normalization method is used (which is not recommended), then the ratio between the normalized counts of two samples will be the same as the ratio between the frequency % of those two samples.
- Max Ratio: Only shown when analyzing more than two samples. This is the most extreme normalized count ratio between any two samples.
Querying comparison results
The comparison tables can be filtered to suit and along with standard SQL filtering, scripts can be used allowing searches for differences between samples. Please refer to the following article (under Filtering using scripts) to learn more about the application of scripts on filtering in comparison tables.
The "Frequency Graph" tab can be used to compare the relative frequencies of clusters between samples. For example, the picture below shows how you could identify trends in Heavy CDR3 region between 5 successive panning rounds. The Heavy CDR3 regions selected in the table will be shown in the graph below. These regions all have high fold change values, and the increased frequency indicates that they have been enriched in the later panning rounds compared to the original sample.