Repertoire comparison by next-generation sequencing gives insight into the breadth of the antibody repertoire used in an immune response. Antibody repertoire comparisons can be performed using the Compare Results operation.
To compare results, (1) select 2 or more Biologics Annotator Result documents and (2) click Post-processing and (3) click Compare Results in the dropdown (see image below).
In order to compare samples, normalization or scaling is often performed to remove variation between samples that prevents direct comparison of data. To perform normalization, select a normalization method from the Normalization Method dropdown. Please refer to the section below to learn more about the normalization methods.
Note that the Median of frequency ratios normalization method is not supported for comparisons of more than 2 Biologics Annotator Result documents.
Depending on your input sequences, some sequences may not be fully annotated and some may contain stop codons. These sequences may influence the comparisons analysis as they contribute towards the normalization calculations. To compare sequences that are fully annotated, in frame and, without stop codons, select the Only use sequences that are fully annotated, in frame and, without stop codons option.
To start the Compare Results operation, click Run. This operation will produce a Biologics Comparison Result document.
This is a simplistic way to compare samples by scaling the raw frequency counts based on the total number of sequences in the sample relative to the other sample. Total count normalization effectively just compares the raw frequency percentages of each cluster. However, this is prone to problems. For example, if one sample has a single new cluster that makes up 50% of sequences, then all other clusters will appear to have half their usual frequency, despite them not actually being any less frequent.
Median of frequency ratios
To solve the above bias, this method uses the DESeq2 approach to sample normalization during differential gene expression, but with an additional heuristic to exclude clusters with very low frequencies when calculating the normalization ratio. This is because the median frequencies may often be only 1 or 2 sequences, which can lead to inaccurate normalization ratios. For example if the median cluster has 1 sequence in one sample and 2 sequences in the other, this would produce a normalization ratio of 2. Instead we take the median ratio of those clusters where both samples have at least as many sequences as specified by the 'of frequencies at least' setting
Total count excluding upper quartile (recommended)
This is another approach used during differential gene expression normalization, and is usually better suited (than the DESeq2 normalization method) to immune system data comparison. This is due to there often being only a few regions in common between samples that have significant numbers of sequences, and selecting the median ratio of these produces a value which is quite sensitive to a change in only a few sequence frequencies.
The choice of the normalization method used affects what the normalized ratio of each cluster will be as well as the P-Value.
P-values are used to indicate whether or not the difference in size between the two clusters is statistically significant. When comparing samples, we may not be interested in cases where the frequencies only differ by a ratio of 2 for example. However, the P-Values calculated in this case could still indicate that this is a statistically significant difference.
This setting assumes that any ratio less than this is not significant and reduces the significance of ratios larger than this accordingly when calculating P-Values.
For example, if this setting is 2, and a region has a ratio of 2.5, the calculated P-Value would be similar to that of a case where this setting is 1 and the region ratio is 1.5. This minimum ratio is applied equally in both directions, so if this setting is 2, then a ratio of 0.5 would not be considered significant either.
- Score: A higher score indicates the differences between counts in this cluster are more interesting. This is a combination of the normalized count ratio and the p-value. For example a normalized count change from 100 to 200 (ratio 2) is more interesting than a normalized count change from 1 to 4 (ratio 4), because the later case is likely to have happened by chance rather than there being any real difference between the samples.
- P-Value: The probability that the difference between observed normalized counts would happen by chance if there is actually no difference in the levels of these clones between the samples.
- Adjusted P-Value: The P-Value adjusted upwards to account for the fact that many different clusters are being analyzed. For example when there are 1000 clusters, we would expect one of these to have a p-value of 0.001 by chance when there is no actual difference between the samples.
- Log2 Normalized Count Ratio: The base 2 log of one of the above ratios, depending on which normalization method was selected in the options.
- Frequency %: The percentage of reads in this cluster out of the total number of reads from the sample used during analysis. If the `Only use sequences that are fully annotated, in frame, and without stop codons` setting is off, this total is the number of reads in the sample. If that setting is on, then the frequency is out of all reads in the sample which meet those conditions.
- Normalized Frequency %: This is the normalized count as a percentage of the 'total sample count adjusted according to frequency % normalization'. If the normalization method is 'Frequency % (Total Count)' then the 'Normalized Frequency %' will be identical to the 'Frequency %'.
- Frequency % Ratio / Total Count Excluding Upper Quartile Normalized Ratio / Median Ratio Normalized Ratio: The ratio between the normalized counts for each of the 3 normalization methods. When a normalized count is less than 1, for the purposes of calculating a ratio, both counts are incremented by the same amount to ensure both values are at least 1. For example if the normalized counts are 0.3 and 1.2, then these are first increased by 0.7 to make 1.0 and 1.9 before calculating the ratio to be 1.9.
- Normalized Count: The raw count of the number of reads in this cluster, scaled according to the normalization method. For example if the 'frequency % (total count)' normalization method is used (which is not recommended), then the ratio between the normalized counts of two samples will be the same as the ratio between the frequency % of those two samples.
- Max Ratio: Only shown when analyzing more than two samples. This is the most extreme normalized count ratio between any two samples.