enclone has some tools for understanding D gene assignment and junction region structure.
enclone can assign D genes to each IGH or TRB exact subclonotype, independent of the assignment made by Cell Ranger. Every such exact subclonotype is assigned the optimal D gene, or two D genes (in a VDDJ) configuration, or none, depending on score. The none case is applied only when no insertion is observed.
In general, although D genes are always assigned, they cannot be assigned confidently.
β’ This is a consequence of the biology: D genes are short, and junction regions can be
heavily edited during SHM and through non-templated indels during VDJ recombination, so in
general it is just not possible to know.
β’ It is possible that where we align a D gene to given transcript bases, it is not the right
D gene, or that the transcript bases represent some other part of the genome (not a D gene
at all), or even random bases that were created during formation of the junction region.
β’ The reason we make these assignments, even though they are not confident, is that in general
they allow one to better understand what happened during junction
region rearrangement, even though that understanding is often incomplete.
β’ D gene assignments are not guaranteed to be consistent across a clonotype.
β’ We have no way of knowing the true error rate in D gene assignment. However because on
very large data sets we observe an inconsistency rate within clonotypes of
~13%
, we very roughly estimate the error rate for D gene assignment at
5-15%
. Note of course that the true rate would likely
depend on sample properties including the rate of SHM.
There are variables that show the best and second best D gene assignment, and the difference
in score between them, see
enclone help cvars
. Here is an
example:
enclone BCR=123085 CVARS=d1_name,d2_name,d_Ξ CDR3=CTRDRDLRGATDAFDIW
[1] GROUP = 1 CLONOTYPES = 51 CELLS
[1.1] CLONOTYPE = 51 CELLS
βββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββ
β β CHAIN 1 β CHAIN 2 β
β β 144.1.2|IGHV3-49 β 53|IGHJ3 β 279|IGKV3-11 β 217|IGKJ5β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββ€
β β 1 11111111111111111 1 β 1111111111111 β
β β 51 11112222222222333 4 β 6 0001111111111 β
β β 53 67890123456789012 1 β 4 7890123456789 β
β β βββββββCDR3ββββββ β βββββCDR3ββββ β
βreference β VV β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦W S β R CQQβ¦β¦β¦β¦β¦β¦β¦β¦β¦β¦ β
βdonor ref β FV β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦W S β R CQQβ¦β¦β¦β¦β¦β¦β¦β¦β¦β¦ β
βββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββ€
β# n β .x ................. x d1_name d2_name d_Ξ β x .x........... β
β1 46 β FV CTRDRDLRGATDAFDIW S IGHD1-14 IGHD1-26 4.0 β R CQQRSNWPPSITF β
β2 3 β FM CTRDRDLRGATDAFDIW S IGHD1-14 IGHD1-26 4.0 β R CHQRSNWPPSITF β
β3 1 β FV CTRDRDLRGATDAFDIW S IGHD1-14 IGHD1-26 4.0 β S CQQRSNWPPSITF β
β4 1 β FV CTRDRDLRGATDAFDIW S IGHD1-14 IGHD1-26 4.0 β R CQQRSNWPPSITF β
βββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββ
In this example, the D genes are assigned consistently across the clonotype. Here is an example
where they are assigned inconsistently:
enclone BCR=123085 CVARS=d1_name,d2_name,d_Ξ CDR3=CAREGGVGVVTATDWYFDLW COMPLETE
[1] GROUP = 1 CLONOTYPES = 6 CELLS
[1.1] CLONOTYPE = 6 CELLS
βββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββ
β β CHAIN 1 β CHAIN 2 β
β β 146.1.1|IGHV3-53 β 51|IGHJ2 β 279|IGKV3-11 β 215|IGKJ3β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββ€
β β 11111111111111111111 β 1111111111111 β
β β 12 11111112222222222333 β 0001111111111 β
β β 35 34567890123456789012 β 7890123456789 β
β β ββββββββCDR3ββββββββ β βββββCDR3ββββ β
βreference β ST β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦LW β CQQβ¦β¦β¦β¦β¦β¦β¦β¦β¦β¦ β
βdonor ref β LS β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦LW β CQQβ¦β¦β¦β¦β¦β¦β¦β¦β¦β¦ β
βββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββ€
β# n β .. ...........x........ d1_name d2_name d_Ξ β ............. β
β1 5 β LS CAREGGVGVVTATDWYFDLW IGHD2-21 IGHD2-15 13.7 β CQQRSNWPPLFTF β
β2 1 β LS CAREGGVGVVTTTDWYFDLW IGHD2-15 IGHD3-22 1.5 β CQQRSNWPPLFTF β
βββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββ
enclone can compute the rate at which D genes are inconsistently assigned across all the data. This is the probability, given two different exact subclonotypes from the same clonotype, that their heavy chains are assigned different D genes. Here you can see the rate (at the bottom of the summary):
enclone BCR=123085 GVARS=d_inconsistent_%,d_inconsistent_n NOPRINT SUMMARY
SUMMARY STATISTICS
1. overall
β’ number of datasets = 1
β’ number of donors = 1
β’ original number of cells = 1654
2. barcode fate
ββββββββββββ¬βββββββββββββββββββββββββββββ
βbarcodes β why deleted β
ββββββββββββΌβββββββββββββββββββββββββββββ€
β 173 β failed UMI filter β
β 54 β failed IMPROPER filter β
β 25 β failed FOURSIE_KILL filterβ
β 11 β failed QUAL filter β
β 8 β failed GRAPH_FILTER filterβ
β 5 β failed WHITEF filter β
β 4 β failed WEAK_CHAINS filter β
β 1 β failed DOUBLET filter β
β 281 β total β
ββββββββββββ΄βββββββββββββββββββββββββββββ
3. for the selected clonotypes
ββββββββββ¬βββββββββββββββββββββββββ¬βββββββββββββββββββ¬ββββββββ
βchains β clonotypes with this β cells in these β %β
β β number of chains β clonotypes β β
ββββββββββΌβββββββββββββββββββββββββΌβββββββββββββββββββΌββββββββ€
β1 β 139 β 140 β 10.2β
β2 β 295 β 959 β 69.8β
β3 β 16 β 262 β 19.1β
β4 β 4 β 12 β 0.9β
βtotal β 454 β 1373 β 100.0β
ββββββββββ΄βββββββββββββββββββββββββ΄βββββββββββββββββββ΄ββββββββ
β’ number of clonotypes having at least two cells = 144
β’ number of intradonor cell-cell merges = 919
β’ number of intradonor cell-cell merges (quadratic) = 25,915
β’ number of cells having 1 chain = 218
β’ number of cells having 2 or 3 chains = 1152
β’ estimated B-B doublet rate = 0.3% = 3/942 = cells with 4 chains / cells with 2 or 4 chains
β’ mean over middle third of contig UMI counts (heavy chain) = 331.32
β’ mean over middle third of contig UMI counts (light chain) = 1860.48
β’ mean over middle third of cell UMI counts for cells having two chains = 3141.60
β’ mean UMIs per cell = 5845.16
β’ mean UMIs per cell having two chains = 6276.57
β’ for reads contributing to UMIs in reported chains, mean reads per UMI = 6.25
GLOBAL VARIABLES
d_inconsistent_% = 2.28
d_inconsistent_n = 747
The second variable is the sample size: the number of pairs of exact subclonotypes that were examined.
The inconsistency rate for this dataset is deceptively low, perhaps because the sample size is
too small. For larger datasets we see a rate of around 13%, however the rate likely depends on
the particular samples, and not just the sample size. The option D_INCONSISTENT
can
be used to show only those clonotypes having D gene assignment inconsistencies.
For any chain in any exact subclonotype, enclone can display the alignment of the entire V..J
sequence to the reference V..J sequence, and it can also display the alignment of just the
junction region (extended by a small and arbitrary amount on both ends to get the display to
work). This feature is enabled by adding
ALIGN<n>
or
JALIGN<n>
to the command line, where n
is the chain number. It
displays one alignment per exact subclonotype, so for brevity we'll show examples where there is
just one. Here is an example for the full alignment:
enclone BCR=123085 ALIGN1 CDR3=CARYIVVVVAATINVGWFDPW CVARSP=d1_name
[1] GROUP = 1 CLONOTYPES = 1 CELLS
[1.1] CLONOTYPE = 1 CELLS
βββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββ
β β CHAIN 1 β CHAIN 2 β
β β 190.1.1|IGHV4-39 β 13|IGHD2-15 β 57|IGHJ5 β 219|IGKV1-12 β 216|IGKJ4 β
β βββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββ€
β β 111111111111111111111 β 11111111111 β
β β 135 222222223333333333444 β 369 01111111111 β
β β 648 234567890123456789012 β 155 90123456789 β
β β βββββββββCDR3ββββββββ β ββββCDR3βββ β
βreference β LPS β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦W β SPT CQQβ¦β¦β¦β¦β¦β¦β¦β¦ β
βdonor ref β VPS β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦W β SPT CQQβ¦β¦β¦β¦β¦β¦β¦β¦ β
βββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββ€
β# n β ... ..................... u const d1_name β ... ........... u constβ
β1 1 β VPG CARYIVVVVAATINVGWFDPW 4 IGHG1 IGHD2-15 β SPT CQQANSFPLTF 2 IGKC β
βββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββ
CHAIN 1 OF #1 β’ CONCATENATED VDJ REFERENCE β’ D = 1ST = IGHD2-15
ATGGATCTCATGTGCAAGAAAATGAAGCACCTGTGGTTCTTCCTCCTGGTGGTGGCGGCTCCCAGATGGGTCCTGTCCCAGCTGCAGCTGCAGGAGTCGG
ATGGATCTCATGTGCAAGAAAATGAAGCACCTGTGGTTCTTCCTCCTGGTGGTGGCGGCTCCCAGATGGGTCCTGTCCCAGCTGCAGCTGCAGGAGTCGG
* *
GCCCTGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGCAGTAGTGGTTACTACTGGGGCTGGATCCGCCA
GCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGCAGTAGTAGTTACTACTGGGGCTGGATCCGCCA
GCCCCCAGGGAAGGGGCTGGAGTGGATTGGGAGTATCTATTATAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCCGTAGAC
GCCCCCAGGGAAGGGGCTGGAGTGGATTGGGAGTATCTATTATAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCCGTAGAC
* *||||
ACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCAGACACGGCTGTGTATTTCTGTGCGAGAT ATATTGTAGTGGTGGTAGCT
ACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCAGACACGGCTGTGTATTACTGTGCGAGACAAGGATATTGTAGTGGTGGTAGCT
**|||||****
GCTACTATAAACGTAGGCTGGTTCGACCCCTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAG
GCTACTCC ACAACTGGTTCGACCCCTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAG
And here is the same example, but showing just the junction region:
enclone BCR=123085 JALIGN1 CDR3=CARYIVVVVAATINVGWFDPW CVARSP=d1_name
[1] GROUP = 1 CLONOTYPES = 1 CELLS
[1.1] CLONOTYPE = 1 CELLS
βββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββ
β β CHAIN 1 β CHAIN 2 β
β β 190.1.1|IGHV4-39 β 13|IGHD2-15 β 57|IGHJ5 β 219|IGKV1-12 β 216|IGKJ4 β
β βββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββ€
β β 111111111111111111111 β 11111111111 β
β β 135 222222223333333333444 β 369 01111111111 β
β β 648 234567890123456789012 β 155 90123456789 β
β β βββββββββCDR3ββββββββ β ββββCDR3βββ β
βreference β LPS β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦W β SPT CQQβ¦β¦β¦β¦β¦β¦β¦β¦ β
βdonor ref β VPS β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦W β SPT CQQβ¦β¦β¦β¦β¦β¦β¦β¦ β
βββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββ€
β# n β ... ..................... u const d1_name β ... ........... u constβ
β1 1 β VPG CARYIVVVVAATINVGWFDPW 4 IGHG1 IGHD2-15 β SPT CQQANSFPLTF 2 IGKC β
βββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββ
CHAIN 1 OF #1 β’ CONCATENATED VDJ REFERENCE β’ D = 1ST = IGHD2-15
C A R Y I V V V V A A T I N V G W F D P
* *|||| **|||||****
TCTGTGCGAGAT ATATTGTAGTGGTGGTAGCTGCTACTATAAACGTAGGCTGGTTCGACCCC
ACTGTGCGAGACAAGGATATTGTAGTGGTGGTAGCTGCTACTCC ACAACTGGTTCGACCCC
For JALIGN
, we show a line of amino acids that represent the translation of
bases in the exact subclonotype sequence. The amino acid lies over the middle base of the
corresponding codon.
There are also options ALIGN_2ND<n>
and JALIGN_2ND<n>
to instead use the second best D segments.
Here is an example showing a VDDJ clonotype:
enclone BCR=165808 JALIGN1 CDR3=CARAYDILTGYYERGYSYGWGFDYW
[1] GROUP = 1 CLONOTYPES = 1 CELLS
[1.1] CLONOTYPE = 1 CELLS
βββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β β CHAIN 1 β CHAIN 2 β
β β 122.1.1|IGHV3-23 β 22|IGHD3-9 β 55|IGHJ4 β 331|IGLV1-51 β 311|IGLJ2 β
β βββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββ€
β β 1111111111111111111111111 β 1111111111111 11 β
β β 2677 1111112222222222333333333 β 66 0001111111111 22 β
β β 3857 4567890123456789012345678 β 45 7890123456789 48 β
β β βββββββββββCDR3ββββββββββ β βββββCDR3ββββ β
βreference β VASY β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦ β KL CGTWDβ¦β¦β¦β¦β¦β¦β¦β¦ KL β
βdonor ref β LASY β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦ β KL CGTWDβ¦β¦β¦β¦β¦β¦β¦β¦ KL β
βββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββ€
β# n β .... ......................... u const β .. ............. .. u constβ
β1 1 β LTNY CARAYDILTGYYERGYSYGWGFDYW 16 IGHM β KF CGTWDSSLSAVVF TP 15 IGLC2β
βββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββ
CHAIN 1 OF #1 β’ CONCATENATED VDDJ REFERENCE β’ D = 1ST = IGHD3-9:IGHD5-18
C A R A Y D I L T G Y Y E R G Y S Y G W G F D Y W
* |||* ||||| *||***
ACTGTGCGAGAG CTTACGATATTTTGACTGGTTATTACGAACGTGGATACAGCTATGGTTG GGGCTTTGACTACTGG
ACTGTGCGAAAGAGTATTACGATATTTTGACTGGTTATTA GTGGATACAGCTATGGTTACACTACTTTGACTACTGG
For VDDJ clonotypes, we do not check that the two D genes are in order on the genome, which would make sense biologically. We do not carry out the check because an individual genome might be rearranged, and in an case, we are simply reporting what we observe.
The problems we are solving here are to (a) pick the "best" reference D segment, in the case of IGH or TRB, and (b) exhibit the "correct" alignment of the transcript to the concatenated reference.
The algorithm aligns the V(D)J region on a transcript to the concatenated V(D)J reference, allowing for each possible D reference segment (or the null D segment, or DD), in the case of IGH or TRB. These alignments are carried out using the following scoring scheme:
case | score |
match | 2 |
mismatch | -2 |
gap open for insertion between V/D/J segments | -4 |
gap open for deletion bridging V/D/J segments | -4 |
gap open (otherwise) | -12 |
gap extend for insertion between V/D/J segments | -1 |
gap extend (otherwise) | -2 |
To the score from this, we add 2.2
times a "bit score" for the alignment,
defined as -log2
of the probability that a random DNA sequence of length n will match
a given DNA sequence with β€ k mismatches = sum{l=0..=k} (n choose l) * 3^l / 4^n
.
The alignment that is produced is approximately optimal, relative to this scoring scheme. It is not exactly optimal because we first produce an alignment using a Smith-Waterman algorithm, which does not fully incorporate the complexity of the scoring scheme, and then edit both the alignment and its score.
Then the D segment having the highest score is selected, arbitrarily selecting a winner in the case of a tie.
The following were optimized in designing the algorithm:
In more detail, here is how we assess the inconsistency rate. There are two issues. First,
if one allows clonotypes having a large number of exact subclonotypes, then measurement is noisy
because a single clonotype can overly influence the rate. For this reason, we restrict to
clonotypes having at most 10
exact subclonotypes. Second, for experimental purposes,
it is too slow to keep recomputing using a very large dataset. Therefore we do the following
(with a large dataset substituted in):
enclone BCR=... SUBSET_JSON=subset/outs/all_contig_annotations.json NOPRINT MIN_EXACTS=2 MAX_EXACTS=10
which is slow, followed by:
enclone BCR=subset GVARS=d_inconsistent_%,d_inconsistent_n NOPRINT
which is much faster, to get an inconsistency rate.