D genes and junction regions

enclone has some tools for understanding D gene assignment and junction region structure.

enclone can assign D genes to each IGH or TRB exact subclonotype, independent of the assignment made by Cell Ranger. Every such exact subclonotype is assigned the optimal D gene, or two D genes (in a VDDJ) configuration, or none, depending on score. The none case is applied only when no insertion is observed.

In general, although D genes are always assigned, they cannot be assigned confidently.
• This is a consequence of the biology: D genes are short, and junction regions can be heavily edited during SHM and through non-templated indels during VDJ recombination, so in general it is just not possible to know.
• It is possible that where we align a D gene to given transcript bases, it is not the right D gene, or that the transcript bases represent some other part of the genome (not a D gene at all), or even random bases that were created during formation of the junction region.
• The reason we make these assignments, even though they are not confident, is that in general they allow one to better understand what happened during junction region rearrangement, even though that understanding is often incomplete.
• D gene assignments are not guaranteed to be consistent across a clonotype.
• We have no way of knowing the true error rate in D gene assignment. However because on very large data sets we observe an inconsistency rate within clonotypes of ~13%, we very roughly estimate the error rate for D gene assignment at 5-15%. Note of course that the true rate would likely depend on sample properties including the rate of SHM.

There are variables that show the best and second best D gene assignment, and the difference in score between them, see enclone help cvars. Here is an example:

enclone BCR=123085 CVARS=d1_name,d2_name,d_Δ CDR3=CTRDRDLRGATDAFDIW

[1] GROUP = 1 CLONOTYPES = 51 CELLS

[1.1] CLONOTYPE = 51 CELLS
┌───────────┬───────────────────────────────────────────────────┬──────────────────────────┐
│           │  CHAIN 1                                          │  CHAIN 2                 │
│           │  144.1.2|IGHV3-49 ◆ 53|IGHJ3                      │  279|IGKV3-11 ◆ 217|IGKJ5│
│           ├───────────────────────────────────────────────────┼──────────────────────────┤
│           │   1 11111111111111111 1                           │    1111111111111         │
│           │  51 11112222222222333 4                           │  6 0001111111111         │
│           │  53 67890123456789012 1                           │  4 7890123456789         │
│           │     ═══════CDR3══════                             │    ═════CDR3════         │
│reference  │  VV ◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦W S                           │  R CQQ◦◦◦◦◦◦◦◦◦◦         │
│donor ref  │  FV ◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦W S                           │  R CQQ◦◦◦◦◦◦◦◦◦◦         │
├───────────┼───────────────────────────────────────────────────┼──────────────────────────┤
│#   n      │  .x ................. x  d1_name   d2_name   d_Δ  │  x .x...........         │
│1  46      │  FV CTRDRDLRGATDAFDIW S  IGHD1-14  IGHD1-26  4.0  │  R CQQRSNWPPSITF         │
│2   3      │  FM CTRDRDLRGATDAFDIW S  IGHD1-14  IGHD1-26  4.0  │  R CHQRSNWPPSITF         │
│3   1      │  FV CTRDRDLRGATDAFDIW S  IGHD1-14  IGHD1-26  4.0  │  S CQQRSNWPPSITF         │
│4   1      │  FV CTRDRDLRGATDAFDIW S  IGHD1-14  IGHD1-26  4.0  │  R CQQRSNWPPSITF         │
└───────────┴───────────────────────────────────────────────────┴──────────────────────────┘

In this example, the D genes are assigned consistently across the clonotype. Here is an example where they are assigned inconsistently:

enclone BCR=123085 CVARS=d1_name,d2_name,d_Δ CDR3=CAREGGVGVVTATDWYFDLW COMPLETE

[1] GROUP = 1 CLONOTYPES = 6 CELLS

[1.1] CLONOTYPE = 6 CELLS
┌───────────┬─────────────────────────────────────────────────────┬──────────────────────────┐
│           │  CHAIN 1                                            │  CHAIN 2                 │
│           │  146.1.1|IGHV3-53 ◆ 51|IGHJ2                        │  279|IGKV3-11 ◆ 215|IGKJ3│
│           ├─────────────────────────────────────────────────────┼──────────────────────────┤
│           │     11111111111111111111                            │  1111111111111           │
│           │  12 11111112222222222333                            │  0001111111111           │
│           │  35 34567890123456789012                            │  7890123456789           │
│           │     ════════CDR3════════                            │  ═════CDR3════           │
│reference  │  ST ◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦LW                            │  CQQ◦◦◦◦◦◦◦◦◦◦           │
│donor ref  │  LS ◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦LW                            │  CQQ◦◦◦◦◦◦◦◦◦◦           │
├───────────┼─────────────────────────────────────────────────────┼──────────────────────────┤
│#  n       │  .. ...........x........  d1_name   d2_name    d_Δ  │  .............           │
│1  5       │  LS CAREGGVGVVTATDWYFDLW  IGHD2-21  IGHD2-15  13.7  │  CQQRSNWPPLFTF           │
│2  1       │  LS CAREGGVGVVTTTDWYFDLW  IGHD2-15  IGHD3-22   1.5  │  CQQRSNWPPLFTF           │
└───────────┴─────────────────────────────────────────────────────┴──────────────────────────┘

enclone can compute the rate at which D genes are inconsistently assigned across all the data. This is the probability, given two different exact subclonotypes from the same clonotype, that their heavy chains are assigned different D genes. Here you can see the rate (at the bottom of the summary):

enclone BCR=123085 GVARS=d_inconsistent_%,d_inconsistent_n NOPRINT SUMMARY

SUMMARY STATISTICS
1. overall
   • number of datasets = 1
   • number of donors = 1
   • original number of cells = 1654
2. barcode fate
   ┌──────────┬────────────────────────────┐
   │barcodes  │  why deleted               │
   ├──────────┼────────────────────────────┤
   │     173  │  failed UMI filter         │
   │      54  │  failed IMPROPER filter    │
   │      25  │  failed FOURSIE_KILL filter│
   │      11  │  failed QUAL filter        │
   │       8  │  failed GRAPH_FILTER filter│
   │       5  │  failed WHITEF filter      │
   │       4  │  failed WEAK_CHAINS filter │
   │       1  │  failed DOUBLET filter     │
   │     281  │  total                     │
   └──────────┴────────────────────────────┘
3. for the selected clonotypes
   ┌────────┬────────────────────────┬──────────────────┬───────┐
   │chains  │  clonotypes with this  │  cells in these  │      %│
   │        │      number of chains  │      clonotypes  │       │
   ├────────┼────────────────────────┼──────────────────┼───────┤
   │1       │                   139  │             140  │   10.2│
   │2       │                   295  │             959  │   69.8│
   │3       │                    16  │             262  │   19.1│
   │4       │                     4  │              12  │    0.9│
   │total   │                   454  │            1373  │  100.0│
   └────────┴────────────────────────┴──────────────────┴───────┘
   • number of clonotypes having at least two cells = 144
   • number of intradonor cell-cell merges = 919
   • number of intradonor cell-cell merges (quadratic) = 25,915
   • number of cells having 1 chain = 218
   • number of cells having 2 or 3 chains = 1152
   • estimated B-B doublet rate = 0.3% = 3/942 = cells with 4 chains / cells with 2 or 4 chains
   • mean over middle third of contig UMI counts (heavy chain) = 331.32
   • mean over middle third of contig UMI counts (light chain) = 1860.48
   • mean over middle third of cell UMI counts for cells having two chains = 3141.60
   • mean UMIs per cell = 5845.16
   • mean UMIs per cell having two chains = 6276.57
   • for reads contributing to UMIs in reported chains, mean reads per UMI = 6.25

GLOBAL VARIABLES

d_inconsistent_% = 2.28
d_inconsistent_n = 747

The second variable is the sample size: the number of pairs of exact subclonotypes that were examined.

The inconsistency rate for this dataset is deceptively low, perhaps because the sample size is too small. For larger datasets we see a rate of around 13%, however the rate likely depends on the particular samples, and not just the sample size. The option D_INCONSISTENT can be used to show only those clonotypes having D gene assignment inconsistencies.

For any chain in any exact subclonotype, enclone can display the alignment of the entire V..J sequence to the reference V..J sequence, and it can also display the alignment of just the junction region (extended by a small and arbitrary amount on both ends to get the display to work). This feature is enabled by adding ALIGN<n> or JALIGN<n> to the command line, where n is the chain number. It displays one alignment per exact subclonotype, so for brevity we'll show examples where there is just one. Here is an example for the full alignment:

enclone BCR=123085 ALIGN1 CDR3=CARYIVVVVAATINVGWFDPW CVARSP=d1_name

[1] GROUP = 1 CLONOTYPES = 1 CELLS

[1.1] CLONOTYPE = 1 CELLS
┌───────────┬─────────────────────────────────────────────────┬───────────────────────────┐
│           │  CHAIN 1                                        │  CHAIN 2                  │
│           │  190.1.1|IGHV4-39 ◆ 13|IGHD2-15 ◆ 57|IGHJ5      │  219|IGKV1-12 ◆ 216|IGKJ4 │
│           ├─────────────────────────────────────────────────┼───────────────────────────┤
│           │      111111111111111111111                      │      11111111111          │
│           │  135 222222223333333333444                      │  369 01111111111          │
│           │  648 234567890123456789012                      │  155 90123456789          │
│           │      ═════════CDR3════════                      │      ════CDR3═══          │
│reference  │  LPS ◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦W                      │  SPT CQQ◦◦◦◦◦◦◦◦          │
│donor ref  │  VPS ◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦W                      │  SPT CQQ◦◦◦◦◦◦◦◦          │
├───────────┼─────────────────────────────────────────────────┼───────────────────────────┤
│#  n       │  ... .....................  u  const  d1_name   │  ... ...........  u  const│
│1  1       │  VPG CARYIVVVVAATINVGWFDPW  4  IGHG1  IGHD2-15  │  SPT CQQANSFPLTF  2  IGKC │
└───────────┴─────────────────────────────────────────────────┴───────────────────────────┘

CHAIN 1 OF #1  •  CONCATENATED VDJ REFERENCE  •  D = 1ST = IGHD2-15
                                                                                                    
ATGGATCTCATGTGCAAGAAAATGAAGCACCTGTGGTTCTTCCTCCTGGTGGTGGCGGCTCCCAGATGGGTCCTGTCCCAGCTGCAGCTGCAGGAGTCGG
ATGGATCTCATGTGCAAGAAAATGAAGCACCTGTGGTTCTTCCTCCTGGTGGTGGCGGCTCCCAGATGGGTCCTGTCCCAGCTGCAGCTGCAGGAGTCGG

    *                                                                     *                         
GCCCTGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGCAGTAGTGGTTACTACTGGGGCTGGATCCGCCA
GCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGCAGTAGTAGTTACTACTGGGGCTGGATCCGCCA

                                                                                                    
GCCCCCAGGGAAGGGGCTGGAGTGGATTGGGAGTATCTATTATAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCCGTAGAC
GCCCCCAGGGAAGGGGCTGGAGTGGATTGGGAGTATCTATTATAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCCGTAGAC

                                                                *          *||||                    
ACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCAGACACGGCTGTGTATTTCTGTGCGAGAT    ATATTGTAGTGGTGGTAGCT
ACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCAGACACGGCTGTGTATTACTGTGCGAGACAAGGATATTGTAGTGGTGGTAGCT

      **|||||****                                               
GCTACTATAAACGTAGGCTGGTTCGACCCCTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAG
GCTACTCC     ACAACTGGTTCGACCCCTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAG

And here is the same example, but showing just the junction region:

enclone BCR=123085 JALIGN1 CDR3=CARYIVVVVAATINVGWFDPW CVARSP=d1_name

[1] GROUP = 1 CLONOTYPES = 1 CELLS

[1.1] CLONOTYPE = 1 CELLS
┌───────────┬─────────────────────────────────────────────────┬───────────────────────────┐
│           │  CHAIN 1                                        │  CHAIN 2                  │
│           │  190.1.1|IGHV4-39 ◆ 13|IGHD2-15 ◆ 57|IGHJ5      │  219|IGKV1-12 ◆ 216|IGKJ4 │
│           ├─────────────────────────────────────────────────┼───────────────────────────┤
│           │      111111111111111111111                      │      11111111111          │
│           │  135 222222223333333333444                      │  369 01111111111          │
│           │  648 234567890123456789012                      │  155 90123456789          │
│           │      ═════════CDR3════════                      │      ════CDR3═══          │
│reference  │  LPS ◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦W                      │  SPT CQQ◦◦◦◦◦◦◦◦          │
│donor ref  │  VPS ◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦W                      │  SPT CQQ◦◦◦◦◦◦◦◦          │
├───────────┼─────────────────────────────────────────────────┼───────────────────────────┤
│#  n       │  ... .....................  u  const  d1_name   │  ... ...........  u  const│
│1  1       │  VPG CARYIVVVVAATINVGWFDPW  4  IGHG1  IGHD2-15  │  SPT CQQANSFPLTF  2  IGKC │
└───────────┴─────────────────────────────────────────────────┴───────────────────────────┘

CHAIN 1 OF #1  •  CONCATENATED VDJ REFERENCE  •  D = 1ST = IGHD2-15
   C  A  R      Y  I  V  V  V  V  A  A  T  I  N  V  G  W  F  D  P 
*          *||||                          **|||||****             
TCTGTGCGAGAT    ATATTGTAGTGGTGGTAGCTGCTACTATAAACGTAGGCTGGTTCGACCCC
ACTGTGCGAGACAAGGATATTGTAGTGGTGGTAGCTGCTACTCC     ACAACTGGTTCGACCCC

For JALIGN, we show a line of amino acids that represent the translation of bases in the exact subclonotype sequence. The amino acid lies over the middle base of the corresponding codon.

There are also options ALIGN_2ND<n> and JALIGN_2ND<n> to instead use the second best D segments.

Here is an example showing a VDDJ clonotype:

enclone BCR=165808 JALIGN1 CDR3=CARAYDILTGYYERGYSYGWGFDYW

[1] GROUP = 1 CLONOTYPES = 1 CELLS

[1.1] CLONOTYPE = 1 CELLS
┌───────────┬─────────────────────────────────────────────┬────────────────────────────────┐
│           │  CHAIN 1                                    │  CHAIN 2                       │
│           │  122.1.1|IGHV3-23 ◆ 22|IGHD3-9 ◆ 55|IGHJ4   │  331|IGLV1-51 ◆ 311|IGLJ2      │
│           ├─────────────────────────────────────────────┼────────────────────────────────┤
│           │       1111111111111111111111111             │     1111111111111 11           │
│           │  2677 1111112222222222333333333             │  66 0001111111111 22           │
│           │  3857 4567890123456789012345678             │  45 7890123456789 48           │
│           │       ═══════════CDR3══════════             │     ═════CDR3════              │
│reference  │  VASY ◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦             │  KL CGTWD◦◦◦◦◦◦◦◦ KL           │
│donor ref  │  LASY ◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦             │  KL CGTWD◦◦◦◦◦◦◦◦ KL           │
├───────────┼─────────────────────────────────────────────┼────────────────────────────────┤
│#  n       │  .... .........................   u  const  │  .. ............. ..   u  const│
│1  1       │  LTNY CARAYDILTGYYERGYSYGWGFDYW  16  IGHM   │  KF CGTWDSSLSAVVF TP  15  IGLC2│
└───────────┴─────────────────────────────────────────────┴────────────────────────────────┘

CHAIN 1 OF #1  •  CONCATENATED VDDJ REFERENCE  •  D = 1ST = IGHD3-9:IGHD5-18
   C  A  R     A  Y  D  I  L  T  G  Y  Y  E  R  G  Y  S  Y  G  W    G  F  D  Y  W 
         *  |||*                        |||||                  *||***             
ACTGTGCGAGAG   CTTACGATATTTTGACTGGTTATTACGAACGTGGATACAGCTATGGTTG  GGGCTTTGACTACTGG
ACTGTGCGAAAGAGTATTACGATATTTTGACTGGTTATTA     GTGGATACAGCTATGGTTACACTACTTTGACTACTGG

For VDDJ clonotypes, we do not check that the two D genes are in order on the genome, which would make sense biologically. We do not carry out the check because an individual genome might be rearranged, and in an case, we are simply reporting what we observe.

How the algorithm works

The problems we are solving here are to (a) pick the "best" reference D segment, in the case of IGH or TRB, and (b) exhibit the "correct" alignment of the transcript to the concatenated reference.

The algorithm aligns the V(D)J region on a transcript to the concatenated V(D)J reference, allowing for each possible D reference segment (or the null D segment, or DD), in the case of IGH or TRB. These alignments are carried out using the following scoring scheme:

case	score
match	2
mismatch	-2
gap open for insertion between V/D/J segments	-4
gap open for deletion bridging V/D/J segments	-4
gap open (otherwise)	-12
gap extend for insertion between V/D/J segments	-1
gap extend (otherwise)	-2

To the score from this, we add 2.2 times a "bit score" for the alignment, defined as -log2 of the probability that a random DNA sequence of length n will match a given DNA sequence with ≤ k mismatches = sum{l=0..=k} (n choose l) * 3^l / 4^n.

The alignment that is produced is approximately optimal, relative to this scoring scheme. It is not exactly optimal because we first produce an alignment using a Smith-Waterman algorithm, which does not fully incorporate the complexity of the scoring scheme, and then edit both the alignment and its score.

Then the D segment having the highest score is selected, arbitrarily selecting a winner in the case of a tie.

The following were optimized in designing the algorithm:

the inconsistency rate for a large dataset (over a million cells)
placement of indels (manual examination)
consistency with IgBLAST, or if not, justifiable difference from it.

However, there is no rigorous way to balance these criteria. The algorithm is not optimal but it is unclear how one would decide that a new algorithm was better.

In more detail, here is how we assess the inconsistency rate. There are two issues. First, if one allows clonotypes having a large number of exact subclonotypes, then measurement is noisy because a single clonotype can overly influence the rate. For this reason, we restrict to clonotypes having at most 10 exact subclonotypes. Second, for experimental purposes, it is too slow to keep recomputing using a very large dataset. Therefore we do the following (with a large dataset substituted in):

enclone BCR=... SUBSET_JSON=subset/outs/all_contig_annotations.json NOPRINT MIN_EXACTS=2 MAX_EXACTS=10

which is slow, followed by:

enclone BCR=subset GVARS=d_inconsistent_%,d_inconsistent_n NOPRINT

which is much faster, to get an inconsistency rate.

References for VDDJ recombination

Briney, B.S. et al. 2012. Frequency and genetic characterization of V(DD)J recombinants in the human peripheral blood antibody repertoire. Immunology.
Briney, B.S. et al. 2013. Secondary mechanisms of diversification in the human antibody repertoire. Frontiers in Immunology.
Safonova Y. et al. 2019. De novo Inference of Diversity Genes and Analysis of Non-canonical V(DD)J Recombination in Immunoglobulins. Frontiers in Immunology.
Safonova Y. et al. 2020. V(DD)J recombination is an important and evolutionarily conserved mechanism for generating antibodies with unusually long CDR3s. Genome Research.