```
information about how enclone works
The goal of enclone is to find and display the clonotypes within single cell VDJ datasets: groups
of cells having the same fully rearranged common ancestor.
enclone provides the foundation for fully understanding each cell's antigen affinity and the
evolutionary relationship between cells within one or more datasets. This starts with, for each
cell, the full length sequence of all its VDJ receptor chains. Such data may be obtained using
the 10x Genomics immune profiling platform.
See also the heuristics page at bit.ly/enclone.
For this, there are fundamental challenges:
┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
│1. It is extremely easy to get false positives: the incorrect appearance that two cells have a │
│common ancestor. │
│ │
│2. Because of somatic hypermutation in B cells, it can be difficult to know that two B cells share│
│a common ancestor. │
│ │
│3. There is always some background noise, e.g. from ambient mRNA. When building large clonotypes,│
│this noise tends to pile up, yielding ectopic chains, i.e. chains within a clonotype that are │
│artifacts and do not represent true biology. │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘
To address these challenges, the enclone algorithm has several steps, which we outline:
1. Input data. enclone gets its information from the file all_contig_annotations.json that is
produced by Cell Ranger. Only productive contigs are used. Each has an annotated V and J
segment. The V segment alignment may have a single indel whose length is divisible by three, and
in that case, the V reference sequence is edited either to delete or insert sequence. In the
insertion case, the bases are taken from the contig. These indels are noted in the enclone
output.
2. Exact subclonotypes. enclone groups cells into exact subclonotypes, provided that they have
the same number of chains, identical V..J sequences, identical C segment assignments, and the same
distance between the J stop and the C start (which is usually zero).
3. Finding the germline sequences. For datasets from a given donor, enclone derives "donor
reference sequences" for the V chains present in the donor's genome. This is powerful, even
though based on imperfect information. V segments vary in their expression frequency and thus the
more cells which are present, the more complete the information will be. It is also not possible
to accurately determine the terminal bases in a V chain from transcript data alone because these
bases mutate during recombination and because of non-templated nucleotide addition.
The idea for how this is done is roughly the following: for each V segment, we choose one cell
from each clonotype (although these have not actually been computed yet, so it's an
approximation). Next for each position on the V segment, excluding the last 15 bases, we
determine the distribution of bases that occur within these selected cells. We only consider
those positions where a non-reference base occurs at least four times and is at least 25% of the
total. Then each cell has a footprint relative to these positions; we require that these
footprints satisfy similar evidence criteria. Each such non-reference footprint then defines an
"alternate allele". We do not restrict the number of alternate alleles because they may arise
from duplicated gene copies.
A similar approach was attempted for J segments but at the time of testing did not appear to
enhance clonotyping specificity. This could be revisited later and might be of interest even if
it does not improve specificity.
4. What joins are tested. Pairs of exact subclonotypes are considered for joining, as described
below. This process only considers exact subclonotypes have two or three chains. There is some
separate joining for the case of one chain. Exact subclonotypes having four chains are not joined
at present. These cases are clearly harder because these exact subclonotypes are highly enriched
for cell doublets, which we discard if we can identify as such.
5. Initial grouping. For each pair of exact subclonotypes, and for each pair of chains in each
of the two exact subclonotypes, for which V..J has the same length for the corresponding chains,
and the CDR3 segments have the same length for the corresponding chains, enclone considers joining
the exact subclonotypes into the same clonotype.
6. Error bounding. To proceed, as a minimum requirement, there must be at most 55 total
mismatches between the two exact subclonotypes, within the given two V..J segments.
This can be changed by setting MAX_DIFFS=n on the command line. (Note
that for CellRanger version 5.0, the value is instead 50.)
7. Shared mutations. enclone next finds shared mutations betweens exact subclonotypes, that is,
for two exact subclonotypes, common mutations from the reference sequence, using the donor
reference for the V segments and the universal reference for the J segments. Shared mutations are
supposed to be somatic hypermutations, that would be evidence of common ancestry. By using the
donor reference sequences, most shared germline mutations are excluded, and this is critical for
the algorithm's success.
8. Are there enough shared mutations? We find the probability p that “the shared mutations occur
by chance”. More specifically, given d shared mutations, and k total mutations (across the two
cells), we compute the probability p that a sample with replacement of k items from a set whose
size is the total number of bases in the V..J segments, yields at most k – d distinct elements.
The probability is an approximation, for the method please see
https://docs.rs/stirling_numbers/0.1.0/stirling_numbers.
9. Are there too many CDR3 mutations? Next, let N be "the number of DNA sequences that differ
from the given CDR3 sequences by at most the number of observed differences". More specifically,
if cd is the number of differences between the given CDR3 nucleotide sequences, and n is the total
length in nucleotides of the CDR3 sequences (for the two chains), we compute the total number N of
strings of length n that are obtainable by perturbing a given string of length n, which is
sum( choose(n,m), m = 0..=cd) ). We also require that cd is at most 15 (and this bound is
adjustable via the command-line argument MAX_CDR3_DIFFS). (The value for Cell Ranger 5.0 is 10.)
10. Key join criteria. Two cells sharing sufficiently many shared differences and sufficiently
few CDR3 differences are deemed to be in the same clonotype. That is, The lower p is, and the
lower N is, the more likely it is that the shared mutations represent bona fide shared ancestry.
Accordingly, the smaller p*N is, the more likely it is that two cells lie in the same true
clonotype. To join two cells into the same clonotype, we require that the bound p*n ≤ C is
satisfied, where C is the constant 500,000. (The value for Cell Ranger 5.0 is 1,000,000.) The
value may be adjusted using the command-line argument MAX_SCORE, or the log10 of this,
MAX_LOG_SCORE. This constant was arrived at by empirically balancing sensitivity and specificity
across a large collection of datasets. See results described at bit.ly/enclone.
11. Other join criteria. We do not join two clonotypes which were assigned different reference
sequences unless those reference sequences differ by at most 2 positions. This value can be
controlled using the command-line argument MAX_DEGRADATION. (Note that for Cell Ranger 5.0, the
value is instead 3.) There is an additional restriction imposed when creating two-cell
clonotypes: we require that that cd ≤ d, where cd is the number of CDR3 differences and d is the
number of shared mutations, as above. This filter may be turned off using the command-line
argument EASY.
12. Junk. Spurious chains are filtered out based on frequency and connections. See "enclone help
special" for a description of the filters.
We are actively working to improve the algorithm. Test results for the current version may be
found at bit.ly/enclone.
```