enclone help how

information about how enclone works

The goal of enclone is to find and display the clonotypes within single cell VDJ datasets: groups
of cells having the same fully rearranged common ancestor.

enclone provides the foundation for fully understanding each cell's antigen affinity and the
evolutionary relationship between cells within one or more datasets. This starts with, for each
cell, the full length sequence of all its VDJ receptor chains. Such data may be obtained using
the 10x Genomics immune profiling platform.

See also the heuristics page at bit.ly/enclone.

For this, there are fundamental challenges:

┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
│1. It is extremely easy to get false positives: the incorrect appearance that two cells have a │
│common ancestor. │
│ │
│2. Because of somatic hypermutation in B cells, it can be difficult to know that two B cells share│
│a common ancestor. │
│ │
│3. There is always some background noise, e.g. from ambient mRNA. When building large clonotypes,│
│this noise tends to pile up, yielding ectopic chains, i.e. chains within a clonotype that are │
│artifacts and do not represent true biology. │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘

To address these challenges, the enclone algorithm has several steps, which we outline:

1. Input data. enclone gets its information from the file all_contig_annotations.json that is
produced by Cell Ranger. Only productive contigs are used. Each has an annotated V and J
segment. The V segment alignment may have a single indel whose length is divisible by three, and
in that case, the V reference sequence is edited either to delete or insert sequence. In the
insertion case, the bases are taken from the contig. These indels are noted in the enclone
output.

2. Exact subclonotypes. enclone groups cells into exact subclonotypes, provided that they have
the same number of chains, identical V..J sequences, identical C segment assignments, and the same
distance between the J stop and the C start (which is usually zero).

3. Finding the germline sequences. For datasets from a given donor, enclone derives "donor
reference sequences" for the V chains present in the donor's genome. This is powerful, even
though based on imperfect information. V segments vary in their expression frequency and thus the
more cells which are present, the more complete the information will be. It is also not possible
to accurately determine the terminal bases in a V chain from transcript data alone because these
bases mutate during recombination and because of non-templated nucleotide addition.

The idea for how this is done is roughly the following: for each V segment, we choose one cell
from each clonotype (although these have not actually been computed yet, so it's an
approximation). Next for each position on the V segment, excluding the last 15 bases, we
determine the distribution of bases that occur within these selected cells. We only consider
those positions where a non-reference base occurs at least four times and is at least 25% of the
total. Then each cell has a footprint relative to these positions; we require that these
footprints satisfy similar evidence criteria. Each such non-reference footprint then defines an
"alternate allele". We do not restrict the number of alternate alleles because they may arise
from duplicated gene copies.

A similar approach was attempted for J segments but at the time of testing did not appear to
enhance clonotyping specificity. This could be revisited later and might be of interest even if
it does not improve specificity.

4. What joins are tested. Pairs of exact subclonotypes are considered for joining, as described
below. This process only considers exact subclonotypes have two or three chains. There is some
separate joining for the case of one chain. Exact subclonotypes having four chains are not joined
at present. These cases are clearly harder because these exact subclonotypes are highly enriched
for cell doublets, which we discard if we can identify as such.

5. Initial grouping. For each pair of exact subclonotypes, and for each pair of chains in each
of the two exact subclonotypes, for which V..J has the same length for the corresponding chains,
and the CDR3 segments have the same length for the corresponding chains, enclone considers joining
the exact subclonotypes into the same clonotype.

6. Shared mutations. enclone next finds shared mutations betweens exact subclonotypes, that is,
for two exact subclonotypes, common mutations from the reference sequence, using the donor
reference for the V segments and the universal reference for the J segments. Shared mutations are
supposed to be somatic hypermutations, that would be evidence of common ancestry. By using the
donor reference sequences, most shared germline mutations are excluded, and this is critical for
the algorithm's success.

7. Are there enough shared mutations? We find the probability p that “the shared mutations occur
by chance”. More specifically, given d shared mutations, and k total mutations (across the two
cells), we compute the probability p that a sample with replacement of k items from a set whose
size is the total number of bases in the V..J segments, yields at most k – d distinct elements.
The probability is an approximation, for the method please see
https://docs.rs/stirling_numbers/0.1.0/stirling_numbers.

8. Are there too many CDR3 mutations? We define a constant N that is used below. We first set
cd1 to the number of heavy chain CDR3 nucleotide differences, and cd2 to the number of light chain
CDR3 nucleotide differences. Let n1 be the nucleotide length of the heavy chain CDR3, and
likewise n2 for the light chain. Then N = 80^(42 * (cd1/n1 + cd2/n2)). The number 80 may be
alternately specified via MULT_POW and the number 42 via CDR3_NORMAL_LEN.

9. We also require CDR3 nucleotide identity of at least 85%. The number 85 may be alternately
set using JOIN_CDR3_IDENT=.... The nucleotide identity is computed by dividing cd by the total
nucleotide length of the heavy and light chains, normalized.

10. Key join criteria. Two cells sharing sufficiently many shared differences and sufficiently
few CDR3 differences are deemed to be in the same clonotype. That is, The lower p is, and the
lower N is, the more likely it is that the shared mutations represent bona fide shared ancestry.
Accordingly, the smaller p*N is, the more likely it is that two cells lie in the same true
clonotype. To join two cells into the same clonotype, we require that the bound p*n ≤ C is
satisfied, where C is the constant 100,000. The value may be adjusted using the command-line
argument MAX_SCORE, or the log10 of this, MAX_LOG_SCORE. This constant was arrived at by
empirically balancing sensitivity and specificity across a large collection of datasets. See
results described at bit.ly/enclone.

11. Other join criteria.
• If V gene names are different (after removing trailing *...), and either V gene reference
sequences are different, after truncation on right to the same length or 5' UTR reference
sequences are different, after truncation on left to the same length, then the join is rejected.
• As an exception to the key join criterion, we allow a join which has at least 15 shares, even if
p*N > C. The constant 15 is modifiable via the argument AUTO_SHARES.
• As a second exception to the key join criterion, we first compute heavy chain join complexity.
This is done by finding the optimal D gene, allowing no D, or DD), and aligning the junction
region on the contig to the concatenated reference. (This choice can be visualized using the
option JALIGN1, see enclone help display.) The heavy chain join complexity hcomp is then a sum
as follows: each inserted base counts one, each substitution counts one, and each deletion
(regardless of length) counts one. Then we allow a join if it has hcomp - cd ≥ 8, so long as the
number of differences between both chains outside the junction regions is at most 80, even if p*N
> C.
• We do not join two clonotypes which were assigned different reference sequences unless those
reference sequences differ by at most 2 positions. This value can be controlled using the
command-line argument MAX_DEGRADATION.
• There is an additional restriction imposed when creating two-cell clonotypes: we require that
that cd ≤ d, where cd is the number of CDR3 differences and d is the number of shared mutations,
as above. This filter may be turned off using the command-line argument EASY.
• We do not join in cases where light chain constant regions are different and cd > 0. This
filter may be turned off using the command-line argument OLD_LIGHT.
• If the percent nucleotide identity on heavy chain FWR1 is at least 20 more than the percent
nucleotide identity on heavy chain CDR1+CDR2 (combined), then the join is rejected.
• We do not join in cases where there is too high a concentration of changes in the junction
region. More specifically, if the number of mutations in CDR3 is at least 5 times the number of
non-shared mutations outside CDR3 (maxed with 1), the join is rejected. The number 5 is the
parameter CDR3_MULT.

12. Junk. Spurious chains are filtered out based on frequency and connections. See "enclone help
special" for a description of the filters.

13. Alternate algorithm. An alternate and much simpler clonotyping algorithm can be invoked by
specifying JOIN_BASIC=90. This causes two exact subclonotypes to be joined if they have the same
V and J gene assignments, the same CDR3 lengths, and CDR3 nucleotide identity of at least 90% on
each chain. The number 90 can be changed.

We are actively working to improve the algorithm. Test results for the current version may be
found at bit.ly/enclone.