enclone heuristics

This pages provides technical descriptions of some of the heuristics that enclone uses. Please see also enclone help how and enclone default filters.

Clonotype chain grouping. After exact subclonotypes have been grouped into clonotypes, we decide which chains from which exact subclonotypes are placed in the same column of the table for the clonotype. While in most particular instances the answer is "obvious", the general problem is complicated. We proceed by "joining" chains, i.e. deciding that they will go in the same column. There are several steps:

At the earlier point in the algorithm where we decide that two exact subclonotypes go in the same clonotype, we align a heavy or TRB chain from each (one from each exact subclonotype) to the other, and likewise for the light or TRA chains. This defines a correspondence between chains, and at the subsequent point when we generate clonotype tables, this information is carried forward to join chains.
The initial process misses some joins for two reasons: (1) because in the initial join step, for computational performance reasons, we only test as many joins as are needed to form the clonotypes, so some joins are not seen, and this is compounded by filtering steps that delete putatively artifactual exact subclonotypes; (2) when we join two exact subclonotypes, and one or both have three chains, we stop looking once we've joined them and thus do not look at all the chains. To mitigate these two problems, at the time of forming clonotype tables, we recover some of the "lost" joins.
In the special case where two exact subclonotypes are joined, and both have three chains, we apply a lower threshold for merging the "third" chain. This threshold is that the V..J sequences have the same length and differ at at most 10 bases.
We also connect onesie exact subclonotypes to other chains by matching based on exact identity of V..J.

This description is accurate for the current enclone, and corresponds to changes that will appear in Cell Ranger in a version after 6.0.

Exact subclonotype ordering within a clonotype. Originally we tried ordering the exact clonotypes in reverse order by the number of cells in a given exact subclonotype. This had the perverse effect of sometimes placing onesie (single chain) clonotypes at the top, which we thought not helpful. After some experimentation we arrived at the following ordering, which although arbitrary, tends to place "more interesting" exact subclonotypes near the top:

For each exact subclonotype, we form a vector of boolean values reflecting the presence of a given chain in an exact subclonotype. For example, in a two-chain clonotype, if both chains are present, the vector is (true, true). The ordering is reverse lexicographical, so that in the two chain case, the ordering is:
```
(true, true)
(true, false)
(false, true).
```
Subject to this, exact subclonotypes are then reverse ordered by number of cells.
Subject to the first two orderings, exact subclonotypes are then reverse ordered by the total number of UMIs, summed across chains.