When enclone is run, a series of filters are applied, resulting in deletion of barcodes or in some cases in changing which cells are combined together. Here we describe the order of the filters and technical details about some of them. Please see also the page enclone help special that describes some details about the filters and how they can be turned off.
Note that when you run enclone, if you specify the SUMMARY
option, then a table
will be printed showing the filters that removed cells, and how many cells that each removed.
You can use this as a guide regarding which filters are most important for your dataset. If you
only want to see the summary, then you can use the two options SUMMARY
and
NOPRINT
.
Filter order. The following table enumerates the filters, in order of application, along with a brief description of what they do. Please understand that in general, artifactual barcodes cannot be surgically removed by sharply defined tests. Rather, the filters are heuristic, and have the cumulative effect of removing nearly all artifacts (in most cases), while removing few valid barcodes. In the enclone codebase, there are regression tests, some of which provide representative examples for filtering. When we modify the filters, we examine the effect on these examples, and add others as needed as protection against accidental deterioration of performance.
number | filter name | brief description |
---|---|---|
1 | cell filter | remove barcodes not called cells in Cell Ranger VDJ pipeline |
2 | maximum contigs filter | remove barcodes having more than four productive contigs |
3 | graph filter | remove some exact subclonotypes that appear to be background |
4 | cross filter | use cross-library information to remove spurious exact subclonotypes |
5 | barcode duplication filter | remove duplicated barcodes within an exact subclonotype |
6 | whitelist filter | remove rare artifacts arising from gel bead contamination |
7 | foursie filter | remove some four-chain clonotypes that might represent doublets |
8 | improper filter | remove exact subclonotypes having multiple chains, all of the same type |
9 | weak onesie filter | disintegrate some single-chain clonotypes into single cells |
10 | UMI filter | remove some B cells having very low UMI counts |
11 | UMI ratio filter | remove some B cells having very low UMI counts, relative to clonotype |
12 | GEX filter | remove cells called by VDJ pipeline but not by GEX pipeline |
13 | doublet filter | remove some barcodes that appear to represent doublets |
14 | signature filter | remove some barcodes that appear to be contaminants, based on their chain signature |
15 | onesie merger | prevent merger of some single-chain clonotypes into other clonotypes |
16 | weak chain filter | remove cells having a chain that is probably spurious |
17 | quality merger | filter out exact subclonotypes having a position with low quality scores |
The remainder of this page and enclone help special provide more details about the filters.
Maximum contigs filtering. Remove barcodes that were assigned more than four productive
contigs. Specifying NMAX
turns off this filter.
This only has an effect if cell filtering is also turned off. Also deletion of cells by this
filter is not tracked by the SUMMARY
option or the lvar filter
.
Cross filtering. If multiple draws are made from the same tube of cells, and one library
made from each, yielding multiple "datasets" having the same "origin", then the clonotypes
observed in different libraries should be statistically consistent. Otherwise, they likely
represent an artifact, for example, possibly resulting from fragmentation of a plasma cell. We
apply the following test as a proxy for statistical consistency (unless NCROSS
is
specified):
If a V..J segment appears in exactly one dataset, with frequency n
, let
x
be the total number of productive pairs for that dataset, and let y
be
the total number of productive pairs for all datasets from the same origin.
If (x/y)^n β€ 10^-6
, i.e. the probability that assuming even distribution, all
instances of that V..J ended up in that one dataset, delete all the productive pairs for that
V..J segment that do not have at least 100
supporting UMIs.
This test could clearly be strengthened.
Foursie filtering. Foursie exact subclonotypes are highly enriched for cell doublets. Deleting them all might be justified, but because it is hypothetically possible that sometimes they represent the actual biology of single cells, we do not do this. However we never merge them with other exact subclonotypes, and sometimes we delete them, if we have other evidence they they are doublets. Specifically, for each foursie exact subclonotype, enclone looks at each pair of two chains within it (with one heavy and one light, or TRB/TRA), and if the V..J sequences for those appear in a twosie exact subclonotype having at least ten cells, then the foursie exact subclonotype is deleted, no matter how many cells it has. For example, this shows two foursie clonotypes that are present if the filtering is off:
enclone BCR=123085 CDR3=CARRYFGVVADAFDIW NFOURSIE_KILL
[1] GROUP = 1 CLONOTYPES = 34 CELLS
[1.1] CLONOTYPE = 34 CELLS
βββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββ
β β CHAIN 1 β CHAIN 2 β
β β 740.1.1|IGHV4-30-4 β 53|IGHJ3 β 253|IGKV1D-39 β 217|IGKJ5β
β βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββ€
β β 1 1111111111111111 β 111111111111 β
β β 25690 1111122222222223 β 011111111112 β
β β 01048 5678901234567890 β 901234567890 β
β β ββββββCDR3ββββββ β ββββCDR3ββββ β
βreference β LDPSA β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦W β CQQβ¦β¦β¦β¦β¦β¦β¦β¦β¦ β
βdonor ref β VGHSA β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦W β CQQβ¦β¦β¦β¦β¦β¦β¦β¦β¦ β
βββββββββββββΌββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββ€
β# n β ..... ................ u const β ............ u const β
β1 34 β VGHSA CARRYFGVVADAFDIW 57 IGHM β CQQSYSTPPITF 207 IGKC β
βββββββββββββ΄ββββββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββ
βΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΈ
[2] GROUP = 1 CLONOTYPES = 22 CELLS
[2.1] CLONOTYPE = 22 CELLS
βββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββ
β β CHAIN 1 β CHAIN 2 β CHAIN 3 β CHAIN 4 β
β β 740.1.1|IGHV4-30-4 β 53|IGHJ3 β 144.1.2|IGHV3-49 β 737|IGHJ6 β 273|IGKV2D-28 β 213|IGKJ1 β 253|IGKV1D-39 β 217|IGKJ5β
β βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββ€
β β 1 1111111111111111 β 1111111111111111111111 β 11111111111 β 111111111111 β
β β 25690 1111122222222223 β 35 1111222222222233333333 β 11111111222 β 011111111112 β
β β 01048 5678901234567890 β 15 6789012345678901234567 β 23456789012 β 901234567890 β
β β ββββββCDR3ββββββ β βββββββββCDR3βββββββββ β ββββCDR3βββ β ββββCDR3ββββ β
βreference β LDPSA β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦W β QV β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦GMDVW β CMQβ¦β¦β¦β¦β¦β¦β¦β¦ β CQQβ¦β¦β¦β¦β¦β¦β¦β¦β¦ β
βdonor ref β VGHSA β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦W β QF β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦GMDVW β CMQβ¦β¦β¦β¦β¦β¦β¦β¦ β CQQβ¦β¦β¦β¦β¦β¦β¦β¦β¦ β
βββββββββββββΌββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββ€
β# n β ..... ................ u const β .. ...................... u const β ........... u const β ............ u const β
β1 22 β VGHSA CARRYFGVVADAFDIW 65 IGHM β KF CTRAGFLSYQLLSYYYYGMDVW 308 IGHG1 β CMQALQTPWTF 647 IGKC β CQQSYSTPPITF 264 IGKC β
βββββββββββββ΄ββββββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββ
βΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΈ
[3] GROUP = 1 CLONOTYPES = 1 CELLS
[3.1] CLONOTYPE = 1 CELLS
βββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββ
β β CHAIN 1 β CHAIN 2 β CHAIN 3 β CHAIN 4 β
β β 740.1.1|IGHV4-30-4 β 53|IGHJ3 β 144.1.2|IGHV3-49 β 737|IGHJ6 β 273|IGKV2D-28 β 213|IGKJ1 β 253|IGKV1D-39 β 217|IGKJ5β
β βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββ€
β β 1 1111111111111111 β 1111111111111111111111 β 11111111111 β 111111111111 β
β β 25690 1111122222222223 β 35 1111222222222233333333 β 11111111222 β 011111111112 β
β β 01048 5678901234567890 β 15 6789012345678901234567 β 23456789012 β 901234567890 β
β β ββββββCDR3ββββββ β βββββββββCDR3βββββββββ β ββββCDR3βββ β ββββCDR3ββββ β
βreference β LDPSA β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦W β QV β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦GMDVW β CMQβ¦β¦β¦β¦β¦β¦β¦β¦ β CQQβ¦β¦β¦β¦β¦β¦β¦β¦β¦ β
βdonor ref β VGHSA β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦W β QF β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦GMDVW β CMQβ¦β¦β¦β¦β¦β¦β¦β¦ β CQQβ¦β¦β¦β¦β¦β¦β¦β¦β¦ β
βββββββββββββΌββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββ€
β# n β ..... ................ u const β .. ...................... u const β ........... u const β ............ u const β
β1 1 β VGHSA CARRYFGVVADAFDIW 69 ? β KF CTRAGFLSYQLLSYYYYGMDVW 304 IGHG1 β CMQALQTPWTF 562 IGKC β CQQSYSTPPITF 266 IGKC β
βββββββββββββ΄ββββββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββ
and which are deleted if the foursie filtering is on:
enclone BCR=123085 CDR3=CARRYFGVVADAFDIW
[1] GROUP = 1 CLONOTYPES = 34 CELLS
[1.1] CLONOTYPE = 34 CELLS
βββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββ
β β CHAIN 1 β CHAIN 2 β
β β 740.1.1|IGHV4-30-4 β 53|IGHJ3 β 253|IGKV1D-39 β 217|IGKJ5β
β βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββ€
β β 1 1111111111111111 β 111111111111 β
β β 25690 1111122222222223 β 011111111112 β
β β 01048 5678901234567890 β 901234567890 β
β β ββββββCDR3ββββββ β ββββCDR3ββββ β
βreference β LDPSA β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦W β CQQβ¦β¦β¦β¦β¦β¦β¦β¦β¦ β
βdonor ref β VGHSA β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦β¦W β CQQβ¦β¦β¦β¦β¦β¦β¦β¦β¦ β
βββββββββββββΌββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββ€
β# n β ..... ................ u const β ............ u const β
β1 34 β VGHSA CARRYFGVVADAFDIW 57 IGHM β CQQSYSTPPITF 207 IGKC β
βββββββββββββ΄ββββββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββ
UMI filtering. enclone filters out B cells having low UMI counts, relative to a baseline
that is determined for each dataset, according to a
heuristic described here, unless the argument NUMI
is supplied, to turn off that
filter.
The motivation for this filter is to mitigate illusory clonotype expansions arising from fragmentation of plasma cells or other physical processes (not all fully understood). These processes all result in "cells" having low UMI counts, many of which do not correspond to intact real cells. Illusory clonotype expansions are generally infrequent, but occasionally cluster in individual datasets.
Nomenclature: for any cell, find the maximum UMI count for its heavy chains, if any, and the
maximum for its light chains, if any. The sum of these two maxima is denoted
umitot
.
The algorithm for this filter first establishes a baseline for the expected value of
umitot
, for each dataset taken individually. To do this, all clonotypes having
exactly one cell and exactly one heavy and light chain each are examined. If there are less than
20
such cells, the filter is not applied to cells in that dataset. Otherwise,
let n_50%
denote the median of the umitot
values for the dataset, and let
n_10%
the 10th percentile. Let
umin = min( n_10%, n_50% - 4 * sqrt(n_50%) )
.
This is the baseline low value for umitot
. The reason for having the second
part of the min
is to prevent filtering in cases where UMI counts are sufficiently
low that poisson variability could cause a real cell to appear fake.
Next we scan each clonotype having at least two cells, and delete every cell having
umitot < umin
, with the following qualifications:
k
be the number of cells to be deleted in a clonotype having n
cells. Then we require that for a binomial distribution having p = 0.1
, the
probability of observing k
or more events in a sample of size n
is
less then 0.01
. The more cells are flagged in a clonotype, the more likely this
test is satisfied, which is the point of the test.
umitot
, summing across its cells. Then we protect from
deletion the cell in this exact subclonotype having the highest umitot
value. We
do this because in general even if a clonotype expansion did not occur, there was probably at
least a single bona fide cell that gave rise to it.
A better test could probably be devised that started from the expected distribution of UMI counts. The test would trigger based on the number and improbability of low UMI counts. The current test only considers the number of counts that fall below a threshold, and not their particular values.
UMI ratio filtering. enclone filters out B cells having low UMI counts, relative to
other UMI counts in a given clonotype, according to a
heuristic described here, unless the argument NUMI_RATIO
is supplied, to turn off that
filter.
First we mark a cell for possible deletion, if the VDJ UMI count for some chain of some other cell is at least 500 times greater than the total VDJ UMI count for the given cell.
Then we scan each clonotype having at least two cells, and delete every cell marked as above,
with the following qualification.
Let k
be the number of cells to be deleted in a clonotype having n
cells. Then we require that for a binomial distribution having p = 0.1
, the
probability of observing k
or more events in a sample of size n
is
less then 0.01
.
Doublet filtering. This filtering removes some exact subclonotypes that appear to represent doublets (or possibly higher-order multiplets). The first Cell Ranger version in which this appeared was 6.0.
The algorithm works by first computing pure subclonotypes. This is done by taking each clonotype and breaking it apart according to its chain signature. All the exact subclonotypes that have entries for particular chains (and not entries for the other chains) are merged together to form a pure subclonotype.
In the simplest case, where the clonotype has two chains, the clonotype could give rise to three pure subclonotypes: one for the exact subclonotypes that have both chains, and one each for the subclonotypes that have only one chain.
The algorithm then finds triples (p0, p1, p2)
of pure subclonotypes, for which
the following three conditions are all satisfied:
p0
and p1
share an identical CDR3 DNA sequencep0
and p2
share an identical CDR3 DNA sequencep1
and p2
do not share an identical CDR3 DNA sequence.Finally, if 5 * ncells(p0) <= min( ncells(p1), ncells(p2) )
, the entire pure
subclonotype p0
is deleted. And after all these operations are completed, some of
the original clonotypes may break up into separate clonotypes, as they may no longer be held
together by shared chains.
If the argument NDOUBLET
is supplied to enclone, then doublet filtering is
not applied.
Signature filtering. This filter removes some exact subclonotypes that appear to represent contaminants, based on their chain signature. This filter sometimes breaks up complex clonotypes having many chains and representing multiple true clonotypes that are glued together into a single clonotype via exact subclonotypes whose constituent barcodes do not arise fully from single cells.
This filter can dramatically affect certain datasets, but has almost no effect on typical data (less than one per million clonotypes tested).
The algorithm uses some terminology described at doublet filtering, above. Given a pure
subclonotype p having at least two chains, if the total cells in the two-chain pure subclonotypes
that are different from it but share a chain with it is at least 20
times greater
than the number of cells in p, then p is deleted.
If the argument NSIG
is supplied to enclone, then signature filtering is not
applied.
This filter first appeared after cellranger version 6.1
.
Weak chain filtering. If a clonotype has three or more chains, and amongst those there
is a chain that appears in a relatively small number of cells, we delete all the cells that support
that chain. This filter is turned off if NWEAK_CHAINS
is specified. The precise
condition is that the number of cells supporting the chain is at most 20
, and
8
times that number of cells is less than the total number of cells in the clonotype.
For the current Cell Ranger, replace 20
by 5
. This will change at some
point after Cell Ranger 6.0.