clonotype filtering options
enclone provides filtering by cell, by exact subclonotype, and by clonotype. This page describes
filtering by clonotype. These options cause only certain clonotypes to be printed. See also
enclone help special
, which describes other filtering options. This page also described
scanning for feature enrichment.
┌─────────────────────┬───────────────────────────────────────────────────────────────────────────┐
│MIN_CELLS=n │ only show clonotypes having at least n cells │
│MAX_CELLS=n │ only show clonotypes having at most n cells │
│CELLS=n │ only show clonotypes having exactly n cells │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│MIN_UMIS=n │ only show clonotypes having ≳ n UMIs on some chain on some cell │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│MIN_CHAINS=n │ only show clonotypes having at least n chains │
│MAX_CHAINS=n │ only show clonotypes having at most n chains │
│CHAINS=n │ only show clonotypes having exactly n chains │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│CDR3=<pattern> │ only show clonotypes having a CDR3 amino acid seq that matches │
│ │ the given pattern*, from beginning to end │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│SEG="s_1|...|s_n" │ only show clonotypes using one of the given reference segment names │
│SEGN="s_1|...|s_n" │ only show clonotypes using one of the given reference segment numbers │
│ │ both: looks for V, D, J and C segments; double quote only │
│ │ needed if n > 1 │
│ │ For both SEG and SEGN, multiple instances are allowed, and their │
│ │ effects are cumulative. │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│NSEG="s_1|...|s_n" │ do not show clonotypes using one of the given reference segment names │
│NSEGN="s_1|...|s_n" │ do not show clonotypes using one of the given reference segment numbers │
│ │ Otherwise similar to SEG and SEGN. │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│MAX_EXACTS=n │ only show clonotypes having at most n exact subclonotypes │
│MIN_EXACTS=n │ only show clonotypes having at least n exact subclonotypes │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│VJ=seq │ only show clonotypes using exactly the given V..J sequence │
│ │ (string in alphabet ACGT) │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│MIN_DATASETS=n │ only show clonotypes containing cells from at least n datasets │
│MAX_DATASETS=n │ only show clonotypes containing cells from at most n datasets │
│MIN_DATASET_RATIO=n │ only show clonotypes having at least n cells and for which the ratio │
│DATASET="d1|...|dn" │ only show clonotypes having at least one of the listed datasets │
│ │ of the number of cells in the must abundant dataset to the next most │
│ │ abundant one is at least n │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│MIN_ORIGINS=n │ only show clonotypes containing cells from at least n origins │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│MIN_DONORS=n │ only show clonotypes containing cells from at least n donors │
│ │ If n ≥ 2, this automatically turns on MIX_DONORS, as otherwise cells from│
│ │ two or more donors would not be combined into the same clonotype. │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│CDIFF │ only show clonotypes having a difference in constant region with the │
│ │ universal reference │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│DEL │ only show clonotypes exhibiting a deletion │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│BARCODE=bc1,...,bcn │ only show clonotypes that use one of the given barcodes; note that such │
│ │ clonotypes will typically contain cells that are not in your │
│ │ list; if you want to fully restrict to a list of barcodes you can use │
│ │ the KEEP_CELL_IF option, please see enclone help special
│
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│INKT │ only show clonotypes for which some exact subclonotype is annotated as │
│ │ having some iNKT evidence, see bit.ly/enclone for details │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│MAIT │ only show clonotypes for which some exact subclonotype is annotated as │
│ │ having some MAIT evidence, see bit.ly/enclone for details │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│D_INCONSISTENT │ only show clonotypes having an inconsistent assignment of D genes │
│D_NONE │ only show clonotypes having a null D gene assignment │
│D_SECOND │ only show VDDJ clonotypes │
└─────────────────────┴───────────────────────────────────────────────────────────────────────────┘
* Examples of how to specify CDR3:
Two pattern types are allowed: either regular expressions, or "Levenshtein distance patterns", as
exhibited by examples below.
┌────────────────────────────────────────┬────────────────────────────────────────────────────┐
│CDR3=CARPKSDYIIDAFDIW │ have exactly this sequence as a CDR3 │
│CDR3="CARPKSDYIIDAFDIW|CQVWDSSSDHPYVF" │ have at least one of these sequences as a CDR3 │
│CDR3=".*DYIID.*" │ have a CDR3 that contains DYIID inside it │
│CDR3="CQTWGTGIRVF~3" │ CDR3s within Levenshtein distance 3 of CQTWGTGIRVF│
│CDR3="CQTWGTGIRVF~3|CQVWDSSSDHPYVF~2" │ CDR3s within Levenshtein distance 3 of CQTWGTGIRVF│
│ │ or Levenshtein distance 2 of CQVWDSSSDHPYVF │
└────────────────────────────────────────┴────────────────────────────────────────────────────┘
Note that double quotes should be used if the pattern contains characters other than letters.
A gentle introduction to regular expressions may be found at
https://en.wikipedia.org/wiki/Regular_expression#Basic_concepts, and a precise
specification for the regular expression version used by enclone may be found at
https://docs.rs/regex.
linear conditions
enclone understands linear conditions of the form
c1*v1 ± ... ± cn*vn > d
where each ci is a constant, "ci*" may be omitted, each vi is a parseable variable, and d is a
constant. Blank spaces are ignored. The > sign may be replaced by
• >= or equivalently ≥ or ⩾
• <
• <= or equivalently ≤ or ⩽.
The details of how enclone evaluates a linear condition for a clonotype are subtle, and these
subtleties may or may not matter for what you're doing. You may wish to look at the specific
examples given below. For more detail, here are the rules:
• When a variable is assessed for a given cell, we use the value that would have been obtained
using parseable output (including with the PCELL mode); see "enclone help parseble". In most
cases it will make more sense to use the per-cell version of a variable, if it is defined.
For example, u1_cell would be the number of UMIs for the first chain for a given cell, but u1
would be the median value for all cells in an exact subclonotype, regardless of which cell is
examined.
• For each variable, enclone finds its values for all cells in the clonotype. Values that are not
finite numbers are ignored. This can have unintended consequences, so be careful not to
accidentally use a variable that is non-numeric.
• If no such values are found for some variable, then the constraint fails.
• Otherwise, some function is applied to all the values for a given variable (e.g. the mean
function) and the constraint is tested, after substituting in the values from the function.
The particular function that is used is documented at the appropriate point.
Because the minus sign - doubles as a hyphen and is used in some feature names, we allow
parentheses around variable names to prevent erroneous parsing, like this (IGHV3-7_g) >= 1. And
something like that would need to be quoted on the command line.
filtering by linear conditions
enclone has the capability to filter by bounding variables, using the command-line argument:
KEEP_CLONO_IF_CELL_MEAN="L"
where L is a linear condition (as defined above). Multiple bounds may be imposed by using
multiple instances of KEEP_CLONO_IF_CELL_MEAN=... . As explained above, note that
KEEP_CLONO_IF_CELL_MEAN=... filters by computing the mean across all cells in the clonotype. See
also KEEP_CELL_IF= at enclone help special
.
If for a given clonotype and a given variable, not all values are specified (e.g. if for a
user-specified variable, values are blank), then only the values that are specified are used in
the computation of mean and max. If no values are specified, then the condition fails.
Similarly, to filter by the min or max across all cells in a clonotype, one may use
KEEP_CLONO_IF_CELL_MIN="L"
or
KEEP_CLONO_IF_CELL_MAX="L"
and otherwise as above.
Caution. Because of interactions between filters (including built-in filters), the results of
filtering can be counterintuitive. In particular, cells might be removed from a clonotype after a
linear condition is applied, leading to confusing results.
For cell-exact variables (see https://10xgenomics.github.io/enclone/pages/auto/variables.html),
note that linear conditions are applied to the cell version of the variable.
feature scanning
If gene expression and/or feature barcode data have been generated, enclone can scan all features
to find those that are enriched in certain clonotypes relative to certain other clonotypes. This
feature is turned on using the command line argument
SCAN="test,control,threshold"
where each of test, control and threshold are linear conditions as defined above. Blank spaces
are ignored. The test condition defines the "test clonotypes" and the control condition defines
the "control clonotypes". The threshold condition is special: it may use only the variables "t"
and "c" that represent the raw UMI count for a particular gene or feature, for the test (t) or
control (c) clonotypes. To get a meaningful result, you should specify MIN_CELLS appropriately
and manually examine the test and control clonotypes to make sure that they make sense.
If in addition the argument SCAN_EXACT is supplied, then scanning will be carried out over exact
subclonotypes rather than clonotypes.
an example
Suppose that your data are comprised of two origins with datasets
named pre and post, representing time points relative to some event. Then
SCAN="n_post - 10*n_pre >= 0, n_pre - 0.5*n_post >= 0, t - 2*c >= 0.1"
would define the test clonotypes to be those satisfying n_post >= 10*n_pre (so having far more
post cells then pre cells), the control clonotypes to be those satisfying n_pre >= 0.5*n_post (so
having lots of pre cells), and thresholding on t >= 2*c * 0.1, so that the feature must have a bit
more than twice as many UMIs in the test than the control. The 0.1 is there to exclude noise from
features having very low UMI counts.
Feature scanning is not a proper statistical test. It is a tool for generating a list of feature
candidates that may then be examined in more detail by rerunning enclone using some of the
detected features as lead variables (appropriately suffixed). Ultimately the power of the scan is
determined by having "enough" cells in both the test and control sets, and in having those sets
cleanly defined.
Currently feature scanning requires that each dataset have identical features.