enclone help lvars


lead column options

These options define lead variables, which are variables that are computed for each exact
subclonotype, and if using the PER_CELL option, also computed for each cell.  In addition, lead
variables can be used for parseable output.

Lead variables appear in columns that appear once in each clonotype, on the left side, and have
one entry for each exact subclonotype row.

Note that for medians of integers, we actually report the "rounded median", the result of rounding
the true median up to the nearest integer, so that e.g. 6.5 is rounded up to 7.

See also enclone help cvars and the inventory of all variables at
            https://10xgenomics.github.io/enclone/pages/auto/inventory.html.

Lead variables are specified using LVARS=x1,...,xn where each xi is one of:

┌──────────────────┬───────────────────────────────────────────────────────────────────────────────┐
│nchains           │  total number of chains in the clonotype                                      │
│nchains_present   │  number of chains present in an exact subclonotype                            │
├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│datasets          │  dataset identifiers                                                          │
│origin            │  origin identifiers                                                           │
│donors            │  donor identifiers                                                            │
├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│n                 │  number of cells                                                              │
│n_<name>          │  number of cells associated to the given name, which can be a dataset         │
│                  │  or origin or donor or tag short name; may name only one such category        │
│clonotype_ncells  │  total number of cells in the clonotype                                       │
├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│nd<k>             │  For k a positive integer, this creates k+1 fields, that are specific to each │
│                  │  clonotype.  The first field is n_<d1>, where d1 is the name of the dataset   │
│                  │  having the most cells in the clonotype.  If k ≥ 2, then you'll get a         │
│                  │  "runner-up" field n_<d2>, etc.  Finally you get a field n_other, however     │
│                  │  fields will be elided if they represent no cells.  Use a variable of this    │
│                  │  type at most once.                                                           │
├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│near              │  Hamming distance of V..J DNA sequence to nearest neighbor                    │
│far               │  Hamming distance of V..J DNA sequence to farthest neighbor                   │
│                  │  both compare to cells having chains in the same columns of the clonotype,    │
│                  │  with - shown if there is no other exact subclonotype to compare to           │
│dref              │  Hamming distance of V..J DNA sequence to donor reference, excluding          │
│                  │  region of recombination, sum over all chains                                 │
│dref_aa           │  Hamming distance of V..J amino acid sequence to donor reference, excluding   │
│                  │  region of recombination, sum over all chains                                 │
│dref_max          │  Hamming distance of V..J DNA sequence to donor reference, max over all       │
│                  │  chains                                                                       │
├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│count_<reg>       │  Number of matches of the V..J amino acid sequences of all chains to the given│
│                  │  regular expression, which is treated as a subset match, so for example,      │
│                  │  count_CAR would count the total number of occurrences of the string CAR in   │
│                  │  all the chains.  Please see enclone help filter for a discussion             │
│                  │  about regular expressions.  We also allow the form abbr:count_<regex>,       │
│                  │  where abbr is an abbreviation that will appear as the field label.           │
│count_<f>_<reg>   │  Supposing that f is in {cdr1,..,cdr3,fwr1,..,fwr4,cdr,fwr}, this is similar  │
│                  │  to the above but restricted to motifs lying entirely within                  │
│                  │  a given feature or feature set.                                              │
├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│inkt              │  A string showing the extent to which the T cells in an exact subclonotype    │
│                  │  have evidence for being an iNKT cell.  The most evidence is denoted 𝝰gj𝝱gj,  │
│                  │  representing both gene name and junction sequence (CDR3) requirements for    │
│                  │  both chains.  See bit.ly/enclone for details on the requirements.            │
│mait              │  Same as with inkt but for MAIT cells instead.                                │
├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│g<d>              │  Here d is a nonnegative integer.  Then all the exact subclonotypes are       │
│                  │  grouped according to the Hamming distance of their V..J sequences.  Those    │
│                  │  within distance d are defined to be in the same group, and this is           │
│                  │  extended transitively.  The group identifier 1, 2, ... is shown.  The        │
│                  │  ordering of these identifiers is arbitrary.  This option is best applied     │
│                  │  to cases where all exact subclonotypes have a complete set of chains.        │
├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│gex               │  ● median gene expression UMI count                                           │
│n_gex             │  ● number of cells found by cellranger using GEX or Ab data                   │
├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│<gene>_g          │  ● all four feature types: look for a declared feature of the given type      │
│<antibody>_ab     │  with the given id or name; report the median UMI count for it; we allow      │
│<crispr>_cr       │  we also allow <regular expression>_g where g can be replaced by ab, ag, cr   │
│<custom>_cu       │  or cu; this represents a sum of UMI counts across the matching features. ●   │
├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│cred              │  Short for credibility.  It is a measure of the extent to which cells         │
│                  │  having gene expression similar to a given putative B cell are themselves     │
│                  │  B cells.  (Or similarly for T cells.)  For the actual definition, let n      │
│                  │  be the number of VDJ cells that are also GEX cells.  For a given cell,       │
│                  │  find the n GEX cells that are closest to it in PCA space, and report the     │
│                  │  percent of those that are also VDJ cells.  For multiple datasets, it would   │
│                  │  be better to "aggr" the data, however that is not currently supported        │
│                  │  The computation is also inefficient, so let us know if it's causing          │
│                  │  problems for you.  And cred makes much better sense for datasets that        │
│                  │  consist of mixed cell types, rather than consisting of pure B or T cells.    │
├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│filter            │  See enclone help special.  Use with PER_CELL.  If you turn off some          │
│                  │  default filters (or all default filters, e.g. with NALL_CELL), and this      │
│                  │  cell would have been deleted by one of the default filters, then this will   │
│                  │  show the name of the last filter that would have been applied to delete the  │
│                  │  cell.  (There are exceptions, please see enclone help special.)  Note        │
│                  │  that there are complex interactions between filters, so the actual effect    │
│                  │  with all default filters on may be significantly different.  Note also that  │
│                  │  use of NALL_CELL will typically result in peculiar artifacts, so this        │
│                  │  should only be used as an exploratory tool.                                  │
├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│nbc               │  numerically encoded barcode: a ten-digit number, padded with zeros           │
│                  │  on the left, which represents the base four encoding of the barcode DNA      │
│                  │  sequence, with A ==> 0, C => 1, G ==> 2 and T ==> 3; only defined for cells  │
├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│hcomp             │  complexity of heavy chain, only computed if two chains, one heavy, one light │
│                  │  and computed by finding optimal D, aligning to concatenated VDJ,             │
│                  │  and then scoring +1 of each inserted base, +1 for each deletion,             │
│                  │  regardless of size, and +1 for each substitution                             │
│jun_ins           │  like hcomp, but only counts inserted bases                                   │
└──────────────────┴───────────────────────────────────────────────────────────────────────────────┘
For gene expression and feature barcode stats, such data must be provided as input to enclone.

● Example: IG.*_g matches all genes that begin with IG, and TR(A|B).*_g matches all genes that
begin with TRA or TRB.  Double quotes as in LVARS="..." may be needed.  The regular expression
must be in the alphabet A-Za-z0-9+_-.[]()|* and is only interpreted as a regular expression if it
contains a character in []()|*.  See enclone help filter for more information about regular
expressions.

  ● These variables have some alternate versions, as shown in the table below.
  
  ┌──────────┬───────────────────────────────┬──────────┬──────────────┬─────────────┬────────────┐
  │variable  │  semantics                    │  visual  │  visual      │  parseable  │  parseable │
  │          │                               │          │  (one cell)  │             │  (one cell)│
  ├──────────┼───────────────────────────────┼──────────┼──────────────┼─────────────┼────────────┤
  │x         │  median over cells            │  yes     │  this cell   │  yes        │  yes       │
  │x_mean    │  mean over cells              │  yes     │  null        │  yes        │  yes       │
  │x_μ       │  (same as above)              │  yes     │  null        │  yes        │  yes       │
  │x_sum     │  sum over cells               │  yes     │  null        │  yes        │  yes       │
  │x_Σ       │  (same as above)              │  yes     │  null        │  yes        │  yes       │
  │x_min     │  min over cells               │  yes     │  null        │  yes        │  yes       │
  │x_max     │  max over cells               │  yes     │  null        │  yes        │  yes       │
  │x_%       │  % of total GEX (genes only)  │  yes     │  this cell   │  yes        │  yes       │
  │x_cell    │  this cell                    │  no      │  no          │  no         │  this cell │
  └──────────┴───────────────────────────────┴──────────┴──────────────┴─────────────┴────────────┘
  Some explanation is required.  If you use enclone without certain options, you get the "visual"
  column.
  • Add the option PER_CELL (see enclone help display) and then you get visual output with extra
  lines for each cell within an exact subclonotype, and each of those extra lines is described by
  the "visual (one cell)" column.
  • If you generate parseable output (see enclone help parseable), then you get the "parseable"
  column for that output, unless you specify PCELL, and then you get the last column.
  • For the forms with μ and Σ, the Greek letters are only used in column headings for visual output
  (to save space), and optionally, in names of fields on the command line.
  ▶ If you try out these features, you'll see exactly what happens! ◀

● Similar to the above but simpler: n_gex is just a count of cells, visual (one cell) shows 0 or
1, n_gex_cell is defined for parseable (one cell), and the x_mean etc. forms do not apply.

The default is datasets,n, except that datasets is suppressed if there is only one dataset.

LVARSP=x1,...,xn is like LVARS but appends to the list.