enclone help input


enclone has two mechanisms for specifying input datasets: either directly on the command line or
via a supplementary metadata file. Only one mechanism may be used at a time.

In both cases, you will need to provide paths to directories where the outputs of the Cell Ranger
pipeline may be found.  enclone uses only some of the pipeline output files, so it is enough that
those files are present in given directory, and the particular files that are needed may be found
by typing enclone help input_tech.

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃If you use the argument PRE=p then p/ will be prepended to all pipeline paths.  A comma-separated ┃
┃list is also allowed PRE=p1,...,pn, in which case these directories are searched from left to     ┃
┃right, until one works, and if all fail, the path is used without prepending anything.  Lastly,   ┃
┃(see enclone help command), you can avoid putting PRE on the command line by setting the          ┃
┃environment variable ENCLONE_PRE to the desired value.  The default value for PRE is              ┃
┃~/enclone/datasets_me,~/enclone/datasets,~/enclone/datasets2.  There is also an argument PREPOST=x┃
┃that causes /x to be appended to all entries in PRE.                                              ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

Both input forms involve abbreviated names (discussed below), which should be as short as
possible, as longer abbreviations will increase the width of the clonotype displays.

█ 1 █ To point directly at input files on the command line, use e.g.
TCR=/home/jdoe/runs/dataset345
or likewise for BCR.  A more complicated syntax is allowed in which commas, colons and semicolons
act as delimiters.  Commas go between datasets from the same origin, colons between datasets from
the same donor, and semicolons separate donors.  If semicolons are used, the value must be quoted.

enclone uses the distinction between datasets, origins and donors in the following ways:
1. If two datasets come from the same origin, then enclone can filter to remove certain artifacts,
unless you specify the option NCROSS.
See also illusory clonotype expansion page at bit.ly/enclone.
2. If two cells came from different donors, then enclone will not put them in the same clonotype,
unless you specify the option MIX_DONORS.
More information may be found at `enclone help special`.  In addition, this is enclone's way of
keeping datasets organized and affects the output of fields like origin, etc.

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃Naming.  Using this input system, each dataset is assigned an abbreviated name, which is         ┃
┃everything after the final slash in the directory name (e.g. dataset345 in the above example), or┃
┃the entire name if there is no slash; origins and donors are assigned identifiers s1,... and     ┃
┃d1,..., respectively; numbering of origins restarts with each new donor.  To specify origins     ┃
┃and donors, use the second input form, and see in particular abbr:path.                          ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

Examples:
TCR=p1,p2   -- input data from two libraries from the same origin
TCR=p1,p2:q -- input data as above plus another from a different origin from the same donor
TCR="a;b"   -- input one library from each of two donors.

Matching gene expression and/or feature barcode data may also be supplied using an argument GEX=...,
whose right side must have the exact same structure as the TCR or BCR argument.  Specification of
both TCR and BCR is not allowed.  If both BCR and GEX data are in the same directory (from a multi
run), and single argument BCR_GEX=... may be used, and similarly one may use TCR_GEX.

In addition, barcode-level data may be specified using BC=..., whose right side is a list of paths
having the same structure as the TCR or BCR argument.  Each such path must be for a CSV or TSV
file, which must include the field barcode, may include special fields origin, donor, tag and color,
and may also include arbitrary other fields.  The origin and donor fields allow a particular
origin and donor to be associated to a given barcode.  A use case for this is genetic
demultiplexing.  The tag field is intended to be used with tag demultiplexing.  The color field is
used by the PLOT option.  All other fields are treated as lead variables, but values are only
displayed in PER_CELL mode, or for parseable output using PCELL.  These fields should not include
existing lead variable names.  Use of BC automatically turns on the MIX_DONORS option.

Alternatively, an argument BC_JOINT=filename may be specified, where the filename is a CSV or TSV
file like those for BC=..., but with an additional field dataset, whose value is an abbreviated
dataset name, and which enables the information to be split up to mirror the specification of TCR
or BCR.

The argument BC=... or equivalently BC_JOINT=filename may be used on conjunction with
KEEP_CELL_IF=... (see enclone help special) to restrict the barcodes used by enclone to a
specified set.

█ 2 █ To specify a metadata file, use the command line argument
META=filename
This file should be a CSV (comma-separated values) file, with one line per cell group.  After the
first line, blank lines and lines starting with # are ignored.  There must be a field tcr or bcr,
and some other fields are allowed:
┌────────┬───────────────┬──────────────────────────────────────────────────────────────┐
│field   │  default      │  meaning                                                     │
├────────┼───────────────┼──────────────────────────────────────────────────────────────┤
│tcr     │  (required!)  │  path to dataset, or abbr:path, where abbr is an abbreviated │
│or bcr  │               │  name for the dataset; exactly one of tcr or bcr must be used│
├────────┼───────────────┼──────────────────────────────────────────────────────────────┤
│gex     │  null         │  path to GEX dataset, which may include or consist entirely  │
│        │               │  of FB data                                                  │
├────────┼───────────────┼──────────────────────────────────────────────────────────────┤
│origin  │  s1           │  abbreviated name of origin                                  │
├────────┼───────────────┼──────────────────────────────────────────────────────────────┤
│donor   │  d1           │  abbreviated name of donor                                   │
├────────┼───────────────┼──────────────────────────────────────────────────────────────┤
│color   │  null         │  color to associate to this dataset (for PLOT option)        │
├────────┼───────────────┼──────────────────────────────────────────────────────────────┤
│bc      │  null         │  name of CSV file as in the BC option                        │
└────────┴───────────────┴──────────────────────────────────────────────────────────────┘

Multiple META arguments are cumulative and we also allow META to be a comma-separated list of
filenames.  In both cases the META files must have identical header lines.  In addition, metadata
maybe fully specified on the command line via METAX="l1;...;ln" where the li are the lines that
you would otherwise put in the META file.
█ 3 █ enclone can also read an ancillary CSV file that specifies arbitrary fields that are
associated to particular immune receptor sequences.  This is done using INFO=path.csv; The CSV
file must have fields vj_seq1, specifying the full heavy or TRB sequence from the beginning of the
V segment to the end of the J segment, and vj_seq2, for the light or TRA chain.  The other fields
are then made accessible as lvars (see enclone help lvars), which are populated for any exact
subclonotype having exactly two chains (heavy/light or TRB/TRA) that match the data in the CSV
file.  By default, one cannot have two lines for the same antibody, however a separate argument
INFO_RESOLVE may be used to "pick the first one".