enclone help faq


▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
enclone main help page (what you get by typing "enclone")
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓

The mission of enclone is to:

  Find and display the clonotypes within single cell VDJ datasets:
  groups of cells having the same fully rearranged common ancestor.

enclone is part of the 10x Genomics immune profiling tools, including Cell Ranger and Loupe. 
enclone uses output from Cell Ranger version ≥ 3.1.

The complete enclone documentation is at bit.ly/enclone.  This page catalogs the subset of those
pages that are directly accessible from the enclone command line.  These pages can be viewed in a
100 wide x 56 high window, except for those labeled "long" or "wide".

┌─────────────────────────┬─────────────────────────────────────────────────────────────────────┐
│command                  │  what it provides                                                   │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────┤
│enclone help             │  help to test for correct setup                                     │
│enclone                  │  what you see here: guide to all the doc                            │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────┤
│enclone help quick       │  quick guide to getting started                                     │
│enclone help how         │  how enclone works (long)                                           │
│enclone help command     │  info about enclone command line argument processing                │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────┤
│enclone help glossary    │  glossary of terms used by enclone, and conventions                 │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────┤
│enclone help example1    │  explanation of an example                                          │
│enclone help example2    │  example showing gene expression and feature barcodes (wide)        │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────┤
│enclone help input       │  how to provide input to enclone (long)                             │
│enclone help input_tech  │  how to provide input to enclone (technical notes)                  │
│enclone help parseable   │  parseable output (long)                                            │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────┤
│enclone help filter      │  clonotype filtering options, scanning for feature enrichment (long)│
│enclone help special     │  special filtering options (long)                                   │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────┤
│enclone help lvars       │  lead column options (long)                                         │
│enclone help cvars       │  per chain column options (long)                                    │
│enclone help amino       │  per chain column options for amino acids                           │
│enclone help display     │  other clonotype display options (long)                             │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────┤
│enclone help indels      │  insertion and deletion handling                                    │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────┤
│enclone help color       │  how enclone uses color, and related things                         │
│enclone help faq         │  frequently asked questions (long)                                  │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────┤
│enclone help all         │  concatenation of all the help pages (long, wide)                   │
│                         │  ███ USE THIS TO SEARCH ALL THE HELP PAGES! ███                     │
└─────────────────────────┴─────────────────────────────────────────────────────────────────────┘
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
enclone setup page (for one time use, what you get by typing "enclone help")
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓


Welcome to enclone!

The purpose of this first page is to help you make sure that you're set up properly
to run enclone.  PLEASE READ!

(for the main help page, please type instead: enclone)

Here we go through several setup tests.

1. Are you using a fixed width font?
Look at this:
A FAT BROWN CAT JUMPED OVER THE WALL
||||||||||||||||||||||||||||||||||||
Do those two lines end at the same position?  If not, you need to switch your font.

2. Is your terminal window wide enough to see the help pages?
Your terminal needs to be at least 100 columns wide.  Look at this:
0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
Does it appear as a single line?  If not, please widen your window.

3. Can your terminal display box characters?
Look at this:
┌────────┬─────────┐
│banana  │  peel   │
├────────┼─────────┤
│oops    │  slipped│
└────────┴─────────┘
Do you see a neat rectangle composed of four rectangles with words inside them?  Are the vertical
lines contiguous?  If not, something is wrong with your terminal!  You may need to change the
terminal font.  We use Menlo Regular at 13pt.  However, you may still observe small vertical
gaps between characters in some instances, depending on your computer and apparently
resulting from bugs in font rendition.

4. Can your terminal correctly display ANSI escape sequences?
The following word should be bold.  The following word should be blue.
If that doesn't make sense, or is messed up, something is wrong, and you have two options:
(a) seek help to fix your terminal window
(b) turn off escape sequences by adding PLAIN to every enclone command, or set
the environment variable ENCLONE_PLAIN.
But that should be only a last resort.

5. Can your terminal correctly display unicode characters?
Do you see a centered dot here • ?
If not, your terminal has a problem!

6. Does this entire help page appear at once in your terminal window?
If not, please increase the number of rows in your window to 56.


If you go through all those tests and everything worked, you should be good to go!


▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
enclone help quick
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓

quick guide to getting started

Just type this:

enclone BCR=p

where p is the path to your Cell Ranger VDJ directory.

Substitute TCR if that's what you've got.

This will show you all the clonotypes, in descending order by number of cells.

You'll need to make your window wide enough so that lines are not folded.  This depends on the
dataset.

Only one page of output is shown at a time.  To navigate within the full output, use the space bar
to go forward and the b key to go backward.

See enclone help example1 for a detailed guide to how to read the enclone output.  A few key
things you should know:

1. You'll see numbers near the top.  These are amino acid position numbers, and
   they read downwards.  Numbering starts at the start codon, numbered zero.

2. Each numbered line represents an exact subclonotype: cells having identical V(D)J transcripts.

3. By default, you'll see data in amino acid space.  Only "interesting" amino acids are shown.

Please read on to learn more!

navigation in enclone

enclone automatically sends its output through the program "less".  This allows you to navigate
within the output, using the following keys (and many more, not shown, and which you don't need to
know):
• space: causes output to page forward
• b: causes output to page backward
• /string: finds instances of "string" in the output
• n: having done the previous, jump to the next instance
• q: quit, to return to the command line.

When enclone uses less, it passes the argument -R, which causes certain characters to be hidden,
namely escape codes that color or bold text.

▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
enclone help how
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓

information about how enclone works

The goal of enclone is to find and display the clonotypes within single cell VDJ datasets: groups
of cells having the same fully rearranged common ancestor.

enclone provides the foundation for fully understanding each cell's antigen affinity and the
evolutionary relationship between cells within one or more datasets.  This starts with, for each
cell, the full length sequence of all its VDJ receptor chains.  Such data may be obtained using
the 10x Genomics immune profiling platform.

See also the heuristics page at bit.ly/enclone.

For this, there are fundamental challenges:

┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
│1. It is extremely easy to get false positives: the incorrect appearance that two cells have a    │
│common ancestor.                                                                                  │
│                                                                                                  │
│2. Because of somatic hypermutation in B cells, it can be difficult to know that two B cells share│
│a common ancestor.                                                                                │
│                                                                                                  │
│3. There is always some background noise, e.g. from ambient mRNA.  When building large clonotypes,│
│this noise tends to pile up, yielding ectopic chains, i.e. chains within a clonotype that are     │
│artifacts and do not represent true biology.                                                      │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘

To address these challenges, the enclone algorithm has several steps, which we outline:

1.  Input data.  enclone gets its information from the file all_contig_annotations.json that is
produced by Cell Ranger.  Only productive contigs are used.  Each has an annotated V and J
segment.  The V segment alignment may have a single indel whose length is divisible by three, and
in that case, the V reference sequence is edited either to delete or insert sequence.  In the
insertion case, the bases are taken from the contig.  These indels are noted in the enclone
output.

2.  Exact subclonotypes.  enclone groups cells into exact subclonotypes, provided that they have
the same number of chains, identical V..J sequences, identical C segment assignments, and the same
distance between the J stop and the C start (which is usually zero).

3.  Finding the germline sequences.  For datasets from a given donor, enclone derives "donor
reference sequences" for the V chains present in the donor's genome.  This is powerful, even
though based on imperfect information.  V segments vary in their expression frequency and thus the
more cells which are present, the more complete the information will be.  It is also not possible
to accurately determine the terminal bases in a V chain from transcript data alone because these
bases mutate during recombination and because of non-templated nucleotide addition.

The idea for how this is done is roughly the following: for each V segment, we choose one cell
from each clonotype (although these have not actually been computed yet, so it's an
approximation).  Next for each position on the V segment, excluding the last 15 bases, we
determine the distribution of bases that occur within these selected cells.  We only consider
those positions where a non-reference base occurs at least four times and is at least 25% of the
total.  Then each cell has a footprint relative to these positions; we require that these
footprints satisfy similar evidence criteria.  Each such non-reference footprint then defines an
"alternate allele".  We do not restrict the number of alternate alleles because they may arise
from duplicated gene copies.

A similar approach was attempted for J segments but at the time of testing did not appear to
enhance clonotyping specificity.  This could be revisited later and might be of interest even if
it does not improve specificity.

4.  What joins are tested.  Pairs of exact subclonotypes are considered for joining, as described
below.  This process only considers exact subclonotypes have two or three chains.  There is some
separate joining for the case of one chain.  Exact subclonotypes having four chains are not joined
at present.  These cases are clearly harder because these exact subclonotypes are highly enriched
for cell doublets, which we discard if we can identify as such.

5.  Initial grouping.  For each pair of exact subclonotypes, and for each pair of chains in each
of the two exact subclonotypes, for which V..J has the same length for the corresponding chains,
and the CDR3 segments have the same length for the corresponding chains, enclone considers joining
the exact subclonotypes into the same clonotype.

6.  Shared mutations.  enclone next finds shared mutations betweens exact subclonotypes, that is,
for two exact subclonotypes, common mutations from the reference sequence, using the donor
reference for the V segments and the universal reference for the J segments.  Shared mutations are
supposed to be somatic hypermutations, that would be evidence of common ancestry.  By using the
donor reference sequences, most shared germline mutations are excluded, and this is critical for
the algorithm's success.

7.  Are there enough shared mutations?  We find the probability p that “the shared mutations occur
by chance”.  More specifically, given d shared mutations, and k total mutations (across the two
cells), we compute the probability p that a sample with replacement of k items from a set whose
size is the total number of bases in the V..J segments, yields at most k – d distinct elements. 
The probability is an approximation, for the method please see
https://docs.rs/stirling_numbers/0.1.0/stirling_numbers.

8.  Are there too many CDR3 mutations?  We define a constant N that is used below.  We first set
cd1 to the number of heavy chain CDR3 nucleotide differences, and cd2 to the number of light chain
CDR3 nucleotide differences.  Let n1 be the nucleotide length of the heavy chain CDR3, and
likewise n2 for the light chain.  Then N = 80^(42 * (cd1/n1 + cd2/n2)).  The number 80 may be
alternately specified via MULT_POW and the number 42 via CDR3_NORMAL_LEN.

9.  We also require CDR3 nucleotide identity of at least 85%.  The number 85 may be alternately
set using JOIN_CDR3_IDENT=....  The nucleotide identity is computed by dividing cd by the total
nucleotide length of the heavy and light chains, normalized.

10.  Key join criteria.  Two cells sharing sufficiently many shared differences and sufficiently
few CDR3 differences are deemed to be in the same clonotype.  That is, The lower p is, and the
lower N is, the more likely it is that the shared mutations represent bona fide shared ancestry. 
Accordingly, the smaller p*N is, the more likely it is that two cells lie in the same true
clonotype.  To join two cells into the same clonotype, we require that the bound p*n ≤ C is
satisfied, where C is the constant 100,000.  The value may be adjusted using the command-line
argument MAX_SCORE, or the log10 of this, MAX_LOG_SCORE.  This constant was arrived at by
empirically balancing sensitivity and specificity across a large collection of datasets.  See
results described at bit.ly/enclone.

11.  Other join criteria.
• If V gene names are different (after removing trailing *...), and either V gene reference
sequences are different, after truncation on right to the same length or 5' UTR reference
sequences are different, after truncation on left to the same length, then the join is rejected.
• As an exception to the key join criterion, we allow a join which has at least 15 shares, even if
p*N > C.  The constant 15 is modifiable via the argument AUTO_SHARES.
• As a second exception to the key join criterion, we first compute heavy chain join complexity. 
This is done by finding the optimal D gene, allowing no D, or DD), and aligning the junction
region on the contig to the concatenated reference.  (This choice can be visualized using the
option JALIGN1, see enclone help display.)  The heavy chain join complexity hcomp is then a sum
as follows: each inserted base counts one, each substitution counts one, and each deletion
(regardless of length) counts one.  Then we allow a join if it has hcomp - cd ≥ 8, so long as the
number of differences between both chains outside the junction regions is at most 80, even if p*N
> C.
• We do not join two clonotypes which were assigned different reference sequences unless those
reference sequences differ by at most 2 positions.  This value can be controlled using the
command-line argument MAX_DEGRADATION.
• There is an additional restriction imposed when creating two-cell clonotypes: we require that
that cd ≤ d, where cd is the number of CDR3 differences and d is the number of shared mutations,
as above.  This filter may be turned off using the command-line argument EASY.
• We do not join in cases where light chain constant regions are different and cd > 0.  This
filter may be turned off using the command-line argument OLD_LIGHT.
• If the percent nucleotide identity on heavy chain FWR1 is at least 20 more than the percent
nucleotide identity on heavy chain CDR1+CDR2 (combined), then the join is rejected.
• We do not join in cases where there is too high a concentration of changes in the junction
region.  More specifically, if the number of mutations in CDR3 is at least 5 times the number of
non-shared mutations outside CDR3 (maxed with 1), the join is rejected.  The number 5 is the
parameter CDR3_MULT.

12.  Junk.  Spurious chains are filtered out based on frequency and connections. See "enclone help
special" for a description of the filters.

13.  Alternate algorithm.  An alternate and much simpler clonotyping algorithm can be invoked by
specifying JOIN_BASIC=90.  This causes two exact subclonotypes to be joined if they have the same
V and J gene assignments, the same CDR3 lengths, and CDR3 nucleotide identity of at least 90% on
each chain.  The number 90 can be changed.

We are actively working to improve the algorithm.  Test results for the current version may be
found at bit.ly/enclone.

▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
enclone help command
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓

information about enclone command-line argument processing

1. Order of processing

• Before processing its command line, enclone first checks for environment
variables of the form ENCLONE_<x>.  These are converted into command-line arguments.  You can set
any command-line argument this way.  The reason why you might want to use this feature is if you
find yourself using the same command-line option over and over, and it is more convenient to set
it once as an environment variable.
• For example, setting the environment variable ENCLONE_PRE to /Users/me/enclone_data is
equivalent to providing the command-line argument PRE=/Users/me/enclone_data.
• After checking environment variables, arguments on the command line are read from left to right;
if an argument name is repeated, only the rightmost value is used, except as noted specifically in
the documentation.

2. Importing arguments

Extra arguments can be imported on the command line using SOURCE=filename.  The file may have
newlines, and more than one SOURCE command may be used.  Any line starting with # is treated as a
comment.

3. Color

enclone uses ANSI escape codes for color and bolding, frivolously, for emphasis, and more
importantly for amino acids, to represent different codons.  This is done automatically but you
can turn it off....

PLEASE READ THIS:

You can turn off escape codes by adding PLAIN to any command.  Use this if you want to peruse
output using a text editor which does not grok the escape codes.  However some things will not
make sense without color.

4. Paging

• enclone automatically pipes its output to less -R -F -X.
• The effect of this will be that you'll see only the first screen of output.  You can then use
the spacebar to go forward, b to go backward, and q to quit.  The -R option causes escape
characters to be correctly displayed, the -F option causes an automatic exit if output fits on a
single screen, and the -X option prevents output from being sent to the "alternate screen" under
certain platform/version combinations.
• Type man less if you need more information.
• If for whatever reason you need to turn off output paging, add the argument NOPAGER to the
enclone command.

▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
enclone help glossary
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓

glossary of terms used by enclone

┌────────────────────┬─────────────────────────────────────────────────────────────────────────────┐
│V..J                │  the full sequence of a V(D)J transcript, from the beginning of the V       │
│                    │  segment to the end of the J segment; this sequence begins with a stop codon│
│                    │  and ends with a partial codon (its first base)                             │
│CDR3                │  The terms CDR3 and junction are commonly mistaken and often                │
│                    │  used interchangeably.  In enclone's nomenclature, "CDR3"                   │
│                    │  actually refers to the junction (the CDR3 loop plus the                    │
│                    │  canonical C and W/F at the N and C termini respectively).                  │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────┤
│clonotype           │  all the cells descended from a single fully rearranged T or B cell         │
│                    │  (approximated computationally)                                             │
│exact subclonotype  │  all cells having identical transcripts ○                                   │
│                    │  (every clonotype is a union of exact subclonotypes)                        │
│clone               │  a cell in a clonotype, or in an exact subclonotype                         │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────┤
│onesie              │  a clonotype or exact subclonotype having exactly one chain                 │
│twosie              │  a clonotype or exact subclonotype having exactly two chains                │
│threesie            │  a clonotype or exact subclonotype having exactly three chains;             │
│                    │  these frequently represent true biological events, arising from expression │
│                    │  of both alleles                                                            │
│foursie             │  a clonotype or exact subclonotype having exactly four chains;              │
│                    │  these very rarely represent true biological events                         │
│moresie             │  a clonotype having more than four chains;                                  │
│                    │  these sad clonotypes do not represent true biological events               │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────┤
│donor               │  an individual from whom datasets of an origin are obtained                 │
│origin              │  a tube of cells from a donor, from a particular tissue at a                │
│                    │  particular point in time, and possibly enriched for particular cells       │
│cell group          │  an aliquot from an origin, presumed to be a random draw                    │
│dataset             │  all sequencing data obtained from a particular library type                │
│                    │  (e.g. TCR or BCR or GEX or FB), from one cell group, processed by running  │
│                    │  through the Cell Ranger pipeline                                           │
└────────────────────┴─────────────────────────────────────────────────────────────────────────────┘

○ The exact requirements for being in the same exact subclonotype are that cells:
• have the same number of productive contigs identified
• that these have identical bases within V..J
• that they are assigned the same constant region reference sequences
• and that the difference between the V stop and the C start is the same
  (noting that this difference is nearly always zero).
Note that we allow mutations within the 5'-UTR and constant regions.

conventions

• When we refer to "V segments", we always include the leader segment.
• Zero or one?  We number exact subclonotypes as 1, 2, ... and likewise with
chains within a clonotype, however DNA and amino-acid positions are numbered starting at zero.

▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
enclone help example1
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓

Shown below is the output of the command:

enclone BCR=123089 CDR3=CARRYFGVVADAFDIW

[1] GROUP = 1 CLONOTYPES = 13 CELLS

[1.1] CLONOTYPE = 13 CELLS
┌───────────┬────────────────────────────────────────┬──────────────────────────────┐
│           │  CHAIN 1                               │  CHAIN 2                     │
│           │  740.1.2|IGHV4-30-4 ◆ 53|IGHJ3         │  253|IGKV1D-39 ◆ 217|IGKJ5   │
│           ├────────────────────────────────────────┼──────────────────────────────┤
│           │       1 1111111111111111               │  1 111111111111              │
│           │  256890 1111122222222223               │  0 011111111112              │
│           │  010148 5678901234567890               │  6 901234567890              │
│           │         ══════CDR3══════               │    ════CDR3════              │
│reference  │  LDPPSA ◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦W               │  T CQQ◦◦◦◦◦◦◦◦◦              │
│donor ref  │  VGHPSA ◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦W               │  T CQQ◦◦◦◦◦◦◦◦◦              │
├───────────┼────────────────────────────────────────┼──────────────────────────────┤
│#   n      │  ...x.. ..............x.     u  const  │  x ......x.....      u  const│
│1  10      │  VGHPSA CARRYFGVVADAFDIW  3953  IGHM   │  T CQQSYSTPPITF  10648  IGKC │
│2   2      │  VGHSSA CARRYFGVVADAFDIW  2626  IGHM   │  A CQQSYSPPPITF   9766  IGKC │
│3   1      │  VGHPSA CARRYFGVVADAFDIW     5  IGHM   │                              │
└───────────┴────────────────────────────────────────┴──────────────────────────────┘

This shows an invocation of enclone that takes one dataset as input and exhibits
all clonotypes for which some chain has the given CDR3 sequence.

What you see here is a compressed view of the entire information encoded in the
full length transcripts of the 13 cells comprising this clonotype: every base!
There is a lot to explain about the compression, so please read carefully.

• Clonotypes are grouped.  Here we see just one group having one clonotype in it.
• This clonotype has three exact subclonotypes in it, the first of which has 10 cells.
• This clonotype has two chains.  The reference segments for them are shown at the top.
• The notation 740.1.2 says that this V reference sequence is an alternate allele
  derived from the universal reference sequence (contig in the reference file)
  numbered 181, that is from donor 1 ("740.1") and is alternate allele 2 for that donor.
• Sometimes chains are missing from exact subclonotypes.
• Amino acids are assigned different colors depending on which codon they represent.
• Numbered columns show the state of particular amino acids, e.g. the first column is for amino
  acid 20 in chain 1 (where 0 is the start codon).  The numbers read vertically, downward!
• Universal ref: state for the contig in the reference file.
• Donor ref: state for the inferred donor germline sequence.
• ◦s are "holes" in the recombined region where the reference doesn't make sense.
• The "dot and x" line has xs where there's a difference *within* the clonotype.
• Amino acids are shown if they differ from the universal reference or are in the CDR3.
• u = median UMI count for a chain in the exact subclonotype.
• const = const region name for a chain in the exact subclonotype.

The view you see here is configurable: see the documentation at enclone help lvars and enclone
help cvars.

▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
enclone help example2
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓

Shown below is the output of the command:

enclone BCR=123085 GEX=123217 LVARSP=gex,IGHV2-5_g_μ CDR3=CALMGTYCSGDNCYSWFDPW

[1] GROUP = 1 CLONOTYPES = 6 CELLS

[1.1] CLONOTYPE = 6 CELLS
┌──────────────────────────┬───────────────────────────────────────┬─────────────────────────────┐
│                          │  CHAIN 1                              │  CHAIN 2                    │
│                          │  98|IGHV2-5 ◆ 13|IGHD2-15 ◆ 57|IGHJ5  │  349|IGLV3-1 ◆ 311|IGLJ2    │
│                          ├───────────────────────────────────────┼─────────────────────────────┤
│                          │    11111111111111111111               │    11111111111              │
│                          │  8 11111222222222233333               │  6 00000111111              │
│                          │  5 56789012345678901234               │  2 56789012345              │
│                          │    ════════CDR3════════               │    ════CDR3═══              │
│reference                 │  S ◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦W               │  V CQAWD◦◦◦◦◦◦              │
│donor ref                 │  S ◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦W               │  V CQAWD◦◦◦◦◦◦              │
├──────────────────────────┼───────────────────────────────────────┼─────────────────────────────┤
│#  n    gex  IGHV2-5_g_μ  │  x ..x.......xx........     u  const  │  x ...........      u  const│
│1  3   9011         1849  │  S CALMGTYCSGDNCYSWFDPW   592  IGHM   │  V CQAWDSSVVVF   2995  IGLC2│
│2  1  29846         6515  │  S CALMGTYCSGDNCYSWFDPW  6218  IGHG1  │  V CQAWDSSVVVF  15182  IGLC2│
│3  1  14995         3326  │  T CALMGTYCSGDNCYSWFDPW  4033  IGHG1  │  V CQAWDSSVVVF   6777  IGLC2│
│4  1   3250            2  │  S CAHMGTYCSGGSCYSWFDPW    18  IGHG1  │  V CQAWDSSVVVF    592  IGLC2│
└──────────────────────────┴───────────────────────────────────────┴─────────────────────────────┘

This shows an invocation of enclone that takes VDJ, and gene expression data as input, and
exhibits all clonotypes for which some chain has the given CDR3 sequence.  As well the command
requests UMI (molecule) counts for one hand-selected gene.  You can use any gene(s) you like and
any antibodies for which you have feature barcodes.

▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
enclone help input
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓

enclone has two mechanisms for specifying input datasets: either directly on the command line or
via a supplementary metadata file. Only one mechanism may be used at a time.

In both cases, you will need to provide paths to directories where the outputs of the Cell Ranger
pipeline may be found.  enclone uses only some of the pipeline output files, so it is enough that
those files are present in given directory, and the particular files that are needed may be found
by typing enclone help input_tech.

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃If you use the argument PRE=p then p/ will be prepended to all pipeline paths.  A comma-separated ┃
┃list is also allowed PRE=p1,...,pn, in which case these directories are searched from left to     ┃
┃right, until one works, and if all fail, the path is used without prepending anything.  Lastly,   ┃
┃(see enclone help command), you can avoid putting PRE on the command line by setting the          ┃
┃environment variable ENCLONE_PRE to the desired value.  The default value for PRE is              ┃
┃~/enclone/datasets_me,~/enclone/datasets,~/enclone/datasets2.  There is also an argument PREPOST=x┃
┃that causes /x to be appended to all entries in PRE.                                              ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

Both input forms involve abbreviated names (discussed below), which should be as short as
possible, as longer abbreviations will increase the width of the clonotype displays.

█ 1 █ To point directly at input files on the command line, use e.g.
TCR=/home/jdoe/runs/dataset345
or likewise for BCR.  A more complicated syntax is allowed in which commas, colons and semicolons
act as delimiters.  Commas go between datasets from the same origin, colons between datasets from
the same donor, and semicolons separate donors.  If semicolons are used, the value must be quoted.

enclone uses the distinction between datasets, origins and donors in the following ways:
1. If two datasets come from the same origin, then enclone can filter to remove certain artifacts,
unless you specify the option NCROSS.
See also illusory clonotype expansion page at bit.ly/enclone.
2. If two cells came from different donors, then enclone will not put them in the same clonotype,
unless you specify the option MIX_DONORS.
More information may be found at `enclone help special`.  In addition, this is enclone's way of
keeping datasets organized and affects the output of fields like origin, etc.

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃Naming.  Using this input system, each dataset is assigned an abbreviated name, which is         ┃
┃everything after the final slash in the directory name (e.g. dataset345 in the above example), or┃
┃the entire name if there is no slash; origins and donors are assigned identifiers s1,... and     ┃
┃d1,..., respectively; numbering of origins restarts with each new donor.  To specify origins     ┃
┃and donors, use the second input form, and see in particular abbr:path.                          ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

Examples:
TCR=p1,p2   -- input data from two libraries from the same origin
TCR=p1,p2:q -- input data as above plus another from a different origin from the same donor
TCR="a;b"   -- input one library from each of two donors.

Matching gene expression and/or feature barcode data may also be supplied using an argument GEX=...,
whose right side must have the exact same structure as the TCR or BCR argument.  Specification of
both TCR and BCR is not allowed.  If both BCR and GEX data are in the same directory (from a multi
run), and single argument BCR_GEX=... may be used, and similarly one may use TCR_GEX.

In addition, barcode-level data may be specified using BC=..., whose right side is a list of paths
having the same structure as the TCR or BCR argument.  Each such path must be for a CSV or TSV
file, which must include the field barcode, may include special fields origin, donor, tag and color,
and may also include arbitrary other fields.  The origin and donor fields allow a particular
origin and donor to be associated to a given barcode.  A use case for this is genetic
demultiplexing.  The tag field is intended to be used with tag demultiplexing.  The color field is
used by the PLOT option.  All other fields are treated as lead variables, but values are only
displayed in PER_CELL mode, or for parseable output using PCELL.  These fields should not include
existing lead variable names.  Use of BC automatically turns on the MIX_DONORS option.

Alternatively, an argument BC_JOINT=filename may be specified, where the filename is a CSV or TSV
file like those for BC=..., but with an additional field dataset, whose value is an abbreviated
dataset name, and which enables the information to be split up to mirror the specification of TCR
or BCR.

The argument BC=... or equivalently BC_JOINT=filename may be used on conjunction with
KEEP_CELL_IF=... (see enclone help special) to restrict the barcodes used by enclone to a
specified set.

█ 2 █ To specify a metadata file, use the command line argument
META=filename
This file should be a CSV (comma-separated values) file, with one line per cell group.  After the
first line, blank lines and lines starting with # are ignored.  There must be a field tcr or bcr,
and some other fields are allowed:
┌────────┬───────────────┬──────────────────────────────────────────────────────────────┐
│field   │  default      │  meaning                                                     │
├────────┼───────────────┼──────────────────────────────────────────────────────────────┤
│tcr     │  (required!)  │  path to dataset, or abbr:path, where abbr is an abbreviated │
│or bcr  │               │  name for the dataset; exactly one of tcr or bcr must be used│
├────────┼───────────────┼──────────────────────────────────────────────────────────────┤
│gex     │  null         │  path to GEX dataset, which may include or consist entirely  │
│        │               │  of FB data                                                  │
├────────┼───────────────┼──────────────────────────────────────────────────────────────┤
│origin  │  s1           │  abbreviated name of origin                                  │
├────────┼───────────────┼──────────────────────────────────────────────────────────────┤
│donor   │  d1           │  abbreviated name of donor                                   │
├────────┼───────────────┼──────────────────────────────────────────────────────────────┤
│color   │  null         │  color to associate to this dataset (for PLOT option)        │
├────────┼───────────────┼──────────────────────────────────────────────────────────────┤
│bc      │  null         │  name of CSV file as in the BC option                        │
└────────┴───────────────┴──────────────────────────────────────────────────────────────┘

Multiple META arguments are cumulative and we also allow META to be a comma-separated list of
filenames.  In both cases the META files must have identical header lines.  In addition, metadata
maybe fully specified on the command line via METAX="l1;...;ln" where the li are the lines that
you would otherwise put in the META file.
█ 3 █ enclone can also read an ancillary CSV file that specifies arbitrary fields that are
associated to particular immune receptor sequences.  This is done using INFO=path.csv; The CSV
file must have fields vj_seq1, specifying the full heavy or TRB sequence from the beginning of the
V segment to the end of the J segment, and vj_seq2, for the light or TRA chain.  The other fields
are then made accessible as lvars (see enclone help lvars), which are populated for any exact
subclonotype having exactly two chains (heavy/light or TRB/TRA) that match the data in the CSV
file.  By default, one cannot have two lines for the same antibody, however a separate argument
INFO_RESOLVE may be used to "pick the first one".▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
enclone help cvars
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓

per-chain column options: These options define per-chain variables, which correspond to columns
that appear once for each chain in each clonotype, and have one entry for each exact subclonotype.
 Please note that for medians of integers, we actually report the "rounded median", the result of
rounding the true median up to the nearest integer, so that e.g. 6.5 is rounded up to 7.

See also enclone help lvars and the inventory of all variables at
            https://10xgenomics.github.io/enclone/pages/auto/inventory.html.

Per-column variables are specified using
CVARS=x1,...,xn
where each xi is one of:

┌─────────────────┬──────────────────────────────────────────────────────────────────────────────┐
│var              │  bases at positions in chain that vary across the clonotype                  │
├─────────────────┼──────────────────────────────────────────────────────────────────────────────┤
│u                │  ● VDJ UMI count for each exact subclonotype, median across cells            │
│r                │  ● VDJ read count for each exact subclonotype, median across cells           │
├─────────────────┼──────────────────────────────────────────────────────────────────────────────┤
│edit             │  a string that defines the edit of the reference V(D)J concatenation versus  │
│                 │  the contig, from the beginning of the CDR3 to the end of the J segment;     │
│                 │  this uses a coordinate system in which 0 is the first base of the J ref     │
│                 │  segment (or the first base of the D ref segment for IGH and TRB); for       │
│                 │  example D-4:4 denotes the deletion of the last 4 bases of the V segment,    │
│                 │  I0:2 denotes an insertion of 2 bases after the V                            │
│                 │  and I0:2•S5 denotes that plus a substitution at position 5; in computing    │
│                 │  "edit", for IGH and TRB, we always test every possible D segment,           │
│                 │  regardless of whether one is annotated, and pick the best one; for this     │
│                 │  reason, "edit" may be slow                                                  │
│comp             │  a measure of CDR3 complexity, which is the total number of S, D and I       │
│                 │  symbols in "edit" as defined above                                          │
│cigar            │  the CIGAR string that defines the edit of the V..J contig sequence versus   │
│                 │  the universal reference V(D)J concatenation                                 │
├─────────────────┼──────────────────────────────────────────────────────────────────────────────┤
│cdr*_aa          │  the CDR*_AA sequence, or "unknown" if not computed                          │
│cdr*_aa_L_R_ext  │  the CDR*_AA sequence, with L amino acids added on the left and R amino acids│
│                 │  added on the right; either may be negative, denoting trimming instead       │
│                 │  of extension                                                                │
│cdr*_aa_north    │  the CDR*_AA sequence for BCR defined by North B et al. (2011), A new        │
│                 │  clustering of antibody CDR loop conformations, J Mol Biol 406, 228-256.     │
│                 │  cdr1_aa_north = cdr1_aa_3_3_ext for heavy chains                            │
│                 │  cdr1_aa_north = cdr1_aa for light chains                                    │
│                 │  cdr2_aa_north = cdr2_aa_2_3_ext for heavy chains                            │
│                 │  cdr2_aa_north = cdr2_aa_1_0_ext for light chains                            │
│                 │  cdr3_aa_north = cdr3_aa_-1_-1_ext                                           │
│cdr*_aa_ref      │  cdr*_aa, for the universal reference sequence (but not for cdr3)            │
│cdr*_len         │  number of amino acids in the CDR* sequence, or "unknown" if not computed    │
│cdr*_dna         │  the CDR*_DNA sequence, or "unknown" if not computed                         │
│cdr*_dna_ref     │  same, for the universal reference sequence (but not for cdr3)               │
├─────────────────┼──────────────────────────────────────────────────────────────────────────────┤
│cdr3_aa_conx     │  consensus for CDR3 across the clonotype, showing X for each variant residue │
│cdr3_aa_conp     │  consensus for CDR3 across the clonotype, showing a property symbol whenever │
│                 │  two different amino acids are observed, per the following table:            │
│                 │  --------------------------------------------------------------------        │
│                 │  asparagine or aspartic acid   B   DN                                        │
│                 │  glutamine or glutamic acid    Z   EQ                                        │
│                 │  leucine or isoleucine         J   IL                                        │
│                 │  negatively charged            -   DE                                        │
│                 │  positively charged            +   KRH                                       │
│                 │  aliphatic (non-aromatic)      Ψ   VILM                                      │
│                 │  small                         π   PGAS                                      │
│                 │  aromatic                      Ω   FWYH                                      │
│                 │  hydrophobic                   Φ   VILFWYM                                   │
│                 │  hydrophilic                   ζ   STHNQEDKR                                 │
│                 │  any                           X   ADEFGHIKLMNPQRSTVWY                       │
│                 │  --------------------------------------------------------------------        │
│                 │  The table is searched top to bottom until a matching class is found.        │
│                 │  In the special case where every amino acid is shown as a gap (-),           │
│                 │  a "g" is printed.                                                           │
├─────────────────┼──────────────────────────────────────────────────────────────────────────────┤
│fwr*_aa          │  the FWR*_AA sequence, or "unknown" if not computed                          │
│fwr*_aa_ref      │  same, for the universal reference sequence                                  │
│fwr*_len         │  number of amino acids in the FWR* sequence, or "unknown" if not computed    │
│fwr*_dna         │  the FWR*_DNA sequence, or "unknown" if not computed                         │
│fwr*_dna_ref     │  same, for the universal reference sequences                                 │
│                 │  For all of these, * is 1 or 2 or 3 (or 4, for the fwr variables).           │
│                 │  For CDR1 and CDR2, please see enclone help amino and the page on            │
│                 │  bit.ly/enclone on V(D)J features.                                           │
├─────────────────┼──────────────────────────────────────────────────────────────────────────────┤
│v_start          │  start of V segment on full DNA sequence                                     │
│d_start          │  start of D segment on full DNA sequence (or null)                           │
│cdr3_start       │  base position start of CDR3 sequence on full contig                         │
│d_frame          │  reading frame of D segment, either 0 or 1 or 2 (or null)                    │
├─────────────────┼──────────────────────────────────────────────────────────────────────────────┤
│aa%              │  amino acid percent identity with donor reference, outside junction region   │
│dna%             │  nucleotide percent identity with donor reference, outside junction region   │
├─────────────────┼──────────────────────────────────────────────────────────────────────────────┤
│v_name_orig      │  name of V region originally assigned (per cell);                            │
│                 │  values below are clonotype consensuses                                      │
│utr_name         │  name of 5'-UTR region                                                       │
│v_name           │  name of V region                                                            │
│d_name           │  name of D region (or null)                                                  │
│j_name           │  name of J region                                                            │
│const            │  name of constant region                                                     │
├─────────────────┼──────────────────────────────────────────────────────────────────────────────┤
│utr_id           │  id of 5'-UTR region                                                         │
│v_id             │  id of V region                                                              │
│d_id             │  id of D region (or null)                                                    │
│j_id             │  id of J region                                                              │
│const_id         │  id of constant region (or null, if not known)                               │
│                 │  (these are the numbers after ">" in the VDJ reference file)                 │
├─────────────────┼──────────────────────────────────────────────────────────────────────────────┤
│allele           │  numerical identifier of the computed donor reference allele                 │
│                 │  for this exact subclonotype                                                 │
│allele_d         │  variant bases in the allele for this exact subclonotype,                    │
│                 │  and a list of all the possibilities for this                                │
├─────────────────┼──────────────────────────────────────────────────────────────────────────────┤
│d1_name          │  name of optimal D gene, or none                                             │
│d2_name          │  name of second best D gene, or none                                         │
│d1_score         │  score for optimal D gene                                                    │
│d2_score         │  score for second best D gene                                                │
│d_delta          │  score difference between first and second best D gene                       │
│d_Δ              │  same                                                                        │
│                 │  These are recomputed from scratch and ignore the given assignment.          │
│                 │  Note that in many cases D gene assignments are essentially random, as       │
│                 │  it is often not possible to know the true D gene assignment.                │
│                 │  If the value is "null" it means that having no D gene at all scores better  │
├─────────────────┼──────────────────────────────────────────────────────────────────────────────┤
│vjlen            │  number of bases from the start of the V region to the end of the J region   │
│                 │  Please note that D gene assignments are frequently "random" -- it is not    │
│                 │  possible to know the actual D gene that was assigned.                       │
│clen             │  length of observed constant region (usually truncated at primer start)      │
│ulen             │  length of observed 5'-UTR sequence;                                         │
│                 │  note however that what report is just the start of the V segment            │
│                 │  on the contig, and thus the length may include junk before the UTR          │
│cdiff            │  differences with universal reference constant region, shown in the          │
│                 │  abbreviated form e.g. 22T (ref changed to T at base 22) or 22T+10           │
│                 │  (same but contig has 10 additional bases beyond end of ref C region         │
│                 │  At most five differences are shown, and if there are more, ... is appended. │
│udiff            │  like cdiff, but for the 5'-UTR                                              │
├─────────────────┼──────────────────────────────────────────────────────────────────────────────┤
│q<n>_            │  comma-separated list of the quality                                         │
│                 │  scores at zero-based position n, numbered starting at the                   │
│                 │  beginning of the V segment, for each cell in the exact subclonotype         │
├─────────────────┼──────────────────────────────────────────────────────────────────────────────┤
│notes            │  optional note if there is an insertion or the end of J does not exactly abut│
│                 │  the beginning of C; elided if empty; also single base overlaps between      │
│                 │  J and C are not shown unless you use the special option JC1; we do this     │
│                 │  because with some VDJ references, one nearly always has such an overlap     │
├─────────────────┼──────────────────────────────────────────────────────────────────────────────┤
│ndiff<n>vj       │  number of base differences within V..J between this exact subclonotype and  │
│                 │  exact subclonotype n                                                        │
│d_univ           │  distance from universal reference, more specifically,                       │
│                 │  number of base differences within V..J between this exact                   │
│                 │  clonotype and universal reference, exclusive of indels, the last 15         │
│                 │  bases of the V and the first 15 bases of the J                              │
│d_donor          │  distance from donor reference,                                              │
│                 │  as above but computed using donor reference                                 │
└─────────────────┴──────────────────────────────────────────────────────────────────────────────┘

  ● These variables have some alternate versions, as shown in the table below.
  
  ┌──────────┬───────────────────────────────┬──────────┬──────────────┬─────────────┬────────────┐
  │variable  │  semantics                    │  visual  │  visual      │  parseable  │  parseable │
  │          │                               │          │  (one cell)  │             │  (one cell)│
  ├──────────┼───────────────────────────────┼──────────┼──────────────┼─────────────┼────────────┤
  │x         │  median over cells            │  yes     │  this cell   │  yes        │  yes       │
  │x_mean    │  mean over cells              │  yes     │  null        │  yes        │  yes       │
  │x_μ       │  (same as above)              │  yes     │  null        │  yes        │  yes       │
  │x_sum     │  sum over cells               │  yes     │  null        │  yes        │  yes       │
  │x_Σ       │  (same as above)              │  yes     │  null        │  yes        │  yes       │
  │x_min     │  min over cells               │  yes     │  null        │  yes        │  yes       │
  │x_max     │  max over cells               │  yes     │  null        │  yes        │  yes       │
  │x_%       │  % of total GEX (genes only)  │  yes     │  this cell   │  yes        │  yes       │
  │x_cell    │  this cell                    │  no      │  no          │  no         │  this cell │
  └──────────┴───────────────────────────────┴──────────┴──────────────┴─────────────┴────────────┘
  Some explanation is required.  If you use enclone without certain options, you get the "visual"
  column.
  • Add the option PER_CELL (see enclone help display) and then you get visual output with extra
  lines for each cell within an exact subclonotype, and each of those extra lines is described by
  the "visual (one cell)" column.
  • If you generate parseable output (see enclone help parseable), then you get the "parseable"
  column for that output, unless you specify PCELL, and then you get the last column.
  • For the forms with μ and Σ, the Greek letters are only used in column headings for visual output
  (to save space), and optionally, in names of fields on the command line.
  ▶ If you try out these features, you'll see exactly what happens! ◀

At least one variable must be listed.  The default is u,const,notes.  CVARSP: same as CVARS but
appends.

▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
enclone help input_tech
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓

information about providing input to enclone (technical notes)

enclone only uses certain files, which are all in the outs subdirectory of a Cell Ranger pipeline
directory:

┌────────────────────────────────────────────────────────────────────────────┬──────────┐
│file                                                                        │  pipeline│
├────────────────────────────────────────────────────────────────────────────┼──────────┤
│all_contig_annotations.json                                                 │  VDJ     │
├────────────────────────────────────────────────────────────────────────────┼──────────┤
│vdj_reference/fasta/regions.fa                                              │  VDJ     │
├────────────────────────────────────────────────────────────────────────────┼──────────┤
│metrics_summary.csv                                                         │  GEX     │
├────────────────────────────────────────────────────────────────────────────┼──────────┤
│raw_feature_bc_matrix.h5                                                    │  GEX     │
├────────────────────────────────────────────────────────────────────────────┼──────────┤
│analysis/clustering/graphclust/clusters.csv                                 │  GEX     │
├────────────────────────────────────────────────────────────────────────────┼──────────┤
│analysis/pca/10_components/projection.csv                                   │  GEX     │
├────────────────────────────────────────────────────────────────────────────┼──────────┤
│per_feature_metrics.csv (optional, by default not generated by cellranger)  │  GEX     │
└────────────────────────────────────────────────────────────────────────────┴──────────┘

The first file is required, and the second should be supplied if Cell Ranger version 4.0 or
greater was used.  The others are required, in the indicated structure, if GEX or META/gex
arguments are provided.  The exact files that are used could be changed in the future.

Note that the VDJ outs directories must be from Cell Ranger version ≥ 3.1.  There is a workaround
for earlier versions (which you will be informed of if you try), but it is much slower and the
results may not be as good.

Note also that running "cellranger count" using only feature barcodes (antibodies),
             with less than ten features, will not yield all the needed files.  You can work
around this by adding "fake antibodies", to the feature list, so as to pad out the total number to
ten.

▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
enclone help parseable
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓

parseable output

The standard output of enclone is designed to be read by humans, but is not readily parseable by
computers.  We supplement this with parseable output that can be easily read by computers.

The default behavior for this is to generate a CSV file having "every possible" field (over a
hundred).  We also provide an option to print only selected fields, and some options which enable
inspection, short of generating a separate CSV file.

Parseable output is targeted primarily at R and Python users, because of the ease of wrangling CSV
files with these languages.

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃Parseable output is invoked by using the argument                                                ┃
┃POUT=filename                                                                                    ┃
┃specifying the name of the file that is to be written to.                                        ┃
┃  The filename "stdout" may be used for a preview; in that case parseable output is generated    ┃
┃  separately for each clonotype and the two output types are integrated.  There is also          ┃
┃  "stdouth", which is similar, but uses spaces instead of commas, and lines things up in columns.┃
┃By default, we show four chains for each clonotype, regardless of how many chains it             ┃
┃has, filling in with null entries.  One may instead specify n chains using the argument          ┃
┃PCHAINS=n                                                                                        ┃
┃and if you use max in place of n, then the maximum value for your dataset will be used.          ┃
┃The parseable output fields may be specified using                                               ┃
┃PCOLS=x1,...,xn                                                                                  ┃
┃where each xi is one of the field names shown below.                                             ┃
┃The argument PNO_HEADER may be used to suppress the CSV header line.                             ┃
┃If you use POUT, the PCOLS option reduces run time and memory usage, and prevents voluminous     ┃
┃output.  Please use it!                                                                          ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

Over time additional fields may be added and the order of fields may change.

There is an alternate parseable output mode in which one line is emitted for each cell, rather
then each exact subclonotype.  This mode is enabled by adding the argument PCELL to the command
line.  Each exact subclonotype then yields a sequence of output lines that are identical except as
noted below.

If you want to completely suppress the generation of visual clonotypes, add NOPRINT to the enclone
command line.

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃FASTA output.  This is a separate feature.  To generate nucleotide FASTA output for each chain in ┃
┃each exact subclonotype, use the argument FASTA=filename.  The special case stdout will cause the ┃
┃FASTA records to be shown as part of standard output.  The FASTA records that are generated are of┃
┃the form V(D)JC, where V is the full V segment (including the leader) and C is the full constant  ┃
┃region, copied verbatim from the reference.  If a particular chain in a particular exact          ┃
┃subclonotype is not assigned a constant region, then we use the constant region that was assigned ┃
┃to the clonotype.  If no constant region at all was assigned, then the FASTA record is omitted.   ┃
┃Similarly, FASTA_AA=filename may be used to generate a matching amino acid FASTA file.            ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

───────────────────────
parseable output fields
───────────────────────

See also enclone help lvars, enclone help cvars, and the inventory of all variables at
https://10xgenomics.github.io/enclone/pages/auto/inventory.html.

1. per clonotype group fields

┌──────────────┬──────────────────────────────────────────┐
│group_id      │  identifier of clonotype group - 0,1, ...│
├──────────────┼──────────────────────────────────────────┤
│group_ncells  │  total number of cells in the group      │
│              │  (cannot be used in linear conditions)   │
└──────────────┴──────────────────────────────────────────┘

2. per clonotype fields

┌──────────────┬────────────────────────────────────────────────────────────────┐
│clonotype_id  │  identifier of clonotype within the clonotype group = 0, 1, ...│
└──────────────┴────────────────────────────────────────────────────────────────┘

3. per chain fields, where <i> is 1,2,... (see above)
each of these has the same value for each exact clonotype

┌──────────────────────┬──────────────────────────────────────────────────────────────┐
├──────────────────────┼──────────────────────────────────────────────────────────────┤
│var_indices_dna<i>    │  DNA positions in chain that vary across the clonotype       │
│var_indices_aa<i>     │  amino acid positions in chain that vary across the clonotype│
│share_indices_dna<i>  │  DNA positions in chain that are constant across the         │
│                      │  clonotype, but differ from the donor ref                    │
│share_indices_aa<i>   │  amino acid positions in chain that are constant across the  │
│                      │  clonotype, all of these are comma-separated lists but differ│
│                      │  from the donor ref                                          │
└──────────────────────┴──────────────────────────────────────────────────────────────┘

4. per exact subclonotype fields

┌───────────────────────┬─────────────────────────────────────────────────────────────────────────┐
│exact_subclonotype_id  │  identifer of exact subclonotype = 1, 2, ...                            │
├───────────────────────┼─────────────────────────────────────────────────────────────────────────┤
│barcodes               │  comma-separated list of barcodes for the exact subclonotype            │
│<dataset>_barcodes     │  like "barcodes", but restricted to the dataset with the given name     │
│barcode                │  if PCELL is specified, barcode for one cell                            │
│<dataset>_barcode      │  if PCELL is specified, barcode for one cell, or null, if the barcode is│
│                       │  not from the given dataset                                             │
├───────────────────────┴─────────────────────────────────────────────────────────────────────────┤
│In addition, every lead variable may be specified as a field.  See enclone help lvars.           │
└─────────────────────────────────────────────────────────────────────────────────────────────────┘

5. per chain, per exact subclonotype fields, where <i> is 1,2,... (see above)

[all apply to chain i of a particular exact clonotype]

┌──────────────┬────────────────────────────────────────────────────────────────────────────┐
│vj_seq<i>     │  DNA sequence of V..J                                                      │
│vj_seq_nl<i>  │  DNA sequence of V..J, but starting after the leader                       │
│vj_aa<i>      │  amino acid sequence of V..J (excludes last base, in incomplete codon)     │
│vj_aa_nl<i>   │  amino acid sequence of V..J (excludes last base, in incomplete codon),    │
│              │  but starting after the leader                                             │
│seq<i>        │  full DNA sequence                                                         │
├──────────────┼────────────────────────────────────────────────────────────────────────────┤
│var_aa<i>     │  amino acids that vary across the clonotype (synonymous changes included)  │
├──────────────┴────────────────────────────────────────────────────────────────────────────┤
│In addition, every chain variable, after suffixing by <i>, may be used as a field.  However│
│parametrizable chain variables e.g. ndiff1vj1 must be explicitly listed using PCOLS;       │
│they are not in the default list.  See enclone help cvars.                                 │
└───────────────────────────────────────────────────────────────────────────────────────────┘

▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
enclone help filter
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓

clonotype filtering options

enclone provides filtering by cell, by exact subclonotype, and by clonotype.  This page describes
filtering by clonotype.  These options cause only certain clonotypes to be printed.  See also
enclone help special, which describes other filtering options.  This page also described
scanning for feature enrichment.

┌─────────────────────┬───────────────────────────────────────────────────────────────────────────┐
│MIN_CELLS=n          │  only show clonotypes having at least n cells                             │
│MAX_CELLS=n          │  only show clonotypes having at most n cells                              │
│CELLS=n              │  only show clonotypes having exactly n cells                              │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│MIN_UMIS=n           │  only show clonotypes having ≳ n UMIs on some chain on some cell          │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│MIN_CHAINS=n         │  only show clonotypes having at least n chains                            │
│MAX_CHAINS=n         │  only show clonotypes having at most n chains                             │
│CHAINS=n             │  only show clonotypes having exactly n chains                             │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│CDR3=<pattern>       │  only show clonotypes having a CDR3 amino acid seq that matches           │
│                     │  the given pattern*, from beginning to end                                │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│SEG="s_1|...|s_n"    │  only show clonotypes using one of the given reference segment names      │
│SEGN="s_1|...|s_n"   │  only show clonotypes using one of the given reference segment numbers    │
│                     │  both: looks for V, D, J and C segments; double quote only                │
│                     │  needed if n > 1                                                          │
│                     │  For both SEG and SEGN, multiple instances are allowed, and their         │
│                     │  effects are cumulative.                                                  │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│NSEG="s_1|...|s_n"   │  do not show clonotypes using one of the given reference segment names    │
│NSEGN="s_1|...|s_n"  │  do not show clonotypes using one of the given reference segment numbers  │
│                     │  Otherwise similar to SEG and SEGN.                                       │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│MAX_EXACTS=n         │  only show clonotypes having at most n exact subclonotypes                │
│MIN_EXACTS=n         │  only show clonotypes having at least n exact subclonotypes               │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│VJ=seq               │  only show clonotypes using exactly the given V..J sequence               │
│                     │  (string in alphabet ACGT)                                                │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│MIN_DATASETS=n       │  only show clonotypes containing cells from at least n datasets           │
│MAX_DATASETS=n       │  only show clonotypes containing cells from at most n datasets            │
│MIN_DATASET_RATIO=n  │  only show clonotypes having at least n cells and for which the ratio     │
│DATASET="d1|...|dn"  │  only show clonotypes having at least one of the listed datasets          │
│                     │  of the number of cells in the must abundant dataset to the next most     │
│                     │  abundant one is at least n                                               │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│MIN_ORIGINS=n        │  only show clonotypes containing cells from at least n origins            │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│MIN_DONORS=n         │  only show clonotypes containing cells from at least n donors             │
│                     │  If n ≥ 2, this automatically turns on MIX_DONORS, as otherwise cells from│
│                     │  two or more donors would not be combined into the same clonotype.        │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│CDIFF                │  only show clonotypes having a difference in constant region with the     │
│                     │  universal reference                                                      │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│DEL                  │  only show clonotypes exhibiting a deletion                               │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│BARCODE=bc1,...,bcn  │  only show clonotypes that use one of the given barcodes; note that such  │
│                     │  clonotypes will typically contain cells that are not in your             │
│                     │  list; if you want to fully restrict to a list of barcodes you can use    │
│                     │  the KEEP_CELL_IF option, please see enclone help special                 │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│INKT                 │  only show clonotypes for which some exact subclonotype is annotated as   │
│                     │  having some iNKT evidence, see bit.ly/enclone for details                │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│MAIT                 │  only show clonotypes for which some exact subclonotype is annotated as   │
│                     │  having some MAIT evidence, see bit.ly/enclone for details                │
├─────────────────────┼───────────────────────────────────────────────────────────────────────────┤
│D_INCONSISTENT       │  only show clonotypes having an inconsistent assignment of D genes        │
│D_NONE               │  only show clonotypes having a null D gene assignment                     │
│D_SECOND             │  only show VDDJ clonotypes                                                │
└─────────────────────┴───────────────────────────────────────────────────────────────────────────┘

* Examples of how to specify CDR3:

Two pattern types are allowed: either regular expressions, or "Levenshtein distance patterns", as
exhibited by examples below.

┌────────────────────────────────────────┬────────────────────────────────────────────────────┐
│CDR3=CARPKSDYIIDAFDIW                   │  have exactly this sequence as a CDR3              │
│CDR3="CARPKSDYIIDAFDIW|CQVWDSSSDHPYVF"  │  have at least one of these sequences as a CDR3    │
│CDR3=".*DYIID.*"                        │  have a CDR3 that contains DYIID inside it         │
│CDR3="CQTWGTGIRVF~3"                    │  CDR3s within Levenshtein distance 3 of CQTWGTGIRVF│
│CDR3="CQTWGTGIRVF~3|CQVWDSSSDHPYVF~2"   │  CDR3s within Levenshtein distance 3 of CQTWGTGIRVF│
│                                        │  or Levenshtein distance 2 of CQVWDSSSDHPYVF       │
└────────────────────────────────────────┴────────────────────────────────────────────────────┘

Note that double quotes should be used if the pattern contains characters other than letters.

A gentle introduction to regular expressions may be found at
https://en.wikipedia.org/wiki/Regular_expression#Basic_concepts, and a precise
specification for the regular expression version used by enclone may be found at
https://docs.rs/regex.

linear conditions

enclone understands linear conditions of the form
c1*v1 ± ... ± cn*vn > d
where each ci is a constant, "ci*" may be omitted, each vi is a parseable variable, and d is a
constant.  Blank spaces are ignored.  The > sign may be replaced by
• >= or equivalently ≥ or ⩾
• <
• <= or equivalently ≤ or ⩽.

The details of how enclone evaluates a linear condition for a clonotype are subtle, and these
subtleties may or may not matter for what you're doing.  You may wish to look at the specific
examples given below.  For more detail, here are the rules:
• When a variable is assessed for a given cell, we use the value that would have been obtained
  using parseable output (including with the PCELL mode); see "enclone help parseble".  In most
  cases it will make more sense to use the per-cell version of a variable, if it is defined.
  For example, u1_cell would be the number of UMIs for the first chain for a given cell, but u1
  would be the median value for all cells in an exact subclonotype, regardless of which cell is
  examined.
• For each variable, enclone finds its values for all cells in the clonotype.  Values that are not
  finite numbers are ignored.  This can have unintended consequences, so be careful not to
  accidentally use a variable that is non-numeric.
• If no such values are found for some variable, then the constraint fails.
• Otherwise, some function is applied to all the values for a given variable (e.g. the mean
  function) and the constraint is tested, after substituting in the values from the function.
  The particular function that is used is documented at the appropriate point.

Because the minus sign - doubles as a hyphen and is used in some feature names, we allow
parentheses around variable names to prevent erroneous parsing, like this (IGHV3-7_g) >= 1.  And
something like that would need to be quoted on the command line.

filtering by linear conditions

enclone has the capability to filter by bounding variables, using the command-line argument:
KEEP_CLONO_IF_CELL_MEAN="L"
where L is a linear condition (as defined above).  Multiple bounds may be imposed by using
multiple instances of KEEP_CLONO_IF_CELL_MEAN=... .  As explained above, note that
KEEP_CLONO_IF_CELL_MEAN=... filters by computing the mean across all cells in the clonotype.  See
also KEEP_CELL_IF= at enclone help special.

If for a given clonotype and a given variable, not all values are specified (e.g. if for a
user-specified variable, values are blank), then only the values that are specified are used in
the computation of mean and max.  If no values are specified, then the condition fails.

Similarly, to filter by the min or max across all cells in a clonotype, one may use
KEEP_CLONO_IF_CELL_MIN="L"
or
KEEP_CLONO_IF_CELL_MAX="L"
and otherwise as above.

Caution.  Because of interactions between filters (including built-in filters), the results of
filtering can be counterintuitive.  In particular, cells might be removed from a clonotype after a
linear condition is applied, leading to confusing results.

For cell-exact variables (see https://10xgenomics.github.io/enclone/pages/auto/variables.html),
note that linear conditions are applied to the cell version of the variable.

feature scanning

If gene expression and/or feature barcode data have been generated, enclone can scan all features
to find those that are enriched in certain clonotypes relative to certain other clonotypes.  This
feature is turned on using the command line argument
SCAN="test,control,threshold"
where each of test, control and threshold are linear conditions as defined above.  Blank spaces
are ignored.  The test condition defines the "test clonotypes" and the control condition defines
the "control clonotypes".  The threshold condition is special: it may use only the variables "t"
and "c" that represent the raw UMI count for a particular gene or feature, for the test (t) or
control (c) clonotypes.  To get a meaningful result, you should specify MIN_CELLS appropriately
and manually examine the test and control clonotypes to make sure that they make sense.

If in addition the argument SCAN_EXACT is supplied, then scanning will be carried out over exact
subclonotypes rather than clonotypes.

an example

Suppose that your data are comprised of two origins with datasets
            named pre and post, representing time points relative to some event.  Then
SCAN="n_post - 10*n_pre >= 0, n_pre - 0.5*n_post >= 0, t - 2*c >= 0.1"
would define the test clonotypes to be those satisfying n_post >= 10*n_pre (so having far more
post cells then pre cells), the control clonotypes to be those satisfying n_pre >= 0.5*n_post (so
having lots of pre cells), and thresholding on t >= 2*c * 0.1, so that the feature must have a bit
more than twice as many UMIs in the test than the control.  The 0.1 is there to exclude noise from
features having very low UMI counts.

Feature scanning is not a proper statistical test.  It is a tool for generating a list of feature
candidates that may then be examined in more detail by rerunning enclone using some of the
detected features as lead variables (appropriately suffixed).  Ultimately the power of the scan is
determined by having "enough" cells in both the test and control sets, and in having those sets
cleanly defined.

Currently feature scanning requires that each dataset have identical features.

▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
enclone help amino
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓

There is a complex per-chain column to the left of other per-chain columns, defined by
AMINO=x1,...,xn: display amino acid columns for the given categories, in one combined ordered
group, where each xi is one of:

┌────────┬───────────────────────────────────────────────────────────────────────────────────────┐
│cdr1    │  CDR1 sequence                                                                        │
│cdr2    │  CDR2 sequence                                                                        │
│cdr3    │  CDR3 sequence                                                                        │
│fwr1    │  FWR1 sequence                                                                        │
│fwr2    │  FWR2 sequence                                                                        │
│fwr3    │  FWR3 sequence                                                                        │
│fwr4    │  FWR4 sequence                                                                        │
│        │  Notes:                                                                               │
│        │  1. Please see the page on bit.ly/enclone about V(D)J features for notes              │
│        │  on our method and interpretation.                                                    │
│        │  2. There are circumstances under which these cannot be calculated, most notably in   │
│        │  cases where something is wrong with the associated reference sequence.  In such      │
│        │  cases, even though you specify CDR1 or CDR2, they will not be shown.                 │
│        │  3. If the CDR1 and CDR2 sequences are sufficiently short, the part of the header line│
│        │  that looks like e.g. ═CDR1═ will get contracted e.g. to DR1 or something even more   │
│        │  cryptic.  It is also possible that the computed CDR1 or CDR2 is empty.               │
│        │  4. The same stipulations apply to FWR1, FWR2 and FWR3.                               │
│        │  5. Spaces are shown between features unless NOSPACES is specified.                   │
├────────┼───────────────────────────────────────────────────────────────────────────────────────┤
│var     │  positions in chain that vary across the clonotype                                    │
│share   │  positions in chain that differ consistently from the donor reference                 │
├────────┼───────────────────────────────────────────────────────────────────────────────────────┤
│donor   │  positions in chain where the donor reference differs from the universal reference    │
├────────┼───────────────────────────────────────────────────────────────────────────────────────┤
│donorn  │  positions in chain where the donor reference differs nonsynonymously                 │
│        │  from the universal reference                                                         │
├────────┼───────────────────────────────────────────────────────────────────────────────────────┤
│a-b     │  amino acids numbered a through b (zero-based, inclusive)                             │
└────────┴───────────────────────────────────────────────────────────────────────────────────────┘

Note that we compute positions in base space, and then divide by three to get positions in amino
acid space.  Thus it can happen that a position in amino acid space is shown for both var and share.

The default value for AMINO is cdr3,var,share,donor.  Note that we only report amino acids that
are strictly within V..J, thus specifically excluding the codon bridging J and C.

▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
enclone help special
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓

special filtering options

This page documents some options, most of which allow noise filters to be turned off, and which
normally should not be invoked.  Some of these options delete barcodes, and a summary of this
action is included in the SUMMARY option.  See also the lead variable "filter", see "enclone help
lvars".  At the bottom of this page we provide some other options that are not noise filters.

┌─────────────────────────┬─────────────────────────────────────────────────────────────────────┐
│NALL                     │  Turn off all the noise filters shown below.  This may              │
│                         │  yield quite a mess.                                                │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────┤
│NCELL                    │  Use contigs found by Cell Ranger even if they were not             │
│                         │  in a called cell, or not called high confidence.                   │
│NALL_CELL                │  Turn off all the noise filters except for the cell filter.         │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────┤
│NMAX                     │  Allow barcodes for which more than four contigs were identified.   │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────┤
│NGEX                     │  If gene expression and/or feature barcode data are                 │
│                         │  provided, if a barcode is called a cell by the VDJ part            │
│                         │  of the Cell Ranger pipeline, but not called a cell by              │
│                         │  the gene expression and/or feature barcode part, then              │
│                         │  the default behavior of enclone is to remove such cells            │
│                         │  from clonotypes.  This option disables that behavior.              │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────┤
│NCROSS                   │  If you specify that two or more libraries arose from               │
│                         │  the same origin (i.e. cells from the same tube or                  │
│                         │  tissue), then by default enclone will "cross filter" so            │
│                         │  as to remove expanded exact subclonotypes that are                 │
│                         │  present in one library but not another, in a fashion               │
│                         │  that would be highly improbable, assuming random draws             │
│                         │  of cells from the tube.  These are believed to arise               │
│                         │  when a plasma or plasmablast cell breaks up during                 │
│                         │  during or after pipetting from the tube, and the                   │
│                         │  resulting fragments seed GEMs, yielding expanded 'fake'            │
│                         │  clonotypes that are residues of real single plasma                 │
│                         │  cells.  The NCROSS options turns off this filter, which            │
│                         │  could be useful so long as you interpret the restored              │
│                         │  clonotypes as representing what are probably single                │
│                         │  cells.  There may also be other situations where the               │
│                         │  filter should be turned off, and in particular the                 │
│                         │  filter can do weird things if inputs are somehow                   │
│                         │  mis-specified to enclone.  Note that for purposes of               │
│                         │  this option, enclone defines an origin by the pair                 │
│                         │  (origin name, donor name).                                         │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────┤
│NUMI                     │  Filter out B cells based on low BCR UMI counts.  The heuristics    │
│                         │  for this are described on the enclone site at bit.ly/enclone.      │
│NUMI_RATIO               │  Filter out B cells based on low BCR UMI counts relative to another │
│                         │  cell in a given clonotype.  The heuristics for this                │
│                         │  are described on the enclone site at bit.ly/enclone.               │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────┤
│NGRAPH_FILTER            │  By default, enclone filters to remove exact                        │
│                         │  subclonotypes that by virtue of their relationship to              │
│                         │  other exact subclonotypes, appear to arise from                    │
│                         │  background mRNA or a phenotypically similar phenomenon.            │
│                         │   The NGRAPH_FILTER option turns off this filtering.                │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────┤
│NQUAL                    │  By default, enclone filters out exact subclonotypes                │
│                         │  having a base in V..J that looks like it might be                  │
│                         │  wrong.  More specifically, enclone finds bases which               │
│                         │  are not Q60 for a barcode, not Q40 for two barcodes,               │
│                         │  are not supported by other exact subclonotypes, are                │
│                         │  variant within the clonotype, and which disagree with              │
│                         │  the donor reference.  NQUAL turns this off.                        │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────┤
│NWEAK_CHAINS             │  By default, enclone filters chains from clonotypes that            │
│                         │  are weak and appear to be artifacts, perhaps arising               │
│                         │  from a stray mRNA molecule that floated into a GEM.                │
│                         │  The NWEAK_CHAINS option turns off this filter.                     │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────┤
│NWEAK_ONESIES            │  By default, enclone disintegrates certain untrusted                │
│                         │  clonotypes into single cell clonotypes.  The untrusted             │
│                         │  clonotypes are onesies that are light chain or TRA and             │
│                         │  whose number of cells is less than 0.1% of the total               │
│                         │  number of cells.  This operation reduces the likelihood            │
│                         │  of creating clonotypes containing cells that arose from            │
│                         │  different recombination events.  NWEAK_ONESIES turns               │
│                         │  this operation off.                                                │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────┤
│NMERGE_ONESIES           │  enclone merges certain onesie clonotypes into                      │
│                         │  clonotypes having two or more chains.  By default, this            │
│                         │  merger is prevented if the number of cells in the                  │
│                         │  onesie is less than 0.01% of the total number of cells.            │
│                         │   NMERGE_ONESIES causes these merges to happen anyway.              │
│                         │  The naming of this option is confusing.                            │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────┤
│NFOURSIE_KILL            │  Under certain circumstances, enclone will delete foursie exact     │
│                         │  subclonotypes.  Please see                                         │
│                         │  10xgenomics.github.io/enclone/pages/auto/default_filters.html.     │
│                         │   The foursies that are killed are believed to be artifacts         │
│                         │  arising from repeated cell doublets or GEMs that contain two       │
│                         │  cells and multiple gel beads.  The argument NFOURSIE_KILL          │
│                         │  turns off this filtering.                                          │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────┤
│NDOUBLET                 │  Under certain circumstances, enclone will delete exact             │
│                         │  subclonotypes that appear to represent doublets.  Please see       │
│                         │  10xgenomics.github.io/enclone/pages/auto/default_filters.html.     │
│                         │  The argument NDOUBLET turns off this filtering.                    │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────┤
│NSIG                     │  Under certain circumstances, enclone will delete exact             │
│                         │  subclonotypes that appear to be contaminants, based on their       │
│                         │  chain signature.  Please see                                       │
│                         │  10xgenomics.github.io/enclone/pages/auto/default_filters.html.     │
│                         │  The argument NSIG turns off this filtering.                        │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────┤
│NWHITEF                  │  By default, enclone filters out rare artifacts arising             │
│                         │  from contamination of oligos on gel beads.  The NWHITEF            │
│                         │  option turns off this filter.                                      │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────┤
│NBC_DUP                  │  By default, enclone filters out duplicated barcodes within an exact│
│                         │  subclonotype.  The NBC_DUP option turns off this filter.           │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────┤
│MIX_DONORS               │  By default, enclone will prevent cells from different              │
│                         │  donors from being placed in the same clonotype.  The               │
│                         │  MIX_DONORS option turns off this behavior, thus                    │
│                         │  allowing cells from different donors to be placed in               │
│                         │  the same clonotype.  The main use of this option is for            │
│                         │  specificity testing, in which data from different                  │
│                         │  donors are deliberately combined in an attempt to find             │
│                         │  errors.  Use of the bc field for META input                        │
│                         │  specification automatically turns on this option.                  │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────┤
│NIMPROPER                │  enclone filters out exact subclonotypes having more                │
│                         │  than one chain, but all of the same type.  For example,            │
│                         │  the filter removes all exact subclonotypes having two              │
│                         │  TRA chains and no other chains.  The NIMPROPER option              │
│                         │  turns off this filter.                                             │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────┤
│MIN_CHAINS_EXACT=n       │  Delete any exact subclonotype having less than n                   │
│                         │  chains.  You can use this to "purify" a clonotype so as            │
│                         │  to display only exact subclonotypes having all their chains.       │
│CHAINS_EXACT=n           │  Delete any exact subclonotype not having exactly n chains.         │
│MIN_CELLS_EXACT=n        │  Delete any exact subclonotype having less than n cells.  You might │
│                         │  want to use this if you have a very large and complex expanded.    │
│                         │  clonotype.                                                         │
│COMPLETE                 │  delete any exact subclonotype that has less chains than the        │
│                         │  clonotype for which you would like to see a simplified view.       │
│CONST_IGH="<pattern>"    │  for BCR, keep only exact subclonotypes having a heavy              │
│                         │  chain whose constant region gene name matches the given            │
│                         │  pattern (meaning regular expression, see enclone help filter)      │
│CONST_IGKL="<pattern>"   │  for BCR, keep only exact subclonotypes having a light              │
│                         │  chain whose constant region gene name matches the given            │
│                         │  pattern (meaning regular expression, see enclone help filter)      │
│MAX_HEAVIES=1            │  ignore any cell having more than one IGH or TRB chain              │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────┤
│KEEP_CELL_IF=constraint  │  Let "constraint" be any constraint involving arithmetic            │
│                         │  and boolean operators, and variables that are specified            │
│                         │  as fields using the BC option (or equivalently, using              │
│                         │  bc, via META), see enclone help input, or feature                  │
│                         │  variables: <gene>_g or <antibody>_ab or <crispr>_cr or             │
│                         │  <custom>_cu, as described at enclone help lvars (but               │
│                         │  without regular expressions, as these would conflict               │
│                         │  with arithmetic operators).  This option filters out               │
│                         │  all barcodes that do not satisfy the given constraint.             │
│                         │  Note that for purposes of testing the constraint, if               │
│                         │  the value for a particular barcode has not been                    │
│                         │  specified, then its value is taken to be null.  Also               │
│                         │  multiple instances of KEEP_CELL_IF may be used to                  │
│                         │  impose multiple filters.  See the examples below, and              │
│                         │  be very careful about syntax, which should match the               │
│                         │  given examples exactly.  In particular,                            │
│                         │  • use == for equality, and not =                                   │
│                         │  • put string values in single quotes                               │
│                         │  • put the entire expression in double quotes.                      │
│                         │                                                                     │
│                         │  As a toy example, suppose you had a CSV file f having five lines:  │
│                         │  barcode,nice,rank                                                  │
│                         │  AGCATACTCAGAGGTG-1,true,3                                          │
│                         │  CGTGAGCGTATATGGA-1,true,7                                          │
│                         │  CGTTAGAAGGAGTAGA-1,false,99                                        │
│                         │  CGTTAGAAGGAGTAGA-1,dunno,43                                        │
│                         │  then the command                                                   │
│                         │  enclone BCR=123085 BC=f KEEP_CELL_IF="nice == 'true'"              │
│                         │  would cause enclone to use only the first two barcodes shown in    │
│                         │  the file, and the command                                          │
│                         │  enclone BCR=123085 BC=f KEEP_CELL_IF="nice == 'true' && rank <= 5" │
│                         │  would cause only the first barcode to be used.                     │
│                         │                                                                     │
│                         │  See also KEEP_CLONO_IF_CELL_MEAN=... and                           │
│                         │  KEEP_CLONO_IF_CELL_MAX=... at enclone help filter.                 │
└─────────────────────────┴─────────────────────────────────────────────────────────────────────┘

▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
enclone help lvars
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓

lead column options

These options define lead variables, which are variables that are computed for each exact
subclonotype, and if using the PER_CELL option, also computed for each cell.  In addition, lead
variables can be used for parseable output.

Lead variables appear in columns that appear once in each clonotype, on the left side, and have
one entry for each exact subclonotype row.

Note that for medians of integers, we actually report the "rounded median", the result of rounding
the true median up to the nearest integer, so that e.g. 6.5 is rounded up to 7.

See also enclone help cvars and the inventory of all variables at
            https://10xgenomics.github.io/enclone/pages/auto/inventory.html.

Lead variables are specified using LVARS=x1,...,xn where each xi is one of:

┌──────────────────┬───────────────────────────────────────────────────────────────────────────────┐
│nchains           │  total number of chains in the clonotype                                      │
│nchains_present   │  number of chains present in an exact subclonotype                            │
├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│datasets          │  dataset identifiers                                                          │
│origin            │  origin identifiers                                                           │
│donors            │  donor identifiers                                                            │
├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│n                 │  number of cells                                                              │
│n_<name>          │  number of cells associated to the given name, which can be a dataset         │
│                  │  or origin or donor or tag short name; may name only one such category        │
│clonotype_ncells  │  total number of cells in the clonotype                                       │
├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│nd<k>             │  For k a positive integer, this creates k+1 fields, that are specific to each │
│                  │  clonotype.  The first field is n_<d1>, where d1 is the name of the dataset   │
│                  │  having the most cells in the clonotype.  If k ≥ 2, then you'll get a         │
│                  │  "runner-up" field n_<d2>, etc.  Finally you get a field n_other, however     │
│                  │  fields will be elided if they represent no cells.  Use a variable of this    │
│                  │  type at most once.                                                           │
├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│near              │  Hamming distance of V..J DNA sequence to nearest neighbor                    │
│far               │  Hamming distance of V..J DNA sequence to farthest neighbor                   │
│                  │  both compare to cells having chains in the same columns of the clonotype,    │
│                  │  with - shown if there is no other exact subclonotype to compare to           │
│dref              │  Hamming distance of V..J DNA sequence to donor reference, excluding          │
│                  │  region of recombination, sum over all chains                                 │
│dref_aa           │  Hamming distance of V..J amino acid sequence to donor reference, excluding   │
│                  │  region of recombination, sum over all chains                                 │
│dref_max          │  Hamming distance of V..J DNA sequence to donor reference, max over all       │
│                  │  chains                                                                       │
├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│count_<reg>       │  Number of matches of the V..J amino acid sequences of all chains to the given│
│                  │  regular expression, which is treated as a subset match, so for example,      │
│                  │  count_CAR would count the total number of occurrences of the string CAR in   │
│                  │  all the chains.  Please see enclone help filter for a discussion             │
│                  │  about regular expressions.  We also allow the form abbr:count_<regex>,       │
│                  │  where abbr is an abbreviation that will appear as the field label.           │
│count_<f>_<reg>   │  Supposing that f is in {cdr1,..,cdr3,fwr1,..,fwr4,cdr,fwr}, this is similar  │
│                  │  to the above but restricted to motifs lying entirely within                  │
│                  │  a given feature or feature set.                                              │
├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│inkt              │  A string showing the extent to which the T cells in an exact subclonotype    │
│                  │  have evidence for being an iNKT cell.  The most evidence is denoted 𝝰gj𝝱gj,  │
│                  │  representing both gene name and junction sequence (CDR3) requirements for    │
│                  │  both chains.  See bit.ly/enclone for details on the requirements.            │
│mait              │  Same as with inkt but for MAIT cells instead.                                │
├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│g<d>              │  Here d is a nonnegative integer.  Then all the exact subclonotypes are       │
│                  │  grouped according to the Hamming distance of their V..J sequences.  Those    │
│                  │  within distance d are defined to be in the same group, and this is           │
│                  │  extended transitively.  The group identifier 1, 2, ... is shown.  The        │
│                  │  ordering of these identifiers is arbitrary.  This option is best applied     │
│                  │  to cases where all exact subclonotypes have a complete set of chains.        │
├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│gex               │  ● median gene expression UMI count                                           │
│n_gex             │  ● number of cells found by cellranger using GEX or Ab data                   │
├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│<gene>_g          │  ● all four feature types: look for a declared feature of the given type      │
│<antibody>_ab     │  with the given id or name; report the median UMI count for it; we allow      │
│<crispr>_cr       │  we also allow <regular expression>_g where g can be replaced by ab, ag, cr   │
│<custom>_cu       │  or cu; this represents a sum of UMI counts across the matching features. ●   │
├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│cred              │  Short for credibility.  It is a measure of the extent to which cells         │
│                  │  having gene expression similar to a given putative B cell are themselves     │
│                  │  B cells.  (Or similarly for T cells.)  For the actual definition, let n      │
│                  │  be the number of VDJ cells that are also GEX cells.  For a given cell,       │
│                  │  find the n GEX cells that are closest to it in PCA space, and report the     │
│                  │  percent of those that are also VDJ cells.  For multiple datasets, it would   │
│                  │  be better to "aggr" the data, however that is not currently supported        │
│                  │  The computation is also inefficient, so let us know if it's causing          │
│                  │  problems for you.  And cred makes much better sense for datasets that        │
│                  │  consist of mixed cell types, rather than consisting of pure B or T cells.    │
├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│filter            │  See enclone help special.  Use with PER_CELL.  If you turn off some          │
│                  │  default filters (or all default filters, e.g. with NALL_CELL), and this      │
│                  │  cell would have been deleted by one of the default filters, then this will   │
│                  │  show the name of the last filter that would have been applied to delete the  │
│                  │  cell.  (There are exceptions, please see enclone help special.)  Note        │
│                  │  that there are complex interactions between filters, so the actual effect    │
│                  │  with all default filters on may be significantly different.  Note also that  │
│                  │  use of NALL_CELL will typically result in peculiar artifacts, so this        │
│                  │  should only be used as an exploratory tool.                                  │
├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│nbc               │  numerically encoded barcode: a ten-digit number, padded with zeros           │
│                  │  on the left, which represents the base four encoding of the barcode DNA      │
│                  │  sequence, with A ==> 0, C => 1, G ==> 2 and T ==> 3; only defined for cells  │
├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│hcomp             │  complexity of heavy chain, only computed if two chains, one heavy, one light │
│                  │  and computed by finding optimal D, aligning to concatenated VDJ,             │
│                  │  and then scoring +1 of each inserted base, +1 for each deletion,             │
│                  │  regardless of size, and +1 for each substitution                             │
│jun_ins           │  like hcomp, but only counts inserted bases                                   │
└──────────────────┴───────────────────────────────────────────────────────────────────────────────┘
For gene expression and feature barcode stats, such data must be provided as input to enclone.

● Example: IG.*_g matches all genes that begin with IG, and TR(A|B).*_g matches all genes that
begin with TRA or TRB.  Double quotes as in LVARS="..." may be needed.  The regular expression
must be in the alphabet A-Za-z0-9+_-.[]()|* and is only interpreted as a regular expression if it
contains a character in []()|*.  See enclone help filter for more information about regular
expressions.

  ● These variables have some alternate versions, as shown in the table below.
  
  ┌──────────┬───────────────────────────────┬──────────┬──────────────┬─────────────┬────────────┐
  │variable  │  semantics                    │  visual  │  visual      │  parseable  │  parseable │
  │          │                               │          │  (one cell)  │             │  (one cell)│
  ├──────────┼───────────────────────────────┼──────────┼──────────────┼─────────────┼────────────┤
  │x         │  median over cells            │  yes     │  this cell   │  yes        │  yes       │
  │x_mean    │  mean over cells              │  yes     │  null        │  yes        │  yes       │
  │x_μ       │  (same as above)              │  yes     │  null        │  yes        │  yes       │
  │x_sum     │  sum over cells               │  yes     │  null        │  yes        │  yes       │
  │x_Σ       │  (same as above)              │  yes     │  null        │  yes        │  yes       │
  │x_min     │  min over cells               │  yes     │  null        │  yes        │  yes       │
  │x_max     │  max over cells               │  yes     │  null        │  yes        │  yes       │
  │x_%       │  % of total GEX (genes only)  │  yes     │  this cell   │  yes        │  yes       │
  │x_cell    │  this cell                    │  no      │  no          │  no         │  this cell │
  └──────────┴───────────────────────────────┴──────────┴──────────────┴─────────────┴────────────┘
  Some explanation is required.  If you use enclone without certain options, you get the "visual"
  column.
  • Add the option PER_CELL (see enclone help display) and then you get visual output with extra
  lines for each cell within an exact subclonotype, and each of those extra lines is described by
  the "visual (one cell)" column.
  • If you generate parseable output (see enclone help parseable), then you get the "parseable"
  column for that output, unless you specify PCELL, and then you get the last column.
  • For the forms with μ and Σ, the Greek letters are only used in column headings for visual output
  (to save space), and optionally, in names of fields on the command line.
  ▶ If you try out these features, you'll see exactly what happens! ◀

● Similar to the above but simpler: n_gex is just a count of cells, visual (one cell) shows 0 or
1, n_gex_cell is defined for parseable (one cell), and the x_mean etc. forms do not apply.

The default is datasets,n, except that datasets is suppressed if there is only one dataset.

LVARSP=x1,...,xn is like LVARS but appends to the list.

▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
enclone help display
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓

other options that control clonotype display

┌───────────────┬───────────────────────────────────────────────────────────────────────────────┐
│PER_CELL       │  expand out each exact clonotype line, showing one line per cell,             │
│               │  for each such line, displaying the barcode name, the number of UMIs assigned,│
│               │  and the gene expression UMI count, if applicable, under gex_med              │
├───────────────┼───────────────────────────────────────────────────────────────────────────────┤
│BARCODES       │  print list of all barcodes of the cells in each clonotype, in a              │
│               │  single line near the top of the printout for a given clonotype               │
├───────────────┼───────────────────────────────────────────────────────────────────────────────┤
│SEQC           │  print V..J sequence for each chain in the first exact subclonotype, near     │
│               │  the top of the printout for a given clonotype                                │
├───────────────┼───────────────────────────────────────────────────────────────────────────────┤
│FULL_SEQC      │  print full sequence for each chain in the first exact subclonotype,          │
│               │  near the top of the printout for a given clonotype                           │
├───────────────┼───────────────────────────────────────────────────────────────────────────────┤
│SUM            │  print sum row for each clonotype (sum is across cells)                       │
│MEAN           │  print mean row for each clonotype (mean is across cells)                     │
├───────────────┼───────────────────────────────────────────────────────────────────────────────┤
│DIFF_STYLE=C1  │  instead of showing an x for each amino acid column containing a difference,  │
│               │  show a C if the column lies within a complementarity-determining region,     │
│               │  and F if it lies in a framework region, and an L if it lies in the leader    │
│DIFF_STYLE=C2  │  instead of showing an x for each amino acid column containing a difference,  │
│               │  show a ◼ if the column lies within a complementarity-determining region,     │
│               │  and otherwise show a ▮.                                                      │
├───────────────┼───────────────────────────────────────────────────────────────────────────────┤
│CONX           │  add an additional row to each clonotype table, showing the amino acid        │
│               │  consensus across the clonotype, with X for each variant residue              │
│CONP           │  add an additional row to each clonotype table, showing the amino acid        │
│               │  consensus across the clonotype, with a property symbol whenever two different│
│               │  amino acids are observed, see enclone help cvars                             │
├───────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ALIGN<n>       │  exhibit a visual alignment for chain n (for each exact subclonotype) to the  │
│               │  donor V(D)J reference, picking the best D for heavy chains / TRB             │
│               │  Multiple values of n may be specified using multiple arguments.              │
│ALIGN_2ND<n>   │  same as ALIGN<n> but use second best D segment                               │
│JALIGN<n>      │  same as ALIGN<n> but only show the region from 15 bases before the end of the│
│               │  V segment to 35 bases into the J segment                                     │
│JALIGN_2ND<n>  │  same as JALIGN<n> but use second best D segment                              │
└───────────────┴───────────────────────────────────────────────────────────────────────────────┘

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

options that control clonotype grouping

By default, enclone organizes clonotypes into groups, and each group contains just one clonotype! 
We offer some options to do actual grouping, with the intention of reflecting functional
(antigen-binding) differences, but with many caveats because this is a hard problem.

These options are experimental.  There are many natural extensions that we have not implemented.

enclone has two types of grouping: symmetric and asymmetric.  Symmetric grouping creates
nonoverlapping groups, whereas asymmetric grouping creates groups that may overlap.

To turn on symmetric grouping, one uses a command of the form
GROUP=c1,...,cn
where each ci is a condition.  Two clonotypes are placed in the same group if all the conditions
are satisfied, and that grouping is extended transitively.

In what follows, heavy chain means IGH or TRB, and light chain means IGK or IGL or TRA.  

Here are the conditions:

┌───────────────────────┬───────────────────────────────────────────────────────────────────────┐
│vj_refname             │  V segments have the same reference sequence name,                    │
│                       │  and likewise for J segments                                          │
│v_heavy_refname        │  heavy chain V segments have the same reference sequence name         │
│vj_heavy_refname       │  heavy chain V segments have the same reference sequence name,        │
│                       │  and likewise for J segments                                          │
│                       │  (only applied to heavy chains)                                       │
│vdj_refname            │  V segments have the same reference sequence name,                    │
│                       │  and likewise for D segments, computed from scratch, and J segments   │
│vdj_heavy_refname      │  V segments have the same reference sequence name,                    │
│                       │  and likewise for D segments, computed from scratch, and J segments   │
│                       │  (only applied to heavy chains)                                       │
├───────────────────────┼───────────────────────────────────────────────────────────────────────┤
│len                    │  the lengths of V..J are the same (after correction for indels)       │
│cdr3_len               │  CDR3 sequences have the same length                                  │
│cdr3_heavy_len         │  heavy chain CDR3 sequences have the same length                      │
│cdr3_light_len         │  light chain CDR3 sequences have the same length                      │
├───────────────────────┼───────────────────────────────────────────────────────────────────────┤
│cdr3_heavy≥n%          │  nucleotide identity on heavy chain CDR3 sequences is at least n%     │
│cdr3_light≥n%          │  nucleotide identity on light chain CDR3 sequences is at least n%     │
│cdr3_aa_heavy≥n%       │  amino acid identity on heavy chain CDR3 sequences is at least n%     │
│cdr3_aa_light≥n%       │  amino acid identity on light chain CDR3 sequences is at least n%     │
│                       │  (note that use of either of these options without at least one of the│
│                       │  earlier options may be slow)                                         │
│                       │  (in both cases, we also recognize >= (with quoting) and ⩾)           │
│                       │  (all of the above options use Levenshtein distance)                  │
│cdr3_aa_heavy≥n%:h:@f  │  given a file f containing 20 lines,                                  │
│                       │  each having 20 numbers, separated by single spaces,                  │
│                       │  compute the Hamming distance between heavy chain amino acid          │
│                       │  sequences of the same length, weighted by the matrix defined by f,   │
│                       │  and require that percent identity is bounded accordingly             │
├───────────────────────┼───────────────────────────────────────────────────────────────────────┤
│heavy≥n%               │  nucleotide identity on heavy chain V..J sequences is at least n%     │
│light≥n%               │  nucleotide identity on light chain V..J sequences is at least n%     │
│aa_heavy≥n%            │  amino acid identity on heavy chain V..J sequences is at least n%     │
│aa_light≥n%            │  amino acid identity on light chain V..J sequences is at least n%     │
│                       │  (note that use of either of these options without at least one of the│
│                       │  earlier options may be very slow)                                    │
│                       │  (in both cases, we also recognize >= (with quoting) and ⩾)           │
│                       │  (all of the above options use Levenshtein distance)                  │
└───────────────────────┴───────────────────────────────────────────────────────────────────────┘

To instead turn on asymmetric grouping, one uses the AGROUP option.  To use this, it is in
addition necessary to define "center clonotypes", a "distance formula", and a "distance bound". 
Each group will then consist of the center clonotype (which comes first), followed by, in order by
distance (relative to the formula), all those clonotypes that satisfy the distance bound (with
ties broken arbitrarily).  For each clonotype in a group, we print its distance from the first
clonotype, and this is also available as a parseable variable dist_center.

Center clonotypes.  These are in principle any set of clonotypes.  For now we allow two options:
AG_CENTER=from_filters
which causes all the filters described at "enclone help filters" to NOT filter clonotypes in the
usual way, but instead filter to define the center, and
AG_CENTER=copy_filters
which effectively does nothing -- it just says that filters apply to all clonotypes, whether in
the center or not.

┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│Please note that asymmetric grouping is very time consuming, and run time is roughly a linear│
│function of (number of center clonotypes) * (number of clonotypes).  So it is advisable to   │
│restrict the number of center clonotypes.                                                    │
└─────────────────────────────────────────────────────────────────────────────────────────────┘

Distance formula.  This could in principle be any function that takes as input two clonotypes and
returns a number.  For now we allow only:
AG_DIST_FORMULA=cdr3_edit_distance
which is the "Levenshtein CDR3 edit distance between two clonotypes".  This is the minimum, over
all pairs of exact subclonotypes, one from each of the two clonotypes, of the edit distance
between two exact subclonotypes, which is the sum of the edit distances between the heavy chains
and between the light chains.

Technical note.  This is the explanation for the case where there are two chains of different
types.  Here is the explanation for the "non-standard" cases.  We take the sum, over all pairs of
heavy chains, one from each of the two exact subclonotypes, of the edit distance between the CDR3
sequences for the heavy chains, plus the same for light chains.  Exact subclonotypes that lack a
heavy or a light chain are ignored by this computation.  Also the distance between two clonotypes
is declared infinite if one of them lacks a heavy chain or one of them lacks a light chain.

Distance bound.  For now we allow the following two forms:
AG_DIST_BOUND=top=n
which returns the top n clonotypes (plus the center), and
AG_DIST_BOUND=max=d
which returns all clonotypes having distance ≤ d from the center clonotype.

In addition, there are the following grouping options, for both the symmetric and asymmetric
cases:

┌─────────────────────┬──────────────────────────────────────────────────────────────────┐
│MIN_GROUP            │  minimum number of clonotypes in group to print (default = 1)    │
│MIN_GROUP_DONORS     │  minimum number of donors for a group to be printed (default = 1)│
│GROUP_CDR3H_LEN_VAR  │  only print groups having variable heavy chain CDR3 length       │
│GROUP_CDR3=x         │  only print groups containing the CDR3 amino acid sequence x     │
│GROUP_DONOR=d        │  only print groups containing a cell from the given donor;       │
│                     │  multiple instances may be used to jointly restrict              │
│GROUP_NAIVE          │  only show groups having an exact subclonotype with dref = 0     │
│GROUP_NO_NAIVE       │  only show groups lacking an exact subclonotype with dref = 0    │
├─────────────────────┼──────────────────────────────────────────────────────────────────┤
│NGROUP               │  don't display grouping messages                                 │
└─────────────────────┴──────────────────────────────────────────────────────────────────┘

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

options that display dataset variables

enclone has some variables that are computed for each dataset, and whose values may by printed as
a table in the summary, and not otherwise used (currently).  These may be specified using
DVARS=var1,...,varn.  The dataset-level variables that are supported currently are:
<feature>_cellular_r
<feature>_cellular_u
which are, respectively, the percentage of reads [UMIs] for the given feature that are in cells
that were called by the cellranger pipeline.  A feature is e.g. IGHG1_g etc. as discussed at
enclone help lvars.  To compute the metrics, the cellranger output file per_feature_metrics.csv
is read.  In addition, one may also use numeric values defined in the file
metrics_summary_json.json, but this file is in general not available.  To get it, it may be
necessary to rerun the cellranger pipeline using --vdrmode=disable and then copy the json file to
outs.  Finally, variable names may be prefaced with abbreviation:, and in such cases, it is the
abbreviation that is displayed in the table.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

options that control global variables

enclone has some global variables that can be computed, with values printed in the summary, and
not otherwise used (currently).  These may be specified using GVARS=var1,...,varn.  The global
variables that are supported currently are:
d_inconsistent_%
d_inconsistent_n
Please see https://10xgenomics.github.io/enclone/pages/auto/d_genes.html for more information.

▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
enclone help indels
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓

handling of insertions and deletions

enclone can recognize and display a single insertion or deletion in a contig relative to the
reference, so long as its length is divisible by three, is relatively short, and occurs within the
V segment, not too close to its right end.

These indels could be germline, however most such events are already captured in a reference
sequence.  Currently the donor reference code in enclone does not recognize indels.

SHM deletions are rare, and SHM insertions are even more rare.

Deletions are displayed using hyphens (-).  If you use the var option for cvars, the hyphens will
be displayed in base space, where they are initially observed.  For the AMINO option, the deletion
is first shifted by up to two bases, so that the deletion starts at a base position that is
divisible by three.  Then the deleted amino acids are shown as hyphens.

Insertions are shown only in amino acid space, in a special per-chain column called notes that
appears if there is an insertion.  Colored amino acids are shown for the insertion, and the
position of the insertion is shown.  The notation e.g.
ins = TFT at 46
means that TFT is inserted after the first 46 amino acids.  Since the first amino acid (which is a
start codon) is numbered 0, the insertion actually occurs after the amino acid numbered 45.

▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
enclone help color
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓

Here is the color palette that enclone uses for amino acids:

█ █ █ █ █ █ █ 

When enclone shows amino acids, it uses one of three coloring schemes.  The first scheme (the
default, or using the argument COLOR=codon), colors amino acids by codon, according to the
following scheme:

Alanine        A  GCT GCC GCA GCG
Arginine       R  CGT CGC CGA CGG AGA AGG
Asparagine     N  AAT AAC
Aspartic Acid  D  GAT GAC
Cysteine       C  TGT TGC
Glutamine      Q  CAA CAG
Glutamic Acid  E  GAA GAG
Glycine        G  GGT GGC GGA GGG
Histidine      H  CAT CAC
Isoleucine     I  ATT ATC ATA
Leucine        L  TTA TTG CTT CTC CTA CTG
Lysine         K  AAA AAG
Methionine     M  ATG
Phenylalanine  F  TTT TTC
Proline        P  CCT CCC CCA CCG
Serine         S  TCT TCC TCA TCG AGT AGC
Threonine      T  ACT ACC ACA ACG
Tryptophan     W  TGG
Tyrosine       Y  TAT TAC
Valine         V  GTT GTC GTA GTG

Colored amino acids enable the compact display of all the information in a clonotype.

The second scheme, COLOR=codon-diffs, is the same as the first, except that some amino acids are
"grayed out".  An amino acid is highlighted (not grayed out) if (a) its codon differs from the
universal reference or (b) it is in a CDR3 and the codon is shared by half or less of the exact
subclonotypes having the given chain.  You may wish to use this with the CONX or CONP option.

The third scheme for coloring amino acids, COLOR=property, colors amino acids by their properties,
according to the following scheme:

1. Aliphatic: A G I L P V
2. Aromatic: F W Y
3. Acidic: D E
4. Basic: R H K
5. Hydroxylic: S T
6. Sulfurous: C M
7. Amidic: N Q

In all cases, the coloring is done using special characters, called ANSI escape characters.  Color
is used occasionally elsewhere by enclone, and there is also some bolding, accomplished using the
same mechanism.

Correct display of colors and bolding depends on having a terminal window that is properly set up.
 As far as we know, this may always be the case, but it is possible that there are exceptions.  In
addition, in general, text editors do not correctly interpret escape characters.

For both of these reasons, you may wish to turn off the "special effects", either some or all of
the time.  You can do this by adding the argument
PLAIN
to any enclone command.

We know of two methods to get enclone output into another document, along with colors:
1. Take a screenshot.
2. Open a new terminal window, type the enclone command, and then convert the terminal window into
a pdf.  See enclone help faq for related instructions.

▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
enclone help faq
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓

Frequently Asked Questions

We're sorry you're having difficulty! Please see the answers below or check out the other help
guides.

1. Why is my enclone output garbled?

We can think of two possibilities:

A. The escape characters that enclone emits for color and bolding are not getting
translated.  You have some options:
(a) Turn off escape character generation by adding PLAIN to your enclone commands.
This will work but you'll lose some information.
(b) If your terminal window is not translating escape characters, ask someone
with appropriate expertise to help you.  We have not observed this phenomenon,
but it should be fixable.
(c) If you're trying to view enclone output, with escape characters, using an editor,
that's probably not going to work well.

B. Perhaps enclone is emitting very wide lines.  Here are things you can do about this:
(a) Make your terminal window wider or reduce the font size.
(b) Identify the field that is very wide and use the column controls to remove that
field.  See the help for lvars and cvars.  For example,
AMINO=cdr3
may help, or even
AMINO=
These options may also help: CVARS=u FOLD_HEADERS.

2. Can I convert the enclone visual output into other forms?

Yes, there are choices:
A. On a Mac, you can screenshot from a terminal window.
B. Add the argument HTML to the enclone command line.  Then the output will be presented as html,
with title "enclone output".  If you want to set the title, use HTML="...".
C. You can then convert the html to pdf.  The best way on a Mac is to open Safari, which is the
best browser for this particular purpose, select the file where you've saved the html, and then
export as pdf.  Do not convert to pdf via printing, which produces a less readable file, and also
distorts colors.  (We do not know why the colors are distorted.)
D. If you want to put enclone output in a Google Doc, you can do it via approach A, although then
you won't be able to select text within the copied region.  Alternatively, if you open the html
file in a browser, you can then select text (including clonotype box text) and paste into a Google
Doc.  It will be pretty ugly, but will capture color and correctly render the box structure,
provided that you use an appropriate fixed-width font for that part of the Doc.  We found that
Courier New works, with line spacing set to 0.88.  You may have to reduce the font size.

3. Why is enclone slow for me?

On a single VDJ dataset, it typically runs for us in a few seconds, on a Mac or Linux server. 
Runs where we combine several hundred datasets execute in a couple minutes (on a server).  Your
mileage could vary, and we are interested in cases where it is underperforming.  Let us know.  We
are aware of several things that could be done to speed up enclone.

4. How does enclone fit into the 10x Genomics software ecosystem?

There are several parts to the answer:
• enclone is a standalone executable that by default produces human-readable output.
• You can also run enclone to produce parseable output (see enclone help parseable), and that
output can be digested using code that you write (for example, in R).
• When you run Cell Ranger to process 10x single cell immune profiling data, it in effect calls
enclone with a special option that yields only an output file for the 10x visualization tool
Loupe.
• Clonotypes may then be viewed using Loupe.  The view of a clonotype provided by Loupe is
different than the view provided by enclone.  Loupe shows a continuous expanse of bases across
each chain, which you can scroll across, rather than the compressed view of "important" bases or
amino acids that enclone shows.

5. What platforms does enclone run on?

1. Linux/x86-64 (that's most servers)
2. Mac.

However, we have not and cannot test every possible configuration of these platforms.  Please let
us know if you encounter problems!

6. How can I print out all the donor reference sequences?

Add the argument DONOR_REF_FILE=filename to your enclone command, and fasta for the donor
reference sequences will be dumped there.

7. How does enclone know what VDJ reference sequences I'm using?

If you used Cell Ranger version 4.0 or greater, then the VDJ reference file was included in the
outs directory, and so enclone knows the reference sequence from that.

For outs from older Cell Ranger versions, enclone has to guess which VDJ reference sequences were
used, and may or may not do so correctly.  As part of this, if you have mouse data from older Cell
Ranger versions, you need to supply the argument MOUSE on the command line.

It is also possible to set the reference sequence directly by adding by adding REF=f to your
command line, where f is the name of your VDJ reference fasta file, but if that is different than
the reference supplied to Cell Ranger, then you will have to add the additional argument RE to
recompute annotations, and that will slow down enclone somewhat.

8. Can I provide data from more than one donor?

Yes.  Type enclone help input for details.  The default behavior of enclone is to prevent cells
from different donors from being placed in the same clonotype.  The MIX_DONORS option may be used
to turn off this behavior.  If you employ this option, then clonotypes containing cells from more
than one donor will be flagged as errors, unless you use the NWARN option to turn off those
warnings.  The primary reason for allowing entry of data from multiple donors is to allow
estimation of enclone's error rate.

9. What are some command line argument values quoted?

Command line argument values that contain any of these characters ;|* need to be quoted like so
TCR="a;b"
to prevent the shell from interpreting them for a purpose completely unrelated to enclone.  This
is a trap, because forgetting to add the quotes can result in nonsensical and confusing behavior!

10. If enclone fails, does it return nonzero exit status?

Yes, unless output of enclone is going to a terminal.  In that case, you'll always get zero.

11. Could a cell be missing from an enclone clonotype?

Yes, some cells are deliberately deleted.  The cell might have been deleted by one of the filters
described in enclone help special, and which you can turn off.  We also delete cells for which
more than four chains were found.

12. Can enclone print summary stats?

Yes, if you add the option SUMMARY, then some summary stats will be printed.  If you wish to
suppress visual output, then also add the option NOPRINT.

13. What is the notes column?

The notes column appears if one of two relatively rare events occurs:

1. An insertion is detected in a chain sequence, relative to the reference.

2. The end of the J segment on a chain sequence does not exactly coincide with
   the beginning of the C segment.
The latter could correspond to one of several phenomena:
a. A transcript has an insertion between its J and C segments.
   This can happen.  See e.g. Behlke MA, Loh DY.
   Alternative splicing of murine T-cell receptor beta-chain transcripts.
   Nature 322(1986), 379-382.
b. There is an error in a reference sequence segment.
   We have tried to eliminate all such errors from the built-in references for
   human and mouse.
c. A cell produced a nonstandard transcript and also standard ones, and the
   Cell Ranger pipeline just happened to pick a nonstandard one.
d. There was a technical artifact and the sequence does not actually represent
   an mRNA molecule.

Overlaps of length exactly one between J and C segments are not shown unless you specify the
option JC1.  The reason for this is that certain reference sequences (notably those from IMGT and
those supplied with Cell Ranger 3.1) often have an extra base at the beginning of their C
segments, resulting in annoying overlap notes for a large fraction of clonotypes.

14. Can I cap the number of threads used by enclone?

You can use the command-line argument MAX_CORES=n to cap the number of cores used in parallel
loops.  The number of threads used is typically one higher.

15. Can I use enclone if I have only gene expression data?

Possibly.  In some cases this works very well, but in other cases it does not.  Success depends on
dataset characteristics that have not been carefully investigated.  To attempt this, you need to
invoke Cell Ranger on the GEX dataset as if it was a VDJ dataset, and you need to specify to Cell
Ranger that the run is to be treated as BCR or TCR.  Two separate invocations can be used to get
both.  Note also that Cell Ranger has been only minimally tested for this configuration and that
this is not an officially supported Cell Ranger configuration.

16. How can I cite enclone?

10x Genomics, https://github.com/10XGenomics/enclone,
(your enclone version information will be printed here).
You can cite the enclone preprint, which can be found on bioRxiv in the link below or by using the
DOI 10.1101/2022.04.21.489084. The latest version of the preprint can be found at:
https://www.biorxiv.org/content/10.1101/2022.04.21.489084v1.

17. Can I print the enclone version?

Yes, type "enclone version".

18. Can enclone ingest multiple datasets from the same library?

If enclone detects significant (≥ 25%) barcode reuse between datasets, it will exit.  This
behavior can be overridden using the argument ACCEPT_REUSE.

19. Can I turn off all the filters used in joining clonotypes?

Pretty much.  You can run with the following arguments:
MAX_CDR3_DIFFS=100
MAX_LOG_SCORE=100
EASY
MAX_DIFFS=200
MAX_DEGRADATION=150,
however this will in general be very slow and not produce useful results.  Depending on what your
goal is, you may find it helpful to use some of these arguments, and with lower values.  You can
see the meaning of the arguments and their default values by typing enclone help how.

20. How can I send the developers an example?

Use filters to select a clonotype or clonotypes of interest.  Then you can cut and paste enclone
output into an email.  If you want the example to be reproducible by us, add the argument
SUBSET_JSON=filename to the command line, which will create a json file containing only data for
the barcodes in the clonotype.  Then send us the file as an email attachment.  This only works for
VDJ data, and we do not have a parallel mechanism for gene expression and antibody data.  Please
note also that running enclone on the barcodes from a single clonotype will not necessarily
reproduce the results you observed, because the clonotyping algorithm uses all the data, even if
only some clonotypes are selected.