5' UTR | leader | FWR1 | CDR1 | FWR2 | CDR2 | FWR3 | CDR3 | FWR4 | constant region | 3' UTR | poly-A tail |
---|
Here we describe functions that start from a reference sequence for a V segment, and from that compute all the features lying fully within such a sequence, comprising the leader up through FWR3.
These functions all work via a position-weight matrix (PWM) that characterizes a particular site. The matrices were empirically and manually configured to match the available truth data, insofar as this was possible.
The functions have been tested on several species: human, rhesus, rabbit, mouse and rat, as well as some other species that were less abundant in the truth data. Because of very high conservation in the V region, we anticipate that the functions would likely work well on most other mammalian species, but it is impossible to know without trying. We are interested in hearing about issues and also about improvements to the functions that we currently use.
The functions return values that are consistent with the Martin (or equivalently Chothia) system for BCR, and the IMGT system for TCR.
It is very important to understand that for BCR, different numbering systems give rise to different answers, and for example, different actual sequences for the CDR1 and CDR2 regions!
If the functions do not return consistent answers, i.e. with features ordered properly along the transcript, then we change their values to null. This can happen if a V reference sequence is defective, for example if it has a frameshift. It may happen very rarely in other cases. We encountered exactly one such instance and report it below.
We used three sources of truth data:
1. abYbank (link breaks transiently). These data consist of PDB (crystallographic protein) BCR records augmented with Martin numbering. There is one heavy and one light chain for each. These data were downloaded from http://www.abybank.org/abdb/Data/NR_LH_Combined_Martin.tar.bz2 on 7/12/2020. After filtering there were the following records:
811 | Homo sapiens |
761 | Mus musculus |
19 | Macacca mulatta |
13 | Oryctolagus cuniculus |
15 | Other species. |
We used this description to define the boundaries of the CDR1 and CDR2 sequences. (Note that this link is occasionally nonfunctional. There is an archived version here.)
2. STCRDab. (Note that this link is occasionally nonfunctional. There is an archived version here.) These data are also crystallographic protein records, but for TCR. They contain CDR1 and CDR2 annotations that we use as truth data. There were 832 records when downloaded on 7/22/20, all from human and mouse.
3. IMGT. (Note that this link is occasionally nonfunctional. There is an archived version here.) The data we used consisted of annotated reference sequences for human, rhesus, rabbit, mouse and rat (Rattus norvegicus). These were for BCR and TCR, with the exception of rat, for which we had only BCR data. The sequences themselves were also downloaded from IMGT, except for human and mouse, for which we used the V(D)J reference sequences supplied by 10X Genomics. After filtering there were 1245 sequences. This was our only source for FWR1 start position truth data.
Filtering and editing. We removed any sequences containing evidence of frameshifts, for example stop codons, and also removed sequences that did not reach into the CDR3 region. For the protein sources, we made the amino acid sequences look more like V reference sequences (for which the algorithm is designed), by prepending 21 amino acids and truncating 5 amino acids after the CDR3 start.
Reconciliation of numbering systems. To compensate for differences with the IMGT numbering system:
A. The FWR1 start algorithm could only be tested versus the IMGT data. We found a single
discrepancy, for rabbit gene IGHV1S18*01
, for which we report a start position one
smaller than what IMGT reports.
B. We tested transcripts from all three sources for consistency with both CDR1 and CDR2.
4TNW_1
in mouse,
where we report a CDR1 sequence that is one shorter on the right, and a CDR2 sequence that is one
longer on the left.
IGHV2-50*01
, where the entire region
before the CDR1 start is deleted in the reference sequence. For this case, our algorithm does not
report a value for CDR1.
The V region has already been located with the aid of the reference sequences. It begins with a start codon.
We search the first 40
(25
for IGL) amino acids in the V region,
looking for the following PWM:
pos | weight | values | notes |
---|---|---|---|
1 | 150 | D, E, G, K, Q | |
2 | 50 | A, I, Q, V | |
4 | 100 | L, M, V | |
6 | 250 | E, Q | |
22 | 250 | C | use weight 500 for IGH |
23 | 250 | C |
Example. Suppose that for IGH, starting at a particular position, we observe the following amino acids (downward reading relative column positions shown for clarity):
Then the total score for the particular position would be11111111112222
12345678901234567890123
QVQLVQSGAEVKKPGASVKVSCK
150 + 50 + 100 + 250 + 500 + 0 = 1050
, which is
the maximum score one could normally get, since the likelihood of observing a C
at both positions 22
and 23
is very small. The position would almost
certainly be the "winner" and thus declared the start of the FWR1 region. We make one
exception: if the first codon is C, we add one to the start position.
We use the following PWM:
pos | weight | values |
---|---|---|
1 | 50 | V |
2 | 30 | T |
3 | 200 | I, L, M, V |
4 | 80 | R, S, T |
5 | 250 | C |
8 | 100 | D, I, S |
and search for this motif, starting at zero-based positions between 25 and the FWR1 start plus 19, inclusive. As with FWR1, the highest scoring position is declared the winner. The value returned by the function then depends on the chain type. For IGK and IGL, we return 5 plus the winning position, and for other chain types, we return 8 plus the winning position.
We use the following PWM:
pos | weight | values | notes |
---|---|---|---|
1 | 50 | F, L, M, V | |
2 | 250 | Y | only for IGL |
3 | 250 | W | |
4 | 150 | Y | |
5 | 100 | R | |
6 | 250 | Q | |
9 | 110 | G | |
10 | 60 | K, Q | |
11 | 40 | A, G, K |
and search for this motif, starting at zero-based positions between 40 and 73, inclusive (except for IGH, where we stop at 62). The highest scoring position is declared the winner. The value returned by the function then depends on the chain type. For IGH, we return the winner. For IGK and IGL we add one to the winner. For TRA and TRB, we subtract one from the winner.
We use different functions, depending on the chain type. For IGH, we use the following PWM:
pos | weight | values |
---|---|---|
1 | 80 | L |
2 | 80 | E |
3 | 80 | W |
4 | 40 | I, L, M, V |
5 | 40 | A, G, S |
and search for this motif, between 8 and 13 amino acids after the start of FWR2 (as calculated above). Then we return 7 plus the highest scoring position.
For IGK and IGL, we simply return 15 plus the start of FWR2.
For TRA, we use the following PWM:
pos | weight | values |
---|---|---|
1 | 15 | L, P |
2 | 15 | E, I, Q, T, V |
3 | 20 | F, L |
4 | 35 | L |
5 | 15 | I, L |
and search for this motif, between 10 and 12 amino acids after the start of FWR2. Then we return 6 plus the highest scoring position.
For TRB, we simply return 17 plus the start of FWR2.
In order to determine the start of the FWR3 region, we first determine the start of the CDR3 region (see below).
We then use different functions, depending on the chain type. For IGH, we use the following PWM:
pos | weight | values |
---|---|---|
1 | 600 | N, Y |
2 | 500 | Y |
3 | 400 | A, N |
6 | 850 | F, L |
7 | 800 | K, Q, R |
9 | 1000 | K, R |
10 | 700 | A, F, L, V |
searching between the start of the CDR3 minus 40 and the start of the CDR3 minus 34, inclusive. Then we return the start of the motif, minus one.
For IGK and IGL we use the following PWM:
pos | weight | values |
---|---|---|
1 | 100 | G |
3 | 100 | P |
5 | 100 | R |
6 | 100 | F |
8 | 100 | G |
searching between the start of the CDR3 minus 35 and the start of the CDR3 minus 28, inclusive. Then we return the start of the motif.
For TRA we use the following PWM:
pos | weight | values |
---|---|---|
1 | 50 | E, K, N, V |
2 | 50 | A, E, K, T |
3 | 60 | E, S |
4 | 50 | D, N, S |
5 | 50 | N |
6 | 80 | G, M, R |
7 | 50 | A, F, I, Y |
8 | 50 | S, T |
9 | 50 | A, V |
10 | 50 | E, T |
12 | 50 | D, N |
searching between the start of the CDR3 minus 36 and the start of the CDR3 minus 33, inclusive. Then we return the start of the motif, plus one.
For TRB we use the following PWM:
pos | weight | values |
---|---|---|
1 | 50 | D, E, K |
2 | 200 | G, Q, S |
3 | 200 | D, E, G, S |
4 | 200 | I, L, M, V |
5 | 100 | P, S |
searching between the start of the CDR3 minus 38 and the start of the CDR3 minus 35, inclusive. Then we return the start of the motif, minus two.
We use the following PWM:
pos | weight | values |
---|---|---|
1 | 100 | A, L, V |
2 | 100 | E, Q, T |
3 | 100 | A, P, S |
4 | 100 | E, G, S |
5 | 100 | D, Q |
6 | 100 | A, S, T |
7 | 100 | A, G, S |
8 | 100 | L, T, V |
9 | 300 | Y |
10 | 100 | F, L, Y |
11 | 300 | C |
and look for this motif, terminating right before the CDR3, assuming that up to 18 amino acids of the CDR3 are present in the reference sequence.
Case 1. This is for the gene TRAV12-1
in human. We agree with IMGT for
CDR1, and IMGT shows CDR2 extending one amino acid further on the right. However the motif we use
strongly supports the right boundary that we give, matching at eight amino acids.
Note that in this case that the motif is designed
to overlap by one base with the CDR2.
MISLRVLLVILWLQLSWVWSQRKEVEQDPGPFNVPEGATVAFNCTYSNSASQSFFWYRQDCRKEPKLLMSVYSSGNEDGRFTAQLNRASQYISLLIRDSKLSDSATYLCVVN --▊▊▊▊▊▊▊--▊ EAEDGGAEAE-D KESNNMFSVT-N NT-S-RIT---- VK----Y-----
Case 2. This is for the gene TRBV6-8
in human. We agree with IMGT for CDR1,
and IMGT shows CDR2 extending one amino acid further on the right. However
the motif we use strongly supports the right boundary that we give:
MSLGLLCCAAFSLLWAGPVNAGVTQTPKFHILKTGQSMTLQCAQDMNHGYMSWYRQDPGMGLRLIYYSAAAGTTDKEVPNGYNVSRLNTEDFPLRLVSAAPSRTSVYLCASSY --▊-▊▊▊ --EGDIP --DQELS --KSGM- ----SV-Case 3. This is a different allele for the gene
TRBV6-8
, and the story is
exactly the same.
MSLGLLCCAAFSLLWAGPVNAGVTQTPKFHILKTGQSMTLQCAQDMNHGYMSWYRQDPGMGLRLIYYSAAAGTTDKEVPNGYNVSRLNTEDFPLRLVSAAPSQTSVYLCASSY --▊-▊▊▊ --EGDIP --DQELS --KSGM- ----SV-
Case 4. This is for the gene IGHV11-1*01
in rat.
For CDR1, we compute GFTFSNS
, which partially agrees with IMGT (GFTFSN
,
after adjusting for the numbering scheme), and fully agrees with ANARCI.
For CDR2, we do not compute a value, because the left and right motifs for CDR2 match the
reference sequence very strongly and overlap. The implied length of the CDR2 is minus one,
although this is for the Martin/Chothia numbering system, and for a different numbering system
which extended further, the length would be positive.
---I--- ---LA-- ---MG-- LEWVS-- ▊▊▊▊▊-- MELELLSFIFALLKCNVQCEAQLVESGGGWVQPGASLKLSCVASGFTFSNSWMYWFHQFPGKALEWIGGIKYAPSIKNRVTMSRENAKSTLNLQMNNVRSDYTATYYCVR --▊▊---▊-▊▊ -NYA--FK-KA -Y N--LQ-RF -------R--L ----------V
Case 5. This is for the gene IGHV3-1*01
in rat.
The computed value for CDR1 is GYSITSN
, which agrees with IMGT. For
CDR2, we compute SYSGS
, which agrees exactly with the
bracketing motifs (see below), whereas IMGT computes ISYSG
.
---L--- ---IA-- ---MS-- LEWVG-- -▊▊▊▊-- MRTLGLLYLLTALPGECILSEVQLQESGPGLVKPSQSLSLTCSVTGYSITSNYWGWIRKFPGNKMEWIGHISYSGSTSYNPSLKSRISITRDTSKNQFFLQLNSVTTEDTATYYCAR --▊▊--▊▊-▊- -NYA--FK-KA -Y-N--LQ-RF -------R--L ----------V
Case 6. This is for the gene IGHV8-30*01
in rat.
For CDR2, we report DWDGD
, whereas the adjusted value for IMGT
is IDWDG
. However the PWMs strongly support our choice:
---L--- ---IA-- ---MS-- LEWVG-- ▊▊▊▊▊-- MDRLTSSFLLLLVPAYVLSQVTLKESGPGMLQPSKTLSLTCSFSGFSLTIHGTHGAPMPIRGSLEWLAAIDWDGDKYYNPSLKSRLTVSKDTSNTQVFLKITSVDIADTATYYCARR -▊▊▊--▊▊-▊▊ -NYA--FK-KA -Y-N--LQ-RF -------R--L ----------V