Detecting V(D)J features

A V(D)J transcript has the following structure:

5' UTR	leader	FWR1	CDR1	FWR2	CDR2	FWR3	CDR3	FWR4	constant region	3' UTR	poly-A tail

Here we describe functions that start from a reference sequence for a V segment, and from that compute all the features lying fully within such a sequence, comprising the leader up through FWR3.

These functions all work via a position-weight matrix (PWM) that characterizes a particular site. The matrices were empirically and manually configured to match the available truth data, insofar as this was possible.

That one can have such a matrix at all is due to conservation, and in the vast majority of cases, the exact details of the matrix would not matter.
Yet the fine details of the matrix do affect a handful of cases, thus reflecting the characteristics of the truth data, including errors therein, if present.

The functions have been tested on several species: human, rhesus, rabbit, mouse and rat, as well as some other species that were less abundant in the truth data. Because of very high conservation in the V region, we anticipate that the functions would likely work well on most other mammalian species, but it is impossible to know without trying. We are interested in hearing about issues and also about improvements to the functions that we currently use.

The functions return values that are consistent with the Martin (or equivalently Chothia) system for BCR, and the IMGT system for TCR.

It is very important to understand that for BCR, different numbering systems give rise to different answers, and for example, different actual sequences for the CDR1 and CDR2 regions!

If the functions do not return consistent answers, i.e. with features ordered properly along the transcript, then we change their values to null. This can happen if a V reference sequence is defective, for example if it has a frameshift. It may happen very rarely in other cases. We encountered exactly one such instance and report it below.

Truth data

We used three sources of truth data:

1. abYbank (link breaks transiently). These data consist of PDB (crystallographic protein) BCR records augmented with Martin numbering. There is one heavy and one light chain for each. These data were downloaded from http://www.abybank.org/abdb/Data/NR_LH_Combined_Martin.tar.bz2 on 7/12/2020. After filtering there were the following records:

811	Homo sapiens
761	Mus musculus
19	Macacca mulatta
13	Oryctolagus cuniculus
15	Other species.

We used this description to define the boundaries of the CDR1 and CDR2 sequences. (Note that this link is occasionally nonfunctional. There is an archived version here.)

2. STCRDab. (Note that this link is occasionally nonfunctional. There is an archived version here.) These data are also crystallographic protein records, but for TCR. They contain CDR1 and CDR2 annotations that we use as truth data. There were 832 records when downloaded on 7/22/20, all from human and mouse.

3. IMGT. (Note that this link is occasionally nonfunctional. There is an archived version here.) The data we used consisted of annotated reference sequences for human, rhesus, rabbit, mouse and rat (Rattus norvegicus). These were for BCR and TCR, with the exception of rat, for which we had only BCR data. The sequences themselves were also downloaded from IMGT, except for human and mouse, for which we used the V(D)J reference sequences supplied by 10X Genomics. After filtering there were 1245 sequences. This was our only source for FWR1 start position truth data.

Filtering and editing. We removed any sequences containing evidence of frameshifts, for example stop codons, and also removed sequences that did not reach into the CDR3 region. For the protein sources, we made the amino acid sequences look more like V reference sequences (for which the algorithm is designed), by prepending 21 amino acids and truncating 5 amino acids after the CDR3 start.

Reconciliation of numbering systems. To compensate for differences with the IMGT numbering system:

IGH CDR1: we remove the last amino acid from the IMGT sequence.
IGK and IGL CDR1: we remove the first three and the last two amino acids from our sequence.
IGH CDR2: we remove the first and last amino acids from the IMGT sequence.
IGK and IGL CDR2: we remove the last four amino acids from our sequence.

Test results

A. The FWR1 start algorithm could only be tested versus the IMGT data. We found a single discrepancy, for rabbit gene IGHV1S18*01, for which we report a start position one smaller than what IMGT reports.

B. We tested transcripts from all three sources for consistency with both CDR1 and CDR2.

abYbank: there was a single discrepancy, for the light chain of 4TNW_1 in mouse, where we report a CDR1 sequence that is one shorter on the right, and a CDR2 sequence that is one longer on the left.
STCRDab: there were no discrepancies.
IMGT: there were 348 discrepancies. Of these 341 were cases where there were highly conserved positions for which the truth data for the FWR2 start appeared to be one too low. These were the W or Q positions in the PWM with score 250 (see below). By species there were 153 in rat, 97 in mouse and 91 in rhesus. We were surprised by this and wonder if somehow we may have misinterpreted the IMGT conventions. In general for these cases, both CDR1 and CDR2 were different. There were six additional cases where conservation supports the choice made by our algorithm (see appendix). Finally we observed a single case, rat gene IGHV2-50*01, where the entire region before the CDR1 start is deleted in the reference sequence. For this case, our algorithm does not report a value for CDR1.

Detecting the start of the FWR1 region

The V region has already been located with the aid of the reference sequences. It begins with a start codon.

We search the first 40 (25 for IGL) amino acids in the V region, looking for the following PWM:

pos	weight	values	notes
1	150	D, E, G, K, Q
2	50	A, I, Q, V
4	100	L, M, V
6	250	E, Q
22	250	C	use weight 500 for IGH
23	250	C

Example. Suppose that for IGH, starting at a particular position, we observe the following amino acids (downward reading relative column positions shown for clarity):

         11111111112222
12345678901234567890123
QVQLVQSGAEVKKPGASVKVSCK

Then the total score for the particular position would be 150 + 50 + 100 + 250 + 500 + 0 = 1050, which is the maximum score one could normally get, since the likelihood of observing a C at both positions 22 and 23 is very small. The position would almost certainly be the "winner" and thus declared the start of the FWR1 region. We make one exception: if the first codon is C, we add one to the start position.

Detecting the start of the CDR1 region

We use the following PWM:

pos	weight	values
1	50	V
2	30	T
3	200	I, L, M, V
4	80	R, S, T
5	250	C
8	100	D, I, S

and search for this motif, starting at zero-based positions between 25 and the FWR1 start plus 19, inclusive. As with FWR1, the highest scoring position is declared the winner. The value returned by the function then depends on the chain type. For IGK and IGL, we return 5 plus the winning position, and for other chain types, we return 8 plus the winning position.

Detecting the start of the FWR2 region

We use the following PWM:

pos	weight	values	notes
1	50	F, L, M, V
2	250	Y	only for IGL
3	250	W
4	150	Y
5	100	R
6	250	Q
9	110	G
10	60	K, Q
11	40	A, G, K

and search for this motif, starting at zero-based positions between 40 and 73, inclusive (except for IGH, where we stop at 62). The highest scoring position is declared the winner. The value returned by the function then depends on the chain type. For IGH, we return the winner. For IGK and IGL we add one to the winner. For TRA and TRB, we subtract one from the winner.

Detecting the start of the CDR2 region

We use different functions, depending on the chain type. For IGH, we use the following PWM:

pos	weight	values
1	80	L
2	80	E
3	80	W
4	40	I, L, M, V
5	40	A, G, S

and search for this motif, between 8 and 13 amino acids after the start of FWR2 (as calculated above). Then we return 7 plus the highest scoring position.

For IGK and IGL, we simply return 15 plus the start of FWR2.

For TRA, we use the following PWM:

pos	weight	values
1	15	L, P
2	15	E, I, Q, T, V
3	20	F, L
4	35	L
5	15	I, L

and search for this motif, between 10 and 12 amino acids after the start of FWR2. Then we return 6 plus the highest scoring position.

For TRB, we simply return 17 plus the start of FWR2.

Detecting the start of the FWR3 region

In order to determine the start of the FWR3 region, we first determine the start of the CDR3 region (see below).

We then use different functions, depending on the chain type. For IGH, we use the following PWM:

pos	weight	values
1	600	N, Y
2	500	Y
3	400	A, N
6	850	F, L
7	800	K, Q, R
9	1000	K, R
10	700	A, F, L, V

searching between the start of the CDR3 minus 40 and the start of the CDR3 minus 34, inclusive. Then we return the start of the motif, minus one.

For IGK and IGL we use the following PWM:

pos	weight	values
1	100	G
3	100	P
5	100	R
6	100	F
8	100	G

searching between the start of the CDR3 minus 35 and the start of the CDR3 minus 28, inclusive. Then we return the start of the motif.

For TRA we use the following PWM:

pos	weight	values
1	50	E, K, N, V
2	50	A, E, K, T
3	60	E, S
4	50	D, N, S
5	50	N
6	80	G, M, R
7	50	A, F, I, Y
8	50	S, T
9	50	A, V
10	50	E, T
12	50	D, N

searching between the start of the CDR3 minus 36 and the start of the CDR3 minus 33, inclusive. Then we return the start of the motif, plus one.

For TRB we use the following PWM:

pos	weight	values
1	50	D, E, K
2	200	G, Q, S
3	200	D, E, G, S
4	200	I, L, M, V
5	100	P, S

searching between the start of the CDR3 minus 38 and the start of the CDR3 minus 35, inclusive. Then we return the start of the motif, minus two.

Detecting the start of the CDR3 region

We use the following PWM:

pos	weight	values
1	100	A, L, V
2	100	E, Q, T
3	100	A, P, S
4	100	E, G, S
5	100	D, Q
6	100	A, S, T
7	100	A, G, S
8	100	L, T, V
9	300	Y
10	100	F, L, Y
11	300	C

and look for this motif, terminating right before the CDR3, assuming that up to 18 amino acids of the CDR3 are present in the reference sequence.

Appendix: six miscellaneous discrepancies with IMGT

Case 1. This is for the gene TRAV12-1 in human. We agree with IMGT for CDR1, and IMGT shows CDR2 extending one amino acid further on the right. However the motif we use strongly supports the right boundary that we give, matching at eight amino acids. Note that in this case that the motif is designed to overlap by one base with the CDR2.

MISLRVLLVILWLQLSWVWSQRKEVEQDPGPFNVPEGATVAFNCTYSNSASQSFFWYRQDCRKEPKLLMSVYSSGNEDGRFTAQLNRASQYISLLIRDSKLSDSATYLCVVN
                                                                          --▊▊▊▊▊▊▊--▊
                                                                          EAEDGGAEAE-D
                                                                          KESNNMFSVT-N
                                                                          NT-S-RIT----
                                                                          VK----Y-----

Case 2. This is for the gene TRBV6-8 in human. We agree with IMGT for CDR1, and IMGT shows CDR2 extending one amino acid further on the right. However the motif we use strongly supports the right boundary that we give:

MSLGLLCCAAFSLLWAGPVNAGVTQTPKFHILKTGQSMTLQCAQDMNHGYMSWYRQDPGMGLRLIYYSAAAGTTDKEVPNGYNVSRLNTEDFPLRLVSAAPSRTSVYLCASSY
                                                                        --▊-▊▊▊
                                                                        --EGDIP
                                                                        --DQELS
                                                                        --KSGM-
                                                                        ----SV-

Case 3. This is a different allele for the gene TRBV6-8, and the story is exactly the same.

MSLGLLCCAAFSLLWAGPVNAGVTQTPKFHILKTGQSMTLQCAQDMNHGYMSWYRQDPGMGLRLIYYSAAAGTTDKEVPNGYNVSRLNTEDFPLRLVSAAPSQTSVYLCASSY
                                                                        --▊-▊▊▊
                                                                        --EGDIP
                                                                        --DQELS
                                                                        --KSGM-
                                                                        ----SV-

Case 4. This is for the gene IGHV11-1*01 in rat.

For CDR1, we compute GFTFSNS, which partially agrees with IMGT (GFTFSN, after adjusting for the numbering scheme), and fully agrees with ANARCI. For CDR2, we do not compute a value, because the left and right motifs for CDR2 match the reference sequence very strongly and overlap. The implied length of the CDR2 is minus one, although this is for the Martin/Chothia numbering system, and for a different numbering system which extended further, the length would be positive.

                                                               ---I---
                                                               ---LA--
                                                               ---MG--
                                                               LEWVS--
                                                               ▊▊▊▊▊--
MELELLSFIFALLKCNVQCEAQLVESGGGWVQPGASLKLSCVASGFTFSNSWMYWFHQFPGKALEWIGGIKYAPSIKNRVTMSRENAKSTLNLQMNNVRSDYTATYYCVR
                                                                     --▊▊---▊-▊▊
                                                                     -NYA--FK-KA
                                                                     -Y N--LQ-RF
                                                                     -------R--L
                                                                     ----------V

Case 5. This is for the gene IGHV3-1*01 in rat.

The computed value for CDR1 is GYSITSN, which agrees with IMGT. For CDR2, we compute SYSGS, which agrees exactly with the bracketing motifs (see below), whereas IMGT computes ISYSG.

                                                                ---L---
                                                                ---IA--
                                                                ---MS--
                                                                LEWVG--
                                                                -▊▊▊▊--
MRTLGLLYLLTALPGECILSEVQLQESGPGLVKPSQSLSLTCSVTGYSITSNYWGWIRKFPGNKMEWIGHISYSGSTSYNPSLKSRISITRDTSKNQFFLQLNSVTTEDTATYYCAR
                                                                            --▊▊--▊▊-▊-
                                                                            -NYA--FK-KA
                                                                            -Y-N--LQ-RF
                                                                            -------R--L
                                                                            ----------V

Case 6. This is for the gene IGHV8-30*01 in rat.

For CDR2, we report DWDGD, whereas the adjusted value for IMGT is IDWDG. However the PWMs strongly support our choice:

                                                               ---L---
                                                               ---IA--
                                                               ---MS--
                                                               LEWVG--
                                                               ▊▊▊▊▊--
MDRLTSSFLLLLVPAYVLSQVTLKESGPGMLQPSKTLSLTCSFSGFSLTIHGTHGAPMPIRGSLEWLAAIDWDGDKYYNPSLKSRLTVSKDTSNTQVFLKITSVDIADTATYYCARR
                                                                           -▊▊▊--▊▊-▊▊
                                                                           -NYA--FK-KA
                                                                           -Y-N--LQ-RF
                                                                           -------R--L
                                                                           ----------V