Comments (10)
Looks like they are correct and are treated as exceptions, but probably non-functional
For TRAJ35*01:
(6) noncanonical J-MOTIF: Cys-Gly-X-Gly instead of Phe-Gly-X-Gly. The functionality of TRAJ35*01, initially considered as ORF owing to the presence of Cys (C118) instead of Phe (F118) in the J-MOTIF, has been changed to F (functional) as TRAJ35*01 has been found rearranged and expressed in several specific cDNA clones (comments added by G. Folch and M.-P. Lefranc on the 23/01/2019, following the identification of an additional clone and a question by C. Joos (Germany)).
https://www.imgt.org/IMGTrepertoire/index.php?section=LocusGenes&repertoire=genetable&species=human&group=TRAJ
Genecard says it's non-functional
https://www.genecards.org/cgi-bin/carddisp.pl?gene=TRAJ35
TRBJ2-2P*01 also looks like an exception and non-functional open reading frame
(1) This sequence has diverged considerably from that of other TRBJ2-2: it has no canonical J-NONAMER [2], J-PHE 118 is replaced by Gly (G) and G 119 by R in J-MOTIF.
https://www.imgt.org/IMGTrepertoire/index.php?section=LocusGenes&repertoire=genetable&species=human&group=TRBJ
https://www.genecards.org/cgi-bin/carddisp.pl?gene=TRBJ2-2P
TRBJ2-7*02
(2) J-PHE replaced 118 replaced by Val in J-MOTIF.
from scirpy.
i just checked the IgBLAST auxiliary files that indicate where the CDR3 end is and crossed it to the fasta files and it looks like the codons are correct for those 3 genes as well. so should be alright to use the J-junction-motifs.txt
from scirpy.
Hi Rachel,
thanks for bringing this up. This was meant as a workaround to make it work with ir_dist
(which is currently hardcoded to work on the junction sequence).
A proper solution is probably to
- convert junction to cdr3 (which means just removing the constant amino acids)
- provide an option in ir_dist to use the cdr3 column instead of junction_aa
For now, as a workaround, I would suggest to store CDR3 sequences in junction_aa
in both query and database object.
You can trim the string with awkward array functions:
import awkward as ak
mdata["airr"].obsm["airr"]["junction_aa"] = ak.str.slice(mdata["airr"].obsm["airr"]["junction_aa"], 1, -1)
from scirpy.
@zktuong, just wanted to ask you to be sure
junction_aa
->cdr3
should be trivial, right? Just clip off the first and last amino acid. Any caveats?cdr3
->junction_aa
: Is it correct that the first AA is alwaysC
and the last is alwaysW/F
? Can this be robustly inferred from the V call?
from scirpy.
junction_aa -> cdr3 should be trivial, right? Just clip off the first and last amino acid. Any caveats?
yes should be correct.
cdr3 -> junction_aa: Is it correct that the first AA is always C and the last is always W/F? Can this be robustly inferred from the V call?
yes first is always C and the last should always be either F/W. Should be able to infer the F/W from the second codon from start of the J call (TTT/TTC for F or TGG for W).
from scirpy.
Hi @racng,
I believe I fixed this in #476, by extracting the junction_aa
sequence from the full protein sequence based on the start/end coordinates provieded in IEDB. Would be great if you could give this a try!
You can install that version using
pip install git+https://github.com/scverse/scirpy@issue-469
Please make sure to remove the cached version of iedb before rerunning ir.datasets.iedb
by deleting the data
directory that was created.
from scirpy.
@grst Thank you for working on a fix for this issue!
I have copied the new function into a script and tested it.
For some reason I am getting a lot of NaN values for both junction_aa and cdr3_aa.
adata = iedb(cached=True, cache_path='test/iedb.h5ad')
ir.get.airr(adata, 'junction_aa')['VDJ_1_junction_aa'].isna().sum()
# 30080
ir.get.airr(adata, 'cdr3_aa')['VDJ_1_cdr3_aa'].isna().sum()
# 30110
# Old copy of iedb reference
ir.get.airr(adata_old, 'junction_aa')['VDJ_1_junction_aa'].isna().sum()
# 118
from scirpy.
Bad news! I took another look and it seems that indeed the Start/End position and and Protein sequence are only available for ~5000 receptors:
>>> iedb_df["Chain 1 Protein Sequence"].dropna().size
5083
For the rest, there is "CDR3 Curated", but not "CDR3 Calculated" available. Taking a closer look at "CDR3 Curated", it seems that some (but not all) sequences there are actually junction sequences, including the C
and W/F
.
So I'm afraid this would take quite some cleanup to get it right!
On the scirpy side, I could consider adding the option to use cdr3_aa
sequences, instead of junction_aa
, for sequence comparisons. Then the CDR3 sequences in the database could remain as they are.
from scirpy.
I think I found a solution. We can use the J-motif sequences in J-region-motifs.tsv
generated by IMGTgeneDL run in stitchr mode to determine the terminal residue of the J gene. It "contains automatically inferred CDR3 junction ending motifs and residues (using the process established in the autoDCR TCR assignation tool)".
I am attaching it here:
J-region-motifs.txt
from scirpy.
@zktuong, what do you say about this one? Is that wrong or the always existing exception to the rule in Biology?
(L and V are "not confident" but C is)
(this is from above J-junction-motifs.txt
)
from scirpy.
Related Issues (20)
- ir_dist alignment stuck HOT 4
- Unclear default value for the Hamming Distance cut-off HOT 1
- Dandelion interoperability
- Where has UMI count for AIR chains gone? HOT 1
- Large dataset tutorial HOT 1
- Make sure axes of nextwork plots don't have any ticks
- Add the Morisita-Horn index for repertoire overlap similarity scores HOT 1
- Sorting logic in `index_chains()` HOT 3
- Community tutorial page
- ir.tl.ir_query fails with error 'ValueError: max_workers must be greater than 0' HOT 1
- ir.tl.clonotype_modularity - ValueError: Length of values does not match length of index HOT 2
- "read_10x_vdj" not loading data properly HOT 2
- clone definition purely using CDR3 sequence HOT 3
- Optimize TCRdist metric HOT 1
- When running 'ir.tl.define_clonotypes' on MacOS14.4.1, I've got an Error:module 'os' has no attribute 'sched_getaffinity' HOT 2
- TypeError: join() got an unexpected keyword argument 'validate' HOT 9
- Thoughts on adding motif results to amino acid similarity? HOT 1
- ir.pl.clonotype_network HOT 3
- Error with define_clonotypes on small datasets HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scirpy.