Git Product home page Git Product logo

stitchr's People

Contributors

jamieheather avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

stitchr's Issues

errors if J interface is not found

determine_j_interface is supposed to return a two element tuple. However, if the J interface is not found, stitchr is quite cryptic. Example:

$ python stitchr.py -v TRAV1-2 -j TRAJ33 -cdr3 CAVERGEGF
Traceback (most recent call last):
  File "stitchr.py", line 107, in <module>
    c_term_nt_trimmed, cdr3_c_end = fxn.determine_j_interface(input_args['cdr3'], c_term_nt, c_term_aa)
TypeError: 'NoneType' object is not iterable

This is because the condition inside the for loop in determine_j_interface is never met, so the for loop ends and defaults to return None.

I would guess that either determine_j_interface should raise an error, or stitchr should check the return value and exit with a nicer message.

TRDV2 error

I get the following error when running stitchr or thimble on TRDV2:

can only concatenate list (not "str") to list

It appears to be a python error. I have no issues with any of the other Delta chains. Do you know how to resolve?

Integration with tidytcells

Hello! I'm Yuta, current PhD candidate at the Chain lab. A big fan of and user of stitchr- thank you for the software!

I understand that stitchr by construction expects IMGT-compliant V/J gene symbols as input, and I'm finding myself often using stitchr in conjunction with a software I wrote myself called tidytcells that can dynamically auto-correct non-standard TR gene symbols.

Was wondering if you think tidytcells would be a useful tool to be either integrated into stitchr's input parsing? Or if not then referenced in the docs when mentioning IMGT standardization of input?

Kind of a plug for tidytcells 😁 but also thought it might actually help some people out!

Can't get demo TCRs or help functions to print

Hey Jamie—have tried a few times with Python 3.6.5 and can't seem to get either of the example calls to work, or even the help command. I have biopython installed and have no issue importing it as import Bio. All three of the following snippets:

  1. Demo TCR 1
python stitchr.py -v TRBV7-3*01 -j TRBJ1-1*01 -cdr3 CASSYLQAQYTEAFF
  1. Demo TCR 2
python stitchr.py -v TRAV1-2 -j TRAJ33 -cdr3 CAVLDSNYQLIW
  1. Help menu
python stitchr.py -h

... all produce the exact same error message:

File "stitchr.py", line 81
    print "\tCannot find", r.upper(), "gene", input_args[r] + \
                        ^
SyntaxError: invalid syntax

I tried poking around and adding some quotes around lines 78-83 but can't seem to find exactly where the missing apostrophe is.

Best wishes,
Wyatt

Simplifying output / silent mode

Great tool - one feature request would be to have an output switch for a simplified output reporting of just the NT or AA sequence.

This would make it easier to implement stitchr with other pipelines (in our case in R through reticulate), without having to process the intermediate output files (i.e. strip out the fasta header and select just AA or NT). --aa/--nt ?

exits with an error but zero status code

Multiple functions in functions.py exit with sys.exit if they fail input validation. However, they exit wiith a zero status code:

$ python stitchr.py -v TRBV17 -j TRBJ1-2 -cdr3 YSSGEAGSGYTF
Error, CDR3 does not fit expected parameters. Please ensure it includes the conserved C/F residues.
$ echo $?
0

The implications of exiting with a zero status code, is that for all purposes the script reports exiting with success. The status code should be non-zero.

I would recommend raising ValueError instead of calling sys.exit, at least for this specific case.

Alternative non-templated codon usage?

Currently stitchr uses Kasuza frequency data to pick codons for residues that are not wholly germline-encodable when providing AA CDRs. This basically picks the most common nucleotide triplet used from the relevant species' exome.

I'm wondering whether it might be useful in certain circumstances to instead draw on codon frequency usage taken from non-templated sections of CDR3s of the corresponding species/locus. This might theoretically deliver NT sequences that are more representative of rearranged TCRs of the specific locus, which are not necessarily the same as the wider exome.

Importing stitchr for use in other scripts - obtain stitched aa sequence without C region information

I am actually trying to run stitchr inside a Shiny R app, for which I use reticulate to run it inside a conda environment.

For that, I do the following:

reticulate::use_condaenv('myenv')
reticulate::py_install("stitchr", pip = TRUE)

Then I create a run_stitchr.py script as detailed in https://jamieheather.github.io/stitchr/importing.html

run_stitchr.py looks exactly like the one in the link above:

# import stitchr
from Stitchr import stitchrfunctions as fxn
from Stitchr import stitchr as st

# specify details about the locus to be stitched
chain = 'TRB'
species = 'HUMAN'

# initialise the necessary data
tcr_dat, functionality, partial = fxn.get_imgt_data(chain, st.gene_types, species)
codons = fxn.get_optimal_codons('', species)

# provide details of the rearrangement to be stitched
tcr_bits = {'v': 'TRBV7-3*01', 'j': 'TRBJ1-1*01', 'cdr3': 'CASSYLQAQYTEAFF',
            'l': 'TRBV7-3*01', 'c': 'TRBC1*01',
            'skip_c_checks': False, 'species': species, 'seamless': False,
            '5_prime_seq': '', '3_prime_seq': '', 'name': 'TCR'}

# then run stitchr on that rearrangement
stitched = st.stitch(tcr_bits, tcr_dat, functionality, partial, codons, 3, '')

print(stitched)

and I run it as follows:

run_results <- reticulate::py_run_file(run_script)

This works perfectly, and run_results$stitched contains the following output:

[[1]]
[1] "TCR"             "TRBV7-3*01"      "TRBJ1-1*01"      "TRBC1*01"        "CASSYLQAQYTEAFF"
[6] "TRBV7-3*01(L)"  

[[2]]
[1] "ATGGGCACCAGGCTCCTCTGCTGGGCAGCCCTGTGCCTCCTGGGGGCAGATCACACAGGTGCTGGAGTCTCCCAGACCCCCAGTAACAAGGTCACAGAGAAGGGAAAATATGTAGAGCTCAGGTGTGATCCAATTTCAGGTCATACTGCCCTTTACTGGTACCGACAAAGCCTGGGGCAGGGCCCAGAGTTTCTAATTTACTTCCAAGGCACGGGTGCGGCAGATGACTCAGGGCTGCCCAACGATCGGTTCTTTGCAGTCAGGCCTGAGGGATCCGTCTCTACTCTGAAGATCCAGCGCACAGAGCGGGGGGACTCAGCCGTGTATCTCTGTGCCAGCAGCTACCTGCAGGCCCAGTACACTGAAGCTTTCTTTGGACAAGGCACCAGACTCACAGTTGTAGAGGACCTGAACAAGGTGTTCCCACCCGAGGTCGCTGTGTTTGAGCCATCAGAAGCAGAGATCTCCCACACCCAAAAGGCCACACTGGTGTGCCTGGCCACAGGCTTCTTCCCCGACCACGTGGAGCTGAGCTGGTGGGTGAATGGGAAGGAGGTGCACAGTGGGGTCAGCACGGACCCGCAGCCCCTCAAGGAGCAGCCCGCCCTCAATGACTCCAGATACTGCCTGAGCAGCCGCCTGAGGGTCTCGGCCACCTTCTGGCAGAACCCCCGCAACCACTTCCGCTGTCAAGTCCAGTTCTACGGGCTCTCGGAGAATGACGAGTGGACCCAGGATAGGGCCAAACCCGTCACCCAGATCGTCAGCGCCGAGGCCTGGGGTAGAGCAGACTGTGGCTTTACCTCGGTGTCCTACCAGCAAGGGGTCCTGTCTGCCACCATCCTCTATGAGATCCTGCTAGGGAAGGCCACCCTGTATGCTGTGCTGGTCAGCGCCCTTGTGTTGATGGCCATGGTCAAGAGAAAGGATTTC"

[[3]]
[1] 0

Now here I have 2 questions:

1- using this approach, is there a possibility to obtain the aminoacid stitched sequence instead (similarly to command-line stitchr, that returns both DNA and aa sequences)?

2- I don't have C region information, but if I just write 'c': '' in the tcr_bits section of the run_stitchr.py above, I get the following:

Error: a CONSTANT sequence region has not been found for gene in the IMGT data for this chain/species. Please check your TCR and species data.

No C region information does not seem to be a problem for command-line stitchr... what is the correct way to run stitchr with no C region information using the approach above with run_stitchr.py?

Many thanks!
Daniel

Make stitchr work on mouse TCRs

Perhaps with a command line option to choose between species.

NB: it seems IMGT doesn't have fully annotated murine constant regions, so some manual splicing/checking will be required to get the final EX1+EX2+EX3+EX4 sequences.

none of the examples in README are valid

The two examples provided in the README:

python stitchr.py -v TRBV7-3*01 -j TRBJ1-3*01 -cdr3 CASSPGRGTTNEKLFF
python stitchr.py -v TRAV1-2 -j TRAJ33 -cdr3 CAVERGEGF

do not work (see also issue #1). Manually checking the data files, it does seem that the problem is on the example, not the actual stitchr code.

CDR1/2

Works brilliantly! Thank you so much, great work.

Just had an enquiry about inclusion of CDR1/CDR2. Is there an easy way to include these, and more broadly do you see any merit in doing so? How faithful should the current approach be?

I'm using these amino acids to then perform structural modelling for context.

Many thanks!

First example at https://jamieheather.github.io/stitchr/installation.html not working as expected

Hello! I want to start running stitchr routinely, but right after installation, I find the following problem:

The first example at https://jamieheather.github.io/stitchr/installation.html does not work the way I expected to.

This is the command:

stitchr -v TRBV7-3*01 -j TRBJ1-1*01 -cdr3 CASSYLQAQYTEAFF

which returns:

zsh: no matches found: TRBV7-3*01

However, if I remove the allele information and run just the following:

stitchr -v TRBV7-3 -j TRBJ1-1 -cdr3 CASSYLQAQYTEAFF

it works more as expected, returning this:

No valid leader region allele determined yet for TRBV7-3.
Defaulting to *01 for the leader region, in the absence of a preferred allele file being specified.
NB: the prototypical '*01' allele is being used for the leader region by default, but other alleles are available - consider double checking the right allele is asked for.
No valid variable region allele determined yet for TRBV7-3.
Defaulting to *01 for the variable region, in the absence of a preferred allele file being specified.
NB: the prototypical '*01' allele is being used for the variable region by default, but other alleles are available - consider double checking the right allele is asked for.
No valid joining region allele determined yet for TRBJ1-1.
Defaulting to *01 for the joining region, in the absence of a preferred allele file being specified.
----------------------------------------------------------------------------------------------
>nt||TRBV7-3*01|TRBJ1-1*01|TRBC1*01|CASSYLQAQYTEAFF|TRBV7-3*01(L)
ATGGGCACCAGGCTCCTCTGCTGGGCAGCCCTGTGCCTCCTGGGGGCAGATCACACAGGT
GCTGGAGTCTCCCAGACCCCCAGTAACAAGGTCACAGAGAAGGGAAAATATGTAGAGCTC
AGGTGTGATCCAATTTCAGGTCATACTGCCCTTTACTGGTACCGACAAAGCCTGGGGCAG
GGCCCAGAGTTTCTAATTTACTTCCAAGGCACGGGTGCGGCAGATGACTCAGGGCTGCCC
AACGATCGGTTCTTTGCAGTCAGGCCTGAGGGATCCGTCTCTACTCTGAAGATCCAGCGC
ACAGAGCGGGGGGACTCAGCCGTGTATCTCTGTGCCAGCAGCTACCTGCAGGCCCAGTAC
ACTGAAGCTTTCTTTGGACAAGGCACCAGACTCACAGTTGTAGAGGACCTGAACAAGGTG
TTCCCACCCGAGGTCGCTGTGTTTGAGCCATCAGAAGCAGAGATCTCCCACACCCAAAAG
GCCACACTGGTGTGCCTGGCCACAGGCTTCTTCCCCGACCACGTGGAGCTGAGCTGGTGG
GTGAATGGGAAGGAGGTGCACAGTGGGGTCAGCACGGACCCGCAGCCCCTCAAGGAGCAG
CCCGCCCTCAATGACTCCAGATACTGCCTGAGCAGCCGCCTGAGGGTCTCGGCCACCTTC
TGGCAGAACCCCCGCAACCACTTCCGCTGTCAAGTCCAGTTCTACGGGCTCTCGGAGAAT
GACGAGTGGACCCAGGATAGGGCCAAACCCGTCACCCAGATCGTCAGCGCCGAGGCCTGG
GGTAGAGCAGACTGTGGCTTTACCTCGGTGTCCTACCAGCAAGGGGTCCTGTCTGCCACC
ATCCTCTATGAGATCCTGCTAGGGAAGGCCACCCTGTATGCTGTGCTGGTCAGCGCCCTT
GTGTTGATGGCCATGGTCAAGAGAAAGGATTTC

>aa||TRBV7-3*01|TRBJ1-1*01|TRBC1*01|CASSYLQAQYTEAFF|TRBV7-3*01(L)
MGTRLLCWAALCLLGADHTGAGVSQTPSNKVTEKGKYVELRCDPISGHTALYWYRQSLGQ
GPEFLIYFQGTGAADDSGLPNDRFFAVRPEGSVSTLKIQRTERGDSAVYLCASSYLQAQY
TEAFFGQGTRLTVVEDLNKVFPPEVAVFEPSEAEISHTQKATLVCLATGFFPDHVELSWW
VNGKEVHSGVSTDPQPLKEQPALNDSRYCLSSRLRVSATFWQNPRNHFRCQVQFYGLSEN
DEWTQDRAKPVTQIVSAEAWGRADCGFTSVSYQQGVLSATILYEILLGKATLYAVLVSAL
VLMAMVKRKDF

As of note, since no allele information is provided, stitchr defaults to *01 for both V and J genes, exactly as in the first command that found no results... is "*01" not the correct way to specify it in the first command?

Thanks!
Daniel

stitch errors if -a option is not specified

If the -aa option is not specified, then stitchr errors:

python stitchr.py -v TRAV13-1 -j TRAJ34 -cdr3 CAAYNTDKLIF
[...]
Traceback (most recent call last):
  File "stitchr.py", line 124, in <module>
    if input_args['aa']:
KeyError: 'aa'

The issue is that the code checks the value of aa in the options dictionary instead of its presence.

Improve % of perfectly replicated CDR3s when using a NT CDR3

Currently when submitting a rearrangements using nucleotide CDR3s, Stitchr will produce a small proportion of TCRs where the stitched nucleotide sequence does not exactly replicate the sequence that appeared in the original TCR (see Fig 3D of the NAR paper. This is due to Stitchr determining the edge of the germline-encodable sequence at the codon level, but V(D)J recombination will often delete only part of a codon and then P/NP/D gene nucleotide additions can complete a different but redundant codon encoding the same amino acid (see Fig 1C of the paper. Of course the final stitched amino acid may be the same, but it would be nice for these nucleotide-provided CDR3s to perfectly match the actual rearranged TCR, without needing to resort to the slower seamless option.

It's just occurred to me, but the vast majority of these mismatches could probably be avoided by with a simple switch. Currently the NT-CDR3 option copies the AA-CDR3 option, defining the wholly-germline encodable and then filling in with the non-templated sequence. Instead, when using NT-CDR3 the script could delete one amino acid further back from the edge of what is potentially encodable, and simply add back in an extra 3 NT from the provided CDR3 sequence.

It still wouldn't be 100% perfect (as rare TCRs will have coincidentally deleted further back and replaced multiple redundant codons), but it would improve accuracy a lot.

Giving the sparsity of applications that need this I won't be implemented this in a hurry, but I'm adding this here in case anyone needed it and fancies implementing.

fails weirdly if it can't find tcr data

$ python stitchr.py -v TRAV29DV5 -j TRAJ23 -cdr3 CAAFNQGGKLIF
Warning: gene TRAV29DV5*01
Traceback (most recent call last):
  File "stitchr.py", line 90, in <module>
    functionality[gene][allele] + "\', and thus may not express or function correctly."
TypeError: cannot concatenate 'str' and 'list' objects

The problem is that functionality[gene][allele] is an empty list. I'm guessing the code should have never reach that state.

Thimble GUI example files throw error

Was testing the code and thought to use the GUI mode first out of idle curiosity. Uploading the example file:
stitchr-main\Templates\GUI-Examples\human_TRA-TRB.tsv and attempting to process it (without P2A linker) leads to the following error:

gui-stitcher.py:436: UserWarning: WArning: user-provided gene TCRgenename*01 contains non-DNA sequences....
KeyError: ('TRA/TRB', 'HUMAN')

This is a fatal error and leads to dropping back to console. Seems like it's probably a teething error with the move to 0.91, thought I'd raise an issue (I'll go back to command line use)

thimble function gives an error for LEADER sequence when stitching TCR

Dear Jamie,
Thank you for developing this package!

I'm trying to use thimble.py function to get full paired-chain TCR sequences.
Unfortunately, for every TCR in the .tsv file (~4,500 TCRs), I get the same error:
(TRA) Error: a LEADER sequence region has not been found for genes in the IMGT data for this chain/species. Please check your TCR and species data. Cannot stitch a sequence for TRA. (TRB) Error: a LEADER sequence region has not been found for gene in the IMGT data for this chain/species. Please check your TCR and species data. Cannot stitch a sequence for TRB. (Link) Error: need both a TRA and TRB to link._

However, when I tried to stitch the same TCRs manually using stitchr.py function, it worked perfectly fine.
For example, a line in the .tsv table:
TCR0 TRAV3 TRAJ37 CAAPPSNTGKLIF TRBV19 TRBJ2-3 CASSTRAADTQYF - gives an error when using thimble.py
But both of these work well:
python3 stitchr.py -v TRAV3 -j TRBJ37 -cdr3 CAAPPSNTGKLIF
python3 stitchr.py -v TRBV19 -j TRBJ2-3 -cdr3 CASSTRAADTQYF

My TCR genes name should be consistent with the IMGT nomenclature.
What could be the reason for such an error and how I can fix it?
Thank you for your time.

Add option to read FASTA automatically into additional-genes.fasta

Currently it's awkward to add additional genes, both due to having to confirm to the altered IMGT header format and locating the relevant file in the squirreled away data directory.

An optional script (perhaps part of stitchrdl) could:

  • be provided a path to a fasta file which it could parse
  • QC the reads therein
  • generate a suitable fasta header
  • append the reads to the correct file

Add capacity for gamma/delta TCR stitching

Note that this may require some tweaking of how stitchr currently handles constant regions (given that human/mice TRGC have multiple isoforms on account of exon 2 duplications).

Wild card usage in Thimble ignores extra genes (-xg/additional-genes.fasta)

Using 'TRAVx*%' in Thimble should produce rows in the output for that rearrangement with all included alleles of TRAVx, however it currently only includes alleles that are included in the appropriate locus' fasta file in the relevant species folder in the data directory.

Not a big issue (as users can just pop their additional genes in those files directly), but it's not expected behaviour and encourages editing files which are best left managed by the automatic downloading of stitchrdl.

Compatibility of stitchr with Windows

Hi @JamieHeather,

I'm trying to use stitchr on my Windows machine, but I'm encountering an error message that looks like this:

 C:\Users\user1>stitchr -v TRBV2-1 -j TRBJ2-1 -cdr3 CASSRLAGGMDEQFF
Traceback (most recent call last):
  File "C:\Users\user1\anaconda3\envs\ml\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\user1\anaconda3\envs\ml\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\user1\anaconda3\envs\ml\Scripts\stitchr.exe\__main__.py", line 7, in <module>
  File "C:\Users\user1\anaconda3\envs\ml\lib\site-packages\Scripts\stitchr.py", line 399, in main
    input_args, chain = fxn.sort_input(vars(args()))
  File "C:\Users\user1\anaconda3\envs\ml\lib\site-packages\Scripts\functions.py", line 161, in sort_input
    species_dirs = find_species_covered()
  File "C:\Users\user1\anaconda3\envs\ml\lib\site-packages\Scripts\functions.py", line 131, in find_species_covered
    species_list = [x for x in os.listdir(data_dir) if os.path.isdir(data_dir + '/' + x) and
FileNotFoundError: [WinError 3] The system cannot find the path specified: ''

I'm not sure if this error is caused by a compatibility issue between stitchr and Windows or if it's due to a different problem. Do you have any suggestions for how I might resolve this issue?

Thank you!

TRBD gene

Hi, thank you for this very useful tool!

I'm new to TCR assemble so it might be a naive question. May I know why there is no option to fill in the TRBD gene ? There are three different TRBD allele in my dataset:
TRBD101 TRBD201 TRBD2*02
Are they all the same? Will not including them make the output inaccurate?

Possible to use custom species (non-human/non-mouse)?

Hi Jamie

I'd like to be able to use mouse constant regions (TRAC/TRBC) with the human TRA/TRB genes, but am not able to do so. It wasn't clear whether I could do this by adding custom regions to the additional-genes.fasta in combination with '-xg', and editing the thimble file to list the mouse genes also didn't work.

I thought of creating a custom human/mouse hybrid reference, where I replaced human TRAC and TRBC genes with the mouse versions in the imgt, and also in the 'C-region-motifs.tsv' file. To try and use this, I called the folder 'HUMAN_mCONST' and specified this with the '-s' option. But no luck here either. I couldn't see in your code why this wouldn't work, as it should only be looking for the named folder in the Data directory, but I get the following error:

  File "/data/github/stitchr/Scripts/thimble.py", line 137, in <module>
    raise IOError("No data available for requested species: " + input_args['species'])
OSError: No data available for requested species: HUMAN_mCONST

Would you have any advice on how or if this might be possible?

Thanks!
Sam

Specify databases location

Hi! is there a way to explicitly specify the location of the databases (TRA.fasta and TRB.fasta) to use?

I made a package to use Stitchr (and other things) and want to put the fasta files in a folder that goes with the package. However, when running it, it complains with TRA.fasta not detected in the Data directory. Please check data exists for this species/locus combination. cause I don't have the files in the Data/HUMAN directory anymore...

Is there a way to pass the databases location to Stitchr?

I use the importing approach as discussed in issue #35

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.