crowelab / pyir Goto Github PK

View Code? Open in Web Editor NEW

44.0 44.0 13.0 266.11 MB

Immunoglobulin and T-Cell receptor rearrangement software

License: Other

Python 73.85% Perl 0.39% Shell 18.11% TeX 7.66%

pyir's People

Contributors

Stargazers

Watchers

Forkers

wangdi2014 atdurian menchant liukai1029 jwillis0720 peji-moghimi abivarsh sailfish009 sarangians sarangian khalildaibes erichardson97 lorenzgerber

pyir's Issues

Support for different species

PyIR/pyir/arg_parse.py

Line 175 in 1a4d9b8

# choices=['human', 'mouse', 'rabbit', 'rat', 'rhesus_monkey'],

Hello I am looking to run PYir on mice, but as I was going through the code I noticed that mouse is no longer an option that can passed into the tool. Is this something done intentional, or does this tool not work on mice anymore?

Thanks,

Kuzirh

empty output

Dear PyIR maintainer,

Thanks for maintaining this tool. I installed it and give a test run on one of my fasta files, but get an empty output from it.

$ pyir myfasta.fa  --outfmt tsv -m 8
4,555,603 sequences successfully split into 4554 pieces
Starting process pool using 8 processors
  0%|                                                                                                | 0/4555603 [04:45<?, ?seq/s]
4,555,603 sequences processed in 297.47 seconds, 15,314 sequences / s
Zipping up final output
Analysis complete, result file: myfasta.tsv.gz

May I ask if you have an idea about why it should happen?

Thanks,

ABhelix data

Hi, am looking at the ABhelix derived data from your paper "High frequency of shared clonotypes in human B cell receptor repertoires". The reads corresponding to IgA and IgM cover a part of the C gene which is enough for isotype identification, but reads in files corresponding to IgG1-IgG4 cover only last 5 nucleotides of the C gene. I wonder how IgG1-IG4 isotypes were determined?

IGBLAST_TSV_HEADER misalignment

There are column names misalignment in TSV output file. The length of TSV header is less than the actual number of output columns and all the column names after the newly added "complete_vdj" are misaligned from their meant column data.

Issue of using pair-end WES data

Dear developer,

I noticed that pyIR currently supports input in the form of a single FASTA/FASTQ file. I'm working with paired-end Whole Exome Sequencing (WES) data and was wondering about the best approach to use this data with pyIR. Should I preprocess the paired-end data into a single file using tools like pRESTO before inputting it into pyIR?

Thanks for your guidance!

Result difference between PyIR and IgBLAST server

Dear developer,

I noticed that using inputting the same TCR fasta file (50 sequences, I used a small file so that IgBLAST server could run) into both PyIR and IgBLAST server, it seemed that IgBLAST recognized the CDR3 sequences in all the 50 sequences, while PyIR only recognized 24 of them. I wonder did it happen because of the quality control function of PyIR? Will this issue simply be solved if the fastq/fastq file is in high quality?

Below was my code in PyIR:
PyIR(query=FILE, args=['--outfmt', 'tsv', '-r', 'TCR', '--species', 'human'])

Result from PyIR:

Result from IgBLAST server:

Example PyIR output for immunarch

Hi, thank you for creating PyIR! Would you be open to provide an example output of PyIR so we can add it as a parsing option to https://github.com/immunomind/immunarch ? I tried to download files from Wiki, but some of them are not available, and I have troubles with the PyIR installation. So it would be great to have some PyIR outputs to quickly implement the parser and help PyIR users easily explore their AIRR data. Thank you, and let me know if you have any questions!

Issue to track the parser: immunomind/immunarch#84

-- Vadim

Isotype annotation

Hi,

Thanks for creating this software. Is there any functionality for annoting B cell isotypes? If not, are there any plans to add this in future versions?

Thanks

Annotation for most alpha chains missing, but not for beta

Hi there,

thanks for creating PyIR! I experience a funny issue where TCR beta chains are perfectly annotated out of the box, but alpha chains are not. Alpha chains are always marked as complete_vdj=F, most gene calls are missing and most sequences (fwr1 to junctions) are missing. Is there something obvious I might be doing wrong? All sequences I used came from filtered and complete 10X output.

Many thanks,
Andreas

Murine blast?

Is murine support no longer available?

issue about downloading database

Hello, when I tried to download the database from your website(http://www.imgt.org/download/V-QUEST/IMGT_V-QUEST_reference_directory/Homo_sapiens/IG/IGHV.fasta), there was an error. could you kindly help me to check the website? thanks for your assistance

Add license?

For those who find this repository by means other than the BMC Bioinformatics publication, it may be helpful to have a LICENSE file available. I see the publication has the following:

License: Free to academics
Any restrictions to use by non-academics: Yes; non academics should contact the author for permission to use the software or license options for incorporation into software that is being sold for profit.

could we have an selection option to use blastp?

given that we have a list of protein sequences rather than nucleotide sequences in hand, could we use blastp program for immunoglobulin protein sequence blast as it shows in igBlast website?

Failure in building docker image

When building the docker image, the Ig tests fails with what looks to be an improperly written regex:

Starting process pool using 4 processors
 22%|██▏       | 108/500 [00:00<00:03, 109.25seq/s]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/lib64/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.7/site-packages/pyir/igblast.py", line 20, in run
    return igblast_run.run_single_process(fasta_input_file, fastq_input_file)
  File "/usr/local/lib/python3.7/site-packages/pyir/igblast.py", line 144, in run_single_process
    total_parsed = parser.parse()
  File "/usr/local/lib/python3.7/site-packages/pyir/parsers.py", line 615, in parse
    self.output = parser.parse(line, self.output, previous_line_whitespace, self.seq_dict)
  File "/usr/local/lib/python3.7/site-packages/pyir/parsers.py", line 92, in parse
    self.hits.append({'gene': matches.group(1), 'bit_score': float(matches.group(2)), 'e_value':float(matches.group(3))})
ValueError: could not convert string to float: 'sapiens|IGHV8|P|V-REGION'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/pyir", line 14, in <module>
    py_ir.run()
  File "/usr/local/lib/python3.7/site-packages/pyir/factory.py", line 65, in run
    output_pieces = self.run_pool(input_pieces, fastq_input_pieces, total_seqs)
  File "/usr/local/lib/python3.7/site-packages/pyir/factory.py", line 218, in run_pool
    for x in imap:
  File "/usr/lib64/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
ValueError: could not convert string to float: 'sapiens|IGHV8|P|V-REGION'

Stops at 9%

Allow argument for setting IGBlast word size

This publication suggests that the default IGBlast word_size (9) is already too high, and PyIR is hardcoded to 11.

I'm running into real data where this is causing errors.

Please consider adding an argument that would allow users to set their desired word_size.

http://simlab.biomed.drexel.edu/papers_published/discrimination_zhang_2015.pdf

after pyir set up, pyir still can't be used with other species

Hi,

I have been able to run pyir only on human. When I try to set up database for other species by "pyir setup" and then run pyir blast search it does not work. It always default back to human. I noticed in the pyir arg_parse.py, the choices is only for human. I tried uncommented line 175 choices=['human'] and commented the line 176 choices=['human','mouse', 'rabbit', 'rat', 'rhesus_monkey' ] but it pyir still doesn't work for any other species but human. The pyir setup command seems to run fine but it seems like there is some other arguments except the one species choices are preventing it to find the database. Any suggestion?

missing FR4 with PyIR 1.3 AIRR parser

Hi PyIR team,

With PyIR 1.3, we have noticed that many queries/alignments are missing FR4. A closer look indicates that many sequences do contain the FR4 but not being reported in PyIR 1.3.

Below is one example annotated by PyIR 1.0 and PyIR 1.3 respectively; only minimal information included to reproduce behavior; PyIR 1.0 results has proper end of FR4 and agrees much better with Igblast web tool.

PyIR 1.0
"Raw Sequence":"CAACATCCGAGCAGGGTTATCTGGTCTGATGGCTCAAACACAGCGACCTCGGGTGGGAACACGTTTTTCAGGTCCTCTGTGACCGTGAGCCTGGTGCCCGGCCCGAAGTACTGCTCGTAGGATTCCAACCCCCACTCCCATCTAGCACTGCAGATGTAGAAGCTGCTGTCTTCAGGATGGGCACTGGTCACTGTCAGAGTGGACAAGGTCAGGCTTGCATGGTTGATGAGAAACTTGTCCTTCTCGACGCCTTGCTCGTATGTGGCCTTGGAGCCCTCATTGGAAGTTGCCATCAGCATGAGACTCTGTTTCGGGAACTGACGATACCAAAACATAGTTGTGGCCTGAAAGTCCAGGGAACGGCACTCGATCTTCACAGAGGTTCCACTCTTACAGATAACCCTGCTCGGATGTTGAGAGACGACAGCACCAAGCCCGGAGCCTGGCCCCAGAAGCAGCAGAAGCAGCAGCATCTTCCGTGATGGCCTCACACCACCTTCTCTGGGGAGAGTTCAGAGCGCAGAGC",
"CDR1":{
"from":174.0,
"to":191.0,
"length":18.0,
"matches":18.0,
"mismatches":0.0,
"gaps":0.0,
"percent identity":100.0,
"AA":"DFQATT",
"NT":"GACTTTCAGGCCACAACT"
},
"FR2":{
"from":192.0,
"to":242.0,
"length":51.0,
"matches":51.0,
"mismatches":0.0,
"gaps":0.0,
"percent identity":100.0,
"AA":"MFWYRQFPKQSLMLMAT",
"NT":"ATGTTTTGGTATCGTCAGTTCCCGAAACAGAGTCTCATGCTGATGGCAACT"
},
"CDR2":{
"from":243.0,
"to":263.0,
"length":21.0,
"matches":21.0,
"mismatches":0.0,
"gaps":0.0,
"percent identity":100.0,
"AA":"SNEGSKA",
"NT":"TCCAATGAGGGCTCCAAGGCC"
},
"FR3":{
"from":264.0,
"to":377.0,
"length":114.0,
"matches":114.0,
"mismatches":0.0,
"gaps":0.0,
"percent identity":100.0,
"AA":"TYEQGVEKDKFLINHASLTLSTLTVTSAHPEDSSFYIC",
"NT":"ACATACGAGCAAGGCGTCGAGAAGGACAAGTTTCTCATCAACCATGCAAGCCTGACCTTGTCCACTCTGACAGTGACCAGTGCCCATCCTGAAGACAGCAGCTTCTACATCTGC"
},
"CDR3":{
"from":378.0,
"to":386.0,
"length":9.0,
"matches":9.0,
"mismatches":0.0,
"gaps":0.0,
"percent identity":100.0,
"AA":"SARWEWGLESYEQY",
"NT":"AGTGCTAGATGGGAGTGGGGGTTGGAATCCTACGAGCAGTAC"
"FR4":{
"AA":"FGPGTRLTVT",
"NT":"TTCGGGCCGGGCACCAGGCTCACGGTCACAG"

PyIR 1.3 :

"sequence":"GCTCTGCGCTCTGAACTCTCCCCAGAGAAGGTGGTGTGAGGCCATCACGGAAGATGCTGCTGCTTCTGCTGCTTCTGGGGCCAGGCTCCGGGCTTGGTGCTGTCGTCTCTCAACATCCGAGCAGGGTTATCTGTAAGAGTGGAACCTCTGTGAAGATCGAGTGCCGTTCCCTGGACTTTCAGGCCACAACTATGTTTTGGTATCGTCAGTTCCCGAAACAGAGTCTCATGCTGATGGCAACTTCCAATGAGGGCTCCAAGGCCACATACGAGCAAGGCGTCGAGAAGGACAAGTTTCTCATCAACCATGCAAGCCTGACCTTGTCCACTCTGACAGTGACCAGTGCCCATCCTGAAGACAGCAGCTTCTACATCTGCAGTGCTAGATGGGAGTGGGGGTTGGAATCCTACGAGCAGTACTTCGGGCCGGGCACCAGGCTCACGGTCACAGAGGACCTGAAAAACGTGTTCCCACCCGAGGTCGCTGTGTTTGAGCCATCAGACCAGATAACCCTGCTCGGATGTTG",
"fwr1":"GGTGCTGTCGTCTCTCAACATCCGAGCAGGGTTATCTGTAAGAGTGGAACCTCTGTGAAGATCGAGTGCCGTTCCCTG",
"fwr1_aa":"GAVVSQHPSRVICKSGTSVKIECRSL",
"cdr1":"GACTTTCAGGCCACAACT",
"cdr1_aa":"DFQATT",
"fwr2":"ATGTTTTGGTATCGTCAGTTCCCGAAACAGAGTCTCATGCTGATGGCAACT",
"fwr2_aa":"MFWYRQFPKQSLMLMAT",
"cdr2":"TCCAATGAGGGCTCCAAGGCC",
"cdr2_aa":"SNEGSKA",
"fwr3":"ACATACGAGCAAGGCGTCGAGAAGGACAAGTTTCTCATCAACCATGCAAGCCTGACCTTGTCCACTCTGACAGTGACCAGTGCCCATCCTGAAGACAGCAGCTTCTACATCTGC",
"fwr3_aa":"TYEQGVEKDKFLINHASLTLSTLTVTSAHPEDSSFYIC",
"fwr4":"",
"fwr4_aa":"",
"cdr3":"AGTGCTAGATGGGAGTGGGGGTTGGAATCCTACGAGCAGTAC",
"cdr3_aa":"SARWEWGLESYEQY",

UPDATE: A closer look tells me that this is only relating to igblast's AIRR output parser. The legacy parser is still as good but doesn't support tsv outfmt. It will be nice to have the AIRR parser return FR4.

Adaptive Biotechnologies data set

Hi, I am trying to download Adaptive Biotechnologies data sets (FASTA) from the publication "High frequency of shared clonotypes in human B cell receptor repertoires". But the link does not seem to work.

tempfile.mkdtemp() causing "No space left on device error"

Hi,

I am trying to run PyIR on many files simultaneously. It is a great wrapper for IgBlast and I have been able to parse the json files easily. But, I am running into a space issue with the tmpdir.

Line 26 of factory.py contains this command:
args['tmp_dir'] = tempfile.mkdtemp()

Which appears to be root cause of the "No space left on device" error that I am seeing below. Some of the jobs work but most fail. I am running this job on a cluster.

My current workaround will be to set the environment variable TMPDIR to a directory with sufficient space, but this may be an issue that others may run into as well.

Error output:

  File "/data/omicscore/Easterhoff-Easterhoff-20190501/scripts/PyIR/bin/./pyir", line 14, in <module>
    py_ir.run()
  File "/data/omicscore/Easterhoff-Easterhoff-20190501/scripts/conda/envs/py36/lib/python3.6/site-packages/pyir/factory.py", line 53, in run
    total_seqs, input_pieces, fastq_input_pieces = self.split_input_file(input_format)
  File "/data/omicscore/Easterhoff-Easterhoff-20190501/scripts/conda/envs/py36/lib/python3.6/site-packages/pyir/factory.py", line 159, in split_input_file
    Bio.SeqIO.write(seq, current_pieces[proc_index], 'fasta')
  File "/data/omicscore/Easterhoff-Easterhoff-20190501/scripts/conda/envs/py36/lib/python3.6/site-packages/Bio/SeqIO/__init__.py", line 529, in write
    fp.write(format_function(record))
OSError: [Errno 28] No space left on device

Thanks,
Jen

I get an empty output json when running blastp on TCR data.

These are my args
'--species', 'human', '-r', 'TCR', '--legacy', '--sequence_type','prot'

when I reverse translate any given sequence into nucleotides and run it under the nucl flag, I get a correct output.