andrewjpage / tiptoft Goto Github PK

Predict plasmids from uncorrected long read data

License: GNU General Public License v3.0

Python 92.73% TeX 6.88% Dockerfile 0.39%

genomics oxford-nanopore nanopore plasmid pacbio uncorrected research bioinformatics bioinformatics-pipeline global-health

tiptoft's Issues

Successfully installed tiptoft-x.y.z ?

x.y.z needs to be filled it?

$ pip3 install tiptoft

Collecting tiptoft
  Downloading https://files.pythonhosted.org/packages/c0/9a/3b39936b78e2d0c1aeeb61ec6a6b03737e4bae7e5874ea121dbb6b3d0588/tiptoft-0.1.3.tar.gz (88kB)
    100% |████████████████████████████████| 92kB 7.1MB/s
Requirement already satisfied: biopython>=1.68 in /home/linuxbrew/.linuxbrew/lib/python3.7/site-packages (from tiptoft) (1.72)
Requirement already satisfied: pyfastaq>=3.12.0 in /home/linuxbrew/.linuxbrew/lib/python3.7/site-packages (from tiptoft) (3.17.0)
Requirement already satisfied: cython in /home/linuxbrew/.linuxbrew/lib/python3.7/site-packages (from tiptoft) (0.28.5)
Requirement already satisfied: numpy in /home/linuxbrew/.linuxbrew/lib/python3.7/site-packages (from biopython>=1.68->tiptoft) (1.15.2)
Skipping bdist_wheel for tiptoft, due to binaries being disabled for it.
Installing collected packages: tiptoft
  Running setup.py install for tiptoft ... done
Successfully installed tiptoft-x.y.z

cannot find example data file

the link here to an example data file does not work & the data file is not included in the repo any more.

I found the commit that deleted it like so,

git rev-list -n 1 HEAD -- ERS654932_plasmids.fastq.gz

and then checked it out like so,

git show 7c7d3e55da84cf814c783dbaf429def261c77328^:example_data/ERS654932_plasmids.fastq.gz > ERS654932_plasmids.fastq.gz

I think it'd be great if the file remained in the repo, but it should be provided somewhere (or removed from the README - -1 on that :). Note that deleting it without rewriting history means it's still in the git history so it doesn't speed up git clones, although it does reduce the bundled repo size for releases etc.

Default to include data/ files?

I note that data/ has the files downloaded.
It would be great if it could default to use those.
Are they part of the pip resources section?

eg.

parser.add_argument('--output_prefix',	 '-o', help='Output directory', 
   default =   pkg_resources.get_distribution("plasmidpredictor").the_files_we_need)

I guess the reason is ethical/political

Could make everyone cite Plasmidfinder as well as your future JOSS paper.

ValueError: 'homopolymer_compression.pyx' doesn't match any files

Some kind of cython issue?

pip3 install tiptoft
Collecting tiptoft
  Downloading https://files.pythonhosted.org/packages/8d/1e/0df98a3565f3f656ca625bc427804de61282593c0fed17fcb14da7605891/tiptoft-0.1.1.tar.gz (86kB)
    100% |████████████████████████████████| 92kB 20.5MB/s
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-qkhicklb/tiptoft/setup.py", line 36, in <module>
        ext_modules = cythonize(extensions),
      File "/home/linuxbrew/.linuxbrew/opt/python/lib/python3.7/site-packages/Cython/Build/Dependencies.py", line 897, in cythonize
        aliases=aliases)
      File "/home/linuxbrew/.linuxbrew/opt/python/lib/python3.7/site-packages/Cython/Build/Dependencies.py", line 777, in create_extension_list
        for file in nonempty(sorted(extended_iglob(filepattern)), "'%s' doesn't match any files" % filepattern):
      File "/home/linuxbrew/.linuxbrew/opt/python/lib/python3.7/site-packages/Cython/Build/Dependencies.py", line 102, in nonempty
        raise ValueError(error_msg)
    ValueError: 'homopolymer_compression.pyx' doesn't match any files

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-qkhicklb/tiptoft/

pip3 --version
pip 18.0 from /home/linuxbrew/.linuxbrew/opt/python/lib/python3.7/site-packages/pip (python 3.7)

cython --version
Cython version 0.28.5

[JOSS-review] Please remember to cite the plasmidFinder paper

As you mention in your REAMED.md: "Please remember to cite the plasmidFinder paper"

The paper.md would benefit from the addition of a brief explanation of the reference database it requires, along with the appropriate bibliographical reference.

create a CITATION file and add plasmidFinder to it?

One way to help users cite the appropriate literate is to create a CITATION.cff file, https://citation-file-format.github.io/. (there are other approaches, too, of course ;). ref #16.

This is just a suggestion, not critical :)

Add 3.7-dev test

language: python

python:
  - "3.5"
  - "3.6"
  - "3.7-dev"

Recent Python branches require OpenSSL 1.0.2+. As this library is not available for Trusty, 3.7, 3.7-dev, 3.8-dev, and nightly do not work (or use outdated archive).

OR NOT ?

Author's github photo needs cheering up

Is this Quadram productions?

-o FILE does not create FILE

% plasmidpredictor -o out.txt db/plasmids.fa subreads.fq.gz

rep2.1_ORF(E.faeciumContig1183)_JDOE    100
rep2.2_repR(pEF1)_DQ198088*     99
rep11.1_repA(pB82)_AB178871     100
rep14.3_ORF(pRI1)_EU327398*     99
<snip>

% less out.txt
No such file or directory

Error downloading database

Hello,

I wanted to try tiptoft but similarly to issue #26 I encounter the following error when trying to run tiptoft_database_downloader v1.0.2. Before breaking, the downloader saved nine .fsa files to tiptoft_db.tmp.download .
If you require additional information please let me know.
Thanks for having a look at it!

Best,
Laura

Combining downloaded fasta files...
    RepA_N.fsa
    enterobacteriaceae.fsa
Traceback (most recent call last):
  File "/home/user/.local/bin/tiptoft_database_downloader", line 30, in <module>
    tiptoft.run()
  File "/home/user/.local/lib/python3.9/site-packages/tiptoft/TipToftDatabaseDownloader.py", line 23, in run
    refgenes.run(self.output_prefix)
  File "/home/user/.local/lib/python3.9/site-packages/tiptoft/RefGenesGetter.py", line 85, in run
    exec('self._get_from_' + self.ref_db + '(outprefix)')
  File "<string>", line 1, in <module>
  File "/home/user/.local/lib/python3.9/site-packages/tiptoft/RefGenesGetter.py", line 58, in _get_from_plasmidfinder
    for seq in file_reader:
  File "/home/user/.local/lib/python3.9/site-packages/pyfastaq/sequences.py", line 141, in file_reader
    raise Error('Error determining file type from file "' + fname + '". First line is:\n' + line.rstrip())
pyfastaq.sequences.Error: Error determining file type from file "/path/to/location/enterobacteriaceae.fsa". First line is:
<!DOCTYPE html>

plasmidpredictor_database_downloader shouldn't download surprisingly

% plasmidpredictor_database_downloader
Downloading data with:
curl -X POST --data "folder=plasmidfinder&filename=plasmidfinder.zip"

Noooo! whats going on. whats all this scrolling. whats happened??

I think --outdir should be required ... "principle of LEAST SURPRISE".

Most people expect to see --help to stderr when no parameter supplied.

Best parameters for pacbio reads?

k=13 is commong for Nanopore seeding

Would it go faster for pacbio with a bigger k ?

Interpreting the output

When I run it i get a lot of output:

<SNIP>
**
rep2.1_ORF(E.faeciumContig1183)_JDOE    100
rep2.2_repR(pEF1)_DQ198088*     99
rep11.1_repA(pB82)_AB178871     100
rep14.3_ORF(pRI1)_EU327398*     99
rep17.1_CDS29(pRUM)_AF507977    100
repUS15._ORF(E.faecium287)_NZAAAK010000287*     89
****

I think this is 'progressive' output as you go through reads? print_interval ?
Do I just focus on the "final" section (see above) ?
What are the numbers?
How many plasmids should I expect given the above output? Six?
What are the numbers? coverage?

% abricate --db plasmidfinder E.faecium/canu/canu.contigs.fasta  | cut -f 2,5,6 | column -t

SEQUENCE     GENE                                  COVERAGE
tig00000002  rep2_1_ORF(E.faeciumContig1183)_JDOE  1-1494/1494
tig00000002  rep2_1_ORF(E.faeciumContig1183)_JDOE  1-1494/1494

tig00000010  repUS15__ORF(E.faecium287)            1-1041/1041

tig00000011  rep17_1_CDS29(pRUM)                   537-1041/1041

tig00000012  rep6_1_repA(p703/5)                   14-663/723
tig00000012  rep17_1_CDS29(pRUM)                   1-1041/1041
tig00000012  rep6_1_repA(p703/5)                   14-663/723

tig00000013  rep18_1_repA(p200B)                   413-931/933
tig00000013  rep18_1_repA(p200B)                   413-931/933
tig00000013  rep18_1_repA(p200B)                   413-931/933

tig00000016  rep14_3_ORF(pRI1)                     1-951/951
tig00000016  rep14_3_ORF(pRI1)                     772-951/951

tig00000026  rep2_1_ORF(E.faeciumContig1183)_JDOE  1-1494/1494
tig00000027  rep2_1_ORF(E.faeciumContig1183)_JDOE  1-1494/1494
tig00000028  rep2_1_ORF(E.faeciumContig1183)_JDOE  1-1494/1494

add CONTRIBUTING.md?

in the README you have,

If you wish to fix a bug or add new features to the software we welcome Pull Requests. Please fork the repo, make the change, then submit a Pull Request with details about what the change is and what it fixes/adds.

I'd suggest adding an explicit CONTRIBUTING.md and maybe a CoC, issue template, and PR template per suggestions here:

https://github.com/andrewjpage/tiptoft/community

See sourmash's CONTRIBUTING.md here: https://github.com/dib-lab/sourmash/blob/master/CONTRIBUTING.md - the content's more or less the same as what you have, it's just in a standardized place :)

0.0.3 crash - TypeError: get_one_x_coverage_of_kmers() missing 3 required positional arguments: 'sequence', 'k', and 'end'

plasmidpredictor --verbose subreads.fq.gz

GENE    COMPLETENESS    %COVERAGE       ACCESSION       DATABASE        PRODUCT
Traceback (most recent call last):
  File "/home/linuxbrew/.linuxbrew/bin/plasmidpredictor", line 53, in <module>
    plasmidpredictor.run()
  File "/home/linuxbrew/.linuxbrew/opt/python/lib/python3.6/site-packages/plasmidpredictor/PlasmidPredictor.py", line 49, in run
    fastq.read_filter_and_map()
  File "/home/linuxbrew/.linuxbrew/opt/python/lib/python3.6/site-packages/plasmidpredictor/Fastq.py", line 52, in read_filter_and_map
    if self.map_read(read):
  File "/home/linuxbrew/.linuxbrew/opt/python/lib/python3.6/site-packages/plasmidpredictor/Fastq.py", line 66, in map_read
    candidate_gene_names = self.does_read_contain_quick_pass_kmers(read.seq)
  File "/home/linuxbrew/.linuxbrew/opt/python/lib/python3.6/site-packages/plasmidpredictor/Fastq.py", line 81, in does_read_contain_quick_pass_kmers
    read_onex_kmers = kmers_obj.get_one_x_coverage_of_kmers()
TypeError: get_one_x_coverage_of_kmers() missing 3 required positional arguments: 'sequence', 'k', and 'end'

Consider homopolymer compressed kmers

Heng uses homo-compressed k-mers in minimap2
Might be useful here
Or not

Usage question

I realize this tool is designed for detecting plasmids, but I'm wondering if it could be modified for more general purposes such as detecting which samples had particular distinct sequences. There is surprisingly a lack of tools that do this using a kmer approach.

I'm assuming it would not be as simple as providing these distinct sequences to the "--plasmid_data" parameter?

Error downloading database

tried both conda and pip installations, get the same error. Do you know what causes it?

tiptoft_database_downloader plasmidfinder
Downloading data with:
curl -o enterobacteriaceae.fsa https://bitbucket.org/genomicepidemiology/plasmidfinder_db/raw/master/enterobacteriaceae.fsa

Downloading data with:
curl -o gram_positive.fsa https://bitbucket.org/genomicepidemiology/plasmidfinder_db/raw/master/gram_positive.fsa

Combining downloaded fasta files...
    gram_positive.fsa
Traceback (most recent call last):
  File "/gpfs2/well/bag/users/lipworth/python3venv/bin/tiptoft_database_downloader", line 30, in <module>
    tiptoft.run()
  File "/gpfs2/well/bag/users/lipworth/python3venv/lib/python3.4/site-packages/tiptoft/TipToftDatabaseDownloader.py", line 23, in run
    refgenes.run(self.output_prefix)
  File "/gpfs2/well/bag/users/lipworth/python3venv/lib/python3.4/site-packages/tiptoft/RefGenesGetter.py", line 87, in run
    exec('self._get_from_' + self.ref_db + '(outprefix)')
  File "<string>", line 1, in <module>
  File "/gpfs2/well/bag/users/lipworth/python3venv/lib/python3.4/site-packages/tiptoft/RefGenesGetter.py", line 60, in _get_from_plasmidfinder
    for seq in file_reader:
  File "/gpfs2/well/bag/users/lipworth/python3venv/lib/python3.4/site-packages/pyfastaq/sequences.py", line 141, in file_reader
    raise Error('Error determining file type from file "' + fname + '". First line is:\n' + line.rstrip())
pyfastaq.sequences.Error: Error determining file type from file "/gpfs2/well/bag/users/lipworth/gram_neg/gnbc_nanopore/easymag/plasmidfinder.tmp.download/gram_positive.fsa". First line is:
<!DOCTYPE html>
```

andrewjpage / tiptoft Goto Github PK

tiptoft's Issues

Recommend Projects

Recommend Topics

Recommend Org