Git Product home page Git Product logo

stringmlst's People

Contributors

andrewjpage avatar anujg30 avatar ar0ch avatar rrwick avatar takadonet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

stringmlst's Issues

Memory usage balloons when building rMLST kmer-database

Hi,

Thanks for your work on this tool. I'm interested in pairing it with rMLST like you describe in the original paper, so I obtained the database and have been trying to build a kmer database for classifying samples. However, the memory required seems to be far above 128 GB, and I've killed the build script before the OOM killer activates.

Any advice on building the DB? I suppose I could do it on AWS if needed, or construct the DB manually using jellyfish.

Proxy issue?

Hello,
while trying to run stringMLST on a machine behind a coprorate proxy we get the following error message:

[xxx@xxx ~]$ stringMLST.py --getMLST -P datasets/ --species all
Using a kmer size of 35 for all databases.
Preparing: Achromobacter spp.
Traceback (most recent call last):
File "/usr/local/bin/stringMLST.py", line 1639, in
profileURL = get_links(key,schemes)
File "/usr/local/bin/stringMLST.py", line 263, in get_links
xml = urlopen(URL)
File "/usr/lib64/python2.7/urllib.py", line 87, in urlopen
return opener.open(url)
File "/usr/lib64/python2.7/urllib.py", line 208, in open
return getattr(self, name)(url)
File "/usr/lib64/python2.7/urllib.py", line 359, in open_http
return self.http_error(url, fp, errcode, errmsg, headers)
File "/usr/lib64/python2.7/urllib.py", line 372, in http_error
result = method(url, fp, errcode, errmsg, headers)
File "/usr/lib64/python2.7/urllib.py", line 665, in http_error_301
return self.http_error_302(url, fp, errcode, errmsg, headers, data)
File "/usr/lib64/python2.7/urllib.py", line 635, in http_error_302
data)
File "/usr/lib64/python2.7/urllib.py", line 661, in redirect_internal
return self.open(newurl)
File "/usr/lib64/python2.7/urllib.py", line 208, in open
return getattr(self, name)(url)
File "/usr/lib64/python2.7/urllib.py", line 437, in open_https
h.endheaders(data)
File "/usr/lib64/python2.7/httplib.py", line 1013, in endheaders
self._send_output(message_body)
File "/usr/lib64/python2.7/httplib.py", line 864, in _send_output
self.send(msg)
File "/usr/lib64/python2.7/httplib.py", line 826, in send
self.connect()
File "/usr/lib64/python2.7/httplib.py", line 1236, in connect
server_hostname=sni_hostname)
File "/usr/lib64/python2.7/ssl.py", line 350, in wrap_socket
_context=self)
File "/usr/lib64/python2.7/ssl.py", line 611, in init
self.do_handshake()
File "/usr/lib64/python2.7/ssl.py", line 833, in do_handshake
self._sslobj.do_handshake()
IOError: [Errno socket error] EOF occurred in violation of protocol (_ssl.c:579)

All our openssl libs and libraries are installed.

Cut a 1.0.0 release

It's probably time to release 1.0. stringMLST has been feature complete and stable for quite some time now

--getMLST command error: IndexError: child index out of range

Hi all,

We have a user running stringMLST v0.6.2 --getMLST command for neisseria spp. However, the --getMLST step seems to be pulling down an empty directory called neisseria_db:

image

As you can see from the image: the n--getMLST step produces an IndexError IndexError: child index out of range and also creates an empty directory called neisseria_db which doesn't have the nmb subdirectory (which is mentioned in the quick start guide). Any advice on how we can proceed with this or if we are doing something wrong?

Best,
Nishant Gerald

Cannot reproduce * with -z parameter

Hi!
I tried to reproduce your script with different values of the fuzzy (10,100,1000) present in the closed issue "ar0ch commented on Oct 25, 2016":

for i in {10,100,1000};~/stringMLST/stringMLST.py --predict -1 ./tests/fastqs/ERR026529_1.fastq -2 ./tests/fastqs/ERR026529_2.fastq -p -P ./tests/testdb -z $i -t
Sample abcZ adk aroE fumC gdh pdhC pgm ST Time
ERR026529_ 231 180 306 612 269 277* 260 10174 11.21
Sample abcZ adk aroE fumC gdh pdhC pgm ST Time
ERR026529_ 231 180* 306 612 269 277* 260 10174 12.00
Sample abcZ adk aroE fumC gdh pdhC pgm ST Time
ERR026529_ 231 180* 306* 612 269 277* 260* 10174 12.05

This is my command with -z 1000 :

python3.6 stringMLST.py --predict -1 ./ERR026529_1.fastq.gz -2 ./ERR026529_2.fastq.gz -p -P ./neisseria/nmb -z 1000 -t
Sample abcZ adk aroE fumC gdh pdhC pgm ST Time
ERR026529 231 180 306 612 269 277 260 10174 8.19

I don't see the "*" in adk, aroE pdhC and pgm.

I use the last version of stringMLST 0.5.1, bwa 0.7.16a-r1181, samtools 1-6, bedtools 2.26, pyfaidx is installed.

Am I missing something?
Thank you very much,
-A

Doc for output format from `stringMLST.py --predict`

Could you please document the output format of stringMLST.py --predict and how the numbers and fields relate to the database files.

For example what are the counts in this table:

Sample	abcZ	adk	aroE	fumC	gdh	pdhC	pgm	ST
ERR026529	231	180	306	612	269	277	260	10174

Thanks,
Stephen

Different allels predicted despite mapping showing they are the same

Hi everyone,

I have some issue, which is... confusing.
I used stringMLST before, succesfully with a cgMLST scheme, and the results seem to make sense.
For another project, I also ran some data with stringMLST through the same scheme. The results also kinda make sense, but not quite. Basically, we had a few strains, which we are really, really certain that they are clonal. The result of stringMLST is that they are all slightly above the accepted threshold for clonality (which is probably not valid, but that's not the point right now).
So I had a look at the diverging allel predictions. I mapped the reads with bowtie2 to all first allels of the cgMLST (which is like 2200), and inspected the mapping of the diverging allels, manually, in a genome viewer, just to be certain.
And the diverging alles are the same between at least 2 of the samples. stringMLST predicts them different, but the mapping is the same.
The diverging allel predictions are not tremendously off (like 1 or 2 bp), but they are still wrong.
The affected allels are affected over multiple samples, so this also seems to be systematic.

Any idea what could be the cause of this?

This affects version 0.6.2, in case this matters.

Regards,
Bastian

EDIT: The very samples I am taking about are ERR2232520 and ERR2232524 .
I used this scheme https://www.cgmlst.org/ncs/schema/3560802/ , and affected allels are for these 2 samples are 03340, 04010, 08730, 08930, 10560, 13310, 16340 .
e.g. allel 08930 is predicted to be different over 5 samples, which we expect to be the same (different prediction in all 5 samples), so this is probably systematic.

python error with --coverage

In attempting to use '--coverage' I encounter the following python error, which I have been unable to correct. Admittedly, I am not a skilled python user.

Traceback (most recent call last):
File "/home/yrh8/Tools/stringMLST/stringMLST.py", line 1613, in
getCoverage(results)
File "/home/yrh8/Tools/stringMLST/stringMLST.py", line 790, in getCoverage
allele = gene+'_'+re.sub("*", "", str(results[sample][gene]))
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "/usr/lib/python2.7/re.py", line 244, in _compile
raise error, v # invalid expression
sre_constants.error: nothing to repeat

Optional kmer depth coverage output

Output kmer depth for each allele profiled. Can be useful for determining why stringMLST fails to give type something -- e.g. too low or no of coverage -- and for non-MLST screens (such as AMR)

ERROR while loading ST

Hello @ar0ch,

I am using stringMLST and get ST assigned in terminal but log shows several lines of `"ERROR while loading ST".

Can anybody help?

Not compatible with python3

Currently some function calls are not compatible with python 3.5+

  • Automated porting via modernize
  • Check for regressions
  • Write test cases for porting script

-z parameter says 500 is not an interger

I have been using the stringMLST with a pubMLST database. However it is missing new alleles that have been identifed in the isolates using two other methods. I tried to adjust the -z parameter incase that would help but I only got the response that the various numbers I tried - 100, 500, 400, 350 were not intergers.

Enterococcus faecium issue

Dear string MLST author,
we encountered an issue while testing your software on Enterococcus faecium. The error reported is:

./stringMLST.py --predict -1 ~/SRR980587/SRR980587_1.fastq -2 ~/SRR980587/SRR980587_2.fastq -k 21 -P ENT


Traceback (most recent call last):
  File "./stringMLST.py", line 1145, in <module>
    results = singleSampleTool(fastq1, fastq2, paired, k, results)
  File "./stringMLST.py", line 181, in singleSampleTool
    finalProfile = getMaxCount(weightedProfile, fileName)
  File "./stringMLST.py", line 334, in getMaxCount
    compare = int(re.sub("\*$", "", str(max_n[loc])))
[ValueError: invalid literal for int() with base 10: ](ValueError: invalid literal for int() with base 10: )'3799252.24074'

The sample SRR980587 used for testing has been downloaded from:
www.ebi.ac.uk/ena/data/view/SRR980587.
Enterococcus faecium data are not present in dataset folder so we added it downloading allele files from PubMLST.

We created the config file following your README instructions:
[loci]
adk    datasets/Enterococcus_faecium/adk.fa
atpA    datasets/Enterococcus_faecium/atpA.fa
ddl    datasets/Enterococcus_faecium/ddl.fa
gdh    datasets/Enterococcus_faecium/gdh.fa
gyd    datasets/Enterococcus_faecium/gyd.fa
pstS    datasets/Enterococcus_faecium/pstS.fa
purK    datasets/Enterococcus_faecium/purK.fa
   
[profile]   
profile    datasets/Enterococcus_faecium/profile_Enterococcus_faecium.txt

some samples came back with traceback error or empty result

Hi,

I just installed stringMLST through bioconda on my VM (Ubuntu 18.04 LTS on Windows 7 host) and ran it on a bunch of samples using E coli 1 MLST scheme. About 80% of the samples gave a ST and 3.5% returned 0, which were expected.

However, for the rest 16.5%, some returned an empty list and some gave a traceback error message. All of these problem samples were downloaded from NCBI. I had no problem running them in QC inspection or reference sequence alignment (bowtie2/samtools, etc.) so I know the files were not corrupted. Nevertheless, I did notice that most of the samples that returned an empty list were submitted from one source and contained contaminating reads from another serotype. I didn't observe any pattern for the ones that gave me a Traceback error other than the fastq.gz files might be on the small side (<50 MB each). I was wondering if there's an explanation, and better, a fix, for these samples.

Please see examples below (Ec is the prefix I gave E. coli MLST scheme 1 when I created DB):

ST = stringMLST.py --predict -1 DRR015930_1.fastq.gz -2 DRR015930_2.fastq.gz -P Ec
print(ST)

[]

ST = stringMLST.py --predict -1 ERR1777574_1.fastq.gz -2 ERR1777574_2.fastq.gz -P Ec
print(ST)

['Traceback (most recent call last):', ' File "/home/florathecat/anaconda3/bin/stringMLST.py", line 1605, in ', ' results = singleSampleTool(fastq1, fastq2, paired, k, results)', ' File "/home/florathecat/anaconda3/bin/stringMLST.py", line 399, in singleSampleTool', ' singleFileTool(fastq1, k)', ' File "/home/florathecat/anaconda3/bin/stringMLST.py", line 452, in singleFileTool', ' fileExplorer(fastq, k, non_overlapping_window)', ' File "/home/florathecat/anaconda3/bin/stringMLST.py", line 468, in fileExplorer', ' lines = f.readlines()', ' File "/home/florathecat/anaconda3/lib/python3.6/gzip.py", line 289, in read1', ' return self._buffer.read1(size)', ' File "/home/florathecat/anaconda3/lib/python3.6/_compression.py", line 68, in readinto', ' data = self.read(len(byte_view))', ' File "/home/florathecat/anaconda3/lib/python3.6/gzip.py", line 482, in read', ' raise EOFError("Compressed file ended before the "', 'EOFError: Compressed file ended before the end-of-stream marker was reached']

Thanks for your time.

Git repo too big with example dataset

When downloading the master branch as a zip file, the total size is 160.98M which is HUGE for a code base. Perhaps replace example read files folder with a script that would download them separately instead.

It is taking over 15 minutes to clone the repo now.

'{speciesName}' does not exist on PubMLST

$ stringMLST.py --getMLST -P neisseria/nmb --species neisseria  
Traceback (most recent call last):
  File "/home/hadoop/anaconda2/bin/stringMLST.py", line 4, in <module>
    __import__('pkg_resources').run_script('stringMLST==0.6.1', 'stringMLST.py')
  File "/home/hadoop/anaconda2/lib/python2.7/site-packages/pkg_resources/__init__.py", line 664, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/hadoop/anaconda2/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1450, in run_script
    script_code = compile(script_text, script_filename, 'exec')
  File "/home/hadoop/anaconda2/lib/python2.7/site-packages/stringMLST-0.6.1-py2.7.egg/EGG-INFO/scripts/stringMLST.py", line 275
    print(f"This usually means the provided species, '{speciesName}', does not exist on PubMLST")

Given 'random' ST type when no k-mers match any allele

When running the Strep pneu against the Listeria Mono. to see what a failure looks like I get the following result.

No k-mer matches were found for the sample . strain_R1_001.fastq and strain_R2_001.fastq. Probable cause of the error: low quality data/too many N's in the data
Sample ST Time
strain 344 15.88

Noticed in the code that the No k-mer match statement was followed by exit statement but it was commented out. I think reporting back to the user with N/A or no result would be best instead of just exiting . If someone is running in batch mode or combining individual results together, its very usefully to know which strain actually had no result. If they do not show up at all, then no idea if the software failed or because no k-mer found.

BuildDB will incorrectly parse ST profile data

Created database for Listeria monocytogenes from http://bigsdb.pasteur.fr/listeria/ and it finished correctly. However, when running the prediction, one allele was missing from the results and no ST type were found.

Looks like the issue is with this section of code: https://github.com/anujg1991/stringMLST/blob/master/stringMLST.py#L438

The script always assume there is only ONE extra column at the end of the ST. In Listeria case, there are two extra columns "CC" and "lineage".

Alelle order not consistent when have different sample names

The order of the alleles can be different when using different strain name
i.e

Sample mdh gyrB recA fumC purAicd adk ST
strain1 1 1 2 3 4 5 5

Sample purAicd adk mdh gyrB recA fumC ST
strain2 4 5 4 1 1 2 3 5

Makes it difficult to merge the two individual runs together into a single report.

running time

Hi Aroon,
Happy New Year.

I noticed a discrepancy between the time reported with "-t" option by stringMLST and the time measured by /usr/bin/time -v.

This is what is reported by stringMLST:
Sample abcZ adk aroE fumC gdh pdhC pgm ST Time
ERR026529 231 180* 306 612 269* 277* 260 10174 7.67

And this is reported by /usr/bin/time -v

Command being timed: "./stringMLST_iss36.py --predict -P neisseria/nmb -1 ERR026529_1.fastq -2 ERR026529_2.fastq -x -t"
User time (seconds): 19.86
System time (seconds): 1.82
Percent of CPU this job got: 89%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:24.28

Why are they different?

Bests,
-A

No k-mer matches were found for the sample ERR026529....

Hi,
I am trying to get stringMLST to work, and I am following the Quickstart guide.
Using this command line (after pip installing, downloading database and data):
stringMLST.py --predict -P neisseria/nmb -1 ERR026529_1.fastq.gz -2 ERR026529_2.fastq.gz
I get this result:

No k-mer matches were found for the sample ERR026529_1.fastq.gz and ERR026529_2.fastq.gz. Probable cause of the error: low quality data/too many N's in the data
Sample ST
ERR026529 ST

I checked both data and database, that seems to be alright.
OS: Ubuntu 14.04
Python: Python 3.6.0 |Anaconda 4.3.1 (64-bit)

Cannot build db with lower case characters sequence in fasta file

Could not built database because some of my sequences contain lower case characters. Got the following stack trace:

Info: Making DB for k = 35
Info: Making DB with prefix = LM
Traceback (most recent call last):
File "/share/apps/stringMLST/stringMLST.py", line 1079, in
makeCustomDB(config,k,dbPrefix)
File "/share/apps/stringMLST/stringMLST.py", line 671, in makeCustomDB
formKmerDB(configDict,k,output_filename)
File "/share/apps/stringMLST/stringMLST.py", line 613, in formKmerDB
string = key+'\t'+key1+'\t'+str(kmerDict[key][key1]).replace(" ","")+'\n'
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

Looks like the reverseComplement function only works on upper case character. Perhaps you should convert all input sequence to upper case first before attempting to do reverse complement.

Formatting of the locus files ("Allele name in locus file should be seperated by '_' or '-'")

Sorry, need to ask again something.
When I try to build the DB, I get the following:

Info: Making DB for k =  35
Info: Making DB with prefix = /exports/mm-hpc/bacteriologie/bastian/data/Cdif/cgmlst/stringMLST_db/cdif_cgmlst
Info: Log file written to  /exports/mm-hpc/bacteriologie/bastian/data/Cdif/cgmlst/stringMLST_db/cdif_cgmlst.log
Error : Allele name in locus file should be seperated by '_' or '-'

I'm unsure to what it refers to.
My fasta files with the different loci are called e.g. CD630_00010.fasta , but the names of the fasta entries are only integers (>1, >2, >3). Do I need to change these fasta headers?

In the profile file, does it then still need only the integers, even if I rename all the headers to e.g. >CD630_00010_1, >CD630_00010_2 etc?

Am I guessing correct here?

Thanks,
Bastian

--getMLST fails due to SSL self-signed cert error

When I tried updating my mlst database using --getMLST, it failed with ssl errors. I was able to correct the problem by importing ssl and adding in a line to ignore the certs. The attached version of stringMLST.py contains the patched code. I've confirmed it works.

patch.zip

How do you define an unknown allele in the profile file?

Hey there,

I'm attempting to run stringMLST for a cgMLST scheme.
I'm about to build everything to make it run, I have the allele fasta files, and will now create the profile file, where all the cgMLSt types and the alleles are listed.
Since it's a cgMLSt scheme, not all genes do have an allele in all cases.
I have multiple profiles, where some alleles are put down as "?" in the definition at the cgMLST website.
Do I also put a ? in the profile file? Or something else?
Any advice :)?
(I'll try anyways with a ? in there)

Bastian

[Breaking change] Make options and syntax more obvious

Make program flags more obvious:

-P/--prefix is awkward for setting the database. Move to -db/--database

-x Replaced with -f/--force, standard syntax for forcibly running/overwriting

Update help statements to be module specific

Change of license

stringMLST will be relicensed into CC Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) on Friday, April 21 2017.

We feel CC BY-NC-SA 4.0 accurately reflects our original intent and that no additional restrictions or limitations are placed on users of stringMLST with this change. We hope this will clarify the licensing of stringMLST, and facilitate third-party packages and derivative works. stringMLST will remain Open Source Software and free for noncommercial use.

0 versus empty

Hi!
I ran stringMLST on the sample from EBI SRA database ERR024377 (S. enterica):

stringMLST.py
--predict
-1 ERR024377_1.fastq.gz
-2 ERR024377_2.fastq.gz
-k 35
--prefix ERR024377
--output res.txt

and the res.txt file contains the following:

Sample aroC dnaN purE ST
ERR024377 345 342 0 0

Since the loci are totally 7 (aroC, dnaN, hemD, hisD, purE, sucA,thrA) , what is the difference between purE which got a 0 and hemD, hisD, sucA, thrA which got an empty value?

Bests,
-A

IndexError: child index out of range

If I use the next line:
stringMLST.py --getMLST -P Listeria --species Listeria monocytogenes

I get the following error:

Preparing: Listeria
Traceback (most recent call last):
File "/mnt/disk1/bin/miniconda3/bin/stringMLST.py", line 1562, in
profileURL, loci = get_links(dbRoot, filePrefix, species)
File "/mnt/disk1/bin/miniconda3/bin/stringMLST.py", line 268, in get_links
profileURL = child[1].text
IndexError: child index out of range

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.