jordanlab / stringmlst Goto Github PK
View Code? Open in Web Editor NEWFast k-mer based tool for multi locus sequence typing (MLST)
License: Other
Fast k-mer based tool for multi locus sequence typing (MLST)
License: Other
stingMLST.py --buildDB breaks if there is a 'U' in the multi fasta locus file because there isn't a key in the reverse compliment dictionary for 'U'.
Hi,
Thanks for your work on this tool. I'm interested in pairing it with rMLST like you describe in the original paper, so I obtained the database and have been trying to build a kmer database for classifying samples. However, the memory required seems to be far above 128 GB, and I've killed the build script before the OOM killer activates.
Any advice on building the DB? I suppose I could do it on AWS if needed, or construct the DB manually using jellyfish.
Hello,
while trying to run stringMLST on a machine behind a coprorate proxy we get the following error message:
[xxx@xxx ~]$ stringMLST.py --getMLST -P datasets/ --species all
Using a kmer size of 35 for all databases.
Preparing: Achromobacter spp.
Traceback (most recent call last):
File "/usr/local/bin/stringMLST.py", line 1639, in
profileURL = get_links(key,schemes)
File "/usr/local/bin/stringMLST.py", line 263, in get_links
xml = urlopen(URL)
File "/usr/lib64/python2.7/urllib.py", line 87, in urlopen
return opener.open(url)
File "/usr/lib64/python2.7/urllib.py", line 208, in open
return getattr(self, name)(url)
File "/usr/lib64/python2.7/urllib.py", line 359, in open_http
return self.http_error(url, fp, errcode, errmsg, headers)
File "/usr/lib64/python2.7/urllib.py", line 372, in http_error
result = method(url, fp, errcode, errmsg, headers)
File "/usr/lib64/python2.7/urllib.py", line 665, in http_error_301
return self.http_error_302(url, fp, errcode, errmsg, headers, data)
File "/usr/lib64/python2.7/urllib.py", line 635, in http_error_302
data)
File "/usr/lib64/python2.7/urllib.py", line 661, in redirect_internal
return self.open(newurl)
File "/usr/lib64/python2.7/urllib.py", line 208, in open
return getattr(self, name)(url)
File "/usr/lib64/python2.7/urllib.py", line 437, in open_https
h.endheaders(data)
File "/usr/lib64/python2.7/httplib.py", line 1013, in endheaders
self._send_output(message_body)
File "/usr/lib64/python2.7/httplib.py", line 864, in _send_output
self.send(msg)
File "/usr/lib64/python2.7/httplib.py", line 826, in send
self.connect()
File "/usr/lib64/python2.7/httplib.py", line 1236, in connect
server_hostname=sni_hostname)
File "/usr/lib64/python2.7/ssl.py", line 350, in wrap_socket
_context=self)
File "/usr/lib64/python2.7/ssl.py", line 611, in init
self.do_handshake()
File "/usr/lib64/python2.7/ssl.py", line 833, in do_handshake
self._sslobj.do_handshake()
IOError: [Errno socket error] EOF occurred in violation of protocol (_ssl.c:579)
All our openssl libs and libraries are installed.
Produce an error and exit when no file (in the right format) are found in the query directory
What happens if a strain contains an allele but it particular sequence is unique to the database? How it is represented to the user? SRST2 represents it with a * character beside the allele number.
It's probably time to release 1.0. stringMLST has been feature complete and stable for quite some time now
Attempted to build a database where some alleles have some 'N'. You have all other iupac codes but 'N' for your reverse complement function.
Hi all,
We have a user running stringMLST v0.6.2 --getMLST command for neisseria spp. However, the --getMLST step seems to be pulling down an empty directory called neisseria_db:
As you can see from the image: the n--getMLST step produces an IndexError IndexError: child index out of range and also creates an empty directory called neisseria_db which doesn't have the nmb subdirectory (which is mentioned in the quick start guide). Any advice on how we can proceed with this or if we are doing something wrong?
Best,
Nishant Gerald
Hi!
I tried to reproduce your script with different values of the fuzzy (10,100,1000) present in the closed issue "ar0ch commented on Oct 25, 2016":
for i in {10,100,1000};~/stringMLST/stringMLST.py --predict -1 ./tests/fastqs/ERR026529_1.fastq -2 ./tests/fastqs/ERR026529_2.fastq -p -P ./tests/testdb -z $i -t
Sample abcZ adk aroE fumC gdh pdhC pgm ST Time
ERR026529_ 231 180 306 612 269 277* 260 10174 11.21
Sample abcZ adk aroE fumC gdh pdhC pgm ST Time
ERR026529_ 231 180* 306 612 269 277* 260 10174 12.00
Sample abcZ adk aroE fumC gdh pdhC pgm ST Time
ERR026529_ 231 180* 306* 612 269 277* 260* 10174 12.05
This is my command with -z 1000 :
python3.6 stringMLST.py --predict -1 ./ERR026529_1.fastq.gz -2 ./ERR026529_2.fastq.gz -p -P ./neisseria/nmb -z 1000 -t
Sample abcZ adk aroE fumC gdh pdhC pgm ST Time
ERR026529 231 180 306 612 269 277 260 10174 8.19
I don't see the "*" in adk, aroE pdhC and pgm.
I use the last version of stringMLST 0.5.1, bwa 0.7.16a-r1181, samtools 1-6, bedtools 2.26, pyfaidx is installed.
Am I missing something?
Thank you very much,
-A
Could you please document the output format of stringMLST.py --predict
and how the numbers and fields relate to the database files.
For example what are the counts in this table:
Sample abcZ adk aroE fumC gdh pdhC pgm ST
ERR026529 231 180 306 612 269 277 260 10174
Thanks,
Stephen
Hi everyone,
I have some issue, which is... confusing.
I used stringMLST before, succesfully with a cgMLST scheme, and the results seem to make sense.
For another project, I also ran some data with stringMLST through the same scheme. The results also kinda make sense, but not quite. Basically, we had a few strains, which we are really, really certain that they are clonal. The result of stringMLST is that they are all slightly above the accepted threshold for clonality (which is probably not valid, but that's not the point right now).
So I had a look at the diverging allel predictions. I mapped the reads with bowtie2 to all first allels of the cgMLST (which is like 2200), and inspected the mapping of the diverging allels, manually, in a genome viewer, just to be certain.
And the diverging alles are the same between at least 2 of the samples. stringMLST predicts them different, but the mapping is the same.
The diverging allel predictions are not tremendously off (like 1 or 2 bp), but they are still wrong.
The affected allels are affected over multiple samples, so this also seems to be systematic.
Any idea what could be the cause of this?
This affects version 0.6.2, in case this matters.
Regards,
Bastian
EDIT: The very samples I am taking about are ERR2232520 and ERR2232524 .
I used this scheme https://www.cgmlst.org/ncs/schema/3560802/ , and affected allels are for these 2 samples are 03340, 04010, 08730, 08930, 10560, 13310, 16340 .
e.g. allel 08930 is predicted to be different over 5 samples, which we expect to be the same (different prediction in all 5 samples), so this is probably systematic.
It's unclear from README
In attempting to use '--coverage' I encounter the following python error, which I have been unable to correct. Admittedly, I am not a skilled python user.
Traceback (most recent call last):
File "/home/yrh8/Tools/stringMLST/stringMLST.py", line 1613, in
getCoverage(results)
File "/home/yrh8/Tools/stringMLST/stringMLST.py", line 790, in getCoverage
allele = gene+'_'+re.sub("*", "", str(results[sample][gene]))
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "/usr/lib/python2.7/re.py", line 244, in _compile
raise error, v # invalid expression
sre_constants.error: nothing to repeat
Output kmer depth for each allele profiled. Can be useful for determining why stringMLST fails to give type something -- e.g. too low or no of coverage -- and for non-MLST screens (such as AMR)
Handle broken links, empty files and odd HTTP status codes more gracefully. Currently, stringMLST crashed when it encounters bad input. This is increasingly becoming an issue, pubMLST have pushed several empty or otherwise malformed profile definitions over the last month.
This will also help address jordanlab/stringMLST_datasets#1
What happens if strain has missing alleles? How does the user know it was missing?
Hello @ar0ch,
I am using stringMLST and get ST assigned in terminal but log shows several lines of `"ERROR while loading ST".
Can anybody help?
Currently some function calls are not compatible with python 3.5+
modernize
https://github.com/anujg1991/stringMLST/blob/master/stringMLST.py#L941
When running script with no argument, fuzzy is not mentioned at all. If ran with -h, it appears.
See https://github.com/jordanlab/stringMLST#usage-documentation
It seems to be Markdown within a Markdown verbatim block?
I have been using the stringMLST with a pubMLST database. However it is missing new alleles that have been identifed in the isolates using two other methods. I tried to adjust the -z parameter incase that would help but I only got the response that the various numbers I tried - 100, 500, 400, 350 were not intergers.
Dear string MLST author,
we encountered an issue while testing your software on Enterococcus faecium. The error reported is:
./stringMLST.py --predict -1 ~/SRR980587/SRR980587_1.fastq -2 ~/SRR980587/SRR980587_2.fastq -k 21 -P ENT
Traceback (most recent call last):
File "./stringMLST.py", line 1145, in <module>
results = singleSampleTool(fastq1, fastq2, paired, k, results)
File "./stringMLST.py", line 181, in singleSampleTool
finalProfile = getMaxCount(weightedProfile, fileName)
File "./stringMLST.py", line 334, in getMaxCount
compare = int(re.sub("\*$", "", str(max_n[loc])))
[ValueError: invalid literal for int() with base 10: ](ValueError: invalid literal for int() with base 10: )'3799252.24074'
The sample SRR980587 used for testing has been downloaded from:
www.ebi.ac.uk/ena/data/view/SRR980587.
Enterococcus faecium data are not present in dataset folder so we added it downloading allele files from PubMLST.
We created the config file following your README instructions:
[loci]
adk datasets/Enterococcus_faecium/adk.fa
atpA datasets/Enterococcus_faecium/atpA.fa
ddl datasets/Enterococcus_faecium/ddl.fa
gdh datasets/Enterococcus_faecium/gdh.fa
gyd datasets/Enterococcus_faecium/gyd.fa
pstS datasets/Enterococcus_faecium/pstS.fa
purK datasets/Enterococcus_faecium/purK.fa
[profile]
profile datasets/Enterococcus_faecium/profile_Enterococcus_faecium.txt
Hi,
I just installed stringMLST through bioconda on my VM (Ubuntu 18.04 LTS on Windows 7 host) and ran it on a bunch of samples using E coli 1 MLST scheme. About 80% of the samples gave a ST and 3.5% returned 0, which were expected.
However, for the rest 16.5%, some returned an empty list and some gave a traceback error message. All of these problem samples were downloaded from NCBI. I had no problem running them in QC inspection or reference sequence alignment (bowtie2/samtools, etc.) so I know the files were not corrupted. Nevertheless, I did notice that most of the samples that returned an empty list were submitted from one source and contained contaminating reads from another serotype. I didn't observe any pattern for the ones that gave me a Traceback error other than the fastq.gz files might be on the small side (<50 MB each). I was wondering if there's an explanation, and better, a fix, for these samples.
Please see examples below (Ec is the prefix I gave E. coli MLST scheme 1 when I created DB):
ST = stringMLST.py --predict -1 DRR015930_1.fastq.gz -2 DRR015930_2.fastq.gz -P Ec
print(ST)
[]
ST = stringMLST.py --predict -1 ERR1777574_1.fastq.gz -2 ERR1777574_2.fastq.gz -P Ec
print(ST)
['Traceback (most recent call last):', ' File "/home/florathecat/anaconda3/bin/stringMLST.py", line 1605, in ', ' results = singleSampleTool(fastq1, fastq2, paired, k, results)', ' File "/home/florathecat/anaconda3/bin/stringMLST.py", line 399, in singleSampleTool', ' singleFileTool(fastq1, k)', ' File "/home/florathecat/anaconda3/bin/stringMLST.py", line 452, in singleFileTool', ' fileExplorer(fastq, k, non_overlapping_window)', ' File "/home/florathecat/anaconda3/bin/stringMLST.py", line 468, in fileExplorer', ' lines = f.readlines()', ' File "/home/florathecat/anaconda3/lib/python3.6/gzip.py", line 289, in read1', ' return self._buffer.read1(size)', ' File "/home/florathecat/anaconda3/lib/python3.6/_compression.py", line 68, in readinto', ' data = self.read(len(byte_view))', ' File "/home/florathecat/anaconda3/lib/python3.6/gzip.py", line 482, in read', ' raise EOFError("Compressed file ended before the "', 'EOFError: Compressed file ended before the end-of-stream marker was reached']
Thanks for your time.
When downloading the master branch as a zip file, the total size is 160.98M which is HUGE for a code base. Perhaps replace example read files folder with a script that would download them separately instead.
It is taking over 15 minutes to clone the repo now.
$ stringMLST.py --getMLST -P neisseria/nmb --species neisseria
Traceback (most recent call last):
File "/home/hadoop/anaconda2/bin/stringMLST.py", line 4, in <module>
__import__('pkg_resources').run_script('stringMLST==0.6.1', 'stringMLST.py')
File "/home/hadoop/anaconda2/lib/python2.7/site-packages/pkg_resources/__init__.py", line 664, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/home/hadoop/anaconda2/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1450, in run_script
script_code = compile(script_text, script_filename, 'exec')
File "/home/hadoop/anaconda2/lib/python2.7/site-packages/stringMLST-0.6.1-py2.7.egg/EGG-INFO/scripts/stringMLST.py", line 275
print(f"This usually means the provided species, '{speciesName}', does not exist on PubMLST")
When running the Strep pneu against the Listeria Mono. to see what a failure looks like I get the following result.
No k-mer matches were found for the sample . strain_R1_001.fastq and strain_R2_001.fastq. Probable cause of the error: low quality data/too many N's in the data
Sample ST Time
strain 344 15.88
Noticed in the code that the No k-mer match statement was followed by exit statement but it was commented out. I think reporting back to the user with N/A or no result would be best instead of just exiting . If someone is running in batch mode or combining individual results together, its very usefully to know which strain actually had no result. If they do not show up at all, then no idea if the software failed or because no k-mer found.
Created database for Listeria monocytogenes from http://bigsdb.pasteur.fr/listeria/ and it finished correctly. However, when running the prediction, one allele was missing from the results and no ST type were found.
Looks like the issue is with this section of code: https://github.com/anujg1991/stringMLST/blob/master/stringMLST.py#L438
The script always assume there is only ONE extra column at the end of the ST. In Listeria case, there are two extra columns "CC" and "lineage".
No matter how high I set the -z value, there are no stars. Removing the backslash in the line below fixes the problem for me.
Line 604 in 84e7433
The order of the alleles can be different when using different strain name
i.e
Sample mdh gyrB recA fumC purAicd adk ST
strain1 1 1 2 3 4 5 5
Sample purAicd adk mdh gyrB recA fumC ST
strain2 4 5 4 1 1 2 3 5
Makes it difficult to merge the two individual runs together into a single report.
Hi Aroon,
Happy New Year.
I noticed a discrepancy between the time reported with "-t" option by stringMLST and the time measured by /usr/bin/time -v.
This is what is reported by stringMLST:
Sample abcZ adk aroE fumC gdh pdhC pgm ST Time
ERR026529 231 180* 306 612 269* 277* 260 10174 7.67
And this is reported by /usr/bin/time -v
Command being timed: "./stringMLST_iss36.py --predict -P neisseria/nmb -1 ERR026529_1.fastq -2 ERR026529_2.fastq -x -t"
User time (seconds): 19.86
System time (seconds): 1.82
Percent of CPU this job got: 89%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:24.28
Why are they different?
Bests,
-A
User should have ability to indicate where to put the log file for the individual run instead of pooling into a single common one.
Hi,
I am trying to get stringMLST to work, and I am following the Quickstart guide.
Using this command line (after pip installing, downloading database and data):
stringMLST.py --predict -P neisseria/nmb -1 ERR026529_1.fastq.gz -2 ERR026529_2.fastq.gz
I get this result:
No k-mer matches were found for the sample ERR026529_1.fastq.gz and ERR026529_2.fastq.gz. Probable cause of the error: low quality data/too many N's in the data
Sample ST
ERR026529 ST
I checked both data and database, that seems to be alright.
OS: Ubuntu 14.04
Python: Python 3.6.0 |Anaconda 4.3.1 (64-bit)
Could not built database because some of my sequences contain lower case characters. Got the following stack trace:
Info: Making DB for k = 35
Info: Making DB with prefix = LM
Traceback (most recent call last):
File "/share/apps/stringMLST/stringMLST.py", line 1079, in
makeCustomDB(config,k,dbPrefix)
File "/share/apps/stringMLST/stringMLST.py", line 671, in makeCustomDB
formKmerDB(configDict,k,output_filename)
File "/share/apps/stringMLST/stringMLST.py", line 613, in formKmerDB
string = key+'\t'+key1+'\t'+str(kmerDict[key][key1]).replace(" ","")+'\n'
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
Looks like the reverseComplement function only works on upper case character. Perhaps you should convert all input sequence to upper case first before attempting to do reverse complement.
Sorry, need to ask again something.
When I try to build the DB, I get the following:
Info: Making DB for k = 35
Info: Making DB with prefix = /exports/mm-hpc/bacteriologie/bastian/data/Cdif/cgmlst/stringMLST_db/cdif_cgmlst
Info: Log file written to /exports/mm-hpc/bacteriologie/bastian/data/Cdif/cgmlst/stringMLST_db/cdif_cgmlst.log
Error : Allele name in locus file should be seperated by '_' or '-'
I'm unsure to what it refers to.
My fasta files with the different loci are called e.g. CD630_00010.fasta , but the names of the fasta entries are only integers (>1, >2, >3). Do I need to change these fasta headers?
In the profile file, does it then still need only the integers, even if I rename all the headers to e.g. >CD630_00010_1, >CD630_00010_2 etc?
Am I guessing correct here?
Thanks,
Bastian
When I tried updating my mlst database using --getMLST
, it failed with ssl errors. I was able to correct the problem by importing ssl
and adding in a line to ignore the certs. The attached version of stringMLST.py
contains the patched code. I've confirmed it works.
Download profiles and alleles on demand for the PubMLST curated MLST schemes
Hey there,
I'm attempting to run stringMLST for a cgMLST scheme.
I'm about to build everything to make it run, I have the allele fasta files, and will now create the profile file, where all the cgMLSt types and the alleles are listed.
Since it's a cgMLSt scheme, not all genes do have an allele in all cases.
I have multiple profiles, where some alleles are put down as "?" in the definition at the cgMLST website.
Do I also put a ? in the profile file? Or something else?
Any advice :)?
(I'll try anyways with a ? in there)
Bastian
The apostrophes on lines 114 and 163 were causing an error. "SyntaxError: Non-ASCII character '\xe2' in file /home/username/.local/bin/stringMLST.py on line 164, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details"
Not sure if this is unique to my system but removing these resolved the issue.
Make program flags more obvious:
-P/--prefix is awkward for setting the database. Move to -db/--database
-x Replaced with -f/--force, standard syntax for forcibly running/overwriting
Update help statements to be module specific
stringMLST will be relicensed into CC Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) on Friday, April 21 2017.
We feel CC BY-NC-SA 4.0 accurately reflects our original intent and that no additional restrictions or limitations are placed on users of stringMLST with this change. We hope this will clarify the licensing of stringMLST, and facilitate third-party packages and derivative works. stringMLST will remain Open Source Software and free for noncommercial use.
Hi!
I ran stringMLST on the sample from EBI SRA database ERR024377 (S. enterica):
stringMLST.py
--predict
-1 ERR024377_1.fastq.gz
-2 ERR024377_2.fastq.gz
-k 35
--prefix ERR024377
--output res.txt
and the res.txt file contains the following:
Sample aroC dnaN purE ST
ERR024377 345 342 0 0
Since the loci are totally 7 (aroC, dnaN, hemD, hisD, purE, sucA,thrA) , what is the difference between purE which got a 0 and hemD, hisD, sucA, thrA which got an empty value?
Bests,
-A
Add to documentation what version of the tools are needed to the path for --coverage option
If I use the next line:
stringMLST.py --getMLST -P Listeria --species Listeria monocytogenes
I get the following error:
Preparing: Listeria
Traceback (most recent call last):
File "/mnt/disk1/bin/miniconda3/bin/stringMLST.py", line 1562, in
profileURL, loci = get_links(dbRoot, filePrefix, species)
File "/mnt/disk1/bin/miniconda3/bin/stringMLST.py", line 268, in get_links
profileURL = child[1].text
IndexError: child index out of range
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.