mattb112885 / clusterdbanalysis Goto Github PK
View Code? Open in Web Editor NEWITEP - Integrated Toolkit for Exploration of microbial Pan-genomes
ITEP - Integrated Toolkit for Exploration of microbial Pan-genomes
Given the tab-delimtied blast result format, return a sequence for the HSP (protein or DNA).
Right now it only just computes BBH for everything in the database but we should be able to link it up with existing scripts to filter BLAST results by organism and gene to filter the BBH calculation as well.
Probably should have two scripts (which both essentially call the same function).
The first script takes an existing genbank file, makes up ITEP IDs and produces a table and a genbank file containing the same IDs as the table.
The second script takes a genbank file and a raw table containing ITEP IDs and produces a genbank file with the ITEP IDs added.
Around line 71 db_getClusterGeneInformation.py calls:
cmd = "cat %s | db_getClusterGeneInformation.py -r %d -c %d | annoteSeq2Fasta.py -g 3 -a 5 -s 6 > %s" %(fname, rc+1, cc+1, fasta)
These column options should be removed (the defualt is, I thik, right) or should be -g 1 -a 5 -s 12
Bug reproduction:
jamesrh@biome:~/iTEP_Matt/pipeline_treetables_orthocluster_4$ echo "all_I_1.7_c_0.1_m_maxbit805" | db_makeClusterAlignment.py -m mafft_default --notrim
Error returned:
cat 786782332.tmp | db_getClusterGeneInformation.py -r 1 -c 2 | annoteSeq2Fasta.py -g 3 -a 5 -s 6 > 786782332.fasta
Traceback (most recent call last):
File "/data/Cluster_Files/src/db_getClusterGeneInformation.py", line 52, in <module>
con.execute(query, (spl[rc], spl[cc]))
IndexError: list index out of range
In the past I made these changes to db_makeClusterAlignment.py, as we discussed, I was making it multiplex many cluster runs as well as fixing this bug, which is not necessary.
< cmd = "cat %s | db_getClusterGeneInformation.py -r %d -c %d | annoteSeq2Fasta.py -g 3 -a 5 -s 6 > %s" %(fname, rc+1, cc+1, fasta)
---
> cmd = "cat %s | db_getClusterGeneInformation.py -r %d -c %d | cut -f 1 |sort -u | db_getGeneInformation.py | annoteSeq2Fasta.py -g 1 -a 5 -s 12 > %s" %(fname, rc+1, cc+1, fasta)
i.e. db_makeClusterAlignment.py
I should be using argparse rather than optparse because the latter is obsoleted. Also, it works with this package:
https://github.com/kislyuk/argcomplete
I see the following uses (at least) for such a package:
1 - Automatically suggest run IDs for the many functions that require input of a single run ID
2 - Automatically suggest scoring metrics for the functions that only allow use of a specific set of scoring metrics.
Would it get confusing if I use such an ability for some arguments but not for most?
Just raw MCL doesn't do a great job finding orthologs. I need to improve the BBH scripts and also very likely hook up orthoMCL (with appropriate reformattings to avoid re-calculating all the data) as an alternative way to identify clusters.
Since SourceMe.sh will not change the bash variables if run instead of sourced (as the user returns to their own shell before the exports occurred), we should remove the executable bit (and maybe change it to SourceMe.path or something). We should also make it clear in the docs that it should be sourced, not run.
Alternatively, we could make a wrapper script (SourceMe.sh) that sources the actual file (SourceMe.path), which would open up the possability of allowing the user to specify a file to concatenate to ( SourceMe.sh ~/.profile => cat SourceMe.path >> ~/.profile ). There is probably some way to do an if/then and combine the two in the same file.
James and I talked about this... the organism abbreviations are the only manual part about the organisms file that is left, and they are hard \ buggy to maintain. They aren't really used for anything within ITEP... we have thus decided to get rid of it.
The following will have to adapt to this:
I should probably make a standard organism file parser function to prevent any issues if we need to change this around again.
E.g. try downloading 573061 (C. cellulovorans)
It should be possible to automatically look for an organism ID in the organisms file, obtain the organism name, and if it isn't present, add it to the file without asking the user to do so manually.
Unfortunately this isn't possible from the RAST files so the user should still be warned if they are providing files from RAST.
Probably missing an AS in the join command somewhere...
PRAGMA TABLE_INFO(blastres_selfbit);
returns
0|querygene|TEXT|0||0
1|targetgene|TEXT|0||0
2|pctid|REAL|0||0
3|alnlen|INT|0||0
4|mismatches|INT|0||0
5|gapopens|INT|0||0
6|querystart|INT|0||0
7|queryend|INT|0||0
8|substart|INT|0||0
9|subend|INT|0||0
10|evalue|REAL|0||0
11|bitscore|REAL|0||0
12|bitscore:1|REAL|0||0
13|blast_self.bitscore|REAL|0||0
One of columns 12 and 13 should be query self-bit and the other is target self-bit score.
We should modify the codes to do a blastn all vs all in a similar manner to blastp - useful for closely-related organisms. Switches could be added to much of the code to query a different table but use exactly the same analysis otherwise. Table structure should be exactly the same but using nucleotide data rather than amino acid data as inputs.
Apparently Biopython reorders contigs from the way they appear in the Genbank file (this according to Nick). Need to investigate and fix - this could fix our discrepancies with RAST IDs...
For example:
$ echo "fig|192952.1.peg.556" | db_getGeneInformation.py
fig|192952.1.peg.556 Methanosarcina mazei Go1 192952.1 mazGO1 192952.1.NC_003901 682111 682161 + 1 unknown function_MM0556 gtggagaaatgtttcaatccttgttttaatggatcttgctcgcgaatatga MEKCFNPCFNGSCSRI
But this gene doesn't appear in any clusters or in the BLAST results because the self-hit is not sufficiently significant...
So that the genes are back-searchable to their original database later, we should add the original alias (from NCBI, KBASE, whatever) to the "aliases" table when generating the raw files. That way, at least the original IDs are retained somewhere.
Possibly print out a header as stderr? If not, definitely within the help text.
It disrupts the pipeline if getClusterFastas.py cannot be run due to non-existence of the specified folder to place the fasta files.
Unlike all other scripts, get_RAST_jobs.pl -h
prints to std error, not std out
Right now the E-value cutoff is 1E-5. It's been fine for our purposes but people might want to go lower than that... we can just make it an optional parameter in the script.
Should have a way to just dump the clusterID info table for specific clusters.
(This bug is filed for reference - it has already been fixed)
The main3.sh script was adding contigs to the contig file even if they were already there, thus leading to duplicate entries in the contig table of the db.
As title implies... we have some very old parts of the python code that call python scripts because there were no librarified versions of these functions. The latter is (or soon will be) fixed so the former should be too.
Make lib directory
Four files:
Need to rewrite existing imports to use new files and update docs
In order to make this function more generally useful (e.g. to people who don't have their genbank files in NCBI yet like us), we should change it so that it doesn't depend on the GI numbers to generate gene IDs. It'll have to be incremental I think.
Low-complexity region filtering will be useful particularly for TBLASTN... repeats in the genome that can give a false result. If I understand the NCBI docs correctly SEG is only added to the query sequences automatically, not the target sequences...
I have functions to identify organisms present in a cluster but the inverse (identifying organisms missing in a cluster) is more of a pain with the current codebase.
The current one is really ugly and fragile...
E.g. synonym:[nameofstuff]
I should be able to wrap the existing presence/absence script to look for presence and absence in a clade relative only to neighboring clades (instead of relative to the entire tree) - this would make the analysis less sensitive to the choice of organisms to put in the tree and allow us to identify families that are gained or lost in multiple clades.
Due to a change in the PubSEED FTP server the convertPubseedToRast.sh and associated scripts no longer work.
The other two (not pegs) are outdated and not used for much - just make one function with a flag for 0/1 if we want a presence-absence pasta.
In addition to bit scores, overlap lengths are useful to identify how good the match was from the TBLASTN target to the called gene. I should add a new column where this is computed.
Making these changes will make people like me less confused about what SQLite is doing with the data types in the database and more clear what the limitations really are. They won't (shouldn't) change anything about the database as it is now, just clarifies things in the docs.
SQLite does not enforce varchar lengths (https://www.sqlite.org/faq.html#q9) and implicitly changes them into TEXT, so we should make them all TEXT as well so people (like me) don't think their lengths are really limited. I need to comb through and remove checks for this (there's one in checkInputFormat and some in the SQL code itself) since it's pointless.
SQLite does not have a FLOAT type and implicitly changes them into REALs. We should declare them as such
All integer types are treated the same - as INTEGER - as is the boolean type.
Some functions (like the neighborhood tree displayer) require you to still have the original IDs in there, and FigTree adds a quote to every ID which makes it very hard to figure out what is wrong if you don't know what you're looking for. I should just make a function to reroot the tree so we don't have to go through db_displayTree to do it (which replaces the gene IDs with annotations and organism names - good for displaying the tree, not so good for making a new tree for further analysis that interacts with the database).
I just wrote library functions to take BLAST results and compute scores in a standard way - but I still have some separate implementations of these floating around in various functions. I need to clean these up (particularly, db_bidirectionalBestHits.py, blastResultsToDistanceMatrix.py ... others?). makeBlastScoreTable.py has already been cleaned up and the recent function for Cytoscape export already was written (recently) to do this properly.
For global bootstrapping (-g), use the -intree1 option to speed up runs.
http://www.microbesonline.org/fasttree/#Support
Besides the changes we discussed in replaceOrgWithAbbrev.py, other files use organism names in their output or input.
in src/makeCoreClusterAnalysisTree.py, the input and output use sanitized organism names:
"The input MUST be a Newick file with organism IDs REPLACED with their names"
"WARNING: Organism name %s in the database was not found in the provided tree. It will be deleted!!\n" %(collist[ii]))
The description and the header comment in this file conflict about the function of the script:
src/db_getBlastResultsBetweenSpecificGenes.py
description = "Given list of genes to match, returns a list of BLAST results between genes in the list only"
I think this is from duplication between thses scripts:
src/db_getBlastResultsBetweenSpecificGenes.py src/db_getBlastResultsBetweenSpecificOrganisms.py
Other scripts to check if the organism name or ID are used:
db_findClustersByOrganismList.py
db_getOrganismsInClusterRun.py
db_getOrganismsInCluster.py
db_addOrganismNameToTable.py
db_bidirectionalBestHits.py
db_TBlastN_wrapper.py
We discussed keeping the library functions, but another way to find the dependences is to see what called these library functions:
lib/TreeFuncs.py: '''Parse a node name into an organism ID.
lib/ClusterFuncs.py: Given an organism name, return the ID for that organism name.
lib/CoreGeneFunctions.py: The return object is a list of (runid, clusterid, organism) tuples sorted by run ID then by cluster ID.'''
lib/CoreGeneFunctions.py:def findGenesByOrganismList(orglist
lib/CoreGeneFunctions.py: The organisms in "orglist" are considered the "ingroup"
Currently there are several functions in the pipeline that require organism or groups files as explicit input arguments. I should instead make one of the metadata functions for each of these files return the path to those files. However, it might be worth keeping the overwrites as optional arguments (just no longer having to explicitly identify them).
In getClusterFastas.py the fasta files printed are incorrect if two consecutive clusters have different run IDs but the same cluster ID.
Since we are dependent on the taxIDs in the genbank file all being the same but we dont actually ever check this, we should check it. I should also put this in as part of the consistency checking script...
Yes, NCBI managed to go to a new low and put things with two different tax IDs under the same strain (C. tetani) - one of them was a virus. SIGH.
Apparently NCBI changed the format of the CDD data so only the newer version of RPSBLAST works with it - so I should be using makeprofiledb to build the database instead of formatrpsdb. There's a bit of a naming nightmare that I hope we don't have too many issues with... namely, they didn't change the name of the rpsblast executable but changed the syntax. We might just have to switch to the new syntax and let it crash if people are using the wrong one...
The VM I currently have is unable of handling RPSBlast, I think because of memory limitations. It could be memory limit of my host but it could also be because its a 32-bit guest. Need to test this before sending out.
Have a user option to set the number of threads used by FastTree.
The number of threads for FastTreeMP used is determined by the environmental variable OMP_NUM_THREADS
ncoers=10
env = os.environ.copy()
env['OMP_NUM_THREADS'] =str(ncores)
Both of these functions provide "geneinfo" tables so the format of the output should be the same.
it would be easier to grep the output of db_listDbFiles.py if the default was in one column. Alternatively, if there was a sys.argv[1], then the script could use that as a to grep through the list.
Very minor, I know.
The -h flag for db_getGeneInformation.py
should have what is in each column
It would be useful to implement a hook to a neighborhood-based clustering tool that helps separate out groups by synteny. This one looks potentially promising:
There should be an option to get nucleotide in addition to protein FASTA files from getClusterFastas.py - maybe I should have it specify column numbers in the same way annoteSeq2Fasta.py does?
I can not run /data/Cluster_Files/completeCoreAnalysis.sh from my own directory without making a copy to my directory.
I think changing all of /scripts to be in group gtl_cluster will solve this:
4 -rwxr-xr-- 1 benedic1 benedic1 2874 Nov 30 09:55 /data/Cluster_Files/scripts/completeCoreAnalysis.sh
we should also probably make all the files in /src consistent too:
4 -rwxrwxr-x 1 benedic1 benedic1 2138 Oct 25 17:06 replaceOrgWithAbbrev.py
4 -rwxrwxr-x 1 benedic1 benedic1 2106 Nov 16 14:41 Rpsblast_all_vs_one.py
4 -rwxr-xr-x 1 benedic1 gtl_cluster 867 Oct 20 12:24 sanitizeString.py
4 -rw-r--r-- 1 benedic1 benedic1 1090 Oct 20 12:26 sanitizeString.pyc
4 -rwxr-xr-x 1 benedic1 gtl_cluster 1406 Oct 20 12:06 startDifference.py
4 -rwxr-xr-x 1 benedic1 gtl_cluster 3404 Aug 21 12:35 tblastn_all_vs_all.py
4 -rwxrwxr-x 1 benedic1 benedic1 564 Oct 20 12:07 unsanitizeGeneIds.py
The removeOrganisms and removeOrganisms_SAFE functions are relying on rather obscure file naming conventions that aren't consistent with the rest of the code (and therefore may not work when we need to use them). I might want to use a slower but more-sure method for dealing with this than looking at file names (unfortunately though, some input files like the genbank files won't necessarily have the organism IDs anywhere in them...). This will require some thought - we can't make it too generic or else if we have e.g. two genome IDs
83333.1
and
83333.10
and try to delete the former, we need to make sure not to delete the latter.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.