mattb112885 / clusterdbanalysis Goto Github PK

View Code? Open in Web Editor NEW

26.0 26.0 15.0 3 MB

ITEP - Integrated Toolkit for Exploration of microbial Pan-genomes

Shell 10.63% Python 88.80% Perl 0.57%

clusterdbanalysis's People

Contributors

Stargazers

Watchers

Forkers

bheavner pricelab juanu greg568 josephryanpeterson nevegav alenzhao tybiot oberonthewizard tomos-s91 wangdi2014 kdbrumfield maggishaggy hites77 germant13

clusterdbanalysis's Issues

Request: Obtain sequences of HSPs from the blast results

Given the tab-delimtied blast result format, return a sequence for the HSP (protein or DNA).

Bidirectional-best hits script should include the option to filter by organisms or genes

Right now it only just computes BBH for everything in the database but we should be able to link it up with existing scripts to filter BLAST results by organism and gene to filter the BBH calculation as well.

Add gene IDs to genbank files

Probably should have two scripts (which both essentially call the same function).

The first script takes an existing genbank file, makes up ITEP IDs and produces a table and a genbank file containing the same IDs as the table.

The second script takes a genbank file and a raw table containing ITEP IDs and produces a genbank file with the ITEP IDs added.

db_getClusterGeneInformation.py uses wrong columns

Around line 71 db_getClusterGeneInformation.py calls:

cmd = "cat %s | db_getClusterGeneInformation.py -r %d -c %d | annoteSeq2Fasta.py -g 3 -a 5 -s 6 > %s" %(fname, rc+1, cc+1, fasta)

These column options should be removed (the defualt is, I thik, right) or should be -g 1 -a 5 -s 12

Bug reproduction:

jamesrh@biome:~/iTEP_Matt/pipeline_treetables_orthocluster_4$ echo "all_I_1.7_c_0.1_m_maxbit805" | db_makeClusterAlignment.py   -m  mafft_default --notrim

Error returned: 
cat 786782332.tmp | db_getClusterGeneInformation.py -r 1 -c 2 | annoteSeq2Fasta.py -g 3 -a 5 -s 6 > 786782332.fasta
Traceback (most recent call last):
  File "/data/Cluster_Files/src/db_getClusterGeneInformation.py", line 52, in <module>
    con.execute(query, (spl[rc], spl[cc]))
IndexError: list index out of range

In the past I made these changes to db_makeClusterAlignment.py, as we discussed, I was making it multiplex many cluster runs as well as fixing this bug, which is not necessary.

< cmd = "cat %s | db_getClusterGeneInformation.py -r %d -c %d | annoteSeq2Fasta.py -g 3 -a 5 -s 6 > %s" %(fname, rc+1, cc+1, fasta)

---
> cmd = "cat %s | db_getClusterGeneInformation.py -r %d -c %d | cut -f 1 |sort -u | db_getGeneInformation.py | annoteSeq2Fasta.py -g 1 -a 5 -s 12 > %s" %(fname, rc+1, cc+1, fasta)

Functions limited to only singular inputs should have command-line options

i.e. db_makeClusterAlignment.py

Switch to argparse and use argcomplete to deal with options with fixed list of possibilities

I should be using argparse rather than optparse because the latter is obsoleted. Also, it works with this package:

https://github.com/kislyuk/argcomplete

I see the following uses (at least) for such a package:

1 - Automatically suggest run IDs for the many functions that require input of a single run ID
2 - Automatically suggest scoring metrics for the functions that only allow use of a specific set of scoring metrics.

Would it get confusing if I use such an ability for some arguments but not for most?

Hook up the db to a real ortholog finder

Just raw MCL doesn't do a great job finding orthologs. I need to improve the BBH scripts and also very likely hook up orthoMCL (with appropriate reformattings to avoid re-calculating all the data) as an alternative way to identify clusters.

change SourceMe so people don't run it and not have the right PATH set

Since SourceMe.sh will not change the bash variables if run instead of sourced (as the user returns to their own shell before the exports occurred), we should remove the executable bit (and maybe change it to SourceMe.path or something). We should also make it clear in the docs that it should be sourced, not run.

Alternatively, we could make a wrapper script (SourceMe.sh) that sources the actual file (SourceMe.path), which would open up the possability of allowing the user to specify a file to concatenate to ( SourceMe.sh ~/.profile => cat SourceMe.path >> ~/.profile ). There is probably some way to do an if/then and combine the two in the same file.

Remove references to organism abbreviations in the organisms file

James and I talked about this... the organism abbreviations are the only manual part about the organisms file that is left, and they are hard \ buggy to maintain. They aren't really used for anything within ITEP... we have thus decided to get rid of it.

The following will have to adapt to this:

Any functions that use the organisms file and expect the ID to be in the third column (and not the second)...
The SQLite schema

I should probably make a standard organism file parser function to prevent any issues if we need to change this around again.

Downloading from NCBI should default to completed genomes if both complete and incomplete are available (but doesn't)

E.g. try downloading 573061 (C. cellulovorans)

Automatic adding of organism name and ID from NCBI or pubseed parsing functions

It should be possible to automatically look for an organism ID in the organisms file, obtain the organism name, and if it isn't present, add it to the file without asking the user to do so manually.

Unfortunately this isn't possible from the RAST files so the user should still be warned if they are providing files from RAST.

Bad column names in blastres_selfbit

Probably missing an AS in the join command somewhere...

PRAGMA TABLE_INFO(blastres_selfbit);

returns

0|querygene|TEXT|0||0
1|targetgene|TEXT|0||0
2|pctid|REAL|0||0
3|alnlen|INT|0||0
4|mismatches|INT|0||0
5|gapopens|INT|0||0
6|querystart|INT|0||0
7|queryend|INT|0||0
8|substart|INT|0||0
9|subend|INT|0||0
10|evalue|REAL|0||0
11|bitscore|REAL|0||0
12|bitscore:1|REAL|0||0
13|blast_self.bitscore|REAL|0||0

One of columns 12 and 13 should be query self-bit and the other is target self-bit score.

BLASTN all vs all

We should modify the codes to do a blastn all vs all in a similar manner to blastp - useful for closely-related organisms. Switches could be added to much of the code to query a different table but use exactly the same analysis otherwise. Table structure should be exactly the same but using nucleotide data rather than amino acid data as inputs.

Issue with contig order from Biopython?

Apparently Biopython reorders contigs from the way they appear in the Genbank file (this according to Nick). Need to investigate and fix - this could fix our discrepancies with RAST IDs...

Very short genes are excluded from clustering results

For example:

$ echo "fig|192952.1.peg.556" | db_getGeneInformation.py
fig|192952.1.peg.556 Methanosarcina mazei Go1 192952.1 mazGO1 192952.1.NC_003901 682111 682161 + 1 unknown function_MM0556 gtggagaaatgtttcaatccttgttttaatggatcttgctcgcgaatatga MEKCFNPCFNGSCSRI

But this gene doesn't appear in any clusters or in the BLAST results because the self-hit is not sufficiently significant...

Automatic adding of gene aliases

So that the genes are back-searchable to their original database later, we should add the original alias (from NCBI, KBASE, whatever) to the "aliases" table when generating the raw files. That way, at least the original IDs are retained somewhere.

Output formats for tables need documentation within the function

Possibly print out a header as stderr? If not, definitely within the help text.

getClusterFastas.py fails if output folder does not exist

It disrupts the pipeline if getClusterFastas.py cannot be run due to non-existence of the specified folder to place the fasta files.

get_RAST_jobs.pl -h prints to std error, not std out

Unlike all other scripts, get_RAST_jobs.pl -h prints to std error, not std out

Make BLAST E-value cutoff a parameter to main.sh

Right now the E-value cutoff is 1E-5. It's been fine for our purposes but people might want to go lower than that... we can just make it an optional parameter in the script.

There is no function to get description \name for a cluster ID

Should have a way to just dump the clusterID info table for specific clusters.

duplicate contigs in contig table

(This bug is filed for reference - it has already been fixed)

The main3.sh script was adding contigs to the contig file even if they were already there, thus leading to duplicate entries in the contig table of the db.

Remove old implicit shell script calls and replace with library functions

As title implies... we have some very old parts of the python code that call python scripts because there were no librarified versions of these functions. The latter is (or soon will be) fixed so the former should be too.

modularizing commonly used functions

Make lib directory
Four files:

treefuncs.py : Tree functions
metadata: e.g. locateDatabase()
stringfunctions.py (?)
commondbaccess.py - common functions for getting data from db (e.g. organism name <-> ID), presence\absence slices, ...

Need to rewrite existing imports to use new files and update docs

convertGenbankToTable should not depend on GI Ids being available

In order to make this function more generally useful (e.g. to people who don't have their genbank files in NCBI yet like us), we should change it so that it doesn't depend on the GI numbers to generate gene IDs. It'll have to be incremental I think.

Need option to add filtering (SEG or DUST) to automatically-generated BLAST databases

Low-complexity region filtering will be useful particularly for TBLASTN... repeats in the genome that can give a false result. If I understand the NCBI docs correctly SEG is only added to the query sequences automatically, not the target sequences...

Feature request: Find missing organisms in a cluster

I have functions to identify organisms present in a cluster but the inverse (identifying organisms missing in a cluster) is more of a pain with the current codebase.

come up with better mapping from org groups to runid names

The current one is really ugly and fragile...

E.g. synonym:[nameofstuff]

Presence and absence relative to tree neighbors (instead of whole tree)

I should be able to wrap the existing presence/absence script to look for presence and absence in a clade relative only to neighboring clades (instead of relative to the entire tree) - this would make the analysis less sensitive to the choice of organisms to put in the tree and allow us to identify families that are gained or lost in multiple clades.

pubseed interface has broken

Due to a change in the PubSEED FTP server the convertPubseedToRast.sh and associated scripts no longer work.

getPresenceAbsence functions should be combined

The other two (not pegs) are outdated and not used for much - just make one function with a flag for 0/1 if we want a presence-absence pasta.

TBLASTN output needs percent overlap of query as a metric

In addition to bit scores, overlap lengths are useful to identify how good the match was from the TBLASTN target to the called gene. I should add a new column where this is computed.

Change database declarations into SQLite standard types

Making these changes will make people like me less confused about what SQLite is doing with the data types in the database and more clear what the limitations really are. They won't (shouldn't) change anything about the database as it is now, just clarifies things in the docs.

SQLite does not enforce varchar lengths (https://www.sqlite.org/faq.html#q9) and implicitly changes them into TEXT, so we should make them all TEXT as well so people (like me) don't think their lengths are really limited. I need to comb through and remove checks for this (there's one in checkInputFormat and some in the SQL code itself) since it's pointless.

SQLite does not have a FLOAT type and implicitly changes them into REALs. We should declare them as such

All integer types are treated the same - as INTEGER - as is the boolean type.

Need a function to reroot a tree without replacing gene names with organism names and annotations

Some functions (like the neighborhood tree displayer) require you to still have the original IDs in there, and FigTree adds a quote to every ID which makes it very hard to figure out what is wrong if you don't know what you're looking for. I should just make a function to reroot the tree so we don't have to go through db_displayTree to do it (which replaces the gene IDs with annotations and organism names - good for displaying the tree, not so good for making a new tree for further analysis that interacts with the database).

Need to move other things that make up their own scoring schemes to the standard library functions

I just wrote library functions to take BLAST results and compute scores in a standard way - but I still have some separate implementations of these floating around in various functions. I need to clean these up (particularly, db_bidirectionalBestHits.py, blastResultsToDistanceMatrix.py ... others?). makeBlastScoreTable.py has already been cleaned up and the recent function for Cytoscape export already was written (recently) to do this properly.

Global bootstrapping defaults for FastTree_wrapper.py

For global bootstrapping (-g), use the -intree1 option to speed up runs.
http://www.microbesonline.org/fasttree/#Support

Change file input and output from sanatised organism name to (sanatized) organism ID

Besides the changes we discussed in replaceOrgWithAbbrev.py, other files use organism names in their output or input.

in src/makeCoreClusterAnalysisTree.py, the input and output use sanitized organism names:
"The input MUST be a Newick file with organism IDs REPLACED with their names"
"WARNING: Organism name %s in the database was not found in the provided tree. It will be deleted!!\n" %(collist[ii]))

The description and the header comment in this file conflict about the function of the script:
src/db_getBlastResultsBetweenSpecificGenes.py
description = "Given list of genes to match, returns a list of BLAST results between genes in the list only"

Provide a list of organisms to match [can match any portion of the organism so if you give it just "mazei" it will return to you a list of Methanosarcina mazei]

I think this is from duplication between thses scripts:
src/db_getBlastResultsBetweenSpecificGenes.py src/db_getBlastResultsBetweenSpecificOrganisms.py

Other scripts to check if the organism name or ID are used:
db_findClustersByOrganismList.py
db_getOrganismsInClusterRun.py
db_getOrganismsInCluster.py
db_addOrganismNameToTable.py
db_bidirectionalBestHits.py
db_TBlastN_wrapper.py

We discussed keeping the library functions, but another way to find the dependences is to see what called these library functions:
lib/TreeFuncs.py: '''Parse a node name into an organism ID.
lib/ClusterFuncs.py: Given an organism name, return the ID for that organism name.
lib/CoreGeneFunctions.py: The return object is a list of (runid, clusterid, organism) tuples sorted by run ID then by cluster ID.'''
lib/CoreGeneFunctions.py:def findGenesByOrganismList(orglist
lib/CoreGeneFunctions.py: The organisms in "orglist" are considered the "ingroup"

Use metadata to identify location of organism and groups files

Currently there are several functions in the pipeline that require organism or groups files as explicit input arguments. I should instead make one of the metadata functions for each of these files return the path to those files. However, it might be worth keeping the overwrites as optional arguments (just no longer having to explicitly identify them).

Issue with duplicate clusterIDs in getClusterFastas.py

In getClusterFastas.py the fasta files printed are incorrect if two consecutive clusters have different run IDs but the same cluster ID.

convertGenbankToTable needs to check for taxID consistency in concatinated genbank file

Since we are dependent on the taxIDs in the genbank file all being the same but we dont actually ever check this, we should check it. I should also put this in as part of the consistency checking script...

Yes, NCBI managed to go to a new low and put things with two different tax IDs under the same strain (C. tetani) - one of them was a virus. SIGH.

Building RPSBLAST databases no longer works with current CDD data

Apparently NCBI changed the format of the CDD data so only the newer version of RPSBLAST works with it - so I should be using makeprofiledb to build the database instead of formatrpsdb. There's a bit of a naming nightmare that I hope we don't have too many issues with... namely, they didn't change the name of the rpsblast executable but changed the syntax. We might just have to switch to the new syntax and let it crash if people are using the wrong one...

Need to build a 64-bit version of the VM

The VM I currently have is unable of handling RPSBlast, I think because of memory limitations. It could be memory limit of my host but it could also be because its a 32-bit guest. Need to test this before sending out.

env variable for FastTree_wrapper.py to set number of cores

Have a user option to set the number of threads used by FastTree.

The number of threads for FastTreeMP used is determined by the environmental variable OMP_NUM_THREADS

ncoers=10
env = os.environ.copy()
env['OMP_NUM_THREADS'] =str(ncores)

Different number of columns from db_getGeneInformation.py and db_getClusterGeneInformation.py

Both of these functions provide "geneinfo" tables so the format of the output should be the same.

db_listDbFiles.py features for easier grepping

it would be easier to grep the output of db_listDbFiles.py if the default was in one column. Alternatively, if there was a sys.argv[1], then the script could use that as a to grep through the list.

Very minor, I know.

Documentation for table headers

The -h flag for db_getGeneInformation.py
should have what is in each column

Neighborhood-based clustering

It would be useful to implement a hook to a neighborhood-based clustering tool that helps separate out groups by synteny. This one looks potentially promising:

http://nar.oxfordjournals.org/content/40/22/e172.

Add nucleotide option to getClusterFastas.py

There should be an option to get nucleotide in addition to protein FASTA files from getClusterFastas.py - maybe I should have it specify column numbers in the same way annoteSeq2Fasta.py does?

permission on /data/Cluster_Files/completeCoreAnalysis.sh

I can not run /data/Cluster_Files/completeCoreAnalysis.sh from my own directory without making a copy to my directory.

I think changing all of /scripts to be in group gtl_cluster will solve this:

4 -rwxr-xr-- 1 benedic1 benedic1 2874 Nov 30 09:55 /data/Cluster_Files/scripts/completeCoreAnalysis.sh

we should also probably make all the files in /src consistent too:

4 -rwxrwxr-x 1 benedic1 benedic1 2138 Oct 25 17:06 replaceOrgWithAbbrev.py
4 -rwxrwxr-x 1 benedic1 benedic1 2106 Nov 16 14:41 Rpsblast_all_vs_one.py
4 -rwxr-xr-x 1 benedic1 gtl_cluster 867 Oct 20 12:24 sanitizeString.py
4 -rw-r--r-- 1 benedic1 benedic1 1090 Oct 20 12:26 sanitizeString.pyc
4 -rwxr-xr-x 1 benedic1 gtl_cluster 1406 Oct 20 12:06 startDifference.py
4 -rwxr-xr-x 1 benedic1 gtl_cluster 3404 Aug 21 12:35 tblastn_all_vs_all.py
4 -rwxrwxr-x 1 benedic1 benedic1 564 Oct 20 12:07 unsanitizeGeneIds.py

removeOrganisms and removeOrganisms_SAFE need cleanup

The removeOrganisms and removeOrganisms_SAFE functions are relying on rather obscure file naming conventions that aren't consistent with the rest of the code (and therefore may not work when we need to use them). I might want to use a slower but more-sure method for dealing with this than looking at file names (unfortunately though, some input files like the genbank files won't necessarily have the organism IDs anywhere in them...). This will require some thought - we can't make it too generic or else if we have e.g. two genome IDs

83333.1

and

83333.10

and try to delete the former, we need to make sure not to delete the latter.