Git Product home page Git Product logo

clusterdbanalysis's People

Contributors

jamesrh avatar josephryanpeterson avatar mattb112885 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

clusterdbanalysis's Issues

Add gene IDs to genbank files

Probably should have two scripts (which both essentially call the same function).

The first script takes an existing genbank file, makes up ITEP IDs and produces a table and a genbank file containing the same IDs as the table.

The second script takes a genbank file and a raw table containing ITEP IDs and produces a genbank file with the ITEP IDs added.

db_getClusterGeneInformation.py uses wrong columns

Around line 71 db_getClusterGeneInformation.py calls:

cmd = "cat %s | db_getClusterGeneInformation.py -r %d -c %d | annoteSeq2Fasta.py -g 3 -a 5 -s 6 > %s" %(fname, rc+1, cc+1, fasta)

These column options should be removed (the defualt is, I thik, right) or should be -g 1 -a 5 -s 12

Bug reproduction:

jamesrh@biome:~/iTEP_Matt/pipeline_treetables_orthocluster_4$ echo "all_I_1.7_c_0.1_m_maxbit805" | db_makeClusterAlignment.py   -m  mafft_default --notrim

Error returned: 
cat 786782332.tmp | db_getClusterGeneInformation.py -r 1 -c 2 | annoteSeq2Fasta.py -g 3 -a 5 -s 6 > 786782332.fasta
Traceback (most recent call last):
  File "/data/Cluster_Files/src/db_getClusterGeneInformation.py", line 52, in <module>
    con.execute(query, (spl[rc], spl[cc]))
IndexError: list index out of range

In the past I made these changes to db_makeClusterAlignment.py, as we discussed, I was making it multiplex many cluster runs as well as fixing this bug, which is not necessary.

< cmd = "cat %s | db_getClusterGeneInformation.py -r %d -c %d | annoteSeq2Fasta.py -g 3 -a 5 -s 6 > %s" %(fname, rc+1, cc+1, fasta)

---
> cmd = "cat %s | db_getClusterGeneInformation.py -r %d -c %d | cut -f 1 |sort -u | db_getGeneInformation.py | annoteSeq2Fasta.py -g 1 -a 5 -s 12 > %s" %(fname, rc+1, cc+1, fasta)

Switch to argparse and use argcomplete to deal with options with fixed list of possibilities

I should be using argparse rather than optparse because the latter is obsoleted. Also, it works with this package:

https://github.com/kislyuk/argcomplete

I see the following uses (at least) for such a package:

1 - Automatically suggest run IDs for the many functions that require input of a single run ID
2 - Automatically suggest scoring metrics for the functions that only allow use of a specific set of scoring metrics.

Would it get confusing if I use such an ability for some arguments but not for most?

Hook up the db to a real ortholog finder

Just raw MCL doesn't do a great job finding orthologs. I need to improve the BBH scripts and also very likely hook up orthoMCL (with appropriate reformattings to avoid re-calculating all the data) as an alternative way to identify clusters.

change SourceMe so people don't run it and not have the right PATH set

Since SourceMe.sh will not change the bash variables if run instead of sourced (as the user returns to their own shell before the exports occurred), we should remove the executable bit (and maybe change it to SourceMe.path or something). We should also make it clear in the docs that it should be sourced, not run.

Alternatively, we could make a wrapper script (SourceMe.sh) that sources the actual file (SourceMe.path), which would open up the possability of allowing the user to specify a file to concatenate to ( SourceMe.sh ~/.profile => cat SourceMe.path >> ~/.profile ). There is probably some way to do an if/then and combine the two in the same file.

Remove references to organism abbreviations in the organisms file

James and I talked about this... the organism abbreviations are the only manual part about the organisms file that is left, and they are hard \ buggy to maintain. They aren't really used for anything within ITEP... we have thus decided to get rid of it.

The following will have to adapt to this:

  • Any functions that use the organisms file and expect the ID to be in the third column (and not the second)...
  • The SQLite schema

I should probably make a standard organism file parser function to prevent any issues if we need to change this around again.

Automatic adding of organism name and ID from NCBI or pubseed parsing functions

It should be possible to automatically look for an organism ID in the organisms file, obtain the organism name, and if it isn't present, add it to the file without asking the user to do so manually.

Unfortunately this isn't possible from the RAST files so the user should still be warned if they are providing files from RAST.

Bad column names in blastres_selfbit

Probably missing an AS in the join command somewhere...

PRAGMA TABLE_INFO(blastres_selfbit);

returns

0|querygene|TEXT|0||0
1|targetgene|TEXT|0||0
2|pctid|REAL|0||0
3|alnlen|INT|0||0
4|mismatches|INT|0||0
5|gapopens|INT|0||0
6|querystart|INT|0||0
7|queryend|INT|0||0
8|substart|INT|0||0
9|subend|INT|0||0
10|evalue|REAL|0||0
11|bitscore|REAL|0||0
12|bitscore:1|REAL|0||0
13|blast_self.bitscore|REAL|0||0

One of columns 12 and 13 should be query self-bit and the other is target self-bit score.

BLASTN all vs all

We should modify the codes to do a blastn all vs all in a similar manner to blastp - useful for closely-related organisms. Switches could be added to much of the code to query a different table but use exactly the same analysis otherwise. Table structure should be exactly the same but using nucleotide data rather than amino acid data as inputs.

Issue with contig order from Biopython?

Apparently Biopython reorders contigs from the way they appear in the Genbank file (this according to Nick). Need to investigate and fix - this could fix our discrepancies with RAST IDs...

Very short genes are excluded from clustering results

For example:

$ echo "fig|192952.1.peg.556" | db_getGeneInformation.py
fig|192952.1.peg.556 Methanosarcina mazei Go1 192952.1 mazGO1 192952.1.NC_003901 682111 682161 + 1 unknown function_MM0556 gtggagaaatgtttcaatccttgttttaatggatcttgctcgcgaatatga MEKCFNPCFNGSCSRI

But this gene doesn't appear in any clusters or in the BLAST results because the self-hit is not sufficiently significant...

Automatic adding of gene aliases

So that the genes are back-searchable to their original database later, we should add the original alias (from NCBI, KBASE, whatever) to the "aliases" table when generating the raw files. That way, at least the original IDs are retained somewhere.

duplicate contigs in contig table

(This bug is filed for reference - it has already been fixed)

The main3.sh script was adding contigs to the contig file even if they were already there, thus leading to duplicate entries in the contig table of the db.

modularizing commonly used functions

Make lib directory
Four files:

  • treefuncs.py : Tree functions
  • metadata: e.g. locateDatabase()
  • stringfunctions.py (?)
  • commondbaccess.py - common functions for getting data from db (e.g. organism name <-> ID), presence\absence slices, ...

Need to rewrite existing imports to use new files and update docs

Presence and absence relative to tree neighbors (instead of whole tree)

I should be able to wrap the existing presence/absence script to look for presence and absence in a clade relative only to neighboring clades (instead of relative to the entire tree) - this would make the analysis less sensitive to the choice of organisms to put in the tree and allow us to identify families that are gained or lost in multiple clades.

Change database declarations into SQLite standard types

Making these changes will make people like me less confused about what SQLite is doing with the data types in the database and more clear what the limitations really are. They won't (shouldn't) change anything about the database as it is now, just clarifies things in the docs.

SQLite does not enforce varchar lengths (https://www.sqlite.org/faq.html#q9) and implicitly changes them into TEXT, so we should make them all TEXT as well so people (like me) don't think their lengths are really limited. I need to comb through and remove checks for this (there's one in checkInputFormat and some in the SQL code itself) since it's pointless.

SQLite does not have a FLOAT type and implicitly changes them into REALs. We should declare them as such

All integer types are treated the same - as INTEGER - as is the boolean type.

Need a function to reroot a tree without replacing gene names with organism names and annotations

Some functions (like the neighborhood tree displayer) require you to still have the original IDs in there, and FigTree adds a quote to every ID which makes it very hard to figure out what is wrong if you don't know what you're looking for. I should just make a function to reroot the tree so we don't have to go through db_displayTree to do it (which replaces the gene IDs with annotations and organism names - good for displaying the tree, not so good for making a new tree for further analysis that interacts with the database).

Need to move other things that make up their own scoring schemes to the standard library functions

I just wrote library functions to take BLAST results and compute scores in a standard way - but I still have some separate implementations of these floating around in various functions. I need to clean these up (particularly, db_bidirectionalBestHits.py, blastResultsToDistanceMatrix.py ... others?). makeBlastScoreTable.py has already been cleaned up and the recent function for Cytoscape export already was written (recently) to do this properly.

Change file input and output from sanatised organism name to (sanatized) organism ID

Besides the changes we discussed in replaceOrgWithAbbrev.py, other files use organism names in their output or input.

in src/makeCoreClusterAnalysisTree.py, the input and output use sanitized organism names:
"The input MUST be a Newick file with organism IDs REPLACED with their names"
"WARNING: Organism name %s in the database was not found in the provided tree. It will be deleted!!\n" %(collist[ii]))

The description and the header comment in this file conflict about the function of the script:
src/db_getBlastResultsBetweenSpecificGenes.py
description = "Given list of genes to match, returns a list of BLAST results between genes in the list only"

Provide a list of organisms to match [can match any portion of the organism so if you give it just "mazei" it will return to you a list of Methanosarcina mazei]

I think this is from duplication between thses scripts:
src/db_getBlastResultsBetweenSpecificGenes.py src/db_getBlastResultsBetweenSpecificOrganisms.py

Other scripts to check if the organism name or ID are used:
db_findClustersByOrganismList.py
db_getOrganismsInClusterRun.py
db_getOrganismsInCluster.py
db_addOrganismNameToTable.py
db_bidirectionalBestHits.py
db_TBlastN_wrapper.py

We discussed keeping the library functions, but another way to find the dependences is to see what called these library functions:
lib/TreeFuncs.py: '''Parse a node name into an organism ID.
lib/ClusterFuncs.py: Given an organism name, return the ID for that organism name.
lib/CoreGeneFunctions.py: The return object is a list of (runid, clusterid, organism) tuples sorted by run ID then by cluster ID.'''
lib/CoreGeneFunctions.py:def findGenesByOrganismList(orglist
lib/CoreGeneFunctions.py: The organisms in "orglist" are considered the "ingroup"

Use metadata to identify location of organism and groups files

Currently there are several functions in the pipeline that require organism or groups files as explicit input arguments. I should instead make one of the metadata functions for each of these files return the path to those files. However, it might be worth keeping the overwrites as optional arguments (just no longer having to explicitly identify them).

convertGenbankToTable needs to check for taxID consistency in concatinated genbank file

Since we are dependent on the taxIDs in the genbank file all being the same but we dont actually ever check this, we should check it. I should also put this in as part of the consistency checking script...

Yes, NCBI managed to go to a new low and put things with two different tax IDs under the same strain (C. tetani) - one of them was a virus. SIGH.

Building RPSBLAST databases no longer works with current CDD data

Apparently NCBI changed the format of the CDD data so only the newer version of RPSBLAST works with it - so I should be using makeprofiledb to build the database instead of formatrpsdb. There's a bit of a naming nightmare that I hope we don't have too many issues with... namely, they didn't change the name of the rpsblast executable but changed the syntax. We might just have to switch to the new syntax and let it crash if people are using the wrong one...

Need to build a 64-bit version of the VM

The VM I currently have is unable of handling RPSBlast, I think because of memory limitations. It could be memory limit of my host but it could also be because its a 32-bit guest. Need to test this before sending out.

env variable for FastTree_wrapper.py to set number of cores

Have a user option to set the number of threads used by FastTree.

The number of threads for FastTreeMP used is determined by the environmental variable OMP_NUM_THREADS

ncoers=10
env = os.environ.copy()
env['OMP_NUM_THREADS'] =str(ncores)

db_listDbFiles.py features for easier grepping

it would be easier to grep the output of db_listDbFiles.py if the default was in one column. Alternatively, if there was a sys.argv[1], then the script could use that as a to grep through the list.

Very minor, I know.

Add nucleotide option to getClusterFastas.py

There should be an option to get nucleotide in addition to protein FASTA files from getClusterFastas.py - maybe I should have it specify column numbers in the same way annoteSeq2Fasta.py does?

permission on /data/Cluster_Files/completeCoreAnalysis.sh

I can not run /data/Cluster_Files/completeCoreAnalysis.sh from my own directory without making a copy to my directory.

I think changing all of /scripts to be in group gtl_cluster will solve this:

4 -rwxr-xr-- 1 benedic1 benedic1 2874 Nov 30 09:55 /data/Cluster_Files/scripts/completeCoreAnalysis.sh

we should also probably make all the files in /src consistent too:

4 -rwxrwxr-x 1 benedic1 benedic1 2138 Oct 25 17:06 replaceOrgWithAbbrev.py
4 -rwxrwxr-x 1 benedic1 benedic1 2106 Nov 16 14:41 Rpsblast_all_vs_one.py
4 -rwxr-xr-x 1 benedic1 gtl_cluster 867 Oct 20 12:24 sanitizeString.py
4 -rw-r--r-- 1 benedic1 benedic1 1090 Oct 20 12:26 sanitizeString.pyc
4 -rwxr-xr-x 1 benedic1 gtl_cluster 1406 Oct 20 12:06 startDifference.py
4 -rwxr-xr-x 1 benedic1 gtl_cluster 3404 Aug 21 12:35 tblastn_all_vs_all.py
4 -rwxrwxr-x 1 benedic1 benedic1 564 Oct 20 12:07 unsanitizeGeneIds.py

removeOrganisms and removeOrganisms_SAFE need cleanup

The removeOrganisms and removeOrganisms_SAFE functions are relying on rather obscure file naming conventions that aren't consistent with the rest of the code (and therefore may not work when we need to use them). I might want to use a slower but more-sure method for dealing with this than looking at file names (unfortunately though, some input files like the genbank files won't necessarily have the organism IDs anywhere in them...). This will require some thought - we can't make it too generic or else if we have e.g. two genome IDs

83333.1

and

83333.10

and try to delete the former, we need to make sure not to delete the latter.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.