etabari / porthomcl Goto Github PK

Parallel implementation of OrthoMCL

License: GNU General Public License v3.0

Perl 16.59% Shell 28.89% Python 54.52%

porthomcl's Introduction

PorthoMCL

Parallel implementation of OrthoMCL

We have reimplemented the sections of OrthoMCL that rely on databases. This way, OrthoMCL could be ran in parallel for a large number of genomes.

Check Wiki

Please cite PorthoMCL paper if you use it in your research :

PorthoMCL: Parallel orthology prediction using MCL for the realm of massive genome availability
Ehsan Tabari and Zhengchang Su
Big Data Analytics 2017 2:4 DOI: 10.1186/s41044-016-0019-8

porthomcl's People

Contributors

Stargazers

Watchers

Forkers

greatfireball kevlim83 kkuonjj juulluu21 chbk alphaneer haessar

porthomcl's Issues

How to work with Output?

Hello,

I just ran this program on 13 genomes and it was very easy and fast. However, I cannot find anything that explains the output files and how to use them. Is there any documentation that explains the output files and how to determine orthologs from it? Thank you.

-Brittany

Difference between original orthomcl and porthomcl

Hi,

I read your paper with great interest. Thanks for developing this tool. I also noticed some differences in methodology which perhaps you can help me understand further.

In the original orthomcl, protein pairs of 1) orthologs, 2) paralogs and 3) coorthologs were first derived and merged into a single large network before mcl is run.

However, I read in both your paper and the detailed manual and found that the slight difference is that you do not merge these pairs into one single network but run mcl on separate networks representing the orthologs and paralogs. I am curious to know if this might cause differences in final interpretation of results since the starting network is different? Also, it seems that you have not mentioned coorthologs in both the manuscript and the detailed manual even though you have a script for that. I am also curious to learn why.

It would be helpful if you can explain the rationale behind the choice of your methodology.

Thanks a bunch!

Regards,
K

FilterFasta error

Hi,

thanks for this tool which promises to be very useful for my work! Unfortunately, I am stuck with an error at step 2.

I am running PorthoMCL on a bunch of files exactly as suggested.

./porthomcl.sh -s 1 -e 2 -n '(.*)_protein\.faa' container

The files are in container/0.input_faa folder and named like- "GCF_000013025.1_ASM1302v1_protein.faa".

However, I get this error at step 2-

The ID on def line '>GCF_000013025.1_ASM1302v1|WP_011645035.1' is missing the prefix '1_ASM1302v1|' 'GCF_000013025.1_ASM1302v1'

I tried looking for the cause of the error, but I do not know perl at all. What may be the problem?

Saurabh

Python3 Update?

When will these scrips work with the newer python3 release?

Different clustering from orthoMCL; possible bug

I am super happy I discovered this parallel much faster version. However, I find that sometimes orthoMCL performs better or is at least more stable. I typically work with clustering orthologs of very similar sequences like strains of viruses or human primate sequences - so it could be that this small differences between sequences is causing issues.

Essentially I am at times seeing less clusters than what orthoMCL would produce. Today I was clustering Monkeypox ORFs, and while orthoMCL assigned a cluster for all 186 ORFs of my reference genome ORFs, porthoMCL only assigned 36 ORFs to clusters. It's bizzare because it classified correctly all the ORFs of the other similar 11 viruses.

I am also frequently noticing that one taxa will have 0's for all pairwise comparisons in 6.ortholog file. There isn't an obvious reason. Yesterday it was one primate organism out of 15 and today it was a virus out of 11. Rerunning all the steps does not seem to fix the issue.

How are blast scores normalized for mclInput? Thanks

some of the sequences in the input file are missing in the output file

hi, i am using this tool in order to divide genes from one specie only (Arabidopsis thaliana) into paralogous families.
after running Porthomcl, the orthologs output is empty as expected, and the paralogs output does not contain all the sequences from the input (which almost all of them cataloged as "good proteins"). there are about 5000 sequences missing.
are they count as singletons/orphan genes? if so, what is the difference between them and other sequences that are present in the output, but have no other sequence in their family?
i would also mention that those missing sequences are present during steps 1-4, and disappear from step 5 onward (actually the folder 5.besthit is always empty).

i am wondering whether it happens because there is only one specie? or maybe i read the results in the wrong way?

thank you very much,
Nitzan

Unable to reproduce the final result

Hi. A few years ago I pointed out in #4 that proteins with e-value 0 were not included in the final result table (8.all.ort.group), but I recently realized that perhaps the problem has not been completely solved.

I noticed that in my final result table 8.all.ort.group, there were many lines with only one protein at the end of the file, which was not observed in final output from the example run (the 12-species example provided by the author). I also noticed that many of these "one-line-one-gene" proteins actually have very clear orthology across species (for example, I tried to run PorthoMCL with the default parameters to identify homologous genes between mouse and chicken, but many possibly homologous or even 1-to-1-orthologous genes became these "one-line-one-gene" proteins).

In addition, I tried to count how many genes have been captured in the final table 8.all.ort.group. If I omit the "one-line-one-gene" proteins from the table, then only around 30% of all the proteins in each species (in my case: around 6,000 proteins out of all ~20,000 proteins) were identified as homologous genes in 8.all.ort.group. Obviously something wrong must have happened: usually there are at least around 10,000 1-to-1 orthologs (e-value threshold cutoff 1e-5) between vertebrates, and PorthoMCL (or orthoMCL) is expected to find many more genes with 1-to-many and many-to-many relationships, but now much fewer were captured in the final output.

Then I went back to see if it's possible to reproduce the output from the example run (the 12-species example) using the updated code after issue #4 (I started from step 4, which is the step after running BLASTP). The answer is NO. Judging from the sizes of the intermediate files, the files from the step 6.orthologs onwards are not the same. And now with the updated code, the final output table 8.all.ort.group from the example also included these "one-line-one-gene" proteins. A search in the BLASTP result table indicated that many of these genes still had e-value of 0, and many of them also appeared in the result in 7.ogenes (indicating that they have orthologous genes in the other species?). The orthology relationships in 8.all.ort.group (lines with >= 2 protein IDs) also differ slightly from the example output.

I have been quite frustrated with the results from my recent run (as there were too few protein IDs included in the orthologoy ouptut) and now finally realized the issue reported a few years ago may still persist. May I know if you have any idea how to solve this? Thank you so much.

Jason.

Original software licence of orthoMCL violated?

Dear Ehsan,

thanks for your work on PorthoMCL. It seems to be promising to enable faster determination of orthologs for huge data sets. I am no lawyer, but I think you have to change our licence to GPL3 due to the original orthoMCL files require that licence (http://www.orthomcl.org/common/downloads/software/v2.0/SoftwareLicense.txt). Otherwise you should re-implement the functionality of those scripts.

Thanks again!

Best regards,
Frank

Replacing BLAST with DIAMOND

Hi,
Do you have any plans to replace BLAST with DIAMOND?

Thank you in advance,

Michal

proteins going missing at find best blast hit step.

Dear developers

I've used porthoMCL to cluster proteins from 10 isolates of a fungus.
Most of them get clustered alright but some proteins disappear entirely in the find best hits step. Checking manually these proteins do have counterparts in the other isolates, sometimes 100 percent matches over the entire length.
Also, Blastp finds them as they're included in the blastp output files. These matches also survive the splitSimSeq step but then get lost at the find the best blast hit step.

I've run:
porthomclPairsBestHit.py -t taxon_list -s splitSimSeq -b besthit -q paralogTemp -x 1
(repeating this for x 2 to 10)
I saw no error messages.
I'm happy to send some files in confidence for you to try and replicate the problem.

Kind regards and many thanks!
A Webb