Hi. A few years ago I pointed out in #4 that proteins with e-value 0 were not included in the final result table (8.all.ort.group
), but I recently realized that perhaps the problem has not been completely solved.
I noticed that in my final result table 8.all.ort.group
, there were many lines with only one protein at the end of the file, which was not observed in final output from the example run (the 12-species example provided by the author). I also noticed that many of these "one-line-one-gene" proteins actually have very clear orthology across species (for example, I tried to run PorthoMCL with the default parameters to identify homologous genes between mouse and chicken, but many possibly homologous or even 1-to-1-orthologous genes became these "one-line-one-gene" proteins).
In addition, I tried to count how many genes have been captured in the final table 8.all.ort.group
. If I omit the "one-line-one-gene" proteins from the table, then only around 30% of all the proteins in each species (in my case: around 6,000 proteins out of all ~20,000 proteins) were identified as homologous genes in 8.all.ort.group
. Obviously something wrong must have happened: usually there are at least around 10,000 1-to-1 orthologs (e-value threshold cutoff 1e-5) between vertebrates, and PorthoMCL (or orthoMCL) is expected to find many more genes with 1-to-many and many-to-many relationships, but now much fewer were captured in the final output.
Then I went back to see if it's possible to reproduce the output from the example run (the 12-species example) using the updated code after issue #4 (I started from step 4, which is the step after running BLASTP). The answer is NO. Judging from the sizes of the intermediate files, the files from the step 6.orthologs
onwards are not the same. And now with the updated code, the final output table 8.all.ort.group
from the example also included these "one-line-one-gene" proteins. A search in the BLASTP result table indicated that many of these genes still had e-value of 0, and many of them also appeared in the result in 7.ogenes
(indicating that they have orthologous genes in the other species?). The orthology relationships in 8.all.ort.group
(lines with >= 2 protein IDs) also differ slightly from the example output.
I have been quite frustrated with the results from my recent run (as there were too few protein IDs included in the orthologoy ouptut) and now finally realized the issue reported a few years ago may still persist. May I know if you have any idea how to solve this? Thank you so much.
Jason.