Issue: we need to decide on the specific settings for running muscle. <p dir="auto

Decision needed: Selecting number of alignment iterations,about m-orton/evolutionary-rates-analysis-pipeline

Comments (16)

sadamowi commented on August 15, 2024

Hi Matt,

Regarding the gap penalty, I think we should examine alignments from a few more phyla before making any adjustment to the default gap penalty. For Annelida (we had data for classes Polychaeta and Clitellata), I think the alignment performed well. I provided a more detailed explanation in the now-closed "alignment" thread about why I think the gap penalty performed well in that group.

That is a good question about whether we can save computing time by reducing the number of alignment iterations. I'd like to suggest that we run Cnidaria and Echinodermata soon to see how the alignment performs. Marine animal groups with ancient diversification would likely pose the greatest challenges for our pipeline. Therefore, I'd like to suggest that we ensure the "longer" alignment version is performing well before reducing this.

That is a very good question about whether this would be fine for Arthropoda. I think we will likely run into memory challenges on that group. So, this is a good option to explore. Also, I will follow up with Compute Canada about my account. I submitted this application some weeks ago, and it was supposed to be processed within a few days.

These are just my thoughts for now. I suggest we leave this issue open until we run a few more phyla... i.e. those with modest numbers of sequences but deep evolutionary divergences. Then, we should reconsider this before closing this issue.

Cheers,
Sally

from evolutionary-rates-analysis-pipeline.

sadamowi commented on August 15, 2024

Update on trials to determine alignment settings

For Annelida, I have found that imposing no constraint vs. maxiters2 and diags = TRUE made a very slight (but nearly negligible) difference in the alignment and final results. The conclusions that would be drawn from that dataset would be the same. However, there was a large difference in processing time.

The largest difference in computation time was for alignment 2, which took 59 min. vs. around 10 min.

Alignment 3 took about 20 min vs. 1 min. Compared to some priar versions of this script, this step also seems to have been helped a lot in speed (and number of sequences going into this step) by including the divergent sequence exclusion step prior to running alignment 3. So, I think that was a good choice.

Based upon these results, I suggest that maxiters = 2 is a viable option for this data set. However, as this group is small, running at a higher number of maxiters isn't difficult. If we find that performance is better for the other primarily marine phyla (with ancient divergences)at a higher level of maxiters, we could go with that level for Annelida too for consistency.

I will try Cnidaria and Echinodermata also. And, it would be interesting to see the influence of this setting for Mollusca, if that's feasible for you to run the script again for Mollusca.

Cheers,
Sally

from evolutionary-rates-analysis-pipeline.

sadamowi commented on August 15, 2024

Hi Matt,

Here is just a brief update on the progress with the alignment settings.

I am struggling a little to figure out if we can use the same alignment settings for all 6 phyla. I am starting to think it is OK if there are different settings for the different phyla. OK with you?

For example, for Cnidaria, I am finding the best alignments with the gapopen=-3000

This higher gap opening penalty is maintaining the reading frame well. Superfluous gaps that disrupt the reading frame are NOT inserted while a gap of 3 positions (reflecting one amino acid indel) is inserted in Hydrozoa. I think this setting yields the most biologically reasonable alignments for this phylum.

I think that if all sequences are correctly, a heavy gap opening penalty makes the most sense for COI. HOWEVER, the problem is when there is an erroneous EXTRA nucleotide inserted near the end. When this happens near the end, there may not be enough distance left for a stop codon to show up (which blocks a sequence from receiving a BIN assignment). We want a single position gap to be inserted in such a case; thus, homology is maintained, and that position becomes ignored in future distance calculations that use pairwise deletion.

I have found very minimal influence of the maxiters setting. As Cnidaria is such a small group, I've been using maxiters=3 rather than 2.

I am exploring the results from a few other phyla to see if -3000 as the gapopen setting makes sense for us as our default for the code. We could comment in the code that this parameter may need adjusting. I hope to provide another update this evening, with a final recommendation about what I think makes the most sense heading into the two largest phyla: Chordata and Arthropoda.

Cheers,
Sally

from evolutionary-rates-analysis-pipeline.

m-orton commented on August 15, 2024

Hi Sally, thanks very much for looking into the best alignment settings. Your suggestions sound good to me for changing the alignment settings depending on the phyla.

Thanks,
Matt

from evolutionary-rates-analysis-pipeline.

sadamowi commented on August 15, 2024

Hi Matt,

OK! So, I think I have recommendations for the alignment settings that can be considered near final. As usual, I think we should continue to examine the alignments (especially the final, trimmed alignments), for any groups for which it is feasible to do so, through the end. I can look at these for any analyses that you run on your computer or the cloud. I can now access files through the Google Drive - thank you. Also, let me know if you also want me to place my results there. I'm not sure if you want to do through those.

So:

I have been using diags=TRUE for all three alignments. I have not found this to make a difference so far. However, I have not done an in-depth exploration. Basically, I used this feature, and then I examined whether the alignments make sense biologically... and, they did.

Also, I have found the maxiters parameters to make either no difference or just a minimal difference to the alignments. I compared max2, max3, and no limit imposed for a couple of groups. Given that this has minimal cost for small phyla, I've mainly been using maxiters = 3 for all three alignment stages. Perhaps we could go with that for the smaller phyla and try out maxiters = 2 for Arthropoda?

I have found the gapopen parameters to be the most important. I recommend gapopen = -3000 as our own default. Again, I suggest we continue to inspect new alignments as they come out. For the phyla I have examined so far (four phyla), this helped to maintain homology and conservatism of the alignment at the amino acid level, i.e. without the alignment being excessively gappy when there is substantial variability at the nucleotide level.

I think our new 640 filter/620 trim approach has helped with the earlier issue I raised that a stringent gap opening penalty can mean a gap is erroneously NOT placed in light of an indel based upon a base calling error near the end of the sequence.

So, in summary, my recommendation as our default for all three alignment steps is:

maxiters = 3, diags = TRUE, gapopen = -3000

I suggest to consider changing maxiters = 2 for Arthropoda.

I also suggest to make one other change in light of the Mollusca alignment. In general, we are using pairwise.deletion = TRUE throughout. I suggest to keep that for most groups.

For Mollusca, I suggest to change this to pairwise.deletion = FALSE but only for the final distance calculation step. I think pairwise would be fine and even preferable for earlier steps, such as the centroid finder and also fine for the divergent sequence ejection step. I recommend to turn off pairwise deletion just for the final distance matrix step for Mollusca.

This recommendation was based upon the observation that the specific gaps were dubiously placed for the two largest mollusc classes: Bivalvia and Gastropoda. With the stringent gap setting, the amino acid alignment was well maintained on both side up to a the gappy regions, but then the gaps were dubiously placed within those regions. I think we will get more accurate results using complete deletion for molluscs specifically, the phylum with the most indels among our 6 target phyla.

So, please let me know if you have any comments with the above!

Otherwise, I am happy to go forward with these as the near-final settings. I will continue to evaluate any new taxa or new alignments produced.

Best wishes,
Sally

from evolutionary-rates-analysis-pipeline.

m-orton commented on August 15, 2024

Hi Sally,

I think it would be good to have all of the results in the google drive. The one limitation about google drive that I dont like is that I cant post R workspaces so actually maybe dropbox or some other alternative might be better.

Your suggestions on the alignment parameters all sound great to me.

I think maxiters=2 for Arthropoda should be sufficient. Supposing the script runs really fast, I could try experimenting with maxiters=3 to see if that has any effect. But I will go with the alignment parameters you suggested for Arthropoda and see how it goes.

I can also modify the branches with the new alignment parameters as well and runthrough Mollusca (with the new pairwise deletion parameter) again if you like.

Just to clarify, the default alignment parameters you suggested are across all branches, correct?

Thanks,
Matt

from evolutionary-rates-analysis-pipeline.

sadamowi commented on August 15, 2024

Hi Matt,

I am already using Dropbox. Shall I just invite you to the entire project folder for this project/paper? That could work well for sharing various files associated with the project. The current folder size is just 580 MB. I presume that's fine in terms of sharing with you? Do you have a free Dropbox account? I have a TB paid account, and so normally I don't worry about space much, given my typical activities.

Note I didn't save every single run I performed, but I saved the ones that I thought were important for justifying some of the key decisions, such as comparing the maxiters setting and the default vs. a heavy gap opening parameter. For Cnidaria, I tried a lot of interim gap opening parameters (I didn't save all those interim results), and I kept going until I got the parameter that yielded the alignment I wanted based upon the translated sequences: i.e. a small number of gaps reflecting amino acid indels. Then, for other groups, I just compared default vs. -3000 to make sure the results made sense.

Also, I performed some additional explorations, such as realigning the Mollusca sequences in MEGA. I wrote up the observations to the issues tracker as a summary but didn't save the results. Next time, I will save the results. However, I am hoping that the intensive tinkering phase is now over!

Also, I think I should have a final look at the alignments outputted from all groups, as we have tweaked quite a lot of things with the code recently. I suggest I should do a final final check of outputs (especially the final, trimmed alignments) when we run everything on the cloud in the new year, for consistency.

I suggest that if the maxiters = 2 yields biologically reasonable alignments for Arthropoda, then we should leave the setting there. Hopefully we can figure out how to examine at least part of the large alignment visually. I should be able to copy a portion of the alignment file to MEGA, for example, to confirm that nothing unexpected has happened.

Yes, the alignment settings that I recommended are for all taxonomic branches and for all three alignment steps. Then, I recommend a unique setting (turn off pairwise deletion) for the final distance matrix for Mollusca.

I will examine the alignments for the groups we haven't run yet (Arthropoda and Chordata) to make sure that those settings work well for those groups too. The settings SHOULD work well if the sequences are perfect! :-)

The issue is whether these large datasets contain sequences with errors (especially indel-related calling errors) in them that are severe enough to mess things up (while also having passed through the filters - there may be some tricky kinds of cases we haven't thought of us - but hopefully not!).

I look forward to seeing how this trial goes!

Cheers,
Sally

from evolutionary-rates-analysis-pipeline.

sadamowi commented on August 15, 2024

PS. I forgot to clarify... yes, that would be great if you'd run through Mollusca with the revised code. Thank you very much.

from evolutionary-rates-analysis-pipeline.

m-orton commented on August 15, 2024

Hi Sally, thanks for clarifying on the above points that I mentioned. That would be great if you could add me to the project folder in dropbox. I have a free account, would I need to upgrade to a paid account to add things to your folder?

I'll update all of the branches with the new alignment parameters and runthrough Mollusca again as well with the pairwise deletion setting. I think tonight I'll focus on the Lepidoptera test and then tomorrow, I'll go through Mollsuca again. Fingers crossed everything goes well!

Best Regards,
Matt

from evolutionary-rates-analysis-pipeline.

sadamowi commented on August 15, 2024

Hi Matt,

No, you don't need a paid account to share a folder with me. I am just asking about space, because the free accounts have limited storage space. The contents of shared folder counts for each person towards their total storage quota.

I'd be happy to add you. What email is your dropbox account under?

OK awesome - thank you for updating the code and tackling the Lep test. Please do let me know when the code is updated, as I will then run new analyses for the three smallest phyla (Annelida, Echinodermata, and Cnidaria) in order to verify that all went well with our recent decisions and these updates, and also to get the penultimate results for helping me to work on the manuscript draft over the break.

Cheers,
Sally

from evolutionary-rates-analysis-pipeline.

m-orton commented on August 15, 2024

Ok no problem, I can upgrade to the paid account for more space. Email for my dropbox is my gmail: [email protected]

from evolutionary-rates-analysis-pipeline.

m-orton commented on August 15, 2024

All branches have now been updated with new alignment parameters and Mollusca has also had its pairwise deletion setting changed to false for the final distance matrix.

Lepidoptera will be run with maxiters=2.

Also I upgraded to dropbox pro so we have enough space in the shared folder.

One thing I suggest is that for all final results of each phyla, we post in dropbox the initial tsv workspace as well as a final R workspace containing all pairing results.

Best Regards
Matt

from evolutionary-rates-analysis-pipeline.

sadamowi commented on August 15, 2024

Hi Matt,

Is your regular account nearly full? We could always see how things go with that, if you have some space left. I don't want you to have to upgrade, unless you had been thinking of that anyhow for your own usage, as it is rather expensive. My accounts are each $99 US/year. I have one for myself and one for my lab group.

So, if you need more space in order to be to use the shared folder, another option is that I can give you access to my lab's shared account and we can put files there. That is shared with all lab members, and so you would see various folders (e.g. lab SOPs, reprints, etc.), but there aren't personal files there. So, that would be an option to consider.

Cheers,
Sally

from evolutionary-rates-analysis-pipeline.

m-orton commented on August 15, 2024

Hi Sally,

My regular account still has some space but I decided to go with the monthly option for upgrading to Dropbox pro.

No worries on the upgrade, I sometimes back up files from my desktop so I decided it would be beneficial for me to upgrade anyways. So I should be good on space for the shared folder but thanks for offering.

Best Regards,
Matt

from evolutionary-rates-analysis-pipeline.

sadamowi commented on August 15, 2024

Dear Matt,

That sounds good. I have invited you to share the folder. Let me know if you have any trouble accessing the folder.

After we wrap our analyses, we could back up our results in multiple places, and then you could downgrade your account back and delete the large results files if you wish.

That's a good idea about saving the initial and final workspaces. We are hopefully getting towards the real results at this point!

Cheers,
Sally

from evolutionary-rates-analysis-pipeline.

sadamowi commented on August 15, 2024

Dear Matt and Jacqueline,

Given we are getting good alignments for leps based upon Matt's recent runs, I think this issue of alignment iterations is now settled, as described several comments up. To reduce clutter on the screen, I will close this issue. Of course, I think it would be wise for us to continue to check the alignments through the end to be sure nothing goes awry for new runs or taxa. We may need to make an adjustment for specific taxa, but that could be discussed in its own thread.

Best wishes,
Sally

from evolutionary-rates-analysis-pipeline.

Decision needed: Selecting number of alignment iterations about evolutionary-rates-analysis-pipeline HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent