Hi Matt, I am running Annelida again to further explore and finetune

Hi Matt, OK - thank you very much for trying that. <p dir="auto"

Question about filtering by sequence length about evolutionary-rates-analysis-pipeline HOT 17 CLOSED

m-orton commented on July 17, 2024

Question about filtering by sequence length

from evolutionary-rates-analysis-pipeline.

Comments (17)

sadamowi commented on July 17, 2024

PS. I didn't have that quite right. The sequence wasn't necessarily shorter than 620 bp, but part of the length was at the front, leaving a number of gaps at the end after trimming using the reference.

One idea I have to address this would be that we make the length filter 640 bp (instead of 620), but then still trim to 620. That way, we would hopefully be trimming to higher quality sequence most of the time, rather than using the last few nucleotides of a sequence that is declining in quality towards it end.

What do you think?

Thanks for any thoughts on that.

from evolutionary-rates-analysis-pipeline.

m-orton commented on July 17, 2024

Hi Sally,

Thanks for your suggestion, I agree on changing the filter to 640 bp. I can make the changes to the branches.

The length filter counts all characters but I could find a way of only counting nucleotides if that works better. I'm guessing the Biostrings package probably has a function for easily counting nucleotides. I'll look into it.

Best Regards,
Matt

from evolutionary-rates-analysis-pipeline.

sadamowi commented on July 17, 2024

Thank you very much Matt.

Line 258 (Annelida branch) uses "length" for calculating the proportion of internal gaps. Could that be used? I think the most important thing is excluding the external gaps for counting the sequence length. I'm not so concerned about internal gaps, as those are few.

Again, the few sequences that I looked at relating to this issue were short on the right side but could have been 620 bp in total. I am having a run at Annelida now with the 640 bp filter to see if that helps.

Cheers,
Sally

from evolutionary-rates-analysis-pipeline.

m-orton commented on July 17, 2024

Thanks Sally, your right about the command at line 258, I should be able to reuse that command to determine the sequence length.

from evolutionary-rates-analysis-pipeline.

sadamowi commented on July 17, 2024

Hi Matt,

OK - thank you very much for trying that.

Cheers,
Sally

from evolutionary-rates-analysis-pipeline.

m-orton commented on July 17, 2024

Think I came up with an easy fix for determining the correct sequence length:

sequenceLengths <- nchar(gsub("-", "",dfInitial$nucleotides))
sequenceLengthCheck <- which(sequenceLengths>1000 | sequenceLengths<640)
dfInitial <- dfInitial[-sequenceLengthCheck,]

The gsub function will eliminate the gaps in any sequences that have them so that the nchar function should only count actual nucleotides. Also the minimum sequence length will now be 640. Just tested on Annelida and seems to work!

I'll modify all of the branches with the above code.

Best Regards,
Matt

from evolutionary-rates-analysis-pipeline.

sadamowi commented on July 17, 2024

Fantastic - thanks Matt!

from evolutionary-rates-analysis-pipeline.

sadamowi commented on July 17, 2024

Hi Matt,

I think this looks great. Thank you for implementing this. I think the different filtering and trimming settings will be helpful for the quality of the sequences and results.

I do have a final question before we close this issue, about the specific bp choices. I suggest that 640 filtering and 620 trimming are good choices. However, my final question is whether we would lose a lot of pairs for the Arthropoda? For example, if we would lose 2% of pairs, I think that is fine. However, if we'd lose 20% of pairs, then I suggest we consider 620 filtering and 600 trimming.

I recall you did some explorations some time ago about median sequence lengths and pairings as relating to sequence length? Do you have any prior results that could help us out here?

If not, then perhaps we could do a final consideration of this issue if things run smoothly for Arthropoda on the cloud. We could run the analysis both ways.

Thank you for any thoughts.

Best wishes,
Sally

from evolutionary-rates-analysis-pipeline.

sadamowi commented on July 17, 2024

Hi Matt,

Sorry to add one more thing to your list on this issue. I realized just now that for Echinodermata that one quite short sequence got through to the final trimmed alignment (I am looking first at class Asteroida - a sequence of 449 bp in that alignment got through). Perhaps this sequence was longer and hence passed the filter, but some of the length was in the "wrong" region.

Is it possible to add a second length-based exclusion step after completion of the final, trimmed alignment? Or, do you have a different idea for this?

For some taxa, this would have a minimal impact, as probably there are very few sequences that are long but not in the "correct" genetic region for us. Most of the sequences in the database will have been generated using primers that bind in the standard region. So, if this is tricky to fix, I think it is OK and we can manually check against that, specifically for Mollusca.

However, if this can be fixed without too much trouble, I suggest this would be helpful. As you will see in another thread, I am going to recommend complete deletion rather than pairwise deletion for calculation of Mollusca genetic distances. So, allowing shorter sequences through could mess up the result. However, we can manually check for this scenario for Mollusca specifically if we don't want to mess with the code further at this late stage. But if the code can readily be made more general, this would save us time if running the code again in the future, e.g. if we are finetuning additional parameters based upon new discoveries about the data or results or if we are running things again after checking for updated sequence data after review, etc.

Thank you for considering this issue.

Cheers,
Sally

from evolutionary-rates-analysis-pipeline.

sadamowi commented on July 17, 2024

PS. Or, perhaps there is a better way, and the reference sequence could be included in alignment 2 and the length filtering done after that, prior to the final alignment? Any different ideas welcome!

from evolutionary-rates-analysis-pipeline.

sadamowi commented on July 17, 2024

PPS. Pardon me for this long chain! I should have looked at all of the echinoderm classes before posting. The diversity of genetic regions within COI and the diversity of primers used appears to be much higher for Echinodermata than for the other classes (based upon info for 4 classes: I have run Annelida, Cnidaria, and Echinodermata, and I looked at your Mollusca results).

We would lose a lot of data for echinoderms if we implement the solution that I proposed above: deleting sequences based on length a final time after trimming using the reference. I wanted to post now so that you don't work on this issue now. I am exploring other options for dealing with this... such as using partial deletion at the stage of calculating the distance matrix.

I will update again soon on this issue!

from evolutionary-rates-analysis-pipeline.

sadamowi commented on July 17, 2024

OK - So, unfortunately, I do not see an option for partial deletion in the distance calculator we are using. (This is an option in MEGA.) I was going to recommend partial deletion for Echinodermata with a setting such as 0.60 (i.e. positions with less than 60% data coverage would be deleted at the time of calculating the distance matrix).

I suggest that we leave this issue alone. For most groups, there will be very very few short sequences that get through, based upon the filters that we do have in place. In echinoderms there will be a few.

What I will do is compare the results for echinoderms with pairwise vs complete deletion to make sure this doesn't have a large impact upon the conclusions. For all other groups, I think this will most likely be an extremely minor issue.

So, again, pardon the long chain with all these comments. However, at least you will see this issue I was considering and you can let me know if you have any alternative view compared to the conclusion I came to here.

Also, in light of my many subsequent comments, please do see further up this chain for my question about 2% vs. 20% of Arthropoda pairs. Thank you!

from evolutionary-rates-analysis-pipeline.

m-orton commented on July 17, 2024

Hi Sally,

Sorry for the delay in responding (had to go and buy a turkey for Christmas).

Thanks for all of your work on this! In regards to the 2% vs 20% of Arthropoda, I just did a few tests on the Lepidoptera initial dataframe that I had saved. I ran a sequence length check with the original code:
sequenceLengths <- nchar(dfInitial$nucleotides)
sequenceLengthCheck <- which(sequenceLengths>1000 | sequenceLengths<620)

I ended up with 26097 sequences being filtered out, out of a total of 573242 sequences or roughly 4% being filtered out.

Then I tried with the new code:
sequenceLengths <- nchar(gsub("-", "",dfInitial$nucleotides))
sequenceLengthCheck <- which(sequenceLengths>1000 | sequenceLengths<640)

I ended up with 65798 sequences being filtered out, out of a total of 573242 sequences or roughly %11 being filtered out.

Then I tried using the new method of counting nucleotides with the min sequence length of 620:
sequenceLengths <- nchar(gsub("-", "",dfInitial$nucleotides))
sequenceLengthCheck <- which(sequenceLengths>1000 | sequenceLengths<620)

This gave a very marginal increase of sequences being filtered from the original code to 26136 sequences being filtered which still ends up being around 4% total sequences.

So the question would be if we are willing to exclude 11% of sequences with the minimum of 640 sequence length?

Best Regards,
Matt

from evolutionary-rates-analysis-pipeline.

m-orton commented on July 17, 2024

Adding to that, I cant seem to find any results I have for the effects this sequence would have on pairing results. I think when I did those tests, that was with an earlier version of the script so its hard to say what effect the min sequence length would have with the current iteration of the script. Hopefully what I mentioned above can help us make a decision.

Best Regards,
Matt

from evolutionary-rates-analysis-pipeline.

sadamowi commented on July 17, 2024

Hi Matt,

Thank you very much for this summary above. I think that if 11% of the sequences are filtered out, likely the percentage of BINs that would be filtered out would be smaller. At BIO, many insects are sequenced unidirectionally, but then there is a step for seeking bidirectional sequences for a few representatives per BIN. It's just that that work is in progress, and so not all new BINs would be covered yet. Nevertheless, I predict that our % of BINs dropped would be lower than the % of sequences dropped.

So, I am OK with proceeding with the 640 filtering and 620 approach. I think that overall our results will be more accurate. The results will be based on longer sequences and a higher proportion of sequences that have bidirectional sequencing coverage.

Sound OK?

Best wishes,
Sally

from evolutionary-rates-analysis-pipeline.

m-orton commented on July 17, 2024

Sounds good, ill proceed with the 640 and 620 approach.

Best Regards,
Matt

from evolutionary-rates-analysis-pipeline.

sadamowi commented on July 17, 2024

Great! Thanks Matt. I will close this issue. As usual, we will look out for additional issues that arise in the new analyses.

Best wishes,
Sally

from evolutionary-rates-analysis-pipeline.

Question about filtering by sequence length about evolutionary-rates-analysis-pipeline HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent