Comments (17)
PS. I didn't have that quite right. The sequence wasn't necessarily shorter than 620 bp, but part of the length was at the front, leaving a number of gaps at the end after trimming using the reference.
One idea I have to address this would be that we make the length filter 640 bp (instead of 620), but then still trim to 620. That way, we would hopefully be trimming to higher quality sequence most of the time, rather than using the last few nucleotides of a sequence that is declining in quality towards it end.
What do you think?
Thanks for any thoughts on that.
from evolutionary-rates-analysis-pipeline.
Hi Sally,
Thanks for your suggestion, I agree on changing the filter to 640 bp. I can make the changes to the branches.
The length filter counts all characters but I could find a way of only counting nucleotides if that works better. I'm guessing the Biostrings package probably has a function for easily counting nucleotides. I'll look into it.
Best Regards,
Matt
from evolutionary-rates-analysis-pipeline.
Thank you very much Matt.
Line 258 (Annelida branch) uses "length" for calculating the proportion of internal gaps. Could that be used? I think the most important thing is excluding the external gaps for counting the sequence length. I'm not so concerned about internal gaps, as those are few.
Again, the few sequences that I looked at relating to this issue were short on the right side but could have been 620 bp in total. I am having a run at Annelida now with the 640 bp filter to see if that helps.
Cheers,
Sally
from evolutionary-rates-analysis-pipeline.
Thanks Sally, your right about the command at line 258, I should be able to reuse that command to determine the sequence length.
from evolutionary-rates-analysis-pipeline.
Hi Matt,
OK - thank you very much for trying that.
Cheers,
Sally
from evolutionary-rates-analysis-pipeline.
Think I came up with an easy fix for determining the correct sequence length:
sequenceLengths <- nchar(gsub("-", "",dfInitial$nucleotides))
sequenceLengthCheck <- which(sequenceLengths>1000 | sequenceLengths<640)
dfInitial <- dfInitial[-sequenceLengthCheck,]
The gsub function will eliminate the gaps in any sequences that have them so that the nchar function should only count actual nucleotides. Also the minimum sequence length will now be 640. Just tested on Annelida and seems to work!
I'll modify all of the branches with the above code.
Best Regards,
Matt
from evolutionary-rates-analysis-pipeline.
Fantastic - thanks Matt!
from evolutionary-rates-analysis-pipeline.
Hi Matt,
I think this looks great. Thank you for implementing this. I think the different filtering and trimming settings will be helpful for the quality of the sequences and results.
I do have a final question before we close this issue, about the specific bp choices. I suggest that 640 filtering and 620 trimming are good choices. However, my final question is whether we would lose a lot of pairs for the Arthropoda? For example, if we would lose 2% of pairs, I think that is fine. However, if we'd lose 20% of pairs, then I suggest we consider 620 filtering and 600 trimming.
I recall you did some explorations some time ago about median sequence lengths and pairings as relating to sequence length? Do you have any prior results that could help us out here?
If not, then perhaps we could do a final consideration of this issue if things run smoothly for Arthropoda on the cloud. We could run the analysis both ways.
Thank you for any thoughts.
Best wishes,
Sally
from evolutionary-rates-analysis-pipeline.
Hi Matt,
Sorry to add one more thing to your list on this issue. I realized just now that for Echinodermata that one quite short sequence got through to the final trimmed alignment (I am looking first at class Asteroida - a sequence of 449 bp in that alignment got through). Perhaps this sequence was longer and hence passed the filter, but some of the length was in the "wrong" region.
Is it possible to add a second length-based exclusion step after completion of the final, trimmed alignment? Or, do you have a different idea for this?
For some taxa, this would have a minimal impact, as probably there are very few sequences that are long but not in the "correct" genetic region for us. Most of the sequences in the database will have been generated using primers that bind in the standard region. So, if this is tricky to fix, I think it is OK and we can manually check against that, specifically for Mollusca.
However, if this can be fixed without too much trouble, I suggest this would be helpful. As you will see in another thread, I am going to recommend complete deletion rather than pairwise deletion for calculation of Mollusca genetic distances. So, allowing shorter sequences through could mess up the result. However, we can manually check for this scenario for Mollusca specifically if we don't want to mess with the code further at this late stage. But if the code can readily be made more general, this would save us time if running the code again in the future, e.g. if we are finetuning additional parameters based upon new discoveries about the data or results or if we are running things again after checking for updated sequence data after review, etc.
Thank you for considering this issue.
Cheers,
Sally
from evolutionary-rates-analysis-pipeline.
PS. Or, perhaps there is a better way, and the reference sequence could be included in alignment 2 and the length filtering done after that, prior to the final alignment? Any different ideas welcome!
from evolutionary-rates-analysis-pipeline.
PPS. Pardon me for this long chain! I should have looked at all of the echinoderm classes before posting. The diversity of genetic regions within COI and the diversity of primers used appears to be much higher for Echinodermata than for the other classes (based upon info for 4 classes: I have run Annelida, Cnidaria, and Echinodermata, and I looked at your Mollusca results).
We would lose a lot of data for echinoderms if we implement the solution that I proposed above: deleting sequences based on length a final time after trimming using the reference. I wanted to post now so that you don't work on this issue now. I am exploring other options for dealing with this... such as using partial deletion at the stage of calculating the distance matrix.
I will update again soon on this issue!
from evolutionary-rates-analysis-pipeline.
OK - So, unfortunately, I do not see an option for partial deletion in the distance calculator we are using. (This is an option in MEGA.) I was going to recommend partial deletion for Echinodermata with a setting such as 0.60 (i.e. positions with less than 60% data coverage would be deleted at the time of calculating the distance matrix).
I suggest that we leave this issue alone. For most groups, there will be very very few short sequences that get through, based upon the filters that we do have in place. In echinoderms there will be a few.
What I will do is compare the results for echinoderms with pairwise vs complete deletion to make sure this doesn't have a large impact upon the conclusions. For all other groups, I think this will most likely be an extremely minor issue.
So, again, pardon the long chain with all these comments. However, at least you will see this issue I was considering and you can let me know if you have any alternative view compared to the conclusion I came to here.
Also, in light of my many subsequent comments, please do see further up this chain for my question about 2% vs. 20% of Arthropoda pairs. Thank you!
from evolutionary-rates-analysis-pipeline.
Hi Sally,
Sorry for the delay in responding (had to go and buy a turkey for Christmas).
Thanks for all of your work on this! In regards to the 2% vs 20% of Arthropoda, I just did a few tests on the Lepidoptera initial dataframe that I had saved. I ran a sequence length check with the original code:
sequenceLengths <- nchar(dfInitial$nucleotides)
sequenceLengthCheck <- which(sequenceLengths>1000 | sequenceLengths<620)
I ended up with 26097 sequences being filtered out, out of a total of 573242 sequences or roughly 4% being filtered out.
Then I tried with the new code:
sequenceLengths <- nchar(gsub("-", "",dfInitial$nucleotides))
sequenceLengthCheck <- which(sequenceLengths>1000 | sequenceLengths<640)
I ended up with 65798 sequences being filtered out, out of a total of 573242 sequences or roughly %11 being filtered out.
Then I tried using the new method of counting nucleotides with the min sequence length of 620:
sequenceLengths <- nchar(gsub("-", "",dfInitial$nucleotides))
sequenceLengthCheck <- which(sequenceLengths>1000 | sequenceLengths<620)
This gave a very marginal increase of sequences being filtered from the original code to 26136 sequences being filtered which still ends up being around 4% total sequences.
So the question would be if we are willing to exclude 11% of sequences with the minimum of 640 sequence length?
Best Regards,
Matt
from evolutionary-rates-analysis-pipeline.
Adding to that, I cant seem to find any results I have for the effects this sequence would have on pairing results. I think when I did those tests, that was with an earlier version of the script so its hard to say what effect the min sequence length would have with the current iteration of the script. Hopefully what I mentioned above can help us make a decision.
Best Regards,
Matt
from evolutionary-rates-analysis-pipeline.
Hi Matt,
Thank you very much for this summary above. I think that if 11% of the sequences are filtered out, likely the percentage of BINs that would be filtered out would be smaller. At BIO, many insects are sequenced unidirectionally, but then there is a step for seeking bidirectional sequences for a few representatives per BIN. It's just that that work is in progress, and so not all new BINs would be covered yet. Nevertheless, I predict that our % of BINs dropped would be lower than the % of sequences dropped.
So, I am OK with proceeding with the 640 filtering and 620 approach. I think that overall our results will be more accurate. The results will be based on longer sequences and a higher proportion of sequences that have bidirectional sequencing coverage.
Sound OK?
Best wishes,
Sally
from evolutionary-rates-analysis-pipeline.
Sounds good, ill proceed with the 640 and 620 approach.
Best Regards,
Matt
from evolutionary-rates-analysis-pipeline.
Great! Thanks Matt. I will close this issue. As usual, we will look out for additional issues that arise in the new analyses.
Best wishes,
Sally
from evolutionary-rates-analysis-pipeline.
Related Issues (20)
- Using Cloud Computing with RStudio HOT 88
- Discussion - Running Cnidaria HOT 6
- Decision needed: selecting the final trimming length HOT 2
- Sally's next tasks HOT 22
- New Mollusca Results HOT 14
- Alignment of mite group Mesostigmata HOT 25
- R Style Guide HOT 16
- Model test HOT 4
- Update to Pipeline and New Branch HOT 3
- Median latitude HOT 10
- Jacqueline got an error line 569 of Annelida branch HOT 10
- Error on line 1572 of Annelida branch HOT 5
- Very minor edits! HOT 5
- Identify weird sequences HOT 9
- Ideas to consider for V2 HOT 16
- New Results Thread HOT 73
- Sister vs. Phylo pipeline comparison HOT 3
- Map Figures HOT 20
- Matt - task list HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from evolutionary-rates-analysis-pipeline.