Git Product home page Git Product logo

m-orton / evolutionary-rates-analysis-pipeline Goto Github PK

View Code? Open in Web Editor NEW
7.0 6.0 1.0 39.9 MB

The purpose of this repository is to develop software pipelines in R that can perform large scale phylogenetics comparisons of various taxa found on the Barcode of Life Database (BOLD) API.

License: GNU General Public License v3.0

R 100.00%
r molecular-rates phylogenetics barcode-index-number barcode

evolutionary-rates-analysis-pipeline's People

Contributors

m-orton avatar matt-orton avatar sadamowi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

jmay29

evolutionary-rates-analysis-pipeline's Issues

Question - Are "zero" differences in relative distances reflected?

Hi Matt,

I am wondering if the current pipeline is biasing the results a little bit towards "failures". I notice there is code for designating "successes" (defined as a directional result). Then, "failures" seem to be not successes.

Line 1261:

#Also making a vector for pairings that were not successes
failureOverall <- dfPairingResultsL1$inGroupPairing[-successOverall]

What about cases if there is absolutely no difference in the relative outgroup distance? I suggest those cases should be omitted priot to statistical testing rather than counted as failures. Are those accounted for somewhere? Perhaps that is addressed elsewhere, and I just overlooked that.

Thanks very much for checking.

Cheers,
Sally

Sally got an error on line 1610

Hi Matt,

I also got an error when plotting, using the block of code starting on line 1610.

plot_ly(dfPairingResultsL1L2, lat = medianLatMap.x, lon = medianLon.x, text = hover,

  •     color = as.ordered(inGroupPairing), colors = "Spectral", mode = "markers+lines", type = 'scattergeo') %>%
    
  • layout(title = paste0(mapTitle) , geo = mapLayout)

Error in plot_ly(dfPairingResultsL1L2, lat = medianLatMap.x, lon = medianLon.x, :
object 'medianLatMap.x' not found

plot_ly(dfPairingResultsL1L2, lat = medianLatMap.x, lon = medianLon.x, text = hover,

  •     color = class_name.x, mode = "markers", type = 'scattergeo') %>%
    
  • layout(title = paste0(mapTitle) , geo = mapLayout)

Error in plot_ly(dfPairingResultsL1L2, lat = medianLatMap.x, lon = medianLon.x, :
object 'medianLatMap.x' not found

Sally got an error on line 1532

Hi Matt,

I got the below error when running the block of code starting on line 1532. I got a plot for Polychaeta but no others to scroll through.

foreach(i=1:length(relativeDistClass)) %do%

  • (print(ggplot(relativeDistClass[[i]], aes(x = variable, y = value, color = sign))
  • foreach(i=1:length(relativeDistClass)) %do%

Error: unexpected symbol in:
" (print(ggplot(relativeDistClass[[i]], aes(x = variable, y = value, color = sign))
foreach"

(print(ggplot(relativeDistClass[[i]], aes(x = variable, y = value, color = sign))

  •      + geom_point(stat="identity", size = 2.5) 
    
  •      + theme(text = element_text(size=13), axis.text.x = element_text(face="bold",angle=90, vjust=1))
    
  •      + ggtitle(paste0(classTitle[i],"\n", pValBinomialTitle[i], "\n", pvalWilcoxonTitle[i])) 
    
  •      + labs(x="Pairing Number", y="Signed Relative OutGroup Distance", color = "sign")))
    

Cheers,
Sally

one more line needed for using muscle?

Hi Matt,

I found I needed to add this around line 140 in order to move forward.

biocLite("muscle")

I didn't want to change anything directly in the code, and so I created an issue. If you agree, then please do go ahead and add that. Then, I think we can close this minor issue.

Thanks!

Cheers,
Sally

Notes on Annelida branch

Just created a separate branch for Annelida. Purposely not merging to keep things separate. Not sure if the commit comments are showing up so Im posting here.

I think what I will do is make separate branches for each specific phyla so each phyla can be modified independently depending on specific changes that may need to be made for it.

Notes on changes made:
Reference sequence dataframe now has real reference sequences for Clitellata and Polychaeta
Minimum sequence length in filtering changed to 620 bp, maximum remains at 1000 and longer sequences will be trimmed down to reference length.
Symmetric trimming of the reference sequences - 19 bp on either end to a final sequence length of 620 for all sequences

Formatting changes to reflect that the analyses is being made at the class level instead of order level
Plot of relative outgroup distance now generated according to class
Pvalue dataframe now has total pvalues for binomial and wilcoxon as well as pvalues per class

Notes on Annelida analyses:
-seem to be running into some issues with the alignment for Polychaeta, some large gaps are present in the alignment
-the Clitellata alignment looks good though there is one indel apparent in the alignment

Sending results of Annelida shortly

Using Cloud Computing with RStudio

Hi Sally,

I think I found another way of doing the more computationally demanding taxa in case we cant use the Compute Canada resources. Sort of a back plan I guess.

Basically it involves using the cloud computing resources offered by Amazon Web Services. Specifically, it would be using one of their services called Elastic Compute Cloud. I found a useful guide with how to integrate this resource with a web browser based version of RStudio:
http://strimas.com/r/rstudio-cloud-1/

I managed to go through the steps and run a free instance of this service. I was able to test some of the script and it seems to work well. The only thing is that in order to use greater amounts of computational power, there is a small cost per hour. It would be something on the order of $1-2 per hour in order to get the computation we would need for the larger taxa I would guess.

Just thought it might be something to consider.

Best Regards,
Matt

Map Figures

Hi Sally, I thought I would make an issue for discussing the map figures.

I looked into that one Echinoderm pair that was in the middle of Africa. It turns out its because the BIN that lineage belongs to only has two specimens; one at 28 degrees and another at 3 degrees in latitude (which both are on the coast of Africa) but when you take the median you get a point in the middle of Africa. Smaller BIN's like this could pose a problem for mapping so I suggest we only show pairings with BINs which are larger in size (say maybe 10 specimens/records or larger?) such that the median is more representative. I also like your idea of showing the pairings with the largest latitudinal differences.

Let me know what you think,
Matt

Identify weird sequences

So, I was fiddling around with some code to identify sequences with vastly different numbers of gaps as Sally had suggested in another issue. I got some code working but I was wondering what would be a good "range" of gappiness? I was thinking something like this:

This will give the number of positions where internal gaps are found for each sequence.
internalGaps <- sapply(regmatches(dfInitial$nucleotides, gregexpr("[-+]", dfInitial$nucleotides)), length)

Add it to the dataframe so you can get a sense of which sequences are weird.
dfInitial$numberOfGaps <- internalGaps

Mean (or median) number of gaps?
meanGap <- mean(internalGaps)

Upper range of gappiness
extremeHighGap <- meanGap + 7

Lower range of gappiness
extremeLowGap <- meanGap - 7

We then loop through each sequence to see if the number of gaps deviates greatly from the mean. Any sequences with gaps greater than extremeHighGap or lower than extremeLowGap will be identified.
extremeSeqs <- foreach(i=1:nrow(dfInitial)) %do%
which(internalGaps[[i]] > extremeHighGap | internalGaps[[i]] < extremeLowGap)
extremeBins <- which(extremeSeqs > 0)

Subset out these extreme gappy/ungappy sequences so you can look at them.
dfExtreme <- dfInitial[extremeBins, ]

If you want to remove them from the original dataframe.
dfNormal <- dfInitial[-extremeBins, ]

Discussion - Running Cnidaria

Hi Matt,

Sorry for my delay in posting. I have been working on the project a fair bit recently but have also made some errors, such as forgetting to save the Workspace. I needed to re-run some things in order to be able to compare properly the different trimming lengths as well as alignment settings. In any case, I have now moved on to Cnidaria.

I have been encountering a few errors. I have been running the Annelida code from Github (most recently obtained today) but with the Cnidaria reference sequences inserted. I've also been using maxiters = 3 and diags = TRUE for all alignments. The alignments mostly look very good. I will try a different run with a different setting to see if it's possible to resolve one case, but I think overall the alignments look good and that close relatives are properly aligned with one another, which is the most important thing here.

However, I am getting some errors in some other parts of the code.

The first one is in lline 625 in the code (note these line numbers are after I made adjustments for Cnidaria, such as inputting the Cnidaria ref seqs. Line numbers would be slightly different compared to the Annelida branch, but within a couple of lines of the Github version.)

dnaStringSet4 <- foreach(i=1:nrow(dfRefSeq)) %do% subset(dnaStringSet4[[i]][-refSeqRemove[[i]]])

Error message:

Error in subset(dnaStringSet4[[i]][-refSeqRemove[[i]]]) :
task 4 failed - "subscript is a logical vector with out-of-bounds TRUE values"

Same message for similar code a couple of lines later.

Also, error at line 711.

dfPairingResultsL1L2$inGroupPairing <- rep(1:(nrow(dfPairingResultsL1L2)/2), each = 2)

error message:

Error in $<-.data.frame(*tmp*, "inGroupPairing", value = c(1L, 1L, :
replacement has 2320 rows, data has 2321

Also error after 716

dfPairingResultsL1L2 <- (dfPairingResultsL1L2[,c("inGroupPairing","record_id","bin_uri","values",
"inGroupDistx1.3","medianLatAbs","medianLatMap","latMin","latMax",
"binSize","phylum_taxID","phylum_name","class_taxID",
"class_name","order_taxID","order_name","family_taxID",
"family_name","subfamily_taxID","subfamily_name",
"genus_taxID","genus_name","species_taxID","species_name",
"nucleotides","ind","medianLon")])

Error message:

Error in [.data.frame(dfPairingResultsL1L2, , c("inGroupPairing", "record_id", :
undefined columns selected

and, further error messages are beyond these also.

I have thought of something to try (deleting those reference sequences associated with classes that didn't end up yielding data). I will keep you posted as to whether that works.

Cheers,
Sally

Question for Matt: Can efficiency of outputting alignments be improved?

Hi Matt,

Thank you for adding the code for outputting key interim alignments and the final trimmed alignments. Those lines of code have been very helpful for exploring the performance of the analysis.

This is not an urgent issue, and I think these extra steps are fine overall for these smaller phyla (for which we potentially expect more serious alignment problems). However, I'm wondering if there is a straightforward way to have the alignments exported without having to rerun them?

If there isn't a straightforward way to combine those steps, then don't worry about this. Thank you.

Cheers,
Sally

Make sure Sally and Matt get the same results before finalizing pipeline

Hi Matt,

After I ran through the script again with Annelida, using the workspace you sent me, I still got different results compared to yours. I'm not sure why that would be. Is there any random component still left in the pipeline? A large alignment may have a random component to it. Also, is it possible that I inadvertently inserted a mistake into the code during the editing of the comments?

I wanted to open an issue on this topic to be sure we remember to address it before calling the pipeline final. However, I suggest that we don't spend excessive time scrutinizing the previous interim results. I suggest that I should rerun the analysis after you finish coding the components we just discussed in the "alignment" thread. Then, perhaps we should both rerun the analysis using the revised code and compare? At that point, would you also please re-send me the workspace so that we are positive we are beginning from the same starting point?

Again, I need to leave this aside, but shoudl be able to resume work on this project on Friday. I look forward to completing this.

Cheers,
Sally

Additional package needed? colorspace

Hi Matt,

I am working with the Annelida branch. It seemed that RStudio expected me to install the package "colorspace" around line 157 of the code. Does that make sense? Should the command to install that package be added to the code?

Cheers,
Sally

Sally task - check p values

I am opening this issue to remind me to check the binomial and Wilcoxon tests, as requested. I will close this issue after I have done that.

Median latitude

I was wondering if median latitude in this pipeline could be determined prior to running some of the filters (i.e. removing sequences with large N/gap content, short/long sequences)? This could increase the amount of data that is used to determine the median latitude of each BIN.
I think you and I discussed this just prior to break, Sally, as I have it on my "to-do" list!

Update to Pipeline and New Branch

Hi Sally and Jacqueline,

This is more of an update rather than an issue so I can close this once everyone has seen it.

Just letting you know I have updated all of the branches so that they can now run single classes and orders at a time.

I also created a branch that is intended to be run with any of the classes/orders in Arthropoda but could be used as a template for other very large taxa. In running Lepidoptera, I've had to make some minor adjustments to the pipeline in order to minimize memory load, reduce workspace sizes (since saving workspaces has been vital in case of crashes) and improve processing speed (since some steps of the pipeline can be very time consuming). This version of the pipeline should allow very large datasets to be run more smoothly. Although, its probably not needed for the smaller phyla that can be run locally.

Best Regards,
Matt

Alignment of mite group Mesostigmata

The preliminary alignment (Jan 1, 2017) of Mesostigmata contained large gaps.

Sally to check into this:

Is the beginning of the alignment correct?

Is there any information relevant for this in Young and Hebert 2015?

Sally tasks for next week (note on Dec 7th)

Notes about my tasks:

  1. update R (to very very secure dishes)
  2. rerun Annelida script to ensure everything works smoothly with new R version (and with the newest updates to the script from Matt, revisions made Dec. 7th)
  3. compare results to previous R version I ran (sincere pumpkin patch) to assess consistency
  4. at that point, if all seems good, contact Jacqueline to run Annelida script on an additional taxon and to evaluate interim steps (remember to mention R versioning issue)
  5. manually generate a script file with the reference sequences for Cnidaria and Echinodermata
  6. run those phyla
  7. test alignment settings (number of iterations - can we reduce?)
  8. assess consistency of results across trimming lengths (600 bp, 620 bp, 640 bp)
  9. more forward with the remaining three phyla (Mollusca, Chordata, Echinodermata)
  10. continue work on manuscript prose

(For step #9, I was wondering if you'd be willing to try to run Mollusca at the class level on your better computer? If that's possible, I think that would be a better choice for that phylum. However, if we do end up needing to go with order, we would lose some unidentified sequences, or sequences identified using an alternative taxonomic hierarchy, but I think that isn't catastrophic for the project. I contacted Compute Canada, but they indicated they are still awaiting approval from Guelph for my account. Hopefully that will be sorted out soon and so we would hopefully have access to more computing resources if needed.)

Let me know if I missed something!

Cheers,
Sally

Error message when Sally running Annelida branch - line 869 - Nov 28

Hi Matt,

I am running the Annelida branch, using the Annelida workspace you sent me. I got the following error when I ran line 869 of the code.

noOutGroupCheck <- foreach(i=1:length(noOutGroupCheck)) %do% which(dfPairingResultsL1$bin_uri == noOutGroupCheck[[i]])

Error in which(dfPairingResultsL1$bin_uri == noOutGroupCheck[[i]]) :
task 1 failed - "subscript out of bounds"

Have you seen this error before? Thanks.

Cheers,
Sally

Model test

I was wondering if the function "modelTest" from the package "phangorn" was worth exploring for model selection in some parts of the pipeline. I had used it previously in a script to compare different models of DNA evolution for a phyDat object (a multiple sequence alignment, in our case) for building a ML tree. Here is a link to the documentation:

https://www.rdocumentation.org/packages/phangorn/versions/2.0.4/topics/modelTest

Although I just ran modelTest in RStudio on my data and it crashed ( 👎 ), so I will experiment a bit more with it!

Matt - task list

Hi Matt,

To help us stay organized, I've also prepared a task list for you. Please feel free to edit!

  1. Work on methods section:
    -consider my comments from the last round
    -add the divergent sequence ejector step
    -briefly explain the "rules" for dropping sequences on the basis of generating indels in the alignment
    -explain the subsetting of large insect orders by geographic region
    -return the draft to me for the round round of revisions

  2. Cross check the separate geographic output files for the large insect orders. Are there ingroup BINs that appear in multiple regions? If this occurs frequently, then I think we should discuss potential solutions. We could, for example, consider retaining just the pair with the smallest ingroup distance. I hope that this does not occur frequently. We would mention in the methods section that we checked for this.

  3. Have a look at the initial workspaces and consider if we can run the pipeline on any additional markers (for select taxa).

  4. Can legends be made larger in the map figures?

  5. Prepare the final map figure (after discussion).

  6. Collaborate in writing up the discussion.

Additional package needed

Hi Matt,

I am redoing everything for the Annelida pipeline after having installed R version very very secure dishes.

I found that before line 172:

library(plotly)

I had to first type:

install.packages("jsonlite")
library(jsonlite)

This seems to be required before plotly would load happily.

Cheers,
Sally

Error message when Sally running Annelida branch - line 916

Hi Matt,

I also got an error message when running line 916, using the Annelida branch and the Annelida workspace you sent me.

dfPairingResultsL1$latDelta <- latDelta

Error in $<-.data.frame(*tmp*, "latDelta", value = list(numeric(0), :
replacement has 77 rows, data has 1

Cheers,
Sally

Sister vs. Phylo pipeline comparison

Hi Jacqueline,

Matt has now completed analysis for the majority of the taxa using the sister pipeline. As we previously discussed, it would be very helpful if you would please run a few groups using your phylo pipeline for comparison.

I'd like to suggest to start by separately running Cypriniformes and Perciformes. These are both large orders of fish with a lot of barcode records. However, by the end alignment ("finaltrim"), there were 1100-1200 sequences left, and so I think these should run in a reasonable amount of time in terms of the phylogenetic analysis. Trying to run a larger group, such as the entire class, would likely be difficult. Also, these are large groups and so I think would be good for us in terms of comparing the pipelines. If these run successfully, then we might discuss running a couple of other groups (e.g. Echinodermata, which had significant results in the sister pipeline).

If possible, I'd like to suggest to use Matt's end workspaces and the "finaltrim" alignments as your starting point. This would give the same input dataset for comparison. Does that work? These files are within the taxon-specific folders under "Chordata" within the "Results" folder.

For these orders, you could use the reference sequence from the opposite order to root the tree. The reference sequences are in the "Reference sequences" folder.

Please let me know if you have any comments about these suggested plans (and Matt too). Thank you very much.

Cheers,
Sally

Jacqueline got an error line 569 of Annelida branch

So I got this error when running line 571 of the Annelida branch:

dnaStringSet3 <- foreach(i=1:nrow(dfRefSeq)) %do% DNAStringSet(alignmentSequencesPlusRef[[i]])
Error in DNAStringSet(alignmentSequencesPlusRef[[i]]) :
task 1 failed - "key 49 (char '1') not in lookup table"

I believe this happens when only using 1 class, as it has troubles with the code on line 565:

alignmentSequencesPlusRef <- foreach(i=1:nrow(dfRefSeq)) %do% append(classSequences[[i]],alignmentRef[[i]])

Would an if statement that checks for the length of classSequences be appropriate before creating alignmentSequencesPlusRef?

Alignment method selection - Annelida branch

Hi Matt,

I created this issue for us to discuss the alignment algorithm choice, as I remembered I didn't want to bury discussions in email. So, I will try to use the issue tracker more judiciously! I also got a lot of gaps in Polychaeta alignment, as you mentioned. I am reading up on the different alignment options and will try to see if there is an option that would give us a better alignment. ClustalOmega seems to be only for amino acid sequences. Ideally, we will want to consider the amino acid translation when aligning the nucleotides.

If you have any thoughts to share based upon any other alignment options you have tried, please let me know.

Cheers,
Sally

Decision needed: Selecting number of alignment iterations

Issue: we need to decide on the specific settings for running muscle.

Pasting comments on this specific issue from Matt:

"Hi Sally, just to add to the email I sent about the script changes and the new Muscle package. The package has a few parameters that can be adjusted.

Documentation of muscle algorithm with all parameters can be found here:
http://www.drive5.com/muscle/muscle_userguide3.8.html
If you go to section 5 it will give a list of all parameters with all defaults.

I think there are a few that may become important to our analysis:
diags = TRUE - parameter will enhance the speed of the algorithm with a potential loss to accuracy, this may become important for the larger insect taxa possibly since the muscle algorithm takes a while to process. I guess the question is are we willing to sacrifice accuracy for decreased computation time? Or is this irrelevant since we may be able to use the compute canada resources?

gapopen - parameter will determine what level of stringency to assign to gaps in the alignment. We may want to consider changing this depending on the taxa being run. For instance really gappy alignments (like Polychaeta for example) we may want to consider increasing the gap open penalty so it penalizes gaps more heavily.

maxiters - parameter will determine how many iterations the muscle algorithm goes through. Apparently the default is 16 but they suggest in the documentation of Muscle that for large alignments, it is better to restrict the number of iterations to 2. They say for large alignments:

"If you have a large number of sequences (a few thousand), or they are very long, then the default settings of may be too slow for practical use. A good compromise between speed and accuracy is to run just the first two iterations of the algorithm. This is done by the option –maxiters 2, as in the following example."

"forgot to add the example they use for maxiters is for a command line not for the muscle package. In the muscle package it would be muscle(dnastringset, maxiters = 2)."

Question about filtering by sequence length

Hi Matt,

I am running Annelida again to further explore and finetune the alignment parameters, specifically trying -3000 gapopen in case that might be a general parameter that is useful for us.

In the Clitellata alignment, I notice that there are some sequences shorter than 620 bp, which have terminal gaps. I am wondering if the length filter (lines 277-279) counts all characters or just nucleotides? Would you mind having a look?

I suggest we would want sequences of at least 620 bp in length. If we implement this filter, then the sequences will be longer than this most of the time, which is good, and then we'd be trimming down. i.e. The sequence read would have gone right through the end of the portion we are using, yielding higher quality data. I noticed that a few of the shorter sequences have an unusual amino acid at the end, suggesting a poorer quality of sequence. For example, possibly a base was added or dropped, yielding a reading frame shift. It's difficult to detect that or for a gap to be added when this is very near the end of the sequence. Therefore, I'd like to suggest that all sequences be at least 620 bp in length, counting only nucleotides.

Thanks very much for having a look.

Cheers,
Sally

New Results Thread

Making a thread to post updates on results.

Chilopoda results are now up on dropbox. Alignment looked great, no gaps or issues. Chilopods only had one pairing but it was a really small dataset.

New Mollusca Results

Hi Sally,

Just letting you know I am currently uploading the new Mollusca results to the shared folder in dropobox. The results are from using the alignment parameters you suggested (maxiters = 3, diags = TRUE, gapopen =-3000) and pairwise deletion set to FALSE for the final pairwise distance matrix. The alignments look good overall though there are still a few gaps I'm noticing in Bivalvia and Gastropoda. Though I think its still an improvement from the second runthrough.

The number of pairings has increased slightly with this runthrough however I am getting more pseudoreplicates than before.

Best Regards,
Matt

Error on line 1572 of Annelida branch

So, I got this error on line 1572:

pValBinomialTitle <- foreach(i=1:length(pvalBinomialClass)) %do%
paste("Binomial Test p-value:", round(pvalBinomialClass[i], digits = 4))
Error in paste("Binomial Test p-value:", round(pvalBinomialClass[i], digits = 4)) :
task 1 failed - "non-numeric argument to mathematical function"

I seemed to have fixed it by changing pvalBinomialClass[i] to pvalBinomialClass[[i]] as pvalBinomialClass[i] is of the list class and pvalBinomialClass[[i]] is of numeric class. Please let me know if you guys come across the same error.

Sally's next tasks

Hi Matt,

Here is my list of planned next steps.

HIGHEST PRIORITY:

  1. Select group for cloud test for Arthropoda and generate ref seqs for that group plus Lepidoptera.

  2. Select ref seqs for Arthropoda and Chordata.

NEXT PRIORITIES:

  1. Verify the Wilcoxon test is running correctly, as requested by Matt.

  2. Try Matt's suggestions for resolving plotting issues.

  3. Focus on the draft manuscript next, specifically:
    a) complete draft of methods section (be sure to include software version info)
    b) type up more detailed analysis information and justification for settings as a supplementary file
    c) prepare a results table, including preliminary results (and placeholder spots for those taxa remaining to be run)

AFTER THAT. Sensitivity analyses and exploration to add further justification for choices:

  1. Verify for Echinodermata that pairwise vs complete deletion make a minimal difference to the results. (i.e. I hypothesize that the molecular rates in the front section with substantial missing data in echinoderms is similar to the molecular rate in the remainder of the barcode region of COI.) We have currently settled upon pairwise deletion as our default for most of our phyla.

  2. Explore impact of adding a gamma parameter upon the distance matrices and results.

  3. Explore whether altering the outgroup minimum distance (to 1.4 or 1.5) makes any difference.

Please do let me know if I've overlooked anything!

Cheers,
Sally

Very minor edits!

Hello!! These are some minor spelling errors I came across and some thoughts I had about certain parts of the code:

Line 205: I believe "markercode" is missing from this line (after 'nucleotides')

Line 360: Should 'dfInitial' instead be 'dfCentroid' in this comment?

Line 456: I'm just wondering if "taxa" should instead be changed to "class" in dfRefSeq? Just so the user is clear what should be entered here? Just a thought!

Line 448: "approprate" spelling error

Line 455: Should there be a note around this part of the code that tells the user that all of the ref sequences in dfRefSeq should be the same length (i.e. 658 bp) for line 465 to work on each sequence equally? What if ref sequences of different lengths are used?

Line 479: Change 'allseq dataframe' to dfAllSeq in this comment? Just for consistency purposes.

Line 486: "closed" should be "closely"

Line 490: Change 'taxalist' to taxaListComplete in this comment?

Line 523: Weird, this link brings me to https://mran.microsoft.com/.

Line 597: "consist" should be "consistent"

Line 743: "intial" spelling error

Line 753: "pairingresults" to "dfPairingResultsL1L2" in this comment?

Line 889: "Fist" spelling error

Line 940: "an" to "a"

Line 1324: "realtiveDistOverall" spelling error

Jacqueline will run the code on fish

Hi Matt,

I have spoken with Jacqueline, and once we feel the Annelida pipeline is fairly stable, we can let her know. I think we are almost there!

Then, she will run the code on fish and check the interim steps. I think this would be really helpful to have another person read over and run the code and also check what it is doing. She is much more advanced in R coding than I am. Once she verifies performance, I will close this issue.

Cheers,
Sally

Discussion on running the phylum Mollusca

Dear Matt,

Through email (Dec. 8th), I just sent you an updated Excel file that contains reference sequences for Mollusca. As we previously discussed for other phyla, I only selected references for those classes that had candidate pairs according to your preliminary analysis.

Note that for three of the classes, the sequences are 658 bp, as typical for many animals. For two classes, the sequences are 661 bp. These sequences are amplified by primers that bind in the typical "Folmer" binding region of COI. The Folmer primers and many slight variants of those primers bind there. Some molluscs truly have amino acid indels (insertions or deletions) compared to the most common length for this gene region. Note the difference is a multiple of 3 nucleotides, reflecting one amino acid difference. I think the 661 bp sequences are real biological sequences for the barcode region. I suggest to keep the end trimming to the same number of nucleotides as the 658 bp sequences. This will keep the same gene region for comparison across taxonomic groups and will also keep the same amount of trimming in terms of number of nucleotides. The expectation would that there would be a slight length different in the final alignments among the classes. But I think that is OK. Evolutionarily, that is the same gene region being compared.

Depending upon how much time you have, you could either try running this in the near future, and possibly playing with the alignment settings (e.g. maxiters 2), to see how that influences the mollusc alignments. Or, you could wait for me to get back to you about my findings for the other phyla regarding alignment settings. I have a hunch (to be confirmed!) that we may be able to get away with those less stringent settings. COI is protein-coding and is a relatively conserved marker. Alignment settings are much more important and affect the result much more when non-coding regions are being aligned or when very evolutionarily distant taxa or ancient gene duplications are being investigated.

Another note ... most of the classes have far fewer than our original sample size cut-off of BINs for analyzing in one alignment (fewer than 2000 BINs). The only exception within Mollusca is the class Gastropoda, with ~11.5K BINs. So, you could consider analyzing all the other classes first, just subtracting Gastropoda from the dataset, if you don't want the analysis to get stuck on preliminary trial runs.

Let me know how it goes! Thank you very much for running this phylum.

Best wishes,
Sally

Error message when Sally running line 945 - Annelida branch

Hi again Matt,

I am recording the below error message for our information. However, I will verify this by running everything again. Perhaps I skipped a line by accident at some point.

dfPairingResultsL1 <- (dfPairingResultsL1[,c("inGroupPairing","associatedInGroupBin","inGroupDist","inGroupDistx1.3",

  •                                          "medianLatAbs.x","latDelta","latMin.x","latMax.x","record_id.x",
    
  •                                          "binSize.x","phylum_taxID.x","phylum_name.x","class_taxID.x",
    
  •                                          "class_name.x","order_taxID.x","order_name.x","family_taxID.x",
    
  •                                          "family_name.x","subfamily_taxID.x","subfamily_name.x",
    
  •                                          "genus_taxID.x","genus_name.x","species_taxID.x","species_name.x",
    
  •                                          "trimmedNucleotides.x","medianLatMap.x","medianLon.x",
    
  •                                          "outGroupBin","outGroupDistance","record_id.y","binSize.y",
    
  •                                          "phylum_taxID.y","phylum_name.y","class_taxID.y",
    
  •                                          "class_name.y","order_taxID.y","order_name.y","family_taxID.y",
    
  •                                          "family_name.y","subfamily_taxID.y","subfamily_name.y",
    
  •                                          "genus_taxID.y","genus_name.y","species_taxID.y","species_name.y",
    
  •                                          "trimmedNucleotides.y")])
    

Error in [.data.frame(dfPairingResultsL1, , c("inGroupPairing", "associatedInGroupBin", :
undefined columns selected

Ideas to consider for V2

Making a separate thread for ideas and code sections that could be implemented in version 2 (V2) of the pipeline.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.