green-striped-gecko / dartr Goto Github PK

View Code? Open in Web Editor NEW

30.0 30.0 20.0 133.25 MB

Importing and Analysing DArT type snp and silicodart data

License: GNU General Public License v3.0

R 100.00%

dartr's People

Contributors

Stargazers

Watchers

Forkers

biomatix jdyen maschette jamesyboy01 swotherspoon yasinkaymaz davan690 hakancengiz1 carlopacioni hossainriad konoutan qitsweauca sunnyev yvanpapa mijangos81 yangjl antifranc genostack

dartr's Issues

Package plotly not loaded when dartR is loaded.

Create test examples script

Automate the test examples script, to ensure all examples work with testset.gl

gl2related deleteld

deleted gl2related.

it was not working well anyway and installation of the whole package was stopped because of that.

now deprecated until better version exists.

Bug in gl.filter.callrate -- subscript out of bounds

gl.filter.callrate(testset.gl, method="ind", t=0.95)
Reporting for a genlight object
Note: Missing values most commonly arise from restriction site mutation.
Initial no. of individuals = 250

?ui=2&ik=c826be832d&view=fimg&th=15b1d3647fb96366&attid=0 Show Traceback
Rerun with Debug
Error in x@gen[[1]] : subscript out of bounds

Filter for missing data per population

Hi Arthur and Bernd,

I am trying to use the DIYABC software and they require an input file which has been filtered for loci with missing values per population.

The error message I received was: "Loci 314 in population Bredbo has only missing values. This is not allowed. Please remove this locus from your data file."

Using dartR, I had filtered my dataset by call rate for loci and individuals but I was wondering if there is a way I can filter for call rate based on populations in dartR. If not, could you please suggest a possible workaround?

Thank you,

Yael

gl2fasta to work with trimmedsequence and snppos only, because some DArT files do not have alleleseq

Bug in gl2fasta -- output fasta records differ in length

Output fastA records differ in length. Rarely, records are one base short. Suggest looking at script behaviour when SNPosition = 1 or SNPosition is the last base of the seqeunce tag. @green-striped-gecko

Convert pipe to underscore in terminal labels for gl2fastA

gl.pcoa.plot does not work without pop specified

even if only individuals are used for labels

gl.filter.dups -- suggested change of name.

Change the name to gl.filter.secondaries to avoid confusion re duplicated sequence tags.

Pierre Feutry: Error installing

Error message below.
Any idea how to fix this? Cheers
Pierre

installing source package ‘dartR’ ...
** R
** data
*** moving datasets to lazyload DB
** inst
** preparing package for lazy loading
Warning: namespace ‘DBI’ is not available and has been replaced
by .GlobalEnv when processing object ‘testset.gl’
Warning: namespace ‘DBI’ is not available and has been replaced
by .GlobalEnv when processing object ‘testset.gl’
Warning: namespace ‘DBI’ is not available and has been replaced
by .GlobalEnv when processing object ‘testset.gl’
Warning: namespace ‘DBI’ is not available and has been replaced
by .GlobalEnv when processing object ‘testset.gl’
Warning: namespace ‘DBI’ is not available and has been replaced
by .GlobalEnv when processing object ‘testset.gl’
Error : .onLoad failed in loadNamespace() for 'rgl', details:
call: dyn.load(file, DLLpath = DLLpath, ...)
error: unable to load shared object '/Library/Frameworks/R.framework/Versions/3.4/Resources/library/rgl/libs/rgl.so':
dlopen(/Library/Frameworks/R.framework/Versions/3.4/Resources/library/rgl/libs/rgl.so, 6): Library not loaded: /opt/X11/lib/libGLU.1.dylib
Referenced from: /Library/Frameworks/R.framework/Versions/3.4/Resources/library/rgl/libs/rgl.so
Reason: image not found
ERROR: lazy loading failed for package ‘dartR’
removing ‘/Library/Frameworks/R.framework/Versions/3.4/Resources/library/dartR’
Installation failed: Command failed (1)

Vignette -- add hwe scripts to the vignette

gl.report.monomorphs value incorrect

I think I may have found a bug in gl.report.monomorphs.
When running gl.report.monomorphs after running gl.filter.monomorphs on a genlight object the report indicates that there are still monomorphs present in the object.
Looking at the code in gl.report.monomorphs, it appears that when a loci has only 1 (heterozygote) values, or a combination of 1 and NA, (ie. no 0 or 2 present) it is being counted as monomorphic:
line 50: c[i] <- all(xmat[,i]==1,na.rm=TRUE)
which is different to the code in gl.filter.monomorphs used to determine monomorphs:
line 46: a[i] <- all(xmat[,i]==0,na.rm=TRUE) || all(xmat[,i]==2,na.rm=TRUE)
I have used a genlight object and excel to test this and can provide files if required.
Regards
Rob

adding color and shape options to gl.pcoa.plot

Given how frequent modifying color and shapes seems to appear in the dartR google group it is probably worth adding the options to gl.pcoa.plot()

There are three ways to do this with varying levels of flexibility and implementation cost.

add scale_color/shape_manual to the current function, I did this for a previous project and the code is here: https://github.com/Maschette/redartR/blob/master/plot_pcoa.R it adds the options col and shape to the current function so these can be changed. It does also set the default theme to theme_bw because that's what I needed for a publication.
add a new function which returns a ggplot object using ggbuilder which can be edited to modify a range of the ggplot settings. @raymondben had a first pass of this which I forked here: https://github.com/Maschette/redartR
rewrite the function implementing the ggplot2 style guide (https://ggplot2.tidyverse.org/dev/articles/ggplot2-in-packages.html#referring-to-ggplot2-functions) in combination with ggbuilder to give a more flexible function. The advantage of this being people would be able to implement things such as:
gl.pcoa.plot(glPca, gl)+scale_color_manual(values=...) to change things.

My recommendation for the short term would be implement 1 and explore 2-3.

Olly Bolly: Add a coverage filter

Fixed difference analysis -- give options for the content of upper and lower matricies. Provide the SNP sample sizes for which the percentages are calculated. Requested Kylie Ewart.

Filter heterozygosity

Hi DARTr team,

Are we able to exclude loci based on heterozygosity?

How can we estimate heterozygosity across individuals per population?

Thanks, Jenny

Remove spaces from OTU names on input from DArT

Spaces in OTU names present a range of problems, particularly trailing spaces. We need to remove all spaces from OTU names at the point when the data are input from DArT.

Vignette: Add in a section on reporting and filtering on Linkage Disequilibrium

The Vignette does not currently have a section on analysis of Linkage disequilibrium. The vignette needs to provide advice on this issue (single population, sample size etc) and then how to report on departure from linkage equilibrium (with and without bonferonni correction) and then how to filter out all but one SNP in a linkage group.

Peter Unmack: Identify locus metadata provided by DArT that become redundant on deletion of one or more populations, and amend scripts accordingly

Olly Bolly: Add MAF report and filter scripts.

Minumum allele frequency

Add ctb names

Olly Berry, Jason Bragg, Peter J. Unmack, Aaron T Adamack

gl.filter.hamming removes information from gl@other$loc.metrics (rdepth missing)

Hi,

I noticed that the gl.filter.hamming command removes information from gl@other$loc.metrics, when I went through the analysis.
I wanted to look at the final read depth of the SNPs that I had retained and compare it to the average read depth at the beginning, but I noticed that it was missing from the loc metrics after the hamming filter step. Basically its replaced with the MAF in the loc metrics.

I assume this information is just dropped then? Is there some way to retain it when using the command?

Thanks

Script: Filter SNPs where snposition > TrimmedSequence length

dartR - Installation problems using Ubuntu

Hi,
I have recently started to use Linux (I'm still not very familiar with it) and I am having problems to install dartR. I'm using RStudio and when I try to follow the recommended steps I received the following message:

"package dartR is not available (for R version 3.2.3)"

I checked for updates of RStudio and it says that I am using the newest version. Therefore I am not sure how to fix this problem. Do you have any recommendations?

Thanks!!

ind.metrics File -- independent of order of OTUs

Currently the order of the individuals in the ind.metrics file needs to match the order in the DArT input file. Need to make it so that the order of the individuals in the ind.metrics file does not matter, while retaining the checks for all individuals in the DArT file present in the ind.metrics file, and vice versa.

Add trimmed sequences to testset.gl

testset.gl does not have trimmed seqeunces in it, so some of the examples will fall over. May need to recreate the testset.gl.

gl.filter.repavg -- not working

Hi DARTR team,

The tutorial says "CloneID is essential (with its very special format), and dartR scripts for loading your data sets will terminate with an error message if this is not present."

I have DART data without 'CloneID' field. The data loaded OK into a genlight object. The console said
.
.
Try to add covariate file: xxx_2018_metadata.csv .
Ids of covariate file (at least a subset of) are matching!
Found 147 matching ids out of 147 ids provided in the covariate file. Subsetting snps now!.
Added pop factor.
Please note:there is no lat column
Please note:there is no lon column
Added id to the other$ind.metrics slot.
Added pop to the other$ind.metrics slot.
Warning message:
In .local(.Object, ...) :
Miss-formed strings in loc.all (must be e.g. 'c/g') - storing this argument in @other.

Could the lack of CloneID be causing gl.filter.repavg to return "[1] NA" when I request number of loci (nLoc) after using filter? The number of individuals (nInd) is correct after filtering.

I duplicated the AlleleID column and called it CloneID in the DART.csv but this didn't allow the filter to proceed.

Any ideas?

Thanks, Jenny

gl.filter.hamming and related -- skip comparisons of fragments of different length

Homologous fragments are expected to be the same length, so the computations can be sped up by only comparing fragments that are the same length. Implement this change to the scripts.

gl.collapse.recursive -- needs to list populations that remain unamalgamated.

Currently the reporting includes populations that are amalgamated on the basis of no fixed differences, but it does not list the populations that do not amalgamate. The end report needs to list all surviving OTUs.

Outliers for downstream analysis

Hi,

I've run the gl.outflank and was able to produce a report on the outliers in my dataset of 40,746 SNPs. 303 loci were flagged as outliers. I'd like to now subset my data into outliers and non-outliers to run downstream analyses (e.g. PCA, etc.).

However, I've been unable to figure out how to pull out those 303 loci to run downstream analyses. Is this function already available, or do you have recommendations on how to do it? If it can't be done to the gl object, I'm assuming there is a way to add to the info of a vcf file, but it is beyond my abilities. Any advice would be much appreciated.

Please let me know if I need to clarify anything or provide further information. Thanks in advance for your help!

Renee Catullo: Add an option 5 to gl2fasta

Most phylogenetic methods that analyse SNPs (e.g. IQtree/SNAPP) function better if there are no constant sites. These programs define "constant" as no individual being homozygous for the minor allele. So it would be great to have option 5, which is option 3 but with no "constant" sites. IQtree fully rejects datasets with these SNPS, even though they are very useful for popgen.

Script: Hamming distance filter

gl.report.ld crash

Hi Bernd & Arthur,

I struggle to get the gl.report.ld function to work as it always crashes at some point. It is also not clear to me what the command is to restart the function. I tried to just use the same command as the previous, but it seems it starts from the beginning not the last completed chunk. I got 3 chunks, how to restart using the chunks?
Finally, in an old post I found the function gl.filter.ld, it seems it is gone in the current version. Is there a way to filter based on ld in dartR?

ld_rep <- gl.report.ld(gl, save = TRUE, nchunks = 4, name = ld_test, ncores = 16, chunkname = NULL, probar = TRUE)

The gl is relative big, but working on a machine with 16 cores and 128GB ram it shouldn't be too much of a problem I guess.

/// GENLIGHT OBJECT /////////

// 257 genotypes, 24,010 binary SNPs, size: 39 Mb
80052 (1.3 %) missing data

// Basic content
@gen: list of 257 SNPbin
@ploidy: ploidy of each individual (range: 2-2)

// Optional content
@ind.names: 257 individual labels
@loc.names: 24010 locus labels
@loc.all: 24010 alleles
@position: integer storing positions of the SNPs
@pop: population of each individual (group size range: 17-101)
@other: a list containing: loc.metrics latlong ind.metrics

sessioninfo::session_info()

Session info
setting value
version R version 3.5.2 (2018-12-20)
os Ubuntu 14.04.6 LTS
system x86_64, linux-gnu
ui RStudio
language (EN)
collate en_NZ.UTF-8
ctype en_NZ.UTF-8
tz Pacific/Auckland
date 2019-03-27

─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
package * version date lib source
ade4 * 1.7-13 2018-08-31 [1] CRAN (R 3.5.2)
adegenet * 2.1.1 2018-02-02 [1] CRAN (R 3.5.2)
ape 5.3 2019-03-17 [1] CRAN (R 3.5.2)
assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.5.2)
backports 1.1.3 2018-12-14 [1] CRAN (R 3.5.2)
boot 1.3-20 2017-07-30 [4] CRAN (R 3.5.0)
broom 0.5.1 2018-12-05 [1] CRAN (R 3.5.2)
calibrate 1.7.2 2013-09-10 [1] CRAN (R 3.5.2)
callr 3.2.0 2019-03-15 [1] CRAN (R 3.5.2)
class 7.3-15 2019-01-01 [4] CRAN (R 3.5.2)
classInt 0.3-1 2018-12-18 [1] CRAN (R 3.5.2)
cli 1.1.0 2019-03-19 [1] CRAN (R 3.5.2)
cluster 2.0.7-1 2018-04-09 [4] CRAN (R 3.5.0)
coda 0.19-2 2018-10-08 [1] CRAN (R 3.5.2)
codetools 0.2-16 2018-12-24 [4] CRAN (R 3.5.2)
colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.5.2)
combinat 0.0-8 2012-10-29 [1] CRAN (R 3.5.2)
crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.2)
crosstalk 1.0.0 2016-12-21 [1] CRAN (R 3.5.2)
dartR * 1.2.0 2019-03-25 [1] Github (10b4b5e)
data.table 1.12.0 2019-01-13 [1] CRAN (R 3.5.2)

Thanks,
Flo

LD scripts finalized

Create gl.report.ld
Create gl.filter.ld
Bring gl2gi inside the ld scripts

Bug in gl.pcoa.plot -- no visible binding for global variable 'pt.labels'

gl.pcoa.plot: no visible binding for global variable 'pt.labels'
gl.percent.freq: no visible binding for global variable 'snp'

Those are ones which are not easy to fix for me (as I need your head, what kind of labels you wanted to supply etc.)

Can you check I think pt.labels are nowhere defind and therefore the error….

Remove dart2fasta, depreciated. Replaced by gl2fasta.r

Fixed difference analysis -- provide a verbose option to identify the SNP loci with fixed differences in paired comparisons. Requested Kyle Ewart.

Join the read dart and dart2gl scripts into one gl.read.dart.r

There needs to be one script to take the DArT csv files (1-row or 2-row) and convert to a genlight object -- gl.read.dart(). At present, the user must read the DArT file in and then convert to a genlight object as a two step process. Leave the two scripts without export.

gl.filter.hamming -- Add selection of best fragment to retain

When the script finds two fragments within the trheshold distance, it deletes the first. We need to amend the script so that it deletes the fragment with the lowest PIC | Call Rate | Reproducibility

Olly Bolly: Add a heterozygosity filter

Vector Size cannot be NA?

Hey Bernd,

Having some trouble with the vector.

fmat <- as.matrix(ggbfmales)[father,]

fhet <- sum(fmat==1)
father.half <- ifelse(fmat==1, sample(c(0,2), fhet, replace = T), fmat)

Error in sample.int(length(x), size, replace, prob) :
vector size cannot be NA

Where am I going wrong here?

gl.diversity inoperable on Linux

Hi Bernd,

I am using a fresh install of R and RStudio on a brand new computer with Pop!_OS 18.04, and I noticed that when computing Shannon Diversity, the function outputs straight 1s for every metric. I have not had the same issue on my Mac, even with the same version of R and dartR. The remainder of dartR appears to function as normal, that is filtering and other functions produce identical outputs to my Mac. I have also used gl.diversity on Ubuntu/Pop!_OS before without any issue, though that was quite a while ago. I am using the same dataset that I provided you in 2018 to help write the function. Any ideas? Thanks in advance for helping with this.

gl.outflank settings

Hi Bernd & Arthur,
I was just running gl.outflank and realised that one of the settings "LeftTrimFraction" I'd written as "eftTrimFraction" - but its didn't give me any error. Is that because you've built in some cunning typo-tolerance, or its not reading that setting? Cheers, Olly

Peter Unmack: Order the descriptions of filters in the vignette in the order in which they would normally be used.

We need to use forks and pull requests

currently it is quite tedious to have two repositories (thought it would be easier, but we need to learn to use only one )

gl2treemix problems opening outfile

Hi there,

I was very excited to try out the new script for converting a gl to a treemix file but it doesn't seem to have worked. It took a long time to process (20812 snps in the gl) and it created a .gz file as intended but when I try to open the file, it says "does not appear to be a valid archive". The size of the .gz file is 1385KB - does that seem right?
I was a bit confused by the description of the gl2treemix function itself as it says "The file needs to be gzipped before it will be recognised by treemix." but the outfile we need to specify has to be xxx.gz so is this being done within the script?

Apologies if this is a silly question - I am new to this.

Thanks,

Yael

Filter individuals based on heterozygostiy

Hi DARTr team

Can we calculate the heterozygosity of single individuals (across large number of SNP loci)?

Thanks, Jenny

Outlier analysis

Hi ArthR and grubR,

For your digestion - Some things I've encountered and wasn't sure they needed tweaking or I just wasn't using them correctly:

When I output results from the Outflank analysis (i.e. the spreadsheet with the He, Fst, outlier flags etc) I see that both alleles at each locus are present in the table and the total number of outlier loci identified and reported in the table is this number - i.e. 2 x the number of loci. Is there a reason for this? The stats for each allele are of course identical.
I was interested in pulling out the sequences of the outlier loci so I could blast them for homology to genes of know function (long shot I know). I thought I'd take the list of outliers from dartR/Outflank and then use that as a lookup table in excel to pull out the sequences (sorry - I use excel). The thing is the locus names in the dart raw data file (CloneID) have a different format (e.g. 13451614|F|0-43:G>A-43:G>A) to the locus names in the genlight file (e.g. 13469410-11292-A/G.A). It seems as though the allele names have lost some of their content (i.e. their position [43 in the above example]) and gained a unique number that is probably their number in the sequence of loci in the whole dataset. This makes it unclear whether I'm looking at the same locus in the dart and genlight file because as you know there can be multiple snps with the same starting number in their CloneID. Have I missed how dartR/adegenet renames loci, or is that information still accessible?
I'm sure you would do the above in a more elegant way than using Excel. Perhaps that is a suggestion for an addition to dartR - to enable pulling out the trimmed sequences after an outlier analysis so that they can be used in downstream analysis like blasting etc.

Adios,

Olly

Problems with gl2svdquartets

Hi,

I'm trying to use gl2svdquartets but I keep getting the following error message:

gl2svdquartets(gl, outfile="svd.nex", outpath=getwd(), method=1)
Starting gl2svdquartets: Create nexus file
Extacting SNP bases and creating records for each individual
Error in strsplit(snp, ">") : non-character argument

I'm converting my vcf file to genlight object using vcfR2genlight, and then using it to convert to the svdquartets format. Do have any suggestion about what can be causing this error message?

Thank you,
Ana

green-striped-gecko / dartr Goto Github PK

dartr's People

Contributors

Stargazers

Watchers

Forkers

dartr's Issues

Olly

Recommend Projects

Recommend Topics

Recommend Org