knausb / vcfr Goto Github PK

View Code? Open in Web Editor NEW

235.0 16.0 54.0 21.47 MB

Tools to work with variant call format files

R 73.87% Shell 0.01% C++ 26.12%

vcf-data rstats genomics visualization rcpp population-genetics population-genomics

vcfr's Introduction

VcfR: a package to manipulate and visualize VCF data in R

On CRAN:

Appveyor (Windows):

Coveralls:

VcfR is an R package intended to allow easy manipulation and visualization of variant call format (VCF) data. Functions are provided to rapidly read from and write to VCF files. Once VCF data is read into R a parser function extracts matrices from the VCF data for use with typical R functions. This information can then be used for quality control or other purposes. Additional functions provide visualization of genomic data. Once processing is complete data may be written to a VCF file or converted into other popular R objects (e.g., genlight, DNAbin). VcfR provides a link between VCF data and the R environment connecting familiar software with genomic data.

VcfR is built upon two data structures.

vcfR - S4 class to contain data from a VCF file.

chromR - S4 class to contain variant information (VCF) and optional sequence (FASTA) and annotation (GFF) information.

Functions in vcfR provide the ability to subset VCF data as well as to extract and parse the data. For example, individual genotypes, sequence depths or genotype likelihoods (when provided in the VCF file) can easily be accessed. These tools are provided to aid researchers in rapidly surveying the quality and other characteristics of data provided as VCF data. With this information in hand, researchers should be able to determine criteria for hard filtering in order to attempt to maximize biological variation and minimize technical variation.

Documentation

Documentation for vcfR can now be found here: vcfR_documentation.

We also have Population genetics and genomics in R which is more general and provides examples of analyses.

If you think you've found a bug, please see reporting an issue.

Publication

vcfR articles

Knaus, Brian J., and Niklaus J. Grunwald. 2017. VCFR: a package to manipulate and visualize variant call format data in R. Molecular Ecology Resources 17(1):44-53. http://dx.doi.org/10.1111/1755-0998.12549.

Knaus, Brian J., and Niklaus J. Grunwald. 2016. VcfR: an R package to manipulate and visualize VCF format data. bioRxiv: 041277. http://dx.doi.org/10.1101/041277.

Copy number variation article

Knaus, Brian, and Niklaus J. Grünwald. 2018. Inferring variation in copy number using high throughput sequencing data in R. Frontiers in Genetics 9: 123. http://dx.doi.org/10.3389/fgene.2018.00123.

Download

vcfR is available at CRAN. To install use:

install.packages('vcfR')

The development version can be installed through github:

devtools::install_github(repo="knausb/vcfR")
library(vcfR)

If you would like the vignettes use:

devtools::install_github(repo="knausb/vcfR", build_vignettes=TRUE)

If you've built the vignettes, you can browse them with:

browseVignettes(package="vcfR")

If you've installed this package with devtools you will probably need to run:

devtools::install(build_vignettes = TRUE)

Devel branch

The devel branch (which may not be stable) can also be installed:

devtools::install_github(repo="knausb/vcfR@devel")
library(vcfR)

And to build the vignettes:

devtools::install_github(repo="knausb/vcfR@devel", build_vignettes=TRUE)

Software that produce VCF files

A fun part of this project has been learning about how people use vcfR. One facet of this is learning about the software that create VCF files. So I've decided to make a list of these software. If you know of a software that I have not included on this list, particularly if you can report that vcfR works with its files, feel free to let me know!

Genomic variant callers:

Restriction site associated DNA markers (e.g., RADseq, GBS):

Manipulation of VCF data:

Beagle v4.1 Inputs VCF genotypes and outputs phased genotypes to VCF format
pegas::read.vcf Population and Evolutionary Genetics Analysis System
PyVCF
SnpEff Genetic variant annotation and effect prediction toolbox
Picard A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF
VCF-kit VCF-kit is a command-line based collection of utilities for performing analysis on Variant Call Format (VCF) files.
VCFtools General manipulation and analysis
VariantAnnotation::readVcf Bioconductor package for annotating variants

R packages that read VCF data

Enjoy!

vcfr's People

Contributors

Stargazers

Watchers

vcfr's Issues

Error: File does not appear to exist!

Dear developer of the vcfR package,

I like to work with this package and will use it to implement into a Shinyapp.
But unfortunately, at different PCs and Laptops, read.vcfR doesn't work.
First, I thought it has to do with the operating system, but even on some Macs it doesn't work.
It always tells me that the File does not appear to exist... what could be the reason? Especially, because it works on some PCs/Laptops?!

Many thanks in advance,
Angela

DP=NA in chromo

If DP is all NA then chromoqc and chromo throws an error. This should be handled more gracifully.

Type=Flag in INFO

The INFO column contains key, value pairs where the keys should be defined in the meta region. The function extract_info_tidy() returns a tibble where the column names are the keys and the values populate the cells. Section 1.4.2 of VCFv4.3.pdf indicates that data of type=Flag may also be contained in the INFO field. This data includes no value. Instead it can be treated as a logical where the presence of the acronym indicates TRUE and absence indicated FALSE. At present, extract_info_tidy() returns NA for Type=Flag. An improvement would be for extract_info_tidy() to return a vector of logicals in place of the NAs.

Add relevant information to vcfR2genlight

Hi Brian,

Thank you again for making vcfR2genlight a reality. I have one more request:

Since the VCF file possibly contains the chromosomal locations of each SNP along with their genotype and other information, it would be useful to include that information as well. I would make a pull request, but I'm not quite sure the proper way to access that information from a vcfR or chrom object at the moment.

You can set these values like so:

chromosome(x) <- chromosome_vector # vector with the chromosome designation for each SNP
position(x)   <- position_vector # vector containing the numeric position of each SNP along the chromosome
alleles(x)    <- genotype_vector # vector of genotypes coded as A/C...
indNames(x)   <- sample_names # vector of sample designators
locNames(x)   <- locus_names # vector of locus names

plot.chromR with missing data

When the plot method of chromR is called and DP, QUAL, ... etc. are misssing, unexpected results occur. We need to handle this better.

chromR with only GT

Beagle creates VCF files that only contain the GT field in the GT section. Plotting a chromR object with this data appears to render correctly, however an error is thrown. Could this be handled more delicately?

Allow numeric sample names in read.vcf

When reading in the vcf file with read.vcf, sample names such as 2968 are coverted to X2968 because of the default argument check.names = TRUE in the function read.table on line 252. If you change it to FALSE, numeric sample

Haploid example data

A haploid example dataset could demonstrate the relevance of this software to researchers who study haploid organisms, such as bacteria and true fungi. Pseudomonas may be a good choice.

Annotation functions

Add annotation capabilities similar to those found in vcfanno:

http://biorxiv.org/content/early/2016/02/29/041863

Add tidyr compatibility

Reviewer #2 has suggested adding tidyr compatibility to vcfR so that users familiar with these tools can use them to work with the VCF data.

Standardize show methods for vcf and chrom objects with cat

Feel free to ignore this, for it's mainly a stylistic request, but I've noticed that you use message() to display the vcfR object and print() to display chromR objects. This results in very different aesthetics in the output, which could lead to confusion from novice users.

> chrom
[1] "*****   Class chromR, method Show   *****"
[1] "Name:  Supercontig"
[1] "Length:  1042442"
[1] "Use head(object) for more details."
[1] "*****      End Show (chromR)        *****"
> vcf
***** Object of Class vcfR *****
18 samples
22,031 variants
7.929 percent missing data
*****        *****         *****

What's more is that, by using message() to to display this information, Rmarkdown understands this as output that's printed line by line in vignettes:

## ***** Object of Class vcfR *****

## 18 samples

## 22,031 variants

## 7.929 percent missing data

## *****        *****         *****

My proposed solution would be to use cat() for both of them. This has precedence in several other object classes including DNAbin (ape), igraph (igraph), tbl_df (dplyr), and many more.

ERROR: compilation failed for package 'vcfR'

Hi Brian,
I get an error when installing the devel version of vcfR. This is the session info:

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=German_Switzerland.1252  LC_CTYPE=German_Switzerland.1252    LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C                        LC_TIME=German_Switzerland.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] devtools_1.9.1 Rcpp_0.12.1   

loaded via a namespace (and not attached):
[1] httr_1.0.0    R6_2.1.1      magrittr_1.5  tools_3.2.2   curl_0.9.3    memoise_0.2.1 stringi_1.0-1 stringr_1.0.0 digest_0.6.8

This is the message stuff I get when installing via github:

> library(devtools)
> install_github(repo="knausb/vcfR")
Downloading GitHub repo knausb/vcfR@master
Installing vcfR
"C:/ProgramData/R-32~1.2/bin/x64/R" --no-site-file --no-environ --no-save --no-restore CMD INSTALL "C:/Users/S Crameri/AppData/Local/Temp/RtmpSuY9RD/devtools21f47f3462c0/knausb-vcfR-dbd1a2d"  \
  --library="C:/ProgramData/R-3.2.2/library" --install-tests 

* installing *source* package 'vcfR' ...
** libs
g++ -m64 -std=c++0x -I"C:/ProgramData/R-32~1.2/include" -DNDEBUG    -I"C:/ProgramData/R-3.2.2/library/Rcpp/include" -I"d:/RCompile/r-compiling/local/local320/include"     -O2 -Wall  -mtune=core2 -c NM2winNM.cpp -o NM2winNM.o
NM2winNM.cpp: In function 'Rcpp::NumericVector win_mean(std::vector<std::vector<double> >)':
NM2winNM.cpp:11:23: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
NM2winNM.cpp:15:36: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
NM2winNM.cpp: In function 'Rcpp::NumericMatrix NM2winNM(Rcpp::NumericMatrix, std::vector<int>, int, int)':
NM2winNM.cpp:79:28: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
NM2winNM.cpp:90:28: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
NM2winNM.cpp: In function 'double vector_sum(std::vector<double>)':
NM2winNM.cpp:119:25: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
NM2winNM.cpp: In function 'double vector_mean(std::vector<double>)':
NM2winNM.cpp:132:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
g++ -m64 -std=c++0x -I"C:/ProgramData/R-32~1.2/include" -DNDEBUG    -I"C:/ProgramData/R-3.2.2/library/Rcpp/include" -I"d:/RCompile/r-compiling/local/local320/include"     -O2 -Wall  -mtune=core2 -c RcppExports.cpp -o RcppExports.o
g++ -m64 -std=c++0x -I"C:/ProgramData/R-32~1.2/include" -DNDEBUG    -I"C:/ProgramData/R-3.2.2/library/Rcpp/include" -I"d:/RCompile/r-compiling/local/local320/include"     -O2 -Wall  -mtune=core2 -c extractGT2NM.cpp -o extractGT2NM.o
extractGT2NM.cpp: In function 'int elementNumber(Rcpp::String, std::string)':
extractGT2NM.cpp:35:30: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
extractGT2NM.cpp: In function 'Rcpp::String extractElementS(Rcpp::String, int, int)':
extractGT2NM.cpp:69:30: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
extractGT2NM.cpp: In function 'double extractElementD(Rcpp::String, int)':
extractGT2NM.cpp:119:34: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
extractGT2NM.cpp: In function 'std::string gt2alleles(Rcpp::String, std::vector<std::basic_string<char> >, char)':
extractGT2NM.cpp:199:23: error: 'stoi' is not a member of 'std'
extractGT2NM.cpp:205:36: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
extractGT2NM.cpp:207:40: error: 'stoi' was not declared in this scope
extractGT2NM.cpp: In function 'Rcpp::StringMatrix extract_haps(Rcpp::StringVector, Rcpp::StringVector, Rcpp::StringMatrix, char, int)':
extractGT2NM.cpp:329:26: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
make: *** [extractGT2NM.o] Error 1
Warning: running command 'make -f "Makevars.win" -f "C:/ProgramData/R-32~1.2/etc/x64/Makeconf" -f "C:/ProgramData/R-32~1.2/share/make/winshlib.mk" CXX='$(CXX1X) $(CXX1XSTD)' CXXFLAGS='$(CXX1XFLAGS)' CXXPICFLAGS='$(CXX1XPICFLAGS)' SHLIB_LDFLAGS='$(SHLIB_CXX1XLDFLAGS)' SHLIB_LD='$(SHLIB_CXX1XLD)' SHLIB="vcfR.dll" WIN=64 TCLBIN=64 OBJECTS="NM2winNM.o RcppExports.o extractGT2NM.o gt_to_popsum.o pair_sort.o rank_variants.o seq_to_rects.o var_window.o vcfRCommon.o vcf_io.o"' had status 2
ERROR: compilation failed for package 'vcfR'
* removing 'C:/ProgramData/R-3.2.2/library/vcfR'
Error: Command failed (1)
> library(vcfR)
Error in library(vcfR) : there is no package called ‘vcfR

As can be seen in the sessionInfo(), having Rcpp package attached did not help (not having it attached gave the same error). RTools is installed (newest version).

My home directory has a blank, may that cause troubles?

"C:/Users/S Crameri/AppData/Local/Temp/RtmpSuY9RD/devtools21f47f3462c0/knausb-vcfR-dbd1a2d"  \
  --library="C:/ProgramData/R-3.2.2/library" --install-tests

Any ideas on how I could circumvent this compilation problem?
Thanks in advance!

incorrect parsing of GT using vcfR::extract.gt with VCF from Tassel

Hi Brian, congratulations again for the paper.

this might not be a bug related to vcfR, but I wanted to clear this with you before asking Tassel team.

I'm using vcfR for some functions in my packages stackr and assigner.
Jeremy Leluyer @jleluyer, recently tried stackr with a VCF made and filtered in Tassel and got an error (see below). I don't think it's following the vcf 4.0 specification, although it's says ##fileformat=VCFv4.0 in the header.

The VCF header:

##fileformat=VCFv4.0
##Tassel=<ID=GenotypeTable,Version=5,Description="Reference allele is not known. The major allele was used as reference allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the reference and alternate alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=PL,Number=.,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT

So you expect to find GT, AD, DP, GQ, PL in the FORMAT field and NS, DP, AF in the INFO field however this is not the case:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	IND1	IND2	IND3	IND4	IND5
0	192	TP192	A	C	.	PASS	.	GT	0/0	0/0	0/0	0/0	0/0
0	754	TP754	G	A	.	PASS	.	GT	0/0	0/0	0/0	0/0	0/0
0	848	TP848	C	A	.	PASS	.	GT	0/0	0/0	0/0	0/0	0/0
0	946	TP946	A	G	.	PASS	.	GT	0/0	0/0	0/0	0/0	0/0
0	1563	TP1563	C	A	.	PASS	.	GT	0/0	0/0	0/0	0/0	0/0
0	2438	TP2438	T	A	.	PASS	.	GT	0/0	0/0	0/0	0/0	0/0
0	2808	TP2808	A	G	.	PASS	.	GT	0/0	0/0	0/0	0/0	0/0
0	3264	TP3264	G	T	.	PASS	.	GT	0/0	0/0	0/0	0/0	0/0
0	3708	TP3708	C	T	.	PASS	.	GT	0/0	0/0	0/0	0/0	0/0
0	4616	TP4616	G	A	.	PASS	.	GT	0/0	0/0	./.	0/0	0/0
0	4617	TP4617	G	A	.	PASS	.	GT	0/0	0/0	./.	0/0	0/0
0	4635	TP4635	G	A	.	PASS	.	GT	0/0	0/0	0/0	0/0	0/0
0	4979	TP4979	T	A	.	PASS	.	GT	0/0	0/0	0/0	0/0	0/0
0	5087	TP5087	C	A	.	PASS	.	GT	0/0	0/0	0/0	0/0	0/0
0	5970	TP5970	A	C	.	PASS	.	GT	0/0	0/0	0/0	0/0	0/0
0	6000	TP6000	A	G	.	PASS	.	GT	0/0	0/0	0/0	0/0	0/0
0	6226	TP6226	A	G	.	PASS	.	GT	0/0	0/0	0/0	0/0	0/0
0	6528	TP6528	A	C	.	PASS	.	GT	0/0	0/0	0/0	0/0	0/0
0	7080	TP7080	C	A	.	PASS	.	GT	0/0	0/0	./.	0/0	0/0
0	7387	TP7387	C	T	.	PASS	.	GT	0/0	0/0	0/1	0/0	0/0

So you see that there is just the GT filled in. There is no problem reading the VCF in VCFTOOLS.

Problem:
In vcfR the ./. genotypes get parsed as NA instead of remaining ./. like other VCF's I tried in vcfR.

Question 1: is this a bug in vcfR?

Question 2: even if they are not following VCF specifications (from 4.0 to 4.3) is this something you want to fix ?

There is potential for opening a door you might not want, as they can be a thousand way a VCF can go wrong an I've seen horrible genepop file going around... ;)

I've fix this behaviour in my package by scanning for NA's and replacing it back to ./.
Cheers
Thierry

names method for chromR

Reviewer #1 has requested that a names method be set to display and change the name of chromR objects.

CHROM param in create.chromR

Currently create.chromR requires the user to subset the sequence, variant and annotation data to a single chromosome. Some users have expressed that this is overly complex. Adding a parameter CHROM to the function could help. It could have 3 states:

NULL - default behavior requires user to subset the data as in its current implementation
length(CHROM) == 1 - a single chromosome name to be used for subsetting, requires standardized names among all data
length(CHROM) == 3 - 3 names to be used for the chromosome, variants and the annotations

Someday this may be extended to read data from files as opposed to memory as in the current implementation.

vcfR to gff/bed

Summarizing variants per sample in a gff/bed type file could allow transfer from vcfR to browsers such as IGV, gbrowse, jbrowse, etc. Flybase may have a good example from their point mutation tracks. See the Integrative Genomics Viewer as an example for an input software.

Wrap tests in `test_that` function; add content to testthat.R

The tests that are currently in the files to be tested are evaluated by testthat, but they are not true unit tests in that if one fails at the beginning, you have no clue if any others will pass later. Utilizing the function test_that with your tests will ensure that all the tests will run, even if one or two fail.

Example:

currently you have this code in test_create_chrom.R:

# create_chrom tests.

# detach(package:vcfR, unload=T)
library(vcfR)
context("create_chrom functions")

data(vcfR_example)

pinf_mt <- create_chrom('pinf_mt', seq=pinf_dna, vcf=pinf_vcf, ann=pinf_gff, verbose=F)
expect_that(pinf_mt, is_a("Chrom"))

pinf_mt <- create_chrom('pinf_mt', seq=pinf_dna, vcf=pinf_vcf, verbose=F)
expect_that(pinf_mt, is_a("Chrom"))

pinf_mt <- create_chrom('pinf_mt', vcf=pinf_vcf, ann=pinf_gff, verbose=F)
expect_that(pinf_mt, is_a("Chrom"))

pinf_mt <- create_chrom('pinf_mt', vcf=pinf_vcf, verbose=F)
expect_that(pinf_mt, is_a("Chrom"))

It could easily be converted by adding the test_that function using the syntax of:

test_that("something happens", {
  expect_that(code(), equals("something happens"))
})

In context, this is how it could be implemented:

# create_chrom tests.

# detach(package:vcfR, unload=T)
library(vcfR)
context("create_chrom functions")

test_that("chrom objects can be created", {
  data(vcfR_example)

  pinf_mt <- create_chrom('pinf_mt', seq=pinf_dna, vcf=pinf_vcf, ann=pinf_gff, verbose=F)
  expect_that(pinf_mt, is_a("Chrom"))

  pinf_mt <- create_chrom('pinf_mt', seq=pinf_dna, vcf=pinf_vcf, verbose=F)
  expect_that(pinf_mt, is_a("Chrom"))

  pinf_mt <- create_chrom('pinf_mt', vcf=pinf_vcf, ann=pinf_gff, verbose=F)
  expect_that(pinf_mt, is_a("Chrom"))

  pinf_mt <- create_chrom('pinf_mt', vcf=pinf_vcf, verbose=F)
  expect_that(pinf_mt, is_a("Chrom"))
}

Additionally, you should place the following in testthat.R:

library("testthat")
test_check("vcfR")

INFO2df function

The INFO column contains summary information about each variant. The values and classes of this data can be extracted from the META portion of the VCF data and used to initialize a data.frame. This data.frame could then be populated with values parsed from the INFO column. By converting the data contained in the INFO column into a data.frame it would make this information more accessible.

chromR2vcfR implements use.mask=T does not subset data

vcfR2genlight

You currently have the ability to transfer from vcfR to loci, genind, and DNAbin. As it is clear that genind objects are memory hogs (creating a separate column for each allele), it might be beneficial to also include a vcfR2genlight function, which would be able to store binary SNP data.

Benefits:

This gives the user a versatile way to transfer their data into genlight format so they can calculate the index of association and DAPC quickly.
The function can automatically insert the allele calls and position information (which the user must do manually for genlight)

Drawbacks:

All alleles going in should be binary.

I know it's not a priority by any means, but it's something to think about as a way to get around the memory issue and make things more accessible.

Memory profiler

An effective memory profiler may be:

https://github.com/shinra-dev/memuse

We'll see.

read_vcf3.cpp error using zlib

For new ubuntu users using 14.04 LTS, there is a new version of zlib (zlib 1.2.8). this version breaks read_vcf3.cpp on line 772.

to fix the error, I changed the pointer in line 732:

-   gzFile *fi = (gzFile *)gzopen(filename.c_str(),"ab");
+   gzFile fi = (gzFile )gzopen(filename.c_str(),"ab");

subsetting chrom

Allow chrom objects to be subset using [].

tidyr data structures

VCF data may take the form of:

POS REF ALT
1110696 A G,T

A tidR representation would be:

POS REF ALT
1110696 A G
1110696 A T

Implementing this, however, would break the relationship among rows in vcfR@fix and vcfR@gt. It appears that a tidyR solution would be to implement a primary key for each variant. This would provide a unique identifier for each variant and allow for the selection of a variant's data from either the @fix or @gt slot. This may require a little homework to implement.

Utilize viridis for default color ramp

The current color ramp for heatmap.bp is not colorblind friendly. Viridis is a color palette that was explicitly designed for python's matplotlib to tease out large and small differences in data. Heres a video describing how they came up with the palette. It is available in R as the package "viridis".

extract.gt() with return.alleles and multiallelic variants

Reviewer #2 reports that extract.gt() 'probably chokes' for multiallelic variants. This should either be addressed or documented.

extract haps does not handle all missing data

The VCF definition allows missing genotypes to be specified as '.' as well as '.|.' and possibly '0|.' and '.|1'. At the present, only the first case is handled by vcfR_extract_haps.

create.chrom() argument order

Reviewer #1 has asked that the order of arguments in create.chrom() be changed so that required arguments appear before arguments with defaults.

summary() for vcfR

A summary function for objects of class vcfR would provide feedback on what was read in to R. It could include:

Sample number
Variant number
Percent NA
Number of chromosomes

See genlight or DNAbin for examples.

examples for read.vcfR()

Reviewer #2 has requested that examples be included for read.vcfR().

extract.indels() does not handle ALT = NA as zero length

data(vcfR_test)
getFIX(vcfR_test)
vcf <- extract.indels(vcfR_test)
getFIX(vcf)

Variant (row) 4 includes NA as the alternate allele. This should be treated as a deletion.

write.vcf append includes the @fix header

Let's assume I have a list of filtered chrom objects called chrom.win. Each of the elements of a list represent a chromosome, so chrom.win[[1]] is chromosome 1.

I'm trying to create an output vcf file using write.vcf for each of the elements of the chrom.win in order to append all elements into a vcf file that laready has chromosome 1 and all the vcf headers + info:

write.vcf(chrom.win[[1]], file="filtered.vcf.gz", mask=T)

for (i in 2:length(chrom.win)){
  write.vcf(chrom.win[[i]],file = filtered.vcf.gz,APPEND = T,mask = T)
}

when I open filtered.vcf.gz I find the same number of #CHROM headers than chromosomes. I can go around it by creating one vcf file per chromosome and concatenate them using vcf-tools, but having the #CHROM header breaks the idea of appending the output, doesn't it?

dplyr 0.5.0 will break vcfR

Received the below e-mail from h.wickham:

This is an automated email to let you know about the upcoming release
of dplyr, which will be submitted to CRAN on June 23 (in about two
weeks). This is my second attempt at getting dplyr 0.5.0 onto CRAN,
and this time it's definitely going to succeed. Please get in touch if
you need any help getting your package ready for this dplyr update!

To check for potential problems, I ran R CMD check on your package
vcfR (1.1.0).

I found: 1 error | 0 warnings | 0 notes.

checking examples ... ERROR
Running examples in ‘vcfR-Ex.R’ failed
The error most likely occurred in:

> base::assign(".ptime", proc.time(), pos = "CheckExEnv")
> ### Name: Convert to tidy data frames
> ### Title: Convert vcfR objects to tidy data frames
> ### Aliases: 'Convert to tidy data frames' extract_gt_tidy
> ###   extract_info_tidy vcfR2tidy vcf_field_names
>
... 8 lines ...
> # data frames: fix, gt, and meta. Here we don't coerce columns
> # to integer or numeric types...
> Z <- vcfR2tidy(vcf)
Extracting gt element AD
Extracting gt element DP
Extracting gt element GQ
Extracting gt element GT
Extracting gt element PL
Error in eval(expr, envir, enclos) : could not find function "everything"
Calls: vcfR2tidy ... select_vars_ -> <Anonymous> -> lapply -> FUN

-> eval -> eval
Execution halted

If I got an ERROR because I couldn't install your package (or one of
it's dependencies), my apologies. You'll have to run the checks
yourself (unfortunately I don't have the time to diagnose installation
failures as I have to run checks on hundreds of packages).

Otherwise, please carefully look at the results, and let me know if
I've introduced a bug in dplyr. If I have, I'd really appreciate a
minimal reproducible example that uses only dplyr functions. That way
I can find and fix the bug as quickly as possible.

If it doesn't look like a bug in dplyr, please prepare an update for
CRAN. Ideally you'll tweak your package so it works with both the
released and development versions of dplyr. Otherwise, be prepared to
submit your package to CRAN soon after I let you know that I've
submitted.

To get the development version of dplyr so you can run the checks
yourself, you can run:

# install.packages("devtools")
devtools::install_github("hadley/dplyr")

To see what's changed visit
https://github.com/hadley/dplyr/blob/master/NEWS.md.

If you have any questions about this email, please feel free to
respond directly.

Regards,

Hadley

Tilde expansion for read.vcfR()

Reviewer #2 reports that tilde expansion does not work in read.vcfR(). The reviewer also correctly identifies the issue to be with zlib, which is used to open gzipped files.

extract_info_tidy(info_types = TRUE) as default

The function extract_info_tidy() currently has a default setting of info_types = NULL. If it were instead set to TRUE the function would query the meta information to guess its type. This seems desirable. But it seems so simple to change I wonder if the original author had a vision here that I am not seeing. @eriqande, would you like to contribute to this decision?

ERROR: compilation failed for package ‘vcfR’

Hi Brian,
I get an error when installing vcfR in my home directory in cluster. Here is the error message:

installing source package ‘vcfR’ ...
** package ‘vcfR’ successfully unpacked and MD5 sums checked
** libs
I/usr/share/R/include -DNDEBUG -I"$LOCAL_R_LIB/Rcpp/include" -c NM2winNM.cpp -o NM2winNM.o
/bin/bash: I/usr/share/R/include: No such file or directory
make: [NM2winNM.o] Error 127 (ignored)
I/usr/share/R/include -DNDEBUG -I"$LOCAL_R_LIB/Rcpp/include" -c RcppExports.cpp -o RcppExports.o
/bin/bash: I/usr/share/R/include: No such file or directory
make: [RcppExports.o] Error 127 (ignored)
I/usr/share/R/include -DNDEBUG -I"$LOCAL_R_LIB/Rcpp/include" -c ad_frequency.cpp -o ad_frequency.o
/bin/bash: I/usr/share/R/include: No such file or directory
make: [ad_frequency.o] Error 127 (ignored)
I/usr/share/R/include -DNDEBUG -I"$LOCAL_R_LIB/Rcpp/include" -c extractGT2NM.cpp -o extractGT2NM.o
/bin/bash: I/usr/share/R/include: No such file or directory
make: [extractGT2NM.o] Error 127 (ignored)
I/usr/share/R/include -DNDEBUG -I"$LOCAL_R_LIB/Rcpp/include" -c freq_peak.cpp -o freq_peak.o
/bin/bash: I/usr/share/R/include: No such file or directory
make: [freq_peak.o] Error 127 (ignored)
I/usr/share/R/include -DNDEBUG -I"$LOCAL_R_LIB/Rcpp/include" -c gt_to_popsum.cpp -o gt_to_popsum.o
/bin/bash: I/usr/share/R/include: No such file or directory
make: [gt_to_popsum.o] Error 127 (ignored)
I/usr/share/R/include -DNDEBUG -I"$LOCAL_R_LIB/Rcpp/include" -c is_het.cpp -o is_het.o
/bin/bash: I/usr/share/R/include: No such file or directory
make: [is_het.o] Error 127 (ignored)
I/usr/share/R/include -DNDEBUG -I"$LOCAL_R_LIB/Rcpp/include" -c masplit.cpp -o masplit.o
/bin/bash: I/usr/share/R/include: No such file or directory
make: [masplit.o] Error 127 (ignored)
I/usr/share/R/include -DNDEBUG -I"$LOCAL_R_LIB/Rcpp/include" -c pair_sort.cpp -o pair_sort.o
/bin/bash: I/usr/share/R/include: No such file or directory
make: [pair_sort.o] Error 127 (ignored)
I/usr/share/R/include -DNDEBUG -I"$LOCAL_R_LIB/Rcpp/include" -c rank_variants.cpp -o rank_variants.o
/bin/bash: I/usr/share/R/include: No such file or directory
make: [rank_variants.o] Error 127 (ignored)
I/usr/share/R/include -DNDEBUG -I"$LOCAL_R_LIB/Rcpp/include" -c seq_to_rects.cpp -o seq_to_rects.o
/bin/bash: I/usr/share/R/include: No such file or directory
make: [seq_to_rects.o] Error 127 (ignored)
I/usr/share/R/include -DNDEBUG -I"$LOCAL_R_LIB/Rcpp/include" -c var_window.cpp -o var_window.o
/bin/bash: I/usr/share/R/include: No such file or directory
make: [var_window.o] Error 127 (ignored)
I/usr/share/R/include -DNDEBUG -I"$LOCAL_R_LIB/Rcpp/include" -c vcfRCommon.cpp -o vcfRCommon.o
/bin/bash: I/usr/share/R/include: No such file or directory
make: [vcfRCommon.o] Error 127 (ignored)
I/usr/share/R/include -DNDEBUG -I"$LOCAL_R_LIB/Rcpp/include" -c vcf_io.cpp -o vcf_io.o
/bin/bash: I/usr/share/R/include: No such file or directory
make: [vcf_io.o] Error 127 (ignored)
-o vcfR.so NM2winNM.o RcppExports.o ad_frequency.o extractGT2NM.o freq_peak.o gt_to_popsum.o is_het.o masplit.o pair_sort.o rank_variants.o seq_to_rects.o var_window.o vcfRCommon.o vcf_io.o -lz -L/usr/lib/R/lib -lR
/bin/bash: line 2: -o: command not found
make: *** [vcfR.so] Error 127
ERROR: compilation failed for package ‘vcfR’
removing ‘$LOCAL_R_LIB/vcfR’

The downloaded source packages are in
‘/tmp/RtmpCsjA3I/downloaded_packages’
Warning message:
In install.packages("vcfR", repos = "http://cran.r-project.org", :
installation of package ‘vcfR’ had non-zero exit status

How can I fix this error?

Thanks,
Mao

is.vcf check

Should consider functions:

is.vcf()

and

is.gff()

to help qc incoming data.

extractGT2NM.cpp ASAN

Address sanitizer throws errors:

extractGT2NM.cpp:189
extractGT2NM.cpp:260
extractGT2NM.cpp:355

Perhaps see:

sanitizers package
Dirk's sanitizers
Using-Address-Sanitizer

Annotation challenges

Hi- thanks for making this package available. I am really excited about all of the features.

I am working with a non-model organism, though I have .gff and .vcf files. I am working with SNP's from RNA-seq work, and therefore SNPs aren't called in all of the areas which have annotations. I was trying to create a chomR object, but received the following message:

Annotation positions exceed chromosome positions. Is this the correct set of annotations?

I checked both the .vcf file and the .gff file and there aren't any annotations or SNP's called in a position that doesn't exist in the files. I tried this with a few other files as well and got the same message. Any assistance would be greatly appreciated. I have included two example files.

example.zip

vcfR2Dnabin converts one row matrix to character vector

vcf file with only one variant threw an error (shown below). The error was it was converting one row matrix to character vector.

Error in variants[pos >= start.pos, , ] :
incorrect number of dimensions

variants <- variants[pos < start.pos + dim(ref.seq)[2], ]

Solution:

variants <- variants[pos < start.pos + dim(ref.seq)[2], drop = FALSE]

create.chromR() with zero variants

create.chromR() currently appears to fail when there are zero variants in the vcfR object. It would be nice to add this functionality in order to help automation. For example, when processing Supercontigs throughout a genome some Supercontigs may not have variants. Adding this functionality would therefore allow processing of the entire genome.

Suppress labels in heatmap.bp()

Reviewer #2 has noted that the row labels in Figure 2, which was produced by heatmap.bp(), smear together due to the high number of variants. The reviewer suggests an option to suppress these labels.

Expand show method for objects of class chromR

The show method could include summaries of variant content, sequence length (already implemented but perhaps not named the best) and annotation content. Also, the chromosome names could be reported to help validate they all refer to the correct chromosome (scaffold, supercontig, etc.).

maspit does not handle '.,.'

library(vcfR)
data("vcfR_test")
gt <- extract.gt(vcfR_test, element = 'HQ')
myHQ1 <- masplit(gt)
Failed to convert to a float.
Failed to convert to a float.

Needs to be fixed.

Handle INFO column when its set to NA.

Some VCF files may contain an INFO column that is entirely set to NA. This currently throws an error when trying to plot() the object.

library(vcfR)
data(vcfR_test)
is.na(vcfR_test@fix[,'INFO']) <- TRUE
chrom <- create.chromR(vcfR_test)
plot(chrom)
Error in hist.default(DP, col = 3, main = "Read depth (DP)", xlab = "") : invalid number of 'breaks'

This could be handled by plotting a blank window with a message that no data was found instead of the error.

Memory issues

An email has reported:

This still has the (sometimes fatal) memory-access errors linked from [https://cran.r-project.org/web/checks/check_results_vcfR.html].

In addition, valgrind is reporting use of uninitialized memory and many small memory leaks.

Error with output of vcf.write()

Hi,

I've been running through the vignettes outlined on the CRAN repository with both my own data and the example data from pinfsc50. I noticed that vcf.write() does not appear to be writing a vcf that excludes low quality variants (identified by masker()) when mask = TRUE. The only difference between the new and original vcf I can discern are entries in the FILTER column (which all say PASS). Am I misunderstanding what the output of write.vcf() is when mask = TRUE? Or is this an error? This is the code I've been using:

library(vcfR)

vcf_file <- system.file("extdata", "pinf_sc50.vcf.gz", package = "pinfsc50")
dna_file <- system.file("extdata", "pinf_sc50.fasta", package = "pinfsc50")
gff_file <- system.file("extdata", "pinf_sc50.gff", package = "pinfsc50")

vcf <- read.vcfR(vcf_file, verbose = FALSE)
dna <- ape::read.dna(dna_file, format = "fasta")
gff <- read.table(gff_file, sep="\t", quote="")

chrom <- create.chromR(name="Supercontig", vcf=vcf, seq=dna, ann=gff, verbose=FALSE)
chrom <- proc.chromR(chrom, verbose = TRUE)

chrom <- masker(chrom, min_QUAL=0, min_DP=350, max_DP=650, min_MQ=59.5, max_MQ=60.5)
chrom <- proc.chromR(chrom, verbose = FALSE)

write.vcf(chrom, file="good_variants.vcf.gz", mask=TRUE)

Any help on this would be greatly appreciated

Give users TRUE/FALSE option for masking Chrom objects in extract.gt()

Currently, the behavior of extract.gt() is to simply extract the information from a vcfR or chrom object, but it will currently will take into account masking if the user explicitly supplies a logical vector of variants to mask. Since the information masking the data does live within a Chrom object, can you give the users an option to simply use the mask field in the Chrom object?

Currently, the implementation looks like this:

myChrom.gt <- extract.gt(myChrom, element = "GT", mask = myChrom@var.info$mask)

What I expect would look like:

myChrom.gt <- extract.gt(myChrom, element = "GT", mask = TRUE)

internally, you can check the length of the vector and class of the object with something like:

if ("Chrom" %in% class(x) && length(mask) == 1){
  ... use masking from var.info slot ...
} else {
  ... do what you normally do ...
}

Add color scale option to heatmap function

The heatmap function is very useful, but it's not very friendly to red-green colorblind people. Is there a way that you could include parameters to make this an option?

heatmap.bp fails unless viridisLite is in imports

if the user does not have viridisLite installed, the heatmap.bp() function will fail because it has no color palette:

library("vcfR")
x <- as.matrix(mtcars)
heatmap.bp(x)
#> Error in loadNamespace(name) : there is no package called ‘viridisLite’

This can be remedied by placing viridisLite in the Imports section of your DESCRIPTION, ensuring that the user has access to this package.