Git Product home page Git Product logo

checkvcf's Introduction

checkVCF.py

checkVCF.py is a small tool written in Python to check input VCF files before association tests. It can report monomorphic sites, sites with reference alleles inconsistent with the reference genome, sites with invalid genotypes, non-SNP site (e.g. indels), and all sites with allele frequencies greater than ''0.5''. After you passed the checking, you can go on to run rvtests - rare-variant test software.

Download

Download from this bundle (839 Mb) and unzip the downloaded file. This includes checkVCF.py script, reference genome (hs37d5.fa) in FASTA format and its index file (hs37d5.fa.fai).

Example

python checkVCF.py -r hs37d5.fa -o test $your_VCF

Outputs

Console output and .log file

Upon successfully running checkVCF.py on the example file, you will see following outputs:

checkVCF.py -- check validity of VCF file for meta-analysis
version 1.3 (20130223)
contact [email protected] or [email protected] for problems.
Python version is [ 2.7.3.final.0 ] 
Begin checking vcfFile [ example.vcf.gz ]
---------------     REPORT     ---------------
Total [ 18 ] lines processed
Examine [ 7 ] VCF header lines, [ 11 ] variant sites, [ 6 ] samples
[ 0 ] duplicated sites
[ 0 ] NonSNP site are outputted to [ tmp.check.nonSnp ]
[ 10 ] Inconsistent reference sites are outputted to [ tmp.check.ref ]
[ 0 ] Variant sites with invalid genotypes are outputted to [ tmp.check.geno ]
[ 1 ] Alternative allele frequency > 0.5 sites are outputted to [ tmp.check.af ]
[ 1 ] Monomorphic sites are outputted to [ tmp.check.mono ]
---------------     ACTION ITEM     ---------------
* Read tmp.check.ref, for autosomal sites, make sure the you are using the forward strand
* Upload these files to the ftp: tmp.check.log tmp.check.dup tmp.check.noSnp tmp.check.ref tmp.check.geno tmp.check.af tmp.check.mono

.check.nonSnp file

This file includes all non-SNP sites. These sites can be detected when the length of the reference allele or alternative allele is larger than one. For example, reference allele is AT. Non-SNP sites also include reference alleles that are not composited of 'A', 'C', 'G', 'T' alleles or alternative alleles that are not composited of 'A', 'T', 'G', 'C', '.' alleles.

.check.ref file

This file includes the variant sites that do not match reference alleles. That can happen when: (1) variant chromosome names do not appear in the reference genome file. You will see a line with "FailedGetBase" and chromosome:position from the input VCF file; (2) reference alleles do not match. You will see "MismatchRefBase" and chromosome:position:trueReferenceAllele-referenceAlleleInVCF:referenceAlleleInVCF. For example:

MismatchRefBase 19:50578409:G-C/T
FailedGetBase   23:208316

.check.geno file

This file contains line numbers in which genotypes are not found or not formatted correctly. You will get either "IndividualMissingGTField" warning or "IndividualHasInvalidGT" warnings.

.check.af file

This file contains the sites where alternative allele frequencies are larger than 0.5 . It is normal that this file contains a number of lines. For human exome chip, you are likely to have ~10k lines in this file. That means out of total ~250k variants, around 10k SNP variants have allele frequencies larger than 0.5.

.check.mono file

This file contains the monomorphic sites. It is normal that this file contains a number of lines. In the ideal case, VCF files should only contain variant sites. However, it is practical or convenient to keep some monomorhipc sites in the VCF file. This file records these monomorphic sites.

Contact

Questions or comments can be sent to Xiaowei Zhan or Dajiang Liu.

Bitdeli Badge

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.