mandawilson / pyvcf Goto Github PK
View Code? Open in Web Editor NEWThis project forked from jamescasbon/pyvcf
A Variant Call Format reader for Python.
Home Page: http://pyvcf.readthedocs.org/en/latest/index.html
License: MIT License
This project forked from jamescasbon/pyvcf
A Variant Call Format reader for Python.
Home Page: http://pyvcf.readthedocs.org/en/latest/index.html
License: MIT License
Save order of all unique LABELs where LABEL is any ##LABEL so that diff of input and output only shows important changes.
This is different from mandawilson / PyVCF Issue #3, which will just preserve the order within metadata, info, format and filters, but not between them.
When there is a non-numeric value in INFO.Number or FORMAT.Number it is lost and replaced with None. But VCF 4.1 allows A and G (this is partially done in the code). Also make sure that a "." is read and written correctly (if we want it to be stored as None, make sure it is output as ".").
VCF 4.1:
The Number entry is an Integer that describes the number of values that can be included with the INFO field. For example, if the INFO field contains a single number, then this value should be 1; if the INFO field describes a pair of numbers, then this value should be 2 and so on. If the field has one value per alternate allele then this value should be 'A'; if the field has one value for each possible genotype (more relevant to the FORMAT tags) then this value should be 'G'. If the number of possible values varies, is unknown, or is unbounded, then this value should be '.'. The 'Flag' type indicates that the INFO field does not contain a Value entry, and hence the Number should be 0 in this case.
Output "a;b" for filters a and b instead of ["a", "b"] which is invalid.
Use collections.OrderedDicts for metadata, info, format and filters so that it is easier to diff input and output. This was a feature request from users.
mandawilson/PyVCF will require python 2.7.
It expects things like this:
But now we have things like multiple contigs, alts, and samples. These specific examples should be parsed in the way that the INFO and FORMAT fields are, but in the meantime we should at least allow multiple values for a:
E.G.:
It might be nice if there is just one value for a key that we have the value, instead of a list of size 1.
In:
The key will be 'KEY=<ID=X,Description' and the value is '"Testing multiple keys in metadata (like multiple contigs)">'.
In PGM tools not all INFO keys are defined in the meta-information section. Turns out they don't have to be defined there, even thought it is strongly encouraged. The code assumes they are, and this breaks when there is an undefined flag field in the INFO section.
From VCF 4.1 format specification: It is strongly encouraged that information lines describing the INFO, FILTER and FORMAT entries used in the body of the VCF file be included in the meta-information section. Although they are optional, if these lines are present then they must be completely well-formed.
cat test/pgm_tools.vcf | ./vcfIdentity.py
Traceback (most recent call last):
File "./vcfIdentity.py", line 11, in
for rec in vcfin:
File "/home/wilson/PyVCF/vcf/parser.py", line 702, in next
info = self._parse_info(row[7])
File "/home/wilson/PyVCF/vcf/parser.py", line 594, in _parse_info
val = entry[1].split(',')
IndexError: list index out of range
If something is supposed to be a list and it is None make it the empty list, if it is just one element make it a one element list.
Currently "PASS" is changed to a "." and the writer will output "." instead of "PASS". These two things have different meanings according to the VCF 4.1 spec (see below). In addition, this change makes it much harder for users to find significant differences using diff between a VCF file pre and post filtering with PyVCF.
According to the VCF 4.1 spec:
PASS if this position has passed all filters, i.e. a call is made at this position. Otherwise, if the site has not passed all filters, a semicolon-separated list of codes for filters that fail. e.g. “q10;s50” might indicate that at this site the quality is below 10 and the number of samples with data is below 50% of the total number of samples. “0” is reserved and should not be used as a filter String. If filters have not been applied, then this field should be set to the missing value. (String, no white-space or semi-colons permitted).
...
In all cases, missing values are specified with a dot (”.”)
(1) Info values that are string are being truncated to the first letter (since they are sequences like lists).
E.g. When STR=Test, what is stored is STR=T.
(2) When INFO.Number > 1 or INFO.Number == "." or INFO.Number == "A" or INFO.Number == "G" the value should be a list according to the vcf 4.1 format described below. (Note: Not sure if it should be a list when the length is 1 or 0, also not sure if we should always make sure lists for "A" an "G" are correct length with Nones or not.)
(3) Also when INFO.Type == "Flag" the output was "KEY=True;" instead of just "KEY;".
(4) It would be nice if the KEY/VALUE pairs were output in the order in which they are read so that it is easier for users to see significant changes when doing a diff between an input vcf file and one that is output after filtering/modification. This applies to everything, not just INFO fields so I will create a separate issue for ordering.
According to VCF format 4.1:
Possible Types for INFO fields are: Integer, Float, Flag, Character, and String.
The Number entry is an Integer that describes the number of values that can be included with the INFO field. For example, if the INFO field contains a single number, then this value should be 1; if the INFO field describes a pair of numbers, then this value should be 2 and so on. If the field has one value per alternate allele then this value should be 'A'; if the field has one value for each possible genotype (more relevant to the FORMAT tags) then this value should be 'G'. If the number of possible values varies, is unknown, or is unbounded, then this value should be '.'. The 'Flag' type indicates that the INFO field does not contain a Value entry, and hence the Number should be 0 in this case. The Description value must be surrounded by double-quotes. Double-quote character can be escaped with backslash (") and backslash as .
Ability to write sample-less VCF's like feat01.vcf (below)
Ie, only 8 columns CHROM--INFO, drop FORMAT in header
and in record rows. To know you have a sample-less
vcf check length of .samples var
vin=vcf.Reader(open("dbsnp.vcf"))
if len(vin.samples)==0:
Do_not_output_FORMAT_column
## fileformat=VCFv4.1
## FILTER=<ID=NC,Description="Inconsistent Genotype Submission For At Least One Sample">
## INFO=<ID=ASP,Number=0,Type=Flag,Description="Is Assembly specific. This is set if the snp only maps to one assembly">
## dbSNP_BUILD_ID=132
## fileDate=20110320
## phasing=partial
## reference=GRCh37
## source=dbSNP
## variationPropertyDocumentationUrl=ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_latest.pdf
# CHROM POS ID REF ALT QUAL FILTER INFO
chr1 10433 rs56289060 A AC . PASS ASP
chr1 10439 rs112766696 AC A . PASS ASP
chr1 10519 rs62636508 G C . PASS ASP
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.