Git Product home page Git Product logo

pyvcf's People

Contributors

arq5x avatar brentp avatar ian1roberts avatar jdoughertyii avatar mandawilson avatar martijnvermaat avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

pyvcf's Issues

Preserve order of all unique ##LABELs

Save order of all unique LABELs where LABEL is any ##LABEL so that diff of input and output only shows important changes.

This is different from mandawilson / PyVCF Issue #3, which will just preserve the order within metadata, info, format and filters, but not between them.

if INFO.Number or FORMAT.Number equal A or G they become None

When there is a non-numeric value in INFO.Number or FORMAT.Number it is lost and replaced with None. But VCF 4.1 allows A and G (this is partially done in the code). Also make sure that a "." is read and written correctly (if we want it to be stored as None, make sure it is output as ".").

VCF 4.1:

The Number entry is an Integer that describes the number of values that can be included with the INFO field. For example, if the INFO field contains a single number, then this value should be 1; if the INFO field describes a pair of numbers, then this value should be 2 and so on. If the field has one value per alternate allele then this value should be 'A'; if the field has one value for each possible genotype (more relevant to the FORMAT tags) then this value should be 'G'. If the number of possible values varies, is unknown, or is unbounded, then this value should be '.'. The 'Flag' type indicates that the INFO field does not contain a Value entry, and hence the Number should be 0 in this case.

assumes 1 value per metadata key

It expects things like this:

assembly=url

But now we have things like multiple contigs, alts, and samples. These specific examples should be parsed in the way that the INFO and FORMAT fields are, but in the meantime we should at least allow multiple values for a:

KEY=

KEY=

E.G.:

ALT=<ID=type1,Description=description1>

ALT=<ID=type2,Description=description2>

It might be nice if there is just one value for a key that we have the value, instead of a list of size 1.

fix greedy parsing of key=value

In:

KEY=<ID=X,Description="Testing multiple keys in metadata (like multiple contigs)">

The key will be 'KEY=<ID=X,Description' and the value is '"Testing multiple keys in metadata (like multiple contigs)">'.

Error when INFO flag key not defined in meta-information section

In PGM tools not all INFO keys are defined in the meta-information section. Turns out they don't have to be defined there, even thought it is strongly encouraged. The code assumes they are, and this breaks when there is an undefined flag field in the INFO section.

From VCF 4.1 format specification: It is strongly encouraged that information lines describing the INFO, FILTER and FORMAT entries used in the body of the VCF file be included in the meta-information section. Although they are optional, if these lines are present then they must be completely well-formed.

cat test/pgm_tools.vcf | ./vcfIdentity.py
Traceback (most recent call last):
File "./vcfIdentity.py", line 11, in
for rec in vcfin:
File "/home/wilson/PyVCF/vcf/parser.py", line 702, in next
info = self._parse_info(row[7])
File "/home/wilson/PyVCF/vcf/parser.py", line 594, in _parse_info
val = entry[1].split(',')
IndexError: list index out of range

Lists should always be lists

If something is supposed to be a list and it is None make it the empty list, if it is just one element make it a one element list.

if FILTER == "PASS", don't change to "."

Currently "PASS" is changed to a "." and the writer will output "." instead of "PASS". These two things have different meanings according to the VCF 4.1 spec (see below). In addition, this change makes it much harder for users to find significant differences using diff between a VCF file pre and post filtering with PyVCF.

According to the VCF 4.1 spec:

PASS if this position has passed all filters, i.e. a call is made at this position. Otherwise, if the site has not passed all filters, a semicolon-separated list of codes for filters that fail. e.g. “q10;s50” might indicate that at this site the quality is below 10 and the number of samples with data is below 50% of the total number of samples. “0” is reserved and should not be used as a filter String. If filters have not been applied, then this field should be set to the missing value. (String, no white-space or semi-colons permitted).
...

In all cases, missing values are specified with a dot (”.”)

INFO fields not read/written correctly

(1) Info values that are string are being truncated to the first letter (since they are sequences like lists).

E.g. When STR=Test, what is stored is STR=T.

(2) When INFO.Number > 1 or INFO.Number == "." or INFO.Number == "A" or INFO.Number == "G" the value should be a list according to the vcf 4.1 format described below. (Note: Not sure if it should be a list when the length is 1 or 0, also not sure if we should always make sure lists for "A" an "G" are correct length with Nones or not.)

(3) Also when INFO.Type == "Flag" the output was "KEY=True;" instead of just "KEY;".

(4) It would be nice if the KEY/VALUE pairs were output in the order in which they are read so that it is easier for users to see significant changes when doing a diff between an input vcf file and one that is output after filtering/modification. This applies to everything, not just INFO fields so I will create a separate issue for ordering.

According to VCF format 4.1:

INFO=<ID=ID,Number=number,Type=type,Description=”description”>

Possible Types for INFO fields are: Integer, Float, Flag, Character, and String.

The Number entry is an Integer that describes the number of values that can be included with the INFO field. For example, if the INFO field contains a single number, then this value should be 1; if the INFO field describes a pair of numbers, then this value should be 2 and so on. If the field has one value per alternate allele then this value should be 'A'; if the field has one value for each possible genotype (more relevant to the FORMAT tags) then this value should be 'G'. If the number of possible values varies, is unknown, or is unbounded, then this value should be '.'. The 'Flag' type indicates that the INFO field does not contain a Value entry, and hence the Number should be 0 in this case. The Description value must be surrounded by double-quotes. Double-quote character can be escaped with backslash (") and backslash as .

Write VCF with no samples

Ability to write sample-less VCF's like feat01.vcf (below)
Ie, only 8 columns CHROM--INFO, drop FORMAT in header
and in record rows. To know you have a sample-less
vcf check length of .samples var

vin=vcf.Reader(open("dbsnp.vcf"))

if len(vin.samples)==0:
    Do_not_output_FORMAT_column
## fileformat=VCFv4.1
## FILTER=<ID=NC,Description="Inconsistent Genotype Submission For At Least One Sample">
## INFO=<ID=ASP,Number=0,Type=Flag,Description="Is Assembly specific. This is set if the snp only maps to one assembly">
## dbSNP_BUILD_ID=132
## fileDate=20110320
## phasing=partial
## reference=GRCh37
## source=dbSNP
## variationPropertyDocumentationUrl=ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_latest.pdf
# CHROM  POS ID  REF ALT QUAL    FILTER  INFO

chr1    10433   rs56289060  A   AC  .   PASS    ASP
chr1    10439   rs112766696 AC  A   .   PASS    ASP
chr1    10519   rs62636508  G   C   .   PASS    ASP

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.