bihealth / vcfpy Goto Github PK

View Code? Open in Web Editor NEW

85.0 6.0 18.0 1.28 MB

Python 3 library with good support for both reading and writing VCF

License: MIT License

Makefile 0.09% Python 99.74% TeX 0.17%

bioinformatics vcf parsing writing file-format

vcfpy's People

Contributors

Stargazers

Watchers

Forkers

davidrajones robertopreste tneymanov bapleliu jhl667 yeemey vadim-kor k8iechen farnazmdi hylkedonker khurrammaqbool 1511878618 genostack likhitha-surapaneni vivianleung effigies nakib103

vcfpy's Issues

Couldn't parse compressed vcf file (vcf.bgz)

vcfpy version: 0.13.4
Python version: 3.10.0
Operating System: Mac OS

Description

I was trying to parse a vcf.bgz file but got a UnicodeDecodeError.

What I Did

vcfpy.Reader.from_path(path)

Traceback (most recent call last):
  File "/Users/sai/Desktop/Elucidata/scripts/vcf_support/parse_vcfpy.py", line 38, in <module>
    parse_vcf(path)
  File "/Users/sai/Desktop/Elucidata/scripts/vcf_support/parse_vcfpy.py", line 19, in parse_vcf
    vcf_reader = vcfpy.Reader.from_path(path)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/vcfpy/reader.py", line 94, in from_path
    return klass.from_stream(
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/vcfpy/reader.py", line 60, in from_stream
    return Reader(
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/vcfpy/reader.py", line 119, in __init__
    self.parser = parser.Parser(stream, self.path, self.record_checks)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/vcfpy/parser.py", line 701, in __init__
    self._line = stream.readline()  # trailing '\n'
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Add eq functions where applicable

Otherwise, we just have object identity which is not enough.

Installation Issue

vcfpy version: N/A
Python version: 3.6.0
Operating System: Windows 10
Anaconda Version: 4.3.1 (64 bit)

I am unable to install vcfpy onto multiple systems, is there an issue with anaconda or windows and this package?

What I Did

I have attempted installation across multiple systems, all which culminate into this issue.

Does add_format function work with multisample vcf file ?

I'm using vcfpy to add an extra field in FORMAT of each record of a multisamples vcf file. From the document, i think that the function add_format would help me to do that. However, it keeps adding the same values for every sample. How can i add different value for each sample?

FORMAT: list or tuple?

vcfpy version: 0.13.2
Python version: 3.6
Operating System: Centos 7

Description

When record.FORMAT is empty it is a tuple but when a vcf containing calls and FORMAT columns is imported FORMAT it turns into a list.

An empty FORMAT object cannot be filled using record.add_format(). An exception is thrown: AttributeError: 'tuple' object has not attribute 'append'

if a key is to be added to an empty FORMAT using record.add_format(), it must first be converted into a list:

record.FORMAT = []
record.add_format('GT')

write_record() error: TypeError: 'int' object is not iterable

vcfpy version: '0.13.3'
Python version: 3.8.2
Operating System: macOS Catalina, Version 10.15.7

Hi,

I added vcf records from multiple vcfs into a list and when I try to write each record I get the following error:

for record in all_records:
    print(record)
    writer.write_record(record)

Traceback (most recent call last):
  File "<input>", line 3, in <module>
  File "/Users/okcb/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/vcfpy/writer.py", line 130, in write_record
    self._serialize_record(record)
  File "/Users/okcb/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/vcfpy/writer.py", line 144, in _serialize_record
    row.append(f(self._serialize_info(record)))
  File "/Users/okcb/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/vcfpy/writer.py", line 161, in _serialize_info
    result.append("{}={}".format(key, format_value(info, value, "INFO")))
  File "/Users/okcb/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/vcfpy/writer.py", line 48, in format_value
    return ",".join(map(lambda x: format_atomic(x, section), value))
TypeError: 'int' object is not iterable

a record in all_records looks like this:

Record('1', 2488153, [], 'A', [Substitution(type_='SNV', value='G')], None, ['m2'], {'m2_DP': 1379, 'm2_ECNT': 3, 'm2_POP_AF': [5e-08], 'm2_TLOD': [2137.76]}, ['m2_GT', 'm2_AD', 'm2_AF', 'm2_DP', 'm2_F1R2', 'm2_F2R1', 'm2_MBQ', 'm2_MFRL', 'm2_MMQ', 'm2_MPOS', 'm2_ORIGINAL_CONTIG_MISMATCH', 'm2_SA_MAP_AF', 'm2_SA_POST_PROB'], [Call('20KD-085P0001-1', {'m2_GT': '0/1', 'm2_AD': [672, 656], 'm2_AF': [0.492], 'm2_DP': 1328, 'm2_F1R2': [250, 244], 'm2_F2R1': [422, 412], 'm2_MBQ': [36, 36], 'm2_MFRL': [184, 181], 'm2_MMQ': [60], 'm2_MPOS': [25], 'm2_ORIGINAL_CONTIG_MISMATCH': 0, 'm2_SA_MAP_AF': [0.495, 0.475, 0.494], 'm2_SA_POST_PROB': [0.015, 0.009728, 0.975]})])

Any ideas on why this is happening?

Thanks,
O.K

Add support for writing bgzf files

How fast is this compared to other libraries?

Not a real issue but was curious: How fast is this compared to other libraries like PyVCF, cyvcf2?

Add indices for header line access in VCFHeader

This will speed up the writing.

We will need methods for adding headers to implement this properly.

( never mind, please delete, sorry! )

ModuleNotFoundError: No module named 'pysam'

vcfpy version: 0.13.4
Python version: 3.10.4
Operating System: Linux (Ubuntu 22.04)

Description

Try to install and use vcfpy.
Install successful,
use failed.

I think requirements are not processed correctly when using pip

What I Did

$ pip install vcfpy
Collecting vcfpy
  Downloading vcfpy-0.13.4.tar.gz (1.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 7.8 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Building wheels for collected packages: vcfpy
  Building wheel for vcfpy (setup.py) ... done
  Created wheel for vcfpy: filename=vcfpy-0.13.4-py2.py3-none-any.whl size=34590 sha256=409354257971197c313d3e7f84b9c8f1e23bda33d78121f4b3efd1662d40488e
  Stored in directory: /home/victor/.cache/pip/wheels/fb/f8/c3/4f2a584f46d0a628bd41c17311e38496eea7b3f8c5ef4309d2
Successfully built vcfpy
Installing collected packages: vcfpy
Successfully installed vcfpy-0.13.4

But when using vcfpy

$ python
Python 3.10.4 (main, Jun 29 2022, 12:14:53) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import vcfpy
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/victor/Documents/python/algemeen/venv/lib/python3.10/site-packages/vcfpy/__init__.py", line 46, in <module>
    from .reader import Reader
  File "/home/victor/Documents/python/algemeen/venv/lib/python3.10/site-packages/vcfpy/reader.py", line 8, in <module>
    import pysam
ModuleNotFoundError: No module named 'pysam'

$ pip list | grep pysam
does indeed show pysam is not installed.

Checking dependency
$ python -m pip check
does not show pysam is missing

In ' base/requirement.txt'
pysam is mentioned
I think pip is not using it.

Workaround:
manually install pysam
$ pip install pysam

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

Benchmark code and tweak parsing and writing

Allow jumping by cordinate string

vcfpy fails searching FT field in header

vcfpy version: latest dev branch
Python version: 3.4.3
Operating System: centos7

Description

vcfpy claims that the flag set in FT field can't be found in the header though it exists. Obviously vcfpy searches for single letters instead of the whole word.

Error message:

vcfpy/vcfpy/parser.py:499: UnknownFilter: Filter not found in header: M; problem in FORMAT/FT column of sample X
  UnknownFilter)
vcfpy/vcfpy/parser.py:499: UnknownFilter: Filter not found in header: i; problem in FORMAT/FT column of sample X
  UnknownFilter)
vcfpy/vcfpy/parser.py:499: UnknownFilter: Filter not found in header: n; problem in FORMAT/FT column of sample X
  UnknownFilter)
vcfpy/vcfpy/parser.py:499: UnknownFilter: Filter not found in header: G; problem in FORMAT/FT column of sample X
  UnknownFilter)
vcfpy/vcfpy/parser.py:499: UnknownFilter: Filter not found in header: q; problem in FORMAT/FT column of sample X
  UnknownFilter)
vcfpy/vcfpy/parser.py:499: UnknownFilter: Filter not found in header: a; problem in FORMAT/FT column of sample X
  UnknownFilter)
vcfpy/vcfpy/parser.py:499: UnknownFilter: Filter not found in header: x; problem in FORMAT/FT column of sample X
  UnknownFilter)
vcfpy/vcfpy/parser.py:499: UnknownFilter: Filter not found in header: A; problem in FORMAT/FT column of sample X
  UnknownFilter)
vcfpy/vcfpy/parser.py:499: UnknownFilter: Filter not found in header: f; problem in FORMAT/FT column of sample X
  UnknownFilter)
vcfpy/vcfpy/parser.py:499: UnknownFilter: Filter not found in header: H; problem in FORMAT/FT column of sample X
  UnknownFilter)
vcfpy/vcfpy/parser.py:499: UnknownFilter: Filter not found in header: e; problem in FORMAT/FT column of sample X
  UnknownFilter)
vcfpy/vcfpy/parser.py:499: UnknownFilter: Filter not found in header: t; problem in FORMAT/FT column of sample X
  UnknownFilter)
vcfpy/vcfpy/parser.py:499: UnknownFilter: Filter not found in header: o; problem in FORMAT/FT column of sample X
  UnknownFilter)
vcfpy/vcfpy/parser.py:499: UnknownFilter: Filter not found in header: m; problem in FORMAT/FT column of sample X
  UnknownFilter)
vcfpy/vcfpy/parser.py:499: UnknownFilter: Filter not found in header: l; problem in FORMAT/FT column of sample X
  UnknownFilter)

Set flags in FT in this file:

MaxAafHet
MaxAafHet;MinGq
MinAafHomAlt
MinAafHomAlt;MinGq
MinGq

Taking just the letters in a set and printing them:

>>> s = 'MaxAafHetMaxAafHetMinGqMinAafHomAltMinAafHomAltMinGq'
>>> print(''.join(set(s)))
aAeGfiHMmonqtxl

Printing the letters from the error message:

>>> t = 'MinGqaxAfHetoml'
>>> print(''.join(set(t)))
aAeGfiHMmonqtxl

These are exactly the same.

What I Did

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

Record.fetch opens and closes TabixFile on each call, which slows down fetch calls drastically

For big vcf files it makes each Record.fetch call very slow, comparing to pysam.

I suggest to not close TabixFile if it's already open.

Implement stricter parsing

I'll add to the list while I find more shortcomings

all FORMAT fields must be there
no superflous FORMAT fields allowed

Add better support for interpreting VCF header

Add %-escaping of forbidden characters

add FORMAT and calls to vcf?

vcfpy version: 0.13.2
Python version: 3.6
Operating System: Centos 7

Question

I would like to modify a vcf (from Lofreq) that does not contain FORMAT or call columns in order for it to be accepted by Picard within the bcbio-nextgen framework.

I have tried like this but it does not work. Nothing comes out:

reader = vcfpy.Reader.from_path("/path/to/sample.vcf.gz")
reader.header.samples = vcfpy.SamplesInfos(names=['D20-XXX'], name_to_idx={'D20-XXX': 0})

writer = vcfpy.Writer.from_path('/dev/stdout', reader.header)

for rec in reader:
    data = {'GT': '1/1'}
    rec.calls.append(vcfpy.Call(sample=sample, data=data))
    rec.FORMAT = []
    rec.add_format('GT')
    writer.write_record(rec)

Is there some inherent limitation that prevents adding this info? I noticed a caveat that the FORMAT field cannot be altered after modifying the SamplesInfos object.

Incidently, I have also noticed that data objects in the call instances are dicts and not ordered dicts, as the documentation says. Perhaps a remnant from earlier times?

Complete support for writing simple VCF records

Add examples to documentation

Add more examples

reading records with hard-filtering
reading records with adding attributes, in particular INFO and FORMAT
include reasoning for soft-filtering

in _serialize_record row = [record.CHROM, record.POS] AttributeError: 'str' object has no attribute 'CHROM'

vcfpy version: 0.13.6
Python version:3.7.6
Operating System:Linux

Description

I'm trying to use vcfpy yo convert a CSV file into VCF the script : import pandas as pd
import vcfpy
import sys

def csv_to_vcf(csv_file, vcf_file):

# read CSV file using pandas
df = pd.read_csv(csv_file)
# read vcf file
reader = vcfpy.Reader.from_path(vcf_file)
# create a new VCF file
vcf_writer = vcfpy.Writer.from_path(vcf_file,reader.header)

# add the header information to the VCF file


# loop through rows in the CSV file
for i, row in df.iterrows():
    # create a new record for the VCF file
    record = vcfpy.Record(
        CHROM=row['#CHROM'],
        POS=row['POS'],
        ID=".",
        REF=row['REF'],
        ALT=".",
        QUAL=".",
        FILTER=".",
        INFO=".",
        FORMAT=[])
    
    chr = row['#CHROM']
    pos = int(row['POS'])
    ref = row['REF']
    alt = row['ALT']

    chr='chr'+str(chr)
    if chr == 'chrMT':
        chr = 'chrM'
    try:
        fetched = reader.fetch(record.CHROM,pos-5,pos+5)
    
    except:
        fetched = '.|.'
    for fet in fetched:
    # add the record to the VCF file
        vcf_writer.write_record(fet)
            



# close the VCF file
vcf_writer.close()

if name == "main":
csv_file = sys.argv[1]
vcf_file = sys.argv[2]
csv_to_vcf(csv_file,vcf_file)

What I Did

python csv_to_vcf.py Cmpd_Hetro_Bathc1_morethan10test.csv results_table.DVD.vcf

where the Cmpd is the CSV file and the VCF file already has a header.

If there was a crash, please include the traceback here.

Error : 
/vcfpy/writer.py", line 135, in _serialize_record
    row = [record.CHROM, record.POS]
AttributeError: 'str' object has no attribute 'CHROM'

Suggestion: Add process_alt() to init.py and documentation

I often use vcfpy to create VCF files from non-vcf data sources, such as a spreadsheet. One difficulty with doing this has been creating AltRecords for alternative alleles. Looking through the code I found the process_alt function, which I find very helpful for this use-case. I think it would help other people looking to create VCFs without a starting VCF if this was in the documentation.

Complete support for parsing VCF

Doesn't understand PASS in filter column

vcfpy version: 0.12.0
Python version: 3.6.5
Operating System: Ubuntu Bionic

Description

PASS is a valid value for the filter column and shouldn't be required in header:

http://www.internationalgenome.org/wiki/Analysis/vcf4.0/

What I Did

UnknownFilter: Filter not found in header: PASS; problem in FILTER column
  UnknownFilter,

Thanks

Allow configuration for switching between warning and errors for format errors

Unable to install in Python 3.12

vcfpy version: 0.13.6 (latest)
Python version: 3.12
Operating System: MacOS 13.6

Description

I'm unable to install on a new 3.12 environment. The Pip error message refers to a module attribute, SafeConfigParser, that has been deprecated in 3.12. At a glance, it appears to be rooted in the Versioneer build management tool that you're using here, but they claim to have fixed this.

I know that you don't claim to support 3.12 in the setup.py file but I figured I'd flag it for you anyway in case it's an easy fix.

What I Did

[⚙ venv]~/code/retraction % pip install vcfpy     
Collecting vcfpy
  Using cached vcfpy-0.13.6.tar.gz (1.0 MB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error
  
  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [31 lines of output]
      /private/var/folders/wx/h1q5fy057t97w4md21k1ktjnlp06wn/T/pip-install-xqxoql9o/vcfpy_a939bfba6aed45e0910fdbaf1c40343f/versioneer.py:432: SyntaxWarning: invalid escape sequence '\s'
        ] = '''
      Traceback (most recent call last):
        File "/Users/jss009/code/retraction/venv/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
          main()
        File "/Users/jss009/code/retraction/venv/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/jss009/code/retraction/venv/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
          return hook(config_settings)
                 ^^^^^^^^^^^^^^^^^^^^^
        File "/private/var/folders/wx/h1q5fy057t97w4md21k1ktjnlp06wn/T/pip-build-env-gn2x9bq0/overlay/lib/python3.12/site-packages/setuptools/build_meta.py", line 355, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=['wheel'])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/private/var/folders/wx/h1q5fy057t97w4md21k1ktjnlp06wn/T/pip-build-env-gn2x9bq0/overlay/lib/python3.12/site-packages/setuptools/build_meta.py", line 325, in _get_build_requires
          self.run_setup()
        File "/private/var/folders/wx/h1q5fy057t97w4md21k1ktjnlp06wn/T/pip-build-env-gn2x9bq0/overlay/lib/python3.12/site-packages/setuptools/build_meta.py", line 507, in run_setup
          super(_BuildMetaLegacyBackend, self).run_setup(setup_script=setup_script)
        File "/private/var/folders/wx/h1q5fy057t97w4md21k1ktjnlp06wn/T/pip-build-env-gn2x9bq0/overlay/lib/python3.12/site-packages/setuptools/build_meta.py", line 341, in run_setup
          exec(code, locals())
        File "<string>", line 48, in <module>
        File "/private/var/folders/wx/h1q5fy057t97w4md21k1ktjnlp06wn/T/pip-install-xqxoql9o/vcfpy_a939bfba6aed45e0910fdbaf1c40343f/versioneer.py", line 1505, in get_version
          return get_versions()["version"]
                 ^^^^^^^^^^^^^^
        File "/private/var/folders/wx/h1q5fy057t97w4md21k1ktjnlp06wn/T/pip-install-xqxoql9o/vcfpy_a939bfba6aed45e0910fdbaf1c40343f/versioneer.py", line 1434, in get_versions
          cfg = get_config_from_root(root)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/private/var/folders/wx/h1q5fy057t97w4md21k1ktjnlp06wn/T/pip-install-xqxoql9o/vcfpy_a939bfba6aed45e0910fdbaf1c40343f/versioneer.py", line 346, in get_config_from_root
          parser = configparser.SafeConfigParser()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      AttributeError: module 'configparser' has no attribute 'SafeConfigParser'. Did you mean: 'RawConfigParser'?
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

Separately, I tried in a fresh 3.11 environment and it worked.

Pysam UnicodeDecodeError when loading with tabixed VCF

vcfpy version: 0.13.2
Python version: 3.6.9 64bit [GCC 8.4.0]
Operating System: Linux 4.15.0 1093 oem x86_64 with Ubuntu 18.04 bionic

Description

When I fetch variants by contig ID I get the following UnicodeDecodeError demosntrating some issues when parsing the tabix file. Maybe the issue comes from pysam, but I would like to know if you have had previous reports based on this issue.

What I Did

Tabix VCF file

$ tabix -p vcf <vcf_filepath>

reader = vcfpy.Reader.from_path(path=DATA['annot_vcf'], tabix_path=DATA['annot_vcf']+'.tbi')
for record in reader.fetch('chr1'):
    [...]

Traceback Error

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-38-046818a3e579> in <module>
      3 variant_records = []
      4 sample_records = []
----> 5 for record in reader.fetch('chr1'):
      6     if record.CHROM in wanted_chroms:
      7         ALT = record.ALT[0].value

/usr/local/lib/python3.6/dist-packages/vcfpy/reader.py in __next__(self)
    171         """
    172         if self.tabix_iter:
--> 173             return self.parser.parse_line(str(next(self.tabix_iter)))
    174         else:
    175             result = self.parser.parse_next_record()

pysam/libctabix.pyx in pysam.libctabix.TabixIterator.__next__()

pysam/libcutils.pyx in pysam.libcutils.charptr_to_str()

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2821: ordinal not in range(128)

Complete support for breakends and single breakends

Unknown Filter `PASS`

vcfpy version: 0.11.2
Python version: 3.7
Operating System: Ubuntu 20.04

Description

Parsing a file.

What I Did

Copy/pasted the code from https://vcfpy.readthedocs.io/en/stable/examples.html#reading-vcf-files

What I get

.../vcfpy/parser.py:495: UnknownFilter: Filter not found in header: PASS; problem in FILTER column

I think that PASS is simply the VCF way of saying that all filters passed, so I think this should not be considered an unknown filter.

Use context managers to making reader and writer close-able

Escape characters

vcfpy version: 0.13.3
Python version: 3.6.1
Operating System: Linux (Ubuntu)

Description

I'm having a hard time understanding why there is a mapping of escape characters, because when I write a vcf if I have some % in my fields, this gets changed to %25 instead of remaining as % like in the original. And I believe this can cause conflicts when using tools on the VCF with the field formatted like %25.

What I Did

I ran this custom function to split the VCF in samples.

def split_vcf(vcf_path):
    reader_main = vcfpy.Reader.from_path(vcf_path)
    samples = list(set([x.replace('NORMAL.', '').replace('TUMOR.', '') for x in reader_main.header.samples.names]))
    reader_main.close()
    for sample in samples:
        reader_1 = vcfpy.Reader.from_path(vcf_path)
        reader_2 = vcfpy.Reader.from_path(vcf_path)
        header = reader_2.header
        header.samples.names = [x for x in header.samples.names if x in [f'NORMAL.{sample}', f'TUMOR.{sample}']]
        header.samples.name_to_idx = {x: header.samples.name_to_idx[x] for x in [f'NORMAL.{sample}', f'TUMOR.{sample}']}
        writer = vcfpy.Writer.from_path('./{}.vcf'.format(sample), header)
        for record in reader_1:
            if any(sample == s for s in record.INFO['set'].split('-')):
                writer.write_record(record)
        writer.close()

Add support for jumping in tabix-indexed VCF files

VCF without FORMAT and sample columns causes error

vcfpy version: 0.11.2
Python version: 3.5.0
Operating System: Mac OSX Yosemite 10.10.5

Description

According to the VCF specs (https://samtools.github.io/hts-specs/VCFv4.2.pdf) it is valid to have a VCF without a FORMAT and sample columns. However, this causes an error when parsing the individual variant entries.

What I Did

$ cat tests/test_data/input.no_sample.vcf 
##fileformat=VCFv4.2
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
22	18644673	.	C	T	.	.	.

$python
>>> import vcfpy
>>> vcf_reader = vcfpy.Reader.from_path('tests/test_data/input.no_sample.vcf')
>>> next(vcf_reader)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ssiebert/miniconda2/envs/pvactools/lib/python3.5/site-packages/vcfpy/reader.py", line 170, in __next__
    result = self.parser.parse_next_record()
  File "/Users/ssiebert/miniconda2/envs/pvactools/lib/python3.5/site-packages/vcfpy/parser.py", line 792, in parse_next_record
    return self.parse_line(self._read_next_line())
  File "/Users/ssiebert/miniconda2/envs/pvactools/lib/python3.5/site-packages/vcfpy/parser.py", line 783, in parse_line
    return self._record_parser.parse_line(line)
  File "/Users/ssiebert/miniconda2/envs/pvactools/lib/python3.5/site-packages/vcfpy/parser.py", line 457, in parse_line
    format_ = arr[8].split(':')
IndexError: list index out of range

Inconsistent GATK entries causing vcfpy.Reader to break

vcfpy version: 0.11.0
Python version: 3.5.4
Operating System: Ubuntu 16.04 LTS

Description

I'm trying to read a vcf file that was generated by GATK 3.8. GATK is not annotating according to VCF format specifications. It leaves empty fields in the FORMAT column empty and does not write out separators, e.g.:

GT:AD:DP:FT:GQ:PL       ./.:.:.:MinAafHomAlt;MinCovHomAlt;MinGq

What I Did

The above mentioned line triggers the following error:

Traceback (most recent call last):
  File "ped_into_vcf.py", line 136, in <module>
    sys.exit(main())
  File "ped_into_vcf.py", line 132, in main
    run(args)
  File "ped_into_vcf.py", line 116, in run
    extend_vcf(args.vcf_file, pedigree)
  File "ped_into_vcf.py", line 108, in extend_vcf
    writer.write_record(record)
  File "/vol/local/data/miniconda3/lib/python3.5/site-packages/vcfpy/writer.py", line 131, in write_record
    self._serialize_record(record)
  File "/vol/local/data/miniconda3/lib/python3.5/site-packages/vcfpy/writer.py", line 148, in _serialize_record
    for s in self.header.samples.names]
  File "/vol/local/data/miniconda3/lib/python3.5/site-packages/vcfpy/writer.py", line 148, in <listcomp>
    for s in self.header.samples.names]
  File "/vol/local/data/miniconda3/lib/python3.5/site-packages/vcfpy/writer.py", line 170, in _serialize_call
    for key in format_]
  File "/vol/local/data/miniconda3/lib/python3.5/site-packages/vcfpy/writer.py", line 170, in <listcomp>
    for key in format_]
KeyError: 'GQ'

Bug in is_variant

    @property
    def is_variant(self):
        """Return ``True`` for non-hom-ref calls"""
        return self.gt_type != HOM_REF

should be

    @property
    def is_variant(self):
        """Return ``True`` for non-hom-ref calls"""
        return not gt_type

vcfpy writer.write_record write string with "comma (,)" into "%2C"

vcfpy version: vcfpy 0.13.6
Python version: 3.8.5
Operating System: Linux

Description

I want to add self defined Filter into my vcf column (such as LowQual etc)

What I Did

## read a vcf 
vcf_reader= vcfpy.Reader.from_path(vcfInput)

## add Filter-Header

## vcf output
vcf_writer= vcfpy.Writer.from_path(newOutput+'/'+ name+'.filtered.vcf',vcf_reader.header)  

## add filter in each variant e.g if QUAL < 30: filter= "LowQual"
## write variant in vcf output
vcf_writer.write_record(record)

Solution:
maybe instead of:

",".join(...)

use
','.join(...)

Make default header Type and Number known to vcfpy

Add Type and Number constants known for standardized and common header line types.

PyVCF can be used for inspiration.

Rename identifiers

vcfpy.Reader.from_file => vcfpy.Reader.from_stream
vcfpy.Writer.from_file => vcfpy.Writer.from_stream
reorder identifiers in vcfpy.Writer.from_path, make path go first

[BUG] INFO flag raises TypeError

vcfpy version: master branch
Python version: 3.9.7
Operating System: Ubuntu 21.10

Description

I have VCF files that generate a

TypeError: argument of type 'bool' is not iterable

when reading the lines. Closer inspection reveals that the problem is in the INFO field. More specifically, when there is a flag in the INFO field, but the header specifies otherwise, a TypeError is generated in convert_field_value(type_, value).

What I Did

Here is a minimum reproducing example:

import io
import vcfpy

string_buffer = io.StringIO(r"""##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=MH,Number=1,Type=String,Description="Microhomology">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	FOO	BAR
1	4798729	.	CT	C	621	PASS	MH	GT	0/0:25	0/1:64
""")

with vcfpy.Reader.from_stream(string_buffer) as reader:
    for line in reader:
        print(line)

Solution

I think a quick solution would be to adjust parse_field_value to detect that a flag, instead of key-value pair, is passed and override behaviour specified in the header.
I can make a pull request.

Adding of lines to header

Proper, tested adding of lines to header, must update index data structures

Add feature for parsing only genotypes for some individuals

Do it in a similar fashion as htslib and cyvcf2 do.

Problem with colons when writing INFO field value strings

vcfpy version: 0.12.2
Python version: 3.7
Operating System: Ubuntu bionic

Description

When writing the VCF record, the colon [:] is escaped to %3A in the INFO field value strings which is causing a problem for us. According to the VCF specification, colons are not excluded eg for v4.3,
8. INFO — additional information: (String, no semi-colons or equals-signs permitted; commas are permitted only as delimiters for lists of values; characters with special meaning can be encoded using the percent encoding, see Section 1.2; space characters are allowed)

Help with resolving this issue would be very much appreciated.

Help subsetting samples

vcfpy version:
0.12.1
Python version:
3.6.3
Operating System:
ubuntu 18.04

Description

I need to subset samples from a large VCF with approximately 100.000 samples. As the extraction is the only part of the workflow outside python it would be very nice to be able to use vcfpy for this. I read the docs and also tried to deduce how the package works from the source. However, I am coming up short and would very much appreciate some help understanding how the subsetting of samples works.

What I Did

import vcfpy
reader = vcfpy.Reader.from_path()

subset_samples = sel = ['sample123', 'sample124']
reader = vcfpy.Reader.from_path('myfile.vcf.gz', parsed_samples=subset_samples)

reader.parsed_samples returns the samples chosen to be parsed in the input:
['sample123', 'sample124']

When iterating over the records in the reader object:
for record in reader: record

I get:

<vcfpy.record.UnparsedCall object at 0x7f065fd30f98>

So I am assuming that the parser respects my list of samples to be parsed. However I am struggling with getting the calls for only these samples.

My goal is to extract all markers, but only for a few samples.

Any help would be very much appreciated!

VCF Writer seems not to write header

vcfpy version: 0.12.1
Python version: 3.7
Operating System: macOS

When attempting to do a roundtrip with a reader and a writer, I run into an IncorrectVCFFormat exception complaining about the header line.

Here is where I create the writer (I'm writing to a buffer rather than a file):

with io.StringIO() as vcf_stream:
    with vcfpy.Reader.from_path(self.template_vcf) as template_vcf_reader:
        vcf_writer = vcfpy.Writer.from_stream(vcf_stream, header=template_vcf_reader.header)

    ...

Then I do some subsequent write_record calls (which do not come from the template; the template VCF just contains headers).

Finally, I read the buffer with (effectively) vcfpy.Reader.from_stream(vcf_stream)

Unfortunately, this results in the aforementioned exception:

vcfpy.exceptions.IncorrectVCFFormat: Missing line starting with "#CHROM"

Add checking whether filters etc. are present in the VCF file on parsing

Add more comprehensive warnings on format errors

Basically, each point where we overlook a problem with VCF, we could echo a warning.

We should take care that the printed warnings are stored somewhere so there are not millions of warnings.

Remove redudant parameter in writer construction

Passing samples infos is not necessary, is already in header.

Issue with GT data

vcfpy version: 0.12.1
Python version: 3.7
Operating System: Ubuntu 18.04.2, macOS 10.14.5

Description

I integrated vcfpy into an application (HmtNote), and I use it to parse VCF files and add custom annotations. I encountered an issue regarding GT data when processing a VCF file formatted as:

#CHROM  POS	ID	REF	ALT	QUAL	FILTER	INFO	        FORMAT	 TEST1
chrMT	 73	        .	A	G	.	        PASS	AC=2;AN=2	GT:HF	 0/1:0.994

What I Did

When traversing the file using vcfpy.Reader.from_path() and manipulating each record, the following error shows up:

 File "/home/ubuntu/test/ang_vcf/venv/lib/python3.7/site-packages/hmtnote/classes.py", line 502, in annotate
    for record in self.reader:
  File "/home/ubuntu/test/ang_vcf/venv/lib/python3.7/site-packages/vcfpy/reader.py", line 175, in __next__
    result = self.parser.parse_next_record()
  File "/home/ubuntu/test/ang_vcf/venv/lib/python3.7/site-packages/vcfpy/parser.py", line 802, in parse_next_record
    return self.parse_line(self._read_next_line())
  File "/home/ubuntu/test/ang_vcf/venv/lib/python3.7/site-packages/vcfpy/parser.py", line 793, in parse_line
    return self._record_parser.parse_line(line)
  File "/home/ubuntu/test/ang_vcf/venv/lib/python3.7/site-packages/vcfpy/parser.py", line 467, in parse_line
    calls = self._handle_calls(alts, format_, arr[8], arr)
  File "/home/ubuntu/test/ang_vcf/venv/lib/python3.7/site-packages/vcfpy/parser.py", line 479, in _handle_calls
    call = record.Call(sample, data)
  File "/home/ubuntu/test/ang_vcf/venv/lib/python3.7/site-packages/vcfpy/record.py", line 234, in __init__
    for allele in ALLELE_DELIM.split(self.data["GT"]):
TypeError: expected string or bytes-like object

I found a temporary solution and it works for in my specific case, but maybe the author might find something more suitable. The temporary fix involves replacing lines 232-234 in record.py as such:

# original lines 232-234
if self.data.get("GT", None) is not None: 
    self.gt_alleles = []
    for allele in ALLELE_DELIM.split(self.data["GT"]):

# my replacement 
if self.data.get("GT", None) is not None and len(self.data.get("GT", None)) != 0:
    self.gt_alleles = []
    for allele in ALLELE_DELIM.split(self.data["GT"][0]):

bihealth / vcfpy Goto Github PK

vcfpy's People

Contributors

Stargazers

Watchers

Forkers

vcfpy's Issues

Description

What I Did

Description

Description

What I Did

Description

What I Did

Question

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Traceback Error

Description

What I Did

What I get

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Solution

Description

Description

What I Did

Description

What I Did

Recommend Projects

Recommend Topics

Recommend Org