bihealth / vcfpy Goto Github PK
View Code? Open in Web Editor NEWPython 3 library with good support for both reading and writing VCF
License: MIT License
Python 3 library with good support for both reading and writing VCF
License: MIT License
I was trying to parse a vcf.bgz file but got a UnicodeDecodeError.
vcfpy.Reader.from_path(path)
Traceback (most recent call last):
File "/Users/sai/Desktop/Elucidata/scripts/vcf_support/parse_vcfpy.py", line 38, in <module>
parse_vcf(path)
File "/Users/sai/Desktop/Elucidata/scripts/vcf_support/parse_vcfpy.py", line 19, in parse_vcf
vcf_reader = vcfpy.Reader.from_path(path)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/vcfpy/reader.py", line 94, in from_path
return klass.from_stream(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/vcfpy/reader.py", line 60, in from_stream
return Reader(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/vcfpy/reader.py", line 119, in __init__
self.parser = parser.Parser(stream, self.path, self.record_checks)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/vcfpy/parser.py", line 701, in __init__
self._line = stream.readline() # trailing '\n'
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
Otherwise, we just have object identity which is not enough.
I am unable to install vcfpy onto multiple systems, is there an issue with anaconda or windows and this package?
What I Did
I have attempted installation across multiple systems, all which culminate into this issue.
I'm using vcfpy to add an extra field in FORMAT of each record of a multisamples vcf file. From the document, i think that the function add_format would help me to do that. However, it keeps adding the same values for every sample. How can i add different value for each sample?
When record.FORMAT is empty it is a tuple but when a vcf containing calls and FORMAT columns is imported FORMAT it turns into a list.
An empty FORMAT object cannot be filled using record.add_format()
. An exception is thrown: AttributeError: 'tuple' object has not attribute 'append'
if a key is to be added to an empty FORMAT using record.add_format(), it must first be converted into a list:
record.FORMAT = []
record.add_format('GT')
Hi,
I added vcf records from multiple vcfs into a list and when I try to write each record I get the following error:
for record in all_records:
print(record)
writer.write_record(record)
Traceback (most recent call last):
File "<input>", line 3, in <module>
File "/Users/okcb/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/vcfpy/writer.py", line 130, in write_record
self._serialize_record(record)
File "/Users/okcb/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/vcfpy/writer.py", line 144, in _serialize_record
row.append(f(self._serialize_info(record)))
File "/Users/okcb/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/vcfpy/writer.py", line 161, in _serialize_info
result.append("{}={}".format(key, format_value(info, value, "INFO")))
File "/Users/okcb/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/vcfpy/writer.py", line 48, in format_value
return ",".join(map(lambda x: format_atomic(x, section), value))
TypeError: 'int' object is not iterable
a record in all_records
looks like this:
Record('1', 2488153, [], 'A', [Substitution(type_='SNV', value='G')], None, ['m2'], {'m2_DP': 1379, 'm2_ECNT': 3, 'm2_POP_AF': [5e-08], 'm2_TLOD': [2137.76]}, ['m2_GT', 'm2_AD', 'm2_AF', 'm2_DP', 'm2_F1R2', 'm2_F2R1', 'm2_MBQ', 'm2_MFRL', 'm2_MMQ', 'm2_MPOS', 'm2_ORIGINAL_CONTIG_MISMATCH', 'm2_SA_MAP_AF', 'm2_SA_POST_PROB'], [Call('20KD-085P0001-1', {'m2_GT': '0/1', 'm2_AD': [672, 656], 'm2_AF': [0.492], 'm2_DP': 1328, 'm2_F1R2': [250, 244], 'm2_F2R1': [422, 412], 'm2_MBQ': [36, 36], 'm2_MFRL': [184, 181], 'm2_MMQ': [60], 'm2_MPOS': [25], 'm2_ORIGINAL_CONTIG_MISMATCH': 0, 'm2_SA_MAP_AF': [0.495, 0.475, 0.494], 'm2_SA_POST_PROB': [0.015, 0.009728, 0.975]})])
Any ideas on why this is happening?
Thanks,
O.K
Not a real issue but was curious: How fast is this compared to other libraries like PyVCF, cyvcf2?
This will speed up the writing.
We will need methods for adding headers to implement this properly.
Try to install and use vcfpy.
Install successful,
use failed.
I think requirements are not processed correctly when using pip
$ pip install vcfpy
Collecting vcfpy
Downloading vcfpy-0.13.4.tar.gz (1.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 7.8 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Building wheels for collected packages: vcfpy
Building wheel for vcfpy (setup.py) ... done
Created wheel for vcfpy: filename=vcfpy-0.13.4-py2.py3-none-any.whl size=34590 sha256=409354257971197c313d3e7f84b9c8f1e23bda33d78121f4b3efd1662d40488e
Stored in directory: /home/victor/.cache/pip/wheels/fb/f8/c3/4f2a584f46d0a628bd41c17311e38496eea7b3f8c5ef4309d2
Successfully built vcfpy
Installing collected packages: vcfpy
Successfully installed vcfpy-0.13.4
But when using vcfpy
$ python
Python 3.10.4 (main, Jun 29 2022, 12:14:53) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import vcfpy
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/victor/Documents/python/algemeen/venv/lib/python3.10/site-packages/vcfpy/__init__.py", line 46, in <module>
from .reader import Reader
File "/home/victor/Documents/python/algemeen/venv/lib/python3.10/site-packages/vcfpy/reader.py", line 8, in <module>
import pysam
ModuleNotFoundError: No module named 'pysam'
$ pip list | grep pysam
does indeed show pysam is not installed.
Checking dependency
$ python -m pip check
does not show pysam is missing
In ' base/requirement.txt'
pysam is mentioned
I think pip is not using it.
Workaround:
manually install pysam
$ pip install pysam
Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.
vcfpy claims that the flag set in FT field can't be found in the header though it exists. Obviously vcfpy searches for single letters instead of the whole word.
Error message:
vcfpy/vcfpy/parser.py:499: UnknownFilter: Filter not found in header: M; problem in FORMAT/FT column of sample X
UnknownFilter)
vcfpy/vcfpy/parser.py:499: UnknownFilter: Filter not found in header: i; problem in FORMAT/FT column of sample X
UnknownFilter)
vcfpy/vcfpy/parser.py:499: UnknownFilter: Filter not found in header: n; problem in FORMAT/FT column of sample X
UnknownFilter)
vcfpy/vcfpy/parser.py:499: UnknownFilter: Filter not found in header: G; problem in FORMAT/FT column of sample X
UnknownFilter)
vcfpy/vcfpy/parser.py:499: UnknownFilter: Filter not found in header: q; problem in FORMAT/FT column of sample X
UnknownFilter)
vcfpy/vcfpy/parser.py:499: UnknownFilter: Filter not found in header: a; problem in FORMAT/FT column of sample X
UnknownFilter)
vcfpy/vcfpy/parser.py:499: UnknownFilter: Filter not found in header: x; problem in FORMAT/FT column of sample X
UnknownFilter)
vcfpy/vcfpy/parser.py:499: UnknownFilter: Filter not found in header: A; problem in FORMAT/FT column of sample X
UnknownFilter)
vcfpy/vcfpy/parser.py:499: UnknownFilter: Filter not found in header: f; problem in FORMAT/FT column of sample X
UnknownFilter)
vcfpy/vcfpy/parser.py:499: UnknownFilter: Filter not found in header: H; problem in FORMAT/FT column of sample X
UnknownFilter)
vcfpy/vcfpy/parser.py:499: UnknownFilter: Filter not found in header: e; problem in FORMAT/FT column of sample X
UnknownFilter)
vcfpy/vcfpy/parser.py:499: UnknownFilter: Filter not found in header: t; problem in FORMAT/FT column of sample X
UnknownFilter)
vcfpy/vcfpy/parser.py:499: UnknownFilter: Filter not found in header: o; problem in FORMAT/FT column of sample X
UnknownFilter)
vcfpy/vcfpy/parser.py:499: UnknownFilter: Filter not found in header: m; problem in FORMAT/FT column of sample X
UnknownFilter)
vcfpy/vcfpy/parser.py:499: UnknownFilter: Filter not found in header: l; problem in FORMAT/FT column of sample X
UnknownFilter)
Set flags in FT in this file:
MaxAafHet
MaxAafHet;MinGq
MinAafHomAlt
MinAafHomAlt;MinGq
MinGq
Taking just the letters in a set and printing them:
>>> s = 'MaxAafHetMaxAafHetMinGqMinAafHomAltMinAafHomAltMinGq'
>>> print(''.join(set(s)))
aAeGfiHMmonqtxl
Printing the letters from the error message:
>>> t = 'MinGqaxAfHetoml'
>>> print(''.join(set(t)))
aAeGfiHMmonqtxl
These are exactly the same.
Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.
For big vcf files it makes each Record.fetch call very slow, comparing to pysam.
I suggest to not close TabixFile if it's already open.
I'll add to the list while I find more shortcomings
##ALT=
##assembly=
##PEDIGREE=
##META=
##SAMPLE=
##ALT=
vcfpy version: 0.13.2
Python version: 3.6
Operating System: Centos 7
I would like to modify a vcf (from Lofreq) that does not contain FORMAT or call columns in order for it to be accepted by Picard within the bcbio-nextgen framework.
I have tried like this but it does not work. Nothing comes out:
reader = vcfpy.Reader.from_path("/path/to/sample.vcf.gz")
reader.header.samples = vcfpy.SamplesInfos(names=['D20-XXX'], name_to_idx={'D20-XXX': 0})
writer = vcfpy.Writer.from_path('/dev/stdout', reader.header)
for rec in reader:
data = {'GT': '1/1'}
rec.calls.append(vcfpy.Call(sample=sample, data=data))
rec.FORMAT = []
rec.add_format('GT')
writer.write_record(rec)
Is there some inherent limitation that prevents adding this info? I noticed a caveat that the FORMAT field cannot be altered after modifying the SamplesInfos object.
Incidently, I have also noticed that data objects in the call instances are dicts and not ordered dicts, as the documentation says. Perhaps a remnant from earlier times?
Add more examples
I'm trying to use vcfpy yo convert a CSV file into VCF the script : import pandas as pd
import vcfpy
import sys
def csv_to_vcf(csv_file, vcf_file):
# read CSV file using pandas
df = pd.read_csv(csv_file)
# read vcf file
reader = vcfpy.Reader.from_path(vcf_file)
# create a new VCF file
vcf_writer = vcfpy.Writer.from_path(vcf_file,reader.header)
# add the header information to the VCF file
# loop through rows in the CSV file
for i, row in df.iterrows():
# create a new record for the VCF file
record = vcfpy.Record(
CHROM=row['#CHROM'],
POS=row['POS'],
ID=".",
REF=row['REF'],
ALT=".",
QUAL=".",
FILTER=".",
INFO=".",
FORMAT=[])
chr = row['#CHROM']
pos = int(row['POS'])
ref = row['REF']
alt = row['ALT']
chr='chr'+str(chr)
if chr == 'chrMT':
chr = 'chrM'
try:
fetched = reader.fetch(record.CHROM,pos-5,pos+5)
except:
fetched = '.|.'
for fet in fetched:
# add the record to the VCF file
vcf_writer.write_record(fet)
# close the VCF file
vcf_writer.close()
if name == "main":
csv_file = sys.argv[1]
vcf_file = sys.argv[2]
csv_to_vcf(csv_file,vcf_file)
python csv_to_vcf.py Cmpd_Hetro_Bathc1_morethan10test.csv results_table.DVD.vcf
where the Cmpd is the CSV file and the VCF file already has a header.
If there was a crash, please include the traceback here.
Error :
/vcfpy/writer.py", line 135, in _serialize_record
row = [record.CHROM, record.POS]
AttributeError: 'str' object has no attribute 'CHROM'
I often use vcfpy to create VCF files from non-vcf data sources, such as a spreadsheet. One difficulty with doing this has been creating AltRecords for alternative alleles. Looking through the code I found the process_alt function, which I find very helpful for this use-case. I think it would help other people looking to create VCFs without a starting VCF if this was in the documentation.
PASS is a valid value for the filter column and shouldn't be required in header:
http://www.internationalgenome.org/wiki/Analysis/vcf4.0/
UnknownFilter: Filter not found in header: PASS; problem in FILTER column
UnknownFilter,
Thanks
I'm unable to install on a new 3.12 environment. The Pip error message refers to a module attribute, SafeConfigParser
, that has been deprecated in 3.12. At a glance, it appears to be rooted in the Versioneer build management tool that you're using here, but they claim to have fixed this.
I know that you don't claim to support 3.12 in the setup.py
file but I figured I'd flag it for you anyway in case it's an easy fix.
[⚙ venv]~/code/retraction % pip install vcfpy
Collecting vcfpy
Using cached vcfpy-0.13.6.tar.gz (1.0 MB)
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [31 lines of output]
/private/var/folders/wx/h1q5fy057t97w4md21k1ktjnlp06wn/T/pip-install-xqxoql9o/vcfpy_a939bfba6aed45e0910fdbaf1c40343f/versioneer.py:432: SyntaxWarning: invalid escape sequence '\s'
] = '''
Traceback (most recent call last):
File "/Users/jss009/code/retraction/venv/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
main()
File "/Users/jss009/code/retraction/venv/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jss009/code/retraction/venv/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
return hook(config_settings)
^^^^^^^^^^^^^^^^^^^^^
File "/private/var/folders/wx/h1q5fy057t97w4md21k1ktjnlp06wn/T/pip-build-env-gn2x9bq0/overlay/lib/python3.12/site-packages/setuptools/build_meta.py", line 355, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/private/var/folders/wx/h1q5fy057t97w4md21k1ktjnlp06wn/T/pip-build-env-gn2x9bq0/overlay/lib/python3.12/site-packages/setuptools/build_meta.py", line 325, in _get_build_requires
self.run_setup()
File "/private/var/folders/wx/h1q5fy057t97w4md21k1ktjnlp06wn/T/pip-build-env-gn2x9bq0/overlay/lib/python3.12/site-packages/setuptools/build_meta.py", line 507, in run_setup
super(_BuildMetaLegacyBackend, self).run_setup(setup_script=setup_script)
File "/private/var/folders/wx/h1q5fy057t97w4md21k1ktjnlp06wn/T/pip-build-env-gn2x9bq0/overlay/lib/python3.12/site-packages/setuptools/build_meta.py", line 341, in run_setup
exec(code, locals())
File "<string>", line 48, in <module>
File "/private/var/folders/wx/h1q5fy057t97w4md21k1ktjnlp06wn/T/pip-install-xqxoql9o/vcfpy_a939bfba6aed45e0910fdbaf1c40343f/versioneer.py", line 1505, in get_version
return get_versions()["version"]
^^^^^^^^^^^^^^
File "/private/var/folders/wx/h1q5fy057t97w4md21k1ktjnlp06wn/T/pip-install-xqxoql9o/vcfpy_a939bfba6aed45e0910fdbaf1c40343f/versioneer.py", line 1434, in get_versions
cfg = get_config_from_root(root)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/private/var/folders/wx/h1q5fy057t97w4md21k1ktjnlp06wn/T/pip-install-xqxoql9o/vcfpy_a939bfba6aed45e0910fdbaf1c40343f/versioneer.py", line 346, in get_config_from_root
parser = configparser.SafeConfigParser()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'configparser' has no attribute 'SafeConfigParser'. Did you mean: 'RawConfigParser'?
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
Separately, I tried in a fresh 3.11 environment and it worked.
When I fetch variants by contig ID I get the following UnicodeDecodeError
demosntrating some issues when parsing the tabix file. Maybe the issue comes from pysam
, but I would like to know if you have had previous reports based on this issue.
$ tabix -p vcf <vcf_filepath>
reader = vcfpy.Reader.from_path(path=DATA['annot_vcf'], tabix_path=DATA['annot_vcf']+'.tbi')
for record in reader.fetch('chr1'):
[...]
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-38-046818a3e579> in <module>
3 variant_records = []
4 sample_records = []
----> 5 for record in reader.fetch('chr1'):
6 if record.CHROM in wanted_chroms:
7 ALT = record.ALT[0].value
/usr/local/lib/python3.6/dist-packages/vcfpy/reader.py in __next__(self)
171 """
172 if self.tabix_iter:
--> 173 return self.parser.parse_line(str(next(self.tabix_iter)))
174 else:
175 result = self.parser.parse_next_record()
pysam/libctabix.pyx in pysam.libctabix.TabixIterator.__next__()
pysam/libcutils.pyx in pysam.libcutils.charptr_to_str()
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2821: ordinal not in range(128)
Parsing a file.
Copy/pasted the code from https://vcfpy.readthedocs.io/en/stable/examples.html#reading-vcf-files
.../vcfpy/parser.py:495: UnknownFilter: Filter not found in header: PASS; problem in FILTER column
I think that PASS
is simply the VCF way of saying that all filters passed, so I think this should not be considered an unknown filter.
I'm having a hard time understanding why there is a mapping of escape characters, because when I write a vcf if I have some %
in my fields, this gets changed to %25
instead of remaining as %
like in the original. And I believe this can cause conflicts when using tools on the VCF with the field formatted like %25
.
I ran this custom function to split the VCF in samples.
def split_vcf(vcf_path):
reader_main = vcfpy.Reader.from_path(vcf_path)
samples = list(set([x.replace('NORMAL.', '').replace('TUMOR.', '') for x in reader_main.header.samples.names]))
reader_main.close()
for sample in samples:
reader_1 = vcfpy.Reader.from_path(vcf_path)
reader_2 = vcfpy.Reader.from_path(vcf_path)
header = reader_2.header
header.samples.names = [x for x in header.samples.names if x in [f'NORMAL.{sample}', f'TUMOR.{sample}']]
header.samples.name_to_idx = {x: header.samples.name_to_idx[x] for x in [f'NORMAL.{sample}', f'TUMOR.{sample}']}
writer = vcfpy.Writer.from_path('./{}.vcf'.format(sample), header)
for record in reader_1:
if any(sample == s for s in record.INFO['set'].split('-')):
writer.write_record(record)
writer.close()
According to the VCF specs (https://samtools.github.io/hts-specs/VCFv4.2.pdf) it is valid to have a VCF without a FORMAT and sample columns. However, this causes an error when parsing the individual variant entries.
$ cat tests/test_data/input.no_sample.vcf
##fileformat=VCFv4.2
#CHROM POS ID REF ALT QUAL FILTER INFO
22 18644673 . C T . . .
$python
>>> import vcfpy
>>> vcf_reader = vcfpy.Reader.from_path('tests/test_data/input.no_sample.vcf')
>>> next(vcf_reader)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/ssiebert/miniconda2/envs/pvactools/lib/python3.5/site-packages/vcfpy/reader.py", line 170, in __next__
result = self.parser.parse_next_record()
File "/Users/ssiebert/miniconda2/envs/pvactools/lib/python3.5/site-packages/vcfpy/parser.py", line 792, in parse_next_record
return self.parse_line(self._read_next_line())
File "/Users/ssiebert/miniconda2/envs/pvactools/lib/python3.5/site-packages/vcfpy/parser.py", line 783, in parse_line
return self._record_parser.parse_line(line)
File "/Users/ssiebert/miniconda2/envs/pvactools/lib/python3.5/site-packages/vcfpy/parser.py", line 457, in parse_line
format_ = arr[8].split(':')
IndexError: list index out of range
I'm trying to read a vcf file that was generated by GATK 3.8. GATK is not annotating according to VCF format specifications. It leaves empty fields in the FORMAT column empty and does not write out separators, e.g.:
GT:AD:DP:FT:GQ:PL ./.:.:.:MinAafHomAlt;MinCovHomAlt;MinGq
The above mentioned line triggers the following error:
Traceback (most recent call last):
File "ped_into_vcf.py", line 136, in <module>
sys.exit(main())
File "ped_into_vcf.py", line 132, in main
run(args)
File "ped_into_vcf.py", line 116, in run
extend_vcf(args.vcf_file, pedigree)
File "ped_into_vcf.py", line 108, in extend_vcf
writer.write_record(record)
File "/vol/local/data/miniconda3/lib/python3.5/site-packages/vcfpy/writer.py", line 131, in write_record
self._serialize_record(record)
File "/vol/local/data/miniconda3/lib/python3.5/site-packages/vcfpy/writer.py", line 148, in _serialize_record
for s in self.header.samples.names]
File "/vol/local/data/miniconda3/lib/python3.5/site-packages/vcfpy/writer.py", line 148, in <listcomp>
for s in self.header.samples.names]
File "/vol/local/data/miniconda3/lib/python3.5/site-packages/vcfpy/writer.py", line 170, in _serialize_call
for key in format_]
File "/vol/local/data/miniconda3/lib/python3.5/site-packages/vcfpy/writer.py", line 170, in <listcomp>
for key in format_]
KeyError: 'GQ'
@property
def is_variant(self):
"""Return ``True`` for non-hom-ref calls"""
return self.gt_type != HOM_REF
should be
@property
def is_variant(self):
"""Return ``True`` for non-hom-ref calls"""
return not gt_type
I want to add self defined Filter into my vcf column (such as LowQual etc)
## read a vcf
vcf_reader= vcfpy.Reader.from_path(vcfInput)
## add Filter-Header
## vcf output
vcf_writer= vcfpy.Writer.from_path(newOutput+'/'+ name+'.filtered.vcf',vcf_reader.header)
## add filter in each variant e.g if QUAL < 30: filter= "LowQual"
## write variant in vcf output
vcf_writer.write_record(record)
Solution:
maybe instead of:
",".join(...)
use
','.join(...)
Add Type and Number constants known for standardized and common header line types.
PyVCF can be used for inspiration.
vcfpy.Reader.from_file
=> vcfpy.Reader.from_stream
vcfpy.Writer.from_file
=> vcfpy.Writer.from_stream
vcfpy.Writer.from_path
, make path
go firstmaster
branchI have VCF files that generate a
TypeError: argument of type 'bool' is not iterable
when reading the lines. Closer inspection reveals that the problem is in the INFO
field. More specifically, when there is a flag in the INFO field, but the header specifies otherwise, a TypeError
is generated in convert_field_value(type_, value)
.
Here is a minimum reproducing example:
import io
import vcfpy
string_buffer = io.StringIO(r"""##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=MH,Number=1,Type=String,Description="Microhomology">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT FOO BAR
1 4798729 . CT C 621 PASS MH GT 0/0:25 0/1:64
""")
with vcfpy.Reader.from_stream(string_buffer) as reader:
for line in reader:
print(line)
I think a quick solution would be to adjust parse_field_value
to detect that a flag, instead of key-value pair, is passed and override behaviour specified in the header.
I can make a pull request.
Proper, tested adding of lines to header, must update index data structures
Do it in a similar fashion as htslib and cyvcf2 do.
When writing the VCF record, the colon [:] is escaped to %3A in the INFO field value strings which is causing a problem for us. According to the VCF specification, colons are not excluded eg for v4.3,
8. INFO — additional information: (String, no semi-colons or equals-signs permitted; commas are permitted only as delimiters for lists of values; characters with special meaning can be encoded using the percent encoding, see Section 1.2; space characters are allowed)
Help with resolving this issue would be very much appreciated.
I need to subset samples from a large VCF with approximately 100.000 samples. As the extraction is the only part of the workflow outside python it would be very nice to be able to use vcfpy for this. I read the docs and also tried to deduce how the package works from the source. However, I am coming up short and would very much appreciate some help understanding how the subsetting of samples works.
import vcfpy
reader = vcfpy.Reader.from_path()
subset_samples = sel = ['sample123', 'sample124']
reader = vcfpy.Reader.from_path('myfile.vcf.gz', parsed_samples=subset_samples)
reader.parsed_samples
returns the samples chosen to be parsed in the input:
['sample123', 'sample124']
When iterating over the records in the reader
object:
for record in reader: record
I get:
<vcfpy.record.UnparsedCall object at 0x7f065fd30f98>
So I am assuming that the parser respects my list of samples to be parsed. However I am struggling with getting the calls for only these samples.
My goal is to extract all markers, but only for a few samples.
Any help would be very much appreciated!
When attempting to do a roundtrip with a reader and a writer, I run into an IncorrectVCFFormat
exception complaining about the header line.
Here is where I create the writer (I'm writing to a buffer rather than a file):
with io.StringIO() as vcf_stream:
with vcfpy.Reader.from_path(self.template_vcf) as template_vcf_reader:
vcf_writer = vcfpy.Writer.from_stream(vcf_stream, header=template_vcf_reader.header)
...
Then I do some subsequent write_record
calls (which do not come from the template; the template VCF just contains headers).
Finally, I read the buffer with (effectively) vcfpy.Reader.from_stream(vcf_stream)
Unfortunately, this results in the aforementioned exception:
vcfpy.exceptions.IncorrectVCFFormat: Missing line starting with "#CHROM"
Basically, each point where we overlook a problem with VCF, we could echo a warning.
We should take care that the printed warnings are stored somewhere so there are not millions of warnings.
Passing samples infos is not necessary, is already in header.
I integrated vcfpy into an application (HmtNote), and I use it to parse VCF files and add custom annotations. I encountered an issue regarding GT data when processing a VCF file formatted as:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TEST1
chrMT 73 . A G . PASS AC=2;AN=2 GT:HF 0/1:0.994
When traversing the file using vcfpy.Reader.from_path()
and manipulating each record, the following error shows up:
File "/home/ubuntu/test/ang_vcf/venv/lib/python3.7/site-packages/hmtnote/classes.py", line 502, in annotate
for record in self.reader:
File "/home/ubuntu/test/ang_vcf/venv/lib/python3.7/site-packages/vcfpy/reader.py", line 175, in __next__
result = self.parser.parse_next_record()
File "/home/ubuntu/test/ang_vcf/venv/lib/python3.7/site-packages/vcfpy/parser.py", line 802, in parse_next_record
return self.parse_line(self._read_next_line())
File "/home/ubuntu/test/ang_vcf/venv/lib/python3.7/site-packages/vcfpy/parser.py", line 793, in parse_line
return self._record_parser.parse_line(line)
File "/home/ubuntu/test/ang_vcf/venv/lib/python3.7/site-packages/vcfpy/parser.py", line 467, in parse_line
calls = self._handle_calls(alts, format_, arr[8], arr)
File "/home/ubuntu/test/ang_vcf/venv/lib/python3.7/site-packages/vcfpy/parser.py", line 479, in _handle_calls
call = record.Call(sample, data)
File "/home/ubuntu/test/ang_vcf/venv/lib/python3.7/site-packages/vcfpy/record.py", line 234, in __init__
for allele in ALLELE_DELIM.split(self.data["GT"]):
TypeError: expected string or bytes-like object
I found a temporary solution and it works for in my specific case, but maybe the author might find something more suitable. The temporary fix involves replacing lines 232-234 in record.py
as such:
# original lines 232-234
if self.data.get("GT", None) is not None:
self.gt_alleles = []
for allele in ALLELE_DELIM.split(self.data["GT"]):
# my replacement
if self.data.get("GT", None) is not None and len(self.data.get("GT", None)) != 0:
self.gt_alleles = []
for allele in ALLELE_DELIM.split(self.data["GT"][0]):
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.