griffithlab / vatools Goto Github PK

View Code? Open in Web Editor NEW

25.0 17.0 12.0 39.75 MB

A set of tools to annotate VCF files with expression and readcount data

Home Page: http://www.vatools.org

License: MIT License

Python 100.00%

vatools's Issues

ref-transcript-mismatch-reporter ZeroDivisionError: float division by zero

Hi I am using griffithlab/vatools:latest docker image to annotate my vcf. However, I ran into this error:

2023-07-27T02:16:47.901908883Z Traceback (most recent call last):
2023-07-27T02:16:47.901930024Z   File “/usr/local/bin/ref-transcript-mismatch-reporter”, line 11, in <module>
2023-07-27T02:16:47.901936846Z     sys.exit(main())
2023-07-27T02:16:47.901939332Z   File “/usr/local/lib/python3.6/dist-packages/vatools/ref_transcript_mismatch_reporter.py”, line 215, in main
2023-07-27T02:16:47.901941899Z     “Percentage of processable transcripts with mismatched annotations: {}%\n”.format(round(float(filtered_transcript_count)/float(processable_transcript_count)*100, 2)) +
2023-07-27T02:16:47.901944955Z ZeroDivisionError: float division by zero

This is my command

ref-transcript-mismatch-reporter -f hard -o <output_file_name> <input.vcf.gz>

where the input for testing is input.vcf.gz

Seems like when there is no processable transcripts (processable_transcript_count = 0) it will throw this error.

Cheers,
Weilin

ref-transcript-mismatch-reporter does not work

hello,
With ref-transcript-mismatch-reporter (vatools 5.1.0) on my test_vep.vcf as below:

ref-transcript-mismatch-reporter test_vep.vcf --filter hard --output-vcf test.vcf

The error still existed, the variant is as below:
chr12 48238361 . G GCCTCAATGAGGAGCACTCCAAGCAGTACCGCTGCCTCTCCTTCCAGCC . clustered_events AS_FilterStatus=SITE;AS_SB_TABLE=101,4|1,6;DP=118;ECNT=5;GERMQ=93;MBQ=37,34;MFRL=52,194;MMQ=60,60;MPOS=49;NALOD=1.48;NLOD=8.75;POPAF=6;TLOD=19.3;CSQ=CCTCAATGAGGAGCACTCCAAGCAGTACCGCTGCCTCTCCTTCCAGCC|stop_gained&protein_altering_variant|HIGH|VDR|7421|Transcript|NM_001364085.1|protein_coding|10/10||NM_001364085.1:c.1451_1452insGGCTGGAAGGAGAGGCAGCGGTACTGCTTGGAGTGCTCCTCATTGAGG|NP_001351014.1:p.Asn484delinsLysAlaGlyArgArgGlySerGlyThrAlaTrpSerAlaProHisTer|1611-1612|1451-1452|484|N/KAGRRGSGTAWSAPH*G|aac/aaGGCTGGAAGGAGAGGCAGCGGTACTGCTTGGAGTGCTCCTCATTGAGGc|||-1||EntrezGene|||rseq_mrna_nonmatch&rseq_5p_mismatch||||OK|||||||||||||||MEAMAASTSLPDPGDFDRNVPRICGVCGDRATGFHFNAMTCEGCKGFFRRSMKRKALFTCPFNGDCRITKDNRRHCQACRLKRCVDIGMMKEFILTDEEVQRKREMILKRKEEEALKDSLRPKLSEEQQRIIAILLDAHHKTYDPTYSDFCQFRPPVRVNDGGGSHPSRPNSRHTPSFSGDSSSSCSDHCITSSDMMDSSSFSNLDLSEEDSDDPSVTLELSQLSMLPHLADLVSYSIQKVIGFAKMIPGFRDLTSEDQIVLLKSSAIEVIMLRSNESFTMDDMSWTCGNQDYKYRVSDVTKAGHSLELIEPLIKFQVGLKKLNLHEEEHVLLMAICIVSPDRPGVQDAALIEAIQDRLSNTLQTYIRCRHPPPGSHLLYAKMIQKLADLRSLNEEHSKQYRCLSFQPECSMKLTPLVLEVFGNEISLGQPVAVPGWGCSSRATCQARGWRLLSSPPHPVWGSAPPLPPPLSTQPILSPVQPNPFPAGFSPVP GT:AD:AF:DP:F1R2:F2R1:SB 0/0:29,0:0.0318:29:17,0:12,0:29,0,0,0 0/1:76,7:0.0936:83:57,1:19,0:72,4,1,6

It has not been filtered, please help.

Thanks!

Please consider adding support for multi-thread support

Being able to scatter/gather the work over multiple CPU cores would really help speed-up your script.

warn when pick flag is expected but not observed

I have encountered issues wherree VEP --flag_pick appears to be broken, it will not flag a mutation with pick if that mutation intersects with multiple gene/transcript features. It would be beneficial if vep-annotation-reporter warned when it encounters entries where a pick flag is expected but not seen. In these cases it appears to pick the last consequence item from the vcf

Cannot install `vatools` with pip

I created a new conda environment on my linux workstateion running Rocky Linux 9.3 (Blue Onyx) with python 3.12 and installed pip.

Running pip install vatools leads to

Collecting vatools
  Using cached vatools-5.1.1-py3-none-any.whl.metadata (813 bytes)
Collecting vcfpy==0.12.3 (from vatools)
  Using cached vcfpy-0.12.3.tar.gz (1.0 MB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [1 lines of output]
      ERROR: Can not execute `setup.py` since setuptools is not available in the build environment.
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details

Are there any hints on what I missied here? Installing other packages via pip works fine.

ValueError: invalid literal for int() with base 10

Hello, recently I've found the mismatch issue with my RNAseq data.
Basically I have both normal & tumor RNAseq data. I did the variant calling following the pipeline below:

STAR (index using GENCODE GRCh38.primary_assembly.genome.fa & gencode.v40.annotation.gtf)
GATK (GRCh38.primary_assembly.genome.fa provided as fasta)
2a. MarkDuplicatesSpark
2b. SplitNCigarReads
2c. BQSR
2d. Mutect2
VEP (version 106)

    vep --input_file vcf_files/${sample}_RNA_match_RNA_somatic.filtered.vcf.gz \
        --format vcf \
        --output_file vep/${sample}_RNA_match_RNA_somatic.filtered.vep.vcf \
        --vcf --symbol --terms SO --tsl --hgvs \
        --fasta reference/vep/homo_sapiens/106_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz \
        --offline --cache --dir_cache reference/vep/ \
        --plugin Frameshift --plugin Wildtype --dir_plugins reference/vep/plugins/ \
        --synonyms reference/vep/homo_sapiens/106_GRCh38/chr_synonyms.txt

pVACtools (version 3.0.1)

	pvacseq run --iedb-install-directory ~/.conda/miniconda/4.9/envs/neo/ \
              --binding-threshold 500 --n-threads 24 \
              --net-chop-method "cterm" --netmhc-stab --exclude-NAs \
              --normal-sample-name ${sample}_N \
              vep/${sample}_RNA_match_RNA_somatic.filtered.vep.vcf \
              ${sample} ${hla1},${hla2} \
              MHCflurry MHCnuggetsI NetMHC NetMHCpan \
              pvacseq/${sample}_RNA_match_RNA

The first issue is when running VEP, I got the follow message for a few lines, but it still generated a VEP annotated vcf.

Use of uninitialized value in numeric lt (<) at /hpctmp/renyi04/reference/vep/plugins/Frameshift.pm line 129, <$fh> line 5840.
Use of uninitialized value in numeric lt (<) at /hpctmp/renyi04/reference/vep/plugins/Frameshift.pm line 129, <$fh> line 5840.
Use of uninitialized value in addition (+) at /hpctmp/renyi04/reference/vep/plugins/Frameshift.pm line 129, <$fh> line 5840.

The second error is when I run pvacseq, I got warnings and errors below:

KI270733.1 135027 C CT
Variant doesn't have protein position information. Skipping.
KI270733.1 135027 C CCG
Variant doesn't have protein position information. Skipping.
KI270733.1 135028 CGCGCGCGCGGGAGGGCGCGTGCCCCG C
Variant doesn't have protein position information. Skipping.
KI270733.1 168056 T G ENST00000615130
Variant doesn't have protein position information. Skipping.
KI270733.1 177433 G GGGC ENST00000618998
Completed
Generating Variant Peptide FASTA and Key File
Warning: Amino acid change is not sane - contains multiple stops. Skipping entry 2208.EIF4E2.ENST00000258416.inframe_del.176-177QVSHP*ARLVSCVAFALLAAPDLL**FLLFLW*E/QE
Warning: Amino acid change is not sane - contains multiple stops. Skipping entry 2209.EIF4E2.ENST00000409098.inframe_del.176-177QVSHP*ARLVSCVAFALLAAPDLL**FLLFLW*E/QE
Warning: Amino acid change is not sane - contains multiple stops. Skipping entry 2210.EIF4E2.ENST00000409167.inframe_del.131-132QVSHP*ARLVSCVAFALLAAPDLL**FLLFLW*E/QE
Warning: Amino acid change is not sane - contains multiple stops. Skipping entry 2211.EIF4E2.ENST00000409322.inframe_del.131-132QVSHP*ARLVSCVAFALLAAPDLL**FLLFLW*E/QE
Warning: Amino acid change is not sane - contains multiple stops. Skipping entry 2212.EIF4E2.ENST00000409394.inframe_del.131-132QVSHP*ARLVSCVAFALLAAPDLL**FLLFLW*E/QE
Warning: Amino acid change is not sane - contains multiple stops. Skipping entry 2213.EIF4E2.ENST00000409495.inframe_del.176-177QVSHP*ARLVSCVAFALLAAPDLL**FLLFLW*E/QE
Warning: Amino acid change is not sane - contains multiple stops. Skipping entry 2214.EIF4E2.ENST00000409514.inframe_del.176-177QVSHP*ARLVSCVAFALLAAPDLL**FLLFLW*E/QE
Warning: Amino acid change is not sane - contains multiple stops. Skipping entry 2215.EIF4E2.ENST00000454501.inframe_del.171-172QVSHP*ARLVSCVAFALLAAPDLL**FLLFLW*E/QE
Warning: Amino acid change is not sane - contains multiple stops. Skipping entry 2216.EIF4E2.ENST00000687222.inframe_del.131-132QVSHP*ARLVSCVAFALLAAPDLL**FLLFLW*E/QE
Warning: Amino acid change is not sane - contains multiple stops. Skipping entry 2217.EIF4E2.ENST00000690794.inframe_del.76-77QVSHP*ARLVSCVAFALLAAPDLL**FLLFLW*E/QE
ERROR: There was a mismatch between the actual wildtype amino acid sequence (VHILGEPRPHLFGQMFVRLQLLRAVREVLHTGLAMLGL) and the expected amino acid sequence (VHILGVSTQQNGVGRAERYRVKAKEAEDETSRPFLFQE). Did you use the same reference build version for VEP that you used for creating the VCF?
{'chromosome_name': 'chr3', 'start': '49015706', 'stop': '49015817', 'reference': 'TCCTGAAAGAGAAAGGGCCTGCTGGTCTCATCCTCTGCTTCCTTTGCCTTTACCCTATACCTCTCTGCACGTCCCACCCCATTTTGCTGTGTGCTCACCCCCAGGATGTGTA', 'variant': 'T', 'gene_name': 'DALRD3', 'transcript_name': 'ENST00000313778', 'transcript_support_level': '2', 'amino_acid_change': 'VHILGVSTQQNGVGRAERYRVKAKEAEDETSRPFLFQE/E', 'codon_change': 'gTACACATCCTGGGGGTGAGCACACAGCAAAATGGGGTGGGACGTGCAGAGAGGTATAGGGTAAAGGCAAAGGAAGCAGAGGATGAGACCAGCAGGCCCTTTCTCTTTCAGGag/gag', 'ensembl_gene_id': 'ENSG00000178149', 'hgvsc': 'ENST00000313778.9:c.998_1012del', 'hgvsp': 'ENSP00000323265.5:p.Val333_Gln337del', 'wildtype_amino_acid_sequence': 'MLTFLQQLRVDWPAASERASSHTLRSHALEELTSANDGRTLSPGILGRLCLKELVEEQGRTAGYDPNLDNCLVTEDLLSVLAELQEALWHWPEDSHPGLAGASDTGTGGCLVVHVVSCEEEFQQQKLDLLWQKLVDKAPLRQKHLICGPVKVAGAPGTLMTAPEYYEFRHTQVCKASALKHGGDLAQDPAWTEIFGVLSVATIKFEMLSTAPQSQLFLALADSSISTKGTKSGTFVMYNCARLATLFESYKCSMEQGLYPTFPPVSSLDFSLLHDEGEWLLLFNSILPFPDLLSRTAVLDCTAPGLHIAVRTEMICKFLVQLSMDFSSYYNRVHILGEPRPHLFGQMFVRLQLLRAVREVLHTGLAMLGLPPLSHI', 'frameshift_amino_acid_sequence': '', 'fusion_amino_acid_sequence': '', 'variant_type': 'inframe_del', 'protein_position': '333-338', 'transcript_expression': 'NA', 'gene_expression': 'NA', 'normal_depth': 'NA', 'normal_vaf': 'NA', 'tdna_depth': '473', 'tdna_vaf': '0.016', 'trna_depth': 'NA', 'tr
[CV023_RNA_match_RNA_somatic.filtered.vep.vcf.zip](https://github.com/griffithlab/pVACtools/files/9018016/CV023_RNA_match_RNA_somatic.filtered.vep.vcf.zip)
na_vaf': 'NA', 'index': '2823.DALRD3.ENST00000313778.inframe_del.333-338VHILGVSTQQNGVGRAERYRVKAKEAEDETSRPFLFQE/E', 'protein_length_change': ''}

I tried to use VAtools (5.0.1) ref-transcript-mismatch-reporter to exclude the mismatch. But got the error after it generated some entries

Traceback (most recent call last):
  File "/home/svu/renyi04/.conda/miniconda/4.9/envs/neo/bin/ref-transcript-mismatch-reporter", line 8, in <module>
    sys.exit(main())
  File "/home/svu/renyi04/.conda/miniconda/4.9/envs/neo/lib/python3.8/site-packages/vatools/ref_transcript_mismatch_reporter.py", line 162, in main
    position = int(protein_position.split('-', 1)[0]) - 1
ValueError: invalid literal for int() with base 10: ''

Last, after I removed the mismatched entries mannually (about 4-7 in different samples), pVACtools worked.

I saw the same issue about VAtools but it should be solved in the lastest version. Any suggestion how I can solve the issue?

My vep.vcf is attached
CV023_RNA_match_RNA_somatic.filtered.vep.vcf.zip

Encoding issue?

Hi I'm running the vcf-readcount-annotator.py tool, and it seems that the resulting VCF files have some encoding issues, where the results have %2C, corresponding to ','; for example as follows,
GT:AD:AF:DP:DP4:F1R2:F2R1:FREQ:RD:SB
0/0:164,47:0.22275:211:25%2C61%2C0%2C0:.,.:.,.:0%25:86:.

I've looked at the code, and it seems when I check the entry variable, all is still correct; but when the function vcf_writer.write_record(entry) is called, the written file would start showing %2C. I'm not sure if this is some encoding that has to be specified, or this is an issue with the vcfpy library?

Add feature to vcf-genotype-annotator to allow for adding GT field to existing samples

Right now the vcf-genotype-annotator will throw a fatal error if the give sample already exists in the VCF. If the GT field does not yet exist in the VCF (no header and not a field in FORMAT) this new option would allow for the tool to add the GT field.

which version is on pip?

Hi there developers,

Which version number is installed when I install vatools using:

pip install vatools

https://vatools.readthedocs.io/en/latest/install.html

vcf-expression annotator "Expression entry"

Dear All,
I am running vcf-expression annotator via this command
vcf-expression-annotator $VCF simple_gene_counts.txt custom gene --id-column Geneid --expression-column gerald_C1TD1ACXX_8_ACAGTG.sorted.bam -s H_NJ-HCC1395-HCC1395-GatkSomaticIndel

on vep-annotated vcf with the vep arguments as below:
vep -i ~/Documents/neoantigen/DNA_data/hcc1395.build135650880.exome.snvs.vcf -o hcc1395.build135650880.exome.snvs.ann.vcf --format vcf --vcf --symbol --terms SO /
--tsl --hgvs --fasta ~/Documents/neoantigen/DNA_data/GRCh37-lite.fa --offline --cache --plugin Downstream --plugin Wildtype --verbose

but the software gives this error:
WARNING:root:891352 of 891352 transcripts did not have an expression entry for their gene id

Please help

Error running VCF Readcount Annotator

Hello!
I'm running the following command line:

vcf-readcount-annotator mysample_vep_output.vcf mysample_bam_readcount DNA -s mysample

This returns the following error:

Traceback (most recent call last):
File "/home/BIG/CMelo/miniconda3/envs/py35/bin/vcf-readcount-annotator", line 11, in
sys.exit(main())
File "/home/BIG/CMelo/miniconda3/envs/py35/lib/python3.5/site-packages/vcf_annotation_tools/vcf_readcount_annotator.py", line 213, in main
ads.append(brct[ref_base])
TypeError: 'NoneType' object is not subscriptable

I'm using different WGS samples and the output I get is truncated in different points for each one. Some stops on Chr 6 and other ones on Chr 1 but the error message is always the same. For few I've got the result for all Chromosomes with no errors. Any guess of what can it be?

vep-annotation-reporter needs to output values for multiple transcripts

Right now it just returns the first transcript or - if the VCF is annotated with flag pick - it returns the first picked transcript. We need to support output for multiple/all transcripts. This could either be a comma-separated list, a dictionary keyed on the transcript, or simply the string multiple.

vcf-readcount-annotator

HI
I've got a problem with annotation in using the latest vatools version. Input and output are attached: the test vcf is vep annotated and decomposed.

and the The counts are obtained using

docker run --rm -i -t --mount type=bind,source=/mnt/,target=/mnt/ --mount type=bind,source=/mnt/ATG/,target=/mnt/ATG/ mgibio/bam_readcount_helper-cwl:1.0.0 /usr/bin/python /usr/bin/bam_readcount_helper.py /mnt/test.decomp.vcf TUMOR /mnt//reference_material/human_g1k_v37_decoy.fasta /mnt/tumour.bam /mnt/

which makes TUMOR_bam_readcount_snv.tsv and TUMOR_bam_readcount_indel.tsv attached below...

error when using: ( I have the lasted vatools as of today March 3 2022

docker run --rm -i -t --mount type=bind,source=/mnt/,target=/mnt/ --mount type=bind,source=/mnt/ATG/,target=/mnt/ATG/ griffithlab/vatools:latest vcf-readcount-annotator /mnt/test.decomp.vcf /mnt/TUMOR_bam_readcount_snv.tsv DNA -s TUMOR -t snv -o /mnt/ATG/research/users/leop/research/scratch2/Grants/Riccardo/pvactools/vcf2022/test.decomp.DNA.vcf
Traceback (most recent call last):
File "/usr/local/bin/vcf-readcount-annotator", line 11, in
sys.exit(main())
File "/usr/local/lib/python3.6/dist-packages/vatools/vcf_readcount_annotator.py", line 206, in main
for entry in vcf_reader:
File "/usr/local/lib/python3.6/dist-packages/vcfpy/reader.py", line 175, in next
result = self.parser.parse_next_record()
File "/usr/local/lib/python3.6/dist-packages/vcfpy/parser.py", line 802, in parse_next_record
return self.parse_line(self._read_next_line())
File "/usr/local/lib/python3.6/dist-packages/vcfpy/parser.py", line 793, in parse_line
return self._record_parser.parse_line(line)
File "/usr/local/lib/python3.6/dist-packages/vcfpy/parser.py", line 457, in parse_line
info = self._parse_info(arr[7], len(alts))
File "/usr/local/lib/python3.6/dist-packages/vcfpy/parser.py", line 535, in _parse_info
result[key] = parse_field_value(self.header.get_info_field_info(key), True)
File "/usr/local/lib/python3.6/dist-packages/vcfpy/parser.py", line 266, in parse_field_value
return convert_field_value(field_info.type, value)
File "/usr/local/lib/python3.6/dist-packages/vcfpy/parser.py", line 243, in convert_field_value
if "%" in value:
TypeError: argument of type 'bool' is not iterable

TUMOR_bam_readcounts.zip
test.decomp.vcfs.zip

Any ideas???

Thanks
Paul

ValueError: invalid literal for int() with base 10: '"0' when using vcf-expression-annotator

I got the following error when running vcf-expression-annotator

/home/bobojin/software/miniconda3/envs/vatools/lib/python3.8/site-packages/vcfpy/header.py:413: FieldInfoNotFound: INFO "CONTQ not found using String/'.' instead warnings.warn( /home/bobojin/software/miniconda3/envs/vatools/lib/python3.8/site-packages/vcfpy/parser.py:251: CannotConvertValue: 19.4" cannot be converted to Float, keeping as string. warnings.warn( /home/bobojin/software/miniconda3/envs/vatools/lib/python3.8/site-packages/vcfpy/parser.py:251: CannotConvertValue: 0" cannot be converted to Integer, keeping as string. warnings.warn( Traceback (most recent call last): File "/home/bobojin/software/miniconda3/envs/vatools/bin/vcf-expression-annotator", line 8, in <module> sys.exit(main()) File "/home/bobojin/software/miniconda3/envs/vatools/lib/python3.8/site-packages/vatools/vcf_expression_annotator.py", line 202, in main for entry in vcf_reader: File "/home/bobojin/software/miniconda3/envs/vatools/lib/python3.8/site-packages/vcfpy/reader.py", line 175, in __next__ result = self.parser.parse_next_record() File "/home/bobojin/software/miniconda3/envs/vatools/lib/python3.8/site-packages/vcfpy/parser.py", line 802, in parse_next_record return self.parse_line(self._read_next_line()) File "/home/bobojin/software/miniconda3/envs/vatools/lib/python3.8/site-packages/vcfpy/parser.py", line 793, in parse_line return self._record_parser.parse_line(line) File "/home/bobojin/software/miniconda3/envs/vatools/lib/python3.8/site-packages/vcfpy/parser.py", line 467, in parse_line calls = self._handle_calls(alts, format_, arr[8], arr) File "/home/bobojin/software/miniconda3/envs/vatools/lib/python3.8/site-packages/vcfpy/parser.py", line 479, in _handle_calls call = record.Call(sample, data) File "/home/bobojin/software/miniconda3/envs/vatools/lib/python3.8/site-packages/vcfpy/record.py", line 238, in __init__ self.gt_alleles.append(int(allele)) ValueError: invalid literal for int() with base 10: '"0'

I have pip install upgrade vatools to 5.1.0 version as @susannasiebert said. And I've tried run ref-transcript-mismatch-reporter crc01_somatic_annotated.vcf -f hard but got almost the same error. So is it about my VEP-annotaed.vcf ? But unlike the error message, the input vcf is VEP-annotated and has the CONFQ info.

I would be appreciated to get some suggestions to solve the issue, thank you.

vcf-readcount-annotator: variant base is not present in the bam-readcount entry

Hi!
I have two separate vcf files for a tumor sample, one for indels and another for snvs. Both have been VEP annotated and decomposed. The indels vcf file has different variant callers:
VarscanSomatic, Strelka, Pindel, and GatkSomaticIndel. Both the reference and the RNA-Seq BAM files were indexed.
On running the vcf-readcount-annotator for indels.vcf, I received this warning for all samples in the vcf file of indels for all types of variant callers used in the vcf for almost all sites of indels:
Warning: variant base +T is not present in the bam-readcount entry for variant X 80457914. This might indicate that the bam-readcount file doesn't match the VCF.

Please note that I ran bam_readcount_helper.py script for indels, it gives this warning for almost all sites
Complex variant or MNP will be skipped: GL000205.1 119085 CAAA CAAAA

I am afraid that will impact the reliability of the resultant vcf annotated with tumor RNA coverage for indels. Please help

Support for GZ compressed annotation files

Please consider adding support for auto-detection/reading of compressed annotation files, not just compressed VCFs.

Support gene names as well as gene Ensembl IDs when using the `custom` option of the `vcf-expression-annotator`

Hi,
I'm using

vcf-expression-annotator *.vcf.gz *_geneNAME_count_header.tsv custom gene --id-column gene_name --expression-column expression

to annotate gene expression on VEP annotated VCF files using :

./vep
--input_file $f --output_file $VCF
--format vcf --vcf --everything --terms SO
--species mus_musculus
--symbol
--offline --fasta Mus_musculus.GRCm39.dna_rm.primary_assembly.fa --cache --merged
--plugin Frameshift --plugin Wildtype
--verbose \

After running the vcf-expression-annotator I get the error message :

WARNING:root:13124 of 13124 genes did not have an expression entry for their gene id.

which means it doesn't find any match between the symbols from the .vcf and .tsv.

I have read issue #28 and I have checked my .tsv for incorrect spacing and such, but it seems to be correct.
I can also find match manually between the gene_name from .tsv and the gene annotation of the .vcf.

Here is one of my .tsv (as .txt) as an example.
gene_count.txt

Do you have an idea of what could cause this problem?

Thank you very much.

Handle bam-readcount entries for same position in consolidated snv and indel input files

When an input VCF contains a SNV and an InDel at the same position, bam-readcount will create two entries for for them since it's run once in "normal" mode and once in insertion mode. Our current workflow has us combine the snv and indel bam-readcount files before running vcf-readcount-annotator. This results in two bam-readcount entries for the same position. In a previous iteration, we would silently drop the first entry encountered and only use the second entry, which would lead to incorrect readcounts if the bam-readcount output for the two were different.

Commit 7dc601f introduced a fatal error for this situation. We've now been provided with some test data that we can use to come up with a permanent solution (griffithlab/pmbio.org#15). Unfortunately, in this example it seems that the bam-readcount entries between SNV and InDel file are indeed different.

Here are a couple of ways we can resolve this:

Write DP field only
If the actual alt we're trying to process has discrepant readcounts, write DP field only.
Process SNV and InDels separately

I'm leaning toward option 2. This would allow us to process as much information as possible without fatal errors and without reworking our whole workflows.

If this needs to be addressed ASAP before I return from vacation, we need to be mindful that simply reverting the above commit can introduce incorrect readcounts annotated into our VCFs. A safer solution that will be very straightforward to implement would be to “skip” this position. This can be achieved by replacing line 60 with coverage[(chromosome,position,reference_base)] = None (untested). However, this would write 0 for DP/AD/AF instead which isn’t ideal either.

Installation via pip requires version previous than 21 because of Python version

As stated in documentation, installation via conda suggests the usage of pip. This is not gonna working with pip version higher than 21

Here's the message provided by pip version (< 20.2)

DEPRECATION: Python 3.5 reached the end of its life on September 13th, 2020. Please upgrade your Python as Python 3.5 is no longer maintained. pip 21.0 will drop support for Python 3.5 in January 2021. pip 21.0 will remove support for this functionality.

Enhancement: Adding allele counts per strand in vcf-readcount-annotator

The bam-readcount output contains allelic read counts on each strand; the readme file provides the following fields for each allele:

base:count:avg_mapping_quality:avg_basequality:avg_se_mapping_quality:num_plus_strand:num_minus_strand:avg_pos_as_fraction:avg_num_mismatches_as_fraction:avg_sum_mismatch_qualities:num_q2_containing_reads:avg_distance_to_q2_start_in_q2_reads:avg_clipped_length:avg_distance_to_effective_3p_end

Where num_plus_strand and num_minus_strand refers to read counts for allele base on the plus strand and minus strand respectively.

I wonder if these two fields can be added to vcf-readcount-annotator's output? For standardization's sake, you can put those data to FORMAT fields ADF and ADR like in the official VCF specs, like the following FORMAT header lines in bcftools:

##FORMAT=<ID=ADF,Number=R,Type=Integer,Description="Allelic depths on the forward strand (high-quality bases)">
##FORMAT=<ID=ADR,Number=R,Type=Integer,Description="Allelic depths on the reverse strand (high-quality bases)">

Many pipelines put filters on allele-specific depths, and the removal of the SAC annotation from GATK4 causes quite a bit of hardship.

Thanks in advance!

When running the vcf-info-annotator support input TSVs with multiple value columns

This could easily be achieved by requiring the TSV to have a header and looking up the appropriate field using the info_field parameter (i.e. the header for the value column would need to match the INFO tag you want to use in your VCF).

Note, that this does not mean to add support for adding multiple INFO fields in one command run, just that the input TSV can have an arbitrary set of value columns.

getting error when The INFO field name to use

Hi!

Trying to use your useful tool, I get this error for the 3rd argument. Not pretty sure how use it. My vcf is OK and the tsv is ok also. It seems that the INFO filed is not detected in the vcf... Any advice.

By the way, I will be possible to add more columns in the tsv file to collapse into the vcf ??

I paste the command and the error

vcf-info-annotator -f 'String' --overwrite -o ./vatools_merge_test ./WES_tsv_rare.vcf ./rare_genes.txt INFO
Traceback (most recent call last):
File "/home/geopela/.local/bin/vcf-info-annotator", line 8, in
sys.exit(main())
File "/home/geopela/.local/lib/python3.8/site-packages/vatools/vcf_info_annotator.py", line 103, in main
raise Exception("the --description and --value_format arguments are required unless updating/overwriting an existing field (with flag --overwrite)")
Exception: the --description and --value_format arguments are required unless updating/overwriting an existing field (with flag --overwrite)

Annotation VCF

Can I annotate vcf file used for pvactools with another tools rather than VEP?

THANKS,
Nour

vep_annotation_reporter.py + AttributeError: 'bool' object has no attribute 'split'

Hi,

We have tried using VAtools tools installed with the command sudo pip3 install vatools to get a tsv file and stuck with the below python specific error. The input is a Annovar+VEP annotated VCF.

Could you help to fix the below error.
How do we get the binary installation to run as vep-annotation-reporter instead of vep_annotation_reporter.py

$ python3 /vatools/vep_annotation_reporter.py -o hg38_multianno.vep.tsv hg38_multianno.vep.vcf.gz Allele Consequence IMPACT SYMBOL Gene Feature_type Feature BIOTYPE EXON INTRON HGVSc HGVSp cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation DISTANCE STRAND FLAGS VARIANT_CLASS SYMBOL_SOURCE HGNC_ID CANONICAL MANE TSL CCDS ENSP SpliceAI_pred_DP_AG SpliceAI_pred_DP_AL SpliceAI_pred_DP_DG SpliceAI_pred_DP_DL SpliceAI_pred_DS_AG SpliceAI_pred_DS_AL SpliceAI_pred_DS_DG SpliceAI_pred_DS_DL SpliceAI_pred_SYMBOL
/usr/local/lib/python3.8/dist-packages/vcfpy/header.py:413: FieldInfoNotFound: INFO . not found using String/'.' instead
  warnings.warn(
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/vatools/vep_annotation_reporter.py", line 202, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/vatools/vep_annotation_reporter.py", line 167, in main
    vep = extract_vep_fields(args)
  File "/usr/local/lib/python3.8/dist-packages/vatools/vep_annotation_reporter.py", line 114, in extract_vep_fields
    for variant in vcf_reader:
  File "/usr/local/lib/python3.8/dist-packages/vcfpy/reader.py", line 175, in __next__
    result = self.parser.parse_next_record()
  File "/usr/local/lib/python3.8/dist-packages/vcfpy/parser.py", line 802, in parse_next_record
    return self.parse_line(self._read_next_line())
  File "/usr/local/lib/python3.8/dist-packages/vcfpy/parser.py", line 793, in parse_line
    return self._record_parser.parse_line(line)
  File "/usr/local/lib/python3.8/dist-packages/vcfpy/parser.py", line 457, in parse_line
    info = self._parse_info(arr[7], len(alts))
  File "/usr/local/lib/python3.8/dist-packages/vcfpy/parser.py", line 535, in _parse_info
    result[key] = parse_field_value(self.header.get_info_field_info(key), True)
  File "/usr/local/lib/python3.8/dist-packages/vcfpy/parser.py", line 271, in parse_field_value
    return [convert_field_value(field_info.type, x) for x in value.split(",")]
AttributeError: 'bool' object has no attribute 'split'

vcf-readcount-annotator only annotates first variant for indel

Hello,
I am trying to annotate my indel VCF as part of preprocessing for pVAC.
I followed this guide and everything works without an error, but when examining the final indel VCF file, I realized that only the first variant is actually annotated with the DP, AF and AD.
I have attached the output VCF file and the readcount file from the bam-readout.

Interestingly, for the snv VCF file, everything worked perfectly and everything got annotated.

If somebody could point me to what I did wrong, I would be very happy!
Cheers!

indel.zip

ref_transcript_mismatch_reporter: `ValueError: invalid literal for int() with base 10: '71/98'`

I get the following error when running vatools

Traceback (most recent call last):
  File "/home/el/miniconda3/envs/vatools/bin/ref-transcript-mismatch-reporter", line 8, in <module>
    sys.exit(main())
  File "/home/el/miniconda3/envs/vatools/lib/python3.8/site-packages/vatools/ref_transcript_mismatch_reporter.py", line 148, in main
    position = int(protein_position) - 1
ValueError: invalid literal for int() with base 10: '71/98'

I ran pip show vatools to make sure it works and got the following output:

Name: vatools
Version: 5.0.0
Summary: A tool for annotating VCF files with expression and readcount data
Home-page: https://github.com/griffithlab/vatools
Author: Susanna Kiwala, Chris Miller
Author-email: [email protected]
License: MIT License
Location: /home/el/miniconda3/envs/vatools/lib/python3.8/site-packages
Requires: vcfpy, pysam, gtfparse, testfixtures, pandas
Required-by:

Could these issues be related?
regards
El

Originally posted by @iichelhadi in griffithlab/pVACtools#692 (comment)

Inaccurate warning message when running vcf-expression-annotator

very minor, but if a gene is not found when running vcf-expression-annotator the warning message still says:
WARNING:root:12 of 382 transcripts did not have an expression entry for their gene id.
from
vcf-expression-annotator SCLC20_R_LN.vatools_anno_6.vcf /storage1/fs1/rgovindan/Active/rnaAnalysis/phase0/expression/SCLC20_R_LN.kallisto_gene_abundance.tsv kallisto gene -s TUMOR -o SCLC20_R_LN.vatools_anno_7.vcf

Add a tool to validate VEP annotations

Check that the Downstream annotation is correct. Sometimes it seems to contain the transcript protein sequence
Check that the Amino_acid change wildtype amino acid matches the WildtypeProtein annotation amino acid at the protein change position

VEP entry for at CHR chr, POS , REF , ALT already exists

Greetings,

I am parsing a VEP annotated vcf file with the vep_annotation_reporter. My file may contain multiple rows of the same variant. The program stops when finds said repeated variant and prints:

VEP entry for at CHR chr*, POS *, REF *, ALT * already exists

Is there a way I can parse this file with repeated variants or should I remove such variants?
Thank you.

AttributeError: 'bool' object has no attribute 'split'

I've got a minimal VCF attached that reliably crashes vep-annotation-reporter. A half dozen VCF are failing with the same error, another dozen or so run just fine.

This command will reliably crash it from the most recent docker version: griffithlab/vatools:4.1.0

vep-annotation-reporter -t zz.tsv -o zz.anno.tsv zz.vep.vcf Consequence

First variant line alone works fine, add the second line, and it crashes with the following error:

  File "/usr/local/bin/vep-annotation-reporter", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/vatools/vep_annotation_reporter.py", line 167, in main
    vep = extract_vep_fields(args)
  File "/usr/local/lib/python3.6/dist-packages/vatools/vep_annotation_reporter.py", line 114, in extract_vep_fields
    for variant in vcf_reader:
  File "/usr/local/lib/python3.6/dist-packages/vcfpy/reader.py", line 175, in __next__
    result = self.parser.parse_next_record()
  File "/usr/local/lib/python3.6/dist-packages/vcfpy/parser.py", line 802, in parse_next_record
    return self.parse_line(self._read_next_line())
  File "/usr/local/lib/python3.6/dist-packages/vcfpy/parser.py", line 793, in parse_line
    return self._record_parser.parse_line(line)
  File "/usr/local/lib/python3.6/dist-packages/vcfpy/parser.py", line 457, in parse_line
    info = self._parse_info(arr[7], len(alts))
  File "/usr/local/lib/python3.6/dist-packages/vcfpy/parser.py", line 535, in _parse_info
    result[key] = parse_field_value(self.header.get_info_field_info(key), True)
  File "/usr/local/lib/python3.6/dist-packages/vcfpy/parser.py", line 271, in parse_field_value
    return [convert_field_value(field_info.type, x) for x in value.split(",")]
AttributeError: 'bool' object has no attribute 'split'

I've done some basic sanity checking on the VCF fields to try to figure out what the issue is, but haven't been able to nail it down. I think debugging from within the python tool is going to be necessary, unless I've missed something obvious!

vcf.tar.gz

Bug in "vcf-genotype-annotator" due to misspelling

Hello,

I tried running Command vcf-genotype-annotator strelka2.dna.VEP.Final.vcf Tumor2 "0/1" and got an error below "

Traceback (most recent call last):
  File "/home/miniconda3/envs/pvactools/bin/vcf-genotype-annotator", line 8, in <module>
    sys.exit(main())
  File "/home/miniconda3/envs/pvactools/lib/python3.7/site-packages/vatools/vcf_genotype_annotator.py", line 67, in main
    entry.FORMAT = entry.FORMAT.appened('GT')
AttributeError: 'list' object has no attribute 'appened'

Best,
Nihir

New feature to support writing VCF format meta-data headers to both VCF and TSV output files.

It would be nice to add a common, optional input(s) to this toolkit to write provided key/value meta-data fields to the output VCF and TSV files. I would assume that if the key already exists, then the existing value would be overwritten.

The VCF spec contains some information on what's acceptable for meta-data lines in the header. Most TSV readers and software such as Excel can be told to ignore or deal with pound commented lines, ex. ##. The relevant VCF spec:

1.4 Meta-information lines
File meta-information is included after the ## string and must be key=value pairs. Meta-information lines are optional, but if they are present then they must be completely well-formed. Note that BCF, the binary counterpart of VCF, requires that all entries are present. It is recommended to include meta-information lines describing the entries used in the body of the VCF file.
All structured lines that have their value enclosed within ”<>” require an ID which must be unique within their type. For all of the structured lines (##INFO, ##FORMAT, ##FILTER, etc.), extra fields can be included after the default fields. For example:
##INFO=<ID=ID,Number=number,Type=type,Description="description",Source="description",Version="128">
In the above example, the extra fields of “Source” and “Version” are provided. Optional fields must be stored as strings even for numeric values.
It is recommended in VCF and required in BCF that the header includes tags describing the reference and contigs backing the data contained in the file. These tags are based on the SQ field from the SAM spec; all tags are optional (see the VCF example above).
Meta-information lines can be in any order with the exception of ‘fileformat‘ which must come first.

small special bug about logging

Hi，I found bam_readcount annotator has two small special bug in line such as the following, where "chromsome" should be "chromosome" :

source code >>> logging.warning("Running in snv variant type mode but VCF entry for chr {} pos {} ref {} alts {} contains both SNVs and InDels. Skipping.".format(chromsome, entry.POS, reference, alts))

pVACtools examples tutorials

Hello!
I am trying to find the tutorials to generate files and how to use them for run pvactools here:
http://www.genomedata.org/pvactools-examples/
The files: http://www.genomedata.org/pvactools-examples/annotated.expression.vcf.gz
http://www.genomedata.org/pvactools-examples/annotated.phased.vcf.gz
Could you please help?
Thank you

Add a tool that can write R fields to a TSV in a decomposed manner

GATK VariantsToTable will not split out R-number fields while a user might only be interested in the alt value (e.g. for the allele read counts). We should make a tool that can take any arbitrary R-number field and write the values as <FIELD>-REF and <FIELD>-ALT. This will only work for decomposed variants.

griffithlab / vatools Goto Github PK

vatools's Issues

Recommend Projects

Recommend Topics

Recommend Org