counsyl / hgvs Goto Github PK

HGVS variant name parsing and generation

License: MIT License

Makefile 0.80% Python 99.20%

hgvs's Introduction

HGVS variant name parsing and generation

The Human Genome Variation Society (HGVS) promotes the discovery and sharing of genetic variation in the human population. As part of facilitating variant sharing, the society has produced a series of recommendations for how to name and refer to variants within research publications and clinical settings. A compilation of these recommendations is available on their website.

This library provides a simple Python API for parsing, formatting, and normalizing HGVS names. Surprisingly, there are many non-trivial steps necessary in handling HGVS names and therefore there is a need for well tested libraries that encapsulate these steps.

HGVS name example

In most next-generation sequencing applications, variants are first discovered and described in terms of their genomic coordinates such as chromosome 7, position 117,199,563 with reference allele G and alternative allele T. According to the HGVS standard, we can describe this variant as NC_000007.13:g.117199563G>T. The first part of the name is a RefSeq ID NC_000007.13 for chromosome 7 version 13. The g. denotes that this is a variant described in genomic (i.e. chromosomal) coordinates. Lastly, the chromosomal position, reference allele, and alternative allele are indicated. For simple single nucleotide changes the > character is used.

More commonly, a variant will be described using a cDNA or protein style HGVS name. In the example above, the variant in cDNA style is named NM_000492.3:c.1438G>T. Here again, the first part of the name refers to a RefSeq sequence, this time mRNA transcript NM_000492 version 3. Optionally, the gene name can also be given as NM_000492.3(CFTR). The c. indicates that this is a cDNA name, and the coordinate indicates that this mutation occurs at position 1438 along the coding portion of the spliced transcript (i.e. position 1 is the first base of ATG translation start codon). Briefly, the protein style of the variant name is NP_000483.3:p.Gly480Cys which indicates the change in amino-acid coordinates (480) along an amino-acid sequence (NP_000483.3) and gives the reference and alternative amino-acid alleles (Gly and Cys, respectively).

The standard also specifies custom name formats for many mutation categories such as insertions (NM_000492.3:c.1438_1439insA), deletions (NM_000492.3:c.1438_1440delGGT), duplications (NM_000492.3:c.1438_1440dupGGT), and several other more complex genomic rearrangements.

While many of these names appear to be simple to parse or generate, there are many corner cases, especially with cDNA HGVS names. For example, variants before the start codon should have negative cDNA coordinates (NM_000492.3:c.-4G>C), and variants after the stop codon also have their own format (NM_000492.3:c.*33C>T). Variants within introns are indicated by the closest exonic base with an additional genomic offset such as NM_000492.3:4243-20A>G (the variant is 20 bases in the 5' direction of the cDNA coordinate 4243). Lastly, all coordinates and alleles are specified on the strand of the transcript. This library properly handles all logic necessary to convert genomic coordinates to and from HGVS cDNA coordinates.

Another important consideration of any library that handles HGVS names is variant normalization. The HGVS standard aims to provide "uniform and unequivocal" description of variants. Namely, two people discovering a variant should be able to arrive at the same name for it. Such a property is very useful for checking whether a variant has been seen before and connecting all known relevant information. For SNPs, this property is fairly easy to achieve. However, for insertions and deletions (indels) near repetitive regions, many indels are equivalent (e.g. it doesn't matter which AT in a run of ATATATAT was deleted). The VCF file format has chosen to uniquely specify such indels by using the most left-aligned genomic coordinate. Therefore, compliant variant callers that output VCF will have applied this normalization. The HGVS standard also specifies a normalization for such indels. However, it states that indels should use the most 3' position in a transcript. For genes on the positive strand, this is the opposite direction specified by VCF. This library properly implements both kinds of variant normalization and allows easy conversion between HGVS and VCF style variants. It also handles many other cases of normalization (e.g. the HGVS standard recommends indicating an insertion with the dup notation instead of ins if it can be represented as a tandem duplication).

Example usage

Below is a minimal example of parsing and formatting HGVS names. In addition to the name itself, two other pieces of information are needed: the genome sequence (needed for normalization), and the transcript model or a callback for fetching the transcript model (needed for transcript coordinate calculations). This library makes as few assumptions as possible about how this external data is stored. In this example, the genome sequence is read using the pyfaidx library and transcripts are read from a RefSeqGenes flat-file using methods provided by hgvs.

import pyhgvs as hgvs
import hgvs.utils as hgvs_utils
from pyfaidx import Fasta

# Read genome sequence using pyfaidx.
genome = Fasta('hg19.fa')

# Read RefSeq transcripts into a python dict.
with open('hgvs/data/genes.refGene') as infile:
    transcripts = hgvs_utils.read_transcripts(infile)

# Provide a callback for fetching a transcript by its name.
def get_transcript(name):
    return transcripts.get(name)

# Parse the HGVS name into genomic coordinates and alleles.
chrom, offset, ref, alt = hgvs.parse_hgvs_name(
    'NM_000352.3:c.215A>G', genome, get_transcript=get_transcript)
# Returns variant in VCF style: ('chr11', 17496508, 'T', 'C')
# Notice that since the transcript is on the negative strand, the alleles
# are reverse complemented during conversion.

# Format an HGVS name.
chrom, offset, ref, alt = ('chr11', 17496508, 'T', 'C')
transcript = get_transcript('NM_000352.3')
hgvs_name = hgvs.format_hgvs_name(
    chrom, offset, ref, alt, genome, transcript)
# Returns 'NM_000352.3(ABCC8):c.215A>G'

The hgvs library can also perform just the parsing step and provide a parse tree of the HGVS name.

import pyhgvs as hgvs

hgvs_name = hgvs.HGVSName('NM_000352.3:c.215-10A>G')

# fields of the HGVS name are available as attributes:
#
# hgvs_name.transcript = 'NM_000352.3'
# hgvs_name.kind = 'c'
# hgvs_name.mutation_type = '>'
# hgvs_name.cdna_start = hgvs.CDNACoord(215, -10)
# hgvs_name.cdna_end = hgvs.CDNACoord(215, -10)
# hgvs_name.ref_allele = 'A'
# hgvs_name.alt_allele = 'G'

Install

This library can be installed using the setup.py file as follows:

python setup.py install

Tests

Test cases can be run by running

python setup.py nosetests

Requirements

This library requires at least Python 2.6, but otherwise has no external dependencies.

The library does assume that genome sequence is available through a pyfaidx compatible Fasta object. For an example of writing a wrapper for a different genome sequence back-end, see hgvs.tests.genome.MockGenome.

hgvs's People

Contributors

Stargazers

Watchers

hgvs's Issues

pyhgvs normalize is right,but not give base number instead of base itself

I am new to this package, and want to know how to get the right normalize result.
thanks a lot

Is there a way to use this to convert protein HGVS to genome space VCF coordinates?

I have been using Pierre Lindenbaum's tool BackLocate to accomplish this (http://lindenb.github.io/jvarkit/BackLocate.html) but if there's a smoother, more pythonic way using this tool it's not clear to me from the documentation.

how to Check whether p.val meets the HGVS specification

I'm trying to localize all variants of CIVIC

But I'm not sure whether some variants meet HGVS standards

This is an outstanding project, but in readme, I haven't seen an example of analyzing protein level variation

I want to know if it can do this, and thank any other suggestions

README.md (import hgvs error)

Noticed that the code in the UI's readme didnt work for me it looks like I was resolved in the examples1.py file. In the second line use

import pyhgvs.utils as hgvs_utils
intstead of
import hgvs.utils as hgvs_utils

No module read_transcripts in hgvs_utils

My code is an exact copy of the README.md file on your site. I can't get your package to work as directed.
>>> import pyhgvs as hgvs
>>> import hgvs.utils as hgvs_utils
>>> hgvs_utils.read_transcripts
Traceback (most recent call last):
File "", line 1, in
AttributeError: 'module' object has no attribute 'read_transcripts'

I am trying to use Ensembl transcripts as well, and the documentation is rather sparse on that.

Add NC_ALLELE parse

Awesome work! Thanks!

There are some variants which have no mRNA or cDNA hgvs,

eg. rs716274，NC_000011.9:g.103418158A>G

NC_ALLELE is empty and not being processed now.

enhancement - Compatibility with Ensembl genePred information

Since RefSeq (NCBI) is not the only source for annotation, it's also useful to have compatibility with other gene sets sources, like Ensembl genePred information (easy to obtain from Ensembl gtf files)

Reference HGVS without reference base leads to wrong coordinates and reference allele

The current regex treats the last digit as a ref digit, ie it uses it to multiply "N" that many times. This makes the coordinate wrong as the last digit is cut off, eg:

In [6]: HGVSName("NC_000017.11:g.50199235=")                                                                                                                               
Out[6]: HGVSName('NC_000017.11:g.5019923NNNNN=')

In [7]: HGVSName("NM_018090.5:c.462=")                                                                                                                                     
Out[7]: HGVSName('NM_018090.5:c.46NN=')

Unit test test_hgvs_names.py

# Copy pasted from BRCA1:c.101A= test with "A" removed

    ('BRCA1:c.101=', True,
     {
         'gene': 'BRCA1',
         'kind': 'c',
         'cdna_start': CDNACoord(101),
         'cdna_end': CDNACoord(101),
         'ref_allele': '',
         'alt_allele': '',
         'mutation_type': '=',
     }),

# Copy pasted from BRCA1:g.101A= test with "A" removed

    ('BRCA1:g.101=', True,
     {
         'gene': 'BRCA1',
         'kind': 'g',
         'start': 101,
         'end': 101,
         'ref_allele': '',
         'alt_allele': '',
         'mutation_type': '=',
     }),

Currently fails with:

AssertionError: CDNACoord(10, 0) != CDNACoord(101, 0)

Fix is to add a new regex just above the existing "No change" regexes, ie in HGVSRegex:

CDNA_ALLELE = [
    CDNA_START + EQUAL, 
    # old regexes
]

GENOMIC_ALLELE = [
    COORD_START + EQUAL,
    # old regexes
]

I am not sure whether the protein HGVS is affected, and if need to specify the ref ie whether "p.1000=" is valid or not

'dict' object has no attribute 'tx_position'

python hgvs-convert.py

DEBUG seqdb._create_seqLenDict: Building sequence length index...
Traceback (most recent call last):
File "hgvs-convert.py", line 35, in
print(hgvs.parse_hgvs_name("NM_000352.3:c.215A>G",genome,transcripts))
File "/home/josephv/Pythonmodules/lib/python2.7/site-packages/pyhgvs-0.9.4-py2.7.egg/pyhgvs/init.py", line 1365, in parse_hgvs_name
chrom, start, end, ref, alt = get_vcf_allele(hgvs, genome, transcript)
File "/home/josephv/Pythonmodules/lib/python2.7/site-packages/pyhgvs-0.9.4-py2.7.egg/pyhgvs/init.py", line 662, in get_vcf_allele
chrom, start, end = hgvs.get_vcf_coords(transcript)
File "/home/josephv/Pythonmodules/lib/python2.7/site-packages/pyhgvs-0.9.4-py2.7.egg/pyhgvs/init.py", line 1181, in get_vcf_coords
chrom, start, end = self.get_coords(transcript)
File "/home/josephv/Pythonmodules/lib/python2.7/site-packages/pyhgvs-0.9.4-py2.7.egg/pyhgvs/init.py", line 1142, in get_coords
chrom = transcript.tx_position.chrom
AttributeError: 'dict' object has no attribute 'tx_position'

The script I am using is

import pyhgvs as hgvs
import pyhgvs.utils as hgvs_utils
from pygr.seqdb import SequenceFileDB

genome = SequenceFileDB('/ifs/e63data/offitlab/Human_Decoy_REF/hs37d5.fa')

with open('/ifs/e63data/offitlab/REFGENE/sorted.curated_geneTrack_wo_chr_sorted.refgene') as infile:
transcripts = hgvs_utils.read_transcripts(infile)

def get_transcript(name):
return transcripts.get(name)

print(hgvs.parse_hgvs_name("NM_000352.3:c.215A>G",genome,transcripts))

pip install for python3 fails (os x 10.11.3)

Incorrect translation when the HGVS string does not contain a reference or alt allele

I've come across this problem with strings such as NM_007294.3:c.1209dup - which IMHO should actually be NM_007294.3:c.1209dupT (which is how ClinVar represents the variant), but mutalyzer claims that NM_007294.3:c.1209dup is valid HGVS... When I parse its name with

chrom, offset, ref, alt = hgvs.parse_hgvs_name(variant, genome, get_transcript=get_transcript)

I get the results that ref and alt are both 'C', where alt should be 'CC'. If there's a way around this, please let me know!

Thanks!

Issue with installing in Ubuntu

I seem to have an issue installing HGVS when running "python setup.py install" I encounter the following:

Traceback (most recent call last):
File "setup.py", line 35, in
main()
File "setup.py", line 30, in main
parse_requirements('requirements-dev.txt')],
File "/usr/lib/python2.7/dist-packages/pip/req.py", line 1200, in parse_requirements
skip_regex = options.skip_requirements_regex
AttributeError: 'NoneType' object has no attribute 'skip_requirements_regex'

how to create or find "genes.refGene" file for hg19 and hg38

how to create or find "genes.refGene" file for hg19, hg38.
i have got "genes.refGene" file from USSC but these are not working for my case

error shows :

Traceback (most recent call last):
File "first_py.py", line 38, in
hgvs_name, genome, get_transcript=get_transcript)
File "build/bdist.linux-x86_64/egg/pyhgvs/init.py", line 1356, in parse_hgvs_name
ValueError: transcript is required

Announcing cdot - a way to load lots of transcripts fast

I've made a Python package that provides ~800k transcripts (both RefSeq and Ensembl) for PyHGVS

https://github.com/SACGF/cdot

You can either download a JSON.gz file, or use a REST service. To use it:

from cdot.pyhgvs.pyhgvs_transcript import JSONPyHGVSTranscriptFactory, RESTPyHGVSTranscriptFactory

factory = RESTPyHGVSTranscriptFactory()
# factory = JSONPyHGVSTranscriptFactory(["./cdot-0.2.1.refseq.grch38.json.gz"])  # Uses local JSON file
pyhgvs.parse_hgvs_name(hgvs_c, genome, get_transcript=factory.get_transcript_grch37)

Rename repository to pyhgvs?

This package was renamed from hgvs to pyhgvs a while ago, but the GitHub url still uses hgvs. Switching is actually pretty low-cost, since GH sets up redirects from the old name to the new name, so old links don't break. Even git pull/push still works (I've done this with a few repositories in the past).

dup longer than 100 bases converted back to delins (due to hardcoding of 100 in code)

Expected: Converting a long HGVS dup to variant coordinates then back again will make a dup
Actual: A long dup is converted to a delins:

from pyhgvs import parse_hgvs_name, variant_to_hgvs_name

g_hgvs_str = "NC_000001.10:g.235611675_235611994dup"
c_hgvs_str = "NM_003193.4(TBCE):c.1411_1501dup"


chrom, offset, ref, alt = parse_hgvs_name(g_hgvs_str, f, None)
g_hgvs_name = variant_to_hgvs_name(chrom, offset, ref, alt, f, None)

print(f"{g_hgvs_str=} => {g_hgvs_name=}")

chrom, offset, ref, alt = parse_hgvs_name(c_hgvs_str, f, transcript)
c_hgvs_name = variant_to_hgvs_name(chrom, offset, ref, alt, f, transcript)

print(f"{c_hgvs_str=} => {c_hgvs_name=}")

Output:

g_hgvs_str='NC_000001.10:g.235611675_235611994dup' => g_hgvs_name=HGVSName('g.235611773_235611774ins320')
c_hgvs_str='NM_003193.4(TBCE):c.1411_1501dup' => c_hgvs_name=HGVSName('NM_003193.4(TBCE):c.1491+18_1491+19ins320')

This is because hgvs_justify_indel only looks a hardcoded 100 bases around the indel

If you change the code to:

    size = max(len(ref), len(alt)) + 1
    start = max(offset - size, 0)
    end = offset + size

It keeps the dup:

g_hgvs_str='NC_000001.10:g.235611675_235611994dup' => g_hgvs_name=HGVSName('g.235611675_235611994dup320')
c_hgvs_str='NM_003193.4(TBCE):c.1411_1501dup' => c_hgvs_name=HGVSName('NM_003193.4(TBCE):c.1411_1501dup320')

AttributeError: 'module' object has no attribute 'read_transcripts'

Hello, I installed the 'hgvs', use:
pip install 'hgvs'
pip install 'pygr'

But there are some issues, how to fix it ?

[root@bio-x-2 hgvs]# python
Python 2.7.5 (default, Sep 15 2016, 22:37:39)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import pyhgvs as hgvs
import hgvs.utils as hgvs_utils
from pygr.seqdb import SequenceFileDB
hgvs_utils.read_transcripts()
Traceback (most recent call last):
File "", line 1, in
AttributeError: 'module' object has no attribute 'read_transcripts'

HGVS output from oncotator MAF gives an error

I get the following error when reading data from the HGVS_coding_DNA_change column of oncotator MAF output (http://www.broadinstitute.org/oncotator/).

InvalidHGVSName: Invalid HGVS cDNA allele "5407-17T>-"

Not sure if this is an oncotator issue or a pyhgvs issue.

Python3 version

Any chance to make it happen? It seems much better than biocommon hgvs since it requires connection to uta resources.

how to get coordinate of "AB026906.1:c.40_42del" by hgvs code

i have used genes.refGene(#26 (comment)) and hg19.fa

genes.refGene does not have "AB026906.1" transcript

Error :
Traceback (most recent call last):
File "first_py.py", line 38, in
hgvs_name, genome, get_transcript=get_transcript)
File "build/bdist.linux-x86_64/egg/pyhgvs/init.py", line 1356, in parse_hgvs_name
ValueError: transcript is required

Support inversions (inv)

NM_007300.4:c.2902_2959inv currently fails

https://varnomen.hgvs.org/recommendations/DNA/variant/inversion/

how to get genes.refGene with version

add pip install

having an option to pip install pyhgvs would make package management much easier.

single base pair insertion name comes up as slightly off

Getting a systematic issue:
Every cdna name from vcf records is correct except for single base pair insertion.

shouldBe getting
CFTR:c.1006_1007insG CFTR:c.1007insG
CFTR:c.1029_1030insG CFTR:c.1030insG
CFTR:c.1660_1661insA CFTR:c.1661insA
CFTR:c.3883_3884insG CFTR:c.3884insG

So its close but it doesn't get the first coordinate. Multi-bp insertions are correct. Any idea why there is a difference?

Running UTA locally

hi, I preferred to run UTA locally, and I have downloaded and installed the docker and the postgreSQL docker. But "docker" technology is quite new to me, and I am not sure how to run the database. Could you help me on this? Thanks

Catch invalid HGVS names like NC_000005.10:g.177421339_177421327delACTCGAGTGCTCC

NC_000005.10:g.177421339_177421327delACTCGAGTGCTCC appears in ClinVar, and is an invalid name (the genomic start/stop coords are not in increasing order). This causes parse_hgvs_name to raise an IndexError. It should raise InvalidHGVSName instead

enhancement - add compatibility with RNA and non-coding RNA sequences

At the moment the hgvs module is only able to work with coding DNA, genomic and protein sequences. It would be great if all sequence types could be accepted by the module. I will be very happy to contribute in this task, so please let me know how could I help.

License

Thanks for sharing very useful library!

Would you mind adding License for this software?

AttributeError: 'module' object has no attribute 'utils'

In your example, the line

transcripts = hgvs.utils.read_transcripts('genes.refGene')

is throwing the error:
transcripts = hgvs.utils.read_transcripts('genes.refGene')
AttributeError: 'module' object has no attribute 'utils'

Any thoughts?

Incorrect HGVS to VCF conversion for some genomic indels

Hi, genomic indels are often wrong because get_coords() adjustment of start/end is only done for indels if self.kind == 'c'

Testing against examples from the ClinGen allele registry:

http://reg.clinicalgenome.org/redmine/projects/registry/genboree_registry/allele?hgvsOrDescriptor=NM_000492.3%3Ac.1155_1156dupTA

    'NM_000492.3:c.1155_1156dupTA' # correct resolves to ('chr7', 117182104, 'A', 'AAT')
    # Same as above but without optional trailing base - issue #32
    'NM_000492.3:c.1155_1156dup' # Error - resolves to ('chr7', 117182107, 'A', 'A')
    # Genomic coordinate of above
    "chr7:g.117182108_117182109dup" # Error - resolves to ('7', 117182109, 'A', 'A')

    # Genomic coordinate of above but shifted with optional base suffix
    "chr7:g.117182105_117182106dupAT" # Error - resolves to ('7', 117182106, 'T', 'T')

I would do a pull request but I've been working with existing pull request #25 and it doesn't look like this project is being updated anymore. If you merge #25 please ping this issue and I'll make a pull request.

Fixes are to remove test for if self.kind == 'c': in get_coords()

I've patched my fork: https://github.com/sacgf/hgvs

HGVS / genome coordinate conversion does not account for cDNA alignment gaps

RefSeq transcript sequences can be different from the reference sequence (even if they agree with 1 build they can be different across builds). These sequences are aligned against the genome to produce exon coordinates in GFF releases.

This alignment can sometimes produce insertions / deletions (5-10% of transcripts), eg in the GFF file there is a “cDNA match” string that records the alignment, and has a “Gap” entry:

NC_000002.12    RefSeq  cDNA_match      73385758        73386192        431.411 +       .       ID=daa36283c6058f57b6347eb074291b21;Target=NM_015120.4 1 438 +;assembly_bases_aln=5003;assembly_bases_seq=5003;consensus_splices=44;exon_identity=0.999768;for_remapping=2;gap_count=1;identity=0.999768;idty=0.993151;matches=12925;num_ident=12925;num_mismatch=0;pct_coverage=99.9768;pct_coverage_hiqual=99.9768;pct_identity_gap=99.9768;pct_identity_ungap=100;product_coverage=1;rank=1;splices=44;weighted_identity=0.999771;Gap=M185 I3 M250

NM_015120.4 has cDNA_match Gap=M185 I3 M250 - meaning there was 185 bases matched, 3 bases inserted then back to matching. You can see how this affects PyHGVS conversion downstream from the gaps:

2:73385942 A>T: NM_015120.4(ALMS1):c.74A>T (correct)
2:73385943 A>T: NM_015120.4(ALMS1):c.75A>T (off by 3, VEP gives NM_015120.4:c.78A>T)
2:73385944 G>C: NM_015120.4(ALMS1):c.76G>C (off by 3, VEP gives NM_015120.4:c.79G>C)

Need updated version of genes.refGene

Hi I ahve some variants in HGVS format which has NM_004364.4 transcript.

This transcript is not there in pyhgvs/data/genes.refGene file.

Can you please tell me how can I get the updated file or add this to the file.

Thank you

Regards

update of genes.refGene files

I need to use an updated version of refseq. Is it available any script to download the current version of the file 'genes.refGene' or I should to build it by hand?. Thank you. Angela

hgvs/pyhgvs/data/genes.refGen file

dear:

How do I create this file : hgvs/pyhgvs/data/genes.refGen ，This file is out of date and I want to update it。

I want to use the latest transcripts。

get_transcripts()

I am running the sample script from GitHUB but using my local version of refGene and Human Genome reference.

import pyhgvs as hgvs
import pyhgvs.utils as hgvs_utils
from pygr.seqdb import SequenceFileDB

genome = SequenceFileDB('hs37d5.fa')

with open('sorted.curated_geneTrack_wo_chr_sorted.refgene') as infile:
transcripts = hgvs_utils.read_transcripts(infile)

def get_transcript(name):
return transcripts.get(name)

chrom, offset, ref, alt = hgvs.parse_hgvs_name('NM_000352.3:c.215A>G', genome, get_transcript=get_transcript)
print(chrom, offset, ref, alt)

I am encountering this error:

File "hgvs-convert.py", line 34, in
chrom, offset, ref, alt = hgvs.parse_hgvs_name('NM_000352.3:c.215A>G', genome, get_transcript=get_transcript)
File "build/bdist.linux-x86_64/egg/pyhgvs/init.py", line 1356, in parse_hgvs_name
ValueError: transcript is required

hgvs_utils not installing?

OS X 10.11.3
python 2.7.10

Or am I supposed to install this separately?

I git cloned hgvs and ran python setup.py install

>>> import hgvs.utils as hgvs_utils
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named hgvs.utils

parse_hgvs_name() crashes if start>end

I have trouble converting chr19:g.10291325_10291323dup (rs147441348) into chrom, pos, ref, alt using parse_hgvs_name(). The traceback is

Traceback (most recent call last):
  File "XXX", line 76, in main
    get_transcript=get_transcript)
  File "xxxx/pyhgvs/__init__.py", line 1360, in parse_hgvs_name
    chrom, start, end, ref, alt = get_vcf_allele(hgvs, genome, transcript)
  File "xxxx/pyhgvs/__init__.py", line 672, in get_vcf_allele
    alt = ref[0] + alt
IndexError: string index out of range

pyghvs is unable to retrieve the ref bases which is likely to be caused by get_genomic_sequence() which in turn does not support end coordinates bigger that start coordinates. Now, I am not sure this is wrong. However, I can paste chr19:g.10291325_10291323dup into Alamut in my case and find the variant. Exchanging start/end seems to yield the correct result, too.

how to get pdot

Hello, I see from example usage how to get HGVS cdot from REF/ALT. Is there a built-in function to get the pdot? Thanks.

Unable to parse a HGVS variant in format that VEP accepts

pyhgvs.InvalidHGVSName: Invalid HGVS cDNA allele "3252delC+3263insC"

VEP's web interface was able to translate that just fine, so I'm assuming that is the correct HGVS format. I gave it the variant as such:

ENST00000333535:c.3252delC+3263insC

a format which worked for all of my other variants. Just a PSA unless there is some older/newer format version for this kind of variant of which I am unaware.

Syntax error when trying to parse valid R variant

import hgvs.parser
hp = hgvs.parser.Parser()
hp.parse_hgvs_variant("NM13423:r.831_832ins831+1_831+60")
...
ometa.runtime.ParseError:
NM13423:r.831_832ins831+1_831+60
^
Parse error at line 1, column 20: Syntax error. trail: [rna_iupac rna rna_ins rna_edit r_posedit r_variant hgvs_variant]
...
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
...
hgvs.exceptions.HGVSParseError: NM13423:r.831_832ins831+1_831+60: char 20: Syntax error

error naming CFTR:c.1521_1523delCTT

Using hg18.fa and the provided genes.refGene in the git repo. I don't think this is a problem but let me know if you think it is.

chrom, offset, ref, alt = ('chr7', 116986881, 'TCTT', 'T')
transcript = get_transcript('NM_000492.3')
hgvs_name = hgvs.format_hgvs_name(
    chrom, offset, ref, alt, genome, transcript)
print(hgvs_name)
#returns NM_000492.3(CFTR):c.-133267_-133265delCTT

However I don't think this is correct. Shouldn't it be CFTR:c.1521_1523delCTT?
Goods news: I tried an alternative form of FDel508 and got the same result

#NM_000492.3 is the transcript for CFTR
chrom, offset, ref, alt = ('chr7', 11698688, 'ATCT', 'A')
transcript = get_transcript('NM_000492.3')
hgvs_name = hgvs.format_hgvs_name(
    chrom, offset, ref, alt, genome, transcript)
print(hgvs_name)
#returns NM_000492.3(CFTR):c.-133267_-133265delCTT

So I think it is just how it is counting from is possibly off. Any thoughts? Thanks! Let me know if I can help contribute!

counsyl / hgvs Goto Github PK

hgvs's Introduction

HGVS variant name parsing and generation

HGVS name example

Example usage

Install

Tests

Requirements

hgvs's People

Contributors

Stargazers

Watchers

Forkers

hgvs's Issues

Recommend Projects

Recommend Topics

Recommend Org