biothings / myvariant.info Goto Github PK

View Code? Open in Web Editor NEW

85.0 30.0 32.0 13.18 MB

MyVariant.info: A BioThings API for human variant annotations

Home Page: http://myvariant.info

License: Other

Python 99.18% HTML 0.82%

biothings variants api webservice bioinformatics variant-annotations annotations ncats-translator

myvariant.info's Introduction

Introduction

This is a project coming out of 1st NoB Hackathon.

The scope of this project is to aggregate existing annotations for genetic variants. Variant annotations have drawn huge amount of efforts from researchers, which made many variant annotation resources available, but also very scattered. Doing integration of all of them is hard, so we want to create a simple way to pool them together first, with high-performance programmatic access. That way, the further integration (e.g. deduplication, deriving higher-level annotations, etc) can be much easier.

From the discussion of the hackathon, we decided a strategy summarized as below:

A very simple rule to aggregate variant annotations

each variant is represented as a JSON document
the only requirement of the JSON document is that the key of this JSON document ("_id" field in this document) follows HGVS nomenclature. For example:

     {
       '_id': 'chr1:g.35366C>T',
       'allele1': 'C',
       'allele2': 'T',
       'chrom': 'chr1',
       'chromEnd': 35367,
       'chromStart': 35366,
       'func': 'unknown',
       'rsid': 'rs71409357',
       'snpclass': 'single',
       'strand': '-'
     }

that way, we can then merge multiple annotations for the same variant into a merged JSON document. Each resource of annotations is under its own field. Here is a merged example.

A powerful query-engine to access/query aggregated annotations

The query engine we developed for MyGene.info can be easily adapted to provide the high-performance and flexible query interface for programmatic access. MyGene.info follows the same spirit, but for gene annotations. It currently serves ~3M request per month.

User contributions of variant annotations

User contribution is vital, given the scale of available (also increasing) resources. The simple rule we defined above makes the merging new annotation resource very easy, essentially writing a JSON importer. And the sophisticated query-engine we built can save users effort to build their own infrastructure, which provides the incentive for them to contribute.

Also note that it's not only the data-provider can write the importer, anyone who finds a useful resource can do that as well (of course, check to make sure the data release license allows that)

See the guideline below for contributing JSON importer.

How to contribute

See this How to contribute document.

myvariant.info's People

Contributors

Stargazers

Watchers

myvariant.info's Issues

dbSNP parser missing variants

User report variant "rs281865162" is missing in MyVariant.info. The problem comes from the dbSNP parser:
In https://github.com/biothings/myvariant.info/blob/master/src/hub/dataload/sources/dbsnp/dbsnp_vcf_parser.py#L60, we specifically remove all variants which are not single nucleotide deletion. Not sure if it is on purpose.
Need @newgene to confirm.

change example query

In the "Query Examples" of the myvariant.info home page, we currently show http://myvariant.info/v1/variant/chr1:g.35367G>A for annotation retrieval. But, that specific variant has a rather limited set of annotation sources. I'd suggest choosing another variant that better highlights as many of the annotation resources as possible.

CIViC auto upload

CIViC is loaded through API query. Should trigger it every month.

Incorrect HGVS name creation from VCF file

data source: PheWAS catalog

https://phewascatalog.org/phewas

I think that should contain all the data in supp tables 3 and 8 in https://www.nature.com/articles/nbt.2749, but would be good to double check.

also, this is from an older 2013 paper. After this is loaded, would be good to check with the Vanderbilt team (eg Lisa Bastarache) whether there are any other relevant large-scale data available...

home page has logo with a left-handed helix

switch to right-handed helix

evaluate denovo-db and add if appropriate

https://academic.oup.com/nar/article/45/D1/D804/2770653/denovo-db-a-compendium-of-human-de-novo-variants
http://denovo-db.gs.washington.edu/denovo-db/Download.jsp
http://denovo-db.gs.washington.edu/denovo-db/Usage.jsp

from https://github.com/monarch-initiative/MDAC/issues/2

ACMG guideline

provide annotation based on ACMG guidlines
ACMG guidelines is widely used to interpret variants.
We could provide variant classification results based on ACMG guidelines.
https://www.acmg.net/docs/standards_guidelines_for_the_interpretation_of_sequence_variants.pdf

Update dbNSFP parser

dbNSFP parser needs to be updated for version 4.0b1a
Details could be found at: https://sites.google.com/site/jpopgen/dbNSFP

snv marked as del (related to dbsnp)

https://www.ncbi.nlm.nih.gov/snp/rs35772246

Generate and store list of _id in s3

Output file: list of all _id in each myvariant's assembly. Feature was deactivated in 2cf6144 after switching to cold/hot collection design.

With cold/hot, since we never have the full merged collection in mongo, the only way to generate such list in an efficient manner is to use cache file, cold and hot ones, then sort/uniq them (as some hot _ids are already in cold) to create the output file.

Note: this file is used by clingen team to generate CAID for myvariant

Data source: HGMD

http://www.hgmd.cf.ac.uk/ac/index.php

Seems to be very frequently used by a lot of labs working on variant annotation pipelines.

add incidence rates in TCGA data

MyVariant.info release notes should have anchors for each release

MyVariant.info release notes are here:

http://docs.myvariant.info/en/latest/doc/release_changes.html

It would be handy to add the anchor (for the direct URL) to each release, something like this:

http://docs.myvariant.info/en/latest/doc/release_changes.html#release-20190226

and even deeper into each of hg19 and hg38 release notes:

http://docs.myvariant.info/en/latest/doc/release_changes.html#release-20190226-hg19
http://docs.myvariant.info/en/latest/doc/release_changes.html#release-20190226-hg38

When the hash exists, it should expand the specific release note content.

The rendering of the "anchor" can be made the same as the other anchors on this page, e.g. this one:

http://docs.myvariant.info/en/latest/doc/release_changes.html#myvariant-releases

(the anchor icon will show up when mouse-over)

The same changes can be applied to docs.mygene.info and docs.mychem.info as well.

load cytoband data from UCSC

Suggested by Beth Pitel (I think) at the CIViC Hackathon...

I think the data is in this file http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/cytoBand.txt.gz from http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/

facet query on cadd fails (unittest MyVariantTest.test_query_facets)

http://myvariant.info/v1/query?q=cadd.gene.gene_id:ENSG00000113368&facets=cadd.polyphen.cat&size=0

gives

{
"success": false,
"error": "Could not execute query due to the following exception(s): ['illegal_argument_exception Fielddata is disabled on text fields by default. Set fielddata=true on [cadd.polyphen.cat] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.']"
}

Need to update CADD mapping (and rebuilt pre-merge/cold collection)

Production stability

Hi,

Great work with variant info project.
Infact I was part of the hackathon where you guys came up with this.

I am wondering how stable is this now and what are your future plans.
Any plans integrating with mygene.info or making more stable service on its own?

Thanks,
Nikhil

snpeff annotation switch to use GRCh37 and GRCh38 reference genomes

Currently, we use "hg19" and "hg38" reference genomes (from UCSC) to produce snpeff annotations. The result misses "gene_id" field (the value is the same as "gene_name"). We can switch to use GRCh37 and GRCh38 reference genomes available here:

https://sourceforge.net/projects/snpeff/files/databases/v4_3/

Also we could upgrade the snpeff version we used too.

Parser for new data source: FIRE

https://sites.google.com/site/fireregulatoryvariation/

add github link to front page of myvariant.info

also applies to mygene and mychem...

indexing premerge/hot data sends notification "done" when only premerge collection is indexed

... it should send the notification only when both premerge and hot collections are indexed

usage stats on front page not updating

minor note -- I just noticed that usage stats on the front page of mygene.info are updating, but not so for myvariant.info. (last stamp is 2016-11-15...)

how to query a position with a POST

I would like to query the following variants using POST (i.e. on http://myvariant.info/v1/query):

q="chr1:54844G>A,chr1:61987A>G,chr1:61989G>C,chr1:86018C>G,chr1:86303G>T"

I've tried the above paramaters, but it returns the follows:

[
  {
    "query": "chr1:54844G>A",
    "notfound": true
  }
]

I understand that I also need to input a scope in order to make it work but I'm not sure what the scope should be in this case...

Thanks
Ismail

load data from ClinGen VCI database

Matt Wright and Jimmy Zhen from the ClinGen team seemed interested in this idea at the CIViC hackathon. Need to reach out to them for more info on logistics...

get_hgvs_from_vcf & get_pos_start_end doesn't handle cases where REF is None

Example:
CHROM: 22
POS: 18898839
REF: A
ALT: NONE

This example comes from dbSNP v151

Data source: gwas catalog

https://www.ebi.ac.uk/gwas/docs/file-downloads

ExAC mapping

The mapping file for ExAC contains a small problem. The ac_hom field should be put in 'ac' rather than 'hom'.
Potential solutions:

change the mapping
add an additional field called 'ac_hom' under 'hom'

Data source: OncoKB

http://oncokb.org/#/dataAccess

need to get the permission first.

similar to CIVIC and CGI.

Better format when using both always_list and allow_null options?

In the recent release, I noticed that there're some handy new features, including the always_list and allow_null option. But when they are used in combination, the result is probably not in the nicest format. Instead of returning an empty list [] when there's no data, it returns a list of a null object like so: [null].

It will cause some confusion for the client side, since usually you would check if the returned list is empty, as opposed to checking each element in the list if they are empty.

A sample request to reproduce this error would be:

https://myvariant.info/v1/query?q=rs12131234&fields=dbsnp&always_list=dbsnp.gene&allow_null=dbsnp.gene

I'm wondering if it's possible to change this behaviour? Thanks.

Which config file?

When installing myvariant and testing, it asks for a BioThings config file. Which file should we use or how should we configure it? Thanks.

update wellderly data

Just noticed that the data from http://myvariant.info/v1/variant/chr12:g.1299226A%3EG?fields=wellderly differs from what's available from https://genomics.scripps.edu/browser/#. The allele/genotype frequencies are different, and they also separate out the illumina data from complete genomics.

Looks like the VCFs are here: https://genomics.scripps.edu/browser/files/wellderly/vcf/

replace license_short_url with bit.ly based instead of goo.gl

https://goo.gl/ will discontinue on March 30, 2019.

The logic of get_pos_start_end and _normalize_vcf is conflicting

Use case: try to normalize vcf before using the get_pos_start_end function.

Problem:
In the case of deletion: REF -> TTTCTTTTTCTTTTTCTTTTTCTTTCTT, ALT -> TG
_normalize_vcf would trim the first T from both REF and ALT

However, get_pos_start_end asserts the first nucleotide in both REF and ALT is the same
see: https://github.com/biothings/myvariant.info/blob/master/src/utils/hgvs.py#L150

These two functions could not be used together to handle deletion cases.

the counts on the data sources table (in the docs) not updating correctly

In this table: http://docs.myvariant.info/en/latest/doc/data.html#data-sources

Adding computed VEP variant annotations to MyVariant.info

VEP: http://uswest.ensembl.org/info/docs/tools/vep/index.html

Similar to the SnpEff annotation we have already, VEP is a tool to compute variant impact.

query variants with genename

Hi,
One task I'd like to run with myvariant.info is to return all variants in a gene. For example TP53, so I tried
http://myvariant.info/v1/query?q=TP53&fields=_id
which returns with count of 5918.
I also tried query with ensembleID
http://myvariant.info/v1/query?q=ENSG00000141510&fields=_id
which returns nothing.
Then I tried
http://myvariant.info/v1/query?q=dbnsfp.ensembl.geneid:ENSG00000141510&fields=_id
http://myvariant.info/v1/query?q=cadd.gene.gene_id:ENSG00000141510&fields=_id
which returns 3318 and 4539.

So the question I have is when I just search for TP53, which fields are searched exactly. It seems the default query in elasticseach is search _all fields? and why I can't get any results back with just ensembleID? Is range query a better way to get all variants related to a gene? Or what is the best way to do this task with myvariant.info api?

Thank you very much

update field-specific notes for some data fields

Those notes can be added here:

https://github.com/biothings/myvariant.info/blob/master/src/web/context/myvariant_field_table_notes.json

It will then rendered in the "Notes" column of the available-fields table in the docs:

http://docs.myvariant.info/en/latest/doc/data.html#available-fields

load VICC-harmonized data

From Alex Wagner, this link https://s3-us-west-2.amazonaws.com/g2p-0.10/index.html has the current release of the VICC-harmonized data (described in https://www.biorxiv.org/content/early/2018/07/11/366856). It is subject to change as that manuscript goes through peer review. But once that's done and the data set is finalized, seems like a good source to import. (obviously we already have civic data directly, but this resource will provide access to several other sources as well in a standardized format.)

cc @ahwagner

add incidence rates (and specific cell lines) of variants in CCLE data

Discrepancies in returned COSMIC ids

Hi!

We've noticed that there seem to be some inconsistencies with the COSMIC data being returned in the variant annotation service.

Here's an example query:

GET myvariant.info/v1/variant/chr4:g.55141036T>C?fields=cosmic,mutdb

And the response:

{
    "_id": "chr4:g.55141036T>C",
    "_version": 2,
    "cosmic": {
        "alt": "C",
        "chrom": "4",
        "cosmic_id": "COSM1430077",
        "hg19": {
            "end": 55141036,
            "start": 55141036
        },
        "mut_freq": 0.14,
        "mut_nt": "T>C",
        "ref": "T",
        "tumor_site": "large_intestine"
    },
    "mutdb": {
        "alt": "C",
        "chrom": "4",
        "cosmic_id": "85787",
        "hg19": {
            "end": 55141036,
            "start": 55141036
        },
        "mutpred_score": -1,
        "ref": "T",
        "rsid": null,
        "strand": "p"
    }
}

The cosmic id returned in the cosmic top level key (body['cosmic']['cosmid_id'] doesn't match the cosmic id returned in the mutdb top level key (body['mutdb']['cosmic_id']). Additionally, the cosmic id returned in the cosmic section isn't a valid cosmic id at all, while the one in the mutdb section appears to be the correct one for the variant in question.

I assume this is likely to come from discrepancies in the underlying data sources, but it was a little surprising to find a non-existent cosmic id in the cosmic section.

variant normalization

Hi,
We are wondering how the variant normalization is done in myvariant.info? When you import the variants from each database, do you do any sort of internal variant normalization or just take the chr,pos,ref,alt directly from the source?

Thanks

Unable to run clinvar_xml_parser dataloader

The clinvar_xml_parser.py data loader is referencing a clinvar or clinvar1 import that is not listed in the requirements:
https://github.com/SuLab/myvariant.info/blob/master/src/dataload/contrib/clinvar/clinvar_xml_parser.py#L5

It's changed from clinvar to clinvar1 - is the clinvar library that does the parseString() call available from you or is it a separate 3rd party lib to be installed?

https://github.com/SuLab/myvariant.info/blob/master/src/dataload/contrib/clinvar/clinvar_xml_parser.py#L315
record_parsed = clinvar1.parseString(record, silence=1)

Chembl data parser fixes

sebastienlelong [2:51 PM]
@ChunleiWu also I see a lot of CHEBI:None in chembl: http://mychem.info/v1/drug/GWNBDVRVUYBAGA-UHFFFAOYSA-N?fields=chembl.chebi_par_id

dbSNP download site change (Maybe?)

Currently, the newest release of dbSNP is v152. Our latest version in MyVariant.info is v151.

We download from: ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/
The last update time for the file is: 4/22/2018 (v151)

v152 is stored in: ftp://ftp.ncbi.nih.gov/snp/latest_release/VCF

Also, from v152, dbSNP provides the JSON version of the data dump:
ftp://ftp.ncbi.nih.gov/snp/latest_release/JSON

Related post regarding the change from dbSNP: https://ncbiinsights.ncbi.nlm.nih.gov/2017/07/07/dbsnp-redesign-supports-future-data-expansion/

snpeff ann field is sometimes a list, sometimes an object

The format for the field ann, nested in snpeff, is a list in variants like in chr1:g.35367G>A, and an object in variants like chr7:g.140453136A>T. While trying to parse the output, this complicates the mapping of the key and values. Was this intended?

Thanks!

live query API does not work for some ClinVar RCVs

RCV000008604, RCV000008605, RCV000008606 and RCV000008607 share one variant (ClinVar variation 8131, also called measureSet id and variant id in their xml file). The API works for RCV000008604 only, but not for any others. Input data as mv.querymany(['RCV000008604'], scopes='clinvar.rcv_accession', fields='clinvar.clinvar_id')

Add version number for gnomAD in MyVariant new release

http://myvariant.info/v1/metadata
Looks like the version number and license info is missing for gnomAD.

clingen.caid should be indexed

We now have clingen CA id loaded for hg38 index, we should have this clingen.caid field indexed (as "string_lowercase").

VCF to HGVS conversion as datatransform edge

Wrap this function https://github.com/biothings/myvariant.info/blob/master/src/utils/hgvs.py#L88 as a new "compute" edge (compute a result from input data instead of lookin up data from mongodb)

Cosmic mutation frequency information seems limited/arbitrary

Thank you for this amazing resource!

We are in the process of adding selected relevant information from myvariant.info to CIViC (civicdb.org).

While considering options, we hoped to add cosmic mutation frequency. But the mutation frequency available appears to be the frequency for a single tumor site? And this is chosen arbitrarily from several possibilities perhaps?

Consider this example (which seem representative of other variants in myvariant.info):
http://myvariant.info/v1/variant/chr7:g.140453136A%3ET

This is BRAF V600E.
http://cancer.sanger.ac.uk/cosmic/mutation/overview?id=476

The mutation frequency information returned for COSMIC is:
"mut_freq": 2.83,
"tumor_site": "biliary_tract"

See attached:

This seems odd. How is this being determined? Would it be possible to determine overall mutation frequency across all tumor_sites, and then for each tumor_site and perhaps return the top site(s) and their frequencies?

Our relevant CIViC github issues are:
griffithlab/civic-server#243
griffithlab/civic-server#38

For now we will move on without using the COSMIC info but it would be great to have more options to select from here.