Git Product home page Git Product logo

myvariant.info's Introduction

Introduction

This is a project coming out of 1st NoB Hackathon.

The scope of this project is to aggregate existing annotations for genetic variants. Variant annotations have drawn huge amount of efforts from researchers, which made many variant annotation resources available, but also very scattered. Doing integration of all of them is hard, so we want to create a simple way to pool them together first, with high-performance programmatic access. That way, the further integration (e.g. deduplication, deriving higher-level annotations, etc) can be much easier.

From the discussion of the hackathon, we decided a strategy summarized as below:

A very simple rule to aggregate variant annotations
  • each variant is represented as a JSON document
  • the only requirement of the JSON document is that the key of this JSON document ("_id" field in this document) follows HGVS nomenclature. For example:
     {
       '_id': 'chr1:g.35366C>T',
       'allele1': 'C',
       'allele2': 'T',
       'chrom': 'chr1',
       'chromEnd': 35367,
       'chromStart': 35366,
       'func': 'unknown',
       'rsid': 'rs71409357',
       'snpclass': 'single',
       'strand': '-'
     }
  • that way, we can then merge multiple annotations for the same variant into a merged JSON document. Each resource of annotations is under its own field. Here is a merged example.
A powerful query-engine to access/query aggregated annotations

The query engine we developed for MyGene.info can be easily adapted to provide the high-performance and flexible query interface for programmatic access. MyGene.info follows the same spirit, but for gene annotations. It currently serves ~3M request per month.

User contributions of variant annotations

User contribution is vital, given the scale of available (also increasing) resources. The simple rule we defined above makes the merging new annotation resource very easy, essentially writing a JSON importer. And the sophisticated query-engine we built can save users effort to build their own infrastructure, which provides the incentive for them to contribute.

Also note that it's not only the data-provider can write the importer, anyone who finds a useful resource can do that as well (of course, check to make sure the data release license allows that)

See the guideline below for contributing JSON importer.

How to contribute

See this How to contribute document.

myvariant.info's People

Contributors

a3leong avatar cyrus0824 avatar dmcgoldrick avatar erikyao avatar everaldorodrigo avatar gerikson avatar gtsueng avatar jal347 avatar jordeu avatar kevinxin90 avatar mmayers12 avatar newgene avatar puva avatar sirloon avatar stuppie avatar zcqian avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

myvariant.info's Issues

change example query

In the "Query Examples" of the myvariant.info home page, we currently show http://myvariant.info/v1/variant/chr1:g.35367G>A for annotation retrieval. But, that specific variant has a rather limited set of annotation sources. I'd suggest choosing another variant that better highlights as many of the annotation resources as possible.

CIViC auto upload

CIViC is loaded through API query. Should trigger it every month.

Generate and store list of _id in s3

Output file: list of all _id in each myvariant's assembly. Feature was deactivated in 2cf6144 after switching to cold/hot collection design.

With cold/hot, since we never have the full merged collection in mongo, the only way to generate such list in an efficient manner is to use cache file, cold and hot ones, then sort/uniq them (as some hot _ids are already in cold) to create the output file.

Note: this file is used by clingen team to generate CAID for myvariant

MyVariant.info release notes should have anchors for each release

MyVariant.info release notes are here:

http://docs.myvariant.info/en/latest/doc/release_changes.html

It would be handy to add the anchor (for the direct URL) to each release, something like this:

http://docs.myvariant.info/en/latest/doc/release_changes.html#release-20190226

and even deeper into each of hg19 and hg38 release notes:

http://docs.myvariant.info/en/latest/doc/release_changes.html#release-20190226-hg19
http://docs.myvariant.info/en/latest/doc/release_changes.html#release-20190226-hg38

When the hash exists, it should expand the specific release note content.

The rendering of the "anchor" can be made the same as the other anchors on this page, e.g. this one:

http://docs.myvariant.info/en/latest/doc/release_changes.html#myvariant-releases

(the anchor icon will show up when mouse-over)

The same changes can be applied to docs.mygene.info and docs.mychem.info as well.

facet query on cadd fails (unittest MyVariantTest.test_query_facets)

http://myvariant.info/v1/query?q=cadd.gene.gene_id:ENSG00000113368&facets=cadd.polyphen.cat&size=0

gives

{
"success": false,
"error": "Could not execute query due to the following exception(s): ['illegal_argument_exception Fielddata is disabled on text fields by default. Set fielddata=true on [cadd.polyphen.cat] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.']"
}

Need to update CADD mapping (and rebuilt pre-merge/cold collection)

Production stability

Hi,

Great work with variant info project.
Infact I was part of the hackathon where you guys came up with this.

I am wondering how stable is this now and what are your future plans.
Any plans integrating with mygene.info or making more stable service on its own?

Thanks,
Nikhil

usage stats on front page not updating

minor note -- I just noticed that usage stats on the front page of mygene.info are updating, but not so for myvariant.info. (last stamp is 2016-11-15...)

how to query a position with a POST

I would like to query the following variants using POST (i.e. on http://myvariant.info/v1/query):

q="chr1:54844G>A,chr1:61987A>G,chr1:61989G>C,chr1:86018C>G,chr1:86303G>T"

I've tried the above paramaters, but it returns the follows:

[
  {
    "query": "chr1:54844G>A",
    "notfound": true
  }
]

I understand that I also need to input a scope in order to make it work but I'm not sure what the scope should be in this case...

Thanks
Ismail

load data from ClinGen VCI database

Matt Wright and Jimmy Zhen from the ClinGen team seemed interested in this idea at the CIViC hackathon. Need to reach out to them for more info on logistics...

ExAC mapping

The mapping file for ExAC contains a small problem. The ac_hom field should be put in 'ac' rather than 'hom'.
Potential solutions:

  • change the mapping

  • add an additional field called 'ac_hom' under 'hom'

Better format when using both always_list and allow_null options?

In the recent release, I noticed that there're some handy new features, including the always_list and allow_null option. But when they are used in combination, the result is probably not in the nicest format. Instead of returning an empty list [] when there's no data, it returns a list of a null object like so: [null].

It will cause some confusion for the client side, since usually you would check if the returned list is empty, as opposed to checking each element in the list if they are empty.

A sample request to reproduce this error would be:

https://myvariant.info/v1/query?q=rs12131234&fields=dbsnp&always_list=dbsnp.gene&allow_null=dbsnp.gene

I'm wondering if it's possible to change this behaviour? Thanks.

Which config file?

When installing myvariant and testing, it asks for a BioThings config file. Which file should we use or how should we configure it? Thanks.

The logic of get_pos_start_end and _normalize_vcf is conflicting

Use case: try to normalize vcf before using the get_pos_start_end function.

Problem:
In the case of deletion: REF -> TTTCTTTTTCTTTTTCTTTTTCTTTCTT, ALT -> TG
_normalize_vcf would trim the first T from both REF and ALT

However, get_pos_start_end asserts the first nucleotide in both REF and ALT is the same
see: https://github.com/biothings/myvariant.info/blob/master/src/utils/hgvs.py#L150

These two functions could not be used together to handle deletion cases.

query variants with genename

Hi,
One task I'd like to run with myvariant.info is to return all variants in a gene. For example TP53, so I tried
http://myvariant.info/v1/query?q=TP53&fields=_id
which returns with count of 5918.
I also tried query with ensembleID
http://myvariant.info/v1/query?q=ENSG00000141510&fields=_id
which returns nothing.
Then I tried
http://myvariant.info/v1/query?q=dbnsfp.ensembl.geneid:ENSG00000141510&fields=_id
http://myvariant.info/v1/query?q=cadd.gene.gene_id:ENSG00000141510&fields=_id
which returns 3318 and 4539.

So the question I have is when I just search for TP53, which fields are searched exactly. It seems the default query in elasticseach is search _all fields? and why I can't get any results back with just ensembleID? Is range query a better way to get all variants related to a gene? Or what is the best way to do this task with myvariant.info api?

Thank you very much

load VICC-harmonized data

From Alex Wagner, this link https://s3-us-west-2.amazonaws.com/g2p-0.10/index.html has the current release of the VICC-harmonized data (described in https://www.biorxiv.org/content/early/2018/07/11/366856). It is subject to change as that manuscript goes through peer review. But once that's done and the data set is finalized, seems like a good source to import. (obviously we already have civic data directly, but this resource will provide access to several other sources as well in a standardized format.)

cc @ahwagner

Discrepancies in returned COSMIC ids

Hi!

We've noticed that there seem to be some inconsistencies with the COSMIC data being returned in the variant annotation service.

Here's an example query:

GET myvariant.info/v1/variant/chr4:g.55141036T>C?fields=cosmic,mutdb

And the response:

{
    "_id": "chr4:g.55141036T>C",
    "_version": 2,
    "cosmic": {
        "alt": "C",
        "chrom": "4",
        "cosmic_id": "COSM1430077",
        "hg19": {
            "end": 55141036,
            "start": 55141036
        },
        "mut_freq": 0.14,
        "mut_nt": "T>C",
        "ref": "T",
        "tumor_site": "large_intestine"
    },
    "mutdb": {
        "alt": "C",
        "chrom": "4",
        "cosmic_id": "85787",
        "hg19": {
            "end": 55141036,
            "start": 55141036
        },
        "mutpred_score": -1,
        "ref": "T",
        "rsid": null,
        "strand": "p"
    }
}

The cosmic id returned in the cosmic top level key (body['cosmic']['cosmid_id'] doesn't match the cosmic id returned in the mutdb top level key (body['mutdb']['cosmic_id']). Additionally, the cosmic id returned in the cosmic section isn't a valid cosmic id at all, while the one in the mutdb section appears to be the correct one for the variant in question.

I assume this is likely to come from discrepancies in the underlying data sources, but it was a little surprising to find a non-existent cosmic id in the cosmic section.

variant normalization

Hi,
We are wondering how the variant normalization is done in myvariant.info? When you import the variants from each database, do you do any sort of internal variant normalization or just take the chr,pos,ref,alt directly from the source?

Thanks

Unable to run clinvar_xml_parser dataloader

The clinvar_xml_parser.py data loader is referencing a clinvar or clinvar1 import that is not listed in the requirements:
https://github.com/SuLab/myvariant.info/blob/master/src/dataload/contrib/clinvar/clinvar_xml_parser.py#L5

It's changed from clinvar to clinvar1 - is the clinvar library that does the parseString() call available from you or is it a separate 3rd party lib to be installed?

https://github.com/SuLab/myvariant.info/blob/master/src/dataload/contrib/clinvar/clinvar_xml_parser.py#L315
record_parsed = clinvar1.parseString(record, silence=1)

dbSNP download site change (Maybe?)

Currently, the newest release of dbSNP is v152. Our latest version in MyVariant.info is v151.

We download from: ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/
The last update time for the file is: 4/22/2018 (v151)

v152 is stored in: ftp://ftp.ncbi.nih.gov/snp/latest_release/VCF

Also, from v152, dbSNP provides the JSON version of the data dump:
ftp://ftp.ncbi.nih.gov/snp/latest_release/JSON

Related post regarding the change from dbSNP: https://ncbiinsights.ncbi.nlm.nih.gov/2017/07/07/dbsnp-redesign-supports-future-data-expansion/

snpeff ann field is sometimes a list, sometimes an object

The format for the field ann, nested in snpeff, is a list in variants like in chr1:g.35367G>A, and an object in variants like chr7:g.140453136A>T. While trying to parse the output, this complicates the mapping of the key and values. Was this intended?

Thanks!

live query API does not work for some ClinVar RCVs

RCV000008604, RCV000008605, RCV000008606 and RCV000008607 share one variant (ClinVar variation 8131, also called measureSet id and variant id in their xml file). The API works for RCV000008604 only, but not for any others. Input data as mv.querymany(['RCV000008604'], scopes='clinvar.rcv_accession', fields='clinvar.clinvar_id')

clingen.caid should be indexed

We now have clingen CA id loaded for hg38 index, we should have this clingen.caid field indexed (as "string_lowercase").

Cosmic mutation frequency information seems limited/arbitrary

Thank you for this amazing resource!

We are in the process of adding selected relevant information from myvariant.info to CIViC (civicdb.org).

While considering options, we hoped to add cosmic mutation frequency. But the mutation frequency available appears to be the frequency for a single tumor site? And this is chosen arbitrarily from several possibilities perhaps?

Consider this example (which seem representative of other variants in myvariant.info):
http://myvariant.info/v1/variant/chr7:g.140453136A%3ET

This is BRAF V600E.
http://cancer.sanger.ac.uk/cosmic/mutation/overview?id=476

The mutation frequency information returned for COSMIC is:
"mut_freq": 2.83,
"tumor_site": "biliary_tract"

See attached:
myvariant example

This seems odd. How is this being determined? Would it be possible to determine overall mutation frequency across all tumor_sites, and then for each tumor_site and perhaps return the top site(s) and their frequencies?

Our relevant CIViC github issues are:
griffithlab/civic-server#243
griffithlab/civic-server#38

For now we will move on without using the COSMIC info but it would be great to have more options to select from here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.