Git Product home page Git Product logo

mychem.info's Introduction

mychem.info's People

Contributors

ctrl-schaff avatar cyrus0824 avatar dylanwelzel avatar erikyao avatar everaldorodrigo avatar greg-k-taylor avatar jadesara avatar kevinxin90 avatar neuralflux avatar newgene avatar polyg314 avatar ravila4 avatar sirloon avatar stuppie avatar zcqian avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mychem.info's Issues

mychem.info/v1/drug/ returns 10 records no matter how many inchi keys are supplied

Hi Biothings team. I am trying to retrieve records in large batches using multiple inchi keys. THe swagger doc says that you can submit up to 1000 keys in a post request at a time. It seems that no matter how many I supply, the results include 10 records.

example post I am making with 17 unique inchi keys that all resolve individually

fields = 'drugbank.targets,drugbank.drugbank_id,unii.unii,drugcentral.drug_use,drugcentral.bioactivity',

ids = 'FQXGHZNSUOHCLO-IZLXSQMJSA-N,MKWYFZFMAMBPQK-UHFFFAOYSA-J,SXYZQZLHAIHKKY-GSTUPEFVSA-N,WMKGGPCROCCUDY-HEEUSZRZSA-N,XXKJBDTZZBQBCP-MRVPVSSYSA-N,WQVJHHACXVLGBL-UHFFFAOYSA-N,FLOSMHQXBMRNHR-UHFFFAOYSA-N,FLOSMHQXBMRNHR-QPJJXVBHSA-N,ZMJBYMUCKBYSCP-UHFFFAOYSA-N,IMONTRJLAWHYGT-REGVOWLASA-N,MSRILKIQRXUYCT-UHFFFAOYSA-M,ZJEFYLVGGFISGT-UHFFFAOYSA-L,ZJEFYLVGGFISGT-VRZXRVJBSA-L,UFUVLHLTWXBHGZ-KUWMELJBSA-N,UFUVLHLTWXBHGZ-MGZQPHGTSA-N,GGWBHVILAJZWKJ-KJEVSKRMSA-N,GGWBHVILAJZWKJ-CHHCPSLASA-N' 

        url = 'http://mychem.info/v1/drug'
        data = {
            'ids': ids,
            'fields': fields
        }

Cheers

pip requirements files incorrect

First, the biothings.api commit point in requirements_web.txt is incorrect.

The commit point in the requirements_web.txt file is:
3933f82392042b1446456e13700f51dea9b4c975

However, the following exception is thrown at run time:
mychem.info | Traceback (most recent call last):
mychem.info | File "bin/hub.py", line 62, in
mychem.info | from biothings.hub.databuild.syncer import ThrottledESJsonDiffSyncer, ThrottledESJsonDiffSelfContainedSyncer

I manually checked the biothings.api commit point and the class ThrottledESJsonDiffSyncer is missing in this commit. Which commit point should be used?

Further, the following packages are missing in the requirements file:
aiocron
IPython
pympler

Pubchem additional fields

Pubchem flat files located here: ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/XML/ don't have CAS numbers.
But they are accessible from PubChemRDF
example
along with lots of other identifiers

Trouble linking aeolus compounds

If I do a simple query q=siltuximab, I get 5 results, with these identifiers and keys:

57894-421 ['_id', '_score', 'ndc']
57894-420 ['_id', '_score', 'ndc']
CHEMBL1743070 ['_id', '_score', 'chembl', 'drugcentral']
DB09036 ['_id', '_score', 'drugbank']
T4H8FMA7IM ['_id', '_score', 'aeolus', 'unii']

The way I actually want to query this data is by asking for compounds that have a particular aeolus outcome. So if I come in and query for a particular outcome, and it matches siluximab, I will get back only aeolus and unii information. I won't get chembl or drugcentral, making it hard to give this compound an identifier that I can integrate other data with.

I don't know if this is a general feature or if I just found one, but it seemed in testing that I often didn't get either a chembl or chebi node when querying by aeolus.

Pharmgkb external_vocabulary should remove drug name

Current external vocab:

"external_vocabulary": {
"atc": "N07XX02(riluzole)",
"ndfrt": "N0000148421(RILUZOLE)",
"rxnorm": "35623(Riluzole)",
"umls": "C0073379(Riluzole)"
},

should be:

"external_vocabulary": {
"atc": "N07XX02",
"ndfrt": "N0000148421",
"rxnorm": "35623",
"umls": "C0073379"
},

support https

mychem.info should support both http and https like MyGene.info and MyVariant.info

Fields under Drugcentral should be split by '|', e.g. gene_name, uniprot_id, swissprot

"bioactivity": {
"act_comment": "Mechanism of Action; CHEMBL2362997; PROTEIN COMPLEX GROUP",
"act_source": "CHEMBL",
"action_type": "ANTAGONIST",
"gene_name": "CHRNA1|CHRNB1|CHRND|CHRNE|CHRNG",
"moa": "1",
"moa_source": "SCIENTIFIC LITERATURE",
"swissprot": "ACHA_HUMAN|ACHB_HUMAN|ACHD_HUMAN|ACHE_HUMAN|ACHG_HUMAN",
"target": "Muscle-type nicotinic acetylcholine receptor",
"target_class": "Ion channel",
"uniprot_id": "P02708|P07510|P11230|Q04844|Q07001"
},

chebi.pubchem_database_links should separate pubchem compound ID and pubchem substance ID

Query: http://mychem.info/v1/drug/CHEMBL503?fields=chebi.pubchem_database_links

There are two problems with this field:

  • The results should not be in CURIEs, e.g. (prefix:value), it should be just the value itself, which is 53232.

  • PubChem Compound ID (CID) and PubChem Substance ID (SID) are two different IDs, they should be put in two separate fields, so the current JSON structure should be:
    { 'pubchem_database_link': { 'CID': 53232, 'SID': 26697338 } }

Mychem next release (ES6, parsers, etc...)

  • drugcentral collection was updated, new structure => run inspector for new mapping (done in 501a9a1)
  • sider: use keylookup decorator to convert pubchem ID to InchiKey + rerun mapping generation (parser has changed ?) : mapping done in 67a449d (not keylookup)

import CMap/LINCS data

CMap should provide gene expression measurements associated with chemical compounds. In theory, that data should be accessible via https://clue.io/cmap, but on first glance it is unclear how one can actually access these data (via either downloadable files or API). Needs someone to dig through the links (including the GEO records mentioned in https://clue.io/GEO-guide and this user guide).

Also, I believe CMap and LINCS are a partially overlapping set of data. Would also be good to import LINCS -- more info probably available at http://www.lincsproject.org/.

This ticket is partially motivated by a request from the alpha translator team, so they should be notified when we have more info...

chembl.molecule_synonyms field should be restructured

Current structure:

{
  "molecule_synonym": "DNDI996469",
  "syn_type": "OTHER",
  "synonyms": "DNDI996469"
},
{
  "molecule_synonym": "DNDI996469", 
  "syn_type": "RESEARCH_CODE",
  "synonyms": "DNDI996469"
}
]
  1. Should be grouped based on "syn_type"
  2. "syn_type" should serve as the key
  3. "molecule_synonym" and "synonyms" fields look redundant

Proposed structure:

{
  "research_code": "DND1996469",
  "other": "DND1996469"
}

DrugBank ID representation

In current MyChem.info, drugbank ID is sometimes represented as "drugbank_id" while sometimes represented as "id".

Example of "drugbank_id": http://mychem.info/v1/drug/PMXMIIMHBWHSKN-UHFFFAOYSA-N?fields=drugbank.drugbank_id,drugbank.id

Example of "id": http://mychem.info/v1/query?q=_exists_:drugbank.id&fields=drugbank.id,drugbank.drugbank_id

We should uniform it. So all "drugbank_id" should be converted to "id".

Possible lines to look at in the code:

Issues with Drug bank

  1. Fields documented in the mapping file but returning zero hit:
    drugbank.pharmacology.snp_effects.effects.rs-id
    drugbank.pharmacology.snp_effects.effects.gene-symbol
    ...and many others

  2. Missing 'gene_name' field
    Take an example of Maraviroc, a CCR5-targeted drug
    https://www.drugbank.ca/drugs/DB04835#targets
    Above is the link to this drug annotation information provided by drugbank website,
    there is a field called 'Gene Name' in the table, however, that field is missing in c.biothings.io

drugcentral.drug_use.indication List or not?

for drugcentral.drug_use.indication and contraindication, if there is more than one indication, you get a list. But if there is only one, then you don't get a list, you just get the item.

That's fine, but in terms of parsing the output it means that you have to do a little extra logic. From my perspective, it would be nicer to return a list of one item.

ChEBI additional fields

CAS is missing from SDF file but is present (along with many other xrefs) in the obo/owl files:
ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.obo.gz

[Term]
id: CHEBI:15578
name: (25S)-5beta-spirostan-3beta-ol
alt_id: CHEBI:178
alt_id: CHEBI:10854
alt_id: CHEBI:18537
subset: 3_STAR
synonym: "(3beta,5beta,25S)-spirostan-3-ol" RELATED [ChemIDplus]
...
xref: KNApSAcK:C00003590 
xref: Beilstein:91757 "Beilstein"
xref: KEGG:C03963 
xref: CAS:126-19-2 "KEGG COMPOUND"
xref: CAS:126-19-2 "ChemIDplus"
is_a: CHEBI:26606

ChEMBL cross_references should be restructured

  1. Data should be grouped based on xref_src
  2. CURIEs should be converted to values if it's the standard representation by the source
    Example:
    "cross_references": {
    "xref_id": 170466478,
    "xref_name": "SID: 170466478",
    "xref_src": "PubChem"
    }

=>

"cross_references": {
"pubchem_sid": "170466478"
}

new data source: RHEA

main page is at https://www.rhea-db.org/. provides useful information on chemical reactions (mapped to chebi IDs)

Many possible formats on their download page, but will need some work to figure out which is the right one to use.

Terms of use of mychem.info service

Hi,

Really awesome service and concept.

I noticed the 'non-commercial' use flag on mychem.info that isn't present on mygene.info. Could you comment/blog on why the difference? Is there was any way (perhaps an endpoint with limited datasets, or an alternative service) that it could be used in a commercial environment? I understand that there is a vast difference in the history of licensing between biology and chemistry data - As I am sure you are aware, non-commercial licenses cause a lot of complications for integration efforts.

Thanks,

Iain

Missing xrefs for AEOLUS data

There are 4245 drugs in the aeolus data set. All have a rxcui. 3043 can be matched to an InchiKey or UNII using the data from the FDA link.
Look into using Wikidata to do additional normalization.

'Indications' appearing as adverse drug reactions in AEOLUS data

Wanted to follow up with a question we discussed offline about AEOLUS data, which describe adverse drug reactions (ADRs), using meddra terms to codify the adverse outcomes. Specifically, I noted the inclusion of 'outcomes' in the mydrug AEOLUS data that seem to be primary indications for a drug (i.e. what it is used to treat), rather than adverse drug reactions.

For example, the mydrug results for imatinib, which is used to treat various leukemias, and outcomes include things like "Leukaemia", "Chronic lymphocytic leukaemia", "Acute leukaemia".

In your parsing of AEOLUS data, was there any metadata that would tell you what medrra terms represent indications for a particular drug so these could be filtered from the results, or at least tagged as being indications rather than adverse reactions to the drug? This would be a significant improvement to the dataset you generate.

MyChem query endpoint should return a subset of field by default

Similar to how mygene.info/v3/query behavior. By default, the matching hits return a subset of "essential fields" by default, instead of all available fields. This will return hits much smaller and faster.

Users can always get the specific fields they need by specifying "fields" parameter, or passing fields=all to return all available field (the current default)

default query fields

I have a vague recollection of mentioning this issue before, but I forget the response. So just adding it here.

I was expecting this query:
http://mychem.info/v1/query?q=imatinib

to return a superset of records relative to this fielded query:
http://mychem.info/v1/query?q=drugbank.name%3Aimatinib

But the first query currently returns no results. I'm guessing this has to do with what fields are indexed by default. I'll leave it you guys to decide whether this is the desired behavior or something to be changed....

Value consistency

Pubchem ID in mychem.info should have unified name
"pharmgkb.pubchem_compound": "132971",
"pubchem.cid": "CID132971"

Fields with Mixed Type

The following fields need to be fixed in the parser level to conform to the same data type

  1. drugcentral.drug_use.indication.snomed_concept_id: int/float -> int (should be int, need to look at why it's float in some cases)

move src/data folder to /data?

This is very minor and low priority.

I found myself many times to click into data folder instead when I actually wanted to go to hub folder. Probably just because the name "data" confused me.

Don't know if that's just me or not. One suggestion would be to move /src/data to the top-level /data, considering this folder is not the application code itself, just some data relevant to the application.

And this could be applicable for other biothings API code bases.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.