biothings / mychem.info Goto Github PK

View Code? Open in Web Editor NEW

16.0 15.0 13.0 23.54 MB

MyChem.info: A BioThings API for chemical/drug annotations

Home Page: http://mychem.info

License: Apache License 2.0

Python 96.94% Shell 0.61% Jupyter Notebook 2.44%

drug chemical api bioinformatics webservice biothings ncats-translator

mychem.info's Introduction

MyChem.info API

MyChem.info provides a high-performance API for up-to-date annotations/knowledge about individual chemicals and drugs. It's part of BioThings API collection, together with MyGene.info, MyVariant.info and more.

mychem.info's People

Contributors

Stargazers

Watchers

Forkers

stuppie sulab veleritas greg-k-taylor quiltomics polyg314 ravila4 erikyao shunsunsun r76941156 codingcane nghiencuuthuoc

mychem.info's Issues

sider.side_effect.name field should be index as text string instead of keyword

Ref to the line:

mychem.info/src/hub/dataload/sources/sider/sider_upload.py

Line 65 in 844b448

"analyzer":"string_lowercase"

mychem.info/v1/drug/ returns 10 records no matter how many inchi keys are supplied

Hi Biothings team. I am trying to retrieve records in large batches using multiple inchi keys. THe swagger doc says that you can submit up to 1000 keys in a post request at a time. It seems that no matter how many I supply, the results include 10 records.

example post I am making with 17 unique inchi keys that all resolve individually

fields = 'drugbank.targets,drugbank.drugbank_id,unii.unii,drugcentral.drug_use,drugcentral.bioactivity',

ids = 'FQXGHZNSUOHCLO-IZLXSQMJSA-N,MKWYFZFMAMBPQK-UHFFFAOYSA-J,SXYZQZLHAIHKKY-GSTUPEFVSA-N,WMKGGPCROCCUDY-HEEUSZRZSA-N,XXKJBDTZZBQBCP-MRVPVSSYSA-N,WQVJHHACXVLGBL-UHFFFAOYSA-N,FLOSMHQXBMRNHR-UHFFFAOYSA-N,FLOSMHQXBMRNHR-QPJJXVBHSA-N,ZMJBYMUCKBYSCP-UHFFFAOYSA-N,IMONTRJLAWHYGT-REGVOWLASA-N,MSRILKIQRXUYCT-UHFFFAOYSA-M,ZJEFYLVGGFISGT-UHFFFAOYSA-L,ZJEFYLVGGFISGT-VRZXRVJBSA-L,UFUVLHLTWXBHGZ-KUWMELJBSA-N,UFUVLHLTWXBHGZ-MGZQPHGTSA-N,GGWBHVILAJZWKJ-KJEVSKRMSA-N,GGWBHVILAJZWKJ-CHHCPSLASA-N' 

        url = 'http://mychem.info/v1/drug'
        data = {
            'ids': ids,
            'fields': fields
        }

Cheers

drug central

http://drugcentral.org/download

add /metadata and /metadata/fields to the query examples section

Add this to the landing page query examples section at the bottom:

Get data source metadata:
GET http://mychem.info/metadata

Get the list of all fields:
GET http://mychem.info/metadata/fields

This applicable to mygene.info and myvariant.info APIs as well.

RxNorm

Load rxnorm. scrape API, example:
https://rxnav.nlm.nih.gov/REST/rxcui/595060/allProperties.xml?prop=all

new data source: dgidb

drug-gene interaction database

downloads page: http://www.dgidb.org/downloads

new data source: DSigDB

http://tanlab.ucdenver.edu/DSigDB/DSigDBv1.0/geneSearch.html

Add pharos id as xrefs

size parameter not working?

Tried: http://mychem.info/v1/query?q=drugbank.targets.uniprot:P24941&fields=_id&size=20

Shows total number of responses is 137

But it's just only returning 10 results.

I tried different sizes, e.g. 2, it works. However, putting any number larger than 10 wouldn't work.

pip requirements files incorrect

First, the biothings.api commit point in requirements_web.txt is incorrect.

The commit point in the requirements_web.txt file is:
3933f82392042b1446456e13700f51dea9b4c975

However, the following exception is thrown at run time:
mychem.info | Traceback (most recent call last):
mychem.info | File "bin/hub.py", line 62, in
mychem.info | from biothings.hub.databuild.syncer import ThrottledESJsonDiffSyncer, ThrottledESJsonDiffSelfContainedSyncer

I manually checked the biothings.api commit point and the class ThrottledESJsonDiffSyncer is missing in this commit. Which commit point should be used?

Further, the following packages are missing in the requirements file:
aiocron
IPython
pympler

Extract Indications & MoA from ChEMBL

See: https://www.ebi.ac.uk/chembldb/index.php/compound/inspect/CHEMBL1431

http://mychem.info/v1/chem/XZWYZXLIPXDOLR-UHFFFAOYSA-N?fields=chembl

Pubchem additional fields

Pubchem flat files located here: ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/XML/ don't have CAS numbers.
But they are accessible from PubChemRDF
example
along with lots of other identifiers

Trouble linking aeolus compounds

If I do a simple query q=siltuximab, I get 5 results, with these identifiers and keys:

57894-421 ['_id', '_score', 'ndc']
57894-420 ['_id', '_score', 'ndc']
CHEMBL1743070 ['_id', '_score', 'chembl', 'drugcentral']
DB09036 ['_id', '_score', 'drugbank']
T4H8FMA7IM ['_id', '_score', 'aeolus', 'unii']

The way I actually want to query this data is by asking for compounds that have a particular aeolus outcome. So if I come in and query for a particular outcome, and it matches siluximab, I will get back only aeolus and unii information. I won't get chembl or drugcentral, making it hard to give this compound an identifier that I can integrate other data with.

I don't know if this is a general feature or if I just found one, but it seemed in testing that I often didn't get either a chembl or chebi node when querying by aeolus.

Pharmgkb external_vocabulary should remove drug name

Current external vocab:

"external_vocabulary": {
"atc": "N07XX02(riluzole)",
"ndfrt": "N0000148421(RILUZOLE)",
"rxnorm": "35623(Riluzole)",
"umls": "C0073379(Riluzole)"
},

should be:

"external_vocabulary": {
"atc": "N07XX02",
"ndfrt": "N0000148421",
"rxnorm": "35623",
"umls": "C0073379"
},

add "total" fields in metadata stats

support https

mychem.info should support both http and https like MyGene.info and MyVariant.info

Fields under Drugcentral should be split by '|', e.g. gene_name, uniprot_id, swissprot

"bioactivity": {
"act_comment": "Mechanism of Action; CHEMBL2362997; PROTEIN COMPLEX GROUP",
"act_source": "CHEMBL",
"action_type": "ANTAGONIST",
"gene_name": "CHRNA1|CHRNB1|CHRND|CHRNE|CHRNG",
"moa": "1",
"moa_source": "SCIENTIFIC LITERATURE",
"swissprot": "ACHA_HUMAN|ACHB_HUMAN|ACHD_HUMAN|ACHE_HUMAN|ACHG_HUMAN",
"target": "Muscle-type nicotinic acetylcholine receptor",
"target_class": "Ion channel",
"uniprot_id": "P02708|P07510|P11230|Q04844|Q07001"
},

DrugCentral Version number need to be updated

website "docs" link points to myvariantinfo

chebi.pubchem_database_links should separate pubchem compound ID and pubchem substance ID

Query: http://mychem.info/v1/drug/CHEMBL503?fields=chebi.pubchem_database_links

There are two problems with this field:

The results should not be in CURIEs, e.g. (prefix:value), it should be just the value itself, which is 53232.
PubChem Compound ID (CID) and PubChem Substance ID (SID) are two different IDs, they should be put in two separate fields, so the current JSON structure should be:
{ 'pubchem_database_link': { 'CID': 53232, 'SID': 26697338 } }

DrugCentral Xref -> Xrefs

http://mychem.info/v1/query?q=drugbank.name:riluzole&fields=drugcentral.xref

pharmgkb.type field should be indexed as a keyword

This line:

mychem.info/src/hub/dataload/sources/pharmgkb/pharmgkb_upload.py

Lines 74 to 76 in a4f8d50

 "type": { 

 "type": "text" 

 },

The values of pharmgkb.type field look like just "Drug" and "Prodrug", better to index it as keyword. Is it somehow recognized in the inspector as "text" field?

New Data Source: HMDB

http://www.hmdb.ca/metabolites/HMDB0000012

Mychem next release (ES6, parsers, etc...)

drugcentral collection was updated, new structure => run inspector for new mapping (done in 501a9a1)
sider: use keylookup decorator to convert pubchem ID to InchiKey + rerun mapping generation (parser has changed ?) : mapping done in 67a449d (not keylookup)

check the list of fields added to "all"

For example, drugbank.id field should be included in "all", but does not seems included:

Either http://mychem.info/v1/chem/DB01076 or http://mychem.info/v1/query?q=drugbank.id:DB01076 works, but not this:

http://mychem.info/v1/query?q=DB01076

We should double check all fields need to be included in "all".

import CMap/LINCS data

CMap should provide gene expression measurements associated with chemical compounds. In theory, that data should be accessible via https://clue.io/cmap, but on first glance it is unclear how one can actually access these data (via either downloadable files or API). Needs someone to dig through the links (including the GEO records mentioned in https://clue.io/GEO-guide and this user guide).

Also, I believe CMap and LINCS are a partially overlapping set of data. Would also be good to import LINCS -- more info probably available at http://www.lincsproject.org/.

This ticket is partially motivated by a request from the alpha translator team, so they should be notified when we have more info...

aeolus.outcomes.meddra_code is not index properly

Sample data:

http://mychem.info/v1/query?q=_exists_:aeolus.outcomes&fields=aeolus.outcomes

this field "aeolus.outcomes.meddra_code" was labelled as "aeolus.outcomes.code" in the mapping:

aeolus/aeolus_upload.py#L74

should change the mapping to "meddra_code", and this query should work:

q=aeolus.outcomes.meddra_code:10028813

chembl.molecule_synonyms field should be restructured

Current structure:

{
  "molecule_synonym": "DNDI996469",
  "syn_type": "OTHER",
  "synonyms": "DNDI996469"
},
{
  "molecule_synonym": "DNDI996469", 
  "syn_type": "RESEARCH_CODE",
  "synonyms": "DNDI996469"
}
]

Should be grouped based on "syn_type"
"syn_type" should serve as the key
"molecule_synonym" and "synonyms" fields look redundant

Proposed structure:

{
  "research_code": "DND1996469",
  "other": "DND1996469"
}

DrugBank ID representation

In current MyChem.info, drugbank ID is sometimes represented as "drugbank_id" while sometimes represented as "id".

Example of "drugbank_id": http://mychem.info/v1/drug/PMXMIIMHBWHSKN-UHFFFAOYSA-N?fields=drugbank.drugbank_id,drugbank.id

Example of "id": http://mychem.info/v1/query?q=_exists_:drugbank.id&fields=drugbank.id,drugbank.drugbank_id

We should uniform it. So all "drugbank_id" should be converted to "id".

Possible lines to look at in the code:

mychem.info/src/hub/dataload/sources/drugbank/drugbank_parser.py

Line 106 in 2b30a46

d1.update({key:id_list})

PharmGKB data license changed to CC-BY-SA

TODO:
* update the license text for pharmgkb data source
* see if any additional data we can include

Issues with Drug bank

Fields documented in the mapping file but returning zero hit:
drugbank.pharmacology.snp_effects.effects.rs-id
drugbank.pharmacology.snp_effects.effects.gene-symbol
...and many others
Missing 'gene_name' field
Take an example of Maraviroc, a CCR5-targeted drug
https://www.drugbank.ca/drugs/DB04835#targets
Above is the link to this drug annotation information provided by drugbank website,
there is a field called 'Gene Name' in the table, however, that field is missing in c.biothings.io

Extract Roles from ChEBI

The Roles and ChEBI ontology data is missing from the chebi response

Example:
http://mychem.info/v1/chem/XZWYZXLIPXDOLR-UHFFFAOYSA-N?fields=chebi
http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI%3A6801

drugcentral.drug_use.indication List or not?

for drugcentral.drug_use.indication and contraindication, if there is more than one indication, you get a list. But if there is only one, then you don't get a list, you just get the item.

That's fine, but in terms of parsing the output it means that you have to do a little extra logic. From my perspective, it would be nicer to return a list of one item.

ChEBI additional fields

CAS is missing from SDF file but is present (along with many other xrefs) in the obo/owl files:
ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.obo.gz

[Term]
id: CHEBI:15578
name: (25S)-5beta-spirostan-3beta-ol
alt_id: CHEBI:178
alt_id: CHEBI:10854
alt_id: CHEBI:18537
subset: 3_STAR
synonym: "(3beta,5beta,25S)-spirostan-3-ol" RELATED [ChemIDplus]
...
xref: KNApSAcK:C00003590 
xref: Beilstein:91757 "Beilstein"
xref: KEGG:C03963 
xref: CAS:126-19-2 "KEGG COMPOUND"
xref: CAS:126-19-2 "ChemIDplus"
is_a: CHEBI:26606

ChEMBL cross_references should be restructured

Data should be grouped based on xref_src
CURIEs should be converted to values if it's the standard representation by the source
Example:
"cross_references": {
"xref_id": 170466478,
"xref_name": "SID: 170466478",
"xref_src": "PubChem"
}

"cross_references": {
"pubchem_sid": "170466478"
}

new data source: chemistry dashboard from EPA

https://comptox.epa.gov/dashboard

Looks like they do some aggregation among other resources, but they also have new data of their own.

Downloadable files here: https://comptox.epa.gov/dashboard/downloads

extensive duplication in drugbank.mixtures

for example, see http://mychem.info/v1/chem/KTUFNOKKBVMGRW-UHFFFAOYSA-N?fields=drugbank,drugcentral

screenshot:

New data source: Guide to Pharmcology

http://www.guidetopharmacology.org/download.jsp

CC-BY-SA

new data source: RHEA

main page is at https://www.rhea-db.org/. provides useful information on chemical reactions (mapped to chebi IDs)

Many possible formats on their download page, but will need some work to figure out which is the right one to use.

Terms of use of mychem.info service

Hi,

Really awesome service and concept.

I noticed the 'non-commercial' use flag on mychem.info that isn't present on mygene.info. Could you comment/blog on why the difference? Is there was any way (perhaps an endpoint with limited datasets, or an alternative service) that it could be used in a commercial environment? I understand that there is a vast difference in the history of licensing between biology and chemistry data - As I am sure you are aware, non-commercial licenses cause a lot of complications for integration efforts.

Thanks,

Iain

Missing xrefs for AEOLUS data

There are 4245 drugs in the aeolus data set. All have a rxcui. 3043 can be matched to an InchiKey or UNII using the data from the FDA link.
Look into using Wikidata to do additional normalization.

'Indications' appearing as adverse drug reactions in AEOLUS data

Wanted to follow up with a question we discussed offline about AEOLUS data, which describe adverse drug reactions (ADRs), using meddra terms to codify the adverse outcomes. Specifically, I noted the inclusion of 'outcomes' in the mydrug AEOLUS data that seem to be primary indications for a drug (i.e. what it is used to treat), rather than adverse drug reactions.

For example, the mydrug results for imatinib, which is used to treat various leukemias, and outcomes include things like "Leukaemia", "Chronic lymphocytic leukaemia", "Acute leukaemia".

In your parsing of AEOLUS data, was there any metadata that would tell you what medrra terms represent indications for a particular drug so these could be filtered from the results, or at least tagged as being indications rather than adverse reactions to the drug? This would be a significant improvement to the dataset you generate.

import UMLS id mapping to chemicals/drugs

UMLS provide IDs for many biological entity types, including chemicals and drugs:

https://www.nlm.nih.gov/research/umls/

Similar to this UMLS <-> gene mapping for MyGene.info:
https://github.com/biothings/mygene.info/tree/master/src/dataload/sources/umls

, we can import the mapping of UMLS <-> chemicals/drugs into MyChem.info.

drugbank dumper can now be automated

Currently dumper informs about a new release and it needs the zip file to be manually downloaded and put on to the server. As of release 5.1.0, website shows a command we can use to download it using curl:

https://www.drugbank.ca/releases/5-1-0

curl -Lfv -o filename.zip -u EMAIL:PASSWORD https://www.drugbank.ca/releases/5-1-0/downloads/all-full-database

MyChem query endpoint should return a subset of field by default

Similar to how mygene.info/v3/query behavior. By default, the matching hits return a subset of "essential fields" by default, instead of all available fields. This will return hits much smaller and faster.

Users can always get the specific fields they need by specifying "fields" parameter, or passing fields=all to return all available field (the current default)

default query fields

I have a vague recollection of mentioning this issue before, but I forget the response. So just adding it here.

I was expecting this query:
http://mychem.info/v1/query?q=imatinib

to return a superset of records relative to this fielded query:
http://mychem.info/v1/query?q=drugbank.name%3Aimatinib

But the first query currently returns no results. I'm guessing this has to do with what fields are indexed by default. I'll leave it you guys to decide whether this is the desired behavior or something to be changed....

Current PubChem parser skip the record if no inchikey found in source file

See: https://github.com/biothings/mychem.info/blob/master/src/hub/dataload/sources/pubchem/pubchem_parser.py#L21

Should consider remove this code and add a decorator for the pubchem parser.

Value consistency

Pubchem ID in mychem.info should have unified name
"pharmgkb.pubchem_compound": "132971",
"pubchem.cid": "CID132971"

Fields with Mixed Type

The following fields need to be fixed in the parser level to conform to the same data type

drugcentral.drug_use.indication.snomed_concept_id: int/float -> int (should be int, need to look at why it's float in some cases)

move src/data folder to /data?

This is very minor and low priority.

I found myself many times to click into data folder instead when I actually wanted to go to hub folder. Probably just because the name "data" confused me.

Don't know if that's just me or not. One suggestion would be to move /src/data to the top-level /data, considering this folder is not the application code itself, just some data relevant to the application.

And this could be applicable for other biothings API code bases.