gnn4dr / drkg Goto Github PK

View Code? Open in Web Editor NEW

569.0 569.0 155.0 19.6 MB

A knowledge graph and a set of tools for drug repurposing

License: Apache License 2.0

Jupyter Notebook 99.89% Python 0.11%

dgl dgl-ke drug-repurposing graph-neural-networks knowledge-graph knowledge-graph-embeddings

drkg's People

Contributors

Stargazers

Watchers

Forkers

gurdaspuriya karypis aspirincode xuanlin1991 mufeili xzenglab pincher-chen wibrow blackclover1 zhangdachuanfoodies takerupapa zhangjiekui bioannidis bopeng112 datamining4science zpeng1989 dlreseach polojacky andeeli derrickfwang lrf2019 huayu-zhang thunder001 a-new max-insitro softuncle wangkangdegithub zhilongjia sailfish009 charlie-42 1001yeahstory ruska612 ryusidjin jennyzhang0215 chenyujie1127 onlineaidoctor jeanru leelasd chenjun0210 kimsangyeon-dgu michael-wzhu george86028 srravula1 nlptechx shicheng-guo languageandintelligence gavin4th fw339wj henry-ding craigchan happyflying-web bbyun28 2018luyi xiaoqiangq kk12321 bl-lac149597870 hang14 liuyunwu acproject jenniejiang milkigit ravihansa3000 rpatil524 zbn123 alenegro81 chebuu medical-projects chenchuming lrpopeyou layeqa pk-organics lothary yanxiliang1991 xiaomingaaa wmjpillow gzwhxu stevenjokess jayliu-clgx panford ahmadpgh nlp-kg tiffen nghiencuuthuoc leimao dechang64 ishaan-mehta stjordanis freshnemo julian200355 liukeweiaway drug-matrix zhaocong122 aabbccgithub lovemeijiang fastflair 4n6strider ling8983 xiaoqiongxia joskid u46932

drkg's Issues

I have doubt about entites.csv and relationship.csv

The DRKG is a great work no doubt in that. I just wanted to understand that how the entites.csv and relationships are generated from?
I understood that they are taken from various database like DrugBank, Hetionet, GNBR, String, IntAct and DGIdb. but i wanted to know how you generated entities and realtionship csv files out of it

How to get more information about entities?

Hi, I want to know what is the specific item corresponding to the number for entities? For example, What
do Gene::23368 means? Is there any index file that can be linked?

Thanks, Hope for your reply!

Can't find the Covid-19 Disease Node

Hi!

Thank you for your work and putting out this tremendous graph! I just have one question concerning the Covid-19 node. All I could find was 27 different SARS-CoV-2 nodes. These are categorized as diseases, but as far as I understand it, they actually are Virus Proteins. I also looked for a node for the actual Disease and couldnt find it via its DOIDs (DOID:0080600 and DOID:0080848 for long Covid).

For my research I'll just go ahead and use the SARS-CoV-2 Spike node as a proxy for Covid 19. Did you intend it that way or did I miss something?

Thanks!

Any plan to release further version of DRKG?

If i am not wrong, DRKG includes ~10K small molecule drugs from DrugBank 4.0. I find that DrugBank 5.1.9 is now available which has more than 11K small molecule drugs. Is there any plan to release further version of DRKG?

how to use it

can you show some examples by using this model to predict some drugs on certain diseases？what's the input？ thanks a lot

Does DRKG contain ICD code?

Hi Dear Author,

I see DRKG synthesizes a lot of knowledge bases together to get a very big graph, but does any of these knowledge bases contain ICD code for their diseases? If not, is there any way to map their disease codes to ICD?

Thanks!

Reference to himmelstein's hetionet?

This looks very similar to what has already been done by https://github.com/hetio/hetionet. Would you please add documentation explaining how this is different? Did you try using hetionet for this task, since it is nominally set up for drug repositioning?

Since it looks like some edges are ingested from hetionet, what added value does this add?

Gene with no id inside the file "drkg.tsv"

Thanks for all your effort. Great content!

Inside the file drkg.tsv lines 4123208-4123214 contain genes without id info. You might want to delete them or fix them. You can find with regex search for Gene::\t

Also, I see 22 more genes with no id. You can find them with regex search for Gene::\n. For example, you can see them on lines 4122083, 4138986, 4145818, etc...

Unreasonable to compute jaccard_scores in this way

for example, by comparing edge bioarx::HumGenHumGen:Gene:Gene and GNBR::Pr::Compound:Disease:

e1,e2 = keys[0],keys[40]
n1_d=node_dictionary[e1]
n2_d=node_dictionary[e2]

the number in n2_d may mean compound or disease, while the same number in n1_d refers to gene. The jaccard scores between these two sets is meaningless

Retraining DRKG with additional triplets

Hi!
I want to know if I could retrain the embeddings of the DRKG with additional triplets related to certain diseases using Colab Pro, i.e. with a GPU. Particularly, I want to know whether that retraining task is extremely time-consuming or not.

Thanks!

ModuleNotFoundError: No module named 'utils'

Hi -
When I try running this notebook, the 'utils' module is not found. (I have the CWD set to the directory where I downloaded the DRKG)
Where should this 'utils' module be ?

ModuleNotFoundError Traceback (most recent call last)
in
4 import sys
5 sys.path.insert(1, '../utils')
----> 6 from utils import download_and_extract
7 download_and_extract()
8 drkg_file = '../data/drkg/drkg.tsv'

ModuleNotFoundError: No module named 'utils'

ImportError: cannot import name 'download_and_extract' from 'utils' (/usr/local/lib/python3.10/dist-packages/utils/init.py)

Hi I am currently trying to replicate your methods using the notebooks that you have provided.
I tried running the Jaccard similarity score notebook and I stumbled upon the following errors. Is there any work around?

ImportError Traceback (most recent call last)
in <cell line: 6>()
4 import sys
5 sys.path.insert(1, '../utils')
----> 6 from utils import download_and_extract
7 download_and_extract()
8 drkg_file = '../data/drkg/drkg.tsv'

ImportError: cannot import name 'download_and_extract' from 'utils' (/usr/local/lib/python3.10/dist-packages/utils/init.py)

Need entity names for the ids in the DRKG graph

The graph available in the DRKG repository containing 5,874,261 triplets has ids for identifying compounds and diseases.
Please find this sample from the main graph dataset.

Sample from graph:
Compound::DB01586 GNBR::T::Compound:Disease Disease::MESH:C535932
Compound::DB01586 GNBR::T::Compound:Disease Disease::MESH:D002779
Compound::DB01586 GNBR::Pa::Compound:Disease Disease::MESH:D007674

Ideal data format needed:
Ursodeoxycholic acid GNBR::T::Compound:Disease Cholestasis of pregnancy
Ursodeoxycholic acid GNBR::T::Compound:Disease Cholestasis
Ursodeoxycholic acid GNBR::Pa::Compound:Disease Kidney Diseases

Is there any automated way through which we can get the names for these ids ?
Any help in this direction will be appreciated.

How to generate my own Knowledge Graph?

I am trying to reproduce the results and will like to understand a few things:

How are the relationships (links) between entities established within the DRKG dataset (to then allow for the knowledge graph generation)?
How are the triplets and edge types produced?

How to get the entity‘s name

how to get the entity’s name，can you give the script to get them

Parameter Setting

Hello, I would like to ask a question. The results (entity embedding and relational embedding) we trained with the same parameter settings in the code are very different from the results you gave. The drugs and diseases recommended by our training embeddings have no association. Do you have any suggestion for this?

Windows support

Hi, when I try to run dglke_train on my Windows 10 I get the following error:

Traceback (most recent call last):
  File "C:\Users\u1123073\Anaconda3\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\u1123073\Anaconda3\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\u1123073\Anaconda3\Scripts\dglke_train.exe\__main__.py", line 7, in <module>
  File "C:\Users\u1123073\Anaconda3\lib\site-packages\dglke\train.py", line 144, in main
    train_sampler_head = train_data.create_sampler(args.batch_size,
  File "C:\Users\u1123073\Anaconda3\lib\site-packages\dglke\dataloader\sampler.py", line 379, in create_sampler
    return EdgeSampler(self.g,
  File "C:\Users\u1123073\Anaconda3\lib\site-packages\dgl\contrib\sampling\sampler.py", line 662, in __init__
    self._seed_edges = utils.toindex(self._seed_edges)
  File "C:\Users\u1123073\Anaconda3\lib\site-packages\dgl\utils\internal.py", line 271, in toindex
    return data if isinstance(data, Index) else Index(data, dtype)
  File "C:\Users\u1123073\Anaconda3\lib\site-packages\dgl\utils\internal.py", line 26, in __init__
    self._initialize_data(data)
  File "C:\Users\u1123073\Anaconda3\lib\site-packages\dgl\utils\internal.py", line 33, in _initialize_data
    self._dispatch(data)
  File "C:\Users\u1123073\Anaconda3\lib\site-packages\dgl\utils\internal.py", line 58, in _dispatch
    raise InconsistentDtypeException('Index data specified as %s, but got: %s' %
dgl.utils.internal.InconsistentDtypeException: DGL now requires the input tensor to have the same dtype as the
 graph index's dtype(which you can get by g.idype). Index data specified as int64, but got: int32

Is maintenance still active for this repo? Could you please have a look at this?

Why does this not contain all GNBR edges?

The GNBR paper states that GNBR contains over 2 million edges, whereas the README for DRKG specifies that only 335,369 edges were derived from GNBR. Where does this discrepancy come from?

Deduplicate data from different sources

Thanks for your sharing! Did you deduplicate data from different sources?

Thanks for sharing the great knowledge, but I got some problem

when I try to run the train_embeddings.ipynb code. the command not found. And here are my code

print(os.system("DGLBACKEND=pytorch dglke_train --dataset DRKG --data_path ./train --data_files drkg_train.tsv drkg_valid.tsv drkg_test.tsv --format 'raw_udd_hrt' --model_name TransE_l2 --batch_size 2048 "))

print(os.system('--neg_sample_size 256 --hidden_dim 400 --gamma 12.0 --lr 0.1 --max_step 100000 --log_interval 1000 --batch_size_eval 16 -adv --regularization_coef 1.00E-07 --test --num_thread 1 --gpu 0 1 2 3 4 5 6 7 --num_proc 8 --neg_sample_size_eval 10000 --async_update'))

And the error remind

sh: dglke_train: command not found
32512
sh: --: invalid option
Usage: sh [GNU long option] [option] ...
sh [GNU long option] [option] script-file ...

thanks

Failed to run for the dataset DRKG

Hi,

Thanks for all the work. It looks amazing and I am looking forward to integrating my data with other diseases.
It's a pity that I cannot run the code for training DRKG on my machine, which only has CPUs.

The command is
"dglke_train --dataset DRKG --data_path ./train --data_files drkg_train.tsv drkg_valid.tsv drkg_test.tsv --format 'raw_udd_hrt' --model_name TransE_l2 --batch_size 64 --neg_sample_size 256 --hidden_dim 400 --gamma 12.0 --lr 0.1 --max_step 100000 --log_interval 1000 --batch_size_eval 16 -adv --regularization_coef 1.00E-07 --test --num_thread 1 --num_proc 8 --neg_sample_size_eval 10000"

and the output is
Reading train triples....
Finished. Read 5286834 train triples.
Reading valid triples....
Finished. Read 293713 valid triples.
Reading test triples....
Finished. Read 293713 test triples.
|Train|: 5286834
random partition 5286834 edges into 8 parts
part 0 has 660855 edges
part 1 has 660855 edges
part 2 has 660855 edges
part 3 has 660855 edges
part 4 has 660855 edges
part 5 has 660855 edges
part 6 has 660855 edges
part 7 has 660849 edges
/opt/conda/lib/python3.7/site-packages/dgl/base.py:25: UserWarning: multigraph will be deprecated.DGL will treat all graphs as multigraph in the future.
warnings.warn(msg, warn_type)
|valid|: 293713
|test|: 293713
Bus error (core dumped)

The command works fine for other data.
"dglke_train --model_name TransE_l2 --dataset FB15k --batch_size 1000 --neg_sample_size 200 --hidden_dim 400 --gamma 19.9 --lr 0.25 --max_step 3000 --log_interval 100 --batch_size_eval 16 --test -adv --regularization_coef 1.00E-09 --num_thread 1 --num_proc 8" worked successfully.

Thanks again for your sharing.

How to use the entity2src.tsv to query the corresponding concepts

Hi, thank for sharing this great task. But I want to ask how can I use entity2src to get the corresponding concepts. For example : Gene::100129669 [Hetionet] Biomedical knowledge graph https://het.io/about/ [STRING] https://string-db.org/, does this mean the identifier 100129669 are the same with Hetionet and STRING?

Interests about repurposing tasks

Hi Dear Author,

Your repurposing KG looks great and comprehensive. I'm curious about the use of this KG, especially on repurposing tasks. If my understanding is correct, the "Compound" entities refer to "Drug" in the collected sources; the repurposing means find new targets/functions/effects of a known drug.

For example, we know a drug is effective on some diseases, targets some genes, and has some side effects, we want to explore more diseases it will work, and also potential targeted genes, and even more side effects it may contain...Are those the repurposing tasks?

MESH IDs

Hi! And first of all, thanks a lot for putting this together 👍🏻
I'm working on some drug repurposing tasks, and I've bumped into several compound registers that I can't seem to find in any MESH database. I'm referring to records that start with "Compound::MESH:Cxxxx", like, for example, "Compound::MESH:C403507".
Can you assist me with how I should look for these registers on an external database to find the compound name?
Whatsmore, is there any mapping between MESH ids and Drug Bank?

Thanks!

gnn4dr / drkg Goto Github PK

drkg's People

Contributors

Stargazers

Watchers

Forkers

drkg's Issues

Recommend Projects

Recommend Topics

Recommend Org