gnn4dr / drkg Goto Github PK
View Code? Open in Web Editor NEWA knowledge graph and a set of tools for drug repurposing
License: Apache License 2.0
A knowledge graph and a set of tools for drug repurposing
License: Apache License 2.0
The DRKG is a great work no doubt in that. I just wanted to understand that how the entites.csv and relationships are generated from?
I understood that they are taken from various database like DrugBank, Hetionet, GNBR, String, IntAct and DGIdb. but i wanted to know how you generated entities and realtionship csv files out of it
Hi, I want to know what is the specific item corresponding to the number for entities? For example, What
do Gene::23368
means? Is there any index file that can be linked?
Thanks, Hope for your reply!
Hi!
Thank you for your work and putting out this tremendous graph! I just have one question concerning the Covid-19 node. All I could find was 27 different SARS-CoV-2 nodes. These are categorized as diseases, but as far as I understand it, they actually are Virus Proteins. I also looked for a node for the actual Disease and couldnt find it via its DOIDs (DOID:0080600 and DOID:0080848 for long Covid).
For my research I'll just go ahead and use the SARS-CoV-2 Spike node as a proxy for Covid 19. Did you intend it that way or did I miss something?
Thanks!
If i am not wrong, DRKG includes ~10K small molecule drugs from DrugBank 4.0. I find that DrugBank 5.1.9 is now available which has more than 11K small molecule drugs. Is there any plan to release further version of DRKG?
can you show some examples by using this model to predict some drugs on certain diseases?what's the input? thanks a lot
Hi Dear Author,
I see DRKG synthesizes a lot of knowledge bases together to get a very big graph, but does any of these knowledge bases contain ICD code for their diseases? If not, is there any way to map their disease codes to ICD?
Thanks!
This looks very similar to what has already been done by https://github.com/hetio/hetionet. Would you please add documentation explaining how this is different? Did you try using hetionet for this task, since it is nominally set up for drug repositioning?
Since it looks like some edges are ingested from hetionet, what added value does this add?
Thanks for all your effort. Great content!
Inside the file drkg.tsv
lines 4123208-4123214 contain genes without id info. You might want to delete them or fix them. You can find with regex search for Gene::\t
Also, I see 22 more genes with no id. You can find them with regex search for Gene::\n
. For example, you can see them on lines 4122083, 4138986, 4145818, etc...
for example, by comparing edge bioarx::HumGenHumGen:Gene:Gene
and GNBR::Pr::Compound:Disease
:
e1,e2 = keys[0],keys[40]
n1_d=node_dictionary[e1]
n2_d=node_dictionary[e2]
the number in n2_d
may mean compound or disease, while the same number in n1_d refers to gene. The jaccard scores
between these two sets is meaningless
Hi!
I want to know if I could retrain the embeddings of the DRKG with additional triplets related to certain diseases using Colab Pro, i.e. with a GPU. Particularly, I want to know whether that retraining task is extremely time-consuming or not.
Thanks!
Hi -
When I try running this notebook, the 'utils' module is not found. (I have the CWD set to the directory where I downloaded the DRKG)
Where should this 'utils' module be ?
ModuleNotFoundError Traceback (most recent call last)
in
4 import sys
5 sys.path.insert(1, '../utils')
----> 6 from utils import download_and_extract
7 download_and_extract()
8 drkg_file = '../data/drkg/drkg.tsv'
ModuleNotFoundError: No module named 'utils'
Hi I am currently trying to replicate your methods using the notebooks that you have provided.
I tried running the Jaccard similarity score notebook and I stumbled upon the following errors. Is there any work around?
ImportError Traceback (most recent call last)
in <cell line: 6>()
4 import sys
5 sys.path.insert(1, '../utils')
----> 6 from utils import download_and_extract
7 download_and_extract()
8 drkg_file = '../data/drkg/drkg.tsv'
ImportError: cannot import name 'download_and_extract' from 'utils' (/usr/local/lib/python3.10/dist-packages/utils/init.py)
Hi
The graph available in the DRKG repository containing 5,874,261 triplets has ids for identifying compounds and diseases.
Please find this sample from the main graph dataset.
Sample from graph:
Compound::DB01586 GNBR::T::Compound:Disease Disease::MESH:C535932
Compound::DB01586 GNBR::T::Compound:Disease Disease::MESH:D002779
Compound::DB01586 GNBR::Pa::Compound:Disease Disease::MESH:D007674
Ideal data format needed:
Ursodeoxycholic acid GNBR::T::Compound:Disease Cholestasis of pregnancy
Ursodeoxycholic acid GNBR::T::Compound:Disease Cholestasis
Ursodeoxycholic acid GNBR::Pa::Compound:Disease Kidney Diseases
Is there any automated way through which we can get the names for these ids ?
Any help in this direction will be appreciated.
I am trying to reproduce the results and will like to understand a few things:
how to get the entity’s name,can you give the script to get them
Hello, I would like to ask a question. The results (entity embedding and relational embedding) we trained with the same parameter settings in the code are very different from the results you gave. The drugs and diseases recommended by our training embeddings have no association. Do you have any suggestion for this?
Hi, when I try to run dglke_train
on my Windows 10 I get the following error:
Traceback (most recent call last):
File "C:\Users\u1123073\Anaconda3\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\u1123073\Anaconda3\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\u1123073\Anaconda3\Scripts\dglke_train.exe\__main__.py", line 7, in <module>
File "C:\Users\u1123073\Anaconda3\lib\site-packages\dglke\train.py", line 144, in main
train_sampler_head = train_data.create_sampler(args.batch_size,
File "C:\Users\u1123073\Anaconda3\lib\site-packages\dglke\dataloader\sampler.py", line 379, in create_sampler
return EdgeSampler(self.g,
File "C:\Users\u1123073\Anaconda3\lib\site-packages\dgl\contrib\sampling\sampler.py", line 662, in __init__
self._seed_edges = utils.toindex(self._seed_edges)
File "C:\Users\u1123073\Anaconda3\lib\site-packages\dgl\utils\internal.py", line 271, in toindex
return data if isinstance(data, Index) else Index(data, dtype)
File "C:\Users\u1123073\Anaconda3\lib\site-packages\dgl\utils\internal.py", line 26, in __init__
self._initialize_data(data)
File "C:\Users\u1123073\Anaconda3\lib\site-packages\dgl\utils\internal.py", line 33, in _initialize_data
self._dispatch(data)
File "C:\Users\u1123073\Anaconda3\lib\site-packages\dgl\utils\internal.py", line 58, in _dispatch
raise InconsistentDtypeException('Index data specified as %s, but got: %s' %
dgl.utils.internal.InconsistentDtypeException: DGL now requires the input tensor to have the same dtype as the
graph index's dtype(which you can get by g.idype). Index data specified as int64, but got: int32
Is maintenance still active for this repo? Could you please have a look at this?
The GNBR paper states that GNBR contains over 2 million edges, whereas the README for DRKG specifies that only 335,369 edges were derived from GNBR. Where does this discrepancy come from?
Thanks for your sharing! Did you deduplicate data from different sources?
when I try to run the train_embeddings.ipynb code. the command not found. And here are my code
print(os.system("DGLBACKEND=pytorch dglke_train --dataset DRKG --data_path ./train --data_files drkg_train.tsv drkg_valid.tsv drkg_test.tsv --format 'raw_udd_hrt' --model_name TransE_l2 --batch_size 2048 "))
print(os.system('--neg_sample_size 256 --hidden_dim 400 --gamma 12.0 --lr 0.1 --max_step 100000 --log_interval 1000 --batch_size_eval 16 -adv --regularization_coef 1.00E-07 --test --num_thread 1 --gpu 0 1 2 3 4 5 6 7 --num_proc 8 --neg_sample_size_eval 10000 --async_update'))
And the error remind
sh: dglke_train: command not found
32512
sh: --: invalid option
Usage: sh [GNU long option] [option] ...
sh [GNU long option] [option] script-file ...
thanks
Hi,
Thanks for all the work. It looks amazing and I am looking forward to integrating my data with other diseases.
It's a pity that I cannot run the code for training DRKG on my machine, which only has CPUs.
The command is
"dglke_train --dataset DRKG --data_path ./train --data_files drkg_train.tsv drkg_valid.tsv drkg_test.tsv --format 'raw_udd_hrt' --model_name TransE_l2 --batch_size 64 --neg_sample_size 256 --hidden_dim 400 --gamma 12.0 --lr 0.1 --max_step 100000 --log_interval 1000 --batch_size_eval 16 -adv --regularization_coef 1.00E-07 --test --num_thread 1 --num_proc 8 --neg_sample_size_eval 10000"
and the output is
Reading train triples....
Finished. Read 5286834 train triples.
Reading valid triples....
Finished. Read 293713 valid triples.
Reading test triples....
Finished. Read 293713 test triples.
|Train|: 5286834
random partition 5286834 edges into 8 parts
part 0 has 660855 edges
part 1 has 660855 edges
part 2 has 660855 edges
part 3 has 660855 edges
part 4 has 660855 edges
part 5 has 660855 edges
part 6 has 660855 edges
part 7 has 660849 edges
/opt/conda/lib/python3.7/site-packages/dgl/base.py:25: UserWarning: multigraph will be deprecated.DGL will treat all graphs as multigraph in the future.
warnings.warn(msg, warn_type)
|valid|: 293713
|test|: 293713
Bus error (core dumped)
The command works fine for other data.
"dglke_train --model_name TransE_l2 --dataset FB15k --batch_size 1000 --neg_sample_size 200 --hidden_dim 400 --gamma 19.9 --lr 0.25 --max_step 3000 --log_interval 100 --batch_size_eval 16 --test -adv --regularization_coef 1.00E-09 --num_thread 1 --num_proc 8" worked successfully.
Thanks again for your sharing.
Hi, thank for sharing this great task. But I want to ask how can I use entity2src to get the corresponding concepts. For example : Gene::100129669 [Hetionet] Biomedical knowledge graph https://het.io/about/ [STRING] https://string-db.org/, does this mean the identifier 100129669 are the same with Hetionet and STRING?
Hi Dear Author,
Your repurposing KG looks great and comprehensive. I'm curious about the use of this KG, especially on repurposing tasks. If my understanding is correct, the "Compound" entities refer to "Drug" in the collected sources; the repurposing means find new targets/functions/effects of a known drug.
For example, we know a drug is effective on some diseases, targets some genes, and has some side effects, we want to explore more diseases it will work, and also potential targeted genes, and even more side effects it may contain...Are those the repurposing tasks?
Hi! And first of all, thanks a lot for putting this together 👍🏻
I'm working on some drug repurposing tasks, and I've bumped into several compound registers that I can't seem to find in any MESH database. I'm referring to records that start with "Compound::MESH:Cxxxx", like, for example, "Compound::MESH:C403507".
Can you assist me with how I should look for these registers on an external database to find the compound name?
Whatsmore, is there any mapping between MESH ids and Drug Bank?
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.