Git Product home page Git Product logo

drkg's People

Contributors

bioannidis avatar classicsong avatar gurdaspuriya avatar ishaan-mehta avatar mufeili avatar zheng-da avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

drkg's Issues

I have doubt about entites.csv and relationship.csv

The DRKG is a great work no doubt in that. I just wanted to understand that how the entites.csv and relationships are generated from?
I understood that they are taken from various database like DrugBank, Hetionet, GNBR, String, IntAct and DGIdb. but i wanted to know how you generated entities and realtionship csv files out of it

How to get more information about entities?

Hi, I want to know what is the specific item corresponding to the number for entities? For example, What
do Gene::23368 means? Is there any index file that can be linked?

Thanks, Hope for your reply!

Can't find the Covid-19 Disease Node

Hi!

Thank you for your work and putting out this tremendous graph! I just have one question concerning the Covid-19 node. All I could find was 27 different SARS-CoV-2 nodes. These are categorized as diseases, but as far as I understand it, they actually are Virus Proteins. I also looked for a node for the actual Disease and couldnt find it via its DOIDs (DOID:0080600 and DOID:0080848 for long Covid).

For my research I'll just go ahead and use the SARS-CoV-2 Spike node as a proxy for Covid 19. Did you intend it that way or did I miss something?

Thanks!

Any plan to release further version of DRKG?

If i am not wrong, DRKG includes ~10K small molecule drugs from DrugBank 4.0. I find that DrugBank 5.1.9 is now available which has more than 11K small molecule drugs. Is there any plan to release further version of DRKG?

how to use it

can you show some examples by using this model to predict some drugs on certain diseases?what's the input? thanks a lot

Does DRKG contain ICD code?

Hi Dear Author,

I see DRKG synthesizes a lot of knowledge bases together to get a very big graph, but does any of these knowledge bases contain ICD code for their diseases? If not, is there any way to map their disease codes to ICD?

Thanks!

Reference to himmelstein's hetionet?

This looks very similar to what has already been done by https://github.com/hetio/hetionet. Would you please add documentation explaining how this is different? Did you try using hetionet for this task, since it is nominally set up for drug repositioning?

Since it looks like some edges are ingested from hetionet, what added value does this add?

Gene with no id inside the file "drkg.tsv"

Thanks for all your effort. Great content!

Inside the file drkg.tsv lines 4123208-4123214 contain genes without id info. You might want to delete them or fix them. You can find with regex search for Gene::\t

image

Also, I see 22 more genes with no id. You can find them with regex search for Gene::\n. For example, you can see them on lines 4122083, 4138986, 4145818, etc...

Unreasonable to compute jaccard_scores in this way

for example, by comparing edge bioarx::HumGenHumGen:Gene:Gene and GNBR::Pr::Compound:Disease:

e1,e2 = keys[0],keys[40]
n1_d=node_dictionary[e1]
n2_d=node_dictionary[e2]

the number in n2_d may mean compound or disease, while the same number in n1_d refers to gene. The jaccard scores between these two sets is meaningless

Retraining DRKG with additional triplets

Hi!
I want to know if I could retrain the embeddings of the DRKG with additional triplets related to certain diseases using Colab Pro, i.e. with a GPU. Particularly, I want to know whether that retraining task is extremely time-consuming or not.

Thanks!

ModuleNotFoundError: No module named 'utils'

Hi -
When I try running this notebook, the 'utils' module is not found. (I have the CWD set to the directory where I downloaded the DRKG)
Where should this 'utils' module be ?


ModuleNotFoundError Traceback (most recent call last)
in
4 import sys
5 sys.path.insert(1, '../utils')
----> 6 from utils import download_and_extract
7 download_and_extract()
8 drkg_file = '../data/drkg/drkg.tsv'

ModuleNotFoundError: No module named 'utils'

ImportError: cannot import name 'download_and_extract' from 'utils' (/usr/local/lib/python3.10/dist-packages/utils/__init__.py)

Hi I am currently trying to replicate your methods using the notebooks that you have provided.
I tried running the Jaccard similarity score notebook and I stumbled upon the following errors. Is there any work around?

ImportError Traceback (most recent call last)
in <cell line: 6>()
4 import sys
5 sys.path.insert(1, '../utils')
----> 6 from utils import download_and_extract
7 download_and_extract()
8 drkg_file = '../data/drkg/drkg.tsv'

ImportError: cannot import name 'download_and_extract' from 'utils' (/usr/local/lib/python3.10/dist-packages/utils/init.py)

Need entity names for the ids in the DRKG graph

Hi

The graph available in the DRKG repository containing 5,874,261 triplets has ids for identifying compounds and diseases.
Please find this sample from the main graph dataset.

Sample from graph:
Compound::DB01586 GNBR::T::Compound:Disease Disease::MESH:C535932
Compound::DB01586 GNBR::T::Compound:Disease Disease::MESH:D002779
Compound::DB01586 GNBR::Pa::Compound:Disease Disease::MESH:D007674

Ideal data format needed:
Ursodeoxycholic acid GNBR::T::Compound:Disease Cholestasis of pregnancy
Ursodeoxycholic acid GNBR::T::Compound:Disease Cholestasis
Ursodeoxycholic acid GNBR::Pa::Compound:Disease Kidney Diseases

Is there any automated way through which we can get the names for these ids ?
Any help in this direction will be appreciated.

How to generate my own Knowledge Graph?

I am trying to reproduce the results and will like to understand a few things:

  1. How are the relationships (links) between entities established within the DRKG dataset (to then allow for the knowledge graph generation)?
  2. How are the triplets and edge types produced?

Parameter Setting

Hello, I would like to ask a question. The results (entity embedding and relational embedding) we trained with the same parameter settings in the code are very different from the results you gave. The drugs and diseases recommended by our training embeddings have no association. Do you have any suggestion for this?

Windows support

Hi, when I try to run dglke_train on my Windows 10 I get the following error:

Traceback (most recent call last):
  File "C:\Users\u1123073\Anaconda3\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\u1123073\Anaconda3\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\u1123073\Anaconda3\Scripts\dglke_train.exe\__main__.py", line 7, in <module>
  File "C:\Users\u1123073\Anaconda3\lib\site-packages\dglke\train.py", line 144, in main
    train_sampler_head = train_data.create_sampler(args.batch_size,
  File "C:\Users\u1123073\Anaconda3\lib\site-packages\dglke\dataloader\sampler.py", line 379, in create_sampler
    return EdgeSampler(self.g,
  File "C:\Users\u1123073\Anaconda3\lib\site-packages\dgl\contrib\sampling\sampler.py", line 662, in __init__
    self._seed_edges = utils.toindex(self._seed_edges)
  File "C:\Users\u1123073\Anaconda3\lib\site-packages\dgl\utils\internal.py", line 271, in toindex
    return data if isinstance(data, Index) else Index(data, dtype)
  File "C:\Users\u1123073\Anaconda3\lib\site-packages\dgl\utils\internal.py", line 26, in __init__
    self._initialize_data(data)
  File "C:\Users\u1123073\Anaconda3\lib\site-packages\dgl\utils\internal.py", line 33, in _initialize_data
    self._dispatch(data)
  File "C:\Users\u1123073\Anaconda3\lib\site-packages\dgl\utils\internal.py", line 58, in _dispatch
    raise InconsistentDtypeException('Index data specified as %s, but got: %s' %
dgl.utils.internal.InconsistentDtypeException: DGL now requires the input tensor to have the same dtype as the
 graph index's dtype(which you can get by g.idype). Index data specified as int64, but got: int32

Is maintenance still active for this repo? Could you please have a look at this?

Thanks for sharing the great knowledge, but I got some problem

when I try to run the train_embeddings.ipynb code. the command not found. And here are my code

print(os.system("DGLBACKEND=pytorch dglke_train --dataset DRKG --data_path ./train --data_files drkg_train.tsv drkg_valid.tsv drkg_test.tsv --format 'raw_udd_hrt' --model_name TransE_l2 --batch_size 2048 "))

print(os.system('--neg_sample_size 256 --hidden_dim 400 --gamma 12.0 --lr 0.1 --max_step 100000 --log_interval 1000 --batch_size_eval 16 -adv --regularization_coef 1.00E-07 --test --num_thread 1 --gpu 0 1 2 3 4 5 6 7 --num_proc 8 --neg_sample_size_eval 10000 --async_update'))

And the error remind

sh: dglke_train: command not found
32512
sh: --: invalid option
Usage: sh [GNU long option] [option] ...
sh [GNU long option] [option] script-file ...

thanks

Failed to run for the dataset DRKG

Hi,

Thanks for all the work. It looks amazing and I am looking forward to integrating my data with other diseases.
It's a pity that I cannot run the code for training DRKG on my machine, which only has CPUs.

The command is
"dglke_train --dataset DRKG --data_path ./train --data_files drkg_train.tsv drkg_valid.tsv drkg_test.tsv --format 'raw_udd_hrt' --model_name TransE_l2 --batch_size 64 --neg_sample_size 256 --hidden_dim 400 --gamma 12.0 --lr 0.1 --max_step 100000 --log_interval 1000 --batch_size_eval 16 -adv --regularization_coef 1.00E-07 --test --num_thread 1 --num_proc 8 --neg_sample_size_eval 10000"

and the output is
Reading train triples....
Finished. Read 5286834 train triples.
Reading valid triples....
Finished. Read 293713 valid triples.
Reading test triples....
Finished. Read 293713 test triples.
|Train|: 5286834
random partition 5286834 edges into 8 parts
part 0 has 660855 edges
part 1 has 660855 edges
part 2 has 660855 edges
part 3 has 660855 edges
part 4 has 660855 edges
part 5 has 660855 edges
part 6 has 660855 edges
part 7 has 660849 edges
/opt/conda/lib/python3.7/site-packages/dgl/base.py:25: UserWarning: multigraph will be deprecated.DGL will treat all graphs as multigraph in the future.
warnings.warn(msg, warn_type)
|valid|: 293713
|test|: 293713
Bus error (core dumped)

The command works fine for other data.
"dglke_train --model_name TransE_l2 --dataset FB15k --batch_size 1000 --neg_sample_size 200 --hidden_dim 400 --gamma 19.9 --lr 0.25 --max_step 3000 --log_interval 100 --batch_size_eval 16 --test -adv --regularization_coef 1.00E-09 --num_thread 1 --num_proc 8" worked successfully.

Thanks again for your sharing.

Interests about repurposing tasks

Hi Dear Author,

Your repurposing KG looks great and comprehensive. I'm curious about the use of this KG, especially on repurposing tasks. If my understanding is correct, the "Compound" entities refer to "Drug" in the collected sources; the repurposing means find new targets/functions/effects of a known drug.

For example, we know a drug is effective on some diseases, targets some genes, and has some side effects, we want to explore more diseases it will work, and also potential targeted genes, and even more side effects it may contain...Are those the repurposing tasks?

MESH IDs

Hi! And first of all, thanks a lot for putting this together 👍🏻
I'm working on some drug repurposing tasks, and I've bumped into several compound registers that I can't seem to find in any MESH database. I'm referring to records that start with "Compound::MESH:Cxxxx", like, for example, "Compound::MESH:C403507".
Can you assist me with how I should look for these registers on an external database to find the compound name?
Whatsmore, is there any mapping between MESH ids and Drug Bank?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.