dhimmel / integrate Goto Github PK

Scripts and resources to create Hetionet v1.0, a heterogeneous network for drug repurposing

Home Page: https://doi.org/10.15363/thinklab.4

Python 0.34% Jupyter Notebook 99.58% Shell 0.08%

rephetio neo4j data-integration network hetnet drug-repurposing hetionet

integrate's Introduction

Building hetionet: data integration, hetnet permutation, and Neo4j import

Hetnets are networks with multiple types of nodes and edges. This repository creates hetionet v1.0, which is a hetnet encoding biology, disease, and pharmacology. We created Hetionet v1.0 for Project Rephetio, a study to systematically evaluate why drugs work and to predict new therapeutic uses for existing drugs. The study describing Project Rephetio and Hetionet v1.0 is:

Systematic integration of biomedical knowledge prioritizes drugs for repurposing
Daniel S Himmelstein, Antoine Lizee, Christine Hessler, Leo Brueggeman, Sabrina L Chen, Dexter Hadley, Ari Green, Pouya Khankhanian, Sergio E Baranzini
eLife (2017-09-22) DOI: 10.7554/eLife.26726

Note: this repository is for building Hetionet v1.0. We recommend that users interested in downloading and using the completed hetnet, do so from the dhimmel/hetionet repository.

Execution

precompile.sh executes notebooks which combine multiple resources into a single type of edge. See the contents of compile for more information.
build.sh builds the hetnet, creates permuted derivatives, and exports the hetnet to Neo4j.

Notebooks

integrate.ipynb creates the hetnet, by integrating data that is stored either in compile or elsewhere on GitHub. All GitHub links use commit hashes to be version specific. The JSON-formatted hetnet is exported to data/hetnet.json.bz2.
permute.ipynb loads the created hetnet and creates permuted derivatives that preserve node degree but destroy edge specificity. The permuted hetnets are written to data/permuted, but are not uploaded due to file size.
neo4j-import.ipynb imports the hetnet and its permutations into separate neo4j instances. These neo4j instances are not uploaded due to file size and licensing issues. Currently, neo4j-community-2.3.3 is used.

Components

data: the directory containing saved versions of the network.
data/summary: the directory with tables of network statistics. See the summaries of metanodes and metaedges.
viz: the directory containing network visualizations. Includes a holistic network view as well as node degree distributions.

Environment

The dependencies are listed in environment.yml, which can be installed on Linux using:

conda env create --file=environment.yml

Activate the environment with source activate integrate.

License

All original content in this repository is released as CC0. However, the hetnet integrates data from many resources and users should consider the licensing of each source. We apply a license attribute on a per node and per edge basis for sources with defined licenses. However, some resources don't provide any license, so for those we've requested permission. More information is available on Thinklab. See licenses/README.md for a table of all resources and their licensing.

integrate's People

Contributors

Stargazers

Watchers

Forkers

elifesciences-publications kaulmonish gwaybio aspirincode zietzm mpetrenk athithyaaselvam im281 bhargavaganti rpatil524 shunsunsun van-truong mars-wei mughetto

integrate's Issues

Add Ceritinib to Hetionet?

Thanks for the work setting up Hetionet!

Is it possible to add the cancer drug Ceritinib (https://www.drugbank.ca/drugs/DB09063)?

A similar compound Crizotinib (https://www.drugbank.ca/drugs/DB08865) is in the database.

They both target the same protein for cancer treatment and it would be great to see how they differentiate within Hetionet.

Add data to Entrez Gene add_node

Also add url to data. Possible URL formats include:

RAM Requirements

How much RAM is required to run the neo4j-import.ipynb script? On a 32 GB RAM AWS instance, the RAM gets maxed out and the cell is terminated. Based on this post, it seems that your machine had 256 GB of RAM, and I was wondering if it needed that much, or if there was a way to limit RAM usage.

Remove self-loop edges for LINCS L1000 genetic perturbations

Knockdown and overexpression LINCS perturbations should by definition down/up-regulate the manipulated gene. Thus the self-loop edges here are superfluous.

Implement by ensuring that source_id != target_id.

Managing dependencies

How are we supposed to install the hetio package which is part of the repo's dependencies? The environment.yml file has a hard coded path (/home/dhimmels/Documents/github/hetio)==0.1.0.

I tried downloading version 0.1.0 of the hetio package from https://github.com/dhimmel/hetio/releases and changing the hard coded path to the local path instead, but received the following error when trying to run conda env create -f environment.yml:

Extracting packages ...
[      COMPLETE      ]|###################################################| 100%
Linking packages ...
[      COMPLETE      ]|###################################################| 100%
Invalid requirement: 'hetio (/home/ubuntu/hetio)==0.1.0'
It looks like a path. Does it exist ?
CondaValueError: Value error: pip returned an error.

Any ideas? I have also tried to install from the github directly instead of using a local path, but I get a dependency error with pypandoc instead (i.e. in environment.yml have the following line: - "--editable=git+https://github.com/dhimmel/hetio.git#egg=hetio") see this issue.

Bias in Anatomy–downregulates–Gene and Anatomy–upregulates–Gene edges

Hi Daniel,

If I understand it correctly, you used the information about over-/under-expression from Bgee to create your Anatomy–downregulates–Gene and Anatomy–upregulates–Gene edges [1]. These values provided by Bgee refer to over-/under-expression across anatomy but also over-/under-expression across life stages [2]. However, you are using only data from adults and therefore do not provide the life stage in your anatomy nodes. Could this in your opinion produce a bias in your network or affect it or predictions etc in any other way?

Questions about hetionet: metabolomics / side effects versus diseases

Hi Daneil,

For Hetionet, I have two brief questions and would like to hear about your insights:

on metabolomics side, why didn't you use the HMDB database for linking metabolites, diseases, variants, genes etc?
For sepsis and chronic fatigue, why they are categorized as side effects rather than diseases?

Finalize PPI sources

Choose which PPI sources to include in the network. Also, partition PPIs into biased and unbiased subsets.

Presence misspelled as presense

As a metaedge kind

Add AEOLUS to hetnet

AEOLUS provides a curated and standardized version of FAERS. It

removed duplicate case records
applied standardized vocabularies with drug names mapped to RxNorm concepts and outcomes mapped to SNOMED-CT concepts
pre-computed summary statistics about drug-outcome relationships for general consumption

The paper can be accessed at https://doi.org/10.1038/sdata.2016.26

Duplicate edges in Hetionet

Hi Daniel,

Just wanted to note that there are still duplicate edges in hetionet in the newest integrate.ipynb. Specifically, the following two types of relationships give duplicate edge errors when the notebook is run:

Disease-gene differential expression edges

commit = '1a11633b5e0095454453335be82012a9f0f482e4'
url = rawgit('dhimmel', 'stargeo', commit, 'data/diffex.tsv')
stargeo_df = pandas.read_table(url)
# Filter to at most 250 up and 250 down-regulated genes per disease
stargeo_df = stargeo_df.groupby(['slim_id', 'direction']).apply(
    lambda df: df.nsmallest(250, 'p_adjusted')).reset_index(drop=True)
stargeo_df.head(2)

for row in stargeo_df.itertuples():
    source_id = 'Disease', row.slim_id
    target_id = 'Gene', row.entrez_gene_id
    kind = row.direction + 'regulates'
    data = {
        'source': 'STARGEO',
        'log2_fold_change': round(row.log2_fold_change, 5),
        'unbiased': True,
        'license': 'CC0 1.0'
    }
    graph.add_edge(source_id, target_id, kind, 'both', data)

LINCS Compound-gene dysregulation edges

url = rawgit('dhimmel', 'lincs', commit, 'data/consensi/signif/dysreg-drugbank.tsv')
l1000_df = pandas.read_table(url)
l1000_df = l1000_df.query("perturbagen in @compound_df.drugbank_id and entrez_gene_id in @coding_genes")
l1000_df = filter_l1000_df(l1000_df, n=125)
l1000_df.tail(2)

mapper = {'up': 'upregulates', 'down': 'downregulates'}
for row in l1000_df.itertuples():
    source_id = 'Compound', row.perturbagen
    target_id = 'Gene', row.entrez_gene_id
    data = {
        'source': 'LINCS L1000',
        'z_score': round(row.z_score, 3),
        'method': row.status,
        'unbiased': True,
    }
    kind = mapper[row.direction]
    graph.add_edge(source_id, target_id, kind, 'both', data)

Also, is the metaedge generation supposed to be exponential with the number of metapaths in the network? I noticed that if I don't include these types of metapaths in the network, but include everything else, then the number of metapaths drops from 1200 to only 130

['Compound', 'Disease', 'palliates', 'both']
['Compound', 'Gene', 'downregulates', 'both']
['Compound', 'Gene', 'upregulates', 'both']
['Disease', 'Gene', 'downregulates', 'both']
['Disease', 'Gene', 'upregulates', 'both']

The four regulation metapaths were not included due to the edge import errors, and the palliates one due to my excluding them for testing purposes.

Remove keys with nan values in the data attribute

These are creating problems for the neo4j conversion

Incorrect pathway licensing

All pathways are currently CC-BY 3.0, however only WikiPathways should be CC-BY 3.0.

Relabel disease-symptom relationship name to presence

Currently metaedge is disease - causation - symptom. Do diseases cause symptoms or vice versa? Presence is a more neutral term.

Connection refused when using hetio/py2neo to export to neo4j

Hi Daniel,

Do you remember what you needed to do to increase the number of concurrent open files allowable by neo4j? At the moment I get a warning that says:

WARNING: Max 1024 open files allowed, minimum of 40 000 recommended. See the Neo4j manual.
Starting Neo4j Server...WARNING: not changing user
process [6305]... waiting for server to be ready..... OK.
http://localhost:7500/ is ready.

Also, I think they way that neo4j is started with a subprocess call in Python is giving me issues, since I trying to start neo4j 3.1.1 in the same manner causes the service to shut down immediately before any Python code can be executed.

dhimmel / integrate Goto Github PK

integrate's Introduction

Building hetionet: data integration, hetnet permutation, and Neo4j import

Execution

Notebooks

Components

Environment

License

integrate's People

Contributors

Stargazers

Watchers

Forkers

integrate's Issues

Disease-gene differential expression edges

LINCS Compound-gene dysregulation edges

Recommend Projects

Recommend Topics

Recommend Org