Git Product home page Git Product logo

integrate's Introduction

Building hetionet: data integration, hetnet permutation, and Neo4j import

DOI

Hetnets are networks with multiple types of nodes and edges. This repository creates hetionet v1.0, which is a hetnet encoding biology, disease, and pharmacology. We created Hetionet v1.0 for Project Rephetio, a study to systematically evaluate why drugs work and to predict new therapeutic uses for existing drugs. The study describing Project Rephetio and Hetionet v1.0 is:

Systematic integration of biomedical knowledge prioritizes drugs for repurposing
Daniel S Himmelstein, Antoine Lizee, Christine Hessler, Leo Brueggeman, Sabrina L Chen, Dexter Hadley, Ari Green, Pouya Khankhanian, Sergio E Baranzini
eLife (2017-09-22) DOI: 10.7554/eLife.26726

Note: this repository is for building Hetionet v1.0. We recommend that users interested in downloading and using the completed hetnet, do so from the dhimmel/hetionet repository.

Execution

  1. precompile.sh executes notebooks which combine multiple resources into a single type of edge. See the contents of compile for more information.

  2. build.sh builds the hetnet, creates permuted derivatives, and exports the hetnet to Neo4j.

Notebooks

  1. integrate.ipynb creates the hetnet, by integrating data that is stored either in compile or elsewhere on GitHub. All GitHub links use commit hashes to be version specific. The JSON-formatted hetnet is exported to data/hetnet.json.bz2.
  2. permute.ipynb loads the created hetnet and creates permuted derivatives that preserve node degree but destroy edge specificity. The permuted hetnets are written to data/permuted, but are not uploaded due to file size.
  3. neo4j-import.ipynb imports the hetnet and its permutations into separate neo4j instances. These neo4j instances are not uploaded due to file size and licensing issues. Currently, neo4j-community-2.3.3 is used.

Components

Environment

The dependencies are listed in environment.yml, which can be installed on Linux using:

conda env create --file=environment.yml

Activate the environment with source activate integrate.

License

All original content in this repository is released as CC0. However, the hetnet integrates data from many resources and users should consider the licensing of each source. We apply a license attribute on a per node and per edge basis for sources with defined licenses. However, some resources don't provide any license, so for those we've requested permission. More information is available on Thinklab. See licenses/README.md for a table of all resources and their licensing.

integrate's People

Contributors

dhimmel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

integrate's Issues

RAM Requirements

How much RAM is required to run the neo4j-import.ipynb script? On a 32 GB RAM AWS instance, the RAM gets maxed out and the cell is terminated. Based on this post, it seems that your machine had 256 GB of RAM, and I was wondering if it needed that much, or if there was a way to limit RAM usage.

Managing dependencies

How are we supposed to install the hetio package which is part of the repo's dependencies? The environment.yml file has a hard coded path (/home/dhimmels/Documents/github/hetio)==0.1.0.

I tried downloading version 0.1.0 of the hetio package from https://github.com/dhimmel/hetio/releases and changing the hard coded path to the local path instead, but received the following error when trying to run conda env create -f environment.yml:

Extracting packages ...
[      COMPLETE      ]|###################################################| 100%
Linking packages ...
[      COMPLETE      ]|###################################################| 100%
Invalid requirement: 'hetio (/home/ubuntu/hetio)==0.1.0'
It looks like a path. Does it exist ?
CondaValueError: Value error: pip returned an error.

Any ideas? I have also tried to install from the github directly instead of using a local path, but I get a dependency error with pypandoc instead (i.e. in environment.yml have the following line: - "--editable=git+https://github.com/dhimmel/hetio.git#egg=hetio") see this issue.

Bias in Anatomy–downregulates–Gene and Anatomy–upregulates–Gene edges

Hi Daniel,

If I understand it correctly, you used the information about over-/under-expression from Bgee to create your Anatomy–downregulates–Gene and Anatomy–upregulates–Gene edges [1]. These values provided by Bgee refer to over-/under-expression across anatomy but also over-/under-expression across life stages [2]. However, you are using only data from adults and therefore do not provide the life stage in your anatomy nodes. Could this in your opinion produce a bias in your network or affect it or predictions etc in any other way?

Add AEOLUS to hetnet

AEOLUS provides a curated and standardized version of FAERS. It

  • removed duplicate case records

  • applied standardized vocabularies with drug names mapped to RxNorm concepts and outcomes mapped to SNOMED-CT concepts

  • pre-computed summary statistics about drug-outcome relationships for general consumption

The paper can be accessed at https://doi.org/10.1038/sdata.2016.26

Duplicate edges in Hetionet

Hi Daniel,

Just wanted to note that there are still duplicate edges in hetionet in the newest integrate.ipynb. Specifically, the following two types of relationships give duplicate edge errors when the notebook is run:

Disease-gene differential expression edges

commit = '1a11633b5e0095454453335be82012a9f0f482e4'
url = rawgit('dhimmel', 'stargeo', commit, 'data/diffex.tsv')
stargeo_df = pandas.read_table(url)
# Filter to at most 250 up and 250 down-regulated genes per disease
stargeo_df = stargeo_df.groupby(['slim_id', 'direction']).apply(
    lambda df: df.nsmallest(250, 'p_adjusted')).reset_index(drop=True)
stargeo_df.head(2)

for row in stargeo_df.itertuples():
    source_id = 'Disease', row.slim_id
    target_id = 'Gene', row.entrez_gene_id
    kind = row.direction + 'regulates'
    data = {
        'source': 'STARGEO',
        'log2_fold_change': round(row.log2_fold_change, 5),
        'unbiased': True,
        'license': 'CC0 1.0'
    }
    graph.add_edge(source_id, target_id, kind, 'both', data)

LINCS Compound-gene dysregulation edges

url = rawgit('dhimmel', 'lincs', commit, 'data/consensi/signif/dysreg-drugbank.tsv')
l1000_df = pandas.read_table(url)
l1000_df = l1000_df.query("perturbagen in @compound_df.drugbank_id and entrez_gene_id in @coding_genes")
l1000_df = filter_l1000_df(l1000_df, n=125)
l1000_df.tail(2)

mapper = {'up': 'upregulates', 'down': 'downregulates'}
for row in l1000_df.itertuples():
    source_id = 'Compound', row.perturbagen
    target_id = 'Gene', row.entrez_gene_id
    data = {
        'source': 'LINCS L1000',
        'z_score': round(row.z_score, 3),
        'method': row.status,
        'unbiased': True,
    }
    kind = mapper[row.direction]
    graph.add_edge(source_id, target_id, kind, 'both', data)

Also, is the metaedge generation supposed to be exponential with the number of metapaths in the network? I noticed that if I don't include these types of metapaths in the network, but include everything else, then the number of metapaths drops from 1200 to only 130

['Compound', 'Disease', 'palliates', 'both']
['Compound', 'Gene', 'downregulates', 'both']
['Compound', 'Gene', 'upregulates', 'both']
['Disease', 'Gene', 'downregulates', 'both']
['Disease', 'Gene', 'upregulates', 'both']

The four regulation metapaths were not included due to the edge import errors, and the palliates one due to my excluding them for testing purposes.

Connection refused when using hetio/py2neo to export to neo4j

Hi Daniel,

Do you remember what you needed to do to increase the number of concurrent open files allowable by neo4j? At the moment I get a warning that says:

WARNING: Max 1024 open files allowed, minimum of 40 000 recommended. See the Neo4j manual.
Starting Neo4j Server...WARNING: not changing user
process [6305]... waiting for server to be ready..... OK.
http://localhost:7500/ is ready.

Also, I think they way that neo4j is started with a subprocess call in Python is giving me issues, since I trying to start neo4j 3.1.1 in the same manner causes the service to shut down immediately before any Python code can be executed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.