greenelab / connectivity-search-analyses Goto Github PK

hetnet connectivity search research notebooks (previously hetmech)

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 99.82% Python 0.18%

hetmech hetnet theory-in-practice hetionet mechanism networks hetmat matrix numpy scipy

connectivity-search-analyses's Introduction

Hetnet connectivity search prototyping and data repository

Connectivity Search (formerly called Hetmech for hetnet mechanisms) is a project to extract mechanistic connections between nodes in hetnets. The project aims to identify the relevant network connections between query nodes. The method is designed to operate on hetnets (networks with multiple node or relationship types).

Note: the hetmech python package has been renamed to hetmatpy and relocated to hetio/hetmatpy. This repository is now used as a historical archive, as well as a dataset storage, method prototyping, and exploratory data analysis repository.

Many findings from this repository are described in the Connectivity Search Manuscript. The manuscript source code is available in greenelab/connectivity-search-manuscript.

Environment

This repository uses conda to manage its environment as specified in environment.yml. Install the environment with:

# install new hetmech environment
conda env create --file=environment.yml

# update existing hetmech environment
conda env update --file=environment.yml

Then use conda activate hetmech and conda deactivate to activate or deactivate the environment.

Note that the environment is tested with the conda channel_priority strict configuration. Locally, you can run the following commands to configure conda (as per https://conda-forge.org docs), but note that it affects your conda config beyond this environment:

conda config --add channels conda-forge
conda config --set channel_priority strict

Another option is to install conda with miniforge.

Acknowledgments

This work was supported through a research collaboration with Pfizer Worldwide Research and Development. This work was funded in part by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative through Grants GBMF4552 to Casey Greene and GBMF4560 to Blair Sullivan.

connectivity-search-analyses's People

Contributors

Stargazers

Watchers

Forkers

dhimmel gwaybio naglem ben-heil yhao-compbio

connectivity-search-analyses's Issues

Alternative graph data structures

We are considering creating another base representation of hetnets. One of the main goals is to facilitate faster network loading, which at present can take over a minute and a half to load a graph.

The following are under consideration:

Mambo, which relies on SNAP.
GraphFrames, a graph package for Apache Spark. Advantages of this framework are that metadata and multiple edge types are supported in a logical way, Spark scales well in case we eventually make hetmech a back-end server application, and the data structure is highly-used. However, we would have to re-implement the DWPC in a way that either computes exact path-level results (path and corresponding DWPC) more slowly than the current matrix implementation or computes path-level DWPC only.
HetMat, what would be an internally-produced matrix-first representation of the network. This option would allow us to store the entire Hetionet-v1.0 and five permutations in about 30 MB as scipy sparse matrices in .npz format. Moreover, unlike the other methods, we would likely not have to change more than a few functions in order to load on-disk adjacency matrices.

Matrixfy normalization operations

Currently, we're doing column-by-column and row-by-row math in dual_normalize.

Should we do more vectorized operations and use something like nan_to_num to deal with division by zero?

From my perspective, this decision should be primarily based on memory and computational efficiency.

Implement Degree Weighted Path Count (DWPC)

The diffusion code we have in https://github.com/greenelab/hetmech/blob/e387eee580cfe3d530a21c92df025afd9f668784/hetmech/diffusion.py
can be made to compute the DWPC for an input metapath.
Implement this functionality -- @dhimmel in which module should we put this function?

Consider xarray for storing hetnets as multidimensional arrays

Check out the xarray package (https://github.com/pydata/xarray) to store hetnets as a multidimensional array.

Is numpy.linalg.norm useful for the normalization step?

@kkloste I came across this function: numpy.linalg.norm, whose doc states:

This function is able to return one of eight different matrix norms, or one of an infinite number of vector norms (described below), depending on the value of the ord parameter.

Not sure whether this is helpful. Seems related to sklearn's normalize we had been using. Close this issue if it's not relevant for us.

Disk storage of DWPC matrices

Refs #74

We would like to cache the DWPC matrices on disk, with the ability to rapidly read them. I looked into several different storage methods. HDF5 has a nice Python (somewhat) wrapper called PyTables, which seems useful. In addition, there is a native Numpy/Scipy saving format, .npy/.npz. In a quick notebook example I did earlier today, these two methods seem to read back matrices at comparable speeds (~4.3 ms). With this in mind, I will do some more rigorous speed tests in the next few days, and I will use this issue to organize updates.

dual_normalize docstring issue

column_damping repeated twice.

https://github.com/greenelab/hetmech/blob/f54bd7e2b767682f3592a1c5c4fab64d0f17cf99/hetmech/diffusion.py#L14-L15

Consider adding doc for when column_damping=1.

Are informative DWPCs greater than permuted DWPCs?

We've had the thesis that DWPCs that measure truly connected nodes will be higher than P-DWPCs (mean permuted DWPCs). However, @zietzm and I were looking at P-DWPC and Z-DWPC distributions yesterday of the real network and a permuted network (permuted compared to permuted) and the distributions looked incredibly similar.

Some discussions from Project Rephetio that may have wisdom from the past:

Transforming DWPCs for hetnet edge prediction (first comment with distributions)
Introducing the Residual Degree Weighted Path Count (R-DWPC)

One issue is that we don't have a great understanding of DWPCs and permuted DWPCs. I think we should start by exploring DWPCs and how they compare to permuted DWPCs. In whats ways are DWPCs from a permuted network different than from the unmodified network?

dual_normalize test should include NaN / Inf corner cases

A good test of dual_normalize should also check it handles properly the corner case that a column or row is empty, or that the normalization procedures somehow lead to NaN or Inf

Caching and bulk computation of metapaths

If we are doing bulk computations of metapaths, we will want to take full advantage of our developing caching functionality. During the computation of most DWPCs, the metapath will be segmented and sub-problems computed first. We do not want to cache metapaths or metapath segments that appear only once in the full list of segments. Excluding these single-appearance segments excludes 78% of all segments (regardless of whether looking at max_length=4 or max_length=5).

Another thing to think about could be retrieval vs transpose time for some segments. For example, the first and sixth most common segments are GdD and DdG. If transposing is much slower than retrieving, maybe we want to cache both the GdD and DdG matrices. But for sparse matrices, transposing may be quite quick, in which case we would prefer storing one direction only.

I have rank-ordered all segments by their frequency of appearance. An interesting thing to note is that nearly one-third of all segments which would be included (meaning appear more than once) appear less than ten times, while only ten percent appear more than one hundred times. Basically, the distribution of frequencies is extremely biased toward small numbers with an incredibly long tail. The most frequent segments occur nearly 30,000 times, while only 1% occur even 1000 times. Thus, the smallest increases in performance will potentially be quite noticeable for the most frequent segments, while others may be much less visible.

Cypher implementation of hetnet diffusion that returns path contributions

xref #29
One item in the TODO list was "Kyle can help show how to perform matrix normalizations in a manner that is easier to do on cypher."

I don't remember addressing this, is it still a concern?

Fix error when importing hetmech.matrix

@gwaygenomics made the following observation:

Importing hetmech.matrix without previously importing hetmech.hetmat gives an ImportError

In the following image, I ran the first cell, then restarted the kernel and ran only the second cell. As can be seen below, importing hetmech.matrix alone or before hetmech.hetmat gives this error.

Hetnet permutation via adjacency matrices

In, the past, we've permuted hetnets via the hetio package. We have also explored a Cypher implementation. The method we use is called XSwap as described in:

Randomization Techniques for Graphs
Sami Hanhijärvi, Gemma C. Garriga, Kai Puolamäki (2009) Proceedings of the 2009 SIAM International Conference on Data Mining. doi:10.1137/1.9781611972795.67

XSwap randomly selects pairs of edges and attempts to switch the endpoints. The result is a network with degree preserved but relationships randomized. To extend this concept to hetnets, we separately permute each relationship. So I think we could implement hetnet permutation on the adjacency matrices. This may be a bit faster than the hetio implementation.

I believe this paper describes the edge swap technique for matrices:

A Ramachandra R, Rabindranath S (1996) A Markov Chain Monte Carlo Method for Generating Random (0, 1)-Matrices with Given Marginals. Sankhya Indian J Stat Ser A 58: 225–242. https://www.jstor.org/stable/25051102

Anyways, @kkloste do you think there's an easy matrix implementation of XSwap?

Memory leak in bulk computation of permuted DWPCs

In #140 / b882476, we specified computing degree-grouped permutation stats for 200 permutated hetnets. However, the computation died on the 99th iteration without any error message. Hence, I suspected the process was killed due to excessive memory consumption. I reran the bulk notebook while supervising memory usage and within a day or two the process was consuming 50 GB of RAM and counting.

Our cache sizes for path count matrices are set at 16GB, so max memory usage shouldn't exceed 20GB (4 GB is a generous estimate for the other objects that must be stored). Hence, it seems that the garbage collection is not working as expected, or that we are not properly clearing references to discarded files.

I stopped the notebook with the growing leak, with it's objects still in memory and then ran:

from pympler import muppy, summary
all_objects = muppy.get_objects()

Running these commands caused memory consumption to drop:

Still not sure what to make of this clue.

Are our adjacency matrices transposed?

Currently, our adjacency matrices for metaedges encode source nodes as columns and target nodes as rows. I'm starting to think this is backwards. Two observations:

to access a specific value, we must go matrix[target_pos, source_pos]. It's more natural to declare source then target.
In the proposed DWWC implemenation, we have to do dwwc_matrix = adj_mat @ dwwc_matrix rather than dwwc_matrix = dwwc_matrix @ adj_mat. The later is preferable in that we could use functools.reduce.

@kkloste what do you think?

Graph algorithms for HetNets

Hi all,

since you are working with heterogeneous information networks, I think it is interesting to point out a couple of works that might help you in this regard.

Some work on metapaths that are abstract description of paths surrounding a specific node.
They are useful in order to compare nodes in the graph.

Meng, C., Cheng, R., Maniu, S., Senellart, P., & Zhang, W. (2015). Discovering Meta-Paths in Large Heterogeneous Information Networks (pp. 754–764). Presented at the the 24th International Conference, New York, New York, USA: ACM Press. http://doi.org/10.1145/2736277.2741123

Sun, Y., 0001, J. H., Yan, X., Yu, P. S., & Wu, T. (2011). PathSim - Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks. Pvldb.

Graph embeddings: project the graph into a multi-dimensional space and perform similarity operations in that space (e.g., k-means):

Chen, Y., & Wang, C. (2017). HINE: Heterogeneous Information Network Embedding. In Database Systems for Advanced Applications (Vol. 10177, pp. 180–195). Cham: Springer International Publishing. http://doi.org/10.1007/978-3-319-55753-3_12

metapath2vec: https://www.youtube.com/watch?v=t9yujaCpX_E

Some work on node classification:

http://www.cs.cmu.edu/~deswaran/papers/vldb17-zoobp.pdf

These are just a few, but if you tell me what you need there might be more customized ideas out there!

Regards,
Davide

two-repeated-nodes can be deleted

I'm still figuring out how to handle multiple forks, branches, setting upstream, etc.
The branch two-repeated-nodes was intended to live on my fork, not here.

I've got all the changes in kkloste:multi-duplicate, so greenlab:two-repeated-nodes can be deleted
@dhimmel

Archival deposits for HetMat archives and bulk DWPCs

We've now completed computing DWPCs and their corresponding DGP null distributions for all metapaths up to length 3 in Hetionet v1.0. We're working on building a database with this information in https://github.com/greenelab/hetmech-backend, so now is a good time to think about archival locations for our datasets.

HetMat-formatted hetnets

We invented the hetmat format for storing hetnets as on-disk matrices. I opened a PR to add these to the Hetionet GitHub repo at hetio/hetionet#11. This repo is where we've stored Hetionet in the past, so it's the obvious place to put additional network files.

DWPC files

The DWPC files created in #142 are large (slightly under 200 GB for all of them). After contacting Zenodo, they were willing to increase our quota for this upload under the condition that we cite it in a peer-reviewed publication. I will post a draft of the Zenodo upload in the next comment.

Use pytest for testing

http://doc.pytest.org/en/latest/

You can run locally with py.test or pytest.

Cypher queries to explore candidate node pairs

@kkloste has been generating node pairs with manageable numbers of paths for a tailored follow up. For example, here are gene-compound pairs with one path.

Here are some Cypher queries to investigate candidate pairs. You can run them in the Hetionet Browser at https://neo4j.het.io

Find all paths up to length 2 between the specified compound and gene

MATCH path = (source:Compound)-[*..2]-(target:Gene)
WHERE source.identifier = 'DB00736'
AND target.identifier = 100130958
RETURN path

Find all paths as above but report the number of paths by type

MATCH path = (source:Compound)-[*..2]-(target:Gene)
WHERE source.identifier = 'DB00736'
AND target.identifier = 100130958
WITH
  extract(node in nodes(path)| head(labels(node))) AS node_types,
  extract(rel in relationships(path)| type(rel)) AS rel_types
RETURN node_types, rel_types, count(*) AS count

I'll update this discussion with additional queries that may be useful.

Analyses that depend on path counts now filtered from the database

In #173, I began touching up some of the rephetio epilepsy predictions for the manuscript. Specifically, I'm most interested in including visualizations from:

Those notebooks connected to a legacy database we had hosted on a Penn workstation that is either no longer online or firewalled. In 60f4826, I switched over to using the production database.

However, the production database does not include all nonzero path counts, since it filters based on a p-value threshold to save database storage. See Prioritizing enriched metapaths for database storage in draft manuscript. However, I believe the notebooks above were developed by @ben-heil against a database that did not filter any rows from the path_counts table (a database populated before greenelab/connectivity-search-backend#41). @ben-heil does that sound correct?

Problem with environment installing neo4j, hetio

I created the environment via conda, as per the instructions in the README, but this failed to install neo4j-driver for some reason:

so I manually installed neo4j-driver via pip install neo4j-driver, but then discovered that hetio was not properly installed:

These error messages occurred when I was definitely using the hetmech environment (i.e. via source activate hetmech, as per the README).

Archive creation does not support pathlib paths

When feeding paths into the HetMat archiving functions, it would be very helpful if pathlib paths were supported inputs, as opposed to just strings.

all_paths_1 = hetmat.metagraph.extract_all_metapaths(1)
paths_1 = []
for path in all_paths_1:
    path = pathlib.Path(f'path-counts/dwpc-0.5/{path}.sparse.npz')
    if path.exists():
        paths_1.append(path)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-31-49cc11dbbf57> in <module>()
      1 hetmech.hetmat.archive.create_archive_by_globs(
----> 2     'dwpc-0.5-len-2.zip', '.', include_globs=paths_2)

~/Documents/hetmech/hetmech/hetmat/archive.py in create_archive_by_globs(destination_path, root_directory, include_globs, exclude_globs, include_paths, **kwargs)
     35     source_paths = set()
     36     for glob in include_globs:
---> 37         source_paths |= set(root_directory.glob(glob))
     38     for glob in exclude_globs:
     39         source_paths -= set(root_directory.glob(glob))

~/.conda/envs/hetmech/lib/python3.6/pathlib.py in glob(self, pattern)
   1072             raise ValueError("Unacceptable pattern: {!r}".format(pattern))
   1073         pattern = self._flavour.casefold(pattern)
-> 1074         drv, root, pattern_parts = self._flavour.parse_parts((pattern,))
   1075         if drv or root:
   1076             raise NotImplementedError("Non-relative patterns are unsupported")

~/.conda/envs/hetmech/lib/python3.6/pathlib.py in parse_parts(self, parts)
     60             if altsep:
     61                 part = part.replace(altsep, sep)
---> 62             drv, root, rel = self.splitroot(part)
     63             if sep in rel:
     64                 for x in reversed(rel.split(sep)):

~/.conda/envs/hetmech/lib/python3.6/pathlib.py in splitroot(self, part, sep)
    281 
    282     def splitroot(self, part, sep=sep):
--> 283         if part and part[0] == sep:
    284             stripped_part = part.lstrip(sep)
    285             # According to POSIX path resolution:

TypeError: 'PosixPath' object does not support indexing

Whereas simply changing the append to be a string works correctly.

all_paths_1 = hetmat.metagraph.extract_all_metapaths(1)
paths_1 = []
for path in all_paths_1:
    path = pathlib.Path(f'path-counts/dwpc-0.5/{path}.sparse.npz')
    if path.exists():
        paths_1.append(str(path))

DWPC implementation coverage of the Project Rephetio metapaths

We currently have implemented the DWPC (see #45) for certain metapaths. The question is, of the 1,206 metapaths in Project Rephetio, how many will our current implementation succeed on?

You can get the metapath info here or here. An interactive metapath browser for Project Rephetio is also available at http://het.io/repurpose/metapaths.html

Formatting the hetnet small-path information

@dhimmel I am trying to determine the most useful format for the output of the path-counting process.
We can compute the number of length k paths between any two nodes for k=2,3, and quickly (even for all pairs), but I'm not sure how it would be best to organize the output so that you can use it.

Do you want a pair of dictionaries where the keys are node pairs, and the values are "# paths of length k", where one dictionary will have k=2, and the other will have k=3?

Or do you simply want a list of "100 pairs of node with smallest nonzero number of length 3 paths between them" ?

Simultaneous query of multiple nodes

Add functionality in hetmech to query a set of nodes in order to return a ranked list of connection predictions along with the corresponding metapaths.

For example, if a set of genes were queried, we would want output of the form:

Predictions

Rank	End Node	Metapath	DWPC	p-DWPC	r-DWPC
1	Compound A	CbGpPpG	...	...	...
2	Disease B	DtCcCcCbG	...	...	...
3	Anatomy C	AlDtC<rG	...	...	...
...	...	...	...	...	...

In order to reduce computation time, it may be useful to cache DWPC matrices so that a set of query nodes (given as a vector) can be queried almost instantly (order ~ 100 microseconds).

It should be noted that after work done in #54 and #43 to add sparse matrix functionality, the computation time for DWPC over all 752 compatible Rephetio metapaths has been reduced from a total of 6.5 hours to 48 minutes! In fact, the longest computation time for DWPC over a single metapath is now around 35 seconds, while the average time is about 3.9 seconds (see below).

Caching the DWPC matrices for the 752 Rephetio metapaths using scipy.io.savemat saves 752 sparse matrices as a .mat file. The file size for all these matrices is 461 MB.

The histogram below shows the distribution of DWPC times over the 752 metapaths.

Add pqtl_entrez_id column to pqtl.tsv

Follows up on #81 by @naglem.

IIRC, some of the in-progress versions of pqtl.tsv from #81 had a column titled pqtl_entrez_id, which had the genes from pqtl_gene, but converted to entrez GeneIDs. It looks like this didn't make it into the merged pqtl.tsv

@naglem would you be able to re-add the pqtl_entrez_id column? Would make the analysis a tad bit easier.

Switch to sparse matrices

@kkloste's calculations suggest that our large matrices are sparse.

Switching to a sparse encoding could reduce the Gene by Biological Process matrix from ~1000 MB in RAM to ~15 MB in RAM.

Infer therapeutic mechanisms based on the precision of CbGiG paths in reaching disease-genes

https://neo4j.het.io/browser/

P_values of zero in the hetmech database

SELECT * FROM dj_hetmech_app_pathcount WHERE source_id = 14421 AND target_id = 14792;

id	path_count	dwpc	p_value	metapath_id	source_id	target_id	dgp_id
14927	2	4.22700018268129	0.0378205049612811	CrCtD	14421	14792	20495
31357	90	3.33906031212536	0.0125763977488916	CrCbGaD	14421	14792	163888
59342	1	2.67809412135379	0.0166159667835124	CbGaDrD	14421	14792	40803
84589	8	3.6438294986658	0.048865029398301	CbGaD	14421	14792	2841
127899	33	3.43243143754024	0.0119735568763205	CbGiGaD	14421	14792	55155
316012	1294	3.50943383023676	0	CcSEcCtD	14421	14792	82209

SELECT * FROM dj_hetmech_app_degreegroupedpermutation WHERE id = 82209;

id	source_degree	target_degree	n_dwpcs	n_nonzero_dwpcs	nonzero_mean	nonzero_sd	metapath_id
82209	115	25	800	800	2.29988868478163	0.125316938712694	CcSEcCtD

We could do the math to calculate the log of the p-value initially, or we could use the Decimal package built into python to get arbitrarily large floats, then take the log of that.

In playing around with the built in floats and decimal floats, the built in floats are set to zero somewhere around 1/2^1070, while Decimal was able to represent anything, though it slowed down around 1/2^500000.

@zietzm @dhimmel do you guys have a preference?

Git / GitHub workflow

See https://github.com/cognoma/cognoma/blob/master/CONTRIBUTING.md#contribution-workflow for the git / GitHub workflow and how to use forks.

Next Steps

@dhimmel I have been thinking about what we discussed regarding the unsupervised approaches we could use in the future. If I understand roughly correctly, what we want to do is implement a system which determines the metapaths showing significant differences from random (permuted) networks, but doing this for all metapaths (of a certain length) in the graph.

It seems like this should be a two-step process, starting with computing DWPC matrices for all the metapaths of a certain length, and doing this a certain number of times, on various permuted networks. All this information would need to be stored for the next step.

This is where I reach the edge of my current knowledge. We will have several matrices for each metapath, with one as the actual values and the others as "controls". I'm not sure how we would go about comparing these matrices with one another. We want to do this without having the actual data, as was done with Rephetio.

Where should I start? Do you think that hetmech is the best place for this work, or should we make a new repository for the implementation part of this project? Should the matrices be stored on GitHub?

Introduce functionality for sparse matrices in diffusion and degree_weight

Subissue of #13
Add support for scipy.sparse matrices in diffusion.py and degree_weight.py. Matrix type will for now be selected as a function parameter.
matrix.normalize can now output a sparse matrix, meaning that the dense intermediates should only be of dimension 1. Support allowing all matrices to be sparse

Optimize matrix chain multiplication

@kkloste brought up that the order chained matrix multiplications are executed affects efficiency: #49 (comment). Turns out there's a Wikipedia for this topic.

Numpy has an implementation in numpy.linalg.multi_dot. If we dig through the source code, there's a function _multi_dot_matrix_chain_order, which we could use to compute the optimal ordering.

We have to contend with the presence of mixed numpy and scipy.sparse arrays. Also, we may want to generate order just from the shape of the matrices, so we don't need all matrices in memory at the same time.

DWPC degree effect analyses for mansucript

Some analyses that would be helpful for the manuscript:

Show how DWPC measures are affected by source/target node degree.
Show that the null distributions can remove these effects.
Show that corrected DWPCs (either p-values or some other metric) provide superior prediction in certain domains (where avoiding recapitulating study bias is critical).

Using hetmech outside repository

Hi @dhimmel - thanks for putting this repo and analysis together.

I am looking to repurpose 8.gene-set-search.ipynb for alternative analyses. This script requires methods written in scripts inside the hetmech folder. How would you recommend using these methods?

For instance,

from hetmech.degree_weight import dwpc
from hetmech.matrix import get_node_to_position

fails anywhere outside of this repository.

Install xlrd >= 0.9.0 for Excel support

Encountered in 8.gene-set-search

bp_df = (
    pandas.read_excel(url, skiprows=[0, 2])
    .rename(columns={
        'EntrezGeneID_FHS': 'entrez_gene_id',
    })
    .dropna(subset=['entrez_gene_id'])
    .drop_duplicates(subset=['entrez_gene_id'])
    .query("BP_sixCohort_meta_p < 0.001")
    [['entrez_gene_id', 'BP_sixCohort_meta_TE', 'BP_sixCohort_meta_p']]
)

May want to update environment.yml

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
~/anaconda3/envs/hetmech/lib/python3.6/site-packages/pandas/io/excel.py in __init__(self, io, **kwds)
    260         try:
--> 261             import xlrd
    262         except ImportError:

ModuleNotFoundError: No module named 'xlrd'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
<ipython-input-6-f805b7efd0cc> in <module>()
      2 url = 'https://doi.org/10.1371/journal.pgen.1005035.s006'
      3 bp_df = (
----> 4     pandas.read_excel(url, skiprows=[0, 2])
      5     .rename(columns={
      6         'EntrezGeneID_FHS': 'entrez_gene_id',

~/anaconda3/envs/hetmech/lib/python3.6/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    116                 else:
    117                     kwargs[new_arg_name] = new_arg_value
--> 118             return func(*args, **kwargs)
    119         return wrapper
    120     return _deprecate_kwarg

~/anaconda3/envs/hetmech/lib/python3.6/site-packages/pandas/io/excel.py in read_excel(io, sheet_name, header, skiprows, skip_footer, index_col, names, usecols, parse_dates, date_parser, na_values, thousands, convert_float, converters, dtype, true_values, false_values, engine, squeeze, **kwds)
    228 
    229     if not isinstance(io, ExcelFile):
--> 230         io = ExcelFile(io, engine=engine)
    231 
    232     return io._parse_excel(

~/anaconda3/envs/hetmech/lib/python3.6/site-packages/pandas/io/excel.py in __init__(self, io, **kwds)
    261             import xlrd
    262         except ImportError:
--> 263             raise ImportError(err_msg)
    264         else:
    265             ver = tuple(map(int, xlrd.__VERSION__.split(".")[:2]))

Alternatives to XSwap for hetnet permutation

We had a hetmech conference call today with @bdsullivan @cgreene @zietzm @kkloste and Drew van der Poel. @bdsullivan brought up the possibilities of using other methods than XSwap to create permuted hetnets. In the past, we've discussed implementations of XSwap in #46 but have not evaluated completed different methods of generating permuted networks.

Here is the description of our current method from the Project Rephetio manuscript:

From Hetionet, we derived five permuted hetnets (Himmelstein, 2016b). The permutations preserve node degree but eliminate edge specificity by employing an algorithm called XSwap to randomly swap edges (Hanhijärvi et al., 2009). To extend XSwap to hetnets (Himmelstein and Baranzini, 2015a), we permuted each metaedge separately, so that edges were only swapped with other edges of the same type. We adopted a Markov chain approach, whereby the first permuted hetnet was generated from Hetionet v1.0, the second permuted hetnet was generated from the first, and so on. For each metaedge, we assessed the percent of edges unchanged as the algorithm progressed to ensure that a sufficient number of swaps had been performed to randomize the network (Himmelstein, 2016b). Permuted hetnets are useful for computing the baseline performance of meaningless edges while preserving node degree (Himmelstein, 2015l).

Currently, our HetMat data structure has a function for generating permuted hetnets using XSwap, which delegates to hetmech.matrix.permute_matrix, which delegates to hetio.permute.permute_pair_list. We've now generated 200 permuted hetnets derived from Hetionet v1.0.

@bdsullivan mentioned some other methods for generating permuted graphs. My notes include random-graph model, configuration model, and Chung Lu. So if anyone wants to suggest an alternative and how it would differ in terms of output from XSwap, that would be a good starting point.

Set up continuous integration

Once testing is created in #10.

My choice would be Travis CI, since it's easy to configure conda with Travis.

Gamma Hurdle DWPC model

DWPCs across multiple permutations for a single source, target node combination have a zero-inflated distribution.

Meanwhile, the nonzero values follow roughly a gamma distribution

The distributions of DWPC values are similar for a single source, target degree combination along a single metapath over all permutations. Below, orange is the distribution of nonzero values, blue is the distribution of values including zeros.

Full pdf

In the past, we have modeled the distribution of permuted DWPC values for a single source, target combination across permutations as a simple normal distribution with the mean and variance calculated from the various permuted values. For example, in the plot above, we show the distribution of permuted DWPC values for the source, target degree combination (448, 6) across 25 permutations.

In general, the gamma hurdle model will more precisely fit the nonzero data than a cut-off normal distribution.

Add flake8 to automatically check style

See http://flake8.pycqa.org/en/latest/.

We can add it to our Travis builds.

@kkloste flake8 enforces a rather strict set of style rules. I will probably disable some which I find annoying.

Coordinating matrix DWPC implementation with Scripps folks

Several individuals in the Su Lab at Scripps Research Institute, specifically @veleritas, @NuriaQueralt and @mmayers12, are working on extensions to Rephetio and Hetionet. We're going to catch up on a call Wednesday July 12, 2017 from 1-2 PM. As an aside, @cgreene @zietzm @danich1 @kkloste, feel free to join this call, even if just as background audio. Since we want to focus the call on more high level planning, @mmayers12 suggested we take a look at his work on implementing matrix DWPC beforehand.

You can see the current progress in the mmayers12/hetnet-ml repo and specifically in matrix_tools.py. It looks like @mmayers12 has arrived to several of the same decisions and algorithms we've implemented here. For example, the DWWC and DWPC distinction.

For reference, the work in hetmech currently consists of three primary contributors:

@kkloste, an algebraic mathematician from North Carolina State University. Kyle helps us with figuring out how to do things using matrices.
@zietzm, who's interning in the Greene Lab this summer, does much of the Python implementation. He's now the most knowledgeable about matrices in Python, and has a good handle on the scipy.sparse API.
@dhimmel (me), who's main involvement is with the application of these methods to hetnets and strategic directions.

It would be great if @mmayers12, @kkloste, @zietzm and I could all get on the same page regarding how the mmayers12/hetnet-ml and greenelab/hetmech implementation differ. My hope is that @mmayers12 will be able to contribute his advances to this repo. Specifically, we should compare our matrix-DWPC algorithms which are discussed specifically in #52, #47, #45 (merged), and #20 (outdated). Our current implementation is in degree_weight.py. In short, @kkloste has figured out how to exclude duplicate-node paths for some metapaths but not all.

There is also the larger question of whether DWPC is even needed, or whether DWWC is sufficient as comparison to permuted-DWWCs could correct for duplicate-node paths. Finally, Thinklab is shutting down and transitioning to a read-only state 😞, hence the new URL of https://think-lab.github.io.

Assessing path and intermediate node contributions to a node-pair search

@ben-heil and I are meeting presently to discuss potential projects for his rotation in the @greenelab. One thing that will be important for the search engine we're building, where users select a node pair and we identify paths that occur more frequently than is expected by chance, is to identify not just metapaths, but also specific paths and intermediate nodes that are relevant.

For example, the hetmech-backend database, which is still populated, returns the following most-significant metapaths between the gene FTO and disease obesity:

metapath_id  path_count      dwpc       p_value  source_degree  target_degree  n_dwpcs  n_nonzero_dwpcs  nonzero_mean    nonzero_sd
  DaGpBPpG         435  2.814122  3.932283e-08            373             32    29000            29000      2.095634  1.214048e-01
   DaGeAeG        6204  2.002286  7.400247e-08            373             28    53000            53000      1.868725  2.485515e-02
   DpSpDaG          25  4.434438  1.337052e-04             17              6   101000           100994      2.438776  4.514481e-01
     DrDaG           3  5.138905  2.442112e-03              5              6   181800            32414      3.920578  5.135883e-01
   DlAlDaG          42  3.744022  7.010120e-03             33              6    20200            20200      2.726263  3.786078e-01

The challenge is to further decompose these DWPCs into individual path scores. Once that is accomplished, and path scores can be compared across metapaths, we can even aggregate scores by intermediate nodes, as we have briefly explored previously in Decomposing the DWPC to assess intermediate node or edge contributions.

So the main tasks here would seem to be:

Going from DWPC to path score, which can be accomplished using a neo4j cypher query and is largely already implemented in some form or another.
Finding a method to assign an overall weight to a metapath, such that paths from different metapaths can be ranked according to a common score.
Identifying how this approach can fit within the hetmech search engine, which likely needs an implementation that is near immediate in human time.

Make sure dual_normalize test function tests row and column

The dual_normalize test function only checks that column normalization works properly -- correct to check row normalization as well.

https://github.com/greenelab/hetmech/blob/e387eee580cfe3d530a21c92df025afd9f668784/hetmech/test_diffusion.py

Add sparse matrix option in metaedge_to_adjacency_matrix

@kkloste meet @zietzm (undergrad at Penn starting to work in the Greene Lab). From now until the summer, I'm planning to deploy @zietzm on several small (potentially random) tasks to build his development skills.

In #13 we discussed strategically using scipy.sparse matrices for efficiency. A good first step will be enhancing metaedge_to_adjacency_matrix to optionally return a scipy.sparse matrix.

I think this will be a good task for @zietzm. We probably want to allow the user to choose which matrix backend to use, e.g. numpy.array, numpy.matrix (potentially), or any of the scipy.sparse matrices.

@zietzm would be nice if you add tests to the non-existent test_matrix.py.