mklab-iti / pygrank Goto Github PK

View Code? Open in Web Editor NEW

29.0 29.0 4.0 9.18 MB

Recommendation algorithms for large graphs

License: Apache License 2.0

Python 99.93% Shell 0.07%

pygrank's People

Contributors

Stargazers

Watchers

Forkers

maniospas samlidippos twishmay elseviersoftwarex

pygrank's Issues

Torch GNN support

Current helper methods for GNNs are centered on tensorflow and keras. Create backend operations to abstract them so that they can also be implemented through torch. This needs to add a new test to make sure everything is working.

Related tests: tests.test_gnn.test_appnp

Automatic citation discovery

Since node ranking algorithms can comprise multiple components, some of which are implicitly determined (e.g. through default instantiation or specific argument parameters), create methods that can summarize used components and provide citations.

Usefulness: This can help streamline citation practices.

Related tests: None

Potential issue in the GNN demonstrator example with tensorflow backend

During the review process of the library's paper, a reviewer pointed out that the following error occurs in their local system with TensorFlow 3.9.2 and Python 3.10.6.

TypeError: Sequential.call() got multiple values for argument 'training'

This occurs when running the code of the APPNP example. The issue lies fully with the example and not with any additional library functionality - it will not motivate a hotfix.

Investigate whether this issue is unique to the version of TensorFlow or whether the latter has yet again updated something that will break the example in all future versions. At the very least, this error should not occur in github actions.

If this is not the case, investigate whether this issue is platform-dependent.

optimization_dict not improving performance

The optimization_dict argument to the ClosedFormGraphFilter class does not seem to produce as an improvement in runnng time. This could indicate either a bug or bottlenecks in other parts of the pipeline, e.g. in graph signal instantiation.

Version: run with version 2.3 adjusted to run experiments 50 times when measuring time

Demonstration:

>>> import pygrank as pg
>>> optimization_dict = dict()
>>> pg.benchmark_print(pg.benchmark({"HK": pg.HeatKernel(optimization_dict=optimization_dict)}, pg.load_datasets_all_communities(["bigraph"]), metric="time"))
               	 HK 
bigraph0       	 3.06
bigraph1       	 3.36
>>> pg.benchmark_print(pg.benchmark({"HK": pg.HeatKernel()}, pg.load_datasets_all_communities(["bigraph"]), metric="time"))
               	 HK 
bigraph0       	 2.98
bigraph1       	 2.96

Related tests: None

Implement krylov space analysis in tensorflow

Implement krylov space analysis for the tensorflow backend. This could require defining additional backend operations.

Related tests: tests.test_filter_optimization.test_lanczos_speedup

Citation discovery for postprocessors

Recommend how to cite graph filters and tuners through their cite() method.

Related tests: test_autorefs.test_autorefs, test_autorefs.test_postprocessor_citations

Autotune too slow for backends other than numpy

Investigate why filter tuning can be exceptionally slow for backends other than numpy.

Focus on matvec first.
Slowness persists when L1 is used as optimization measure instead of AUC for matvec, investigate potentially slow operations (e.g., indexing?) in that project.

Related tests: None (change backends in the current compare_filter_tuning.py in the playground)

Check FairWalk correctness

Fairwalk does not achieve the same level of fairness (as high a pRule) as other fairness-aware heuristics during tests.
This could arise from erroneous implementation. If the implementation is found to be correct, separate its tests from other heuristics to account for the lower expected improvement.

Related tests: tests.test_fairness.test_fair_heuristics

Automatically switching between backend interpretations

Currently, graph signal .np attributes need to be manually converted between backends. Switch to a @Property getter interface that automatically performs needed conversions to remove the burden of checking for backend compliance from developers.

Module 'networkx' has no attribute 'to_scipy_sparse_matrix'

I encountered this issue while running benchmark.

~/.local/lib/python3.10/site-packages/pygrank/algorithms/autotune/tuning.py in rank(self, graph, personalization, *args, **kwargs)
...
---> M = G.to_scipy_sparse_array() if isinstance(G, fastgraph.Graph) else nx.to_scipy_sparse_matrix(G, weight=weight, dtype=float)
         renormalize = float(renormalize)
         left_reduction = reduction #(lambda x: backend.degrees(x)) if reduction == "sum" else reduction

AttributeError: module 'networkx' has no attribute 'to_scipy_sparse_matrix'

Performant use of sparse matrices?

I'm trying to use pygrank with larger graphs: 100k-1m nodes, hundreds of millions of edges. My graphs are in sparse matrix format. So far I've just converted to networkx and used those:

g = nx.from_scipy_sparse_array(A, create_using=nx.DiGraph)

signal_dict = {i: 1.0 for i in seeds}

signal = pg.to_signal(g, signal_dict)

# normalize signal
signal.np /= signal.np.sum()

result = algorithm(signal).np

Is there a more performant option available?

Numeric graph signal operations

Currently, there needs to be clear distinction between graph signal objects and extraction of their .np fields.
Reframe code so that, when signal operations are employed, their respective .np fields are used in their place.
This can help write comphrehensible high-level code.

Implementing PageRank as a GenericGraphFilter does not seem to work

In the following code, the two algorithm should be close to equivalent, yet there is a signficant Mabs error between them.

>>> import pygrank as pg
>>> graph = next(pg.load_datasets_graph(["graph9"]))
>>> ranks1 = pg.Normalize(pg.PageRank(0.85, tol=1.E-12, max_iters=1000)).rank(graph, {"A": 1})
>>> ranks2 = pg.Normalize(pg.GenericGraphFilter([0.85**i for i in range(20)], tol=1.E-12)).rank(graph, {"A": 1})
>>> print(pg.Mabs(ranks1)(ranks2))
0.025585056372903574

Related tests: tests.test_filters.test_custom_runs

Citation discovery for graph filters

Recommend how to cite graph filters and tuners through their cite() method.

Related tests: test_autorefs.test_autorefs, test_autorefs.test_filter_citations

Tune on non-seeds?

Is it possible to run the tuners with non-seed nodes? For example if I have a seed_set and a target_set can I run the tuner diffusions with the signal from the former but optimize for metrics defined with respect to the latter? In this case I have a desired ranking of the nodes in the target_set.

Seamless verbosity

Automatically display verbose progress (e.g. for dataset downloading) that disappears once tasks are complete (e.g. to resume normal benchmarking prints).

Possible sparse_dot_mkl integration?

I saw that you have your own sparse matrix library called matvec which parallelizes sparse-dense multiplication. There is an existing Python library called sparsedot which does the same but with scipy csr/csc matrices https://github.com/flatironinstitute/sparse_dot.

I benchmarked the two with a matrix of size

<6357204x6357204 sparse matrix of type '<class 'numpy.float32'>'
	with 3614017927 stored elements in Compressed Sparse Column format>

With the 32 core/64 thread server CPU I'm testing on the times for 10 matrix-vec multiplications on the right and left are:

matvec
right-vec  25.15
left-vec  19.47

sparse_dot csc
right-vec  40.17
left-vec  14.91

sparse_dot csr
right-vec  10.38
left-vec  28.53

The times look competitive. I'm not sure if matvec has some other advantages I'm not considering here, but that sparsedot works with the existing scipy types would be a huge benefit (for my usecase, at least). Sparsedot does require installing the mkl library and for giant matrices as above requires the environment variable MKL_INTERFACE_LAYER=ILP64.

Support correlations in tensorflow backend

Spearman and pearson correlations need to be supported by the tensorflow backend.

Related tests: tests.test_measures.test_correlation_compliance

Convergence management tracking

IImplement a high-level way of summarizing convergence analysis, for example to help measure running time and iterations when algorithms are wrapped by postprocessors (including iterative schemes).
For example, a list of all convergence manager run outcomes could be obtained. Perhaps this could be achieved with some combination of dependent algorithm discovery and keeping convergence manager history on restart.

Related tests: tests.test_filters.test_convergence_string_conversion

Citation discovery for evaluation

Recommend how to cite evaluation methodologies through acite() method.

Related tests: None (to be added in test_autorefs.py)