mklab-iti / pygrank Goto Github PK
View Code? Open in Web Editor NEWRecommendation algorithms for large graphs
License: Apache License 2.0
Recommendation algorithms for large graphs
License: Apache License 2.0
Current helper methods for GNNs are centered on tensorflow and keras. Create backend operations to abstract them so that they can also be implemented through torch. This needs to add a new test to make sure everything is working.
Related tests: tests.test_gnn.test_appnp
Since node ranking algorithms can comprise multiple components, some of which are implicitly determined (e.g. through default instantiation or specific argument parameters), create methods that can summarize used components and provide citations.
Usefulness: This can help streamline citation practices.
Related tests: None
During the review process of the library's paper, a reviewer pointed out that the following error occurs in their local system with TensorFlow 3.9.2 and Python 3.10.6.
TypeError: Sequential.call() got multiple values for argument 'training'
This occurs when running the code of the APPNP example. The issue lies fully with the example and not with any additional library functionality - it will not motivate a hotfix.
Investigate whether this issue is unique to the version of TensorFlow or whether the latter has yet again updated something that will break the example in all future versions. At the very least, this error should not occur in github actions.
If this is not the case, investigate whether this issue is platform-dependent.
The optimization_dict argument to the ClosedFormGraphFilter class does not seem to produce as an improvement in runnng time. This could indicate either a bug or bottlenecks in other parts of the pipeline, e.g. in graph signal instantiation.
Version: run with version 2.3 adjusted to run experiments 50 times when measuring time
Demonstration:
>>> import pygrank as pg
>>> optimization_dict = dict()
>>> pg.benchmark_print(pg.benchmark({"HK": pg.HeatKernel(optimization_dict=optimization_dict)}, pg.load_datasets_all_communities(["bigraph"]), metric="time"))
HK
bigraph0 3.06
bigraph1 3.36
>>> pg.benchmark_print(pg.benchmark({"HK": pg.HeatKernel()}, pg.load_datasets_all_communities(["bigraph"]), metric="time"))
HK
bigraph0 2.98
bigraph1 2.96
Related tests: None
Implement krylov space analysis for the tensorflow backend. This could require defining additional backend operations.
Related tests: tests.test_filter_optimization.test_lanczos_speedup
Recommend how to cite graph filters and tuners through their cite() method.
Related tests: test_autorefs.test_autorefs, test_autorefs.test_postprocessor_citations
Investigate why filter tuning can be exceptionally slow for backends other than numpy.
Related tests: None (change backends in the current compare_filter_tuning.py in the playground)
Fairwalk does not achieve the same level of fairness (as high a pRule) as other fairness-aware heuristics during tests.
This could arise from erroneous implementation. If the implementation is found to be correct, separate its tests from other heuristics to account for the lower expected improvement.
Related tests: tests.test_fairness.test_fair_heuristics
Currently, graph signal .np attributes need to be manually converted between backends. Switch to a @Property getter interface that automatically performs needed conversions to remove the burden of checking for backend compliance from developers.
I encountered this issue while running benchmark.
~/.local/lib/python3.10/site-packages/pygrank/algorithms/autotune/tuning.py in rank(self, graph, personalization, *args, **kwargs)
...
---> M = G.to_scipy_sparse_array() if isinstance(G, fastgraph.Graph) else nx.to_scipy_sparse_matrix(G, weight=weight, dtype=float)
renormalize = float(renormalize)
left_reduction = reduction #(lambda x: backend.degrees(x)) if reduction == "sum" else reduction
AttributeError: module 'networkx' has no attribute 'to_scipy_sparse_matrix'
I'm trying to use pygrank with larger graphs: 100k-1m nodes, hundreds of millions of edges. My graphs are in sparse matrix format. So far I've just converted to networkx and used those:
g = nx.from_scipy_sparse_array(A, create_using=nx.DiGraph)
signal_dict = {i: 1.0 for i in seeds}
signal = pg.to_signal(g, signal_dict)
# normalize signal
signal.np /= signal.np.sum()
result = algorithm(signal).np
Is there a more performant option available?
Currently, there needs to be clear distinction between graph signal objects and extraction of their .np fields.
Reframe code so that, when signal operations are employed, their respective .np fields are used in their place.
This can help write comphrehensible high-level code.
In the following code, the two algorithm should be close to equivalent, yet there is a signficant Mabs error between them.
>>> import pygrank as pg
>>> graph = next(pg.load_datasets_graph(["graph9"]))
>>> ranks1 = pg.Normalize(pg.PageRank(0.85, tol=1.E-12, max_iters=1000)).rank(graph, {"A": 1})
>>> ranks2 = pg.Normalize(pg.GenericGraphFilter([0.85**i for i in range(20)], tol=1.E-12)).rank(graph, {"A": 1})
>>> print(pg.Mabs(ranks1)(ranks2))
0.025585056372903574
Related tests: tests.test_filters.test_custom_runs
Recommend how to cite graph filters and tuners through their cite() method.
Related tests: test_autorefs.test_autorefs, test_autorefs.test_filter_citations
Is it possible to run the tuners with non-seed nodes? For example if I have a seed_set
and a target_set
can I run the tuner diffusions with the signal from the former but optimize for metrics defined with respect to the latter? In this case I have a desired ranking of the nodes in the target_set
.
Automatically display verbose progress (e.g. for dataset downloading) that disappears once tasks are complete (e.g. to resume normal benchmarking prints).
I saw that you have your own sparse matrix library called matvec which parallelizes sparse-dense multiplication. There is an existing Python library called sparsedot which does the same but with scipy csr/csc matrices https://github.com/flatironinstitute/sparse_dot.
I benchmarked the two with a matrix of size
<6357204x6357204 sparse matrix of type '<class 'numpy.float32'>'
with 3614017927 stored elements in Compressed Sparse Column format>
With the 32 core/64 thread server CPU I'm testing on the times for 10 matrix-vec multiplications on the right and left are:
matvec
right-vec 25.15
left-vec 19.47
sparse_dot csc
right-vec 40.17
left-vec 14.91
sparse_dot csr
right-vec 10.38
left-vec 28.53
The times look competitive. I'm not sure if matvec has some other advantages I'm not considering here, but that sparsedot works with the existing scipy types would be a huge benefit (for my usecase, at least). Sparsedot does require installing the mkl library and for giant matrices as above requires the environment variable MKL_INTERFACE_LAYER=ILP64
.
Spearman and pearson correlations need to be supported by the tensorflow backend.
Related tests: tests.test_measures.test_correlation_compliance
IImplement a high-level way of summarizing convergence analysis, for example to help measure running time and iterations when algorithms are wrapped by postprocessors (including iterative schemes).
For example, a list of all convergence manager run outcomes could be obtained. Perhaps this could be achieved with some combination of dependent algorithm discovery and keeping convergence manager history on restart.
Related tests: tests.test_filters.test_convergence_string_conversion
Recommend how to cite evaluation methodologies through acite() method.
Related tests: None (to be added in test_autorefs.py)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.