Git Product home page Git Product logo

Comments (24)

flomlo avatar flomlo commented on June 3, 2024

ok, on a 50% reconstruction of the huge connectome (it's still a 700mb .h5 file, so I'll refrain from posting it here) the error is as follows:
Out of bounds, tried to add an edge involving vertex 50038, but there are only 49926 vertices.

I'm posting that additional bit of information because before I was kinda suspicious about the 34902/34316 numbers, as they are pretty close to 2^15 (resulting in .flag files of 2^16 line length cause of whitespaces). But this does not seem to be the problem.

from pyflagser.

ulupo avatar ulupo commented on June 3, 2024

Hey @flomlo, thanks for the report!

Unless size really is a problem, I suspect foul play from

def _extract_unweighted_graph(adjacency_matrix):

somehow. Perhaps we could see what vertices and edges are and whether there are some inconsistencies with them?

Is this a directed graph? And is it in sparse format?

from pyflagser.

flomlo avatar flomlo commented on June 3, 2024

Hi,

I've checked it all and now I'm absolutely certain it is due to vertex_index_t being set to a unsigned short in https://github.com/luetge/flagser/blob/292bcf15202eaa27a81612acfb746235bed56480/include/definitions.h#L12.

I've tried another Graph with 115462 vertices (the 50% reconstruction mentioned above), which complains that it only has 49926 vertices. But 115462 mod 2^16 is exactly 49926. So this makes me absolutely sure that this bug is due to 2^16 being too small :D

I'm currently compiled flagser for myself with MANY_VERTICES enabled, but it seems to be a problem if HDF5 is enabled at the same time. So I'm currently digging through this.

from pyflagser.

ulupo avatar ulupo commented on June 3, 2024

Excellent debugging! I'm glad this is clear. I'm also summoning @MonkeyBreaker so he is aware of this conversation as it just turned low-level.

from pyflagser.

flomlo avatar flomlo commented on June 3, 2024

Assuming that I manage to compile all that properly, do you have a testsuite to test for performance regressions?

I suspect just adding the MANY_VERTICES flag to https://github.com/giotto-ai/pyflagser/blob/master/CMakeLists.txt (and maybe preventing HDF5 support to be compiled alongside, because it is not used, anyway(?)) will remove my bug here.

But it will result in performance regressions for smaller graphs. Maybe not in computation time (this is hard to predict), but certainly in memory usage (probably twice as much? But hard to tell, as well).

I will leave the decision if the regressions are acceptable to you, maybe I can just write a small extension describing what to do in case of HUGE graphs somewhere in the README. I think one can expect a user to compile their own version of pyflagser if they encounter this kind of problems.

from pyflagser.

MonkeyBreaker avatar MonkeyBreaker commented on June 3, 2024

HI !

Nice digging, about the issue, we should in Python at least check for the number of vertices, and at least print a warning about the current limitations from the C++ backend.

@flomlo Looking about MANY_VERTICES, it only defines the size of vertex_index_t and index_t. In my opinion and from personal experience, I wouldn't expect a big drop in performances, I would more expect a gain in performances, due to the CPU architectures. But, it will consume more memory, that's for sure.

In my opinion, we could enable by default MANY_VERTICES, of course check if it doesn't break anything thank to our test pipeline 😎.

@flomlo if you are ready to create a PR, I'll go in this direction (@ulupo what do you think ?):

  1. Write a regression test with a big enough dataset (should be generated on the fly)
  2. Fix the issue by enabling MANY_VERTICES
  3. Document current limitation
  4. Enjoy new Pyflagser

Of course, I'm more than glad to review any work. I'm just currently too busy with other tasks to create myself the PR.

Best,
Julián

from pyflagser.

flomlo avatar flomlo commented on June 3, 2024

Hi,

yeah, I'll try modifying the CMakeLists.txt of pyflagser locally and see how it goes. This will take some time, I'm a total noob with CMake.

I've confirmed in the meantime that flagser-count with MANY_VERTICES (and disabled HDF5 support) performs correctly.

About the speed regression test: I'ld use pyflagser on the BBP connectome (32k vertices, 8m edges, topologically complex) cause I have it lying around anyways. This needs about 1h of computation time and iirc ~8GB or RAM (without MANY_VERTICES), which is a proper test, I think.
Then I'll test it again with MANY_VERTICES enabled and inform you about the results.

Yes, that is not an autogenerated dataset, but getting a huge random network to have a computation time of 1m < t < 1h is surprisingly nasty.

It will be done sometime afternoon or over the weekend.

Best,

Florian

from pyflagser.

ulupo avatar ulupo commented on June 3, 2024

Thanks both! Just curious: if the memory increase is significant, couldn't we have both versions available and pick the appropriate one at runtime?

About

Yes, that is not an autogenerated dataset, but getting a huge random network to have a computation time of 1m < t < 1h is surprisingly nasty.

Since we would just be checking for index failures, why not just create a sparse graph which consists just of disconnected points save for a single edge between two vertices with very large index?

from pyflagser.

flomlo avatar flomlo commented on June 3, 2024

@ compiling both and choosing at runtime: Yes, that's actually quite feasible. Not beautiful, but should combine the best of both worlds.

@ testing: I though the testing was to check for memory/time regression? If it is just for checking it >65k-vertices-graphs are working, then yes, that is a trivial test.

from pyflagser.

ulupo avatar ulupo commented on June 3, 2024

At the moment, my prejudice is that true memory/performance tests for very large datasets cannot realistically become part of the CI test suite because we are limited in both resources and total time (60 min total including installing environments, compiling, installing python dependencies and running all tests). But I'll wait to see what comes of your local tests, @flomlo !

Note that if we compile both versions there is no need for such tests because we would have no choice as to which version to use beyond a certain size.

from pyflagser.

MonkeyBreaker avatar MonkeyBreaker commented on June 3, 2024

yeah, I'll try modifying the CMakeLists.txt of pyflagser locally and see how it goes. This will take some time, I'm a total noob with CMake.

I'll would add something like target_compile_definitions(flagser_count_pybind PRIVATE MANY_VERTICES=1) after line 42 in CMakeLists.txt this should do the work.

compiling both and choosing at runtime: Yes, that's actually quite feasible. Not beautiful, but should combine the best of both worlds.

I'm not a big fan of compiling both, we already do this with USE_COEFFICIENTS, for MANY_VERTICES I think we should be able to add to the C++ backend a template argument to choose the size of vertex_index_t and index_t at runtime instead of generating multiples binaries.

from pyflagser.

MonkeyBreaker avatar MonkeyBreaker commented on June 3, 2024

At the moment, my prejudice is that true memory/performance tests for very large datasets cannot realistically become part of the CI test suite because we are limited in both resources and total time (60 min total including installing environments, compiling, installing python dependencies and running all tests). But I'll wait to see what comes of your local tests, @flomlo !

@ulupo I think that for large datasets, benchmark should be done locally and not on the CI. Maybe we should add to the documentation how users that would like to contribute to the performances side of the library, to measure on their machine the performances with public datasets in order to benchmark their version with the current main (main is now master branch name on github) of pyflagser.

from pyflagser.

flomlo avatar flomlo commented on June 3, 2024

I'm not a big fan of compiling both, we already do this with USE_COEFFICIENTS, for MANY_VERTICES I think we should be able to add to the C++ backend a template argument to choose the size of vertex_index_t and index_t at runtime instead of generating multiples binaries.

I'm also not a fan for reasons of elegance, but if I understand it correctly, MANY_VERTICES is a flag used at compiletime in order to build more efficient code?

Either I'm not aware of some advanced programming tricks, or having the performance-benefits of a slim version and a MANY_VERTICES-version absolutely requires that both of them have been compiled separately.

from pyflagser.

MonkeyBreaker avatar MonkeyBreaker commented on June 3, 2024

Either I'm not aware of some advanced programming tricks, or having the performance-benefits of a slim version and a MANY_VERTICES-version absolutely requires that both of them have been compiled separately.

Right now, as you say, it's a compile time definition, but with meta-programming, we could easily make it a runtime condition. If you look at my PR in flagser luetge/flagser#36, I make KEEP_FLAG_IN_MEMORY a runtime condition instead of a compile time definition.

from pyflagser.

flomlo avatar flomlo commented on June 3, 2024

Oh, then this definitely is out of my programming skills scope :D

For now at least ...

from pyflagser.

flomlo avatar flomlo commented on June 3, 2024

Ok, here are the test results of flagser-memory --min-dim 4 with and w/o MANY_VERTICES on the BBP connectomic data, computed on my personal Ryzen 3600X with 64gb of RAM. To make the test results diagnostically less conclusive, there were a few other tasks (firefox, thunderbird ...) running in the background of the machine as well.

Without MANY_VERTICES: 4m03s, maximum RAM: 17.49GB
With MANY_VERTICES: 3m55s, maximum RAM: 18.72GB

See for full output:
time_on_flagser_with_MANY_VERTICES.txt
time_on_flagser_without_MANY_VERTICES.txt

Result:
I would ignore the additional 7% of memory usage (as well as the 3% faster execution, which is probably just measuring error) and just add MANY_VERTICES to the compile flags of pyflagsers CMakeList.txt.
I do not see the need for version without MANY_VERTICES enabled. If someone is this low on RAM, he should not use python anyways :D

Do you agree?

from pyflagser.

MonkeyBreaker avatar MonkeyBreaker commented on June 3, 2024

For me, yes we could enable by default MANY_VERTICES, I should be able to come for a templated solution in flagser in the following days (but on the meantime, we could integrate this quick fix). Waht do you think @ulupo ?

Thank you for the measurements :) !

from pyflagser.

flomlo avatar flomlo commented on June 3, 2024

If you think that the templated solution is that quick to implement, then I'm happy just to use the MANY_VERTICES-version locally for the next few weeks. Works for me™

[EDIT: Use it locally]

from pyflagser.

MonkeyBreaker avatar MonkeyBreaker commented on June 3, 2024

I think it's doable to do it, but right now I cannot spend the time require, and also, the merge on flagser could also take some time.
I think that merging a temporary fix is a good approach if we include a quick test and update the documentation.

from pyflagser.

flomlo avatar flomlo commented on June 3, 2024

I've patched flagser (see luetge/flagser#40), now there is an error in case of too many vertices, requesting the user to compile with MANY_VERYTICES enabled.

I think this error should occur from pyflagser as well, so if there is an accidental regression again, the user will get a meaningful warning instead of the hard-to-interpret error I've gotten.

from pyflagser.

danielaegassan avatar danielaegassan commented on June 3, 2024

It seems like this issues is closed, but I actually still having the same issue. That is, I do not get a meaningful warning about the graph being too large. What am I missing? Here an example on a simple but large graph

Steps/code to reproduce

import scipy.sparse
import numpy as np
import pyflagser
n=100000
data=np.random.random(100)
row=np.random.choice(n,100,replace=False)
col=np.random.choice(n,100,replace=False)
adj=scipy.sparse.coo_matrix((data,(row,col)),[n,n])
#Checking for self loops
print(np.count_nonzero(adj.diagonal(0)!=0))
#Checking matrix shape
print(adj.get_shape())
pyflagser.flagser_weighted(adj)

Expected results

pyflagser runs without errors or gives a meaningful message that the graph is too large

Actual results


RuntimeError Traceback (most recent call last)
in
13 #Checking matrix shape
14 print(adj.get_shape())
---> 15 pyflagser.flagser_weighted(adj)

/gpfs/bbp.cscs.ch/project/proj102/venv/p102/lib/python3.8/site-packages/pyflagser/flagser.py in flagser_weighted(adjacency_matrix, max_edge_weight, min_dimension, max_dimension, directed, filtration, coeff, approximation)
246
247 # Call flagser binding
--> 248 homology = _compute_homology(vertices, edges, min_dimension,
249 _max_dimension, directed, coeff,
250 _approximation, filtration)[0]

RuntimeError: Out of bounds, tried to add an edge involving vertex 39854, but there are only 34464 vertices.

from pyflagser.

ulupo avatar ulupo commented on June 3, 2024

Hi @danielaegassan! Looking at this whole thread and the related PR #68, I think this issue was just temporarily abandoned due to me being distracted by other projects at the time, but if @flomlo and @MonkeyBreaker are happy with this I can just merge #68 and fix the issue.

from pyflagser.

MonkeyBreaker avatar MonkeyBreaker commented on June 3, 2024

Hi @danielaegassan,

As @ulupo said, the issue was temporarily abandoned, thank you for taking the time into looking why we were encountering the issue.
We will merge #68 and I don't know the timeline to publish a release of pyflagser with the fix.

Will keep you updated when the new version will be published.

from pyflagser.

ulupo avatar ulupo commented on June 3, 2024

Just merged #68, which closes this issue. @danielaegassan indeed you can wait for the next pyflagser release, but I suspect you'd rather not have to. Perhaps you would be happy to pull the merged changes in a clone of the repo and compile pyflagser from sources?

from pyflagser.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.