Comments (24)
ok, on a 50% reconstruction of the huge connectome (it's still a 700mb .h5 file, so I'll refrain from posting it here) the error is as follows:
Out of bounds, tried to add an edge involving vertex 50038, but there are only 49926 vertices.
I'm posting that additional bit of information because before I was kinda suspicious about the 34902/34316 numbers, as they are pretty close to 2^15 (resulting in .flag files of 2^16 line length cause of whitespaces). But this does not seem to be the problem.
from pyflagser.
Hey @flomlo, thanks for the report!
Unless size really is a problem, I suspect foul play from
Line 8 in a4f933a
somehow. Perhaps we could see what
vertices
and edges
are and whether there are some inconsistencies with them?
Is this a directed graph? And is it in sparse format?
from pyflagser.
Hi,
I've checked it all and now I'm absolutely certain it is due to vertex_index_t
being set to a unsigned short
in https://github.com/luetge/flagser/blob/292bcf15202eaa27a81612acfb746235bed56480/include/definitions.h#L12.
I've tried another Graph with 115462 vertices (the 50% reconstruction mentioned above), which complains that it only has 49926 vertices. But 115462 mod 2^16 is exactly 49926. So this makes me absolutely sure that this bug is due to 2^16 being too small :D
I'm currently compiled flagser
for myself with MANY_VERTICES
enabled, but it seems to be a problem if HDF5 is enabled at the same time. So I'm currently digging through this.
from pyflagser.
Excellent debugging! I'm glad this is clear. I'm also summoning @MonkeyBreaker so he is aware of this conversation as it just turned low-level.
from pyflagser.
Assuming that I manage to compile all that properly, do you have a testsuite to test for performance regressions?
I suspect just adding the MANY_VERTICES flag to https://github.com/giotto-ai/pyflagser/blob/master/CMakeLists.txt (and maybe preventing HDF5 support to be compiled alongside, because it is not used, anyway(?)) will remove my bug here.
But it will result in performance regressions for smaller graphs. Maybe not in computation time (this is hard to predict), but certainly in memory usage (probably twice as much? But hard to tell, as well).
I will leave the decision if the regressions are acceptable to you, maybe I can just write a small extension describing what to do in case of HUGE graphs somewhere in the README. I think one can expect a user to compile their own version of pyflagser
if they encounter this kind of problems.
from pyflagser.
HI !
Nice digging, about the issue, we should in Python at least check for the number of vertices, and at least print a warning about the current limitations from the C++ backend.
@flomlo Looking about MANY_VERTICES
, it only defines the size of vertex_index_t
and index_t
. In my opinion and from personal experience, I wouldn't expect a big drop in performances, I would more expect a gain in performances, due to the CPU architectures. But, it will consume more memory, that's for sure.
In my opinion, we could enable by default MANY_VERTICES
, of course check if it doesn't break anything thank to our test pipeline
@flomlo if you are ready to create a PR, I'll go in this direction (@ulupo what do you think ?):
- Write a regression test with a big enough dataset (should be generated on the fly)
- Fix the issue by enabling
MANY_VERTICES
- Document current limitation
- Enjoy new Pyflagser
Of course, I'm more than glad to review any work. I'm just currently too busy with other tasks to create myself the PR.
Best,
Julián
from pyflagser.
Hi,
yeah, I'll try modifying the CMakeLists.txt of pyflagser
locally and see how it goes. This will take some time, I'm a total noob with CMake.
I've confirmed in the meantime that flagser-count
with MANY_VERTICES (and disabled HDF5 support) performs correctly.
About the speed regression test: I'ld use pyflagser
on the BBP connectome (32k vertices, 8m edges, topologically complex) cause I have it lying around anyways. This needs about 1h of computation time and iirc ~8GB or RAM (without MANY_VERTICES), which is a proper test, I think.
Then I'll test it again with MANY_VERTICES enabled and inform you about the results.
Yes, that is not an autogenerated dataset, but getting a huge random network to have a computation time of 1m < t < 1h is surprisingly nasty.
It will be done sometime afternoon or over the weekend.
Best,
Florian
from pyflagser.
Thanks both! Just curious: if the memory increase is significant, couldn't we have both versions available and pick the appropriate one at runtime?
About
Yes, that is not an autogenerated dataset, but getting a huge random network to have a computation time of 1m < t < 1h is surprisingly nasty.
Since we would just be checking for index failures, why not just create a sparse graph which consists just of disconnected points save for a single edge between two vertices with very large index?
from pyflagser.
@ compiling both and choosing at runtime: Yes, that's actually quite feasible. Not beautiful, but should combine the best of both worlds.
@ testing: I though the testing was to check for memory/time regression? If it is just for checking it >65k-vertices-graphs are working, then yes, that is a trivial test.
from pyflagser.
At the moment, my prejudice is that true memory/performance tests for very large datasets cannot realistically become part of the CI test suite because we are limited in both resources and total time (60 min total including installing environments, compiling, installing python dependencies and running all tests). But I'll wait to see what comes of your local tests, @flomlo !
Note that if we compile both versions there is no need for such tests because we would have no choice as to which version to use beyond a certain size.
from pyflagser.
yeah, I'll try modifying the CMakeLists.txt of pyflagser locally and see how it goes. This will take some time, I'm a total noob with CMake.
I'll would add something like target_compile_definitions(flagser_count_pybind PRIVATE MANY_VERTICES=1)
after line 42 in CMakeLists.txt
this should do the work.
compiling both and choosing at runtime: Yes, that's actually quite feasible. Not beautiful, but should combine the best of both worlds.
I'm not a big fan of compiling both, we already do this with USE_COEFFICIENTS
, for MANY_VERTICES
I think we should be able to add to the C++ backend a template argument to choose the size of vertex_index_t
and index_t
at runtime instead of generating multiples binaries.
from pyflagser.
At the moment, my prejudice is that true memory/performance tests for very large datasets cannot realistically become part of the CI test suite because we are limited in both resources and total time (60 min total including installing environments, compiling, installing python dependencies and running all tests). But I'll wait to see what comes of your local tests, @flomlo !
@ulupo I think that for large datasets, benchmark should be done locally and not on the CI. Maybe we should add to the documentation how users that would like to contribute to the performances side of the library, to measure on their machine the performances with public datasets in order to benchmark their version with the current main
(main is now master branch name on github) of pyflagser
.
from pyflagser.
I'm not a big fan of compiling both, we already do this with
USE_COEFFICIENTS
, forMANY_VERTICES
I think we should be able to add to the C++ backend a template argument to choose the size ofvertex_index_t
andindex_t
at runtime instead of generating multiples binaries.
I'm also not a fan for reasons of elegance, but if I understand it correctly, MANY_VERTICES
is a flag used at compiletime in order to build more efficient code?
Either I'm not aware of some advanced programming tricks, or having the performance-benefits of a slim version and a MANY_VERTICES
-version absolutely requires that both of them have been compiled separately.
from pyflagser.
Either I'm not aware of some advanced programming tricks, or having the performance-benefits of a slim version and a MANY_VERTICES-version absolutely requires that both of them have been compiled separately.
Right now, as you say, it's a compile time definition, but with meta-programming, we could easily make it a runtime condition. If you look at my PR in flagser
luetge/flagser#36, I make KEEP_FLAG_IN_MEMORY
a runtime condition instead of a compile time definition.
from pyflagser.
Oh, then this definitely is out of my programming skills scope :D
For now at least ...
from pyflagser.
Ok, here are the test results of flagser-memory --min-dim 4
with and w/o MANY_VERTICES
on the BBP connectomic data, computed on my personal Ryzen 3600X with 64gb of RAM. To make the test results diagnostically less conclusive, there were a few other tasks (firefox, thunderbird ...) running in the background of the machine as well.
Without MANY_VERTICES
: 4m03s, maximum RAM: 17.49GB
With MANY_VERTICES
: 3m55s, maximum RAM: 18.72GB
See for full output:
time_on_flagser_with_MANY_VERTICES.txt
time_on_flagser_without_MANY_VERTICES.txt
Result:
I would ignore the additional 7% of memory usage (as well as the 3% faster execution, which is probably just measuring error) and just add MANY_VERTICES to the compile flags of pyflagser
s CMakeList.txt.
I do not see the need for version without MANY_VERTICES enabled. If someone is this low on RAM, he should not use python anyways :D
Do you agree?
from pyflagser.
For me, yes we could enable by default MANY_VERTICES
, I should be able to come for a templated solution in flagser
in the following days (but on the meantime, we could integrate this quick fix). Waht do you think @ulupo ?
Thank you for the measurements :) !
from pyflagser.
If you think that the templated solution is that quick to implement, then I'm happy just to use the MANY_VERTICES
-version locally for the next few weeks. Works for me™
[EDIT: Use it locally]
from pyflagser.
I think it's doable to do it, but right now I cannot spend the time require, and also, the merge on flagser
could also take some time.
I think that merging a temporary fix is a good approach if we include a quick test and update the documentation.
from pyflagser.
I've patched flagser
(see luetge/flagser#40), now there is an error in case of too many vertices, requesting the user to compile with MANY_VERYTICES enabled.
I think this error should occur from pyflagser
as well, so if there is an accidental regression again, the user will get a meaningful warning instead of the hard-to-interpret error I've gotten.
from pyflagser.
It seems like this issues is closed, but I actually still having the same issue. That is, I do not get a meaningful warning about the graph being too large. What am I missing? Here an example on a simple but large graph
Steps/code to reproduce
import scipy.sparse
import numpy as np
import pyflagser
n=100000
data=np.random.random(100)
row=np.random.choice(n,100,replace=False)
col=np.random.choice(n,100,replace=False)
adj=scipy.sparse.coo_matrix((data,(row,col)),[n,n])
#Checking for self loops
print(np.count_nonzero(adj.diagonal(0)!=0))
#Checking matrix shape
print(adj.get_shape())
pyflagser.flagser_weighted(adj)
Expected results
pyflagser runs without errors or gives a meaningful message that the graph is too large
Actual results
RuntimeError Traceback (most recent call last)
in
13 #Checking matrix shape
14 print(adj.get_shape())
---> 15 pyflagser.flagser_weighted(adj)
/gpfs/bbp.cscs.ch/project/proj102/venv/p102/lib/python3.8/site-packages/pyflagser/flagser.py in flagser_weighted(adjacency_matrix, max_edge_weight, min_dimension, max_dimension, directed, filtration, coeff, approximation)
246
247 # Call flagser binding
--> 248 homology = _compute_homology(vertices, edges, min_dimension,
249 _max_dimension, directed, coeff,
250 _approximation, filtration)[0]
RuntimeError: Out of bounds, tried to add an edge involving vertex 39854, but there are only 34464 vertices.
from pyflagser.
Hi @danielaegassan! Looking at this whole thread and the related PR #68, I think this issue was just temporarily abandoned due to me being distracted by other projects at the time, but if @flomlo and @MonkeyBreaker are happy with this I can just merge #68 and fix the issue.
from pyflagser.
Hi @danielaegassan,
As @ulupo said, the issue was temporarily abandoned, thank you for taking the time into looking why we were encountering the issue.
We will merge #68 and I don't know the timeline to publish a release of pyflagser
with the fix.
Will keep you updated when the new version will be published.
from pyflagser.
Just merged #68, which closes this issue. @danielaegassan indeed you can wait for the next pyflagser
release, but I suspect you'd rather not have to. Perhaps you would be happy to pull the merged changes in a clone of the repo and compile pyflagser
from sources?
from pyflagser.
Related Issues (20)
- No PKG_INFO error when uploading tar file to PyPI HOT 11
- Computations following an interrupted call fail HOT 5
- Support for filtration algos HOT 2
- Computations take a long time HOT 2
- Filtration algorithm vertex_degree produces segmentation fault HOT 2
- Zero values outside of the diagonal HOT 10
- Incorrect outputs of the flagser bindings.
- One dimensional output for empty persistence diagram HOT 3
- flagser_weighted does not terminate/takes forever when coeff > 2 HOT 5
- Relax condition on square shape on sparse input HOT 1
- Warnings and crashes for FlagserPersistence with n_jobs>1 HOT 2
- Feature request: Allow (read only) access to the directed flag complex via python HOT 16
- `save_weighted_flag` crashes (and is uncovered by tests) HOT 3
- flagser_memory support HOT 13
- Drop support for Python 3.6 HOT 6
- ERROR pyflagser - ModuleNotFoundError: No module named 'pyflagser.modules' HOT 4
- error 'flagser/src/flagser-count.cpp' file not found HOT 1
- feature request: custom filtration HOT 4
- Error installing pyflagser in macOS 12.4 FileNotFoundError
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pyflagser.