Comments (10)
I am happy to help draft a PR for this if you want to proceed
Thanks for the detailed issue and your willingness to help with a PR. Please feel free to create a PR and then we can iterate on the details there.
from pynwb.
In my tests, I have also seen a 5-10X speedup using remfile over fsspec.
We have a draft PR here where we would list remfile as a third option on the streaming docs page: #1761
We could move that to the second method or even the first method recommended, with the noted caveat that local caching options are limited.
from pynwb.
I created some benchmark scripts for comparing the performance of fsspec, ros3, and remfile. I even had the idea of running this as a gh action and providing an auto-generated comparison table. I thought this would be a good thing to point to from the pynwb docs. However, I found that the performance fluctuated a lot and depended on all kinds of factors with the network. It's not as straightforward as CPU-benchmarking. In fact, weirdly, I found that fsspec usually performed comparably to remfile on the gh actions server, whereas on my laptop, I always get around an order of magnitude difference. I don't have an explanation for it. Ros3 seems to perform comparably to remfile in the tests I ran both on gh actions and on my laptop. The annoying thing about ros3 is that it requires a specially-built version of h5py. Are there other downsides of ros3?
Bottom line, I don't think my benchmark tests are reliable at this point. Probably the best way to do this is to somehow measure the total data download required, rather than timing, but I'm not sure how to do that (haven't done any research yet).
I'm happy if remfile is listed third, considering it is a new tool, as in that PR. But it would be nice to provide a warning that fsspec (if listed first) can be very slow, and suggesting remfile as an alternative.
from pynwb.
I even had the idea of running this as a gh action and providing an auto-generated comparison table.
However, I found that the performance fluctuated a lot and depended on all kinds of factors with the network. It's not as straightforward as CPU-benchmarking.
Bottom line, I don't think my benchmark tests are reliable at this point. Probably the best way to do this is to somehow measure the total data download required, rather than timing, but I'm not sure how to do that (haven't done any research yet).
I highly recommend setting up Airspeed Velocity for this. I was going to do it once the cloud supplement kicks in but feel free to get ahead of me on that.
See how sklearn uses it for their own benchmarking
You can even then deploy it on all kinds of specialized computational infrastructure (i.e., via dendro
on AWS) and see how it varies based on things like the IOPS volumes and instance type, and compiled/compare results across each architecture
Are there other downsides of ros3?
No automatic retries. Was/is a big pain for the NWB Inspector (been on my to do list to swap to remfile over there)
Also, packaging. Packaging is painful due to defiance on conda-forge, which usually lags behind PyPI releases
from pynwb.
I highly recommend setting up Airspeed Velocity for this. I was going to do it once the cloud supplement kicks in but feel free to get ahead of me on that.
Thanks @CodyCBakerPhD I'll take a look. Although I don't think this would solve the unreliability of network speed tests, would it? Like I said, I'd like to somehow measure data downloaded rather than download time. I believe that should be consistent across different settings.
Are there other downsides of ros3?
No automatic retries. Was/is a big pain for the NWB Inspector (been on my to do list to swap to remfile over there)
I think this should be mentioned in the docs for helping users decide.
from pynwb.
Bottom line, I don't think my benchmark tests are reliable at this point. Probably the best way to do this is to somehow measure the total data download required, rather than timing, but I'm not sure how to do that (haven't done any research yet).
weirdly, I found that fsspec usually performed comparably to remfile on the gh actions server, whereas on my laptop, I always get around an order of magnitude difference
I assume variability in latency and network speed are main factors. Presumably gh actions has high-bandwidth network which may smooth out some of the factors that drive performance issues on home networks/wifi. I think it would be useful to quantify the variability itself, which presumably increases "the further away" (in terms of network transfer rate and latency) we are from the source data.
I think the "total data download" may not be sufficient. With latency as a likely driver of performance, the number and sizes of I/O requests are probably more informative. I.e., you could have roughly the same amount of data being transferred via many more network requests.
from pynwb.
I think the "total data download" may not be sufficient. With latency as a likely driver of performance, the number and sizes of I/O requests are probably more informative. I.e., you could have roughly the same amount of data being transferred via many more network requests.
Good point!
from pynwb.
Although I don't think this would solve the unreliability of network speed tests, would it?
That's all on how you setup and configure particular benchmark tests (like how different pytests can have different setup conditions); my point is it gives a standard platform for others (and remote instances of other configurations, such as an AWS instance from a region on the other side of the world) to run the same benchmarks in the same way, but on their own architecture so you get source of data that can be used to model that variability
from pynwb.
It would be great to have benchmarks run over time, but I think it may be best to do this in a separate place. Maybe https://github.com/NeurodataWithoutBorders/nwb-project-analytics would be a good place to create an ASV setup for NWB repos?
from pynwb.
Related Issues (20)
- ValueError: TimeSeries.__init__: incorrect shape for 'timestamps' (got '(2, 356918)', expected '(None,)') HOT 2
- [Bug]: `DynamicTable.to_dataframe` doesn't resolve `TimeSeriesReference` HOT 3
- [Bug]: PYNWB_VALIDATION when using ndx-miniscope extension HOT 14
- [Documentation]: Improve docs around setting "name" for a neurodata_type
- [Feature]: Warn on read rather than error HOT 1
- `electrodes` column in `misc.Units.add_unit()` doesn't adhere to nwb-schema HOT 1
- [Bug]: OSError: [Errno 5] Unable to synchronously open file HOT 2
- [Bug]: `get_data_in_units` fails for ElectricalSeries with `channel_conversion`
- [Feature]: Add new.processing.ImageSementation.PlaneSegmentation remove feature (function) HOT 1
- [Bug]: `TypeError: `dumps_kwargs` keyword arguments are no longer supported.` when running tests on pynwb 2.5.0 HOT 4
- [Bug]: One more test error for `=pynwb-2.5.0` — `AssertionError: UserWarning not triggered` HOT 1
- [Feature]: a way to disable all tests requiring dynamic downloads OR a static archive for test data HOT 4
- [Feature]: Migrate to only use pytest HOT 1
- [Feature]: Test read of dandisets systematically in CI
- [Feature]: Change pynwb.validate(io=...) to match pynwb.validate(paths=[path])
- [Umbrella Feature]: Upgrade validation methods HOT 2
- [Feature]: load nwb file without initial preload step HOT 3
- [Feature]: Notify user if python package for extension exists and is recommended
- [Bug]: Invalidation of typical session ID HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pynwb.