There are a couple of things that I wanted to bring up and possibly offer solutions. <

Thank you for taking the time to raise these concerns, <a class="user-mention notransl

Lazy and Dask Improvements about kikuchipy HOT 2 OPEN

CSSFrancis commented on August 15, 2024

Lazy and Dask Improvements

from kikuchipy.

Comments (2)

hakonanes commented on August 15, 2024

Thank you for taking the time to raise these concerns, @CSSFrancis. I gave a brief summary of my own use of lazy computations with kikuchipy here (based on your good questions): hyperspy/hyperspy#3107 (comment). All your points are reasonable, but I have different views on how they should be addressed.

I would suggest having a warning that pops up with every function that uses overlap to the effect of

I find a warning raised every time a bit too drastic. But adding this information to the docstring Notes of relevant methods would be highly welcome! map_overlap() is used in EBSD methods average_neighbour_patterns() and get_average_neighbour_dot_product_map(). These methods are not optimized (i.e. we haven't updated them after we initially wrote them), but still, I haven't experienced any major issues in terms of RAM. They run fine on my laptop for larger datasets.

I would highly recommend adding a zarr format to to kikchipy

I use HyperSpy's zarr myself by "casting" EBSD -> Signal2D and using HyperSpy's save() method. I then reload the EBSD signal with hs.load("data.zspy", signal_type="EBSD", lazy=True). We should add this workflow to the user guide. This workflow basically gives kikuchipy a zarr format, and should be sufficient for now. The next step would be to interface better with HyperSpy's load() function by allowing writing to the ZSPY format from EBSD.save().

An aside: I'm reluctant to adding our own zarr format because I'm not that big of a fan of our h5ebsd (HDF5) format. Maintaining an always consistent file format specification is a real hassle when our needs change (e.g. when new "states" are added to a signal). E.g., our h5ebsd format was written so that any dataset saved could be read by EMsoft using its EMEBSD format, which has a flattened navigation dimension. We initially added some metadata as well, but this metadata has since changed a lot to accomodate added functionality (mainly handling of detector-sample geometry via EBSDDetector in the EBSD.detector attribute and orientation data via CrystalMap in the EBSD.xmap attribute). Because of these changes and the anticipation of more changes to the format in the future, I decided to remove our specification of the format from the docs...

It would be convenient to not call the compute function in lazy computations.

Sorry, could you explain what you mean here? Do you mean we shouldn't call compute()?

The idea here is that you could create a lazy Orix CrystalMap object that you could slice and compute only part of.

I agree that this would be convenient. However, users already have this option by selecting a part of their dataset for indexing with HyperSpy's slicing, right?

Dictionary indexing with kikuchipy is arguably not very fast, so having a definite start to the computation, e.g. when calling EBSD.dictionary_indexing(), is a help for the user, I think, since they are forced to make sure all input parameters are appropriate before running.

When it comes to refinement (pattern matching) we can pass a navigation mask to control which patterns are indexed. I see this as a powerful slicing operation, and can effectively be used instead of HyperSpy's slicing (although I haven't tested this in a workflow).

[creating a lazy orix CrystalMap] might be difficult but could potentially cause pretty large improvements.

As an orix user, arrays in my crystal maps fit comfortably in memory. What I'd like to see though is more operations using Dask on-the-fly. Many crystallographic computations involve finding the lowest disorientation angle after applying many symmetrically equivalent operations to each point (rotation or vector), which can be memory intensive. I think this is more important than allowing a lazy crystal map.

from kikuchipy.

CSSFrancis commented on August 15, 2024

I find a warning raised every time a bit too drastic. But adding this information to the docstring Notes of relevant methods would be highly welcome! map_overlap() is used in EBSD methods average_neighbour_patterns() and get_average_neighbour_dot_product_map(). These methods are not optimized (i.e. we haven't updated them after we initially wrote them), but still, I haven't experienced any major issues in terms of RAM. They run fine on my laptop for larger datasets.

Something in the notes is probably good. I imagine that you don't have major issues in terms of RAM because the datasets are still pretty small and running on your laptop you are likely not using very many cores. The problems become larger when running larger datasets with more cores. If most peoples workflows are similar to yours then it most likely isn't worth it to have a warning every time.

I use HyperSpy's zarr myself by "casting" EBSD -> Signal2D and using HyperSpy's save() method. I then reload the EBSD signal with hs.load("data.zspy", signal_type="EBSD", lazy=True). We should add this workflow to the user guide. This workflow basically gives kikuchipy a zarr format, and should be sufficient for now. The next step would be to interface better with HyperSpy's load() function by allowing writing to the ZSPY format from EBSD.save().

Yea that should be easy to do!

An aside: I'm reluctant to adding our own zarr format because I'm not that big of a fan of our h5ebsd (HDF5) format. Maintaining an always consistent file format specification is a real hassle when our needs change (e.g. when new "states" are added to a signal). E.g., our h5ebsd format was written so that any dataset saved could be read by EMsoft using its EMEBSD format, which has a flattened navigation dimension. We initially added some metadata as well, but this metadata has since changed a lot to accomodate added functionality (mainly handling of detector-sample geometry via EBSDDetector in the EBSD.detector attribute and orientation data via CrystalMap in the EBSD.xmap attribute). Because of these changes and the anticipation of more changes to the format in the future, I decided to remove our specification of the format from the docs...

Makes sense.

It would be convenient to not call the compute function in lazy computations.

Sorry, could you explain what you mean here? Do you mean we shouldn't call compute()?

It depends on the instance but calling compute inside a function takes away some potential workflows. Calling compute inside of a function takes away a lot of a users ability to interact with a dataset in the lazy state. For example they cannot rechunk or call persist if they don't want the data to be transferred back to the RAM. You also have to wait for the data to compute before saving the data rather than saving things chunk by chunk in an embarrassingly parallel way. The map function in hyperspy always uses dask to run operations so when using distributed computing in particular you want to limit the number of transfers from the distributed computing to RAM. It also cleans up the code and makes it more flexible and able to respond to api changes in dask.

Basically you just want to make sure that the compute function is the last thing you do with the data. Maybe I have too strict of ideas about this...

I agree that this would be convenient. However, users already have this option by selecting a part of their dataset for indexing with HyperSpy's slicing, right?

This is true but goes with the point above. It just cleans up the code and makes it more consistent.

Dictionary indexing with kikuchipy is arguably not very fast, so having a definite start to the computation, e.g. when calling EBSD.dictionary_indexing(), is a help for the user, I think, since they are forced to make sure all input parameters are appropriate before running.

This is kind of why lazy operations are nice. You can get a lazy result. Test to see if it looks good and then test another small region before doing the larger computation. You can also call the persist function and then keep working without the "waiting" for some code to finish running. It makes the whole workflow more seamless and I think helps speed up iteration.

When it comes to refinement (pattern matching) we can pass a navigation mask to control which patterns are indexed. I see this as a powerful slicing operation, and can effectively be used instead of HyperSpy's slicing (although I haven't tested this in a workflow).

[creating a lazy orix CrystalMap] might be difficult but could potentially cause pretty large improvements.

As an orix user, arrays in my crystal maps fit comfortably in memory. What I'd like to see though is more operations using Dask on-the-fly. Many crystallographic computations involve finding the lowest disorientation angle after applying many symmetrically equivalent operations to each point (rotation or vector), which can be memory intensive. I think this is more important than allowing a lazy crystal map.

I'll have to look more into this...

from kikuchipy.

Lazy and Dask Improvements about kikuchipy HOT 2 OPEN

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent