Comments (2)
Thank you for taking the time to raise these concerns, @CSSFrancis. I gave a brief summary of my own use of lazy computations with kikuchipy here (based on your good questions): hyperspy/hyperspy#3107 (comment). All your points are reasonable, but I have different views on how they should be addressed.
I would suggest having a warning that pops up with every function that uses overlap to the effect of
I find a warning raised every time a bit too drastic. But adding this information to the docstring Notes of relevant methods would be highly welcome! map_overlap()
is used in EBSD
methods average_neighbour_patterns()
and get_average_neighbour_dot_product_map()
. These methods are not optimized (i.e. we haven't updated them after we initially wrote them), but still, I haven't experienced any major issues in terms of RAM. They run fine on my laptop for larger datasets.
I would highly recommend adding a zarr format to to kikchipy
I use HyperSpy's zarr myself by "casting" EBSD
-> Signal2D
and using HyperSpy's save()
method. I then reload the EBSD
signal with hs.load("data.zspy", signal_type="EBSD", lazy=True)
. We should add this workflow to the user guide. This workflow basically gives kikuchipy a zarr
format, and should be sufficient for now. The next step would be to interface better with HyperSpy's load()
function by allowing writing to the ZSPY format from EBSD.save()
.
An aside: I'm reluctant to adding our own zarr format because I'm not that big of a fan of our h5ebsd (HDF5) format. Maintaining an always consistent file format specification is a real hassle when our needs change (e.g. when new "states" are added to a signal). E.g., our h5ebsd format was written so that any dataset saved could be read by EMsoft using its EMEBSD format, which has a flattened navigation dimension. We initially added some metadata as well, but this metadata has since changed a lot to accomodate added functionality (mainly handling of detector-sample geometry via EBSDDetector
in the EBSD.detector
attribute and orientation data via CrystalMap
in the EBSD.xmap
attribute). Because of these changes and the anticipation of more changes to the format in the future, I decided to remove our specification of the format from the docs...
It would be convenient to not call the compute function in lazy computations.
Sorry, could you explain what you mean here? Do you mean we shouldn't call compute()
?
The idea here is that you could create a lazy Orix CrystalMap object that you could slice and compute only part of.
I agree that this would be convenient. However, users already have this option by selecting a part of their dataset for indexing with HyperSpy's slicing, right?
Dictionary indexing with kikuchipy is arguably not very fast, so having a definite start to the computation, e.g. when calling EBSD.dictionary_indexing()
, is a help for the user, I think, since they are forced to make sure all input parameters are appropriate before running.
When it comes to refinement (pattern matching) we can pass a navigation mask to control which patterns are indexed. I see this as a powerful slicing operation, and can effectively be used instead of HyperSpy's slicing (although I haven't tested this in a workflow).
[creating a lazy orix CrystalMap] might be difficult but could potentially cause pretty large improvements.
As an orix user, arrays in my crystal maps fit comfortably in memory. What I'd like to see though is more operations using Dask on-the-fly. Many crystallographic computations involve finding the lowest disorientation angle after applying many symmetrically equivalent operations to each point (rotation or vector), which can be memory intensive. I think this is more important than allowing a lazy crystal map.
from kikuchipy.
I find a warning raised every time a bit too drastic. But adding this information to the docstring Notes of relevant methods would be highly welcome!
map_overlap()
is used inEBSD
methodsaverage_neighbour_patterns()
andget_average_neighbour_dot_product_map()
. These methods are not optimized (i.e. we haven't updated them after we initially wrote them), but still, I haven't experienced any major issues in terms of RAM. They run fine on my laptop for larger datasets.
Something in the notes is probably good. I imagine that you don't have major issues in terms of RAM because the datasets are still pretty small and running on your laptop you are likely not using very many cores. The problems become larger when running larger datasets with more cores. If most peoples workflows are similar to yours then it most likely isn't worth it to have a warning every time.
I use HyperSpy's zarr myself by "casting"
EBSD
->Signal2D
and using HyperSpy'ssave()
method. I then reload theEBSD
signal withhs.load("data.zspy", signal_type="EBSD", lazy=True)
. We should add this workflow to the user guide. This workflow basically gives kikuchipy azarr
format, and should be sufficient for now. The next step would be to interface better with HyperSpy'sload()
function by allowing writing to the ZSPY format fromEBSD.save()
.
Yea that should be easy to do!
An aside: I'm reluctant to adding our own zarr format because I'm not that big of a fan of our h5ebsd (HDF5) format. Maintaining an always consistent file format specification is a real hassle when our needs change (e.g. when new "states" are added to a signal). E.g., our h5ebsd format was written so that any dataset saved could be read by EMsoft using its EMEBSD format, which has a flattened navigation dimension. We initially added some metadata as well, but this metadata has since changed a lot to accomodate added functionality (mainly handling of detector-sample geometry via
EBSDDetector
in theEBSD.detector
attribute and orientation data viaCrystalMap
in theEBSD.xmap
attribute). Because of these changes and the anticipation of more changes to the format in the future, I decided to remove our specification of the format from the docs...
Makes sense.
It would be convenient to not call the compute function in lazy computations.
Sorry, could you explain what you mean here? Do you mean we shouldn't call
compute()
?
It depends on the instance but calling compute inside a function takes away some potential workflows. Calling compute inside of a function takes away a lot of a users ability to interact with a dataset in the lazy state. For example they cannot rechunk or call persist
if they don't want the data to be transferred back to the RAM. You also have to wait for the data to compute before saving the data rather than saving things chunk by chunk in an embarrassingly parallel way. The map
function in hyperspy always uses dask to run operations so when using distributed computing in particular you want to limit the number of transfers from the distributed computing to RAM. It also cleans up the code and makes it more flexible and able to respond to api changes in dask.
Basically you just want to make sure that the compute function is the last thing you do with the data. Maybe I have too strict of ideas about this...
I agree that this would be convenient. However, users already have this option by selecting a part of their dataset for indexing with HyperSpy's slicing, right?
This is true but goes with the point above. It just cleans up the code and makes it more consistent.
Dictionary indexing with kikuchipy is arguably not very fast, so having a definite start to the computation, e.g. when calling
EBSD.dictionary_indexing()
, is a help for the user, I think, since they are forced to make sure all input parameters are appropriate before running.
This is kind of why lazy operations are nice. You can get a lazy result. Test to see if it looks good and then test another small region before doing the larger computation. You can also call the persist
function and then keep working without the "waiting" for some code to finish running. It makes the whole workflow more seamless and I think helps speed up iteration.
When it comes to refinement (pattern matching) we can pass a navigation mask to control which patterns are indexed. I see this as a powerful slicing operation, and can effectively be used instead of HyperSpy's slicing (although I haven't tested this in a workflow).
[creating a lazy orix CrystalMap] might be difficult but could potentially cause pretty large improvements.
As an orix user, arrays in my crystal maps fit comfortably in memory. What I'd like to see though is more operations using Dask on-the-fly. Many crystallographic computations involve finding the lowest disorientation angle after applying many symmetrically equivalent operations to each point (rotation or vector), which can be memory intensive. I think this is more important than allowing a lazy crystal map.
I'll have to look more into this...
from kikuchipy.
Related Issues (20)
- Keeping not_indexed dummy phase after orientation refinement HOT 5
- Build documentation on Windows (without PyVista frame buffer)
- Segmentation fault HOT 4
- Merging crystal maps with not_indexed points results in removal of phase HOT 1
- Support Python 3.11 HOT 1
- Support HyperSpy 2.0 HOT 10
- Error with `kikuchipy.data.si_wafer` HOT 1
- kikuchipy v0.9.0 HOT 1
- `TestKikuchipyH5EBSD.test_load_with_padding` test failure HOT 1
- Reading of master patterns generated with EDAX HOT 4
- Post-process and index raw diffraction patterns in a .tif format HOT 1
- Incorrect detector shape for downsampled signal
- Set correct sample tilt for TKD dataset read from Bruker HDF5 HOT 5
- Also read crystal map together with EBSD patterns from a Bruker HDF5 file HOT 1
- Force EBSDDetector values to be float
- Small minor release 0.10.0
- New string representation for EBSDDetector
- Hough indexing of lazy EBSD patterns errors with PyEBSDIndex 0.3 HOT 11
- Kinematic EBSD simulation from tetragonal phase
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kikuchipy.