Comments (15)
I'm gradually working on this issue. If anyone else is working on it or wants to, let me know and we can figure out how to combine efforts.
from matminer.
(maybe something for @Qi-max to contribute?)
from matminer.
Hi Logan, I am not now. Pls tell me If there is anything you want me to work on.
from matminer.
Thanks, @Qi-max! I'm about halfway done with the features from the PRB article, which will cover most of the Sci Rep features. So, it is probably easiest for me to finish those two papers myself.
Would you want to start on the Nat. Comm features?
Also, the areas that I haven't figured out yet:
- Getting 2nd-nearest neighbors from the NearNeighbor pymatgen classes, which are needed for some of the PRB attributes. I was thinking of building an additional method on to the NearNeighbor classes for this, but am not sure if that is already done or if there is an easier way.
- Whether we can/should combine these features with your existing VoronoiFingerprint class. I am in favor of keeping them as separate featurizers to avoid having classes with too many options.
- How to cache the Voronoi tessellation, if it is used by multiple featurizers. When implementing the PRB features in Magpie, I found that running the tessellation comprised a significant amount of the featurization time. While I'm not doing anything about caching now, I'm worried that making many different featurizers (see 2) will exacerbate any performance issues.
Any thoughts about these issues?
from matminer.
Hi Logan,
Thanks for the reply!
for 1, I think adding an additional method to the NearNeighbor classes to get 2nd-nn is a good way to go. I am only a bit concerned about the performance issue, as it may includes some repeated calculation of 1st-nn in real applications. eg., if we want to get 2nd-nn of each atom in the structure and we already calculated the 1st-nn of all atoms (which is a more common case I think), we can just combine the 1st-nn of the 1st-nn atoms of each atom to get the 2nd-nn atoms, without further calculation. This requires solving the caching issues. But still, I think adding a standard way to get 2nd-nn is quite useful.
Besides, I have considered to add a new site-featurizer to allow "coarse-graining" (eg. getting 'mean', 'std_dev', 'minimum' or 'maximum') any site fingerprint around the neighbors to serve as site features. This featurizer could also takes the 2nd-NN neighboring effect into account. The user can give the featurizer a type of site fingerprint and a neighbor finding algorithm (including Voronoi method) and the featurizer will return a coarse grained version of that site fingerprint. This works similarly as the SiteStatsFingerprint, except for serving as site features rather than structure features. But I am also quite concerned about the performance issue, especially if the neighbor finding is quite computationally expensive, as you have mentioned. I have also tested some matminer featurizers on the disordered systems that contain tens/hundreds of thousands of atoms and found that it indeed takes quite a long time to featurize.
for 2 and 3 (and 1), I totally agree that it will be great if we can also share results among featurizers, in addition to within a single featurizer. This will solve many issues and save a lot of computation time. Some steps, e.g., getting atoms within a cutoff, getting neighborlist etc, are indeed frequently called by a number of featurizers. I think perhaps methods like joblib.Memory
(for caching a function's return values each time it is called with the same input arguments, also used in sklearn
) might be useful? or even create some tmp db files in a standard way (might be less preferable). How do you feel about this?
from matminer.
2nd NN determination: I think you make a good point about avoiding repeatedly calculating the first nearest neighbors. I'll starting to work on that code soon, and I'll tag you on the PR in case you want to critique it.
Neighboring Site Featurizer: Something like SiteStatsFeaturizer could work for that kind of feature. I'll keep that use case in mind when writing the 2nd NN code, so that you could use it effectively.
Caching: I like your idea. After thinking some more, here are the options I'm considering.
- Introducing a utility function to matminer that takes the NearNeighbors class, structure, and site number as input and uses
lru_cache
/joblib.Memory
to serve repeated calls. We would call that rather than the NearestNeighbor operations directly.
- Multiple featurizers would call the same ultility function. And, if someone uses
MultiFeaturizer
, the access pattern will result in plenty of cache hits. - Problem: Not sure how burdensome calling the utility function rather than pymatgen directly would be.
- Problem:
lru_cache
will not natively recognize if you pass two different instantiations of the same NearNeighbor class as the same. To do so, we'd need to implement__eq__
and__hash__
for each NearNeighbor class.
- Storing the NN information as an attribute of the Structure object, and our featurizers would check for the presence of that data before re-computing it. (I'd write also write a utility function for each of these)
- This is how I cache tessellations in Magpie
- Problem: Could run into problems with memory for larger datasets (I did with Magpie). We would need to write code that clears out the NN attributes after they are no longer useful.
I lean option 1. I don't think using a utility operation is a big deal and, while I recognize implementing __eq__
and __hash__
introduces a vector for difficult-to-debug errors, I do not see it as too burdensome to police. Any thoughts/other ideas?
from matminer.
Hi Logan,
I like your implementation of Nth nearest neighbors and additional Voronoi weighting methods! For caching, I also perfer option 1, and I think your points about a new utility function and implementing __eq__
and __hash__
in related classes are quite reasonable. I think it is worth doing, despite introducing some extra complexity, but we at least leave an option (maybe default as False) for users to cache some computationally expensive calls, which can be beneficial when they are dealing with quite large datasets.
from matminer.
Thanks, @Qi-max !
I did some performance testing of this implementation, and found it takes about 30s to compute the features of 10 structures. Turns out that about half of the featurization time is spent computing the tessellation, so I implemented the LRU caching mechanism (Option 1). I automatically cache the NN for all atoms in the structure, so we might want to institute an option to avoid caching (as you suggested). If curious: https://github.com/WardLT/matminer/tree/faster_prb
After adding in the caching, we are down to about 22s for the same 10 structures. The average over 100 structures is about 1.8s/structure. We likely have some room to improve performance further, as I can do about 0.1s/structure for these features with Magpie.
My current plan is to try to optimize the Voronoi/near-neighbor-finding routines in pymatgen improve performance further. If I can get performance of faster than 1s per structure, I'd be pretty happy.
from matminer.
Well, I got it below 1s per structure (latest is 0.8s). I'll open a PR if/when my changes to pymatgen get approved. We can now featurize the FLLA dataset in ~50 minutes.
from matminer.
All, sorry I haven't had a chance to review this thread in detail (and likely might not be able to for a bit, so feel free to continue w/o me).
My suggestion based just on skimming is to look into some of the "memoize" decorators in Python to see if you can avoid recomputing. Not sure how this compares with the current solution
from matminer.
Also an LRI cache might(?) be faster than an LRU one and be similar for our use case. See:
http://boltons.readthedocs.io/en/latest/cacheutils.html
from matminer.
The LRI caches do seem like a good option. Especially for the site featurizers which will call get_neighbors
frequently, we might benefit from a faster caching mechanism.
For our current use case - structure featurizers - the LRU cache is sufficient. In fact, the access pattern for structure-based featurizers is such that a cache size of 1 will achieve the minimum number of cache misses. I'll ask around for a site featurizer use case to benchmark different caching strategies.
from matminer.
@WardLT @Qi-max should this issue be considered closed?
from matminer.
[Sorry for the slow response]
I don't think this issue is quite closed. There is still a little work to go for the de Jong features (e.g., adding the variance of neighbor properties, building presets and examples), and a bit for the Isayev features (we might need to extend the bond fragment work).
from matminer.
superseded by #635
from matminer.
Related Issues (20)
- Materials Project time split dataset - `load_data_from_json` returns `None` during debugging (conditionally)
- `matminer.datasets.utils._validate_dataset()` flaky on Windows? HOT 5
- [FEATURE REQUEST] SkipAtom compositional featurizer
- AttributeError: 'DensityFeatures' object has no attribute 'desired_features' HOT 1
- SOAP features HOT 2
- Suggestion: OPTIMADE data retriever HOT 1
- Fail to approach MPData HOT 4
- Handling NaNs from ElementProperty HOT 3
- CI failing due to broken mongo service
- Fixing matminer's multiprocessing problem HOT 1
- New release 0.9.0? HOT 2
- WenAlloy wrong valence electron counts
- mp-api for MPDataRetrieval needs upgrade badly
- Missing compatibility with pandas v2 HOT 11
- Issue link to matsci.org broken
- compatibility request: pandas-2+ HOT 2
- Error in import composition HOT 6
- Simple composition-based featurization fails due to an upgrade in pymatgen HOT 2
- Re-enable tests that are skipped in CI HOT 3
- when I import matminer,mistake as following:ValueError: Unexpected atomic number Z=119。 HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from matminer.