Git Product home page Git Product logo

Comments (15)

WardLT avatar WardLT commented on June 18, 2024 1

I'm gradually working on this issue. If anyone else is working on it or wants to, let me know and we can figure out how to combine efforts.

from matminer.

computron avatar computron commented on June 18, 2024

(maybe something for @Qi-max to contribute?)

from matminer.

Qi-max avatar Qi-max commented on June 18, 2024

Hi Logan, I am not now. Pls tell me If there is anything you want me to work on.

from matminer.

WardLT avatar WardLT commented on June 18, 2024

Thanks, @Qi-max! I'm about halfway done with the features from the PRB article, which will cover most of the Sci Rep features. So, it is probably easiest for me to finish those two papers myself.

Would you want to start on the Nat. Comm features?

Also, the areas that I haven't figured out yet:

  1. Getting 2nd-nearest neighbors from the NearNeighbor pymatgen classes, which are needed for some of the PRB attributes. I was thinking of building an additional method on to the NearNeighbor classes for this, but am not sure if that is already done or if there is an easier way.
  2. Whether we can/should combine these features with your existing VoronoiFingerprint class. I am in favor of keeping them as separate featurizers to avoid having classes with too many options.
  3. How to cache the Voronoi tessellation, if it is used by multiple featurizers. When implementing the PRB features in Magpie, I found that running the tessellation comprised a significant amount of the featurization time. While I'm not doing anything about caching now, I'm worried that making many different featurizers (see 2) will exacerbate any performance issues.

Any thoughts about these issues?

from matminer.

Qi-max avatar Qi-max commented on June 18, 2024

Hi Logan,

Thanks for the reply!

for 1, I think adding an additional method to the NearNeighbor classes to get 2nd-nn is a good way to go. I am only a bit concerned about the performance issue, as it may includes some repeated calculation of 1st-nn in real applications. eg., if we want to get 2nd-nn of each atom in the structure and we already calculated the 1st-nn of all atoms (which is a more common case I think), we can just combine the 1st-nn of the 1st-nn atoms of each atom to get the 2nd-nn atoms, without further calculation. This requires solving the caching issues. But still, I think adding a standard way to get 2nd-nn is quite useful.

Besides, I have considered to add a new site-featurizer to allow "coarse-graining" (eg. getting 'mean', 'std_dev', 'minimum' or 'maximum') any site fingerprint around the neighbors to serve as site features. This featurizer could also takes the 2nd-NN neighboring effect into account. The user can give the featurizer a type of site fingerprint and a neighbor finding algorithm (including Voronoi method) and the featurizer will return a coarse grained version of that site fingerprint. This works similarly as the SiteStatsFingerprint, except for serving as site features rather than structure features. But I am also quite concerned about the performance issue, especially if the neighbor finding is quite computationally expensive, as you have mentioned. I have also tested some matminer featurizers on the disordered systems that contain tens/hundreds of thousands of atoms and found that it indeed takes quite a long time to featurize.

for 2 and 3 (and 1), I totally agree that it will be great if we can also share results among featurizers, in addition to within a single featurizer. This will solve many issues and save a lot of computation time. Some steps, e.g., getting atoms within a cutoff, getting neighborlist etc, are indeed frequently called by a number of featurizers. I think perhaps methods like joblib.Memory (for caching a function's return values each time it is called with the same input arguments, also used in sklearn) might be useful? or even create some tmp db files in a standard way (might be less preferable). How do you feel about this?

from matminer.

WardLT avatar WardLT commented on June 18, 2024

2nd NN determination: I think you make a good point about avoiding repeatedly calculating the first nearest neighbors. I'll starting to work on that code soon, and I'll tag you on the PR in case you want to critique it.

Neighboring Site Featurizer: Something like SiteStatsFeaturizer could work for that kind of feature. I'll keep that use case in mind when writing the 2nd NN code, so that you could use it effectively.

Caching: I like your idea. After thinking some more, here are the options I'm considering.

  1. Introducing a utility function to matminer that takes the NearNeighbors class, structure, and site number as input and uses lru_cache/joblib.Memory to serve repeated calls. We would call that rather than the NearestNeighbor operations directly.
  • Multiple featurizers would call the same ultility function. And, if someone uses MultiFeaturizer, the access pattern will result in plenty of cache hits.
  • Problem: Not sure how burdensome calling the utility function rather than pymatgen directly would be.
  • Problem: lru_cache will not natively recognize if you pass two different instantiations of the same NearNeighbor class as the same. To do so, we'd need to implement __eq__ and __hash__ for each NearNeighbor class.
  1. Storing the NN information as an attribute of the Structure object, and our featurizers would check for the presence of that data before re-computing it. (I'd write also write a utility function for each of these)
  • This is how I cache tessellations in Magpie
  • Problem: Could run into problems with memory for larger datasets (I did with Magpie). We would need to write code that clears out the NN attributes after they are no longer useful.

I lean option 1. I don't think using a utility operation is a big deal and, while I recognize implementing __eq__ and __hash__ introduces a vector for difficult-to-debug errors, I do not see it as too burdensome to police. Any thoughts/other ideas?

from matminer.

Qi-max avatar Qi-max commented on June 18, 2024

Hi Logan,

I like your implementation of Nth nearest neighbors and additional Voronoi weighting methods! For caching, I also perfer option 1, and I think your points about a new utility function and implementing __eq__ and __hash__ in related classes are quite reasonable. I think it is worth doing, despite introducing some extra complexity, but we at least leave an option (maybe default as False) for users to cache some computationally expensive calls, which can be beneficial when they are dealing with quite large datasets.

from matminer.

WardLT avatar WardLT commented on June 18, 2024

Thanks, @Qi-max !

I did some performance testing of this implementation, and found it takes about 30s to compute the features of 10 structures. Turns out that about half of the featurization time is spent computing the tessellation, so I implemented the LRU caching mechanism (Option 1). I automatically cache the NN for all atoms in the structure, so we might want to institute an option to avoid caching (as you suggested). If curious: https://github.com/WardLT/matminer/tree/faster_prb

After adding in the caching, we are down to about 22s for the same 10 structures. The average over 100 structures is about 1.8s/structure. We likely have some room to improve performance further, as I can do about 0.1s/structure for these features with Magpie.

My current plan is to try to optimize the Voronoi/near-neighbor-finding routines in pymatgen improve performance further. If I can get performance of faster than 1s per structure, I'd be pretty happy.

from matminer.

WardLT avatar WardLT commented on June 18, 2024

Well, I got it below 1s per structure (latest is 0.8s). I'll open a PR if/when my changes to pymatgen get approved. We can now featurize the FLLA dataset in ~50 minutes.

from matminer.

computron avatar computron commented on June 18, 2024

All, sorry I haven't had a chance to review this thread in detail (and likely might not be able to for a bit, so feel free to continue w/o me).

My suggestion based just on skimming is to look into some of the "memoize" decorators in Python to see if you can avoid recomputing. Not sure how this compares with the current solution

from matminer.

computron avatar computron commented on June 18, 2024

Also an LRI cache might(?) be faster than an LRU one and be similar for our use case. See:

http://boltons.readthedocs.io/en/latest/cacheutils.html

from matminer.

WardLT avatar WardLT commented on June 18, 2024

The LRI caches do seem like a good option. Especially for the site featurizers which will call get_neighbors frequently, we might benefit from a faster caching mechanism.

For our current use case - structure featurizers - the LRU cache is sufficient. In fact, the access pattern for structure-based featurizers is such that a cache size of 1 will achieve the minimum number of cache misses. I'll ask around for a site featurizer use case to benchmark different caching strategies.

from matminer.

computron avatar computron commented on June 18, 2024

@WardLT @Qi-max should this issue be considered closed?

from matminer.

WardLT avatar WardLT commented on June 18, 2024

[Sorry for the slow response]

I don't think this issue is quite closed. There is still a little work to go for the de Jong features (e.g., adding the variance of neighbor properties, building presets and examples), and a bit for the Isayev features (we might need to extend the bond fragment work).

from matminer.

ardunn avatar ardunn commented on June 18, 2024

superseded by #635

from matminer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.