Git Product home page Git Product logo

matminer's People

Contributors

ada110 avatar albalu avatar ardunn avatar atan14 avatar blokhin avatar computron avatar dependabot-preview[bot] avatar dependabot[bot] avatar doppe1g4nger avatar dyllamt avatar jacksund avatar janosh avatar jfchen3 avatar jhfrost314 avatar kmu avatar kylebystrom avatar lauri-codes avatar matk86 avatar ml-evs avatar montoyjh avatar nawagner avatar nisse3000 avatar qi-max avatar rkingsbury avatar samysspace avatar saurabh02 avatar shyuep avatar tschaume avatar utf avatar wardlt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

matminer's Issues

Data retrieval and descriptor tools

There seems to be some issue running some examples. In particular (as of Jan 8 despite using api_key), in the bulk modulus example https://hackingmaterials.github.io/matminer/example_bulkmod.html the line:

from matminer.featurizers.data import PymatgenData

returns

ModuleNotFoundError: No module named 'matminer.featurizers.data'

Also:

from matminer.descriptors.composition_features import get_pymatgen_descriptor

returns

ModuleNotFoundError: No module named 'matminer.descriptors'

Minor Bug in CrystalSiteFingerprint.featurize()

I try to use featurize method in CrystalSiteFingerprint class, but met this bug:
TypeError: init() got an unexpected keyword argument 'target'

Possible solution:
In line 484 in featurizer/site.py, vnn = VoronoiNN(cutoff=self.cutoff_radius, targets=target), because no 'target' parameter in VoronoiNN
In line 486 in featurizer/site.py, n_w = vnn.get_voronoi_polyhedra(struct, idx)
because no 'use_weights' parameter in VoronoiNN.get_voronoi_polyhedra method

better (and working) tutorials

better tutorials (e.g. show off featurize_dataframe, better use of featurizers, include structure featurizers, separate into “getting data”, “featurizing data”, and “full data mining workflow”, make sure they work with Py3 (no print ‘x’), use built-in example datasets for tutorials, etc.

BaseFeaturizer() doesn't support certain types of featurizers well

Many featurizers don't have a set number or type of features that they return - the features depends on the inut. For example, the BagofBonds featurizers (currently as a PR) will return one value for each potential bond combination in the material. For Li2O this would mean Li-Li, Li-O, and O-O. But for Li (metal) the only feature is Li-Li. Or for Na2S it is Na-Na, Na-S, S-S. So the features are different for each entry. There are multiple featurizers like this, like ChemicalSRO.

There are two issues here:

  1. You get this clunky thing where feature_labels() only really works well after you've called featurize(), and is only really pertinent to the most recent featurize() call. This makes the Featurizer a stateful object which is less preferable and more prone to breakage.
  2. Sometimes, you need to override featurize_dataframe() to get this to work.

It would be better if someone could design an improvement to BaseFeaturize() to support these kinds of features. I should say that text mining handles this kind of use case all the time. For example, the CountVectorizer in scikit-learn will return one "feature" for every distinct word in a block of text, and the features will be different for each text block. The way to do this is to use the fit_transform() method in scikit-learn.

I am wondering if something similar is needed / helpful for some of our featurizers.

Piezo and elastic data sets

@kylebystrom

I made some changes to the "datasets" package of matminer. Some require your attention:

  • Note that git-lfs seemed to be a big hassle and so I decided to just remove it altogether. Hopefully a normal "pull" operation will just fix everything on your end, but if not you may need to re-clone. Let's skip git-lfs from now on unless it really seems necessary, sorry for the hassle but now we know.
  • Note that I added tests to the package
  • Note that I added a note of the original reference both in the CSV and method
  • Needs your attention: The elastic_tensor dataset is nice in that it contains both formula and structure columns that can be used to generate descriptors. However, the "piezo" and "dielectric_constant" data sets don't have these columns (they are embedded in "meta"). Can you pull these columns out of meta, along with anything else that looks useful to have as its own column? You'll need to update the unit tests afterward. Make sure to preserve the comment line at top that gives the reference.

Py2: AGNIFingerprints test failing

@WardLT The AGNIFingerPrints tests fail on Py2 with error listed below. I tried a few fixes but still doesn't seem to work. Note that everything is OK in Py3.

�======================================================================
ERROR: test_off_center_cscl (test_site.FingerprintTests)

Traceback (most recent call last):
File "/Users/ajain/Documents/code_matgen/matminer/matminer/featurizers/tests/test_site.py", line 52, in test_off_center_cscl
site1 = agni.featurize(self.cscl, 0)
File "/Users/ajain/Documents/code_matgen/matminer/matminer/featurizers/site.py", line 103, in featurize
raise Exception('Unrecognized direction')
Exception: Unrecognized direction

======================================================================
ERROR: test_simple_cubic (test_site.FingerprintTests)
Test with an easy structure

Traceback (most recent call last):
File "/Users/ajain/Documents/code_matgen/matminer/matminer/featurizers/tests/test_site.py", line 31, in test_simple_cubic
features = agni.featurize(self.sc, 0)
File "/Users/ajain/Documents/code_matgen/matminer/matminer/featurizers/site.py", line 103, in featurize
raise Exception('Unrecognized direction')
Exception: Unrecognized direction


some cleanups to sample dataframes

@kylebystrom can you help clean up some of the sample dataframes?

  • Don't need an "idx" column (there is already a material_id column that serves as an index)
  • Actually make the "material_id" the index column of the dataframe - see for example the index_col parameter of pd.read_csv (you'll notice there is no extraneous index column after doing this).
  • if volume is volume per site, rename to volume_per_site. If it's not per site, probably remove it.
  • reorder the columns to be in the following order
  1. The index column ("material_id")
  2. The formula/composition
  3. Other parameters that succintly describe the input structure (nsites, volume, space_group, etc)
  4. The actual structure column itself
  5. Simple (float) output data - band gap, K_VRH, etc.
  6. Complex (vector, tensor) output data - elastic tensor, piezoelectric tensor, etc.
  7. Any extra metadata (cif, poscar, etc) that is essentially redundant info with structure

Tiny bugs in get_pymatgen_descriptor()?

Hello!
Thanks for your excellent program. I get some trouble in 'get_pymatgen_descriptor('Com','ionic_radii'). It reports this bug "TypeError: float() argument must be a string or a number, not 'dict'". Maybe you can fix this bug. Thanks!

add a distance_metrics package

(internal note for myself)

many ML algorithms (e.g., clustering, kernel regression) require only distance between points, not explicit features. Add a distance_metrics package to help with this

Note that it's possible to code a generalized distance function that takes in a featurizer, args1 for data point1, args2 for data point2, and a choice of distance metric (angle or vector distance). Such a feature would work for any existing featurizer.

Other distance functions could be there if you didn't use explicit featurization to get distance.

LFS File Issue (IPython notebooks): Missing Objects

It appears that something is wrong with the MPDS and Citrine IPython notebooks.* I get the following output when I try to pull from the skeleton:

$ git lfs pull skeleton master
Git LFS: (0 of 2 files) 0 B / 191.73 KB
Username for 'https://github.com': kylebystrom
Password for 'https://[email protected]':
Git LFS: (0 of 0 files, 2 skipped) 0 B / 0 B, 191.73 KB skipped [1f60dd7f710b54cf410b4215a740abc9b2e6eba0f92aadd5fd25046112c60094] Object does not exist on the server: [404] Object does not exist on the server
[13656b91a004cd765c2ad641c0efd741407b02aa382fce71ccbae96de51b48cc] Object does not exist on the server: [404] Object does not exist on the server
error: failed to fetch some objects from 'https://github.com/hackingmaterials/matminer.git/info/lfs'

*I verified that the two missing files are in fact the MPDS and Citrine notebooks.

I don't need the notebooks right now, but this issue also prevents me pushing any updates to my fork, which is problematic. Does anyone else have this issue or know a fix?

Also, I noticed that not all of the notebooks are stored with git lfs. Is this intentional?

citrination-client and plotly should be optional requirements

@saurabh02

Commit 948fe5e makes citrination-client and plotly required libraries of matminer. This should not be the case. We can think of a typical use case of matminer as someone that has a spreadsheet of data and wants to add descriptors (composition or structure) to it and then run a data mining model. Neither plotly nor citrination-client is needed for that (although pymatgen and pandas are). If a library is extremely lightweight or robust then it's also usually OK to put it in requirements.

As far as I can tell you added the requirements to get a unit test to pass. But the way to do it is to fix the unit test, not require additional libraries. e.g., pymatgen contains examples on how to write unittests with optional libraries (you raise a SkipTest if the library is not installed). Can you try it this way?

Note also that if you do add requirements, you should also pin the version as per the other requirements.

normalization for standard deviation in PropertyStats

There are two different (common) definitions of standard deviation. One of them has the number of samples in the denominator, n, and the other one has one less (n-1), which is known as Bessel's correction.

The questions are:

  • which definition should be used by default
  • should we support both as an option

My vote is in favor of n-1 by default. I expect most of the time we will not have a full population in matminer. I expect most matminer users will be trying to build models for the purposes of applying that model on unseen data that is not in the data set, which would go along with the n-1 definition. But this should be discussed to make sure we get the right solution.

More sources of data

Thanks for this project! This is not an issue, but more like a feature request: would you be interested in adding more sources of data (e.g. Nomad, Materials Cloud etc.)?

fix pip install

Apparently pip install of matminer does not copy data tables - which results in much of the core code breaking. Someone needs to investigate and fix. See:

#122 (comment)

Add more descriptive comments to citations for CohesiveEnergy

@saurabh02

See this todo:

# TODO: @sbajaj unclear whether cohesive energies are taken from first ref, second ref, or combination of both

matminer.featurizers.composition.CohesiveEnergy#citations

You can add some code comments that explains what is going on. This could be part of the main doc for the CohesiveEnergy featurizer, e.g.

        Class to get cohesive energy per atom of a compound by adding known
        elemental cohesive energies from the formation energy of the
        compound. <<SOME MORE DESCRIPTION HERE ABOUT KNOWN COHESIVE ENERGIES SOURCE>>

oxidation states of elements in Composition descriptors

@JFChen3 @WardLT Many of the descriptors in composition seem to need knowledge of the oxidation state of an element. As far as I can tell, this is read off some kind of table. But many elements are ambivalent (e.g., P3- in AlP vs P5+ in LiFePO4).

I would suggest using pymatgen Composition's oxi_state_guesses function which I recently developed to guess the oxidation state instead. This should correctly figure out whether P is -, +, or neutral (e.g. P element).

Let me know if you have any issues with the function. If it is slow for large systems let me know because I think I have ideas to speed it up (have an option to reduce the Composition first)

ElectronegativityDiff failures

See

# TODO: this featurizer should fail gracefully for compounds with no clear anions (e.g., metals where all elements have zero oxidation) - returning either NaN or zero.

matminer.featurizers.composition.ElectronegativityDiff

Add a BagofBonds structure featurizer

This would take in a structure and retain a list that contains the number of bonds (probably as fraction). For example, if a structure had 2 Li-O bonds and 3 Li-P bonds, it could return [0.4, 0.6] as the featurizer value and ["Li-O", "Li-P"] as the feature labels.

Some aspects to note:

  • You probably want to allow any pymatgen NearNeighbor class to be used as the bond detection method. Some presets might help with usage
  • If you are featurizing an entire data frame, you need the feature labels to match and be a concatenation of all the bonds in all the various structures in the data frame. Also some complications if you fit a model and then predict a structure for which there are no bonds in the training set. Some hashing tricks from text mining might help with this. Talk to AJ if you are implementing and this is confusing to you.

get_pymatgen_descriptor not available

A very interesting project. However when I follow the tutorial and try to:

from matminer.descriptors.composition_features import get_pymatgen_descriptor
avg_mass = np.mean(get_pymatgen_descriptor('LiFePO4', 'atomic_mass'))
    ...

this yields aModuleNotFoundError: No module named 'matminer.descriptors' under v0.1.1.

Am I missing something?

pymatgen preset in ElementProperty

The features chosen for the "pymatgen" preset are somewhat random (I chose them on a whim, based on no data, while coding in a rush)

Someone can try to do a better job with this...

matminer can include capability to functionalize features

Part 1: This would be a tool that takes in a feature (or list of features) and returns some functional outputs like:

  • x
  • 1/x
  • x^05.
  • x^-0.5
  • x^2
  • x^-2
  • x^3
  • x^-3
  • ln(x)
  • 1/ln(x)
  • exp(x)
  • exp(-x)

where x is the original feature value. Note that the list of features above is from "Machine Learning and Materials Informatics: Recent Applications and Prospects" by Ramprasad et al.

Part 2: This would take in a list of features (either raw features or perhaps after applying the functions above) and combine them, i.e., give you x1x2 for all features. Note that if you applied the functions above, you will also have features like x1, x2, 1/x1, 1/x2, etc... so by multiplying all feature combinations you will end up with things like x1x2, x1/x2, x2/x1, etc. Unfortunately, some of these will just be 1 (x11/x1) or redundant (x1x1^2 is the same as x1^3).

featurize_dataframe mishandles features that are equally sized numpy arrays

When BaseFeaturizer.featurize_dataframe() is called for a featurizer which returns equal-dimension
numpy arrays (such as OrbitalFieldMatrix and the in-dev ManyBodyTensor), the code exits in error,
as shown in the Python code and output attached in testfeat.txt. The reason for this is that featurize_dataframe joins the features into a numpy array, which is interpreted as a multidimensional array if and only if each element of the array has the same length. If the features are single values, the array is 1D, and if they are variable-length arrays, the array is interpreted as a 1D collection of other objects. The assign call in featurize_dataframe does not handle multidimensional arrays.

A possible fix is shown in testfeat.py, in which a dataframe is featurized by adding a Series initialized
from a list of features. As shown in testnew.txt, there does not seem to be a significant speed difference between these methods. I apologize if the fix is underthought; I am not familiar with the reasoning for the original design, so I may well be missing something.

testfeat.txt
testnew.txt

Can't clone matminer

Error message is

Downloading matminer/datasets/diel_ref.csv (3.2 MB)
Error downloading object: matminer/datasets/diel_ref.csv (7f3c427): Smudge error: Error downloading matminer/datasets/diel_ref.csv (7f3c4276d159e56eaf13231ad7580824f0512a9aba51e6d7fdad0de513691fad): [7f3c4276d159e56eaf13231ad7580824f0512a9aba51e6d7fdad0de513691fad] Object does not exist on the server: [404] Object does not exist on the server

Errors logged to /Users/shyuepingong/repos/matminer/.git/lfs/objects/logs/20170824T115944.174205621.log
Use `git lfs logs last` to view the log.
error: external filter 'git-lfs filter-process' failed
fatal: matminer/datasets/diel_ref.csv: smudge filter lfs failed
warning: Clone succeeded, but checkout failed.

some remaining items for DosFeaturizer

@dyllamt @albalu

  • move to featurizers/dos.py (instead of bandstructure.py)
  • the xbm_location_i should probably return fractional, not Cartesian coords. Can't think of any good reason for it to be Cartesian
  • the coordination variable in get_cbm_vbm_scores is way too hacky. Only works for tet vs oct, uses a custom function, etc. Remove this var. If someone want this function they should code it well (works for multiple environments, etc), or keep it inside their own test code. Note that there is another issue open for a featurizer that takes in a site and returns back a string label for the coordination number of that site.
  • get_cbm_vbm_scores() should explain in the docs what it's doing

Capitalization consistency in feature names

Hi all

I'd like to make the capitalization of feature names in matminer consistent. We can do:

  • everything lower case (e.g. "band center") except for acroynms (e.g. "formation energy FERE")
  • first-word capitalization ("Formation Energy FERE")

I'm leaning towards the former, it's just easier to remember and type. The latter leads to some people deciding to do "Formation energy FERE" which is not the same ...

@WardLT @JFChen3 @nisse3000

Thoughts, comments, etc?

too many presets

Some of the featurizers have a large amount of presets, all of which are doing essentially the same thing. Here's one from a recent PR, although there were many before this of the same form (some committed by me).

+        if preset == "VoronoiNN":
 +            return ChemicalSRO(VoronoiNN())
 +        elif preset == "JMolNN":
 +            return ChemicalSRO(JMolNN())
 +        elif preset == "MinimumDistanceNN":
 +            return ChemicalSRO(MinimumDistanceNN())
 +        elif preset == "MinimumOKeeffeNN":
 +            return ChemicalSRO(MinimumOKeeffeNN())
 +        elif preset == "MinimumVIRENN":
 +            return ChemicalSRO(MinimumVIRENN())
 +        else:
 +            raise RuntimeError('Unknown preset.')

We should:

  1. Clean up the code to shorten it, using something like this:
    https://stackoverflow.com/questions/4821104/python-dynamic-instantiation-from-string-name-of-a-class-in-dynamically-imported

It will also allow us to automatically support new NearNeighbor algorithms ,etc.

  1. Consider having more guidance to presets. e.g. instead of just a billion presets, to do more like what the ElementProperty presets and do and just give a few "vetted" suggestions.

figure out what to do with distance_metrics.site

matminer/matminer/distance_metrics/site.py has a note:

THIS IS CURRENTLY JUST HOSTING LEGACY CODE.

It will soon be refactored / changed based on some discussions.

- Anubhav (12/7/17)

@nisse3000 can you figure out the plan? Think this hosts your legacy method

Potential Issue with RDF Method

I think there are a few issues with the RDF calculation method (link for convenience)

  1. RDF are typically normalized by number density. This function appears normalize by mass density
  2. In the loop over pairs of neighbors, pair distances are used as keys for the dist_rdf dict. In the loop when normalizing dist_rdf, it seems like the keys are treated as bin indices (based on the shell thickness calculation)

If you agree these are problems, I can make these changes. As you have tests for this function already, I figured it worth checking in before altering both the code and the tests.

support for grouped DataFrames in featurizers

There is a way to group columns in DataFrames which makes things look much prettier and more organized. See for example this example from Patrick Huck in MP:

2018-01-23 10 42 57 am

It would be nice if:

  • sample dataframes included grouped columns (e.g., an "outputs" group, a "metadata" group, an "inputs" group) ( @kylebystrom )
  • the featurize_dataframe() method could featurize into a group so that all features for a particular featurizer (or multifeaturizer) were grouped together.

Improving Performance of Composition Features

I recently compared the performance of matminer to Magpie for computing composition-based features. For ~250k entries, matminer requires 24 minutes to evaluate the same features that take Magpie 4 seconds.

From a quick profiling run, it seems like a fair amount of the time (25%) is spent retrieving elemental properties and, surprisingly, computing the mode of a list. Digging further, the slow part of retrieving elemental properties is sorting the properties by atomic number and calling Composition.get_el_amt_dict(). My plan is to adjust the AbstractData API such that it is possible to avoid performing either of these operations.

I have opened this issue to see if anyone else has ideas for speeding up the code, and to let you know I am overhauling the AbstractData API. Let me know if I should make a public branch if you want to work on this together and we can avoid merge conflicts.

support multiprocessing in featurize_dataframe

have a multiprocessing option in featurize dataframe to speed it up. Should detect the number of processors (straightforward with Python) and use multiprocessing package to parallelize.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.