hackingmaterials / automatminer Goto Github PK
View Code? Open in Web Editor NEWAn automatic engine for predicting materials properties.
License: Other
An automatic engine for predicting materials properties.
License: Other
They should not return strings, bc strings (AFAIK) are not usable with most matminer composition featurizers. This is the same for loading structures should return the Structure object, not the dictionary.
As per @Doppe1g4nger and @Qi-max initial suggestion, our main classes should have sklearn-style .fit
and (.predict
or .transform
) methods. It does not necessarily need to subclass BaseEstimator
but each should still implement these methods, because we will be able to easily (1) tell what each class is doing and (2) apply the same transformation to other datasets
This applies to the following classes:
Example 1:
# Running a featurization
f = Featurization()
f.fit(df) # We understand all features are being fitted to this df
df = f.transform(df) # transform the same df
df2 = f.transform(df2) # apply the transform to another dataset
Example 2:
# Running a data cleaning
pp = Preprocessing()
pp.fit(df, target_key) # determines which features to keep
df = pp.transform(df) # remove na samples, remove features
df2 = pp.transform(df2) # apply the same set of cleaning/feature reduction steps to df2
Example 3:
# Running an entire matbench pipeline in this fashion
pipe = PredictionPipeline()
pipe.fit(df, target_key) . # does featurization, preprocessing, and tpot fititng
pipe.top_model # gives back the best tpot model
predictions = pipe.predict(df2) # runs featurization, preprocesing, and tpot prediction on df2
^^^^^^^^
I think this is a problem with only one of the formulas, but when I do:
df = load_expt_gap()
df['composition'] = [Composition(f) for f in df['formula']]
I get the error:
CompositionError: ,65 is an invalid formula!
Like anubhav said, we should add more featurizers based on combinations of the input parameters. Mainly I am referencing the site featurizers, applied to SiteStatsFingerprint
But also the MagPie preset of ElementProperty, ElementProperty + more stats (e.g., quartiles), BondFractions with approximations turned on/off, BagofBonds with SineCoulombMatrix instead of CoulombMatrix, etc.
I think it would be worth looking into adding the option to run an outlier detection algorithm (sklearn has some good ones) during the preprocessing stage. Based on the results we could throw out outliers that might affect performance or dynamically change the tpot accuracy metric to one that's more outlier resistant.
I thought of this because one of the datasets I'm working with has a few outliers and I think they are causing tpot to try really hard to find a model that improves performance drastically on those few when it should instead be finding a marginally better fit for the vast majority of the data.
We might consider having an ability to ensemble predictions from the top tpot models (top models not just being best hyperparam combo for each model type, but whatever the top hyperparameter + model combinations are)
This can maybe wait until later though
The AutoML segment of the pipeline is both called AutoML to indicate that it does automatic model selection but it also has the automl toolkit as one of the tools used to complete that task with automl_utils.py. In addition it seems we've settled on using tpot as the main model selection tool rather than automl.
I think the current naming is confusing as it makes it hard to understand that the tool being used isn't necessarily automl and it's not immediately clear that automl_utils.py holds utilities for using the automl package, not our overall model selection part of the pipeline.
I think a better naming convention for the submodule would be something that better reflects that the model selection isn't necessarily tied to any package but is meant for automatic model selection. So something like matbench.model_selection or matbench.auto_model_selector rather than matbench.automl
Dataset test set is the following (tentatively):
Piezoelectric dataset - really hard, 900 samples predicting max piezoelectric tensor (regression) [structure + composition]
Alireza's expt. Gaps dataset - classification of metal/nonmetal [compositon]
MP Elastic dataset - predicting K_Vrh and G_vrh (regression) [structure and composition]
Metallic glass - classification of formation or not [composition]
Boltztrap - regression on effective mass for n and p [structure + composition]
These will be the datasets for comparing different pipeline configurations
@ardunn did you set load_castelli_perovskites up? If yes, can you see where the problem is?
Nested CV is a technique often used to decrease model generalization error and provide better comparisons of models. However it is more computationally expensive -- especially when searching over many models and hyperparameter combinations (i.e., in automl).
A short discussion on the topic in tpot is here:
EpistasisLab/tpot#554
It would be good to investigate this, at least at a cursory level.
^^^^^^^^^^
@albalu might be interested in this
http://pubman.mpdl.mpg.de/pubman/item/escidoc:2616210/component/escidoc:2618812/SISSO_Ouyang_etal_2018.pdf
I like to look at some (preferably csv) file at the end that summarizes the results. It could be even excel file with different sheets such as featurize profile/summary, preprocess profile/summary and machine learning profile/summary. Then I will know for example at the end 100 cross-correlated features were removed in preprocess and such as such algorithms were tried and these are their scores and these features turn out to be the most important.
Due to the way TestAllFeaturizers imports featurizers, whenever a new featurizer is added to matminer the test breaks.
Not sure if this is intended but it does not seem like a maintainable motif for testing.
Maybe we should have a separate util which detects all the featurizers present in current version of matminer and compares with the featurizers present in matbench, and lists them out in an easily readable manner. This way we could periodically check and update matbench featurizer sets with the newest featurizers from matminer without breaking the tests all the time
The tests are taking a very long time and are causing us to go over our CircleCI quota. The runtime of the tests should be reduced
Encoding and decoding structures and compositions to/from csv causes problems with oxidation states and is also slow. Is anyone open to converting the matbench dataset format to json or pickle for less hassle loading?
Need top level class methods for the following:
This might have to come later, after we have the constituent classes done
The current tpot version is 0.9.5, newer than the current requirement of 0.9.3. If run the test_tpot with the 0.9.5 version, most of the tests will fail, because the fitted scores of the top_models would differ. The tests may need to be modified in order to avoid maintenance difficulties.
Find a way to obtain feature_cols list and target col easily, as they can be essential supplies to machine learning, i.e. df[feature_cols list] as X
and df[target_col] as y
.
The feature_cols list includes:
This is intended to make the chain "load-featurize-(preprocess)-machine learning" smoother.
We should add coverage reports to make sure (1) we have enough tests and (2) the tests we do have are useful
including e_form, etc., some more of the stuff that is in the MAPI docs
I'll do this
should have 2 separate classes based on Daniel's recommendation, which I agree with: DataCleaning and FeatureReduction
I will do this
Featurizer sets should include more featurizers. The "best" sets should, for example, include multiple instances of the same class (e.g. StructureFeaturizers.best should probably have multiple SiteStatsFingerprints)
MatPipes need to be pickleable or serializable in some way
After the model is returned, we need some interesting graphics and such for interpreting the model. Skater is a library which Anubhav initially suggested that gives you nice graphs on model interpretability:
https://github.com/datascienceinc/Skater
I think @Doppe1g4nger was going to work on this after finishing data cleaning?
Tpot tests are testing stuff besides tpot (e.g., preprocessing changes break tpot tests)
tpot tests need to have pre-formatted data so the tpot problems are localized
Not an issue per se, but some thoughts I had that may be useful.
A paper to look at is On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation (link). I think it goes over most of the "best practices" relevant to matbench.
The main topic of the paper is the types of bias and variance that arise in model selection and evaluation. An important case is how over-fitting during model selection can cause the final model to be either under or over fit. The more models / hyper-parameters searched over, the more likely it is for the best model to be chosen due to variance, which can result in a worse final model. The paper notes that minimizing bias during model selection is less important than variance, as (assuming some uniform bias) the best model will still be chosen. The paper gives a few options to deal with over-fitting during model selection, such as regularization and stopping criteria. A practical choice would be to use 5-fold cv instead of 10-fold cv, as it likely will have somewhat higher bias but lower variance, and is less computationally expensive. This also highlights why it is so important to have a final hold out test set to evaluate the final chosen model, as the cross validation scores can be heavily biased. Another option for small data (it is expensive) is a nested cross validation.
This quote gives the main conclusion:
model selection should be viewed as an integral part of the model fitting procedure, and should be conducted independently in each trial in order to prevent selection bias and because it reflects best practice in operational use.
In practice this means there must be a train / test split, with the training set then used as train / validation either by a split or cross validation, with the entirety of model selection (including most types of feature selection, hyper-parameter optimization, etc) internal to the cross validation. The TPOT code is a good example of this, where "pipelines" are compared to each other as a whole using cross validation, followed by a final evaluation of the best pipeline on a test set to estimate the true generalization error. A useful idea in model comparison would be using variance of the estimates of generalization error (e.g. var of CV error) to see if differences are statistically significant, but this is very hard to do correctly.
(Sidenote) With a goal of simply comparing many models (that is without the goal of choosing a best one at the end), the bias-variance trade-off is tricky. Theoretically either we can choose biased results to minimize variance, and hope that this gives accurate ranking of performance but may not reflect the true generalization error, or we can choose unbiased results that may reflect the generalization error but may not give an accurate ranking. In practice we hope to be somewhere in the middle that gives reasonable results. The hold out test set does not necessarily help here (though it may be better than nothing?), as using it to evaluate multiple models will lead to the same type of bias as in model selection. For example, after years of competitions training NN to do well on CIFAR-10, the selected models may just be over-fitting to the CIFAR-10 competition test set, and would do worse than previous models on new test data, even if they are currently winning by some small percent increase in classification accuracy. This is just a thought, I'm not sure how to deal with this issue, I haven't read much ML literature on large scale model comparison.
Let me know if you have any questions, want clarification, more references, or anything else.
Title sums it up, adding the module in now so I can start reworking the examples but tests need to be added.
Was having trouble implementing this so I am putting it off till later
Basically fitting (with scaling enabled) using DataCleaner should define a scaler_obj for the class. This scaler object can then be used to transform all numerical (excluding target and one-hot or label columns) columns on other dataframes while preserving the scaling from the fitted scaler.
In other words the scaler should not be refit during .transform, only .fit.
this structure featurizer returns the following error that is not forgiven by ignore_errors=True. You can reproduce it with the following example. The reason I put it here is that I reproduce it with our dataset that is not available in matminer, I just put it here not to forget about it...
import matminer.featurizers.structure as sf
from matbench.data.load import load_jdft2d
from matminer.utils.conversions import structure_to_oxidstructure
from pymatgen import Structure
df_init = load_jdft2d()[:2000]
# df_init = df_init[df_init['formula'].isin(['Nb3IrSe8', 'Bi2PbSe4'])]
df_init['structure'] = df_init['structure'].apply(Structure.from_dict)
structure_to_oxidstructure(df_init['structure'], inplace=True)
print(df_init)
featzer = sf.MaximumPackingEfficiency()
df = featzer.fit_featurize_dataframe(df_init, col_id='structure', ignore_errors=True)
print(df)
the error:
Traceback (most recent call last):
File "/Users/alirezafaghaninia/Documents/py3/py3_codes/matbench/matbench/scratch.py", line 13, in <module>
df = featzer.fit_featurize_dataframe(df_init, col_id='structure', ignore_errors=True)
File "/Users/alirezafaghaninia/Documents/py3/py3_codes/matminer/matminer/featurizers/base.py", line 157, in fit_featurize_dataframe
**kwargs)
File "/Users/alirezafaghaninia/Documents/py3/py3_codes/matminer/matminer/featurizers/base.py", line 224, in featurize_dataframe
res = pd.DataFrame(features, index=df.index, columns=labels)
File "/Users/alirezafaghaninia/Documents/py3/lib/python3.6/site-packages/pandas/core/frame.py", line 387, in __init__
arrays, columns = _to_arrays(data, columns, dtype=dtype)
File "/Users/alirezafaghaninia/Documents/py3/lib/python3.6/site-packages/pandas/core/frame.py", line 7475, in _to_arrays
dtype=dtype)
File "/Users/alirezafaghaninia/Documents/py3/lib/python3.6/site-packages/pandas/core/frame.py", line 7552, in _list_to_arrays
content = list(lib.to_object_array(data).T)
File "pandas/_libs/src/inference.pyx", line 1517, in pandas._libs.lib.to_object_array
TypeError: object of type 'numpy.float64' has no len()
One logger should be instantiated in the top level class
The constituent steps (Featurization, Preprocessing, etc.) should all write to this same logger.
According to @utf there is easy way to do this without passing logger object?
or any other kind of id but I think only mpid is available (e.g. there is no citrine id)
In the same spirit as our TpotAutoML class now, we could include classes for other backends (i.e., autosklearn)
Given a PredictionPipeline object (or just a tpot model and feature list until I get the top level classes working better), analysis should give back a nice html (or other format?) containing:
@Doppe1g4nger @ADA110 any other ideas on cool things to include here?
We need a FeaturizerSelection class which given a df, returns either (a) a set of featurizer (objects) to use for featurization or (b) a set of featurizers to exclude
I think @Qi-max is close to being done with this already
As far as implementing this in the pipeline, I think we have two choices:
fs = FeaturizerSelection()
fs.fit(df) # determines the featurizer sets to use, sets .featurizers
featobjs = fs.featurizers # the list of featurizers to be passed to Featurization
df = fs.transform(df) # doesn't do anything to df?
We already have functionfeaturizer in matminer, it should be relatively easy to use it during featurization. We can also test whether having these features actually increases performance or not...
Some load funcs return columns that appear numeric when you print them but are actually strings. Columns which are floats or ints should be converted with pd.to_numeric to avoid (very confusing) issues later on.
For the future use of pipeline, the classes in matbench.preprocess
can also take the form of fit
+ transform
, as the classes insklearn.preprocessing
, such as Imputer
and LabelEncoder
etc.
The default Tpot arguments and models should be looked at in closer detail. There might be an underlying reason why ExtraTreesRegressor is a recurring favorite of tpot (ie, the default hyperparameter grids for ETR are better than the others).
Also, it may be good to include a Keras or PyTorch NN (implementing the sklearn BaseEstimator methods needed to work with tpot, e.g. fit) within tpot's model search space.
This is a problem our incoming undergrad (Abhinav) maybe can look at?
In addition to the issue in #87 I think it may also be worth looking at the way the training data generation and handling process is structured. E.g. what are currently the top level modules metalearning, featurization, and preprocessing.
I think it was agreed that metalearning isn't a very accurate descriptor of what that submodule does. I think a better name would be something like "heuristic featurizer selection" or maybe just "featurizer selection".
A bigger sell I'd also like to put forward is that preprocessing is a bit of a misnomer for what that submodule currently does. Really all three of the above submodules are part of a preprocessing pipeline. What is currently preprocessing is more accurately a combined data cleaning and feature pruning step. It would make sense to bundle each of these under a comprehensive "preprocessing" or "data handling" module and split out the functionality of each submodule into its constituent steps. That way every top level module produces some complete product for the user and submodules then represent steps to completing that product. See the below diagram on what I think the better structure would look like:
in autofeaturizer class
so we know how long each task took
Typically, a matbench model would be considered an "output" of the study.
However, let's say you are trying to create a model that relates crystal structure to band gap. You could of course have the standard matminer structure/composition features and use that as your candidate feature set.
But another option is to use a different matbench (or other ML) model as a feature as well. For example, let's say you previously made a matbench model to relate structure/composition->bulk modulus. Now, when you are predicting band gap, you can use that previous matbench model to get a value for bulk modulus which becomes a new feature for band gap prediction.
Related ideas include the papers:
[1] M.L. Hutchinson, E. Antono, B.M. Gibbons, S. Paradiso, J. Ling, B. Meredig, Overcoming data scarcity with transfer learning, ArXiv. (2017).
[2] Y. Zhang, C. Ling, A strategy to apply machine learning to small datasets in materials science, Npj Comput. Mater. 25 (2018) 28โ33.
and the goal of the propnet project in our group is to do similar things as well.
[given df, return a set of featurizers]
[given dataframe, give reports and final models] (+ generating final reports)
[given df and set of featurizers, featurize a df robustly]
[given featurized df, return an ML ready df]
Coming up with good methods for preprocessing and and feature reduction
[given a request for data, return a nice dataframe, citations, etc.]
Moving the datasets to matminer, adding to figshare, making sure all columns are numeric, making sure correct citation data is present, having all in json format
[given an ML ready df (or just X and y), return a model]
Checking defaults, adding or testing Neural network?
[given a model and dataframe, return cool informative stats (and graphs)]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.