yoshida-lab / xenonpy Goto Github PK

View Code? Open in Web Editor NEW

130.0 130.0 57.0 43.68 MB

XenonPy is a Python Software for Materials Informatics

Home Page: http://xenonpy.readthedocs.io

License: BSD 3-Clause "New" or "Revised" License

Python 5.34% Shell 0.01% Makefile 0.01% Dockerfile 0.02% Jupyter Notebook 94.64%

machine-learning material material-development python-library

xenonpy's People

Contributors

Stargazers

Watchers

xenonpy's Issues

Error fix for proposal

When a molecule is found not to be able to convert to Mol,
current proposal class directly skip the generated smiles string.
Instead, we should either fill in with None or return back to the old smiles string,
and throw out a warning for that.

data elements not exist

I installed development mode with this site,
https://xenonpy.readthedocs.io/en/latest/installation.html

and tried to check the preset.elements with this page.
https://xenonpy.readthedocs.io/en/latest/tutorial/1-dataset.html

then I got the following error message.

>>> preset.elements
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/hagawa/codes/xenonpy/XenonPy/xenonpy/datatools/preset.py", line 230, in elements
    self._check('elements')
  File "/home/hagawa/codes/xenonpy/XenonPy/xenonpy/datatools/preset.py", line 192, in _check
    "data {0} not exist, please run <Preset.sync('{0}')> to download from the repository".format(data))
RuntimeError: data elements not exist, please run <Preset.sync('elements')> to download from the repository

I cannot calculate descriptor without this preset.elements data.
How can I get it?

>>> from xenonpy.datatools import preset

doesn't make any error.
I'm using Ubuntu18.04 and Anaconda.

[WIP] Improve docs

HTTPError message does not shown when response return 504

Hello. Thank you for the awesome package.
I have founded trivial things regarding the raise exception.

I would like to check which models are prepared. Could you provide me the list of trained models, for my research purpose?
In addition, I have a simple query to get all Models data.
When server return the 504 status, the raise message did not match as your expectation.

mdl = MDL()
models = mdl("_", save_to=False)

C:\Anaconda3\lib\json\decoder.py in raw_decode(self, s, idx)
    355             obj, end = self.scan_once(s, idx)
    356         except StopIteration as err:
--> 357             raise JSONDecodeError("Expecting value", s, err.value) from None
    358         return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

When 504 error, ret.json() doesn't success because the return value is not json, since
JSONDecodeError overwrite raise HTTPError.

    def query(self, query, variables):
        payload = json.dumps({'query': query, 'variables': variables})
        ret = requests.post(url=self._url, headers=self._headers, data=payload)
        if ret.status_code != 200:
            raise HTTPError('status_code: %s, %s' %
                            (ret.status_code, ret.json()))
        ret = ret.json()['data']
        return ret

ipdb> ret
<Response [504]>
ipdb> ret.content
b'<html>\r\n<head><title>504 Gateway Time-out</title></head>\r\n<body bgcolor="white">\r\n<center><h1>504 Gateway Time-out</h1></center>\r\n<hr><center>nginx/1.14.2</center>\r\n</body>\r\n</html>\r\n'
ipdb> ret.json()
*** json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I'm looking forward for your kind reply. Thank you very much.

Add error handling to NGram fit function

When training for NGram, some SMILES may failed.
Instead of raising error directly, recording the SMILES that caused the error and continue on is important!

[WIP] Implement Molecular Fingerprint descriptors.

Implement rdkit fingerprint.

Add rdkit to dependence
Implementation of xenonpy.descriptor.Fingerprints class

Error on handling single bond directly followed by a ring number

[RFC] too complex to install

I failed to install xenonpy on anaconda3 on linux and failed to use the docker xenonpy on linux. (It looks that the docker xenonpy doesn't work because of a proxy problem.)
I am afraid that xenonpy is now too integrated to install. How about separating them into layers, e.g. preset feature, feature calculators, and MDL, and so on, depending on users' purposes?

[RFC][WIP] Add more query functions into MDL

We need to add more querying functions to MDL class. At here, I want to collect some suggestions for what function should be be given priority.

list_properties; return a list of all properties' name.
query_properties; return a list of matching property.
list_mode_sets; return a list of all model sets' name.
query_mode_sets; return a list of matching model set.
pull_mode_set; return the details of a model set and download the models in it.
list_descriptors; return a list of all descriptors' name.
query_descriptors; return a list of matching descriptor.

Sparse vector/matrix treatment for descriptor

Update the descriptor class to support sparse vector/matrix and the use of them during model training and prediction (with convertor?)

[WIP] Implement Neural Message Passing interface

https://arxiv.org/abs/1704.01212

get_prob error causes interruption

The modifier needs a try-catch to avoid stopping the iteration due to only one failure of molecule modification (caused by not able to find a substring in the N-gram table inside the get_prob function).

Descriptor calculation using "elements"

Instead of raising an error when an element is not found, you should have an option to fill in with NaN.

[RFC] Consistency checking after removing error descriptors

In XenonPy/xenonpy/inverse/iqspr/estimator.py, there is a bug.

XenonPy/xenonpy/inverse/iqspr/estimator.py

Lines 72 to 77 in fefc5cc

 # remove NaN fromm X 

 desc = self._descriptor.transform(smiles) 

 desc = pd.DataFrame(data=desc).reset_index(drop=True) 

 y = y.reset_index(drop=True) 

 desc.dropna(inplace=True) 

 y = y.loc[desc.index]

'desc' and 'y' are merged if length of them are different.
In previous sample/iQSPR.ipynb, there are bug relating this problem in cell In[11]: (Already fixed).

I guess you should check length of smiles list and property DataFrame.
Following is my opinion.

if len(smiles) != y.shape[0]:
    raise RuntimeError('len(smiles) != y.shape[0]')

Thank you.
And I apologize for my poor English.

[WIP] Refactor NN model runner

This refactor will import RNN training and training losses automatic saving.

[WIP] Add option for user specific clustering index into "DescriptorHeatmap".

[WIP] Add Fine Tuning

Add one-hot-vector encoding for elements in Compositional descriptor.

Remove batched dataset to avoid the potential copyright infringement

Material Project and OQMD have no terms for redistribution.

Add default save function to Dataset class

[WIP] Add iqspr

[RFC] Remove python3.7 support

The official conda channel of rdkit have not supported python 3.7 on OSX. Just remove it temporarily.

Rename fingerprint feature and add "RdkitFP" featurizer

Rename xxxFingerprint to xxxFP
Rename "MorganFingerprintWithFeature" to FCFP
Rename "MorganFingerprint" to ECFP
Add "RdkitFP" feature

Add functions to help the sample dataset building

We need to add some helper functions to help users when building sample dataset.

Add sample codes into sample.
Add building functions to Preset class.

[WIP] Let fingerprint featurizers transform SMILES directly

Fingerprint featurizers will have a init parameter to control the input type.

Add flexibility to handle multiple descriptors for multiple properties

Set descriptor default to None.
Estimator can input dictionary of tuple now, first must be BaseDescriptor and second is BaseEstimator

Change default file extension in Dataset

File extension will be changed.

before	after	description
`pkl.pd_`	`pd(.*)`	pandas.DataFrame object file
`csv`	`csv`	csv file
`xlsx`	`xlsx` or `xls`	excel file
`pkl.z_`	`pkl(.*)`	common pickled files

Class Splitter add cv method

This will also remove Splitter.index property.

Merge OFM codes

Ngram update functionality

Allow weighted combination of multiple Ngram, trained separately.
(Need to watch out for how to track the number of trained data in each case)

bug in set_para for BaseDescriptor

set_para will not function properly in BaseDescriptor because the values assigned will not pass to BaseFeaturizers inside.
E.g., when setting on_errors with .set_para in BaseDescriptor, the on_errors actually will not pass to the BaseFeaturizers defined inside.
A new def for set_para in BaseDescriptor is needed

Support of multiple BaseLikelihood and annealing schedule

Convert the current log_likelihood calculation to support multiple BaseLikelihood and annealing schedule (beta).

Add feature selection for "BaseDescriptor" class

We designed the BaseDescriptor as a container of BaseFeaturizer calculators. By using this, user can batch a lot of featurizers as a preset for pipelining. This proposal for add feature selection function to BaseDescriptor class.

Proposal

Assume we have some class like this:

class BaseDescriptor:
    def __init__(sefl, *, featurizers='all'):
        ....

class NewDescriptor(BaseDescriptor):
    def __init__(sefl, *, featurizers):
        super().__init__(featurizers=featurizers)
        ....

        sefl.input = Featurizer1()
        sefl.input = Featurizer2()
        sefl.input = Featurizer3()

descriptor = NewDescriptor()

In this case, for the input has column named input, descriptor will calculate all features that associate with self.input then concatenate them. This is exactly what we did in current version.

In this proposal, user can initialize the NewDescriptor with a parameter called featurizers. This parameter contains the name of features. Only the featurizer which have name in the featurizers will be activated.

In following example, only the specific features 'Featurizer1' and 'Featurizer3' will be calculated.

descriptor = NewDescriptor(featurizers=['Featurizer1', 'Featurizer3'])

[Doc] Update the docs of sync built-in data using preset

Parallelize NGram training

data source and new data

Don't you add Prof. Oguchi's atomic data made of his first-principles calculation? I think that you have them.
Is it possible to show original feature values with NaN and the interpolated ones? How about changing them with a flag?
Can you show data sources of the features? Sometimes their names are vague to judge what is the descriptor. You can't correct the values if you don't know what they are exactly.

Cannot download pre-trained models

I found that I can download the models for "stable inorganic compounds for materials project" but cannot download those for "QM9 Dataset" or "PolymerGenome Dataset" either.

When I tried mdl.pull(urls), it responded as follows:

FileNotFoundError: [Errno 2] No such file or directory: '~\S3\organic.nonpolymer.mu_debye\rcdk.fp.fingerprint\mxnet.nn.neural_network\shotgun_mu_Debye_randFP4975_corr-0.7528_mxnet_294-101-28-1_2018-06-13\model-175724d\shotgun_mu_Debye_randFP4975_corr-0.7528_mxnet_294-101-28-1_2018-06-13-045255-symbol.json'

Does anyone have any idea for this problem?

BayesianRidgeEstimator.fit dropna issue

When property NaN rows and descriptor NaN rows are not matching (not subset of one of the other), there may be mismatch of rows and causing error in fitting?

Add migration function

Because of #20, users have to move their own data from ~/.xenonpy/dataset to userdata where specific in the ~/.xenonpy/conf.yml.

We should add some migration functions to help that.

$ python -m xenonpy migrate

Add "output_layer" parameter to NN model generator classes

[RFC] Remove python3.5 support

New versions of rdkit, pandas and pymatgen are no longer supported in python 3.5 officially. It's time for us to finish the python 3.5 support now.

[WIP] Allow 'BaseDescriptor' class to use anonymous/renamed input

Until now BaseDescriptor class use a fixed name to get input data from pd.DataFrame object like:

class YouDescriptor(BaseDescriptor):
    def __init__(self, n_jobs=-1):
        self.descriptor1 = Feature1(n_jobs)

Here descriptor1 must be the column name of input. To use this class is something like:

descriptor = YouDescriptor()
input = pd.Series(<'list of samples'>, name='descriptor1')
output = descriptor.fit_transform(input)

That's not flexible in the following 2 case:

only one input are needed.
multiple descriptor use same input.

These issue is a proposal to improve user experience.

allow use a list as input when single type descriptor.
use name mapping: inner_name: user_given_name.

[WIP] Implement MDL features

Simple search
Download

How to use R model?

In the last part of XenonPy/samples/mdl.ipynb, it is announced the R tutorials will be released.
I want to use the R model, since the python model is available only for inorganic crystals.
Please disclose how to calculated the fingerprint value input to the R model.
If there is some reference code, it is ok.

Add appveyor CI for window testing

Implement "FrozenFeaturizer" class

[WIP] Add error handling to "BaseFeaturizer"

minor bugs found when using frozen featurizer in iQSPR

Add functionality to allow automatic label detection in BaseFeaturizer when we use n_job = 0.
BaseEstimator bug: when input pd.Series (taking one column of pandas dataframe) as property input in "fit", there will be error.

Add progress bar for ngram training

Allow to load a user determined dataset version

We specify the db version in conf.yml file to sync user' dataset with us, but this is not a flexible way.

This is a proposal to refactor the data loader which allow to load the user determined version of dataset. This can be used like this:

from xenonpy.datatools import Preset
with Preset(ver='0.1.1') as preset:
    data = preset.mp_structure

refactor class Preset

	# remove NaN fromm X
	desc = self._descriptor.transform(smiles)
	desc = pd.DataFrame(data=desc).reset_index(drop=True)
	y = y.reset_index(drop=True)
	desc.dropna(inplace=True)
	y = y.loc[desc.index]

yoshida-lab / xenonpy Goto Github PK

xenonpy's People

Contributors

Stargazers

Watchers

Forkers

xenonpy's Issues

Recommend Projects

Recommend Topics

Recommend Org