Git Product home page Git Product logo

sparks-baird / xtal2png Goto Github PK

View Code? Open in Web Editor NEW
33.0 3.0 3.0 21.74 MB

Encode/decode a crystal structure to/from a grayscale PNG image for direct use with image-based machine learning models such as Palette.

Home Page: https://xtal2png.readthedocs.io/

License: MIT License

Dockerfile 1.64% Python 72.35% Mathematica 3.95% TeX 22.05%
crystallography image-processing machine-learning materials-informatics python materials-science

xtal2png's Introduction

Project generated with PyScaffold ReadTheDocs Coveralls PyPI-Server Conda-Forge Lines of code status DOI Twitter

⚠️ Manuscript and results using a generative model coming soon ⚠️

xtal2png Open In Colab

Encode/decode a crystal structure to/from a grayscale PNG image for direct use with image-based machine learning models such as Google's Imagen.

The latest advances in machine learning are often in natural language such as with LSTMs and transformers or image processing such as with GANs, VAEs, and guided diffusion models. Encoding/decoding crystal structures via grayscale PNG images is akin to making/reading a QR code for crystal structures. This allows you, as a materials informatics practitioner, to get streamlined results for new state-of-the-art image-based machine learning models applied to crystal structure. Let's take Google's text-to-image diffusion model, Imagen (unofficial), which can also be used as an image-to-image model. Rather than dig into the code spending hours, days, or weeks modifying, debugging, and playing GitHub phone tag with the developers before you can (maybe) get preliminary results, xtal2png lets you get those results using the default instructions on the repository.

After getting preliminary results, you get to decide whether it's worth it to you to take on the higher-cost/higher-expertise task of modifying the codebase and using a more customized approach. Or, you can stick with the results of xtal2png. It's up to you!

Getting Started

Installation

conda create -n xtal2png -c conda-forge xtal2png m3gnet
conda activate xtal2png

NOTE: m3gnet is an optional dependency that performs surrogate DFT relaxation.

Example

Here, we use the top-level XtalConverter class with and without optional relaxation via m3gnet.

# example_structures is a list of `pymatgen.core.structure.Structure` objects
>>> from xtal2png import XtalConverter, example_structures
>>>
>>> xc = XtalConverter(relax_on_decode=False)
>>> data = xc.xtal2png(example_structures, show=True, save=True)
>>> decoded_structures = xc.png2xtal(data, save=False)
>>> len(decoded_structures)
2

>> xc = XtalConverter(relax_on_decode=True)
>> data = xc.xtal2png(example_structures, show=True, save=True)
>> relaxed_decoded_structures = xc.png2xtal(data, save=False)
>> len(relaxed_decoded_structures)
2

Output

print(example_structures[0], decoded_structures[0], relaxed_decoded_structures[0])
Original
Structure Summary
Lattice
    abc : 5.033788 11.523021 10.74117
 angles : 90.0 90.0 90.0
 volume : 623.0356027127609
      A : 5.033788 0.0 3.0823061808931787e-16
      B : 1.8530431062799525e-15 11.523021 7.055815392078867e-16
      C : 0.0 0.0 10.74117
PeriodicSite: Zn2+ (0.9120, 5.7699, 9.1255) [0.1812, 0.5007, 0.8496]
PeriodicSite: Zn2+ (4.1218, 5.7531, 1.6156) [0.8188, 0.4993, 0.1504]
...
Decoded
Structure Summary
Lattice
    abc : 5.0250980392156865 11.533333333333331 10.8
 angles : 90.0 90.0 90.0
 volume : 625.9262117647058
      A : 5.0250980392156865 0.0 0.0
      B : 0.0 11.533333333333331 0.0
      C : 0.0 0.0 10.8
PeriodicSite: Zn (0.9016, 5.7780, 3.8012) [0.1794, 0.5010, 0.3520]
PeriodicSite: Zn (4.1235, 5.7554, 6.9988) [0.8206, 0.4990, 0.6480]
...
Relaxed Decoded
Structure Summary
Lattice
    abc : 5.026834307381214 11.578854613685237 10.724087971087924
 angles : 90.0 90.0 90.0
 volume : 624.1953646135236
      A : 5.026834307381214 0.0 0.0
      B : 0.0 11.578854613685237 0.0
      C : 0.0 0.0 10.724087971087924
PeriodicSite: Zn (0.9050, 5.7978, 3.7547) [0.1800, 0.5007, 0.3501]
PeriodicSite: Zn (4.1218, 5.7810, 6.9693) [0.8200, 0.4993, 0.6499]
...

The before and after structures match within an expected tolerance; note the round-off error due to encoding numerical data as RGB images which has a coarse resolution of approximately 1/255 = 0.00392. Note also that the decoded version lacks charge states. The QR-code-like intermediate PNG image is also provided in original size and a scaled version for a better viewing experience:

64x64 pixels Scaled for Better Viewing (tool credit) Legend
Zn8B8Pb4O24,volume=623,uid=bc2d

Additional examples can be found in the docs.

Limitations and Design Considerations

There are some limitations and design considerations for xtal2png. Here, we cover round-off error, image dimensions, contextual features, and customization.

Round-off

While the round-off error is a necessary evil for encoding to a PNG file format, the unrounded NumPy arrays can be used directly instead if supported by the image model of interest via structures_to_arrays and arrays_to_structures.

Image dimensions

We choose a $64\times64$ representation by default which supports up to 52 sites within a unit cell. The maximum number of sites max_sites can be adjusted which changes the size of the representation. A square representation is used for greater compatibility with the common limitation of image-based models supporting only square image arrays. The choice of the default sidelength as a base-2 number (i.e. $2^6$) reflects common conventions of low-resolution images for image-based machine learning tasks.

Contextual features

While the distance matrix does not directly contribute to the reconstruction in the current implementation of xtal2png, it serves a number of purposes. First, similar to the unit cell volume and space group information, it can provide additional guidance to the algorithm. A corresponding example would be the role of background vs. foreground in classification of wolves vs. huskies; oftentimes classification algorithms will pay attention to the background (such as presence of snow) in predicting the animal class. Likewise, providing contextual information such as volume, space group, and a distance matrix is additional information that can help the models to capture the essence of particular crystal structures. In a future implementation, we plan to reconstruct Euclidean coordinates from the distance matrices and homogenize (e.g. via weighted averaging) the explicit fractional coordinates with the reconstructed coordinates.

Customization

See the docs for the full list of customizable parameters that XtalConverter takes.

Installation

PyPI (pip) installation

Create and activate a new conda environment named xtal2png (-n) with python==3.9.* or your preferred Python version, then install xtal2png via pip.

conda create -n xtal2png python==3.9.*
conda activate xtal2png
pip install xtal2png

Editable installation

In order to set up the necessary environment:

  1. clone and enter the repository via:

    git clone https://github.com/sparks-baird/xtal2png.git
    cd xtal2png
  2. create and activate a new conda environment (optional, but recommended)

    conda env create --name xtal2png python==3.9.*
    conda activate xtal2png
  3. perform an editable (-e) installation in the current directory (.):

    pip install -e .

NOTE: Some changes, e.g. in setup.cfg, might require you to run pip install -e . again.

Optional and needed only once after git clone:

  1. install several pre-commit git hooks with:

    pre-commit install
    # You might also want to run `pre-commit autoupdate`

    and checkout the configuration under .pre-commit-config.yaml. The -n, --no-verify flag of git commit can be used to deactivate pre-commit hooks temporarily.

  2. install nbstripout git hooks to remove the output cells of committed notebooks with:

    nbstripout --install --attributes notebooks/.gitattributes

    This is useful to avoid large diffs due to plots in your notebooks. A simple nbstripout --uninstall will revert these changes.

Then take a look into the scripts and notebooks folders.

Command Line Interface (CLI)

Make sure to install the package first per the installation instructions above. Here is how to access the help for the CLI and a few examples to get you started.

Help

You can see the usage information of the xtal2png CLI script via:

xtal2png --help
Usage: xtal2png [OPTIONS]

 xtal2png command line interface.

Options:
 --version                 Show version.
 -p, --path PATH           Crystallographic information file (CIF) filepath
                           (extension must be .cif or .CIF) or path to
                           directory containing .cif files or processed PNG
                           filepath or path to directory containing processed
                           .png files (extension must be .png or .PNG).
                           Assumes CIFs if --encode flag is used. Assumes
                           PNGs if --decode flag is used.
 -s, --save-dir PATH       Encode CIF files as PNG images.
 --encode                  Encode CIF files as PNG images.
 --decode                  Decode PNG images to CIF files.
 -v, --verbose TEXT        Set loglevel to INFO.
 -vv, --very-verbose TEXT  Set loglevel to INFO.
 --help                    Show this message and exit.

Examples

To encode a single CIF file located at src/xtal2png/utils/Zn2B2PbO6.cif as a PNG and save the PNG to the tmp directory:

xtal2png --encode --path src/xtal2png/utils/Zn2B2PbO6.cif --save-dir tmp

To encode all CIF files contained in the src/xtal2png/utils directory as a PNG and save corresponding PNGs to the tmp directory:

xtal2png --encode --path src/xtal2png/utils --save-dir tmp

To decode a single structure-encoded PNG file located at data/preprocessed/Zn8B8Pb4O24,volume=623,uid=b62a.png as a CIF file and save the CIF file to the tmp directory:

xtal2png --decode --path data/preprocessed/Zn8B8Pb4O24,volume=623,uid=b62a.png --save-dir tmp

To decode all structure-encoded PNG file contained in the data/preprocessed directory as CIFs and save the CIFs to the tmp directory:

xtal2png --decode --path data/preprocessed --save-dir tmp

Note that the save directory (e.g. tmp) including any parents (e.g. ab/cd/tmp) will be created automatically if the directory does not already exist.

Project Organization

├── AUTHORS.md              <- List of developers and maintainers.
├── CHANGELOG.md            <- Changelog to keep track of new features and fixes.
├── CONTRIBUTING.md         <- Guidelines for contributing to this project.
├── Dockerfile              <- Build a docker container with `docker build .`.
├── LICENSE.txt             <- License as chosen on the command-line.
├── README.md               <- The top-level README for developers.
├── configs                 <- Directory for configurations of model & application.
├── data
│   ├── external            <- Data from third party sources.
│   ├── interim             <- Intermediate data that has been transformed.
│   ├── preprocessed        <- The final, canonical data sets for modeling.
│   └── raw                 <- The original, immutable data dump.
├── docs                    <- Directory for Sphinx documentation in rst or md.
├── environment.yml         <- The conda environment file for reproducibility.
├── models                  <- Trained and serialized models, model predictions,
│                              or model summaries.
├── notebooks               <- Jupyter notebooks. Naming convention is a number (for
│                              ordering), the creator's initials and a description,
│                              e.g. `1.0-fw-initial-data-exploration`.
├── pyproject.toml          <- Build configuration. Don't change! Use `pip install -e .`
│                              to install for development or to build `tox -e build`.
├── references              <- Data dictionaries, manuals, and all other materials.
├── reports                 <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures             <- Generated plots and figures for reports.
├── scripts                 <- Analysis and production scripts which import the
│                              actual PYTHON_PKG, e.g. train_model.
├── setup.cfg               <- Declarative configuration of your project.
├── setup.py                <- [DEPRECATED] Use `python setup.py develop` to install for
│                              development or `python setup.py bdist_wheel` to build.
├── src
│   └── xtal2png            <- Actual Python package where the main functionality goes.
├── tests                   <- Unit tests which can be run with `pytest`.
├── .coveragerc             <- Configuration for coverage reports of unit tests.
├── .isort.cfg              <- Configuration for git hook that sorts imports.
└── .pre-commit-config.yaml <- Configuration of pre-commit git hooks.

Note on PyScaffold

This project has been set up using PyScaffold 4.2.1 and the dsproject extension 0.7.1.

To create the same starting point for this repository, as of 2022-06-01 on Windows you will need the development versions of PyScaffold and extensions, however this will not be necessary once certain bugfixes have been introduced in the next stable releases:

pip install git+https://github.com/pyscaffold/pyscaffold.git git+https://github.com/pyscaffold/pyscaffoldext-dsproject.git git+https://github.com/pyscaffold/pyscaffoldext-markdown.git

The following pyscaffold command creates a starting point for this repository:

putup xtal2png --github-actions --markdown --dsproj

Alternatively, you can edit a file interactively and update and uncomment relevant lines, which saves some of the additional setup:

putup --interactive xtal2png

Attributions

  • @michaeldalverson for iterating through various representations during extensive work with crystal GANs. The base representation for xtal2png (see #output) closely follows a recent iteration (2022-06-13), taking the first layer ($1\times64\times64$) of the $4\times64\times64$ representation and replacing a buffer column/row of zeros with unit cell volume.

xtal2png's People

Contributors

dependabot[bot] avatar github-actions[bot] avatar hasan-sayeed avatar kjappelbaum avatar sgbaird avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

xtal2png's Issues

Get default feature ranges based on conventional cells

Follow-up to #36. The idea is to add to https://github.com/sparks-baird/xtal2png/blob/main/notebooks/2.0-materials-project-feature-ranges.ipynb or make a copy and modify it so the structures are reduced to their primitive conventional representations.

 spa = SpacegroupAnalyzer(structure, symprec=0.1, angle_tolerance=5.0) 
# structure = spa.get_primitive_standard_structure()
structure = spa.get_conventional_standard_structure()

@hasan-sayeed you OK with tackling this? Should be pretty straightforward, just splicing the above code into a copy of the linked notebook.

Permutation invariance - site ordering and data augmentation

xref: sparks-baird/matbench-genmetrics#77

If the sites aren't already sorted, best to sort. Perhaps using s.copy(sanitize=True). Can add as a hyperparameter. Shouldn't affect the xtal2png encoding and decoding process to swap the order of sites.

Data augmentation is something I've considered, but with 52 sites, the combinatorial space is enormous and probably intractable. In the worst case with 52 sites and if I'm thinking about this correct, that's nPr ==$52P52$ == 8.07E67. Could maybe do partial data augmentation where sites with a shared periodic element undergo permutation data augmentation locally, but even that might be intractable.

Consider using Wyckoff positions instead of atomic sites

Almost always fewer than the total number of sites, e.g. mp-19841, mp-500, mp-19770. Would cut down the dimensionality by a significant amount, though it also introduces the possibility of representations that aren't able to be reconstructed (disregarding whether the structure would be realistic or not). See e.g. pyxtal: checking compatibility. There's also the question of how to numerically encode the Wyckoff position types. Might be worth digging more into Wren's representation https://github.com/CompRhys/aviary.

Maybe use `get_sites_in_sphere` to calculate pairwise distance matrix, so always 52 sites

Would be via get_all_neighbors(...) or probably preferably get_sites_in_sphere(...)

Pairwise distances could be calculated using get_distance(...) or by digging into how distance_matrix is calculated, which probably uses get_distance(...).

Better method would probably be to extract the Cartesian coordinates of each site from get_sites_in_sphere(...) and just use all_distance per what happens with distance_matrix.

`xtal2png` without arguments throws `ValueError`

When I execute xtal2png without any command line arguments, I get a ValueError:

(xtal2png) kraus@dorje:~::xtal2png 
Traceback (most recent call last):
  File "/home/kraus/Applications/miniconda3/envs/xtal2png/bin/xtal2png", line 10, in <module>
    sys.exit(run())
  File "/home/kraus/Applications/miniconda3/envs/xtal2png/lib/python3.10/site-packages/xtal2png/core.py", line 1141, in run
    main(sys.argv[1:])
  File "/home/kraus/Applications/miniconda3/envs/xtal2png/lib/python3.10/site-packages/xtal2png/core.py", line 1112, in main
    raise ValueError("Specify at least one of --encode or --decode")
ValueError: Specify at least one of --encode or --decode

It would be much cleaner to print the usage message, or a different prompt telling the user what to do, rather than crashing on the user.

Why so many generated structures with hydrogen and/or helium?

'Ca2He2H1,volume=23,uid=3533.cif'
'Cr1Ag1H2He1,volume=31,uid=3d40.cif'
'Cs1Li1Ca1H2,volume=62,uid=8e9d.cif'
'Cs1Rb1La1Sn1Sb1Pd1H3I1Ne1He1,volume=177,uid=5495.cif'
'Cu1H3I1He1,volume=37,uid=3870.cif'
'K1Ca1Ac1Mg1Ti1Mn1Al1Cr1In2Ga1Co1Tc1Cu1Ag1Hg1Ge1Te2As1Pd1H3Rh1Se1C1Xe1I1Kr1He1,volume=1358,uid=eccb.cif'
'Li1Ti1H2Xe1Cl1,volume=37,uid=070b.cif'
'Mo1H1C1Kr1He1,volume=24,uid=879d.cif'
'Na1Li1H1Kr1,volume=18,uid=5905.cif'
'Na1Sr1Sc1H1C1Br1He3,volume=102,uid=7fcc.cif'
'Rb1Na1Li1Sc1Ti1Ni1Ru2H2Cl1,volume=123,uid=5fa0.cif'
'Rb1Sc1He1,volume=18,uid=90e7.cif'
'Rb1Ti1V1Ga1Ni1Ge1H2,volume=83,uid=cf25.cif'
'Sc1Cr1F1Ne1H1He1,volume=47,uid=5b5d.cif'
'Sr1Gd1Dy1Y1Ni1Pd1He1H2,volume=108,uid=cf05.cif'
'Zr1Tc1Cu1H4Se1Br1F1,volume=69,uid=4e8d.cif'

See also #79

JOSS paper review - Software paper

Hello @sgbaird,
In this issue, I will collate points that relate to the software paper itself. Link to review is: openjournals/joss-reviews#4528

  • the reference to the software repo seems unnecessary here (and throughout the paper). The link to the repo is right there in the front-matter!

    diffusion models. Using `xtal2png` [@xtal2png] to encode/decode crystal structures via grayscale PNG images (see

  • the summary is not clear about what is being encoded into what. My understanding is that xtal2png can take a coordinate file and create a fingerprint / QR-code-like representation of that coordinate file. If that's the case, it should be more explicit in the summary.

    e.g. \autoref{fig:64-bit}) is akin to making/reading a QR code for crystal structures.

  • streamlined results? Is the point here that it feeds well into image-based pipelines?

    This allows you, as a materials informatics practitioner, to get streamlined results for

  • The references in this section on domain transfer make it hard to read. Please rewrite - perhaps put the citations behind the dates.

    Another example of state-of-the-art algorithm domain transfer is refactoring image-processing models for crystal structure applications, with
    introduction [@kipfSemisupervisedClassificationGraph2016], domain transfer (preprint)
    [@xieCrystalGraphConvolutional2017], and peer-reviewed domain transferred
    [@xieCrystalGraphConvolutional2018] publication dates of Sep 2016, Oct 2017, and Apr

  • Here, rather than to-from, I'd say between coordinate file and png representation (or something similar):

    is a Python package that allows you to encode/decode a crystal structure to/from a
    grayscale PNG image for direct use with image-based machine learning models. Let's take

  • I am not sure I understand the relevance of Pallette. Please elaborate.

  • It would be helpful to discuss the technical limitations somewhere. Perhaps it's in the docs, but in my view it could be commented on in the paper:

    the representation are given in \autoref{fig:example-and-legend}. Due to the encoding of
    numerical values as grayscale PNG images (allowable values are integers between 0 and
    255), a small round-off error is present during a single round of encoding and decoding.
    An example comparing an original vs. decoded structure is given in

  • First, is there a reason 64x64 pixel image has to be used? I understand it's possible to encode at most 52 atoms within the unit cell. Just by reading the paper, it's not obvious why the cell parameters (lattice constants, angles, volume, point group) occupy rows - seems like a waste of space. If that information was moved into the upper-left zero sector, there would be more space for the pairwise distance matrix.

  • Second, the issue of 255 possible values is a limitation of the grayscale requirement. If this was lifted, the precision could go up dramatically.

  • Third, I'm sure there's a good reason for it, but what's the distance matrix for? It can be calculated from the fractional coordinates. If all we want is to do is to represent the CIF file, there's no reason to include it.

  • I would rewrite this section, as a "potential" time saving is only potential.

    xtal2png/reports/paper.md

    Lines 105 to 106 in cc644f3

    dataset types, potentially saving days or weeks during the process of obtaining
    preliminary results on a newly released model.

  • There is no mention of any other software that provides similar functionality. If there truly is not any encoder/decoder for crystal structure to image data, perhaps a short section discussing existing software for molecules and/or textual representation could be included.

consider reconstructing fractional coordinates from distance matrix (via MDS or neural network)

i.e. using multi-dimensional scaling (MDS), or could be a trained network. If a trained network, an interesting approach might be to map the representation (including redundant information) to the directly used non-redundant inputs Structure(lattice, elements, coords).

If using MDS, the hyperparameter would probably be the weighted average of the direct fractional coordinates and the reconstructed fractional coordinates, so a scalar hyperparameter between 0 and 1.

See also:

Anand, N.; Eguchi, R.; Huang, P.-S. Fully Differentiable Full-Atom Protein Backbone Generation. 2019. https://openreview.net/forum?id=SJxnVL8YOV

which uses pairwise distance matrix reconstruction.

Good discussion and examples about reconstructing directly vs. using a neural network in:

(1) Ovchinnikov, S.; Huang, P.-S. Structure-Based Protein Design with Deep Learning. Current Opinion in Chemical Biology 2021, 65, 136–144. https://doi.org/10.1016/j.cbpa.2021.08.004.

Should we set the lower bound of the `xtal2png` ranges as 0 or as the minimum across all MP compounds?

Implementing for now as minimum across all MP compounds, but this is another hyperparameter choice (technically, the hyperparameter might be a scaling factor between 0 and 1 of the minimum). This would be an alternative to leaving each of the parameter ranges as tunable parameters themselves, and might be a reasonable reduction to the search space (i.e. only leave the upper bound limits as individual tunable parameters).

Matbench task model accuracy for `xtal2png` representation

The task is to use a CNN model for a Matbench submission on regressing formation energy using the xtal2png representation (as an image and/or as an array would be fine). This will help with knowing how "good" the xtal2png representation is from a model accuracy perspective, though I don't expect this to set new benchmarks necessarily.

This might look like using skorch with some type of pytorch CNN module (e.g. ResNetUNet, Net) and an MSE loss function. This skorch tutorial looks like it might help with loading images, though this SO answer is probably better for making the actual dataset to pass to skorch.

If regression is too much of a pain (CNNs aren't used as often for property regression in the image-processing domain), an easy fallback is to do the mp_is_metal binary classification task instead of the e_form regression task.

Related:

Maybe Faris interested in working on this given that he'll be doing some image processing

Suggestion: Add CLI parameter for `max_sites`

When running xtal2png with a large crystal, this (expected) error appears:

(xtal2png) kraus@dorje:~/xtal2png::xtal2png --encode -p test-01.cif
/home/kraus/Applications/miniconda3/envs/xtal2png/lib/python3.10/site-packages/pymatgen/io/cif.py:1155: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
  0%|                                                                                                                                                                    | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/kraus/Applications/miniconda3/envs/xtal2png/bin/xtal2png", line 10, in <module>
    sys.exit(run())
  File "/home/kraus/Applications/miniconda3/envs/xtal2png/lib/python3.10/site-packages/xtal2png/core.py", line 1141, in run
    main(sys.argv[1:])
  File "/home/kraus/Applications/miniconda3/envs/xtal2png/lib/python3.10/site-packages/xtal2png/core.py", line 1129, in main
    xc.xtal2png(fpaths, save=True)
  File "/home/kraus/Applications/miniconda3/envs/xtal2png/lib/python3.10/site-packages/xtal2png/core.py", line 265, in xtal2png
    self.data, self.id_data, self.id_mapper = self.structures_to_arrays(S)
  File "/home/kraus/Applications/miniconda3/envs/xtal2png/lib/python3.10/site-packages/xtal2png/core.py", line 573, in structures_to_arrays
    raise ValueError(
ValueError: crystal supplied with 240 sites, which is more than 52 sites. Remove crystal or increase `max_sites`.

It would be helpful to have a CLI option to increase max_sites with, rather than having to dig into the API.

Back-ping to openjournals/joss-reviews#4528.

OOM on `matbench_mp_e_form` conversion

from internal discussion with @faris-k

Also, I've noticed that when converting larger datasets to PNG via XtalConverter's xtal2png like the matbench_mp_e_form dataset, you may run into memory issues even on high-RAM runtimes on Colab. Workaround is pretty simple, just convert structures to PNGs individually and then convert each Image object to a tensor or array since you'll be doing that anyway. A minor note on that regard, but it looks like xtal2png only works on sequences of structure objects (i.e. you can't use it on a single structure unless you make it into a list with one structure object).

Is it an issue with holding too many structures at once? Or too many PIL Images?

The memory error is thrown when you'd do something like data = xc.xtal2png(X_train) where X_train is like 90k or 100k structures.

Does a batch approach like the following work? (I haven't verified)

data = []
for x in np.array_split(X_train, 10):
    data.append(xc.xtal2png(x))

consider implementing `fit` method

many matminer featurizers have a fit method that allows to set some hyperparameter of the featurizer given a dataset (see, for example, the SOAP featurizer)

How could it be of use here? If you have a collection of cifs, and then call .fit it would automatically set the relevant atom types, perhaps also the maximum number of sites.

Clarification on conversion of data to RGB values, `rgb_scaler` and `rgb_unscaler` functions

@faris-k, each pixel in the image takes on a value between 0 and 255. I ended up writing my own scaler and unscaler functions as I had trouble getting sklearn functions (e.g. MinMaxScaler) to respect the conversion to a user-specified range (i.e. hard lower and upper limits) and other reason(s) that are a bit fuzzy in my memory. The reason for wanting this behavior is so that two people can independently run xtal2png on different datasets, swap their data, and decode to the same crystal structures. I.e. the PNG representation doesn't depend on the collection of data you give it, it only depends on the data_range-s that are specified in the XtalConverter class:

atom_range: Tuple[int, int] = (0, 117),
frac_range: Tuple[float, float] = (0.0, 1.0),
abc_range: Tuple[float, float] = (0.0, 15.0),
angles_range: Tuple[float, float] = (0.0, 180.0),
volume_range: Tuple[float, float] = (0.0, 1000.0),
space_group_range: Tuple[int, int] = (1, 230),
distance_range: Tuple[float, float] = (0.0, 25.0),

You can have a look at the relevant functions:

def rgb_scaler(
X: ArrayLike,
data_range: Optional[Sequence] = None,
):
"""Scale parameters according to RGB scale (0 to 255).
``feature_range`` is fixed to [0, 255], ``data_range`` is either specified
See Also
--------
sklearn.preprocessing.MinMaxScaler : Scale each feature to a given range.
Parameters
----------
X : ArrayLike
Features to be scaled element-wise.
data_range : Optional[Sequence]
Range to use in place of np.min(X) and np.max(X) as in ``MinMaxScaler``.
Returns
-------
X_scaled
Element-wise scaled values.
Examples
--------
>>> rgb_scaler([[1, 2], [3, 4]], data_range=[0, 8])
array([[ 32, 64],
[ 96, 128]], dtype=uint8)
"""
rgb_range = [0, 255]
X_scaled = element_wise_scaler(X, data_range=data_range, feature_range=rgb_range)
X_scaled = np.round(X_scaled).astype(int)
return X_scaled

def rgb_unscaler(
X: ArrayLike,
data_range: Sequence,
):
"""Unscale parameters from their RGB scale (0 to 255).
``feature_range`` is fixed to [0, 255], ``data_range`` is either specified or
calculated based on min and max.
See Also
--------
sklearn.preprocessing.MinMaxScaler : Scale each feature to a given range.
Parameters
----------
X : ArrayLike
Element-wise scaled values.
data_range : Optional[Sequence]
Range to use in place of np.min(X) and np.max(X) as in ``class:MinMaxScaler``.
Returns
-------
X
Unscaled features.
Examples
--------
>>> rgb_unscaler([[32, 64], [96, 128]], data_range=[0, 8])
array([[1, 2],
[3, 4]])
"""
rgb_range = [0, 255]
X_scaled = element_wise_unscaler(X, data_range=data_range, feature_range=rgb_range)
return X_scaled

and where they're used:

atom_scaled = rgb_scaler(atomic_numbers, data_range=self.atom_range)
frac_scaled = rgb_scaler(frac_coords, data_range=self.frac_range)
abc_scaled = rgb_scaler(abc, data_range=self.abc_range)
angles_scaled = rgb_scaler(angles, data_range=self.angles_range)
volume_scaled = rgb_scaler(volume, data_range=self.volume_range)
space_group_scaled = rgb_scaler(space_group, data_range=self.space_group_range)
# NOTE: max_distance could be added as another (repeated) value/row to infer
# NOTE: kind of like frac_distance_matrix, not sure if would be effective
# NOTE: Or could normalize distance_matix by cell volume
# NOTE: and possibly include cell volume as a (repeated) value/row to infer
# NOTE: It's possible extra info like this isn't so bad, instilling the physics
# NOTE: but it could also just be extraneous work to predict/infer
distance_scaled = rgb_scaler(distance_matrix, data_range=self.distance_range)

atomic_numbers = rgb_unscaler(atom_scaled, data_range=self.atom_range)
frac_coords = rgb_unscaler(frac_scaled, data_range=self.frac_range)
abc = rgb_unscaler(abc_scaled, data_range=self.abc_range)
angles = rgb_unscaler(angles_scaled, data_range=self.angles_range)
# # volume, space_group, distance_matrix unecessary for making Structure
volume = rgb_unscaler(volume_scaled, data_range=self.volume_range)
space_group = rgb_unscaler(
space_group_scaled, data_range=self.space_group_range
)
distance_matrix = rgb_unscaler(distance_scaled, data_range=self.distance_range)

The default data_range-s are decided based on https://github.com/sparks-baird/xtal2png/blob/main/notebooks/2.0-materials-project-feature-ranges.ipynb as of v0.3.0.

Generated, decoded structures all seem to be of P(1) symmetry

e.g.

Structure Summary
Lattice
    abc : 3.5100414781297133 3.515196078431373 3.5448717948717947
 angles : 59.64705882352941 59.959276018099544 59.93891402714934
 volume : 30.820712660051388
      A : 3.0385369088434473 0.0 1.7571808762296304
      B : 1.0068399538873045 2.861393772226697 1.7763171049499542
      C : 0.0 0.0 3.5448717948717947
PeriodicSite: Ag (0.4705, 0.1010, 1.2526) [0.1431, 0.0353, 0.2647]
PeriodicSite: Cr (1.0788, 0.7294, 1.6928) [0.2706, 0.2549, 0.2157]
PeriodicSite: H (0.1329, 0.0898, 0.1213) [0.0333, 0.0314, 0.0020]
PeriodicSite: H (0.0654, 0.0842, 0.2814) [0.0118, 0.0294, 0.0588]
PeriodicSite: He (0.0278, 0.0112, 0.0416) [0.0078, 0.0039, 0.0059]

Tried with a looser symprec tolerance, which threw an error at 0.2 and higher.

structures[0].get_space_group_info(symprec=0.2)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\sterg\Miniconda3\envs\xtal2png-docs\lib\site-packages\pymatgen\core\structure.py", line 1013, in get_space_group_info
    return a.get_space_group_symbol(), a.get_space_group_number()
  File "C:\Users\sterg\Miniconda3\envs\xtal2png-docs\lib\site-packages\pymatgen\symmetry\analyzer.py", line 97, in get_space_group_symbol
    return self._space_group_data["international"]
TypeError: 'NoneType' object is not subscriptable

What is our notion of best-fit for generation, prediction, and relaxation?

EDIT: see also issues with the "notion of best" label

Relaxation is probably the most straightforward - use some crystal distance. Prediction can be about checking against known allotropes, where we take the lowest crystal distance among the allotropes. Generation is the least straightforward. Perhaps a Pareto hypervolume metric via a fictitious adaptive design campaign (e.g. bulk modulus vs. energy above hull)? Perform hyperparameter optimization and then do DFT as the final validation.

Add hardcoded reference image test once API is stable (i.e. in conjunction with results manuscript, `v1.0.0`)

Something along the lines of the following (unfinished):

def test_png2xtal_from_loaded_images():
    # TODO: implement this once API is stable (i.e. 1.0.0+)
    imgs = []
    with Image.open("tests/V4Ni2Se8,volume=243,uid=8b92.png") as img:
        datum = list(img.getdata())
        imgs.append(img)
    with Image.open("tests/Zn8B8Pb4O24,volume=623,uid=b62a.png") as img:
        datum2 = list(img.getdata())
        imgs.append(img)

    data = [datum, datum2]

    xc = XtalConverter()
    decoded_structures = xc.png2xtal(imgs)
    assert_structures_approximate_match(
        example_structures, decoded_structures, tol_multiplier=2.0
    )
    return decoded_structures

See #146 (comment)

JOSS paper review - Documentation

Here I will collate issues/suggestions I found with documentation. See openjournals/joss-reviews#4528 for the full JOSS review.

  • Statement of need - the landing page on GitHub has a nice README.md, but the index page of the docs could be expanded with a short blurb. Feel free to reuse something from the paper and don't be afraid of being more technical.
  • Functionality documentation - the API docs are built, which is great, but I'm missing a "functionality" page that would link to classes/functions in the API docs. It's up to you if you put it as a subsection of the "Overview" or make a new section. I think such a page should discuss the decisions behind the default output format, link to the key classes/functions in the API docs, and also mention it's possible to customise the output PNGs. Then, reference this overview from the paper to avoid having to write the same thing twice and risking the paper might end up out of date.
  • Automated tests - CI is enabled and tests pass. However, it's difficult to figure out what's being tested where. I'd suggest splitting this test file into multiple modules, separating the functionality tests from the validity tests. Also, I think a very useful category of tests is missing - comparing image data directly with a reference png. You can use the instructions here to get a test working. We have written a similar test you can copy, here.

Decide default feature ranges based on Materials Project characteristics

  • Min and max for abc lattice parameter lengths (abc_range)
  • Min and max for pairwise distances (distance_range)
  • Min and max for unit cell volume (volume)

i.e. take all Materials Project entries with less than or equal to 52 sites, and determine the minimum and maximum lattice parameter lengths across all a, b, and c values. Same for distances and volume.

Write function for adding sensible perturbations to the array representation

Hyperparameter would be level of perturbation per type of perturbation, e.g. perturbation for atomic numbers, perturbation for distance matrix, fractional coordinates, etc. Might be reasonable to simply perturb the final grayscale image; however, this will not preserve the Euclidean nature of the distance matrix.

whether to shuffle the data or not

Either in mp-time-split (when using .get_train_and_val_data() etc., of course only shuffling after splitting into train and val) or in the generative model. Might cause issues if really similar compounds are in the same batch.

Probably better to just shuffle the data by default rather than include it as a tunable hyperparameter during hyperparameter optimization. Might leave as a kwarg with default set to True. Of course, if the generative model uses data from a directory without shuffling that data, it could be problematic. Will check on denoising_diffusion_pytorch to see if the data gets shuffled by default.

Consider normalizing "the lengths of lattice vectors with [N^(1/3)]" (inspired by CDVAE)

Related methods from CDVAE (emphasis added):

Since the lattice matrix L is not rotation invariant, we instead predict the 6 lattice parameters, i.e. the lengths of the 3 lattice vectors and the angles between them. We normalize the lengths of lattice vectors with [N^(1/3)], where N is the number of atoms, to ensure that the lengths for materials of different sizes are at the same scale.

See get_reduced_structure()

Originally posted by @sgbaird in #94 (comment)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.