gemdat-repos / gemdat Goto Github PK

View Code? Open in Web Editor NEW

21.0 21.0 3.0 3.77 MB

Python toolkit for molecular dynamics analysis

Home Page: https://gemdat.readthedocs.io

License: Apache License 2.0

Nix 0.79% Python 99.21%

data-analysis lammps molecular-dynamics python vasp

gemdat's People

Contributors

Stargazers

Forkers

guancred ecnuitaa zhenming-xu

gemdat's Issues

Dashboard plots that need fixing

plot_collective_jumps:

  File "/home/vikko/local_projects/GEMDAT/src/plots/jumps.py", line 112, in plot_collective_jumps
    ticks = range(len(sites.jump_names))
  File "/home/vikko/local_projects/GEMDAT/src/sites.py", line 153, in jump_names
    return ['->'.join(key) for key in self.rates]
  File "/home/vikko/local_projects/GEMDAT/src/sites.py", line 153, in <listcomp>
    return ['->'.join(key) for key in self.rates]
TypeError: sequence item 0: expected str instance, NoneType found

plot_jumps_3d

  File "/home/vikko/local_projects/GEMDAT/src/plots/jumps.py", line 189, in plot_jumps_3d
    plotter.plot_labels(site_labels,
  File "/home/vikko/local_projects/pymatgen/pymatgen/electronic_structure/plotter.py", line 4268, in plot_labels
    if k.startswith("\\") or k.find("_") != -1:
AttributeError: 'NoneType' object has no attribute 'startswith'

Currently disabled by not putting them in plots.__all__

Calculate sites data from materials input

This issue tracks processing the 'known materials' data and calculating sites data.

Features

Read known materials from standard crystallographic input (wyckhof symbol, cif file, or otherwise?) -> #18

Variables

Selecting server-side file with streamlit dashboard `gemdash`

The option to select a server-side file seems to be missing.

There is file picker which does what we want at the client side

Currently this is solved by just having a text input box, which is a bit ugly

Explore `lxml` for faster loading of `vasprun.xml`

Loading a large (~2 GB) vasprun.xml using pymatgen takes a couple of minutes.

It's using the standard library xml module to parse the element tree (see here).

See if we can use a faster library like lxml (source) to speed this up. It claims to follow the standard library ElementTree API.

Extract labels from cif files

Some analyses require that we use the labels to tag the sites we are working with. These can be read from the cif file, but pymatgen does not store these data. I have a fork here with this feature: https://github.com/stefsmeets/pymatgen/tree/cif-site-labels3

This will make the site labels available via structure.labels / structure.sites[0].label.

Refactor ported code

As we ported more, we realized that some structures could be better represented. this issue is to track that.

Refactor all calculate functions into "property" functions, which are calculated on demand
- Also add a toggle to calculate on creation of the object.
move diffusive element displacement code into Trajectories
Remove calculate folder and put calculations in corresponding files
Refactor all plots to use the Trajectory / Sites / Jumps classes where possible, so that we do not have to pass each property independently
Math ( or TrajectoryStatistics) class for statistics concerning Trajectories
remove diffusive_element specifiers on Trajectory functions, and instead allow a view on trajectory, like:

li_trajectory = trajectory.where(element='Li', equilibration_steps=1250)

Better way to include all plots then using __all__
write trajectory.precompute() and sites.precompute()
Fix the RDF functions

Add RDF plots to dashboard

RDFS are still missing from the dashboard. Example snippet to generate RDF plots:

from gemdat import SimulationData, SitesData
from gemdat.io import load_known_material

equilibration_steps = 1250
diffusing_element = 'Li'
diffusion_dimensions = 3
z_ion = 1

VASP_XML = '/home/stef/md-analysis-matlab-example-short/vasprun.xml' 

data = SimulationData.from_vasprun(VASP_XML)

extras = data.calculate_all(
    equilibration_steps=equilibration_steps,
    diffusing_element=diffusing_element,
    z_ion=z_ion,
    diffusion_dimensions=diffusion_dimensions,
)

structure = load_known_material('argyrodite', supercell=(2,1,1))

sites = SitesData(structure)
sites.calculate_all(data=data, extras=extras)

from gemdat.rdf import *

rdfs = calculate_rdfs(
    data=data, 
    sites=sites, 
    diff_coords=extras.diff_coords, 
    n_steps=extras.n_steps, 
    equilibration_steps=extras.equilibration_steps,
    max_dist=10,
    resolution=0.1,
)

for state, rdf in rdfs.items():
    plot_rdf(rdf, name=state)

`lru_cache` leaks memory when used with class methods

Use of functools.lru_cache on class methods can lead to memory leaks. The cache may retain instance references, preventing garbage collection.

See this SO post for more info:
https://stackoverflow.com/questions/33672412/python-functools-lru-cache-with-instance-methods-release-object

This seems to be the most straightforward way to work around it:
https://stackoverflow.com/a/68550238

Make it possible to calculate required arrays for plots on the fly from the data

currently it has to be done manually

One option might be to add sensible computable defaults to all plots as most arrays can be calculated from the Data arrays.
Another might be to make those arrays optionally computable on the Data object somehow.

The nice thing here would be to have it transparantly, so that if a user provides it that arrray is used, and that otherwise a default is calculated from the provided Data if possible.

Consider using xarray as a data store

Some of the data we are generating for the timesteps are well suited for storing in an xarray. Most of the data we are working with are some form of (time step, atom index).

As dimensions we can use:

Time
Labels of the sites or (diffusing) atoms
Then we can use the parameters per atom/site (occupancy, transition state, speed, etc) as columns in the array

Implement sites that are dynamic over time

One of the stretch goals of the project would be to work with dynamic site locations.

At present, the sites are defined from a cif file and static with time. In a real scenario, the atomic clusters are oscillating/moving. This affects the jumps calculations.

Load sites from density

As a researcher working with MD data,
I want to load sites locations from a density file,
so that I can have have higher accuracy analysis

Loading the sites from a cif file works, but is not ideal, because it has to be a manually defined and the positions are static.

As an alternative, we can generate the sites from the trajectory directly. The trajectory can be used to generate an electron density, in turn, we can use peaks in the electron density to define the position of sites.

For example,

Take trajectory
Make voxel array with ~4*voxels / Angstrom from trajectory.get_lattice()?
Squash over time axis and assign to voxels (bins)
Find peaks in voxel array
Convert peaks coordinates to structure in pymatgen

Alternative to 2-4: squash along time axis and use cluster analysis to find best n sites.

Add tests for generated plots

We can use the matplotlib decorators to test for plot similarity:
https://matplotlib.org/stable/api/testing_api.html#matplotlib.testing.decorators.image_comparison

See implementation example here:
https://github.com/hpgem/nanomesh/blob/main/tests/test_plotting.py

Update pin for pymatgen

Pymatgen is currently pinned to my fork.

The label fixes were merged yesterday. Once a new release of pymatgen becomes available, we should update our pin to the latest version and make a new release on pypi.

Expose calculate statistics in the dashboard

Some of the simulation statistics we are calculating can be exposed in the dashboard.

I think this element would be a fun way to display it:
https://docs.streamlit.io/library/api-reference/data/st.metric

Add known materials from matlab code

Find/generate structures in CIF format for known materials in matlab code. These are the ones that are available.

Interactive plotting library (especially for 3d plots in notebooks)

Currently the plots are created with matplotlib, however for some plots and the dashboard interactivity might be nice to add. We could use

Altair
plotly
bokeh

Trajectory class

I think we should set up a trajectory class which will make it easier to handle the simulation data.

Most that we care about is the Trajectory anyways. The pymatgen trajectory class is somewhat limited for our use-case.

This could also take some of the methods in the vibration/displacements modules (speed, displacements, etc.)

Just writing some ideas here:

class Trajectory:
   _coords: np.ndarray[$time, $site, $xyz]
   structure: pymatgen.core.Structure

   @property
   def n_steps(self):
       # return number of steps after equilibration time

   @property
   def coords(self):
       return self._coords[self.n_steps:, ...]

   def displacements(self, element: list[str] | str | None=None):
       # Return displacements for all / selected elements

   def speed(self, element: list[str] | str | None=None):
       # Return speed for all / selected elements

   def set_equilibration_time(self, equilibration_time):
       # sets starting point for data

   def get_coords_for_element(self, label):
       # replaces diff_coords, which is somewhat poorly named

Todo

Add some sort of Trajectory.metadata attribute to track some sort of global simulation parameters like temperature
Replace test trajectories by fixtures
What makes sense to move from calculate_all() to GemdatTrajectory?
Add intuitive method to Trajectory to easily get coordinates for diffusing atom
Add Trajectory.from_PymatgenTrajectory
Update readme.md with new API

Pymatgen does not mod coordinates back to origin cell

When converting between positions and displacements on a trajectory, pymatgen does not completely convert back to the origin cell.

For example, a position at [0, 0, 0.001] may end up at [0, 0, 1.001].

>>> trajectory = Trajectory.from_vasprun(vasp_xml)
>>> coords1 = trajectory.filter('Li').coords
>>>
>>> trajectory.to_displacements()
>>> trajectory.to_positions()
>>> 
>>> coords2 = trajectory.filter('Li').coords
>>> 
>>> np.testing.assert_allclose(coords1, coords2)
AssertionError: 
Not equal to tolerance rtol=1e-07, atol=0

Mismatched elements: 33264 / 540000 (6.16%)
Max absolute difference: 1.
Max relative difference: 4761904.77031164
 x: array([[[0.13401 , 0.36404 , 0.028937],
        [0.09121 , 0.712146, 0.508927],
        [0.289686, 0.675079, 0.220109],...
 y: array([[[ 0.13401 ,  0.36404 ,  0.028937],
        [ 0.09121 ,  0.712146,  0.508927],
        [ 0.289686,  0.675079,  0.220109],...

Drift / Vibration Correction for trajectories

Should be an operation f(Trajectory) -> Trajectory

Drift might be available from vasprun.xml

Axis on Jumps vs Distance Plot

I feel like these axis are not correct. This issue is to make sure I don't forget

Add some examples to the documentation

Load known materials from crystallographic information format

CIF files are the standard file format for storing crystallographic information.

Crystal structures for the materials we work with are available:

E.g. for argyrodite:

In the matlab code these are coded by hand in known_materials.m. I would like to have these available in standard CIF format, so that

Any crystallographic software can be used to visualize/inspect/modify the crystal structures.
Any known crystal structure can be used as a basis for the analysis without having to modify the code. This makes our tool more accessible.
Errors are reduced by having a fixed crystallographic format

Define scientific units

I just noticed pymatgen has a very useful module to work with scientific units, including a FloatWithUnit class.

I think we should consider using this (or an alternative package) to keep track/define scientific values/constants.

Add licence

Radial distribution functions

These are calculated in calc_rdfs.m. Data can be verified against rdf.mat.

Create a structure for the GEMDAT tool

It might look something like this:

option 1

GEMDAT
- plot_all
- plot_1
- plot_2
- plot_3

from GEMDAT import plot_1, plot_all
plot_1(<data_and_config>)
plot_all(<data_and_config>)

Or something like this:

option 2

GEMDAT
- plot(<data_and_config>, )
- plots
  - plot_1(<data_and_config>)
  - plot_2(<data_and_config>)

from GEMDAT import plot
plot(<data_and_config>, ['diffusivity', 'MSD'])

from GEMDAT.plots import plot_diffusivity
plot_diffusivity(<data_and_config>)

data_and_config:
There will probably be the need to adjust plots with some configuration (splitting into x number of multiple smaller simulations, or cutting away the first few timesteps). I think it would be okay to just pass those through **kwargs and let all plot functions accept **kwargs, so they can extract those keywords that they will use.
All the possible keywords should be listed in the plot (or plot_all for option 1) function, but how they are implemented can be explained in the specific plot_xxx function.

I prefer option two, what are your thoughts about this, @stefsmeets ?

Make the plot_3d_jumps plot rotatable in the browser

I think plotly can be a good candidate for this, but the matplotlib to plotly conversion function seems to be broken for this figure at the moment.

  File "/home/vikko/local_projects/GEMDAT/.venv/lib/python3.11/site-packages/plotly/matplotlylib/mplexporter/exporter.py", line 289, in draw_collection
    offset_order = offset_dict[collection.get_offset_position()]

The issue is well known and because of a deprecation in matplotlib: this very ugly fix works:
mpld3/mpld3#477
But then again it does not seem to understand more than 2 dimensions, so this is not the way to go.

If we want to do this it is probably best to re-implement it fully in plotly (see also comment below)

Improve nearest site finding algorithm

This function:

GEMDAT/src/sites.py

Line 195 in b6d6f98

pdist = lattice.get_all_distances(atom_coords, site_coords)

already takes 15s, and we expect it to become a bottleneck for larger datasets, so we should have a look at it.

Periodic boundaries should be taken into account

Possible paths to explore:

LazyTensor: http://www.kernel-operations.io/keops/python/api/common/GenericLazyTensor.html
Some other matrix(multiplication)Operator which is fast and does not store the full matrix
Custom code from Julia/Rust to define such an operator
using another method to find the closest neighbours https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html#scipy.spatial.KDTree

Refactor integration test.

Can we somehow isolate the group the different stages of the calculations? e.g. test_jumps, test_transitions_matrix, test_all_transition
See if it is feasable to test for more 'actionable' numbers (e.g. not sum of matrix)

issues raised by stefsmeets: #71 (comment)

Pymatgen: site labels for multiple of the same species

pymatgen does not read the labels correctly if there are multiple of the same species.

>>> structure = load_known_material('lisnps')
>>> set(structure.labels)
{'Li1'}
# expected: {'Li1', 'Li2', 'Li3', 'Li4'}

Add supercell support to `load_known_material`

In the matlab code you can specify the supercell to use for the known structure. We should have this feature in gemdat as well.

We can do this using this method on Structure:
https://pymatgen.org/pymatgen.core.structure.html#pymatgen.core.structure.Structure.make_supercell

Rename `structure` argument to SitesData constructor to avoid confusion with `data.structure`

Add support for `SitesData` in dashboard

#48 adds the first plot (plots.jumps.plot_jumps_vs_distance) that uses the SitesData class. We need a selector in the dashboard to:

Set the known material (either cif or from the internal database in /src/data)
Set the supercell (default (1,1,1))
Initialize SitesData and pass this to the plots

Add support for data from other formats

See available formats here:

https://pymatgen.org/pymatgen.io.html

VASP
LAMMPS
gromacs
cp2k

Minimum target is to be able to load these into a Trajectory.

Pymatgen: drops labels when applying a new supercell to an existing structure

This has to be fixed in pymatgen <... room for pull request link>

Fix this in the dashboard by enabling the supercell selection again

Calculate transition energy between sites

with the transition energy we can determine the energy threshold which has to be crossed for a successful jump, if this energy is too low we could inform the user that the sites are not well-defined and should probably be merged.

Bug? Assignment atom_sites_to/from swapped from return in calculate all.

https://github.com/GEMDAT-repos/GEMDAT/blob/b69fc418785dc9efe9b3ecc26f3674497bd1311f/src/sites.py#L68C13-L68C13

Implement conditionals for a jump.

There are two main properties which can control this (assuming the sites are defined correctly).
This should be a post-process step after defining the Sites and probably should be specified as parameters for calculating the jumps

Time condition: only count a jump if atom remains at new site for certain amount of time.
Decrease size of sites: decreases the likelyhood of an oscillation in between the sites generating fake jumps.

Revise how `SitesData.*_parts` data are being defined

To calculate statistics, we are calculating dedicated *_parts attributes on SitesData. These inputs for these analyses are no different than the full data set, just on a smaller subset.

So, instead of having dedicated parts variables, like $VARIABLE_parts, consider feeding SitesData with the parent data already split (probably atom_sites or all_transitions. Use these as a basis for subsequent calculations.

Basically, instead of:

parent_data -> SitesData -> derived_data -> split -> derived_data_parts

One full SitesData instance for entire timeseries with attributes containing lists of parts data.

Do:

parent_data -> SitesData -> derived_data
parent_data -> split -> SitesData -> derived_data_parts

One SitesData instance for entire timeseries + list of SitesData instances (one for each part).

enable the test cases with a short run on Github Action

Dashboard

This tool lends itself very well for an interactive dashboard.

I would like to have something where you can load your project in the sidebar, set some parameters (like the diffusing element, equilibration time, etc), and have a selector for which plots to generate.

Plots and visualizations

This issue has a list of plots and visualizations that should be implemented.

RDF (#57)

Figure 11. 48h Density vs Distance -> #66
Figure 12. Transition 48h -> #66
Figure 13. Transition 48h-48h -> #66

Calculate Sites from density instead of defining them

There are two ways to go about this:

Get X number of sites from a rasterized domain where the density is highest
Do a cluster analysis and get the peaks which correspond to sites.

A site should have: Position, size, the size might differ per dimension. In this way ellipsoidal sites could also be defined

Pydantic V2 support

Pydantic v2 was released 30 june. It seems to be somewhat broken: https://github.com/pydantic/pydantic/issues

I pinned the version to V1 for now in #19.

Check if known structure lattices matches lattice from simulation

The matlab code has no check for if the known materials structure matches the vasp/lammps data. This can happen if the cell orientation, supercell, or symmetry does not match.

We can add a basic check on the lattice parameters (within some tolerance, e.g. 0.5 Angstrom / 1 degrees) to prevent potential errors.

This check should probably be implemented in SitesData.calculate_all (compare SimulationData.structure with SitesData.structure).