Git Product home page Git Product logo

gemdat's People

Contributors

astylavrinenko avatar sciarella avatar stefsmeets avatar tfamprikis avatar v1kko avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gemdat's Issues

Dashboard plots that need fixing

  • plot_collective_jumps:
  File "/home/vikko/local_projects/GEMDAT/src/plots/jumps.py", line 112, in plot_collective_jumps
    ticks = range(len(sites.jump_names))
  File "/home/vikko/local_projects/GEMDAT/src/sites.py", line 153, in jump_names
    return ['->'.join(key) for key in self.rates]
  File "/home/vikko/local_projects/GEMDAT/src/sites.py", line 153, in <listcomp>
    return ['->'.join(key) for key in self.rates]
TypeError: sequence item 0: expected str instance, NoneType found
  • plot_jumps_3d
  File "/home/vikko/local_projects/GEMDAT/src/plots/jumps.py", line 189, in plot_jumps_3d
    plotter.plot_labels(site_labels,
  File "/home/vikko/local_projects/pymatgen/pymatgen/electronic_structure/plotter.py", line 4268, in plot_labels
    if k.startswith("\\") or k.find("_") != -1:
AttributeError: 'NoneType' object has no attribute 'startswith'

Currently disabled by not putting them in plots.__all__

Calculate sites data from materials input

This issue tracks processing the 'known materials' data and calculating sites data.

Features

  • Read known materials from standard crystallographic input (wyckhof symbol, cif file, or otherwise?) -> #18

Variables

  • #19

    • transitions
    • all_transitions
    • succes
    • occupancy
    • occupancy_parts
    • atoms (atom_sites)
  • #36

    • sites_occupancy
    • sites_occupancy_parts
    • atom_locations
    • atom_locations_parts
  • #43

    • jump_names
    • nr_jumps
    • rates
    • e_act
  • #56

    • collective
    • collective_jumps
    • collective_matrix
    • multi_collective
    • solo_jumps
    • solo_frac
  • #48

    • jump_diffusivity
    • correlation_factor

Explore `lxml` for faster loading of `vasprun.xml`

Loading a large (~2 GB) vasprun.xml using pymatgen takes a couple of minutes.

It's using the standard library xml module to parse the element tree (see here).

See if we can use a faster library like lxml (source) to speed this up. It claims to follow the standard library ElementTree API.

Refactor ported code

As we ported more, we realized that some structures could be better represented. this issue is to track that.

  • Refactor all calculate functions into "property" functions, which are calculated on demand
    • Also add a toggle to calculate on creation of the object.
  • move diffusive element displacement code into Trajectories
  • Remove calculate folder and put calculations in corresponding files
  • Refactor all plots to use the Trajectory / Sites / Jumps classes where possible, so that we do not have to pass each property independently
  • Math ( or TrajectoryStatistics) class for statistics concerning Trajectories
  • remove diffusive_element specifiers on Trajectory functions, and instead allow a view on trajectory, like:
li_trajectory = trajectory.where(element='Li', equilibration_steps=1250)
  • Better way to include all plots then using __all__
  • write trajectory.precompute() and sites.precompute()
  • Fix the RDF functions

Add RDF plots to dashboard

RDFS are still missing from the dashboard. Example snippet to generate RDF plots:

from gemdat import SimulationData, SitesData
from gemdat.io import load_known_material

equilibration_steps = 1250
diffusing_element = 'Li'
diffusion_dimensions = 3
z_ion = 1

VASP_XML = '/home/stef/md-analysis-matlab-example-short/vasprun.xml' 

data = SimulationData.from_vasprun(VASP_XML)

extras = data.calculate_all(
    equilibration_steps=equilibration_steps,
    diffusing_element=diffusing_element,
    z_ion=z_ion,
    diffusion_dimensions=diffusion_dimensions,
)

structure = load_known_material('argyrodite', supercell=(2,1,1))

sites = SitesData(structure)
sites.calculate_all(data=data, extras=extras)

from gemdat.rdf import *

rdfs = calculate_rdfs(
    data=data, 
    sites=sites, 
    diff_coords=extras.diff_coords, 
    n_steps=extras.n_steps, 
    equilibration_steps=extras.equilibration_steps,
    max_dist=10,
    resolution=0.1,
)

for state, rdf in rdfs.items():
    plot_rdf(rdf, name=state)

Make it possible to calculate required arrays for plots on the fly from the data

  • currently it has to be done manually

One option might be to add sensible computable defaults to all plots as most arrays can be calculated from the Data arrays.
Another might be to make those arrays optionally computable on the Data object somehow.

The nice thing here would be to have it transparantly, so that if a user provides it that arrray is used, and that otherwise a default is calculated from the provided Data if possible.

Consider using xarray as a data store

Some of the data we are generating for the timesteps are well suited for storing in an xarray. Most of the data we are working with are some form of (time step, atom index).

As dimensions we can use:

  • Time
  • Labels of the sites or (diffusing) atoms
  • Then we can use the parameters per atom/site (occupancy, transition state, speed, etc) as columns in the array

Implement sites that are dynamic over time

One of the stretch goals of the project would be to work with dynamic site locations.

At present, the sites are defined from a cif file and static with time. In a real scenario, the atomic clusters are oscillating/moving. This affects the jumps calculations.

Load sites from density

As a researcher working with MD data,
I want to load sites locations from a density file,
so that I can have have higher accuracy analysis

Loading the sites from a cif file works, but is not ideal, because it has to be a manually defined and the positions are static.

As an alternative, we can generate the sites from the trajectory directly. The trajectory can be used to generate an electron density, in turn, we can use peaks in the electron density to define the position of sites.

For example,

  1. Take trajectory
  2. Make voxel array with ~4*voxels / Angstrom from trajectory.get_lattice()?
  3. Squash over time axis and assign to voxels (bins)
  4. Find peaks in voxel array
  5. Convert peaks coordinates to structure in pymatgen

Alternative to 2-4: squash along time axis and use cluster analysis to find best n sites.

Update pin for pymatgen

Pymatgen is currently pinned to my fork.

The label fixes were merged yesterday. Once a new release of pymatgen becomes available, we should update our pin to the latest version and make a new release on pypi.

Add known materials from matlab code

Find/generate structures in CIF format for known materials in matlab code. These are the ones that are available.

  • argyrodite, supercell: (1 1 1)
  • latp, supercell: (1 1 1)
  • na3ps4, supercell: (2 2 2)
  • lisnps, supercell: (1 1 1)
  • li3ps4_beta, supercell: (1 1 2)
  • mno2_lambda, supercell: (1 1 1)
  • lagp, supercell: (1 2 2)

Trajectory class

I think we should set up a trajectory class which will make it easier to handle the simulation data.

Most that we care about is the Trajectory anyways. The pymatgen trajectory class is somewhat limited for our use-case.

This could also take some of the methods in the vibration/displacements modules (speed, displacements, etc.)

Just writing some ideas here:

class Trajectory:
   _coords: np.ndarray[$time, $site, $xyz]
   structure: pymatgen.core.Structure

   @property
   def n_steps(self):
       # return number of steps after equilibration time

   @property
   def coords(self):
       return self._coords[self.n_steps:, ...]

   def displacements(self, element: list[str] | str | None=None):
       # Return displacements for all / selected elements

   def speed(self, element: list[str] | str | None=None):
       # Return speed for all / selected elements

   def set_equilibration_time(self, equilibration_time):
       # sets starting point for data

   def get_coords_for_element(self, label):
       # replaces diff_coords, which is somewhat poorly named

Todo

  • Add some sort of Trajectory.metadata attribute to track some sort of global simulation parameters like temperature
  • Replace test trajectories by fixtures
  • What makes sense to move from calculate_all() to GemdatTrajectory?
  • Add intuitive method to Trajectory to easily get coordinates for diffusing atom
  • Add Trajectory.from_PymatgenTrajectory
  • Update readme.md with new API

Pymatgen does not mod coordinates back to origin cell

When converting between positions and displacements on a trajectory, pymatgen does not completely convert back to the origin cell.

For example, a position at [0, 0, 0.001] may end up at [0, 0, 1.001].

>>> trajectory = Trajectory.from_vasprun(vasp_xml)
>>> coords1 = trajectory.filter('Li').coords
>>>
>>> trajectory.to_displacements()
>>> trajectory.to_positions()
>>> 
>>> coords2 = trajectory.filter('Li').coords
>>> 
>>> np.testing.assert_allclose(coords1, coords2)
AssertionError: 
Not equal to tolerance rtol=1e-07, atol=0

Mismatched elements: 33264 / 540000 (6.16%)
Max absolute difference: 1.
Max relative difference: 4761904.77031164
 x: array([[[0.13401 , 0.36404 , 0.028937],
        [0.09121 , 0.712146, 0.508927],
        [0.289686, 0.675079, 0.220109],...
 y: array([[[ 0.13401 ,  0.36404 ,  0.028937],
        [ 0.09121 ,  0.712146,  0.508927],
        [ 0.289686,  0.675079,  0.220109],...

Load known materials from crystallographic information format

CIF files are the standard file format for storing crystallographic information.

Crystal structures for the materials we work with are available:

E.g. for argyrodite:

In the matlab code these are coded by hand in known_materials.m. I would like to have these available in standard CIF format, so that

  1. Any crystallographic software can be used to visualize/inspect/modify the crystal structures.
  2. Any known crystal structure can be used as a basis for the analysis without having to modify the code. This makes our tool more accessible.
  3. Errors are reduced by having a fixed crystallographic format

Radial distribution functions

These are calculated in calc_rdfs.m. Data can be verified against rdf.mat.

  • rdf.distributions
  • rdf.integrated
  • rdf.rdf_names
  • rdf.elements
  • rdf.max_dist
  • rdf.resolution
  • rdf.total

Create a structure for the GEMDAT tool

It might look something like this:

option 1

  • GEMDAT
    • plot_all
    • plot_1
    • plot_2
    • plot_3
from GEMDAT import plot_1, plot_all
plot_1(<data_and_config>)
plot_all(<data_and_config>)

Or something like this:

option 2

  • GEMDAT
    • plot(<data_and_config>, )
    • plots
      • plot_1(<data_and_config>)
      • plot_2(<data_and_config>)
from GEMDAT import plot
plot(<data_and_config>, ['diffusivity', 'MSD'])  
from GEMDAT.plots import plot_diffusivity
plot_diffusivity(<data_and_config>)

data_and_config:
There will probably be the need to adjust plots with some configuration (splitting into x number of multiple smaller simulations, or cutting away the first few timesteps). I think it would be okay to just pass those through **kwargs and let all plot functions accept **kwargs, so they can extract those keywords that they will use.
All the possible keywords should be listed in the plot (or plot_all for option 1) function, but how they are implemented can be explained in the specific plot_xxx function.


I prefer option two, what are your thoughts about this, @stefsmeets ?

Make the plot_3d_jumps plot rotatable in the browser

I think plotly can be a good candidate for this, but the matplotlib to plotly conversion function seems to be broken for this figure at the moment.

  File "/home/vikko/local_projects/GEMDAT/.venv/lib/python3.11/site-packages/plotly/matplotlylib/mplexporter/exporter.py", line 289, in draw_collection
    offset_order = offset_dict[collection.get_offset_position()]

The issue is well known and because of a deprecation in matplotlib: this very ugly fix works:
mpld3/mpld3#477
But then again it does not seem to understand more than 2 dimensions, so this is not the way to go.

If we want to do this it is probably best to re-implement it fully in plotly (see also comment below)

Improve nearest site finding algorithm

This function:

pdist = lattice.get_all_distances(atom_coords, site_coords)
already takes 15s, and we expect it to become a bottleneck for larger datasets, so we should have a look at it.

Periodic boundaries should be taken into account

Possible paths to explore:

Refactor integration test.

  • Can we somehow isolate the group the different stages of the calculations? e.g. test_jumps, test_transitions_matrix, test_all_transition
  • See if it is feasable to test for more 'actionable' numbers (e.g. not sum of matrix)

issues raised by stefsmeets: #71 (comment)

Add support for `SitesData` in dashboard

#48 adds the first plot (plots.jumps.plot_jumps_vs_distance) that uses the SitesData class. We need a selector in the dashboard to:

  1. Set the known material (either cif or from the internal database in /src/data)
  2. Set the supercell (default (1,1,1))
  3. Initialize SitesData and pass this to the plots

Calculate transition energy between sites

with the transition energy we can determine the energy threshold which has to be crossed for a successful jump, if this energy is too low we could inform the user that the sites are not well-defined and should probably be merged.

Implement conditionals for a jump.

There are two main properties which can control this (assuming the sites are defined correctly).
This should be a post-process step after defining the Sites and probably should be specified as parameters for calculating the jumps

  • Time condition: only count a jump if atom remains at new site for certain amount of time.
  • Decrease size of sites: decreases the likelyhood of an oscillation in between the sites generating fake jumps.

Revise how `SitesData.*_parts` data are being defined

To calculate statistics, we are calculating dedicated *_parts attributes on SitesData. These inputs for these analyses are no different than the full data set, just on a smaller subset.

So, instead of having dedicated parts variables, like $VARIABLE_parts, consider feeding SitesData with the parent data already split (probably atom_sites or all_transitions. Use these as a basis for subsequent calculations.

Basically, instead of:

parent_data -> SitesData -> derived_data -> split -> derived_data_parts

One full SitesData instance for entire timeseries with attributes containing lists of parts data.

Do:

parent_data -> SitesData -> derived_data
parent_data -> split -> SitesData -> derived_data_parts

One SitesData instance for entire timeseries + list of SitesData instances (one for each part).

Dashboard

This tool lends itself very well for an interactive dashboard.

I would like to have something where you can load your project in the sidebar, set some parameters (like the diffusing element, equilibration time, etc), and have a selector for which plots to generate.

Plots and visualizations

This issue has a list of plots and visualizations that should be implemented.

  • Figure 1. Displacement of diffusing element -> #9
  • Figure 2. Histogram of displacement of diffusing element -> #9
  • Figure 3. Displacement per element -> #9
  • Figure 4. Density of diffusing element (3D) -> #63
  • Figure 5. Jumps between sites (3D) -> #61
  • Figure 6. Jumps vs distance -> #48
  • Figure 7. Histogram of vibrational amplitudes with fitted Gaussian -> #11
  • Figure 8. Occurance vs frequency -> #11
  • Figure 9. Histogram of jumps vs time -> #61
  • Figure 10. Number of cooperative jumps per jump-type combination -> #61
  • Jumps movie -> #70

RDF (#57)

  • Figure 11. 48h Density vs Distance -> #66
  • Figure 12. Transition 48h -> #66
  • Figure 13. Transition 48h-48h -> #66

Calculate Sites from density instead of defining them

There are two ways to go about this:

  • Get X number of sites from a rasterized domain where the density is highest
  • Do a cluster analysis and get the peaks which correspond to sites.

A site should have: Position, size, the size might differ per dimension. In this way ellipsoidal sites could also be defined

Check if known structure lattices matches lattice from simulation

The matlab code has no check for if the known materials structure matches the vasp/lammps data. This can happen if the cell orientation, supercell, or symmetry does not match.

We can add a basic check on the lattice parameters (within some tolerance, e.g. 0.5 Angstrom / 1 degrees) to prevent potential errors.

This check should probably be implemented in SitesData.calculate_all (compare SimulationData.structure with SitesData.structure).

Fix x-axis for RDF plots

The RDF plots have incorrect labels for the x-axis. Currently it just uses the bin number, these should be adjusted to the distance bin in Angstrom.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.