Git Product home page Git Product logo

mdsuite's People

Contributors

christophlohrmann avatar dependabot[bot] avatar fratorhe avatar kaiszuttor avatar marcbrueck avatar psomers3 avatar pythonfz avatar samtov avatar tobiasmerkt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

mdsuite's Issues

Angular Distribution Function

Is your feature request related to a problem? Please describe.
When studying molecular or crystal system it I often important to perform some bonding analysis. This is best done with angular information on the atoms or particles. For this reason it would be helpful to have access to the angular distribution functions.

Describe the solution you'd like
Implement an angular distribution function calculation.

Describe alternatives you've considered
None

Additional context
None

GK Diffusion with sampling is too slow

The GK calculation is itself not slow. However, as we have introduced the error calculation of correlation lengths, it has become too slow. This should be rectified

Add method for the addition of data to a database, with class state updates.

When running simulations, it is often only possible to run parts of the at any one time. It would be beneficial to be able to add data to a database, rather than have to build the database again. I would suggest we have a class attribute which can add data to a pre-existing database, so analysis can be performed at various stages of the simulation

Add a common method for adding data to the hdf5 database

It will probably be easier to have a shared class method for adding data to the hdf5 database which can also take care of checking to see if the data should be new or if a previous dataset should be expanded and the new data added.

can not find GPU

Describe the bug
File "analyse_simulation.py", line 17, in
NaCl_1400K.add_data(trajectory_file='dump.lammpstrj')
File "/tikhome/mbrueckner/.local/lib/python3.8/site-packages/mdsuite/experiment/experiment.py", line 316, in add_data
self._build_new_database()
File "/tikhome/mbrueckner/.local/lib/python3.8/site-packages/mdsuite/experiment/experiment.py", line 294, in _build_new_database
trajectory_reader.process_trajectory_file() # get properties of the trajectory and update the class
File "/tikhome/mbrueckner/.local/lib/python3.8/site-packages/mdsuite/file_io/lammps_trajectory_files.py", line 68, in process_trajectory_file
batch_size = optimize_batch_size(self.project.trajectory_file, number_of_configurations)
File "/tikhome/mbrueckner/.local/lib/python3.8/site-packages/mdsuite/utils/meta_functions.py", line 96, in optimize_batch_size
computer_statistics = get_machine_properties() # Get computer statistics
File "/tikhome/mbrueckner/.local/lib/python3.8/site-packages/mdsuite/utils/meta_functions.py", line 69, in get_machine_properties
total_gpu_devices = GPUtil.getGPUs() # get information on all the gpu's
File "/tikhome/mbrueckner/.local/lib/python3.8/site-packages/GPUtil/GPUtil.py", line 102, in getGPUs
deviceIds = int(vals[i])
ValueError: invalid literal for int() with base 10: 'Failed to initialize NVML: Driver/library version mismatch'

Add the functionality for unwraping added data

Currently if you add more data to a dataset and then try to unwrap the coordinates it just wont add the new data. This is because the database will say that the group already exists and therefore the code will say this is not necessary and will end the calculation.

We can handle this two ways as I see it.

1.) If the number of configurations in the current unwrapped coordinates does not match the number in the positions file, delete the unwrapped coordinates and just re-run it.

2.) Same check but work out the best way to add the data.

The second is of course better, especially for huge simulations, but we will see. Laziness will out.

Implement Neighbour list and triple neighbour list calculations in PyTorch

In order to calculate the RDF's and the ADF's we need to generate neighbour lists. If done in a tensor-centric manner, this can be deployed on GPU, possible saving hours or days of analysis time on big systems. In the case of ADF's, we need to have information regarding angles, and therefore a triple neighbour analysis must be performed.

Error GK diffusion

I get an error when computing the GK diffusion. It seems to be related to the multiprocessing.map function.
On windows it starts looping and getting quite crazy. On Linux the error is more "understandable"
multiprocessing.pool.RemoteTraceback:

"""
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/mnt/c/Users/User/python/MDSuite/mdsuite/analysis/green_kubo_diffusion_coefficients.py", line 92, in _singular_diffusion_calculation
mode='full', method='fft') +
IndexError: too many indices for array
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "conductivity_gk.py", line 19, in
argon.green_kubo_diffusion_coefficients(plot=True, data_range=50)
File "/mnt/c/Users/User/python/MDSuite/mdsuite/experiment/experiment.py", line 422, in green_kubo_diffusion_coefficients
calculation_gkd.run_analysis() # run the analysis
File "/mnt/c/Users/User/python/MDSuite/mdsuite/analysis/green_kubo_diffusion_coefficients.py", line 206, in run_analysis
self._singular_diffusion_coefficients() # calculate the singular diffusion coefficients
File "/mnt/c/Users/User/python/MDSuite/mdsuite/analysis/green_kubo_diffusion_coefficients.py", line 118, in _singular_diffusion_coefficients
result = p.map(self._singular_diffusion_calculation, zip(data, self.species))
File "/usr/lib/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
IndexError: too many indices for array

Add database generation for unwrapping via indices

If the box image (ix, iy, and iz) are given in the trajectory files, the database generator should simply multiply the coordinates by these numbers to apply the transformation at the database generation run-time.

Merge experiment files

Is your feature request related to a problem? Please describe.
As discussed, it would be better to put all the experiment class together since now we have the run_computation method.

Describe the solution you'd like
Put it all together such that per-atom or flux are computed.

Describe alternatives you've considered
Keep it as is.

Parallelization of database generation

Must implement parallelization of the database generation either with openmp, mpi, or by using a cython implementation. Code is in the Project.py module (formerly the Classes.py) starting from line 160. We should parallelise the for I in tqdm... loop which reads in the batch of configurations and processes them.

    with open(self.filename) as f:
                counter = 0
                for _ in tqdm(range(int(self.number_of_configurations / self.batch_size))):
                    test = self.read_configurations(self.batch_size, f)

                    self.process_configurations(test, database, counter)

                    counter += self.batch_size

Sorting atoms in database generation

Problem

When the atoms in a configuration are read in they are assumed to be dumped in the same order at each time step. If one runs lamps with the order id command this is no problem. However, this is not always the case and so MDSuite will fail under some circumstances.

Solutions

  • Store atom id's and check if the .index(list of id's) is a viable solution to run on the fly in order to build the array in line 199 of file_read.py.
  • Sort the atoms on the fly and then read them into the database
  • Run an external cpp program which reformats the trajectory file

I am in favour of the first option, but this will come down to speed.

TypeError in Classes.py

Hi Sam,

After your commit "Fix hard coded conductivity charge assignment and add periodic table functionality" (fa8862a). There is a TypeError: 'int' object is not subscriptable in:
diffusion_array.append(_diffusion_coefficients["Singular"][element] * abs(self.species[element]['charge'][0]) * (len(self.species[element]['indices']) / self.number_of_atoms))

Structure Factors

Implement a structure factor calculation using the existing rdf calculation. Use a database for the determination of the element specific multiplication factors and add this property to the species dict such that it can be adjusted during an analysis.

Add command line arguments for selected operations.

For the addition of data to a database, and on occasion, for the construction of a new database, it might be helpful to have some command line interfaces which allow for this to be automated. For example, I can run a simulation on a cluster where I schedule a re-rerun after 2 days. It might be nice to add to the cluster script the ability to build an MDSuite database after the first run, and then update the database after subsequent runs. This would be best done through the command line.

Parallelization

Is your feature request related to a problem? Please describe.
Speed is always a problem and we would like to improve it.

Describe the solution you'd like
We would like to implement all of the database construction in parallel so that we can reduce the time taken for the initial building.

Using mpi4py, parallelise the database construction and the analysis modules.

In the case of the database construction, find the best place for parallelisation. Possibilities include:
1.) Over species and properties storage
2.) Over batches.
- In this case, a quick calculation will need to be made on whether it is faster to make the batch size bigger, or to split into two memory safe processes.

In the analysis modules, the parallelisation should take place over species. The batch size attribute of the project class already contains memory information for parallel processes. It should however be a keyword as on smaller systems, it is the difference between a 10s and a 5s calculation. For huge systems, this can make a big difference.

Describe alternatives you've considered
Serial implementation is completely fine for most analysis, but there is only so much optimisation can do for the database construction. We can also consider moving the hdf5 database building to a cpp function, as this is probably faster.

*Additional context
The database construction is performed in the process_configurations method in the FileProcessor parent class. You will find this in the file_io directory.

The database itself is opened as an object in the _fill_database private method in the experiment class. This method loads a batch of data and parses it to the process_configurations method for loading into the database. This is done for all batches.

Add custom data loading to the add_data method

We would like to be able to add custom groups to the hdf5 databases for things like thermal flux and and maybe a viscosity calculation or two. In these cases, it would be good to have some keyword argument on the add_data and load_matrix method to force them not to look at the species information but rather just then a custom group.

Improve units management

Is your feature request related to a problem? Please describe.
There are several systems of units very often used in MD. For the moment, we implemented the method units_to_si(units_system) in the class ProjectMethods (experiment_methods.py, maybe we should rename this file as well...), and also we have a constants.py file with a bunch of constants in the SI.

However, sometimes one may need the Boltzman constant in "real" or "metal" units, and then, we may need to add conversions here and there, which actually may make the code look less readable. Since the units are already loaded in the Project, it would be easier to simply call something like self.units['boltz'].

Describe the solution you'd like
I would put all the units in a file called systems_units.py in a dict of dicts as it is currently done in units_to_si.
However, we can also include inside any constants, or other units which one may need.
The main benefit would be to have everything in one place.
The method units_to_si can still be there, but instead of creating the dict there, it would get loaded from the new file.

Many of the changes needed are available in LAMMPS src/update.cpp file.

Describe alternatives you've considered
We can keep it as it is.
Another option would be to have a file for each system of units, and only read that or something similar, but I don't think the gain would be worth it.

Additional context
I think units are quite fundamental part to be consistent in the calculations and they need to be easy to reach such that users can add their own systems of units if needed.

If this is ok for you, I will take care of the implementation.

move lammps_properties_dict

Is your feature request related to a problem? Please describe.
Due to the changes in the units, in the file constants.py there will be left only a dictionary called lammps_properties_dict. This dictionary is used to map the LAMMPS keywords (vx, vy, vz) to the database structure (velocity: {x, y, z}).

Describe the solution you'd like
We should move this dict somewhere else, possibly inside lammps_trajectory_files.py.
Likek this, every other processor for different codes would have the same structure.

Describe alternatives you've considered
Another alternative is to have a file in the file_io folder with a similar structure to units.py. But with this option, for a new software, the user will have to add a SOFTWARETrajectoryFile class and also the mapping dictionary software_properties_dict in two different files, and I find it a bit confusing.

Additional context
Not of the highest priority, but it will help to cleanup the code.

Restructure project for easier enhancement

Move all analysis into its own class and into a subdirectory. Adjust the main class such that it can instantiate these private classes and perform operations. This will allow for several stages of analysis to be performed when a function is called.

For example, upon calling einstein_difusion_coefficients, the code could:

  • Run position autocorrelation to get the correlation time
  • Calculate ensemble averages of the msd over ensemble time to ensure uncorrelated data
  • Perform a comprehensive curve fit analysis on the msd to ensure an accurate error in the fitting process

This is especially important the case of Green-Kubo analysis, where integral should be taken over several time lengths to ensure convergence in the observable.

Garbage collection for used variables.

After a specific analysis is completed, the variables generated to perform the analysis will be in memory. These should be deleted and flushed to avoid memory usage buildup. Relevant data i stored to the class and saved and therefore nothing should be left behind after an analysis is complete.

Implement units in other way

Sam,

I would suggest to implement the different units systems of LAMMPS as input to the code. This is, for the Class constructor to pass an argument of unit_type = "metal", "real", etc. And this, internally can create the self.time_unit, self.length_unit, etc. accordingly. If required, we can leave the option "other" so the user can introduce the units in a dictionary and pass it as **kwargs, or pass it as a file (unit_type="file_unit"), and it can be read from a json file in the same directory.

Benefits:

  • More standardized class
  • Less inputs
  • Less prone to unit errors
  • Should not be a big change to the code.

Drawbacks:

  • Focused on LAMMPS.
  • Some code re-writing.

If you like the idea, I can take care of the implementation

Add checks for the hdf5 functions

If a hdf5 database is to be generated and it already exists, it will throw errors. This should be replaced with simply performing the calculation with the available data and then leaving a message.

Write_XYZ uses pre-defined species

Hello Sam,

I was reading the code and in file Meta_functions line 24, the species are pre-defined, which I guess can provoke in the future.
I am not sure if the species should be an input to the function, or the function can retrieve them from the array.
It is a very minor issue, but better to get it fixed now.

Regards,
Francisco

Test JAX for correlation computation

Is your feature request related to a problem? Please describe.
JAX has a correlation function equivalent to scipy. It would be nice to try how fast it is.
The only problem is that it is not compatible with Windows.

Describe the solution you'd like
If it is faster, we could use this one as the main computation package, while leaving the normal scipy only for Windows users.

Put together properties of experiment

In experiment.py lines 179-190 there are the properties of the experiment (computed or not).
I think we could put all of them in a single dictionary.

Benefits:

  • Easy dump of properties in a json format
  • Easier read/write property values with the new structure with the single method run_computation.
  • IMO less clutter.
  • If needed, we could use the current properties names as aliases to the new dictionary structure (not sure how to do it, but should be easy).

If this seems good, I will implement it along with the single run method.

Improve computation calls

It feels a bit uncomfortable that every time a new calculator needs to be implemented, it has to be added in two different places:

  • In the Experiment class: create a new function for each calculator
  • In the analysis methods: create the actual class that performs the calculation

I was thinking that we could have a method in Experiment called, for example, calculate such that:
Experiment.calculate('radial_distribution_function', **kwargs)
Then, this calculate would take care of finding the calculator (maybe from the filename? or with a dictionary), instantiate the actual computation class, and getting the computation done.

Benefits:

  • Having only one compute method (instead of many small compute methods)
  • User could list the available computation methods easily.
  • More standarized way of calling the calculators.
  • We would only need to add the computation class, and then maybe its name in the "calculators" dispatch dictionary.
  • If this dispatch dict is somewhere accessible, we could call a "compute all possible properties", which could be useful in some cases.

Drawbacks:

  • May be tricky if calling the different calculators differs a lot.

Create a plotting class

We could really use a nice plotting class for the different types of analysis. I would suggest the following functionality.

  • Colour blind safe. Try to use shapes and linestyles before moving to colour as a defining feature.
  • Options for a split axis. In cases where we would like to plot two functions which depend on different y or x axis on top of
    each other there should be options to handle this
  • The plots should only be saved as eps or svg files so that users can come back and move elements around later.

Dump results to a json file

I think it would also be nice to have a function in the parent class that allows to dump the results from the hdf5 to json files.
Reasons:

  • json files are very easy to read/write (better than simple print)
  • give a quick overview of the results without need of plotting or opening database
  • we can then use several json files from different projects to easily plot results. For example, if we want to have a property as a function of temperature.

Invalid Input needs to be checked

Describe the bug
The user can give invalid input and the there is no functionality that checks for that.
for example the Einstein Diffusion coefficient has data_range as a keyword argument.
The user can chose this argument to be greater than the number of configurations which does not make sense.

To Reproduce
Steps to reproduce the behavior:
Implement assertions wherever the user can input stuff

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.