theochem / atomdb Goto Github PK

View Code? Open in Web Editor NEW

14.0 14.0 12.0 173.04 MB

An Extended Periodic Table of Neutral and Charged Atomic Species

Home Page: http://atomdb.qcdevs.org/

License: GNU General Public License v3.0

Python 100.00%

atomdb's People

Contributors

Stargazers

Watchers

Forkers

gabriellebarskygiles maximilianvz kunikachandra aditish51 msricher harshnayangithub ansh-sarkar chrinide sarahjanenft leila-pujal richrick1 gabrielasd

atomdb's Issues

Set up GitHub Action + PyPi

Ubuntu Python 3.7 build failing

The github actions workflow that builds the package with Python 3.7 on Ubuntu now fails. This happened after updating the link to the IOData dependency in the .toml file (commit f637b2d).

The change only affects the developer version of AtomDB (is it listed as optional dev dependencies in the toml), but it breaks the github actions workflow when tests that need these dependencies are ran.

The report shows this error at the end:

[Database] Where to host raw data

Currently we have the following raw data files under atomdb/data
slater_atom.tar.xz (137 K)
database_beta_1.3.0.h5 (12M)
c6cp04533b1.csv (135K)

HCI data is in Cedar, ~200GB uncompressed

Change Atom class to an Element class in periodic module

Instead of Atom class, make an element class since the information of the elements table is only defined for the neutral species of the atom.
With this change one does not need to specify the charge as input.
Additionally, the column for the multiplicity should be removed since this information can be parsed from a separate table for multiplicities.

[Doc] Discoverability of datasets' properties

We should have a table of which property is available for each dataset, both in the code, and in the published documentation.

In the library, we should also gracefully handle error cases where the user attempts to access a property that is unavailable in the current dataset.

Overflow of Logarithm (or other transform) with local property spline

  @gabrielasd and @msricher do we have something to control for the fact that the log of the atomic density/orbital energy/orbital densities may fail when the argument of the log is too close to zero? Something like: 
https://github.com/theochem/denspart/blob/e21de342078035669832b693cb49b269870a6e02/denspart/vh.py#L434

Originally posted by @PaulWAyers in #1

Compile NIST database

This issue contributes to the completion of issue #8

To compile this database one needs to place the files:
c6cp04533b1.csv
database_beta_1.3.0.h5
that are currently in atomdb/data, under a folder raw in atomdb/datasets/nist.
(Important, the raw file and its data must not be added to the source control; only developers need access to this)

More generally, to generate the data files with the tools in the api module, the program expect the following folder structure:
MYDATAPATH/DATASET/raw
where DATASET is the folder for the specific database (the nist folder here), raw is the folder containing the initial files that will be processed to create the standardized database information (what SpeciesData defines) and MYDATAPATH is some path leading to DATASET (basically the path set by the keyword argument datapath shown bellow).

Then the serialized data can be generated with the function compile from the API:
atomdb.compile(atnum, charge, mult, 0, database, datapath=mydatapath)
The database argument refers to the specific source of raw data, in this case the nist dataset, and the optional datapath argument sets the path to this dataset folder.
If you placed the raw data inside the atomdb package (as suggested above) there is no need to specify this variable, it will take the value defined by the environment variable DEFAULT_DATAPATH defined in API. However, it allows to specify a custom path for where to look for the raw files.

One example; to create the MessagePack file for neutral Beryllium atom from the nist raw data (placed in the default path) do:

atnum = 4
charge = 0
mult = 1
database = "nist"
atomdb.compile(atnum, charge, mult, 0, database)

Promolecules with fractional charges and default multiplicity fail

The make_promolecules fail for small fractional charges. See code example.

from atomdb import make_promolecule
import numpy as np


atnums = [7]
charges = [0.1]
atcoords = np.array([[0.0, 0.0, 0.0]])

# Build a promolecule
promol = make_promolecule(
    atnums,
    atcoords,
    charges=charges,
    units="bohr",
    dataset="gaussian",
)

The cause is in lines 620-623 of promolecules.py

    # Handle default multiplicity parameters
    if mults is None:
        # Force non-int charge to be integer here; will be overwritten below.
        mults = [MULTIPLICITIES[(atnum, charge)] for (atnum, charge) in zip(atnums, charges)]

Verify multiplicity parameter in numeric dataset compilation script

The function that retrieves the data from the numeric Hartree-Fock computations, only requires the atomic number and number of electrons as parameters. This assumes that the atomic species is the most stable one for the given atom and charge, but there is no specification for the multiplicity. Therefore an incorrect multiplicity for the retrieved data may be passed as input:

AtomDB/atomdb/datasets/numeric/__init__.py

Line 95 in 85031c5

def run(elem, charge, mult, nexc, dataset, datapath):

Use NIST data to assign the reference mult value based on the ground state for the element/number of electrons.
Highly charged species may require taking the ground state of the corresponding isolelectronic specie.

Fixing Jupyter Notebooks

Hi, everyone. I'm trying to help out with improving/finalizing the API, and find it useful to start with the Jupyter Notebooks, ensuring that these work before moving forward. I've made some fixes, but also discovered where things don't really work. I took some time this afternoon to do this, and have made a branch in my forked version of the repo to easily show where I've had to adjust the repo's code to force things to work for me. This branch is located at: https://github.com/maximilianvz/AtomDB/tree/fix_notebooks

I have three commits thus far. The first one regarding the API changes should be self-explanatory and primarily fixes improper attribute naming in the notebooks. The other two are a bit more involved, and I think they are a consequence of a lack of uniformity when compiling datasets, so I'll explain things briefly below:

Workaround when only one mass stored: As described in the documentation for the Species class here, the atmass attribute is intended to be stored as a dictionary with two keys. However, in the Promolecule_Tools.ipynb notebook, the Beryillium atoms that get loaded to form the promolecule only have floats stored in the atmass attribute. You can test this by running promol.atoms[0].atmass in the second cell. If I don't include the change to the code in this commit, I'll get an error, "TypeError: 'float' object is not subscriptable." I'm not quite sure yet why the atmass values for Beryllium are floats, but, for example, that of the chlorine atom in the Getting_Started.ipynb notebook is a dictionary.
Workaround for uhf_augccpvdz dataset: In this commit, I made a few changes to two functions in species.py. The change to the datafile function involving repodata.txt fixes a simple problem where an underscore was accidentally included in the filename, which doesn't possess an underscore as it is stored on GitHub. The extra logic I added to the assignment of the nexc variable was necessary because file names in the uhf_augccpvdz are stored with the nexc variable as a single digit, not 3 digits - this is probably a problem with compiling the uhf_augccpvdz dataset. The same problem would occur for charge and mult, but these got passed as ellipses to the datafile function in the last cell of Promolecule_Tools.ipynb, so they didn't raise an error for that cell in the notebook (hence, I didn't make this same change for those variables). The changes I made regarding the fields dictionary in the load function had to be done because trying to initialize a Species object and passing a dictionary with the keys I excluded will raise an error because self._data = SpeciesData(**fields) isn't equipped to deal with those arguments (specifically, dataset is redundant, and dispersion_C6 isn't set up). I don't know why the dictionary obtained via unpackb(f.read(), object_hook=decode) includes a dataset key for these uhf_augccpvdz files, because they don't for slater dataset files, for example (otherwise, the first cell in Promolecule_Tools.ipynb would break).

Hopefully, my notes above make sense. The changes I made to the code itself are certainly not good ways of going about fixing things. I should also note that an error owing to infinite values is raised in Promolecule_NCI.ipynb, which I didn't have time to look at yet, and may not possess the knowledge to deal with (others will probably be better able to do so). The first two notebooks seem to work now with the changes I made.

Errors while generating documentation with sphinxs

I ran into the following problems when trying to generate the sphinx documentation for atomdb with different Python versions.

with Python 3.7.12
After running:

pip install -e .[doc]
cd docs && make html

I get:
Running Sphinx v5.3.0
Extension error:
Could not import extension sphinx.builders.epub3 (exception: cannot import name 'RemovedInSphinx40Warning' from 'sphinx.deprecation' (MYPATH/miniconda3/envs/qcdevs/lib/python3.7/site-packages/sphinx/deprecation.py))
make: *** [Makefile:20: html] Error 2

with Python 3.9.13:
After running:
make html

I get:
Running Sphinx v5.3.0
Extension error:
Could not import extension sphinxcontrib.bibtex (exception: No module named 'sphinxcontrib.bibtex')
make: *** [Makefile:20: html] Error 2

For this later case I can get the documentation to compile by commenting the line for sphinxcontrib.bibtex in source/conf.py

Compilation Slater DB failed cases

This is to keep record of the cases that failed during compilation of the Slater database so that we can fix them latter on.

FIXME:

Compilation of the neutral species of Md, No and Lr failed because our MULTIPLICITIES table only gets up to atomic number 100 (Fm)
For $Cs^{+1}$ the run script crashed with the error message:
ValueError: Both Anion & Cation Slater File for element Cs does not exist.
However, the is a raw file for this species in the file slater_atom.tar.xz (cation/cs.cat).
The anions (charge $-1$) of V, Zr, Nb, Rh and Ag also failed.
The error messages look like this one for Vanadium:
ValueError: Multiplicity 7 is not available for V with charge -1
The problem is a mismatch between the input multiplicity value and the one computed inside the run script. The input mult value comes from the MULTIPLICITIES dict, the internal value in the script is evaluate here

Remove old db files

We need to remove the old compiled dataset versions here to avoid confusion/errors (e.g. #70).
Updated DB versions are being stored in a separate repository, AtomDBdata, and there is no reason to have duplicated information.

Update GBasis API in compilation scripts

This affects the datasets that rely on method using gaussian basis sets, the UHF (gaussian) and the HCI datasets.

[Doc] Species doc attribute

Currently the doc attribute of the Species class holds the documentation of the loaded dataset. It gets assigned by calling the function get_docstring

1- The docstrings are mere place holders for now, and its likely that the get_docstring function itself needs revision. Maybe instead of typing the information directly in the body of the function we should have it in a separate file(s)?

2- It could also be nice to add some functionality that appended a table with the actual properties available for the loaded specie.

The function get_docstring should grab the variable atomdb.{dataset}.DOCSTRING and return that. This allows us to keep all the information about each dataset in the same file, {dataset}/__init__.py.

[Release] Update RUN/COMPILE functions for the datasets

A) The "run" functions for each dataset (esp. HCI) should be checked and made up to date, and one should be able to run it as a script on ComputeCanada. The "compile" functions should be also kept up to date, with all of the available properties computed from the raw data.

B) Finally, after the API and list of properties is finalized, and before release, all of the currently available datasets should be run and compiled, and the .msg files included in the Github repo, and in the library itself.

[Doc] Improve species module docstrings

The documentation of some of the methods in the species module need a bit more explanation and consistency in the format of the description.

Tasks:

Check that the attributes of Species class are documented. For example for atomic properties that have multiple definitions like the atomic mass, covalent radius and others defined in this data_info.csv we need to specify which options we support and what are the sources.
Make consistent the docstrings of the methods compile, load, dump, datafile and raw_datafile. E.g. take the datafile one as reference.

GitHub Actions Testing Failing

If you look at the "Actions" tab, you can see that the workflow for automatically running the test scripts upon opening a PR or pushing to master isn't quite working yet. The reason for this is the periodic.py module. Specifically, dictionaries entitled num2Ar and sym2Ar are referenced in __all__, but aren't established anywhere else. This causes test_gaussian.py to throw an error, and thus the whole GitHub Actions workflow fails. @gabrielasd, maybe you'd know more about how to handle this. Is it safe to delete num2Ar and sym2Ar here? I checked in my fork, and this allows the workflow to run successfully. However, you might still want to add these dictionaries to this module, so that may not be the best course of action.

Missing mass values for elements 96-118

The values for atomic mass we currently have are taken from the following source: https://doi.org/10.1515/pac-2019-0603
However, in this reference the values for Tc, Pm, Po-Ac and Np-Og are missing.

[Doc] Improve documentation of promolecule module

Some docstrings of the attributes and methods of the Promoleucle module are incomplete or have different styles.

For example the functions to get the density properties evaluated on a grid: density, gradient, hessian, laplacian and KED function. All these should have Parameters and Returns sections.

Task:
Check that the docstrings of the methods on the promolecule module are complete and consistent.

Notebooks

There are three notebooks here but two of them are redundant since the hello_atomdb one covers both the atomdb API and the promolecular tool.

Compile HF numeric DB

The raw data for this database is under the directory atomdb/data as a *.tar.gz file (this one density.out.tar.gz)

The compiled data is being stored in this repository AtomDBdata, using the directory structure: {dataset}/db

[Units] Consistent internal units

Corresponding values/properties between raw datasets may be stored with different units (e.g. conceptual DFT properties here in eV vs a.u. when computed from HCI/gaussian datasets). We should be consistent and convert them to an internal unit convention as was done in the old repo.

We are using the units module from there when parsing the mass and atomic radii parameters, so the compilation scripts could also import it.

Thes units cript should be updated to use scipy.constants:
https://docs.scipy.org/doc/scipy/reference/constants.html

(Likely this is affects mainly the nist dataset)

Parameters for Separated-Atom and United-Atom Limits

It's useful to have key quantities related to the separated and united atom limits. These include dispersion coefficients (C6, C8, C10, etc.) for homonuclear diatomics and (C9) triatomics. Data isn't available universally, but the Grimme D2 parameters for C6 are available in Psi4.
https://github.com/psi4/psi4/blob/2cd33eda01b7018a23739d00c1cdd51ca87faa64/psi4/src/psi4/libdisp/dispersion_defines.h#L227

Beyond these values, I can find data for several diatomics, and good benchmark data for hydrogen, but limited "universal" data. Over time, we should try to put benchmark C6, C8, C10, and C9 coefficients into the database where possible.

@gabrielasd if you can set up the fields for these quantities, then @SakshiTak can add the data as she accumulates it.

For the united-atom limit, it's helpful to have the electron density at the nucleus provided as a parameter.

I think the data should also be available in the xtb package but I can't find it right now. IIt should be possible to generate the data by running xtb for all stretched diatomic molecules; that would give a (sensible) first pass at these values.

Problems with first and second derivatives o splines of density

Description

A common issue across the compiled datasets with density properties (gaussian, numeric, slater) is that the tests for the 1st and 2nd derivatives of the density fail:

AtomDB/atomdb/test/test_gaussian.py

Line 78 in 85031c5

# FIXME: density derivatives tests fail.

AtomDB/atomdb/test/test_numeric.py

Line 143 in 85031c5

@pytest.mark.xfail

AtomDB/atomdb/test/test_wfn_slater.py

Line 128 in 85031c5

@pytest.mark.xfail

The wrong derivatives are computed through the interpolated splines of the density (cubic splines). E.g. evaluation of 1st derivative of density from the Slater dataset tests:

AtomDB/atomdb/test/test_wfn_slater.py

Lines 137 to 138 in 85031c5

 spline = sp.interpolate_dens(spin='ab', log=False) 

 gradient = spline(grid, deriv=1)

rather than done analytically.

The reference data being used to compare against comes from the raw data (usually the 1st derivatives were stored), for e.g in the case of the Slater dataset:
https://github.com/theochem/AtomDB/blob/85031c5fc08b7c920a093c84ab4f6458b5d48825/atomdb/test/test_wfn_slater.py#L139-142
or by using Numpy gradient.

One problem may be that the log keyword argument should be set to True, although I haven't seen before that it fixes the problem. It also may be that the values that make the tests crash are derivatives of the density close to the atomic center.

Tasks & Progress

Revise gradient tests
Check spline interpolation function cubic_interp
Fix tests

Compilation Gaussian excited state cases

From the raw set of files for the gaussian database (UHF jobs), there are a few species, specifically cations of the transition metals row 4 (Z=22-30), that I am not sure how to add to the database.

It seems that for these cations the electronic configuration that was solved for corresponds to that of the isoelectronic species.
For example for $Cr^{+1}$ the raw file corresponds to the multiplicity $4$ (which would be the ground state of Vanadium) instead of mult $6$.

Compilation NIST failed cases

These are some known cases that fail when compiling NIST DB because we do not have information of electronic configuration (there is no entry for these in the MULTIPLICITIES dict)
cations:
$Pr^{+1}$, and from $Tb^{+1}$ to $Tm^{+1}$
anions:
from $Fm^{-1}$ to $Lr^{-1}$

[Promolecule] Utility functions for creating promolecules

This was originally issue 36 in the QuantumElephant repo. I'm replicating here to keep track of things, but there is already part of this initial request that has been implemented by @msricher in the module promolecule

Description:

We need utility functions for creating promolecules simply. These will involve creating the proper linear combinations of atomic species and then loading them into the Promolecule.

Two simple methods for coefficients.

Given atoms with positions/centers and charges which are not integers.

duplicate atoms on each center.
for atom with ceiling(charge), choose coeff = charge - floor(charge)
for atom with floor(charge) electrons choose coeff = ceiling(charge) - charge
if atom with floor(charge) does not exist, choose coeff = (atomic_number - charge)/(atomic_number - ceiling(charge))

Example. O atom with charge -0.4
charge 0.0 has coefficient .6
charge -1 has coefficient .4

Example: O atom with charge -1.2
charge -1 should have weight 9.2/9.0

Given atoms with positions/centers and charges AND multiplicities which are not integers

triplicate atoms on each center.
coefficients and species are chosen by the flat planes.

Originally posted by @PaulWAyers in https://github.com/QuantumElephant/atomdb/issues/2#issuecomment-1119991254

We should make a new class based on the Species class called MixedSpecies, which acts as a linear combination of different charge/multiplicity states of the same element at the same coordinates in space, to ensure that properties like energy and mass remain correct when accessed through the promolecule species.

Then, we should have different utility functions for creating the promolecules:

Basic; just specify atoms and coordinates
Basic manual; specify atoms, coordinates, and {charge, multiplicity}
As in Paul's original comment, copied above
As in Paul's other comment, which I'll copy into a comment below

AtomDB has iodata, grid, gbasis and pyci as dependencies.

In the actual code, when one of the data sets is imported, the corresponding __init is loaded and several of them import iodata, gbasis etc. These are used in the compilation scripts (which are not needed for the common user of AtomDB). The problem is that because of these imports, AtomDB cannot be used without these dependencies.

Windows builds failing

Windows builds on Github Actions fail, I think because you can't install PySCF on Windows.

Proposed solution: Remove the Windows builds from Github Actions and don't support Windows. Most people use WSL for this kind of stuff now anyway.

Let me know what you all think and I'll update the .github/ folder.

[Doc] Write Sphinx-Doc pages

Fill out Sphinx documentation

Installation
Usage
Library structure
Datasets & Properties
Contributing (how to add datasets!)
API

[Database] Update Slater dataset compilation functions

Description

The code to compile the Slater dataset was ported from the oldmaster branch by picking the pertinent commits.
It has the following function and class:
load_slater_wfn function
AtomicDensity class
to build a Slater wavefunction and evaluate/retrieve several properties (e.g. density functions) for a specified neutral/charged species.
However, there is an updated version of this code (with potential bug fixes) in BFit so we should use the newest version instead.
We could add BFit as a dependency or borrow the code from there.

Changes made to the two methods above may break the run compilation function that uses AtomicDensity.

This change should also fix the problem found in the tests:

AtomDB/atomdb/test/test_wfn_slater.py

Line 37 in 85031c5

 # FIXME:sign error in parsed energy value (looks like T value instead of E was parsed from raw file). 

But checking will require recompiling the test data which requires more detailed instructions.

Tasks & Progress

Update Slater wavefunction code from BFit
Make pertinent changes to run function

Atomic radii and other properties

We should make sure we have the atomic radii and other properties in
https://github.com/QC-Devs/gopt/blob/main/gopt/periodic/data/elements.csv

Periodic and electron density related properties for atoms

Properties in AtomDB

This file lists the properties in AtomDB, together with their format. For any of these properties, a "promolecular" value should be possible to define. It will generally be either a mean (perhaps a geometric mean or an l-infinity mean) or a sum of the atomic properties.

Scalars:

Given the atomic number, charge (default to zero), multiplicity (default to ground state of species with that charge and multiplicity), and excitation-level (default to lowest-energy state for a given charge and multiplicity). When fractional charge/multiplicity is provided, the correct thing to do is to define the value from the convex hull. Fractional excitation is not something I know how to support, except perhaps when the fractional occupation is less than one.

Vectors:

It isn't really part of AtomDB per se, but when multiple entries for a data entry are there, we should default to experiment or the highest-accuracy HCI (or similar) calculation we have. That way a user doesn't necessarily have to request a specific piece of data.

Originally posted by @PaulWAyers in https://github.com/QuantumElephant/atomdb/issues/2#issuecomment-812965031

[Database] Compilation and upload to repo

Compiled datasets:

GSoC 2024: Refactor database structure

Description

Update the AtomDB API to use a better (de)serialization method based on a Python database library, such as ZODB.

📚 Package Description and Impact

AtomDB is a database of chemical and physical properties for atomic and ionic species. It includes a Python library for submitting computations to generate database entries, accessing entries, and interpolating their properties at points in space. AtomDB currently uses MsgPack for (de)serializing database entries (instances of dataclasses), but the deserialization is slow, complicated, and uses poor Python practices. This project will involve updating the AtomDB API to use a better (de)serialization method based on a proper database library, such as ZODB, which has seamless interoperability with Python classes and objects. This is a key milestone on AtomDB release schedule.

👷 What will you do?

You will update the AtomDB API to replace the MsgPack-based (de)serialization functions database entry files with the ZODB database library. You will port the atomic/ionic species class to be a standalone class (instead of dataclass + wrapper), which will provide transparent (de)serialization with ZODB. Finally, you will port the existing AtomDB entry files to the new database, and modify the build files (pyproject.toml) so that the new database is included with user installations of AtomDB.

🏁 Expected Outcomes

(De)serialization works transparently with instances of Species, and is done to and from a ZODB database.
Species is made a standalone class (not a dataclass), by subclassing the ZODB persistent object base class.
Old MsgPack database entries are ported to the new database.
Build files are updated to reflect the change in database files shipped with AtomDB.
The AtomDB API is tested after the previous changes are made.


Required skills	Python, OOP, Linux
Preferred skills	Database experience
Project size	175 hours, Medium
Difficulty	Medium

🙋 Mentors


Michelle Richer	richer.m_at_queensu_dot_ca	@msricher
Gabriela Sánchez-Díaz	sanchezg_at_mcmaster_dot_ca	@gabrielasd
Farnaz Heidar-Zadeh	farnaz.heidarzadeh_at_queensu_dot_ca	@FarnazH

	spline = sp.interpolate_dens(spin='ab', log=False)
	gradient = spline(grid, deriv=1)

theochem / atomdb Goto Github PK

atomdb's People

Contributors

Stargazers

Watchers

Forkers

atomdb's Issues

Description

Tasks & Progress

Description:

Description

Tasks & Progress

Properties in AtomDB

Scalars:

Vectors:

Description

📚 Package Description and Impact

👷 What will you do?

🏁 Expected Outcomes

🙋 Mentors

Recommend Projects

Recommend Topics

Recommend Org