theochem / atomdb Goto Github PK
View Code? Open in Web Editor NEWAn Extended Periodic Table of Neutral and Charged Atomic Species
Home Page: http://atomdb.qcdevs.org/
License: GNU General Public License v3.0
An Extended Periodic Table of Neutral and Charged Atomic Species
Home Page: http://atomdb.qcdevs.org/
License: GNU General Public License v3.0
The github actions workflow that builds the package with Python 3.7 on Ubuntu now fails. This happened after updating the link to the IOData dependency in the .toml file (commit f637b2d).
The change only affects the developer version of AtomDB (is it listed as optional dev
dependencies in the toml), but it breaks the github actions workflow when tests that need these dependencies are ran.
Currently we have the following raw data files under atomdb/data
slater_atom.tar.xz (137 K)
database_beta_1.3.0.h5 (12M)
c6cp04533b1.csv (135K)
HCI data is in Cedar, ~200GB uncompressed
Instead of Atom class, make an element class since the information of the elements table is only defined for the neutral species of the atom.
With this change one does not need to specify the charge as input.
Additionally, the column for the multiplicity should be removed since this information can be parsed from a separate table for multiplicities.
We should have a table of which property is available for each dataset, both in the code, and in the published documentation.
In the library, we should also gracefully handle error cases where the user attempts to access a property that is unavailable in the current dataset.
@gabrielasd and @msricher do we have something to control for the fact that the log of the atomic density/orbital energy/orbital densities may fail when the argument of the log is too close to zero? Something like:
https://github.com/theochem/denspart/blob/e21de342078035669832b693cb49b269870a6e02/denspart/vh.py#L434
Originally posted by @PaulWAyers in #1
This issue contributes to the completion of issue #8
To compile this database one needs to place the files:
c6cp04533b1.csv
database_beta_1.3.0.h5
that are currently in atomdb/data, under a folder raw
in atomdb/datasets/nist
.
(Important, the raw file and its data must not be added to the source control; only developers need access to this)
More generally, to generate the data files with the tools in the api module, the program expect the following folder structure:
MYDATAPATH/DATASET/raw
where DATASET
is the folder for the specific database (the nist folder here), raw is the folder containing the initial files that will be processed to create the standardized database information (what SpeciesData defines) and MYDATAPATH
is some path leading to DATASET
(basically the path set by the keyword argument datapath
shown bellow).
Then the serialized data can be generated with the function compile from the API:
atomdb.compile(atnum, charge, mult, 0, database, datapath=mydatapath)
The database
argument refers to the specific source of raw data, in this case the nist
dataset, and the optional datapath argument sets the path to this dataset folder.
If you placed the raw data inside the atomdb package (as suggested above) there is no need to specify this variable, it will take the value defined by the environment variable DEFAULT_DATAPATH
defined in API. However, it allows to specify a custom path for where to look for the raw files.
One example; to create the MessagePack file for neutral Beryllium atom from the nist
raw data (placed in the default path) do:
atnum = 4
charge = 0
mult = 1
database = "nist"
atomdb.compile(atnum, charge, mult, 0, database)
The make_promolecules fail for small fractional charges. See code example.
from atomdb import make_promolecule
import numpy as np
atnums = [7]
charges = [0.1]
atcoords = np.array([[0.0, 0.0, 0.0]])
# Build a promolecule
promol = make_promolecule(
atnums,
atcoords,
charges=charges,
units="bohr",
dataset="gaussian",
)
The cause is in lines 620-623 of promolecules.py
# Handle default multiplicity parameters
if mults is None:
# Force non-int charge to be integer here; will be overwritten below.
mults = [MULTIPLICITIES[(atnum, charge)] for (atnum, charge) in zip(atnums, charges)]
The function that retrieves the data from the numeric Hartree-Fock computations, only requires the atomic number and number of electrons as parameters. This assumes that the atomic species is the most stable one for the given atom and charge, but there is no specification for the multiplicity. Therefore an incorrect multiplicity for the retrieved data may be passed as input:
Use NIST data to assign the reference mult value based on the ground state for the element/number of electrons.
Highly charged species may require taking the ground state of the corresponding isolelectronic specie.
Hi, everyone. I'm trying to help out with improving/finalizing the API, and find it useful to start with the Jupyter Notebooks, ensuring that these work before moving forward. I've made some fixes, but also discovered where things don't really work. I took some time this afternoon to do this, and have made a branch in my forked version of the repo to easily show where I've had to adjust the repo's code to force things to work for me. This branch is located at: https://github.com/maximilianvz/AtomDB/tree/fix_notebooks
I have three commits thus far. The first one regarding the API changes should be self-explanatory and primarily fixes improper attribute naming in the notebooks. The other two are a bit more involved, and I think they are a consequence of a lack of uniformity when compiling datasets, so I'll explain things briefly below:
Species
class here, the atmass
attribute is intended to be stored as a dictionary with two keys. However, in the Promolecule_Tools.ipynb
notebook, the Beryillium atoms that get loaded to form the promolecule only have floats stored in the atmass
attribute. You can test this by running promol.atoms[0].atmass
in the second cell. If I don't include the change to the code in this commit, I'll get an error, "TypeError: 'float' object is not subscriptable." I'm not quite sure yet why the atmass
values for Beryllium are floats, but, for example, that of the chlorine atom in the Getting_Started.ipynb
notebook is a dictionary.species.py
. The change to the datafile
function involving repodata.txt
fixes a simple problem where an underscore was accidentally included in the filename, which doesn't possess an underscore as it is stored on GitHub. The extra logic I added to the assignment of the nexc
variable was necessary because file names in the uhf_augccpvdz
are stored with the nexc
variable as a single digit, not 3 digits - this is probably a problem with compiling the uhf_augccpvdz
dataset. The same problem would occur for charge
and mult
, but these got passed as ellipses to the datafile
function in the last cell of Promolecule_Tools.ipynb
, so they didn't raise an error for that cell in the notebook (hence, I didn't make this same change for those variables). The changes I made regarding the fields
dictionary in the load
function had to be done because trying to initialize a Species
object and passing a dictionary with the keys I excluded will raise an error because self._data = SpeciesData(**fields)
isn't equipped to deal with those arguments (specifically, dataset
is redundant, and dispersion_C6
isn't set up). I don't know why the dictionary obtained via unpackb(f.read(), object_hook=decode)
includes a dataset
key for these uhf_augccpvdz
files, because they don't for slater
dataset files, for example (otherwise, the first cell in Promolecule_Tools.ipynb
would break).Hopefully, my notes above make sense. The changes I made to the code itself are certainly not good ways of going about fixing things. I should also note that an error owing to infinite values is raised in Promolecule_NCI.ipynb
, which I didn't have time to look at yet, and may not possess the knowledge to deal with (others will probably be better able to do so). The first two notebooks seem to work now with the changes I made.
I ran into the following problems when trying to generate the sphinx documentation for atomdb with different Python versions.
with Python 3.7.12
After running:
pip install -e .[doc]
cd docs && make html
I get:
Running Sphinx v5.3.0
Extension error:
Could not import extension sphinx.builders.epub3 (exception: cannot import name 'RemovedInSphinx40Warning' from 'sphinx.deprecation' (MYPATH/miniconda3/envs/qcdevs/lib/python3.7/site-packages/sphinx/deprecation.py))
make: *** [Makefile:20: html] Error 2
with Python 3.9.13:
After running:
make html
I get:
Running Sphinx v5.3.0
Extension error:
Could not import extension sphinxcontrib.bibtex (exception: No module named 'sphinxcontrib.bibtex')
make: *** [Makefile:20: html] Error 2
For this later case I can get the documentation to compile by commenting the line for sphinxcontrib.bibtex in source/conf.py
This is to keep record of the cases that failed during compilation of the Slater database so that we can fix them latter on.
FIXME:
ValueError: Both Anion & Cation Slater File for element Cs does not exist.
ValueError: Multiplicity 7 is not available for V with charge -1
We need to remove the old compiled dataset versions here to avoid confusion/errors (e.g. #70).
Updated DB versions are being stored in a separate repository, AtomDBdata, and there is no reason to have duplicated information.
This affects the datasets that rely on method using gaussian basis sets, the UHF (gaussian) and the HCI datasets.
Currently the doc
attribute of the Species class holds the documentation of the loaded dataset. It gets assigned by calling the function get_docstring
1- The docstrings are mere place holders for now, and its likely that the get_docstring
function itself needs revision. Maybe instead of typing the information directly in the body of the function we should have it in a separate file(s)?
2- It could also be nice to add some functionality that appended a table with the actual properties available for the loaded specie.
The function get_docstring
should grab the variable atomdb.{dataset}.DOCSTRING
and return that. This allows us to keep all the information about each dataset in the same file, {dataset}/__init__.py
.
A) The "run" functions for each dataset (esp. HCI) should be checked and made up to date, and one should be able to run it as a script on ComputeCanada. The "compile" functions should be also kept up to date, with all of the available properties computed from the raw data.
B) Finally, after the API and list of properties is finalized, and before release, all of the currently available datasets should be run and compiled, and the .msg files included in the Github repo, and in the library itself.
The documentation of some of the methods in the species module need a bit more explanation and consistency in the format of the description.
Tasks:
load
, dump
, datafile and raw_datafile
. E.g. take the datafile
one as reference.If you look at the "Actions" tab, you can see that the workflow for automatically running the test scripts upon opening a PR or pushing to master isn't quite working yet. The reason for this is the periodic.py module. Specifically, dictionaries entitled num2Ar
and sym2Ar
are referenced in __all__
, but aren't established anywhere else. This causes test_gaussian.py
to throw an error, and thus the whole GitHub Actions workflow fails. @gabrielasd, maybe you'd know more about how to handle this. Is it safe to delete num2Ar
and sym2Ar
here? I checked in my fork, and this allows the workflow to run successfully. However, you might still want to add these dictionaries to this module, so that may not be the best course of action.
The values for atomic mass we currently have are taken from the following source: https://doi.org/10.1515/pac-2019-0603
However, in this reference the values for Tc, Pm, Po-Ac and Np-Og are missing.
Some docstrings of the attributes and methods of the Promoleucle module are incomplete or have different styles.
For example the functions to get the density properties evaluated on a grid: density, gradient, hessian
, laplacian
and KED function. All these should have Parameters and Returns sections.
Task:
Check that the docstrings of the methods on the promolecule module are complete and consistent.
There are three notebooks here but two of them are redundant since the hello_atomdb one covers both the atomdb API and the promolecular tool.
The raw data for this database is under the directory atomdb/data as a *.tar.gz file (this one density.out.tar.gz)
The compiled data is being stored in this repository AtomDBdata, using the directory structure: {dataset}/db
Corresponding values/properties between raw datasets may be stored with different units (e.g. conceptual DFT properties here in eV vs a.u. when computed from HCI/gaussian datasets). We should be consistent and convert them to an internal unit convention as was done in the old repo.
We are using the units module from there when parsing the mass and atomic radii parameters, so the compilation scripts could also import it.
Thes units cript should be updated to use scipy.constants
:
https://docs.scipy.org/doc/scipy/reference/constants.html
(Likely this is affects mainly the nist
dataset)
It's useful to have key quantities related to the separated and united atom limits. These include dispersion coefficients (C6, C8, C10, etc.) for homonuclear diatomics and (C9) triatomics. Data isn't available universally, but the Grimme D2 parameters for C6 are available in Psi4.
https://github.com/psi4/psi4/blob/2cd33eda01b7018a23739d00c1cdd51ca87faa64/psi4/src/psi4/libdisp/dispersion_defines.h#L227
Beyond these values, I can find data for several diatomics, and good benchmark data for hydrogen, but limited "universal" data. Over time, we should try to put benchmark C6, C8, C10, and C9 coefficients into the database where possible.
@gabrielasd if you can set up the fields for these quantities, then @SakshiTak can add the data as she accumulates it.
For the united-atom limit, it's helpful to have the electron density at the nucleus provided as a parameter.
I think the data should also be available in the xtb
package but I can't find it right now. IIt should be possible to generate the data by running xtb for all stretched diatomic molecules; that would give a (sensible) first pass at these values.
A common issue across the compiled datasets with density properties (gaussian, numeric, slater) is that the tests for the 1st and 2nd derivatives of the density fail:
AtomDB/atomdb/test/test_gaussian.py
Line 78 in 85031c5
AtomDB/atomdb/test/test_numeric.py
Line 143 in 85031c5
AtomDB/atomdb/test/test_wfn_slater.py
Line 128 in 85031c5
The wrong derivatives are computed through the interpolated splines of the density (cubic splines). E.g. evaluation of 1st derivative of density from the Slater dataset tests:
AtomDB/atomdb/test/test_wfn_slater.py
Lines 137 to 138 in 85031c5
The reference data being used to compare against comes from the raw data (usually the 1st derivatives were stored), for e.g in the case of the Slater dataset:
https://github.com/theochem/AtomDB/blob/85031c5fc08b7c920a093c84ab4f6458b5d48825/atomdb/test/test_wfn_slater.py#L139-142
or by using Numpy gradient.
One problem may be that the log
keyword argument should be set to True, although I haven't seen before that it fixes the problem. It also may be that the values that make the tests crash are derivatives of the density close to the atomic center.
From the raw set of files for the gaussian database (UHF jobs), there are a few species, specifically cations of the transition metals row 4 (Z=22-30), that I am not sure how to add to the database.
It seems that for these cations the electronic configuration that was solved for corresponds to that of the isoelectronic species.
For example for
These are some known cases that fail when compiling NIST DB because we do not have information of electronic configuration (there is no entry for these in the MULTIPLICITIES dict)
cations:
anions:
from
This was originally issue 36 in the QuantumElephant repo. I'm replicating here to keep track of things, but there is already part of this initial request that has been implemented by @msricher in the module promolecule
We need utility functions for creating promolecules simply. These will involve creating the proper linear combinations of atomic species and then loading them into the Promolecule.
Two simple methods for coefficients.
Example. O atom with charge -0.4
charge 0.0 has coefficient .6
charge -1 has coefficient .4
Example: O atom with charge -1.2
charge -1 should have weight 9.2/9.0
Originally posted by @PaulWAyers in https://github.com/QuantumElephant/atomdb/issues/2#issuecomment-1119991254
We should make a new class based on the Species class called MixedSpecies, which acts as a linear combination of different charge/multiplicity states of the same element at the same coordinates in space, to ensure that properties like energy and mass remain correct when accessed through the promolecule species.
Then, we should have different utility functions for creating the promolecules:
In the actual code, when one of the data sets is imported, the corresponding __init is loaded and several of them import iodata, gbasis etc. These are used in the compilation scripts (which are not needed for the common user of AtomDB). The problem is that because of these imports, AtomDB cannot be used without these dependencies.
I
Windows builds on Github Actions fail, I think because you can't install PySCF on Windows.
Proposed solution: Remove the Windows builds from Github Actions and don't support Windows. Most people use WSL for this kind of stuff now anyway.
Let me know what you all think and I'll update the .github/ folder.
Fill out Sphinx documentation
The code to compile the Slater dataset was ported from the oldmaster
branch by picking the pertinent commits.
It has the following function and class:
load_slater_wfn function
AtomicDensity class
to build a Slater wavefunction and evaluate/retrieve several properties (e.g. density functions) for a specified neutral/charged species.
However, there is an updated version of this code (with potential bug fixes) in BFit so we should use the newest version instead.
We could add BFit as a dependency or borrow the code from there.
Changes made to the two methods above may break the run
compilation function that uses AtomicDensity.
This change should also fix the problem found in the tests:
AtomDB/atomdb/test/test_wfn_slater.py
Line 37 in 85031c5
We should make sure we have the atomic radii and other properties in
https://github.com/QC-Devs/gopt/blob/main/gopt/periodic/data/elements.csv
This file lists the properties in AtomDB, together with their format. For any of these properties, a "promolecular" value should be possible to define. It will generally be either a mean (perhaps a geometric mean or an l-infinity mean) or a sum of the atomic properties.
Given the atomic number, charge (default to zero), multiplicity (default to ground state of species with that charge and multiplicity), and excitation-level (default to lowest-energy state for a given charge and multiplicity). When fractional charge/multiplicity is provided, the correct thing to do is to define the value from the convex hull. Fractional excitation is not something I know how to support, except perhaps when the fractional occupation is less than one.
It isn't really part of AtomDB per se, but when multiple entries for a data entry are there, we should default to experiment or the highest-accuracy HCI (or similar) calculation we have. That way a user doesn't necessarily have to request a specific piece of data.
Originally posted by @PaulWAyers in https://github.com/QuantumElephant/atomdb/issues/2#issuecomment-812965031
Update the AtomDB API to use a better (de)serialization method based on a Python database library, such as ZODB.
AtomDB is a database of chemical and physical properties for atomic and ionic species. It includes a Python library for submitting computations to generate database entries, accessing entries, and interpolating their properties at points in space. AtomDB currently uses MsgPack for (de)serializing database entries (instances of dataclasses), but the deserialization is slow, complicated, and uses poor Python practices. This project will involve updating the AtomDB API to use a better (de)serialization method based on a proper database library, such as ZODB, which has seamless interoperability with Python classes and objects. This is a key milestone on AtomDB release schedule.
You will update the AtomDB API to replace the MsgPack-based (de)serialization functions database entry files with the ZODB database library. You will port the atomic/ionic species class to be a standalone class (instead of dataclass + wrapper), which will provide transparent (de)serialization with ZODB. Finally, you will port the existing AtomDB entry files to the new database, and modify the build files (pyproject.toml) so that the new database is included with user installations of AtomDB.
Required skills | Python, OOP, Linux |
Preferred skills | Database experience |
Project size | 175 hours, Medium |
Difficulty | Medium |
Michelle Richer | richer.m_at_queensu_dot_ca | @msricher |
Gabriela Sánchez-Díaz | sanchezg_at_mcmaster_dot_ca | @gabrielasd |
Farnaz Heidar-Zadeh | farnaz.heidarzadeh_at_queensu_dot_ca | @FarnazH |
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.