isayev / ani1_dataset Goto Github PK
View Code? Open in Web Editor NEWA data set of 20 million calculated off-equilibrium conformations for organic molecules
License: MIT License
A data set of 20 million calculated off-equilibrium conformations for organic molecules
License: MIT License
Where can I obtain the equilibrium geometries?
Does the first entry for each molecule correspond to the equilibrium structure?
What are the units of the energies for this dataset?
Is there a dataset available for classical forcefields calculated on ANI? Alternatively, is there a way to interface with a library like lamps to calculate the classical forcefields directly? Thanks for the help.
Is it possible for you guys to provide the gdb-10 dataset for our validation work?
Hello! Could you compute and publish the sha256 hashes for your ani-1_dataset.tar.gz
file and include them in your README? This will help users to ensure that the data that they download has not been manipulated by some third party.
You can easily compute a hash using:
from hashlib import sha256
def hash_check(fname, hash_fn=sha256):
"""Reads in data from disk and returns hash
Parameters
----------
fname : str | Path
hash_fn : Callable[[], Hash], optional (default=hashlib.sha256)
Examples
--------
Checking sha256 hash..
>>> from hashlib import sha256
>>> hash_check('./text.txt, sha256)
'a4337bc45a8fc544c03f52dc550cd6e1e87021bc896588bd79e901e2'
"""
hash_fn = hash_fn()
with open(fname, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_fn.update(chunk)
return hash_fn.hexdigest()
Thanks!
Thanks for making this fantastic resource available!
Is there a way you could make some description of the molecular topology (e.g. mol2 files) available? While the QM energies are clearly only dependent on the atomic positions, your initial RDKit representation likely contains a mapping from a molecular topology (which includes bonds and bond orders) that allows atom indices to be uniquely identified within the molecular topology. It would be great if this topology information could be provided as well---perhaps as a compressed multimolecule mol2 file?
We've been trying to do some DFT calculations to replicate the energies of a subset of the conformers in the data files, and while we get close (with the same potential), we haven't been able to exactly replicate the numbers. For example, for the lowest-energy conformer of gdb11_s08-1279, the energy in the file is -345.269812 (hartree; rounded), and we obtain -345.269936 with a coarse grid, and -345.269864 with the finest grid.
It's also worth noting that we obtain different self-interaction energies than stated in the README. For example, we're off by at least 0.005 hartree for oxygen (-75.041 vs the stated -75.036).
Could you provide more information on the exact version of the software you used, as well as any parameters that might be causing the discrepancy?
Hello,
I was trying to convert the ANI-1 dataset into a parquet format, and I ran into a potential mismatch between the coordinates and smiles string of at least one molecule (around 4k conformers).
I wrote a piece of sample code to try to isolate this first issue I ran into (Python 2.7.6 interpreter):
import h5py
from pybel import readstring
import json
import numpy as np
import pandas as pd
ani_path = '.../ani'
shard3 = os.path.join(ani_path, 'ani_gdb_s03.h5')
with h5py.File(shard3, 'r') as f:
data_dict = f['gdb11_s03/gdb11_s03-11']
coords = data_dict['coordinates']
elements = data_dict['species']
energies = data_dict['energies']
smi = ''.join(data_dict['smiles'])
mol = readstring('smi', smi)
jmol = json.loads(pymol_to_json(mol))
if len(jmol['atoms']) != len(elements[:]):
print "shard: ", shard1
print "\nmolecule: gdb11_s03/gdb11_s03-11"
print "\nsmile: ", smi
print "\nspecies:", elements[:]
print "\npybel mol:", jmol
print "\ncoordinates: ", coords.shape
with sample output:
shard: .../ani_gdb_s03.h5
molecule: gdb11_s03/gdb11_s03-11
smile: [H]C([H])=NN([H])[H]
species ['O' 'C' 'O' 'H' 'H']
pybel mol {u'atoms': [[1, 0], [6, 0], [1, 0], [7, 0], [7, 0], [1, 0], [1, 0]], u'bonds': [[1, 2, 1], [2, 3, 1], [2, 4, 2], [4, 5, 1], [5, 6, 1], [5, 7, 1]]}
coordinates: (4320, 5, 3)
Only the filepath should need to be edited back in for this to run. I also wrote a different parser than the example code because I was having trouble getting the iteration to perform consistently, so maybe I introduced an unintended error there.
I will filter my parquet files for similar mismatches and go-ahead without them for now. If I have made an obvious mistake or if this has already been identified I'd still appreciate feedback.
Thanks!
These energies appear in the README file:
Self-interaction atomic energies
H = -0.500607632585 C = -37.8302333826 N = -54.5680045287 O = -75.0362229210
Are these DFT energies at wB97x/6-31g*? I am unable to reproduce them. Can you give details about how these numbers were produced?
Thank you.
While the reader code has an MIT license, the dataset itself doesn't seem to have one. Is this perhaps just an oversight? Would it be possible to have a license file (e.g., MIT) bundled with the dataset itself?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.