isayev / ani1_dataset Goto Github PK

View Code? Open in Web Editor NEW

93.0 12.0 18.0 2.69 MB

A data set of 20 million calculated off-equilibrium conformations for organic molecules

License: MIT License

Python 100.00%

machine-learning opendata chemistry cheminformatics molecular-structures molecular-modeling

ani1_dataset's People

Contributors

Stargazers

Watchers

Forkers

jchodera thegodone 775019309 chipper1 thegreenjedi plin1112 gusarovs mastricker danny305 nareshram256 yamilee98 ccchai1 shunsunsun kjadams2000 annamarieweber jmwoll collvey rnaimehaom

ani1_dataset's Issues

equilibrium geometies

Where can I obtain the equilibrium geometries?
Does the first entry for each molecule correspond to the equilibrium structure?

Units?

What are the units of the energies for this dataset?

Classical forcefields dataset for AN1

Is there a dataset available for classical forcefields calculated on ANI? Alternatively, is there a way to interface with a library like lamps to calculate the classical forcefields directly? Thanks for the help.

GDB 10 dataset

Is it possible for you guys to provide the gdb-10 dataset for our validation work?

Publish sha256 hashes for improved user safety

Hello! Could you compute and publish the sha256 hashes for your ani-1_dataset.tar.gz file and include them in your README? This will help users to ensure that the data that they download has not been manipulated by some third party.

You can easily compute a hash using:

from hashlib import sha256


def hash_check(fname, hash_fn=sha256):
    """Reads in data from disk and returns hash

    Parameters
    ----------
    fname : str | Path

    hash_fn : Callable[[], Hash], optional (default=hashlib.sha256)

    Examples
    --------
    Checking sha256 hash..

    >>> from hashlib import sha256
    >>> hash_check('./text.txt, sha256)
    'a4337bc45a8fc544c03f52dc550cd6e1e87021bc896588bd79e901e2'
    """
    hash_fn = hash_fn()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_fn.update(chunk)
    return hash_fn.hexdigest()

Thanks!

mol2 files describing molecular topology?

Thanks for making this fantastic resource available!

Is there a way you could make some description of the molecular topology (e.g. mol2 files) available? While the QM energies are clearly only dependent on the atomic positions, your initial RDKit representation likely contains a mapping from a molecular topology (which includes bonds and bond orders) that allows atom indices to be uniquely identified within the molecular topology. It would be great if this topology information could be provided as well---perhaps as a compressed multimolecule mol2 file?

parameters used for DFT calculations?

We've been trying to do some DFT calculations to replicate the energies of a subset of the conformers in the data files, and while we get close (with the same potential), we haven't been able to exactly replicate the numbers. For example, for the lowest-energy conformer of gdb11_s08-1279, the energy in the file is -345.269812 (hartree; rounded), and we obtain -345.269936 with a coarse grid, and -345.269864 with the finest grid.

It's also worth noting that we obtain different self-interaction energies than stated in the README. For example, we're off by at least 0.005 hartree for oxygen (-75.041 vs the stated -75.036).

Could you provide more information on the exact version of the software you used, as well as any parameters that might be causing the discrepancy?

Potential smile/coordinate discrepency

Hello,

I was trying to convert the ANI-1 dataset into a parquet format, and I ran into a potential mismatch between the coordinates and smiles string of at least one molecule (around 4k conformers).

I wrote a piece of sample code to try to isolate this first issue I ran into (Python 2.7.6 interpreter):

import h5py
from pybel import readstring
import json
import numpy as np
import pandas as pd

ani_path = '.../ani'
shard3 = os.path.join(ani_path, 'ani_gdb_s03.h5')

with h5py.File(shard3, 'r') as f:
    data_dict = f['gdb11_s03/gdb11_s03-11']

    coords     = data_dict['coordinates']
    elements   = data_dict['species']
    energies   = data_dict['energies']
    smi        = ''.join(data_dict['smiles'])
    
    mol = readstring('smi', smi)
    jmol = json.loads(pymol_to_json(mol))

    if len(jmol['atoms']) != len(elements[:]):
        print "shard: ", shard1
        print "\nmolecule: gdb11_s03/gdb11_s03-11"
        print "\nsmile: ", smi
        print "\nspecies:", elements[:]
        print "\npybel mol:", jmol
        print "\ncoordinates: ", coords.shape

with sample output:

shard:  .../ani_gdb_s03.h5
molecule: gdb11_s03/gdb11_s03-11
smile:  [H]C([H])=NN([H])[H]
species ['O' 'C' 'O' 'H' 'H']
pybel mol {u'atoms': [[1, 0], [6, 0], [1, 0], [7, 0], [7, 0], [1, 0], [1, 0]], u'bonds': [[1, 2, 1], [2, 3, 1], [2, 4, 2], [4, 5, 1], [5, 6, 1], [5, 7, 1]]}
coordinates:  (4320, 5, 3)

Only the filepath should need to be edited back in for this to run. I also wrote a different parser than the example code because I was having trouble getting the iteration to perform consistently, so maybe I introduced an unintended error there.

I will filter my parquet files for similar mismatches and go-ahead without them for now. If I have made an obvious mistake or if this has already been identified I'd still appreciate feedback.

Thanks!

How are the "self interaction energies" computed?

These energies appear in the README file:

Self-interaction atomic energies

H = -0.500607632585 C = -37.8302333826 N = -54.5680045287 O = -75.0362229210

Are these DFT energies at wB97x/6-31g*? I am unable to reproduce them. Can you give details about how these numbers were produced?

Thank you.

Data license?

While the reader code has an MIT license, the dataset itself doesn't seem to have one. Is this perhaps just an oversight? Would it be possible to have a license file (e.g., MIT) bundled with the dataset itself?

isayev / ani1_dataset Goto Github PK

ani1_dataset's People

Contributors

Stargazers

Watchers

Forkers

ani1_dataset's Issues

equilibrium geometies

Units?

Classical forcefields dataset for AN1

GDB 10 dataset

Publish sha256 hashes for improved user safety

mol2 files describing molecular topology?

parameters used for DFT calculations?

Potential smile/coordinate discrepency

How are the "self interaction energies" computed?

Data license?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent