Git Product home page Git Product logo

ani1_dataset's People

Contributors

collvey avatar isayev avatar jussmith01 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ani1_dataset's Issues

equilibrium geometies

Where can I obtain the equilibrium geometries?
Does the first entry for each molecule correspond to the equilibrium structure?

Units?

What are the units of the energies for this dataset?

Classical forcefields dataset for AN1

Is there a dataset available for classical forcefields calculated on ANI? Alternatively, is there a way to interface with a library like lamps to calculate the classical forcefields directly? Thanks for the help.

GDB 10 dataset

Is it possible for you guys to provide the gdb-10 dataset for our validation work?

Publish sha256 hashes for improved user safety

Hello! Could you compute and publish the sha256 hashes for your ani-1_dataset.tar.gz file and include them in your README? This will help users to ensure that the data that they download has not been manipulated by some third party.

You can easily compute a hash using:

from hashlib import sha256


def hash_check(fname, hash_fn=sha256):
    """Reads in data from disk and returns hash

    Parameters
    ----------
    fname : str | Path

    hash_fn : Callable[[], Hash], optional (default=hashlib.sha256)

    Examples
    --------
    Checking sha256 hash..

    >>> from hashlib import sha256
    >>> hash_check('./text.txt, sha256)
    'a4337bc45a8fc544c03f52dc550cd6e1e87021bc896588bd79e901e2'
    """
    hash_fn = hash_fn()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_fn.update(chunk)
    return hash_fn.hexdigest()

Thanks!

mol2 files describing molecular topology?

Thanks for making this fantastic resource available!

Is there a way you could make some description of the molecular topology (e.g. mol2 files) available? While the QM energies are clearly only dependent on the atomic positions, your initial RDKit representation likely contains a mapping from a molecular topology (which includes bonds and bond orders) that allows atom indices to be uniquely identified within the molecular topology. It would be great if this topology information could be provided as well---perhaps as a compressed multimolecule mol2 file?

parameters used for DFT calculations?

We've been trying to do some DFT calculations to replicate the energies of a subset of the conformers in the data files, and while we get close (with the same potential), we haven't been able to exactly replicate the numbers. For example, for the lowest-energy conformer of gdb11_s08-1279, the energy in the file is -345.269812 (hartree; rounded), and we obtain -345.269936 with a coarse grid, and -345.269864 with the finest grid.

It's also worth noting that we obtain different self-interaction energies than stated in the README. For example, we're off by at least 0.005 hartree for oxygen (-75.041 vs the stated -75.036).

Could you provide more information on the exact version of the software you used, as well as any parameters that might be causing the discrepancy?

Potential smile/coordinate discrepency

Hello,

I was trying to convert the ANI-1 dataset into a parquet format, and I ran into a potential mismatch between the coordinates and smiles string of at least one molecule (around 4k conformers).

I wrote a piece of sample code to try to isolate this first issue I ran into (Python 2.7.6 interpreter):

import h5py
from pybel import readstring
import json
import numpy as np
import pandas as pd

ani_path = '.../ani'
shard3 = os.path.join(ani_path, 'ani_gdb_s03.h5')

with h5py.File(shard3, 'r') as f:
    data_dict = f['gdb11_s03/gdb11_s03-11']

    coords     = data_dict['coordinates']
    elements   = data_dict['species']
    energies   = data_dict['energies']
    smi        = ''.join(data_dict['smiles'])
    
    mol = readstring('smi', smi)
    jmol = json.loads(pymol_to_json(mol))

    if len(jmol['atoms']) != len(elements[:]):
        print "shard: ", shard1
        print "\nmolecule: gdb11_s03/gdb11_s03-11"
        print "\nsmile: ", smi
        print "\nspecies:", elements[:]
        print "\npybel mol:", jmol
        print "\ncoordinates: ", coords.shape

with sample output:

shard:  .../ani_gdb_s03.h5
molecule: gdb11_s03/gdb11_s03-11
smile:  [H]C([H])=NN([H])[H]
species ['O' 'C' 'O' 'H' 'H']
pybel mol {u'atoms': [[1, 0], [6, 0], [1, 0], [7, 0], [7, 0], [1, 0], [1, 0]], u'bonds': [[1, 2, 1], [2, 3, 1], [2, 4, 2], [4, 5, 1], [5, 6, 1], [5, 7, 1]]}
coordinates:  (4320, 5, 3)

Only the filepath should need to be edited back in for this to run. I also wrote a different parser than the example code because I was having trouble getting the iteration to perform consistently, so maybe I introduced an unintended error there.

I will filter my parquet files for similar mismatches and go-ahead without them for now. If I have made an obvious mistake or if this has already been identified I'd still appreciate feedback.

Thanks!

How are the "self interaction energies" computed?

These energies appear in the README file:

Self-interaction atomic energies

H = -0.500607632585 C = -37.8302333826 N = -54.5680045287 O = -75.0362229210

Are these DFT energies at wB97x/6-31g*? I am unable to reproduce them. Can you give details about how these numbers were produced?

Thank you.

Data license?

While the reader code has an MIT license, the dataset itself doesn't seem to have one. Is this perhaps just an oversight? Would it be possible to have a license file (e.g., MIT) bundled with the dataset itself?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.