Git Product home page Git Product logo

formed_ml's Introduction

FORMED_ML

Machine learning models for the FORMED database and downstream tasks, and cross coupling tool.

All the raw data associated with this project can be found in the corresponding Materials Cloud record, including interactive visualization. Notably, all labels and xyz files containing molecular 3D structure are available there.

Installation

We provide a conda environment file environment.yml to install all requirements with conda into a conda environnment called FORMED. The FORMED environment can be used to run all provided scripts and notebooks in this repository. This approach has been tested in several recent releases of Ubuntu (18-22) with python versions 3.7-3.9 and the process should take a few minutes. We also provide a requirements.txt file for virtual environment installation.

Use the environment file by running conda env create -f environment.yml and activate the environment with conda activate FORMED.

Usage note

We do not provide the SLATM representation of the molecules, which is required to run the ML models, given the humongous size of the resulting arrays. Instead, we provide scripts (generate_slatm.py) to produce those from the xyz files containing the 3D structure.

To re-train the models, we recomment that you re-generate the representations and labels using the raw data in the Materials Cloud record to minimize the chance of mismatching molecules and properties. Inference can be run safely after generating the representations from the xyz files to predict.

Content

  1. crosscoupler contains the source code and example of the cross-coupling tool, which can find suitable unique sp2 carbons in molecules and generate coupling products. The code is given as a jupyer notebook. To run the jupyter notebook you need to provide the conda environment FORMED (vide supra) by running python -m ipykernel install --user --name=FORMED. After that, you should be able to run the jupyter notebook normally by selecting the FORMED environment as kernel. Example inputs are provided and pre-filled; the expected output is detailed in the notebook and the runtime should be almost instantaneous.

  2. cv contains 10-fold cross-validation scripts for the XGBoost ML models, as well as the outputs of the scripts. To run, please execute generate_slatm.py adequately by pointing to the xyz files (vide infra) to generate the repr.npy file containing the SLATM representations.

  3. data contains raw data as numpy arrays, as extracted from the TD-DFT computations. It also contains the script to generate the SLATM representation from xyz files available in the Materials Cloud record and saving it as repr.npy. The same data is also available in the record. We also provide the exact definition of the SMARTS keys used for substructure search.

  4. models contains the trained XGBoost models and learning curves. To re-run, please execute generate_slatm.py adequately by pointing to the xyz files (vide supdra) to generate the repr.npy file containing the SLATM representations.

  5. predict contains the scripts for inference using the trained models. The SLATM representations of the dimer data can be generated with the given script from the xyz files available in the Materials Cloud record. The output of the predictions is also given. To run, please execute generate_slatm.py adequately by pointing to the xyz files of the molecules to predict to generate the repr.npy file containing the SLATM representations.

formed_ml's People

Contributors

rlaplaza avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.