Git Product home page Git Product logo

chemnlp's Introduction

Contributor Covenant

ChemNLP project ๐Ÿงช๐Ÿš€

The ChemNLP project aims to

  1. create an extensive chemistry dataset and
  2. use it to train large language models (LLMs) that can leverage the data for a wide range of chemistry applications.

For more details see our information material section below.

Information material

Community

Feel free to join our #chemnlp channel on our OpenBioML discord server to start the discussion in more detail.

Contributing

ChemNLP is an open-source project - your involvement is warmly welcome! If you're excited to join us, we recommend the following steps:

Over the past months ChemNLP has received many contributions and a lot of feedback. We appreciate all contributions from community to make ChemNLP thrive.

Note on the "ChemNLP" name

Our OpenBioML ChemNLP project is not affiliated to the ChemNLP library from NIST and we use "ChemNLP" as a general term to highlight our project focus. The datasets and models we create through our project will have a unique and recognizable name when we release them.

About OpenBioML.org

See https://openbioml.org, especially our approach and partners.

Installation and set-up

Create a new conda environment with Python 3.8:

conda create -n chemnlp python=3.8
conda activate chemnlp

To install the chemnlp package (and required dependencies):

pip install chemnlp

If working on developing the python package:

pip install -e "chemnlp[dev]"  # to install development dependencies

When working on the cluster it's important we conserve memory and reduce duplication. Therefore, it is recommended to also pin your Hugging Face cache directory to a shared folder on the cluster by running the command below (or adding it to your ~/.bashrc startup script). There is also a helper script at experiments/scripts/transfer_hf_cache.sh which will transfer any existing cache from certain folders to the shared directory;

export HF_HOME="/fsx/proj-chemnlp/hf_cache"

If extra dependencies are required (e.g. for dataset creation) but are not needed for the main package please add to the pyproject.toml in the dataset_creation variable and ensure this is reflected in the conda.yml file.

Then, please run

pre-commit install

to install the pre-commit hooks. These will automatically format and lint your code upon every commit. There might be some warnings, e.g., by flake8. If you struggle with them, do not hestiate to contact us.

Note

If working on model training, request access to the wandb project chemnlp and log-in to wandb with your API key per here.

Adding a new dataset (to the model training pipline)

We specify datasets by creating a new function here which is named per the dataset on Hugging Face. At present the function must accept a tokenizer and return back the tokenized train and validation datasets.

Cloning submodules

In order to work on the git submodules (i.e. gpt-neox) you will need to ensure you have cloned them.

To do this at the same time as cloning ChemNLP:

 # using ssh (if you have your ssh key on GitHub)
git clone --recurse-submodules [email protected]:OpenBioML/chemnlp.git

 # using https (if you use personal access token)
git clone --recurse-submodules [[email protected]:OpenBioML/chemnlp.git ](https://github.com/OpenBioML/chemnlp.git)

This will automatically initialize and update each submodule in the repository, including nested submodules if any of the submodules in the repository have submodules themselves.

If you've already cloned ChemNLP and don't have the submodules you can run:

git submodule update --init --recursive

See here for more information about contributing to submodules.

Experiments

Follow the guidelines here for more information about running experiments on the Stability AI cluster.

chemnlp's People

Contributors

bethanyconnolly avatar jackapbutler avatar kjappelbaum avatar micpie avatar phalem avatar adrianm0 avatar maw501 avatar pre-commit-ci[bot] avatar othertea avatar n0w0f avatar mehradans92 avatar adamoyoung avatar apoorvasrinivasan26 avatar arkadiusz-czerwinski avatar hypnopump avatar ml-evs avatar pixelatory avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.