Light

kjappelbaum / gpt3forchem Goto Github PK

View Code? Open in Web Editor NEW

2.0 3.0 2.0 342.54 MB

License: Apache License 2.0

Jupyter Notebook 97.61% Python 0.24% Makefile 0.01% CSS 0.01% HTML 2.14%

gpt3forchem's Introduction

GPT3 for molecular and materials design and discovery

Install

pip install gpt3forchem

You'll also need to install gpflow from the develop branch: pip install git+https://github.com/GPflow/GPflow.git@develop

How to use

the legacy directory contains code from initial exploration. The relevant parts have been migrated into notebooks.
the experiments directory contain code for the actual fine-tuning experiments

Before you can use it, you need to set up the OpenAI API access (you might need to export your OPENAI_API_KEY)

Also, you need to keep in mind that there are rate limits wherefore we needed to add some delays between requests (and typically also not evaluate on the full datasets).

gpt3forchem's People

Contributors

Stargazers

Watchers

Forkers

abhiroopbhattacharya konulj

gpt3forchem's Issues

compare the embedding similarity with the ones of "conventional" fingerprints

are molecules that are close in fingerprint space still close in embeddings?

add `representation` call to prompt frames

this makes it easier to compute statistics / molecule

how do we tie all threads together?

I feel we're now going a bit all over the place and it will be hard to structure a coherent study. The initial "pitch" notes were structures around "forward classification" and "inverse design". Now we also added regression.

In total we have for the "forward" predictions

polymers, MOFs, photovoltaics, and photoswitches as case studies. We do classification and regression for them. Where available we also compare different representations.

For the "inverse" route we currently only have the polymers.

regression case study

only the polymer case study it seems to be able to produce numbers, need to write utilities to produce parity plots.

let's use STOUT to get all the IUPAC names.

https://github.com/Kohulan/Smiles-TO-iUpac-Translator

consider adding case study for transfer properties

on dataset from Bardow's group https://arxiv.org/abs/2209.04135

add photoswitch case study

given that the labels come from TDDFT, one might also try the inverse route ...

in the paper (still authorship ordering dispute on Arxiv) they focus mostly on the transition wavelength.

prompt prefixes to try

i'm an expert chemist
i'm an expert polymer chemist

in script for the mof learning curves the rename dict is probably not correctly used

reaction yields

parse CMLs into a more useful form
for start, remove the full sentence that is tagged as "yield" from the text (even info such that an "oil" was formed might yield to issues)
similar to Philippes' work, split according to the scale (but not sure if this is needed if we leave the reagent amounts in the text, the model might figure it out itself)
fine-tune GPT-3 embedding model
use regressor with the rxnfps and the text embeddings
also extract year for time split
also smoothen the yields

test melting point dataset

**https://pubs.acs.org/doi/suppl/10.1021/ci020280x/suppl_file/ci020280x_s.pdf **

we should be able to look up the IUPAC name for all of them in pubchem

if we re-run full experiments, switch to quantile binning

I used fixed sizes bins up to now and this can lead to quite imbalanced cases.

use tartarus

https://github.com/aspuru-guzik-group/Tartarus

fine-tuning hyperparameters

We've picked default hyperparameters that work well across a range of use cases. The only required parameter is the training file.

That said, tweaking the hyperparameters used for fine-tuning can often lead to a model that produces higher quality output. In particular, you may want to configure the following:

model: The name of the base model to fine-tune. You can select one of "ada", "babbage", "curie", or "davinci". To learn more about these models, see the [Models](https://beta.openai.com/docs/models) documentation.
n_epochs - defaults to 4. The number of epochs to train the model for. An epoch refers to one full cycle through the training dataset.
batch_size - defaults to ~0.2% of the number of examples in the training set, capped at 256. The batch size is the number of training examples used to train a single forward and backward pass. In general, we've found that larger batch sizes tend to work better for larger datasets.
learning_rate_multiplier - defaults to 0.05, 0.1, or 0.2 depending on final batch_size. The fine-tuning learning rate is the original learning rate used for pretraining multiplied by this multiplier. We recommend experimenting with values in the range 0.02 to 0.2 to see what produces the best results. Empirically, we've found that larger learning rates often perform better with larger batch sizes.
compute_classification_metrics - defaults to False. If True, for fine-tuning for classification tasks, computes classification-specific metrics (accuracy, F-1 score, etc) on the validation set at the end of every epoch.

we tried davinci vs. ada and davinci does not seem worth the huge extra cost (at least on the polymer forward predictions)
might try reducing the number of epochs

for fun, tune a model to learn chemical name -> paper title

we have this data for the MOFs

cache the query calls

I find myself running this code sometimes. There is no need to pay for it, we should just cache the results.

remove `legacy` folder and move everything to nbdev code

still not sure if I prefer nbdev over "conventional" software dev

use `tenacity` for retry logic

https://tenacity.readthedocs.io/en/latest/

multitask learning

go back to the MOF case study, try the multitask learning slightly more systematically

prompts in API can also be an Array

https://beta.openai.com/docs/api-reference/completions/create#completions/create-prompt

help Andres debug the TDDFT deviations

plot molecules on the scatter plots to see if the flexibility might be the issue for the deviations

Basic pip question

@kjappelbaum

pip install fails for me (as tried below). I also don't see the library on pypi.org. what should i be doing differently?

!pip install git+https://github.com/GPflow/GPflow.git@develop
!pip install gpt3forchem

increase spending limit/month - we already hit it https://share.hsforms.com/1AQvscELNT724FkL2Hp5Lvg4sk30, asked for increase
increase rate limit

plot errorbands as envelopes?

https://stackoverflow.com/questions/34235530/how-to-get-high-and-low-envelope-of-a-signal

def hl_envelopes_idx(s, dmin=1, dmax=1, split=False):
    """
    Input :
    s: 1d-array, data signal from which to extract high and low envelopes
    dmin, dmax: int, optional, size of chunks, use this if the size of the input signal is too big
    split: bool, optional, if True, split the signal in half along its mean, might help to generate the envelope in some cases
    Output :
    lmin,lmax : high/low envelope idx of input signal s
    """

    # locals min      
    lmin = (np.diff(np.sign(np.diff(s))) > 0).nonzero()[0] + 1 
    # locals max
    lmax = (np.diff(np.sign(np.diff(s))) < 0).nonzero()[0] + 1 
    

    if split:
        # s_mid is zero if s centered around x-axis or more generally mean of signal
        s_mid = np.mean(s) 
        # pre-sorting of locals min based on relative position with respect to s_mid 
        lmin = lmin[s[lmin]<s_mid]
        # pre-sorting of local max based on relative position with respect to s_mid 
        lmax = lmax[s[lmax]>s_mid]


    # global max of dmax-chunks of locals max 
    lmin = lmin[[i+np.argmin(s[lmin[i:i+dmin]]) for i in range(0,len(lmin),dmin)]]
    # global min of dmin-chunks of locals min 
    lmax = lmax[[i+np.argmax(s[lmax[i:i+dmax]]) for i in range(0,len(lmax),dmax)]]
    
    return lmin,lmax

make number of epochs adaptive on training set size?

i think this is the reason for the "peaks" we sometimes see in the learning curve. We seem to be able to shift them by tuning for how many epochs we fine-tune

there are different options:

use a validation set to tune this for every training set size (expensive as we'll need to do this multiple times due to the variance)
come up with some heuristic (?). However, if we inform this based on the experiments we ran this is kind of data leakage.
use ensemble of models tuned for 2, 4, 6, whatever epochs
use deep ensembles (at fixed number of epochs). However, since there is less randomness (always the same initialization) we probably need to (sub)sample

use test-time augmentation for uncertainty estimates

is probably also cheaper than the ensembles I've been trying to build

try inverse design with names

use OPSIN to convert to SMILES

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.