Git Product home page Git Product logo

gpt3forchem's Introduction

GPT3 for molecular and materials design and discovery

Install

pip install gpt3forchem

You'll also need to install gpflow from the develop branch: pip install git+https://github.com/GPflow/GPflow.git@develop

How to use

  • the legacy directory contains code from initial exploration. The relevant parts have been migrated into notebooks.

  • the experiments directory contain code for the actual fine-tuning experiments

Before you can use it, you need to set up the OpenAI API access (you might need to export your OPENAI_API_KEY)

Also, you need to keep in mind that there are rate limits wherefore we needed to add some delays between requests (and typically also not evaluate on the full datasets).

gpt3forchem's People

Contributors

pschwllr avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

gpt3forchem's Issues

how do we tie all threads together?

I feel we're now going a bit all over the place and it will be hard to structure a coherent study. The initial "pitch" notes were structures around "forward classification" and "inverse design". Now we also added regression.

In total we have for the "forward" predictions

  • polymers, MOFs, photovoltaics, and photoswitches as case studies. We do classification and regression for them. Where available we also compare different representations.

For the "inverse" route we currently only have the polymers.

regression case study

only the polymer case study it seems to be able to produce numbers, need to write utilities to produce parity plots.

add photoswitch case study

given that the labels come from TDDFT, one might also try the inverse route ...

in the paper (still authorship ordering dispute on Arxiv) they focus mostly on the transition wavelength.

reaction yields

  • parse CMLs into a more useful form
  • for start, remove the full sentence that is tagged as "yield" from the text (even info such that an "oil" was formed might yield to issues)
  • similar to Philippes' work, split according to the scale (but not sure if this is needed if we leave the reagent amounts in the text, the model might figure it out itself)
  • fine-tune GPT-3 embedding model
  • use regressor with the rxnfps and the text embeddings
  • also extract year for time split
  • also smoothen the yields

fine-tuning hyperparameters

We've picked default hyperparameters that work well across a range of use cases. The only required parameter is the training file.

That said, tweaking the hyperparameters used for fine-tuning can often lead to a model that produces higher quality output. In particular, you may want to configure the following:

model: The name of the base model to fine-tune. You can select one of "ada", "babbage", "curie", or "davinci". To learn more about these models, see the [Models](https://beta.openai.com/docs/models) documentation.
n_epochs - defaults to 4. The number of epochs to train the model for. An epoch refers to one full cycle through the training dataset.
batch_size - defaults to ~0.2% of the number of examples in the training set, capped at 256. The batch size is the number of training examples used to train a single forward and backward pass. In general, we've found that larger batch sizes tend to work better for larger datasets.
learning_rate_multiplier - defaults to 0.05, 0.1, or 0.2 depending on final batch_size. The fine-tuning learning rate is the original learning rate used for pretraining multiplied by this multiplier. We recommend experimenting with values in the range 0.02 to 0.2 to see what produces the best results. Empirically, we've found that larger learning rates often perform better with larger batch sizes.
compute_classification_metrics - defaults to False. If True, for fine-tuning for classification tasks, computes classification-specific metrics (accuracy, F-1 score, etc) on the validation set at the end of every epoch.
  • we tried davinci vs. ada and davinci does not seem worth the huge extra cost (at least on the polymer forward predictions)
  • might try reducing the number of epochs

cache the query calls

I find myself running this code sometimes. There is no need to pay for it, we should just cache the results.

multitask learning

go back to the MOF case study, try the multitask learning slightly more systematically

duplicates in photoswitch data

i currently simply use .dropna(subset=["SMILES"]). But this is the most naive way of doing it. Not sure if there is one "higher quality" point we should preferentially retain

try using ensemble of fine-tuned models

the errorbands we get with GPT3 are much wider than for the baselines.
Would be nice if we could stabilize this with an ensemble. This might also give useful uncertainty estimates (a bit similar to Andrew White's recent Bayesian Opt paper)

plot errorbands as envelopes?

https://stackoverflow.com/questions/34235530/how-to-get-high-and-low-envelope-of-a-signal

def hl_envelopes_idx(s, dmin=1, dmax=1, split=False):
    """
    Input :
    s: 1d-array, data signal from which to extract high and low envelopes
    dmin, dmax: int, optional, size of chunks, use this if the size of the input signal is too big
    split: bool, optional, if True, split the signal in half along its mean, might help to generate the envelope in some cases
    Output :
    lmin,lmax : high/low envelope idx of input signal s
    """

    # locals min      
    lmin = (np.diff(np.sign(np.diff(s))) > 0).nonzero()[0] + 1 
    # locals max
    lmax = (np.diff(np.sign(np.diff(s))) < 0).nonzero()[0] + 1 
    

    if split:
        # s_mid is zero if s centered around x-axis or more generally mean of signal
        s_mid = np.mean(s) 
        # pre-sorting of locals min based on relative position with respect to s_mid 
        lmin = lmin[s[lmin]<s_mid]
        # pre-sorting of local max based on relative position with respect to s_mid 
        lmax = lmax[s[lmax]>s_mid]


    # global max of dmax-chunks of locals max 
    lmin = lmin[[i+np.argmin(s[lmin[i:i+dmin]]) for i in range(0,len(lmin),dmin)]]
    # global min of dmin-chunks of locals min 
    lmax = lmax[[i+np.argmax(s[lmax[i:i+dmax]]) for i in range(0,len(lmax),dmax)]]
    
    return lmin,lmax

make number of epochs adaptive on training set size?

i think this is the reason for the "peaks" we sometimes see in the learning curve. We seem to be able to shift them by tuning for how many epochs we fine-tune

there are different options:

  • use a validation set to tune this for every training set size (expensive as we'll need to do this multiple times due to the variance)
  • come up with some heuristic (?). However, if we inform this based on the experiments we ran this is kind of data leakage.
  • use ensemble of models tuned for 2, 4, 6, whatever epochs
  • use deep ensembles (at fixed number of epochs). However, since there is less randomness (always the same initialization) we probably need to (sub)sample

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.