tijeco / berteome Goto Github PK

A library to analyze and explore protein sequences using BERT models

License: Apache License 2.0

Jupyter Notebook 99.07% Python 0.90% Makefile 0.03%

berteome's Issues

Make dataframe object similar to seqlogo Ppm

I don't know if seqlogo Ppm is a generalized class or not, but effectively it achieves what I originally intended to. Have a general purpose class/object with properties (such as an alphabet), but when it is called it is effectively a pandas dataframe. Then maybe I could have functions attached to it (such as calculate_nEffective from #8 ) that could add columns to itself (ideally, only if they don't already exist, of course with the option to overwrite). This would make #10 much easier to deal with too!

ESM hugging face

Previously, there wasn't a hugging face implentation for ESM so I had to do some extra engineering to get it to behave the same way as protBERT, which is available through huggingface. It seems that ESM is now available through hugging face, so I think it makes perfect sense to re-group and update the ESM implementation to use the huggingface API.

Add support for ESM

It seems like esm could be supported in the same way as protBert is from hugging face.

berteome.unmasker(), would have to be updated, since it borrows from hugging face but would need to be tweeked to give output from ESM.

Also ESM handles masks differently ( instead of [MASK], I don't think spaces matter as much either, so it would be a bit different. Anyway, I think it would be interesting to see how the two models compare.

models download upon loading of library

I'm not sure if I like it doing this or not. I think it would be best to download the models as they are being used. Because now if I wanted to load the library but not use the models I have to wait for them to download!

reorganization strategy

So I think I would like each model to be made available as it's own module such as BERT.py and ESM.py. That way they can both have their own separate classes that run the model on a given sequence and return a pandas dataframe.

A third module, something like model.py can be the glue between these two to serve as whatever similar functionality is shared between the two models. I don't quite know what that would be, but I know that if i were to write a class for ESM and BERT to return a dataframe, there will likely be some redundancy, so ideally model.py would handle that redundancy.

administrative tidbits

So things are moving along quite smoothly, and there are certainly lots of fun features I'd like to work on! But.. there is a bit of housekeeping that I think needs to be done (not necessarily before the fun stuff, but still needs to be done).

Definitely not exactly fun. Maybe can happen alongside the fun stuff??

I think these types of things are the main things that need to be done to polish everything off, get the library in it's most usable / stable form and ready to submit to JOSS!

load model class

I think the load_model function needs to be wrapped up in some sort of class that is initiated in a way that has a dictionary / list of all the supported models. That way users can use this class to see what models are supported and then use that to load the model of interest.

generalized maskify function

I don't know precisely the best way to do this, but ESM and prot_bert need input to be in two somewhat different formats.

To mask the 3rd residue in the sequence MENDEL

Prot_bert: "M E [MASK] E L"
ESM: ['M', 'E', '<mask>', 'E', 'L']

So, I think it is actually easier to get to the ESM version, just by turning the seq into a list and replacing the position with the mask, then this structure can be used to make the prot_bert, so it would just require some conditional based on the model or maybe even mask token.

I just feel like this would be better than having two somewhat redundant functions.

Insert residue at a given position

This is similar to the idea of making variants by substituting residues, but this time by inserting a residue at a given position!

For instance, if you have the sequence MENDEL, what would be the best residue to insert at MENDEL. The masked language models can make this a reality!

The main issue that I have in thinking about this is that I think I'd have to interface the models separately, which sounds annoying! I have a nice centralized dataframe way of piping predictions for all residues, but really I just need the mask predictions, and that is what is really different between the models. I guess I can just have a separate insertion module that interfaces with all the models.

I don't know the best way to approach this though, I definitely think it would be worth while to be able to have a given proteins sequence and a position and return the top k amino acids that can be placed there. From a usability standpoint, I think that I would basically like to be able to return a similar output as the augment function, an iterable list of peptides annotated as to which positions were inserted with which residue.

Allow for arbitrary mask token

This is relevant to #1, since ESM uses <MASK> or something like that, we would need to be able to do that in order to interface with that model.

github actions that don't break..

So I'd like the actions environment to be the same as what is specified in some sort of conda environment file.

I'm looking at this here (https://github.com/marketplace/actions/setup-conda), I think I can use that to activate some sort of conda environment??

They have a nice set up for matrix testing which let's test run on multiple versions, I don't necessarily have any desire to really do that since it's a bit overkill.. but I figured I'd put it here for future reference.

jobs:
  build:
    runs-on: ${{ matrix.os }}
    strategy:
      matrix:
        os: [ubuntu-latest, windows-latest, macOS-latest]
        python-version: [3.6, 3.7, 3.8]
    name: Python ${{ matrix.python-version }} example
    steps:
      - uses: actions/checkout@v2
      - name: Setup conda
        uses: s-weigand/setup-conda@v1
        with:
          update-conda: true
          python-version: ${{ matrix.python-version }}
          conda-channels: anaconda, conda-forge
      - run: conda --version
      - run: which python

Fork of seqlogo

The seqlogo function works well with the output dataframes, but it the colors and formatting (the way the logo is split up) can't be customized, so I think just modifiying a version of this would be ideal.

fun stuff

So beyond administrative tidbits (#18), there is some fun stuff that I think we can dive into and have a little fun with!

seqlogo from a single sequence
position specific scoring matrix from a single sequence (probably related to seqlogo)
BLOSUM like 20x20 correlation matrix and maybe even a plot!
protein sequence augmentation
plots

What's fun about this, is that these are all things that derive from the prediction dataframe! Which we now have a pretty good handle on thanks to #15 and #17. So what a nice hurdle to be over!!

RAM issue on google colab

This is a weird thing that I've thus far only encountered on colab, but I think addressing it may help reveal potential implementation shortcomings.

Fortunately, it doesn't make berteome unusable in colab, so far here is what I have done.

load prot_bert and esm1b modules. Prot_bert works fine, downloads the model seems great. Do the same for esm, it downloads the model but then crashes when loading it into memory. If I restart the session, the model is already downloaded, and it loads fine.

What I know: you can't start from scratch load prot_bert and em1b modules from the start. I would like that to be a possibility.

What I don't know: Maybe the issue is with loading both models into memory. Maybe one can be done but not both? I haven't tried loading the esm1b and then prot_bert.

sequence mutational variants

I think it would be a nice touch to be able to use the models to generate a list of the top k most likely variants for all possible residue positions.

This could potentially be useful for a scenario where there is a need to intelligently augment a dataset of peptide sequences, or maybe even in directed evolution type scenarios where there is a need to generate mutational variants, maybe it would be useful to only generate the top most likely variants??

amino acid alphabet

This is definitely only a minor thing, but I think is worth addressing. I personally only really care about the standard 20 amino acids, so I have no interest in predictions for other amino acids that these models may have support for. For now, I have hard coded the 20 amino acids as a string.

This creates two issues. First, others may be interested in other amino acids that are supported by these models. Second, even if that's not the case, I think it would be better to have a centralized definition of the alphabet instead of hard coding it numerous times in other functions.

What I think would make the most sense is to maybe make a utils module or core module and put things like this here so that other modules just import the alphabet.

plot library

There are a few different types of plots that can be made using this library, so an easy to use format might be nice. For instance, plotting the wildtype score across all residues, or plot something like N-effective (a separate issue to be raised). I also think it would be cool to overlay that over a seqlogo, but maybe that would be too crazy!

calculate N-effective per site

I would like to be able to add a row to the prediction columns that includes the calculated N-effective. I think this is more or less the general entropy of a given residue, or the number of possible residues that can be present there. I think of it as a proxy for diversity, if a residue has a very strong probability for one residue, low diversity, but if it has lots of low probabilities for lots of residues, high diversity!

There may be other metrics to look at, that are probably related to summarizing alignment columns.

tijeco / berteome Goto Github PK

berteome's People

Contributors

Stargazers

Watchers

berteome's Issues

Recommend Projects

Recommend Topics

Recommend Org