Git Product home page Git Product logo

sacdallago / bio_embeddings Goto Github PK

View Code? Open in Web Editor NEW
438.0 16.0 62.0 69.97 MB

Get protein embeddings from protein sequences

Home Page: http://docs.bioembeddings.com

License: MIT License

Python 5.42% Shell 0.03% Jupyter Notebook 1.68% HTML 92.82% CSS 0.01% JavaScript 0.03% Dockerfile 0.02%
sequence-embeddings pipeline embedders bio-embeddings protein-sequences machine-learning language-model protein-structure protein-prediction

bio_embeddings's Introduction

Bio Embeddings

Resources to learn about bio_embeddings:

Project aims:

  • Facilitate the use of language model based biological sequence representations for transfer-learning by providing a single, consistent interface and close-to-zero-friction
  • Reproducible workflows
  • Depth of representation (different models from different labs trained on different dataset for different purposes)
  • Extensive examples, handle complexity for users (e.g. CUDA OOM abstraction) and well documented warnings and error messages.

The project includes:

  • General purpose python embedders based on open models trained on biological sequence representations (SeqVec, ProtTrans, UniRep,...)
  • A pipeline which:
    • embeds sequences into matrix-representations (per-amino-acid) or vector-representations (per-sequence) that can be used to train learning models or for analytical purposes
    • projects per-sequence embedidngs into lower dimensional representations using UMAP or t-SNE (for lightwieght data handling and visualizations)
    • visualizes low dimensional sets of per-sequence embeddings onto 2D and 3D interactive plots (with and without annotations)
    • extracts annotations from per-sequence and per-amino-acid embeddings using supervised (when available) and unsupervised approaches (e.g. by network analysis)
  • A webserver that wraps the pipeline into a distributed API for scalable and consistent workfolws

Installation

You can install bio_embeddings via pip or use it via docker. Mind the additional dependencies for align.

Pip

Install the pipeline and all extras like so:

pip install bio-embeddings[all]

To install the unstable version, please install the pipeline like so:

pip install -U "bio-embeddings[all] @ git+https://github.com/sacdallago/bio_embeddings.git"

If you only need to run a specific model (e.g. an ESM or ProtTrans model) you can install bio-embeddings without dependencies and then install the model-specific dependency, e.g.:

pip install bio-embeddings
pip install bio-embeddings[prottrans]

The extras are:

  • seqvec
  • prottrans
    • prottrans_albert_bfd
    • prottrans_bert_bfd
    • prottrans_t5_bfd
    • prottrans_t5_uniref50
    • prottrans_t5_xl_u50
    • prottrans_xlnet_uniref100
  • esm
    • esm
    • esm1b
    • esm1v
  • unirep
  • cpcprot
  • plus
  • bepler
  • deepblast

Docker

We provide a docker image at ghcr.io/bioembeddings/bio_embeddings. Simple usage example:

docker run --rm --gpus all \
    -v "$(pwd)/examples/docker":/mnt \
    -v bio_embeddings_weights_cache:/root/.cache/bio_embeddings \
    -u $(id -u ${USER}):$(id -g ${USER}) \
    ghcr.io/bioembeddings/bio_embeddings:v0.1.6 /mnt/config.yml

See the docker example in the examples folder for instructions. You can also use ghcr.io/bioembeddings/bio_embeddings:latest which is built from the latest commit.

Dependencies

To use the mmseqs_search protocol, or the mmsesq2 functions in align, you additionally need to have mmseqs2 in your path.

Installation notes

bio_embeddings was developed for unix machines with GPU capabilities and CUDA installed. If your setup diverges from this, you may encounter some inconsistencies (e.g. speed is significantly affected by the absence of a GPU and CUDA). For Windows users, we strongly recommend the use of Windows Subsystem for Linux.

What model is right for you?

Each models has its strengths and weaknesses (speed, specificity, memory footprint...). There isn't a "one-fits-all" and we encourage you to at least try two different models when attempting a new exploratory project.

The models prottrans_t5_xl_u50, esm1b, esm, prottrans_bert_bfd, prottrans_albert_bfd, seqvec and prottrans_xlnet_uniref100 were all trained with the goal of systematic predictions. From this pool, we believe the optimal model to be prottrans_t5_xl_u50, followed by esm1b.

Usage and examples

We highly recommend you to check out the examples folder for pipeline examples, and the notebooks folder for post-processing pipeline runs and general purpose use of the embedders.

After having installed the package, you can:

  1. Use the pipeline like:

    bio_embeddings config.yml

    A blueprint of the configuration file, and an example setup can be found in the examples directory of this repository.

  2. Use the general purpose embedder objects via python, e.g.:

    from bio_embeddings.embed import SeqVecEmbedder
    
    embedder = SeqVecEmbedder()
    
    embedding = embedder.embed("SEQVENCE")

    More examples can be found in the notebooks folder of this repository.

Cite

If you use bio_embeddings for your research, we would appreciate it if you could cite the following paper:

Dallago, C., Schütze, K., Heinzinger, M., Olenyi, T., Littmann, M., Lu, A. X., Yang, K. K., Min, S., Yoon, S., Morton, J. T., & Rost, B. (2021). Learned embeddings from deep learning to visualize and predict protein sets. Current Protocols, 1, e113. doi: 10.1002/cpz1.113

The corresponding bibtex:

@article{https://doi.org/10.1002/cpz1.113,
author = {Dallago, Christian and Schütze, Konstantin and Heinzinger, Michael and Olenyi, Tobias and Littmann, Maria and Lu, Amy X. and Yang, Kevin K. and Min, Seonwoo and Yoon, Sungroh and Morton, James T. and Rost, Burkhard},
title = {Learned Embeddings from Deep Learning to Visualize and Predict Protein Sets},
journal = {Current Protocols},
volume = {1},
number = {5},
pages = {e113},
keywords = {deep learning embeddings, machine learning, protein annotation pipeline, protein representations, protein visualization},
doi = {https://doi.org/10.1002/cpz1.113},
url = {https://currentprotocols.onlinelibrary.wiley.com/doi/abs/10.1002/cpz1.113},
eprint = {https://currentprotocols.onlinelibrary.wiley.com/doi/pdf/10.1002/cpz1.113},
year = {2021}
}

Additionally, we invite you to cite the work from others that was collected in `bio_embeddings` (see section _"Tools by category"_ below). We are working on an enhanced user guide which will include proper references to all citable work collected in `bio_embeddings`.

Contributors

  • Christian Dallago (lead)
  • Konstantin Schütze
  • Tobias Olenyi
  • Michael Heinzinger

Want to add your own model? See contributing for instructions.

Non-exhaustive list of tools available (see following section for more details):

Datasets

  • prottrans_t5_xl_u50 residue and sequence embeddings of the Human proteome at full precision + secondary structure predictions + sub-cellular localisation predictions: DOI
  • prottrans_t5_xl_u50 residue and sequence embeddings of the Fly proteome at full precision + secondary structure predictions + sub-cellular localisation predictions + conservation prediction + variation prediction: DOI

Tools by category

Pipeline
General purpose embedders

bio_embeddings's People

Contributors

hannesstark avatar konstin avatar kvetab avatar mheinzinger avatar sacdallago avatar saendigphilip avatar t03i avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bio_embeddings's Issues

Add `extract` step

Allow an additional pipeline step that takes an already embedded dataset and some annotation of it (basically what visualize does but on raw embeddings) and uses some metric to annotate the current embeddings using the "background" information.

Use case: I have a large set of antibodies, some with a desired property. I have a new set of unannotated antibodies and want to figure out which one in this set is the closest the the one in the original set with the desired property.

Annotation transfer

Keeping this vague :P but, what @mariasche and I discussed today.

Notes here:

  • Annotation file will have multiple annotations for one target (!!)
  • Comparison function can be optimized, look at provided code.

FileSystemFileManager constructor gets values even though doesn't take any

The only existing file_manager, FileSystemFileManager doesn't have an __init__ function, as it does not have any parameters, and FileManagerInterface doesn't specify any constructor. Thus setting any value in management other the default empty dict raises an exception in file_manager(**management).

def get_file_manager(**kwargs):
management = kwargs.get('management', {})
file_manager_type = management.get('file_manager')
file_manager = FILE_MANAGERS.get(file_manager_type)
return file_manager(**management)

Using simple remapping fails

You can check out /mnt/project/bio_embeddings/runs/cath on rostssh.

stderr.log there will tell you:

Traceback (most recent call last):
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/bin/bio_embeddings", line 8, in <module>
    sys.exit(main())
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/utilities/cli.py", line 22, in main
    run(arguments.config_path[0], overwrite=arguments.overwrite)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/utilities/pipeline.py", line 166, in run
    stage_output_parameters = stage_runnable(**stage_parameters)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/pipeline.py", line 276, in run
    return PROTOCOLS[kwargs["protocol"]](**kwargs)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/pipeline.py", line 196, in seqvec
    return embed_and_write_batched(embedder, file_manager, result_kwargs)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/pipeline.py", line 166, in embed_and_write_batched
    reduced_embeddings_file.create_dataset(
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/h5py/_hl/group.py", line 139, in create_dataset
    self[name] = dset
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/h5py/_hl/group.py", line 370, in __setitem__
    name, lcpl = self._e(name, lcpl=True)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/h5py/_hl/base.py", line 137, in _e
    name = name.encode('ascii')
AttributeError: 'int' object has no attribute 'encode'

You probably have to cast the int to str when saving the embedding. I remember I fixed this in the past, but probably due to significant overwrites, it got lost...

Please, once you fixed this issue, can you try re-running the job?

Instructions in: /mnt/project/bio_embeddings/README

Log storage estimate

At the beginning of any embed pipeline run, calculate (by means of the mapping file) the expected size of the embedding file and the reduced embedding file.

As formula:

per_amino_acid_size_in_bits = (embedding_dimension * layers) * 32
per_protein_size_in_bits = (embeddings_dimension) * * 32


total_number_of_proteins = len(mapping_file)
total_aa = mapping_file.sequence_length.sum()

embeddings_file_size = per_amino_acid_size_in_bits * total_aa
reduced_embeddings_file_size = per_protein_size_in_bits * total_number_of_proteins

Webserver improvements

Once v0.1.4 is out, it makes sense to revamp the webserver and make it 100% pipeline compliant.

MVP:

  • Process multiple sequences, up to a max cap of AA per sequence file (currently the webserver is limited to one sequence, max 20k AA)
  • Perform: embed, project, visualize, extract (supervised) & extract (unsupervised)
  • Chose between Bert or SeqVec
  • Download results.

Good to have:

  • visualize some results online (e.g. sec struct, loc). For each visualization tool: create a separate frontend, isolated app. Use CellMap for loc viz. For sec struct IDK yet... For unsupervised extract (Go annotations), I could re-use the tree visualization from PP.

Add compression to embedding export

An easy improvement when storing embeddings_file and reduced_embeddings_file, supported out of the box, may impact speed (but that's acceptable).

https://docs.h5py.org/en/stable/high/dataset.html#filter-pipeline

Also, since at it: double check that stored dataset uses the most fitting dtype.

P.S.: preference for gzip

P.P.S.: would be nice to run this as a test to see "how much it buys". Easy test: take an h5 file and copy all datasets into a new h5 file applying compression. Then we see if this is useful...

General purpose embedders not loading

I'm trying to use the embedders within a python script but they are not loading. I've tried just the example code from the README, but this is what I get

>>> from bio_embeddings.embed import SeqVecEmbedder
>>> embedder = SeqVecEmbedder()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/share/PI/rbaltman/gmcinnes/bin/miniconda3/envs/bioembeddings/lib/python3.6/site-packages/bio_embeddings/embed/seqvec_embedder.py", line 48, in __init__
    self._weights_file = self._options["weights_file"]
KeyError: 'weights_file'

I also tried BertEmbedder but I get a different error

>>> from bio_embeddings.embed import *
>>> embedder = BertEmbedder()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/share/PI/rbaltman/gmcinnes/bin/miniconda3/envs/bioembeddings/lib/python3.6/site-packages/bio_embeddings/embed/bert_embedder.py", line 29, in __init__
    self.model = BertModel.from_pretrained(self._model_directory)
  File "/share/PI/rbaltman/gmcinnes/bin/miniconda3/envs/bioembeddings/lib/python3.6/site-packages/transformers/modeling_utils.py", line 587, in from_pretrained
    **kwargs,
  File "/share/PI/rbaltman/gmcinnes/bin/miniconda3/envs/bioembeddings/lib/python3.6/site-packages/transformers/configuration_utils.py", line 201, in from_pretrained
    config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/share/PI/rbaltman/gmcinnes/bin/miniconda3/envs/bioembeddings/lib/python3.6/site-packages/transformers/configuration_utils.py", line 224, in get_config_dict
    if os.path.isdir(pretrained_model_name_or_path):
  File "/share/PI/rbaltman/gmcinnes/bin/miniconda3/envs/bioembeddings/lib/python3.6/genericpath.py", line 42, in isdir
    st = os.stat(s)
TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

I installed the repo via github in a fresh miniconda environment two days ago.

CPU fallback for Bert

As mentioned here, there seems to be a problem with batching in Bert.

The set I've used for the tests is viruses_90 from the test sets on oculus (in case you want to reproduce the run :) ).

I think that enhancing with the SeqVec single sequence --> CPU --> fail strategy will solve this issue, too.

Inhertiance over helper functions

Instead of this, I would rather you create another interface (TransformerEmbedderInterface) which extends EmbedderInterface and then defines embeds_batch at the level of the abstract class.

The rationale: I expect TransformerEmbedderInterface to have a class of shared methods which differentiate it slightly from other Embedders, but which need "unity". Specifically, the extraction of the attention activations from the embeddings (something you can't retroactively do).

Warning "Failed to create stage directory" when using overwrite

When running a pipeline the second time I get a warning, even though I'm using --overwrite:

WARNING: Failed to create stage directory my_develop/stage_1.

Depending on the desired behaviour, this can e.g. be solved by replacing os.mkdir(path) with os.makedirs(path, exist_ok=True) or by calling shutil.rmtree before if the directory exists.

Decouple LM dependencies from package dependencies

@konstin proposed some new fancy thing that allows to have different levels of dependencies (like modules in npm) in the same pip installable package. This will be particularily useful once we have more LMs (and even now, with the gigantic allennlp depenency + transformers dependency).

Create basic google collab for embedding generation

@konstin & @t03i you might have more experience with this: I would like to make a very basic google collab (a notebook using Google's hardware: why not; free TPU/GPU) to put in the Readme (basically, the same as https://github.com/sacdallago/bio_embeddings/blob/master/notebooks/embed_fasta_sequences.ipynb , but including the pip install directive, and storing of the embeddings to a file).

Unfortunately, I can't seem to be able to install bio_embeddings. I tried via pip install bio_embeddings and directly from git (see collab linked below). I have a hunch this might be because currently the requirement is python > 3.7, but collab is on 3.6.9 . Do you see any reason why we cannot support 3.6.9 ? Can one of you maybe give it a try?

Here's the link to the collab that I started: https://colab.research.google.com/drive/1h5izTF07GjHMkekmGNUj32Sbb1gccJxd?usp=sharing

Assorted improvements in embed

Reading through current develop, things that I noticed:

  • (for general purpose users) we have to decide if we want to make it from bio_embeddings import SeqVecEmbedder or from bio_embeddings.embed import SeqVecEmbedder; now it's inconsistent: https://github.com/sacdallago/bio_embeddings/blob/14f1de5754221452c27d2e2c5420f191bb2ecc00/bio_embeddings/__init__.py

    Up to you @konstin

  • Once that has been decided, all notebooks in examples need to be revised and updated!

  • Speaking of notebooks, this one seves as and example of what to expect when the model files are not provided (in the case of SeqVec). After the introduction of your with_download method, I think this will need re-writing.

  • Finally: once you are done with decisions and improvements, please make sure all relevant notebooks run and are up to date. E.g.: this one still uses the "constrained albert" (see warning message), but that should not happen anymore ;). Not relevant notebooks: project_visualize_custom_embedings and project_visualize_pipeline_embeddings

Refactor cuda device assignation

Currently, in one form or another, in various parts of the pipeline, this is used:

"cuda:0" if torch.cuda.is_available() and not self._use_cpu else "cpu"

The problem here is that cuda:0 will always refer to the 0 card. In systems hosting multiple cards, this will be painful. workoarounds are:


Examples where it's used:

Add/extend test-cases

Extend current tests (pre-computed embeddings are compared to output of pipeline to ensure LM weights are loaded correctly) for testing feature extractors (e.g. compare pre-computed secondary structure predictions to pipeline output for a few sequences).

MD5 Clashes among unique sequences

I'm hitting the MD5Clash when I have only unique sequences in my dataset. I edited the code to remove the exception but there should be an option to ignore this.

This most likely indicates there are multiple identical sequences in your FASTA file. 
MD5 hashes are used to remap sequence identifiers from the input FASTA.
This error exists to prevent wasting resources (computing the same embedding twice).
There's a (very) low probability of this indicating a real MD5 clash.

If you are sure there are no identical sequences in your set, please open an issue at https://github.com/sacdallago/bio_embeddings/issues . Otherwise, use cd-hit to reduce your input FASTA to exclude identical sequences!

Pass per-aa layer reducer as parameter in config

In order to address the problem of RAW embeddings occupying too much storage in case of big transformers, e.g. bert, and to have the least amount of "default destruction" imposed by the pipeline: allow users to define an "on the fly layer reducer" function, aka. a lambda function via the config.

Cons: lambdas cannot be expressed in YAML, so needs to be a string (which then gets evaled in python). This will inevitabely lead to some problems for less expert users, but that's something I'm willing to accept IFF there are enough examples provided (aka: I have to provide enogh examples!). It may also pose a security thread IFF this configuration option ever gets exposed on a webserver.

String can be parsed using something like:

layer_reducer_function = eval(kwargs.get('layer_reducer_function', "lambda x: x"))

The challenge is making the name quite significant and non-redudnant with reduce, which is instead used for the concept of reducing variable embeddings to fixed size (aka: per-aa to per-sequence).

[XLNet] Unable to open file (file signature not found)

Hey,

I'm trying to use XLNetEmbedder with the newest XLNet weights and options files (another source found in the dropbox cloud).

from bio_embeddings.embed.xlnet_embedder import XLNetEmbedder

embedder = XLNetEmbedder(
    model_directory='models/xlnet/'
)

After that I received the OSError exception:

lib/python3.6/site-packages/h5py/_hl/files.py in make_fid(name, mode, userblock_size, fapl, fcpl, swmr)
    171         if swmr and swmr_support:
    172             flags |= h5f.ACC_SWMR_READ
--> 173         fid = h5f.open(name, flags, fapl=fapl)
    174     elif mode == 'r+':
    175         fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5f.pyx in h5py.h5f.open()

OSError: Unable to open file (file signature not found)

Thanks for your time for looking into this issue.

Error in BassicAnnotationextractor

In develop and on google collab, the BasicAnnotationExtractor class is not loading weights correctly, or maybe its that the weights uploaded are not correct.

RuntimeError Traceback (most recent call last)
in ()
19 annotations_extractor = BasicAnnotationExtractor(
20 secondary_structure_checkpoint_file="secstruct_checkpoint",
---> 21 subcellular_location_checkpoint_file="subcell_checkpoint"
22 )

1 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in load_state_dict(self, state_dict, strict)
845 if len(error_msgs) > 0:
846 raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
--> 847 self.class.name, "\n\t".join(error_msgs)))
848 return _IncompatibleKeys(missing_keys, unexpected_keys)
849

RuntimeError: Error(s) in loading state_dict for SUBCELL_FNN:
Missing key(s) in state_dict: "layer.3.weight", "layer.3.bias", "layer.3.running_mean", "layer.3.running_var".

Installation from the PyPI with Python 3.6+ failed

Hey,

I discovered that bio-embeddings installation from the PyPI with Python 3.6+ doesn't work anymore.
But it's working fine directly from the GitHub repository.

pip install bio-embeddings[all]

ERROR: Could not find a version that satisfies the requirement bio-embeddings[all] (from versions: none)
ERROR: No matching distribution found for bio-embeddings[all]

or

ERROR: Could not find a version that satisfies the requirement bio-embeddings==0.1.3 (from versions: none)

I tried with Python 3.6.8 on Mac OX and with Python 3.6.9 on Ubuntu. The same raised exception.
I think it is related with the setup.py configuration:

'python_requires': '>=3.7,<4.0'

Could you tell why Python 3.6 is excluded? Thanks!

Set number of threads

Good day,

Quick question: Is there any way to set the number of threads in the bio_embeddings pipeline - either from within the config file or from the command line?

Thank you.

Remove embedders from GPU memory post-embed phase in pipeline

I once had a situation in which the pipeline was in a "visualize" stage, but the GPU was still occupied by the embedder (SeqVec).

I had assumed that the embedder is destroyed adter the embed stage (the stages are written in a way whuch should make python's authomatic garbage collection easy). But apparently I was wrong.

Maybe it makes sence to explecitely del embedder at the end of the embed stage.

It's worth looking into this. The visualize stage is sometimes slow (it can take up 2 days on for big plots)... Occupying GPU resources for no good reason is a waste in those cases. In the future (e.g. with extract) GPU RAM will be needed

Check alphabets

Different embedder may be trained on different alphabets, which may be different from those provided by the user.

  • SeqVec has a 25 letter alphabet ("20 standard and 2 rare amino acids (U and O) plus 3 special cases describing either ambiguous (B, Z) or unknown amino acids (X)")
  • The transformers remove the rare amino acids (re.sub(r"[UZOB]", "X", sequence))
  • UniRep has a 26 letter alphabet (https://github.com/ElArkk/jax-unirep/blob/81843c034941cfeb4e45c1808364b5f996771382/tests/test_layers.py#L78)
  • PLUS has 21 letters ("20 proteinogenic and 1 unspecified amino acids")
  • I haven't checked provis provis uses tape while defaulting to iupac. Tape has the following code, i.e. they are likely good inspiration for what we want to do:
class TAPETokenizer():
    r"""TAPE Tokenizer. Can use different vocabs depending on the model.
    """

    def __init__(self, vocab: str = 'iupac'):
        if vocab == 'iupac':
            self.vocab = IUPAC_VOCAB
        elif vocab == 'unirep':
            self.vocab = UNIREP_VOCAB
        self.tokens = list(self.vocab.keys())
        self._vocab_type = vocab
        assert self.start_token in self.vocab and self.stop_token in self.vocab

We should ensure that whatever alphabet the input fasta uses all embedders work, likely by doing something similar to the transformer, and test that behaviour.

EDIT: Current state on the none-standard amino acids:

  • SeqVec: X, O, U, Z, B
  • transformer: X, O, U, Z, B (O, U, Z, B -> X internally)
  • esm: X, O, U, Z, B
  • unirep: X, O, U, Z, B, J
  • CPCProt (tape): X, O, U, Z, B
  • PLUS: X, O, U, Z, B

Improvement in embedding estimation

@mheinzinger proposes that instead of tqdm counting proteins, it counts amino acids while embedding (definitely a better measure!).

I think: great idea, but VERY low priority. Especially if this requires a lot of coding.

Add timestamp to logger

@konstin you were completely right: logging with timestamp is much better than just logging. I now have a pipeline run that has been running for at least 12h for which generating a plot is taking ages (this is expected). Nevertheless, would be nice to know when it started :)

For now, I only see when UMAP finished (which is still a good indicator, but, yeah... :D)

Sat Jul 11 02:14:16 2020 Finished Nearest Neighbor Search
Sat Jul 11 02:14:25 2020 Construct embedding
        completed  0  /  200 epochs
        completed  20  /  200 epochs
        completed  40  /  200 epochs
        completed  60  /  200 epochs
        completed  80  /  200 epochs
        completed  100  /  200 epochs
        completed  120  /  200 epochs
        completed  140  /  200 epochs
        completed  160  /  200 epochs
        completed  180  /  200 epochs
Sat Jul 11 02:33:37 2020 Finished embedding
INFO Created the file go_embeddings/umap_projection/projected_embeddings_file.csv
INFO Created the file go_embeddings/umap_projection/ouput_parameters_file.yml
INFO Created the stage directory go_embeddings/plotly_2D
INFO Created the file go_embeddings/plotly_2D/input_parameters_file.yml
INFO Created the file go_embeddings/plotly_2D/input_annotation_file.csv
INFO Created the file go_embeddings/plotly_2D/merged_annotation_file.csv

Fix embed module when the extras aren't selected

Currently using from bio_embeddings.embed import ... will lead to import errors when not using the all extra. This should be fixed by using try-except-ImportError-blocks and should be documented.

[SeqVec2] Unable to open object (object 'char_embed' doesn't exist)

Hey,

I'm trying to use SeqVecEmbedder with the newest SeqVec v2 weights and options files.

from bio_embeddings import SeqVecEmbedder

embedder = SeqVecEmbedder(
    weights_file='models/seqvec2/weights.hdf5',
    options_file='models/seqvec2/options.json'
)

After that I received the KeyError exception:

/lib/python3.6/site-packages/h5py/_hl/group.py in __getitem__(self, name)
    262                 raise ValueError("Invalid HDF5 object reference")
    263         else:
--> 264             oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
    265 
    266         otype = h5i.get_type(oid)

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5o.pyx in h5py.h5o.open()

KeyError: "Unable to open object (object 'char_embed' doesn't exist)"

I discovered similar problem on the AlleNLP issues [1] and [2].

Thanks for your time for looking into this issue.

Better handling of input FASTA and processing of sequences in pipeline

For LSTM based models (e.g. SeqVec) the longer protein sequences go, the more computational resources are needed. Especially for long sequences, resorting to GPU computing might result in RAM exceptions, halting the computation. While there is a mechanism in place to fall back on CPU for longer sequences based on a maximum count of AA in the sequence or a sequence set, this could still result in an exception (since there's no direct AA-RAM conversion formula), which would lead to the pipeline halting at the end of embedding a dataset. This is obviously not ideal! Additionally, for fixed-length models, it is quite useless to go through the embedding phase for any sequence > max sequence length the model accepts. Therefore:

  1. Sort input FASTA in sequence length descending order. This will force to process the hard samples first, and if any exception arises it arises soon when computation starts, rather than when 90% of the data has been processed, maybe after hours/days of computation

  2. Split data a priori (using mapping file) into CPU/GPU embeddable (e.g. for LSTMs) and non-embeddable / GPU (for fixed size embedders). This will reduce exceptions, logging and improve speed.

Catch embedding exception, close embedding file

Right now, embeddings are written iteratively on an opened embeddings file (either whole or reduced). If the embeddings step fail (e.g. by running out of memory) the files will be abruptly closed (not close()) resulting in corrupted files.

Ideally, we want to close the files correctly no matter what. This might require a bigger re-engineering of the FileSystem file handler, which just gives read/write pointers to files instead of paths, and on sigterm or alike tries to close all handles.

The quick solution now would be encapsulating the embedding calls in try / finally clauses where the error isn't caught (aka: allowed to propagate) but the files do get closed no matter if successful or not.

Short typo(?)

Dear contributors,

thank you for your library to embed protein sequence. I guess that the following elif should be
elif self._version == 2.

Many thanks!
Damianos

Add resource use information

Breaking down #15

information about CPU/GPU/RAM consumption

Low priority + GPU utilization monitoring is complex in multi-GPU environments.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.