sacdallago / bio_embeddings Goto Github PK
View Code? Open in Web Editor NEWGet protein embeddings from protein sequences
Home Page: http://docs.bioembeddings.com
License: MIT License
Get protein embeddings from protein sequences
Home Page: http://docs.bioembeddings.com
License: MIT License
This colab: https://colab.research.google.com/drive/1msZVwcCT2b768HnbRK3SrnmRqtVjvsgg?usp=sharing
Has been set up to use GPU acceleration, however when embedding, it does so on CPU (there's a warning that shows up in the bottom saying "you are connected to a GPU runtime but aren't using the GPU...", and the speed at which it embeds.. well.. Not ideal!). This is weird. Could it be #26 ? @konstin ?
Related to Rostlab/SeqVec#10
Related to Rostlab/SeqVec#10
For LSTM based models (e.g. SeqVec) the longer protein sequences go, the more computational resources are needed. Especially for long sequences, resorting to GPU computing might result in RAM exceptions, halting the computation. While there is a mechanism in place to fall back on CPU for longer sequences based on a maximum count of AA in the sequence or a sequence set, this could still result in an exception (since there's no direct AA-RAM conversion formula), which would lead to the pipeline halting at the end of embedding a dataset. This is obviously not ideal! Additionally, for fixed-length models, it is quite useless to go through the embedding phase for any sequence > max sequence length the model accepts. Therefore:
Sort input FASTA in sequence length descending order. This will force to process the hard samples first, and if any exception arises it arises soon when computation starts, rather than when 90% of the data has been processed, maybe after hours/days of computation
Split data a priori (using mapping file) into CPU/GPU embeddable (e.g. for LSTMs) and non-embeddable / GPU (for fixed size embedders). This will reduce exceptions, logging and improve speed.
Instead of this, I would rather you create another interface (TransformerEmbedderInterface
) which extends EmbedderInterface
and then defines embeds_batch
at the level of the abstract class.
The rationale: I expect TransformerEmbedderInterface
to have a class of shared methods which differentiate it slightly from other Embedders, but which need "unity". Specifically, the extraction of the attention activations from the embeddings (something you can't retroactively do).
Once v0.1.4
is out, it makes sense to revamp the webserver and make it 100% pipeline compliant.
MVP:
Good to have:
The only existing file_manager
, FileSystemFileManager
doesn't have an __init__
function, as it does not have any parameters, and FileManagerInterface
doesn't specify any constructor. Thus setting any value in management
other the default empty dict raises an exception in file_manager(**management)
.
bio_embeddings/bio_embeddings/utilities/filemanagers/__init__.py
Lines 10 to 15 in 9a07822
Hey,
I'm trying to use XLNetEmbedder with the newest XLNet weights and options files (another source found in the dropbox cloud).
from bio_embeddings.embed.xlnet_embedder import XLNetEmbedder
embedder = XLNetEmbedder(
model_directory='models/xlnet/'
)
After that I received the OSError exception:
lib/python3.6/site-packages/h5py/_hl/files.py in make_fid(name, mode, userblock_size, fapl, fcpl, swmr)
171 if swmr and swmr_support:
172 flags |= h5f.ACC_SWMR_READ
--> 173 fid = h5f.open(name, flags, fapl=fapl)
174 elif mode == 'r+':
175 fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/h5f.pyx in h5py.h5f.open()
OSError: Unable to open file (file signature not found)
Thanks for your time for looking into this issue.
Extend current tests (pre-computed embeddings are compared to output of pipeline to ensure LM weights are loaded correctly) for testing feature extractors (e.g. compare pre-computed secondary structure predictions to pipeline output for a few sequences).
Remove dependency from ruamel_yaml
and use pyyaml
instead.
You can check out /mnt/project/bio_embeddings/runs/cath
on rostssh.
stderr.log
there will tell you:
Traceback (most recent call last):
File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/bin/bio_embeddings", line 8, in <module>
sys.exit(main())
File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/utilities/cli.py", line 22, in main
run(arguments.config_path[0], overwrite=arguments.overwrite)
File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/utilities/pipeline.py", line 166, in run
stage_output_parameters = stage_runnable(**stage_parameters)
File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/pipeline.py", line 276, in run
return PROTOCOLS[kwargs["protocol"]](**kwargs)
File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/pipeline.py", line 196, in seqvec
return embed_and_write_batched(embedder, file_manager, result_kwargs)
File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/pipeline.py", line 166, in embed_and_write_batched
reduced_embeddings_file.create_dataset(
File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/h5py/_hl/group.py", line 139, in create_dataset
self[name] = dset
File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/h5py/_hl/group.py", line 370, in __setitem__
name, lcpl = self._e(name, lcpl=True)
File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/h5py/_hl/base.py", line 137, in _e
name = name.encode('ascii')
AttributeError: 'int' object has no attribute 'encode'
You probably have to cast the int
to str
when saving the embedding. I remember I fixed this in the past, but probably due to significant overwrites, it got lost...
Please, once you fixed this issue, can you try re-running the job?
Instructions in: /mnt/project/bio_embeddings/README
I once had a situation in which the pipeline was in a "visualize" stage, but the GPU was still occupied by the embedder (SeqVec).
I had assumed that the embedder is destroyed adter the embed stage (the stages are written in a way whuch should make python's authomatic garbage collection easy). But apparently I was wrong.
Maybe it makes sence to explecitely del embedder
at the end of the embed stage.
It's worth looking into this. The visualize stage is sometimes slow (it can take up 2 days on for big plots)... Occupying GPU resources for no good reason is a waste in those cases. In the future (e.g. with extract) GPU RAM will be needed
For the general purpose embedder classes (e.g. this, instead of storing the downloaded files in tmp (= at every re-run, re-download the file)), store the file in a cache.
I'm trying to use the embedders within a python script but they are not loading. I've tried just the example code from the README, but this is what I get
>>> from bio_embeddings.embed import SeqVecEmbedder
>>> embedder = SeqVecEmbedder()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/share/PI/rbaltman/gmcinnes/bin/miniconda3/envs/bioembeddings/lib/python3.6/site-packages/bio_embeddings/embed/seqvec_embedder.py", line 48, in __init__
self._weights_file = self._options["weights_file"]
KeyError: 'weights_file'
I also tried BertEmbedder but I get a different error
>>> from bio_embeddings.embed import *
>>> embedder = BertEmbedder()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/share/PI/rbaltman/gmcinnes/bin/miniconda3/envs/bioembeddings/lib/python3.6/site-packages/bio_embeddings/embed/bert_embedder.py", line 29, in __init__
self.model = BertModel.from_pretrained(self._model_directory)
File "/share/PI/rbaltman/gmcinnes/bin/miniconda3/envs/bioembeddings/lib/python3.6/site-packages/transformers/modeling_utils.py", line 587, in from_pretrained
**kwargs,
File "/share/PI/rbaltman/gmcinnes/bin/miniconda3/envs/bioembeddings/lib/python3.6/site-packages/transformers/configuration_utils.py", line 201, in from_pretrained
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/share/PI/rbaltman/gmcinnes/bin/miniconda3/envs/bioembeddings/lib/python3.6/site-packages/transformers/configuration_utils.py", line 224, in get_config_dict
if os.path.isdir(pretrained_model_name_or_path):
File "/share/PI/rbaltman/gmcinnes/bin/miniconda3/envs/bioembeddings/lib/python3.6/genericpath.py", line 42, in isdir
st = os.stat(s)
TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType
I installed the repo via github in a fresh miniconda environment two days ago.
Allow an additional pipeline step that takes an already embedded dataset and some annotation of it (basically what visualize does but on raw embeddings) and uses some metric to annotate the current embeddings using the "background" information.
Use case: I have a large set of antibodies, some with a desired property. I have a new set of unannotated antibodies and want to figure out which one in this set is the closest the the one in the original set with the desired property.
In order to address the problem of RAW
embeddings occupying too much storage in case of big transformers, e.g. bert
, and to have the least amount of "default destruction" imposed by the pipeline: allow users to define an "on the fly layer reducer" function, aka. a lambda function via the config.
Cons: lambdas cannot be expressed in YAML, so needs to be a string (which then gets eval
ed in python). This will inevitabely lead to some problems for less expert users, but that's something I'm willing to accept IFF there are enough examples provided (aka: I have to provide enogh examples!). It may also pose a security thread IFF this configuration option ever gets exposed on a webserver.
String can be parsed using something like:
layer_reducer_function = eval(kwargs.get('layer_reducer_function', "lambda x: x"))
The challenge is making the name quite significant and non-redudnant with reduce
, which is instead used for the concept of reducing variable embeddings to fixed size (aka: per-aa to per-sequence).
Different embedder may be trained on different alphabets, which may be different from those provided by the user.
re.sub(r"[UZOB]", "X", sequence)
)iupac
. Tape has the following code, i.e. they are likely good inspiration for what we want to do:class TAPETokenizer():
r"""TAPE Tokenizer. Can use different vocabs depending on the model.
"""
def __init__(self, vocab: str = 'iupac'):
if vocab == 'iupac':
self.vocab = IUPAC_VOCAB
elif vocab == 'unirep':
self.vocab = UNIREP_VOCAB
self.tokens = list(self.vocab.keys())
self._vocab_type = vocab
assert self.start_token in self.vocab and self.stop_token in self.vocab
We should ensure that whatever alphabet the input fasta uses all embedders work, likely by doing something similar to the transformer, and test that behaviour.
EDIT: Current state on the none-standard amino acids:
@mheinzinger proposes that instead of tqdm
counting proteins, it counts amino acids while embedding (definitely a better measure!).
I think: great idea, but VERY low priority. Especially if this requires a lot of coding.
https://www.biorxiv.org/content/10.1101/2020.08.07.242347v1?rss=1
Don't know if it would fit, but certainly interesting. Will give thought after my holidays are over
Frite a file at the end of the execution (maybe also at the beginning) with information about CPU/GPU/RAM consumption (optional) and creation + termination time. Idea came from PP call...
When running a pipeline the second time I get a warning, even though I'm using --overwrite
:
WARNING: Failed to create stage directory my_develop/stage_1.
Depending on the desired behaviour, this can e.g. be solved by replacing os.mkdir(path)
with os.makedirs(path, exist_ok=True)
or by calling shutil.rmtree
before if the directory exists.
I could not find any documentation on which stages need per-residue and which need per-protein embeddings, and thereby whether it's possible to use reduce: true
with these stages or not.
Dear contributors,
thank you for your library to embed protein sequence. I guess that the following elif should be
elif self._version == 2
.
Many thanks!
Damianos
@konstin & @t03i you might have more experience with this: I would like to make a very basic google collab (a notebook using Google's hardware: why not; free TPU/GPU) to put in the Readme (basically, the same as https://github.com/sacdallago/bio_embeddings/blob/master/notebooks/embed_fasta_sequences.ipynb , but including the pip install
directive, and storing of the embeddings to a file).
Unfortunately, I can't seem to be able to install bio_embeddings. I tried via pip install bio_embeddings
and directly from git (see collab linked below). I have a hunch this might be because currently the requirement is python > 3.7, but collab is on 3.6.9 . Do you see any reason why we cannot support 3.6.9 ? Can one of you maybe give it a try?
Here's the link to the collab that I started: https://colab.research.google.com/drive/1h5izTF07GjHMkekmGNUj32Sbb1gccJxd?usp=sharing
@konstin proposed some new fancy thing that allows to have different levels of dependencies (like modules in npm
) in the same pip installable package. This will be particularily useful once we have more LMs (and even now, with the gigantic allennlp depenency + transformers dependency).
Pretty much the title. This may just be a design decision, but currently, it is necessary to load both the checkpoint file from secondary structure and subcellular for Bert.
https://www.biorxiv.org/content/10.1101/589333v1
Notice that UniRep has been re-engineered. Look for "jax-unirep":
From an ISMB comment:
It might be of interested to check out jax-unirep, which is a re-implementation of the model that is much easier to work with than the tensorflow model
Currently using from bio_embeddings.embed import ...
will lead to import errors when not using the all
extra. This should be fixed by using try-except-ImportError-blocks and should be documented.
In order to address the problem of evaluting LMs "seamlessly", there should be a way of hooking up TAPE: https://github.com/songlab-cal/tape/
Reading through current develop, things that I noticed:
(for general purpose users) we have to decide if we want to make it from bio_embeddings import SeqVecEmbedder
or from bio_embeddings.embed import SeqVecEmbedder
; now it's inconsistent: https://github.com/sacdallago/bio_embeddings/blob/14f1de5754221452c27d2e2c5420f191bb2ecc00/bio_embeddings/__init__.py
Up to you @konstin
Once that has been decided, all notebooks in examples
need to be revised and updated!
Speaking of notebooks, this one seves as and example of what to expect when the model files are not provided (in the case of SeqVec). After the introduction of your with_download
method, I think this will need re-writing.
Finally: once you are done with decisions and improvements, please make sure all relevant notebooks run and are up to date. E.g.: this one still uses the "constrained albert" (see warning message), but that should not happen anymore ;). Not relevant notebooks: project_visualize_custom_embedings
and project_visualize_pipeline_embeddings
Hey,
I'm trying to use SeqVecEmbedder with the newest SeqVec v2 weights and options files.
from bio_embeddings import SeqVecEmbedder
embedder = SeqVecEmbedder(
weights_file='models/seqvec2/weights.hdf5',
options_file='models/seqvec2/options.json'
)
After that I received the KeyError exception:
/lib/python3.6/site-packages/h5py/_hl/group.py in __getitem__(self, name)
262 raise ValueError("Invalid HDF5 object reference")
263 else:
--> 264 oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
265
266 otype = h5i.get_type(oid)
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/h5o.pyx in h5py.h5o.open()
KeyError: "Unable to open object (object 'char_embed' doesn't exist)"
I discovered similar problem on the AlleNLP issues [1] and [2].
Thanks for your time for looking into this issue.
@konstin you were completely right: logging with timestamp is much better than just logging. I now have a pipeline run that has been running for at least 12h for which generating a plot is taking ages (this is expected). Nevertheless, would be nice to know when it started :)
For now, I only see when UMAP finished (which is still a good indicator, but, yeah... :D)
Sat Jul 11 02:14:16 2020 Finished Nearest Neighbor Search
Sat Jul 11 02:14:25 2020 Construct embedding
completed 0 / 200 epochs
completed 20 / 200 epochs
completed 40 / 200 epochs
completed 60 / 200 epochs
completed 80 / 200 epochs
completed 100 / 200 epochs
completed 120 / 200 epochs
completed 140 / 200 epochs
completed 160 / 200 epochs
completed 180 / 200 epochs
Sat Jul 11 02:33:37 2020 Finished embedding
INFO Created the file go_embeddings/umap_projection/projected_embeddings_file.csv
INFO Created the file go_embeddings/umap_projection/ouput_parameters_file.yml
INFO Created the stage directory go_embeddings/plotly_2D
INFO Created the file go_embeddings/plotly_2D/input_parameters_file.yml
INFO Created the file go_embeddings/plotly_2D/input_annotation_file.csv
INFO Created the file go_embeddings/plotly_2D/merged_annotation_file.csv
Keeping this vague :P but, what @mariasche and I discussed today.
Notes here:
As of now, this seems to be the only major "lacking" thing.
As mentioned here, there seems to be a problem with batching in Bert.
The set I've used for the tests is viruses_90
from the test sets on oculus (in case you want to reproduce the run :) ).
I think that enhancing with the SeqVec single sequence --> CPU --> fail
strategy will solve this issue, too.
In develop and on google collab, the BasicAnnotationExtractor class is not loading weights correctly, or maybe its that the weights uploaded are not correct.
RuntimeError Traceback (most recent call last)
in ()
19 annotations_extractor = BasicAnnotationExtractor(
20 secondary_structure_checkpoint_file="secstruct_checkpoint",
---> 21 subcellular_location_checkpoint_file="subcell_checkpoint"
22 )
1 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in load_state_dict(self, state_dict, strict)
845 if len(error_msgs) > 0:
846 raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
--> 847 self.class.name, "\n\t".join(error_msgs)))
848 return _IncompatibleKeys(missing_keys, unexpected_keys)
849
RuntimeError: Error(s) in loading state_dict for SUBCELL_FNN:
Missing key(s) in state_dict: "layer.3.weight", "layer.3.bias", "layer.3.running_mean", "layer.3.running_var".
Right now, embeddings are written iteratively on an opened embeddings file (either whole or reduced). If the embeddings step fail (e.g. by running out of memory) the files will be abruptly closed (not close()
) resulting in corrupted files.
Ideally, we want to close the files correctly no matter what. This might require a bigger re-engineering of the FileSystem file handler, which just gives read/write pointers to files instead of paths, and on sigterm
or alike tries to close all handles.
The quick solution now would be encapsulating the embedding calls in try / finally
clauses where the error isn't caught (aka: allowed to propagate) but the files do get closed no matter if successful or not.
At the beginning of any embed pipeline run, calculate (by means of the mapping file) the expected size of the embedding file and the reduced embedding file.
As formula:
per_amino_acid_size_in_bits = (embedding_dimension * layers) * 32
per_protein_size_in_bits = (embeddings_dimension) * * 32
total_number_of_proteins = len(mapping_file)
total_aa = mapping_file.sequence_length.sum()
embeddings_file_size = per_amino_acid_size_in_bits * total_aa
reduced_embeddings_file_size = per_protein_size_in_bits * total_number_of_proteins
Hey,
I discovered that bio-embeddings installation from the PyPI with Python 3.6+ doesn't work anymore.
But it's working fine directly from the GitHub repository.
pip install bio-embeddings[all]
ERROR: Could not find a version that satisfies the requirement bio-embeddings[all] (from versions: none)
ERROR: No matching distribution found for bio-embeddings[all]
or
ERROR: Could not find a version that satisfies the requirement bio-embeddings==0.1.3 (from versions: none)
I tried with Python 3.6.8 on Mac OX and with Python 3.6.9 on Ubuntu. The same raised exception.
I think it is related with the setup.py
configuration:
'python_requires': '>=3.7,<4.0'
Could you tell why Python 3.6 is excluded? Thanks!
Good day,
Quick question: Is there any way to set the number of threads in the bio_embeddings
pipeline - either from within the config file or from the command line?
Thank you.
Breaking down #15
information about CPU/GPU/RAM consumption
Low priority + GPU utilization monitoring is complex in multi-GPU environments.
Currently, in one form or another, in various parts of the pipeline, this is used:
"cuda:0" if torch.cuda.is_available() and not self._use_cpu else "cpu"
The problem here is that cuda:0
will always refer to the 0
card. In systems hosting multiple cards, this will be painful. workoarounds are:
NOT defining the device number, aka, just cuda
(see: https://pytorch.org/docs/stable/tensor_attributes.html#torch.torch.device), or
allow the device number to be passed via a param, e.g.:
"cuda:{cuda_device}" if torch.cuda.is_available() and not self._use_cpu else "cpu"
where cuda_device is an integer that defaults to 0
.
Examples where it's used:
Currently, if CUDA runs out of memory, the samples won't be processed and #11 happens. Additionally to fixing that for unknown problems, collect sequences that spawn known problems for processing on CPU. If even that's not possible, then throw error!
An easy improvement when storing embeddings_file
and reduced_embeddings_file
, supported out of the box, may impact speed (but that's acceptable).
https://docs.h5py.org/en/stable/high/dataset.html#filter-pipeline
Also, since at it: double check that stored dataset uses the most fitting dtype.
P.S.: preference for gzip
P.P.S.: would be nice to run this as a test to see "how much it buys". Easy test: take an h5 file and copy all datasets into a new h5 file applying compression. Then we see if this is useful...
I'm hitting the MD5Clash when I have only unique sequences in my dataset. I edited the code to remove the exception but there should be an option to ignore this.
This most likely indicates there are multiple identical sequences in your FASTA file.
MD5 hashes are used to remap sequence identifiers from the input FASTA.
This error exists to prevent wasting resources (computing the same embedding twice).
There's a (very) low probability of this indicating a real MD5 clash.
If you are sure there are no identical sequences in your set, please open an issue at https://github.com/sacdallago/bio_embeddings/issues . Otherwise, use cd-hit to reduce your input FASTA to exclude identical sequences!
As from title.
Important: use of volumes / output folder / prefix!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.