sacdallago / bio_embeddings Goto Github PK

Get protein embeddings from protein sequences

Home Page: http://docs.bioembeddings.com

License: MIT License

Python 5.42% Shell 0.03% Jupyter Notebook 1.68% HTML 92.82% CSS 0.01% JavaScript 0.03% Dockerfile 0.02%

sequence-embeddings pipeline embedders bio-embeddings protein-sequences machine-learning language-model protein-structure protein-prediction

bio_embeddings's Issues

Colab uses CPU instead of GPU although GPU is enabled

This colab: https://colab.research.google.com/drive/1msZVwcCT2b768HnbRK3SrnmRqtVjvsgg?usp=sharing

Has been set up to use GPU acceleration, however when embedding, it does so on CPU (there's a warning that shows up in the bottom saying "you are connected to a GPU runtime but aren't using the GPU...", and the speed at which it embeds.. well.. Not ideal!). This is weird. Could it be #26 ? @konstin ?

Implement intermittent write-to-storage for > 5k sequences

Related to Rostlab/SeqVec#10

Sort sequences by length when re-mapping

Related to Rostlab/SeqVec#10

Better handling of input FASTA and processing of sequences in pipeline

For LSTM based models (e.g. SeqVec) the longer protein sequences go, the more computational resources are needed. Especially for long sequences, resorting to GPU computing might result in RAM exceptions, halting the computation. While there is a mechanism in place to fall back on CPU for longer sequences based on a maximum count of AA in the sequence or a sequence set, this could still result in an exception (since there's no direct AA-RAM conversion formula), which would lead to the pipeline halting at the end of embedding a dataset. This is obviously not ideal! Additionally, for fixed-length models, it is quite useless to go through the embedding phase for any sequence > max sequence length the model accepts. Therefore:

Sort input FASTA in sequence length descending order. This will force to process the hard samples first, and if any exception arises it arises soon when computation starts, rather than when 90% of the data has been processed, maybe after hours/days of computation
Split data a priori (using mapping file) into CPU/GPU embeddable (e.g. for LSTMs) and non-embeddable / GPU (for fixed size embedders). This will reduce exceptions, logging and improve speed.

Inhertiance over helper functions

Instead of this, I would rather you create another interface (TransformerEmbedderInterface) which extends EmbedderInterface and then defines embeds_batch at the level of the abstract class.

The rationale: I expect TransformerEmbedderInterface to have a class of shared methods which differentiate it slightly from other Embedders, but which need "unity". Specifically, the extraction of the attention activations from the embeddings (something you can't retroactively do).

Webserver improvements

Once v0.1.4 is out, it makes sense to revamp the webserver and make it 100% pipeline compliant.

MVP:

Process multiple sequences, up to a max cap of AA per sequence file (currently the webserver is limited to one sequence, max 20k AA)
Perform: embed, project, visualize, extract (supervised) & extract (unsupervised)
Chose between Bert or SeqVec
Download results.

Good to have:

visualize some results online (e.g. sec struct, loc). For each visualization tool: create a separate frontend, isolated app. Use CellMap for loc viz. For sec struct IDK yet... For unsupervised extract (Go annotations), I could re-use the tree visualization from PP.

FileSystemFileManager constructor gets values even though doesn't take any

The only existing file_manager, FileSystemFileManager doesn't have an __init__ function, as it does not have any parameters, and FileManagerInterface doesn't specify any constructor. Thus setting any value in management other the default empty dict raises an exception in file_manager(**management).

bio_embeddings/bio_embeddings/utilities/filemanagers/__init__.py

Lines 10 to 15 in 9a07822

 def get_file_manager(**kwargs): 

 management = kwargs.get('management', {}) 

 file_manager_type = management.get('file_manager') 

 file_manager = FILE_MANAGERS.get(file_manager_type) 

 return file_manager(**management)

[XLNet] Unable to open file (file signature not found)

Hey,

I'm trying to use XLNetEmbedder with the newest XLNet weights and options files (another source found in the dropbox cloud).

from bio_embeddings.embed.xlnet_embedder import XLNetEmbedder

embedder = XLNetEmbedder(
    model_directory='models/xlnet/'
)

After that I received the OSError exception:

lib/python3.6/site-packages/h5py/_hl/files.py in make_fid(name, mode, userblock_size, fapl, fcpl, swmr)
    171         if swmr and swmr_support:
    172             flags |= h5f.ACC_SWMR_READ
--> 173         fid = h5f.open(name, flags, fapl=fapl)
    174     elif mode == 'r+':
    175         fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5f.pyx in h5py.h5f.open()

OSError: Unable to open file (file signature not found)

Thanks for your time for looking into this issue.

Add/extend test-cases

Extend current tests (pre-computed embeddings are compared to output of pipeline to ensure LM weights are loaded correctly) for testing feature extractors (e.g. compare pre-computed secondary structure predictions to pipeline output for a few sequences).

Remove dependency from ruamel_yaml

Remove dependency from ruamel_yaml and use pyyaml instead.

Using simple remapping fails

You can check out /mnt/project/bio_embeddings/runs/cath on rostssh.

stderr.log there will tell you:

Traceback (most recent call last):
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/bin/bio_embeddings", line 8, in <module>
    sys.exit(main())
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/utilities/cli.py", line 22, in main
    run(arguments.config_path[0], overwrite=arguments.overwrite)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/utilities/pipeline.py", line 166, in run
    stage_output_parameters = stage_runnable(**stage_parameters)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/pipeline.py", line 276, in run
    return PROTOCOLS[kwargs["protocol"]](**kwargs)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/pipeline.py", line 196, in seqvec
    return embed_and_write_batched(embedder, file_manager, result_kwargs)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/pipeline.py", line 166, in embed_and_write_batched
    reduced_embeddings_file.create_dataset(
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/h5py/_hl/group.py", line 139, in create_dataset
    self[name] = dset
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/h5py/_hl/group.py", line 370, in __setitem__
    name, lcpl = self._e(name, lcpl=True)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/h5py/_hl/base.py", line 137, in _e
    name = name.encode('ascii')
AttributeError: 'int' object has no attribute 'encode'

You probably have to cast the int to str when saving the embedding. I remember I fixed this in the past, but probably due to significant overwrites, it got lost...

Please, once you fixed this issue, can you try re-running the job?

Instructions in: /mnt/project/bio_embeddings/README

Remove embedders from GPU memory post-embed phase in pipeline

I once had a situation in which the pipeline was in a "visualize" stage, but the GPU was still occupied by the embedder (SeqVec).

I had assumed that the embedder is destroyed adter the embed stage (the stages are written in a way whuch should make python's authomatic garbage collection easy). But apparently I was wrong.

Maybe it makes sence to explecitely del embedder at the end of the embed stage.

It's worth looking into this. The visualize stage is sometimes slow (it can take up 2 days on for big plots)... Occupying GPU resources for no good reason is a waste in those cases. In the future (e.g. with extract) GPU RAM will be needed

Integrate PLUS

https://github.com/mswzeus/PLUS/

Cache downloaded items instead of saving them to tmp

For the general purpose embedder classes (e.g. this, instead of storing the downloaded files in tmp (= at every re-run, re-download the file)), store the file in a cache.

General purpose embedders not loading

I'm trying to use the embedders within a python script but they are not loading. I've tried just the example code from the README, but this is what I get

>>> from bio_embeddings.embed import SeqVecEmbedder
>>> embedder = SeqVecEmbedder()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/share/PI/rbaltman/gmcinnes/bin/miniconda3/envs/bioembeddings/lib/python3.6/site-packages/bio_embeddings/embed/seqvec_embedder.py", line 48, in __init__
    self._weights_file = self._options["weights_file"]
KeyError: 'weights_file'

I also tried BertEmbedder but I get a different error

>>> from bio_embeddings.embed import *
>>> embedder = BertEmbedder()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/share/PI/rbaltman/gmcinnes/bin/miniconda3/envs/bioembeddings/lib/python3.6/site-packages/bio_embeddings/embed/bert_embedder.py", line 29, in __init__
    self.model = BertModel.from_pretrained(self._model_directory)
  File "/share/PI/rbaltman/gmcinnes/bin/miniconda3/envs/bioembeddings/lib/python3.6/site-packages/transformers/modeling_utils.py", line 587, in from_pretrained
    **kwargs,
  File "/share/PI/rbaltman/gmcinnes/bin/miniconda3/envs/bioembeddings/lib/python3.6/site-packages/transformers/configuration_utils.py", line 201, in from_pretrained
    config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/share/PI/rbaltman/gmcinnes/bin/miniconda3/envs/bioembeddings/lib/python3.6/site-packages/transformers/configuration_utils.py", line 224, in get_config_dict
    if os.path.isdir(pretrained_model_name_or_path):
  File "/share/PI/rbaltman/gmcinnes/bin/miniconda3/envs/bioembeddings/lib/python3.6/genericpath.py", line 42, in isdir
    st = os.stat(s)
TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

I installed the repo via github in a fresh miniconda environment two days ago.

Add `extract` step

Allow an additional pipeline step that takes an already embedded dataset and some annotation of it (basically what visualize does but on raw embeddings) and uses some metric to annotate the current embeddings using the "background" information.

Use case: I have a large set of antibodies, some with a desired property. I have a new set of unannotated antibodies and want to figure out which one in this set is the closest the the one in the original set with the desired property.

Pass per-aa layer reducer as parameter in config

In order to address the problem of RAW embeddings occupying too much storage in case of big transformers, e.g. bert, and to have the least amount of "default destruction" imposed by the pipeline: allow users to define an "on the fly layer reducer" function, aka. a lambda function via the config.

Cons: lambdas cannot be expressed in YAML, so needs to be a string (which then gets evaled in python). This will inevitabely lead to some problems for less expert users, but that's something I'm willing to accept IFF there are enough examples provided (aka: I have to provide enogh examples!). It may also pose a security thread IFF this configuration option ever gets exposed on a webserver.

String can be parsed using something like:

layer_reducer_function = eval(kwargs.get('layer_reducer_function', "lambda x: x"))

The challenge is making the name quite significant and non-redudnant with reduce, which is instead used for the concept of reducing variable embeddings to fixed size (aka: per-aa to per-sequence).

Check alphabets

Different embedder may be trained on different alphabets, which may be different from those provided by the user.

SeqVec has a 25 letter alphabet ("20 standard and 2 rare amino acids (U and O) plus 3 special cases describing either ambiguous (B, Z) or unknown amino acids (X)")
The transformers remove the rare amino acids (re.sub(r"[UZOB]", "X", sequence))
UniRep has a 26 letter alphabet (https://github.com/ElArkk/jax-unirep/blob/81843c034941cfeb4e45c1808364b5f996771382/tests/test_layers.py#L78)
PLUS has 21 letters ("20 proteinogenic and 1 unspecified amino acids")
~~I haven't checked provis~~ provis uses tape while defaulting to iupac. Tape has the following code, i.e. they are likely good inspiration for what we want to do:

class TAPETokenizer():
    r"""TAPE Tokenizer. Can use different vocabs depending on the model.
    """

    def __init__(self, vocab: str = 'iupac'):
        if vocab == 'iupac':
            self.vocab = IUPAC_VOCAB
        elif vocab == 'unirep':
            self.vocab = UNIREP_VOCAB
        self.tokens = list(self.vocab.keys())
        self._vocab_type = vocab
        assert self.start_token in self.vocab and self.stop_token in self.vocab

We should ensure that whatever alphabet the input fasta uses all embedders work, likely by doing something similar to the transformer, and test that behaviour.

EDIT: Current state on the none-standard amino acids:

SeqVec: X, O, U, Z, B
transformer: X, O, U, Z, B (O, U, Z, B -> X internally)
esm: X, O, U, Z, B
unirep: X, O, U, Z, B, J
CPCProt (tape): X, O, U, Z, B
PLUS: X, O, U, Z, B

Improvement in embedding estimation

@mheinzinger proposes that instead of tqdm counting proteins, it counts amino acids while embedding (definitely a better measure!).

I think: great idea, but VERY low priority. Especially if this requires a lot of coding.

IG-VAE -- don't know if would fit

https://www.biorxiv.org/content/10.1101/2020.08.07.242347v1?rss=1

Don't know if it would fit, but certainly interesting. Will give thought after my holidays are over

Add timestamp for job submission and termination

Frite a file at the end of the execution (maybe also at the beginning) with information about CPU/GPU/RAM consumption (optional) and creation + termination time. Idea came from PP call...

Warning "Failed to create stage directory" when using overwrite

When running a pipeline the second time I get a warning, even though I'm using --overwrite:

WARNING: Failed to create stage directory my_develop/stage_1.

Depending on the desired behaviour, this can e.g. be solved by replacing os.mkdir(path) with os.makedirs(path, exist_ok=True) or by calling shutil.rmtree before if the directory exists.

Document when you can reduce and/or discard_per_amino_acid_embeddings

I could not find any documentation on which stages need per-residue and which need per-protein embeddings, and thereby whether it's possible to use reduce: true with these stages or not.

Short typo(?)

Dear contributors,

thank you for your library to embed protein sequence. I guess that the following elif should be
elif self._version == 2.

Many thanks!
Damianos

Create basic google collab for embedding generation

@konstin & @t03i you might have more experience with this: I would like to make a very basic google collab (a notebook using Google's hardware: why not; free TPU/GPU) to put in the Readme (basically, the same as https://github.com/sacdallago/bio_embeddings/blob/master/notebooks/embed_fasta_sequences.ipynb , but including the pip install directive, and storing of the embeddings to a file).

Unfortunately, I can't seem to be able to install bio_embeddings. I tried via pip install bio_embeddings and directly from git (see collab linked below). I have a hunch this might be because currently the requirement is python > 3.7, but collab is on 3.6.9 . Do you see any reason why we cannot support 3.6.9 ? Can one of you maybe give it a try?

Here's the link to the collab that I started: https://colab.research.google.com/drive/1h5izTF07GjHMkekmGNUj32Sbb1gccJxd?usp=sharing

Decouple LM dependencies from package dependencies

@konstin proposed some new fancy thing that allows to have different levels of dependencies (like modules in npm) in the same pip installable package. This will be particularily useful once we have more LMs (and even now, with the gigantic allennlp depenency + transformers dependency).

BasicAnnotationExtractor requires both ss and subcellular checkpoints to load

Pretty much the title. This may just be a design decision, but currently, it is necessary to load both the checkpoint file from secondary structure and subcellular for Bert.

Integrate UniRep

https://www.biorxiv.org/content/10.1101/589333v1

Notice that UniRep has been re-engineered. Look for "jax-unirep":

From an ISMB comment:

It might be of interested to check out jax-unirep, which is a re-implementation of the model that is much easier to work with than the tensorflow model

Fix embed module when the extras aren't selected

Currently using from bio_embeddings.embed import ... will lead to import errors when not using the all extra. This should be fixed by using try-except-ImportError-blocks and should be documented.

Evaluation step or script with tape

In order to address the problem of evaluting LMs "seamlessly", there should be a way of hooking up TAPE: https://github.com/songlab-cal/tape/

Assorted improvements in embed

Reading through current develop, things that I noticed:

(for general purpose users) we have to decide if we want to make it from bio_embeddings import SeqVecEmbedder or from bio_embeddings.embed import SeqVecEmbedder; now it's inconsistent: https://github.com/sacdallago/bio_embeddings/blob/14f1de5754221452c27d2e2c5420f191bb2ecc00/bio_embeddings/__init__.py

Up to you @konstin
Once that has been decided, all notebooks in examples need to be revised and updated!
Speaking of notebooks, this one seves as and example of what to expect when the model files are not provided (in the case of SeqVec). After the introduction of your with_download method, I think this will need re-writing.
Finally: once you are done with decisions and improvements, please make sure all relevant notebooks run and are up to date. E.g.: this one still uses the "constrained albert" (see warning message), but that should not happen anymore ;). Not relevant notebooks: project_visualize_custom_embedings and project_visualize_pipeline_embeddings

[SeqVec2] Unable to open object (object 'char_embed' doesn't exist)

Hey,

I'm trying to use SeqVecEmbedder with the newest SeqVec v2 weights and options files.

from bio_embeddings import SeqVecEmbedder

embedder = SeqVecEmbedder(
    weights_file='models/seqvec2/weights.hdf5',
    options_file='models/seqvec2/options.json'
)

After that I received the KeyError exception:

/lib/python3.6/site-packages/h5py/_hl/group.py in __getitem__(self, name)
    262                 raise ValueError("Invalid HDF5 object reference")
    263         else:
--> 264             oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
    265 
    266         otype = h5i.get_type(oid)

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5o.pyx in h5py.h5o.open()

KeyError: "Unable to open object (object 'char_embed' doesn't exist)"

I discovered similar problem on the AlleNLP issues [1] and [2].

Thanks for your time for looking into this issue.

Add timestamp to logger

@konstin you were completely right: logging with timestamp is much better than just logging. I now have a pipeline run that has been running for at least 12h for which generating a plot is taking ages (this is expected). Nevertheless, would be nice to know when it started :)

For now, I only see when UMAP finished (which is still a good indicator, but, yeah... :D)

Sat Jul 11 02:14:16 2020 Finished Nearest Neighbor Search
Sat Jul 11 02:14:25 2020 Construct embedding
        completed  0  /  200 epochs
        completed  20  /  200 epochs
        completed  40  /  200 epochs
        completed  60  /  200 epochs
        completed  80  /  200 epochs
        completed  100  /  200 epochs
        completed  120  /  200 epochs
        completed  140  /  200 epochs
        completed  160  /  200 epochs
        completed  180  /  200 epochs
Sat Jul 11 02:33:37 2020 Finished embedding
INFO Created the file go_embeddings/umap_projection/projected_embeddings_file.csv
INFO Created the file go_embeddings/umap_projection/ouput_parameters_file.yml
INFO Created the stage directory go_embeddings/plotly_2D
INFO Created the file go_embeddings/plotly_2D/input_parameters_file.yml
INFO Created the file go_embeddings/plotly_2D/input_annotation_file.csv
INFO Created the file go_embeddings/plotly_2D/merged_annotation_file.csv

Annotation transfer

Keeping this vague :P but, what @mariasche and I discussed today.

Notes here:

Annotation file will have multiple annotations for one target (!!)
Comparison function can be optimized, look at provided code.

Albert, bert, and xlnet improvements

Batch processing should be improved. See: https://github.com/agemagician/ProtTrans/tree/master/Embedding/Advanced

As of now, this seems to be the only major "lacking" thing.

Future direction: embeddings for DNA and RNA

DNA: https://www.biorxiv.org/content/10.1101/836163v1.full

CPU fallback for Bert

As mentioned here, there seems to be a problem with batching in Bert.

The set I've used for the tests is viruses_90 from the test sets on oculus (in case you want to reproduce the run :) ).

I think that enhancing with the SeqVec single sequence --> CPU --> fail strategy will solve this issue, too.

Error in BassicAnnotationextractor

In develop and on google collab, the BasicAnnotationExtractor class is not loading weights correctly, or maybe its that the weights uploaded are not correct.

RuntimeError Traceback (most recent call last)
in ()
19 annotations_extractor = BasicAnnotationExtractor(
20 secondary_structure_checkpoint_file="secstruct_checkpoint",
---> 21 subcellular_location_checkpoint_file="subcell_checkpoint"
22 )

1 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in load_state_dict(self, state_dict, strict)
845 if len(error_msgs) > 0:
846 raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
--> 847 self.class.name, "\n\t".join(error_msgs)))
848 return _IncompatibleKeys(missing_keys, unexpected_keys)
849

RuntimeError: Error(s) in loading state_dict for SUBCELL_FNN:
Missing key(s) in state_dict: "layer.3.weight", "layer.3.bias", "layer.3.running_mean", "layer.3.running_var".

Integrate provis

https://github.com/salesforce/provis

Catch embedding exception, close embedding file

Right now, embeddings are written iteratively on an opened embeddings file (either whole or reduced). If the embeddings step fail (e.g. by running out of memory) the files will be abruptly closed (not close()) resulting in corrupted files.

Ideally, we want to close the files correctly no matter what. This might require a bigger re-engineering of the FileSystem file handler, which just gives read/write pointers to files instead of paths, and on sigterm or alike tries to close all handles.

The quick solution now would be encapsulating the embedding calls in try / finally clauses where the error isn't caught (aka: allowed to propagate) but the files do get closed no matter if successful or not.

Log storage estimate

At the beginning of any embed pipeline run, calculate (by means of the mapping file) the expected size of the embedding file and the reduced embedding file.

As formula:

per_amino_acid_size_in_bits = (embedding_dimension * layers) * 32
per_protein_size_in_bits = (embeddings_dimension) * * 32


total_number_of_proteins = len(mapping_file)
total_aa = mapping_file.sequence_length.sum()

embeddings_file_size = per_amino_acid_size_in_bits * total_aa
reduced_embeddings_file_size = per_protein_size_in_bits * total_number_of_proteins

Installation from the PyPI with Python 3.6+ failed

Hey,

I discovered that bio-embeddings installation from the PyPI with Python 3.6+ doesn't work anymore.
But it's working fine directly from the GitHub repository.

pip install bio-embeddings[all]

ERROR: Could not find a version that satisfies the requirement bio-embeddings[all] (from versions: none)
ERROR: No matching distribution found for bio-embeddings[all]

ERROR: Could not find a version that satisfies the requirement bio-embeddings==0.1.3 (from versions: none)

I tried with Python 3.6.8 on Mac OX and with Python 3.6.9 on Ubuntu. The same raised exception.
I think it is related with the setup.py configuration:

'python_requires': '>=3.7,<4.0'

Could you tell why Python 3.6 is excluded? Thanks!

Set number of threads

Good day,

Quick question: Is there any way to set the number of threads in the bio_embeddings pipeline - either from within the config file or from the command line?

Thank you.

Add resource use information

Breaking down #15

information about CPU/GPU/RAM consumption

Low priority + GPU utilization monitoring is complex in multi-GPU environments.

Refactor cuda device assignation

Currently, in one form or another, in various parts of the pipeline, this is used:

"cuda:0" if torch.cuda.is_available() and not self._use_cpu else "cpu"

The problem here is that cuda:0 will always refer to the 0 card. In systems hosting multiple cards, this will be painful. workoarounds are:

NOT defining the device number, aka, just cuda (see: https://pytorch.org/docs/stable/tensor_attributes.html#torch.torch.device), or
allow the device number to be passed via a param, e.g.:
```
"cuda:{cuda_device}" if torch.cuda.is_available() and not self._use_cpu else "cpu"
```
where cuda_device is an integer that defaults to 0.

Examples where it's used:

Please upload the Albert weight file

Catch embedding CUDA out of memory error and collect for later processing

Currently, if CUDA runs out of memory, the samples won't be processed and #11 happens. Additionally to fixing that for unknown problems, collect sequences that spawn known problems for processing on CPU. If even that's not possible, then throw error!

Add compression to embedding export

An easy improvement when storing embeddings_file and reduced_embeddings_file, supported out of the box, may impact speed (but that's acceptable).

https://docs.h5py.org/en/stable/high/dataset.html#filter-pipeline

Also, since at it: double check that stored dataset uses the most fitting dtype.

P.S.: preference for gzip

P.P.S.: would be nice to run this as a test to see "how much it buys". Easy test: take an h5 file and copy all datasets into a new h5 file applying compression. Then we see if this is useful...

MD5 Clashes among unique sequences

I'm hitting the MD5Clash when I have only unique sequences in my dataset. I edited the code to remove the exception but there should be an option to ignore this.

This most likely indicates there are multiple identical sequences in your FASTA file. 
MD5 hashes are used to remap sequence identifiers from the input FASTA.
This error exists to prevent wasting resources (computing the same embedding twice).
There's a (very) low probability of this indicating a real MD5 clash.

If you are sure there are no identical sequences in your set, please open an issue at https://github.com/sacdallago/bio_embeddings/issues . Otherwise, use cd-hit to reduce your input FASTA to exclude identical sequences!

Dockerize pipeline and create example for docker execution

As from title.

Important: use of volumes / output folder / prefix!

	def get_file_manager(**kwargs):
	management = kwargs.get('management', {})
	file_manager_type = management.get('file_manager')
	file_manager = FILE_MANAGERS.get(file_manager_type)

	return file_manager(**management)

sacdallago / bio_embeddings Goto Github PK

bio_embeddings's Issues

Recommend Projects

Recommend Topics

Recommend Org