Comments (5)
Initial check with the use case four fasta, lengthes [300, 544, 184, 1584, 518] goes from 37MB to 26MB - That's significantly more than I expected and really good for this small sample. The reduced embeddings go from 24KB to 32KB, but at that size it's not even a real benchmark.
What I don't understand from using the documentation is whether the compression is applied across datasets or for each dataset individually. That will most likely decide whether it is at all useful for the reduced embeddings.
Code used:
import sys
import h5py
from tqdm import tqdm
lengthes = []
with h5py.File(sys.argv[1], "r") as uncompressed, h5py.File(sys.argv[2], "w") as compressed:
for key, value in tqdm(uncompressed.items()):
if len(value.shape) == 3:
lengthes.append(value.shape[1])
compressed.create_dataset(key, data=value, compression="gzip")
print(lengthes)
from bio_embeddings.
This already sounds exceptionally good. These are actually the biggest files we produce, so we don't even have to worry about zipping results via pipeline after the run for space reasons if we can directly compress the embeddings.
The only other "worthy" big file produced is the pairwise distance matrix (which on some internal swissprot vs human test is occupying upwards of 10GB in CSV form; thus --> might want to save this as h5 soon).
from bio_embeddings.
New numbers! Reduced embeddings: 148M -> 207M (yes, this file has apparently grown by compression). Normal embeddings: 19G -> 17G; 17G also by just gzipping the whole file, while the latter was faster. Both took a couple of minutes.
from bio_embeddings.
Do you still want this now that we have half_precision
?
from bio_embeddings.
I guess that we can zip files, too. Closing for now
from bio_embeddings.
Related Issues (20)
- Add support for ESM-2 and ESMFold HOT 3
- Update jax-unirep dependency version
- Protocol prottrans_t5_xl_u50: URLError: <urlopen error [Errno 113] No route to host> HOT 1
- OSError: Unable to open file (truncated file: eof = 63504384, sblock->base_addr = 0, stored_eof = 374434776)
- Custom embeddings
- Docker containers shutting down within a few seconds of starting
- Can not install bio-embedding in wsl HOT 2
- Can not install bioembedings on ubuntu? Please help HOT 1
- Hard times trying to run the bindEmbed21 example.
- Error during first step - greenlet size changed, may indicate binary incompatibility. HOT 1
- 3D Protein Embeddings
- AttributeError: 'dict' object has no attribute '__NUMPY_SETUP__'
- Cant install bio_embeddings in colab HOT 2
- Tensor size issue
- Protocol prottrans_t5_xl_u50: PermissionError: [Errno 13] Permission denied: 'C:\\Users\\user\\AppData\\Local\\Temp\\tmpk7e1m0jg' HOT 1
- from where comes the models in "bio_embeddings/utilities /defaults.yml", where is docs, parameters, dataset ?
- Protocol esm1b: AttributeError: 'dict' object has no attribute 'startswith'
- Cannot Import Any Embedder "load()" has been removed, use yaml = YAML(typ='rt') yaml.load(...)
- Protocol prottrans_bert_bfd: OSError: Unable to load weights from pytorch checkpoint file for .catch/bio_embeddings/prottrans_bert_bfd/model_directory/pytorch_model.bin' at '.catch/bio_embeddings/prottrans_bert_bfd/model_directory/pytorch_model.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
- Can Word2Vec be used for 4, 5 and 6 kmer? If possible, which file I need to changed and which parameter. I am seeking Guidance on Adapting Word2Vec Code for 4kmer Sequences
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bio_embeddings.