Git Product home page Git Product logo

Comments (5)

konstin avatar konstin commented on May 24, 2024 1

Initial check with the use case four fasta, lengthes [300, 544, 184, 1584, 518] goes from 37MB to 26MB - That's significantly more than I expected and really good for this small sample. The reduced embeddings go from 24KB to 32KB, but at that size it's not even a real benchmark.

What I don't understand from using the documentation is whether the compression is applied across datasets or for each dataset individually. That will most likely decide whether it is at all useful for the reduced embeddings.

Code used:

import sys

import h5py
from tqdm import tqdm

lengthes = []
with h5py.File(sys.argv[1], "r") as uncompressed, h5py.File(sys.argv[2], "w") as compressed:
    for key, value in tqdm(uncompressed.items()):
        if len(value.shape) == 3:
            lengthes.append(value.shape[1])
        compressed.create_dataset(key, data=value, compression="gzip")

print(lengthes)

from bio_embeddings.

sacdallago avatar sacdallago commented on May 24, 2024

This already sounds exceptionally good. These are actually the biggest files we produce, so we don't even have to worry about zipping results via pipeline after the run for space reasons if we can directly compress the embeddings.

The only other "worthy" big file produced is the pairwise distance matrix (which on some internal swissprot vs human test is occupying upwards of 10GB in CSV form; thus --> might want to save this as h5 soon).

from bio_embeddings.

konstin avatar konstin commented on May 24, 2024

New numbers! Reduced embeddings: 148M -> 207M (yes, this file has apparently grown by compression). Normal embeddings: 19G -> 17G; 17G also by just gzipping the whole file, while the latter was faster. Both took a couple of minutes.

from bio_embeddings.

konstin avatar konstin commented on May 24, 2024

Do you still want this now that we have half_precision?

from bio_embeddings.

sacdallago avatar sacdallago commented on May 24, 2024

I guess that we can zip files, too. Closing for now

from bio_embeddings.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.