Git Product home page Git Product logo

Comments (7)

assaron avatar assaron commented on June 12, 2024

Actually, the 503 errors are probably caused by our IT stress-testing exposed services. But the problem of the long uploads remain in place.

I wonder, could the chunk size to be the problem? Here the h5dump output for the largest dataset there:

$ h5dump -H -p -d '/data/expression' mouse_gene_v2.3.h5
Mon Mar 25 04:38:33 PM UTC 2024
HDF5 "mouse_gene_v2.3.h5" {
DATASET "/data/expression" {
   DATATYPE  H5T_STD_U32LE
   DATASPACE  SIMPLE { ( 53511, 932405 ) / ( 53511, H5S_UNLIMITED ) }
   STORAGE_LAYOUT {
      CHUNKED ( 2000, 1 )
      SIZE 31440200353 (6.348:1 COMPRESSION)
   }
   FILTERS {
      COMPRESSION DEFLATE { LEVEL 4 }
   }
   FILLVALUE {
      FILL_TIME H5D_FILL_TIME_ALLOC
      VALUE  H5D_FILL_VALUE_DEFAULT
   }
   ALLOCATION_TIME {
      H5D_ALLOC_TIME_INCR
   }
}
}

from hsds.

jreadey avatar jreadey commented on June 12, 2024

Yes, the chunk size could be problematic... hsload will iterate over each chunk in the source dataset and there are more than 24 MM chunks in this case.

If you can do a aws s3 cp and pre-stage the file on a local posix disk, that will speed things up greatly.

Another option would be to use the --nodata option to create the scaffold HSDS domain, and then write a custom python script to copy in the data. If you can setup a n-way partitioning of the data to be loaded, you can run n scripts in parallel. Since the latency is in fetching the data from S3, you should be able to use a fairly large number for n without overloading HSDS. You can use docker stats to judge how busy the HSDS containers are. If the CPU % is over 90 for long stretches, run HSDS with more containers or use a smaller value for n.

Let us know if either of these approaches helps1

from hsds.

assaron avatar assaron commented on June 12, 2024

@jreadey What do you mean by pre-stage here? I've downloaded the file locally and use hsload ./mouse_gene_v2.3.h5 /counts/archs4/mouse_gene_v2.3.h5

I'm currently trying to change the chunking with h5repack, but apparently it also will take a while, as only 1/10th of the file has been processed in ~1 hour...

from hsds.

jreadey avatar jreadey commented on June 12, 2024

@assaron - Yes, by pre-stage I meant download the file locally. You can use hsload with a s3 source file, but that would be even slower in your case.

h5repack is a reasonable idea, but as you note it takes some time to run as well.

How do you feel about the partitioning idea?

from hsds.

assaron avatar assaron commented on June 12, 2024

@jreadey Yeah, I have a few files like this, so I can parallel them. I can parallel repack as well.

It still feels a bit weird that repack speed is about 100MB per minute and is limited by CPU. Apparently compression plays a role, as when I add GZIP=1 in filter for repack (instead of level 4 that was there): I get a two times improvement in speed (from 60MB per minute to 120MB per minute). Removing the compression at all makes it even faster, but the files size increases dramatically.

On the other hand, maybe it's relatively reasonable. Repack needs to unpack and pack the data and the unpacked size is several times higher (6 to 1 in the example above). So the speed is actually 500-600 MB per minute of unpacked data, which is slower than just gzip -1 on that data, but couple of times, no and order of magnitude.

@jreadey, thanks for your help. I'm closing the issue, as it's not really HSDS server related. But I wonder, whether the repack style filters can be added for h5py. I imagine that in this situation if I change the layout on the fly it will decrease amount of HSDS API calls. Also, when I do a repack first it had to compress the data again, but for uploading to HSDS server it can be sent uncompressed (if the network speed is high enough), that would also save some time.

from hsds.

jreadey avatar jreadey commented on June 12, 2024

Ok, thanks.
FYI - with hsload the data flow will be: client uncompress -> binary transfer of uncompressed data -> compress by HSDS.
Potentially it would be faster to just send the compressed data to HSDS, but hsload is also taking the opportunity to re-chunk to a larger chunk size... by default HSDS will scale up chunks to hit a 2-8 MB/chunk size.
Feel free to reopen if you have more questions. Also you may find posting to the HDF Forum useful.

from hsds.

assaron avatar assaron commented on June 12, 2024

Oh, HSDS server also changes chunk size. In that case it would make even more sense to combine multiple chunks in hsload to decrease the number of requests (actually, in the beginning I assumed that it already does this).

Also, I realized that hsload can add packing for uncompressed chunks with -z option, so I can repack to an uncompressed file first and then add compression with hsload. Not sure if it increases the speed though: repacking to an uncompressed file still works at 800MB per minute (of uncompressed data, so it's still around 100-200MB of compressed data).

I'll play a bit more and probably will create a new issue for h5pyd then.

from hsds.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.