Hi, I'm not sure whether this issue should be filed here or for <a href="https://githu

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Uploading 30 GB file takes too long to upload about hsds HOT 7 CLOSED

assaron commented on June 12, 2024

Uploading 30 GB file takes too long to upload

from hsds.

Comments (7)

assaron commented on June 12, 2024

Actually, the 503 errors are probably caused by our IT stress-testing exposed services. But the problem of the long uploads remain in place.

I wonder, could the chunk size to be the problem? Here the h5dump output for the largest dataset there:

$ h5dump -H -p -d '/data/expression' mouse_gene_v2.3.h5
Mon Mar 25 04:38:33 PM UTC 2024
HDF5 "mouse_gene_v2.3.h5" {
DATASET "/data/expression" {
   DATATYPE  H5T_STD_U32LE
   DATASPACE  SIMPLE { ( 53511, 932405 ) / ( 53511, H5S_UNLIMITED ) }
   STORAGE_LAYOUT {
      CHUNKED ( 2000, 1 )
      SIZE 31440200353 (6.348:1 COMPRESSION)
   }
   FILTERS {
      COMPRESSION DEFLATE { LEVEL 4 }
   }
   FILLVALUE {
      FILL_TIME H5D_FILL_TIME_ALLOC
      VALUE  H5D_FILL_VALUE_DEFAULT
   }
   ALLOCATION_TIME {
      H5D_ALLOC_TIME_INCR
   }
}
}

from hsds.

jreadey commented on June 12, 2024

Yes, the chunk size could be problematic... hsload will iterate over each chunk in the source dataset and there are more than 24 MM chunks in this case.

If you can do a aws s3 cp and pre-stage the file on a local posix disk, that will speed things up greatly.

Another option would be to use the --nodata option to create the scaffold HSDS domain, and then write a custom python script to copy in the data. If you can setup a n-way partitioning of the data to be loaded, you can run n scripts in parallel. Since the latency is in fetching the data from S3, you should be able to use a fairly large number for n without overloading HSDS. You can use docker stats to judge how busy the HSDS containers are. If the CPU % is over 90 for long stretches, run HSDS with more containers or use a smaller value for n.

Let us know if either of these approaches helps1

from hsds.

assaron commented on June 12, 2024

@jreadey What do you mean by pre-stage here? I've downloaded the file locally and use hsload ./mouse_gene_v2.3.h5 /counts/archs4/mouse_gene_v2.3.h5

I'm currently trying to change the chunking with h5repack, but apparently it also will take a while, as only 1/10th of the file has been processed in ~1 hour...

from hsds.

jreadey commented on June 12, 2024

@assaron - Yes, by pre-stage I meant download the file locally. You can use hsload with a s3 source file, but that would be even slower in your case.

h5repack is a reasonable idea, but as you note it takes some time to run as well.

How do you feel about the partitioning idea?

from hsds.

assaron commented on June 12, 2024

@jreadey Yeah, I have a few files like this, so I can parallel them. I can parallel repack as well.

It still feels a bit weird that repack speed is about 100MB per minute and is limited by CPU. Apparently compression plays a role, as when I add GZIP=1 in filter for repack (instead of level 4 that was there): I get a two times improvement in speed (from 60MB per minute to 120MB per minute). Removing the compression at all makes it even faster, but the files size increases dramatically.

On the other hand, maybe it's relatively reasonable. Repack needs to unpack and pack the data and the unpacked size is several times higher (6 to 1 in the example above). So the speed is actually 500-600 MB per minute of unpacked data, which is slower than just gzip -1 on that data, but couple of times, no and order of magnitude.

@jreadey, thanks for your help. I'm closing the issue, as it's not really HSDS server related. But I wonder, whether the repack style filters can be added for h5py. I imagine that in this situation if I change the layout on the fly it will decrease amount of HSDS API calls. Also, when I do a repack first it had to compress the data again, but for uploading to HSDS server it can be sent uncompressed (if the network speed is high enough), that would also save some time.

from hsds.

jreadey commented on June 12, 2024

Ok, thanks.
FYI - with hsload the data flow will be: client uncompress -> binary transfer of uncompressed data -> compress by HSDS.
Potentially it would be faster to just send the compressed data to HSDS, but hsload is also taking the opportunity to re-chunk to a larger chunk size... by default HSDS will scale up chunks to hit a 2-8 MB/chunk size.
Feel free to reopen if you have more questions. Also you may find posting to the HDF Forum useful.

from hsds.

assaron commented on June 12, 2024

Oh, HSDS server also changes chunk size. In that case it would make even more sense to combine multiple chunks in hsload to decrease the number of requests (actually, in the beginning I assumed that it already does this).

Also, I realized that hsload can add packing for uncompressed chunks with -z option, so I can repack to an uncompressed file first and then add compression with hsload. Not sure if it increases the speed though: repacking to an uncompressed file still works at 800MB per minute (of uncompressed data, so it's still around 100-200MB of compressed data).

I'll play a bit more and probably will create a new issue for h5pyd then.

from hsds.

Uploading 30 GB file takes too long to upload about hsds HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent