Comments (7)
Actually, the 503 errors are probably caused by our IT stress-testing exposed services. But the problem of the long uploads remain in place.
I wonder, could the chunk size to be the problem? Here the h5dump output for the largest dataset there:
$ h5dump -H -p -d '/data/expression' mouse_gene_v2.3.h5
Mon Mar 25 04:38:33 PM UTC 2024
HDF5 "mouse_gene_v2.3.h5" {
DATASET "/data/expression" {
DATATYPE H5T_STD_U32LE
DATASPACE SIMPLE { ( 53511, 932405 ) / ( 53511, H5S_UNLIMITED ) }
STORAGE_LAYOUT {
CHUNKED ( 2000, 1 )
SIZE 31440200353 (6.348:1 COMPRESSION)
}
FILTERS {
COMPRESSION DEFLATE { LEVEL 4 }
}
FILLVALUE {
FILL_TIME H5D_FILL_TIME_ALLOC
VALUE H5D_FILL_VALUE_DEFAULT
}
ALLOCATION_TIME {
H5D_ALLOC_TIME_INCR
}
}
}
from hsds.
Yes, the chunk size could be problematic... hsload will iterate over each chunk in the source dataset and there are more than 24 MM chunks in this case.
If you can do a aws s3 cp and pre-stage the file on a local posix disk, that will speed things up greatly.
Another option would be to use the --nodata option to create the scaffold HSDS domain, and then write a custom python script to copy in the data. If you can setup a n-way partitioning of the data to be loaded, you can run n scripts in parallel. Since the latency is in fetching the data from S3, you should be able to use a fairly large number for n without overloading HSDS. You can use docker stats
to judge how busy the HSDS containers are. If the CPU % is over 90 for long stretches, run HSDS with more containers or use a smaller value for n.
Let us know if either of these approaches helps1
from hsds.
@jreadey What do you mean by pre-stage here? I've downloaded the file locally and use hsload ./mouse_gene_v2.3.h5 /counts/archs4/mouse_gene_v2.3.h5
I'm currently trying to change the chunking with h5repack, but apparently it also will take a while, as only 1/10th of the file has been processed in ~1 hour...
from hsds.
@assaron - Yes, by pre-stage I meant download the file locally. You can use hsload with a s3 source file, but that would be even slower in your case.
h5repack is a reasonable idea, but as you note it takes some time to run as well.
How do you feel about the partitioning idea?
from hsds.
@jreadey Yeah, I have a few files like this, so I can parallel them. I can parallel repack as well.
It still feels a bit weird that repack speed is about 100MB per minute and is limited by CPU. Apparently compression plays a role, as when I add GZIP=1 in filter for repack (instead of level 4 that was there): I get a two times improvement in speed (from 60MB per minute to 120MB per minute). Removing the compression at all makes it even faster, but the files size increases dramatically.
On the other hand, maybe it's relatively reasonable. Repack needs to unpack and pack the data and the unpacked size is several times higher (6 to 1 in the example above). So the speed is actually 500-600 MB per minute of unpacked data, which is slower than just gzip -1 on that data, but couple of times, no and order of magnitude.
@jreadey, thanks for your help. I'm closing the issue, as it's not really HSDS server related. But I wonder, whether the repack style filters can be added for h5py. I imagine that in this situation if I change the layout on the fly it will decrease amount of HSDS API calls. Also, when I do a repack first it had to compress the data again, but for uploading to HSDS server it can be sent uncompressed (if the network speed is high enough), that would also save some time.
from hsds.
Ok, thanks.
FYI - with hsload the data flow will be: client uncompress -> binary transfer of uncompressed data -> compress by HSDS.
Potentially it would be faster to just send the compressed data to HSDS, but hsload is also taking the opportunity to re-chunk to a larger chunk size... by default HSDS will scale up chunks to hit a 2-8 MB/chunk size.
Feel free to reopen if you have more questions. Also you may find posting to the HDF Forum useful.
from hsds.
Oh, HSDS server also changes chunk size. In that case it would make even more sense to combine multiple chunks in hsload to decrease the number of requests (actually, in the beginning I assumed that it already does this).
Also, I realized that hsload can add packing for uncompressed chunks with -z
option, so I can repack to an uncompressed file first and then add compression with hsload. Not sure if it increases the speed though: repacking to an uncompressed file still works at 800MB per minute (of uncompressed data, so it's still around 100-200MB of compressed data).
I'll play a bit more and probably will create a new issue for h5pyd then.
from hsds.
Related Issues (20)
- Reset data to fill value when decreasing then increasing extent of dataset
- Save padding/offset of fields in compound types HOT 3
- point selections can fail after shape update
- Support FLETCHER32 filter
- Compound Type Subsetting HOT 4
- Support N-Bit and scale offset filters
- POST_Links does not return link information when following links recursively HOT 5
- Docker Image CI fails intermittently due to bad username/password
- Uninitialized variable length sequences are returned as scalars instead of empty arrays HOT 2
- not enough room in chunk cache - return 503 HOT 21
- Windows encoding error when writing sequence of variable length UTF-8 characters HOT 2
- Scripts use deprecated docker-compose v1
- Vlen sequence of variable-length UTF-8 strings cannot be written HOT 7
- Creation timestamps lack resolution on Windows HOT 2
- MAX_WAIT_TIME for rescan should be a config option HOT 1
- 409 Conflict during testCreateDomainNodeIds
- Example Requests and Authentication HOT 3
- typo, should be password_file HOT 2
- Improve resiliency for concurrent requests
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hsds.