Comments (5)
There are an awful lot of GET calls in there that, for a whole-array-write operation are totally unnecessary. Even if the array already exists, a single read of all the metadata pieces should suffice. I wonder where we can provide a better directory listing caching experience around this. There is, separately, talk of using a transactional in-memory cache specifically for zarr metadata files (upload when finished) that would help a lot too. It is already possible to provide separate metadata and data storage backends in zarr.
I mention this because, while I don't know what the specific problem is, I can only but assume that the total number of requests/coroutines is implicated by something from deep within asyncio.
One lever you could pull on is the fsspec config setting conf["nofiles_gather_batch_size"]
(default given by fsspec.asyn._NOFILES_DEFAULT_BATCH_SIZE=1280
) to a smaller value.
If there are really requests being made with zero data, we should be able to find out where that's happening and continue on. Perhaps there is a race condition where all the data of a chunk is sent successfully, but the sending function subsequently errors. This would be in gcsfs.core.simple_upload
.
from gcsfs.
every batch of 10 chunks
Is this the number of zarr chunks in a dask partition, or where else does this number come from?
from gcsfs.
Thanks for the fast reply! This number is the number of images you want to hold in memory before writing them to the bucket, so user defined really.
WRT the large number of GET
for every batch upload - while trying to debug the race condition, I tried reconnecting to the zarr store on the bucket every time I wrote a set of 10 images to try to see if the error was related to the connection going stale (idk - I'm a scientist, not a networking guy). Removing that re-connection call removes the stack of GET
calls.
Thanks for the pointer on gcsfs.core.simple_upload
- I'll see if I can explore it to do some debugging for my weird case. If it'll help, I'll try to make a minimal reproducible example this weekend.
from gcsfs.
I was able to manually trace it back to _request
on line 412 in gcsfs.core. It seems like the:
async with self.session.request( ... ) as r:
command on line 416 may be where it fails prior to going into the race condition. The data object going into that command prior to failure is not empty, (<gcsfs.core.UnclosableBytesIO object at 0x0000021B6ACB7BF0>
with a non-zero size from getvalue), so I'm guessing its something in self.session.request
? That said, it seems like self.session.request
comes from another package (aiohttp?), so I ran out of steam and stopped pursuing it. Given that this is the first you're seeing of this, this may be specific to my situation, but maybe this issue thread can help if someone else has this issue. I chatted with the lab and we'll pursue a different, slower uploading schema to get around this. Thanks again for your help!
from gcsfs.
I hope you are right, but good to provide this information for others anyway
from gcsfs.
Related Issues (20)
- Error when listing large directory with versions=True
- Request: add chmod
- Issues when using identity_pool.Credentials for connecting GCSFileSystem HOT 2
- Strange error message when using cp instead of put HOT 2
- Filename with slashes in the path are getting URL encoded, causing them to fail HOT 2
- Pin generation on open for version aware file system HOT 4
- Add API reference for gcsfs.mapping to docs HOT 2
- `fs.isdir` latency 200x slower beginning with version 2023.09.01 HOT 1
- Error introduced in 2024.3.0 HOT 2
- FileNotFoundError since 2024.3.1 HOT 5
- Missing 'name' attribute in 'GCSFile' object when accessing PDF files HOT 1
- `unstrip_protocol` not implemented correctly HOT 4
- Invalid multipart request on retry HOT 6
- Question: aiohttp vs. gRPC API HOT 1
- Token refresh does not work HOT 3
- `AttributeError` when error is a string in `retry.py` HOT 2
- GCSFS reports directory as FileNotFoundError when it exists. Run 1 fails, run 2 succeeds. Caching? HOT 5
- unexceptionally long timeout HOT 3
- Is there an async version of touch? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gcsfs.