Comments (7)
from xarray-beam.
I’d be interested to here @SpencerC’s thoughts.
I'd personally like to see a more modularized approach where xpartition and xarray-beam remain independent and shared functionality is up-streamed. I see distinct use cases for each (although I'm partial to beam workflows myself).
from xarray-beam.
I'm not opposed to either pushing components upstream or even adding xpartition as an xarray-beam dependency, but it's easier to develop in a single repository until we know what the right reusable abstractions are :)
For now I've tried to keep the abstractions pretty light-weight, but even one of the few abstractions we have already (ChunkKey
) needs an overhaul: #9. So maybe this is worth coming back to in a few months.
The logic around partitioning xarray Datasets into smaller chunks does feel very generic, so in the long term I do like the idea of trying to split it out, assuming that there are actual concrete non-Beam use-cases. Xarray-Beam could just be a thin layer with beam.PTransform
objects wrapping upstream helper functions (e.g., from xpartition).
from xarray-beam.
it's easier to develop in a single repository until we know what the right reusable abstractions are :)
Makes sense. I'd suggest working out of the xpartition repo since it has a more permissive license and a code coverage step in the CI, and add it as submodule here; unless you've got a google3/copybara workflow this would overcomplicate and slow you down along other project paths?
I do like the idea of trying to split it out, assuming that there are actual concrete non-Beam use-cases
As amazing as beam is, I've encountered at least one instance in geodata processing where the right call was to break out of beam and roll our own distributed workflow system for one step. If you can get it upstreamed, there are also definitely benefits to maintaining functionality in a heavily used, actively managed project with lots of contributors (assuming it's well administrated).
from xarray-beam.
Makes sense. I'd suggest working out of the xpartition repo since it has a more permissive license and a code coverage step in the CI, and add it as submodule here; unless you've got a google3/copybara workflow this would overcomplicate and slow you down along other project paths?
Hah, @SpencerC are you are (ex-)Googler or are we just notorious? :)
I do have a convenient sync setup for xarray-beam/google3 -- and we have a handful of internal uses that serve as integration tests -- but that's not really a good reason not to do open development, if others want to get involved! It is easy enough for me to mirror other external projects.
It does seem more plausible that xpartition could/should be the shared foundation rather than xarray-beam, which will require a Beam dependency. I would rather not push this all the way to xarray (yet), until we really figure out what we're doing.
(IMO the differences between MIT/Apache licenses and a code-coverage step in CI are not really material factors)
As amazing as beam is, I've encountered at least one instance in geodata processing where the right call was to break out of beam and roll our own distributed workflow system for one step
I agree, there are for sure cases where Beam does not make sense. One obvious one is saving outputs from a numerical model. ML inference on specialized hardware might be another.
That said -- what is the specific shared functionality? Some possibilities:
- a shared way to build/represent a "dataset schema with chunks" that doesn't require using dask
- utilities for reading/writing/splitting/combining partitioned "chunks" of larger datasets
- shared data structures for keeping track of said chunks (e.g.,
xarray_beam.ChunkKey
) and perhaps chunking schemes
If you can get it upstreamed, there are also definitely benefits to maintaining functionality in a heavily used, actively managed project with lots of contributors (assuming it's well administrated).
sure, but who do you think is going to be on hook for maintaining said upstream project? ;)
from xarray-beam.
I think we probably have enough experience that we could make a decent design, so I wouldn't be too worried about needing to track xpartition too closely as a dependency. In this case, the abstraction is basically set theory (e.g. union, intersection, partitions of sets, and ways to map between partitions). I always find designs work pretty well when the code abstractions line up with math abstractions.
from xarray-beam.
I'm also not convinced this simple functionality would be more maintainable in xarray, although it could replace the map_blocks function, which I assume is not particularly easy to maintain...
from xarray-beam.
Related Issues (20)
- Support for striding / rolling windows HOT 1
- Consider adding ZarrToChunks() and/or an open_zarr() helper function
- Support pangeo_forge_recipes.patterns.MergeDim in `FilePatternToChunks`
- Some notes on alignment with pangeo-forge-recipes around "keys" HOT 1
- Better documentation of behaviour with irregular source chunks HOT 5
- Support opening datasets with file-like objects in a Beam pipeline
- Add `split_vars` to FilePatternToChunks transform.
- Document how to add `weather-dl` clients & manifests in the Contributing guide. HOT 1
- Running pipelines on AWS? HOT 1
- Consider omitting unchunked dimensions from Key objects created with DatasetToChunks HOT 1
- Adding source distribution for xarray-beam on PyPI HOT 2
- Help with opening netcdf4 files via HTTP HOT 19
- Support missing chunks in ConsolidateChunks
- Require using make_template() if providing a template to ChunksToZarr? HOT 2
- Simultaneously read multiple Datasets into an Xarray-Beam pipeline HOT 2
- Add CI to check that code has been formatted with pyink formatter.
- Error in ChunksToZarr appears on docs page
- Register Beam coder(s) to avoid "Using fallback deterministic coder" warnings
- Race condition in ChunksToZarr when template is not supplied explicitly as an xarray.Dataset
- Dimension order produced by Rechunk is opaque and not controllable; mismatch can cause errors from subsequent ChunksToZarr.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from xarray-beam.