Comments (9)
I just came across this and it lines up with some old experiments that I recently wrote up for grabbing FCs in parallel with Dask.
I used the same from_map
approach outlined in the design doc, although my method for grabbing chunks was a little more crude - relying on geopandas
to parse features returned from getInfo
. I didn't try to tackle the question of optimal chunking, but I did look at some optimizations like column projection and eagerly grabbing metadata that might be of interest.
from xee.
Great, I'm happy working with your existing repo. I'll make a PR there once I have an MVP together.
What is the use case for a dataframe of IC metadata? What types of queries would folks do on this?
It's not something I've ever needed to do, but I could see it being helpful for granule-level analysis of cloud cover or data coverage. Definitely not a high priority feature, though.
Consider making the optional dependency dask-geopandas: https://dask-geopandas.readthedocs.io/en/stable/
Yes, this looks perfect.
from xee.
As discussed in this issue, the integration now exists: https://github.com/alxmrs/dask-ee
from xee.
Amazing! That we’ve identified the same solution means it probably is most fitting for the problem. Thanks so much Aaron.
Would you like to take your approach and add it to Xee? Or, start a new package? I’d be happy to lend a hand. It seems like you’ve figured out the nitty gritty.
from xee.
Yeah, I'd be happy to write up an implementation for Xee (maybe xee.dataframe.read_fc
?). Your design doc will be a big help, but there are a few specific design questions from there I'd love your thoughts on to get started:
computeFeatures or getInfo
The deferreds will make the underlying calls to
ee.data.computeFeatures()
that will produce Pandas Dataframes.
I haven't done any rigorous testing, but empirically I've found getInfo
runs consistently faster than computeFeatures
, regardless of page size. The main advantage I see of computeFeatures
is decoupling the IO chunk size (i.e. page size) from the Dask chunk size. Do you think that's critical, or are there other strong reasons to stick with computeFeatures
?
Optimizing chunk size
The tricky part, IIUC, will be calculating the appropriate FC shards.
I'm open to any ideas here. My inclination would be to put a hard limit at 5000 features but otherwise leave this up to the user to tweak since optimal chunk size will be so dataset dependent. In terms of avoiding data limits, I don't see any reliable way to estimate bytes per chunk without grabbing features, so again I'd lean towards making that a user responsibility. If you see an opportunity to automatically optimize though, I'm 100% in favor.
ee.Initialize
To actually use a dask cluster, the EE Team needs to make
ee.Initialize()
pickleable. Right now, they are working on eliminating the need for this call altogether.
My hacky solution to initializing workers was to shove ee.Initialize
into the mapped function. I didn't run into any pickling issues there, so maybe stick with that for the time being?
from xee.
I'd be happy to write up an implementation for Xee
Wow, thank you so so much! I know at least one project where this would be a game changer: https://github.com/wildlife-dynamics/ecoscope.
cc: @walljcg
maybe xee.dataframe.read_fc?
Ah, the hardest part of our jobs — naming. I have a few thoughts here:
- As per the Zen of Python (“flat is better than nested”), I think exposing the function at the top is best.
xee.df.read_***
makes sense so long as users can callxee.read_***
. - For the function name, I’d like to mimic Dask conventions as closely as possible. I see two established examples: The core IO methods like ‘read_csv
,
read_parquet, etc.; and extensions like BQ (
read_gbq): pypi.org/project/dask-bigquery. What is it that we’re reading, then?
read_fcor
read_ee`?- Isn’t a FC, on a spiritual level, a dataframe? A table by any other name still scans as sweet? In this understanding, would
read_fc
be redundant? - (On the other hand, could the same argument be applied to
parquet
? Is EE or FC the better name for the container?) - To mimic BQ, doesn’t it make sense to
read_ee
? EE is the container for a table just as BQ is. We would leave it as a matter of types and asserts for the user to know that FCs can only be mapped to dfs. - I once read an article claiming that putting types into the name of methods, variables, etc. was a code smell, at least in functional languages. The typing adds this information to the reader; putting it in the name is redundant.
Borrowing from the meta language roots of Python, this appeals to me a lot.
- Isn’t a FC, on a spiritual level, a dataframe? A table by any other name still scans as sweet? In this understanding, would
- All this aside, if you write it, that means you get to name it. I trust your judgment.
Do you think that's critical, or are there other strong reasons to stick with computeFeatures?
No, this is not critical inherently. To the end user, this implementation detail will be hidden. Only the performance characteristics will be noticed. Thus, I wouldn’t sweat the decision too much — it seems reversible. It would be prudent to get input from an EE platform engineer to see if they know something we don’t wrt performance, given they have capacity.
FWIW, I haven’t used the FC API too much. My work with EE primarily involves rasters. Your experience trumps my doc’s suppositions.
The main advantage I see of computeFeatures is decoupling the IO chunk size (i.e. page size) from the Dask chunk size
Hmm. In my experience with rasters, we do want to separate these two concepts. IO is limited by the EE API, whereas users determine the characteristics of the dask scheduler (i.e. CPU-bound computation). The most performant system for a lot of use cases, in my estimation, will be getting the max IO chucks EE will allow as well as even bigger Dask chunks (given memory optimized VMs and a lot of RAM per worker).
Let’s see how this shapes out in your PR. We can discuss the these details there.
My inclination would be to put a hard limit at 5000 features but otherwise leave this up to the user to tweak
That sounds perfect. I totally agree.
I don't see any reliable way to estimate bytes per chunk without grabbing features
If we can get something like a PyArrow schema on the entire collection (via getInfo()
), then I bet we could produce a “good enough” estimate of the bytes per page. This is worthy of investigation after the MVP. For now, I agree, we can delegate this to the user. Later, it would be nice to calculate this up-front, to make an educated guess as a default. (I took the same approach with the Xarray extension).
My hacky solution to initializing workers was to shove ee.Initialize into the mapped function.
I think this is a great solution. In fact, @KMarkert just used this tactic in Xee proper. It was off my radar when I wrote the doc, and I think we should adopt the pattern for dataframes as well. Let’s use Kel’s idioms in the function. As I understand it, both the pickle solution and this “hack” are short term solutions since the EE team intends to get rid of ee.Initialize()
altogether.
Thanks again, Aaron.
from xee.
my method for grabbing chunks was a little more crude - relying on geopandas to parse features returned from getInfo
Is this crude? I thought it was elegant. Though, I don't know if we want to add that dependency.
from xee.
Thanks for the detailed feedback, Alex! This is all very helpful. Some responses below, but I also wanted to revisit the question of having this be integral to Xee vs. packaged separately.
I'm now leaning towards a separate package, both in the interest of modularity (e.g. using read_ee
without the required geospatial dependencies of Xee) and reducing any maintenance burden on the Earth Engine team. If you're still open to that option, how do you feel about using dask-ee
for the package name, as you discussed in the design doc? Your logic of following the dask-bigquery
example seems the most straightforward to me, but I don't want to take the name you suggested and claim it on PyPI unless you're fully on board with what. I'm happy to consider other names if you'd rather keep that open. In any case, I would still be excited for you to engage with the package at whatever level you're available and interested.
To mimic BQ, doesn’t it make sense to read_ee? EE is the container for a table just as BQ is.
I'm convinced - read_ee
makes sense and is consistent with precedent. I could potentially see loading Image Collection metadata as a dataframe through the same API, so keeping the function type-agnostic and dispatching internally sounds like a good approach.
In fact, @KMarkert just used this tactic in Xee proper. It was off my radar when I wrote the doc, and I think we should adopt the pattern for dataframes as well. Let’s use Kel’s idioms in the function.
Sounds great. I wasn't sure how the approach would scale, so the fact that it's working internally in Xee is reassuring.
Though, I don't know if we want to add that dependency.
Agreed! Parsing the GeoJSON with Pandas shouldn't be a problem. I'm imagining geopandas
as an optional dependency hidden behind an as_geodataframe
(or similar) flag to read_ee
, but I'm open to other ideas.
from xee.
I'm now leaning towards a separate package, both in the interest of modularity (e.g. using read_ee without the required geospatial dependencies of Xee) and reducing any maintenance burden on the Earth Engine team.
That makes a lot of sense to me. I started a repo over here for this purpose, but it is just boilerplate right now.
I’d be happy to change the license/authorship to make us jointly own this repo.
how do you feel about using dask-ee for the package name, as you discussed in the design doc?
I think dask-ee is a better name for the project than dee, especially given that I can’t publish dee on PyPI due to name similarities. If you wanted to use the above repo, I can change the name. Otherwise, feel free to create something new!
I could potentially see loading Image Collection metadata as a dataframe through the same API, so keeping the function type-agnostic and dispatching internally sounds like a good approach.
This is interesting! This hasn’t occurred to me before. What is the use case for a dataframe of IC metadata? What types of queries would folks do on this?
I'm imagining geopandas as an optional dependency hidden behind an as_geodataframe (or similar) flag to read_ee, but I'm open to other ideas.
Consider making the optional dependency dask-geopandas
: https://dask-geopandas.readthedocs.io/en/stable/
Cheers, Aaron.
from xee.
Related Issues (20)
- Scaling Considerations
- ValueError: unrecognized engine ee must be one of: ['scipy', 'rasterio', 'store'] HOT 1
- crs and geometry not working properly HOT 8
- download MODIS (crs: SR-ORG:6974) data appears obscured HOT 8
- Opening a MODIS dataset brings seemingly random values across, rest are 0 HOT 5
- Long-running code results in `requests` `ChunkedEncodingError` exception (broken connection) HOT 1
- Xee and SSL HOT 3
- Transposing MODIS data introduces data artifacts HOT 1
- Missing values outside ee.Image boundaries potentially in data range
- Documentation infra HOT 11
- Inconsistent Dimensions When Loading Processed ImageCollection HOT 3
- Add docs for running integration tests locally
- Xee does not provide correct data for resampled image HOT 3
- Error when Xee attempts to cast image data to an expected dtype HOT 2
- Error when passing in ee.Projection() as parameter for open_dataset() HOT 2
- Output change for the same code HOT 2
- Cannot pin specific version of xee in my conda recipe
- Start testing GIL-free python
- How do I use io_chunks to let xee auto-compute a chunk size relative to the request byte limit?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from xee.