Git Product home page Git Product logo

eerepr's People

Contributors

aazuspan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

eerepr's Issues

Simplify packaging and drop 3.7 support

Python 3.7 support was added for Colab (#15), but Colab now supports modern Python versions so we can drop that.

At the same time, we can simplify the packaging by dropping the redundant setup.py and setup.cfg for a pure pyproject.toml solution.

Shrink reprs

HTML reprs get big fast and can bloat notebook sizes. I tried minifying in #10 and found it wasn't worth the processing time, but there were some free optimizations that could have comparable effects.

Try shortening CSS class names and simplifying frequently repeated HTML elements to drop repr sizes.

Benchmark:

obj = ee.ImageCollection("COPERNICUS/S2_SR").limit(1000)
rep = obj._repr_html_()
mb = len(rep) / 1e6
mb 
# 68.84692

Test with JSON instead of server data

Currently reprs are tested against data pulled from EE. The upside is that ensures tests will fail if there's a breaking change server-side, but it also makes for very slow tests and requires internet and authentication to run tests.

I think breaking changes from EE are unlikely enough that I should just test against local JSON data instead and add a script to generate test data from Earth Engine.

Options for python_requires directive?

First off, this is a much needed module that will keep many of us from jumping back and forth between Python and JS for the ease of object inspection. Thank you for your efforts.

I am currently working out of a Jupyter Lab setup built off a modified Docker image. The base structure is complicated enough that I don't want to tinker with it much, but my issue is that the image has python 3.7.1. The eerepr setup.cfg file has a python_requires directive of >=3.8, which made pip installing impossible. I manually cloned the repo from github, modified the file to accept 3.7.1, and ran an install from within my computing environment. It seems to work fine under light casual testing. So my question: Is the >=3.8 directive in place because of known issues with earlier versions of python, or is it possible (though perhaps untested), that your module works fine with <3.8?

Improve performance

Generating HTML for large collections can take a significant amount of time. Not surprising considering that we're iteratively building a string from potentially hundreds of thousands of nested properties, and Python is famously slow at iterating. There are a few different routes I could take to try and reduce processing time, and some of them can be combined.

Options

  1. Micro-optimizations to individual functions that get called repeatedly. Shaving a few microseconds off of a function that gets called half a million times will add up, so it's worth digging into the hot areas for any savings.
  2. Refactor to avoid string concatenation. String concatenation is frequently considered a slow operation in Python because strings are immutable and must be cloned each time they're modified, but with modern CPython optimizations it looks like that's no longer the case. Still, it would be worth experimenting with alternatives to string concatenation like iteratively adding to a StringIO buffer or building a huge list of substrings and joining to see if there are any gains there.
  3. Cache pieces of objects. Currently we cache based on the Earth Engine object, but two different Sentinel-2 images will have identical band information, so caching within objects could have big performance gains, especially when dealing with homogenous image collections that contain a lot of duplication. This isn't a simple change since objects are represented by nested dictionaries and are usually unhashable. We could hash them by dumping the JSON to string, but obviously that will have an overhead. The best option might be to add some logic to try and cache things that we expect to be repeated, e.g. band information.
  4. Parallelization. This would be a nightmare to implement effectively since we're dealing with deeply nested properties that need to be built in order and the overhead of parallelizing small properties would almost definitely outweigh the benefit. Worth considering, but almost definitely not the way to go.
  5. Rust. I could rewrite the html module in Rust and use PyO3 to provide Python bindings. This would add a build dependency on Maturin and generally complicate builds, but it should ensure the fastest possible solution.

Benchmark

Here's a rough performance benchmark from my laptop to serve as a baseline.

import ee
import eerepr

ee.Initialize()

obj = ee.ImageCollection("COPERNICUS/S2_SR").limit(1000)
info = obj.getInfo()

%timeit -r 5 -n 5 eerepr.html.convert_to_html(info)
# 1.35 s ± 45.4 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

Fix `ee.List.shuffle`

ee.List.shuffle(seed=False) returns non-deterministic results from the same invocation, which means that equality checks on server side objects incorrectly pass causing false-positive cache hits. You can easily confirm this with:

l = ee.List.sequence(1, 10)

l.shuffle(seed=False) == l.shuffle(seed=False) # True
l.shuffle(seed=False).getInfo() == l.shuffle(seed=False).getInfo() # False

Repeated calls will generate different results but hit the same cache, displaying the wrong data. This will affect lists and anything generated from shuffled lists, e.g. a FeatureCollection generated from a list of coordinates.

To fix this, I'll need to add a function that parses each Earth Engine object's invocation string (just get the string repr for the object) and prevents caching if List.shuffle is found.

I've tested every other Earth Engine method that takes a random seed, and this is the only one that acts non-deterministically.

Ideas for limits

Hi Aaron, per our discussion, justing wanting to document that testing limits might be useful at some point. I suppose the goal would be to find thresholds that optimize mostly for user experience and a bit for ee servers (limit unnecessarily large or unmanageable requests) e.g.,

  • Not having to wait too long
  • Not stressing the browser
  • Fast fail for unreasonable requests (for user and server)

Here are the ideas we'd listed:

  • Limit size of HTML (I think you already do this, IIRC 100 MB)
  • Limit size of network request
  • Max elements
  • Max recursion

No action needed yet, just wanting to document and have a place to add conversation.

Support for Python 3.7

I am trying to integrate eerepr into geemap. Just noticed that eerepr requires Python >=3.8. Any reason why Python 3.7 is not supported? Since Colab still uses Python 3.7, although I can use pip install eerepr --ignore-requires-python to install it, it is not ideal as a dependency for other packages.

Try minifying

Notebook file sizes can get huge if you print HTML reprs for a few big collections. I should experiment with minifying the HTML repr to see how that affects performance and file size. The minify-html project looks promising since it supports minifying HTML, CSS, and JS with no external dependencies.

A minor decrease in file size won't be worth adding a dependency, so I could make this a configurable setting with an optional dependency.

Try using async widgets

ipywidgets supports async widgets via threading. Rather than waiting for data, turning it into HTML, and directly displaying the HTML, I could return a loading HTML widget and use threading to update the widget contents once data is retrieved from the server and formatted. This would make the experience more similar to the code editor by not blocking the entire kernel.

The main downside would be adding a dependency on ipywidgets, but if most users are using this alongside geemap then that's not a big issue. Other considerations would be:

  • Can I throw all the JS and CSS into the HTML widget like I currently do, or do I need to handle that differently?
  • Would reprs inside an HTML widget display correctly when rendered statically like they currently do? That's not a dealbreaker, but it would be nice.
  • Are there performance drawbacks in rendering time?
  • What will the ipywidgets compatibility be? I've had issues in the past with ipywidgets>7, especially in Jupyter Lab, so ideally this would work in version 7 or 8.
  • Calling _repr_html_ should return an HTML string, not a widget, so I think I would need to use _ipython_display_ or _repr_mimebundle_ instead and return the corresponding method from the associated widget.

Here's a rough implementation idea:

def _ipython_display_(obj: ee.Element):
  """Display an Earth Engine object in an async HTML widget"""
  html = ipywidgets.HTML("<span>Loading...</span>")
  
  threading.Thread().start(build_repr, args=(obj, html))
  return html._ipython_display_()

def _build_repr(obj: ee.Element, html: ipywidgets.HTML) -> None:
  """Format an HTML repr string and add it to an HTML widget"""
  info = obj.getInfo()
  rep = _format_repr(info)
  html.value = rep

Add `MAX_REPR_MB` config

It's easy to accidentally print the repr for a huge collection which can crash a notebook. I should add some kind of configurable setting that prevents displaying reprs above a certain size. Instead, it should fall back to the string repr and give a user warning about the repr size with instructions to adjust the setting.

Property order

The property order displayed by eerepr doesn't match the Code Editor.

For example, the Code Editor repr for ee.ImageCollection("COPERNICUS/S2_SR").first() is in the order [type, id, version, bands, properties] whereas eerepr displays it as [type, bands, id, version, properties]. This isn't a huge deal, but it seems like Code Editor usually puts scalars before lists/objects which is easier to read. Some objects in the Code Editor are sorted alpha (like an image's properties), but some definitely aren't. It may just be that the order gets mangled when it's passed from the server to Python.

I should try to match the Code Editor order if it's a simple task, and otherwise I should just implement a logical sorting method--probably alpha with scalars before lists and dicts.

Add pre-commit hooks

Add black and ruff pre-commit hooks. In the process, we can simplify the dev dependencies to just pre-commit, as everything will run through there.

Collapsing identical reprs broken by caching

Collapsing behavior is based on pseudo-random UUIDs generated for each <ul> in an objects repr. Because the repr is cached, printing the same object twice will return exactly the same repr with the same UUIDs. When you try to collapse the second repr, the first repr will (probably) get collapsed instead.

For some reason, collapsing works correctly in Jupyter classic, Colab, and VS Code, but is broken in Jupyter Lab (including Binder) or when a notebook is rendered statically (nbviewer). I'm guessing this is an implementation detail since the UUID issue isn't platform-dependent.

The quick solution is to cache the Earth Engine data by wrapping getInfo instead of caching the repr, but regenerating the repr may be a non-negligible performance hit for huge collections. If performance is too slow I'll need to think about another solution, e.g. caching the repr but re-generating each pair of UUIDs.

Check for unused cached test data

Whenever a new Earth Engine object is retrieved by tests.test_html.load_info, it is saved in the tests/data/data.json cache. If an old test is removed or an unused object is accidentally committed, the cache could get out of sync with the tests.

It would be easy enough to fix this just by resetting the cache, but the ideal solution for long term maintenance would be to check that each object is actually used by the tests and either warn the user or delete any objects that aren't used. This may be tricky, especially in cases where a user runs a subset of tests, but it's worth poking around to see if there's a simple solution for this.

Null geometry repr bug

Map.draw_features[0]

AttributeError                            Traceback (most recent call last)
File D:\Code_base\anaconda\envs\GEE\lib\site-packages\IPython\core\formatters.py:342, in BaseFormatter.__call__(self, obj)
    340     method = get_real_method(obj, self.print_method)
    341     if method is not None:
--> 342         return method()
    343     return None
    344 else:

File D:\Code_base\anaconda\envs\GEE\lib\site-packages\eerepr\repr.py:82, in _ee_repr(obj)
     77 if _is_nondeterministic(obj):
     78     # We don't want to cache nondeterministic objects, so we'll add add a unique attribute
     79     # that causes ee.ComputedObject.__eq__ to return False, preventing a cache hit.
     80     setattr(obj, "_eerepr_id", uuid.uuid4())
---> 82 rep = _repr_html_(obj)
     83 mbs = len(rep) / 1e6
     84 if mbs > options.max_repr_mbs:

File D:\Code_base\anaconda\envs\GEE\lib\site-packages\eerepr\repr.py:62, in _repr_html_(obj)
     60 css = _load_css()
     61 js = _load_js()
---> 62 body = convert_to_html(info)
     64 return (
     65     "<div>"
     66     f"<style>{css}</style>"
   (...)
     71     "</div>"
     72 )

File D:\Code_base\anaconda\envs\GEE\lib\site-packages\eerepr\html.py:32, in convert_to_html(obj, key)
     30     return list_to_html(obj, key)
     31 elif isinstance(obj, dict):
---> 32     return dict_to_html(obj, key)
     34 key_html = f"<span class='ee-k'>{key}:</span>" if key is not None else ""
     35 return (
     36     "<li>"
     37     f"{key_html}"
     38     f"<span class='ee-v'>{obj}</span>"
     39     "</li>"
     40 )

File D:\Code_base\anaconda\envs\GEE\lib\site-packages\eerepr\html.py:58, in dict_to_html(obj, key)
     56 """Convert a Python dictionary to an HTML <li> element."""
     57 obj = _sort_dict(obj)
---> 58 label = _build_label(obj)
     60 header = f"{key}: " if key is not None else ""
     61 header += label

File D:\Code_base\anaconda\envs\GEE\lib\site-packages\eerepr\html.py:249, in _build_label(obj)
    246 if obj_type not in labelers:
    247     obj_type = "_Typed"
--> 249 return labelers[obj_type](obj)

File D:\Code_base\anaconda\envs\GEE\lib\site-packages\eerepr\html.py:113, in _build_feature_label(obj)
    111 def _build_feature_label(obj: dict) -> str:
    112     n = len(obj.get("properties", []))
--> 113     geom_type = obj.get("geometry", {}).get("type")
    114     type_label = f"{geom_type}, " if geom_type is not None else ""
    115     noun = "property" if n == 1 else "properties"

AttributeError: 'NoneType' object has no attribute 'get'

<ee.feature.Feature at 0x23a08257cd0>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.