Git Product home page Git Product logo

elapid's People

Contributors

earth-chris avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

elapid's Issues

Convert transform keyword to **kwargs to support other sklearn model applications

The elapid.geo.apply_model_to_rasters() function currently accepts a transform keyword, which is passed to the model.predict(x, transform=transform) function when applied to raster data. This keyword is specific to the MaxentModel() object, and determines which type of model output to produce.

Since this keyword is specific to MaxentModel() objects, the apply_model_to_rasters() function can only be used with these models. It would be cool, however, to be able to use this function in conjunction with other sklearn model estimators. These models (like RandomForestRegression()) all have the same model.fit() and model.predict() structure, so we could use this function to apply more model types and compare outputs.

It would also be useful specifically for this package if we were to create new model forms (like niche envelope models) that don't use the transform keyword.

add background reading section to README

There are a ton of great papers on how maxent works (and how it doesn't). This issue thread will serve as a way to curate a list of relevant papers and links to include in the README. This could either be a sub-heading under Background or a separate heading like Additional resources.

Create a niche envelope model class

This is a presence-only empirical model (no absence/pseudoabsence required). It would find the min/max range of all covariates in the training data, and use boolean logic to flag all predictions to be inside (True) or outside (False) the range of suitability for ech covariate. Rows that match all conditions (the union of all logic checks, or np.all()) get flagged as 1 (inside the envelope of suitability).

Some features to include:

  • allow setting a percentile instead of the min/max to filter outliers.
  • allow the user to pass a set of mins or maxs, which would kinda defeat the purpose of the 'model' but allow the rapid application of model logic.
  • compute the intersection of suitability instead of the union, returning a continuous 0.0-1.0 range of suitability based on the product of all covariates that are in range.

Update to google-style docstrings

We'll want to handle auto-generation of code documentation via mkdocs . The mkdocstrings plugin handles this well, but it requires google-style docstrings.

Set up a branch to update the docstrings for the whole repo, add pages for each set of modules to document, and update the mkdocs.yml file.

Handle 2d `y` variables

Sometimes slicing a single column from a data frame (e.g. y = df['presence'] returns a 2d array of shape (nrows, 1). This breaks some feature fitting routines because it assumes a (nrows,) vector. Could probably add a method like _format_covariate_data() to check input shapes and flatten arrays accordingly.

[JOSS review]: unexpected keyword argument in envelope suitability

Similar to #75 , I also get an unexpected keyword argument error with the overlay options to envelope.predict in the Niche Envelope vignette.

The code I ran:

import elapid as ela

x, y = ela.load_sample_data()
envelope = ela.NicheEnvelopeModel(percentile_range=[2.5, 97.5])
envelope = ela.NicheEnvelopeModel(percentile_range=[2.5, 97.5])
envelope.fit(x, y)

envelope.predict(x, overlay='intersection')
envelope.predict(x, overlay='union')
envelope.predict(x, overlay='average')

And the output:

Python 3.10.6 (main, Nov  2 2022, 18:53:38) [GCC 11.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.6.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import elapid as ela

In [2]: x, y = ela.load_sample_data()

In [3]: envelope = ela.NicheEnvelopeModel(percentile_range=[2.5, 97.5])

In [4]: envelope = ela.NicheEnvelopeModel(percentile_range=[2.5, 97.5])

In [5]: envelope.fit(x, y)
Out[5]: NicheEnvelopeModel()

In [6]: envelope.predict(x, overlay='intersection')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [6], line 1
----> 1 envelope.predict(x, overlay='intersection')

TypeError: NicheEnvelopeModel.predict() got an unexpected keyword argument 'overlay'

In [7]: envelope.predict(x, overlay='union')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [7], line 1
----> 1 envelope.predict(x, overlay='union')

TypeError: NicheEnvelopeModel.predict() got an unexpected keyword argument 'overlay'

In [8]: envelope.predict(x, overlay='average')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [8], line 1
----> 1 envelope.predict(x, overlay='average')

TypeError: NicheEnvelopeModel.predict() got an unexpected keyword argument 'overlay'

speed up apply_model_to_raster via smarter no-data handling

Currently, in multi-raster/multi-band reads, all bands are read before the no-data pixels are evaluated, skipping application if a tile has all nodata. This is prohibitively slow in cases where a) there is resampling involved and b) there's a lot of no-data tiles.

Instead, this could be simplified by checking the no-data values as each band is read and moving on to the next tile if it's all no-data.

package install challenges from `conda-forge`

@chrisborges reported the following issues when installing from conda:

WorkingWithGeospatialData notebook is great. I was able to reproduce almost everything and enjoyed the through explanation. Unfortunately, I was not able to reproduce the histogram graph.
some things that threw errors: TypeError: zonal_stats() got an unexpected keyword argument 'quiet'
In both "WorkingWithGeospatialData" and "A simple maxent model" I was unable to run the merged command:
merged = ela.stack_geodataframes(presence, background, add_class_label=True)
Issue: AttributeError: module 'elapid' has no attribute 'stack_geodataframes'

These appear to be version issues that should be resolved in 0.3.20. But they persisted even after running conda update elapid.

Unfortunately I have not been able to reproduce these errors. I was able to run each demo notebook end-to-end after installing elapid in a clean conda environment.

To diagnose, @chrisborges, would you be able to run the following?

import elapid
print(elapid.__version__)

If the version isn't 0.3.20 then we'll want to make sure you get the latest version. To do that, you could try the following:

If you do have the most recent version and you're still experiencing issues then I'll have to go back to the drawing board.

`progress_apply` breaks in recent pandas packages

I don't recall the exact version number, but at some point the tqdm progress_apply() function, which we use to track the progress of point value extractions from raster data, breaks. We should explicitly set a maximum version number for pandas.

Simplify apply_model_to_raster windowed read logic

Right now there are different for loops for windowed=True and windowed=False keywords. This could be simplified by just setting the window size to the full extent of the image and creating an iterator of length 1, and keeping all the logic in a single windowed read for loop.

[JOSS review]: data to reproduce the examples in the documentation

Would it be possible to have access to data files to reproduce the examples in the documentation? For instance, the "A Simple Maxent Model" vignette uses a .gpkg and a .tif file. The "Working with Geospatial data" also uses a .shp file.

I think having completely reproducible examples would really help users understand how to use this package correctly and would help you show how useful it can be. I often find it hard to know which kind of data to use without a working example, even if it's only a basic one.

I tried using the data in the tests folder as a replacement but I'm running into CRS errors. These are probably easily fixable, but I'm sure users would appreciate having a working example while learning how to use the package.

Report progress for raster annotation

The raster_values_from_geoseries() function has iterables for reading data 1) for each raster input and 2) for each point feature. We should add some way to track the progress for each of these with tqdm.

Since one of the iterations is via df.apply(), we could use this feature.

GDAL_DATA path missing if gdal conda package isn't installed

In testing pip install elapid from a clean conda environment (conda create --name test python==3.7), the following error message is displayed:

ERROR 4: Unable to open EPSG support file gcs.csv.  Try setting the GDAL_DATA environment variable to point to the directory containing EPSG csv files.

I believe this occurs because elapid only requires rasterio, which uses the minimal libgdal requirement and not the full gdal/gdal-bin install. This greatly simplifies build requirements but introduces this error.

Rasterio has documented the issue.

fiona packages their own gdal_data directory, which is here on my conda install: $CONDA_ENV_DIR/lib/python3.7/site-packages/fiona/gdal_data/.

I suspect the GDAL_DATA variable could be set with os.environ.update(GDAL_DATA='/path/to/fiona/gdal-data/). But this will have to check that it's not referenced by other packages or its not overwriting a user default.

There's no testing suite

This means it's very easy to break code. The current version should be tested before making big changes (like potentially rolling back the required versions of library dependencies).

[JOSS review] Cannot determine CRS in Maxent vignette

Hi @earth-chris! I'm running into a CRS error when trying to reproduce the Maxent example (the geospatial data one runs fine and thank you for making it as detailed, by the way!)

Everything runs fine until this line in the second cell:

>>> merged = ela.stack_geodataframes(presence, background, add_class_label=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gabriel/github/elapid/elapid/geo.py", line 101, in stack_geodataframes
    merged = pd.concat((presence[matching], background[matching]), axis=0, ignore_index=True)
  File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/core/reshape/concat.py", line 381, in concat
    return op.get_result()
  File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/core/reshape/concat.py", line 616, in get_result
    new_data = concatenate_managers(
  File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/core/internals/concat.py", line 226, in concatenate_managers
    values = concat_compat(vals, axis=1)
  File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/core/dtypes/concat.py", line 133, in concat_compat
    return cls._concat_same_type(to_concat)
  File "/home/gabriel/.local/lib/python3.10/site-packages/geopandas/array.py", line 1323, in _concat_same_type
    return GeometryArray(data, crs=_get_common_crs(to_concat))
  File "/home/gabriel/.local/lib/python3.10/site-packages/geopandas/array.py", line 1413, in _get_common_crs
    raise ValueError(
ValueError: Cannot determine common CRS for concatenation inputs, got ['WGS 84 / UTM zone 10N', 'WGS 84 / UTM zone 10N']. Use `to_crs()` to transform geometries to the same CRS before merging.

Probably just an easy thing to update I hope

Citing elapid library

Hi, I want to cite the elapid library in my paper.
Is your library plan to be published as a paper, or is there any proper way to cite elapid?

[JOSS review]: Sample data missing at installation

Hi @earth-chris! I'm opening a few issues here for my JOSS review.These will be mostly about bugs or issues when trying to reproduce the examples in your documentation.

I'm having trouble loading the sample data with elapid.load_sample_data(). The files appear to be missing from my local installation (as you can see in the output below). I doubled-checked at the path given in the error message (/home/gabriel/.local/lib/python3.10/site-packages/elapid/) and I simply do not have a data folder or the bradypus.csv.gz file.

I managed to debug it by downloading the data from this repo and moving it to a data folder in the right path. After that elapid.load_sample_data() works as expected.

I installed elapid using pip install elapid as explained in the documentation. I'm using elapid v0.3.18 and Python 3.10.6 on Ubuntu 22.04.

Here is the detailed output:

Python 3.10.6 (main, Nov  2 2022, 18:53:38) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import elapid
>>> x, y = elapid.load_sample_data()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gabriel/.local/lib/python3.10/site-packages/elapid/utils.py", line 54, in load_sample_data
    df = pd.read_csv(file_path, compression="gzip").astype("int64")
  File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 605, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1442, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1735, in _make_engine
    self.handles = get_handle(
  File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/io/common.py", line 750, in get_handle
    handle = gzip.GzipFile(  # type: ignore[assignment]
  File "/usr/lib/python3.10/gzip.py", line 174, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/home/gabriel/.local/lib/python3.10/site-packages/elapid/data/bradypus.csv.gz'

pseudoabsence_from_bias_file not properly accounting for bias

Currently we calculate the probability of drawing a sample as even with the frequency with which it occurs in a raster population. However, it should be adjusting for that frequency and drawing more samples in areas with low frequencies. a 1 / probabilities or something would probably handle this.

[JOSS review]: unexpected keyword argument with feature transformations

Running the Maxent vignette example (after the workaround I described in #74 to access the sample data), using the is_features=True flag in model.fit after the feature transformations returns the following error:
TypeError: MaxentModel.fit() got an unexpected keyword argument 'is_features'.

I get this error after running the code described in the vignette.

import elapid

x, y = elapid.load_sample_data()
model = elapid.MaxentModel()
model.fit(x, y)

model.predict(x) # add

# Feature Transformations
features = elapid.MaxentFeatureTransformer()
z = features.fit_transform(x)
model.fit(z, y, is_features=True)

Here is the detailed output from IPython:

Python 3.10.6 (main, Nov  2 2022, 18:53:38) [GCC 11.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.6.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import elapid

In [2]: x, y = elapid.load_sample_data()

In [3]: model = elapid.MaxentModel()

In [4]: model.fit(x, y)
Out[4]: MaxentModel()

In [5]: features = elapid.MaxentFeatureTransformer()

In [6]: z = features.fit_transform(x)

In [7]: model.fit(z, y, is_features=True)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [7], line 1
----> 1 model.fit(z, y, is_features=True)

TypeError: MaxentModel.fit() got an unexpected keyword argument 'is_features'

handle dropped NA values when annotating a `GeoDataFrame`

When annotating an existing geoseries, the geometry is passed to the annotation function, then pandas is used to concatenate the original and the annotated data.

This leads to issues when drop_na=True and the two dfs don't share the same number of rows. The annotated data then mismatches the original df and the difference in length means NaN is appended to the missing rows.

This should probably be handled by running drop_na after the dfs are concatenated to ensure the indices match before dropping.

CI for mkdocstrings is broken

This is because it's running the documentation generator from a simple ubuntu container: running pip install elapid from there is no good (too many other dependencies).

Add a Dockerfile and a Makefile to build and deploy a container that has the conda environment installed. Then update the run commands to be something like:

  • run: conda activate elapid
  • run: pip install [docs dependencies]
  • run: mkdocs gh-deploy

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.