earth-chris / elapid Goto Github PK

View Code? Open in Web Editor NEW

45.0 6.0 6.0 6.54 MB

Species distribution modeling tools, including a python implementation of Maxent

Home Page: https://elapid.org

License: MIT License

Python 99.24% Makefile 0.76%

biodiversity-informatics biogeography geospatial maxent niche-modelling species-distribution-modelling

elapid's People

Contributors

Stargazers

Watchers

Forkers

l5d1l5 c11 redkenn osgeokr matthewjwhittle

elapid's Issues

Convert transform keyword to **kwargs to support other sklearn model applications

The elapid.geo.apply_model_to_rasters() function currently accepts a transform keyword, which is passed to the model.predict(x, transform=transform) function when applied to raster data. This keyword is specific to the MaxentModel() object, and determines which type of model output to produce.

Since this keyword is specific to MaxentModel() objects, the apply_model_to_rasters() function can only be used with these models. It would be cool, however, to be able to use this function in conjunction with other sklearn model estimators. These models (like RandomForestRegression()) all have the same model.fit() and model.predict() structure, so we could use this function to apply more model types and compare outputs.

It would also be useful specifically for this package if we were to create new model forms (like niche envelope models) that don't use the transform keyword.

`nan` / `-inf` from computing hinge features across uniform covariate values

covariates with all zeros lead to divide-by-zero errors during hinge feature calculations. These should either be handled by dropping those covariates or by using something like np.divide(this, that, where=that > 0)

error: library glmnet has Fortran sources but no Fortran compiler found

I had to install a fortran compiler (windows x64) prior to being able to install glmnet. I followed these directions https://people.eng.unimelb.edu.au/ammoffat/teaching/20005/Install-MinGW.pdf.

add background reading section to README

There are a ton of great papers on how maxent works (and how it doesn't). This issue thread will serve as a way to curate a list of relevant papers and links to include in the README. This could either be a sub-heading under Background or a separate heading like Additional resources.

geodataframe data types often set to `object` instead of a typical python/numpy data type

Methods for labeling point data (raster_values_from_geoseries(), for example) return columns of data type object. This should be explicitly set by the method, which can pull the dtype information from the rasterio source it's reading from.

Create a niche envelope model class

This is a presence-only empirical model (no absence/pseudoabsence required). It would find the min/max range of all covariates in the training data, and use boolean logic to flag all predictions to be inside (True) or outside (False) the range of suitability for ech covariate. Rows that match all conditions (the union of all logic checks, or np.all()) get flagged as 1 (inside the envelope of suitability).

Some features to include:

allow setting a percentile instead of the min/max to filter outliers.
allow the user to pass a set of mins or maxs, which would kinda defeat the purpose of the 'model' but allow the rapid application of model logic.
compute the intersection of suitability instead of the union, returning a continuous 0.0-1.0 range of suitability based on the product of all covariates that are in range.

Update to google-style docstrings

We'll want to handle auto-generation of code documentation via mkdocs . The mkdocstrings plugin handles this well, but it requires google-style docstrings.

Set up a branch to update the docstrings for the whole repo, add pages for each set of modules to document, and update the mkdocs.yml file.

Handle 2d `y` variables

Sometimes slicing a single column from a data frame (e.g. y = df['presence'] returns a 2d array of shape (nrows, 1). This breaks some feature fitting routines because it assumes a (nrows,) vector. Could probably add a method like _format_covariate_data() to check input shapes and flatten arrays accordingly.

Testing CI on all PRs

We should trigger python package tests on all PRs.

[JOSS review]: unexpected keyword argument in envelope suitability

Similar to #75 , I also get an unexpected keyword argument error with the overlay options to envelope.predict in the Niche Envelope vignette.

The code I ran:

import elapid as ela

x, y = ela.load_sample_data()
envelope = ela.NicheEnvelopeModel(percentile_range=[2.5, 97.5])
envelope = ela.NicheEnvelopeModel(percentile_range=[2.5, 97.5])
envelope.fit(x, y)

envelope.predict(x, overlay='intersection')
envelope.predict(x, overlay='union')
envelope.predict(x, overlay='average')

And the output:

Python 3.10.6 (main, Nov  2 2022, 18:53:38) [GCC 11.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.6.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import elapid as ela

In [2]: x, y = ela.load_sample_data()

In [3]: envelope = ela.NicheEnvelopeModel(percentile_range=[2.5, 97.5])

In [4]: envelope = ela.NicheEnvelopeModel(percentile_range=[2.5, 97.5])

In [5]: envelope.fit(x, y)
Out[5]: NicheEnvelopeModel()

In [6]: envelope.predict(x, overlay='intersection')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [6], line 1
----> 1 envelope.predict(x, overlay='intersection')

TypeError: NicheEnvelopeModel.predict() got an unexpected keyword argument 'overlay'

In [7]: envelope.predict(x, overlay='union')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [7], line 1
----> 1 envelope.predict(x, overlay='union')

TypeError: NicheEnvelopeModel.predict() got an unexpected keyword argument 'overlay'

In [8]: envelope.predict(x, overlay='average')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [8], line 1
----> 1 envelope.predict(x, overlay='average')

TypeError: NicheEnvelopeModel.predict() got an unexpected keyword argument 'overlay'

speed up apply_model_to_raster via smarter no-data handling

Currently, in multi-raster/multi-band reads, all bands are read before the no-data pixels are evaluated, skipping application if a tile has all nodata. This is prohibitively slow in cases where a) there is resampling involved and b) there's a lot of no-data tiles.

Instead, this could be simplified by checking the no-data values as each band is read and moving on to the next tile if it's all no-data.

add `quiet` option to silence tqdm

this will be nice for minimizing logs during batch processing

package install challenges from `conda-forge`

@chrisborges reported the following issues when installing from conda:

WorkingWithGeospatialData notebook is great. I was able to reproduce almost everything and enjoyed the through explanation. Unfortunately, I was not able to reproduce the histogram graph.
some things that threw errors: TypeError: zonal_stats() got an unexpected keyword argument 'quiet'
In both "WorkingWithGeospatialData" and "A simple maxent model" I was unable to run the merged command:
merged = ela.stack_geodataframes(presence, background, add_class_label=True)
Issue: AttributeError: module 'elapid' has no attribute 'stack_geodataframes'

These appear to be version issues that should be resolved in 0.3.20. But they persisted even after running conda update elapid.

Unfortunately I have not been able to reproduce these errors. I was able to run each demo notebook end-to-end after installing elapid in a clean conda environment.

To diagnose, @chrisborges, would you be able to run the following?

import elapid
print(elapid.__version__)

If the version isn't 0.3.20 then we'll want to make sure you get the latest version. To do that, you could try the following:

conda install elapid=0.3.20
or install in a clean environment using these instructions

If you do have the most recent version and you're still experiencing issues then I'll have to go back to the drawing board.

Remove docs/pre-commit dependencies from environment.yml

Automated docs builds should install the dependencies during CI instead of in the package build. These should be declared in requirements-dev.txt instead, as well as the pre-commit utils.

`progress_apply` breaks in recent pandas packages

I don't recall the exact version number, but at some point the tqdm progress_apply() function, which we use to track the progress of point value extractions from raster data, breaks. We should explicitly set a maximum version number for pandas.

Simplify apply_model_to_raster windowed read logic

Right now there are different for loops for windowed=True and windowed=False keywords. This could be simplified by just setting the window size to the full extent of the image and creating an iterator of length 1, and keeping all the logic in a single windowed read for loop.

[JOSS review]: data to reproduce the examples in the documentation

Would it be possible to have access to data files to reproduce the examples in the documentation? For instance, the "A Simple Maxent Model" vignette uses a .gpkg and a .tif file. The "Working with Geospatial data" also uses a .shp file.

I think having completely reproducible examples would really help users understand how to use this package correctly and would help you show how useful it can be. I often find it hard to know which kind of data to use without a working example, even if it's only a basic one.

I tried using the data in the tests folder as a replacement but I'm running into CRS errors. These are probably easily fixable, but I'm sure users would appreciate having a working example while learning how to use the package.

Support partial dependence plots and feature importance scores

The Java implementation of Maxent includes tools for creating partial dependence plots and quantifying permutation importance. These haven't been ported to elapid yet but should be.

This can be implemented using native sklearn tools, including permutation_importance and partial_dependence.

Maybe we can add these as class methods for all elapid models?

add tau scaling

Maxent uses a tau parameter to scale relative suitability predictions into detection probability scores. This is not currently implemented, as the translation from the R maxnet predict function didn't include this parameter, effectively fixing tau = 0.5.

We should look at the underlying math and apply tau scaling in the elapid.MaxentModel.predict() function (link here).

You can read more about tau in the maxent for ecologists paper and the practical guide to maxent paper

Report progress for raster annotation

The raster_values_from_geoseries() function has iterables for reading data 1) for each raster input and 2) for each point feature. We should add some way to track the progress for each of these with tqdm.

Since one of the iterations is via df.apply(), we could use this feature.

GDAL_DATA path missing if gdal conda package isn't installed

In testing pip install elapid from a clean conda environment (conda create --name test python==3.7), the following error message is displayed:

ERROR 4: Unable to open EPSG support file gcs.csv.  Try setting the GDAL_DATA environment variable to point to the directory containing EPSG csv files.

I believe this occurs because elapid only requires rasterio, which uses the minimal libgdal requirement and not the full gdal/gdal-bin install. This greatly simplifies build requirements but introduces this error.

Rasterio has documented the issue.

fiona packages their own gdal_data directory, which is here on my conda install: $CONDA_ENV_DIR/lib/python3.7/site-packages/fiona/gdal_data/.

I suspect the GDAL_DATA variable could be set with os.environ.update(GDAL_DATA='/path/to/fiona/gdal-data/). But this will have to check that it's not referenced by other packages or its not overwriting a user default.

There's no testing suite

This means it's very easy to break code. The current version should be tested before making big changes (like potentially rolling back the required versions of library dependencies).

[JOSS review] Cannot determine CRS in Maxent vignette

Hi @earth-chris! I'm running into a CRS error when trying to reproduce the Maxent example (the geospatial data one runs fine and thank you for making it as detailed, by the way!)

Everything runs fine until this line in the second cell:

>>> merged = ela.stack_geodataframes(presence, background, add_class_label=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gabriel/github/elapid/elapid/geo.py", line 101, in stack_geodataframes
    merged = pd.concat((presence[matching], background[matching]), axis=0, ignore_index=True)
  File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/core/reshape/concat.py", line 381, in concat
    return op.get_result()
  File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/core/reshape/concat.py", line 616, in get_result
    new_data = concatenate_managers(
  File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/core/internals/concat.py", line 226, in concatenate_managers
    values = concat_compat(vals, axis=1)
  File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/core/dtypes/concat.py", line 133, in concat_compat
    return cls._concat_same_type(to_concat)
  File "/home/gabriel/.local/lib/python3.10/site-packages/geopandas/array.py", line 1323, in _concat_same_type
    return GeometryArray(data, crs=_get_common_crs(to_concat))
  File "/home/gabriel/.local/lib/python3.10/site-packages/geopandas/array.py", line 1413, in _get_common_crs
    raise ValueError(
ValueError: Cannot determine common CRS for concatenation inputs, got ['WGS 84 / UTM zone 10N', 'WGS 84 / UTM zone 10N']. Use `to_crs()` to transform geometries to the same CRS before merging.

Probably just an easy thing to update I hope

Set up CI workflow dependencies

Several CI steps should only be run on certain conditions. For example, we should only build documentation after we're sure it passes all tests. This can be setup with workflow_run commands (https://stackoverflow.com/questions/58457140/dependencies-between-workflows-on-github-actions).

Citing elapid library

Hi, I want to cite the elapid library in my paper.
Is your library plan to be published as a paper, or is there any proper way to cite elapid?

Update feature extraction functions to use 'annotation' terminology

Functions like raster_values_from_geoseries() could be named more simply as annotate_geoseries(), which is consistent with the current documentation and with a term that I expect will become more common for describing this process.

`model.fit()` should clear state prior to running

Multiple model.fit() calls produce invalid predictions because some state variable (probably the alpha or entropy scores) aren't cleared.

Develop dedicated documentation system

It should be more than just a big README!

[JOSS review]: Sample data missing at installation

Hi @earth-chris! I'm opening a few issues here for my JOSS review.These will be mostly about bugs or issues when trying to reproduce the examples in your documentation.

I'm having trouble loading the sample data with elapid.load_sample_data(). The files appear to be missing from my local installation (as you can see in the output below). I doubled-checked at the path given in the error message (/home/gabriel/.local/lib/python3.10/site-packages/elapid/) and I simply do not have a data folder or the bradypus.csv.gz file.

I managed to debug it by downloading the data from this repo and moving it to a data folder in the right path. After that elapid.load_sample_data() works as expected.

I installed elapid using pip install elapid as explained in the documentation. I'm using elapid v0.3.18 and Python 3.10.6 on Ubuntu 22.04.

Here is the detailed output:

Python 3.10.6 (main, Nov  2 2022, 18:53:38) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import elapid
>>> x, y = elapid.load_sample_data()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gabriel/.local/lib/python3.10/site-packages/elapid/utils.py", line 54, in load_sample_data
    df = pd.read_csv(file_path, compression="gzip").astype("int64")
  File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 605, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1442, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1735, in _make_engine
    self.handles = get_handle(
  File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/io/common.py", line 750, in get_handle
    handle = gzip.GzipFile(  # type: ignore[assignment]
  File "/usr/lib/python3.10/gzip.py", line 174, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/home/gabriel/.local/lib/python3.10/site-packages/elapid/data/bradypus.csv.gz'

Remove old docker references, superfluous conda routines from Makefile

There's a bunch of vestigial stuff in the Makefile now that could be removed, like all of the pip/conda/docker update stuff.

Create conda skeleton to deploy package via GH Actions

Conda skeleton creates templates for building and deploying conda package automatically with GitHub Actions: https://conda-forge.org/docs/user/ci-skeleton.html

(first implemented here: https://github.com/conda-forge/conda-smithy/pull/1461/files)

pseudoabsence_from_bias_file not properly accounting for bias

Currently we calculate the probability of drawing a sample as even with the frequency with which it occurs in a raster population. However, it should be adjusting for that frequency and drawing more samples in areas with low frequencies. a 1 / probabilities or something would probably handle this.

module 'elapid' has no attribute

[JOSS review]: unexpected keyword argument with feature transformations

Running the Maxent vignette example (after the workaround I described in #74 to access the sample data), using the is_features=True flag in model.fit after the feature transformations returns the following error:
TypeError: MaxentModel.fit() got an unexpected keyword argument 'is_features'.

I get this error after running the code described in the vignette.

import elapid

x, y = elapid.load_sample_data()
model = elapid.MaxentModel()
model.fit(x, y)

model.predict(x) # add

# Feature Transformations
features = elapid.MaxentFeatureTransformer()
z = features.fit_transform(x)
model.fit(z, y, is_features=True)

Here is the detailed output from IPython:

Python 3.10.6 (main, Nov  2 2022, 18:53:38) [GCC 11.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.6.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import elapid

In [2]: x, y = elapid.load_sample_data()

In [3]: model = elapid.MaxentModel()

In [4]: model.fit(x, y)
Out[4]: MaxentModel()

In [5]: features = elapid.MaxentFeatureTransformer()

In [6]: z = features.fit_transform(x)

In [7]: model.fit(z, y, is_features=True)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [7], line 1
----> 1 model.fit(z, y, is_features=True)

TypeError: MaxentModel.fit() got an unexpected keyword argument 'is_features'

handle dropped NA values when annotating a `GeoDataFrame`

When annotating an existing geoseries, the geometry is passed to the annotation function, then pandas is used to concatenate the original and the annotated data.

This leads to issues when drop_na=True and the two dfs don't share the same number of rows. The annotated data then mismatches the original df and the difference in length means NaN is appended to the missing rows.

This should probably be handled by running drop_na after the dfs are concatenated to ensure the indices match before dropping.

CI for mkdocstrings is broken

This is because it's running the documentation generator from a simple ubuntu container: running pip install elapid from there is no good (too many other dependencies).

Add a Dockerfile and a Makefile to build and deploy a container that has the conda environment installed. Then update the run commands to be something like:

run: conda activate elapid
run: pip install [docs dependencies]
run: mkdocs gh-deploy