earth-chris / elapid Goto Github PK
View Code? Open in Web Editor NEWSpecies distribution modeling tools, including a python implementation of Maxent
Home Page: https://elapid.org
License: MIT License
Species distribution modeling tools, including a python implementation of Maxent
Home Page: https://elapid.org
License: MIT License
The elapid.geo.apply_model_to_rasters()
function currently accepts a transform
keyword, which is passed to the model.predict(x, transform=transform)
function when applied to raster data. This keyword is specific to the MaxentModel()
object, and determines which type of model output to produce.
Since this keyword is specific to MaxentModel()
objects, the apply_model_to_rasters()
function can only be used with these models. It would be cool, however, to be able to use this function in conjunction with other sklearn
model estimators. These models (like RandomForestRegression()
) all have the same model.fit()
and model.predict()
structure, so we could use this function to apply more model types and compare outputs.
It would also be useful specifically for this package if we were to create new model forms (like niche envelope models) that don't use the transform keyword.
covariates with all zeros lead to divide-by-zero errors during hinge feature calculations. These should either be handled by dropping those covariates or by using something like np.divide(this, that, where=that > 0)
I had to install a fortran compiler (windows x64) prior to being able to install glmnet. I followed these directions https://people.eng.unimelb.edu.au/ammoffat/teaching/20005/Install-MinGW.pdf.
There are a ton of great papers on how maxent works (and how it doesn't). This issue thread will serve as a way to curate a list of relevant papers and links to include in the README. This could either be a sub-heading under Background
or a separate heading like Additional resources
.
Methods for labeling point data (raster_values_from_geoseries()
, for example) return columns of data type object
. This should be explicitly set by the method, which can pull the dtype information from the rasterio source it's reading from.
This is a presence-only empirical model (no absence/pseudoabsence required). It would find the min/max range of all covariates in the training data, and use boolean logic to flag all predictions to be inside (True
) or outside (False
) the range of suitability for ech covariate. Rows that match all conditions (the union
of all logic checks, or np.all()
) get flagged as 1
(inside the envelope of suitability).
Some features to include:
mins
or maxs
, which would kinda defeat the purpose of the 'model' but allow the rapid application of model logic.intersection
of suitability instead of the union
, returning a continuous 0.0-1.0 range of suitability based on the product of all covariates that are in range.We'll want to handle auto-generation of code documentation via mkdocs
. The mkdocstrings
plugin handles this well, but it requires google-style docstrings.
Set up a branch to update the docstrings for the whole repo, add pages for each set of modules to document, and update the mkdocs.yml
file.
Sometimes slicing a single column from a data frame (e.g. y = df['presence']
returns a 2d array of shape (nrows, 1)
. This breaks some feature fitting routines because it assumes a (nrows,)
vector. Could probably add a method like _format_covariate_data()
to check input shapes and flatten arrays accordingly.
We should trigger python package tests on all PRs.
Similar to #75 , I also get an unexpected keyword argument error with the overlay
options to envelope.predict
in the Niche Envelope vignette.
The code I ran:
import elapid as ela
x, y = ela.load_sample_data()
envelope = ela.NicheEnvelopeModel(percentile_range=[2.5, 97.5])
envelope = ela.NicheEnvelopeModel(percentile_range=[2.5, 97.5])
envelope.fit(x, y)
envelope.predict(x, overlay='intersection')
envelope.predict(x, overlay='union')
envelope.predict(x, overlay='average')
And the output:
Python 3.10.6 (main, Nov 2 2022, 18:53:38) [GCC 11.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.6.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import elapid as ela
In [2]: x, y = ela.load_sample_data()
In [3]: envelope = ela.NicheEnvelopeModel(percentile_range=[2.5, 97.5])
In [4]: envelope = ela.NicheEnvelopeModel(percentile_range=[2.5, 97.5])
In [5]: envelope.fit(x, y)
Out[5]: NicheEnvelopeModel()
In [6]: envelope.predict(x, overlay='intersection')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [6], line 1
----> 1 envelope.predict(x, overlay='intersection')
TypeError: NicheEnvelopeModel.predict() got an unexpected keyword argument 'overlay'
In [7]: envelope.predict(x, overlay='union')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [7], line 1
----> 1 envelope.predict(x, overlay='union')
TypeError: NicheEnvelopeModel.predict() got an unexpected keyword argument 'overlay'
In [8]: envelope.predict(x, overlay='average')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [8], line 1
----> 1 envelope.predict(x, overlay='average')
TypeError: NicheEnvelopeModel.predict() got an unexpected keyword argument 'overlay'
Currently, in multi-raster/multi-band reads, all bands are read before the no-data pixels are evaluated, skipping application if a tile has all nodata. This is prohibitively slow in cases where a) there is resampling involved and b) there's a lot of no-data tiles.
Instead, this could be simplified by checking the no-data values as each band is read and moving on to the next tile if it's all no-data.
this will be nice for minimizing logs during batch processing
@chrisborges reported the following issues when installing from conda:
WorkingWithGeospatialData notebook is great. I was able to reproduce almost everything and enjoyed the through explanation. Unfortunately, I was not able to reproduce the histogram graph.
some things that threw errors: TypeError: zonal_stats() got an unexpected keyword argument 'quiet'
In both "WorkingWithGeospatialData" and "A simple maxent model" I was unable to run the merged command:
merged = ela.stack_geodataframes(presence, background, add_class_label=True)
Issue: AttributeError: module 'elapid' has no attribute 'stack_geodataframes'
These appear to be version issues that should be resolved in 0.3.20
. But they persisted even after running conda update elapid
.
Unfortunately I have not been able to reproduce these errors. I was able to run each demo notebook end-to-end after installing elapid in a clean conda environment.
To diagnose, @chrisborges, would you be able to run the following?
import elapid
print(elapid.__version__)
If the version isn't 0.3.20
then we'll want to make sure you get the latest version. To do that, you could try the following:
conda install elapid=0.3.20
If you do have the most recent version and you're still experiencing issues then I'll have to go back to the drawing board.
Automated docs builds should install the dependencies during CI instead of in the package build. These should be declared in requirements-dev.txt
instead, as well as the pre-commit utils.
I don't recall the exact version number, but at some point the tqdm progress_apply()
function, which we use to track the progress of point value extractions from raster data, breaks. We should explicitly set a maximum version number for pandas
.
Right now there are different for
loops for windowed=True
and windowed=False
keywords. This could be simplified by just setting the window size to the full extent of the image and creating an iterator of length 1, and keeping all the logic in a single windowed read for
loop.
Would it be possible to have access to data files to reproduce the examples in the documentation? For instance, the "A Simple Maxent Model" vignette uses a .gpkg and a .tif file. The "Working with Geospatial data" also uses a .shp file.
I think having completely reproducible examples would really help users understand how to use this package correctly and would help you show how useful it can be. I often find it hard to know which kind of data to use without a working example, even if it's only a basic one.
I tried using the data in the tests
folder as a replacement but I'm running into CRS errors. These are probably easily fixable, but I'm sure users would appreciate having a working example while learning how to use the package.
The Java implementation of Maxent includes tools for creating partial dependence plots and quantifying permutation importance. These haven't been ported to elapid
yet but should be.
This can be implemented using native sklearn tools, including permutation_importance and partial_dependence.
Maybe we can add these as class methods for all elapid models?
Maxent uses a tau
parameter to scale relative suitability predictions into detection probability scores. This is not currently implemented, as the translation from the R maxnet predict function didn't include this parameter, effectively fixing tau = 0.5
.
We should look at the underlying math and apply tau scaling in the elapid.MaxentModel.predict() function (link here).
You can read more about tau
in the maxent for ecologists paper and the practical guide to maxent paper
The raster_values_from_geoseries()
function has iterables for reading data 1) for each raster input and 2) for each point feature. We should add some way to track the progress for each of these with tqdm
.
Since one of the iterations is via df.apply()
, we could use this feature.
In testing pip install elapid
from a clean conda environment (conda create --name test python==3.7
), the following error message is displayed:
ERROR 4: Unable to open EPSG support file gcs.csv. Try setting the GDAL_DATA environment variable to point to the directory containing EPSG csv files.
I believe this occurs because elapid
only requires rasterio
, which uses the minimal libgdal
requirement and not the full gdal
/gdal-bin
install. This greatly simplifies build requirements but introduces this error.
Rasterio has documented the issue.
fiona
packages their own gdal_data
directory, which is here on my conda install: $CONDA_ENV_DIR/lib/python3.7/site-packages/fiona/gdal_data/
.
I suspect the GDAL_DATA
variable could be set with os.environ.update(GDAL_DATA='/path/to/fiona/gdal-data/
). But this will have to check that it's not referenced by other packages or its not overwriting a user default.
This means it's very easy to break code. The current version should be tested before making big changes (like potentially rolling back the required versions of library dependencies).
Hi @earth-chris! I'm running into a CRS error when trying to reproduce the Maxent example (the geospatial data one runs fine and thank you for making it as detailed, by the way!)
Everything runs fine until this line in the second cell:
>>> merged = ela.stack_geodataframes(presence, background, add_class_label=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/gabriel/github/elapid/elapid/geo.py", line 101, in stack_geodataframes
merged = pd.concat((presence[matching], background[matching]), axis=0, ignore_index=True)
File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/core/reshape/concat.py", line 381, in concat
return op.get_result()
File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/core/reshape/concat.py", line 616, in get_result
new_data = concatenate_managers(
File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/core/internals/concat.py", line 226, in concatenate_managers
values = concat_compat(vals, axis=1)
File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/core/dtypes/concat.py", line 133, in concat_compat
return cls._concat_same_type(to_concat)
File "/home/gabriel/.local/lib/python3.10/site-packages/geopandas/array.py", line 1323, in _concat_same_type
return GeometryArray(data, crs=_get_common_crs(to_concat))
File "/home/gabriel/.local/lib/python3.10/site-packages/geopandas/array.py", line 1413, in _get_common_crs
raise ValueError(
ValueError: Cannot determine common CRS for concatenation inputs, got ['WGS 84 / UTM zone 10N', 'WGS 84 / UTM zone 10N']. Use `to_crs()` to transform geometries to the same CRS before merging.
Probably just an easy thing to update I hope
Several CI steps should only be run on certain conditions. For example, we should only build documentation after we're sure it passes all tests. This can be setup with workflow_run
commands (https://stackoverflow.com/questions/58457140/dependencies-between-workflows-on-github-actions).
Hi, I want to cite the elapid library in my paper.
Is your library plan to be published as a paper, or is there any proper way to cite elapid?
Functions like raster_values_from_geoseries()
could be named more simply as annotate_geoseries()
, which is consistent with the current documentation and with a term that I expect will become more common for describing this process.
Multiple model.fit() calls produce invalid predictions because some state variable (probably the alpha or entropy scores) aren't cleared.
It should be more than just a big README!
Hi @earth-chris! I'm opening a few issues here for my JOSS review.These will be mostly about bugs or issues when trying to reproduce the examples in your documentation.
I'm having trouble loading the sample data with elapid.load_sample_data()
. The files appear to be missing from my local installation (as you can see in the output below). I doubled-checked at the path given in the error message (/home/gabriel/.local/lib/python3.10/site-packages/elapid/
) and I simply do not have a data
folder or the bradypus.csv.gz
file.
I managed to debug it by downloading the data from this repo and moving it to a data
folder in the right path. After that elapid.load_sample_data()
works as expected.
I installed elapid using pip install elapid
as explained in the documentation. I'm using elapid v0.3.18 and Python 3.10.6 on Ubuntu 22.04.
Here is the detailed output:
Python 3.10.6 (main, Nov 2 2022, 18:53:38) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import elapid
>>> x, y = elapid.load_sample_data()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/gabriel/.local/lib/python3.10/site-packages/elapid/utils.py", line 54, in load_sample_data
df = pd.read_csv(file_path, compression="gzip").astype("int64")
File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper
return func(*args, **kwargs)
File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
return _read(filepath_or_buffer, kwds)
File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 605, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1442, in __init__
self._engine = self._make_engine(f, self.engine)
File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1735, in _make_engine
self.handles = get_handle(
File "/home/gabriel/.local/lib/python3.10/site-packages/pandas/io/common.py", line 750, in get_handle
handle = gzip.GzipFile( # type: ignore[assignment]
File "/usr/lib/python3.10/gzip.py", line 174, in __init__
fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/home/gabriel/.local/lib/python3.10/site-packages/elapid/data/bradypus.csv.gz'
There's a bunch of vestigial stuff in the Makefile
now that could be removed, like all of the pip/conda/docker update stuff.
Conda skeleton creates templates for building and deploying conda package automatically with GitHub Actions: https://conda-forge.org/docs/user/ci-skeleton.html
(first implemented here: https://github.com/conda-forge/conda-smithy/pull/1461/files)
Currently we calculate the probability of drawing a sample as even with the frequency with which it occurs in a raster population. However, it should be adjusting for that frequency and drawing more samples in areas with low frequencies. a 1 / probabilities
or something would probably handle this.
Running the Maxent vignette example (after the workaround I described in #74 to access the sample data), using the is_features=True
flag in model.fit
after the feature transformations returns the following error:
TypeError: MaxentModel.fit() got an unexpected keyword argument 'is_features'
.
I get this error after running the code described in the vignette.
import elapid
x, y = elapid.load_sample_data()
model = elapid.MaxentModel()
model.fit(x, y)
model.predict(x) # add
# Feature Transformations
features = elapid.MaxentFeatureTransformer()
z = features.fit_transform(x)
model.fit(z, y, is_features=True)
Here is the detailed output from IPython:
Python 3.10.6 (main, Nov 2 2022, 18:53:38) [GCC 11.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.6.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import elapid
In [2]: x, y = elapid.load_sample_data()
In [3]: model = elapid.MaxentModel()
In [4]: model.fit(x, y)
Out[4]: MaxentModel()
In [5]: features = elapid.MaxentFeatureTransformer()
In [6]: z = features.fit_transform(x)
In [7]: model.fit(z, y, is_features=True)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [7], line 1
----> 1 model.fit(z, y, is_features=True)
TypeError: MaxentModel.fit() got an unexpected keyword argument 'is_features'
When annotating an existing geoseries, the geometry is passed to the annotation function, then pandas
is used to concatenate the original and the annotated data.
This leads to issues when drop_na=True
and the two dfs don't share the same number of rows. The annotated data then mismatches the original df and the difference in length means NaN is appended to the missing rows.
This should probably be handled by running drop_na
after the dfs are concatenated to ensure the indices match before dropping.
This is because it's running the documentation generator from a simple ubuntu container: running pip install elapid
from there is no good (too many other dependencies).
Add a Dockerfile
and a Makefile
to build and deploy a container that has the conda environment installed. Then update the run commands to be something like:
conda activate elapid
pip install [docs dependencies]
mkdocs gh-deploy
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.