pastas / pastas Goto Github PK

View Code? Open in Web Editor NEW

351.0 16.0 70.0 141.1 MB

Pastas is an open-source Python framework for the analysis of groundwater time series.

Home Page: https://pastas.readthedocs.io

License: MIT License

Python 100.00%

hydrology groundwater python timeseries analysis pastas

pastas's Introduction

Pastas: Analysis of Groundwater Time Series

Important

As of Pastas 1.5, noisemodels are not added to the Pastas models by default anymore. Read more about this change here.

Pastas: what is it?

Pastas is an open source python package for processing, simulating and analyzing groundwater time series. The object oriented structure allows for the quick implementation of new model components. Time series models can be created, calibrated, and analysed with just a few lines of python code with the built-in optimization, visualisation, and statistical analysis tools.

Documentation & Examples

Documentation is provided on the dedicated website pastas.dev
Examples can be found on the examples directory on the documentation website
View and edit a working example notebook of a Pastas model in MyBinder
A list of publications that use Pastas is available in a dedicated Zotero group

Get in Touch

Questions on Pastas can be asked and answered on Github Discussions.
Bugs, feature requests and other improvements can be posted as Github Issues.
Pull requests will only be accepted on the development branch (dev) of this repository. Please take a look at the developers section on the documentation website for more information on how to contribute to Pastas.

Quick installation guide

To install Pastas, a working version of Python 3.9, 3.10, 3.11, or 3.12 has to be installed on your computer. We recommend using the Anaconda Distribution as it includes most of the python package dependencies and the Jupyter Notebook software to run the notebooks. However, you are free to install any Python distribution you want.

Stable version

To get the latest stable version, use:

pip install pastas

Update

To update pastas, use:

pip install pastas --upgrade

Developers

To get the latest development version, use:

pip install git+https://github.com/pastas/pastas.git@dev#egg=pastas

Related packages

Pastastore is a Python package for managing multiple timeseries and pastas models
Metran is a Python package to perform multivariate timeseries analysis using a technique called dynamic factor modelling.
Hydropandas can be used to obtain Dutch timeseries (KNMI, Dinoloket, ..)
PyEt can be used to compute potential evaporation from meteorological variables.

Dependencies

Pastas depends on a number of Python packages, of which all of the necessary are automatically installed when using the pip install manager. To summarize, the dependencies necessary for a minimal function installation of Pastas

numpy>=1.7
matplotlib>=3.1
pandas>=1.1
scipy>=1.8
numba>=0.51

To install the most important optional dependencies (solver LmFit and function visualisation Latexify) at the same time with Pastas use:

pip install pastas[full]

or for the development version use:

pip install git+https://github.com/pastas/pastas.git@dev#egg=pastas[full]

How to Cite Pastas?

If you use Pastas in one of your studies, please cite the Pastas article in Groundwater:

Collenteur, R.A., Bakker, M., Caljé, R., Klop, S.A., Schaars, F. (2019) Pastas: open source software for the analysis of groundwater time series. Groundwater. doi: 10.1111/gwat.12925.

To cite a specific version of Pastas, you can use the DOI provided for each official release (>0.9.7) through Zenodo. Click on the link to get a specific version and DOI, depending on the Pastas version.

Collenteur, R., Bakker, M., Caljé, R. & Schaars, F. (XXXX). Pastas: open-source software for time series analysis in hydrology (Version X.X.X). Zenodo. http://doi.org/10.5281/zenodo.1465866

pastas's People

Contributors

Stargazers

Watchers

pastas's Issues

Solve with other frequency

I was trying to solve a model with a different frequency, but found out this does not work yet/anymore. However, the simulate, residuals etc all support this already. Would be a nice feature to be able to solve with different frequencies.

pr.get_distances() ignores kind kwarg when stresses kwarg is passed

When both the kind kwarg and stresses kwarg are passed to project.get_distances the kind argument is ignored. My current case is that I have multiple sources of meteorological data. I have a list of stresses from KNMI, and a list of stresses names from another source. I could split up my KNMI stresses into a precipitation list and an evaporation list and just use those, but it would be nice if get_distances allows you to still pass the kind kwarg to a subset of stresses.

From project.py get_distances:

if stresses is None and kind is None:
    stresses = self.stresses.index
elif stresses is None:
    stresses = self.stresses[self.stresses.kind == kind].index

I propose adding the following elif to allow both kind and stresses to be passed.

if stresses is None and kind is None:
    stresses = self.stresses.index
elif stresses is None:
    stresses = self.stresses[self.stresses.kind == kind].index
elif stresses is not None and kind is not None:
    stresses = self.stresses.loc[stresses].loc[self.stresses.kind == kind].index

warmup passed differently than tmin, tmax in solve

when the warmup is specified when doing a solve, the value is passed to the residuals function differently than tmin and tmax. In fact, I don't quite know how it is passed, but a different value of the warmup does give a different solution, so it gets passed in some fashion. but it should probably be passed in a similar fashion to tmin and tmax.

simulate function does not work without solving first

I am trying to implement the simulation function such that it can also be used when the model is not yet solved. It then uses the initial values.

When you run the example.py without solving, and then try the simulate you get the following error:

  File "<ipython-input-5-5a455af4e32c>", line 1, in <module>
    ml.simulate()
  File "c:\python\pastas\pastas\model.py", line 191, in simulate
    self.set_tmin_tmax()
  File "c:\python\pastas\pastas\model.py", line 415, in set_tmin_tmax
    tmin = tmin - self.get_time_offset(tmin, self.freq) + self.time_offset
  File "c:\python\pastas\pastas\model.py", line 515, in get_time_offset
    freq = freq.split("-", 1)[0]
AttributeError: 'NoneType' object has no attribute 'split'

Now, when you first have run ml.check_frequency() and ml.simulate() you do get the series. I'll try to fix this tomorrow.

see commit f252845

Bugs in tests of GXG values with pandas 0.20

Most of the tests are now failing in Travis with pandas 0.20. @tomvansteijn, can youfigure out how to fix this? E.g. https://travis-ci.org/pastas/pastas/jobs/233972587

Streamline Tseries (on behalf of the GUI)

To make PASTA more general, and to make it easier to generate the GUI and to be able to expand the GUI, a more general way to define Tseries would be nice (this is probably what Frans meant last week). For example, each Tseries should contain an attribute which states how many timeseries it needs (0 for Constant, 1 for Tseries1 and 2 for Recharge), and the series should be a list of this size, not seperate inputs like in the Recharge-class. With this information it is much easier to make an all-purpose import-dialog for Tseries, but it is also more logical for people who use a script.

folders not copied during install of pasta

When I install pasta using python setup.py install the 'read' and 'recharge' folders are not copied to site-packages\pasta-0.01-py2.7.egg\pasta\

When I copy the folders manually everything works fine.

initial value and vary can not be set on constant

You can not set the initial value or vary (didn't try pmin and pmax) voor the constant, while this can be done for the noise model (which is also automatically generated). Example code:

import pandas as pd
import pastas as ps
dates = pd.date_range('1990', '1991')
ho = pd.Series(data=1, index=dates)
ml = ps.Model(ho)
ml.parameters
ml.set_initial('noise_alpha', 77)
ml.set_initial('constant_d', 33)
ml.set_vary('noise_alpha', False)
ml.set_vary('constant_d', False)
print(ml.parameters)

Note that for the parameter of the noise model the values have been changed, but not for the constant.

Logging config file not in pypi distribution

Log config file is not in pypi distribution. This causes trouble with the log level. In model.py log_level="error" should be "ERROR" i think.

fix future warning from pandas

/Users/mark/git/pastas/pastas/stressmodels.py:298: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
h = h.loc[tindex]

tmin, tmax in ml.plot

tmin and tmax can be specified in ml.plot, but tmin is ignored and the plot is drawn including the warmup period. It would be nice to have an easy option to plot the results for the tmin and tmax specified for ml.solve. Now you have to specify them for the solve and then again for the plot. Some kind of option: 'use tmin and tmax from the solution' (which are stored anyway, right?) would be nice.

Same holds for stats.evp. It would be nice to have an option to compute evp for the period used in solving.

Stressmodel2 and different indexes of the two stresses

For determination of the tmin and tmax are for this stressmodel, the indices are compared. This goes wrong when the two series both have a daily frequency, but measured at another hour.

Possible solution:

solve this in TimeSeries objects, rounding the index to the desired frequency.
Stop comparing indices in the stressmodel, the TimeSeries will extend themselves if necessary anyways. But that leaves weird results i suppose when time series have different lengths and solve tmin/tmax covers the longest.
Change method of comparison in stressmodel2

Wrap lines in code cells of example notebooks for documentation

It would be nice if we can automatically wrap the (code) cells of the jupyter notebook when creating the docs on readthedocs. Making it easier to read the notebooks online..

calculate quantile GXG stats by year

The current timeseries statistics functions q_ghg, q_gvg and q_glg take a quantile of the whole series. The classic definition is better approximated by taking the average of quantiles per year. Proposing to implement this using Pandas resample.

Change index of stress-series to PeriodIndex

Right now the index of stresses consists of Timestamps. It is unclear whether this is the beginning or the end of the period which the timestamp represents. For example, the menyanthes-import gives data at the end of each period. So the monthly well discharge with a timestamp at the 1st of february (0:00) represents the extraction in january. For the knmi-data on the other hand, the precipitation with a timestamp on january 1st (0:00) represents the precipitation on january 1st, and so the index of the data is at the beginning of the period it represents. I would propose to always define data at the end of the period it represents, as this is also the moment the amount is registered. So the precipitation of january 1st would have an index of january 2nd. The choice also has implications for the simulation methods, which right now assume the data is defined at the end of each period (I think).

A better approach would be to use Pandas Periods instead of Pandas Timestamps. By defining periods, it is clear form the definition of the index which period the data represents. We need to figure out if all our methods work with Periods as well however.

quantile GXG functions do not use tmin and tmax arguments

Statistics functions q_ghg, q_glg, q_gvg do not use the optional arguments tmin and tmax.

Add indexing using .loc?

Interpolation of oseries

Figuring out how we deal with high-frequency oseries, meaning that the oseries have a higher frequency than the simulated series. Do we only compare the oseries on the indices wherethere are also simulations? Or do we interpolate the simulation to the indices of the oseries? I thought the last option..

Then, I think I would like an option where one can only compare to the oseries where there's also a simulation. Or be able to change the frequency of the oseries. Fitting high frequency observations is often difficult and suffers from high auto correlations..

model with simulation time step freq="H"

A pastas Model is initialized with a hard-coded daily frequency. This makes pastas resample my hourly stresses to daily values when creating the model. Then when I solve on an hourly frequency, pastas uses the original hourly series to eventually solve the model. The downsampling step is quite unnecessary in this case so how can I force pastas to skip it?

One solution would be to allow Models to be initialized with a frequency. But this probably adds to the (or just my) confusion about pastas settings.

This might lead to a whole other discussion, but currently there are three levels (that i can think of now) at which settings can be provided in pastas:

Timeseries settings, these define resampling/filling/extending logic for timeseries.
Stressmodel settings, these should be derived from Timeseries settings. (not sure what happens when they're provided to Stressmodel).
Model settings: hard-coded at creation, but can be changed when solve() is called with kwargs or by directly editing ml.settings.

I've heard there was a reason to not include a settings kwarg in Model(), but in my case, I wouldn't mind if the option existed... So what would be the best way to avoid pastas doing extra resampling work? I'm curious to hear your thoughts!

Model.simulate doesn't respect tmin

When supplying tmin to Model.simulate, it returns values before tmin for the entire warmup period.

Optimal parameters are gone when del_stressmodel is run twice

When del_transform is ran twice, for example

ml.del_transform('recharge')
ml.del_transform('well 1')

you lose the optimal parameters. After the first time the initial parameters are set to the optimal parameters, and after the second time the initial parameters are overwitten by the default initial parameters.

Fix plot_results method of model class

Implement the new tseries parameter and stress methods in this plotting function.

Current error message:
innovations = self.noisemodel.simulate(residuals, self.odelt) TypeError: simulate() takes at least 4 arguments (3 given)

BUG in plots decomposition when tseries is outside time range

When one of the tseries is outside the time range for which the plot is drawn, NAN's are returned to height ratio (LINE 168):

        fig, ax = plt.subplots(1 + len(self.ml.tseriesdict), sharex=True,
                               gridspec_kw={'height_ratios': height_ratios})

This causes an error.

can't set (initial) parameters if no noisemodel is present

Sanitize names of models and stress models

User provided names could be checked for unwanted characters that cause troubles in other methods.

E.g. name="Test/456" causes troubles when using this name for writing files (pandas read_csv) and could be changed to:

name="Test_456".

utils.get_time_offset fails when freq is None

Issue becomes apparent for example when a series contains only NaN. In this case freq cannot be inferred and is set to None.

freq_is_None_series = pd.Series(index=pd.to_datetime(['2009-01-01', '2010-01-01']))

add classical GXG methods

Sampling at 14 and 28th day of the month, using forward fill or linear interpolation.

I will try to implement these.

Changes in Pastas related to Pandas 1.0

Just read this interesting blog post on Pandas 1.0 which seems to be coming up next year.
https://www.dataschool.io/future-of-pandas/

Most importantly for us the inplace argument will be removed so we should start removing those from the Pastas code. I think it would be good if we can release a stable Pastas version after the release of Pandas 1.0 as Pastas heavily depends on this package.

Is it an idea to start making a list of what we need to do for such a release?

TimeSeries object with no freq-original

When the frequency of a TimeSeries cannot be inferred and is not user-provided, unexpected things can happen when updating the settings / series. This becomes visible when changing the frequency, and resampling automatically switches to the "sample_weighted" method in the "change_frequency" method of the TimeSeries method.

Bottomline: it should be possible to change oseries with no freq_original from Hourly values to daily values and use the "drop" option for dropping nan-values.

fillnan='interpolate' options for Tseries does not work unconditionally

The following code produces an error in optimisation when the recharge series have nan-values at the end of the beginning.

ts1 = Tseries(recharge, Gamma(), name='recharge', fillnan='interpolate')

This is because 'interpolate' borrowed from pandas does not fill nan-values at the beginning and end.

This might be a solution?

Model and Stressmodel should accept DataFrame with one column

If series is entered as DataFrame with one column, pastas should change it to a series

consider using pytest instead of nosetests

My test file runs with pytest. Better and more up-to-date compared with nosetests, in my opinion. Up to you!

Move plot functions to separate class

I suggest to move the plot functions to a separate class to keep the Model class clean and succinct, similar to the way the Statistics class couples with the model in the dev branch.

Calibration period not working when noise model is used

There is an error when a certain calibration period is chosen in combination with a noise model. Without the noise model the model works fine. E.g.:
n = NoiseModel() ml.addnoisemodel(n) ml.solve(tmin='1965', tmax='1990')
Gives:
ValueError: operands could not be broadcast together with shapes (4953,) (616,)

model fit is not yet stored in pas-file

When a model is saved as a .pas file, the model fit is not yet stored. Due to this a couple of methods do not work, e.g. ml.fit_report()

It would be nice to store the fit, including the covariance matrix.

GXG functions grayed out

Methods for calculating Dutch groundwater statistics GHG and GLG are included in the Statistics class.
Why are these commented out? Can I add percentile based methods? I have forked the repo.

Passing array of parameters to residuals function

I want to pass an array of parameters to the residuals function. Right now it checks first if the method is 'lmfit', then whether the parameters are an array. I think that should be the other way around. Also, in the checking whether parameters is an array, it now also check whether it is None. I think that is impossible, as it won't get beyond the if statement.

notebook 3_recharge_series doesn't run

Frequency of dependent variable and the stresses

It is now possible to change the frequency of the observed (dependent) and the independent (stress) series. But the implementation is still very experimental.

Changing the frequency
Changing the frequency works fine for both series, but maybe a user option should be provided on how to resample. Now a forward fill method is standardly applied. Alternatively, a method could be written that uses only existing values and does not rely on interpolation.

The stress series now use the pandas .asfreq function, creating nan-values for each unobserved time index. These nan-values are later filled with user defined function.

Simulating with different frequencies
The simulation of the model works fine when only the frequency of the observed series is changed. When the frequency of the stress series is changed, the residuals result in occasional nan-values and optimisation fails.

TO DO

Write a new resample function for the observed series
Figure out how to the frequency of the stress series can be changed without having nan-values in the residuals

Order of columns of parameters dataframe changes

The order of the model parameters changes after creation, caused by pandas append method that order alphabetically. As a result, pmin is now shown after pmax, which I find confusing. Should be solved somewhere in the get_init_parameters method.

cannot print fit_report when optimized model is loaded from file

ml = ps.io.load("model.pas")
print(ml.fit_report)

returns:

ValueError: The model is not solved yet

This should be possible and relates to #82

The second code cell of the notebook 4_menyanthes_file gives error

AttributeError: 'TimeSeries' object has no attribute 'IN'

Make private method private with leading underscore

I think we should make all non-public methods known by adding a leading underscore to the method name. This is suggested in PEP8 (https://www.python.org/dev/peps/pep-0008/), and is followed by all major packages (E.g. Numpy, Scipy, Pandas, Flopy). E.g.

Model._get_odelt()

Also, we should not use private methods of other packages in Pastas, like the _base_and_stride taken from Pandas in utils.py. These methods can be dropped without notice, which can cause problems in future versions of Pastas.

This will make it more clear for the Pastas-users which methods they should use (They will pop up first on tab completion) and make maintenance of Pastas easier in the future.

Thoughts?

use Pandas methods in model

Hi,

I have created a branch in which some methods of the model class are partly replaced by existing pandas fuctions, see dev-model in my fork. I think it makes the code a bit more compact and (just slightly) faster.

If you see the changes and like them, I can make a pull request.

Double figures in Notebooks

It seems that the built-in plotting functionality of Pastas creates double figure instances when used in a IPython Notebook. A temporary fix can be to suppress output by adding a semi-colon, e.g.:

ml.plots.decomposition();
Or by storing the returned figure instance:

fig = ml.plots.decomposition()

The methods keyword argument show=False has no effect in Notebooks.

Anyone knows how to solve this issue for all plotting methods?

settings in Stressmodel should be easily accessible

Right now it is very difficult to find out what the pre-defined settings are for Stressmodels. They are defined in TimeSeries and can only be seen by looking at the code, as far as I know.

get_stress method return TimeSeries object

This is intended to be the pandas series of the stress

should logging.config be defined at package level

logging.config is now applied when a model is created. So when TimeSeries object is created before, and logging occurs, this information is not printed to the console..

Consistency throughout Pastas when changing frequency

It is now easy to to change the frequency, but the way we deal with this change of frequency is not yet consistent. For now, the gain is the gain/ml.settings['freq']. That means that changing the frequency from daily to weekly, the gain will be divided by 7.

This is not yet consistent with the get_block_response function and the plotting methods that depend in it. I think this would be a good issue to solve in the next Pastas release.

add polder function

Nice library! The polder function (Bruggeman) is not included in rfunc.py. It would be nice to have as well to be able model surface water stresses.

[ENHANCEMENT] move objective function out of solver class

Still very impressed with this library. I thought it would be nice, for a particular application, to be able to do some postprocessing on the residuals before feeding them back to lmfit.minimize. In general I think it is better if the objective function is independent of the solver. See the branch dev-obj-functions in my fork. If you like the changes, I can make a pull request.