Git Product home page Git Product logo

ai4er-cdt / flood_risk_shipston Goto Github PK

View Code? Open in Web Editor NEW
1.0 7.0 0.0 9.35 MB

The purpose of this project is to investigate whether we can establish the effectiveness of natural flood management (NFM) interventions undertaken in the British town of Shipston-on-Stour during 2017 to 2020 from publicly available meteorological data and private data from the river gauge in Shipston.

Python 0.34% Jupyter Notebook 24.13% HTML 75.53%
hydrology-statistical flooding rainfall-runoff-model

flood_risk_shipston's Introduction

Shipston Flood Risk Project

1. Project Summary

Abstract

The purpose of this project is to investigate whether we can establish the effectiveness of natural flood management (NFM) interventions undertaken in the British town of Shipston-on-Stour during 2017 to 2020 from publicly available meteorological data and private data from the river gauge in Shipston.

Our analysis concludes that the available data (c.f. below for data sources) is not enough to confidently assess the effectiveness of recent NFM interventions in Shipston with state-of-the-art rainfall-runoff LSTM models. We attribute this to three main factors:

  • Limited data on extreme events: The period of 1990 to 2020 contains less than 10 floods in Shipston (ca. 7 independent events if we define a flood by a threshold of 3.4m river stage). This means that while there is ample data on the average discharge in the catchment, there is little information about the extreme values of the river discharge. We see this reflected in the results of our model, as the model's predictions agree well with the ground truth on average dischage values in terms of Nash-Sutcliff model efficiency (NSE), but carry errors of about 20-30% for extreme events.
  • Limited temporal resolution of available data: All publicly available, past meteorological data from the Met Office and NRFA is available on a daily basis. Yet, historical data shows that floods in Shipston depend on processes on hourly timescales. Flooding in Shipston strongly depends on whether the peak flow through Shipston exceeds 3.4m (the height of the Shipston bridge arches) at any given time, which typically happens only for a few hours of a day, even during floods. NFM interventions help to flatten the curve and distribute the peak flow across a wider time range by slowing upstream flow rates. Since leakage through NFMs likely occurs on timescales of hours as well (many NFMs are leaky dams), data with daily temporal resolution is likely not enough to confidently assess the effect of NFM interventions. In other words: NFM interventions might flatten the hourly flow without strongly affecting the daily total flow rate, such that the averaging over all hours in a day removes any information about the NFMs effectiveness.
  • Limited meteorological data availability: While temperature, precipitation and river discharge data for 1990 to 2020 were readily available, we could not obtain other important meteorological data (notably humidity, windspeed, potential evapotranspiration, or solar irradiation) data for this period for the Shipston catchment. This data is relevant to include information about the physical process of evapotranspiration into the model and its absence means that our model predictions do not capture all relevant physical mechanisms. We assessed the effect of using only precipitation and temperature instead of all meteorological data via the publicly available CAMELS-GB dataset and found that we only loose about 2-4% in NSE performance. While this means that evapotranspiration plays a less important role at high latitudes in Britain (Shipston: ~52° N), it is still a significant loss to drainage basins and should be taken into account from a hydrological point of view.

Despite the limits on available data, our runoff prediction model was built in a general way such that it can directly use additional features (e.g. humidity, windspeed, etc.) contained in the CAMELS-GB dataset when they become available for Shipston, and it can easily be extended to hourly instead of daily data. Therefore, the model developed in this project can be reused for the effectiveness assessment, once the limits on data-availability have been addressed. A detailed description of how to run the model on the Shipston data or the CAMELS-GB dataset is given in section 2 of this readme.

1.1 Approach

Our approach was to build a model to predict the rainfall-runoff at Shipston in the "what-if" scenario of no NFM interventions being applied. To do this we trained our model on data before 2016 and used it to predict the river flow at Shipston from 2017 to 2020.

For an introduction to using neural networks in hydrology, we refer the interested reader to this excellent introduction.

Data sources

Data
Source
Type
of Data
Temporal
resolution
Spatial
resolution
Description
CAMELS-GB Hydrological data (temperature,
precipitation, radiation, humidity, discharge, static attributes, ...) for 671 catchments in the UK
daily [1970-2015] lumped Hydrologic attributes and historical forcing data for 671 catchments in the UK. All data in CAMELS-GB is lumped, meaning that the data is a averaged over the catchment area. Unfortunately, the dataset does not include Shipston. We used this data for fine tuning the model as well as for analysing the effect of neglecting evapotranspiration.
NRFA Data Discharge data and static
catchment attributes for all UK catchments
daily
[2005-2019]
lumped We used this data to compare to CAMELS-GB and Wiski data and to obtain static catchment attributes for Shipston.
Wiski Data* Precipitation, discharge and stage data for Shipston catchment hourly
[1972-2020]
lumped We use the hourly precipitation and discharge data for the final model for the Shipston catchment.
Asset Data* NFM interventions(cost, location, size, type, build date, ...) daily
[2017-2020]
distributed Data on the NFM assets installed. NFM installationstarted in 2017, hence the data covers the date range from 2017 to 2020.
FEH Data Rainfall return periods - lumped Not used further in our analysis.
CHESS-Met Data Precipitation, temperature for Shipston catchment daily
[1970-2020]
distributed We used this precipitation and temperature data for the final model on the Shipston catchment.
Earth Engine Data Elevation static distributed We used the elevation data from Google Earth engine to calculate the mean elevation in the catchment.

*Starred data sources are private and could be obtained by direct requests to the respective agencies.

1.2 Results

The current model does not seem enough to assess the effectiveness of NFM interventions in this way. A notebook that generates an analysis of the model predictions can be found in predictions_analysis_notebook.ipynb in the notebooks folder, along with other notebooks analysing separate data from Shipston. This shows that there is no detectable effect from the NFM interventions in the difference between model predictions and ground truth flow at Shipston in the 2017-2020 period, however this is most likely due to the data limitations described above.

Results for Shipston-only models

This table shows a comparison of the performance of different types of predictive models on the Shipston dataset - the LSTM is clearly the best performing, so this was chosen as the final model.

Model Validation NSE (2010-2016)
Tuned Vanilla LSTM** 0.8175
Vanilla 1D Conv model 0.4309
WaveNet 0.6975
FilterNet 0.5978
Autoregressive* WaveNet 0.3359
Autoregressive* FilterNet 0.602
Autoregressive* LSTM 0.6925

*Autoregressive here refers to including the previous 365 days of discharge data as an additional feature. **Hyperparameters: 10 layers, 100 hidden units, dropout probability of 0.2, 200 epochs of training.

Temperature and precipitation were the baseline features used in all models. The training set consisted of the data from 1986-2010, and validation set was 2010-2016.

1.3 Directons for future analysis

The most promising idea for further work is to re-train the model on hourly timeseries data at Shipston, since there is a strong possibility that the effect of the NFM interventions operates on hourly timescales. A larger project would be to fully investigate the transfer learning approach of training a model on CAMELS-GB including static basin attributes, then applying it to Shipston. This is a promising way of improving model performance further.

2. Runoff prediction model

This code trains an LSTM deep learning model to predict runoff using data from the Shipston river basin. By default, predictions go up to 2019-12-31 and are saved in the log directory in a subfolder with the run name.

Alternatively, this code can train a model on any number of river basins from the CAMELS-GB dataset of 671 river basins around the UK. See below for customisation options.

Code features:

  • Automatic real-time logging to the cloud with wandb of model loss, validation metrics, and a plot showing test results, using a publicly available dashboard.
  • Full parallelisation across multiple GPUs and multiple nodes.
  • Full command line interface to control all major model, dataset, and training options.
    • This uses hydra, a framework that allows for composable and type-checked configuration objects created from yaml files.
  • PyTorch Lightning model class, allowing for more modular code, less PyTorch boilerplate and easier customisation of the training process.
  • Dataset class capable of handling an arbitrary number of river basins, with data from any date range and any number of features.
  • Automatic saving of the best k checkpoints according to the validation metric.
  • Fully type-hinted and well documented codebase.

2.1 Setup and model training

Before running the model, run conda env create -f environment.yml to install all required packages (after installing conda). CUDA 10.1 is required to train on GPU with PyTorch 1.7.

The code is run from main.py, the only mandatory command line argument is run_name, a short string which describes the run. For example:

python main.py run_name=test-run.

An example of a more complex command:

python main.py run_name=50-epochs-2005 gpus=4 dataset.basins_frac=0.5 dataset.train_test_split=2005 epochs=50 model.dropout_rate=0.2.

Full argument list:

  • Main Options:
    • run_name - Mandatory string argument that describes the run.
    • cuda - Whether to use GPUs for training, defaults to True.
    • gpus - Number of GPUs to use, defaults to 1.
    • precision - Whether to use 32 bit or 16 bit floating points for the model. Warning: 16 bit is buggy. Defaults to 32.
    • seed - Random seed, defaults to 42.
    • parallel_engine - PyTorch parallelisation algorithm to use. Defaults to 'ddp', meaning DistributedDataParallel.
  • Dataset Options
    • dataset - Can be either 'camels-gb' or 'shipston' to choose the dataset. Default is 'shipston'.
    • dataset.features - Dictionary where the keys are feature types and the values are lists of string feature names to use for the model. Defaults to {'timeseries' : ['precipitation', 'temperature']} for dataset=shipston to use the features for which we have the most data at Shipston. For dataset=camels-gb, defaults to {'timeseries' : ['precipitation', 'temperature', 'windspeed', 'shortwave_rad']} - additionalCAMELS-GB features can be added from the full list of features below. Better to change this in the config files rather than the command line.
    • dataset.seq_length - Number of previous days of meterological data to use for one prediction, defaults to 365.
    • dataset.train_test_split - Split date to separate the data into the train and test sets, defaults to '2010' meaning 01-01-2020. You can pass a string in DD-MM-YYYY or YYYY formats.
    • dataset.num_workers - Number of subprocesses to use for data loading, defaults to 8.
    • dataset.basins_frac - Only available for dataset=camels-gb, this specifies the fraction of basins that will be combined to create the dataset. Defaults to 0.1 meaning 10% since the full dataset requires roughly 100 GB of memory.
    • dataset.test_end_date - Only available for dataset=shipston, this specifies the end date of the test dataset, can go up to '2020'. Override this value as a string. Defaults to '2020'.
  • Training Options
    • epochs - Number of training epochs, defaults to 200.
    • batch_size - Size of training batches, defaults to 256.
    • learning_rate - Learning rate for the Adam optimiser, defaults to 3e-3.
    • checkpoint_freq - How many epochs we should train for before checkpointing the model, defaults to 1.
    • val_interval - If this is a float, it is the proportion of the training set that should go between validation epochs. If this is an int, it denotes the number of batches in between validation epochs. Defaults to 0.25, meaning 4 validation epochs per training epoch.
    • log_steps - How many gradient updates between each log point, defaults to 20.
    • date_range - Custom date range for the training dataset to override the default range of 1970 to dataset.train_test_split, as a list of two strings (same formats as dataset.train_test_split).
    • mc_dropout - Boolean that decides whether or not to use MC Dropout to plot output uncertainty. Defaults to False.
    • mc_dropout_iters - Number of forward passes to use with MC dropout to get uncertainty, defaults to 20. Increase up to 100 to tighten uncertainty bounds but be aware this can produce memory errors.
  • Model Options
    • model - Can be either 'lstm' or 'conv' to choose between the LSTM or 1D convolutional model. Default is 'lstm'.
    • LSTM Model Options:
      • model.num_layers - Number of layers in the LSTM, defaults to 10.
      • model.hidden_units - Number of hidden units/LSTM cells per layer, defaults to 100.
      • model.dropout_rate - Dropout probability, where the dropout is applied to the dense layer after the LSTM. Defaults to 0.2
    • Convolutional Model Options:
      • model.wavenet_kernel_size - Size of the convolution kernel in the WaveNet model, defaults to 3.

2.2 Dataset and features

Shipston dataset

The default dataset consists of average daily temperature, precipitation (mm/day), and catchment discharge (m^3/s) time series from 1985 to the end of 2019 for the Shipston river basin. By default the model trains on 1985-2010 and predicts daily average discharge volume for 2010-2019.

CAMELS-GB dataset

The CAMELS-GB dataset will also automatically download to src/data/CAMELS-GB/ using a Dropbox link if it is chosen using the command line option. Each basin in CAMELS-GB has data from 8 different meterological timeseries, as well as many more static basin attributes that have a constant scalar value for the entire basin.

Below is a full list of features that can be included in the model with this dataset. These are the only features that can be included in the dataset.features config option. Full descriptions of all these features can be found in the CAMELS-GB supplementary material.

  • Timeseries Features (daily averages)
    • precipitation
    • pet
    • temperature
    • peti
    • humidity
    • shortwave_rad
    • longwave_rad
    • windspeed
  • Climatic Features
    • p_mean
    • pet_mean
    • aridity
    • p_seasonality
    • frac_snow
    • high_prec_freq
    • high_prec_dur
    • low_prec_freq
    • low_prec_dur
  • Human Influence Features
    • num_reservoir
    • reservoir_cap
  • Hydrogeology Features
    • inter_high_perc
    • inter_mod_perc
    • inter_low_perc
    • frac_high_perc
    • frac_mod_perc
    • frac_low_perc
    • no_gw_perc
    • low_nsig_perc
    • nsig_low_perc
  • Hydrologic Features
    • q_mean
    • runoff_ratio
    • stream_elas
    • baseflow_index
    • baseflow_index_ceh
    • hfd_mean
    • Q5
    • Q95
    • high_q_freq
    • high_q_dur
    • low_q_freq
    • low_q_dur
    • zero_q_freq
  • Land-cover Features
    • dwood_perc
    • ewood_perc
    • grass_perc
    • shrub_perc
    • crop_perc
    • urban_perc
    • inwater_perc
    • bares_perc
    • dom_land_cover
  • Topographic Features
    • gauge_lat
    • gauge_lon
    • gauge_easting
    • gauge_northing
    • gauge_elev
    • area
    • elev_min
    • elev_10
    • elev_50
    • elev_90
    • elev_max
    • dpsbar
    • elev_mean
  • Soil Features
    • sand_perc
    • sand_perc_missing
    • silt_perc
    • silt_perc_missing
    • clay_perc
    • clay_perc_missing
    • organic_perc
    • organic_perc_missing
    • bulkdens
    • bulkdens_missing
    • bulkdens_5
    • bulkdens_50
    • bulkdens_95
    • tawc
    • tawc_missing
    • tawc_5
    • tawc_50
    • tawc_95
    • porosity_cosby
    • porosity_cosby_missing
    • porosity_cosby_5
    • porosity_cosby_50
    • porosity_cosby_95
    • porosity_hypres
    • porosity_hypres_missing
    • porosity_hypres_5
    • porosity_hypres_50
    • porosity_hypres_95
    • conductivity_cosby
    • conductivity_cosby_missing
    • conductivity_cosby_5
    • conductivity_cosby_50
    • conductivity_cosby_95
    • conductivity_hypres
    • conductivity_hypres_missing
    • conductivity_hypres_5
    • conductivity_hypres_50
    • conductivity_hypres_95
    • root_depth
    • root_depth_missing
    • root_depth_5
    • root_depth_50
    • root_depth_95
    • soil_depth_pelletier
    • soil_depth_pelletier_missing
    • soil_depth_pelletier_5
    • soil_depth_pelletier_50
    • soil_depth_pelletier_95

flood_risk_shipston's People

Contributors

croydon-brixton avatar herbiebradley avatar ira-shokar avatar luke-scot avatar mataln avatar sdat2 avatar shmh40 avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

flood_risk_shipston's Issues

Identifying data sources for the inputs of LSTM model

https://hess.copernicus.org/articles/23/5089/2019/hess-23-5089-2019.pdf

Table 4 of this paper contains the input features ranked by importance. It would be great to use their model to predict the flow rate etc of our system, either by training their LSTM with a subsection of our data, or by using their LSTM already pretrained on American basins (question of transferability here - anyone with transfer learning experience?).

So would like to identify data sources for these inputs.

Likely shared with Luke's issue...

Get extra data

  • Center for environmental data analysis
  • Satellite data (Google Earth Engine, ...)

Method of converting to datetime from date and time is slow

Current solution is:


date_time = []
    
for i in range(len(dataframe['Time'])):
     time = dataframe['Time'][i]
     date = datetime.datetime.utcfromtimestamp(
     dataframe['Date'].to_numpy()[i].tolist()/1e9).date()
     date_time.append(datetime.datetime.combine(date, time))

dataframe = dataframe.assign(datetime = date_time)
dataframe = dataframe.set_index('datetime') 

very slow.

Shipston static basin attributes required for LSTM

As well as forcing data, it would be good to have the following static (unchanging over the years) basin attributes.
We would like these to be averaged over the whole Shipston basin.
We have not yet prepared the LSTM for these, so this is lower priority atm! Plus, I suspect some of these will be very tricky to get hold of!

Potential to be added to the model

  • Mean daily precipitation.
  • aridity: Ratio of mean PET to mean precipitation.
  • elev_mean: Catchment mean elevation.
  • high_prec_dur: Average duration of high-precipitation events (number of consecutive days with ≥ 5 times mean daily precipitation).
  • frac_snow_daily: Fraction of precipitation falling on days with temperatures below 0 ◦C.
  • high_prec_freq: Frequency of high-precipitation days (≥ 5 times mean daily precipitation).
  • slope_mean: Catchment mean slope.
  • geol_permeability: Surface permeability (log10).
  • carb_rocks_frac: Fraction of the catchment area characterized as “Carbonate sedimentary rocks”.
  • clay_frac: Fraction of clay in the soil.
  • Mean daily potential evapotranspiration.
  • low_prec_freq: Frequency of dry days (< 1 mm d−1).
  • soil_depth_pelletier: Depth to bedrock (maximum 50 m).
  • p_seasonality: Seasonality and timing of precipitation. Estimated by representing annual
  • forest_frac: Forest fraction.
  • sand_frac: Fraction of sand in the soil.
  • soil_conductivity: Saturated hydraulic conductivity.
  • low_prec_dur: Average duration of dry periods (number of consecutive days with precipitation < 1 mm d−1).
  • gvf_max: Maximum monthly mean of green vegetation fraction.
  • gvf_diff: Difference between the maximum and minimum monthly mean of the green vegetation fraction.
  • lai_diff: Difference between the max. and min. mean of the leaf area index.
  • soil_porosity: Volumetric porosity.
  • soil_depth_statsgo: Soil depth (maximum 1.5 m).
  • lai_max: Maximum monthly mean of leaf area index.
  • silt_frac: Fraction of silt in the soil.
  • max_water_content: Maximum water content of the soil.
  • area_gages2: Catchment area.
  • precipitation and temperature as sine waves. Positive (negative) values indicate precipitation peaks during the summer (winter). Values of approx. 0 indicate uniform precipitation throughout the year.

Group Literature Review

A few papers have been passed around the team, but it would be good if we could synthesise this into either a wiki or webpage that tries to link the various papers together by theme, even if quite simplistically.

Shipston forcing data required for LSTM

We would like forcing data for Shipston for the following features.
Ideally we would have an average daily value for each feature over the whole Shipston basin, from ~1970 to 2020.

Currently in the model:

  • Total precipitation
  • Air temperature
  • Potential evaporation
  • Surface downward shortwave radiation
  • Specific humidity

Potential to be added to the model

  • Surface downward longwave radiation
  • Potential energy
  • Surface pressure
  • Convective fraction
  • u wind component (latitudinal wind, W-E)
  • v wind component (longitudinal wind, N-S)

Find (& request for) other data sources

  • CEDA (various climate/environment datasets)
  • LiDAR (topography)
  • Historic rainfall
  • Soil moisture
  • Land Use: UK gov land use registry/Aerial Photography
  • Building since beginning of measurements?
  • LSTM model data from US training (Seb)

Train LSTM model on all 671 basins in CAMELS-GB dataset

Train LSTM model on all basins in the dataset - then use it to predict discharge in our Shipston basin.

To predict for the Shipston basin we need data specified in issues #17 and #18 .

Then we can use this model to predict discharge of the river for 2019/20, and compare to the discharges observed (which will hopefully be smaller due to the natural flood defences!)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.